Java Searching and Indexing

Efficient searching and indexing of digital information is a very complex computer science subject area. The most prominent example of this is Google or other internet based search engines. Google builds its search index by periodically scanning through all publicly viewable http resources and building a huge index of key-words and phrases. Building and maintaining an up-to-date index of such a large data set is a highly intensive and time consuming operation but essential for allowing fast querying of information.

Indexing: To build the index, you must first be able to extract meaning from the different types of resources made available over HTTP. The simplest and most basic form of internet data would have to be static plain-text HTML pages (or unstructured plain-text files linked from HTML pages). Dynamically generated HTML pages can be more complex to parse as they often implement user sessions through cookies and display different content based on sequences of user actions, user geographical location or persisted user profile settings (pages that require a login to access for example cannot be spidered). Javascript/Actionscript/Silverlight/JavaFX pages all likewise difficult to parse for numerous reasons.

In addition to HTML content, http can also be used to serve files in any number of different formats. PDF articles for example are very common for large product user guides and research journals. A search engine must be able to successfully parse these and extract the data in order to build a thorough index. Likewise for MS Word, PowerPoint, Excel, RTF and any number of other file-formats with textual content in binary or proprietary formats.

Apache Tika and Apache Lucene: Extracting this data can be very difficult. Thankfully however, if you ever want to build an internet search engine in Java, or perhaps more realistically, if you want to implement text search functionality in your desktop or J2EE web applications, then there’s still hope. A number of open-source libraries exist that take care of all the proprietary format parsing and data extraction. PDFBox for example can be used to get all the text out of standard PDF files, and components in the Apache POI project can be used to extract data from MS Office file formats. This is how Javinder implements its text-search functionality.

An even better alternative appears to be the Apache Tika project. Whilst this incubator is still in early stages, it appears to be integrating the above mentioned libraries and others into a unified API for text-extraction. Adding support for new file formats in future will thus be abstracted away from the API user and your pre-compiled tools will be able to take advantage of these changes simply by updating the Tika jar to the latest version.

And if you want to build a more powerful search engine, then have a look at Apache Lucene - “a high-performance, full-featured text search engine library written entirely in Java”. Lucene provides the API and implementation engine for performing indexing and efficient querying of data and uses Tika as a plugin for the data extraction part. These project streams are really exciting for Java searching. I’ll definitely be keeping a close eye on them as I want to implement Tika in my Javinder tool.


Popular posts from this blog

Wkhtmltopdf font and sizing issues

Import Google Contacts to Nokia PC Suite

Can't delete last blank page from Word