Please post comments or corrections to the Author Online forum at
http://www.manning-sandbox.com/forum.jspa?forumID=451
MEAP Edition
Manning Early Access Program
Copyright 2009 Manning Publications
For more information on this and other Manning titles go to
www.manning.com
Contents
Preface
Chapter 1 Meet Lucene
Chapter 2 Indexing
Chapter 3 Adding search to your application
Chapter 4 Analysis
Chapter 5 Advanced search techniques
Chapter 6 Extending search
Chapter 7 Parsing common document formats
Chapter 8 Tools and extensions
Chapter 9 Lucene ports
Chapter 10 Administration and performance tuning
Chapter 11 Case studies
Appendix A Installing Lucene
Appendix B Lucene index format
Appendix C Resources
Appendix D Using the benchmark (contrib) framework
1
Meet Lucene
This chapter covers
Understanding Lucene
General search application architecture
Using the basic indexing API
Working with the search API
Considering alternative products
Lucene is a powerful Java search library that lets you easily add search to any application. In recent years
Lucene has become exceptionally popular and is now the most widely used information retrieval library: it
powers the search features behind many Web sites and desktop tools. While it’s written in Java, thanks to
its popularity and the determination of zealous developers, there are now a number of ports or
integrations to other programming languages (C/C++, C#, Ruby, Perl, Python, PHP, etc.).
One of the key factors behind Lucene’s popularity is its simplicity, but don’t let that fool you: under
the hood there are sophisticated, state of the art Information Retrieval techniques quietly at work. The
careful exposure of its indexing and searching
API
is a sign of the well-designed software. Consequently,
you don’t need in-depth knowledge about how Lucene’s information indexing and retrieval work in order to
start using it. Moreover, Lucene’s straightforward
API
requires using only a handful of classes to get
started.
In this chapter we cover the overall architecture of a typical search application, and where Lucene fits.
It’s crucial to recognize that Lucene is simply a search library, and you’ll need to handle the other
components of a search application (crawling, document filtering, runtime server, user interface,
administration, etc.) yourself as your application requires. We show you how to perform basic indexing
and searching with Lucene with ready-to-use code examples. We then briefly introduce all the core
elements you need to know for both of these processes. We’ll start next with the very modern problem of
information explosion, to understand why we need powerful search functionality in the first place.
NOTE
Lucene is a very active open-source project. By the time you’re reading this, likely Lucene’s APIs and
features will have changed. This book is based on the 3.0 release of Lucene, and thanks to Lucene’s
backwards compatibility policy, all code samples should compile and run fine for all future 3.x releases.
If you have problems, send an email to java-user@lucene.apache.org
and Lucene’s large community
will surely help.
1.1 Evolution of information organization and access
In order to make sense of the perceived complexity of the world, humans have invented categorizations,
classifications, genuses, species, and other types of hierarchical organizational schemes. The Dewey
decimal system for categorizing items in a library collection is a classic example of a hierarchical
categorization scheme.
The explosion of the Internet and electronic data repositories has brought large amounts of
information within our reach. With time, however, the amount of data available has become so vast that
we needed alternate, more dynamic ways of finding information (see Figure 1.1). Although we can classify
data, trawling through hundreds or thousands of categories and subcategories of data is no longer an
efficient method for finding information.
The need to quickly locate certain specific information you need out of the sea of data isn’t limited to
the Internet realm—desktop computers store increasingly more data. Changing directories and expanding
and collapsing hierarchies of folders isn’t an effective way to access stored documents. Furthermore, we
no longer use computers just for their raw computing abilities: They also serve as communication devices,
multimedia players and media storage devices. Those uses for computers require the ability to quickly find
a specific piece of data; what’s more, we need to make rich media—such as images, video, and audio files
in various formats—easy to locate.
With this abundance of information, and with time being one of the most precious commodities for
most people, we need to be able to make flexible, free-form, ad-hoc queries that can quickly cut across
rigid category boundaries and find exactly what we’re after while requiring the least effort possible.