kalani's Tech blog: pdf

Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.

FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.