Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.
FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.

22 comments:

Chathuranga Chandrasekara said...

Just adding more information..

Microsoft Format files can be parsed by Apache POI Library. I think it is not compatible with Office 2007 format yet..

http://poi.apache.org/

Nilindra said...

Nice work around for https://issues.apache.org/jira/browse/PDFBOX-365

An important point that you have missed out!

cd.close(); [COSDocument object needs to be closed]

Nilindra [Another Sri Lankan]

Pieter said...

You could also use Tika (http://lucene.apache.org/tika/). This toolkit is a subproject of Lucene and was made exactly for this: it extracts text and metadata from various documents (PDF, Word, ...) to index them with Lucene. A nice feature is the AutoDetectParser, which will automagically detect the format of the document (PDF, Word, ...) and parse the text.

It is actually a sort of wrapper around POI and PDFBox, to have a common interface.

sparkettin said...

This is great work! I have been trying to index pdf since the last 3 weeks.. All the sample codes I found on internet was very complex. They did not work in my case. This is very simple and working perfect. Thanks for the share.

Stephane said...

Sorry but I just don't get it. Which files do I need to integrate PDFbox with Lucene? Where do I need to put these files? What do I need to modify? Where do I put the code you provided at the top of this document?

kalani Ruwanpathirana said...

@Stephane: Actually you don't need to worry about any integration. You can have a separate method or class to convert the pdf into a text. Then just use that text as the indexing input for your Lucene indexing method.

Stephane said...

I'm gonna need a step-by-step tutorial for this.

Ravikumar said...

getText is causing folowing Exception. Any idea?

System.NullReferenceException was caught
Message="Object reference not set to an instance of an object."
Source="PDFBox-0.7.3"
StackTrace:
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)

kalani Ruwanpathirana said...

Hi, it seems that this is caused by PDFBox itself. Could you check this with a stable PDFBox version?

zahid said...

nice information thanks for sharing .

BigT said...

You can also use Lucene based SearchBlox for indexing PDF documents.

http://www.searchblox.com/

paravena74 said...

Hi Kalani As I see you know a lot :-) here is my question do you know how to print a specific page of a PDF file. I was trying splitting a PDF using pdfbox, and then printing each part, but it doesn't work. Now I'm trying with the PrinterJob object, and the PageRanges attribute but I have no luck. Anyway If you have any idea, I will be very thankfully

Have a nice day
Saludos
Pablo

kalani Ruwanpathirana said...

Hi Praveena, I haven't worked on PDF printing but to access a certain page...could you have a look at PDDocument class?

info said...

Thanks for the tips on indexing PDF docs, kalani - much appreciated

soliddirk said...

I have also a "Object reference not set to an instance of an object"-Exception from the getText method, are ther any solutions or workarounds?

Regards Dirk

soliddirk said...

Does anyone know where to get the pdfbox 1.5.0 dlls for .net? Seems to fix the problem

Regards Dirk

Programmer said...

How can we use this code as part of Lucene in java and not put it on the client who wants to index text?

kalani Ruwanpathirana said...

@Programmer, this is totally a separate code from Lucene code. This code is used just to extract text from pdf. Building Lucene index from that text is a separate part.

preethi said...

HI!! Kalani,

Can Please provide Lucene code for Pdf version 5.2 its very urgent please do the need full

kalani Ruwanpathirana said...

Hi Preethi, I didn't get what you meant by "Lucene code for Pdf version 5.2". If you can extract the text from pdf somehow, then Lucene code stays the same.

preethi said...
This comment has been removed by the author.
piyas de said...

This is another good text extarcting way. So we are growing in path of Text Mining. I will try use this idea in my application areas

Related Posts with Thumbnails