kalani's Tech blog: How to Index PDF Documents with Lucene

Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.

FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.

31 comments:

Chathuranga Chandrasekara10:37 PM
Just adding more information..

Microsoft Format files can be parsed by Apache POI Library. I think it is not compatible with Office 2007 format yet..

http://poi.apache.org/
ReplyDelete
Replies
nilindra5:43 PM
Nice work around for https://issues.apache.org/jira/browse/PDFBOX-365

An important point that you have missed out!

cd.close(); [COSDocument object needs to be closed]

Nilindra [Another Sri Lankan]
ReplyDelete
Replies
Pieter11:03 PM
You could also use Tika (http://lucene.apache.org/tika/). This toolkit is a subproject of Lucene and was made exactly for this: it extracts text and metadata from various documents (PDF, Word, ...) to index them with Lucene. A nice feature is the AutoDetectParser, which will automagically detect the format of the document (PDF, Word, ...) and parse the text.

It is actually a sort of wrapper around POI and PDFBox, to have a common interface.
ReplyDelete
Replies
Unknown6:50 PM
This is great work! I have been trying to index pdf since the last 3 weeks.. All the sample codes I found on internet was very complex. They did not work in my case. This is very simple and working perfect. Thanks for the share.
ReplyDelete
Replies
Stephane10:25 PM
Sorry but I just don't get it. Which files do I need to integrate PDFbox with Lucene? Where do I need to put these files? What do I need to modify? Where do I put the code you provided at the top of this document?
ReplyDelete
Replies
kalani Ruwanpathirana1:30 AM
@Stephane: Actually you don't need to worry about any integration. You can have a separate method or class to convert the pdf into a text. Then just use that text as the indexing input for your Lucene indexing method.
ReplyDelete
Replies
Stephane8:47 AM
I'm gonna need a step-by-step tutorial for this.
ReplyDelete
Replies
Unknown6:11 AM
getText is causing folowing Exception. Any idea?

System.NullReferenceException was caught
Message="Object reference not set to an instance of an object."
Source="PDFBox-0.7.3"
StackTrace:
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)
ReplyDelete
Replies
kalani Ruwanpathirana10:15 PM
Hi, it seems that this is caused by PDFBox itself. Could you check this with a stable PDFBox version?
ReplyDelete
Replies
Unknown6:19 PM
nice information thanks for sharing .
ReplyDelete
Replies
Unknown8:54 AM
You can also use Lucene based SearchBlox for indexing PDF documents.

http://www.searchblox.com/
ReplyDelete
Replies
Unknown8:13 AM
Hi Kalani As I see you know a lot :-) here is my question do you know how to print a specific page of a PDF file. I was trying splitting a PDF using pdfbox, and then printing each part, but it doesn't work. Now I'm trying with the PrinterJob object, and the PageRanges attribute but I have no luck. Anyway If you have any idea, I will be very thankfully

Have a nice day
Saludos
Pablo
ReplyDelete
Replies
kalani Ruwanpathirana10:11 PM
Hi Praveena, I haven't worked on PDF printing but to access a certain page...could you have a look at PDDocument class?
ReplyDelete
Replies
Unknown10:15 PM
Thanks for the tips on indexing PDF docs, kalani - much appreciated
ReplyDelete
Replies
soliddirk2:56 PM
I have also a "Object reference not set to an instance of an object"-Exception from the getText method, are ther any solutions or workarounds?

Regards Dirk
ReplyDelete
Replies
soliddirk5:37 PM
Does anyone know where to get the pdfbox 1.5.0 dlls for .net? Seems to fix the problem

Regards Dirk
ReplyDelete
Replies
Programmer1:32 PM
How can we use this code as part of Lucene in java and not put it on the client who wants to index text?
ReplyDelete
Replies
kalani Ruwanpathirana1:02 AM
@Programmer, this is totally a separate code from Lucene code. This code is used just to extract text from pdf. Building Lucene index from that text is a separate part.
ReplyDelete
Replies
Anonymous6:55 PM
HI!! Kalani,

Can Please provide Lucene code for Pdf version 5.2 its very urgent please do the need full
ReplyDelete
Replies
kalani Ruwanpathirana7:30 PM
Hi Preethi, I didn't get what you meant by "Lucene code for Pdf version 5.2". If you can extract the text from pdf somehow, then Lucene code stays the same.
ReplyDelete
Replies
Anonymous9:50 AM
This comment has been removed by the author.
ReplyDelete
Replies
Piyas De10:56 AM
This is another good text extarcting way. So we are growing in path of Text Mining. I will try use this idea in my application areas
ReplyDelete
Replies
Unknown12:08 PM
how to convert the lucene index into xml format
ReplyDelete
Replies
Online Guitar Lessons12:52 PM
I need to manage my documents of working for best dissertation writing services but i can spare my time for that. This page is really beneficial for me where i can learn well about manage of documents easily.
ReplyDelete
Replies
Unknown10:10 PM
Soundcloud Downloader
Soundcloud to mp3

ReplyDelete
Replies
Anonymous12:04 PM

And the little prince together on the rolex yachtmaster interstellar journey pilots chronograph watch "Little Prince" special edition replica watch In 1930 Argentina, the postal pilot Anthony St. Ai Xiu Bai (left) and his friend Henri uk replica watches Guillaumet (Henri Guillaumet). St. Ai Xiu Bai in his novel "Wind Star" (Wind, Sand and Stars) to Jiayou this flying pioneer name Qingqing history.

ReplyDelete
Replies
Unknown12:20 PM

www.gmail.com
ReplyDelete
Replies
Vale Co Xenia3:14 PM

Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Front end developer learn from Javascript Training in Chennai . or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry. ES6 Training in Chennai
ReplyDelete
Replies
Jack1:19 PM
I found a local text search tool occasionally, it is AnyTXT Searcher. Simple and practical. You should know Everything, AnyTXT is like its brother.
You can try it, and you will like it if you like everything, highly recommended. https://sourceforge.net/p/anytxt/
ReplyDelete
Replies
henry11:30 PM
This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. this
ReplyDelete
Replies
Global Mart3:12 PM
Thanks for providing valuable information.
pcproductkey.co
pdf-tools-crack
ummy-video-downloader-crack
5kplayer-crack
ReplyDelete
Replies

Add comment