kalani's Tech blog: How to Index PDF Documents with Lucene

Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.

FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.

31 comments:

Chathuranga Chandrasekara said...: Just adding more information..

Microsoft Format files can be parsed by Apache POI Library. I think it is not compatible with Office 2007 format yet..

http://poi.apache.org/; 10:37 PM
nilindra said...: Nice work around for https://issues.apache.org/jira/browse/PDFBOX-365

An important point that you have missed out!

cd.close(); [COSDocument object needs to be closed]

Nilindra [Another Sri Lankan]; 5:43 PM
Pieter said...: You could also use Tika (http://lucene.apache.org/tika/). This toolkit is a subproject of Lucene and was made exactly for this: it extracts text and metadata from various documents (PDF, Word, ...) to index them with Lucene. A nice feature is the AutoDetectParser, which will automagically detect the format of the document (PDF, Word, ...) and parse the text.

It is actually a sort of wrapper around POI and PDFBox, to have a common interface.; 11:03 PM
Unknown said...: This is great work! I have been trying to index pdf since the last 3 weeks.. All the sample codes I found on internet was very complex. They did not work in my case. This is very simple and working perfect. Thanks for the share.; 6:50 PM
Stephane said...: Sorry but I just don't get it. Which files do I need to integrate PDFbox with Lucene? Where do I need to put these files? What do I need to modify? Where do I put the code you provided at the top of this document?; 10:25 PM
kalani Ruwanpathirana said...: @Stephane: Actually you don't need to worry about any integration. You can have a separate method or class to convert the pdf into a text. Then just use that text as the indexing input for your Lucene indexing method.; 1:30 AM
Stephane said...: I'm gonna need a step-by-step tutorial for this.; 8:47 AM
Unknown said...: getText is causing folowing Exception. Any idea?

System.NullReferenceException was caught
Message="Object reference not set to an instance of an object."
Source="PDFBox-0.7.3"
StackTrace:
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc); 6:11 AM
kalani Ruwanpathirana said...: Hi, it seems that this is caused by PDFBox itself. Could you check this with a stable PDFBox version?; 10:15 PM
Unknown said...: nice information thanks for sharing .; 6:19 PM
Unknown said...: You can also use Lucene based SearchBlox for indexing PDF documents.

http://www.searchblox.com/; 8:54 AM
Unknown said...: Hi Kalani As I see you know a lot :-) here is my question do you know how to print a specific page of a PDF file. I was trying splitting a PDF using pdfbox, and then printing each part, but it doesn't work. Now I'm trying with the PrinterJob object, and the PageRanges attribute but I have no luck. Anyway If you have any idea, I will be very thankfully

Have a nice day
Saludos
Pablo; 8:13 AM
kalani Ruwanpathirana said...: Hi Praveena, I haven't worked on PDF printing but to access a certain page...could you have a look at PDDocument class?; 10:11 PM
Unknown said...: Thanks for the tips on indexing PDF docs, kalani - much appreciated; 10:15 PM
soliddirk said...: I have also a "Object reference not set to an instance of an object"-Exception from the getText method, are ther any solutions or workarounds?

Regards Dirk; 2:56 PM
soliddirk said...: Does anyone know where to get the pdfbox 1.5.0 dlls for .net? Seems to fix the problem

Regards Dirk; 5:37 PM
Programmer said...: How can we use this code as part of Lucene in java and not put it on the client who wants to index text?; 1:32 PM
kalani Ruwanpathirana said...: @Programmer, this is totally a separate code from Lucene code. This code is used just to extract text from pdf. Building Lucene index from that text is a separate part.; 1:02 AM
Anonymous said...: HI!! Kalani,

Can Please provide Lucene code for Pdf version 5.2 its very urgent please do the need full; 6:55 PM
kalani Ruwanpathirana said...: Hi Preethi, I didn't get what you meant by "Lucene code for Pdf version 5.2". If you can extract the text from pdf somehow, then Lucene code stays the same.; 7:30 PM
Anonymous said...: This comment has been removed by the author.; 9:50 AM
Piyas De said...: This is another good text extarcting way. So we are growing in path of Text Mining. I will try use this idea in my application areas; 10:56 AM
Unknown said...: how to convert the lucene index into xml format; 12:08 PM
Online Guitar Lessons said...: I need to manage my documents of working for best dissertation writing services but i can spare my time for that. This page is really beneficial for me where i can learn well about manage of documents easily.; 12:52 PM
Unknown said...: Soundcloud Downloader
Soundcloud to mp3; 10:10 PM
Anonymous said...: And the little prince together on the rolex yachtmaster interstellar journey pilots chronograph watch "Little Prince" special edition replica watch In 1930 Argentina, the postal pilot Anthony St. Ai Xiu Bai (left) and his friend Henri uk replica watches Guillaumet (Henri Guillaumet). St. Ai Xiu Bai in his novel "Wind Star" (Wind, Sand and Stars) to Jiayou this flying pioneer name Qingqing history.; 12:04 PM
Unknown said...: www.gmail.com; 12:20 PM
Vale Co Xenia said...: Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Front end developer learn from Javascript Training in Chennai . or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry. ES6 Training in Chennai; 3:14 PM
Jack said...: I found a local text search tool occasionally, it is AnyTXT Searcher. Simple and practical. You should know Everything, AnyTXT is like its brother.
You can try it, and you will like it if you like everything, highly recommended. https://sourceforge.net/p/anytxt/; 1:19 PM
henry said...: This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. this; 11:30 PM
Global Mart said...: Thanks for providing valuable information.
pcproductkey.co
pdf-tools-crack
ummy-video-downloader-crack
5kplayer-crack; 3:12 PM