Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.
FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.

31 comments:

  1. Just adding more information..

    Microsoft Format files can be parsed by Apache POI Library. I think it is not compatible with Office 2007 format yet..

    http://poi.apache.org/

    ReplyDelete
  2. Nice work around for https://issues.apache.org/jira/browse/PDFBOX-365

    An important point that you have missed out!

    cd.close(); [COSDocument object needs to be closed]

    Nilindra [Another Sri Lankan]

    ReplyDelete
  3. You could also use Tika (http://lucene.apache.org/tika/). This toolkit is a subproject of Lucene and was made exactly for this: it extracts text and metadata from various documents (PDF, Word, ...) to index them with Lucene. A nice feature is the AutoDetectParser, which will automagically detect the format of the document (PDF, Word, ...) and parse the text.

    It is actually a sort of wrapper around POI and PDFBox, to have a common interface.

    ReplyDelete
  4. This is great work! I have been trying to index pdf since the last 3 weeks.. All the sample codes I found on internet was very complex. They did not work in my case. This is very simple and working perfect. Thanks for the share.

    ReplyDelete
  5. Sorry but I just don't get it. Which files do I need to integrate PDFbox with Lucene? Where do I need to put these files? What do I need to modify? Where do I put the code you provided at the top of this document?

    ReplyDelete
  6. @Stephane: Actually you don't need to worry about any integration. You can have a separate method or class to convert the pdf into a text. Then just use that text as the indexing input for your Lucene indexing method.

    ReplyDelete
  7. I'm gonna need a step-by-step tutorial for this.

    ReplyDelete
  8. getText is causing folowing Exception. Any idea?

    System.NullReferenceException was caught
    Message="Object reference not set to an instance of an object."
    Source="PDFBox-0.7.3"
    StackTrace:
    at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
    at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
    at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
    at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
    at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)

    ReplyDelete
  9. Hi, it seems that this is caused by PDFBox itself. Could you check this with a stable PDFBox version?

    ReplyDelete
  10. nice information thanks for sharing .

    ReplyDelete
  11. You can also use Lucene based SearchBlox for indexing PDF documents.

    http://www.searchblox.com/

    ReplyDelete
  12. Hi Kalani As I see you know a lot :-) here is my question do you know how to print a specific page of a PDF file. I was trying splitting a PDF using pdfbox, and then printing each part, but it doesn't work. Now I'm trying with the PrinterJob object, and the PageRanges attribute but I have no luck. Anyway If you have any idea, I will be very thankfully

    Have a nice day
    Saludos
    Pablo

    ReplyDelete
  13. Hi Praveena, I haven't worked on PDF printing but to access a certain page...could you have a look at PDDocument class?

    ReplyDelete
  14. Thanks for the tips on indexing PDF docs, kalani - much appreciated

    ReplyDelete
  15. I have also a "Object reference not set to an instance of an object"-Exception from the getText method, are ther any solutions or workarounds?

    Regards Dirk

    ReplyDelete
  16. Does anyone know where to get the pdfbox 1.5.0 dlls for .net? Seems to fix the problem

    Regards Dirk

    ReplyDelete
  17. How can we use this code as part of Lucene in java and not put it on the client who wants to index text?

    ReplyDelete
  18. @Programmer, this is totally a separate code from Lucene code. This code is used just to extract text from pdf. Building Lucene index from that text is a separate part.

    ReplyDelete
  19. Anonymous6:55 PM

    HI!! Kalani,

    Can Please provide Lucene code for Pdf version 5.2 its very urgent please do the need full

    ReplyDelete
  20. Hi Preethi, I didn't get what you meant by "Lucene code for Pdf version 5.2". If you can extract the text from pdf somehow, then Lucene code stays the same.

    ReplyDelete
  21. Anonymous9:50 AM

    This comment has been removed by the author.

    ReplyDelete
  22. This is another good text extarcting way. So we are growing in path of Text Mining. I will try use this idea in my application areas

    ReplyDelete
  23. how to convert the lucene index into xml format

    ReplyDelete
  24. I need to manage my documents of working for best dissertation writing services but i can spare my time for that. This page is really beneficial for me where i can learn well about manage of documents easily.

    ReplyDelete
  25. Anonymous12:04 PM


    And the little prince together on the rolex yachtmaster interstellar journey pilots chronograph watch "Little Prince" special edition replica watch In 1930 Argentina, the postal pilot Anthony St. Ai Xiu Bai (left) and his friend Henri uk replica watches Guillaumet (Henri Guillaumet). St. Ai Xiu Bai in his novel "Wind Star" (Wind, Sand and Stars) to Jiayou this flying pioneer name Qingqing history.

    ReplyDelete


  26. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Front end developer learn from Javascript Training in Chennai . or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry. ES6 Training in Chennai

    ReplyDelete
  27. I found a local text search tool occasionally, it is AnyTXT Searcher. Simple and practical. You should know Everything, AnyTXT is like its brother.
    You can try it, and you will like it if you like everything, highly recommended. https://sourceforge.net/p/anytxt/

    ReplyDelete
  28. This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. this

    ReplyDelete