Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.
//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.

22 comments:

  1. Awesome post!!!

    I like it. It saved a day of work for me.
    Thanks Again.

    ReplyDelete
  2. Nice tutorial, thanks!

    Can be simplified using poi 3.7 (also includes the Office 2003+ xml formats)


    import org.apache.poi.extractor.ExtractorFactory;


    final String text = ExtractorFactory.createExtractor(new File("myfile.docx")).getText();

    ReplyDelete
  3. Hi kalani,
    Can you just post about the main differences between lucene and solr and how to build solr above lucene...in java..will be really thankful for your reply..

    ReplyDelete
  4. Vishnu,

    I haven't worked with Solr but these artcles may help you.

    http://www.lucenetutorial.com/lucene-vs-solr.html

    http://lucene.apache.org/solr/tutorial.html#Getting+Started

    ReplyDelete
  5. Hi Kalani,
    I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
    For parsing XL i am using jlx library.
    Lucene 3.1
    I have posted the Question on stackoverflow
    here is the link,pls let me know ahts going wrong in the code.
    http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1

    ReplyDelete
  6. Hi Kalani,
    I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
    For parsing XL i am using jlx library.
    Lucene 3.1
    I have posted the Question on stackoverflow
    here is the link,pls let me know ahts going wrong in the code.
    http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1

    ReplyDelete
  7. Hi BCP,

    Can you check the index table in DB to see whether it is created ok? (make sure there is a title column, etc.) Anyway indexing seems correct to me but it is good if you can modify searcher like below

    Searcher searcher = new IndexSearcher(IndexReader.open("d:\\index"));
    Query query = new QueryParser("title",analyzer).parse("the");
    Hits hits = searcher.search(query);

    ReplyDelete
  8. Hi Kalani,
    The index are created properly, as i can see in the file(_0.fdt).
    Lucene 3.1 doesnot support "Hits".Its been removed from it.
    So any workaround?

    Thanks,
    -Prashanth

    ReplyDelete
  9. Oh! that might be, I was using Lucene 2.3.2. So that is outdated now. The only solution I'm seeing at this time is going through the new docs to see how to search the index. Sorry for that.

    ReplyDelete
  10. Hi Kalinni,

    I switched to Lucene 2.3,still getting the same problem.
    CODE is here
    The page asks for a password,pls use the below passwd.
    password:lucene

    ReplyDelete
  11. Well, It seemed correct to me. However did you try indexing a simple string (instead of an exl sheet) and searching a keyword in it?

    ReplyDelete
  12. Yeah, i have tried with simple string and txt files they work fine.
    Its the only MS XL that i am facing problem with.

    ReplyDelete
  13. Now, that makes sense. Would you able to use Apache POI (as explained in this post) instead of the library you are using now, to extract excel sheet?

    ReplyDelete
  14. I went through the Apache POI,i actually want to access the XL sheet column by column.I dint find any method to access the XL sheet column by column.
    If you dont mind,can u send me your maild id so that i can send u the XL sheet along.
    my mailid is:bcprashanth@yahoo.com

    ReplyDelete
  15. Hi kalani,
    When i take each column from XL sheet,and write them into a text file and if i index those text file, it works properly.But when i try to index the XL sheet after parsing it doesn't work.
    I observed the same behavior in both 2.3 and 3.1 version.
    Any help / direction would be very helpful.
    Dint get any answer on Stackoverflow/manning lucene forums and 2 more lucene forums :(

    Thanks,
    -Prashanth

    ReplyDelete
  16. Hi Kalani,

    I switched to DB and the indexing and serach is working great now,btw your post on indexing using DB was very helpful.
    Thanks.

    ReplyDelete
  17. Hi BCP, Sorry for the delay. It's great that you could get it working. However I am wondering why it didn't work for file index while working for DB index. btw it's good you got through it.

    ReplyDelete
  18. Thanks Kalani,
    btw, i am still trying it,will let you know if i find a solution.

    ReplyDelete
  19. Hello,


    Hello. When I was trying to implement this code, the error message always comes out as following:

    Exception in thread “main” java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
    at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:186)
    at DB_connect.dissertation_araalz.ParseWodDocFile.main(ParseWodDocFile.java:29)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 3 seconds)

    Could you please help me in that problem?

    ReplyDelete
  20. This is a smart blog. I mean it. You have so much knowledge about this issue, and so much passion. You also know how to make people rally behind it, obviously from the responses. Youve got a design here that not too flashy, but makes a statement as big as what your saying. Great job, indeed.
    usb drive recovery

    ReplyDelete
  21. Anonymous9:50 PM

    This seems to be well maintained blog for the reader to get knowledge about the Ms office formats.
    Microsoft Access Training NY

    ReplyDelete
  22. We know that Microsoft office is very beneficial and useful for office work and there are a lot of people who use it for help paper and other purpose. We can learn it from computer institutes and this post is also very beneficial for the users of Microsoft office.

    ReplyDelete