kalani's Tech blog: How to Index Microsoft Format Documents (Word, Excel, Powerpoint)

Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.

//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.

22 comments:

Amit Patel1:24 AM
Awesome post!!!

I like it. It saved a day of work for me.
Thanks Again.
ReplyDelete
Replies
runat5:09 PM
Nice tutorial, thanks!

Can be simplified using poi 3.7 (also includes the Office 2003+ xml formats)

import org.apache.poi.extractor.ExtractorFactory;

final String text = ExtractorFactory.createExtractor(new File("myfile.docx")).getText();
ReplyDelete
Replies
Vishnu Chilamakuru6:42 PM
Hi kalani,
Can you just post about the main differences between lucene and solr and how to build solr above lucene...in java..will be really thankful for your reply..
ReplyDelete
Replies
kalani Ruwanpathirana8:09 PM
Vishnu,

I haven't worked with Solr but these artcles may help you.

http://www.lucenetutorial.com/lucene-vs-solr.html

http://lucene.apache.org/solr/tutorial.html#Getting+Started
ReplyDelete
Replies
BCP10:01 PM
Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1
ReplyDelete
Replies
BCP10:02 PM
Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1
ReplyDelete
Replies
kalani Ruwanpathirana12:19 AM
Hi BCP,

Can you check the index table in DB to see whether it is created ok? (make sure there is a title column, etc.) Anyway indexing seems correct to me but it is good if you can modify searcher like below

Searcher searcher = new IndexSearcher(IndexReader.open("d:\\index"));
Query query = new QueryParser("title",analyzer).parse("the");
Hits hits = searcher.search(query);
ReplyDelete
Replies
BCP12:30 AM
Hi Kalani,
The index are created properly, as i can see in the file(_0.fdt).
Lucene 3.1 doesnot support "Hits".Its been removed from it.
So any workaround?

Thanks,
-Prashanth
ReplyDelete
Replies
kalani Ruwanpathirana12:51 AM
Oh! that might be, I was using Lucene 2.3.2. So that is outdated now. The only solution I'm seeing at this time is going through the new docs to see how to search the index. Sorry for that.
ReplyDelete
Replies
BCP1:03 AM
Hi Kalinni,

I switched to Lucene 2.3,still getting the same problem.
CODE is here
The page asks for a password,pls use the below passwd.
password:lucene
ReplyDelete
Replies
kalani Ruwanpathirana2:53 AM
Well, It seemed correct to me. However did you try indexing a simple string (instead of an exl sheet) and searching a keyword in it?
ReplyDelete
Replies
BCP9:34 AM
Yeah, i have tried with simple string and txt files they work fine.
Its the only MS XL that i am facing problem with.
ReplyDelete
Replies
kalani Ruwanpathirana9:48 AM
Now, that makes sense. Would you able to use Apache POI (as explained in this post) instead of the library you are using now, to extract excel sheet?
ReplyDelete
Replies
BCP9:56 AM
I went through the Apache POI,i actually want to access the XL sheet column by column.I dint find any method to access the XL sheet column by column.
If you dont mind,can u send me your maild id so that i can send u the XL sheet along.
my mailid is:bcprashanth@yahoo.com
ReplyDelete
Replies
BCP12:17 PM
Hi kalani,
When i take each column from XL sheet,and write them into a text file and if i index those text file, it works properly.But when i try to index the XL sheet after parsing it doesn't work.
I observed the same behavior in both 2.3 and 3.1 version.
Any help / direction would be very helpful.
Dint get any answer on Stackoverflow/manning lucene forums and 2 more lucene forums :(

Thanks,
-Prashanth
ReplyDelete
Replies
BCP8:11 PM
Hi Kalani,

I switched to DB and the indexing and serach is working great now,btw your post on indexing using DB was very helpful.
Thanks.
ReplyDelete
Replies
kalani Ruwanpathirana1:20 AM
Hi BCP, Sorry for the delay. It's great that you could get it working. However I am wondering why it didn't work for file index while working for DB index. btw it's good you got through it.
ReplyDelete
Replies
BCP12:34 PM
Thanks Kalani,
btw, i am still trying it,will let you know if i find a solution.
ReplyDelete
Replies
Unknown5:12 AM
Hello,

Hello. When I was trying to implement this code, the error message always comes out as following:

Exception in thread “main” java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:186)
at DB_connect.dissertation_araalz.ParseWodDocFile.main(ParseWodDocFile.java:29)
Java Result: 1
BUILD SUCCESSFUL (total time: 3 seconds)

Could you please help me in that problem?
ReplyDelete
Replies
Unknown10:57 AM
This is a smart blog. I mean it. You have so much knowledge about this issue, and so much passion. You also know how to make people rally behind it, obviously from the responses. Youve got a design here that not too flashy, but makes a statement as big as what your saying. Great job, indeed.
usb drive recovery
ReplyDelete
Replies
Anonymous9:50 PM
This seems to be well maintained blog for the reader to get knowledge about the Ms office formats.
Microsoft Access Training NY
ReplyDelete
Replies
Unknown1:20 PM
We know that Microsoft office is very beneficial and useful for office work and there are a lot of people who use it for help paper and other purpose. We can learn it from computer institutes and this post is also very beneficial for the users of Microsoft office.
ReplyDelete
Replies

Add comment