Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.
//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.

22 comments:

Amit Patel said...

Awesome post!!!

I like it. It saved a day of work for me.
Thanks Again.

runat said...

Nice tutorial, thanks!

Can be simplified using poi 3.7 (also includes the Office 2003+ xml formats)


import org.apache.poi.extractor.ExtractorFactory;


final String text = ExtractorFactory.createExtractor(new File("myfile.docx")).getText();

Vishnu Chilamakuru said...

Hi kalani,
Can you just post about the main differences between lucene and solr and how to build solr above lucene...in java..will be really thankful for your reply..

kalani Ruwanpathirana said...

Vishnu,

I haven't worked with Solr but these artcles may help you.

http://www.lucenetutorial.com/lucene-vs-solr.html

http://lucene.apache.org/solr/tutorial.html#Getting+Started

BCP said...

Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1

BCP said...

Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1

kalani Ruwanpathirana said...

Hi BCP,

Can you check the index table in DB to see whether it is created ok? (make sure there is a title column, etc.) Anyway indexing seems correct to me but it is good if you can modify searcher like below

Searcher searcher = new IndexSearcher(IndexReader.open("d:\\index"));
Query query = new QueryParser("title",analyzer).parse("the");
Hits hits = searcher.search(query);

BCP said...

Hi Kalani,
The index are created properly, as i can see in the file(_0.fdt).
Lucene 3.1 doesnot support "Hits".Its been removed from it.
So any workaround?

Thanks,
-Prashanth

kalani Ruwanpathirana said...

Oh! that might be, I was using Lucene 2.3.2. So that is outdated now. The only solution I'm seeing at this time is going through the new docs to see how to search the index. Sorry for that.

BCP said...

Hi Kalinni,

I switched to Lucene 2.3,still getting the same problem.
CODE is here
The page asks for a password,pls use the below passwd.
password:lucene

kalani Ruwanpathirana said...

Well, It seemed correct to me. However did you try indexing a simple string (instead of an exl sheet) and searching a keyword in it?

BCP said...

Yeah, i have tried with simple string and txt files they work fine.
Its the only MS XL that i am facing problem with.

kalani Ruwanpathirana said...

Now, that makes sense. Would you able to use Apache POI (as explained in this post) instead of the library you are using now, to extract excel sheet?

BCP said...

I went through the Apache POI,i actually want to access the XL sheet column by column.I dint find any method to access the XL sheet column by column.
If you dont mind,can u send me your maild id so that i can send u the XL sheet along.
my mailid is:bcprashanth@yahoo.com

BCP said...

Hi kalani,
When i take each column from XL sheet,and write them into a text file and if i index those text file, it works properly.But when i try to index the XL sheet after parsing it doesn't work.
I observed the same behavior in both 2.3 and 3.1 version.
Any help / direction would be very helpful.
Dint get any answer on Stackoverflow/manning lucene forums and 2 more lucene forums :(

Thanks,
-Prashanth

BCP said...

Hi Kalani,

I switched to DB and the indexing and serach is working great now,btw your post on indexing using DB was very helpful.
Thanks.

kalani Ruwanpathirana said...

Hi BCP, Sorry for the delay. It's great that you could get it working. However I am wondering why it didn't work for file index while working for DB index. btw it's good you got through it.

BCP said...

Thanks Kalani,
btw, i am still trying it,will let you know if i find a solution.

Unknown said...

Hello,


Hello. When I was trying to implement this code, the error message always comes out as following:

Exception in thread “main” java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:186)
at DB_connect.dissertation_araalz.ParseWodDocFile.main(ParseWodDocFile.java:29)
Java Result: 1
BUILD SUCCESSFUL (total time: 3 seconds)

Could you please help me in that problem?

Unknown said...

This is a smart blog. I mean it. You have so much knowledge about this issue, and so much passion. You also know how to make people rally behind it, obviously from the responses. Youve got a design here that not too flashy, but makes a statement as big as what your saying. Great job, indeed.
usb drive recovery

Anonymous said...

This seems to be well maintained blog for the reader to get knowledge about the Ms office formats.
Microsoft Access Training NY

Unknown said...

We know that Microsoft office is very beneficial and useful for office work and there are a lot of people who use it for help paper and other purpose. We can learn it from computer institutes and this post is also very beneficial for the users of Microsoft office.

Related Posts with Thumbnails