Showing posts with label excel. Show all posts
Showing posts with label excel. Show all posts

Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.
//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.
Related Posts with Thumbnails