kalani's Tech blog: August 2008

Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.

//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.

Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.

FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.