The ways to extract text from Word, Excel and Powerpoint documents are shown below.
//Word text extraction POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc")); WordExtractor extractor = new WordExtractor(fs); String wordText = extractor.getText();
//Excel text extraction POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls")); ExcelExtractor extractor = new ExcelExtractor(fs); String excelText = extractor.getText();
//Powerpoint extraction POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt")); PowerPointExtractor extractor = new PowerPointExtractor(fs); String powerText = extractor.getText();
However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.
22 comments:
Awesome post!!!
I like it. It saved a day of work for me.
Thanks Again.
Nice tutorial, thanks!
Can be simplified using poi 3.7 (also includes the Office 2003+ xml formats)
import org.apache.poi.extractor.ExtractorFactory;
final String text = ExtractorFactory.createExtractor(new File("myfile.docx")).getText();
Hi kalani,
Can you just post about the main differences between lucene and solr and how to build solr above lucene...in java..will be really thankful for your reply..
Vishnu,
I haven't worked with Solr but these artcles may help you.
http://www.lucenetutorial.com/lucene-vs-solr.html
http://lucene.apache.org/solr/tutorial.html#Getting+Started
Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1
Hi Kalani,
I am trying to index a excel sheet using lucene.The indexing is working properly,but search returnd 0 hits always.
For parsing XL i am using jlx library.
Lucene 3.1
I have posted the Question on stackoverflow
here is the link,pls let me know ahts going wrong in the code.
http://stackoverflow.com/questions/7621073/indexing-and-searching-a-ms-excel-using-lucene-3-1
Hi BCP,
Can you check the index table in DB to see whether it is created ok? (make sure there is a title column, etc.) Anyway indexing seems correct to me but it is good if you can modify searcher like below
Searcher searcher = new IndexSearcher(IndexReader.open("d:\\index"));
Query query = new QueryParser("title",analyzer).parse("the");
Hits hits = searcher.search(query);
Hi Kalani,
The index are created properly, as i can see in the file(_0.fdt).
Lucene 3.1 doesnot support "Hits".Its been removed from it.
So any workaround?
Thanks,
-Prashanth
Oh! that might be, I was using Lucene 2.3.2. So that is outdated now. The only solution I'm seeing at this time is going through the new docs to see how to search the index. Sorry for that.
Hi Kalinni,
I switched to Lucene 2.3,still getting the same problem.
CODE is here
The page asks for a password,pls use the below passwd.
password:lucene
Well, It seemed correct to me. However did you try indexing a simple string (instead of an exl sheet) and searching a keyword in it?
Yeah, i have tried with simple string and txt files they work fine.
Its the only MS XL that i am facing problem with.
Now, that makes sense. Would you able to use Apache POI (as explained in this post) instead of the library you are using now, to extract excel sheet?
I went through the Apache POI,i actually want to access the XL sheet column by column.I dint find any method to access the XL sheet column by column.
If you dont mind,can u send me your maild id so that i can send u the XL sheet along.
my mailid is:bcprashanth@yahoo.com
Hi kalani,
When i take each column from XL sheet,and write them into a text file and if i index those text file, it works properly.But when i try to index the XL sheet after parsing it doesn't work.
I observed the same behavior in both 2.3 and 3.1 version.
Any help / direction would be very helpful.
Dint get any answer on Stackoverflow/manning lucene forums and 2 more lucene forums :(
Thanks,
-Prashanth
Hi Kalani,
I switched to DB and the indexing and serach is working great now,btw your post on indexing using DB was very helpful.
Thanks.
Hi BCP, Sorry for the delay. It's great that you could get it working. However I am wondering why it didn't work for file index while working for DB index. btw it's good you got through it.
Thanks Kalani,
btw, i am still trying it,will let you know if i find a solution.
Hello,
Hello. When I was trying to implement this code, the error message always comes out as following:
Exception in thread “main” java.lang.NoSuchMethodError: org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryNode;
at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:186)
at DB_connect.dissertation_araalz.ParseWodDocFile.main(ParseWodDocFile.java:29)
Java Result: 1
BUILD SUCCESSFUL (total time: 3 seconds)
Could you please help me in that problem?
This is a smart blog. I mean it. You have so much knowledge about this issue, and so much passion. You also know how to make people rally behind it, obviously from the responses. Youve got a design here that not too flashy, but makes a statement as big as what your saying. Great job, indeed.
usb drive recovery
This seems to be well maintained blog for the reader to get knowledge about the Ms office formats.
Microsoft Access Training NY
We know that Microsoft office is very beneficial and useful for office work and there are a lot of people who use it for help paper and other purpose. We can learn it from computer institutes and this post is also very beneficial for the users of Microsoft office.
Post a Comment