FileInputStream fi = new FileInputStream(new File("sample.pdf")); PDFParser parser = new PDFParser(fi); parser.parse(); COSDocument cd = parser.getDocument(); PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(new PDDocument(cd));
Now this extracted text can be used to build the Lucene index.
Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.
31 comments:
Just adding more information..
Microsoft Format files can be parsed by Apache POI Library. I think it is not compatible with Office 2007 format yet..
http://poi.apache.org/
Nice work around for https://issues.apache.org/jira/browse/PDFBOX-365
An important point that you have missed out!
cd.close(); [COSDocument object needs to be closed]
Nilindra [Another Sri Lankan]
You could also use Tika (http://lucene.apache.org/tika/). This toolkit is a subproject of Lucene and was made exactly for this: it extracts text and metadata from various documents (PDF, Word, ...) to index them with Lucene. A nice feature is the AutoDetectParser, which will automagically detect the format of the document (PDF, Word, ...) and parse the text.
It is actually a sort of wrapper around POI and PDFBox, to have a common interface.
This is great work! I have been trying to index pdf since the last 3 weeks.. All the sample codes I found on internet was very complex. They did not work in my case. This is very simple and working perfect. Thanks for the share.
Sorry but I just don't get it. Which files do I need to integrate PDFbox with Lucene? Where do I need to put these files? What do I need to modify? Where do I put the code you provided at the top of this document?
@Stephane: Actually you don't need to worry about any integration. You can have a separate method or class to convert the pdf into a text. Then just use that text as the indexing input for your Lucene indexing method.
I'm gonna need a step-by-step tutorial for this.
getText is causing folowing Exception. Any idea?
System.NullReferenceException was caught
Message="Object reference not set to an instance of an object."
Source="PDFBox-0.7.3"
StackTrace:
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)
Hi, it seems that this is caused by PDFBox itself. Could you check this with a stable PDFBox version?
nice information thanks for sharing .
You can also use Lucene based SearchBlox for indexing PDF documents.
http://www.searchblox.com/
Hi Kalani As I see you know a lot :-) here is my question do you know how to print a specific page of a PDF file. I was trying splitting a PDF using pdfbox, and then printing each part, but it doesn't work. Now I'm trying with the PrinterJob object, and the PageRanges attribute but I have no luck. Anyway If you have any idea, I will be very thankfully
Have a nice day
Saludos
Pablo
Hi Praveena, I haven't worked on PDF printing but to access a certain page...could you have a look at PDDocument class?
Thanks for the tips on indexing PDF docs, kalani - much appreciated
I have also a "Object reference not set to an instance of an object"-Exception from the getText method, are ther any solutions or workarounds?
Regards Dirk
Does anyone know where to get the pdfbox 1.5.0 dlls for .net? Seems to fix the problem
Regards Dirk
How can we use this code as part of Lucene in java and not put it on the client who wants to index text?
@Programmer, this is totally a separate code from Lucene code. This code is used just to extract text from pdf. Building Lucene index from that text is a separate part.
HI!! Kalani,
Can Please provide Lucene code for Pdf version 5.2 its very urgent please do the need full
Hi Preethi, I didn't get what you meant by "Lucene code for Pdf version 5.2". If you can extract the text from pdf somehow, then Lucene code stays the same.
This is another good text extarcting way. So we are growing in path of Text Mining. I will try use this idea in my application areas
how to convert the lucene index into xml format
I need to manage my documents of working for best dissertation writing services but i can spare my time for that. This page is really beneficial for me where i can learn well about manage of documents easily.
Soundcloud Downloader
Soundcloud to mp3
And the little prince together on the rolex yachtmaster interstellar journey pilots chronograph watch "Little Prince" special edition replica watch In 1930 Argentina, the postal pilot Anthony St. Ai Xiu Bai (left) and his friend Henri uk replica watches Guillaumet (Henri Guillaumet). St. Ai Xiu Bai in his novel "Wind Star" (Wind, Sand and Stars) to Jiayou this flying pioneer name Qingqing history.
www.gmail.com
Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Front end developer learn from Javascript Training in Chennai . or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry. ES6 Training in Chennai
I found a local text search tool occasionally, it is AnyTXT Searcher. Simple and practical. You should know Everything, AnyTXT is like its brother.
You can try it, and you will like it if you like everything, highly recommended. https://sourceforge.net/p/anytxt/
This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. this
Thanks for providing valuable information.
pcproductkey.co
pdf-tools-crack
ummy-video-downloader-crack
5kplayer-crack
Post a Comment