Monday, July 28, 2008

Extracting the Text from XML Documents for Indexing Purposes

In the process of creating a Lucene index for content searching I had to index XML document without XML tags. In simple terms I had to extract every text node from the document. I used the SAX API in doing this and it was just a matter of writing an event handler for character data. The following piece of code shows the way to do it.
final StringBuffer sb = new StringBuffer();

try{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
//Other event handlers (for startElement and endElement) also can be implemented similarly
public void characters(char ch[], int start, int length)
throws SAXException {
    sb.append(new String(ch, start, length));
}};

saxParser.parse("fileName.xml", handler);
System.out.println(sb.toString());
}catch(Exception e){
e.printStackTrace();
}

This can be done using the StAX API too. But these libs a are only available in Java 1.6 onwards. The following code works with Java 1.6. However, you may be able to use the Woodstock parser without changing the code.
try {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream("fileName.xml");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
StringBuffer bf = new StringBuffer();

while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    //here we only consider the event startElement and get the text inside that element
    if(event.isStartElement()){
    event = eventReader.nextEvent();  
    bf.append(event.asCharacters().getData()+" ");
    }
}

System.out.println(bf.toString());

}catch(Exception e){

}

4 comments:

new trader said...

Dear Kalani,


Can you tell me how to index XML doc along with its nodes? So that I can extract required nodes from>

laustan said...

WSO2's most recent SOA stage which is called WSO2 Carbon, accompanies a bundle of to a great degree alluring qualities. Among them extensibility is a standout amongst the most essential components.


custom essay writing service

Unknown said...

I want to give a message to students that study and best essays reading is more important for you then everything and give your all time to it until you become successful in your life. Because we can see that students are wasting their too much time in music listening and enjoyment that is the reason of their failure in exams.

Unknown said...

I enjoy the details on your web site. Thank you so much.
Check here forpackers and movers organizations in bangalore.

Related Posts with Thumbnails