kalani's Tech blog: July 2008

Monday, July 28, 2008

Extracting the Text from XML Documents for Indexing Purposes

In the process of creating a Lucene index for content searching I had to index XML document without XML tags. In simple terms I had to extract every text node from the document. I used the SAX API in doing this and it was just a matter of writing an event handler for character data. The following piece of code shows the way to do it.

final StringBuffer sb = new StringBuffer();

try{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
//Other event handlers (for startElement and endElement) also can be implemented similarly
public void characters(char ch[], int start, int length)
throws SAXException {
    sb.append(new String(ch, start, length));
}};

saxParser.parse("fileName.xml", handler);
System.out.println(sb.toString());
}catch(Exception e){
e.printStackTrace();
}

This can be done using the StAX API too. But these libs a are only available in Java 1.6 onwards. The following code works with Java 1.6. However, you may be able to use the Woodstock parser without changing the code.

try {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream("fileName.xml");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
StringBuffer bf = new StringBuffer();

while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    //here we only consider the event startElement and get the text inside that element
    if(event.isStartElement()){
    event = eventReader.nextEvent();  
    bf.append(event.asCharacters().getData()+" ");
    }
}

System.out.println(bf.toString());

}catch(Exception e){

}