Showing posts with label SAX. Show all posts
Showing posts with label SAX. Show all posts

Monday, July 28, 2008

Extracting the Text from XML Documents for Indexing Purposes

In the process of creating a Lucene index for content searching I had to index XML document without XML tags. In simple terms I had to extract every text node from the document. I used the SAX API in doing this and it was just a matter of writing an event handler for character data. The following piece of code shows the way to do it.
final StringBuffer sb = new StringBuffer();

try{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
//Other event handlers (for startElement and endElement) also can be implemented similarly
public void characters(char ch[], int start, int length)
throws SAXException {
    sb.append(new String(ch, start, length));
}};

saxParser.parse("fileName.xml", handler);
System.out.println(sb.toString());
}catch(Exception e){
e.printStackTrace();
}

This can be done using the StAX API too. But these libs a are only available in Java 1.6 onwards. The following code works with Java 1.6. However, you may be able to use the Woodstock parser without changing the code.
try {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream("fileName.xml");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
StringBuffer bf = new StringBuffer();

while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    //here we only consider the event startElement and get the text inside that element
    if(event.isStartElement()){
    event = eventReader.nextEvent();  
    bf.append(event.asCharacters().getData()+" ");
    }
}

System.out.println(bf.toString());

}catch(Exception e){

}
Related Posts with Thumbnails