Home >Blog

XML Processing with Java

In this article I will give you a brief introduction into XML processing in Java. Although there are some fancy libraries available I will stick with the default tools which ship with Java 8. And they are almost as good as libraries - because those libraries build on this default Java functionality.

One thing I won't write about is what XML is. I assume you are familiar with the language and you want to know how you can parse XML text with Java.

Let's see, what we can gain with Java!

DOM

The DOM parser parses the document by loading the complete contents and creates an internal tree structure.

Drawback: with larger files you can get real slow parsing because you need to load the contents and build an internal tree, which is time and memory consuming.

To use DOM you have to implement the following steps:

get a xml.parsers.DocumentBuilderFactory.newInstance() (name it factory)
create a xml.parsers.DocumentBuilder from factory.newDocumentBuilder() (name this builder)
load the w3c.dom.Document from a File or InputStream with builder.parse() (name it document)
get the root w3c.dom.Element from the DOM with document.getDocumentElement() (name it root)

examine / extract the attributes
examine / extract the sub-elements

The last two steps are recursive because you can extract the attributes and sub-elements from every node so let's take a look at how it works.

To extract all the attributes of a node you have to iterate over the list of attributes and get them by their index:

for (int i = 0; i < node.getAttributes().getLength(); i++) {
    System.out.println(node.getAttributes().item(i).getNodeName());
}

If you know the attribute you are looking for you can just use the getter method providing the name of the attribute node.getAttribute("attribute"). And don’t worry—if the attribute doesn't exist, you won't get an exception.

If you are eager for the child nodes you have to use a loop like with the attributes where you can select each org.w3c.dom.Node in the org.w3c.dom.Node.getChildNodes().

This all means that you have to write extractor functions for each node type -- and if you are not fancying reflection you have to know all the node names and attributes you are interested in.

SAX

SAX parses the XML document with an event-based fashion: it sends an event to the application whenever a new node is available to handle, which means that the API has loaded a tag and you can extract the information you need. Thus this approach does not load the whole contents into the memory of the application. However it should be only used if you do not have deeply nested XML structures.

The XML file is processed in a forward manner when using SAX: you have no random access to the XML documents.

The main interface to use with SAX is the org.xml.sax.ContentHandler because it specifies the callback methods used by the parser.

To use SAX you need the following building blocks:

a xml.parsers.SAXParserFactory.newInstance() (name it factory)
a xml.parsers.SAXParser from factory.newSAXParser() (name it saxParser)
and an extension of xml.sax.helpers.DefaultHandler which will handle the events you are interested in (or you can simply implement org.xml.sax.ContentHandler but in that case you have to write your empty methods -- which are already present in DefaultHandler).

The event handling is straightforward: your callback methods in the handler class are called every time the parser encounters a node which qualifies for a given event. The most commonly used callback methods are:

xml.sax.ContentHandler.startElement(String, String, String, Attributes) when the parser encounters a new starting node like <book>
xml.sax.ContentHandler.endElement(String, String, String) when the parser encounters an ending node like </book>
xml.sax.ContentHandler.characters(char[], int, int) when the parser encounters chunks of character data like the text value between starting and ending elements

In the implementations of these methods you have to handle the events and do something to handle the content. A basic implementations is to print out the contents and from this you can move on to instantiate Objects and fill those fields with values read from the XML.

StAX

The StAX parser is similar for the SAX parser but this implements a pull-based parsing which means that your application asks the API to gather the next chunk of information whenever you are ready.

As for SAX this API is the best if your XML structure is not deeply nested. If it is you shall look for alternatives. However I choose StAX if I have a simple XML structure because with reflection it is very easy to create a generic parser which fills objects with information.

You need the following building blocks to get started with StAX:

a xml.stream.XMLInputFactory.newInstance() (name it factory)
a xml.stream.XMLEventReader from factory.createXMLEventReader(InputStream) (name it eventReader)

Now you only have to poll the events from the eventReader until it hasNext().

Similar to the SAX solution you can listen to some event types and do something when they occur:

xml.stream.events.XMLEvent.isStartElement() tells you that the parser encountered a starting tag like <book>
xml.stream.events.XMLEvent.isEndElement() tells you that the parser encountered an ending tag like </book>
xml.stream.events.XMLEvent.asCharacters() returns the javax.xml.stream.events.XMLEvent's contents as javax.xml.stream.events.Characters. To extract the String content you have to call javax.xml.stream.events.Characters.getData().

And that’s it. As you can see, it is slightly simpler than the SAX solution because you do not need to implement a custom handler to do something with the events occurred but you can define when to go for the next event and if you are not interested in it you can skip it.

JAXB

This approach allows you to map Java classes to XML representations and vice versa. With this approach you have actually no need for code the parsing events just to provide the mapping information for the classes' variables if needed. If your instance variables have getters and setters with the same name as the tags in the XML then you do not need any annotation in your class because it is done per reflection.

This tool is my choice if I have to work with nested XML structures because it is easy to write the unmarshaller and you only need some annotations.

You need the following building blocks to unmarshal an XML stream:

a xml.bind.JAXBContext.newInstance (Class...) where you can provide the target classes to generate (name it context)
a xml.bind.Unmarshaller from context.createUnmarshaller() (name it unmarshaller)
a xml.stream.XMLStreamReader from javax.xml.stream.XMLInputFactory.newInstance().createXMLStreamReader(Reader) (name it xmlReader)

To get the resulting object you have to call unmarshaller.unmarshal(xmlReader, Class<T>) where Class<T> represents the class you used when creating the context. This method returns a javax.xml.bind.JAXBElement<T> which you have to extract into your target object and you can do this with the call of javax.xml.bind.JAXBElement.getValue().

Now you have your resulting objects.

Actually one more thing is sometimes needed: for a list of Book objects you need a parent class to give it to JAXB because you won't get a list back—and you need a single root element in your XML file to conform to the requirements of the markup. For example we define the XML like this with some more books:

<?xml version="1.0" encoding="UTF-8"?>
<publications xmlns:xsi="http://www.w3.org/2001/XMLSchema">
    <book id="_004">
        <title>XML processing and website scraping in Java</title>
        <author>Gabor Laszlo Hajba</author>
        <copyright>2016</copyright>
        <publisher>LeanPub</publisher>
    </book>
</publications>

Now we create a class called Publications analog to the XML's root element and there we define the List of Book objects and tell JAXB that they are elements:

import java.util.List;

import javax.xml.bind.annotation.XmlElement;

/**
 * Sample parent class to extract Book objects with JAXB.
 *
 * @author GHajba
 *
 */
public class Publications {

    @XmlElement(name = "book")
    List<Book> books;

    public List<Book> getBooks() {
        return this.books;
    }
}

If we call the unmarshaller for the Publications class we will get a Publication instance back where we can access the loaded Books through the getBooks method.

Conclusion

As you can see there are some tools in Java which can be used to parse and load XML files / XML data. Naturally each one has its downside but they are good for different purposes. And as I mentioned in the introduction: third-party libraries build on these options to make these features more convenient to use.

By Guest | 8/15/2016 | General