Parsing XML

The neon XML interface is exposed by the ne_xml.h header file. This interface gives a wrapper around the standard SAX API used by XML parsers, with an additional abstraction, stacked SAX handlers, and also giving consistent XML Namespace support.

Introduction to SAX

A SAX-based parser works by emitting a sequence of events to reflect the tokens being parsed from the XML document. For example, parsing the following document fragment:

<hello>world</hello>

results in the following events:

  1. start-element "hello"
  2. character-data "world"
  3. end-element "hello"

This example demonstrates the three event types used used in the subset of SAX exposed by the neon XML interface: start-element, character-data and end-element. In a C API, an “event” is implemented as a function callback; three callback types are used in neon, one for each type of event.

Stacked SAX handlers

WebDAV property values are represented as fragments of XML, transmitted as parts of larger XML documents over HTTP (notably in the body of the response to a PROPFIND request). When neon parses such documents, the SAX events generated for these property value fragments may need to be handled by the application, since neon has no knowledge of the structure of properties used by the application.

To solve this problem[1] the neon XML interface introduces the concept of a SAX handler. A SAX handler comprises a start-element, character-data and end-element callback; the start-element callback being defined such that each handler may accept or decline the start-element event. Handlers are composed into a handler stack before parsing a document. When a new start-element event is generated by the XML parser, neon invokes each start-element callback in the handler stack in turn until one accepts the event. The handler which accepts the event will then be subsequently be passed character-data events if the element contains character data, followed by an end-element event when the element is closed. If no handler in the stack accepts a start-element event, the branch of the tree is ignored.

To illustrate, given a handler A, which accepts the cat and age elements, and a handler B, which accepts the name element, the following document:

Example 2.1. An example XML document

<cat>
  <age>3</age>    
  <name>Bob</name>
</cat>


would be parsed as follows:

  1. A start-element "cat" → accept
  2. A start-element "age" → accept
  3. A character-data "3"
  4. A end-element "age"
  5. A start-element "name" → decline
  6. B start-element "name" → accept
  7. B character-data "Bob"
  8. B end-element "name"
  9. A end-element "cat"

The search for a handler which will accept a start-element event begins at the handler of the parent element and continues toward the top of the stack. For the root element, it begins at the base of the stack. In the above example, handler A is at the base, and handler B at the top; if the name element had any children, only B's start-element would be invoked to accept them.

Maintaining state

To facilitate communication between independent handlers, a state integer is associated with each element being parsed. This integer is returned by start-element callback and is passed to the subsequent character-data and end-element callbacks associated with the element. The state integer of the parent element is also passed to each start-element callback, the value zero used for the root element (which by definition has no parent).

To further extend Example 2.1, “An example XML document”: if handler A defines that the state of the root element cat will be 42, the event trace would be as follows:

  1. A start-element (parent = 0, "cat") → accept, state = 42
  2. A start-element (parent = 42, "age") → accept, state = 50
  3. A character-data (state = 50, "3")
  4. A end-element (state = 50, "age")
  5. A start-element (parent = 42, "name") → decline
  6. B start-element (parent = 42, "name") → accept, state = 99
  7. B character-data (state = 99, "Bob")
  8. B end-element (state = 99, "name")
  9. A end-element (state = 42, "cat")

To avoid collisions between state integers used by different handlers, the interface definition of any handler includes the range of integers it will use.

XML namespaces

To support XML namespaces, every element name is represented as a (namespace, name) pair. The start-element and end-element callbacks are passed namespace and name strings accordingly. If an element in the XML document has no declared namespace, the namespace given will be the empty string, "".



[1] This “problem” only needs solving because the SAX interface is so inflexible when implemented as C function callbacks; a better approach would be to use an XML parser interface which is not based on callbacks.