Configuring Woodstox XML parser: basic Stax properties

@cowtowncoder
3 min readJul 21, 2017

--

As many know (I hope!), Woodstox is the best general-purpose Java XML parser. It implements both main incremental parsing APIs — Stax AND SAX — and offers full XML feature support, conformance, similar to Apache Xerces (which is great XML parser but only implements SAX, not Stax).
And it does all of this efficiently, that is, fast.

Woodstox configuration basics

Beyond standard Stax and SAX configurability, Woodstox allows much wider variety of configuration. When using Stax API, all configuration uses the same standard Stax API, regardless of whether setting itself is defined as part of Stax:

XMLInputFactory inputF = XMLInputFactory.newFactory();
inputF.setProperty(<property-to-set>, <value>);
XMLOutputFactory outputF = XMLOutputFactory.newFactory();
outputF.setProperty(<property-to-set>, <value>);

where property-to-set is a String with one of pre-defined constant values. Value is often of type Boolean, but not always: this depends on configuration setting in question.

These constants can be divided into 3 different groups

  1. Standard Stax properties (see XMLInputFactory and XMLOutputFactory for constants): implemented by all compliant Stax implementations
  2. Stax2 extension API (XMLInputFactory2 and XMLOutputFactory2): implemented by all Stax2-compliant parsers (currently this means Woodstox and Aalto)
  3. Woodstox-specific properties (WstxInputProperties, WstxOutputProperties), supported only by Woodstox itself

In this first entry of a (brief) series, let’s have a look at the first category.

Standard Stax Properties

Set of standard properties is covered by JDK Javadocs (and I link entries below). Most are Boolean valued: I only mention type if it is something different.

XMLInputFactory defines a few settings; most important are:

  • IS_COALESCING: if enabled, parser will ensure that all adjacent text (“cdata”) segments are combined into a single CHARACTERS event. If disabled, text segments may be returned in arbitrary number of events (of type CHARACTERS and CDATA) — often split at places where entities are used
  • IS_NAMESPACE_AWARE: whether namespace-processing is enabled or not: if disabled, namespace-binding does not occur, and full element/attribute name is reported as “local name” (for example: <xml:space> would have local name of “xml:space”, and no namespace prefix or URI). If enabled, namespace declarations are handled and prefix/namespace binding applied as expected
  • SUPPORT_DTD: whether DTD subset (definition) processing is enabled or not. If enabled, DTD definitions are read (both internal and external), and parsed entities are expanded. If disabled, internal DTD subset is skipped and external subset is not read.
    NOTE: if disabled, no DTD validation occurs, regardless of other settings
  • IS_VALIDATING: whether DTD validation is enabled or not (note: does not affect XML Schema, Relax NG, or other validation settings).
    NOTE: only takes effects if SUPPORT_DTD is also enabled
  • RESOLVER: unlike other options, NOT of type Boolean but XMLResolver. Allows overriding reading of external DTD subsets (and parsed external entities defined from there), to (for example) add caching, or allow rewriting, replacing or just removing external DTD definitions. Often used for security purposes to just prevent external reads
  • IS_SUPPORTING_EXTERNAL_ENTITIES: if DTD processing is enabled (see SUPPORT_DTD), external entities (references to external resources outside of XML document or DTD subset itself) are recognized and processed. However, their expansion may be disabled if this setting is disabled. This is typically done for security reasons: if XML content comes from untrusted sources, enabling expansion is not a good idea.
    If disabled, entities are only reported as entity references; if enabled, entities are expanded as per XML specification and reported as XML tokens.

XMLOutputFactory has only one configuration setting:

  • IS_REPAIRING_NAMESPACES: more properly should be called “automatic namespaces”, enabling of which removes need to declared namespace bindings before use. If enabled, passing of namespace prefixes is optional, and all namespace declarations are automatically written by XMLStreamWriter. If disabled, caller must explicitly write namespace declarations. Note that in latter case it is possible that output is not namespace-compliant in cases where namespace declarations (bindings) are missing or misplaced.

Stax2 and Woodstox-specific properties

Stay tuned for more on wider sets of configuration beyond basic Stax configuration settings!

--

--

@cowtowncoder
@cowtowncoder

Written by @cowtowncoder

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox

Responses (2)