Configuring Woodstox XML parser: basic Stax properties
As many know (I hope!), Woodstox is the best general-purpose Java XML parser. It implements both main incremental parsing APIs — Stax AND SAX — and offers full XML feature support, conformance, similar to Apache Xerces (which is great XML parser but only implements SAX, not Stax).
And it does all of this efficiently, that is, fast.
Woodstox configuration basics
Beyond standard Stax and SAX configurability, Woodstox allows much wider variety of configuration. When using Stax API, all configuration uses the same standard Stax API, regardless of whether setting itself is defined as part of Stax:
XMLInputFactory inputF = XMLInputFactory.newFactory();
inputF.setProperty(<property-to-set>, <value>);
XMLOutputFactory outputF = XMLOutputFactory.newFactory();
outputF.setProperty(<property-to-set>, <value>);
where property-to-set
is a String
with one of pre-defined constant values. Value is often of type Boolean
, but not always: this depends on configuration setting in question.
These constants can be divided into 3 different groups
- Standard Stax properties (see
XMLInputFactory
andXMLOutputFactory
for constants): implemented by all compliant Stax implementations - Stax2 extension API (
XMLInputFactory2
andXMLOutputFactory2
): implemented by all Stax2-compliant parsers (currently this means Woodstox and Aalto) - Woodstox-specific properties (
WstxInputProperties
,WstxOutputProperties
), supported only by Woodstox itself
In this first entry of a (brief) series, let’s have a look at the first category.
Standard Stax Properties
Set of standard properties is covered by JDK Javadocs (and I link entries below). Most are Boolean
valued: I only mention type if it is something different.
XMLInputFactory
defines a few settings; most important are:
- IS_COALESCING: if enabled, parser will ensure that all adjacent text (“cdata”) segments are combined into a single
CHARACTERS
event. If disabled, text segments may be returned in arbitrary number of events (of typeCHARACTERS
andCDATA
) — often split at places where entities are used - IS_NAMESPACE_AWARE: whether namespace-processing is enabled or not: if disabled, namespace-binding does not occur, and full element/attribute name is reported as “local name” (for example: <xml:space> would have local name of “xml:space”, and no namespace prefix or URI). If enabled, namespace declarations are handled and prefix/namespace binding applied as expected
- SUPPORT_DTD: whether DTD subset (definition) processing is enabled or not. If enabled, DTD definitions are read (both internal and external), and parsed entities are expanded. If disabled, internal DTD subset is skipped and external subset is not read.
NOTE: if disabled, no DTD validation occurs, regardless of other settings - IS_VALIDATING: whether DTD validation is enabled or not (note: does not affect XML Schema, Relax NG, or other validation settings).
NOTE: only takes effects ifSUPPORT_DTD
is also enabled - RESOLVER: unlike other options, NOT of type
Boolean
butXMLResolver
. Allows overriding reading of external DTD subsets (and parsed external entities defined from there), to (for example) add caching, or allow rewriting, replacing or just removing external DTD definitions. Often used for security purposes to just prevent external reads - IS_SUPPORTING_EXTERNAL_ENTITIES: if DTD processing is enabled (see
SUPPORT_DTD
), external entities (references to external resources outside of XML document or DTD subset itself) are recognized and processed. However, their expansion may be disabled if this setting is disabled. This is typically done for security reasons: if XML content comes from untrusted sources, enabling expansion is not a good idea.
If disabled, entities are only reported as entity references; if enabled, entities are expanded as per XML specification and reported as XML tokens.
XMLOutputFactory
has only one configuration setting:
- IS_REPAIRING_NAMESPACES: more properly should be called “automatic namespaces”, enabling of which removes need to declared namespace bindings before use. If enabled, passing of namespace prefixes is optional, and all namespace declarations are automatically written by
XMLStreamWriter
. If disabled, caller must explicitly write namespace declarations. Note that in latter case it is possible that output is not namespace-compliant in cases where namespace declarations (bindings) are missing or misplaced.
Stax2 and Woodstox-specific properties
Stay tuned for more on wider sets of configuration beyond basic Stax configuration settings!