Configuring Woodstox XML parser: Stax2 properties

As part 2 of the overview of Woodstox (java, stax) XML parser, let’s have a look at another set of configuration options: that of Stax2 extension API.

Another way to configure: profiles

As mentioned earlier, the standard Stax way of configuring anything is through factories, using setProperty(name, value) method. This applies to Stax2 as well.

  • configureForLowMemUsage: try to reduce amount of memory retained during processing by: disabling coalescing (allows parser to report smaller segments), disable P_PRESERVE_LOCATION
  • configureForRoundTripping: try preserving event information as much as possible such that direct writes would not alter physical aspects of XML — disable coalescing, preserve distinction between CHARACTERS and CDATA, disable automatic entity expansion (so entities may be written out)
  • configureForSpeed: try minimizing performance overhead of options: disable coalescing, disable P_PRESERVE_LOCATION; enable intern()ing of both element/attribute names and namespace URIs
  • configureForXmlConformance: enable features required to conform to XML 1.x specification — namespaces, DTD processing
  • configureForXmlConformance: enable all validation options to try to prevent any potential well-formedness problems (f.ex wrt namespace bindings) — but not all repairing options
  • configureForSpeed: optimizes for output performance: will disable validation operations that require scanning over contents; in a way opposite of conformance/robustness profiles.

Stax2 configuration properties

Use of profiles sets values for multiple properties (sometimes both plain Stax and Stax2 properties). But it is always possible to also set individual properties directly. Let’s have a look at what Stax2-extension properties exist and are supported by Woodstox. Note: most are Boolean valued: I only mention type if it is something other than Boolean.

  • P_DTD_OVERRIDE (default: null, value typeDTDValidationSchema): property that may be set if specific DTD instance is to be used instead of what DOCTYPE declaration specifies (if anything).
    NOTE: reading DTDValidationSchema is worth its own article, but basically entry point is `XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_DTD))`
  • P_INTERN_NAMES (default: true): Whether element and attribute names (“local name” part) returned will be String.intern()‘ed first or not — usually doing so saves memory and helps speed, but occasionally it may be necessary to disable this feature if number of distinct names is unbounded: for example, if names are randomly generated (like UUIDs)
  • P_INTERN_NS_URIS (default: true): similar to above, but applies to namespace URIs.
  • P_LAZY_PARSING (default: true): Controls whether parsing is “lazy” or “eager”: “eager” meaning that each event is completely parsed when XMLStreamReader.next() is called; “lazy” that only small part is parsed at that point, and rest is only parsed if and as needed. Benefits of lazy parsing included much faster skipping of unneeded content (esp. textual content, comments and processing instructions); possible downside is that sometimes error reporting may occur later than expected (during actual content access or skipping, that is, when calling next() for following event).
  • P_PRESERVE_LOCATION (default: true): Controls whether XMLStreamLocation information is included in XMLEvent instances or not. Disabling this feature reduces memory usage and improves processing speed modestly, but only when using “Event API” (XMLEventReader).
  • P_REPORT_CDATA (default: true): Whether XML CDATA sections are reported as CDATA Stax event (true) or as general CHARACTERS (false)
  • P_REPORT_PROLOG_WHITESPACE (default: false): When disabled (`false`), white-space outside XML root element is skipped and not reported; only possible COMMENTs and PROCESSING_INSTRUCTIONs are reported. But if enabled, additional SPACE events are reported — this is mostly (only) useful if trying to fully replicate document indentation outside of root element
  • P_TEXT_ESCAPER (default: null, value type EscapingWriterFactory): similar to P_ATTR_VALUE_ESCAPER but used for textual segments (“character data”, NOT included CDATA segments as they do not allow escaping). Similarly used either for changing escaping details, or for more advanced filtering/modifying textual content to output.
  • P_AUTO_CLOSE_OUTPUT (default: false): similar to P_AUTO_CLOSE_INPUT, determines whether underlying OutputStream or Writer is automatically closed when XMLStreamWriter is closed — default is false due to Stax 1.0 specification mandating this behavior.
  • P_AUTOMATIC_EMPTY_ELEMENTS (default: true): When a sequence of START_ELEMENT and END_ELEMENT is output — with possible attributes in-between, but no child elements or textual content, it is possible to output either so-called empty element (like <element />) or fully-written out pair (<element></element). If set to true, empty element is written; if false, separate start/end tags are written.
  • P_AUTOMATIC_NS_PREFIX (default: "wstxns"): When using “repairing: writer mode in which namespace URIs are automatically bound, namespace prefixes are generated using this String as the beginning, followed by a sequence number to keep prefixes unique.

And last but not least, Woodstox-specific properties

Now that we have covered 2 out of 3 properties sets, we are almost ready to have a look at the largest set of properties: ones specific (for now) to Woodstox itself. But that’s worth its separate entry. Stay tuned!

Written by

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store