Configuring Woodstox XML parser: Woodstox-specific properties

As part 3 of the overview of Woodstox (java, stax) XML parser, let’s have a look at another set of configuration options: Woodstox-specific properties
(first part was “basic Stax”, second “Stax2 extensions”)

Finding Property name definitions

Woodstox-specific property names are defined in 2 classes:

Woodstox-specific input properties: overriding DTD handling

First set of input-side properties are related to handling of DTD. They are undefined by default, but can be set to custom handlers to change default handling of DTD subsets and expansion of entities defined within.

  • P_ENTITY_RESOLVER (of type XMLResolver, default: null): similar to P_DTD_RESOLVER, but overrides resolution of External Parsed Entities defined within DTD subset (internal or external).
    NOTE! Will override standard “XmlInputFactory#RESOLVER” setting (for this purpose) if defined
  • P_UNDECLARED_ENTITY_RESOLVER (of type XMLResolver, default: null): Similar to P_ENTITY_RESOLVER, but used in cases where entity has not been defined: allows graceful handling of situation that would otherwise result in exception. A common implementation would simply provide either “empty” contents (to effectively remove entity), or add a marker to indicate error for further processing.

Woodstox-specific input properties: limits

Another set of properties added in Woodstox 4.2 allows specifying maximum limits for certain input constructs. These are typically used to protect against possible Denial-of-Service (DoS) attacks, wherein XML-based web services may be attacked by specifically crafted documents that could cause processing problems by excessive memory or computing power usage.

  • P_MAX_ATTRIBUTES_PER_ELEMENT (default: 1000): specifies maximum number of distinct attributes allowed for any single XML element.
  • P_MAX_CHARACTERS (default: unlimited): specifies maximum total length of the input XML document, in characters. Check is invoked regularly when reading input blocks and is not exact to character: for example, if you define limit as 5000 characters, limit violation may be reported after 5400 characters (based on buffer boundaries): but never with LESS than limit number. So it guarantees that you will be able to process documents up to and including the limit.
    NOTE: this refers to raw input size and NOT input size after possible entity expansion (there is no limit for latter at this point)
  • P_MAX_CHILDREN_PER_ELEMENT (default: unlimited): Similar to P_MAX_ATTRIBUTES_PER_ELEMENT but limits number of child elements within any given element.
  • P_MAX_ELEMENT_COUNT (default: unlimited): specifies maximum total number of elements within a single XML document
  • P_MAX_ELEMENT_DEPTH (default: 1000): specifies maximum nesting level of elements: that is, number of elements that may be nested at any given point. This is one of the settings that some users have had to increase for legit documents.
  • P_MAX_ENTITY_COUNT (default: 100,000): specifies maximum total number of entity expansions allowed per XML document (nested and non-nested)
  • P_MAX_ENTITY_DEPTH (default: 500): specifies maximum nesting level of entity expansion (distinct from total number).
  • P_MAX_TEXT_LENGTH (default: unlimited): specifies maximum contiguous length of any character data segment (either regular text segment or CDATA section). Handling varies between coalescing mode (in which all adjacent textual segments are combined) and non-coalescing; in latter case limit is per-segment

Woodstox-specific input properties: other

And then there are many other input properties that may be configured

  • P_CACHE_DTDS (default: true): allows disabling of DTD caching which is enabled by default. Most likely use case is that where caller has separate external caching of DTDs, or where public/system id used as key is not unique (and cache collisions would occur)
  • P_CACHE_DTDS_BY_PUBLIC_ID (default: true): specifies which identifier is used for caching (if enabled as per P_CACHE_DTD) — if truePublic id is used; if false System id
  • P_INPUT_BUFFER_LENGTH (default: 4000): specifies length of internal read buffer, in characters.
  • P_INPUT_PARSING_MODE (default: WstxInputProperties.PARSING_MODE_DOCUMENT): allows changing operating mode of parser for non-conforming XML content. Default mode requires input of exactly one well-formed XML document (which means one and only one root element). Alternatives are PARSING_MODE_FRAGMENT and PARSING_MODE_DOCUMENTS — both of which allow zero or more root elements, but “documents” mode further allowing inclusion of full documents with their separate xml and DOCTYPE declarations. “Fragments” mode is most often used to read a subset of a full document, whereas “Documents” mode stream that contains individual well-formed documents.
  • P_MIN_TEXT_SEGMENT (default: 64): when NOT using “coalescing” mode, this setting defines smallest partial text segment (that is, part of one contiguous text segment) that may be reported — intention being to reduce likelihood of returning tiny segments while allowing parser to avoid having to buffer longer segments completely.
    Setting this value to Integer.MAX_VALUE will effectively prevent splitting of individual segments without forcing coalescing of adjacent segments, so that is one common override
  • P_NORMALIZE_LFS (default: true): XML specification requires parsers to convert “alt linefeeds” (that is, \r\n and \r) into canonical linefeed (\n), but disabling this property allows exposing actual linefeed without normalization
  • P_RETURN_NULL_FOR_DEFAULT_NAMESPACE (default: false): whether so-called unbound default namespace (one that non-prefixed attributes have, and non-prefixed elements when there is no explicit binding for default namespace) is reported as null (enabled) or empty String (disabled).
  • P_TREAT_CHAR_REFS_AS_ENTS (default: false): Normally character references (like &) are simply expanded and reported as part of character data; but if this property is set to true they will instead be reported as ENTITY tokens. This may occasionally be useful when trying to fully reproduce input representation of an XML document, including choice of escaping of special characters.
    NOTE: this only works for textual content — it is not possible to support for attribute values (as there are no separate tokens; attributes are accessible via START_ELEMENT token only)
  • P_VALIDATE_TEXT_CHARS (default: true): XML specification requires verification that character segments only contain characters legal for XML specification (version 1.0 or 1.1, depending on xml declaration), and reporting error (by throwing exception) if illegal character (such as most of control codes in 0x00–0x1F range) is encountered. By disabling this property it is possible to prevent this validation; either to allow inclusion of otherwise illegal characters (that is, processing of non-wellformed “xml” content), or to achieve minor performance gain (as validation adds some amount of processing cost — however, typically not enough for it to really matter)
    NOTE! As of now (March 2018) this feature is NOT YET implemented — see this issue for details.
  • P_XML10_ALLOW_ALL_ESCAPED_CHARS — to be added in Woodstox 5.2 — will allow decoding of those XML 1.1 character entities (control characters in ASCII range of 0x00–0x1F) that are not otherwise valid in XML 1.0

Woodstox-specific output properties, validation

On output side, a large group of settings is related to (optional) verification of well-formedness of content; and some related settings that allow working around problems that could occur if output was done exactly as implied by calls (but can be performed in modified form).

  • P_OUTPUT_INVALID_CHAR_HANDLER (default: undefined; type InvalidCharHandler): in case textual content would contain character that is NOT legal in XML, a custom handler may be installed to convert from invalid char into valid one (via convertInvalidChar() method). Most often this would produce something like simple space, or perhaps a question mark. Without custom handler, default behavior is to throw a XMLStreamException to indicate problem.
  • P_OUTPUT_VALIDATE_ATTR (default: false): if enabled, will verify that no duplicate attribute values are output. This requires keeping track of attributes output and adds some memory and processing overhead.
  • P_OUTPUT_VALIDATE_CONTENT (default: true): if enabled, with verify well-formedness of textual content with respect to problems listed on P_OUTPUT_FIX_CONTENT and report problems (as XMLStreamExceptions) if attempt is made for such output (and output fixing not enabled)
  • P_OUTPUT_VALIDATE_NAMES (default: false): if enabled, will verify that names of elements and attributes only contain characters legal for XML names. Since there is no escape mechanism for names, inclusion of such characters can only be reported by throwing XMLStreamException.
  • P_OUTPUT_VALIDATE_STRUCTURE (default: true): if enabled, will verify structural well-formedness of output (that all start/end elements match; that there is only one root element). These checks do not impose measurable processing overhead since all bookkeeping (wrt start/end elements) has to be done regardless of whether problems are reported or not.
    The main purpose for disabling this validation is the case where output is not meant to be a single document, but either a fragment or sequence of documents.

Woodstox-specific output properties, formatting

  • P_ADD_SPACE_AFTER_EMPTY_ELEM (default: false): setting that determines if an empty element is output as <elem /> (with space) or <elem/> (without).
  • P_AUTOMATIC_END_ELEMENTS (default: true): setting that determines what happens when an END_ELEMENT is written immediate after START_ELEMENT (excluding possible attribute writes in-between) — if enabled, output will use “empty element” notation like <elem/>; otherwise start and end elements are written separately (<elem></elem>).
    NOTE: has no effect on explicit call to writeEmptyElement() — only affects case of separate writeStartElement() / writeEndElement() pair
  • P_OUTPUT_CDATA_AS_TEXT (default: false): property that may be enabled to “convert” calls to writeCData() to just produce “regular” text segments. If disabled, this will produce a CDATA segment.
  • P_OUTPUT_EMPTY_ELEMENT_HANDLER (default: undefined, type EmptyElementHandler): alternative to P_AUTOMATIC_EMPTY_ELEM — optional handler that may be registered to fine-tune decision on whether to output empty element when possible to do so.
    Main use case for this was to allow following (X)HTML rules wherein some tags do allow empty form (like <br/>) and others not — there exists default HtmlEmptyElementHandler implementation that may be of use for this purpose/
  • P_OUTPUT_ESCAPE_CR (default: true): setting that determines whether output of \r character will result in matching XML entity (when enabled) or simply character itself. Main distinction here is that embedded \r characters will be normalized during parsing (… unless prevented by P_NORMALIZE_LFS discussed earlier…), whereas results of entity expansion will be exposed as-is (never normalized).
  • P_USE_DOUBLE_QUOTES_IN_XML_DECL (default: false): if enabled, xml declaration will use double quotes for version, encoding; if disabled, apostrophe (single quote).

Woodstox-specific output properties, other

  • P_COPY_DEFAULT_ATTRS (default: false): when using XMLStreamWriter2 method for copying events from input to output, whether to explicitly write out values of default attributes (values that come from DTD definition and NOT input document) or not
  • P_OUTPUT_UNDERLYING_STREAM (read-only — can NOT be set, accessed via getProperty()): used to get underlying OutputStream that is used for output (unless Writer passed — see next property)
  • P_OUTPUT_UNDERLYING_WRITER (read-only — can NOT be set, accessed via getProperty()): used to get underlying Writer that is used for output (unless OutputStream passed — see next property)

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox