Jackson 2.12: improved XML module

@cowtowncoder
6 min readDec 11, 2020

--

(continuation of “Deeper Dive on Jackson 2.12” mini-series — see “Jackson 2.12 Features” for context)

Aside from “big 5” features discussed so far, another major area of work this time around was that of Jackson XML dataformat module: no fewer than 26 XML-specific issues were resolved for this release.
As usual, you can see the full list of changes on 2.12 Release Notes; here we’ll dig deeper into most notable fixes and improvements.

Nested Lists in POJOs should work reliably (esp. unwrapped)

Before 2.12 content models with Collection/List valued properties worked well enough for single-level cases, but various combinations of unwrapped (see @JacksonXmlElementWrapper) Lists, nested deeper tended to have edge cases for deserialization (serialization did not have issues); especially in presence of attributes and polymorphic value handling.

During 2.12 development all accumulated failure cases were resolved; see f.ex [dataformat-xml#257], [dataformat-xml#307], [dataformat-xml#314], [dataformat-xml#390].

Empty elements (esp. for POJOs) as nulls

Before 2.12, content like

<root>
<value />
</root>

could be matched to, for example

public class Root {
public MyValue value;
}

and would result in value of null; similarly for other value types.
With 2.12 this instead becomes “empty” POJO, Collection or Map, which is usually more logical setting for most users.
You can still enable FromXmlParser.EMPTY_ELEMENT_AS_NULLto keep earlier logic if you want, but the default was changed and handling should in general be more robust.

In addition, many cases where empty element (or start+end element with possible whitespace in-between) simply failed with an exception are now covered similar to “empty element” usage (see [dataformat-xml#318]).

Root value deserialization works similar to property values

Although serialization of non-POJO values directly is generally discouraged — instead, recommendation is to always have a POJO as the root value — it is now possible to serialize most scalar values as root values in XML, too.

So now following works

Integer value = Integer.valueOf(42);
String xml = new XmlMapper().writeValueAsString(value);
// xml == "<Integer>42</Integer>"
assertEquals(value, xml.readValue(xml, Integer.class);

even if such usage is discouraged; types supported includes Enums (dataformat-xml#121) as well as numbers, wrappers, booleans and Strings. Even third-party types work (see [dataformat-xml#380]) as long as deserializers in question are updated to support this (there is a particular fix that scalar deserializers require); if this fails for a 3rd party datatype please file an issue against Github repo for the module that contains deserializer the datatype.

Additional XML Attributes allowed for Scalars

XML content sometimes contains optional metadata as attribute, using specific namespace like:

<value>
<description xml:lang="en_US">Something</description>
</value>

in this case, for example, information on language of the description is specified as sort of orthogonal aspect; something that may or may not be used by reading application.
If user is not interested in language metadata, they might map it to POJO like:

@JacksonXmlRootElement
public class Value {
public String description;
}

Prior to 2.12 this caused a failure since 2 properties are found and scalar types like String really only expect one “embedded” text value.
Jackson 2.12 handles this case in a way that allows ignoring of “unexpected” attributes, while still supporting ability to alternatively map such attributes if needed (main content would need @JacksonXmlText (or similar @XmlValue) annotation; other attributes fine as-is).

May write xsi:nil (as well as read)

Jackson XML module already understood incoming xsi:nil to indicate “this is null” concept via special attribute, like so:

<root>
<value xsi:nil="true" />
</root>

but now (see dataformat-xml#360) it is also possible to make Jackson write xsi:nil attribute, usually since other tools might expect it. This is not done by default (for backwards-compatibility), but can be enabled by:

XmlMapper xmlMapper = XmlMapper.builder()
.configure(ToXmlGenerator.Feature.WRITE_NULLS_AS_XSI_NIL, true)
.build();

JsonNode/Object[] now support Repeated Elements (as Arrays)

When getting XML content like:

<root>
<a>Foo</a>
<a>Bar</a>
</root>

and reading it as JsonNode (name is due to historical reasons, not really JSON-specific), formerly only the last value for a was included.
This is because content would essentially be seen by jackson-databind as equivalent to JSON:

{ "a" : "Foo",
"a" : "Bar"
}

which, if not throwing exception (which depends on whether JsonParser is configured to detect duplicates), would just replace entry for a every time another element is encountered, leaving the last value seen.

But with Jackson 2.12, a new capability — StreamReadCapability.DUPLICATE_PROPERTIES — was defined to indicate that underlying format may have seeming duplicates: and deserializer for JsonNode will in turn use secondary logic to convert these into ArrayNodes, to effectively make content appear like

{ "a" : [ "Foo", "Bar" ] }

which is likely more usable representation (see dataformat-xml#403 for details).

Similar change was also made for “untyped” deserializer (one that matches type java.lang.Object and by definition also Object[]) — as per dataformat-xml#205 — so that it would also be possible to use:

Map<String, Object> stuff = xmlMapper.readTree(xml);
// will now have Lists if multiple elements with same name found

This change should make JsonNode more usable with XML content.
One limitation is that this does NOT change handling for POJO-binding cases (except where nominal type of property is JsonNode, Object or Object[]) — this because change is not to actual token stream — so it does not necessarily solve all potential use cases.

NOTE: there is a remaining issue with serialization of JsonNode (see dataformat-xml#441) which needs to be resolved in the next minor version — it will add unnecessary wrapping for serialization — but until then at least reading should work better.

Some support for “Mixed Content”

One specific type of XML content that is typically not support by data-binding XML libraries, called mixed content, is quite common for textual markup use cases like XHTML:

<body>Hello, <b>world</b>!</body>

The challenge here is that whereas most data-oriented XML only has leaf-level Strings (“CDATA”) and all branches are elements, mixed content freely mixes text segments and elements. But since text segments do not have logical property name, they are not easy to represent in a way to make sense for Object bindings.

Because of this difficulty, before 2.12 Jackson simply ignored any text segments that were between start elements (like “Hello, “ in above example) or between end elements (like “!”), only exposing “world” as textual value contained within b element (between start and end elements).
So effectively what databind would see was equivalent of:

{ "b" : "world" }

But with 2.12 the underlying streaming parser was improved (see dataformat-xml#405) to expose “mixed” textual segments with nominal name of empty String; so as a token sequence equivalent to:

{
"" : "Hello, ",
"b" : "world",
"" : "!"
}

And this in turn may be read as JsonNode or Object (as per earlier notes on allowing handling of duplicate entries).

This does not necessarily solve the whole “how do I handle mixed content” use case because

  • it is not quite clear how this would map to POJO (no way to specify properties with no name)
  • if binding to JsonNode, order is not fully retained: content would look like:
{
"" : [ "Hello, "!" ],
"b" : "world"
}

Nonetheless since the content is now at least exposed within token stream, custom deserializers can access all content and use it the way it makes sense.
It should also be possible to build further functionality based on ideas submitted as well — feel free to file an issue with suggestions for improvements!

Miscellaneous other fixes

Aside from bigger changes, there are other notable particular fixes:

Future Work

While a lot of progress was made with 2.12, there remain many challenges regarding handling of XML content.
Aside from gaps mentioned already we have challenges like:

  1. “Attribute-ness” is not preserved by JsonNode or buffering (`TokenBuffer`) — ideally it should ([dataformat-xml#217])
  2. Property names may collide in cases where they should not: either due to namespace being ignored (i.e. local name collision occurs even if namespaces differ), or due to list-wrapper name not being used to avoid collisions, or even element-vs-attribute distinction
  3. Custom escaping of output not yet supported (see [dataformat-xml#75])
  4. Polymorphic type id can not be used as property name (“flattening”, see [dataformat-xml#197])
  5. Customizing output of document aspects: XML Schema attribute ([dataformat-xml#90]), DOCTYPE declaration ([dataformat-xml#150]), custom namespace prefixes ([dataformat-xml#207]), encoding other than “UTF-8” in XML declaration ([dataformat-xml#315]), use of xsi:type for output ([dataformat-xml#324])
  6. Handling of Maps, and especially Map keys, is… challenging, wrt XML name validity rules ([dataformat-xml#244])

So we’ll see what can be tackled with 2.13 and later.

--

--

@cowtowncoder
@cowtowncoder

Written by @cowtowncoder

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox

No responses yet