Is Smile format Splittable? Yes

(that is, “Hadoop Splittable”)

What is “Smile format”?

First things first: since some of you may not be familiar with the data format called Smile, here’s a brief introduction:

  1. Specification can be found in smile-format-specification Github repo
  2. Smile encoding is supported my many libraries and frameworks (see above-mentioned Github repo for links): Java support is via Jackson jackson-dataformat-smile module
  3. It is similar in many ways to other “binary JSON” formats like newer CBOR or (slightly) older MessagePack — but has some benefits over both, especially for longer data streams

General strong points of Smile format

While Smile is quite similar to other binary formats — and in particular “binary JSON formats” like CBOR — it has some specific benefits over them.
As a background you may want to read “Smile format design goals” document.

  1. Smaller content size for longer streams via String back-references (has ability to refer to previously written [short] String values). Optionally enabled to allow simpler encoder implementations as well
  2. Incremental generation: minimal buffering needed when outputting Text/String values
  3. Ability to simply append content: Smile content may output as a sequence of values, without external framing (or one can view Smile format itself having “mini-framing” built in). This also allows that “merging” two data files is as simple as writing one after the other with no other processing.

What does “append content” mean here

Most data formats specify the output as a single discrete document: this is the case with XML for example — an XML document MUST have one and only one root element. It is not legal to write a stream with 2 consecutive XML documents, directly, as the result is not well-formed XML content.
Insteadm one is required to use some form of framing: either in-format (use a placeholder root element) or external framing format (either with length-prefixes or some sort of separators). This generally adds complexity and potentially processing overhead, depending on implementation.

But wait! It gets better: Smile data streams are splittable!

Being able to efficiently read and write data streams is good, but for Big Data processing there is one orthogonal aspect that can be very valuable as well: Splittability (as it is called in Hadoop context) of data streams.

Smile settings to ensure splittability

Most binary formats are not Splittable for the simple reason that they use the full range of byte values for encoding and there are no limitations on byte combinations beyond individual bytes: splittability typically requires use of marker values that cannot be present in encoded data.

Writing Smile 0xFF frame marker with Jackson

There are two ways to get 0xFF frame marker written into Smile-encoded output:

  1. Output explicitly using SmileGenerator.writeRaw(byte) method
OutputStream out = ...;// Need to configure underlying stream factory:
SmileFactory f = SmileFactory.builder()
.enable(SmileGenerator.Feature.WRITE_END_MARKER)
.enable(SmileGenerator.Feature.ENCODE_BINARY_AS_7BIT)
.disable(StreamWriteFeature.AUTO_CLOSE_TARGET)
.build();
// and then construct mapper with it:
SmileMapper mapper = SmileMapper.builder(f).build();
for (InputValue value : valuesToWrite) {
mapper.writeValue(out, value);
// will create and close generator and write END_MARKER after every value
// There are other ways too by controlling creation of SmileGenerator and batching multiple values
}
OutputStream out = ...;
SmileFactory f = SmileFactory.builder()
.enable(SmileGenerator.Feature.ENCODE_BINARY_AS_7BIT)
// alas! Must avoid back references if we by-pass doc boundary
.disable(SmileGenerator.Feature.CHECK_SHARED_NAMES)
.disable(SmileGenerator.Feature.CHECK_SHARED_STRING_VALUES)
.build();
SmileMapper mapper = SmileMapper.builder(f).build();
try (SmileGenerator g = mapper.createGenerator(out)) {
for (InputValue value : valuesToWrite) {
mapper.writeValue(g, value);
// write after every value, or every Nth; whatever
g.writeByte(SmileConstants.BYTE_MARKER_END_OF_CONTENT); // 0xFF
}

Using this with Hadoop etc

Beyond somewhat low level handling of Smile document end marker, to allow splitting of Smile encoded content, how would one go about actually using this with, say, Hadoop?

Next time: is LZF compression codec splittable?

I don’t usually do cliff-hangers with my blogging but I’ll make an exception here — I plan to follow up this post on musing about possible splittability of LZF compression codec (see Java LZF implementation at https://github.com/ning/compress) — stay tuned!

Open Source developer, most known for Jackson data processor (nee “JSON library”), author of many, many other OSS libraries for Java, from ClassMate to Woodstox