Is Smile format Splittable? Yes

(that is, “Hadoop Splittable”)

7 min readJul 1, 2021

What is “Smile format”?

First things first: since some of you may not be familiar with the data format called Smile, here’s a brief introduction:

It is a “binary JSON” format: it has the same logical model as JSON (has same value types) but uses binary encoding for compactness (smaller data size), allowing faster reading and writing
Specification can be found in smile-format-specification Github repo
Smile encoding is supported my many libraries and frameworks (see above-mentioned Github repo for links): Java support is via Jackson jackson-dataformat-smile module
It is similar in many ways to other “binary JSON” formats like newer CBOR or (slightly) older MessagePack — but has some benefits over both, especially for longer data streams

For most readers, TL;DNR; is that Smile is “like JSON but smaller/faster; well supported by Jackson; works particularly well for processing long/big data streams”.

General strong points of Smile format

While Smile is quite similar to other binary formats — and in particular “binary JSON formats” like CBOR — it has some specific benefits over them.
As a background you may want to read “Smile format design goals” document.

Smile format strengths include:

Balances read/write performance: some formats heavily focus on read performance at the cost of write performance; Smile avoids some reader- optimizations (such as length prefixes) to improve write throughput
Smaller content size for longer streams via String back-references (has ability to refer to previously written [short] String values). Optionally enabled to allow simpler encoder implementations as well
Incremental generation: minimal buffering needed when outputting Text/String values
Ability to simply append content: Smile content may output as a sequence of values, without external framing (or one can view Smile format itself having “mini-framing” built in). This also allows that “merging” two data files is as simple as writing one after the other with no other processing.

I will focus on (4) here, since that may not be obvious either from the specification and differs from handling of most other data formats I am familiar with. It is also the foundation for “splittability”, a format property we will have a look at in a bit.

What does “append content” mean here

Most data formats specify the output as a single discrete document: this is the case with XML for example — an XML document MUST have one and only one root element. It is not legal to write a stream with 2 consecutive XML documents, directly, as the result is not well-formed XML content.
Insteadm one is required to use some form of framing: either in-format (use a placeholder root element) or external framing format (either with length-prefixes or some sort of separators). This generally adds complexity and potentially processing overhead, depending on implementation.

Same is true for most other data formats: you cannot simply keep on appending values; there is difference between individual values and a complete document.
Some formats have variations, however, that allow effectively append style: “Line-delimited JSON” is one example (fully supported by Jackson!).
Other formats may or only support such append-style content via library-specific extensions.

But Smile format natively allows simple appending of root-level values (Objects, Arrays, scalar values) and the Jackson implementation will read such content without problem. You can think of this as equivalent to line-delimited JSON. It makes processing long data streams simple and efficient.

As mentioned earlier, appendability is not only valuable when writing a single content document or stream: it also allows combining (merging) of multiple content units by simply writing one after the other without other processing (and in particular does NOT require decoding/parsing of content or even any framing).

But wait! It gets better: Smile data streams are splittable!

Being able to efficiently read and write data streams is good, but for Big Data processing there is one orthogonal aspect that can be very valuable as well: Splittability (as it is called in Hadoop context) of data streams.

Splittability here means the ability to take a unit of input, such as a data file with multiple value entries, and starting from somewhere in the middle to be able to efficiently find a point where content can be split in two roughly equally-sized parts, processable as separate units (efficiently here means that only small part of content needs to be processed).
For example, if we have a 10 megabyte file, starting at half-way we can find a “split point” after reading no more than, say, 1kB of data (instead of reading first 5 megabytes), content would be considered splittable.
(for better explanation, read f.ex “What is meant by a compression codec’s splittability in the context of Hadoop?” or “https://blogs.oracle.com/datawarehousing/hadoop-compression-choosing-compression-codec-part2”)

But what is the value of this ability to “split” content unit into two? Simple answer is that this allows parallel processing. Without splitting, that unit has to be decoded sequentially, in single-threaded manner: splitting allows dividing decoding work into multiple decoding tasks or threads.

With Hadoop, splittability is most often a concern with compressed data streams: a stream (file) that is not splittable can not be processed in parallel — all processing of such content is effectively single-threaded.But splittability is equally relevant for actual encoded data when uncompressed. Some data formats are relatively easily splittable — line-delimited JSON can be split along linefeeds (since no pretty-printing is used and all in-String linefeeds in JSON must use backslash-escaping); CSV may be made splittable with some restrictions — whereas others (Avro, protobuf) are simply not splittable at all.

It turns out that we can ensure that Smile format encoded data is Splittable with specific settings.

Smile settings to ensure splittability

Most binary formats are not Splittable for the simple reason that they use the full range of byte values for encoding and there are no limitations on byte combinations beyond individual bytes: splittability typically requires use of marker values that cannot be present in encoded data.

Compared to most other binary formats, Smile format explicitly excludes use of certain byte values in encoded content, for explicit purpose of allowing splittability (or efficient finding of document boundaries with random-access reads). For full details of encoding please refer to the Smile format specification, but here’s the key point that interests us here:

“Values 0xFD through 0xFF are not used as token type markers, key markers, or in values; with exception of optional raw binary data”

So as long as you do NOT enable “raw binary data” (binary data embedded as-is; as opposed to 7-bit encoded binary data which avoids use of relevant byte range), resulting encoded content will NEVER use any of byte values of [0xFD, 0xFE, 0xFF]. This means that if we choose to, we can use any of these 3 bytes as boundary marker, to be scanned and used as a split point.
Byte 0xFF was specifically intended as sort of frame marker (called “end-of-content” marker in Smile specification): ideal candidate for finding a split point and recognized as a end-of-content marker when reading content and is recognized as such by Jackson smile format module.

Since the ability to split (or rather, to find a split point) relies on existence of frame markers, how can we write them?

Writing Smile 0xFF frame marker with Jackson

There are two ways to get 0xFF frame marker written into Smile-encoded output:

Make SmileGenerator write it on close()
Output explicitly using SmileGenerator.writeRaw(byte) method

Let’s look at the first approach first: code could look something like:

OutputStream out = ...;// Need to configure underlying stream factory:
SmileFactory f = SmileFactory.builder()
  .enable(SmileGenerator.Feature.WRITE_END_MARKER)
  .enable(SmileGenerator.Feature.ENCODE_BINARY_AS_7BIT)
  .disable(StreamWriteFeature.AUTO_CLOSE_TARGET)
  .build();
// and then construct mapper with it:
SmileMapper mapper = SmileMapper.builder(f).build();for (InputValue value : valuesToWrite) {
  mapper.writeValue(out, value);
  // will create and close generator and write END_MARKER after every value
  // There are other ways too by controlling creation of SmileGenerator and batching multiple values
}

What happens here is that when SmileGenerator (constructed either directly from SmileFactory or, like here, indirectly as part of SmileMapper.writeValue() ) is closed, it will write 0xFF (“end marker”).
You can get fancier too and write multiple values in a row and only add end marker every Nth write — to do so you would need to construct SmileGenerator and pass that (instead of OutputStream) to SmileMapper. I will leave that as an exercise to the interested reader.

The second way is actually quite simple as well:

OutputStream out = ...;
SmileFactory f = SmileFactory.builder()
  .enable(SmileGenerator.Feature.ENCODE_BINARY_AS_7BIT)
  // alas! Must avoid back references if we by-pass doc boundary
  .disable(SmileGenerator.Feature.CHECK_SHARED_NAMES)
  .disable(SmileGenerator.Feature.CHECK_SHARED_STRING_VALUES)
  .build();
SmileMapper mapper = SmileMapper.builder(f).build();
try (SmileGenerator g = mapper.createGenerator(out)) {
 for (InputValue value : valuesToWrite) {
   mapper.writeValue(g, value);
   // write after every value, or every Nth; whatever
   g.writeByte(SmileConstants.BYTE_MARKER_END_OF_CONTENT); // 0xFF
}

but has one major drawback: since back-references cannot cross logical document boundary, and since we do not open/close document (which creating/closing SmileGenerator does), we must avoid using such references.

Using this with Hadoop etc

Beyond somewhat low level handling of Smile document end marker, to allow splitting of Smile encoded content, how would one go about actually using this with, say, Hadoop?

Good question. Despite this feature existing for past 10 years, I am not aware of frameworks/libraries that makes use of it. Since I have not worked on Hadoop-based systems myself, I haven’t had a chance to play with it.
Same goes for other data processing pipelines: I would be very interested in seeing this feature being used in production.

But this seems like something someone out there might want to tinker with eh? :-)

Next time: is LZF compression codec splittable?

I don’t usually do cliff-hangers with my blogging but I’ll make an exception here — I plan to follow up this post on musing about possible splittability of LZF compression codec (see Java LZF implementation at https://github.com/ning/compress) — stay tuned!