Avro: unions & default values

4 minute read

I recently spent a few hours scratching my head trying to understand Avro’s default values (in particular, when combined with unions), so I’m documenting my findings here.

tl;dr

  • this repository provides complete, runnable examples that document the Avro behavior discussed in this post
  • default values are only used when the reader schema cannot find a value for a given field, not when writing data
  • when declaring an union type with a default value, the type of the default value must match the first element in the union
  • always use Avro’s official implementation in your language of choice to avoid decoding issues

Context

Let’s consider the following schema:

{
  "type": "record",
  "name": "nullOrLong",
  "fields" : [
    { "name": "id", "type": "long" },
    { "name": "parentId", "type": [ "null", "long" ], "default": null }
  ]
}

When I looked at it, I asked myself:

Why do I need to provide a null default value when it’s already implicit in the type definition?

The answer is that the default value is ignored by Avro when encoding data and instead only used when decoding.

To put it in another way, Avro makes a difference between:

this field can have a null (or long) value

and

this field can have a null (or long) value and be absent in encoded data, in which case we’ll assign it a null value when reading

Let’s use a type other than null to further clarify this behaviour:

{
  "type": "record",
  "name": "stringOrLong",
  "fields" : [
    { "name": "id", "type": "long" },
    { "name": "parentId", "type": [ "string", "long" ], "default": "abc" }
  ]
}

When writing a record conforming to this schema, we are required to assign a value to parentId (in spite of what one might expect, given the defined default value), otherwise the encoder will throw an

org.apache.avro.UnresolvedUnionException: Not in union ["string", "long"]: null

exception at us.

When are default values useful, then? When we want to introduce a new field in the reader schema, for example!

Let’s imagine that our original schema looks like this:

{
  "type": "record",
  "name": "justLong",
  "fields" : [
    { "name": "id", "type": "long" }
  ]
}

and we use it to encode our data (either to JSON or binary, it’s not relevant). We later want to add an additional field but we still want to be able to decode data written with schema justLong. The nullOrLong schema that we defined at the beginning of this post can do just that, because it knows that it should set parentId = null whenever that field is absent in encoded data! (see also this Avro ticket for reference)

There are a couple more things to keep in mind when working with unions and/or defaults.

The type of the default value must match the first element of the union

Which is to say that if we want to have a null default, then this is a correct definition of the parentId field:

{ "name": "parentId", "type": [ "null", "long" ], "default": null }

whereas this is wrong (note the position of "null" in the union types' list):

{ "name": "parentId", "type": [ "long", "null" ], "default": null }

Union fields are not encoded to JSON as one might expect

I would (I did, in fact) naively think that the following JSON object would conform to the schema:

{
  "id": 2,
  "parentId": 1
}

but I got hit by reality when faced with an Expected start-union. Got VALUE_LONG exception thrown by Avro’s JsonDecoder. After looking into the documentation, the correct way to encode unions turned out to be:

{
  "id": 2,
  "parentId": {
    "long": 1
  }
}

About those nulls …

Programmatically writing a record that omits the value of parentId when the union is defined as ["null", "long"] might just work (in spite of what said above) because null typically represents the absence of value in programming languages. The same is, however, not true when we’re dealing with other types.

Again, a clarifying example:

val input = new GenericData.Record(nullOrLong)
input.put("id", 1L)
// parentId is implicitly null

def encode(schema: Schema, input: GenericRecord): String = ...

val nullOrLong: Schema = ... // the "nullOrLong" schema defined above
encode(nullOrLong, input) // works

val stringOrLong: Schema = ... // the "stringOrLong" schema defined above
encode(stringOrLong, input) // throws an exception

input.put("parentId", "foo")
encode(stringOrLong, input) // works

Encoding

On a similar note, please be aware that Avro has different encoding rules for null in its JSON and binary formats: in fact, a field with a null value cannot be omitted

{
  "id": 1,
  "parentId": null
}

The binary format, on the other hand, encodes null with 0 bytes so it’s effectively absent (because the binary format doesn’t contain the schema itself, only the data). See JSON Encoding and Binary Encoding, respectively, for more information.

Bottom line: always use Avro’s official encoder to avoid incurring in errors when decoding data!