Avro: unions & default values
Table of Contents
I recently spent a few hours scratching my head trying to understand Avro’s default values (in particular, when combined with unions), so I’m documenting my findings here.
tl;dr #
- this repository provides complete, runnable examples that document the Avro behavior discussed in this post
- default values are only used when the reader schema cannot find a value for a given field, not when writing data
- when declaring an union type with a default value, the type of the default value must match the first element in the union
- always use Avro’s official implementation in your language of choice to avoid decoding issues
Context #
Let’s consider the following schema:
{
"type": "record",
"name": "nullOrLong",
"fields" : [
{ "name": "id", "type": "long" },
{ "name": "parentId", "type": [ "null", "long" ], "default": null }
]
}
When I looked at it, I asked myself:
Why do I need to provide a
null
default value when it’s already implicit in the type definition?
The answer is that the default value is ignored by Avro when encoding data and instead only used when decoding.
To put it in another way, Avro makes a difference between:
this field can have a
null
(orlong
) value
and
this field can have a
null
(orlong
) value and be absent in encoded data, in which case we’ll assign it anull
value when reading
Let’s use a type other than null
to further clarify this behaviour:
{
"type": "record",
"name": "stringOrLong",
"fields" : [
{ "name": "id", "type": "long" },
{ "name": "parentId", "type": [ "string", "long" ], "default": "abc" }
]
}
When writing a record conforming to this schema, we are required to assign a value to parentId
(in spite of what one might expect, given the defined default value), otherwise the encoder will throw an
org.apache.avro.UnresolvedUnionException: Not in union ["string", "long"]: null
exception at us.
When are default values useful, then? When we want to introduce a new field in the reader schema, for example!
Let’s imagine that our original schema looks like this:
{
"type": "record",
"name": "justLong",
"fields" : [
{ "name": "id", "type": "long" }
]
}
and we use it to encode our data (either to JSON or binary, it’s not relevant). We later want to add an additional field but we still want to be able to decode data written with schema justLong
. The nullOrLong
schema that we defined at the beginning of this post can do just that, because it knows that it should set parentId = null
whenever that field is absent in encoded data! (see also this Avro ticket for reference)
There are a couple more things to keep in mind when working with unions and/or defaults.
The type of the default value must match the first element of the union #
Which is to say that if we want to have a null
default, then this is a correct definition of the parentId
field:
{ "name": "parentId", "type": [ "null", "long" ], "default": null }
whereas this is wrong (note the position of "null"
in the union types’ list):
{ "name": "parentId", "type": [ "long", "null" ], "default": null }
Union fields are not encoded to JSON as one might expect #
I would (I did, in fact) naively think that the following JSON object would conform to the schema:
{
"id": 2,
"parentId": 1
}
but I got hit by reality when faced with an Expected start-union. Got VALUE_LONG
exception thrown by Avro’s JsonDecoder
. After looking into the documentation, the correct way to encode unions turned out to be:
{
"id": 2,
"parentId": {
"long": 1
}
}
About those nulls … #
Programmatically writing a record that omits the value of parentId
when the union is defined as ["null", "long"]
might just work (in spite of what said above) because null
typically represents the absence of value in programming languages. The same is, however, not true when we’re dealing with other types.
Again, a clarifying example:
val input = new GenericData.Record(nullOrLong)
input.put("id", 1L)
// parentId is implicitly null
def encode(schema: Schema, input: GenericRecord): String = ...
val nullOrLong: Schema = ... // the "nullOrLong" schema defined above
encode(nullOrLong, input) // works
val stringOrLong: Schema = ... // the "stringOrLong" schema defined above
encode(stringOrLong, input) // throws an exception
input.put("parentId", "foo")
encode(stringOrLong, input) // works
Encoding #
On a similar note, please be aware that Avro has different encoding rules for null
in its JSON and binary formats: in fact, a field with a null
value cannot be omitted
{
"id": 1,
"parentId": null
}
The binary format, on the other hand, encodes null
with 0 bytes so it’s effectively absent (because the binary format doesn’t contain the schema itself, only the data). See JSON Encoding and Binary Encoding, respectively, for more information.
Bottom line: always use Avro’s official encoder to avoid incurring in errors when decoding data!