Deserializing nested dictionaries into complex, typed(!) python objects
Background
Serialization (also known as marshalling) and its inverse is a common task for many programmers. For the unfamiliar, serialization is, for the purposes of this post, the process of taking some object in memory, and creating a serial representation of it in some standard format for the purposes of transferring the object to some other process, where the opposite (serial => memory) will take place. Some popular serialization formats include JSON, Google’s Protocol Buffers, YAML, XML, Python’s Pickle, and many others.
In the above list, there can be found two broad categories of serialization formats: human-readable (JSON, YAML, XML), and machine-readable (Protocol Buffers, Pickle). Machine-readable formats are great if you want really fast de/serialization. They’re pretty terrible for humans to read, though. For example, take a look at how Pickle serializes a simple dictionary:
>>> import pickle
>>> favorite_color = {"lion": "yellow", "kitty": "red"}
>>> pickle.dumps(favorite_color)
"(dp0\nS'lion'\np1\nS'yellow'\np2\nsS'kitty'\np3\nS'red'\np4\ns."
While it’s possible to kind of make out what’s going on, it’s clear to see that once the objects start to become more complicated, and have multiple nested properties, a human reader would quickly be overwhelmed.
Human-readable serialization formats, on the other hand, are expectedly quite nice for humans to read. In JSON, favorite_color would look like this:
{
"lion": "yellow",
"kitty": "red"
}
YAML is even easier for humans to read:
lion: yellow
kitty: red
JSON and YAML are essentially textual representations of arbitrarily deeply-nested trees. Python represents such trees as dicts
. The json
module is included in Python’s standard library, and py-yaml
is easily installed with pip. These modules expose simple APIs that suck in some valid JSON/YAML and spit out a sweet sweet dict
. So for this post, we will work with in-memory dict
s.
With that, let’s explore a technique to easily define complex Python objects which may be deserialized from nested dictionaries.
Problem
To set the stage, here is the problem we would like to solve. Let’s say we have the following complex nested dictionary (as expressed in YAML for ease-of-reading):
name: Tyrion
house:
name: Lannister
age: 700
colors: [Red, Gold]
words: Hear Me Roar!
seat: Casterly Rock
age: 15
sibling_names: [Jaime, Joffrey, Cersei]
We would like to define a set of Python objects to reprsent this nested structure. We would like to be able to access the various attributes naturally as properties, as such:
# intialize the above object into a variable called tyrion...
>>> tyrion.name
"Tyrion"
>>> tyrion.house
<House 'Lannister'>
>>> tyrion.house.age
700
>>> tyrion.age
15
>>> tyrion.house.colors
["Red", "Gold"]
We would also like for it to be easy to intialize the object, given a dictionary, and we would like that initialization to be typed. That is to say, if we are given a str
instead of an int
for tyrion
’s age, we would like an Exception to be raised.
We should also be able to define methods on each of these nested objects. Perhaps one such example might be
>>> tyrion.drink(liters=3, alcohol='wine')
Naive Implementation
We can start with a naive way to just represent the classes we want, along with some basic (and quite ugly IMO) type checking. You really have to work in order to get make things type-checked in Python.
class House(object):
def __init__(self, name, age, colors, words, seat):
self.check_args(name, age, colors, words, seat)
self.name = name
self.age = age
self.colors = colors
self.words = words
self.seat = seat
def check_args(self, name, age, colors, words, seat):
assert type(name) is str
assert type(age) is int
assert type(colors) is list
assert all(type(c) is str for c in colors)
assert type(words) is str
assert type(seat) is str
def fly_banner(self):
print('flying banner')
class GoTPerson(object):
def __init__(self, name, house, age, sibling_names):
self.check_args(name, house, age, sibling_names)
self.name = name
self.house = house
self.age = age
self.sibling_names = sibling_names
def check_args(self, name, house, age, sibling_names):
assert type(name) is str
assert type(house) is House
assert type(age) is int
assert type(sibling_names) is list
assert all(type(s) is str for s in sibling_names)
def drink(liters, alcohol):
print('drinking %d liters of %s' % (liters, alcohol,))
Now that we’ve got a class structure, how do we get from a dictionary to a GoTPerson
? We would like to write as little additional code as possible, because as you can see, the type checks already added a lot of overhead!
It’d be nice to use something like
>>> a_dict
{
'name': 'Tyrion',
'house': {
'name': 'Lannister',
...
},
...
}
>>> tyrion = GoTPersion(**a_dict)
Traceback (most recent call last):
...
File "test.py", line 26, in __init__
self.check_args(name, house, age, sibling_names)
File "test.py", line 34, in check_args
assert type(house) is House
AssertionError
Why doesn’t this work? We are assigning a the dict
house to tyrion
’s house attribute! We need to recursively create a House
from that dict (and create any necessary objects from dicts if House has nested objects)…
So we are forced to make something like this for each of our classes…
class GoTPerson(object):
...
@classmethod
def from_dict(cls, a_dict):
name = a_dict['name']
house = House.from_dict(a_dict['house'])
age = a_dict['age']
sibling_names = a_dict['sibling_names']
return cls(name, house, age, sibling_names)
...
This works… But it’s pretty ugly, and really verbose! There must be a better way.
Awesome Pythonic Implementation of Greatness
There is a better way! The scaffolding of this implementation is heavily inspired from The Python Cookbook, a fabulous resource for getting a large variety stuff done in a really nice way.
First, we define a base class.
class Structure(object):
_fields = []
def _init_arg(self, expected_type, value):
if isinstance(value, expected_type):
return value
else:
return expected_type(**value)
def __init__(self, **kwargs):
field_names, field_types = zip(*self._fields)
assert([isinstance(name, str) for name in field_names])
assert([isinstance(type_, type) for type_ in field_types])
for name, field_type in self._fields:
setattr(self, name, self._init_arg(field_type, kwargs.pop(name)))
# Check for any remaining unknown arguments
if kwargs:
raise TypeError('Invalid arguments(s): {}'.format(','.join(kwargs)))
Now, let’s make House
and GoTPerson
subclass Structure
.
class House(Structure):
_fields = [('house', str), ('age', int), ('colors', list), ('words', str), ('seat', str)]
def fly_banner(self):
print('flying banner')
class Tyrion(Structure):
_fields = [('name', str), ('house', House), ('age', int), ('sibling_names', list)]
def drink(liters, alcohol):
print('drinking %d liters of %s' % (liters, alcohol,))
Awesome! We can now initialize a GoTPerson as such
>>> a_dict
{
'name': 'Tyrion',
'house': {
'name': 'Lannister',
...
},
...
}
>>> tyrion = GoTPersion(**a_dict)
Our Structure
class takes care of type checking, as well as recursively initializing any nested objects from dictionaries. This technique has massively simplified the process of deserializing JSON into Python objects in my own code, and I hope it does the same for yours!
There is one final challenge you might want to undertake. Try to cleanly establish type checks on the items of lists, which are not present in our current implementation.