Reducing boilerplate with Python dataclasses
I used to think of dataclasses as something like a fancy namedtuple. To borrow an example from the Python docs, I might do this
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
rather than this
from collections import namedtuple
Point = namedtuple("Point", ["x", "y"])
But it turns out dataclasses are a lot more useful that. Rather than "fancy namedtuples", dataclasses are more like "fancy classes". They can
- help classes behave in a Pythonic way
- reduce boilerplate
All at the same time! Here's a simplified Book written as a standard class.
class Book:
def __init__(self, title, author, year=2022):
self.title = title
self.author = author
self.year = year
def __repr__(self):
return f"Book(title={self.title}, author={self.author}, year={self.year})"
def cite(self):
return f"{self.author}, {self.title} ({self.year})"
That works. But there's a chunk of boilerplate to give the Book instance
its data and get a nice string representation. Only the
cite
method has logic that's particularly specific to the
Book class. (The attribute names reflect Book characteristics, like
title, but nothing uniquely book-like happens in the
__init__
or __repr___
.)
Worse, this class can't be used in some basic ways. Python doesn't know how to compare Book instances, so doesn't recognise when one book is the same as another and can't sort a list of books.
book = Book("Histories", "Herodotus", -430)
identical_book = Book("Histories", "Herodotus", -430)
book == identical_book
# False
book2 = Book("The History of England", "Thomas Macauly", 1848)
sorted([book, book2])
# TypeError: '<' not supported between instances of 'Book' and 'Book'
The Book class has a __repr__
method, allowing the use of
both str(book)
and repr(book)
. But it doesn't
have other dunder (double underscore) methods that support operators
like ==
and <
. In other words, a little more
boilerplate is needed.
from functools import total_ordering
@total_ordering
class Book:
# Methods hidden for brevity
# __init__, __repr__, cite
def __eq__(self, other):
# If both objects are books, check if all values are equal
# If other object is not a Book, let Python raise a TypeError
if other.__class__ is self.__class__:
return (self.title, self.author, self.year) == (other.title, other.author, other.year)
return NotImplemented
def __gt__(self, other):
# If both objects are books, check if values of the book on the left are
# greater than values on the right
if other.__class__ is self.__class__:
return (self.title, self.author, self.year) > (other.title, other.author, other.year)
return NotImplemented
Book instances can now be compared and sorted, thanks to the
__eq__
(equality) and
__gt__
(greater than) dunder methods and
total_ordering
decorator. The decorator isn't essential; it reduces repetition by
letting the programmer define __eq__
and one other
comparison dunder method (__gt__
, in this case). Without
total_ordering
, it would be necessary to add the other
comparison dunder methods: __gte__
, __lt__
and
__lte__
.
A dataclass provides this behaviour with even less effort. The class and dataclass below, shown side by side, work identically. They have the same init behaviour, same string representation, even the same comparison and sort behaviour. But the dataclass takes care of the boilerplate.
from dataclasses import dataclass
@dataclass(order=True)
class Book:
title: str
author: str
year: int = 2022
def cite(self):
return f"{self.author}, {self.title} ({self.year})"
from functools import total_ordering
@total_ordering
class Book:
def __init__(self, title, author, year=2022):
self.title = title
self.author = author
self.year = year
def __repr__(self):
return f"Book(title={self.title}, author={self.author}, year={self.year})"
def __eq__(self, other):
if other.__class__ is self.__class__:
return (self.title, self.author, self.year) == (other.title, other.author, other.year)
return NotImplemented
def __gt__(self, other):
if other.__class__ is self.__class__:
return (self.title, self.author, self.year) > (other.title, other.author, other.year)
return NotImplemented
def cite(self):
return f"{self.author}, {self.title} ({self.year})"
Dataclasses allow some simple customisation. In the example above,
order=True
is used to add comparison behaviour to the
class. By default, dataclasses only provide __init__
,
__repr__
and __eq__
. (In other words, that's
what you get when using the dataclass decorator without any arguments.)
Behaviours can be turned off by passing the relevant argument to the
decorator, e.g. repr=False
or, in some cases, defining the
method youself, e.g. adding your own __repr__
to the class.
from dataclasses import dataclass
# Default settings (automatic __init__, __repr__, __eq__)
@dataclass
class Book:
title: str
author: str
year: int = 2022
# Don't create __repr__ method
@dataclass(repr=False)
class Book:
title: str
author: str
year: int = 2022
# Use custom __repr__ method
@dataclass
class Book:
title: str
author: str
year: int = 2022
def __repr__(self):
return f"Book: {self.title}"
After a certain point, it's worth implementing the dunder methods manually rather than tweaking and overriding dataclass behaviour. For example, the preferred ordering of Books might not be title, then author, then year. Perhaps ordering should happen author first. Or perhaps the author shouldn't be included in the ordering at all. In cases like those, custom dunder methods are still needed.
Dataclasses do a lot, though, and they're a great way to add Pythonic behaviour to classes.
Useful things
For more info about dataclasses, try Eric V. Smith's proposal for adding them to Python, PEP 557 -- Data Classes. It's excellent and I wish I'd read it sooner.
Dunder methods (a.k.a. magic methods) can add a lot more Pythonic behaviour to custom classes, e.g. letting them behave like Python lists with indexing, slicing and looping. I like Rafe Kettler's Guide to Python's Magic Methods. A few things are specific to Python 2, but they're noted in the short appendix and aren't likely to be relevant when you're starting out. In my experience, it's easiest to start with "Comparison magic methods" and then move onto "Making custom sequences", which are things that behave like lists.