Faster CSV export from the Django admin
- A quick walk through the original version
- Why it's slow
- Using a different data structure
- What's the improvement?
- Bonus: Why not DictWriter?
- TLDR
- Useful things
Adding a CSV download button to the Django admin is a common approach when internal users need to be able to download data. It's simple to add and takes advantage of existing features, like the admin itself and the permissions framework.
The Django Admin Cookbook even provides sample code: How to export CSV from Django admin? (The cookbook hasn't been updated in a while but remains a useful resource, especially for developers working with the admin for the first time.) The Cookbook code adds a new option "Export selected" to the "Action" dropdown on each model's list page, which is a nice use of the standard admin UI.
The downside is that it's slow. Luckily, generating the CSV becomes much more efficient after minor changes to the code.
If you just want that revised code, without the explanation, skip to the TLDR.
A quick walk through the original version
Here's the original code.
import csv
from django.http import HttpResponse
class ExportCsvMixin:
def export_as_csv(self, request, queryset):
meta = self.model._meta
field_names = [field.name for field in meta.fields]
response = HttpResponse(content_type="text/csv")
response["Content-Disposition"] = "attachment; filename={}.csv".format(meta)
writer = csv.writer(response)
writer.writerow(field_names)
for obj in queryset:
row = writer.writerow([getattr(obj, field) for field in field_names])
return response
export_as_csv.short_description = "Export Selected"
It's a simple class, with a single method, that can be used with
Django's standard ModelAdmin
class. The
export_as_csv
method expects a request
and a
queryset
, which are passed by other methods in the admin
class. The queryset holds whatever records the user selected via the
Django admin.
In the function body, some metadata is retrieved from the model whose data is being downloaded. The model name becomes part of the CSV filename and the model field names are used to create the CSV's header row.
The main work happens in the for
loop, where each record in
the queryset becomes a row in the CSV.
Finally, the CSV response is returned, which starts the user's file download.
Why it's slow
The problem with the original implementation is the
for
loop. The loop builds up the CSV row by row using
writerow
. There's also a sneaky second loop - a list
comprehension that gets the value of each field. This happens inside the
for
loop, so it gets the value for each field of each
record.
writerow
expects a "row" and the
Python 3 docs
say a row is "an iterable of strings or numbers for Writer objects". The
queryset itself is iterable but the objects within it aren't.
Here's an example that tries to pass each model instance to
writerow
. It's a simplified version of the original code,
without the bits specific to the Django admin. (The example uses models
from
Django-Hospital, which I wrote to help some junior developers learn Django queries. If
you want to try the examples yourself, you can clone Django-Hospital.)
import csv
from hospital.models import Doctor
with open("/home/ellen/doctors.csv", "w") as f:
field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(f)
writer.writerow(field_names)
for obj in Doctor.objects.all():
writer.writerow(obj)
# Error: iterable expected, not Doctor
As expected, the individual objects within a queryset are not iterable
and don't count as rows. Because Django has returned a queryset of
not-iterable model instances, the ExportCsvMixin transforms each model
instance to a list that can be passed to writerow
.
Generating the CSV would be faster without that transformation step.
Using a different data structure
Luckily, Django's queryset methods don't have to return a queryset of model instances, they can return other data structures.
The most commonly used methods, e.g. all
and
filter
, do return a queryset of ORM objects. But there are
other methods:
values
returns a queryset of dictionariesvalues_list
returns a queryset of tuples
The docs for Django 4.0 say "values() and values_list() are both intended as optimizations for a specific use case: retrieving a subset of data without the overhead of creating a model instance." In other words, Django does less work before returning the queryset.
Even better, values_list
returns a queryset of tuples,
which is an iterable (queryset) of iterables (tuples). That meets the
definition of a "row" expected by csv.writerow
and the
"iterable of rows" expected by csv.writerows
. So the
queryset itself can be passed to csv.writerows
, avoiding
the need for any transformation steps.
The simplified version looks like this.
import csv
from hospital.models import Doctor
with open("/home/ellen/doctors.csv", "w") as f:
field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(f)
writer.writerow(field_names)
writer.writerows(Doctor.objects.values_list(*field_names))
What's the improvement?
While working on this, I did some rough timings to confirm the expected improvement actually happened. Those timings are outlined below. The short version is the new code reduced the time to generate a CSV by 66%.
To do the checks, I created some basic data in the form of 1,000,000 identical doctors.
from hospital.models import Doctor
new_doctors = [
Doctor(
first_name="John",
last_name="Snow",
birth_year=1813,
position=Doctor.ATTENDING,
speciality=Doctor.GENERAL,
)
for _ in range(1000000)
]
Doctor.objects.bulk_create(new_doctors, batch_size=3000)
Then I used the
timeit "magic command"
in an iPython shell. Because I planned to do a few runs with
timeit
, I didn't want to faff around with files and used
StringIO instead. (As a reminder, these timings weren't an exercise in
precision, but a simple check on what I was doing.)
Times were taken on a machine with 1.8GHz Intel i7-8550U (8 cores) and 8GB RAM. Your own times will vary.
Original
%%timeit
import csv
import io
from hospital.models import Doctor
buffer = io.StringIO()
field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(buffer)
writer.writerow(fields)
for obj in Doctor.objects.all():
writer.writerow([getattr(obj, field) for field in field_names])
buffer.close()
# 6.51 s ± 169 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Revised
%%timeit
import csv
import io
from hospital.models import Doctor
buffer = io.StringIO()
field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(buffer)
writer.writerow(field_names)
writer.writerows(Doctor.objects.values_list(*field_names))
buffer.close()
# 2.22 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Bonus: Why not DictWriter?
Given that the Django values
queryset method returns a
queryset of dictionaries, you might have wondered "Why not use
csv.DictWriter
?"
In this case, it's a less efficient option. values
, like
values_list
, avoids the overhead of creating model
instances. However, a quick look at the DictWriter code shows a problem.
# cpython/Lib/csv.py
class DictWriter:
# some methods skipped for brevity
def _dict_to_list(self, rowdict):
if self.extrasaction == "raise":
wrong_fields = rowdict.keys() - self.fieldnames
if wrong_fields:
raise ValueError("dict contains fields not in fieldnames: "
+ ", ".join([repr(x) for x in wrong_fields]))
return (rowdict.get(key, self.restval) for key in self.fieldnames)
def writerow(self, rowdict):
return self.writer.writerow(self._dict_to_list(rowdict))
def writerows(self, rowdicts):
return self.writer.writerows(map(self._dict_to_list, rowdicts))
The name of the _dict_to_list
method immediately raises
questions. Its return statement is reminiscent of the original export
CSV code, looping over each field of each object to create an iterable.
The DictWriter.writerow
and
DictWriter.writerows
methods confirm that DictWriter uses
_dict_to_list
and passes the result to the standard
writer
implementation.
So using csv.DictWriter
adds a transformation step that
isn't immediately obvious because it happens in the CPython code. In
this case, it's better to fetch the data from the database as a list of
tuples and pass it directly to csv.writer
.
TLDR
The faster CSV mixin. Plus examples of use in an admin.py and a snippet of test code.
# mixin.py
import csv
from django.http import HttpResponse
class ExportCsvMixin:
def export_as_csv(self, request, queryset):
meta = self.model._meta
field_names = [field.name for field in meta.fields]
response = HttpResponse(content_type="text/csv")
response["Content-Disposition"] = "attachment; filename={}.csv".format(meta)
writer = csv.writer(response)
writer.writerow(field_names)
writer.writerows(queryset.values_list(*field_names))
return response
export_as_csv.short_description = "Export Selected"
# admin.py
from django.contrib import admin
from hospital.mixins import ExportCsvMixin
from hospital.models import Doctor
@admin.register(Doctor)
class DoctorAdmin(ExportCsvMixin, admin.ModelAdmin):
pass
# snippet for tests
from django.test import RequestFactory
from hospital.models import Doctor
request = RequestFactory().get("/admin")
queryset = Doctor.objects.all()
ExportCsvMixin().export_as_csv(request, queryset)
Useful things
- Django ORM Under the Hood - Iterables looks at some of the Django classes that make querysets iterable.
- 8x speedup for Python's csv.DictWriter profiles code using DictWriter and shows a bottleneck. The article was published in 2015 but the "extrasaction" default can still cause unexpectedly slow code.