Faster CSV export from the Django admin

A quick walk through the original version
Why it's slow
Using a different data structure
What's the improvement?
Bonus: Why not DictWriter?
TLDR
Useful things

Adding a CSV download button to the Django admin is a common approach when internal users need to be able to download data. It's simple to add and takes advantage of existing features, like the admin itself and the permissions framework.

The Django Admin Cookbook even provides sample code: How to export CSV from Django admin? (The cookbook hasn't been updated in a while but remains a useful resource, especially for developers working with the admin for the first time.) The Cookbook code adds a new option "Export selected" to the "Action" dropdown on each model's list page, which is a nice use of the standard admin UI.

The downside is that it's slow. Luckily, generating the CSV becomes much more efficient after minor changes to the code.

If you just want that revised code, without the explanation, skip to the TLDR.

A quick walk through the original version

Here's the original code.


import csv
from django.http import HttpResponse


class ExportCsvMixin:
    def export_as_csv(self, request, queryset):
        meta = self.model._meta
        field_names = [field.name for field in meta.fields]

        response = HttpResponse(content_type="text/csv")
        response["Content-Disposition"] = "attachment; filename={}.csv".format(meta)

        writer = csv.writer(response)
        writer.writerow(field_names)
        for obj in queryset:
            row = writer.writerow([getattr(obj, field) for field in field_names])

        return response

    export_as_csv.short_description = "Export Selected"

It's a simple class, with a single method, that can be used with Django's standard ModelAdmin class. The export_as_csv method expects a request and a queryset, which are passed by other methods in the admin class. The queryset holds whatever records the user selected via the Django admin.

In the function body, some metadata is retrieved from the model whose data is being downloaded. The model name becomes part of the CSV filename and the model field names are used to create the CSV's header row.

The main work happens in the for loop, where each record in the queryset becomes a row in the CSV.

Finally, the CSV response is returned, which starts the user's file download.

Why it's slow

The problem with the original implementation is the for loop. The loop builds up the CSV row by row using writerow. There's also a sneaky second loop - a list comprehension that gets the value of each field. This happens inside the for loop, so it gets the value for each field of each record.

writerow expects a "row" and the Python 3 docs say a row is "an iterable of strings or numbers for Writer objects". The queryset itself is iterable but the objects within it aren't.

Here's an example that tries to pass each model instance to writerow. It's a simplified version of the original code, without the bits specific to the Django admin. (The example uses models from Django-Hospital, which I wrote to help some junior developers learn Django queries. If you want to try the examples yourself, you can clone Django-Hospital.)


import csv
from hospital.models import Doctor

with open("/home/ellen/doctors.csv", "w") as f:
    field_names = [field.name for field in Doctor._meta.fields]
    writer = csv.writer(f)
    writer.writerow(field_names)

    for obj in Doctor.objects.all():
        writer.writerow(obj)

# Error: iterable expected, not Doctor

As expected, the individual objects within a queryset are not iterable and don't count as rows. Because Django has returned a queryset of not-iterable model instances, the ExportCsvMixin transforms each model instance to a list that can be passed to writerow. Generating the CSV would be faster without that transformation step.

Using a different data structure

Luckily, Django's queryset methods don't have to return a queryset of model instances, they can return other data structures.

The most commonly used methods, e.g. all and filter, do return a queryset of ORM objects. But there are other methods:

values returns a queryset of dictionaries
values_list returns a queryset of tuples

The docs for Django 4.0 say "values() and values_list() are both intended as optimizations for a specific use case: retrieving a subset of data without the overhead of creating a model instance." In other words, Django does less work before returning the queryset.

Even better, values_list returns a queryset of tuples, which is an iterable (queryset) of iterables (tuples). That meets the definition of a "row" expected by csv.writerow and the "iterable of rows" expected by csv.writerows. So the queryset itself can be passed to csv.writerows, avoiding the need for any transformation steps.

The simplified version looks like this.


import csv
from hospital.models import Doctor

with open("/home/ellen/doctors.csv", "w") as f:
    field_names = [field.name for field in Doctor._meta.fields]
    writer = csv.writer(f)
    writer.writerow(field_names)
    writer.writerows(Doctor.objects.values_list(*field_names))

What's the improvement?

While working on this, I did some rough timings to confirm the expected improvement actually happened. Those timings are outlined below. The short version is the new code reduced the time to generate a CSV by 66%.

To do the checks, I created some basic data in the form of 1,000,000 identical doctors.


from hospital.models import Doctor

new_doctors = [
    Doctor(
        first_name="John",
        last_name="Snow",
        birth_year=1813,
        position=Doctor.ATTENDING,
        speciality=Doctor.GENERAL,
    )
    for _ in range(1000000)
]
Doctor.objects.bulk_create(new_doctors, batch_size=3000)

Then I used the timeit "magic command" in an iPython shell. Because I planned to do a few runs with timeit, I didn't want to faff around with files and used StringIO instead. (As a reminder, these timings weren't an exercise in precision, but a simple check on what I was doing.)

Times were taken on a machine with 1.8GHz Intel i7-8550U (8 cores) and 8GB RAM. Your own times will vary.

Original


%%timeit

import csv
import io
from hospital.models import Doctor

buffer = io.StringIO()

field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(buffer)
writer.writerow(fields)
for obj in Doctor.objects.all():
  writer.writerow([getattr(obj, field) for field in field_names])

buffer.close()

# 6.51 s ± 169 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Revised


%%timeit

import csv
import io
from hospital.models import Doctor

buffer = io.StringIO()

field_names = [field.name for field in Doctor._meta.fields]
writer = csv.writer(buffer)
writer.writerow(field_names)
writer.writerows(Doctor.objects.values_list(*field_names))

buffer.close()

# 2.22 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Bonus: Why not DictWriter?

Given that the Django values queryset method returns a queryset of dictionaries, you might have wondered "Why not use csv.DictWriter?"

In this case, it's a less efficient option. values, like values_list, avoids the overhead of creating model instances. However, a quick look at the DictWriter code shows a problem.


# cpython/Lib/csv.py

class DictWriter:
    # some methods skipped for brevity

    def _dict_to_list(self, rowdict):
        if self.extrasaction == "raise":
            wrong_fields = rowdict.keys() - self.fieldnames
            if wrong_fields:
                raise ValueError("dict contains fields not in fieldnames: "
                                 + ", ".join([repr(x) for x in wrong_fields]))
        return (rowdict.get(key, self.restval) for key in self.fieldnames)

    def writerow(self, rowdict):
        return self.writer.writerow(self._dict_to_list(rowdict))

    def writerows(self, rowdicts):
        return self.writer.writerows(map(self._dict_to_list, rowdicts))

The name of the _dict_to_list method immediately raises questions. Its return statement is reminiscent of the original export CSV code, looping over each field of each object to create an iterable. The DictWriter.writerow and DictWriter.writerows methods confirm that DictWriter uses _dict_to_list and passes the result to the standard writer implementation.

So using csv.DictWriter adds a transformation step that isn't immediately obvious because it happens in the CPython code. In this case, it's better to fetch the data from the database as a list of tuples and pass it directly to csv.writer.

TLDR

The faster CSV mixin. Plus examples of use in an admin.py and a snippet of test code.


# mixin.py
import csv
from django.http import HttpResponse


class ExportCsvMixin:
    def export_as_csv(self, request, queryset):
        meta = self.model._meta
        field_names = [field.name for field in meta.fields]

        response = HttpResponse(content_type="text/csv")
        response["Content-Disposition"] = "attachment; filename={}.csv".format(meta)

        writer = csv.writer(response)
        writer.writerow(field_names)
        writer.writerows(queryset.values_list(*field_names))

        return response

    export_as_csv.short_description = "Export Selected"


# admin.py
from django.contrib import admin
from hospital.mixins import ExportCsvMixin
from hospital.models import Doctor

@admin.register(Doctor)
class DoctorAdmin(ExportCsvMixin, admin.ModelAdmin):
    pass


# snippet for tests
from django.test import RequestFactory
from hospital.models import Doctor

request = RequestFactory().get("/admin")
queryset = Doctor.objects.all()
ExportCsvMixin().export_as_csv(request, queryset)

Useful things

Django ORM Under the Hood - Iterables looks at some of the Django classes that make querysets iterable.
8x speedup for Python's csv.DictWriter profiles code using DictWriter and shows a bottleneck. The article was published in 2015 but the "extrasaction" default can still cause unexpectedly slow code.