Django testing approaches (fixtures vs sql dumps vs factories)

August 6, 2019

In this article, we try to find the most appropriate solution to populate a database for tests in our Django project. Django is an open-source Python framework with “batteries included” and it provides us with the ability to code fast and clean. And what is the thing that is crucial when we try to keep our project neat? These are tests.

Importance of quick-running unit* tests

*Basically, Django “unit” tests are more “integrational” ones (most of the time, they combine DB, models, and views together), but let us call them “unit” for simplicity as a commonly known definition

It’s unbearable to have a big project that is poorly tested. Testing different app levels (unit testing, integrational testing, system testing) is the main way to prevent errors. Unfortunately, when you have a large, thoroughly tested project running tests locally becomes a burden. One may say it’s okay to have unit tests that run an hour or even more during the CI process.

But let’s talk from the heart: developers do run tests locally to be sure that everything works fine, or they wait for the response from the remote server to start anything further. It slows down development. 5–10 minutes is acceptable for local testing, 60 — is not. It’s not only about the time that you spend. It’s about the motivation of the programmers as well. Having long feedback oppresses developers. As a result, they neither want to test what they have done nor to write new tests.

So what one should do is keep unit tests running fast. What are the main reasons for testing running slow? One of the main bottlenecks in Django during the tests is the database population. It’s possible to mock the model’s object without hitting DB, but as most of the time the correctness of the results is about the correctness of the data, it’s not a good choice. So we need to populate the database and do it as fast and relevant as we can.

Actually, this was exactly the reason why we decided to redesign our approach to writing tests. Having our tests running around 2 minutes, we spent around 20 minutes on database population.

Fixtures vs. SQL dumps vs. factories

Let’s go through the main pros and cons of every technique. SQL dumps and fixtures generally represent the same idea so we will start with their common pros and cons and end with the features of each of the techniques.

Fixtures and SQL dumps:

pros

easy to create via dumping the existing database
you don’t need to know all of the aspects of the app to use them in the tests
they are the same for the whole test case
mostly they represent all possible relative data

cons

most of the time, you load data that you don’t actually need In the test, what makes it slow

Fixtures:

pros

Django provides hooks for loading them during testing

cons

they are slow to load

SQL dumps:

pros

they are fast to load

cons

Django doesn’t provide hooks for loading them during testing
it’s hard to browse them or change them manually

Factories:

pros

the data is highly relevant to the test
they are easy to change
part of the data may be random which widens coverage

cons

you need to code them manually and think them through
you may lose something valuable during tests because you haven’t created all the relevant models
you must know the project well to know what exactly you should create and how to do it
If you create a lot of objects, they are slower than other approaches

As you can see, each of them has pros and cons, so there is no apparent winner. However, I will explain later on, what are the reasons, that we stick to the factories approach.

Let’s discover the usage of each of them, and compare the speed of loading and usability.

Let’s start from the models that we have and the tests:

class Sphere(models.Model):
    name = models.CharField('name', max_length=256, unique=True)

    class Meta:
        ordering = ['name']

    def __str__(self):
        return self.name


class Disease(models.Model):
    sphere = models.ForeignKey(Sphere, on_delete=models.CASCADE)
    name = models.CharField(max_length=256, db_index=True, unique=True)
    chronic = models.BooleanField(default=False)
    symptoms = models.ManyToManyField(to=Symptom, through='DiseaseSymptom')
    duration = models.PositiveSmallIntegerField(default=10)
    contagiousness = models.PositiveSmallIntegerField(validators=[MaxValueValidator(100)])
    malignancy = models.PositiveSmallIntegerField(validators=[MaxValueValidator(100)])
    description = models.TextField()
    diagnostics = models.TextField(blank=True, null=True)
    treatment = models.TextField(blank=True, null=True)
    passing = models.TextField(blank=True, null=True)
    recommendations = models.TextField(blank=True, null=True)
    # occurrence = models.PositiveIntegerField(default=1)  # How many times this disease has occurred
    number = models.PositiveIntegerField('number of people on average to get disease from 10^6', default=0)

    class Meta:
        ordering = ['name']

    def __str__(self):
        return self.name

So these are two simple models that have o2m relation.

Tests.code main part:

def setUp(self):
    super(SymptomFixturesTestCase, self).setUp()
    self.new_sphere = Sphere.objects.create(name='fake')

def test_update_name(self):
    # update some diseases to run rollbacks
    for disease in Disease.objects.all()[:3]:
        disease.name = 'fake_name' + str(disease.id)
        disease.save()

    self.assertEqual(Disease.objects.filter(name__startswith='fake_name').count(), 3)

def test_delete(self):
    # delete some diseases to run rollbacks
    disease_count = Disease.objects.count()
    for disease in Disease.objects.all()[:3]:
        disease.delete()

    self.assertEqual(Disease.objects.count(), disease_count - 3)

def test_create(self):
    diseases_count = Disease.objects.count()
    first_disease = Disease.objects.create(name='fake', sphere=self.new_sphere, duration=15,
                                           contagiousness=15, malignancy=50, description='fake')

    self.assertEqual(Disease.objects.count(), diseases_count + 1)

def test_remove_sphere(self):
    # check that deletion of sphere removes all disease
    sphere = Sphere.objects.first()
    sphere_disease_count = Disease.objects.filter(sphere_id=sphere.id).count()
    all_disease_count = Disease.objects.count()
    sphere.delete()
    self.assertEqual(Disease.objects.count(), all_disease_count - sphere_disease_count)

The code is quite simplified, but it shows the idea well.
Let’s start from the fixtures approach. As it’s common, I just dumped our test database, which covers most of the edge cases. As a result, we have a bit less than 1000 objects in sum. It may seem excessive as we for sure don’t use most of them in the tests, but as I said, it’s simplified. Most of them are used in other tests. As it happens, we start from one TestCase where the data is needed and a neat fixture of 5 objects and end up with a monster fixture that is used in 20 test cases and contains 500 objects. The problems fixtures cause are the following:

– fixtures tend to grow in size as we usually try to populate them with data suitable for all our tests. As our fixtures grow, we load and process irrelevant data.

– we have two data sources, one is fixtures, and another is created in the SetUp method, which is quite a common use case. Therefore, we need to maintain both of them. What makes the approach even more error-prone and harder to change.

-most of the time, we don’t know which object we treat. We just take the first or last one, which makes testing obscure.

Let’s see how much time the tests take.

----Ran 4 tests in 0.745s----Let’s see how much of it is actually spent on running the tests from PyCharm:----
Test Results 26ms
---

It’s just ridiculous. Most of the time is spent on the fixture loading.

Maybe the problem is in the fixtures and not the data volume? Fixtures must be parsed, and then Django ORM stands in to create model objects that can create the slowdown. Let’s substitute our fixtures with SQL dump. The first problem we will face is that Django doesn’t work with SQL loading out of the box. Especially with the tests. So we need to overwrite our TestCase class in the following way:

SQL approach. TestCase, SetUp method

__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

@classmethod
def setUpClass(cls):
    super(SymptomFixturesTestCase, cls).setUpClass()
    with connection.cursor() as cur:
        with open(os.path.join(__location__, 'dump.sql')) as f:
            for statement in sqlparse.split(f.read()):
                if not statement:
                    continue
                cur.execute(statement)

This snippet helps us to load SQL dump when tests run for the first time. It has to be said that this code is a proof of concept only and strongly relies on the transaction support by our test database. So what are the results of such an approach?

------Ran 4 tests in 0.659s-----

We have gained a little speedup, but it doesn’t help us much. Also, it creates some problems on its own (using SQL dumps is not a recommended solution by the Django team. Actually, they intentionally prohibited it). So the reason for our tests running slow is not the speed of Django ORM and fixtures loading; it’s about the data volume.

One way to solve it is sharding our fixtures into smaller ones, which are more relative to our TestCase. Unfortunately, it solves only the first problem of the fixtures and makes maintainability harder.

Another solution is to create objects by ourselves in a test to be easily changeable. It will help us to keep data as relevant as possible. As creating objects directly with ORM produces a lot of boilerplate, factories that provide us with basic capabilities to auto-populate data are a good choice. So how will it look with the use of factory_boy?

import factory

from diseases.models import Sphere, Disease


class SphereFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Sphere

    name = factory.Sequence(lambda n: 'Name {0}'.format(n))


class DiseaseFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Disease

    sphere = factory.SubFactory(SphereFactory)
    name = factory.Sequence(lambda n: 'Name {0}'.format(n))
    contagiousness =  factory.fuzzy.FuzzyInteger(low=1, high=100)
    malignancy =  factory.fuzzy.FuzzyInteger(low=1, high=100)
    description = factory.Sequence(lambda n: 'Description {0}'.format(n))

And our test case will be as the following

class SymptomFactoriesTestCase(TestCase):

    def test_update_name(self):
        # update some disease to run rollbacks

        for disease in DiseaseFactory.create_batch(3):
            disease.name = 'fake_name' + str(disease.id)
            disease.save()

        self.assertEqual(Disease.objects.filter(name__startswith='fake_name').count(), 3)

    def test_delete(self):
        # delete some disease to run rollbacks
        DiseaseFactory.create_batch(5)
        disease_count = Disease.objects.count()
        for disease in Disease.objects.all()[:3]:
            disease.delete()

        self.assertEqual(Disease.objects.count(), disease_count - 3)

    def test_create(self):
        disease_count = Disease.objects.count()
        DiseaseFactory()

        self.assertEqual(Disease.objects.count(), disease_count + 1)

    def test_remove_sphere(self):

        # check that deletion of sphere removes all disease
        sphere = SphereFactory()
        DiseaseFactory.create_batch(3, sphere=sphere)
        sphere_disease_count = Disease.objects.filter(sphere_id=sphere.id).count()
        all_disease_count = Disease.objects.count()
        sphere.delete()
        self.assertEqual(Disease.objects.count(), all_disease_count - sphere_disease_count)

And what do the time metrics show?

-------Ran 4 tests in 0.028s------

And that’s all. No more time is spent on data creation.

As you can see, it has sped up our tests nearly 30 times. Even though it’s quite hypothetical, I’ve seen it’s typical to have a speed increase from 2 to 10 times.

Conclusion

In conclusion, I want to say that there is no silver bullet. You either write your tests carefully, loading only relative data but spending more time writing them, or you do it fast and simply but with the time lost later in running them. It’s up to a specific case what to choose. It’s fine to start from fixtures and rewrite them later to factories when you see fixtures grow to an unmaintainable data source that is big in size and used in various tests.

Subscribe to our updates to not miss other interesting topics!