Optimizing Django Queries: A Comprehensive Guide to select_related and prefetch_related

Optimizing Django Queries: A Comprehensive Guide to select_related and prefetch_related

When I started learning Django, the one thing that fascinated me a lot was the Django ORM. Maybe I was happy that I will not have to write SQL queries and can skip all the pain of writing crazy joins to fetch the data. I was throwing in objects.all() and objects.filter(...) with joy all around my application. Until I reached a point where I had a lot of data in the system and it started to lag. Sometimes as ORM users, we don't bother to see the underlying SQL query that is being built, or how many queries are fired? which becomes expensive over time.

My perspective started to change. I started thinking in SQL first and then how to query the Django ORM so that it builds the desired SQL query. With this approach, I started optimizing the queries and the latency of my APIs began to drop.

I'll be using simple query examples to demonstrate how we can use select_related, prefetch_related? How does it work? Should we use it everywhere or not?

The below function can be used as a decorator to print the SQL queries fired for the Django ORM we use.

from django.db import connection, reset_queries
def log_sql(function):
    def wrapper():
        reset_queries()
        function()
        for query in connection.queries:
            print('-' * 12)
            print(f'{query["time"]}: {query["sql"]}')
    return wrapper

The model design

We have a BookTag model which stores all the book tags, the Author model which stores all the authors, and the Book models which stores the name, description, the book author, and the tags associated with the book.

class BookTag(models.Model):

    title = models.CharField(max_length=64)
    code = models.CharField(max_length=64)

class Author(models.Model):

    first_name = models.CharField(max_length=64)
    last_name = models.CharField(max_length=64)

class Book(models.Model):

    name = models.CharField(max_length=64)
    description = models.TextField()
    author = models.ForeignKey(Author, on_delete=models.CASCADE)
    tags = models.ManyToManyField(BookTag)

Get all the book names

We fetch books using Book.objects.all(), then iterate over the queryset and use just the name attribute of each book.

@log_sql
def fetch_data():
    books = Book.objects.all()
    for book in books:
        print(f'Name: {book.name}')

fetch_data()
# OUTPUT:
Name: Lure the dragon
Name: Kingdom of Rak
Name: Loaded Judgement
------------
0.001: SELECT 
    "app_book"."id",
    "app_book"."name",
    "app_book"."description",
    "app_book"."author_id" 
FROM "app_book"

But when you see the above SQL query, it fetches all the columns from the Book table. If we had 20 fields in the book table, the 20 fields would have been fetched just to print the name.

Here, we take the help of values and pass in just the name.

@log_sql
def fetch_data():
    books = Book.objects.values('name')
    for book in books:
        print(f'Name: {book["name"]}')

fetch_data()
OUTPUT:
Name: Lure the dragon
Name: Kingdom of Rak
Name: Loaded Judgement
------------
0.001: SELECT "app_book"."name" FROM "app_book"

Check the SQL query above, only the book name is being selected from the book table. Keep in mind, the values returns the queryset in a list of dicts, therefore, we have to access each book like a dict.

Get the book name and the author's ID

Again, we hit Book.objects.all() for getting all the books, iterate over them and get book.name and book.author.id. Wait, let's check the queries.

@log_sql
def fetch_data():
    books = Book.objects.all()
    for book in books:
        print(f'Name: {book.name} | Author: {book.author.id}')

fetch_data()
Name: Lure the dragon | Author: 2
Name: Kingdom of Rak | Author: 5
Name: Loaded Judgement | Author: 3
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id" 
FROM "app_book"
------------
0.001: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name"
FROM "app_author" WHERE "app_author"."id" = 2 LIMIT 21
------------
0.001: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name" 
FROM "app_author" WHERE "app_author"."id" = 5 LIMIT 21
------------
0.000: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name" 
FROM "app_author" WHERE "app_author"."id" = 3 LIMIT 21

There is one SQL query for fetch all the books and 3 separate queries for fetching the author of each book! What?

If we do .id on a ForeignKeyField, Django actually hits the database to fetch the id. If we had a million books, 1 + additional million queries to get the author ID.

The better solution: use _id, if you need the ID of a foreign key field. Since the author ForeignKey field is saved as author_id in the DB, no extra queries are created to get the author ID.

@log_sql
def fetch_data():
    books = Book.objects.all()
    for book in books:
        print(f'Name: {book.name} | Author: {book.author_id}')

fetch_data()
Name: Lure the dragon | Author: 2
Name: Kingdom of Rak | Author: 5
Name: Loaded Judgement | Author: 3
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description",
  "app_book"."author_id" 
FROM "app_book"

Just one query! Nice. Yes, we could have used values('name', 'author_id') to select just the columns that we need, but this example was just to show you how a .id can get expensive.

Get the book name and the author's first_name

Let's go with objects.all() and observe the queries.

@log_s`ql
def fetch_data():
    books = Book.objects.all()
    for book in books:
        print(f'Name: {book.name} | Author: {book.author.first_name}')

fetch_data()
Name: Lure the dragon | Author: Doug
Name: Kingdom of Rak | Author: Basil
Name: Loaded Judgement | Author: Gabrielle
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id"
FROM "app_book"
------------
0.001: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name"
FROM "app_author" WHERE "app_author"."id" = 2 LIMIT 21
------------
0.001: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name" 
FROM "app_author" WHERE "app_author"."id" = 5 LIMIT 21
------------
0.001: SELECT 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name" 
FROM "app_author" WHERE "app_author"."id" = 3 LIMIT 21

Same as our author ID example, 1 query fetches the books + 3 separate queries to fetch the author data of each book.

What's happening here? Django observes that we are trying to fetch a value from a ForeignKeyField so each iteration hits the DB with a new query to fetch the related data.

We need to somehow tell Django, to fetch the author data with the book data in one query and this is where select_related comes in.

We will add a select_related('author') in the query and observe what happens.

@log_sql
def fetch_data():
    books = Book.objects.select_related('author').all()
    for book in books:
        print(f'Name: {book.name} | Author: {book.author.first_name}')

fetch_data()
Name: Lure the dragon | Author: Doug
Name: Kingdom of Rak | Author: Basil
Name: Loaded Judgement | Author: Gabrielle
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id", 
  "app_author"."id", 
  "app_author"."first_name", 
  "app_author"."last_name"
FROM "app_book" INNER JOIN "app_author"
ON ("app_book"."author_id" = "app_author"."id")

Woah! Django just hit a single query where it joined both the book table and the author table.

Can we optimize it more since we need only 2 columns? Yes, we use values.

Let's remove select_related('author') and add values('name', 'author__first_name').

@log_sql
def fetch_data():
    books = Book.objects.values('name', 'author__first_name')
    for book in books:
        print(f'Name: {book["name"]} | Author: {book["author__first_name"]}')

fetch_data()
Name: Lure the dragon | Author: Doug
Name: Kingdom of Rak | Author: Basil
Name: Loaded Judgement | Author: Gabrielle
------------
0.001: SELECT 
  "app_book"."name", 
  "app_author"."first_name"
FROM "app_book" INNER JOIN "app_author" 
ON ("app_book"."author_id" = "app_author"."id")

Much better, book and author were joined and it selected only the columns that were needed.

Get book name and the associated tags

With Book.objects.all() for all books and for the tags of each book we go book.tags.all().

@log_sql
def fetch_data():
    books = Book.objects.all()
    for book in books:
        print(f'Name: {book.name}')
        print(f'Description: {book.description}')
        for tag in book.tags.all():
            print(f'\t#{tag.name}')

fetch_data()
Name: Lure the dragon
Description: The dragon lorem ipsum dolor met
    #Sci-Fi
    #Crime
Name: Kingdom of Rak
Description: The kingdom lorem ipsum dolor met
    #Romance
    #Fantasy
Name: Loaded Judgement
Description: The loaded lorem ipsum dolor met
    #Fantasy
    #Crime
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id" 
FROM "app_book"
------------
0.001: SELECT 
  "app_booktag"."id", 
  "app_booktag"."name", 
  "app_booktag"."code" 
FROM "app_booktag" INNER JOIN "app_book_tags" 
ON ("app_booktag"."id" = "app_book_tags"."booktag_id") 
WHERE "app_book_tags"."book_id" = 1
------------
0.001: SELECT 
  "app_booktag"."id", 
  "app_booktag"."name", 
  "app_booktag"."code" 
FROM "app_booktag" INNER JOIN "app_book_tags" 
ON ("app_booktag"."id" = "app_book_tags"."booktag_id") 
WHERE "app_book_tags"."book_id" = 2
------------
0.001: SELECT 
  "app_booktag"."id", 
  "app_booktag"."name", 
  "app_booktag"."code" 
FROM "app_booktag" INNER JOIN "app_book_tags" 
ON ("app_booktag"."id" = "app_book_tags"."booktag_id") 
WHERE "app_book_tags"."book_id" = 3

From the above queries we understand, one query to get all the books and an extra query for each book to get its tag information.

Can we use select_related here?

Mmm, No. Since tags is a ManyToManyField we will have to use prefetch_related.

The select_related only works for ForeignKeyField and OneToOneField.

Alright, let's add prefetch_related('tags').

@log_sql
def fetch_data():
    books = Book.objects.prefetch_related('tags').all()
    for book in books:
        print(f'Name: {book.name}')
        print(f'Description: {book.description}')
        for tag in book.tags.all():
            print(f'\t#{tag.name}')

fetch_data()
Name: Lure the dragon
Description: The dragon lorem ipsum dolor met
    #Sci-Fi
    #Crime
Name: Kingdom of Rak
Description: The kingdom lorem ipsum dolor met
    #Romance
    #Fantasy
Name: Loaded Judgement
Description: The loaded lorem ipsum dolor met
    #Fantasy
    #Crime
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id" 
FROM "app_book"
------------
0.001: SELECT 
  ("app_book_tags"."book_id") AS "_prefetch_related_val_book_id", 
  "app_booktag"."id", 
  "app_booktag"."name", 
  "app_booktag"."code" 
FROM "app_booktag" INNER JOIN "app_book_tags" 
ON ("app_booktag"."id" = "app_book_tags"."booktag_id") 
WHERE "app_book_tags"."book_id" IN (1, 2, 3)

Oh, 2 queries, one for fetching all the books and one query for fetching all the tag information for all the books (check the WHERE clause).

We are using just the tag.name and we don't need the tag.code. Can we optimize it further?

Yes, Prefetch can be used to specify a queryset where we use only to get only the name of the tag.

from django.db.models import Prefetch
@log_sql
def fetch_data():
    books = Book.objects.prefetch_related(
        Prefetch('tags', queryset=BookTag.objects.only('name').all())
    ).all()
    for book in books:
        print(f'Name: {book.name}')
        for tag in book.tags.all():
            print(f'\t#{tag.name}')

fetch_data()
Name: Lure the dragon
    #Sci-Fi
    #Crime
Name: Kingdom of Rak
    #Romance
    #Fantasy
Name: Loaded Judgement
    #Fantasy
    #Crime
------------
0.001: SELECT 
  "app_book"."id", 
  "app_book"."name", 
  "app_book"."description", 
  "app_book"."author_id" 
FROM "app_book"
------------
0.001: SELECT 
  ("app_book_tags"."book_id") AS "_prefetch_related_val_book_id", 
  "app_booktag"."id", 
  "app_booktag"."name" 
FROM "app_booktag" INNER JOIN "app_book_tags" 
ON ("app_booktag"."id" = "app_book_tags"."booktag_id") 
WHERE "app_book_tags"."book_id" IN (1, 2, 3)

Check the second query, only the id, and name of the tags are being fetched.

I hope you got a lot of clarity after all these examples.

Should we use select_related or prefetch_related everywhere?

The answer is, it depends. You need to really think of what data you need and use the appropriate methods to get your desired results optimally. If you use select_related in a query where you don't need the related data, still the tables will be joined and the join operation is expensive.

Thanks for reading.

Cover Photo by Julissa Helmuth from Pexels