= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...6 more queries... Middle Oops. I think this code comes from a slightly rushed job. We can do the COUNT and the SUM at the same time for each query. # New, improved count_and_size() function! def count_and_size(qs, start, end): sub_qs = qs.filter(created_at__gte=start, created_at__lt=end) return sub_qs.aggregate( count=Count('id'), total_size=Sum('size'), ) numbers['uploads'] = { 'today': count_and_size(upload_qs, start_today, today), 'yesterday': count_and_size(upload_qs, start_yesterday, start_today), 'this_month': count_and_size(upload_qs, start_this_month, today), 'this_year': count_and_size(upload_qs, start_this_year, today), } Much better, now there's only one query per time bucket. So 4 queries in total. E.g. SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...3 more queries... After But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works: def make_q(start, end): return Q(created_at__gte=start, created_at__lt=end) q_today = make_q(start_today, today) q_yesterday = make_q(start_yesterday, start_today) q_this_month = make_q(start_this_month, today) q_this_year = make_q(start_this_year, today) aggregates = upload_qs.aggregate( today_count=Count('pk', filter=q_today), today_total_size=Sum('size', filter=q_today), yesterday_count=Count('pk', filter=q_yesterday), yesterday_total_size=Sum('size', filter=q_yesterday), this_month_count=Count('pk', filter=q_this_month), this_month_total_size=Sum('size', filter=q_this_month), this_year_count=Count('pk', filter=q_this_year), this_year_total_size=Sum('size', filter=q_this_year), ) numbers['uploads'] = { 'today': { 'count': aggregates['today_count'], 'total_size': aggregates['today_total_size'], }, 'yesterday': { 'count': aggregates['yesterday_count'], 'total_size': aggregates['yesterday_total_size'], }, 'this_month': { 'count': aggregates['this_month_count'], 'total_size': aggregates['this_month_total_size'], }, 'this_year': { 'count': aggregates['this_year_count'], 'total_size': aggregates['this_year_total_size'], }, } Voila! One single query to get all those pieces of data. The SQL sent to PostgreSQL looks something like this: SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", ... FROM "upload_upload"; Is this the best thing to do? I'm starting to have my doubts. Watch Out! When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE prefix I notice something worrying! QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms (4 rows) A sequential scan! That's terrible. The created_at column is indexed in a BTREE so why can't it use the index. The short answer is: I don't know! I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked. So what did I do? I went back to the "middle" solution. One SELECT query per time bucket. So 4 queries in total, but at least all 4 is able to use an index. " />

January 13th, 2018

Peter Bengtsson: Conditional aggregation in Django 2.0

Programing, Python, by admin.

Django 2.0 came out a couple of weeks ago. It now supports “conditional aggregation” which is SQL standard I didn’t even know about.

Before

So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.

First of all, this is how it set up the time intervals:

today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)

And then, for each of these, there’s a little function that returns a dict for each time interval:

def count_and_size(qs, start, end): sub_qs = qs.filter(created_at__gte=start, created_at__lt=end) return { 'count': sub_qs.count(), 'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
} numbers['uploads'] = { 'today': count_and_size(upload_qs, start_today, today), 'yesterday': count_and_size(upload_qs, start_yesterday, start_today), 'this_month': count_and_size(upload_qs, start_this_month, today), 'this_year': count_and_size(upload_qs, start_this_year, today),
}

What you get is exactly 2 x 4 = 8 queries. One COUNT and one SUM for each time interval. E.g.

SELECT SUM("upload_upload"."size") AS "size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...6 more queries...

Middle

Oops. I think this code comes from a slightly rushed job. We can do the COUNT and the SUM at the same time for each query.

# New, improved count_and_size() function!
def count_and_size(qs, start, end): sub_qs = qs.filter(created_at__gte=start, created_at__lt=end) return sub_qs.aggregate( count=Count('id'), total_size=Sum('size'), ) numbers['uploads'] = { 'today': count_and_size(upload_qs, start_today, today), 'yesterday': count_and_size(upload_qs, start_yesterday, start_today), 'this_month': count_and_size(upload_qs, start_this_month, today), 'this_year': count_and_size(upload_qs, start_this_year, today),
}

Much better, now there’s only one query per time bucket. So 4 queries in total. E.g.

SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...3 more queries...

After

But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there’s so many keyword arguments, but I hope I’ve indented it nicely so it’s easy to see how it works:

def make_q(start, end): return Q(created_at__gte=start, created_at__lt=end) q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today) aggregates = upload_qs.aggregate( today_count=Count('pk', filter=q_today), today_total_size=Sum('size', filter=q_today), yesterday_count=Count('pk', filter=q_yesterday), yesterday_total_size=Sum('size', filter=q_yesterday), this_month_count=Count('pk', filter=q_this_month), this_month_total_size=Sum('size', filter=q_this_month), this_year_count=Count('pk', filter=q_this_year), this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = { 'today': { 'count': aggregates['today_count'], 'total_size': aggregates['today_total_size'], }, 'yesterday': { 'count': aggregates['yesterday_count'], 'total_size': aggregates['yesterday_total_size'], }, 'this_month': { 'count': aggregates['this_month_count'], 'total_size': aggregates['this_month_total_size'], }, 'this_year': { 'count': aggregates['this_year_count'], 'total_size': aggregates['this_year_total_size'], },
}

Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:

SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", ... FROM "upload_upload";

Is this the best thing to do? I’m starting to have my doubts.

Watch Out!

When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE prefix I notice something worrying!

 QUERY PLAN
------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms
(4 rows)

A sequential scan! That’s terrible. The created_at column is indexed in a BTREE so why can’t it use the index.

The short answer is: I don’t know!
I’ve uploaded a reduced, but still complete, example demonstrating this in a gist. It’s very similar to the example in the stackoverflow question I asked.

So what did I do? I went back to the “middle” solution. One SELECT query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.

Back Top

Leave a Reply