July 31st, 2017

Chris Moffitt: Pandas Grouper and Agg Functions Explained

Programing, Python, by admin.

Introduction

Every once in a while it is useful to take a step back and look at pandas’
functions and see if there is a new or better way to do things. I was recently
working on a problem and noticed that pandas had a Grouper function
that I had never used before. I looked into how it can be used and it turns out
it is useful for the type of summary analysis I tend to do on a frequent basis.

In addition to functions that have been around a while, pandas continues to provide
new and improved capabilities with every release. The updated agg function
is another very useful and intuitive tool for summarizing data.

This article will walk through how and why you may want to use the
Grouper
and

agg
functions on your own data. Along the way, I will include a few tips
and tricks on how to use them most effectively.

Grouping Time Series Data

Pandas’ origins are in the financial industry so it should not be a surprise that
it has robust capabilities to manipulate and summarize time series data. Just look at the
extensive time series documentation to get a feel for all the options.
I encourage you to review it so that you’re aware of the concepts.

In order to illustrate this particular concept better, I will walk through an example of sales
data and some simple operations to get total sales by month, day, year, etc.

For this example, I’ll use my trusty transaction data that I’ve used in other articles.
You can follow along in the notebook as well.

import pandas as pd
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df["date"] = pd.to_datetime(df['date'])
df.head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

Before I go much further, it’s useful to become familiar with Offset Aliases.
These strings are used to represent various common time frequencies like days vs. weeks
vs. years. I always forget what these are called and how to use the more esoteric ones
so make sure to bookmark the link!

For example, if you were interested in summarizing all of the sales by month, you could use the

resample
function. The tricky part about using resample is that it only
operates on an index. In this data set, the data is not indexed by the date column
so resample would not work without restructuring the data. In order to make it work,
use
set_index
to make the date column an index and then resample:

df.set_index('date').resample('M')["ext price"].sum()
date
2014-01-31 185361.66
2014-02-28 146211.62
2014-03-31 203921.38
2014-04-30 174574.11
2014-05-31 165418.55
2014-06-30 174089.33
2014-07-31 191662.11
2014-08-31 153778.59
2014-09-30 168443.17
2014-10-31 171495.32
2014-11-30 119961.22
2014-12-31 163867.26
Freq: M, Name: ext price, dtype: float64

This is a fairly straightforward way to summarize the data but it gets a little more
challenging if you would like to group the data as well. If we would like to see
the monthly results for each customer, then you could do this (results truncated
to 20 rows):

df.set_index('date').groupby('name')["ext price"].resample("M").sum()
name date
Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90
Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58
Name: ext price, dtype: float64

This certainly works but it feels a bit clunky. Fortunately
Grouper
makes
this a little more streamlined. Instead of having to play around with reindexing, we
can use our normal
groupby
syntax but provide a little more info on how
to group the data in the date column:

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()
name date
Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90
Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58
Name: ext price, dtype: float64

Since
groupby
is one of my standard functions, this approach seems simpler
to me and it is more likely to stick in my brain.

The nice benefit of this capability is that if you are interested in looking at
data summarized in a different time frame, just change the
freq
parameter
to one of the valid offset aliases. For instance, an annual summary using December
as the last month would look like this:

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name date
Barton LLC 2014-12-31 109438.50
Cronin, Oberbrunner and Spencer 2014-12-31 89734.55
Frami, Hills and Schmidt 2014-12-31 103569.59
Fritsch, Russel and Anderson 2014-12-31 112214.71
Halvorson, Crona and Champlin 2014-12-31 70004.36
Herman LLC 2014-12-31 82865.00
Jerde-Hilpert 2014-12-31 112591.43
Kassulke, Ondricka and Metz 2014-12-31 86451.07
Keeling LLC 2014-12-31 100934.30
Kiehn-Spinka 2014-12-31 99608.77
Koepp Ltd 2014-12-31 103660.54
Kuhn-Gusikowski 2014-12-31 91094.28
Kulas Inc 2014-12-31 137351.96
Pollich LLC 2014-12-31 87347.18
Purdy-Kunde 2014-12-31 77898.21
Sanford and Sons 2014-12-31 98822.98
Stokes LLC 2014-12-31 91535.92
Trantow-Barrows 2014-12-31 123381.38
White-Trantow 2014-12-31 135841.99
Will LLC 2014-12-31 104437.60
Name: ext price, dtype: float64

If your annual sales were on a non-calendar basis, then the data can be easily
changed by modifying the
freq
parameter. I encourage you to play around
with different offsets to get a feel for how it works. When dealing with summarizing
time series data, this is incredibly handy. To put this in perspective, try doing
this in Excel. It is certainly possible (using pivot tables and
custom grouping) but I do not think it is nearly as intuitive as the pandas approach.

New and improved aggregate function

In pandas 0.20.1, there was a new
agg
function added that makes it a lot simpler
to summarize data in a manner similar to the
groupby

API.

To illustrate the functionality, let’s say we need to get the total of the
ext price
and

quantity
column as well as the average of the
unit price

. The process
is not very convenient:

df[["ext price", "quantity"]].sum()
ext price 2018784.32
quantity 36463.00
dtype: float64
df["unit price"].mean()
55.007526666666664

This works but it’s a bit messy. The new
agg
makes this simpler:

df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
ext price quantity unit price
sum 2.018784e+06 36463.000000 82511.290000
mean 1.345856e+03 24.308667 55.007527

The results are good but including the sum of the unit price is not really that
useful. Fortunately we can pass a dictionary to
agg
and specify what
operations to apply to each column.

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
quantity ext price unit price
mean 24.308667 1.345856e+03 55.007527
sum 36463.000000 2.018784e+06 NaN

I find this approach really handy when I want to summarize several columns of data.
In the past, I would run the individual calculations and build up the resulting dataframe
a row at a time. It was tedious. This is a much better approach.

As an added bonus, you can define your own functions. For instance, I frequently
find myself needing to aggregate data and use a mode function that works on text.
I found a lambda function that uses
value_counts
to do what I need and
frequently use this
get_max
function:

get_max = lambda x: x.value_counts(dropna=False).index[0]

Then, if I want to include the most frequent sku in my summary table:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity sku ext price unit price
<lambda> NaN S2-77896 NaN NaN
mean 24.308667 NaN 1.345856e+03 55.007527
sum 36463.000000 NaN 2.018784e+06 NaN

This is pretty cool but there is one thing that has always bugged me about this approach.
The fact that the column says “<lambda>” bothers me. Ideally I want it to say
“most frequent.” In the past I’d jump through some hoops to rename it. But, when
working on this article I stumbled on another approach – explicitly defining the name
of the lambda function.

get_max.__name__ = "most frequent"

Now, when I do the aggregation:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity sku ext price unit price
most frequent NaN S2-77896 NaN NaN
mean 24.308667 NaN 1.345856e+03 55.007527
sum 36463.000000 NaN 2.018784e+06 NaN

I get a much nicer label! It’s a small thing but I am definitely glad I finally
figured that out.

As a final final bonus, here’s one other trick. The aggregate function using a
dictionary is useful but one challenge is that it does not preserve order. If
you want to make sure your columns are in a specific order, you can use an

OrderedDict

:

import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price quantity sku
mean 1.345856e+03 24.308667 NaN
most frequent NaN NaN S2-77896
sum 2.018784e+06 36463.000000 NaN

Conclusion

The pandas library continues to grow and evolve over time. Sometimes it is useful
to make sure there aren’t simpler approaches to some of the frequent approaches
you may use to solve your problems. Pandas’ Grouper function and the updated
agg function are really useful when aggregating and summarizing data. I hope this
article will be useful to you in your data analysis. Are there any other pandas
functions that you just learned about or might be useful to others? Feel free
to give your input in the comments.

Back Top

Leave a Reply