0% found this document useful (0 votes)
93 views49 pages

Understanding Pandas Groupby For Data Aggregation

The document discusses using Pandas' groupby() method to aggregate data in Python. It explains that groupby() allows splitting a dataset into groups based on certain features, then applying aggregation functions like sum(), mean(), count() to each group. This provides insights that can't be gained from analyzing the full dataset as a whole. The document then demonstrates how to use groupby() on a sample retail sales dataset to aggregate sales data by store location type and other attributes.

Uploaded by

Justin Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views49 pages

Understanding Pandas Groupby For Data Aggregation

The document discusses using Pandas' groupby() method to aggregate data in Python. It explains that groupby() allows splitting a dataset into groups based on certain features, then applying aggregation functions like sum(), mean(), count() to each group. This provides insights that can't be gained from analyzing the full dataset as a whole. The document then demonstrates how to use groupby() on a sample retail sales dataset to aggregate sales data by store location type and other attributes.

Uploaded by

Justin Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

https://www.analyticsvidhya.

com/blog/2020/03/groupby-
pandas-aggregating-data-python/
Understanding Pandas Groupby for Data Aggregation

Introduction

What if I told you that we could derive effective and impactful insights from
our dataset in just a few lines of code? That’s the beauty of Pandas’ GroupBy in
Python function! I have lost count of the number of times I’ve relied on GroupBy
to quickly summarize data and aggregate it in a way that’s easy to interpret.

This helps not only when we’re working on a data science project and need quick
results but also in hackathons! When time is of the essence (and when is it not?),
the GroupBy function in Pandas saves us a ton of effort by delivering super quick
results in a matter of seconds. If you are familiar with groups in sql, this article will
be even easier for you to understand!

Loving GroupBy already? In this tutorial, I will first explain the GroupBy function
using an intuitive example before picking up a real-world dataset and
implementing GroupBy in Python. Let’s begin aggregating!

Learning Objectives

 Understanding the syntax and functionality of the groupby() method is


important for efficient data grouping.
 Familiarizing yourself with different types of aggregation functions
available in pandas, including sum(), mean(), count(), max(), and min(), is
necessary to perform effective data analysis.
 Knowing how to apply various aggregation functions to grouped data
enables data analysts to extract useful insights from large data sets.

If you’re new to the world of Python and Pandas, you’ve come to the right place.
Here are two popular free courses you should check out:

 Python for Data Science


 Pandas for Data Analysis in Python

Table of contents

 Introduction
 What Is the Pandas’ GroupBy Function?
 Understanding the Dataset & Problem Statement
 First Look at Pandas GroupBy
 The Split-Apply-Combine Strategy
 Loop Over GroupBy Groups
 Applying Functions to GroupBy Groups
 Conclusion
 Frequently Asked Questions

What Is the Pandas’ GroupBy Function?

Pandas’ Groupby operation is a powerful and versatile function in Python. It


allows you to split your data into separate groups to perform computations for
better analysis.

Let me take an example to elaborate on this. Let’s say we are trying to analyze the
weight of a person in a city. We can easily get a fair idea of their weight by
determining the mean weight of all the city dwellers. But here‘s a question – would
the weight be affected by the gender of a person?

We can group the city dwellers into different gender groups and compute their
mean weight. This would give us a better insight into the weight of a person living
in the city. But we can probably get an even better picture if we further separate
these gender groups into different age groups and then take their mean weight
(because a teenage boy’s weight could differ from that of an adult male)!

You can see how separating people into separate groups and then applying a
statistical value allows us to make better analyses than just looking at the statistical
value of the entire population. This is what makes GroupBy so great!
GroupBy allows us to group our data based on different features and get a more
accurate idea about your data. It is a one-stop shop for deriving deep insights from
your data!

Understanding the Dataset & Problem Statement

We will be working with the Big Mart Sales dataset from our DataHack platform.
It contains attributes related to the products sold at various stores of BigMart. The
aim is to find out the sales of each product at a particular store.

Right, let’s import the libraries and explore the data:

import pandas as pd

import numpy as np

df = pd.read_csv(‘train_v9rqX0R.csv’)

Python Code:

We have some nan values in our dataset. These are mostly in


the Item_Weight and Outlet_Size. I will handle the missing values
for Outlet_Size right now, but we’ll handle the missing values
for Item_Weight later in the article using the GroupBy function!

First Look at Pandas GroupBy


Let’s group the dataset based on the outlet location type using GroupBy, the syntax
is simple we just have to use pandas dataframe.groupby:

df.groupby('Outlet_Location_Type'
)

GroupBy has conveniently returned a DataFrameGroupBy object. It has split


the data into separate groups. However, it won’t do anything unless it is being told
explicitly to do so. So, let’s find the count of different outlet location types:

df.groupby('Outlet_Location_Type').count(
)

We did not tell GroupBy which column we wanted it to apply the aggregation
function on, so we applied it to multiple columns (all the relevant columns) and
returned the output.

But fortunately, GroupBy object supports column indexing just like a pandas
Dataframe!

So let’s find out the total sales for each location type:

df.groupby('Outlet_Location_Type')
['Item_Outlet_Sales']

Here, GroupBy has returned a SeriesGroupBy object. No computation will be


done until we specify the agg function:

df.groupby('Outlet_Location_Type')
['Item_Outlet_Sales'].sum()
Awesome! Now, let’s understand the work behind the GroupBy function in
Pandas.

The Split-Apply-Combine Strategy

You just saw how quickly you can get an insight into grouped data using the
GroupBy function. But, behind the scenes, a lot is taking place, which is important
to understand to gauge the true power of GroupBy.

GroupBy employs the Split-Apply-Combine strategy coined by Hadley Wickham


in his paper in 2011. Using this strategy, a data analyst can break down a big
problem into manageable parts, perform operations on individual parts and
combine them back together to answer a specific question.

I want to show you how this strategy works in GroupBy by working with a sample
dataset to get the average height for males and females in a group. Let’s create that
dataset:

data = {'Gender':['m','f','f','m','f','m','m'],'Height':
[172,171,169,173,170,175,178]}
df_sample = pd.DataFrame(data)
df_sample
Splitting the data into separate groups:

f_filter = df_sample['Gender']=='f'
print(df_sample[f_filter])

m_filter = df_sample['Gender']=='m'
print(df_sample[m_filter])

Applying the operation that we need to perform (average in this case):

f_avg = df_sample[f_filter]['Height'].mean()

m_avg = df_sample[m_filter]['Height'].mean()

print(f_avg,m_avg)

170.0 174.5

Finally, combining the result to output a DataFrame:

df_output = pd.DataFrame({'Gender':['f','m'],'Height':
[f_avg,m_avg]})
df_output
All these three steps can be achieved by using Groupby with just a single line of
code! Here’s how:

df_sample.groupby('Gender').mean(
)

Now that is smart! Have a look at how Groupby did that in the image below:

You can see how GroupBy simplifies our task by doing all the work behind the
scenes without us having to worry about a thing!

Now that you understand the Split-Apply-Combine strategy let’s dive deeper into
the GroupBy function and unlock its full potential.

Loop Over GroupBy Groups


Remember the GroupBy object we created at the beginning of this article? Don’t
worry, we’ll create it again:

obj =
df.groupby('Outlet_Location_Type')
obj

We can display the indices in each group by calling the groups on the GroupBy
object:

obj.groups

We can even iterate over all of the groups:

for name,group in obj:


print(name,'contains',group.shape[0],'rows')

But what if you want to get a specific group out of all the groups? Well, don’t
worry. Pandas has a solution for that too.

Just provide the specific group name when calling get_group on the group object.
Here, I want to check out the features for the ‘Tier 1’ group of locations only:

obj.get_group('Tier 1')
Now isn’t that wonderful! You have the entire Tier 1 features to work with and
derive wonderful insights! But wait, didn’t I say that GroupBy is lazy and doesn’t
do anything unless explicitly specified? Alright then, let’s see GroupBy in action
with the aggregate functions.

Applying Functions to GroupBy Groups

The apply step is unequivocally the most important step of a GroupBy function
where we can perform a variety of operations using aggregation, transformation,
filtration, or even with your own function!

Let’s have a look at these in detail.

Aggregation

We have looked at some aggregation functions in the article so far, such as mean,
mode, and sum. These perform statistical operations on a set of data. Have a
glance at all the aggregate functions in the Pandas package:

 count() – Number of non-null observations


 sum() – Sum of values
 mean() – Mean of values
 median() – Arithmetic median of values
 min() – Minimum
 max() – Maximum
 mode() – Mode
 std() – Standard deviation
 var() – Variance

But the agg() function in Pandas gives us the flexibility to perform several
statistical computations all at once! Here is how it works:

df.groupby('Outlet_Location_Type').agg([np.mean,np.median]
)

We can even run GroupBy with multiple indexes to get better insights from our
data:

df.groupby(['Outlet_Location_Type','Outlet_Establishment_Year'],as_index=False).agg(
{'Outlet_Size':pd.Series.mode,'Item_Outlet_Sales':np.mean})

Notice that I have used different aggregation functions for different column names
by passing them in a dictionary with the corresponding operation to be performed.
This allowed me to group and apply computations on nominal and numeric
features simultaneously.

Also, I have changed the value of the as_index parameter to False. This way, the
grouped index would not be output as an index.

We can even rename the aggregated columns to improve their comprehensibility,


and we get a multi-index dataframe:
df.groupby(['Outlet_Type','Item_Type']).agg(mean_MRP=('Item_MRP',np.mean),mean_Sales
=('Item_Outlet_Sales',np.mean))

It is amazing how a name change can improve the understandability of the output!

Transformation

Transformation allows us to perform some computation on the groups as a whole


and then return the combined DataFrame. This is done using
the transform() function.

We will try to compute the null values in the Item_Weight column using
the transform() function.

The Item_Fat_Content and Item_Type will affect the Item_Weight, don’t you
think? So, let’s group the DataFrame by these columns and handle the missing
weights using the mean of these groups:

df['Item_Weight'] = df.groupby(['Item_Fat_Content','Item_Type'])
['Item_Weight'].transform(lambda x: x.fillna(x.mean()))

“Using the Transform function, a DataFrame calls a function on itself to produce a


DataFrame with transformed values.”

You can read more about the transform() function in this article.

Filtration

Filtration allows us to discard certain values based on computation and return only
a subset of the group. We can do this using the filter() function in Pandas.

Let’s take a look at the number of rows in our DataFrame presently:

df.shape

(8523, 12)
If I wanted only those groups that have item weights within 3 standard deviations, I
could use the filter function to do the job:

def filter_func(x):
return x['Item_Weight'].std() < 3

df_filter =
df.groupby(['Item_Weight']).filter(filter_func)
df_filter.shape

(8510, 12)

GroupBy has conveniently returned a DataFrame with only those groups that
have Item_Weight less than 3 standard deviations.

Applying Our Own Functions

Pandas’ apply() function applies a function along an axis of the DataFrame. When
using it with the GroupBy function, we can apply any function to the grouped
result.

For example, if I wanted to center the Item_MRP values with the mean of their
establishment year group, I could use the apply() function to do just that”:

df_apply = df.groupby(['Outlet_Establishment_Year'])['Item_MRP'].apply(lambda x: x -
x.mean())
df_apply
Here, the values have been centered, and you can check whether the item was sold
at an MRP above or below the mean MRP for that year.

Conclusion

I’m sure you can see how amazing the GroupBy function is and how useful it can
be for analyzing your data. I hope this article helped you understand the function
better! But practice makes perfect, so start with the super impressive datasets on
our very own DataHack platform. Moving forward, you can read about how you
can analyze your data using a pivot table in Pandas.

Key Takeaways

 Groupby() is a powerful function in pandas that allows you to group data


based on a single column or more.
 You can apply many operations to a groupby object, including aggregation
functions like sum(), mean(), and count(), as well as lambda function and
other custom functions using apply().
 The resulting output of a groupby() operation can be a pandas Series or
dataframe, depending on the operation and data structure.

Frequently Asked Questions


Q1. Can we group by an aggregate function?
A. Yes, we can groupby an aggregate function in pandas. To group by an aggregate
function in Pandas, we can use the groupby method followed by the aggregate
function we want to apply.

Q2. Can we use groupby without aggregate function in pandas?


A. Yes, we can use groupby without an aggregate function in pandas. In this case,
groupby will return a GroupBy object that can be used to perform further
operations.

Q3. What is the difference between groupby and groupby agg?


A. Groupby and groupby agg are both methods in pandas that allow us to group a
DataFrame by one or more columns and perform operations on the resulting
groups. However, there are some important differences between the two methods.
Groupby returns a GroupBy object, which can be used to perform a variety of
operations on the groups. Whereas groupby agg is a method specifically for
performing aggregation operations on a grouped DataFrame. It allows us to specify
one or more aggregation functions to apply to each group and returns a DataFrame
containing the results.

https://pbpython.com/groupby-agg.html

Comprehensive Guide to Grouping and


Aggregating with Pandas
Posted by Chris Moffitt in articles

Introduction
One of the most basic analysis functions is grouping and aggregating data. In some
cases, this level of analysis may be sufficient to answer business questions. In other
instances, this activity might be the first step in a more complex data science
analysis. In pandas, the groupby function can be combined with one or more
aggregation functions to quickly and easily summarize data. This concept is
deceptively simple and most new pandas users will understand this concept.
However, they might be surprised at how useful complex aggregation functions can
be for supporting sophisticated analysis.

This article will quickly summarize the basic pandas aggregation functions and show
examples of more complex custom aggregations. Whether you are a new or more
experienced pandas user, I think you will learn a few things from this article.
Aggregating
In the context of this article, an aggregation function is one which takes
multiple individual values and returns a summary. In the majority of the cases,
this summary is a single value.

The most common aggregation functions are a simple average or summation of


values. As of pandas 0.20, you may call an aggregation function on one or more
columns of a DataFrame.

Here’s a quick example of calculating the total and average fare using the Titanic
dataset (loaded from seaborn):

import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')

df['fare'].agg(['sum', 'mean'])
sum 28693.949300
mean 32.204208
Name: fare, dtype: float64

This simple concept is a necessary building block for more complex analysis.

One area that needs to be discussed is that there are multiple ways to call an
aggregation function. As shown above, you may pass a list of functions to apply to
one or more columns of data.

What if you want to perform the analysis on only a subset of columns? There are two
other options for aggregations: using a dictionary or a named aggregation.

Here is a comparison of the the three options:


It is important to be aware of these options and know which one to use when.

Choosing an aggregation approach


As a general rule, I prefer to use dictionaries for aggregations.
The tuple approach is limited by only being able to apply one aggregation at a time
to a specific column. If I need to rename columns, then I will use
the rename function after the aggregations are complete. In some specific
instances, the list approach is a useful shortcut. I will reiterate though, that I think the
dictionary approach provides the most robust approach for the majority of situations.

Groupby
Now that we know how to use aggregations, we can combine this with groupby to
summarize data.

Basic math
The most common built in aggregation functions are basic math functions including
sum, mean, median, minimum, maximum, standard deviation, variance, mean
absolute deviation and product.

We can apply all these functions to the fare while grouping by the embark_town :

agg_func_math = {
'fare':
['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'mad',
'prod']
}
df.groupby(['embark_town']).agg(agg_func_math).round(2)

This is all relatively straightforward math.

As an aside, I have not found a good usage for the prod function which computes the
product of all the values in a group. For the sake of completeness, I am including it.

One other useful shortcut is to use describe to run multiple built-in aggregations at
one time:

agg_func_describe = {'fare': ['describe']}


df.groupby(['embark_town']).agg(agg_func_describe).round(2)

Counting
After basic math, counting is the next most common aggregation I perform on
grouped data. In some ways, this can be a little more tricky than the basic math.
Here are three examples of counting:

agg_func_count = {'embark_town': ['count', 'nunique', 'size']}


df.groupby(['deck']).agg(agg_func_count)

The major distinction to keep in mind is that count will not include NaN values
whereas size will. Depending on the data set, this may or may not be a useful
distinction. In addition, the nunique function will exclude NaN values in the
unique counts. Keep reading for an example of how to include NaN in the unique
value counts.

First and last


In this example, we can select the highest and lowest fare by embarked town. One
important point to remember is that you must sort the data first if you
want first and last to pick the max and min values.

agg_func_selection = {'fare': ['first', 'last']}


df.sort_values(by=['fare'],
ascending=False).groupby(['embark_town'
]).agg(agg_func_selection)
In the example above, I would recommend using max and min but I am
including first and last for the sake of completeness. In other applications (such
as time series analysis) you may want to select the first and last values for
further analysis.

Another selection approach is to use idxmax and idxmin to select the index value
that corresponds to the maximum or minimum value.

agg_func_max_min = {'fare': ['idxmax', 'idxmin']}


df.groupby(['embark_town']).agg(agg_func_max_min)

We can check the results:

df.loc[[258, 378]]

Here’s another shortcut trick you can use to see the rows with the max fare :
df.loc[df.groupby('class')['fare'].idxmax()]

The above example is one of those places where the list-based aggregation is a
useful shortcut.

Other libraries
You are not limited to the aggregation functions in pandas. For instance, you could
use stats functions from scipy or numpy.

Here is an example of calculating the mode and skew of the fare data.

from scipy.stats import skew, mode


agg_func_stats = {'fare': [skew, mode, pd.Series.mode]}
df.groupby(['embark_town']).agg(agg_func_stats)

The mode results are interesting. The scipy.stats mode function returns the most
frequent value as well as the count of occurrences. If you just want the most frequent
value, use pd.Series.mode.

The key point is that you can use any function you want as long as it knows how to
interpret the array of pandas values and returns a single value.

Working with text


When working with text, the counting functions will work as expected. You can also
use scipy’s mode function on text data.
One interesting application is that if you a have small number of distinct values, you
can use python’s set function to display the full list of unique values.

This summary of the class and deck shows how this approach can be useful for
some data sets.

agg_func_text = {'deck': [ 'nunique', mode, set]}


df.groupby(['class']).agg(agg_func_text)

Custom functions
The pandas standard aggregation functions and pre-built functions from the python
ecosystem will meet many of your analysis needs. However, you will likely want to
create your own custom aggregation functions. There are four methods for creating
your own functions.

To illustrate the differences, let’s calculate the 25th percentile of the data using
four approaches:

First, we can use a partial function:

from functools import partial


# Use partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = '25%'

Next, we define our own function (which is a small wrapper around quantile ):

# Define a function
def percentile_25(x):
return x.quantile(.25)

We can define a lambda function and give it a name:

# Define a lambda function


lambda_25 = lambda x: x.quantile(.25)
lambda_25.__name__ = 'lambda_25%'

Or, define the lambda inline:

# Use a lambda function inline


agg_func = {
'fare': [q_25, percentile_25, lambda_25, lambda x: x.quantile(.25)]
}

df.groupby(['embark_town']).agg(agg_func).round(2)

As you can see, the results are the same but the labels of the column are all a little
different. This is an area of programmer preference but I encourage you to be
familiar with the options since you will encounter most of these in online solutions.

Choosing an custom function style


I prefer to use custom functions or inline lambdas.
Like many other areas of programming, this is an element of style and preference
but I encourage you to pick one or two approaches and stick with them
for consistency.
Custom function examples
As shown above, there are multiple approaches to developing custom aggregation
functions. I will go through a few specific useful examples to highlight how they are
frequently used.

In most cases, the functions are lightweight wrappers around built in pandas
functions. Part of the reason you need to do this is that there is no way to pass
arguments to aggregations. Some examples should clarify this point.

If you want to count the number of null values, you could use this function:

def count_nulls(s):
return s.size - s.count()

If you want to include NaN values in your unique counts, you need to
pass dropna=False to the nunique function.

def unique_nan(s):
return s.nunique(dropna=False)

Here is a summary of all the values together:

agg_func_custom_count = {
'embark_town': ['count', 'nunique', 'size', unique_nan,
count_nulls, set]
}
df.groupby(['deck']).agg(agg_func_custom_count)
If you want to calculate the 90th percentile, use quantile :

def percentile_90(x):
return x.quantile(.9)

If you want to calculate a trimmed mean where the lowest 10th percent is excluded,
use the scipy stats function trim_mean :

def trim_mean_10(x):
return trim_mean(x, 0.1)

If you want the largest value, regardless of the sort order (see notes above
about first and last :

def largest(x):
return x.nlargest(1)

This is equivalent to max but I will show another example of nlargest below to
highlight the difference.

I wrote about sparklines before. Refer to that article for install instructions. Here’s
how to incorporate them into an aggregate function for a unique view of the data:

def sparkline_str(x):
bins=np.histogram(x)[0]
sl = ''.join(sparklines(bins))
return sl

Here they are all put together:

agg_func_largest = {
'fare': [percentile_90, trim_mean_10, largest, sparkline_str]
}
df.groupby(['class', 'embark_town']).agg(agg_func_largest)
The nlargest and nsmallest functions can be useful for summarizing the data in
various scenarios. Here is code to show the total fares for the top 10 and bottom
10 individuals:

def top_10_sum(x):
return x.nlargest(10).sum()

def bottom_10_sum(x):
return x.nsmallest(10).sum()

agg_func_top_bottom_sum = {
'fare': [top_10_sum, bottom_10_sum]
}
df.groupby('class').agg(agg_func_top_bottom_sum)
Using this approach can be useful when applying the Pareto principle to your
own data.

Custom functions with multiple columns


If you have a scenario where you want to run multiple aggregations across columns,
then you may want to use the groupby combined with apply as described in
this stack overflow answer.

Using this method, you will have access to all of the columns of the data and can
choose the appropriate aggregation approach to build up your resulting DataFrame
(including the column labels):

def summary(x):
result = {
'fare_sum': x['fare'].sum(),
'fare_mean': x['fare'].mean(),
'fare_range': x['fare'].max() - x['fare'].min()
}
return pd.Series(result).round(0)

df.groupby(['class']).apply(summary)
Using apply with groupy gives maximum flexibility over all aspects of the results.
However, there is a downside. The apply function is slow so this approach should be
used sparingly.

Working with group objects


Once you group and aggregate the data, you can do additional calculations on the
grouped objects.

For the first example, we can figure out what percentage of the total fares sold can
be attributed to each embark_town and class combination. We use assign and
a lambda function to add a pct_total column:

df.groupby(['embark_town', 'class']).agg({
'fare': 'sum'
}).assign(pct_total=lambda x: x / x.sum())
One important thing to keep in mind is that you can actually do this more simply
using a pd.crosstab as described in my previous article:

pd.crosstab(df['embark_town'],
df['class'],
values=df['fare'],
aggfunc='sum',
normalize=True)

While we are talking about crosstab , a useful concept to keep in mind is that agg
functions can be combined with pivot tables too.
Here’s a quick example:

pd.pivot_table(data=df,
index=['embark_town'],
columns=['class'],
aggfunc=agg_func_top_bottom_sum)

Sometimes you will need to do multiple groupby’s to answer your question. For
instance, if we wanted to see a cumulative total of the fares, we can group and
aggregate by town and class then group the resulting object and calculate a
cumulative sum:

fare_group = df.groupby(['embark_town', 'class']).agg({'fare': 'sum'})


fare_group.groupby(level=0).cumsum()
This may be a little tricky to understand. Here’s a summary of what we are doing:

Here’s another example where we want to summarize daily sales data and convert it
to a cumulative daily and quarterly view. Refer to the Grouper article if you are not
familiar with using pd.Grouper() :

In the first example, we want to include a total daily sales as well as cumulative
quarter amount:
sales =
pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/
2018_Sales_Total_v2.xlsx?raw=True')

daily_sales = sales.groupby([pd.Grouper(key='date', freq='D')


]).agg(daily_sales=('ext price',
'sum')).reset_index()
daily_sales['quarter_sales'] = daily_sales.groupby(
pd.Grouper(key='date', freq='Q')).agg({'daily_sales': 'cumsum'})

To understand this, you need to look at the quarter boundary (end of March through
start of April) to get a good sense of what is going on.

If you want to just get a cumulative quarterly total, you can chain multiple
groupby functions.

First, group the daily results, then group those results by quarter and use a
cumulative sum:

sales.groupby([pd.Grouper(key='date', freq='D')
]).agg(daily_sales=('ext price', 'sum')).groupby(
pd.Grouper(freq='Q')).agg({
'daily_sales': 'cumsum'
}).rename(columns={'daily_sales': 'quarterly_sales'})
In this example, I included the named aggregation approach to rename the variable
to clarify that it is now daily sales. I then group again and use the cumulative sum to
get a running sum for the quarter. Finally, I rename the column to quarterly sales.

Admittedly this is a bit tricky to understand. However, if you take it step by step and
build out the function and inspect the results at each step, you will start to get the
hang of it. Don’t be discouraged!

Flattening Hierarchical Column Indices


By default, pandas creates a hierarchical column index on the summary DataFrame.
Here is what I am referring to:

df.groupby(['embark_town', 'class']).agg({'fare': ['sum',


'mean']}).round(0)
At some point in the analysis process you will likely want to “flatten” the columns so
that there is a single row of names.

I have found that the following approach works best for me. I use the
parameter as_index=False when grouping, then build a new collapsed
column name.

Here is the code:

multi_df = df.groupby(['embark_town', 'class'],


as_index=False).agg({'fare': ['sum', 'mean']})

multi_df.columns = [
'_'.join(col).rstrip('_') for col in multi_df.columns.values
]

Here is a picture showing what the flattened frame looks like:


I prefer to use _ as my separator but you could use other values. Just keep in mind
that it will be easier for your subsequent analysis if the resulting column names do
not have spaces.

Subtotals
One process that is not straightforward with grouping and aggregating in pandas is
adding a subtotal. If you want to add subtotals, I recommend the sidetable package.
Here is how you can summarize fares by class , embark_town and sex with a
subtotal at each level as well as a grand total at the bottom:

import sidetable
df.groupby(['class', 'embark_town', 'sex']).agg({'fare':
'sum'}).stb.subtotal()

https://www.shanelynn.ie/summarising-aggregation-and-
grouping-data-in-python-pandas/

Use Pandas Groupby to Group and


Summarise DataFrame

Pandas – Python Data Analysis Library


I’ve recently started using Python’s excellent Pandas library as a data analysis
tool, and, while finding the transition from R’s excellent data.table
library frustrating at times, I’m finding my way around and finding most
things work quite well.
One aspect that I’ve recently been exploring is the task of grouping large data
frames by different variables, and applying summary functions on each group.
This is accomplished in Pandas using the “groupby()” and “agg()” functions of
Panda’s DataFrame objects.
Update: Pandas version 0.20.1 in May 2017 changed the aggregation
and grouping APIs. This post has been updated to reflect the new
changes.

A Sample DataFrame
In order to demonstrate the effectiveness and simplicity of the grouping
commands, we will need some data. For an example dataset, I have
extracted my own mobile phone usage records. I analysed this type of data
using Pandas during my work on KillBiller. If you’d like to follow along – the
full csv file is available here.
The dataset contains 830 entries from my mobile phone log spanning a total
time of 5 months. The CSV file can be loaded into a pandas DataFrame using
the pandas.DataFrame.from_csv() function, and looks like this:

date duration item month network network_type

0 15/10/14 06:58 34.429 data 2014-11 data data

1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile

2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile

3 15/10/14 14:48 4.000 call 2014-11 Tesco mobile

4 15/10/14 17:27 4.000 call 2014-11 Tesco mobile

5 15/10/14 18:55 4.000 call 2014-11 Tesco mobile

6 16/10/14 06:58 34.429 data 2014-11 data data

7 16/10/14 15:01 602.000 call 2014-11 Three mobile

8 16/10/14 15:12 1050.000 call 2014-11 Three mobile

9 16/10/14 15:30 19.000 call 2014-11 voicemail voicemail

10 16/10/14 16:21 1183.000 call 2014-11 Three mobile

11 16/10/14 22:18 1.000 sms 2014-11 Meteor mobile

… … … … … … …
Sample CSV file data containing the dates and durations of phone calls made on my mobile
phone.

The main columns in the file are:

1. date: The date and time of the entry


2. duration: The duration (in seconds) for each call, the amount of
data (in MB) for each data entry, and the number of texts sent
(usually 1) for each sms entry.
3. item: A description of the event occurring – can be one of call,
sms, or data.
4. month: The billing month that each entry belongs to – of form
‘YYYY-MM’.
5. network: The mobile network that was called/texted for each
entry.
6. network_type: Whether the number being called was a mobile,
international (‘world’), voicemail, landline, or other (‘special’)
number.
Phone numbers were removed for privacy. The date column can be parsed
using the extremely handy dateutil library.
import pandas as pd
import dateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

Summarising the DataFrame


Once the data has been loaded into Python, Pandas makes the calculation of
different statistics very simple. For example, mean, max, min, standard
deviations and more for columns are easily calculable:

# How many rows the dataset


data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there for each month?
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9
The need for custom functions is minimal unless you have very specific
requirements. The full range of basic statistics that are quickly calculable and
built into the base Pandas package are:
Function Description

count Number of non-null observations

sum Sum of values

mean Mean of values

mad Mean absolute deviation

median Arithmetic median of values

min Minimum

max Maximum

mode Mode

abs Absolute Value

prod Product of values

std Unbiased standard deviation

var Unbiased variance

sem Unbiased standard error of the mean


Function Description

skew Unbiased skewness (3rd moment)

kurt Unbiased kurtosis (4th moment)

quantile Sample quantile (value at %)

cumsum Cumulative sum

cumprod Cumulative product

cummax Cumulative maximum

cummin Cumulative minimum


The .describe() function is a useful summarisation tool that will quickly
display statistics for any variable or group it is applied to. The describe()
output varies depending on whether you apply it to a numeric or character
column.
Summarising Groups in the DataFrame
There’s further power put into your hands by mastering the
Pandas “groupby()” functionality. Groupby essentially splits the data into
different groups depending on a variable of your choice. For example, the
expression data.groupby(‘month’) will split our current DataFrame by month.
The groupby() function returns a GroupBy object, but essentially describes
how the rows of the original data set has been split. the GroupBy
object .groups variable is a dictionary whose keys are the computed unique
groups and corresponding values being the axis labels belonging to each
group. For example:

data.groupby(['month']).groups.keys()
Out[59]: ['2014-12', '2014-11', '2015-02', '2015-03', '2015-01']
len(data.groupby(['month']).groups['2014-11'])
Out[61]: 230
Functions like max(), min(), mean(), first(), last() can be quickly applied to the
GroupBy object to obtain summary statistics for each group – an immensely
useful function. This functionality is similar to the dplyr and plyr libraries for
R. Different variables can be excluded / included from each summary
requirement.
# Get the first entry for each month
data.groupby('month').first()
Out[69]:
date duration item network network_type
month
2014-11 2014-10-15 06:58:00 34.429 data data data
2014-12 2014-11-13 06:58:00 34.429 data data data
2015-01 2014-12-13 06:58:00 34.429 data data data
2015-02 2015-01-13 06:58:00 34.429 data data data
2015-03 2015-02-12 20:15:00 69.000 call landline landline
# Get the sum of the durations per month
data.groupby('month')['duration'].sum()
Out[70]:
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
# Get the number of dates / entries in each month
data.groupby('month')['date'].count()
Out[74]:
month
2014-11 230
2014-12 157
2015-01 205
2015-02 137
2015-03 101
Name: date, dtype: int64
# What is the sum of durations, for calls only, to each network
data[data['item'] == 'call'].groupby('network')['duration'].sum()
Out[78]:
network
Meteor 7200
Tesco 13828
Three 36464
Vodafone 14621
landline 18433
voicemail 1775
Name: duration, dtype: float64
You can also group by more than one variable, allowing more complex queries.

# How many calls, sms, and data entries are in each month?
data.groupby(['month', 'item'])['date'].count()
Out[76]:
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
Name: date, dtype: int64
# How many calls, texts, and data are sent per month, split by network_type?
data.groupby(['month', 'network_type'])['date'].count()
Out[82]:
month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014-12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015-01 data 31
landline 11
mobile 160
....
Groupby output format – Series or DataFrame?
The output from a groupby and aggregation operation varies between Pandas
Series and Pandas Dataframes, which can be confusing for new users. As a rule
of thumb, if you calculate more than one column of results, your result will be
a Dataframe. For a single column of results, the agg function, by default, will
produce a Series.

You can change this by selecting your operation column differently:

# produces Pandas Series


data.groupby('month')['duration'].sum()
# Produces Pandas DataFrame
data.groupby('month')[['duration']].sum()
The groupby output will have an index or multi-index on rows corresponding
to your chosen grouping variables. To avoid setting this index, pass
“as_index=False” to the groupby operation.
data.groupby('month', as_index=False).agg({"duration": "sum"})

Using the
as_index parameter while Grouping data in pandas prevents setting a row
index on the result.
Multiple Statistics per Group
The final piece of syntax that we’ll examine is the “agg()” function for Pandas.
The aggregation functionality provided by the agg() function allows multiple
statistics to be calculated per group in one calculation.
Applying a single function to columns in groups
Instructions for aggregation are provided in the form of a python dictionary or
list. The dictionary keys are used to specify the columns upon which you’d like
to perform operations, and the dictionary values to specify the function to run.
For example:

# Group the data frame by month and item and extract a number of stats from each
group
data.groupby(
['month', 'item']
).agg(
{
'duration':sum, # Sum duration per group
'network_type': "count", # get the count of networks
'date': 'first' # get the first date per group
}
)
The aggregation dictionary syntax is flexible and can be defined before the
operation. You can also define functions inline using “lambda” functions to
extract statistics that are not provided by the built-in options.
# Define the aggregation procedure outside of the groupby operation
aggregations = {
'duration':'sum',
'date': lambda x: max(x) - 1
}
data.groupby('month').agg(aggregations)
Applying multiple functions to columns in groups
To apply multiple functions to a single column in your grouped data, expand
the syntax above to pass in a list of functions as the value in your aggregation
dataframe. See below:

# Group the data frame by month and item and extract a number of stats from each
group
data.groupby(
['month', 'item']
).agg(
{
# Find the min, max, and sum of the duration column
'duration': [min, max, sum],
# find the number of network type entries
'network_type': "count",
# minimum, first, and number of unique dates
'date': [min, 'first', 'nunique']
}
)
The agg(..) syntax is flexible and simple to use. Remember that you can pass in
custom and lambda functions to your list of aggregated calculations, and each
will be passed the values from the column in your grouped data.
Renaming grouped aggregation columns
We’ll examine two methods to group Dataframes and rename the column
results in your work.

Grouping, calculating, and renaming the results can be achieved in a single


command using the “agg” functionality in Python. A “pd.NamedAgg” is used
for clarity, but normal tuples of form (column_name, grouping_function) can
also be used also.
Recommended: Tuple Named Aggregations
Introduced in Pandas 0.25.0, groupby aggregation with relabelling is
supported using “named aggregation” with simple tuples. Python tuples are
used to provide the column name on which to work on, along with the function
to apply.
For example:

data[data['item'] == 'call'].groupby('month').agg(
# Get max of the duration column for each group
max_duration=('duration', max),
# Get min of the duration column for each group
min_duration=('duration', min),
# Get sum of the duration column for each group
total_duration=('duration', sum),
# Apply a lambda to date column
num_days=("date", lambda x: (max(x) - min(x)).days)
)

Grouping with named aggregation using new Pandas 0.25 syntax. Tuples are used to specify
the columns to work on and the functions to apply to each grouping.
For clearer naming, Pandas also provides the NamedAggregation named-
tuple, which can be used to achieve the same as normal tuples:

data[data['item'] == 'call'].groupby('month').agg(
max_duration=pd.NamedAgg(column='duration', aggfunc=max),
min_duration=pd.NamedAgg(column='duration', aggfunc=min),
total_duration=pd.NamedAgg(column='duration', aggfunc=sum),
num_days=pd.NamedAgg(
column="date",
aggfunc=lambda x: (max(x) - min(x)).days)
)
Note that in versions of Pandas after release, applying lambda functions only
works for these named aggregations when they are the only function applied to
a single column, otherwise causing a KeyError.
Renaming index using droplevel and ravel
When multiple statistics are calculated on columns, the resulting dataframe
will have a multi-index set on the column axis. The multi-index can be difficult
to work with, and I typically have to rename columns after a groupby
operation.
One option is to drop the top level (using .droplevel) of the newly created
multi-index on columns using:
grouped = data.groupby('month').agg("duration": [min, max, mean])
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={
"min": "min_duration", "max": "max_duration", "mean": "mean_duration"
})
grouped.head()
However, this approach loses the original column names, leaving only the
function names as column headers. A neater approach, as suggested to me by a
reader, is using the ravel() method on the grouped columns. Ravel() turns a
Pandas multi-index into a simpler array, which we can combine into sensible
column names:
grouped = data.groupby('month').agg("duration": [min, max, mean])
# Using ravel, and a string join, we can create better names for the columns:
grouped.columns = ["_".join(x) for x in grouped.columns.ravel()]

Q
uick renaming of grouped columns from the groupby() multi-index can be
achieved using the ravel() function.
Dictionary groupby format <DEPRECATED>
There were substantial changes to the Pandas aggregation function in
May of 2017. Renaming of variables using dictionaries within the agg()
function as in the diagram below is being deprecated/removed from
Pandas – see notes.

Aggregation of variables in a Pandas Dataframe using the agg() function. Note


that in Pandas versions 0.20.1 onwards, the renaming of results needs to be
done separately.
In older Pandas releases (< 0.20.1), renaming the newly calculated columns
was possible through nested dictionaries, or by passing a list of functions for a
column. Our final example calculates multiple values from the duration
column and names the results appropriately. Note that the results have multi-
indexed column headers.
Note this syntax will no longer work for new installations of Python
Pandas.
# Define the aggregation calculations
aggregations = {
# work on the "duration" column
'duration': {
# get the sum, and call this result 'total_duration'
'total_duration': 'sum',
# get mean, call result 'average_duration'
'average_duration': 'mean',
'num_calls': 'count'
},
# Now work on the "date" column
'date': {
# Find the max, call the result "max_date"
'max_date': 'max',
'min_date': 'min',
# Calculate the date range per group
'num_days': lambda x: max(x) - min(x)
},
# Calculate two results for the 'network' column with a list
'network': ["count", "max"]
}
# Perform groupby aggregation by "month",
# but only on the rows that are of type "call"
data[data['item'] == 'call'].groupby('month').agg(aggregations)

Summary of Python Pandas Grouping


The groupby functionality in Pandas is well documented in the official
docs and performs at speeds on a par (unless you have massive data and are
picky with your milliseconds) with R’s data.table and dplyr libraries.
If you are interested in another example for practice, I used these same
techniques to analyse weather data for this post, and I’ve put “how-to”
instructions here.
There are plenty of resources online on this functionality, and I’d
recommomend really conquering this syntax if you’re using Pandas in earnest
at any point.

 DataQuest Tutorial on Data


Analysis: https://www.dataquest.io/blog/pandas-tutorial-
python-2/
 Chris Albon notes on
Groups: https://chrisalbon.com/python/pandas_apply_operatio
ns_to_groups.html
 Greg Reda Pandas
Tutorial: http://www.gregreda.com/2013/10/26/working-with-
pandas-dataframes/
sidetable also allows customization of the subtotal levels and resulting labels. Refer
to the package documentation for more examples of how sidetable can summarize
your data.

Summary
Thanks for reading this article. There is a lot of detail here but that is due to how
many different uses there are for grouping and aggregating data with pandas. My
hope is that this post becomes a useful resource that you can bookmark and come
back to when you get stuck with a challenging problem of your own.

If you have other common techniques you use frequently please let me know in the
comments. If I get some broadly useful ones, I will include in this post or as an
updated article.

You might also like