Pivot Tables
Pivot Tables
A pivot table is a statistics tool that summarizes and reorganizes selected columns
and rows of data in a spreadsheet or database table to obtain a desired report. The
tool does not actually change the spreadsheet or database itself, it simply “pivots”
or turns the data to view it from different perspectives.
Pivot tables are especially useful with large amounts of data that would be time-
consuming to calculate by hand. A few data processing functions a pivot table can
perform include identifying sums, averages, ranges or outliers. The table then
arranges this information in a simple, meaningful layout that draws attention to key
values.
1. Columns- When a field is chosen for the column area, only the unique values
of the field are listed across the top.
2. Rows- When a field is chosen for the row area, it populates as the first column.
Similar to the columns, all row labels are the unique values and duplicates are
removed.
3. Values- Each value is kept in a pivot table cell and display the summarized
information. The most common values are sum, average, minimum and
maximum.
For example, a store owner might list monthly sales totals for a large number of
merchandise items in an Excel spreadsheet. If they wanted to know which items
sold better in a particular financial quarter, they could use a pivot table. The sales
quarters would be listed across the top as column labels and the products would be
listed in the first column as rows. The values in the worksheet would show the sum
of sales for each product in each quarter. A filter could then be applied to only
show specific quarters, specific products or averages.
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
In [2]:
titanic.head()
Out[2]:
surviv pcla sibs parc embark clas adult_m dec embark_to aliv alon
sex age fare who
ed ss p h ed s ale k wn e e
In [3]:
titanic.groupby('sex')[['survived']].mean()
Out[3]:
survived
sex
female0.742038
male 0.188908
This immediately gives us some insight: overall, three of every four females on board
survived, while only one in five males survived!
This is useful, but we might like to go one step deeper and look at survival by both
sex and, say, class. Using the vocabulary of GroupBy, we might proceed using
something like this: we group by class and gender, select survival, apply a mean
aggregate, combine the resulting groups, and then unstack the hierarchical index to
reveal the hidden multidimensionality. In code:
In [4]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
Out[4]:
sex
female0.9680850.9210530.500000
male 0.3688520.1574070.135447
This gives us a better idea of how both gender and class affected survival, but the
code is starting to look a bit garbled. While each step of this pipeline makes sense in
light of the tools we've previously discussed, the long string of code is not particularly
easy to read or use. This two-dimensional GroupBy is common enough that Pandas
includes a convenience routine, pivot_table, which succinctly handles this type of
multi-dimensional aggregation.
In [5]:
titanic.pivot_table('survived', index='sex', columns='class')
Out[5]:
sex
female0.9680850.9210530.500000
male 0.3688520.1574070.135447
This is eminently more readable than the groupby approach, and produces the same
result. As you might expect of an early 20th-century transatlantic cruise, the survival
gradient favors both women and higher classes. First-class women survived with
near certainty (hi, Rose!), while only one in ten third-class men survived (sorry,
Jack!).
In [6]:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
Out[6]:
sex age
(18, 80]0.9729730.9000000.423729
(18, 80]0.3750000.0714290.133663
We can apply the same strategy when working with the columns as well; let's add
info on the fare paid using pd.qcut to automatically compute quantiles:
In [7]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
Out[7]:
sex age
dropna=True, margins_name='All')
We've already seen examples of the first three arguments; here we'll take a quick
look at the remaining ones. Two of the options, fill_value and dropna, have to do
with missing data and are fairly straightforward; we will not show examples of them
here.
The aggfunc keyword controls what type of aggregation is applied, which is a mean
by default. As in the GroupBy, the aggregation specification can be a string
representing one of several common choices
(e.g., 'sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an
aggregation (e.g., np.sum(), min(), sum(), etc.). Additionally, it can be specified as a
dictionary mapping a column to any of the above desired options:
In [8]:
titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived':sum, 'fare':'mean'})
Out[8]:
fare survived
sex
At times it's useful to compute totals along each grouping. This can be done via
the margins keyword:
In [9]:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[9]:
sex
female0.9680850.9210530.5000000.742038
male 0.3688520.1574070.1354470.188908
All 0.6296300.4728260.2423630.383838
Here this automatically gives us information about the class-agnostic survival rate by
gender, the gender-agnostic survival rate by class, and the overall survival rate of
38%. The margin label can be specified with the margins_name keyword, which
defaults to "All".
In [10]:
# shell command to download the data:
# !curl -O
https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
In [11]:
births = pd.read_csv('data/births.csv')
Taking a look at the data, we see that it's relatively simple–it contains the number of
births grouped by date and gender:
In [12]:
births.head()
Out[12]:
year monthdaygenderbirths
019691 1 F 4046
119691 1 M 4440
219691 2 F 4454
319691 2 M 4548
419691 3 F 4548
We can start to understand this data a bit more by using a pivot table. Let's add a
decade column, and take a look at male and female births as a function of decade:
In [13]:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
Out[13]:
genderF M
decade
1970 1626307517121550
1980 1831035119243452
1990 1947945420420553
2000 1822930919106428
We immediately see that male births outnumber female births in every decade. To
see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to
visualize the total number of births by year (see Introduction to Matplotlib for a
discussion of plotting with Matplotlib):
In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');
With a simple pivot table and plot() method, we can immediately see the annual
trend in births by gender. By eye, it appears that over the past 50 years male births
have outnumbered female births by around 5%.
In [15]:
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
This final line is a robust estimate of the sample mean, where the 0.74 comes from
the interquartile range of a Gaussian distribution (You can learn more about sigma-
clipping operations in a book I coauthored with Željko Ivezić, Andrew J. Connolly,
and Alexander Gray: "Statistics, Data Mining, and Machine Learning in
Astronomy" (Princeton University Press, 2014)).
With this we can use the query() method (discussed further in High-Performance
Pandas: eval() and query()) to filter-out rows with births outside these values:
In [16]:
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
Next we set the day column to integers; previously it had been a string because
some columns in the dataset contained the value 'null':
In [17]:
# set 'day' column to integer; it originally was a string due to nulls
births['day'] = births['day'].astype(int)
Finally, we can combine the day, month, and year to create a Date index
(see Working with Time Series). This allows us to quickly compute the weekday
corresponding to each row:
In [18]:
# create a datetime index from the year, month, day
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births['dayofweek'] = births.index.dayofweek
In [19]:
import matplotlib.pyplot as plt
import matplotlib as mpl
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
Apparently births are slightly less common on weekends than on weekdays! Note
that the 1990s and 2000s are missing because the CDC data contains only the
month of birth starting in 1989.
Another intersting view is to plot the mean number of births by the day of the year.
Let's first group the data by month and day separately:
In [20]:
births_by_date = births.pivot_table('births',
[births.index.month, births.index.day])
births_by_date.head()
Out[20]:
1 1 4009.225
2 4247.400
3 4500.900
4 4571.350
5 4603.625
Name: births, dtype: float64
The result is a multi-index over months and days. To make this easily plottable, let's
turn these months and days into a date by associating them with a dummy year
variable (making sure to choose a leap year so February 29th is correctly handled!)
In [21]:
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
births_by_date.head()
Out[21]:
2012-01-01 4009.225
2012-01-02 4247.400
2012-01-03 4500.900
2012-01-04 4571.350
2012-01-05 4603.625
Name: births, dtype: float64
Focusing on the month and day only, we now have a time series reflecting the
average number of births by date of the year. From this, we can use the plot method
to plot the data. It reveals some interesting trends:
In [22]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
In particular, the striking feature of this graph is the dip in birthrate on US holidays
(e.g., Independence Day, Labor Day, Thanksgiving, Christmas, New Year's Day)
although this likely reflects trends in scheduled/induced births rather than some deep
psychosomatic effect on natural births.
Looking at this short example, you can see that many of the Python and Pandas
tools we've seen to this point can be combined and used to gain insight from a
variety of datasets. We will see some more sophisticated applications of these data
manipulations in future sections!