0% found this document useful (0 votes)
392 views99 pages

Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming

This document provides a summary of the Pandas library for analyzing data in Python. It discusses how Pandas improves upon NumPy by allowing heterogeneous column types and adding index columns to improve lookup speed. The guide is divided into four parts that cover Pandas motivation and features, Series and Index objects, DataFrames, and the MultiIndex. It provides examples of common operations like sorting, filtering, and aggregating data using Pandas that are more complex to do with NumPy alone. Overall, the document presents Pandas as a powerful tool for working with structured data that builds upon NumPy's functionality.

Uploaded by

sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
392 views99 pages

Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming

This document provides a summary of the Pandas library for analyzing data in Python. It discusses how Pandas improves upon NumPy by allowing heterogeneous column types and adding index columns to improve lookup speed. The guide is divided into four parts that cover Pandas motivation and features, Series and Index objects, DataFrames, and the MultiIndex. It provides examples of common operations like sorting, filtering, and aggregating data using Pandas that are more complex to do with NumPy alone. Overall, the document presents Pandas as a powerful tool for working with structured data that builds upon NumPy's functionality.

Uploaded by

sem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

!"#$%&'()*%+ ,(--(.*!./0.122%+0

3(4*516%2/4
71+*89 : ;<*2%+*.(1) : 5(2#(.=/+$> : 3%&-(+

#$"%&'(()%*+"*,$-&./,&0,12#2*23,&42%)"(&5)2$,
!"#$"%
1*@/A>*/.*1*4%(BC*D'/"$)*?*2(.0(*/.*E/%+C*F+)*B'1-*-'(*'(@G*%&
-%?+)(6C

F$$*%210(&*#>*1"-'/.

is an industry standard for analyzing data in Python. With a few


trokes, you can load, filter, restructure, and visualize gigabytes of
rogeneous information. Built on top of the NumPy library, it borrows
y of its concepts and syntax conventions, so if you are comfortable with
mPy, you’ll find Pandas a pretty familiar tool. And even if you’ve never

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 1 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

d of NumPy, Pandas provides a great opportunity to crack down on data


ysis problems with little or no programming background.

re’re a lot of Pandas guides out there. In this particular one, you’re
ected to have a basic understanding of NumPy. If you don’t, I’d suggest
skim through the NumPy Illustrated guide to get an idea of what a
mPy array is, in which ways it is superior to a Python list, and how it helps
d loops in elementary operations.

key features that Pandas brings to NumPy arrays are:

eterogeneous types — each column is allowed to have its own type;

dex — improves lookup speed for the specified column(s).

rns out these features are enough to make Pandas a powerful competitor
oth spreadsheets and databases.

, the recent reincarnation of Pandas (written in Rust, thus faster¹) does


use NumPy under the hood any longer, yet the syntax is pretty similar, so
ning Pandas will let you feel at ease with Polars as well.

article consists of four parts:

1. Motivation
2. Series and Index
3. DataFrames
4. MultiIndex

is quite lengthy, though easy to read as it is mostly images.

a 1-minute read of the “first steps” in Pandas I can recommend an


llent Visual Intro to Pandas² by Jay Alammar.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 2 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

)%%26#%
cker News (257 points, 40 comments)
ddit r/Python (288 points, 29 comments)

*,#*%
Motivation and Showcase
Pandas Showcase
Pandas Speed

eries and Index


ndex
inding element by value
Missing values
Comparisons
Appends, inserts, deletions
tatistics
Duplicate data
Group by

DataFrames
Reading and writing CSV files
Building a DataFrame
Basic operations with DataFrames
ndexing DataFrames
DataFrame arithmetic
Combining DataFrames:
Vertical stacking
Horizontal stacking
Stacking via MultiIndex
oining DataFrames:
1:1 relationship joins
1:n relationship joins
Multiple joins
nserts and deletes
Group by

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 3 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

Pivoting and ‘unpivoting’

MultiIndex
Visual Grouping
Type conversions
Building DataFrame with MultiIndex
ndexing with MultiIndex
tacking and unstacking
How to prevent stack/unstack from sorting
Manipulating levels
Converting MultiIndex into flat Index and restoring it back
orting MultiIndex
Reading and writing MultiIndexed DataFrames to disk
MultiIndex arithmetic

*&9:&;6*23"*26#&"#$&</6=7"%,
pose you have a file with a million lines of comma-separated values like

DA1@(&*1H-(.*@/$/+&*1.(*H/.*%$$"&-.1-%4(*A".A/&(&*/+$>I*J&"1$$>K*-'(.(*1.(*+/+(I

you need to give answers to basic questions like “Which cities have an
over 450 km² and a population under 10 million” with NumPy.

brute-force solution of feeding the whole table into a NumPy array is not
od option: usually, NumPy arrays are homogeneous (all values must be
e same type), so all fields will be interpreted as strings, and

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 4 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

parisons will not work as expected.

NumPy has structured and record arrays that allow columns of different
s, but they are primarily meant for interfacing with C code. When used
eneral purposes, they have the following downsides:

not really intuitive (e.g., you’ll be faced with constants like <f8 and <U8
verywhere);

have some performance issues as compared to regular NumPy arrays;

tored contiguously in memory, so each column addition or deletion


equires reallocation of the whole array;

till lack a lot of functionality of Pandas DataFrames.

next try would probably be to store each column as a separate NumPy


or. And after that, maybe wrap them into a dict so it would be easier to
ore the integrity of the ‘database’ if you decide to add or remove a row or
later. Here’s what that would look like:

u’ve done that — congratulations! You’ve made your first step in


mplementing Pandas. :)

, here’re a couple of examples of what Pandas can do for you that


mPy cannot (or requires significant effort to accomplish).

$"%&</6=7"%,
sider the following table:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 5 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

scribes the diverse product line of an online shop with a total of four
nct products. In contrast with the previous example, it can be
esented with either a NumPy array or a Pandas DataFrame equally well.
et us look at some common operations with it.

+*2#>
ing by column is more readable with Pandas, as you can see below:

argsort(a[:,1]) calculates the permutation that makes the second


mn of a to be sorted in ascending order and then the outer a[…]

ders the rows of a, accordingly. Pandas can do it in one step.

6+*2#>&@A&%,3,+"(&76()B#%
e need to sort by price column breaking ties using the weight column,

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 6 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

situation gets worse for NumPy:

h NumPy, we first order by weight, then apply second ordering by price.


able sorting algorithm guarantees that the result of the first sort is not
during the second one. There are other ways to do it with NumPy, but
e are as simple and elegant as with Pandas.

$$2#>&"&76()B#
ng columns is way better with Pandas, syntactically and architecturally:

D(1.@'*5()%"2 L.%-( D%0+*"A D%0+*?+

das does not need to reallocate memory for the whole array like NumPy;

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 7 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

st adds a reference to a new column and updates a ‘registry’ of the


mn names.

!"#$%&'()*#
"%*&,(,B,#*&%,"+7/
MNO*P/$$/B(.&
h NumPy arrays, even if the element you search for is the first one, you’ll
Q'.%$$()*1#/"-*A./0.122%+0
need time proportional to the size of the array to find it. With Pandas,
can index the column(s) you expect to be queried most often and reduce P/$$/B

ch time to a constant.

%*+"$,+*)$%"-(.)

R/"&&(H*S/&+% %+ 3(4($*JA*T/)%+0

?I&!"#$"%&F)#7*26#%&16+&JIK
61&A6)+&0"*"&<72,#7,&."%L%

R1+0*U'/" %+ Q(@'Q/P.(()/2

M&D$3"#7,$&!A*/6#&N2%*
OP,+"*26#%&./"*&8"#
Q11,7*23,(A&OP*2B2%,&R6)+S
86$,
7/&(A*P(..(. %+ V((G*T"$-".(

G&8/"*5!.&1,"*)+,%&*6&@66%*
A6)+&$"2(A&=6+L

,(6*QI %+ V((G*T"$-".(

G&QT7,((,#*&H)(2"&F,"*)+,%&./"*
!A*/6#&0,3,(6P,+%&8"#&O#(A
U2%/&./,A&V"$

index column has the following limitations:

t requires memory and time to be built.

t is read-only (needs to be rebuilt after each append or delete operation).


S($A D-1-"& L.%-(.& ,$/0 T1.((.& !.%41@> Q(.2&
The values are not required to be unique, but speedup only happens Q(6-*-/*&A((@'

when the elements are unique.

t requires preheating: the first query is somewhat slower than in NumPy,


ut the subsequent ones are significantly faster.

2#%&@A&76()B#

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 8 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

u want to complement a table with information from another table


d on a common column, NumPy is hardly any help. Pandas is better,
cially for 1:n relationships.

join has all the familiar ‘inner,’ ‘left,’ ‘right,’ and ‘full outer’ join

+6)P2#>&@A&76()B#
another common operation in data analysis is grouping by column(s).
example, to get the total quantity of each product sold, you can do the
wing:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 9 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ddition to sum , Pandas supports all kinds of aggregate functions: mean ,

min , count , etc.

36*&*"@(,%
of the most powerful features of Pandas is a “pivot” table. It is
ething like projecting multi-dimensional space into a two-dimensional

ough it is certainly possible to implement it with NumPy, this


tionality is missing ‘out of the box,’ though it is present in all major
ional databases³ and spreadsheet apps (Excel, Google Sheets).

das also has df.pivot_table which combines grouping and pivoting in


tool.

nutshell, the two main differences between NumPy and Pandas are the
wing:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 10 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

, let’s see whether those features come at the cost of a performance hit.

$"%&<P,,$
benchmarked NumPy and Pandas on a workload typical for Pandas: 5–
columns; 10³–10⁸ rows; integers and floats. Here are the results for 1 row
100 million rows:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 11 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

oks as if in every single operation, Pandas is slower than NumPy!

situation (predictably) does not change when the number of columns


eases. As for the number of rows, the dependency (in the logarithmic
e) looks like this:

das seems to be 30 times slower than NumPy for small arrays (under a
dred rows) and three times slower for large ones (over a million rows).

can it be? Maybe it is high time to submit a feature request to suggest


das reimplement df.column.sum() via df.column.values.sum() ? The
property here provides access to the underlying NumPy array and
lts in a 3x-30x speedup.

answer is no. Pandas is so slow at those basic operations because it


ectly handles the missing values. Pandas needs NaNs (not-a-number) for
f this database-like machinery like grouping and pivoting, plus it is a
mon thing in the real world. In Pandas, a lot of work has been done to
y the usage of NaN across all the supported data types. By definition
orced on the CPU level), nan +anything results in nan . So

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 12 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

np.sum([1, np.nan, 2])

pd.Series([1, np.nan, 2]).sum()

r comparison would be to use np.nansum instead of np.sum , np.nanmean

ead of np.mean and so on. And suddenly…

das becomes 1.5 times faster than NumPy for arrays with over a million
ments. It is still 15 times slower than NumPy for smaller arrays, but
ally, it does not matter much if the operation is completed in 0.5 ms or
ms — it is fast anyway.

bottom line is that if you’re 100% sure you have no missing values in your
mn(s), it makes sense to use df.column.values.sum() instead of
column.sum() to have x3-x30 performance boost. In the presence of

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 13 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ing values the speed of Pandas is quite decent and even beats NumPy for
e arrays (over 10⁶ elements).

*&?:&<,+2,%&"#$&'#$,T

es is a counterpart of a 1D array in NumPy and is a basic building block


DataFrame representing its column. Although its practical importance
minishing in comparison to a DataFrame (you can perfectly well solve a
f practical problems without knowing what a Series is), you might have a
time understanding how DataFrames work without learning Series and
x first.

rnally, Series stores the values in a plain old NumPy vector. As such, it
rits its merits (compact memory layout, fast random access) and
erits (type homogeneity, slow deletions, and insertions). On top of that,
es allows accessing its values by label using a dict-like structure called
. Labels can be of any type (commonly strings and time stamps). They
d not be unique, but uniqueness is required to boost the lookup speed
is assumed in many operations.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 14 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ou can see, now every element can be addressed in two alternative ways:
abel’ (=using the index) and by ‘position’ (=not using the index):

ressing ‘by position’ is sometimes called as ‘by positional index’ which


ely adds to the confusion.

ously, one pair of square brackets is not enough for this. In particular:

s[2:3] is not the most convenient way to address element number 2

f the labels happens to be integers, s[1:3] becomes ambiguous. It might


mean labels 1 to 3 inclusive or positional indices 1 to 3 exclusive.

ddress those issues, Pandas has two more ‘flavors’ of square brackets:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 15 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

oc[] always uses labels and includes both ends of the interval;
oc[] always uses positional indices and excludes the right end.

purpose of using square brackets instead of parentheses here s to get


ss to the convenient Python slicing: You can use a single or double colon
the familiar meaning of start:stop:step . As usual, missing start (end)
ns from the start (to the end) of the Series. The step argument allows to
rence even rows with s.iloc[::2] and to get elements in reverse order
s['Paris':'Oslo':-1]

y also support boolean indexing (indexing with an array of booleans), as


image shows:

you can see how they support ‘fancy indexing’ (indexing with an array
tegers) in this image:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 16 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

worst thing about Series is its visual representation: for some reason, it
’t receive a nice rich-text outlook, so it feels like a second-class citizen in
parison with a DataFrame:

monkey-patched the Series to make it look better, as shown below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 17 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

vertical line means this is a Series, not a DataFrame. Footer is disabled


, but it can be useful for showing dtypes, especially with Categoricals.

can also display several Series or DataFrames side by side with


sidebyside(obj1, obj2, …) :

(stands for pandas illustrated) is an open-source library on github


this and other functions for this article. To use it, write

install pandas-illustrated

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 18 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

object responsible for getting Series elements (as well as DataFrame


s and columns) by label is called an index. It is fast: you can get the result
onstant time, whether you have five elements or 5 billion elements.

is a truly polymorphic creature. By default, when you create a Series


DataFrame) without index argument, it initializes to a lazy object
lar to Python’s range() . Just like range() , it barely uses any memory,
provides the labels coinciding with the positional indexing. Let’s create a
es of a million elements:

s = pd.Series(np.zeros(10**6))
s.index
angeIndex(start=0, stop=1000000, step=1)
s.index.memory_usage() # in bytes
# the same as for Series([0.])

, if we delete an element, the index implicitly morphs into a dict-like


cture, as follows:

s.drop(1, inplace=True)
s.index
nt64Index([ 0, 2, 3, 4, 5, 6, 7,
...
999993, 999994, 999995, 999996, 999997, 999998, 999999],
dtype='int64', length=999999)
s.index.memory_usage()
999992

structure consumes 8Mb of memory! To get rid of it and get back to the
weight range-like structure, write

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 19 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

s.reset_index(drop=True, inplace=True)
s.index
angeIndex(start=0, stop=999999, step=1)
s.index.memory_usage()

u’re new to Pandas, you might wonder why Pandas didn’t do it on its
? Well, for non-numeric labels, it is sort of obvious: why (and how)
ld Pandas, after deleting a row, relabel all the subsequent rows? For
eric labels, the answer is a bit more convoluted.

, as we’ve seen already, Pandas allows you to reference rows purely by


tion, so if you want to address row number 5 after deleting row number
u can do it without reindexing (that’s what iloc is for).

nd, keeping original labels is a way to keep a connection with a moment


e past, like a ‘save game’ button. Imagine you have a big 100x1000000
e and need to find some data. You’re making several queries one by one,
time narrowing your search, but looking at only a subset of the
mns because it is impractical to see all of the one hundred fields at the
e time. Now that you have found the rows of interest, you want to see all
nformation in the original table about them. A numeric index helps you
t immediately without any additional effort.

erally, keeping values in the index unique is a good idea. For example,
won’t get a lookup speed boost in the presence of duplicate values in the
x. Pandas does not have a ‘unique constraint’ like relational databases
feature is still experimental), but it has functions to check if values in
ndex are unique and to get rid of duplicates in various ways.

etimes, a single column is not enough to uniquely identify the row. For
mple, cities of the same name sometimes happen to be found in different
ntries or even in different regions of the same country. So (city, state) is a

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 20 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

er candidate for identifying a place than city alone. In databases, it is


ed the ‘composite primary key.’ In Pandas, it is called MultiIndex (see
4 below), and each column inside the index is called a ‘level.’

ther substantial quality of an index is that it is immutable. In contrast to


nary columns in the DataFrame, you cannot change it in place. Any
nge in the index involves getting data from the old index, altering it, and
aching the new data as the new index. More often than not, it happens
sparently, which is why you cannot just write df.City.name = 'city' , and
have to resort a less obvious df.rename(columns={‘City’: 'city'},

ace=True)

x has a name (in the case of MultiIndex, every level has a name).
ortunately, this name is underused in Pandas. Once you have included
column in the index, you cannot use the convenient df.column_name

tion anymore and have to revert to the less readable df.index or the
e universal df.loc[] . The situation gets worse with MultiIndex. A
minent exception is df.merge — you can specify the column to merge by
e, no matter if it is an index column or not.

columns are labeled using just the same Index as the rows, although it
ht be not evident from the arguments of the pd.DataFrame constructor.

2#>&,(,B,#*&@A&3"(),
sider the following Series object:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 21 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

x provides a fast and convenient way to find a value by label. But how
ut finding a label by value?

.index[s.tolist().find(x)] # faster for len(s) < 1000


.index[np.where(s.values==x)[0][0]] # faster for len(s) > 1000

written a pair of thin wrappers called find() and findall() that are fast
hey automatically choose the actual command based on the series size)
easier to use. Here’s what the code looks like:

import pdi
pdi.find(s, 2)
penguin'
pdi.findall(s, 4)
ndex(['cat', 'dog'], dtype='object')

%2#>&3"(),%
das developers took special care about the missing values. Usually, you
ive a dataframe with NaNs by providing a flag to read_csv . Otherwise,
can use None in the constructor or in an assignment operator (it will
k despite being implemented slightly differently for different data types),
xample:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 22 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

first thing you can do with NaNs is understand if you have any. As seen
m the image above, isna() produces a boolean array, and .sum() gives
otal number of missing values.

that you know they are there, you can opt to get rid of them all at once
lling them with a constant value or through interpolation, as shown

H%$$+1WX*1+)*%+-(.A/$1-(WX

he other hand, you can keep using them. Most Pandas functions happily
re the missing values:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 23 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e advanced functions ( median , rank , quantile , etc.) also do.

hmetic operations are aligned against the index :

results are inconsistent in the presence of non-unique values in the


x. Do not use arithmetic operations on series with a non-unique index.

BP"+2%6#%
paring arrays with missing values might be tricky. Here’s an example:

np.all(pd.Series([1., None, 3.]) ==


pd.Series([1., None, 3.]))
alse
np.all(pd.Series([1, None, 3], dtype='Int64') ==
pd.Series([1, None, 3], dtype='Int64'))

np.all(pd.Series(['a', None, 'c']) ==


pd.Series(['a', None, 'c']))
alse

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 24 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e compared properly, NaNs need to be replaced with something that is


anteed to be missing from the array. E.g. with '' , -1 or ∞:

np.all(s1.fillna(np.inf) == s2.fillna(np.inf)) # works for all dtypes

better yet, use a standard NumPy or Pandas comparison function:

s = pd.Series([1., None, 3.])


np.array_equal(s.values, s.values, equal_nan=True)

len(s.compare(s)) == 0

e, the compare function returns a list of differences (a DataFrame,


ally), and array_equal returns a boolean directly.

n comparing DataFrames with mixed types, NumPy comparison fails


e #19205), while Pandas works perfectly well. Here’s what that looks like:

df = pd.DataFrame({'a': [1., None, 3.], 'b': ['x', None, 'z']})


np.array_equal(df.values, df.values, equal_nan=True)
ypeError
...>
len(df.compare(df)) == 0

,#$%X&2#%,+*%X&$,(,*26#%
ough Series objects are supposed to be size-immutable, it is possible to
end, insert, and delete elements in place, but all those operations are:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 25 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

low, as they require reallocating memory for the whole object and
updating the index;

painfully inconvenient.

e’s one way of inserting a value and two ways of deleting the values:

second method for deleting values (via drop ) is slower and can lead to
cate errors in the presence of non-unique values in the index.

das has the df.insert method, but it can only insert columns (not rows)
a dataframe (and does not work at all with series).

ther method for appending and inserting is to slice the DataFrame with
, apply the necessary conversions, and then put it back with concat . I’ve
emented a function called insert that automates the process:

e that (just like in df.insert) the place to insert is given by a position

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 26 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

<=len(s) , not the label of the element from the index.

can provide a label for a new element. For a non-numeric index, it is


ired. For example:

pecify the insertion point by label, you can combine pdi.find with
insert , as shown below:

e that unlike df.insert , pdi.insert returns a copy instead of modifying


Series/DataFrame in place.

2%*27%
das provides a full spectrum of statistical functions. They can give you an
ght into what is in a million-element Series or DataFrame without
ually scrolling through the data.

andas statistical functions ignore NaNs, as you can see below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 27 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e that Pandas std gives different results than NumPy std.

e every element in a series can be accessed by either a label or a


tional index, there’s a sister function for argmin ( argmax ) called idxmin

), which is shown in the image:

e’s a list of Pandas’ self-descriptive statistical functions for reference:

, sample standard deviation;

, unbiased variance;

, unbiased standard error of the mean;

quantile , sample quantile ( s.quantile(0.5) ≈ s.median() );

, the value(s) that appears most often;

nlargest and nsmallest , by default, in order of appearance;

, first discrete difference;

cumsum and cumprod , cumulative sum, and product;

cummin and cummax , cumulative minimum and maximum.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 28 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

some more specialized stat functions:

pct_change , percent change between the current and previous element;

, unbiased skewness (third moment);

or kurtosis , unbiased kurtosis (fourth moment);

corr and autocorr , covariance, correlation, and autocorrelation;

olling, weighted, and exponentially weighted windows.

(27"*,&$"*"
ial care is taken to detect and deal with duplicate data, as you can see in
mage:

%&Y"+%Z"(K*+"+%Z"(K*41$"(Y@/"+-&

_duplicates and duplicated can keep the last occurrence instead of the
one.

e that s.unique() is faster⁴ than np.unique (O(N) vs O(NlogN)) and it


erves the order instead of returning the sorted results as np.unique does.

ing values are treated as ordinary values, which may sometimes lead to
rising results.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 29 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

u want to exclude NaNs, you need to do it explicitly. In this particular


mple, s.dropna().is_unique == True .

re also is a family of monotonic functions with self-describing names:

s.is_monotonic_increasing() ,

s.is_monotonic_decreasing() ,

s._strict_monotonic_increasing() ,

s._string_monotonic_decreasing() , and, quite unexpectedly,

s.is_monotonic()— this is a synonym for s.is_monotonic_increasing()


nd returns False for monotonically decreasing series!

)P&@A
mmon operation in data processing is to calculate some statistics not
the whole bunch of data but over certain groups thereof. The first step
build a lazy object by providing criteria for breaking a series (or a
frame) into groups. This lazy object has no meaningful representation,
t can be:

terated (yields the grouping key and the corresponding sub-series —


deal for debugging):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 30 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

0./"A#>

ueried in just the same manner as ordinary Series to get a certain


property of each group:

F$$*/A(.1-%/+&*(6@$")(*[1[&

his example, we break the series into three groups based on the integer
of dividing the values by 10. For each group, we request the sum of the
ments, the number of elements, and the average value in each group.

ddition to those aggregate functions, you can access particular elements


d on their position or relative value within a group. Here’s what that
s like:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 31 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

2%+K*2()%1+K*216K*H%.&-K*+-'K*$1&-

can also calculate several functions in one call with g.agg(['min',

or display a whole bunch of stats functions at once with


scribe() .

ese are not enough, you can also pass the data through your own Python
tion. It can either be:

function f that accepts a group x (a Series object) and generates a


ingle value (e.g. sum() ) with g.apply(f)

function f that accepts a group x (a Series object) and generates a


eries object of the same size as x (e.g., cumsum() ) with g.transform(f)

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 32 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

he examples above, the input data is sorted. This is not required for
. Actually, it works equally well if the group elements are not stored
ecutively, so it is closer to collections.defaultdict than to
tools.groupby . And it always returns an index without duplicates.

ontrast to defaultdict and relational database GROUP BY clause, Pandas


sorts the results by group name. It can be disabled with sort=False ,

ou’ll see in the code:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 33 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

aimer: Actually, g.apply(f) is more versatile than described above:

f(x) returns a series of the same size as x, it can mimic transform

f(x) returns a series of different size or a dataframe, it results in a series


with a corresponding MultIindex.

he docs warn that those usages can be slower than the corresponding
sform and agg methods, so take care.

*&C:&0"*"F+"B,%

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 34 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

primary data structure of Pandas is a DataFrame. It bundles a two-


ensional array with labels for its rows and columns. It consists of a
ber of Series objects (with a shared index), each representing a single
mn and possibly having different dtypes.

$2#>&"#$&=+2*2#>&8<4&12(,%
mmon way to construct a DataFrame is by reading a CSV (comma-
rated values) file, as this image shows:

pd.read_csv() function is a fully-automated and insanely customizable


If you want to learn just one thing about Pandas, learn to use read_csv

will pay off :).

e’s an example of parsing a non-standard CSV file:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 35 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

a brief description of some of the arguments:

e CSV does not have a strict specification, sometimes it takes a bit of trial
error to read it correctly. What is cool about read_csv is that it
matically detects a lot of things, including:

olumn names and types

epresentation of booleans

epresentation of missing values, etc.

with any automation, you’d better make sure it has done the right thing. If
esults of simply writing df in a Jupyter cell happen to be too lengthy (or
ncomplete), you can try the following:

df.head(5) or df[:5] displays the first five rows

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 36 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

df.dtypes returns the column types

df.shape returns the number of rows and columns

df.info() summarizes all the relevant information

a good idea to set one or several columns as an index. The following


ge shows this process:

has many uses in Pandas:

t makes lookups by indexed column(s) faster;

rithmetic operations, stacking, joining are aligned by index; etc.

f that comes at the expense of somewhat higher memory consumption


a bit less obvious syntax.

$2#>&"&0"*"F+"B,
ther option is to construct a dataframe from data already stored in
mory. Its constructor is so extraordinarily omnivorous that it can convert
wrap!) just any kind of data you feed into it:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 37 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

he first case, in the absence of row labels, Pandas labeled the rows with
ecutive integers. In the second case, it did the same to both rows and
mns. It is always a good idea to provide Pandas with names of columns
ead of integer labels (using the columns argument) and sometimes —
es of rows (using the index argument, though rows might sound more
itive). This image will help:

ssign a name for the index column, write df.index.name = 'city_name' or


pd.DataFrame(..., index=pd.Index(['Oslo', 'Vienna', 'Tokyo'],

='city_name')).

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 38 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

next option is to construct a DataFrame from a dict of NumPy vectors or


NumPy array:

e how the population values got converted to floats in the second case.
ally, it happened earlier, during the construction of the NumPy array.
ther thing to note here is that constructing a dataframe from a 2D
mPy array is a view by default. That means that changing values in the
nal array changes the dataframe and vice versa. Plus, it saves memory.

mode can be enabled in the first case (a dict of NumPy vectors), too, by
copy=False . It is very fragile, though. Simple operations can turn it
a copy without a notice.

more (less useful) options to create a DataFrame are:

rom a list of dicts (where each dict represents a single row, its keys are
olumn names, and its values are the corresponding cell values)

rom a dict of Series (where each Series represents a column; returns


opy by default, it can be told to return a view with copy=False ).

u register streaming data ‘on the fly,’ your best bet is to use a dict of lists
list of lists because Python transparently preallocates space at the end of

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 39 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

t so that the appends are fast. Neither NumPy arrays nor Pandas
frames do it. Another possibility (if you know the number of rows
rehand) is to manually preallocate memory with something like
Frame(np.zeros) .

7&6P,+"*26#%&=2*/&0"*"F+"B,%
best thing about DataFrame (in my opinion) is that you can:

asily access its columns, eg df.area returns column values (or


lternatively, df[‘area’] — good for column names containing spaces)

perate the columns as if they were independent variables, for example,


fter df.population /= 10**6 the population is stored in millions, and the
ollowing command creates a new column called ‘density’ calculated
rom the values in the existing columns:

e that when creating a new column, square brackets are mandatory even
name contains no spaces.

eover, you can use arithmetic operations on columns even from


rent DataFrames provided their rows have meaningful labels, as shown

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 40 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

T2#>&0"*"F+"B,%
we’ve already seen in the Series section, ordinary square brackets are
ply not enough to fulfill all the indexing needs. You can’t access rows by
ls, can’t access disjoint rows by positional index, you can’t even
rence a single cell, since df['x', 'y'] is reserved for MultiIndex!

meet those needs, dataframes, just like series, have two alternative
xing modes: loc for indexing by labels and iloc for indexing by
tional index.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 41 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

andas, referencing multiple rows/columns is a copy, not a view. But it is a


ial kind of copy that allows assignments as a whole:

df.loc['a']=10 works (single row is writable as a whole)

df.loc['a']['A']=10 works (element access propagates to original df )

df.loc['a':'b'] = 10 works (assigning to a subarray as a whole work)

df.loc['a':'b']['A'] = 10 doesn’t (assigning to its elements doesn’t).

he last case, the value will only be set on a copy of a slice and will not be
cted in the original df (a warning will be displayed accordingly).

ending on the background of the situation, there’re different solutions:

You want to change the original dataframe df . Then use


df.loc['a':'b', 'A'] = 10

You have made the copy intentionally and want to work on that copy:
df1 = df.loc['a':'b']; df1['A']=10 # SettingWithCopy warning

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 42 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

To get rid of a warning in this situation, make it a real copy:


df1 = df.loc['a':'b'].copy(); df1['A']=10

das also supports a convenient NumPy syntax for boolean indexing.

n using several conditions, they must be parenthesized, as you can see

n you expect a single value to be returned, you need special care.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 43 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e there could potentially be several rows matching the condition, loc

rned a Series. To get a scalar value out of it, you can either use:

float(s) or a more universal s.item() which will both raise ValueError


unless there is exactly one value in the Series

s.iloc[0] that will only raise an exception when nothing is found; also, it
s the only one that supports assignments: df[…].iloc[0] = 100 , but
urely you don’t need it when you want to modify all matches: df[…] =

rnatively, you can use string-based queries:

df.query('name=="Vienna"')

df.query('population>1e6 and area<1000')

y are shorter, work great with the MultiIndex, and logical operators have
edence over comparison operators (=less parentheses are required), but
can only filter by rows, and you can’t modify the DataFrame through

ral third-party libraries allow you to use SQL syntax to query the
Frames directly (duckdb) or indirectly by copying the dataframe to
te and wrapping the results back into Pandas objects (pandasql).
urprisingly, the direct method is faster⁵.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 44 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

"F+"B,&"+2*/B,*27
can apply ordinary operations like add, subtract, multiply, divide,
ulo, power, etc., to dataframes, series, and combinations thereof.

rithmetic operations are aligned against the row and column labels:

mixed operations between DataFrames and Series, the Series (God knows
) behaves (and broadcasts) like a row-vector and is aligned accordingly:

bably to keep in line with lists and 1D NumPy vectors (which are not
ned by labels and are expected to be sized as if the DataFrame was a
ple 2D NumPy array):

n the unlucky (and, by coincidence, the most usual!) case of dividing a


frame by a column-vector series, you have to use methods instead of the
ators, as you can see below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 45 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ause of this questionable decision, whenever you need to perform a


ed operation between a dataframe and a column-like series, you have to
it up in the docs (or memorize it):

1))K*&"#K*2"$K*)%4K*2/)K*A/BK*H$//.)%4

B@2#2#>&0"*"F+"B,%
das has three functions, concat (an abbreviation of concatenate), merge ,

join , that are doing the same thing: combining information from
ral dataframes into one. But each of them does it slightly differently, as
are tailored for different use cases.

27"(&%*"7L2#>
is probably the simplest way to combine two or more dataframes into
you take the rows from the first one and append the rows from the
nd one to the bottom. To make it work, those two dataframes need to
(roughly) the same columns. This is similar to vstack in NumPy, as you
see in the image:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 46 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ng duplicate values in the index is bad. You can run into various kinds of
blems (see ‘drop’ example below). Even if you don’t care about the index,
o avoid having duplicate values in it:

ither use reset_index=True argument

df.reset_index(drop=True) to reindex the rows from 0 to len(df)-1,

use the keys argument to resolve the ambiguity with MultiIndex (see
elow).

e columns of the DataFrames do not match each other perfectly


erent order does not count here), Pandas can either take the intersection
e columns ( kind='inner’ , the default) or insert NaNs to mark the
ing values ( kind='outer' ):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 47 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

[6#*"(&%*"7L2#>
can also perform ‘horizontal’ stacking (similar to hstack in NumPy):

is more configurable than concat : in particular, it has five join modes


pposed to only two of concat. See ‘1:1 relationships join’ section below
details.

L2#>&32"&;)(*2'#$,T
th row and column labels coincide, concat allows to do a MultiIndex
valent of vertical stacking (like dstack in NumPy):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 48 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e row and/or the columns partially overlap, Pandas will align the names
rdingly, and that’s most probably not what you want:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 49 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

eneral, if the labels overlap, it means that the dataframes are somehow
ed to each other, and the relations between entities are best described
g the terminology of relational databases.

,("*26#%/2P&]62#%

is when the information about the same group of objects is stored in


ral different DataFrames, and you want to combine it into one
Frame.

e column you want to merge on is not in the index, use merge.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 50 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

first thing it does is discard anything that happens to be in the index.


n it does the join. Finally, it renumbers the results from 0 to n-1.

e column is already in the index, you can use join (which is just an alias
merge with left_index or right_index set to True and different defaults).

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 51 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ou can see from this simplified case (see ‘full outer join’ above), Pandas
etty relaxed about keeping the row order compared to relational
bases. Left and right outer joins tend to be more predictable than inner
outer joins (at least, until there are duplicate values in the column to be
ged). So, if you want a guaranteed row order, you’ll have to sort the
lts explicitly, or use CategoricalIndex ( pdi.lock can help you with it).

,("*26#%/2P&]62#%

is the most widely-used relationship in database design, where one row


ble A (e.g., ‘State’) can be linked to several rows of table B (e.g., City), but
row of table B can only be linked to one row of table A (= a city can only
n a single state, but a state consists of multiple cities).

like 1:1 relationships, to join a pair of 1:n related tables in Pandas, you
two options. If the column to be merged on is not in the index, and
re ok with discarding anything that happens to be in the index of both
es, use merge , for example:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 52 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

merge() *A(.H/.2&*%++(.*E/%+*#>*)(H1"$-

we’ve seen already, merge keeps row order less rigorously than, say,
gres. The “preserve key order” statement from the docs only applies to
_index=True and/or right_index=True (that is what join is an alias for)
only in the absence of duplicate values in the column to be merged on.
’s why merge and join have a sort argument.

, if the column to merge on is already in the index of the right


Frame, use join (or merge with right_index=True , which is exactly the
e thing):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 53 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

E/%+WX*)/(&*$(H-*/"-(.*E/%+*#>*)(H1"$-

time Pandas kept both the index values of the left dataframe and the
r of the rows intact.

: Be careful, if the second table has duplicate index values, you’ll end up with
icate index values in the result, even if the left table index is unique!

etimes, joined dataframes have columns with the same name. Both
and join have a way to resolve the ambiguity, but the syntax is slightly
rent (also, by default, merge will resolve it with '_x', '_y’ while join

raise an exception), as you can see in the image below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 54 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ummarize:

mergejoins on non-index columns, join requires the ‘right’ column to be


ndexed;

merge discards the index of the left DataFrame, join keeps it;

y default, merge performs an inner join, join does left outer join;

merge does not keep the order of the rows, join keeps them (some
estrictions apply);

is an alias for merge with left_index=True and/or right_index=True .

2P(,&]62#%
iscussed above, when join is run against two dataframes, e.g.
oin(df1) , it acts as an alias to merge . But join also has a ‘multiple join’
e, which, in its turn, is an alias for concat(axis=1) .

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 55 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

mode is somewhat limited compared to the regular mode:

t does not provide a means for duplicate column resolution;

t only works for 1:1 relationships (index-to-index joins).

multiple 1:n relationships are supposed to be joined one by one. The repo
das-illustrated’ has a helper for that, too, as you can see below:

join is a simple wrapper over join that accepts lists in on , how and
ixes arguments so that you could make several joins in one command.
like with the original join, on columns pertain to the first DataFrame,

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 56 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

other DataFrames are joined against their indices.

+*%&"#$&$,(,*,%
e a DataFrame is a collection of columns, it is easier to apply these
ations to the rows than to the columns. For example, inserting a column
ways done in-place, while inserting a row always results in a new
Frame, as shown below:

ting columns is usually worry-free, except that del df['D'] works while
df.D doesn’t (limitation on the Python level).

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 57 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ting rows with drop is surprisingly slow and can lead to intricate bugs if
aw labels are not unique. The image below will help explain the
cept:

solution would be to use ignore_index=True , that tells concat to reset the


names after concatenation:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 58 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

his case, setting the name column as an index would help. But for more
plicated filters, it wouldn’t.

another solution that is fast, universal, and even works with duplicate
names is indexing instead of deletion. I’ve written a (one-line-long)
mation to avoid explicitly negating the condition.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 59 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

)P&@A
operation has already been described in detail in the Series section. But
Frame’s groupby has a couple of specific tricks on top of that.

, you can specify the column to group by using just a name, as the image
w shows:

hout as_index=False , Pandas makes the column by which the grouping


performed to be the index column. If this is not desirable, you can use
t_index() or specify as_index=False .

ally, there’re more columns in the DataFrame than you want to see in the
lt. By default, Pandas sums anything remotely summable, so you’ll have
arrow your choice, as shown below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 60 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e that when summing over a single column, you’ll get a Series instead of
taFrame. If, for some reason, you want a DataFrame, you can:

use double brackets: df.groupby('product')[['quantity']].sum() or

onvert explicitly: df.groupby('product')['quantity'].sum().to_frame()

ching to numeric index will also make a DataFrame out of it:

df.groupby('product', as_index=False)['quantity'].sum() or

df.groupby('product')['quantity'].sum().reset_index()

despite the unusual appearance, in many cases a Series behaves just like
taFrame, so maybe a ‘facelift’ of pdi.patch_series_repr() would be
ugh.

erent columns should sometimes be treated differently when grouped.


example, it is perfectly fine to sum over quantity, but it makes no sense
um over price. Using .agg allows you to specify different aggregate
tions for different columns, as the image shows:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 61 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ou can create several aggregate functions for a single column:

o avoid the cumbersome column renaming, you can do the following:

etimes, the predefined functions are not good enough to produce the

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 62 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ired results. For example, it would be better to use weights when


aging the price. So you can provide a custom function for that. In
rast with Series, the function can access multiple columns of the group
fed a sub-dataframe as an argument), as shown below:

ortunately, you can’t combine predefined aggregates with several-


mn-wide custom functions, such as the one above, in one command, as
only accepts one-column-wide user functions. The only thing that one-
mn-wide user functions can access is the index, which can be handy in
ain scenarios. For example, that day, bananas were sold at a 50%
ount, which can be seen below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 63 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ccess the value of the group by column from the custom function, it was
uded in the index beforehand.

sual, the least customized function yields the best performance. So in


r of increasing speed:

multi-column-wide custom function via g.apply()

ingle-column-wide custom function via g.agg() (supports acceleration


with Cython or Numba)

predefined functions (Pandas or NumPy function object, or its name as a


tring).

eful instrument for looking at the data from a different perspective —


n used together with grouping — is pivot tables.

*2#>&"#$&^)#P236*2#>_
pose you have a variable a that depends on two parameters i and j.

re’re two equivalent ways to represent it as a table:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 64 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

‘wide’ format is more appropriate when the data is ‘dense’ (when


e’re few zero or missing elements), and the ‘long’ is better when the data
parse’ (most of the elements are zeros/missing and can be omitted from
able). The situation gets more contrived when there’re more than two
meters.

urally, there should be a simple way to transform between those formats.


Pandas provides a simple and convenient solution for it: the pivot table.

less abstract example, consider the following table with the sales data.
clients have bought the designated quantity of two kinds of products.
ally, this data is in the ‘long format.’ To convert it to the ‘wide format’, use
ivot :

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 65 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

command discards anything unrelated to the operation (i.e. index and


e columns) and transforms the information from the three requested
mns into the long format, placing client names into the result’s index,
duct titles into its columns, and quantity sold into the ‘body’ of it.

or the reverse operation, you can use stack . It merges index and
into the MultiIndex:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 66 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

.(&(-Y%+)(6

u want to stack only certain columns, you can use melt :

e that melt orders the rows of the result in a different manner.

loses the information about the name of the ‘body’ of the result, so
both stack and melt we have to ‘remind’ Pandas about the name of the

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 67 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ntity’ column.

he example above, all the values are present, but it is not a must:

practice of grouping values and then pivoting the results is so common


groupby and pivot have been bundled together into a dedicated
tion (and a corresponding DataFrame method) pivot_table :

without the columns argument, it behaves similarly to groupby ;

when there’re no duplicate rows to group by, it works just like pivot ;

therwise, it does grouping and pivoting.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 68 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

aggfunc parameter controls which aggregate function should be used


rouping the rows ( mean by default).

convenience, pivot_table can calculate the subtotals and grand total:

e created, a pivot table becomes just an ordinary DataFrame, so it can be


ied using the standard methods described earlier:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 69 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

best way to get a grasp on pivot_table (except to start using it right


y!) is to follow a relevant case study. I can highly recommend two of

n extremely thorough sales case is described in this blog post⁶

very well-written generic use case (based on the infamous Titanic


ataset) can be found here⁷

t tables are especially handy when used with MultiIndex. We’ve seen lots
xamples where Pandas functions return a multi-indexed DataFrame.
have a closer look at it.

*&E:&;)(*2'#$,T

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 70 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

most straightforward usage of MultiIndex for people who have never


d of Pandas is using a second index column as a supplement for the first
to identify each row uniquely. For example, to disambiguate cities from
rent states, the state’s name is often appended to the city’s name. (Did
know there’re about 40 Springfields in the US?) In relational databases, it
lled a composite primary key.

can either specify the columns to be included in the index after the
Frame is parsed from CSV or right away as an argument to read_csv .

can also append existing levels to the MultiIndex afterward using


nd=True , as you can see in the image below:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 71 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ther use case, more typical of Pandas, is representing multiple


ensions. When you have a number of objects with a certain set of
perties or evolution in time of one object of the kind. For example:

esults of a sociological survey,

he ‘Titanic’ dataset,

historical weather observations,

chronology of championship standings.

is also known as ‘Panel data,’ and Pandas owes its name to it.

add such a dimension:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 72 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

we have a four-dimensional space, where

ears form one (almost continuous) dimension,

ity names are placed along the second,

tate names along the third, and

particular city properties (‘population,’ ‘density,’ ‘area,’ etc.) act as ‘tick


marks’ along the fourth dimension.

following diagram illustrates the concept:

llow space for the names of the dimensions corresponding to columns,


das shifts the whole header upward:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 73 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

.(+12(Y16%&

)P2#>
first thing to note about MultiIndex is that it does not group anything as
ght appear. Internally, it is just a flat sequence of labels, as you can see

can get the same groupby effect for row labels by just sorting them:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 74 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

&/.-Y%+)(6

you can even disable the visual grouping entirely by setting a


esponding Pandas option: pd.options.display.multi_sparse=False .

,&76#3,+%26#%
das (as well as Python itself) makes a difference between numbers and
gs, so it is usually a good idea to convert numbers to strings in case the
type was not detected automatically:

di.set_level(df.columns, 0, pdi.get_level(df.columns, 0).astype('int'))

u’re feeling adventurous, you can do the same with standard tools:

f.columns = df.columns.set_levels(df.columns.levels[0].astype(int), level=0)

o use them properly, you need to understand what ‘levels’ and ‘codes’
whereas pdi allows you to work with MultiIndex as if the levels were
nary lists or NumPy arrays.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 75 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

really wonder, ‘levels’ and ‘codes’ are something that a regular list of
ls from a certain level are broken into to speed up operations like pivot ,

and so on:

pdi.get_level(df, 0) == Int64Index([2010, 2010, 2020, 2020])

df.columns.levels[0] == Int64Index([2010, 2020])

df.columns.codes[0] == Int64Index([0, 1, 0, 1])

$2#>&"&0"*"F+"B,&=2*/&"&;)(*2'#$,T
ddition to reading from CSV files and building from the existing
mns, there’re some more methods to create a MultiIndex. They are less
monly used — mostly for testing and debugging.

most intuitive way of using the Panda’s own representation of


iIndex does not work for historical reasons.

els’ and ‘codes’ here are (nowadays) considered implementation details


should not be exposed to end user, but we have what we have.

bably, the simplest way of building a MultiIndex is the following:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 76 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

.(+12(Y16%&

downside here is that the names of the levels need to be assigned in a


rate line or in a separate chained method⁸. Several alternative
tructors bundle the names along with the labels.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 77 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

H./2Y1..1>&K*H./2Y-"A(&

n the levels form a regular structure, you can specify the key elements
let Pandas interleave them automatically, as shown below:

H./2YA./)"@-

he methods listed above apply to columns, too. For example:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 78 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

T2#>&=2*/&;)(*2'#$,T
good thing about accessing DataFrame via the MultiIndex is that you can
y reference all levels at once (potentially omitting the inner levels) with
ce and familiar syntax.

mns — via regular square brackets

s and cells — using .loc[]

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 79 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

, what if you want to select all cities in Oregon or leave only the columns
population? Python syntax imposes two limitations here:

here’s no way of telling between df['a', 'b'] and df[('a', 'b')] — it is


essed the same way, so you can’t just write df[:, 'Oregon'] . Otherwise,
das would never know if you mean Oregon the column or Oregon the
nd level of rows.

ython only allows colons inside square brackets, not inside parentheses,
ou can’t write df.loc[(:, 'Oregon'), :] .

he technical side, it is not difficult to arrange. I’ve monkey-patched


ade a patch that is discarded once the kernel dies) the DataFrame to add
functionality, which you can see here:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 80 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

L1.+%+0\*[/-*1*41$%)*!1+)1&*&>+-16\*]+$>*B/.G&*1H-(.*A)%IA1-@'Y2%Y@/WX

only downside of this syntax is that when you use both indexers, it
rns a copy, so you can’t write df.mi[:,’Oregon’].co[‘population’] = 10 .

re’s many alternative indexers, some of which allow such assignments,


all of them have their own quirks:

ou can swap inner layers with outer layers and use the brackets.

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 81 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

&B1A$(4($

df[:, ‘population’] can be implemented with


waplevel(axis=1)['population']

feels hacky and is not convenient for more than two levels.

ou can use the xs method:


s(‘population’, level=1, axis=1) .

es not feel Pythonic enough, especially when selecting multiple levels.


method is unable to filter both rows and columns at the same time, so
easoning behind the name xs (stands for “cross-section”) is not entirely
r. It cannot be used for setting values.

he preferred method for handling this situation is to create an alias for


ndexSlice and use it inside .loc :

pd.IndexSlice; df.loc[:, idx[:, 'population']]

’s more Pythonic, but the necessity of aliasing something to access an


ment is somewhat of a burden (and it is too long without an alias). You
select rows and columns at the same time. Writable.

ou can learn how to use slice instead of a colon. If you know that
10:2] == a[slice(3,10,2)] then you might understand the following, too:
oc[:, (slice(None), 'population') ], but it is barely readable anyway. It
select rows and columns at the same time. Writable.

bottom line, Pandas has a number of ways to access elements of the


Frame with MultiIndex using brackets, but none of them is convenient
ugh, so they had to adopt an alternative indexing syntax:

mini-language for the .query method (it is the only one that is capable
oing ‘or’s, not only ‘and’s):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 82 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

uery('state=="Oregon" or city=="Portland"') .

convenient and fast, but lacks support from IDE (no autocompletion, no
ax highlighting, etc.), and it only filters the rows, not the columns. That
ns you can’t implement df[:, ‘population’] with it, without transposing
DataFrame (which will lose the types unless all the columns are of the
e type). Non-writable.

L2#>&"#$&)#%*"7L2#>
das does not have set_index for columns. A common way of adding
s to columns is to ‘unstack’ existing levels from the index:

&-1@GK*"+&-1@G

das’ stack is very different from NumPy’s stack . Let’s see what the
umentation says about the naming conventions:

function is named by analogy with a collection of books being reorganized


being side by side on a horizontal position (the columns of the dataframe) to
g stacked vertically on top of each other (in the index of the dataframe).”

‘on top’ part does not sound really convincing to me, but at least this
anation helps memorize which one moves things which way. By the way,
es has unstack , but does not have stack because it is ‘stacked already.’
g one-dimensional, Series can act as either row-vector or column-vector
fferent situations but are normally thought of as column vectors (e.g.,

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 83 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

frame columns).

example:

can also specify which level to stack/unstack by name or by positional


x. In this example, df.stack() , df.stack(1) and df.stack(‘year’)

duce the same result, as well as df1.unstack() , df1.unstack(2) , and


unstack(‘year’) . The destination is always ‘after the last level’ and is not
igurable. If you need to put the level somewhere else, you can use
waplevel().sort_index() or pdi.swap_level(df, sort=True)

columns must not contain duplicate values to be eligible for stacking


me applies to index when unstacking):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 84 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

&*6&P+,3,#*&%*"7L`)#%*"7L&1+6B&%6+*2#>
h stack and unstack have a bad habit of unpredictably sorting the result’s
x lexicographically. It might be irritating at times, but it is the only way
ve predictable results when there’re a lot of missing values.

sider the following example. In which order would you expect the days
e week to appear in the right table?

could speculate that if John’s Monday stands to the left of John’s Friday,
‘Mon’ < ‘Fri’ , and similarly, ‘Fri’ < ‘Sun’ for Silvia, so the result
uld be ‘Mon’ < ‘Fri’ < ‘Sun’ . This is legitimate, but what if the
aining columns are in a different order, say, ‘Mon’ < ‘Fri’ and ‘Tue’ <

’? Or ‘Mon’ < ‘Fri’ and ‘Wed’ < ‘Sat’ ?

there aren’t so many days of the week out there, and Pandas could
uce the order based on prior knowledge. But mankind has not arrived at
cisive conclusion on whether Sunday should stand at the end of the week
he beginning. Which order should Pandas use by default? Read regional
ngs? And what about less trivial sequences, say, the order of the states in

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 85 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

t Pandas does in this situation is simply sort it alphabetically, as you can


below:

le this is a sensible default, it still feels wrong. There should be a


tion! And there is one. It is called CategoricalIndex . It remembers the
r even if some labels are missing. It has recently been smoothly
grated into the Pandas toolchain. The only thing it lacks is infrastructure.
difficult to build; it is fragile (falls back to object dtype in certain
ations), yet it is perfectly usable, and the pdi library has some helpers to
pen the learning curve.

example, to tell Pandas to lock the order of, say, simple Index holding
products (which will inevitably get sorted if you decide to unstack days of
week back to columns), you need to write something as horrendous as
ndex = pd.CategoricalIndex(df.index, df.index, sorted=True) . And it is
h more contrived for MultiIndex.

pdi library has a helper function locked (and an alias lock having
ace=True by default) for locking the order of a certain MultiIndex level
romoting the level to the CategoricalIndex :

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 86 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

checkmark ✓ next to a level name means the level is locked. It can be


alized manually with pdi.vis(df) or automatically by monkey-patching
Frame HTML output with pdi.vis_patch() . After applying the patch,
ply writing ‘df’ in a Jupyter cell will show checkmarks for all levels with
ed ordering.

and locked work automatically in simple cases (such as client names)


need a hint from the user for the more complex cases (such as days of
week with missing days).

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 87 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

r the level has been switched to CategoricalIndex , it keeps the original


r in operations like sort_index , stack , unstack , pivot , pivot_table , etc.

fragile, though. Even such an innocent operation as adding a column via


new_col’] = 1 breaks it. Use pdi.insert(df.columns, 0, ‘new_col’, 1)

ch processes level(s) with CategoricalIndex correctly.

2P)("*2#>&(,3,(%
ddition to the already mentioned methods, there are some more:

returns a particular level referenced either


pdi.get_level(obj, level_id)
y number or by name, works with DataFrames, Series, and MultiIndex;

pdi.set_level(obj, level_id, labels) replaces the labels of a level with


he given array (list, NumPy array, Series, Index, etc.):

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 88 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

pdi.insert_level(obj, pos, labels, name) adds a level with the given


alues (properly broadcasted if necessary);

pdi.drop_level(obj, level_id) removes the specified level from the


MultiIndex:

pdi.swap_levels(obj, src=-2, dst=-1) swaps two levels (two innermost


evels by default);

pdi.move_level(obj, src, dst) moves a particular level src to the


esignated position dst :

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 89 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ddition to the arguments mentioned above, all functions from this


on have the following arguments:

axis=None where None means ‘columns’ for a DataFrame and ‘index’ for a
eries (aka ‘info’ axis);

sort=False ,
optionally sorts the corresponding MultiIndex after the
manipulations;

inplace=False ,optionally performs the manipulation in-place (does not


work with a single Index because it is immutable).

he operations above understand the word level in the conventional sense


has the same number of labels as the number of columns in the
Frame), hiding the machinery of index.label and index.codes from the
user.

he rare occasions when moving and swapping separate levels is not


ugh, you can reorder all the levels at once with this pure Pandas call:
olumns = df.columns.reorder_levels([‘M’,’L’,’K’])

re [‘M’, ‘L’, ‘K’] is the desired order of the levels.

erally, it is enough to use get_level and set_level to the necessary fixes

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 90 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

e labels, but if you want to apply a transformation to all levels of the


iIndex at once, Pandas has an (ambiguously named) function rename

accepts a dict or a function:

.(+12(

or renaming the levels, their names are stored in the field .names . This
does not support direct assignments (why not?):
ndex.names[1] = ‘x’ # TypeError

can be replaced as a whole:

n you just need to rename a particular level, the syntax is as follows:

&(-Y+12(&K*.(+12(Y16%&

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 91 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

3,+*2#>&;)(*2'#$,T&2#*6&"&1("*&'#$,T&"#$&+,%*6+2#>&2*
we’ve seen from above, the convenient query method only solves the
plexity of dealing with MultiIndex in the rows. And despite all the helper
tions, when some tricky Pandas function returns a MultiIndex in the
mns, it has a shock effect for beginners. So, the pdi library has the
wing:

join_levels(obj, sep=’_’, name=None) joins all MultiIndex levels into one


ndex

split_level(obj, sep=’_’, names=None) splits the Index back into a


MultiIndex

h have optional axis and inplace arguments.

or a pure-Pandas solution, the following code can do the trick:

oin levels:

olumns = ['_'.join(k) for k in df.columns.to_flat_index()]

plit levels:

olumns = pd.MultiIndex.from_tuples(k.split('_') for k in df.columns)

2#>&;)(*2'#$,T
e MultiIndex consists of several levels, sorting is a bit more contrived

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 92 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

for a single Index. It can still be done with the sort_index method, but
uld be further fine-tuned with the following arguments:

ort column levels, specify axis=1 .

$2#>&"#$&=+2*2#>&;)(*2'#$,T,$&0"*"F+"B,%&*6&$2%L
das can write a DataFrame with a MultiIndex into a CSV file in a fully
mated manner: df.to_csv('df.csv’) . However, when reading such a file,
das cannot parse the MultiIndex automatically and needs some hints
m the user. For example, to read a DataFrame with three-level-high
mns and a four-level-wide index, you need to specify
ead_csv('df.csv', header=[0,1,2], index_col=[0,1,2,3]) .

means that the first three lines contain the information about the
mns, and the first four fields in each of the subsequent lines contain the
x levels (if there’s more than one level in the columns , you can’t
rence row levels by names in read_csv , only by numbers).

not convenient to manually decipher the number of levels in the column


iIndex, so a better idea would be to stack() all but one of the column
der levels before saving the DataFrame to CSV, and unstack() them back
r reading.

u need a fire-and-forget solution, you might want to look into the binary
mats, such as Python pickle format:

irectly: df.to_pickle('df.pkl'), pd.read_pickle('df.pkl')

using the storemagic in Jupyter %store df then %store -r df

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 93 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

stores in $HOME/.ipython/profile_default/db/autorestore )

on pickle is small and fast, but it is only accessible from Python. If you
d interoperability with other ecosystems, look into more standard
mats such as Excel format (which requires the same hints as read_csv

n reading MultiIndex). Here’s the code:

pip install openpyxl


f.to_excel('df.xlsx')
f1 = pd.read_excel('df.xlsx', header=[0,1,2], index_col=[0,1,2,3])

Parquet file format supports multi-indexed dataframes with no hints


tsoever, produces smaller files, and works faster (see a benchmark⁹):

f.to_parquet('df.parquet')
f1 = pd.read_parquet('df.parquet')

official docs has a table listing all ~20 supported formats.

2'#$,T&"+2*/B,*27
perations where a multi-indexed dataframe is used as a whole, the same
s as for ordinary dataframes apply (see Part 3). But dealing with a subset
ells has some peculiarities of its own.

can update a subset of columns referenced via the outer MultiIndex


(s) as simply as the following:

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 94 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

you want to keep the original data intact,


= df.assign(population=df.population*10) .

can also easily get the population density with


ity=df.population/df.area .

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 95 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

unfortunately, you can’t assign the result to the original dataframe with
ssign .

approach is to stack all the irrelevant levels of the column index into the
index, perform the necessary calculations, and unstack them back (use
lock to keep the original order of columns).

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 96 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

rnatively, you can use pdi.assign :

assign is locked-order-aware, so if you feed it a dataframe with locked


(s), it won’t unlock them so that the subsequent stack/unstack/etc.
ations will keep the original columns and rows in order.

eat example of processing a real-life sales dataset with a huge


iIndex can be found here¹⁰.

n all, Pandas is a great tool for analyzing and processing data. Hopefully
article helped you understand both ‘hows’ and ‘whys’ of solving typical
blems, and to appreciate the true value and beauty of the Pandas library.

p me a line (on reddit, hackernews, linkedin, or twitter) if I missed your


rite feature, overlooked a blatant typo, or just if this article proved to be
ful for you!

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 97 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

#6=(,$>,B,#*%
uld like to thank Dr. Irv Lustig from the Pandas development team for
ewing the article and helping me make it better.

library is still in beta and has not been officially approved by the
das development team. It has been thoroughly tested, though (pytest,
coverage), and should be safe to use.

,+,#7,%
Pandas vs. Polars: A Syntax and Speed Comparison’ by Leonie Monigatti

A Gentle Visual Intro to Data Analysis in Python Using Pandas’ by Jay


Alammar

Pivot — Rows to Columns’, Modern SQL blog

A look at Pandas design and development’ by Wes McKinney, NYC


Python meetup, 2012

Efficient SQL on Pandas with DuckDB’ by Mark Raasveldt and Hannes


Mühleisen

Pandas Pivot Table Explained’ by Chris Moffitt in ‘Practical Business


Python’ blog

Pivot tables’ chapter in ‘Python Data Science Handbook’ by Jake


VanderPlas

Modern Pandas (Part 2): Method Chaining’ by Tom Augspurger

The fastest way to read a csv in Pandas’ by Itamar Turner-Trauring in


pythonspeed.com’ blog

Pandas MultiIndex Tutorial’ by Zax Rosenberg, a blog on GitHub.

,#%,
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 98 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM

ights reserved (=you cannot distribute, alter, translate, etc. without


or’s permission).

0.122%+0 a1-1*D@%(+@( !>-'/+ !1+)1& 51@'%+(*3(1.+%+0

9;b 9

$.2$,*+$3*,,""$456"7
-(.*!./0.122%+0

&$(--(.*@/4(.%+0*-'(*#(&-*A./0.122%+0*1.-%@$(&*A"#$%&'()*1@./&&*5()%"2^Q1G(*1*$//GI

+0*"AK*>/"*B%$$*@.(1-(*1*5()%"2*1@@/"+-*%H*>/"*)/+_-*1$.(1)>*'14(*/+(I*`(4%(B V(-*-'%&*+(B&$(--(.
1@>*!/$%@>*H/.*2/.(*%+H/.21-%/+*1#/"-*/".*A.%41@>*A.1@-%@(&I

https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 99 of 99

You might also like