Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
!"#$%&'()*%+ ,(--(.*!./0.122%+0
3(4*516%2/4
71+*89 : ;<*2%+*.(1) : 5(2#(.=/+$> : 3%&-(+
#$"%&'(()%*+"*,$-&./,&0,12#2*23,&42%)"(&5)2$,
!"#$"%
1*@/A>*/.*1*4%(BC*D'/"$)*?*2(.0(*/.*E/%+C*F+)*B'1-*-'(*'(@G*%&
-%?+)(6C
F$$*%210(&*#>*1"-'/.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 1 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
re’re a lot of Pandas guides out there. In this particular one, you’re
ected to have a basic understanding of NumPy. If you don’t, I’d suggest
skim through the NumPy Illustrated guide to get an idea of what a
mPy array is, in which ways it is superior to a Python list, and how it helps
d loops in elementary operations.
rns out these features are enough to make Pandas a powerful competitor
oth spreadsheets and databases.
1. Motivation
2. Series and Index
3. DataFrames
4. MultiIndex
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 2 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
)%%26#%
cker News (257 points, 40 comments)
ddit r/Python (288 points, 29 comments)
*,#*%
Motivation and Showcase
Pandas Showcase
Pandas Speed
DataFrames
Reading and writing CSV files
Building a DataFrame
Basic operations with DataFrames
ndexing DataFrames
DataFrame arithmetic
Combining DataFrames:
Vertical stacking
Horizontal stacking
Stacking via MultiIndex
oining DataFrames:
1:1 relationship joins
1:n relationship joins
Multiple joins
nserts and deletes
Group by
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 3 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
MultiIndex
Visual Grouping
Type conversions
Building DataFrame with MultiIndex
ndexing with MultiIndex
tacking and unstacking
How to prevent stack/unstack from sorting
Manipulating levels
Converting MultiIndex into flat Index and restoring it back
orting MultiIndex
Reading and writing MultiIndexed DataFrames to disk
MultiIndex arithmetic
*&9:&;6*23"*26#&"#$&</6=7"%,
pose you have a file with a million lines of comma-separated values like
DA1@(&*1H-(.*@/$/+&*1.(*H/.*%$$"&-.1-%4(*A".A/&(&*/+$>I*J&"1$$>K*-'(.(*1.(*+/+(I
you need to give answers to basic questions like “Which cities have an
over 450 km² and a population under 10 million” with NumPy.
brute-force solution of feeding the whole table into a NumPy array is not
od option: usually, NumPy arrays are homogeneous (all values must be
e same type), so all fields will be interpreted as strings, and
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 4 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
NumPy has structured and record arrays that allow columns of different
s, but they are primarily meant for interfacing with C code. When used
eneral purposes, they have the following downsides:
not really intuitive (e.g., you’ll be faced with constants like <f8 and <U8
verywhere);
$"%&</6=7"%,
sider the following table:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 5 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
scribes the diverse product line of an online shop with a total of four
nct products. In contrast with the previous example, it can be
esented with either a NumPy array or a Pandas DataFrame equally well.
et us look at some common operations with it.
+*2#>
ing by column is more readable with Pandas, as you can see below:
6+*2#>&@A&%,3,+"(&76()B#%
e need to sort by price column breaking ties using the weight column,
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 6 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
$$2#>&"&76()B#
ng columns is way better with Pandas, syntactically and architecturally:
das does not need to reallocate memory for the whole array like NumPy;
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 7 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
!"#$%&'()*#
"%*&,(,B,#*&%,"+7/
MNO*P/$$/B(.&
h NumPy arrays, even if the element you search for is the first one, you’ll
Q'.%$$()*1#/"-*A./0.122%+0
need time proportional to the size of the array to find it. With Pandas,
can index the column(s) you expect to be queried most often and reduce P/$$/B
ch time to a constant.
%*+"$,+*)$%"-(.)
R/"&&(H*S/&+% %+ 3(4($*JA*T/)%+0
?I&!"#$"%&F)#7*26#%&16+&JIK
61&A6)+&0"*"&<72,#7,&."%L%
R1+0*U'/" %+ Q(@'Q/P.(()/2
M&D$3"#7,$&!A*/6#&N2%*
OP,+"*26#%&./"*&8"#
Q11,7*23,(A&OP*2B2%,&R6)+S
86$,
7/&(A*P(..(. %+ V((G*T"$-".(
G&8/"*5!.&1,"*)+,%&*6&@66%*
A6)+&$"2(A&=6+L
,(6*QI %+ V((G*T"$-".(
G&QT7,((,#*&H)(2"&F,"*)+,%&./"*
!A*/6#&0,3,(6P,+%&8"#&O#(A
U2%/&./,A&V"$
2#%&@A&76()B#
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 8 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
join has all the familiar ‘inner,’ ‘left,’ ‘right,’ and ‘full outer’ join
+6)P2#>&@A&76()B#
another common operation in data analysis is grouping by column(s).
example, to get the total quantity of each product sold, you can do the
wing:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 9 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
36*&*"@(,%
of the most powerful features of Pandas is a “pivot” table. It is
ething like projecting multi-dimensional space into a two-dimensional
nutshell, the two main differences between NumPy and Pandas are the
wing:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 10 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
, let’s see whether those features come at the cost of a performance hit.
$"%&<P,,$
benchmarked NumPy and Pandas on a workload typical for Pandas: 5–
columns; 10³–10⁸ rows; integers and floats. Here are the results for 1 row
100 million rows:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 11 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
das seems to be 30 times slower than NumPy for small arrays (under a
dred rows) and three times slower for large ones (over a million rows).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 12 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
das becomes 1.5 times faster than NumPy for arrays with over a million
ments. It is still 15 times slower than NumPy for smaller arrays, but
ally, it does not matter much if the operation is completed in 0.5 ms or
ms — it is fast anyway.
bottom line is that if you’re 100% sure you have no missing values in your
mn(s), it makes sense to use df.column.values.sum() instead of
column.sum() to have x3-x30 performance boost. In the presence of
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 13 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ing values the speed of Pandas is quite decent and even beats NumPy for
e arrays (over 10⁶ elements).
*&?:&<,+2,%&"#$&'#$,T
rnally, Series stores the values in a plain old NumPy vector. As such, it
rits its merits (compact memory layout, fast random access) and
erits (type homogeneity, slow deletions, and insertions). On top of that,
es allows accessing its values by label using a dict-like structure called
. Labels can be of any type (commonly strings and time stamps). They
d not be unique, but uniqueness is required to boost the lookup speed
is assumed in many operations.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 14 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ou can see, now every element can be addressed in two alternative ways:
abel’ (=using the index) and by ‘position’ (=not using the index):
ously, one pair of square brackets is not enough for this. In particular:
ddress those issues, Pandas has two more ‘flavors’ of square brackets:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 15 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
oc[] always uses labels and includes both ends of the interval;
oc[] always uses positional indices and excludes the right end.
you can see how they support ‘fancy indexing’ (indexing with an array
tegers) in this image:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 16 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
worst thing about Series is its visual representation: for some reason, it
’t receive a nice rich-text outlook, so it feels like a second-class citizen in
parison with a DataFrame:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 17 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
install pandas-illustrated
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 18 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
s = pd.Series(np.zeros(10**6))
s.index
angeIndex(start=0, stop=1000000, step=1)
s.index.memory_usage() # in bytes
# the same as for Series([0.])
s.drop(1, inplace=True)
s.index
nt64Index([ 0, 2, 3, 4, 5, 6, 7,
...
999993, 999994, 999995, 999996, 999997, 999998, 999999],
dtype='int64', length=999999)
s.index.memory_usage()
999992
structure consumes 8Mb of memory! To get rid of it and get back to the
weight range-like structure, write
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 19 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
s.reset_index(drop=True, inplace=True)
s.index
angeIndex(start=0, stop=999999, step=1)
s.index.memory_usage()
u’re new to Pandas, you might wonder why Pandas didn’t do it on its
? Well, for non-numeric labels, it is sort of obvious: why (and how)
ld Pandas, after deleting a row, relabel all the subsequent rows? For
eric labels, the answer is a bit more convoluted.
erally, keeping values in the index unique is a good idea. For example,
won’t get a lookup speed boost in the presence of duplicate values in the
x. Pandas does not have a ‘unique constraint’ like relational databases
feature is still experimental), but it has functions to check if values in
ndex are unique and to get rid of duplicates in various ways.
etimes, a single column is not enough to uniquely identify the row. For
mple, cities of the same name sometimes happen to be found in different
ntries or even in different regions of the same country. So (city, state) is a
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 20 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ace=True)
x has a name (in the case of MultiIndex, every level has a name).
ortunately, this name is underused in Pandas. Once you have included
column in the index, you cannot use the convenient df.column_name
tion anymore and have to revert to the less readable df.index or the
e universal df.loc[] . The situation gets worse with MultiIndex. A
minent exception is df.merge — you can specify the column to merge by
e, no matter if it is an index column or not.
columns are labeled using just the same Index as the rows, although it
ht be not evident from the arguments of the pd.DataFrame constructor.
2#>&,(,B,#*&@A&3"(),
sider the following Series object:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 21 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
x provides a fast and convenient way to find a value by label. But how
ut finding a label by value?
written a pair of thin wrappers called find() and findall() that are fast
hey automatically choose the actual command based on the series size)
easier to use. Here’s what the code looks like:
import pdi
pdi.find(s, 2)
penguin'
pdi.findall(s, 4)
ndex(['cat', 'dog'], dtype='object')
%2#>&3"(),%
das developers took special care about the missing values. Usually, you
ive a dataframe with NaNs by providing a flag to read_csv . Otherwise,
can use None in the constructor or in an assignment operator (it will
k despite being implemented slightly differently for different data types),
xample:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 22 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
first thing you can do with NaNs is understand if you have any. As seen
m the image above, isna() produces a boolean array, and .sum() gives
otal number of missing values.
that you know they are there, you can opt to get rid of them all at once
lling them with a constant value or through interpolation, as shown
H%$$+1WX*1+)*%+-(.A/$1-(WX
he other hand, you can keep using them. Most Pandas functions happily
re the missing values:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 23 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
BP"+2%6#%
paring arrays with missing values might be tricky. Here’s an example:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 24 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
len(s.compare(s)) == 0
,#$%X&2#%,+*%X&$,(,*26#%
ough Series objects are supposed to be size-immutable, it is possible to
end, insert, and delete elements in place, but all those operations are:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 25 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
low, as they require reallocating memory for the whole object and
updating the index;
painfully inconvenient.
e’s one way of inserting a value and two ways of deleting the values:
second method for deleting values (via drop ) is slower and can lead to
cate errors in the presence of non-unique values in the index.
das has the df.insert method, but it can only insert columns (not rows)
a dataframe (and does not work at all with series).
ther method for appending and inserting is to slice the DataFrame with
, apply the necessary conversions, and then put it back with concat . I’ve
emented a function called insert that automates the process:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 26 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
pecify the insertion point by label, you can combine pdi.find with
insert , as shown below:
2%*27%
das provides a full spectrum of statistical functions. They can give you an
ght into what is in a million-element Series or DataFrame without
ually scrolling through the data.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 27 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
, unbiased variance;
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 28 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
(27"*,&$"*"
ial care is taken to detect and deal with duplicate data, as you can see in
mage:
%&Y"+%Z"(K*+"+%Z"(K*41$"(Y@/"+-&
_duplicates and duplicated can keep the last occurrence instead of the
one.
ing values are treated as ordinary values, which may sometimes lead to
rising results.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 29 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
s.is_monotonic_increasing() ,
s.is_monotonic_decreasing() ,
s._strict_monotonic_increasing() ,
)P&@A
mmon operation in data processing is to calculate some statistics not
the whole bunch of data but over certain groups thereof. The first step
build a lazy object by providing criteria for breaking a series (or a
frame) into groups. This lazy object has no meaningful representation,
t can be:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 30 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
0./"A#>
F$$*/A(.1-%/+&*(6@$")(*[1[&
his example, we break the series into three groups based on the integer
of dividing the values by 10. For each group, we request the sum of the
ments, the number of elements, and the average value in each group.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 31 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
2%+K*2()%1+K*216K*H%.&-K*+-'K*$1&-
ese are not enough, you can also pass the data through your own Python
tion. It can either be:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 32 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
he examples above, the input data is sorted. This is not required for
. Actually, it works equally well if the group elements are not stored
ecutively, so it is closer to collections.defaultdict than to
tools.groupby . And it always returns an index without duplicates.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 33 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
he docs warn that those usages can be slower than the corresponding
sform and agg methods, so take care.
*&C:&0"*"F+"B,%
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 34 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
$2#>&"#$&=+2*2#>&8<4&12(,%
mmon way to construct a DataFrame is by reading a CSV (comma-
rated values) file, as this image shows:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 35 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
e CSV does not have a strict specification, sometimes it takes a bit of trial
error to read it correctly. What is cool about read_csv is that it
matically detects a lot of things, including:
epresentation of booleans
with any automation, you’d better make sure it has done the right thing. If
esults of simply writing df in a Jupyter cell happen to be too lengthy (or
ncomplete), you can try the following:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 36 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
$2#>&"&0"*"F+"B,
ther option is to construct a dataframe from data already stored in
mory. Its constructor is so extraordinarily omnivorous that it can convert
wrap!) just any kind of data you feed into it:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 37 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
he first case, in the absence of row labels, Pandas labeled the rows with
ecutive integers. In the second case, it did the same to both rows and
mns. It is always a good idea to provide Pandas with names of columns
ead of integer labels (using the columns argument) and sometimes —
es of rows (using the index argument, though rows might sound more
itive). This image will help:
='city_name')).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 38 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
e how the population values got converted to floats in the second case.
ally, it happened earlier, during the construction of the NumPy array.
ther thing to note here is that constructing a dataframe from a 2D
mPy array is a view by default. That means that changing values in the
nal array changes the dataframe and vice versa. Plus, it saves memory.
mode can be enabled in the first case (a dict of NumPy vectors), too, by
copy=False . It is very fragile, though. Simple operations can turn it
a copy without a notice.
rom a list of dicts (where each dict represents a single row, its keys are
olumn names, and its values are the corresponding cell values)
u register streaming data ‘on the fly,’ your best bet is to use a dict of lists
list of lists because Python transparently preallocates space at the end of
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 39 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
t so that the appends are fast. Neither NumPy arrays nor Pandas
frames do it. Another possibility (if you know the number of rows
rehand) is to manually preallocate memory with something like
Frame(np.zeros) .
7&6P,+"*26#%&=2*/&0"*"F+"B,%
best thing about DataFrame (in my opinion) is that you can:
e that when creating a new column, square brackets are mandatory even
name contains no spaces.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 40 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
T2#>&0"*"F+"B,%
we’ve already seen in the Series section, ordinary square brackets are
ply not enough to fulfill all the indexing needs. You can’t access rows by
ls, can’t access disjoint rows by positional index, you can’t even
rence a single cell, since df['x', 'y'] is reserved for MultiIndex!
meet those needs, dataframes, just like series, have two alternative
xing modes: loc for indexing by labels and iloc for indexing by
tional index.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 41 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
he last case, the value will only be set on a copy of a slice and will not be
cted in the original df (a warning will be displayed accordingly).
You have made the copy intentionally and want to work on that copy:
df1 = df.loc['a':'b']; df1['A']=10 # SettingWithCopy warning
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 42 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 43 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
rned a Series. To get a scalar value out of it, you can either use:
s.iloc[0] that will only raise an exception when nothing is found; also, it
s the only one that supports assignments: df[…].iloc[0] = 100 , but
urely you don’t need it when you want to modify all matches: df[…] =
df.query('name=="Vienna"')
y are shorter, work great with the MultiIndex, and logical operators have
edence over comparison operators (=less parentheses are required), but
can only filter by rows, and you can’t modify the DataFrame through
ral third-party libraries allow you to use SQL syntax to query the
Frames directly (duckdb) or indirectly by copying the dataframe to
te and wrapping the results back into Pandas objects (pandasql).
urprisingly, the direct method is faster⁵.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 44 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
"F+"B,&"+2*/B,*27
can apply ordinary operations like add, subtract, multiply, divide,
ulo, power, etc., to dataframes, series, and combinations thereof.
rithmetic operations are aligned against the row and column labels:
mixed operations between DataFrames and Series, the Series (God knows
) behaves (and broadcasts) like a row-vector and is aligned accordingly:
bably to keep in line with lists and 1D NumPy vectors (which are not
ned by labels and are expected to be sized as if the DataFrame was a
ple 2D NumPy array):
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 45 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
1))K*&"#K*2"$K*)%4K*2/)K*A/BK*H$//.)%4
B@2#2#>&0"*"F+"B,%
das has three functions, concat (an abbreviation of concatenate), merge ,
join , that are doing the same thing: combining information from
ral dataframes into one. But each of them does it slightly differently, as
are tailored for different use cases.
27"(&%*"7L2#>
is probably the simplest way to combine two or more dataframes into
you take the rows from the first one and append the rows from the
nd one to the bottom. To make it work, those two dataframes need to
(roughly) the same columns. This is similar to vstack in NumPy, as you
see in the image:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 46 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ng duplicate values in the index is bad. You can run into various kinds of
blems (see ‘drop’ example below). Even if you don’t care about the index,
o avoid having duplicate values in it:
use the keys argument to resolve the ambiguity with MultiIndex (see
elow).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 47 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
[6#*"(&%*"7L2#>
can also perform ‘horizontal’ stacking (similar to hstack in NumPy):
L2#>&32"&;)(*2'#$,T
th row and column labels coincide, concat allows to do a MultiIndex
valent of vertical stacking (like dstack in NumPy):
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 48 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
e row and/or the columns partially overlap, Pandas will align the names
rdingly, and that’s most probably not what you want:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 49 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
eneral, if the labels overlap, it means that the dataframes are somehow
ed to each other, and the relations between entities are best described
g the terminology of relational databases.
,("*26#%/2P&]62#%
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 50 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
e column is already in the index, you can use join (which is just an alias
merge with left_index or right_index set to True and different defaults).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 51 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ou can see from this simplified case (see ‘full outer join’ above), Pandas
etty relaxed about keeping the row order compared to relational
bases. Left and right outer joins tend to be more predictable than inner
outer joins (at least, until there are duplicate values in the column to be
ged). So, if you want a guaranteed row order, you’ll have to sort the
lts explicitly, or use CategoricalIndex ( pdi.lock can help you with it).
,("*26#%/2P&]62#%
like 1:1 relationships, to join a pair of 1:n related tables in Pandas, you
two options. If the column to be merged on is not in the index, and
re ok with discarding anything that happens to be in the index of both
es, use merge , for example:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 52 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
merge() *A(.H/.2&*%++(.*E/%+*#>*)(H1"$-
we’ve seen already, merge keeps row order less rigorously than, say,
gres. The “preserve key order” statement from the docs only applies to
_index=True and/or right_index=True (that is what join is an alias for)
only in the absence of duplicate values in the column to be merged on.
’s why merge and join have a sort argument.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 53 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
E/%+WX*)/(&*$(H-*/"-(.*E/%+*#>*)(H1"$-
time Pandas kept both the index values of the left dataframe and the
r of the rows intact.
: Be careful, if the second table has duplicate index values, you’ll end up with
icate index values in the result, even if the left table index is unique!
etimes, joined dataframes have columns with the same name. Both
and join have a way to resolve the ambiguity, but the syntax is slightly
rent (also, by default, merge will resolve it with '_x', '_y’ while join
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 54 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ummarize:
merge discards the index of the left DataFrame, join keeps it;
y default, merge performs an inner join, join does left outer join;
merge does not keep the order of the rows, join keeps them (some
estrictions apply);
2P(,&]62#%
iscussed above, when join is run against two dataframes, e.g.
oin(df1) , it acts as an alias to merge . But join also has a ‘multiple join’
e, which, in its turn, is an alias for concat(axis=1) .
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 55 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
multiple 1:n relationships are supposed to be joined one by one. The repo
das-illustrated’ has a helper for that, too, as you can see below:
join is a simple wrapper over join that accepts lists in on , how and
ixes arguments so that you could make several joins in one command.
like with the original join, on columns pertain to the first DataFrame,
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 56 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
+*%&"#$&$,(,*,%
e a DataFrame is a collection of columns, it is easier to apply these
ations to the rows than to the columns. For example, inserting a column
ways done in-place, while inserting a row always results in a new
Frame, as shown below:
ting columns is usually worry-free, except that del df['D'] works while
df.D doesn’t (limitation on the Python level).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 57 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ting rows with drop is surprisingly slow and can lead to intricate bugs if
aw labels are not unique. The image below will help explain the
cept:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 58 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
his case, setting the name column as an index would help. But for more
plicated filters, it wouldn’t.
another solution that is fast, universal, and even works with duplicate
names is indexing instead of deletion. I’ve written a (one-line-long)
mation to avoid explicitly negating the condition.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 59 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
)P&@A
operation has already been described in detail in the Series section. But
Frame’s groupby has a couple of specific tricks on top of that.
, you can specify the column to group by using just a name, as the image
w shows:
ally, there’re more columns in the DataFrame than you want to see in the
lt. By default, Pandas sums anything remotely summable, so you’ll have
arrow your choice, as shown below:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 60 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
e that when summing over a single column, you’ll get a Series instead of
taFrame. If, for some reason, you want a DataFrame, you can:
df.groupby('product', as_index=False)['quantity'].sum() or
df.groupby('product')['quantity'].sum().reset_index()
despite the unusual appearance, in many cases a Series behaves just like
taFrame, so maybe a ‘facelift’ of pdi.patch_series_repr() would be
ugh.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 61 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
etimes, the predefined functions are not good enough to produce the
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 62 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 63 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ccess the value of the group by column from the custom function, it was
uded in the index beforehand.
*2#>&"#$&^)#P236*2#>_
pose you have a variable a that depends on two parameters i and j.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 64 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
less abstract example, consider the following table with the sales data.
clients have bought the designated quantity of two kinds of products.
ally, this data is in the ‘long format.’ To convert it to the ‘wide format’, use
ivot :
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 65 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
or the reverse operation, you can use stack . It merges index and
into the MultiIndex:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 66 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
.(&(-Y%+)(6
loses the information about the name of the ‘body’ of the result, so
both stack and melt we have to ‘remind’ Pandas about the name of the
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 67 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
ntity’ column.
he example above, all the values are present, but it is not a must:
when there’re no duplicate rows to group by, it works just like pivot ;
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 68 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 69 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
t tables are especially handy when used with MultiIndex. We’ve seen lots
xamples where Pandas functions return a multi-indexed DataFrame.
have a closer look at it.
*&E:&;)(*2'#$,T
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 70 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
can either specify the columns to be included in the index after the
Frame is parsed from CSV or right away as an argument to read_csv .
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 71 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
he ‘Titanic’ dataset,
is also known as ‘Panel data,’ and Pandas owes its name to it.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 72 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 73 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
.(+12(Y16%&
)P2#>
first thing to note about MultiIndex is that it does not group anything as
ght appear. Internally, it is just a flat sequence of labels, as you can see
can get the same groupby effect for row labels by just sorting them:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 74 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
&/.-Y%+)(6
,&76#3,+%26#%
das (as well as Python itself) makes a difference between numbers and
gs, so it is usually a good idea to convert numbers to strings in case the
type was not detected automatically:
u’re feeling adventurous, you can do the same with standard tools:
o use them properly, you need to understand what ‘levels’ and ‘codes’
whereas pdi allows you to work with MultiIndex as if the levels were
nary lists or NumPy arrays.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 75 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
really wonder, ‘levels’ and ‘codes’ are something that a regular list of
ls from a certain level are broken into to speed up operations like pivot ,
and so on:
$2#>&"&0"*"F+"B,&=2*/&"&;)(*2'#$,T
ddition to reading from CSV files and building from the existing
mns, there’re some more methods to create a MultiIndex. They are less
monly used — mostly for testing and debugging.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 76 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
.(+12(Y16%&
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 77 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
H./2Y1..1>&K*H./2Y-"A(&
n the levels form a regular structure, you can specify the key elements
let Pandas interleave them automatically, as shown below:
H./2YA./)"@-
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 78 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
T2#>&=2*/&;)(*2'#$,T
good thing about accessing DataFrame via the MultiIndex is that you can
y reference all levels at once (potentially omitting the inner levels) with
ce and familiar syntax.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 79 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
, what if you want to select all cities in Oregon or leave only the columns
population? Python syntax imposes two limitations here:
ython only allows colons inside square brackets, not inside parentheses,
ou can’t write df.loc[(:, 'Oregon'), :] .
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 80 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
L1.+%+0\*[/-*1*41$%)*!1+)1&*&>+-16\*]+$>*B/.G&*1H-(.*A)%IA1-@'Y2%Y@/WX
only downside of this syntax is that when you use both indexers, it
rns a copy, so you can’t write df.mi[:,’Oregon’].co[‘population’] = 10 .
ou can swap inner layers with outer layers and use the brackets.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 81 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
&B1A$(4($
feels hacky and is not convenient for more than two levels.
ou can learn how to use slice instead of a colon. If you know that
10:2] == a[slice(3,10,2)] then you might understand the following, too:
oc[:, (slice(None), 'population') ], but it is barely readable anyway. It
select rows and columns at the same time. Writable.
mini-language for the .query method (it is the only one that is capable
oing ‘or’s, not only ‘and’s):
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 82 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
uery('state=="Oregon" or city=="Portland"') .
convenient and fast, but lacks support from IDE (no autocompletion, no
ax highlighting, etc.), and it only filters the rows, not the columns. That
ns you can’t implement df[:, ‘population’] with it, without transposing
DataFrame (which will lose the types unless all the columns are of the
e type). Non-writable.
L2#>&"#$&)#%*"7L2#>
das does not have set_index for columns. A common way of adding
s to columns is to ‘unstack’ existing levels from the index:
&-1@GK*"+&-1@G
das’ stack is very different from NumPy’s stack . Let’s see what the
umentation says about the naming conventions:
‘on top’ part does not sound really convincing to me, but at least this
anation helps memorize which one moves things which way. By the way,
es has unstack , but does not have stack because it is ‘stacked already.’
g one-dimensional, Series can act as either row-vector or column-vector
fferent situations but are normally thought of as column vectors (e.g.,
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 83 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
frame columns).
example:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 84 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
&*6&P+,3,#*&%*"7L`)#%*"7L&1+6B&%6+*2#>
h stack and unstack have a bad habit of unpredictably sorting the result’s
x lexicographically. It might be irritating at times, but it is the only way
ve predictable results when there’re a lot of missing values.
sider the following example. In which order would you expect the days
e week to appear in the right table?
could speculate that if John’s Monday stands to the left of John’s Friday,
‘Mon’ < ‘Fri’ , and similarly, ‘Fri’ < ‘Sun’ for Silvia, so the result
uld be ‘Mon’ < ‘Fri’ < ‘Sun’ . This is legitimate, but what if the
aining columns are in a different order, say, ‘Mon’ < ‘Fri’ and ‘Tue’ <
there aren’t so many days of the week out there, and Pandas could
uce the order based on prior knowledge. But mankind has not arrived at
cisive conclusion on whether Sunday should stand at the end of the week
he beginning. Which order should Pandas use by default? Read regional
ngs? And what about less trivial sequences, say, the order of the states in
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 85 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
example, to tell Pandas to lock the order of, say, simple Index holding
products (which will inevitably get sorted if you decide to unstack days of
week back to columns), you need to write something as horrendous as
ndex = pd.CategoricalIndex(df.index, df.index, sorted=True) . And it is
h more contrived for MultiIndex.
pdi library has a helper function locked (and an alias lock having
ace=True by default) for locking the order of a certain MultiIndex level
romoting the level to the CategoricalIndex :
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 86 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 87 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
2P)("*2#>&(,3,(%
ddition to the already mentioned methods, there are some more:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 88 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 89 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
axis=None where None means ‘columns’ for a DataFrame and ‘index’ for a
eries (aka ‘info’ axis);
sort=False ,
optionally sorts the corresponding MultiIndex after the
manipulations;
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 90 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
.(+12(
or renaming the levels, their names are stored in the field .names . This
does not support direct assignments (why not?):
ndex.names[1] = ‘x’ # TypeError
&(-Y+12(&K*.(+12(Y16%&
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 91 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
3,+*2#>&;)(*2'#$,T&2#*6&"&1("*&'#$,T&"#$&+,%*6+2#>&2*
we’ve seen from above, the convenient query method only solves the
plexity of dealing with MultiIndex in the rows. And despite all the helper
tions, when some tricky Pandas function returns a MultiIndex in the
mns, it has a shock effect for beginners. So, the pdi library has the
wing:
oin levels:
plit levels:
2#>&;)(*2'#$,T
e MultiIndex consists of several levels, sorting is a bit more contrived
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 92 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
for a single Index. It can still be done with the sort_index method, but
uld be further fine-tuned with the following arguments:
$2#>&"#$&=+2*2#>&;)(*2'#$,T,$&0"*"F+"B,%&*6&$2%L
das can write a DataFrame with a MultiIndex into a CSV file in a fully
mated manner: df.to_csv('df.csv’) . However, when reading such a file,
das cannot parse the MultiIndex automatically and needs some hints
m the user. For example, to read a DataFrame with three-level-high
mns and a four-level-wide index, you need to specify
ead_csv('df.csv', header=[0,1,2], index_col=[0,1,2,3]) .
means that the first three lines contain the information about the
mns, and the first four fields in each of the subsequent lines contain the
x levels (if there’s more than one level in the columns , you can’t
rence row levels by names in read_csv , only by numbers).
u need a fire-and-forget solution, you might want to look into the binary
mats, such as Python pickle format:
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 93 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
stores in $HOME/.ipython/profile_default/db/autorestore )
on pickle is small and fast, but it is only accessible from Python. If you
d interoperability with other ecosystems, look into more standard
mats such as Excel format (which requires the same hints as read_csv
f.to_parquet('df.parquet')
f1 = pd.read_parquet('df.parquet')
2'#$,T&"+2*/B,*27
perations where a multi-indexed dataframe is used as a whole, the same
s as for ordinary dataframes apply (see Part 3). But dealing with a subset
ells has some peculiarities of its own.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 94 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 95 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
unfortunately, you can’t assign the result to the original dataframe with
ssign .
approach is to stack all the irrelevant levels of the column index into the
index, perform the necessary calculations, and unstack them back (use
lock to keep the original order of columns).
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 96 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
n all, Pandas is a great tool for analyzing and processing data. Hopefully
article helped you understand both ‘hows’ and ‘whys’ of solving typical
blems, and to appreciate the true value and beauty of the Pandas library.
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 97 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
#6=(,$>,B,#*%
uld like to thank Dr. Irv Lustig from the Pandas development team for
ewing the article and helping me make it better.
library is still in beta and has not been officially approved by the
das development team. It has been thoroughly tested, though (pytest,
coverage), and should be safe to use.
,+,#7,%
Pandas vs. Polars: A Syntax and Speed Comparison’ by Leonie Monigatti
,#%,
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 98 of 99
Pandas Illustrated: The Definitive Visual Guide to Pandas | by Lev Maximov | Jan, 2023 | Better Programming 1/31/23, 11:42 PM
9;b 9
$.2$,*+$3*,,""$456"7
-(.*!./0.122%+0
&$(--(.*@/4(.%+0*-'(*#(&-*A./0.122%+0*1.-%@$(&*A"#$%&'()*1@./&&*5()%"2^Q1G(*1*$//GI
+0*"AK*>/"*B%$$*@.(1-(*1*5()%"2*1@@/"+-*%H*>/"*)/+_-*1$.(1)>*'14(*/+(I*`(4%(B V(-*-'%&*+(B&$(--(.
1@>*!/$%@>*H/.*2/.(*%+H/.21-%/+*1#/"-*/".*A.%41@>*A.1@-%@(&I
https://betterprogramming.pub/pandas-illustrated-the-definitive-visual-guide-to-pandas-c31fa921a43 Page 99 of 99