Pandas
Pandas
Introduction to Pandas
Pandas (https://pandas.pydata.org) is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming language. Today, pandas
is actively supported by a community of like-minded individuals around the world who contribute
their valuable time and energy to help make open source pandas possible.
In this section of the course, we will learn to use pandas for data analysis. If you have never used
pandas, you can think about pandas as an extremely powerful version of Excel and with lot more
features.
We will cover the following key concepts in this section (Pandas Essentials) of the course:
Series
DataFrame
Indexing and Selection
Hierarchical Indexing
Data Cleaning, Preparation and Handling the Missing Data
Data Wrangling: Merging/Joining, Combining/Concatenation
Data Aggregation and GroupBy
Several other Useful Methods and Operations and much more, and at the end two full data
analysis exercises to practice the skills.
Series:
Series is a one-dimensional array-like object, which contains values and an array of labels,
associated with the values. Series can be indexed using labels.What differentiates the NumPy
array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label,
instead of just a number location.(Series is similar to NumPy array -- actually, it is built on top of
the NumPy array object).Series can hold any arbitrary Python object.
It has to be remembered that unlike Python lists, a Series will always contain data of the same
type.
In [1]: 1 # first thing first, we need to import NumPy and pandas
2 # np and pd are alias for NumPy and pandas
3 # pandas Documentation using pd.<TAB> and pd?
4 import numpy as np
5 import pandas as pd
We can use pd.Series (with capital S) to convert the Python’s list object to pandas Series.
☞ If you press <Shift+tab> , you see Series takes a wide variety of parameters, at the moment
we will focus on the data and the index. lets consider data only and see how it works!
Out[3]: 0 100
1 200
2 300
dtype: int64
Column "0 1 2" is automatically generated index for the elements in series with data 100 200 and
300. We can specify index values and call the data points using these indexes.
Let's pass "my_labels" to the Series as index.
Out[4]: x 100
y 200
z 300
dtype: int64
Out[5]: 0 100
1 200
2 300
dtype: int32
Out[6]: x 100
y 200
z 300
dtype: int32
Out[7]: x 100
y 200
z 300
dtype: int64
Series can hold a wide variety of objects types, lets see with examples:
Out[8]: 0 x
1 y
2 z
dtype: object
To understand the concepts, Let's create three Series, ser1 , ser2 , ser3 from dictionaries with
some random data:
Out[12]: 200
Note, we are passing a string "Calgary" our index contains strings (name of the cities). If the index
is a number, we will pass in the number.
In [13]: 1 ser1 # Order of key is same that what is given in the dictonary
In [14]: 1 ser2
Notice that the values found in the series were added for their appropriate index, on the other
hand, if there is no match, the value appears as NaN (not a number) which is considered in
pandas to mark missing or NA values.
Good to know!
isnull() , notnull()
detect missing data
In [18]: 1 #pd.isnull(ser4)
2 ser4.isnull()
3 # shift+tab, its Type is method
axes , values
axes : returns list of the row axis labels/index
values : returns list of values/data
head() , tail()
To view a small sample of a Series or DataFrame (we will learn DataFrame in the next lecture)
object, use the head() and tail() methods.
The default number of elements to display is five, but you may pass a custom number.
In [22]: 1 ser1.head(1)
In [23]: 1 ser1.tail(1)
size
To check the number of elements in your data.
In [24]: 1 ser1.size
Out[24]: 4
empty
True if the series in empty
Out[25]: False
DataFrame
A very simple way to think about the DataFrame is, "bunch of Series together such as they
share the same index".
A DataFrams is a rectangular table of data that contains an ordered collection of columns,
each of which can be a different value type (numeric, string, boolean, etc). DataFrame has
both a row and column index; it can be thought of as a dictionary of Series all sharing the
same index.
A DataFrams can be created by following methods-
* Passing Data into Data frame function as Numpy array with index
for row and column as sepaerate List(By Default
number index strating with 0,1,2.... in both cases).
* Passing Data into Data frame function as list of values as colum
ns using Zip function with index for row and
column as sepaerate List (By Default number index strating with
0,1,2.... in both cases).
* Another way is to Passing Data into Data frame function as dicti
onary , you will not need to supply column names
separately in this case , However you can pass index for rows (B
y Default row index will be numbers 0,1,2...).
* Reading the data from file using Pandas package functions.
* just the process of creating them is different , there is no dif
ference in properties of end result.
☞ A good read for those, who are interested! (Python for Data Analysis
(http://shop.oreilly.com/product/0636920023784.do))
Let's start with a simple example, using arange() and reshape() together to create a 2D array
(matrix).
Out[27]: ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
In [28]: 1 columns
Out[28]: ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']
In [29]: 1 array_2d
In [30]: 1 # Let's create our first DataFrame using index, columns and array_2dnow
2 df = pd.DataFrame(data = array_2d, index = index, columns = columns)
In [31]: 1 # How the DataFrame look like!
2 df
Out[31]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
Method-2
In [32]: 1 age=np.random.randint(low=16,high=80,size=[20,])
2 city=np.random.choice(['Mumbai','Delhi','Chennai','Kolkata'],20)
3 default=np.random.choice([0,1],20)
In [33]: 1 age
Out[33]: array([57, 30, 22, 53, 60, 19, 32, 35, 17, 37, 44, 71, 72, 47, 66, 23, 75,
37, 20, 23])
In [34]: 1 city
In [35]: 1 default
we can zip these to convert them to single list of tuples , each tuple in that list will correspond to a
row in the dataframe
In [36]: 1 mydata=list(zip(age,city,default))
In [37]: 1 mydata
In [38]: 1 df1=pd.DataFrame(data=mydata,columns=['age','city','default'])
In [39]: 1 df1
Out[39]:
age city default
0 57 Delhi 0
1 30 Chennai 0
2 22 Delhi 0
3 53 Chennai 1
4 60 Kolkata 0
5 19 Delhi 1
6 32 Delhi 1
7 35 Chennai 1
8 17 Chennai 0
9 37 Mumbai 0
10 44 Delhi 1
11 71 Delhi 1
12 72 Kolkata 1
13 47 Mumbai 0
14 66 Delhi 1
15 23 Kolkata 0
16 75 Delhi 1
17 37 Chennai 0
18 20 Chennai 0
19 23 Delhi 1
Method-3
Another way is to put them in a dictionary , you will not need to supply column names separately in
this case
In [40]: 1 df2=pd.DataFrame({'age':age,'city':city,'default':default})
In [41]: 1 df2
Out[41]:
age city default
0 57 Delhi 0
1 30 Chennai 0
2 22 Delhi 0
3 53 Chennai 1
4 60 Kolkata 0
5 19 Delhi 1
6 32 Delhi 1
7 35 Chennai 1
8 17 Chennai 0
9 37 Mumbai 0
10 44 Delhi 1
11 71 Delhi 1
12 72 Kolkata 1
13 47 Mumbai 0
14 66 Delhi 1
15 23 Kolkata 0
16 75 Delhi 1
17 37 Chennai 0
18 20 Chennai 0
19 23 Delhi 1
☞ Let's learn how to Grab data that we need, this is the most important thing we want to learn to
move one!
Columns
In [42]: 1 # Grabbing a single column
2 df['c1']
3 # The output looks like a series, right?.
4 # Also returned Series have the same index as the DataFrame
Out[42]: r1 0
r2 10
r3 20
r4 30
r5 40
r6 50
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32
Out[43]: pandas.core.series.Series
In [44]: 1 # Grabbing more than one column, pass the list of columns you need!
2 df[['c1', 'c10']]
Out[44]:
c1 c10
r1 0 9
r2 10 19
r3 20 29
r4 30 39
r5 40 49
r6 50 59
r7 60 69
r8 70 79
r9 80 89
r10 90 99
df.column_name (e.g. df.c1, df.c2 etc) can be used to grab a column as well, its good to know
but I don't recommend.
If you press "TAB" after df., you will see lots of available methods, its good not to get confused with
these option by using df.column_name.
Let's try this once
In [45]: 1 df.c5
Out[45]: r1 4
r2 14
r3 24
r4 34
r5 44
r6 54
r7 64
r8 74
r9 84
r10 94
Name: c5, dtype: int32
Out[46]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 new
r1 0 1 2 3 4 5 6 7 8 9 1
r2 10 11 12 13 14 15 16 17 18 19 21
r3 20 21 22 23 24 25 26 27 28 29 41
r4 30 31 32 33 34 35 36 37 38 39 61
r5 40 41 42 43 44 45 46 47 48 49 81
r6 50 51 52 53 54 55 56 57 58 59 101
r7 60 61 62 63 64 65 66 67 68 69 121
r8 70 71 72 73 74 75 76 77 78 79 141
r9 80 81 82 83 84 85 86 87 88 89 161
r10 90 91 92 93 94 95 96 97 98 99 181
Shift+tab, you see the default axis is 0, which refers to the index (row labels), for column, we need
to specify axis = 1.
☞ rows refer to 0 axis and columns refers to 1 axis
☞ Quick Check: df.shape gives tuple (rows, cols) at [0] and [1]
In [47]: 1 # We can delete a column using drop()
2 # df.drop('new')# ValueError: labels ['new'] not contained in axis
3 df.drop('new', axis=1)
Out[47]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
In [48]: 1 df
Out[48]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 new
r1 0 1 2 3 4 5 6 7 8 9 1
r2 10 11 12 13 14 15 16 17 18 19 21
r3 20 21 22 23 24 25 26 27 28 29 41
r4 30 31 32 33 34 35 36 37 38 39 61
r5 40 41 42 43 44 45 46 47 48 49 81
r6 50 51 52 53 54 55 56 57 58 59 101
r7 60 61 62 63 64 65 66 67 68 69 121
r8 70 71 72 73 74 75 76 77 78 79 141
r9 80 81 82 83 84 85 86 87 88 89 161
r10 90 91 92 93 94 95 96 97 98 99 181
✅ pandas is generous, it does not want us to lose the information by any mistake and needs
inplace
In [49]: 1 df.drop('new',axis = 1, inplace = True)
2 df
Out[49]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
Rows
We can retrieve a row by its name or position with loc (https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.loc.html) and iloc
(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html).
loc -- Access a group of rows and columns by label(s)
Out[50]: c1 0
c2 1
c3 2
c4 3
c5 4
c6 5
c7 6
c8 7
c9 8
c10 9
Name: r1, dtype: int32
In [51]: 1 type(df.loc['r1'])
Out[51]: pandas.core.series.Series
In [52]: 1 ser1=df.loc['r1']
2 ser1['c1']
3 # same as above 2 lines df.loc['r1']['c1']
Out[52]: 0
Using row's index location with iloc, even if our index is labeled.
Out[53]: c1 0
c2 1
c3 2
c4 3
c5 4
c6 5
c7 6
c8 7
c9 8
c10 9
Name: r1, dtype: int32
Out[54]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
Out[55]: 0
Out[56]:
c1 c2
r1 0 1
r2 10 11
In [57]: 1 # another example - random columns and rows in the list
2 df.loc[['r2','r5'],['c3','c4']]
Out[57]:
c3 c4
r2 12 13
r5 42 43
In [58]: 1 # one more example - random columns and rows in the list
2 df.loc[['r2','r5'],['c3','c4','c5']]
Out[58]:
c3 c4 c5
r2 12 13 14
r5 42 43 44
Out[59]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 False False False False False False True True True True
r2 True True True True True True True True True True
r3 True True True True True True True True True True
r4 True True True True True True True True True True
r5 True True True True True True True True True True
r6 True True True True True True True True True True
r7 True True True True True True True True True True
r8 True True True True True True True True True True
r9 True True True True True True True True True True
r10 True True True True True True True True True True
In [60]: 1 df[df>5]
Out[60]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
*bool_mask = df % 3 == 0
*df[bool_mask]
Out[61]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0.0 NaN NaN 3.0 NaN NaN 6.0 NaN NaN 9.0
r2 NaN NaN 12.0 NaN NaN 15.0 NaN NaN 18.0 NaN
r3 NaN 21.0 NaN NaN 24.0 NaN NaN 27.0 NaN NaN
r4 30.0 NaN NaN 33.0 NaN NaN 36.0 NaN NaN 39.0
r5 NaN NaN 42.0 NaN NaN 45.0 NaN NaN 48.0 NaN
r6 NaN 51.0 NaN NaN 54.0 NaN NaN 57.0 NaN NaN
r7 60.0 NaN NaN 63.0 NaN NaN 66.0 NaN NaN 69.0
r8 NaN NaN 72.0 NaN NaN 75.0 NaN NaN 78.0 NaN
r9 NaN 81.0 NaN NaN 84.0 NaN NaN 87.0 NaN NaN
r10 90.0 NaN NaN 93.0 NaN NaN 96.0 NaN NaN 99.0
☞ Its not common to use such operation on entire dataframe. We usually use them on a columns
or rows instead.
For example, we don't want a row with NaN values.
What to do?
Let's have a look at one example.
Out[62]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
Out[63]: r1 False
r2 False
r3 True
r4 True
r5 True
r6 True
r7 True
r8 True
r9 True
r10 True
Name: c1, dtype: bool
Out[64]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
☞ The above, " df[df['c1']>11] " is a dataframe with applied condition, we can select any col
from this dataframe.
For example:
Out[65]: r3 20
r4 30
r5 40
r6 50
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32
We can do the above operations, (filtering and selecting a columns) in a single line (stack
commonds).
In [67]: 1 df[df['c1']>11]['c1']
2 # Could be little confusing for the beginners, but don't worry, we will
3 # use such operations frequently in the course as well, you will find
4 # them very handy.
Out[67]: r3 20
r4 30
r5 40
r6 50
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32
In [68]: 1 # let's split the above operation into its steps to understand
2 bool_ser = df['c1']>11 # output bool_ser
3 result = df[bool_ser] # output result
4 result['c1'] # out put final
Out[68]: r3 20
r4 30
r5 40
r6 50
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32
In [69]: 1 # let's grab two columns, we need to pass the list ['c1','c9'] here
2 df[df['c1']>11][['c1','c9']]
Out[69]:
c1 c9
r3 20 28
r4 30 38
r5 40 48
r6 50 58
r7 60 68
r8 70 78
r9 80 88
r10 90 98
In [70]: 1 # We can do this operation on rows using loc
2 # Passing multiple rows in a list
3
4 df[df['c1']>11].loc[['r3','r5']]
Out[70]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r3 20 21 22 23 24 25 26 27 28 29
r5 40 41 42 43 44 45 46 47 48 49
In [71]: 1 df[df['c1']==70]
Out[71]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r8 70 71 72 73 74 75 76 77 78 79
Combine 2 conditions
Let's try on c1 for a value > 60 and on c2 for a value > 80
Out[72]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-73-5de458536cbe> in <module>
----> 1 df[(df['c1']>60) and (df['c2']>80)]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero_
_(self)
1476 raise ValueError("The truth value of a {0} is ambiguous. "
1477 "Use a.empty, a.bool(), a.item(), a.any() or
a.all()."
-> 1478 .format(self.__class__.__name__))
1479
1480 __bool__ = __nonzero__
This "ambiguous" means, True, only work for a single booleans at a time "True and False". We
need to use "&" instead. ("|" for or)
Try the above code using "and"
The "and" operator gets confused with series of True/False and raise Error
Out[74]:
index c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
0 r1 0 1 2 3 4 5 6 7 8 9
1 r2 10 11 12 13 14 15 16 17 18 19
2 r3 20 21 22 23 24 25 26 27 28 29
3 r4 30 31 32 33 34 35 36 37 38 39
4 r5 40 41 42 43 44 45 46 47 48 49
5 r6 50 51 52 53 54 55 56 57 58 59
6 r7 60 61 62 63 64 65 66 67 68 69
7 r8 70 71 72 73 74 75 76 77 78 79
8 r9 80 81 82 83 84 85 86 87 88 89
9 r10 90 91 92 93 94 95 96 97 98 99
Out[75]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 newind
r1 0 1 2 3 4 5 6 7 8 9 a
r2 10 11 12 13 14 15 16 17 18 19 b
r3 20 21 22 23 24 25 26 27 28 29 c
r4 30 31 32 33 34 35 36 37 38 39 d
r5 40 41 42 43 44 45 46 47 48 49 e
r6 50 51 52 53 54 55 56 57 58 59 f
r7 60 61 62 63 64 65 66 67 68 69 g
r8 70 71 72 73 74 75 76 77 78 79 h
r9 80 81 82 83 84 85 86 87 88 89 i
r10 90 91 92 93 94 95 96 97 98 99 j
In [77]: 1 df
Out[77]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
newind
a 0 1 2 3 4 5 6 7 8 9
b 10 11 12 13 14 15 16 17 18 19
c 20 21 22 23 24 25 26 27 28 29
d 30 31 32 33 34 35 36 37 38 39
e 40 41 42 43 44 45 46 47 48 49
f 50 51 52 53 54 55 56 57 58 59
g 60 61 62 63 64 65 66 67 68 69
h 70 71 72 73 74 75 76 77 78 79
i 80 81 82 83 84 85 86 87 88 89
j 90 91 92 93 94 95 96 97 98 99
head() , tail()
In [78]: 1 # Returns first n rows
2 df.head(n=2) # n = 5 by default
Out[78]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
newind
a 0 1 2 3 4 5 6 7 8 9
b 10 11 12 13 14 15 16 17 18 19
Out[79]:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
newind
i 80 81 82 83 84 85 86 87 88 89
j 90 91 92 93 94 95 96 97 98 99
info()
Provides a concise summary of the DataFrame.
In [80]: 1
2 df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 10 columns):
c1 10 non-null int32
c2 10 non-null int32
c3 10 non-null int32
c4 10 non-null int32
c5 10 non-null int32
c6 10 non-null int32
c7 10 non-null int32
c8 10 non-null int32
c9 10 non-null int32
c10 10 non-null int32
dtypes: int32(10)
memory usage: 480.0+ bytes
describe()
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a
dataset's distribution, excluding NaN values.
In [81]: 1 df.describe()
Out[81]:
c1 c2 c3 c4 c5 c6 c7 c8
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000 10.
mean 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000 51.000000 52.000000 53.
std 30.276504 30.276504 30.276504 30.276504 30.276504 30.276504 30.276504 30.276504 30.
25% 22.500000 23.500000 24.500000 25.500000 26.500000 27.500000 28.500000 29.500000 30.
50% 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000 51.000000 52.000000 53.
75% 67.500000 68.500000 69.500000 70.500000 71.500000 72.500000 73.500000 74.500000 75.
max 90.000000 91.000000 92.000000 93.000000 94.000000 95.000000 96.000000 97.000000 98.
Hierarchical Indexing
Hierarchical indexing is an important feature of pandas. It makes it possible to have multiple (two
or more) index levels on an axis. Somewhat abstractly, it provides a way to work with higher
dimensional data in a lower dimensional form.
In [82]: 1 # Create a Series with a list of lists (or arrays) as the index:
2 index = [['a','a','a','b','b','b','c','c','d','d'], # level 1 index
3 [1,2,3,1,2,3,1,2,1,2]] # level 2 index
4 ser = pd.Series(np.random.randn(10),index = index) # mean 0 and variance 1.
5 ser
Out[82]: a 1 -0.097603
2 -1.020513
3 0.393764
b 1 0.777136
2 0.054634
3 -0.850093
c 1 -0.911214
2 -1.458672
d 1 -0.725856
2 1.015324
dtype: float64
With a hierarchically indexed object, so-called partial indexing is possible, which enables the
concise selection of the subsets of the data.
In [83]: 1 # Data retrieval
2 ser['a']
Out[83]: 1 -0.097603
2 -1.020513
3 0.393764
dtype: float64
Out[84]: -1.020513453858637
In [86]: 1 df
Out[86]:
AB ON BC
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [87]: 1 df['AB']
Out[87]: a 1 0
2 3
b 1 6
2 9
Name: AB, dtype: int32
In [88]: 1 df.loc['a']
Out[88]:
AB ON BC
1 0 1 2
2 3 4 5
We want to grab a single value, idea is to go from outside to inside, e.g. we want to grab "11"
In [89]: 1 #df.loc['b']
2 #df.loc['b'].loc[2]
3 print(df.loc['b'].loc[2]['BC'])
4 df.loc['a'].loc[2]['BC']
11
Out[89]: 5
In [90]: 1 df.loc['b'].loc[2,'BC']
Out[90]: 11
The hierarchical levels can have names (as strings or any Python objects). If so, these will show
up in the console output:
In [91]: 1 df.index.names
In [93]: 1 df
Out[93]:
AB ON BC
L_1 L_2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
Good to know!
xs()
Let me introduce a very useful and built-in method " xs() " to grab data from multilevel index.
xs() has ability to go inside a multilevel index.
Out[94]:
AB ON BC
L_2
1 0 1 2
2 3 4 5
If we want to grab all the data in df where index L_2 is "1", its tricky for loc method, xs will do
the magic here!
For Example:
tell xs() what you want, 1 here, and indicate the level, L_2 in this case.z
Out[95]:
AB ON BC
L_1
a 0 1 2
b 6 7 8
Hi Guys,
Welcome back to the pandas essentials, now we are going to talk about the missing data!
Out[98]:
A B C D
Out[99]:
A B C D
In [100]: 1 df.isnull().sum(axis=1)
Out[100]: 0 1
1 2
2 2
3 1
4 2
dtype: int64
Out[101]:
A B C D
Out[102]: 7.0
☞ NaN ignored for mean(). Hence that row or column will not be considered.
In [103]: 1 df
Out[103]:
A B C D
In [104]: 1 df.mean()
Out[104]: A 2.333333
B NaN
C 13.000000
D 18.250000
dtype: float64
In [105]: 1 df['C'].mean()
Out[105]: 13.0
In [106]: 1 # drop any row (dafault value) with any NaN value
2 df.dropna()
Out[106]:
A B C D
Out[107]:
C
0 11
1 12
2 13
3 14
4 15
thresh : int, default None thresh = 3 means, it will drop any column that have less than 3 non-NaN
values. OR a column has at least 3 non-NaNs to survive.
In [108]: 1 df
Out[108]:
A B C D
Out[109]:
A C D
0 1.0 11 16.0
1 2.0 12 NaN
2 NaN 13 18.0
3 4.0 14 19.0
4 NaN 15 20.0
In [110]: 1 df.fillna(value='Filled')
Out[110]:
A B C D
0 1 Filled 11 16
1 2 Filled 12 Filled
2 Filled Filled 13 18
3 4 Filled 14 19
4 Filled Filled 15 20
Out[111]: 0 1.000000
1 2.000000
2 2.333333
3 4.000000
4 2.333333
Name: A, dtype: float64
In [112]: 1 # pad / ffill: Forward fill, last valid observation forward to next NaN
2 df.fillna(method='ffill')
3
Out[112]:
A B C D
In [113]: 1 print(df)
2 df.fillna(method='pad')
A B C D
0 1.0 NaN 11 16.0
1 2.0 NaN 12 NaN
2 NaN NaN 13 18.0
3 4.0 NaN 14 19.0
4 NaN NaN 15 20.0
Out[113]:
A B C D
Out[114]:
A B C D
Out[115]:
A B C D
merge(): connects rows in DataFrames based on one or more keys. (This will be familiar to
SQL or other relational databases users, as it implements database join operations).
concat(): concatenate or "stacks" together objects along an axis.
✅ If you don't know SQL, don't worry, the concepts of merging are presented with very simple
examples so that you can follow the steps. Although, our focus here is not to learn SQL, we only
want to go through the widely used and few very important inner and outer joining operations
for data wrangling.
If you have questions, please ask and we are more than happy to help!
✅ Important thing you should know: Merging operation may give NaN in the output and they
needs to be treated according to the circumstances/requirements during data analysis.
We need data to work with, let's create two DataFrames, df1 and df2.
In [117]: 1 df1 = pd.DataFrame({'key': ['a', 'b', 'c', 'd', 'e'],'A1': range(5), 'B1':ra
2 df2 = pd.DataFrame({'key': ['a', 'b', 'c'], 'A2': range(3), 'B2':range(3,6)}
In [118]: 1 df1
Out[118]:
key A1 B1
0 a 0 5
1 b 1 6
2 c 2 7
3 d 3 8
4 e 4 9
In [119]: 1 df2
Out[119]:
key A2 B2
0 a 0 3
1 b 1 4
2 c 2 5
'how' tells the 'merge()' , what type of joining operation needs to be done, it could be
'inner','outer','left','right' . Default value of 'how' is 'inner , if nothing is
provided.
'on' tells the field name to join on, which could be a label or a list.
merge()
Let's overview 'how' and 'on' parameters in 'merge()' .
how:{'inner','outer','left','right'}
'inner': use intersection of keys from both frames, similar to a SQL inner join.
'outer': use union of keys from both frames, similar to a SQL full outer join.
y , j
'left': use only keys from left frame, similar to a SQL left outer join.
'right': use only keys from right frame, similar to a SQL right outer join.
on:label or list
how ='inner'
The key column in the resultant will be the intersection of the 'key' columns in both df1 and df2. In
our case, a b c along with the associated data
☞ I am using print to out put resultant along with the original dataframes df1 , df2 to do the
comparisons.
key A1 B1 A2 B2
0 a 0 5 0 3
1 b 1 6 1 4
2 c 2 7 2 5
key A1 B1
0 a 0 5
1 b 1 6
2 c 2 7
3 d 3 8
4 e 4 9
key A2 B2
0 a 0 3
1 b 1 4
2 c 2 5
☞ d, e did not appear in the merged output, 'inner' returns the intersection of key columns
only!
how = 'Outer'
The key column in the result will be the union of df1['key'] and df2['key'] , means, all
the keys found in both tables.
☞ I am using print to out put resultant along with the original dataframes df1 , df2 to do the
comparisons.
In [121]: 1 print(pd.merge(df1, df2, how = 'outer', on='key'))
2 print(df1)
3 print(df2)
key A1 B1 A2 B2
0 a 0 5 0.0 3.0
1 b 1 6 1.0 4.0
2 c 2 7 2.0 5.0
3 d 3 8 NaN NaN
4 e 4 9 NaN NaN
key A1 B1
0 a 0 5
1 b 1 6
2 c 2 7
3 d 3 8
4 e 4 9
key A2 B2
0 a 0 3
1 b 1 4
2 c 2 5
☞ NaN in A2, B2 columns for d, e indexes. Its Union operation and A2, B2 values does not exist
in df2 for indexes d, e!
how ='left'
Use only key column of the left dataframe, similar to a SQL left outer join.
key A1 B1 A2 B2
0 a 0 5 0.0 3.0
1 b 1 6 1.0 4.0
2 c 2 7 2.0 5.0
3 d 3 8 NaN NaN
4 e 4 9 NaN NaN
key A1 B1
0 a 0 5
1 b 1 6
2 c 2 7
3 d 3 8
4 e 4 9
key A2 B2
0 a 0 3
1 b 1 4
2 c 2 5
key A1 B1 A2 B2
0 a 0 5 0 3
1 b 1 6 1 4
2 c 2 7 2 5
key A1 B1
0 a 0 5
1 b 1 6
2 c 2 7
3 d 3 8
4 e 4 9
key A2 B2
0 a 0 3
1 b 1 4
2 c 2 5
Merging example with two key (key1, key2) columns -- little complicated!
Let's create two data frames such that each have two key columns, key1 & key2 .
'inner' is intersection, only the key pair present in both dataframes will appear in the resultant
In [125]: 1 print(left)
2 print(right)
3 print(pd.merge(left, right, how = 'inner', on=['key1', 'key2']))
key1 key2 A B
0 a a A0 B0
1 a b A1 B1
2 b a A2 B2
3 c b A3 B3
key1 key2 C D
0 a a C0 D0
1 b b C1 D1
2 b a C2 D2
3 c a C3 D3
key1 key2 A B C D
0 a a A0 B0 C0 D0
1 b a A2 B2 C2 D2
As we know, 'outer' is union, all key pair present in both dataframes will appear in the
resultant.
In [126]: 1 print(left)
2 print(right)
3 print(pd.merge(left, right, how='outer', on=['key1', 'key2']))
key1 key2 A B
0 a a A0 B0
1 a b A1 B1
2 b a A2 B2
3 c b A3 B3
key1 key2 C D
0 a a C0 D0
1 b b C1 D1
2 b a C2 D2
3 c a C3 D3
key1 key2 A B C D
0 a a A0 B0 C0 D0
1 a b A1 B1 NaN NaN
2 b a A2 B2 C2 D2
3 c b A3 B3 NaN NaN
4 b b NaN NaN C1 D1
5 c a NaN NaN C3 D3
For 'left' join, the key pair in left will be used only
In [127]: 1 print(left)
2 print(right)
3 print(pd.merge(left, right, how='left', on=['key1', 'key2']))
key1 key2 A B
0 a a A0 B0
1 a b A1 B1
2 b a A2 B2
3 c b A3 B3
key1 key2 C D
0 a a C0 D0
1 b b C1 D1
2 b a C2 D2
3 c a C3 D3
key1 key2 A B C D
0 a a A0 B0 C0 D0
1 a b A1 B1 NaN NaN
2 b a A2 B2 C2 D2
3 c b A3 B3 NaN NaN
For 'right' join, the key pair in right will be used only
In [128]: 1 print(left)
2 print(right)
3 print(pd.merge(left, right, how='right', on=['key1', 'key2']))
key1 key2 A B
0 a a A0 B0
1 a b A1 B1
2 b a A2 B2
3 c b A3 B3
key1 key2 C D
0 a a C0 D0
1 b b C1 D1
2 b a C2 D2
3 c a C3 D3
key1 key2 A B C D
0 a a A0 B0 C0 D0
1 b a A2 B2 C2 D2
2 b b NaN NaN C1 D1
3 c a NaN NaN C3 D3
Concatenation
Concatenation is interchangeably referred as binding, or stacking as well. This operation basically
glues together DataFrames.
☞ It's important to remember that dimensions should match along the axis, we are concatenating
on.
In [131]: 1 df1
Out[131]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [132]: 1 df2
Out[132]:
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
Out[133]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
In [134]: 1 pd.concat([df1,df2],axis=1) # axis = 1/columns
Out[134]:
A B C D A B C D
In [136]: 1 left
Out[136]:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
In [137]: 1 right
Out[137]:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
In [138]: 1 left.join(right)
Out[138]:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
Groupby
Groupby is one of the most important and key functionality in pandas. It allows us to group data
together, call aggregate functions and combine the results in three steps split-apply-combine:
Before we move on to the hands-on, let's try to understand how this split-apply-combine work,
using a data in different colours!
Split: In this process, data contained in a pandas object (e.g. Series, DataFrame) is split into
groups based on one or more keys that we provide. The splitting is performed on a particular
axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its
columns (axis=1).
apply: Once splitting is done, a function is applied to all groups independently, producing a
new value.
combine: Finally, the results of all those functions applications are combined into a resultant
object. The form of the resulting object will usually depend on what's being done to the data.
Out[140]:
Store Customer Sales
3 Costco Denice 90
In the df, we have a Customer unique name, Sales in numbers and store name.
Let's group the data, in df, based on column "Store" using groupby method. This will create a
DataFrameGroupBy object.
Grab the df, access the gropby method using "." and pass the column we want to group the data
on.
Notice, we get a groupby object, stored in a memory 0x....
In [141]: 1 df.groupby("Store")
Now, we have grouped data in "by_store" object, we can call aggregate method on this object.
In [143]: 1 by_store.mean()
Out[143]:
Sales
Store
Costco 320
Target 275
Walmart 175
Pandas will apply mean() on number columns "Sales". It ignore not numeric columns
automatically. Same is True for sum, std, max, and so on..
Out[144]:
Sales
Store
Costco 320
Target 275
Walmart 175
Notice that, the result is a dataframe with "Store" as index and "Sales" as column. We can use loc
method to locate any value for certain company after aggregation function. This will give us the
value (e.g. sales) for a single store.
In [146]: 1 by_store.min()
Out[146]:
Customer Sales
Store
Costco Denice 90
In [147]: 1 by_store.max()
Out[147]:
Customer Sales
Store
Out[148]:
Sales
Store
Costco 325.269119
Target 219.203102
Walmart 35.355339
In [149]: 1 # count the no of instances in the columns, works with strings as well
2 # we have 2 customers and 2 sales in each store
3 by_store.count()
Out[149]:
Customer Sales
Store
Costco 2 2
Target 2 2
Walmart 2 2
describe is a useful method, that gives a bunch of useful information, such as, mean, min, quartile
values etc for each company.
In [150]: 1 by_store.describe()
Out[150]:
Sales
Store
Let's use transpose() after describe so that the output looks good!
In [151]: 1 by_store.describe().transpose()
Out[151]:
Store Costco Target Walmart
We can call a column name for a selected store to separate information with transpose() as
well!
In [152]: 1 by_store.describe().transpose()['Costco']
In this lecture, we will explore some more general purpose operations and revise what we have
learned in the previous lectures.
Let's create a dataframe to get hands-on experience on these operations.
I will repeat some values and also generate NaN in our dataframe.
In [153]: 1 # import numpy as np
2 # import pandas as pd
3 data_dic = {'col_1':[1,2,3,4,5],
4 'col_2':[111,222,333,111,555],
5 'col_3':['alpha','bravo','charlie',np.nan,np.nan],
6 }
7 df = pd.DataFrame(data_dic,index=[1,2,3,4,5])
8 df
Out[153]:
col_1 col_2 col_3
1 1 111 alpha
2 2 222 bravo
3 3 333 charlie
4 4 111 NaN
5 5 555 NaN
info()
provides a concise summary of a DataFrame. We will use this function very often in the course.
In [154]: 1 df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5
Data columns (total 3 columns):
col_1 5 non-null int64
col_2 5 non-null int64
col_3 3 non-null object
dtypes: int64(2), object(1)
memory usage: 160.0+ bytes
head(n)
Returns the first n rows, default is 5. This is very useful to get the overview on our data. We will
use this very often in the course.
In [155]: 1 df.head(2)
Out[155]:
col_1 col_2 col_3
1 1 111 alpha
2 2 222 bravo
isnull()
Return a boolean same-sized object indicating if the values are null.
In [156]: 1 df.isnull()
Out[156]:
col_1 col_2 col_3
dropna()
axis = 0/rows, 1/columns -- 0 is default
inplace = False by default, to make the permanent change, needs to be True
fillna()
Fill NA/NaN values using the specified method
unique()
Find and returns all the unique values.
Lets see how it works on all the columns in our dataframe.
In [159]: 1 print(df['col_1'].unique())
2 print(df['col_2'].unique())
3 print(df['col_3'].unique())
4 # 111 and NaN are repeated values, unique will only return once.
[1 2 3 4 5]
[111 222 333 555]
['alpha' 'bravo' 'charlie' nan]
nunique()
Find returns "how many unique values exist".
☞ Notice the difference, for NaN, it count a missing value and returns "3" for col_3.
In [160]: 1 print(df['col_1'].nunique())
2 print(df['col_2'].nunique())
3 print(df['col_3'].nunique())
5
4
3
value_counts()
We want a table with all the values along with no. of times they appeared in our data, value_counts
do the work here!
☞ for NaN, it count a missing value, nothing in the output.
In [161]: 1 print(df['col_1'].value_counts())
2 print(df['col_2'].value_counts())
3 print(df['col_3'].value_counts())
5 1
4 1
3 1
2 1
1 1
Name: col_1, dtype: int64
111 2
222 1
333 1
555 1
Name: col_2, dtype: int64
alpha 1
charlie 1
bravo 1
Name: col_3, dtype: int64
✅ unique(), unique(), value_counts() are three very useful and frequently used
methods, which are associated with finding unique values in the data.
sort_values()
by default:
ascending=True
inplace=False
In [162]: 1 df.sort_values(by='col_2')
Out[162]:
col_1 col_2 col_3
1 1 111 alpha
4 4 111 NaN
2 2 222 bravo
3 3 333 charlie
5 5 555 NaN
Data Selection
Lets talk about Selecting Data once again. We have learned to grab data in our previous lectures
as well.
We can grab a column with its name, do the conditional selection and much more ....
We can use loc and iloc to find rows as well.
Let's revise the conditional selection, this also includes data selection based on the column name.
* df['col_1'] > 2 : returns the data where condition is True (if you rem
ember, this is just a boolean series)
* df['col_2'] == 111 : returns the data where condition is True
* Lets combine these tow conditions with & by putting both conditions in
().
* wrap them in df[] and see what it returns!
Out[163]: 1 False
2 False
3 True
4 True
5 True
Name: col_1, dtype: bool
Out[164]: 1 111
2 222
3 333
4 111
5 555
Name: col_2, dtype: int64
In [165]: 1 """We can say, this is a boolean mask on said condition to provide
2 to the dataframe, df, for filtering out the results."""
3 bool_ser = (df['col_1'] > 2) & (df['col_2'] == 111)
4 bool_ser
Out[165]: 1 False
2 False
3 False
4 True
5 False
dtype: bool
Out[166]:
col_1 col_2 col_3
4 4 111 NaN
apply()
Indeed, this is one of the most powerful pandas feature. Using apply() method, we can
broadcast our customized functions on our data.
Let's see how to calculate square of col_1
Let's broadcast our customized function "square" using "apply" method to calculate
squares of the col_1 in our DataFrame, df.
In [168]: 1 df['col_1'].apply(square)
Out[168]: 1 2
2 4
3 6
4 8
5 10
Name: col_1, dtype: int64
The same operation can be conveniently carried out using state of the art lambda
expression!
Out[169]: 1 2
2 4
3 6
4 8
5 10
Name: col_1, dtype: int64
Out[170]: 1 5
2 5
3 7
Name: col_3, dtype: int64
Out[171]: float
Good to know
Good to know
In [172]: 1 # Getting index names
2 df.index
col_2 col_3
1 111 alpha
2 222 bravo
3 333 charlie
4 111 NaN
5 555 NaN
col_1 col_2 col_3
1 1 111 alpha
2 2 222 bravo
3 3 333 charlie
4 4 111 NaN
5 5 555 NaN
Out[175]:
col_2 col_3
1 111 alpha
2 222 bravo
3 333 charlie
4 111 NaN
5 555 NaN
In [176]: 1 df.index
pivot_table()
shift + tab to read the documentation.
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored
in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
pivot_table takes three main arguments:
☞ If you are an excel user, you may be familiar with pivot_table. If not, don't worry about this at this
stage, we will discuss it in the coming sections of the course.
In [177]: 1 df
Out[177]:
col_1 col_2 col_3
1 1 111 alpha
2 2 222 bravo
3 3 333 charlie
4 4 111 NaN
5 5 555 NaN
Out[178]:
col_3 alpha bravo charlie
col_1
Out[180]:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
Out[181]:
C x y
A B