Introduction to Pandas & Data Structures
Introduction to Pandas & Data Structures
March 9, 2022
1 Introduction to Pandas
Pandas is an open source library providing high-performance, easy-to-use data structures and data
analysis tools for the Python programming language. Today, pandas is actively supported by a
community of like-minded individuals around the world who contribute their valuable time and
energy to help make open source pandas possible. We will learn to use pandas for data analysis. If
you have never used this library, you can think about pandas as an extremely powerful version of
Excel and with lot more features
1.2 Series
Series is a one-dimensional array-like object, which contains values and an array of labels, associated
with the values. Series can be indexed using labels. (Series is similar to NumPy array – actually,
it is built on top of the NumPy array object) Series can hold any arbitrary Python object. Let’s
get hands-on and learn the concepts of Series with examples:
[1]: # first thing first, we need to import NumPy and pandas
# np and pd are alias for NumPy and pandas
import numpy as np
import pandas as pd
1
So, we have two Python’s list objects, • my_labels - a list of strings, and • my_data - a list of
numbers We can use pd.Series (with capital S) to convert the Python’s list object to pandas Series.
[3]: 0 100
1 200
2 300
dtype: int64
Column “0 1 2” is automatically generated index for the elements in series with data “100 200 300”.
We can specify index values and grab the respective data/values using these indexes. Let’s pass
my_labels to the Series as index.
[4]: pd.Series(data=my_data, index=my_labels)
[4]: x 100
y 200
z 300
dtype: int64
[5]: 0 100
1 200
2 300
dtype: int32
Notice, we got the index column “012” again, let’s pass our own index values!
[6]: pd.Series(data=my_data, index=my_labels)
# pd.Series(my_array, my_labels) # data and index are in order
[6]: x 100
y 200
z 300
dtype: int64
2
1.4 Series using dictionary
[7]: # Let's create a dictionary my_dict
my_dict = {'x': 100, 'y': 200, 'z': 300} # creating a dictionary my_dict
pd.Series(data=my_dict) # creating series from dictionary
[7]: x 100
y 200
z 300
dtype: int64
Notice the difference here, if we pass a dictionary to Series, pandas will take the keys as index/labels
and values as data.
[10]: print(ser1)
Toronto 500
Calgary 200
Vancouver 300
Montreal 700
dtype: int64
[11]: # Grabbing information for series is very much similar to dictionary. Simply␣
,→pass,!the index and it will return the value!
[11]: 200
[12]: ser4 = ser1 + ser2 # adding series and assigning/passing results to a new␣
,→variable,!ser4
ser4
3
[12]: Calgary 400.0
Montreal 1400.0
Toronto NaN
Vancouver 600.0
dtype: float64
head(), tail() To view a small sample of a Series or DataFrame (we will learn DataFrame in the
next lecture) object, use the head() and tail() methods. The default number of elements to display
is five, but you may pass a custom number.
[15]: ser1.head(1) # head(1) will return the first row only
4
[17]: [Index(['Toronto', 'Calgary', 'Vancouver', 'Montreal'], dtype='object')]
size * Returns the number of elements in the series empty * True if the series in empty
[19]: # True for empty series
ser1.empty
[19]: False
[20]: ser1.size
[20]: 4
1.7 DataFrame
A very simple way to think about the DataFrame is, “bunch of Series together such as they share
the same index”. * A DataFrams is a rectangular table of data that contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame
has both row & column index; it can be thought of as a dictionary of Series all sharing the same
index (any row or column). Let’s learn DataFrame with examples:
[21]: # Let’s create two labels or indexes: * index: for rows ‘r1 to r10’ * columns:␣
,→for columns ‘c1 to c10’
import pandas as pd
import numpy as np
print(index)
print(columns)
['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']
[22]: # Let’s start with a simple example, using arange() and reshape() together to␣
,→create a 2D array (matrix).
5
print(array_2d)
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
[23]: # Now, let's create our first DataFrame using index, columns and array_2d!
df = pd.DataFrame(data=array_2d, index=index, columns=columns)
print(df)
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
df is our first dataframe. We have columns, c1 to c10, and their corresponding rows, r1 to r10.
Each column is actually a pandas series, sharing a common index, which is the row labels. Now,
we can play with this dataframe df to learn how to Grab data that we need, which is the most
important concept we want to learn to move one in this course!
Grabbing Columns from dataframe Just pass the name of the required column in square
brackets!
[24]: # Grabbing a single column
df['c1']
[24]: r1 0
r2 10
r3 20
r4 30
r5 40
r6 50
6
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32
[25]: # We can grab more than one column, simply pass the list of columns you need!
df[['c1', 'c10']]
[25]: c1 c10
r1 0 9
r2 10 19
r3 20 29
r4 30 39
r5 40 49
r6 50 59
r7 60 69
r8 70 79
r9 80 89
r10 90 99
print(df)
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 new
r1 0 1 2 3 4 5 6 7 8 9 1
r2 10 11 12 13 14 15 16 17 18 19 21
r3 20 21 22 23 24 25 26 27 28 29 41
r4 30 31 32 33 34 35 36 37 38 39 61
r5 40 41 42 43 44 45 46 47 48 49 81
r6 50 51 52 53 54 55 56 57 58 59 101
r7 60 61 62 63 64 65 66 67 68 69 121
r8 70 71 72 73 74 75 76 77 78 79 141
r9 80 81 82 83 84 85 86 87 88 89 161
r10 90 91 92 93 94 95 96 97 98 99 181
7
to drop a column, we need to pass axis = 1 * inplace: default is False, we need to pass True for
permanent delete. Inplace make sure that we don’t delete column by mistake. If we don’t pass this
parameter, the column will not be dropped from the dataframe.
[27]: # So, we have 10 rows and 11 columns in our dataframe df, “new” is the 11th one␣
,→that we have added.
print(df)
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
df.loc[['r2', 'r3']]
[28]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
[29]: # Uisng iloc, this will again return rows r2 and r3, but here our selection in,␣
,→index based!
[29]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
8
1.11 Grabbing a single element form a dataframe
[30]: # We need to tell the location of the element, [row, col]
# df.loc(req_row, req_col) -- pass row, col for the element!
df.loc['r2', 'c1']
[30]: 10
[31]: 19
Grabbing sub-set of a dataframe We can grab a sub-set by passing list of required rows and list of
required columns
[32]: # for a sub-set, pass the list
df.loc[['r1', 'r2'], ['c1', 'c2']]
[32]: c1 c2
r1 0 1
r2 10 11
[33]: c3 c4
r2 12 13
r5 42 43
[34]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 False False False False False False True True True True
r2 True True True True True True True True True True
r3 True True True True True True True True True True
r4 True True True True True True True True True True
r5 True True True True True True True True True True
r6 True True True True True True True True True True
r7 True True True True True True True True True True
9
r8 True True True True True True True True True True
r9 True True True True True True True True True True
r10 True True True True True True True True True True
[35]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 NaN 2 NaN 4 NaN 6 NaN 8 NaN
r2 10 NaN 12 NaN 14 NaN 16 NaN 18 NaN
r3 20 NaN 22 NaN 24 NaN 26 NaN 28 NaN
r4 30 NaN 32 NaN 34 NaN 36 NaN 38 NaN
r5 40 NaN 42 NaN 44 NaN 46 NaN 48 NaN
r6 50 NaN 52 NaN 54 NaN 56 NaN 58 NaN
r7 60 NaN 62 NaN 64 NaN 66 NaN 68 NaN
r8 70 NaN 72 NaN 74 NaN 76 NaN 78 NaN
r9 80 NaN 82 NaN 84 NaN 86 NaN 88 NaN
r10 90 NaN 92 NaN 94 NaN 96 NaN 98 NaN
1.12.1 info()
Provides a concise summary of the DataFrame. This is a very useful method.
[36]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, r1 to r10
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 10 non-null int32
1 c2 10 non-null int32
2 c3 10 non-null int32
3 c4 10 non-null int32
4 c5 10 non-null int32
5 c6 10 non-null int32
6 c7 10 non-null int32
7 c8 10 non-null int32
8 c9 10 non-null int32
9 c10 10 non-null int32
dtypes: int32(10)
memory usage: 780.0+ bytes
10
1.12.2 describe()
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
[37]: df.describe()
[37]: c1 c2 c3 c4 c5 c6 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000
std 30.276504 30.276504 30.276504 30.276504 30.276504 30.276504
min 0.000000 1.000000 2.000000 3.000000 4.000000 5.000000
25% 22.500000 23.500000 24.500000 25.500000 26.500000 27.500000
50% 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000
75% 67.500000 68.500000 69.500000 70.500000 71.500000 72.500000
max 90.000000 91.000000 92.000000 93.000000 94.000000 95.000000
c7 c8 c9 c10
count 10.000000 10.000000 10.000000 10.000000
mean 51.000000 52.000000 53.000000 54.000000
std 30.276504 30.276504 30.276504 30.276504
min 6.000000 7.000000 8.000000 9.000000
25% 28.500000 29.500000 30.500000 31.500000
50% 51.000000 52.000000 53.000000 54.000000
75% 73.500000 74.500000 75.500000 76.500000
max 96.000000 97.000000 98.000000 99.000000
11