0% found this document useful (0 votes)
5 views

Introduction to Pandas & Data Structures

Uploaded by

Abdul Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Introduction to Pandas & Data Structures

Uploaded by

Abdul Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Pandas & Data Structures

March 9, 2022

1 Introduction to Pandas
Pandas is an open source library providing high-performance, easy-to-use data structures and data
analysis tools for the Python programming language. Today, pandas is actively supported by a
community of like-minded individuals around the world who contribute their valuable time and
energy to help make open source pandas possible. We will learn to use pandas for data analysis. If
you have never used this library, you can think about pandas as an extremely powerful version of
Excel and with lot more features

1.1 pandas Data Structures


Series and DataFrame are two workhorse data structures in pandas. Lets talk about series first:

1.2 Series
Series is a one-dimensional array-like object, which contains values and an array of labels, associated
with the values. Series can be indexed using labels. (Series is similar to NumPy array – actually,
it is built on top of the NumPy array object) Series can hold any arbitrary Python object. Let’s
get hands-on and learn the concepts of Series with examples:
[1]: # first thing first, we need to import NumPy and pandas
# np and pd are alias for NumPy and pandas

import numpy as np
import pandas as pd

# just to check ther versions we are using


print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)

numpy version: 1.20.3


pandas version: 1.3.4
We can create a Series using list, numpy array, or dictionary Let’s create these objects and convert
them into panda’s Series! Series using lists Lets create a Python lists, one containing labels and
another with data
[2]: my_labels = ['x', 'y', 'z']
my_data = [100, 200, 300]

1
So, we have two Python’s list objects, • my_labels - a list of strings, and • my_data - a list of
numbers We can use pd.Series (with capital S) to convert the Python’s list object to pandas Series.

[3]: # Converting my_data (Python list) to Series (pandas series)


pd.Series(data=my_data)

[3]: 0 100
1 200
2 300
dtype: int64

Column “0 1 2” is automatically generated index for the elements in series with data “100 200 300”.
We can specify index values and grab the respective data/values using these indexes. Let’s pass
my_labels to the Series as index.
[4]: pd.Series(data=my_data, index=my_labels)

[4]: x 100
y 200
z 300
dtype: int64

1.3 Series using NumPy arrays


[5]: # Let's create NumPy array from my_data and then Series from that array
my_array = np.array(my_data) # creating numpy's array from list
pd.Series(data=my_array) # creating series from numpy's array

[5]: 0 100
1 200
2 300
dtype: int32

Notice, we got the index column “012” again, let’s pass our own index values!
[6]: pd.Series(data=my_data, index=my_labels)
# pd.Series(my_array, my_labels) # data and index are in order

[6]: x 100
y 200
z 300
dtype: int64

2
1.4 Series using dictionary
[7]: # Let's create a dictionary my_dict
my_dict = {'x': 100, 'y': 200, 'z': 300} # creating a dictionary my_dict
pd.Series(data=my_dict) # creating series from dictionary

[7]: x 100
y 200
z 300
dtype: int64

Notice the difference here, if we pass a dictionary to Series, pandas will take the keys as index/labels
and values as data.

1.5 Grabbing data from Series


Indexes are the key thing to understand in Series. Pandas use these indexes (numbers or names)
for fast information retrieval. (Index works just like a hash table or a dictionary). To understand
the concepts, Let’s create three Series, ser1, ser2, ser3 from dictionaries with some random data
[8]: # Creating three dictionaries dict_1, dict_2, dict_3
dict_1 = {'Toronto': 500, 'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dict_2 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700}
dict_3 = {'Calgary': 200, 'Vancouver': 300, 'Montreal': 700, 'Jasper': 1000}

[9]: # Creating pandas series from the dictionaries


ser1 = pd.Series(dict_1)
ser2 = pd.Series(dict_2)
ser3 = pd.Series(dict_3)

[10]: print(ser1)

Toronto 500
Calgary 200
Vancouver 300
Montreal 700
dtype: int64

[11]: # Grabbing information for series is very much similar to dictionary. Simply␣
,→pass,!the index and it will return the value!

ser1['Calgary'] # its case sensitive "calgary" is not the same as "Calgary"

[11]: 200

[12]: ser4 = ser1 + ser2 # adding series and assigning/passing results to a new␣
,→variable,!ser4

ser4

3
[12]: Calgary 400.0
Montreal 1400.0
Toronto NaN
Vancouver 600.0
dtype: float64

1.6 Builtin Function


Below are some commonly used built-in functions and attributes for series during the data process-
ing. isnull() * detect missing data

[13]: # pd.isnull(ser4) is same as ser4.isnull()


ser4.isnull()
# shift+tab, its Type is method

[13]: Calgary False


Montreal False
Toronto True
Vancouver False
dtype: bool

[14]: # notnull() * Detect existing (non-missing) values.


#pd.notnull(ser5) is same as ser5.notnull()
ser4.notnull()

[14]: Calgary True


Montreal True
Toronto False
Vancouver True
dtype: bool

head(), tail() To view a small sample of a Series or DataFrame (we will learn DataFrame in the
next lecture) object, use the head() and tail() methods. The default number of elements to display
is five, but you may pass a custom number.
[15]: ser1.head(1) # head(1) will return the first row only

[15]: Toronto 500


dtype: int64

[16]: ser1.tail(1) # tail(1) will return the last row only

[16]: Montreal 700


dtype: int64

[17]: # axes * Returns list of the row axis labels


# row axis labels (index) list can be obtained
ser1.axes

4
[17]: [Index(['Toronto', 'Calgary', 'Vancouver', 'Montreal'], dtype='object')]

values * returns list of values/data

[18]: # returns the values/data


ser1.values

[18]: array([500, 200, 300, 700], dtype=int64)

size * Returns the number of elements in the series empty * True if the series in empty
[19]: # True for empty series
ser1.empty

[19]: False

[20]: ser1.size

[20]: 4

1.7 DataFrame
A very simple way to think about the DataFrame is, “bunch of Series together such as they share
the same index”. * A DataFrams is a rectangular table of data that contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame
has both row & column index; it can be thought of as a dictionary of Series all sharing the same
index (any row or column). Let’s learn DataFrame with examples:

[21]: # Let’s create two labels or indexes: * index: for rows ‘r1 to r10’ * columns:␣
,→for columns ‘c1 to c10’

# Using split() for revision!

import pandas as pd
import numpy as np

index = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()


columns = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

print(index)
print(columns)

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

[22]: # Let’s start with a simple example, using arange() and reshape() together to␣
,→create a 2D array (matrix).

array_2d = np.arange(0, 100).reshape(10, 10) # creating a 2D array "array_2d"

5
print(array_2d)

[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]

[23]: # Now, let's create our first DataFrame using index, columns and array_2d!
df = pd.DataFrame(data=array_2d, index=index, columns=columns)

print(df)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99
df is our first dataframe. We have columns, c1 to c10, and their corresponding rows, r1 to r10.
Each column is actually a pandas series, sharing a common index, which is the row labels. Now,
we can play with this dataframe df to learn how to Grab data that we need, which is the most
important concept we want to learn to move one in this course!

Grabbing Columns from dataframe Just pass the name of the required column in square
brackets!
[24]: # Grabbing a single column
df['c1']

[24]: r1 0
r2 10
r3 20
r4 30
r5 40
r6 50

6
r7 60
r8 70
r9 80
r10 90
Name: c1, dtype: int32

[25]: # We can grab more than one column, simply pass the list of columns you need!
df[['c1', 'c10']]

[25]: c1 c10
r1 0 9
r2 10 19
r3 20 29
r4 30 39
r5 40 49
r6 50 59
r7 60 69
r8 70 79
r9 80 89
r10 90 99

1.8 Adding new column to dataframe


pandas dataframes are very handy, Let’s add a column ’new into our dataframe df by adding any
two existing columns using simple “+” operator!
[26]: df['new'] = df['c1'] + df['c2'] # adding a column "new" which is sum of "c1"␣
,→and "c2"

print(df)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 new
r1 0 1 2 3 4 5 6 7 8 9 1
r2 10 11 12 13 14 15 16 17 18 19 21
r3 20 21 22 23 24 25 26 27 28 29 41
r4 30 31 32 33 34 35 36 37 38 39 61
r5 40 41 42 43 44 45 46 47 48 49 81
r6 50 51 52 53 54 55 56 57 58 59 101
r7 60 61 62 63 64 65 66 67 68 69 121
r8 70 71 72 73 74 75 76 77 78 79 141
r9 80 81 82 83 84 85 86 87 88 89 161
r10 90 91 92 93 94 95 96 97 98 99 181

1.9 Deleting column from dataframe


drop() We can delete any column form a dataframe using drop() method. Few important parameters
that we need to consider: * label: column name that we need to pass, if we need to drop more
than one columns, it must be a list of column names. * axis: default value is 0 which refers to row,

7
to drop a column, we need to pass axis = 1 * inplace: default is False, we need to pass True for
permanent delete. Inplace make sure that we don’t delete column by mistake. If we don’t pass this
parameter, the column will not be dropped from the dataframe.
[27]: # So, we have 10 rows and 11 columns in our dataframe df, “new” is the 11th one␣
,→that we have added.

# Let’s delete this column.

df.drop(['new'], axis=1, inplace=True) # If we don't pass inplce =␣


,→True,the,change will not be permanent

print(df)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 1 2 3 4 5 6 7 8 9
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29
r4 30 31 32 33 34 35 36 37 38 39
r5 40 41 42 43 44 45 46 47 48 49
r6 50 51 52 53 54 55 56 57 58 59
r7 60 61 62 63 64 65 66 67 68 69
r8 70 71 72 73 74 75 76 77 78 79
r9 80 81 82 83 84 85 86 87 88 89
r10 90 91 92 93 94 95 96 97 98 99

1.10 Grabbing Rows from dataframe


We can retrieve a row by its name or position with loc and iloc. * loc: Access a rows by label(s).
* iloc: Using row’s index location.
[28]: # using loc, this will return rows r2 and r3, notice the list [r2, r3] in,␣
,→square brackets

df.loc[['r2', 'r3']]

[28]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29

[29]: # Uisng iloc, this will again return rows r2 and r3, but here our selection in,␣
,→index based!

df.iloc[[1, 2]] # remember, index starts with 0

[29]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r2 10 11 12 13 14 15 16 17 18 19
r3 20 21 22 23 24 25 26 27 28 29

8
1.11 Grabbing a single element form a dataframe
[30]: # We need to tell the location of the element, [row, col]
# df.loc(req_row, req_col) -- pass row, col for the element!
df.loc['r2', 'c1']

[30]: 10

[31]: # another element, say 10 which is at [r2,c10]


df.loc['r2', 'c10']

[31]: 19

Grabbing sub-set of a dataframe We can grab a sub-set by passing list of required rows and list of
required columns
[32]: # for a sub-set, pass the list
df.loc[['r1', 'r2'], ['c1', 'c2']]

[32]: c1 c2
r1 0 1
r2 10 11

[33]: # another example - random columns and rows in the list


df.loc[['r2', 'r5'], ['c3', 'c4']]

[33]: c3 c4
r2 12 13
r5 42 43

1.12 Conditional Selection or masking


pandas got excellent features, we can do a conditional selection. For example, all the values that
are greater than some value, e.g. greater that 5 in the case below!
[34]: # We can do a conditional selection as well
df > 5
# df!=0 # try this yourself
# df=0 # try this yourself

[34]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 False False False False False False True True True True
r2 True True True True True True True True True True
r3 True True True True True True True True True True
r4 True True True True True True True True True True
r5 True True True True True True True True True True
r6 True True True True True True True True True True
r7 True True True True True True True True True True

9
r8 True True True True True True True True True True
r9 True True True True True True True True True True
r10 True True True True True True True True True True

[35]: # Return Divisible by 2 or even


bool_mask = df % 2 == 0 # creating mask for the required condition
df[bool_mask] # passing mask to get the required results

# df[df % 2 == 0] # Similar to the above 2 lines of code

[35]: c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 0 NaN 2 NaN 4 NaN 6 NaN 8 NaN
r2 10 NaN 12 NaN 14 NaN 16 NaN 18 NaN
r3 20 NaN 22 NaN 24 NaN 26 NaN 28 NaN
r4 30 NaN 32 NaN 34 NaN 36 NaN 38 NaN
r5 40 NaN 42 NaN 44 NaN 46 NaN 48 NaN
r6 50 NaN 52 NaN 54 NaN 56 NaN 58 NaN
r7 60 NaN 62 NaN 64 NaN 66 NaN 68 NaN
r8 70 NaN 72 NaN 74 NaN 76 NaN 78 NaN
r9 80 NaN 82 NaN 84 NaN 86 NaN 88 NaN
r10 90 NaN 92 NaN 94 NaN 96 NaN 98 NaN

1.12.1 info()
Provides a concise summary of the DataFrame. This is a very useful method.
[36]: df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, r1 to r10
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 10 non-null int32
1 c2 10 non-null int32
2 c3 10 non-null int32
3 c4 10 non-null int32
4 c5 10 non-null int32
5 c6 10 non-null int32
6 c7 10 non-null int32
7 c8 10 non-null int32
8 c9 10 non-null int32
9 c10 10 non-null int32
dtypes: int32(10)
memory usage: 780.0+ bytes

10
1.12.2 describe()
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
[37]: df.describe()

[37]: c1 c2 c3 c4 c5 c6 \
count 10.000000 10.000000 10.000000 10.000000 10.000000 10.000000
mean 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000
std 30.276504 30.276504 30.276504 30.276504 30.276504 30.276504
min 0.000000 1.000000 2.000000 3.000000 4.000000 5.000000
25% 22.500000 23.500000 24.500000 25.500000 26.500000 27.500000
50% 45.000000 46.000000 47.000000 48.000000 49.000000 50.000000
75% 67.500000 68.500000 69.500000 70.500000 71.500000 72.500000
max 90.000000 91.000000 92.000000 93.000000 94.000000 95.000000

c7 c8 c9 c10
count 10.000000 10.000000 10.000000 10.000000
mean 51.000000 52.000000 53.000000 54.000000
std 30.276504 30.276504 30.276504 30.276504
min 6.000000 7.000000 8.000000 9.000000
25% 28.500000 29.500000 30.500000 31.500000
50% 51.000000 52.000000 53.000000 54.000000
75% 73.500000 74.500000 75.500000 76.500000
max 96.000000 97.000000 98.000000 99.000000

11

You might also like