0% found this document useful (0 votes)
68 views45 pages

P03 Introduction To Pandas Ans

Pandas is a Python library used for data analysis and manipulation. It contains two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object containing an array of data and an associated array of labels called the index. A DataFrame is a two-dimensional data structure with labeled axes, allowing for manipulation of tabular data. The document covers creating and manipulating Series and DataFrames, loading and saving data, filtering and handling missing values, and some advanced topics like contingency tables and data grouping.

Uploaded by

YONG LONG KHAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views45 pages

P03 Introduction To Pandas Ans

Pandas is a Python library used for data analysis and manipulation. It contains two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object containing an array of data and an associated array of labels called the index. A DataFrame is a two-dimensional data structure with labeled axes, allowing for manipulation of tabular data. The document covers creating and manipulating Series and DataFrames, loading and saving data, filtering and handling missing values, and some advanced topics like contingency tables and data grouping.

Uploaded by

YONG LONG KHAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

P03: Introduction to Pandas

1. SECTION 1: INTRODUCTION TO PANDAS


A. Section 1.1: SERIES
a. Creating series object
b. The index of a Series
c. Use Series as though they are numpy arrays
d. Exercise 1
B. Section 1.2: DATAFRAME
a. Generating DataFrames
b. The index, colums and values of a DataFrame
c. Accessing items in a DataFrame
d. Adding a new row or column
e. Removing a row or column
f. Use DataFrame with numerical only attributes like a numpy array
g. Exercise 2
2. SECTION 2: DATA PROCESSING TOOLS
A. Section 2.1: Loading and saving dataset
a. Reading from a csv file
b. Specifying the missing values
B. Section 2.2: Look at the data structure
a. Peeking at the data
b. Showing statistics of all numerical columns
c. Showing distribution of a categorical column
C. Section 2.3: Visualizing data
D. Section 2.4: Filtering data
a. Filtering a Series
b. Filtering a DataFrame
E. Section 2.5: Handling missing data
a. Identifying columns and rows with missing value
b. Dropping rows with missing data
c. Filling missing data with values
F. Exercise 3
3. SECTION 3: WRITING TO A CSV FILE

A. Writing to a csv file


4. SECTION 4: Advanced Topics (Optional))

A. Section 4.1: Creating contigency table for categorical columns


B. Section 4.2: Grouping samples for numerical columns
C. Exercise 4

Reference:

10 minutes to pandas
List of comprehensive pandas tutorials
In [1]: import pandas as pd

import numpy as np

SECTION 1: INTRODUCTION TO PANDAS


pandas is a data analysis library that contains many high-level data structures and manipulation tools
designed to make data analysis fast and easy in Python. It is built on top of NumPy and makes it easy to use
in NumPy-centric applications.

There are two types of data structures in Pandas:

1. Series : a 1-dimensional array-like list object


2. DataFrame : a multi-dimensional array-like object

Section 1.1: SERIES


A Series is a 1-D array-like object. It stores an array of items. The following code shows how to create a
Series object.

Creating series object


The following code generates a Series object.

>>> s = pd.Series([4, 7, -5, 3])

>>> s

0 4

1 7

2 -5

3 3

The items in the Series object are referenced by an index and their contents are stored in values . By
default, the position of an item is used as its index . The first column is the index of the data whereas the
second column [4, 7, -5, 3] stores the items ( values of the data) themselves.

In [2]:
s = pd.Series([4, 7, -5, 3])

Out[2]: 0 4

1 7

2 -5

3 3

dtype: int64

The index of a Series


You can specify the index of a Series by two ways:

1. Initialize using a list and specify the index through the parameter index
2. Initialize using a dictionary. The index is specified by the key of the dictionary.

In [4]:
s = pd.Series([21, 'Two', 39, -4], index = ['ONE', 'TWO', 'THREE', 'FOUR'])

Out[4]: ONE 21

TWO Two

THREE 39

FOUR -4

dtype: object

In [5]:
data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

s = pd.Series(data)

Out[5]: Ohio 35000

Texas 71000

Oregon 16000

Utah 5000

dtype: int64
You can access the item of a series by means of its index name

In [6]:
s['Ohio']

Out[6]: 35000

You can also access the item of a series by means of its position

In [7]:
s[0]

Out[7]: 35000

You can access the index of a series through .index

In [8]:
s.index

Out[8]: Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')

You can also change the index of a series.

In [9]:
s.index = ['Perak', 'Penang', 'Selangor', 'Melaka']

Out[9]: Perak 35000

Penang 71000

Selangor 16000

Melaka 5000

dtype: int64

Use Series as though they are numpy arrays


Series acts very similarly to a numpy arrays. So, you can use it as though you are using numpy arrays.

In [10]:
s = pd.Series({'a' : 0.3, 'b' : 1., 'c' : -12., 'd' : -3., 'e' : 15})

Out[10]: a 0.3

b 1.0

c -12.0

d -3.0

e 15.0

dtype: float64

In [11]:
s[0]

Out[11]: 0.3

In [12]:
s[1:3]

Out[12]: b 1.0

c -12.0

dtype: float64

In [13]:
s[s > 0]

Out[13]: a 0.3

b 1.0

e 15.0

dtype: float64

In [14]:
s * 2

Out[14]: a 0.6

b 2.0

c -24.0

d -6.0

e 30.0

dtype: float64

In [15]:
np.exp(s)

Out[15]: a 1.349859e+00

b 2.718282e+00

c 6.144212e-06

d 4.978707e-02

e 3.269017e+06

dtype: float64

In [16]:
s.sum()

Out[16]: 1.3000000000000007

In [17]:
s.mean()

Out[17]: 0.2600000000000001

Exercise 1
Q1. Given the following table:

year population

1995 56656

2000 70343

2005 93420
year population

2010 122330

2015 223234

(a) Create a dictionary named data where the key is the year and the value is the population. Then, use
data to create a series object named population .

In [18]:
data = {1995: 56656, 2000: 70343, 2005: 93420, 2010: 122330, 2015: 223234}

population = pd.Series(data)

population

Out[18]: 1995 56656

2000 70343

2005 93420

2010 122330

2015 223234

dtype: int64
(b) Go through the index in the population and get all data for before 2010. (Hint: use for loop to loop
through index)

Expected output:

1995 : 56656

2000 : 70343

2005 : 93420

In [19]:
for year in population.index:

if year < 2010:

print(year,':', population[year])

1995 : 56656

2000 : 70343

2005 : 93420

Q2. (a) Create the Series object X as shown below where each item is the square of the item's position. Try
to do this in a single line of code and do not use a dictionary to initialize the series. Use the parameter
index instead.

Hints: Use list comprehension to generate the list of indices where you should use 'Item{:d}'.format(i)
to generate the index at location i .

Item0 0

Item1 1

Item2 4

Item3 9

Item4 16

Item5 25

Item6 36

Item7 49

Item8 64

Item9 81

dtype: int32

In [20]:
X = pd.Series(np.arange(10)**2, index = ['Item{:d}'.format(i) for i in range(10)])

Out[20]: Item0 0

Item1 1

Item2 4

Item3 9

Item4 16

Item5 25

Item6 36

Item7 49

Item8 64

Item9 81

dtype: int32
(b) Select all items from X that are divisible by 3.

Ans:

Item0 0

Item3 9

Item6 36

Item9 81

dtype: int32

In [21]:
X[X%3 == 0]

Out[21]: Item0 0

Item3 9

Item6 36

Item9 81

dtype: int32
(c) Standardize the data as follows:

xi −mean(X)

x =
i std(X)

where x is the i-th item in X


i

Ans:

Item0 -1.006893

Item1 -0.971564

Item2 -0.865575

Item3 -0.688927

Item4 -0.441620

Item5 -0.123654

Item6 0.264972

Item7 0.724257

Item8 1.254200

Item9 1.854803

dtype: float64

In [22]:
(X - X.mean())/X.std()

Out[22]: Item0 -1.006893

Item1 -0.971564

Item2 -0.865575

Item3 -0.688927

Item4 -0.441620

Item5 -0.123654

Item6 0.264972

Item7 0.724257

Item8 1.254200

Item9 1.854803

dtype: float64

Section 1.2: DATAFRAME


A dataframe is a 2-dimensional data structure. It may consist of columns of potentially different types.

Items in a dataframe are referenced by:

1. index (which indexes the rows) and


2. columns (which indexes the columns).

Generating DataFrames
We can construct Dataframes from numpy array. The following code generates a dataframe. By default, the
index is the row position and the columns is the column position of the sample.

In [23]:
df = pd.DataFrame(np.random.randn(6,4))

df

Out[23]: 0 1 2 3

0 0.437944 1.247859 2.238804 0.146882

1 1.057917 -1.177278 -0.623115 0.410554

2 1.715567 -0.129310 1.018508 -0.320621

3 -0.091573 -0.286824 -1.030225 0.782606

4 -0.845978 -1.244619 1.177544 0.672923

5 -0.378508 1.112680 -0.106299 1.765793

We can specify the index names of the rows ( index ) and columns ( columns ).

In [24]:
df = pd.DataFrame(np.random.randn(6,4), index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6'], columns =
df

Out[24]: c1 c2 c3 c4

r1 0.184420 -1.902958 0.131049 -0.446624

r2 0.469671 -0.897160 -0.588562 -0.467154

r3 0.247959 -1.375558 0.786026 -1.183577

r4 0.076139 -0.830494 0.684083 2.250017

r5 -1.485974 0.948744 0.452234 0.640522

r6 0.686538 2.620036 0.128378 0.224094

We can also construct Dataframes using a dictionary of lists. Each object can be of different types. We can
check the types of each column by using the command df.dtypes .

In [25]:
data = {'year': [2000, 2001, 2003, 2004, 2005],

'state': ['Yokohama', 'Tokyo', 'Kyoto', 'Hokaido', 'Osaka'],

'population': [1.5, 1.7, 3.6, 2.4, 2.9] }

df = pd.DataFrame(data, index = ['1st', '2nd', '3rd', '4th', '5th'])


df

Out[25]: year state population

1st 2000 Yokohama 1.5

2nd 2001 Tokyo 1.7

3rd 2003 Kyoto 3.6

4th 2004 Hokaido 2.4

5th 2005 Osaka 2.9

Most of the time, we would like to arrange the columns in a certain order. You can do so through the order
you provide in the parameters columns .

In [26]:
df = pd.DataFrame(data, index = ['1st', '2nd', '3rd', '4th', '5th'], columns=['state', 'year',
df

Out[26]: state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

4th Hokaido 2004 2.4

5th Osaka 2005 2.9

To check the types of each column, use the command <DataFrame>.dtypes

In [27]:
df.dtypes

Out[27]: state object

year int64

population float64

dtype: object

Index, colums and values of a DataFrame


You can explicitly access the index, columns and values of a DataFrame through the index , columns
and values attributes.

The type for values is a nd.array (numpy array).


The type for index and columns remain pandas datatypes ( pandas.indexes.base.Index and
pandas.indexes.range.RangeIndex , respectively)

In [28]:
df.index

Out[28]: Index(['1st', '2nd', '3rd', '4th', '5th'], dtype='object')

In [29]:
df.columns

Out[29]: Index(['state', 'year', 'population'], dtype='object')

In [30]:
df.values

Out[30]: array([['Yokohama', 2000, 1.5],

['Tokyo', 2001, 1.7],

['Kyoto', 2003, 3.6],

['Hokaido', 2004, 2.4],

['Osaka', 2005, 2.9]], dtype=object)

Accessing items in a DataFrame


Accessing a single item

We can retrieve a specific item using the index and column names using the command .at[row_index,
column_index]

In [31]:
df.at['5th','state'] # Indexing order: [row, column]

Out[31]: 'Osaka'

In [32]:
df.at['5th', 'state'] = 'Nagasaki'

df

Out[32]: state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

4th Hokaido 2004 2.4

5th Nagasaki 2005 2.9

We can retrieve a specific item using the row and column positions using the command
.iat[row_position, column_position]

In [33]:
df.iat[4, 0]

Out[33]: 'Nagasaki'

In [34]:
df.iat[4, 0] = 'Osaka'

df

Out[34]: state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

4th Hokaido 2004 2.4

5th Osaka 2005 2.9

Accessing a single column


(1) Use the column index to access a particular column:

In [35]:
df['year'] # access column 'year' by dict-like notation

Out[35]: 1st 2000

2nd 2001

3rd 2003

4th 2004

5th 2005

Name: year, dtype: int64

In [36]:
df['year'] = np.arange(2010, 2015) # modifies the content of column 'year'

df

Out[36]: state year population

1st Yokohama 2010 1.5

2nd Tokyo 2011 1.7

3rd Kyoto 2012 3.6

4th Hokaido 2013 2.4

5th Osaka 2014 2.9

To access multiple columns, pass a list containing the column names which we wish to access.

In [37]:
df[['state', 'population']]

Out[37]: state population

1st Yokohama 1.5

2nd Tokyo 1.7

3rd Kyoto 3.6

4th Hokaido 2.4

5th Osaka 2.9

(2) A column can also be accessed by attribute. Each column will appear as individual attribute of the
DataFrame object.

For example df.state is equialent to df['state'] .

In [38]:
df.state # Similar to df['state']

Out[38]: 1st Yokohama

2nd Tokyo

3rd Kyoto

4th Hokaido

5th Osaka

Name: state, dtype: object

In [39]:
df.year

Out[39]: 1st 2010

2nd 2011

3rd 2012

4th 2013

5th 2014

Name: year, dtype: int32

In [40]:
df.population

Out[40]: 1st 1.5

2nd 1.7

3rd 3.6

4th 2.4

5th 2.9

Name: population, dtype: float64


(3) You can also access a DataFrame's column(s) through the .loc and .iloc attributes.

In [41]:
df.loc[:, 'year']

Out[41]: 1st 2010

2nd 2011

3rd 2012

4th 2013

5th 2014

Name: year, dtype: int32

In [42]:
df.loc[:, ['state', 'population']]

Out[42]: state population

1st Yokohama 1.5

2nd Tokyo 1.7

3rd Kyoto 3.6

4th Hokaido 2.4

5th Osaka 2.9

In [43]:
df.iloc[:, [1,2]]

Out[43]: year population

1st 2010 1.5

2nd 2011 1.7

3rd 2012 3.6

4th 2013 2.4

5th 2014 2.9

Accessing row(s) in a DataFrame

Unlike numpy, we cannot use the row position or row index directly to access one row of a DataFrame

In [44]:
#df[0] # This is not allowed. Will generate an error

#df['1st'] # This is not allowed. Will generate an error

(1) To access a DataFrame's row by its row position, we have to use slice indexing.
In [45]: df[0:1] # Access the first row

Out[45]: state year population

1st Yokohama 2010 1.5

In [46]:
df[0:1] = [100, 'KL', 1200] # Change the content of the first row

df

Out[46]: state year population

1st 100 KL 1200.0

2nd Tokyo 2011 1.7

3rd Kyoto 2012 3.6

4th Hokaido 2013 2.4

5th Osaka 2014 2.9

In [47]:
df[1:3] # Read the second and third row

Out[47]: state year population

2nd Tokyo 2011 1.7

3rd Kyoto 2012 3.6

In [48]:
df[1:3] = [[200, 'TAPAH', 1300], [300, 'KAMPAR', 1400]] # Change the second and third
df

Out[48]: state year population

1st 100 KL 1200.0

2nd 200 TAPAH 1300.0

3rd 300 KAMPAR 1400.0

4th Hokaido 2013 2.4

5th Osaka 2014 2.9

(2) We can also access specific rows of a DataFrame using .loc and .iloc attributes.

In [49]:
df.loc['3rd']

Out[49]: state 300

year KAMPAR

population 1400.0

Name: 3rd, dtype: object

In [50]:
df.iloc[2] # Read the second and third row

Out[50]: state 300

year KAMPAR

population 1400.0

Name: 3rd, dtype: object


In [51]:
df.iloc[1:3] # Read the second and third row

Out[51]: state year population

2nd 200 TAPAH 1300.0

3rd 300 KAMPAR 1400.0

Notes: Indexing can be quite confusing in pandas. In general, for columns, use the column name or a list of
column name. For rows, use .iloc or slicing on the items' positions

df['population'] or df[['population', 'state']] accesses the column(s) of the dataframe


df.iloc[2] or df[2:3] accesses the row(s) of the dataframe

Accessing a block of data using . loc and . iloc attributes

In [52]:
df = pd.DataFrame(data, index = ['1st', '2nd', '3rd', '4th', '5th'], columns=['state', 'year',
df

Out[52]: state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

4th Hokaido 2004 2.4

5th Osaka 2005 2.9

In [53]:
df.iloc[[0,2], [1,2]] # Extract row 0 and 2 (`1st` and `3rd`), and column 1 and 2 (`year` a

Out[53]: year population

1st 2000 1.5

3rd 2003 3.6

In [54]:
df.loc[['1st','3rd'],['year','population']] # same as above

Out[54]: year population

1st 2000 1.5

3rd 2003 3.6

In [55]:
df.iloc[:, 1:3] # Extract items at all rows, 2nd and 3rd columns

Out[55]: year population

1st 2000 1.5

2nd 2001 1.7

3rd 2003 3.6


year population

4th 2004 2.4

5th 2005 2.9

In [56]:
df.iloc[1:3, :] # Extract items at 2nd and 3rd rows, all columns

Out[56]: state year population

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

Adding a new row or column


Adding a new row and column to a DataFrame object in pandas is similar to adding an item in a dict
object in Python .

Warning: Unlike list in Python, pandas is designed to load fully populated DataFrame. Minimize the adding
rows or columns to an existent Dataframe in your system.

In [57]:
df.loc['6th'] = ['City Y', 2006, -2.0]

df

Out[57]: state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

3rd Kyoto 2003 3.6

4th Hokaido 2004 2.4

5th Osaka 2005 2.9

6th City Y 2006 -2.0

In [58]:
df['LargeCity'] = df['population'] > 2.5

df

Out[58]: state year population LargeCity

1st Yokohama 2000 1.5 False

2nd Tokyo 2001 1.7 False

3rd Kyoto 2003 3.6 True

4th Hokaido 2004 2.4 False

5th Osaka 2005 2.9 True

6th City Y 2006 -2.0 False

Removing a row or column


df.drop
We can also use df.drop to drop a row or column. If you want to drop a row, set axis set to 0. For
columns, set axis to 1.

df.drop returns a copy of the original Dataframe object. The original object remains unchanged. If you
wish to drop a row in the original object, set the parameter inplace to True.

In [59]:
df.drop('6th', axis = 0, inplace = True) # Drop the last row

df

Out[59]: state year population LargeCity

1st Yokohama 2000 1.5 False

2nd Tokyo 2001 1.7 False

3rd Kyoto 2003 3.6 True

4th Hokaido 2004 2.4 False

5th Osaka 2005 2.9 True

In [60]:
df.drop(['3rd', '4th'], axis = 0, inplace = True) # Drop the 3rd and 4th row

df

Out[60]: state year population LargeCity

1st Yokohama 2000 1.5 False

2nd Tokyo 2001 1.7 False

5th Osaka 2005 2.9 True

In [61]:
print(df.drop('LargeCity', axis = 1)) # drop the column 'LargeCity' (not i
df

state year population

1st Yokohama 2000 1.5

2nd Tokyo 2001 1.7

5th Osaka 2005 2.9

Out[61]: state year population LargeCity

1st Yokohama 2000 1.5 False

2nd Tokyo 2001 1.7 False

5th Osaka 2005 2.9 True

In [62]:
df.drop(['year', 'population'], axis = 1) # drop the columns 'year' and 'populat

Out[62]: state LargeCity

1st Yokohama False

2nd Tokyo False

5th Osaka True

Use DataFrame with numerical only attributes like a numpy array


If the dataframe contains only numerical columns, then you can process it like a numpy array. The only
difference is only in terms of how you index the DataFrame.

In [63]:
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 3)), index = np.arange(5), columns = list('ABC')
df

Out[63]: A B C

0 0.003197 0.767135 0.397991

1 -0.530468 0.392681 -0.133591

2 -0.676254 0.866036 -0.300075

3 0.186886 -0.674477 -0.215064

4 -0.729401 0.961789 -0.771256

In [64]:
df['C']*10

Out[64]: 0 3.979915

1 -1.335913

2 -3.000748

3 -2.150644

4 -7.712557

Name: C, dtype: float64

In [65]:
1/df

Out[65]: A B C

0 312.760488 1.303552 2.512617

1 -1.885128 2.546597 -7.485520

2 -1.478735 1.154687 -3.332503

3 5.350843 -1.482631 -4.649770

4 -1.370988 1.039729 -1.296587

In [66]:
df > 0 # Create an indicator matrix signifying if the item in df1 is

Out[66]: A B C

0 True True True

1 False True False

2 False True False

3 True False False

4 False True False

In [67]:
df[df.A > 0] # Only retain rows with positive entries for column A

Out[67]: A B C

0 0.003197 0.767135 0.397991


A B C

3 0.186886 -0.674477 -0.215064

In [68]:
df[df > 0] # Only retain items that meet the criteria df1 > 0. For those

Out[68]: A B C

0 0.003197 0.767135 0.397991

1 NaN 0.392681 NaN

2 NaN 0.866036 NaN

3 0.186886 NaN NaN

4 NaN 0.961789 NaN

In [69]:
df['A'] > 0 # Filtering column A for values greater than 0

Out[69]: 0 True

1 False

2 False

3 True

4 False

Name: A, dtype: bool

In [70]:
np.exp(df) # Apply the exponential function on df1

Out[70]: A B C

0 1.003202 2.153587 1.488831

1 0.588330 1.480946 0.874948

2 0.508519 2.377467 0.740763

3 1.205490 0.509423 0.806489

4 0.482198 2.616374 0.462432

In [71]:
1/(1+np.exp(-df)) # Apply the sigmoid function on all items in df1

Out[71]: A B C

0 0.500799 0.682901 0.598205

1 0.370408 0.596928 0.466652

2 0.337098 0.703920 0.425539

3 0.546586 0.337495 0.446440

4 0.325326 0.723480 0.316208

In [72]:
df.mean() # get the mean of all columns

Out[72]: A -0.349208

B 0.462633

C -0.204399

dtype: float64

In [73]:
df.mean(axis = 1) # get the mean of all rows

Out[73]: 0 0.389441

1 -0.090459

2 -0.036764

3 -0.234218

4 -0.179623

dtype: float64

Exercise 2
Q1. Create the following DataFrame df to store the coursework and final examination marks of a group of
students. You must use the ID as the index of the dataframe.

ID Name Program Coursework Final

S1 Mandy CS 79.9 100

S2 Joseph CS 34.5 90

S3 Jonathan IA 25.5 30

S4 Linda IA 70.9 95

S5 Sophos CS 80.2 20

In [74]:
data = {'Name': ['Mandy', 'Joseph', 'Jonathan', 'Linda', 'Sophos'],

'Programme': ['CS', 'CS', 'IA', 'IA', 'CS'],

'Coursework': [79.9, 34.5, 25.5, 70.9, 80.2],

'Final': [100, 90, 30, 95, 10]}

df = pd.DataFrame(data, index = ['S' + str(i) for i in range(1, 6)], columns = ['Name', 'Progra
df

Out[74]: Name Programme Coursework Final

S1 Mandy CS 79.9 100

S2 Joseph CS 34.5 90

S3 Jonathan IA 25.5 30

S4 Linda IA 70.9 95

S5 Sophos CS 80.2 10

Q2. Get student with ID S3 .

Hints: There are at least three ways to access a desired row. You can use either (1) .loc , (2) .iloc or (3)
slice indexing to do so.

Ans:

Name Jonathan

Programme IA

Coursework 25.5

Final 30

Name: S3, dtype: object


In [75]: # df[2:3] # using slice indexing

# df.loc['S3'] # using .loc

df.iloc[2] # using .iloc

Out[75]: Name Jonathan

Programme IA

Coursework 25.5

Final 30

Name: S3, dtype: object


Q3. Get the final marks of all students.

Hints: There are at least four different ways to access a desired column. You can use either (1) .loc , (2)
.iloc , (3) column indexing or (4) attribute.

Ans:

S1 100

S2 90

S3 30

S4 95

S5 10

Name: Final, dtype: int64

In [76]:
# df.loc[:, 'Final']

# df.iloc[:, 1]

# df['Final']

df.Final

Out[76]: S1 100

S2 90

S3 30

S4 95

S5 10

Name: Final, dtype: int64


Q4. Get the information for the students form ID S3 to S4.

Ans:

In [77]:
df[2:4]

Out[77]: Name Programme Coursework Final

S3 Jonathan IA 25.5 30

S4 Linda IA 70.9 95

Q5. Get the following information for all students: Name, Programme and Final.

Ans:
In [78]:
df[['Name', 'Programme', 'Final']]

Out[78]: Name Programme Final

S1 Mandy CS 100

S2 Joseph CS 90

S3 Jonathan IA 30

S4 Linda IA 95

S5 Sophos CS 10

Q6. Find all students who has final mark less than 40.

Ans:

In [79]:
df[df.Final<40]

Out[79]: Name Programme Coursework Final

S3 Jonathan IA 25.5 30

S5 Sophos CS 80.2 10

Q7. Create a new column Average that averages the score of both the final and coursework marks.

Ans:

In [80]:
df['Average'] = (df.Final + df.Coursework)/2

df

Out[80]: Name Programme Coursework Final Average

S1 Mandy CS 79.9 100 89.95

S2 Joseph CS 34.5 90 62.25

S3 Jonathan IA 25.5 30 27.75

S4 Linda IA 70.9 95 82.95

S5 Sophos CS 80.2 10 45.10

Q7. Drop the student S3 and S5 from our list.

In [81]:
df.drop(['S3', 'S5'], inplace=True)

df

Out[81]: Name Programme Coursework Final Average

S1 Mandy CS 79.9 100 89.95

S2 Joseph CS 34.5 90 62.25

S4 Linda IA 70.9 95 82.95

SECTION 2: DATA PROCESSING TOOLS


It is common to explore dataframes to understand the data and get insight on what we are dealing with.
Exploring the data allows us to identify important features (columns), have a sense of the integrity and data
quality of the data (e.g., is there many missing values in the dataset).

Section 2.1: Loading and saving dataset


There are many ways to store data which include normal text files, csv files, HDF5, etc. In this practical, we
only look at how to read from and write to csv files. A csv file is a common file format used to store tabular
data where its data fields are most often separated by a comma.

Reading from a csv file


(1) With the first row as header

The following code loads the file testresult_withheader.csv in the working directory. Open the file to see what
we are reading. Note that the first row is the header which specifies the names of each column.

The command .head() displays the first few (default 5) lines of data frame. One can put the number of
line in as input like .head(2) .

In [82]:
loaded_df = pd.read_csv('testresult_withheader.csv')

loaded_df.head()

Out[82]: Unnamed: 0 first_name last_name age Gender State Test1 Test2

0 0 Jason Miller 42.0 Male Perak 78.0 90

1 1 Molly Jacobson 52.0 Female Johor 75.0 45

2 2 Joseph - NaN Male Penang 31.0 90

3 3 Jessica Linsey 24.0 Female Johor NaN 62

4 4 Amy Cooze 73.0 NaN Kedah NaN 35

(2) Specifying the index column

Notice that the first column of testresult_withheader.csv is used to index the dataset. We can specify this
through the parameter index_col .

In [83]:
df = pd.read_csv('testresult_withheader.csv', index_col=0)

df.head()

Out[83]: first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90

1 Molly Jacobson 52.0 Female Johor 75.0 45

2 Joseph - NaN Male Penang 31.0 90

3 Jessica Linsey 24.0 Female Johor NaN 62

4 Amy Cooze 73.0 NaN Kedah NaN 35

(3) Without header file

Some csv file do not come with the header. Open the file testresult_noheader.csv in the working directory to
see the file that we are going to load. Note that there is no header row in the file.

In [84]:
df = pd.read_csv('testresult_noheader.csv', header=None)

df.head()

Out[84]: 0 1 2 3 4 5 6 7

0 0 Jason Miller 42.0 Male Perak 78.0 90

1 1 Molly Jacobson 52.0 Female Johor 75.0 45

2 2 Joseph - NaN Male Penang 31.0 90

3 3 Jessica Linsey 24.0 Female Johor NaN 62

4 4 Amy Cooze 73.0 NaN Kedah NaN 35

(4) Specifying the name of each columns

for files without headers, we may want to specify the name of the columns ourselves. To do this, we can
specify the names of the columns through the parameter names .

In [85]:
df = pd.read_csv('testresult_noheader.csv', names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])

df.head()

Out[85]: A B C D E F G

0 Jason Miller 42.0 Male Perak 78.0 90

1 Molly Jacobson 52.0 Female Johor 75.0 45

2 Joseph - NaN Male Penang 31.0 90

3 Jessica Linsey 24.0 Female Johor NaN 62

4 Amy Cooze 73.0 NaN Kedah NaN 35

Specifying the missing values


When we load testresult_withheader.csv, we actually encounter some issues. Look at all the types of the
columns below. Take note of the type for the column Test2 . It should be loaded as type float64 but it
was loaded object .

In [86]:
df = pd.read_csv('testresult_withheader.csv', index_col = 0)

df

Out[86]: first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90

1 Molly Jacobson 52.0 Female Johor 75.0 45

2 Joseph - NaN Male Penang 31.0 90

3 Jessica Linsey 24.0 Female Johor NaN 62

4 Amy Cooze 73.0 NaN Kedah NaN 35

5 Jackson - 21.0 Male Penang 85.0 100

6 Amy - 80.0 Female Perak 60.0 NAN

7 Monica Lieber 20.0 Female Kedah 70.0 85

8 Bill Gates 78.0 Male Perak 100.0 100

9 Steve Billy 50.0 Male Penang 80.0 99

In [87]:
df.dtypes

Out[87]: first_name object

last_name object

age float64

Gender object

State object

Test1 float64

Test2 object

dtype: object
This is because there is one of the sample has 'NAN' for Test2 . Pandas is unable to recognize it as a
missing value. Therefore, it has treated the whole column as a string. By default, pandas will only consider
the following entries as missing value:

1. empty space
2. NA
3. N/A
4. NaN
5. NULL

na_values

You can tell pandas to identify other types of missing value through the parameter na_values . The
following code shows how to specify the missing value in a DataFrame.

In [88]:
df = pd.read_csv('testresult_withheader.csv',

index_col = 0,

na_values=['NAN','-'])

df

Out[88]: first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

2 Joseph NaN NaN Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor NaN 62.0

4 Amy Cooze 73.0 NaN Kedah NaN 35.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

6 Amy NaN 80.0 Female Perak 60.0 NaN

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

In [89]:
df.dtypes

Out[89]: first_name object

last_name object

age float64

Gender object

State object

Test1 float64

Test2 float64

dtype: object
We will demonstrate how to handle missing values later.

Section 2.2: Look at the data structure


Peeking at the data
df.head , df.tail

To view a small sample of a Series or DataFrame object, use the df.head to peek at the first few rows and
df.tail the last few rows of the df. The default number of elements to display is five, but you may pass in
a custom number.

In [90]:
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN','-'])

df.head() # Peak at the first 5 row sof df

Out[90]: first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

2 Joseph NaN NaN Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor NaN 62.0

4 Amy Cooze 73.0 NaN Kedah NaN 35.0

In [91]:
df.tail(6) # Peek at the last 6 rows of df

Out[91]: first_name last_name age Gender State Test1 Test2

4 Amy Cooze 73.0 NaN Kedah NaN 35.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

6 Amy NaN 80.0 Female Perak 60.0 NaN

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

Showing statistics of all numerical columns


df.describe()

We can use describe() to get the statistics of all numerical columns of a DataFrame (excluding NaN
data). The categorical columns would be ignored.

In [92]:
stat = df.describe()

stat

Out[92]: age Test1 Test2

count 9.000000 8.000000 9.000000

mean 48.888889 72.375000 78.444444

std 24.204568 20.318447 24.845076

min 20.000000 31.000000 35.000000

25% 24.000000 67.500000 62.000000

50% 50.000000 76.500000 90.000000

75% 73.000000 81.250000 99.000000

max 80.000000 100.000000 100.000000

Showing distribution of a categorical column


value_counts()
A lot of time, we need to compute the frequency of unique items in a Series object. To do that, we can use
value_counts for Series. The value_counts() computes a histogram of a Series object.

In [93]:
df.Gender.value_counts()

Out[93]: Male 5

Female 4

Name: Gender, dtype: int64

In [94]:
df.State.value_counts()

Out[94]: Perak 3

Penang 3

Johor 2

Kedah 2

Name: State, dtype: int64

Section 2.3: Visualizing data


Plotting line graphs
By default when we plot a numerical values, we are plotting line graphs.

In [95]:
df = pd.DataFrame({"date": ["Jan", "Feb", "Mac", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct
"book_sales": 2*np.arange(12) - 2 + 3*np.random.randn(12),

"stationary_sales": 0.5*np.arange(12) + 1 + 2*np.random.randn(12),

"cd_sales": 3*np.arange(12) + 3 + 3*np.random.randn(12)})

df

Out[95]: date book_sales stationary_sales cd_sales

0 Jan -4.511942 2.204728 -0.098397

1 Feb -1.007992 1.435572 5.219675

2 Mac -0.188969 5.241206 7.635777

3 Apr 5.144244 5.689594 13.165905

4 May -0.695870 3.343535 10.277410

5 Jun 3.010572 4.286440 19.123507

6 Jul 10.046888 6.021715 25.104942

7 Aug 10.097558 5.449067 21.044820

8 Sep 14.194817 7.496135 28.785170

9 Oct 16.013140 3.152809 29.728839

10 Nov 21.657639 4.607428 31.623140

11 Dec 20.671968 4.349733 34.335153

By default, .plot plots the line graph of all features.

In [96]:
import matplotlib.pyplot as plt # import Matplotlib plotting package

df.plot()

plt.show()

We can specify which features to plot by using the x and y parameters.

In [97]:
df.plot(x = 'date', y = ['book_sales', 'cd_sales'])

plt.show()

Plotting scatter graphs


In [98]:
df.plot(kind='scatter', x='book_sales', y='cd_sales')

plt.show()

To plot multiple groups, repeat the plot method specifying target ax . Use different colors and label
keywords to distinguish the groups.
In [99]:
ax1 = df.plot(kind='scatter', x='book_sales', y='cd_sales', color='Red', label='CD')

df.plot(kind='scatter', x='book_sales', y='stationary_sales', color='Blue', label='Stationary',


plt.show()

Plotting other kinds of graphs


The kind keyword argument of plot() accepts a handful of values for plots other than the line and
scatter plots. These include:

bar or barh for bar plots


hist for histogram
box for boxplot
kde or density for density plots
area for area plots
hexbin for hexagonal bin plots
pie for pie plots

Refer to this link for more information.

Section 2.4: Filtering data


Filtering a Series
One common operation is to filter the data to extract relevant data from our dataset. This can be done using
boolean operators. Different from conventional Python syntax which uses or , and and not for OR, AND
and inversion operations, pandas uses operators such as | , & and ~ .

In [100…
s = pd.Series(range(-3, 4))

Out[100… 0 -3

1 -2

2 -1

3 0

4 1

5 2

6 3

dtype: int64
In [101… s[s > 0]

Out[101… 4 1

5 2

6 3

dtype: int64

In [102…
s[(s < -1) | (s > 0.5)]

Out[102… 0 -3

1 -2

4 1

5 2

6 3

dtype: int64

In [103…
s[(s >= -1) & (s <= 0.5)]

Out[103… 4 1

5 2

6 3

dtype: int64

In [104…
s[~(s < 0)]

Out[104… 3 0

4 1

5 2

6 3

dtype: int64

Filtering a DataFrame
We can also select samples (rows) that fulfils some desired criteria.

In [105…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN', 'NaN','-'])

df

Out[105… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

2 Joseph NaN NaN Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor NaN 62.0

4 Amy Cooze 73.0 NaN Kedah NaN 35.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

6 Amy NaN 80.0 Female Perak 60.0 NaN

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

The following code filters the list of person who score less than 50 for their Test2.

In [107…
df[df.Test2 > 50]

Out[107… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

2 Joseph NaN NaN Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor NaN 62.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

The following code filters the list of person who score less than 50 for both their Test1 and Test2.

In [108…
df[(df.Test1 > 50) & (df.Test2 > 50)]

Out[108… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

Section 2.5: Handling missing data


pandas uses the value np.nan to represent missing data.

In [109…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN','-'])

df

Out[109… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

2 Joseph NaN NaN Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor NaN 62.0

4 Amy Cooze 73.0 NaN Kedah NaN 35.0

5 Jackson NaN 21.0 Male Penang 85.0 100.0

6 Amy NaN 80.0 Female Perak 60.0 NaN

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0


Identifying columns and rows with missing value
<Dataframe>.isnull()

df.isnull() returns a boolean dataframe the same size with df which indicates if an element is null or
not.

In [110…
df.isnull()

Out[110… first_name last_name age Gender State Test1 Test2

0 False False False False False False False

1 False False False False False False False

2 False True True False False False False

3 False False False False False True False

4 False False False True False True False

5 False True False False False False False

6 False True False False False False True

7 False False False False False False False

8 False False False False False False False

9 False False False False False False False

(1) Identifying columns with missing values

In [111…
df.isnull().any()

Out[111… first_name False

last_name True

age True

Gender True

State False

Test1 True

Test2 True

dtype: bool
(2) Showing columns with missing values

In [112…
columns_with_missing_values = df.columns[df.isnull().any()]

columns_with_missing_values

Out[112… Index(['last_name', 'age', 'Gender', 'Test1', 'Test2'], dtype='object')

In [113…
df[columns_with_missing_values]

Out[113… last_name age Gender Test1 Test2

0 Miller 42.0 Male 78.0 90.0

1 Jacobson 52.0 Female 75.0 45.0

2 NaN NaN Male 31.0 90.0

3 Linsey 24.0 Female NaN 62.0


last_name age Gender Test1 Test2

4 Cooze 73.0 NaN NaN 35.0

5 NaN 21.0 Male 85.0 100.0

6 NaN 80.0 Female 60.0 NaN

7 Lieber 20.0 Female 70.0 85.0

8 Gates 78.0 Male 100.0 100.0

9 Billy 50.0 Male 80.0 99.0

Dropping rows with missing data


Sometimes we need to remove samples (rows) with missing values. The following code drops a row if any
values is NA ( how='any' ). If we wish to drop a row only if all values (columns) are NA, then we need to set
how=all .

The command df.dropna() only returns a copy of the result and does not update the df itself. To
update df , use the parameter inplace = True .

In [114…
df.dropna(inplace = True)

df

Out[114… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

Filling missing data with values


We can also fill the cells with missing values with certain values using the command .fillna() . Again,
.fillna() returns a copy of the result without updating df itself. To update df , set the parameter
inplace=True .

In [115…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN', 'NaN','-'])

df.fillna(value=5) # replacing missing values with a value of 5

Out[115… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.0 90.0

1 Molly Jacobson 52.0 Female Johor 75.0 45.0

2 Joseph 5 5.0 Male Penang 31.0 90.0

3 Jessica Linsey 24.0 Female Johor 5.0 62.0

4 Amy Cooze 73.0 5 Kedah 5.0 35.0

5 Jackson 5 21.0 Male Penang 85.0 100.0

6 Amy 5 80.0 Female Perak 60.0 5.0


first_name last_name age Gender State Test1 Test2

7 Monica Lieber 20.0 Female Kedah 70.0 85.0

8 Bill Gates 78.0 Male Perak 100.0 100.0

9 Steve Billy 50.0 Male Penang 80.0 99.0

In [116…
df.fillna(df.mean()) # replace missing values for numerical columns with their mean

C:\Users\kwanl\AppData\Local\Temp/ipykernel_15332/942395719.py:1: FutureWarning: Dropping of nu


isance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future ve
rsion this will raise TypeError. Select only valid columns before calling the reduction.

df.fillna(df.mean()) # replace missing values for numerical columns with their mea
n values

Out[116… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.000000 Male Perak 78.000 90.000000

1 Molly Jacobson 52.000000 Female Johor 75.000 45.000000

2 Joseph NaN 48.888889 Male Penang 31.000 90.000000

3 Jessica Linsey 24.000000 Female Johor 72.375 62.000000

4 Amy Cooze 73.000000 NaN Kedah 72.375 35.000000

5 Jackson NaN 21.000000 Male Penang 85.000 100.000000

6 Amy NaN 80.000000 Female Perak 60.000 78.444444

7 Monica Lieber 20.000000 Female Kedah 70.000 85.000000

8 Bill Gates 78.000000 Male Perak 100.000 100.000000

9 Steve Billy 50.000000 Male Penang 80.000 99.000000

In [117…
mean_of_test1 = df['Test1'].mean() # Fill missing values in column 'Test1'
df['Test1'].fillna(mean_of_test1, inplace = True)

df

Out[117… first_name last_name age Gender State Test1 Test2

0 Jason Miller 42.0 Male Perak 78.000 90.0

1 Molly Jacobson 52.0 Female Johor 75.000 45.0

2 Joseph NaN NaN Male Penang 31.000 90.0

3 Jessica Linsey 24.0 Female Johor 72.375 62.0

4 Amy Cooze 73.0 NaN Kedah 72.375 35.0

5 Jackson NaN 21.0 Male Penang 85.000 100.0

6 Amy NaN 80.0 Female Perak 60.000 NaN

7 Monica Lieber 20.0 Female Kedah 70.000 85.0

8 Bill Gates 78.0 Male Perak 100.000 100.0

9 Steve Billy 50.0 Male Penang 80.000 99.0

Section 2.6: Writing to a csv file


We can also save the dataframe into a csv file.

.to_csv(filename)

In [118…
df.to_csv('testing.csv')

Exercise 3
The following questions processes a dataset called chipotle.csv that shows the orders received for
different items sold by a company.

1. Each row records the details of a single transaction involving one particular item.
2. Each order can include multiple transactions and hence span multiple rows

Answer the following questions:

Q1. Load the dataset chipotle.csv to a variable called chipo .

In [119…
chipo = pd.read_csv('chipotle.csv')

Q2. See the first 10 entries

Ans:

In [120…
chipo.head(10)

Out[120… order_id quantity item_name choice_description total_item_price

0 1 1 Chips and Fresh Tomato Salsa NIL 2.39

1 1 1 Izze [Clementine] NaN

2 1 1 Nantucket Nectar [Apple] 3.39

Chips and Tomatillo-Green Chili


3 1 1 NIL 2.39
Salsa

[Tomatillo-Red Chili Salsa (Hot), [Black


4 2 2 Chicken Bowl 16.98
Beans...

[Fresh Tomato Salsa (Mild), [Rice, Cheese,


5 3 1 Chicken Bowl 10.98
Sou...
order_id quantity item_name choice_description total_item_price

6 3 1 Side of Chips NIL 1.69

[Tomatillo Red Chili Salsa, [Fajita


7 4 1 Steak Burrito 11.75
Vegetables...

[Tomatillo Green Chili Salsa, [Pinto Beans,


8 4 1 Steak Soft Tacos 9.25
Ch...

[Fresh Tomato Salsa, [Rice, Black Beans,


9 5 1 Steak Burrito 9.25
Pinto...

Q3. How many samples are there in the dataset?

Answer: 4622

In [121…
chipo.info()

#OR

chipo.shape[0]

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 4622 non-null object

4 total_item_price 4618 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.7+ KB

Out[121… 4622

Q4. How many columns are there in the dataset?

Answer: 5

In [122…
chipo.shape[1]

Out[122… 5

Q5. Print the name of all the columns

Answer: Index(['order_id', 'quantity', 'item_name', 'choice_description',


'total_item_price'], dtype='object')

In [123…
chipo.columns

Out[123… Index(['order_id', 'quantity', 'item_name', 'choice_description',

'total_item_price'],

dtype='object')
Q6. How is the dataset indexed?

Answer: RangeIndex(start=0, stop=4622, step=1)

In [124…
chipo.index

Out[124… RangeIndex(start=0, stop=4622, step=1)

Q7. Get the list of all sold items and show the number of transactions involving them.

Answer:

Chicken Bowl 726

Chicken Burrito 553

Chips and Guacamole 479

Steak Burrito 368

Canned Soft Drink 301

Steak Bowl 211

...

In [125…
chipo.item_name.value_counts()

Out[125… Chicken Bowl 726

Chicken Burrito 553

Chips and Guacamole 479

Steak Burrito 368

Canned Soft Drink 301

Steak Bowl 211

Chips 211

Bottled Water 162

Chicken Soft Tacos 115

Chips and Fresh Tomato Salsa 110

Chicken Salad Bowl 110

Canned Soda 104

Side of Chips 101

Veggie Burrito 95

Barbacoa Burrito 91

Veggie Bowl 85

Carnitas Bowl 68

Barbacoa Bowl 66

Carnitas Burrito 59

Steak Soft Tacos 55

6 Pack Soft Drink 54

Chips and Tomatillo Red Chili Salsa 48

Chicken Crispy Tacos 47

Chips and Tomatillo Green Chili Salsa 43

Carnitas Soft Tacos 40

Steak Crispy Tacos 35

Chips and Tomatillo-Green Chili Salsa 31

Steak Salad Bowl 29

Nantucket Nectar 27

Barbacoa Soft Tacos 25

Chips and Roasted Chili Corn Salsa 22

Izze 20

Chips and Tomatillo-Red Chili Salsa 20

Veggie Salad Bowl 18

Chips and Roasted Chili-Corn Salsa 18

Barbacoa Crispy Tacos 11

Barbacoa Salad Bowl 10

Chicken Salad 9

Veggie Soft Tacos 7

Carnitas Crispy Tacos 7

Veggie Salad 6

Carnitas Salad Bowl 6

Burrito 6

Steak Salad 4

Crispy Tacos 2

Salad 2

Bowl 2

Chips and Mild Fresh Tomato Salsa 1

Veggie Crispy Tacos 1

Carnitas Salad 1

Name: item_name, dtype: int64


Q8. What is the most ordered items? How many of them were ordered? Hints: Use value_counts() which
returns a Series object storing the frequency of a categorical column.

Answer: Most ordered items = Chicken Bowl . Quantity = 726

In [126…
mostOrdered = chipo.item_name.value_counts()

print('Most ordered items =', mostOrdered.index[0], '. Quantity =', mostOrdered.values[0])

Most ordered items = Chicken Bowl . Quantity = 726

Q9. How many items were being ordered in total?

Ans: 4972

In [127…
chipo.quantity.sum()

Out[127… 4972

Q10. How many orders were made in the period? Remember that each order can include multiple items and
hence span multiple rows. Hints: The number of orders is given by the number of unique order IDs.

Answer: 1834

In [128…
total_orders = len(np.unique(chipo.order_id))

# OR

total_orders = chipo.order_id.value_counts().count()

total_orders

Out[128… 1834

Q11. Compute the average item price per-order.

Answer: 18.795

In [129…
chipo.total_item_price.sum() / total_orders

Out[129… 18.795147219193023

Q12. How many different items are sold?

Answer: 50

In [130…
print(chipo.item_name.value_counts().count())

# OR

print(len(np.unique(chipo.item_name)))

50

50

Q13. How many transactions (rows) are there with total_item_price more than $10?
Answer: 1130

In [131…
len(chipo[chipo.total_item_price > 10])

Out[131… 1130

Q14. How many transactions with quantity equals to 1?

Answer: 4355

In [132…
len(chipo[chipo.quantity == 1])

Out[132… 4355

Q15. How many times people ordered more than one Canned Soda?

Answer: 20

In [133…
len(chipo[(chipo.item_name == 'Canned Soda') & (chipo.quantity > 1)])

Out[133… 20

Q16. Reload the dataset chipotle.csv to the variable chipo but you should also consider 'NIL' as the
missing value. Identify the columns with missing values.

Ans:

Use chipo.info() to identify the two missing columns

In [134…
chipo = pd.read_csv('chipotle.csv', na_values=['NIL'])

chipo.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 3376 non-null object

4 total_item_price 4618 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.7+ KB

Q17. Replace all the missing values in choice_description with the value 'NULL'.

Ans:

Check your solution using chipo.info() . Ensure that there is no more missing values for
choice_description .

Data columns (total 5 columns):

order_id 4622 non-null int64

quantity 4622 non-null int64

item_name 4622 non-null object

choice_description 4622 non-null object

total_item_price 4618 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.6+ KB

In [135…
chipo.choice_description.fillna('NULL', inplace=True)

chipo.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 4622 non-null object

4 total_item_price 4618 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.7+ KB

Q18. Replace all the missing values in total_item_price with its mean value.

Ans:

Check your solution using chipo.info() . Make sure that there is no longer any columns with missing
values.

Data columns (total 5 columns):

order_id 4622 non-null int64

quantity 4622 non-null int64

item_name 4622 non-null object

choice_description 4622 non-null object

total_item_price 4622 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.6+ KB

In [136…
mean_value = chipo.total_item_price.mean()

chipo.total_item_price.fillna(mean_value, inplace=True)

chipo.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4622 entries, 0 to 4621

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 4622 non-null int64

1 quantity 4622 non-null int64

2 item_name 4622 non-null object

3 choice_description 4622 non-null object

4 total_item_price 4622 non-null float64

dtypes: float64(1), int64(2), object(2)

memory usage: 180.7+ KB

Q19. Create a scatterplot with the quantity vs item price.

Ans:
In [137…
chipo.plot(kind='scatter', x='quantity', y='total_item_price')

plt.show()

Section 4: Advanced Topics (Optional)


Section 4.1: Creating contigency table for categorical columns
pd.crosstab()

It is also useful to plot a crosstab or contingency table to display table to show the frequency of
categorical columns in a Dataframe. . Use pd.crosstab to plot the frequency table of the specified
categorical columns.

In [138…
data = {'state': [ 'Perak', 'Selangor', 'Perak', 'Selangor', 'Selangor', 'Selangor', 'Selangor
'town' : ['Kampar', 'Jitra', 'Kampar', 'Rawang', 'Beruntung', 'Rawang', 'Rawang
'rating' : list('ABABCCBBCA') }

df = pd.DataFrame(data)

df

Out[138… state town rating

0 Perak Kampar A

1 Selangor Jitra B
state town rating

2 Perak Kampar A

3 Selangor Rawang B

4 Selangor Beruntung C

5 Selangor Rawang C

6 Selangor Rawang B

7 Kedah Jitra B

8 Kedah Kumit C

9 Perak Tapah A

In [139…
pd.crosstab(df.rating, df.state)

Out[139… state Kedah Perak Selangor

rating

A 0 3 0

B 1 0 3

C 1 0 2

In [140…
pd.crosstab([df.state, df.town], df.rating)

Out[140… rating A B C

state town

Kedah Jitra 0 1 0

Kumit 0 0 1

Perak Kampar 2 0 0

Tapah 1 0 0

Selangor Beruntung 0 0 1

Jitra 0 1 0

Rawang 0 2 1

Crosstabs can also be normalized to show percentages rather than counts using the normalize=True
argument.

In [141…
pd.crosstab([df.state, df.town], df.rating, normalize = True)

Out[141… rating A B C

state town

Kedah Jitra 0.0 0.1 0.0

Kumit 0.0 0.0 0.1

Perak Kampar 0.2 0.0 0.0


rating A B C

state town

Tapah 0.1 0.0 0.0

Selangor Beruntung 0.0 0.0 0.1

Jitra 0.0 0.1 0.0

Rawang 0.0 0.2 0.1

Section 4.2: Grouping samples for numerical columns


Sometimes, we may want to group the samples. For example, given all dengue patients, we may want to
compute the number of patients grouped by states and/or gender.

df.groupby

We can group a DataFrame based on one or more categorical column. For the dataframe above, let's say we
want to group the dengue case based on the states.

The command df.groupby('state') to generate sample groups. Since there are two states in the
database, there are two groups.
The command df.groupby('state').sum() performs the summation operation on each group. You
can apply other types of statistical or arithmetic operations such as mean, std, max, min, etc.

In [142…
df = pd.DataFrame({'Town' : ['Kampar', 'Ipoh', 'Taiping', 'Kuala Kangsar',

'Klang', 'Subang', 'Puchong', 'CyberJaya'],

'State' : ['Perak', 'Perak', 'Perak', 'Perak',

'Selangor', 'Selangor', 'Selangor', 'Selangor'],

'DengueCase': ['low', 'high', 'high', 'low',

'moderate', 'low', 'moderate', 'low'],

'2013' : [300, 1010, 1105, 200, 510, 50, 553, 10],

'2014' : [120, 900, 1200, 180, 590, 45, 600, 20],


'2015' : [250, 1130, 1400, 230, 450, 65, 650, 35],

'ABBR' : ["KPR", "IPH", "TPH", "KK", "KL", "SB", "PC", "CY"]},

columns = ["Town", "State", "2013", "2014", "2015", "DengueCase", "ABBR"])

df

Out[142… Town State 2013 2014 2015 DengueCase ABBR

0 Kampar Perak 300 120 250 low KPR

1 Ipoh Perak 1010 900 1130 high IPH

2 Taiping Perak 1105 1200 1400 high TPH

3 Kuala Kangsar Perak 200 180 230 low KK

4 Klang Selangor 510 590 450 moderate KL

5 Subang Selangor 50 45 65 low SB

6 Puchong Selangor 553 600 650 moderate PC

7 CyberJaya Selangor 10 20 35 low CY

In [143…
df.groupby('State').sum()

Out[143… 2013 2014 2015

State

Perak 2615 2400 3010

Selangor 1123 1255 1200

In [144…
df.groupby('State').mean()

Out[144… 2013 2014 2015

State

Perak 653.75 600.00 752.5

Selangor 280.75 313.75 300.0

We can also specify which numerical column to observe

In [145…
df.groupby('State')['2013'].sum()

Out[145… State

Perak 2615

Selangor 1123

Name: 2013, dtype: int64

In [146…
df.groupby('State')['2013'].mean()

Out[146… State

Perak 653.75

Selangor 280.75

Name: 2013, dtype: float64


We can perform 2-level indexing by specifying two categorical columns to generate the groups.

In [147…
df.groupby(['State','DengueCase']).mean()

Out[147… 2013 2014 2015

State DengueCase

Perak high 1057.5 1050.0 1265.0

low 250.0 150.0 240.0

Selangor low 30.0 32.5 50.0

moderate 531.5 595.0 550.0

Exercise 4
Q20. Create a dataframe to store the following data.

ID Name Programme Year Trimester CGPA

1 Tan Xuan Yong CS 2 1 2.3

2 Rifdean CE 1 2 3.2

3 Sharmila CS 2 1 2.1
ID Name Programme Year Trimester CGPA

4 Subramaniam CE 1 1 3.6

5 Bernard CE 2 2 4.0

In [148…
df = pd.DataFrame({'Name': ['Tan Xuan Yong', 'Rifdean', 'Sharmila', 'Subramaniam', 'Bernard'],

'Programme': ['CS', 'CE', 'CS', 'CE', 'CE'],

'Year': [2,1,2,1,2],

'Trimester': [1,2,1,1,2],

'CGPA': [2.3, 3.2, 2.1, 3.6, 4.0]},

index = np.arange(1, 6))

df

Out[148… Name Programme Year Trimester CGPA

1 Tan Xuan Yong CS 2 1 2.3

2 Rifdean CE 1 2 3.2

3 Sharmila CS 2 1 2.1

4 Subramaniam CE 1 1 3.6

5 Bernard CE 2 2 4.0

Q21. Create the crosstab show the frequency of the students by (1) Year and (2) Programme.

Answer:

Programme CE CS

Year

1 2 0

2 1 2

In [149…
pd.crosstab(df.Year, df.Programme)

Out[149… Programme CE CS

Year

1 2 0

2 1 2

Q22. Show the average CGPA for CS and CE students.

Answer:

Programme

CE 3.6

CS 2.2

Name: CGPA, dtype: float64

In [150…
df.groupby('Programme')['CGPA'].mean()

Out[150… Programme

CE 3.6

CS 2.2

Name: CGPA, dtype: float64

You might also like