2/24/2020 3.
2_filtering - Jupyter Notebook
Filtering
In [1]:
import numpy as np
import pandas as pd
In [2]:
# Create a Dictionary
d = {
'Name':['Amarend','Ajay','Preety','Rakesh','Raju','Shyam',
'Kiran','Rishi','Prem','Raj','Ravina','Premjit'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','Science'],
'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}
# Create a dataframe
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Score'])
df
Out[2]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
1 Ajay Semester 1 Mathematics 47
2 Preety Semester 1 Mathematics 55
3 Rakesh Semester 1 Science 74
4 Raju Semester 1 Science 31
5 Shyam Semester 1 Science 77
6 Kiran Semester 2 Mathematics 85
7 Rishi Semester 2 Mathematics 63
8 Prem Semester 2 Mathematics 42
9 Raj Semester 2 Science 67
10 Ravina Semester 2 Science 89
11 Premjit Semester 2 Science 81
View a column of the dataframe in pandas python:
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 1/7
2/24/2020 3.2_filtering - Jupyter Notebook
In [5]:
df['Name']
Out[5]:
0 Amarend
1 Ajay
2 Preety
3 Rakesh
4 Raju
5 Shyam
6 Kiran
7 Rishi
8 Prem
9 Raj
10 Ravina
11 Premjit
Name: Name, dtype: object
View two or more columns of the dataframe in pandas:
In [18]:
df[['Name', 'Score']]
Out[18]:
Name Score
0 Amarend 62
1 Ajay 47
2 Preety 55
3 Rakesh 74
4 Raju 31
5 Shyam 77
6 Kiran 85
7 Rishi 63
8 Prem 42
9 Raj 67
10 Ravina 89
11 Premjit 81
View first two rows of the dataframe in pandas:
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 2/7
2/24/2020 3.2_filtering - Jupyter Notebook
In [6]:
df[:2]
Out[6]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
1 Ajay Semester 1 Mathematics 47
In [7]:
df.head(2)
Out[7]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
1 Ajay Semester 1 Mathematics 47
View last two rows of the dataframe in pandas:
In [20]:
df[-2:]
Out[20]:
Name Exam Subject Score
10 Ravina Semester 2 Science 89
11 Premjit Semester 2 Science 81
Filter pandas dataframe by column value
Method 1 : DataFrame Way
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 3/7
2/24/2020 3.2_filtering - Jupyter Notebook
In [21]:
# based on one condition
df1 = df[df['Score']>60]
df1
Out[21]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
3 Rakesh Semester 1 Science 74
5 Shyam Semester 1 Science 77
6 Kiran Semester 2 Mathematics 85
7 Rishi Semester 2 Mathematics 63
9 Raj Semester 2 Science 67
10 Ravina Semester 2 Science 89
11 Premjit Semester 2 Science 81
In [22]:
# based on multiple conditions
df1A = df[(df['Score']>60) & (df['Subject']=='Mathematics')]
df1B = df[(df.Score>60) & (df.Subject=='Mathematics')]
#df1A
df1B
Out[22]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
6 Kiran Semester 2 Mathematics 85
7 Rishi Semester 2 Mathematics 63
In [31]:
# Select only a few columns under some conditions
df1C = df[(df.Score>60) & (df.Subject=='Mathematics')][['Name','Score']]
df1C
Out[31]:
Name Score
0 Amarend 62
6 Kiran 85
7 Rishi 63
Method 2 : Query Function
In pandas package, there are multiple ways to perform filtering. The above code can also be written like the
code shown below. This method is elegant and more readable and you don't need to mention dataframe name
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 4/7
2/24/2020 3.2_filtering - Jupyter Notebook
everytime when you specify columns (variables).
In [33]:
df2 = df.query('Score > 60 & Subject == "Mathematics"')
df2
Out[33]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
6 Kiran Semester 2 Mathematics 85
7 Rishi Semester 2 Mathematics 63
Method 3 : loc function
loc is an abbreviation of location term. All these 3 methods return same output. It's just a different ways of doing
filtering rows.
In [36]:
df3 = df.loc[(df.Score>60) & (df.Subject=='Mathematics')]
df3
Out[36]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
6 Kiran Semester 2 Mathematics 85
7 Rishi Semester 2 Mathematics 63
Difference between loc and iloc function
loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it only
takes integers. Let's create a sample data for illustration
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 5/7
2/24/2020 3.2_filtering - Jupyter Notebook
In [38]:
x = pd.DataFrame({"col1" : np.arange(1,20,2)}, index=[9,8,7,6,0, 1, 2, 3, 4, 5])
x
Out[38]:
col1
9 1
8 3
7 5
6 7
0 9
1 11
2 13
3 15
4 17
5 19
iloc - Index Position
In [39]:
x.iloc[0:5]
Out[39]:
col1
9 1
8 3
7 5
6 7
0 9
loc - Index Label
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 6/7
2/24/2020 3.2_filtering - Jupyter Notebook
In [40]:
x.loc[0:5]
Out[40]:
col1
0 9
1 11
2 13
3 15
4 17
5 19
Note : x.loc[0:5] returns 6 rows (inclusive of 5 which is 6th element)
It is because loc does not produce output based on index position. It considers labels of index only which can
be alphabet as well and includes both starting and end point. Refer the example below.
In [41]:
# more examples - (offline) Data Analytics - Preprocessing 4
In [3]:
df.head()
Out[3]:
Name Exam Subject Score
0 Amarend Semester 1 Mathematics 62
1 Ajay Semester 1 Mathematics 47
2 Preety Semester 1 Mathematics 55
3 Rakesh Semester 1 Science 74
4 Raju Semester 1 Science 31
In [9]:
#df.sortby('Name')
In [ ]:
localhost:8888/notebooks/Machine Learning/Python/3.2_filtering.ipynb 7/7