0% found this document useful (0 votes)
38 views64 pages

Exp 7

This document provides an overview of Pandas, a Python library used for data analysis and manipulation. It discusses Pandas data structures like Series and DataFrames, how to read data from different file types into Pandas, and how to explore and access data in Pandas. Key points covered include indexing and selecting data, applying functions to data, and handling missing values.

Uploaded by

Batool Fassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views64 pages

Exp 7

This document provides an overview of Pandas, a Python library used for data analysis and manipulation. It discusses Pandas data structures like Series and DataFrames, how to read data from different file types into Pandas, and how to explore and access data in Pandas. Key points covered include indexing and selecting data, applying functions to data, and handling missing values.

Uploaded by

Batool Fassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Pandas

Prepared by Dr. Mohammad Abdel-majeed


Updated by Dr. Samah Rahamneh ,Eng. Abeer Awad

Data Science: is a branch of computer science where we study


how to store, use and analyze data for deriving information from
it.

1
Outline
• Series and Dataframes
• Reading the data
• Exploring the data
• Indexing
• Selection
• Data Analysis
• Grouping
• Applying functions
• Sorting
• Missing values
• Combining

2
Pandas
• Adds data structures and tools designed to work
with table-like data
• Provides tools for data manipulation: reshaping,
merging, sorting, slicing, aggregation etc.
• Clean messy data sets, and make them readable
and relevant.
• Allows handling missing data
• The name "Pandas" has a reference to both
"Panel Data”

3
Pandas Data Structures
• Series: one dimensional data structure(
column) that stores values — and for every
value it holds a unique index, too.

• DataFrame: two (or more) dimensional data


structure – basically a table with rows and
columns. The columns have names and the
rows have indexes.
import pandas as pd
pd.command(xxx)
4
Series
S = pd.Series([15,20,13,55,67,34,23,1]) 0 15
print(S) 1 20
2 13
3 55
4 67
5 34
print(S[0]) 15 6 23
7 1

With the index argument, you can name your own labels.

S = i0 15
pd.Series([15,20,13,55,67,34,23,1],index=['i0', i1 20
'i1','i2','i3','i4','i5','i6','i7']) i2 13
print(S) i3 55
i4 67
print(S['i0']) 15 i5 34
print(S['i5']) 34 i6 23
i7 1

5
Dataframes
• Index by default is an integer and starts from 0

df = pd.DataFrame({'Name':['Mohammad', 'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]})
print (df)#shows the first 5 rows and the last 5 rows
print (df. to_string())#shows all the rows of the dataframe

Name Age
0 Mohammad 12
1 Ahmad 25
2 Haneen 40
3 Leen 17

6
DataFrames with Index

df = pd.DataFrame({'Name':['Mohammad',
'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]},
index = ['s1','s2','s3','s4'])
Print(pd)

Name Age
s1 Mohammad 12
s2 Ahmad 25
s3 Haneen 40
s4 Leen 17

7
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame

import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)#(1604,19)
# Note: you may need to install xlrd package to run the code

8
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame
import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename,sheet_name=1)#reads the second sheet, xlrd
print(movies.shape)#(1604,19)
#to read .csv file type
df = pd.read_csv('data.csv')
Print(df)

9
Other read_*
• https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

10
Exploring the Data
Title object
Year float64
• Data types Genres
Language
object
object
Country object
Duration float64
print(movies.dtypes) Budget float64
print(movies.Country.dtype) Gross Earnings float64
movies.Duration.astype('int3 Director object
Actor 1 object
2')#make sure that you do
Actor 2 object
not have NA values Facebook Likes - Actor 1 float64
Facebook Likes - Actor 2 float64
Facebook likes - Movie int64
Facenumber in posters float64
User Votes int64
Reviews by Users float64
Reviews by Crtiics float64
IMDB Score float64

11
Exploring the Data
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)
print(movies.head(4))
Print(movies.tail())# last five records

(1604, 19)
Title Year ... Reviews by Crtiics IMDB Score
0 127 Hours 2010.0 ... 450.0 7.6
1 3 Backyards 2010.0 ... 20.0 5.2
2 3 2010.0 ... 76.0 6.8
3 8: The Mormon Proposition 2010.0 ... 28.0 7.1

12
Exploring the Data

import numpy as np
import pandas as pd
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.columns)# columns’ tiles

Index(['Title', 'Year', 'Genres', 'Language', 'Country', 'Duration', 'Budget',


'Gross Earnings', 'Director', 'Actor 1', 'Actor 2',
'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
'Facebook likes - Movie', 'Facenumber in posters', 'User Votes',
'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score'],
dtype='object')

13
Exploring the Data/
Data Frames attributes

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data

14
Add Columns Titles
• By default the first row is considered the
columns headers.
• In case there is no columns headers then you
read the data as follows:
movies = pd.read_excel(filename,header=None)

• create a list of the columns headers and


assign it to the columns variable of the
datafreme
movies.columns= list(range(19))
movies.columns = [‘col1’,’col2’…]
15
Add Columns Titles
• By default the headers will be the integer
values starting from 0.
• To add column headers you have to create a
list of the columns headers names and assign
it to the columns variable of the datafreme
movies.columns = [‘col1’,’col2’…]

16
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies['Title'])
Print(movies.Title)

0 127 Hours
1 3 Backyards
2 3
3 8: The Mormon Proposition
4 A Turtle's Tale: Sammy's Adventures
...
1602 Wuthering Heights
1603 Yu-Gi-Oh! Duel Monsters
Name: Title, Length: 1604, dtype: object
17
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies[['Year','IMDB Score',]])

Year IMDB Score


0 2010.0 7.6
1 2010.0 5.2
2 2010.0 6.8
3 2010.0 7.1
4 2010.0 6.1
... ... ...
1600 NaN 7.3
1601 NaN 7.1
1602 NaN 7.7
1603 NaN 7.0
18
Renaming Columns
• To access columns you have to avoid spaces in
column name
df2 = pd.DataFrame([[1, 1, 1, 1],
[2, 2, 2, 2]],
columns=['A 1','B 1','C 1','D 1'])
print(df2)
df2.columns = [c.replace(' ', '_') for c in df2.columns]
print(df2)

A 1 B 1 C 1 D 1
0 1 1 1 1
1 2 2 2 2
A_1 B_1 C_1 D_1
0 1 1 1 1
1 2 2 2 2 19
Indexing
• Works just like they do in the rest of the
Python ecosystem.
• Pandas has its own access operators
– iloc: index based selection
– loc : label based selection

20
Index Based Selection
(iloc)

print(movies.iloc[5])#returns row 5
print(movies.iloc[:5])#returns row 0,1,2,3,4
print(movies.iloc[:5,0])#returns titles (column 0) of the
first 5 moviess
print(movies.iloc[[0,1,2,3,4],0])#returns titles (column0)
of the first 5 movies
print(movies.iloc[[0,1,2,3,4],18])#returns IMDB scores
(column 18) of the first 5 movies

21
Label Based Selection
(loc)
• When using loc the indexing is inclusive
– The start and end are included.
movies = pd.read_excel(filename)#reads the first
sheet, xlrd
print(movies.loc[5])#returns row 5
print(movies.loc[:5])#returns row 0,1,2,3,4,5
print(movies.loc[:5,'Title'])#returns titles of the
first 6 movies
print(movies.loc[[0,1,2,3,4],'IMDB Score'])#returns
IMDB Scores of the first 5 movies
print(movies.loc[-5:,'IMDB Score'])#returns IMDB
Scores of the last 5 movies
print(movies.loc[-5:,['IMDB Score','Title']])#returns
IMDB Scores and titles of the last 5 movies

22
set_index()
• An Index column will be added to the
dataframe by default
– The range is from 0#of rows -1
• set_index() can be used to set any of the
columns values to be used as a row index
– Duplicates are allowed

23
set_index()
df = pd.DataFrame({'Name':['Mohammad',
'Mohammad','Haneen','Leen'],
'Age':[12,25,40,17],
'Hobby':['Soccer','Singing','Reading','Reading']})

X = df.set_index('Age')
print(X)
df.set_index('Name',inplace=True)
print(df)
print(df.loc['Mohammad'])

Hobby Name Age Hobby


Age Name Age Hobby
12 Soccer Mohammad Mohammad 12 Soccer Name
25 Singing Mohammad Mohammad 25 Singing Mohammad 12 Soccer
40 Reading Haneen Haneen 40 Reading Mohammad 25 Singing
17 Reading Leen Leen 17 Reading

24
Selection
• Several Techniques can be used to select
certain elements
– Relational Operators >,<,>=….
– isin()
– notnull(), isnull()

25
Selection/Examples

print(movies.loc[movies.Country== 'Spain'])#all rows with country=Spain


print(movies[movies.Country== 'Spain'])

print(movies[(movies['Country']== 'Spain') & (movies['Reviews by


Users']>400)])

print(movies[(movies['Year']== 2012) | (movies['Year']==2011)])


print(movies[movies.Year.isin([[2011,2012]])
print(movies.loc[movies.Year.isin([[2011,2012]])

print(movies.loc[movies.Budget.notnull()])
print(movies.loc[movies.Budget.isnull()])

26
Assigning Data
• Assignment operator is used
– Broadcasting is supported

movies.loc[3,'Budget'] = 1500# make budget at row 3 =1500


movies['Title']= 'New Title' #make all data on Title column
=‘New Title”
movies.loc[:5,'Title'] = 'New'# make the first 6 rows of the
Title column =‘New’
movies.head(5).Year= 2000# .head() is used to show data only
print(movies.head(10))

27
Data Analysis
describe()
• Generates a high-level summary of the
attributes of the given column.
– It is type-aware, meaning that its output changes
based on the data type of the input
– For numeric data, the result’s index will include
count, mean, std, min, max and 25, 50 and 75
percentiles
– For object data (e.g. strings or timestamps), the
result’s index will include count, unique, top, and
freq

28
Data Analysis
describe()
• For mixed data types provided via a
DataFrame, the default is to return only an
analysis of numeric columns.
print(movies.describe())

Year Duration ... Reviews by Crtiics IMDB Score


count 1497.000000 1594.000000 ... 1571.000000 1604.000000
mean 2012.773547 103.328733 ... 187.586887 6.337718
std 1.868725 27.429001 ... 165.281572 1.169382
min 2010.000000 7.000000 ... 1.000000 1.600000
25% 2011.000000 92.000000 ... 38.000000 5.700000
50% 2013.000000 102.000000 ... 159.000000 6.400000
75% 2014.000000 114.750000 ... 288.000000 7.100000
max 2016.000000 511.000000 ... 813.000000 9.500000
29
Data Analysis
describe()
• Numerical Fields
print(movies.Duration.describe())
print(type(movies.Duration.describe()))
print(movies.Duration.describe()['max'])
print(movies.Duration.describe().loc['max'])

count 1594.000000
mean 103.328733
std 27.429001
min 7.000000
25% 92.000000
50% 102.000000
75% 114.750000
max 511.000000
Name: Duration, dtype: float64
<class 'pandas.core.series.Series'>
511.0
511.0 30
Data Analysis/Summary
describe()
• String Fields
print(movies.Title.describe())
print(movies.Title.describe().top)
print(movies.Title.describe()['top'])
print(movies.Title.describe().loc['top'])

count 1604
unique 1551
top Victor Frankenstein
freq 3
Name: Title, dtype: object
Victor Frankenstein
Victor Frankenstein
Victor Frankenstein

31
Data Analysis/Basic Statistics
print(movies.describe())
print(movies.describe().loc['max','Budget'])
print(movies.Budget.max())
print(movies.Budget.mean())
print(movies.Budget.mode())#most frequently data
print(movies.Duration.mean().round())
movies.Duration += 15#Broadcasting
print(movies.Duration.mean().round())
Year Duration ... Reviews by Crtiics IMDB Score
count 1497.000000 1594.000000 ... 1571.000000 1604.000000
mean 2012.773547 103.328733 ... 187.586887 6.337718
std 1.868725 27.429001 ... 165.281572 1.169382
min 2010.000000 7.000000 ... 1.000000 1.600000
25% 2011.000000 92.000000 ... 38.000000 5.700000
50% 2013.000000 102.000000 ... 159.000000 6.400000
75% 2014.000000 114.750000 ... 288.000000 7.100000
max 2016.000000 511.000000 ... 813.000000 9.500000
600000000.0
600000000.0
40563243.28888889
0 20000000.0
dtype: float64
103.0 32
118.0
Data Analysis/Summary
Aggregation
• agg() method are useful when multiple
statistics are computed per column:
print(movies[['Budget','IMDB Score']].agg([len,min,max]))
print(movies[['Budget','IMDB Score']].agg([len,min,max]).loc['max','Budget'])

Budget IMDB Score


len 1604.0 1604.0
min 1400.0 1.6
max 600000000.0 9.5
600000000.0

33
Data Analysis/Summary
describe()

[2010. 2011. 2012. 2013.


print(movies.Year.unique()) 2014. 2015. 2016. nan]
2014.0 252
print(movies.Year.value_counts()) 2013.0 237
2010.0 230
print(movies.Year.value_counts()[2012]) 2015.0 226
2011.0 225
2012.0 221
2016.0 106
Name: Year, dtype: int64
221

34
Grouping
• Groupby() method is used to group the rows
in the dataframe based on certain column(s)
values.
– movies.groupby(['Country']) groups the rows
based on the Country
• Number of groups will be equal to the number of
countries
– We can perform operations on each group

35
Grouping
grouped = movies.groupby('Country')
print(grouped.groups)#shows indices
print(grouped.get_group('Spain'))

{'Australia': [16, 54, 65, 194, 294, 360, 473, 692, 859, 860, 880, 1014, 1120, 1138,
1269, 1319, 1514, 1601], 'Bahamas': [978], 'Belgium': [399, 895, 1348, 1349],
'Brazil': [804, 992, 1359], 'Bulgaria': [942], 'Cambodia': [1033],
'Canada': [13, 17, 26, 68, 78, 140, 143, 216, 255, 290, 300, 350, 450, 467, 492, 504, 526,…
Title Year ... Reviews by Crtiics IMDB Score
23 Buried 2010.0 ... 363.0 7.0
258 Blackthorn 2011.0 ... 92.0 6.6
334 Midnight in Paris 2011.0 ... 487.0 7.7
375 Sleep Tight 2011.0 ... 191.0 7.2
430 There Be Dragons 2011.0 ... 77.0 5.9
571 Red Lights 2012.0 ... 195.0 6.2
631 The Impossible 2012.0 ... 371.0 7.6
902 Underdogs 2013.0 ... 82.0 6.7
927 Aloft 2014.0 ... 56.0 5.3
1006 Hidden Away 2014.0 ... 9.0 7.2
1225 Eden 2015.0 ... 5.0 4.8
1295 Regression 2015.0 ... 140.0 5.7

[12 rows x 19 columns]


36
Grouping
• Example: Find the budget spent by each country on movies
production

print(movies.groupby(['Country']).Budget.sum().head(5))

Country
Australia 751500000.0
Bahamas 5000000.0
Belgium 49000000.0
Brazil 11000000.0
Bulgaria 7000000.0
Name: Budget, dtype: float64

37
Grouping
• Example: Find the number of movies produced by each country
print(movies.groupby(['Country']).Title.count().sort_values())
print(movies.groupby(['Country']).Title.count(). sort_values().max())

var=movies.groupby(['Country']).Title.count().sort_values()
print(var.index[var.shape[0]-1])

print(movies['Country'].describe().top)

Country
United Arab Emirates 1
Iran 1
.
.
Canada 44
France 54
UK 136
USA 1184
Name: Title, dtype: int64
1184
USA
USA
38
Grouping and Aggregation
• Use agg() to display more than one function
per group
– Results generated per group
print(movies.groupby(['Country']).Budget.agg([len,min,max]))

len min max


Country
Australia 18.0 2500000.0 150000000.0
Bahamas 1.0 5000000.0 5000000.0
Belgium 4.0 15000000.0 34000000.0

39
Grouping
• Notice that the result has new index and in this case multi-
index.
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]))
var=movies.groupby(['Country','Language']).Budget.agg([len,min,max])
print(var.loc['Brazil','max'])
print(var.loc['Brazil','max'].loc['English'])

len min max


Country Language
Australia English 18 2500000.0 150000000.0
Bahamas English 1 5000000.0 5000000.0
Belgium English 4 15000000.0 34000000.0
Brazil English 1 3000000.0 3000000.0
Portuguese 2 4000000.0 4000000.0
... ... ... ...
USA English 1174 1400.0 263700000.0
Hebrew 1 NaN NaN
None 1 4000000.0 4000000.0
Spanish 3 1200000.0 6000000.0
United Arab Emirates Arabic 1 125000.0 125000.0
[83 rows x 3 columns]
Language
English 3000000.0
Portuguese 4000000.0
Name: max, dtype: float64 40
3000000.0
DataFrameGroupBy.filter()
• Return a copy of a DataFrame excluding elements from groups
that do not satisfy the boolean criterion specified by func.
A B C
1 bar 2 5.0
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 3 bar 4 1.0
'foo', 'bar'], 5 bar 6 9.0
'B' : [1, 2, 3, 4, 5, 6], 4.0
'C' : [2.0, 5., 8., 1., 2., 9.]}) A B C
grouped = df.groupby('A') 0 foo 1 2.0
print(grouped.get_group('bar')) 2 foo 3 8.0
print(grouped.get_group('bar')['B'].mean()) 4 foo 5 2.0
print(grouped.get_group('foo')) 3.0
print(grouped.get_group('foo')['B'].mean()) A B C
print(grouped.filter(lambda x: x['B'].mean() > 3.)) 1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0

41
DataFrameGroupBy.apply()
• Apply certain function on the group elements
B C
A
bar count 3.0 3.000000
mean 4.0 5.000000
std 2.0 4.000000
min 2.0 1.000000
grouped = df.groupby('A')
25% 3.0 3.000000
print(grouped.apply(lambda x: 50% 4.0 5.000000
x.describe())) 75% 5.0 7.000000
max 6.0 9.000000
foo count 3.0 3.000000
mean 3.0 4.000000
std 2.0 3.464102
min 1.0 2.000000
25% 2.0 2.000000
50% 3.0 2.000000
75% 4.0 5.000000
42
max 5.0 8.000000
DataFrameGroupBy.apply()

def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped = df.groupby('A')
print(grouped['C'].apply(f))

43
Groupby()
• reset_index() can be used to reset the index to
decimalvalues starting from 0
print(movies.groupby(['Country','Language']).Budget.agg([len
,min,max]))

len min max


Country Language
Australia English 18 2500000.0 150000000.0
Bahamas English 1 5000000.0 5000000.0
Belgium English 4 15000000.0 34000000.0
Brazil English 1 3000000.0 3000000.0
Portuguese 2 4000000.0 4000000.0
... ... ... ...
USA English 1174 1400.0 263700000.0
Hebrew 1 NaN NaN
None 1 4000000.0 4000000.0
Spanish 3 1200000.0 6000000.0
United Arab Emirates Arabic 1 125000.0 125000.0
44
[83 rows x 3 columns]
Groupby()
print(movies.groupby(['Country','Language']).Budget.agg([len
,min,max]).reset_index())

Country Language len min max


0 Australia English 18 2500000.0 150000000.0
1 Bahamas English 1 5000000.0 5000000.0
2 Belgium English 4 15000000.0 34000000.0
3 Brazil English 1 3000000.0 3000000.0
4 Brazil Portuguese 2 4000000.0 4000000.0
.. ... ... ... ... ...
78 USA English 1174 1400.0 263700000.0
79 USA Hebrew 1 NaN NaN
80 USA None 1 4000000.0 4000000.0
81 USA Spanish 3 1200000.0 6000000.0
82 United Arab Emirates Arabic 1 125000.0 125000.0

[83 rows x 5 columns]

45
Groupby()
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3])
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3]['max'])

Country Brazil
Language English
len 1
min 3000000.0
max 3000000.0
Name: 3, dtype: object
3000000.0

46
Sorting
• sort_values() function/method can be used to
sort dataframes according to certain column
values.
print(movies.sort_values('Country').iloc[:,:5])#sort all the
dataframe by the column Country, and display the first 5 columns

47
Sorting
print(movies.groupby(['Country'])['IMDB Score'].max())
print(movies.groupby(['Country'])['IMDB Score'].max().sort_values(ascending=False))

print(movies.groupby(['Country'])['IMDB
Score'].agg([max]).sort_values('max',ascending=False))

Country Country
Australia 8.1 Canada 9.5
Bahamas 4.4 USA 9.1
Belgium 7.1 Poland 9.1
. .
. .
Thailand 5.7 5.6
UK 8.6 Nigeria 5.6
USA 9.1 Georgia 5.6
United Arab Emirates 8.2 Bahamas 4.4
Name: IMDB Score, dtype: Name: IMDB Score, dtype:
float64 float64

48
Sorting
• sort_values() works on Dataframes or Series
objects
print(type(movies.groupby(['Country'])))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

print(type(movies.groupby(['Country'])['IMDB Score']))
<class 'pandas.core.groupby.generic.SeriesGroupBy'>

print(type(movies.groupby(['Country'])['IMDB Score'].max()))
<class 'pandas.core.series.Series'>

49
Sorting
• sort_values() can sort by more than one column.
• sort_index() is used to sort elements by index.
print(movies.sort_values(['Language','Country']).iloc[:,:5])
print(movies.sort_values(['Language','Country']).loc
[:,['Title','Language','Country']].to_string())

Title ... Country


884 The Square ... Egypt
845 The Brain That Sings ... United Arab Emirates
308 In the Land of Blood and Honey ... USA
1026 Kung Fu Killer ... China
1164 Z Storm ... Hong K

50
Missing Data
• Several Methods are available to deal with
missing data
df.method() description
print(movies[pd.isnull(movies.Country)])
dropna() Drop missing observations

dropna(how='all') Title Year ... where


Drop observations Reviewsallby Crtiics
cells is NA IMDB Score
963 Dawn Patrol 2014.0 ... 9.0 4.8
dropna(axis=1,
1497 how='all') Drop column
10,000 B.C. NaNif ...
all the valuesNaN
are missing 7.2
1529 Gone, Baby, Gone NaN ... NaN 6.6
dropna(thresh
1551= 5)Preacher Drop rowsNaN
that ...
contain less18.0
than 5 non-missing
8.3values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values 51


Missing Data
• To select NaN entries you can use pd.isnull()
(or its companion pd.notnull())

print(movies[pd.isnull(movies.Country)])

Title Year ... Reviews by Crtiics IMDB Score


963 Dawn Patrol 2014.0 ... 9.0 4.8
1497 10,000 B.C. NaN ... NaN 7.2
1529 Gone, Baby, Gone NaN ... NaN 6.6
1551 Preacher NaN ... 18.0 8.3

52
Missing Data
• To select NaN entries you can use pd.isnull()
(or its companion pd.notnull())
movies[movies.isnull().any(axis=1)].head()

Title ... IMDB Score


1 3 Backyards ... 5.2
2 3 ... 6.8
4 A Turtle's Tale: Sammy's Adventures ... 6.1
7 All Good Things ... 6.3
10 Anderson's Cross ... 7.2

[5 rows x 19 columns]

53
Replacing Missing Values
• Replacing missing values is a common operation.
• fillna() provides a few different strategies for mitigating such
data Title Dawn Patrol
Year 2014.0
movies.Country = movies.Country.fillna("X") Genres Drama|Thriller
print(movies.iloc[963]) Language English
Country X
movies.Country.fillna("X",inplace = True) Duration 88.0
print(movies.iloc[963]) Budget 3500000.0
Gross Earnings NaN
Director Daniel Petrie Jr.
Actor 1 Chris Brochu
Actor 2 Jeff Fahey
Facebook Likes - Actor 1 795.0
Facebook Likes - Actor 2 535.0
Facebook likes - Movie 570
Facenumber in posters 0.0
User Votes 455
Reviews by Users 13.0
Reviews by Crtiics 9.0
IMDB Score 4.8
Name: 963, dtype: object

Process finished with exit code 0 54


Replacing Missing Values
Title A Turtle's Tale: Sammy's Adventures
Year 2010.0
Genres Adventure|Animation|Family
Language English
Country France
Duration 88.0
Budget Y
movies.fillna(“Y",inplace = True) Gross Earnings Y
print(movies.iloc[4]) Director Ben Stassen
Actor 1 Ed Begley Jr.
Actor 2 Jenny McCarthy
Facebook Likes - Actor 1 783.0
Facebook Likes - Actor 2 749.0
Facebook likes - Movie 0
Facenumber in posters 2.0
User Votes 5385
Reviews by Users 22.0
Reviews by Crtiics 56.0
IMDB Score 6.1
Name: 4, dtype: object

https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.fillna.html
55
Removing Records with Missing Values
• dropna() can be used to remove all the rows
with ‘NA’ values.

print(movies.shape) #(1604, 19)


movies.dropna(inplace=True)
print(movies.shape)#(1044, 19)

56
fillna()/Examples
A B C D
0 NaN 2.0 NaN 0
df = pd.DataFrame([[np.nan, 2, np.nan, 0], 1 3.0 4.0 NaN 1
[3, 4, np.nan, 1], 2 NaN NaN NaN 5
[np.nan, np.nan, np.nan, 5], 3 NaN 3.0 NaN 4
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
A B C D
print(df) 0 NaN 2.0 NaN 0
df1 = df.fillna(method='ffill') 1 3.0 4.0 NaN 1
print(df1) 2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df2 = df.fillna(value=values) A B C D
print(df2) 0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
57
Renaming
• lets you change index names and/or column
names
• Change column name
movies.rename(columns={'IMDB Score':'IMDB_Score'},inplace=True)

• Change index
– Rarely used; set_index() can be used instead

movies.rename(index = {0:'m0',1:'m1'})

58
Combining
• Dataframes can be combined into one
Dataframe
– concat(), join() and merge() are useful methods for
this purpose.

59
Combining/Example

df1 = pd.DataFrame([[1, 2, 5, 0],


[3, 4, 6, 1]],
columns=list('ABCD'))

df2 = pd.DataFrame([[1, 1, 1, 1],


[2, 2, 2, 2]],
columns=list('ABCD'))

df4 = pd.DataFrame([[1, 1, 1, 1],


[2, 2, 2, 2]],
columns=list('ABCF'))

60
Combining/concat()
• Concatenate along an axis
df3 = pd.concat([df1,df2],ignore_index=True)
print(df3)

df3 = pd.concat([df1,df2],axis=1)
print(df3)

A B C D
0 1 2 5 0
1 3 4 6 1
2 1 1 1 1
3 2 2 2 2

A B C D A B C D
0 1 2 5 0 1 1 1 1
1 3 4 6 1 2 2 2 2 61
Combining/concat()
• Concatenate with different columns labels

df3 = pd.concat([df1,df4],sort=False,ignore_index=True)
print(df3)

A B C D F
0 1 2 5 0.0 NaN
1 3 4 6 1.0 NaN
2 1 1 1 NaN 1.0
3 2 2 2 NaN 2.0

62
Combining/join()()
• Concatenate with different columns labels

df3 = df1.join(df4,lsuffix="_X",rsuffix="_Y")
print(df3)

A_X B_X C_X D A_Y B_Y C_Y F


0 1 2 5 0 1 1 1 1
1 3 4 6 1 2 2 2 2

63
References
• https://pandas.pydata.org/docs/
• https://pandas.pydata.org/pandas docs/stable/user_guide/groupby.html

64

You might also like