Exp 7
Exp 7
1
Outline
• Series and Dataframes
• Reading the data
• Exploring the data
• Indexing
• Selection
• Data Analysis
• Grouping
• Applying functions
• Sorting
• Missing values
• Combining
2
Pandas
• Adds data structures and tools designed to work
with table-like data
• Provides tools for data manipulation: reshaping,
merging, sorting, slicing, aggregation etc.
• Clean messy data sets, and make them readable
and relevant.
• Allows handling missing data
• The name "Pandas" has a reference to both
"Panel Data”
3
Pandas Data Structures
• Series: one dimensional data structure(
column) that stores values — and for every
value it holds a unique index, too.
With the index argument, you can name your own labels.
S = i0 15
pd.Series([15,20,13,55,67,34,23,1],index=['i0', i1 20
'i1','i2','i3','i4','i5','i6','i7']) i2 13
print(S) i3 55
i4 67
print(S['i0']) 15 i5 34
print(S['i5']) 34 i6 23
i7 1
5
Dataframes
• Index by default is an integer and starts from 0
df = pd.DataFrame({'Name':['Mohammad', 'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]})
print (df)#shows the first 5 rows and the last 5 rows
print (df. to_string())#shows all the rows of the dataframe
Name Age
0 Mohammad 12
1 Ahmad 25
2 Haneen 40
3 Leen 17
6
DataFrames with Index
df = pd.DataFrame({'Name':['Mohammad',
'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]},
index = ['s1','s2','s3','s4'])
Print(pd)
Name Age
s1 Mohammad 12
s2 Ahmad 25
s3 Haneen 40
s4 Leen 17
7
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame
import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)#(1604,19)
# Note: you may need to install xlrd package to run the code
8
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame
import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename,sheet_name=1)#reads the second sheet, xlrd
print(movies.shape)#(1604,19)
#to read .csv file type
df = pd.read_csv('data.csv')
Print(df)
9
Other read_*
• https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
10
Exploring the Data
Title object
Year float64
• Data types Genres
Language
object
object
Country object
Duration float64
print(movies.dtypes) Budget float64
print(movies.Country.dtype) Gross Earnings float64
movies.Duration.astype('int3 Director object
Actor 1 object
2')#make sure that you do
Actor 2 object
not have NA values Facebook Likes - Actor 1 float64
Facebook Likes - Actor 2 float64
Facebook likes - Movie int64
Facenumber in posters float64
User Votes int64
Reviews by Users float64
Reviews by Crtiics float64
IMDB Score float64
11
Exploring the Data
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)
print(movies.head(4))
Print(movies.tail())# last five records
(1604, 19)
Title Year ... Reviews by Crtiics IMDB Score
0 127 Hours 2010.0 ... 450.0 7.6
1 3 Backyards 2010.0 ... 20.0 5.2
2 3 2010.0 ... 76.0 6.8
3 8: The Mormon Proposition 2010.0 ... 28.0 7.1
12
Exploring the Data
import numpy as np
import pandas as pd
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.columns)# columns’ tiles
13
Exploring the Data/
Data Frames attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data
14
Add Columns Titles
• By default the first row is considered the
columns headers.
• In case there is no columns headers then you
read the data as follows:
movies = pd.read_excel(filename,header=None)
16
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies['Title'])
Print(movies.Title)
0 127 Hours
1 3 Backyards
2 3
3 8: The Mormon Proposition
4 A Turtle's Tale: Sammy's Adventures
...
1602 Wuthering Heights
1603 Yu-Gi-Oh! Duel Monsters
Name: Title, Length: 1604, dtype: object
17
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies[['Year','IMDB Score',]])
A 1 B 1 C 1 D 1
0 1 1 1 1
1 2 2 2 2
A_1 B_1 C_1 D_1
0 1 1 1 1
1 2 2 2 2 19
Indexing
• Works just like they do in the rest of the
Python ecosystem.
• Pandas has its own access operators
– iloc: index based selection
– loc : label based selection
20
Index Based Selection
(iloc)
print(movies.iloc[5])#returns row 5
print(movies.iloc[:5])#returns row 0,1,2,3,4
print(movies.iloc[:5,0])#returns titles (column 0) of the
first 5 moviess
print(movies.iloc[[0,1,2,3,4],0])#returns titles (column0)
of the first 5 movies
print(movies.iloc[[0,1,2,3,4],18])#returns IMDB scores
(column 18) of the first 5 movies
21
Label Based Selection
(loc)
• When using loc the indexing is inclusive
– The start and end are included.
movies = pd.read_excel(filename)#reads the first
sheet, xlrd
print(movies.loc[5])#returns row 5
print(movies.loc[:5])#returns row 0,1,2,3,4,5
print(movies.loc[:5,'Title'])#returns titles of the
first 6 movies
print(movies.loc[[0,1,2,3,4],'IMDB Score'])#returns
IMDB Scores of the first 5 movies
print(movies.loc[-5:,'IMDB Score'])#returns IMDB
Scores of the last 5 movies
print(movies.loc[-5:,['IMDB Score','Title']])#returns
IMDB Scores and titles of the last 5 movies
22
set_index()
• An Index column will be added to the
dataframe by default
– The range is from 0#of rows -1
• set_index() can be used to set any of the
columns values to be used as a row index
– Duplicates are allowed
23
set_index()
df = pd.DataFrame({'Name':['Mohammad',
'Mohammad','Haneen','Leen'],
'Age':[12,25,40,17],
'Hobby':['Soccer','Singing','Reading','Reading']})
X = df.set_index('Age')
print(X)
df.set_index('Name',inplace=True)
print(df)
print(df.loc['Mohammad'])
24
Selection
• Several Techniques can be used to select
certain elements
– Relational Operators >,<,>=….
– isin()
– notnull(), isnull()
25
Selection/Examples
print(movies.loc[movies.Budget.notnull()])
print(movies.loc[movies.Budget.isnull()])
26
Assigning Data
• Assignment operator is used
– Broadcasting is supported
27
Data Analysis
describe()
• Generates a high-level summary of the
attributes of the given column.
– It is type-aware, meaning that its output changes
based on the data type of the input
– For numeric data, the result’s index will include
count, mean, std, min, max and 25, 50 and 75
percentiles
– For object data (e.g. strings or timestamps), the
result’s index will include count, unique, top, and
freq
28
Data Analysis
describe()
• For mixed data types provided via a
DataFrame, the default is to return only an
analysis of numeric columns.
print(movies.describe())
count 1594.000000
mean 103.328733
std 27.429001
min 7.000000
25% 92.000000
50% 102.000000
75% 114.750000
max 511.000000
Name: Duration, dtype: float64
<class 'pandas.core.series.Series'>
511.0
511.0 30
Data Analysis/Summary
describe()
• String Fields
print(movies.Title.describe())
print(movies.Title.describe().top)
print(movies.Title.describe()['top'])
print(movies.Title.describe().loc['top'])
count 1604
unique 1551
top Victor Frankenstein
freq 3
Name: Title, dtype: object
Victor Frankenstein
Victor Frankenstein
Victor Frankenstein
31
Data Analysis/Basic Statistics
print(movies.describe())
print(movies.describe().loc['max','Budget'])
print(movies.Budget.max())
print(movies.Budget.mean())
print(movies.Budget.mode())#most frequently data
print(movies.Duration.mean().round())
movies.Duration += 15#Broadcasting
print(movies.Duration.mean().round())
Year Duration ... Reviews by Crtiics IMDB Score
count 1497.000000 1594.000000 ... 1571.000000 1604.000000
mean 2012.773547 103.328733 ... 187.586887 6.337718
std 1.868725 27.429001 ... 165.281572 1.169382
min 2010.000000 7.000000 ... 1.000000 1.600000
25% 2011.000000 92.000000 ... 38.000000 5.700000
50% 2013.000000 102.000000 ... 159.000000 6.400000
75% 2014.000000 114.750000 ... 288.000000 7.100000
max 2016.000000 511.000000 ... 813.000000 9.500000
600000000.0
600000000.0
40563243.28888889
0 20000000.0
dtype: float64
103.0 32
118.0
Data Analysis/Summary
Aggregation
• agg() method are useful when multiple
statistics are computed per column:
print(movies[['Budget','IMDB Score']].agg([len,min,max]))
print(movies[['Budget','IMDB Score']].agg([len,min,max]).loc['max','Budget'])
33
Data Analysis/Summary
describe()
34
Grouping
• Groupby() method is used to group the rows
in the dataframe based on certain column(s)
values.
– movies.groupby(['Country']) groups the rows
based on the Country
• Number of groups will be equal to the number of
countries
– We can perform operations on each group
35
Grouping
grouped = movies.groupby('Country')
print(grouped.groups)#shows indices
print(grouped.get_group('Spain'))
{'Australia': [16, 54, 65, 194, 294, 360, 473, 692, 859, 860, 880, 1014, 1120, 1138,
1269, 1319, 1514, 1601], 'Bahamas': [978], 'Belgium': [399, 895, 1348, 1349],
'Brazil': [804, 992, 1359], 'Bulgaria': [942], 'Cambodia': [1033],
'Canada': [13, 17, 26, 68, 78, 140, 143, 216, 255, 290, 300, 350, 450, 467, 492, 504, 526,…
Title Year ... Reviews by Crtiics IMDB Score
23 Buried 2010.0 ... 363.0 7.0
258 Blackthorn 2011.0 ... 92.0 6.6
334 Midnight in Paris 2011.0 ... 487.0 7.7
375 Sleep Tight 2011.0 ... 191.0 7.2
430 There Be Dragons 2011.0 ... 77.0 5.9
571 Red Lights 2012.0 ... 195.0 6.2
631 The Impossible 2012.0 ... 371.0 7.6
902 Underdogs 2013.0 ... 82.0 6.7
927 Aloft 2014.0 ... 56.0 5.3
1006 Hidden Away 2014.0 ... 9.0 7.2
1225 Eden 2015.0 ... 5.0 4.8
1295 Regression 2015.0 ... 140.0 5.7
print(movies.groupby(['Country']).Budget.sum().head(5))
Country
Australia 751500000.0
Bahamas 5000000.0
Belgium 49000000.0
Brazil 11000000.0
Bulgaria 7000000.0
Name: Budget, dtype: float64
37
Grouping
• Example: Find the number of movies produced by each country
print(movies.groupby(['Country']).Title.count().sort_values())
print(movies.groupby(['Country']).Title.count(). sort_values().max())
var=movies.groupby(['Country']).Title.count().sort_values()
print(var.index[var.shape[0]-1])
print(movies['Country'].describe().top)
Country
United Arab Emirates 1
Iran 1
.
.
Canada 44
France 54
UK 136
USA 1184
Name: Title, dtype: int64
1184
USA
USA
38
Grouping and Aggregation
• Use agg() to display more than one function
per group
– Results generated per group
print(movies.groupby(['Country']).Budget.agg([len,min,max]))
39
Grouping
• Notice that the result has new index and in this case multi-
index.
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]))
var=movies.groupby(['Country','Language']).Budget.agg([len,min,max])
print(var.loc['Brazil','max'])
print(var.loc['Brazil','max'].loc['English'])
41
DataFrameGroupBy.apply()
• Apply certain function on the group elements
B C
A
bar count 3.0 3.000000
mean 4.0 5.000000
std 2.0 4.000000
min 2.0 1.000000
grouped = df.groupby('A')
25% 3.0 3.000000
print(grouped.apply(lambda x: 50% 4.0 5.000000
x.describe())) 75% 5.0 7.000000
max 6.0 9.000000
foo count 3.0 3.000000
mean 3.0 4.000000
std 2.0 3.464102
min 1.0 2.000000
25% 2.0 2.000000
50% 3.0 2.000000
75% 4.0 5.000000
42
max 5.0 8.000000
DataFrameGroupBy.apply()
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped = df.groupby('A')
print(grouped['C'].apply(f))
43
Groupby()
• reset_index() can be used to reset the index to
decimalvalues starting from 0
print(movies.groupby(['Country','Language']).Budget.agg([len
,min,max]))
45
Groupby()
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3])
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3]['max'])
Country Brazil
Language English
len 1
min 3000000.0
max 3000000.0
Name: 3, dtype: object
3000000.0
46
Sorting
• sort_values() function/method can be used to
sort dataframes according to certain column
values.
print(movies.sort_values('Country').iloc[:,:5])#sort all the
dataframe by the column Country, and display the first 5 columns
47
Sorting
print(movies.groupby(['Country'])['IMDB Score'].max())
print(movies.groupby(['Country'])['IMDB Score'].max().sort_values(ascending=False))
print(movies.groupby(['Country'])['IMDB
Score'].agg([max]).sort_values('max',ascending=False))
Country Country
Australia 8.1 Canada 9.5
Bahamas 4.4 USA 9.1
Belgium 7.1 Poland 9.1
. .
. .
Thailand 5.7 5.6
UK 8.6 Nigeria 5.6
USA 9.1 Georgia 5.6
United Arab Emirates 8.2 Bahamas 4.4
Name: IMDB Score, dtype: Name: IMDB Score, dtype:
float64 float64
48
Sorting
• sort_values() works on Dataframes or Series
objects
print(type(movies.groupby(['Country'])))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
print(type(movies.groupby(['Country'])['IMDB Score']))
<class 'pandas.core.groupby.generic.SeriesGroupBy'>
print(type(movies.groupby(['Country'])['IMDB Score'].max()))
<class 'pandas.core.series.Series'>
49
Sorting
• sort_values() can sort by more than one column.
• sort_index() is used to sort elements by index.
print(movies.sort_values(['Language','Country']).iloc[:,:5])
print(movies.sort_values(['Language','Country']).loc
[:,['Title','Language','Country']].to_string())
50
Missing Data
• Several Methods are available to deal with
missing data
df.method() description
print(movies[pd.isnull(movies.Country)])
dropna() Drop missing observations
print(movies[pd.isnull(movies.Country)])
52
Missing Data
• To select NaN entries you can use pd.isnull()
(or its companion pd.notnull())
movies[movies.isnull().any(axis=1)].head()
[5 rows x 19 columns]
53
Replacing Missing Values
• Replacing missing values is a common operation.
• fillna() provides a few different strategies for mitigating such
data Title Dawn Patrol
Year 2014.0
movies.Country = movies.Country.fillna("X") Genres Drama|Thriller
print(movies.iloc[963]) Language English
Country X
movies.Country.fillna("X",inplace = True) Duration 88.0
print(movies.iloc[963]) Budget 3500000.0
Gross Earnings NaN
Director Daniel Petrie Jr.
Actor 1 Chris Brochu
Actor 2 Jeff Fahey
Facebook Likes - Actor 1 795.0
Facebook Likes - Actor 2 535.0
Facebook likes - Movie 570
Facenumber in posters 0.0
User Votes 455
Reviews by Users 13.0
Reviews by Crtiics 9.0
IMDB Score 4.8
Name: 963, dtype: object
https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.fillna.html
55
Removing Records with Missing Values
• dropna() can be used to remove all the rows
with ‘NA’ values.
56
fillna()/Examples
A B C D
0 NaN 2.0 NaN 0
df = pd.DataFrame([[np.nan, 2, np.nan, 0], 1 3.0 4.0 NaN 1
[3, 4, np.nan, 1], 2 NaN NaN NaN 5
[np.nan, np.nan, np.nan, 5], 3 NaN 3.0 NaN 4
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
A B C D
print(df) 0 NaN 2.0 NaN 0
df1 = df.fillna(method='ffill') 1 3.0 4.0 NaN 1
print(df1) 2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df2 = df.fillna(value=values) A B C D
print(df2) 0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
57
Renaming
• lets you change index names and/or column
names
• Change column name
movies.rename(columns={'IMDB Score':'IMDB_Score'},inplace=True)
• Change index
– Rarely used; set_index() can be used instead
movies.rename(index = {0:'m0',1:'m1'})
58
Combining
• Dataframes can be combined into one
Dataframe
– concat(), join() and merge() are useful methods for
this purpose.
59
Combining/Example
60
Combining/concat()
• Concatenate along an axis
df3 = pd.concat([df1,df2],ignore_index=True)
print(df3)
df3 = pd.concat([df1,df2],axis=1)
print(df3)
A B C D
0 1 2 5 0
1 3 4 6 1
2 1 1 1 1
3 2 2 2 2
A B C D A B C D
0 1 2 5 0 1 1 1 1
1 3 4 6 1 2 2 2 2 61
Combining/concat()
• Concatenate with different columns labels
df3 = pd.concat([df1,df4],sort=False,ignore_index=True)
print(df3)
A B C D F
0 1 2 5 0.0 NaN
1 3 4 6 1.0 NaN
2 1 1 1 NaN 1.0
3 2 2 2 NaN 2.0
62
Combining/join()()
• Concatenate with different columns labels
df3 = df1.join(df4,lsuffix="_X",rsuffix="_Y")
print(df3)
63
References
• https://pandas.pydata.org/docs/
• https://pandas.pydata.org/pandas docs/stable/user_guide/groupby.html
64