0% found this document useful (0 votes)

38 views64 pages

Exp 7

This document provides an overview of Pandas, a Python library used for data analysis and manipulation. It discusses Pandas data structures like Series and DataFrames, how to read data from different file types into Pandas, and how to explore and access data in Pandas. Key points covered include indexing and selecting data, applying functions to data, and handling missing values.

Uploaded by

Batool Fassi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views64 pages

Exp 7

Uploaded by

Batool Fassi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Pandas

Prepared by Dr. Mohammad Abdel-majeed

Updated by Dr. Samah Rahamneh ,Eng. Abeer Awad

Data Science: is a branch of computer science where we study

how to store, use and analyze data for deriving information from
it.

1
Outline
• Series and Dataframes
• Reading the data
• Exploring the data
• Indexing
• Selection
• Data Analysis
• Grouping
• Applying functions
• Sorting
• Missing values
• Combining

2
Pandas
• Adds data structures and tools designed to work
with table-like data
• Provides tools for data manipulation: reshaping,
merging, sorting, slicing, aggregation etc.
• Clean messy data sets, and make them readable
and relevant.
• Allows handling missing data
• The name "Pandas" has a reference to both
"Panel Data”

3
Pandas Data Structures
• Series: one dimensional data structure(
column) that stores values — and for every
value it holds a unique index, too.

• DataFrame: two (or more) dimensional data

structure – basically a table with rows and
columns. The columns have names and the
rows have indexes.
import pandas as pd
pd.command(xxx)
4
Series
S = pd.Series([15,20,13,55,67,34,23,1]) 0 15
print(S) 1 20
2 13
3 55
4 67
5 34
print(S[0]) 15 6 23
7 1

With the index argument, you can name your own labels.

S = i0 15
pd.Series([15,20,13,55,67,34,23,1],index=['i0', i1 20
'i1','i2','i3','i4','i5','i6','i7']) i2 13
print(S) i3 55
i4 67
print(S['i0']) 15 i5 34
print(S['i5']) 34 i6 23
i7 1

5
Dataframes
• Index by default is an integer and starts from 0

df = pd.DataFrame({'Name':['Mohammad', 'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]})
print (df)#shows the first 5 rows and the last 5 rows
print (df. to_string())#shows all the rows of the dataframe

Name Age
0 Mohammad 12
1 Ahmad 25
2 Haneen 40
3 Leen 17

6
DataFrames with Index

df = pd.DataFrame({'Name':['Mohammad',
'Ahmad','Haneen','Leen'],
'Age':[12,25,40,17]},
index = ['s1','s2','s3','s4'])
Print(pd)

Name Age
s1 Mohammad 12
s2 Ahmad 25
s3 Haneen 40
s4 Leen 17

7
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame

import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)#(1604,19)
# Note: you may need to install xlrd package to run the code

8
Reading Data Files
• Several files types can be accessed and their
content will be stored in Series or DataFrame
import numpy as np
import pandas as pd
filename =r'C:\Users\user\Desktop\movies.xls'
movies = pd.read_excel(filename,sheet_name=1)#reads the second sheet, xlrd
print(movies.shape)#(1604,19)
#to read .csv file type
df = pd.read_csv('data.csv')
Print(df)

9
Other read_*
• https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

10
Exploring the Data
Title object
Year float64
• Data types Genres
Language
object
object
Country object
Duration float64
print(movies.dtypes) Budget float64
print(movies.Country.dtype) Gross Earnings float64
movies.Duration.astype('int3 Director object
Actor 1 object
2')#make sure that you do
Actor 2 object
not have NA values Facebook Likes - Actor 1 float64
Facebook Likes - Actor 2 float64
Facebook likes - Movie int64
Facenumber in posters float64
User Votes int64
Reviews by Users float64
Reviews by Crtiics float64
IMDB Score float64

11
Exploring the Data
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.shape)
print(movies.head(4))
Print(movies.tail())# last five records

(1604, 19)
Title Year ... Reviews by Crtiics IMDB Score
0 127 Hours 2010.0 ... 450.0 7.6
1 3 Backyards 2010.0 ... 20.0 5.2
2 3 2010.0 ... 76.0 6.8
3 8: The Mormon Proposition 2010.0 ... 28.0 7.1

12
Exploring the Data

import numpy as np
import pandas as pd
filename =r'C:\Users\mohammad\Desktop\movies.xls'
movies = pd.read_excel(filename)#reads the first sheet, xlrd
print(movies.columns)# columns’ tiles

Index(['Title', 'Year', 'Genres', 'Language', 'Country', 'Duration', 'Budget',

'Gross Earnings', 'Director', 'Actor 1', 'Actor 2',
'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
'Facebook likes - Movie', 'Facenumber in posters', 'User Votes',
'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score'],
dtype='object')

13
Exploring the Data/
Data Frames attributes

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data

14
Add Columns Titles
• By default the first row is considered the
columns headers.
• In case there is no columns headers then you
read the data as follows:
movies = pd.read_excel(filename,header=None)

• create a list of the columns headers and

assign it to the columns variable of the
datafreme
movies.columns= list(range(19))
movies.columns = [‘col1’,’col2’…]
15
Add Columns Titles
• By default the headers will be the integer
values starting from 0.
• To add column headers you have to create a
list of the columns headers names and assign
it to the columns variable of the datafreme
movies.columns = [‘col1’,’col2’…]

16
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies['Title'])
Print(movies.Title)

0 127 Hours
1 3 Backyards
2 3
3 8: The Mormon Proposition
4 A Turtle's Tale: Sammy's Adventures
...
1602 Wuthering Heights
1603 Yu-Gi-Oh! Duel Monsters
Name: Title, Length: 1604, dtype: object
17
Data Access/Column Access
• To access the data you can use the column
header as follows:
print(movies[['Year','IMDB Score',]])

Year IMDB Score

0 2010.0 7.6
1 2010.0 5.2
2 2010.0 6.8
3 2010.0 7.1
4 2010.0 6.1
... ... ...
1600 NaN 7.3
1601 NaN 7.1
1602 NaN 7.7
1603 NaN 7.0
18
Renaming Columns
• To access columns you have to avoid spaces in
column name
df2 = pd.DataFrame([[1, 1, 1, 1],
[2, 2, 2, 2]],
columns=['A 1','B 1','C 1','D 1'])
print(df2)
df2.columns = [c.replace(' ', '_') for c in df2.columns]
print(df2)

A 1 B 1 C 1 D 1
0 1 1 1 1
1 2 2 2 2
A_1 B_1 C_1 D_1
0 1 1 1 1
1 2 2 2 2 19
Indexing
• Works just like they do in the rest of the
Python ecosystem.
• Pandas has its own access operators
– iloc: index based selection
– loc : label based selection

20
Index Based Selection
(iloc)

print(movies.iloc[5])#returns row 5
print(movies.iloc[:5])#returns row 0,1,2,3,4
print(movies.iloc[:5,0])#returns titles (column 0) of the
first 5 moviess
print(movies.iloc[[0,1,2,3,4],0])#returns titles (column0)
of the first 5 movies
print(movies.iloc[[0,1,2,3,4],18])#returns IMDB scores
(column 18) of the first 5 movies

21
Label Based Selection
(loc)
• When using loc the indexing is inclusive
– The start and end are included.
movies = pd.read_excel(filename)#reads the first
sheet, xlrd
print(movies.loc[5])#returns row 5
print(movies.loc[:5])#returns row 0,1,2,3,4,5
print(movies.loc[:5,'Title'])#returns titles of the
first 6 movies
print(movies.loc[[0,1,2,3,4],'IMDB Score'])#returns
IMDB Scores of the first 5 movies
print(movies.loc[-5:,'IMDB Score'])#returns IMDB
Scores of the last 5 movies
print(movies.loc[-5:,['IMDB Score','Title']])#returns
IMDB Scores and titles of the last 5 movies

22
set_index()
• An Index column will be added to the
dataframe by default
– The range is from 0#of rows -1
• set_index() can be used to set any of the
columns values to be used as a row index
– Duplicates are allowed

23
set_index()
df = pd.DataFrame({'Name':['Mohammad',
'Mohammad','Haneen','Leen'],
'Age':[12,25,40,17],
'Hobby':['Soccer','Singing','Reading','Reading']})

X = df.set_index('Age')
print(X)
df.set_index('Name',inplace=True)
print(df)
print(df.loc['Mohammad'])

Hobby Name Age Hobby

Age Name Age Hobby
12 Soccer Mohammad Mohammad 12 Soccer Name
25 Singing Mohammad Mohammad 25 Singing Mohammad 12 Soccer
40 Reading Haneen Haneen 40 Reading Mohammad 25 Singing
17 Reading Leen Leen 17 Reading

24
Selection
• Several Techniques can be used to select
certain elements
– Relational Operators >,<,>=….
– isin()
– notnull(), isnull()

25
Selection/Examples

print(movies.loc[movies.Country== 'Spain'])#all rows with country=Spain

print(movies[movies.Country== 'Spain'])

print(movies[(movies['Country']== 'Spain') & (movies['Reviews by

Users']>400)])

print(movies[(movies['Year']== 2012) | (movies['Year']==2011)])

print(movies[movies.Year.isin([[2011,2012]])
print(movies.loc[movies.Year.isin([[2011,2012]])

print(movies.loc[movies.Budget.notnull()])
print(movies.loc[movies.Budget.isnull()])

26
Assigning Data
• Assignment operator is used
– Broadcasting is supported

movies.loc[3,'Budget'] = 1500# make budget at row 3 =1500

movies['Title']= 'New Title' #make all data on Title column
=‘New Title”
movies.loc[:5,'Title'] = 'New'# make the first 6 rows of the
Title column =‘New’
movies.head(5).Year= 2000# .head() is used to show data only
print(movies.head(10))

27
Data Analysis
describe()
• Generates a high-level summary of the
attributes of the given column.
– It is type-aware, meaning that its output changes
based on the data type of the input
– For numeric data, the result’s index will include
count, mean, std, min, max and 25, 50 and 75
percentiles
– For object data (e.g. strings or timestamps), the
result’s index will include count, unique, top, and
freq

28
Data Analysis
describe()
• For mixed data types provided via a
DataFrame, the default is to return only an
analysis of numeric columns.
print(movies.describe())

Year Duration ... Reviews by Crtiics IMDB Score

count 1497.000000 1594.000000 ... 1571.000000 1604.000000
mean 2012.773547 103.328733 ... 187.586887 6.337718
std 1.868725 27.429001 ... 165.281572 1.169382
min 2010.000000 7.000000 ... 1.000000 1.600000
25% 2011.000000 92.000000 ... 38.000000 5.700000
50% 2013.000000 102.000000 ... 159.000000 6.400000
75% 2014.000000 114.750000 ... 288.000000 7.100000
max 2016.000000 511.000000 ... 813.000000 9.500000
29
Data Analysis
describe()
• Numerical Fields
print(movies.Duration.describe())
print(type(movies.Duration.describe()))
print(movies.Duration.describe()['max'])
print(movies.Duration.describe().loc['max'])

count 1594.000000
mean 103.328733
std 27.429001
min 7.000000
25% 92.000000
50% 102.000000
75% 114.750000
max 511.000000
Name: Duration, dtype: float64
<class 'pandas.core.series.Series'>
511.0
511.0 30
Data Analysis/Summary
describe()
• String Fields
print(movies.Title.describe())
print(movies.Title.describe().top)
print(movies.Title.describe()['top'])
print(movies.Title.describe().loc['top'])

count 1604
unique 1551
top Victor Frankenstein
freq 3
Name: Title, dtype: object
Victor Frankenstein
Victor Frankenstein
Victor Frankenstein

31
Data Analysis/Basic Statistics
print(movies.describe())
print(movies.describe().loc['max','Budget'])
print(movies.Budget.max())
print(movies.Budget.mean())
print(movies.Budget.mode())#most frequently data
print(movies.Duration.mean().round())
movies.Duration += 15#Broadcasting
print(movies.Duration.mean().round())
Year Duration ... Reviews by Crtiics IMDB Score
count 1497.000000 1594.000000 ... 1571.000000 1604.000000
mean 2012.773547 103.328733 ... 187.586887 6.337718
std 1.868725 27.429001 ... 165.281572 1.169382
min 2010.000000 7.000000 ... 1.000000 1.600000
25% 2011.000000 92.000000 ... 38.000000 5.700000
50% 2013.000000 102.000000 ... 159.000000 6.400000
75% 2014.000000 114.750000 ... 288.000000 7.100000
max 2016.000000 511.000000 ... 813.000000 9.500000
600000000.0
600000000.0
40563243.28888889
0 20000000.0
dtype: float64
103.0 32
118.0
Data Analysis/Summary
Aggregation
• agg() method are useful when multiple
statistics are computed per column:
print(movies[['Budget','IMDB Score']].agg([len,min,max]))
print(movies[['Budget','IMDB Score']].agg([len,min,max]).loc['max','Budget'])

Budget IMDB Score

len 1604.0 1604.0
min 1400.0 1.6
max 600000000.0 9.5
600000000.0

33
Data Analysis/Summary
describe()

[2010. 2011. 2012. 2013.

print(movies.Year.unique()) 2014. 2015. 2016. nan]
2014.0 252
print(movies.Year.value_counts()) 2013.0 237
2010.0 230
print(movies.Year.value_counts()[2012]) 2015.0 226
2011.0 225
2012.0 221
2016.0 106
Name: Year, dtype: int64
221

34
Grouping
• Groupby() method is used to group the rows
in the dataframe based on certain column(s)
values.
– movies.groupby(['Country']) groups the rows
based on the Country
• Number of groups will be equal to the number of
countries
– We can perform operations on each group

35
Grouping
grouped = movies.groupby('Country')
print(grouped.groups)#shows indices
print(grouped.get_group('Spain'))

{'Australia': [16, 54, 65, 194, 294, 360, 473, 692, 859, 860, 880, 1014, 1120, 1138,
1269, 1319, 1514, 1601], 'Bahamas': [978], 'Belgium': [399, 895, 1348, 1349],
'Brazil': [804, 992, 1359], 'Bulgaria': [942], 'Cambodia': [1033],
'Canada': [13, 17, 26, 68, 78, 140, 143, 216, 255, 290, 300, 350, 450, 467, 492, 504, 526,…
Title Year ... Reviews by Crtiics IMDB Score
23 Buried 2010.0 ... 363.0 7.0
258 Blackthorn 2011.0 ... 92.0 6.6
334 Midnight in Paris 2011.0 ... 487.0 7.7
375 Sleep Tight 2011.0 ... 191.0 7.2
430 There Be Dragons 2011.0 ... 77.0 5.9
571 Red Lights 2012.0 ... 195.0 6.2
631 The Impossible 2012.0 ... 371.0 7.6
902 Underdogs 2013.0 ... 82.0 6.7
927 Aloft 2014.0 ... 56.0 5.3
1006 Hidden Away 2014.0 ... 9.0 7.2
1225 Eden 2015.0 ... 5.0 4.8
1295 Regression 2015.0 ... 140.0 5.7

[12 rows x 19 columns]

36
Grouping
• Example: Find the budget spent by each country on movies
production

print(movies.groupby(['Country']).Budget.sum().head(5))

Country
Australia 751500000.0
Bahamas 5000000.0
Belgium 49000000.0
Brazil 11000000.0
Bulgaria 7000000.0
Name: Budget, dtype: float64

37
Grouping
• Example: Find the number of movies produced by each country
print(movies.groupby(['Country']).Title.count().sort_values())
print(movies.groupby(['Country']).Title.count(). sort_values().max())

var=movies.groupby(['Country']).Title.count().sort_values()
print(var.index[var.shape[0]-1])

print(movies['Country'].describe().top)

Country
United Arab Emirates 1
Iran 1
.
.
Canada 44
France 54
UK 136
USA 1184
Name: Title, dtype: int64
1184
USA
USA
38
Grouping and Aggregation
• Use agg() to display more than one function
per group
– Results generated per group
print(movies.groupby(['Country']).Budget.agg([len,min,max]))

len min max

Country
Australia 18.0 2500000.0 150000000.0
Bahamas 1.0 5000000.0 5000000.0
Belgium 4.0 15000000.0 34000000.0

39
Grouping
• Notice that the result has new index and in this case multi-
index.
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]))
var=movies.groupby(['Country','Language']).Budget.agg([len,min,max])
print(var.loc['Brazil','max'])
print(var.loc['Brazil','max'].loc['English'])

len min max

Country Language
Australia English 18 2500000.0 150000000.0
Bahamas English 1 5000000.0 5000000.0
Belgium English 4 15000000.0 34000000.0
Brazil English 1 3000000.0 3000000.0
Portuguese 2 4000000.0 4000000.0
... ... ... ...
USA English 1174 1400.0 263700000.0
Hebrew 1 NaN NaN
None 1 4000000.0 4000000.0
Spanish 3 1200000.0 6000000.0
United Arab Emirates Arabic 1 125000.0 125000.0
[83 rows x 3 columns]
Language
English 3000000.0
Portuguese 4000000.0
Name: max, dtype: float64 40
3000000.0
DataFrameGroupBy.filter()
• Return a copy of a DataFrame excluding elements from groups
that do not satisfy the boolean criterion specified by func.
A B C
1 bar 2 5.0
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 3 bar 4 1.0
'foo', 'bar'], 5 bar 6 9.0
'B' : [1, 2, 3, 4, 5, 6], 4.0
'C' : [2.0, 5., 8., 1., 2., 9.]}) A B C
grouped = df.groupby('A') 0 foo 1 2.0
print(grouped.get_group('bar')) 2 foo 3 8.0
print(grouped.get_group('bar')['B'].mean()) 4 foo 5 2.0
print(grouped.get_group('foo')) 3.0
print(grouped.get_group('foo')['B'].mean()) A B C
print(grouped.filter(lambda x: x['B'].mean() > 3.)) 1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0

41
DataFrameGroupBy.apply()
• Apply certain function on the group elements
B C
A
bar count 3.0 3.000000
mean 4.0 5.000000
std 2.0 4.000000
min 2.0 1.000000
grouped = df.groupby('A')
25% 3.0 3.000000
print(grouped.apply(lambda x: 50% 4.0 5.000000
x.describe())) 75% 5.0 7.000000
max 6.0 9.000000
foo count 3.0 3.000000
mean 3.0 4.000000
std 2.0 3.464102
min 1.0 2.000000
25% 2.0 2.000000
50% 3.0 2.000000
75% 4.0 5.000000
42
max 5.0 8.000000
DataFrameGroupBy.apply()

def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped = df.groupby('A')
print(grouped['C'].apply(f))

43
Groupby()
• reset_index() can be used to reset the index to
decimalvalues starting from 0
print(movies.groupby(['Country','Language']).Budget.agg([len
,min,max]))

len min max

Country Language len min max

0 Australia English 18 2500000.0 150000000.0
1 Bahamas English 1 5000000.0 5000000.0
2 Belgium English 4 15000000.0 34000000.0
3 Brazil English 1 3000000.0 3000000.0
4 Brazil Portuguese 2 4000000.0 4000000.0
.. ... ... ... ... ...
78 USA English 1174 1400.0 263700000.0
79 USA Hebrew 1 NaN NaN
80 USA None 1 4000000.0 4000000.0
81 USA Spanish 3 1200000.0 6000000.0
82 United Arab Emirates Arabic 1 125000.0 125000.0

[83 rows x 5 columns]

45
Groupby()
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3])
print(movies.groupby(['Country','Language']).Budget.agg([len,min,max]).reset_
index().iloc[3]['max'])

Country Brazil
Language English
len 1
min 3000000.0
max 3000000.0
Name: 3, dtype: object
3000000.0

46
Sorting
• sort_values() function/method can be used to
sort dataframes according to certain column
values.
print(movies.sort_values('Country').iloc[:,:5])#sort all the
dataframe by the column Country, and display the first 5 columns

47
Sorting
print(movies.groupby(['Country'])['IMDB Score'].max())
print(movies.groupby(['Country'])['IMDB Score'].max().sort_values(ascending=False))

print(movies.groupby(['Country'])['IMDB
Score'].agg([max]).sort_values('max',ascending=False))

Country Country
Australia 8.1 Canada 9.5
Bahamas 4.4 USA 9.1
Belgium 7.1 Poland 9.1
. .
. .
Thailand 5.7 5.6
UK 8.6 Nigeria 5.6
USA 9.1 Georgia 5.6
United Arab Emirates 8.2 Bahamas 4.4
Name: IMDB Score, dtype: Name: IMDB Score, dtype:
float64 float64

48
Sorting
• sort_values() works on Dataframes or Series
objects
print(type(movies.groupby(['Country'])))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

print(type(movies.groupby(['Country'])['IMDB Score']))
<class 'pandas.core.groupby.generic.SeriesGroupBy'>

print(type(movies.groupby(['Country'])['IMDB Score'].max()))
<class 'pandas.core.series.Series'>

49
Sorting
• sort_values() can sort by more than one column.
• sort_index() is used to sort elements by index.
print(movies.sort_values(['Language','Country']).iloc[:,:5])
print(movies.sort_values(['Language','Country']).loc
[:,['Title','Language','Country']].to_string())

Title ... Country

884 The Square ... Egypt
845 The Brain That Sings ... United Arab Emirates
308 In the Land of Blood and Honey ... USA
1026 Kung Fu Killer ... China
1164 Z Storm ... Hong K

50
Missing Data
• Several Methods are available to deal with
missing data
df.method() description
print(movies[pd.isnull(movies.Country)])
dropna() Drop missing observations

dropna(how='all') Title Year ... where

Drop observations Reviewsallby Crtiics
cells is NA IMDB Score
963 Dawn Patrol 2014.0 ... 9.0 4.8
dropna(axis=1,
1497 how='all') Drop column
10,000 B.C. NaNif ...
all the valuesNaN
are missing 7.2
1529 Gone, Baby, Gone NaN ... NaN 6.6
dropna(thresh
1551= 5)Preacher Drop rowsNaN
that ...
contain less18.0
than 5 non-missing
8.3values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values 51

Missing Data
• To select NaN entries you can use pd.isnull()
(or its companion pd.notnull())

print(movies[pd.isnull(movies.Country)])

Title Year ... Reviews by Crtiics IMDB Score

963 Dawn Patrol 2014.0 ... 9.0 4.8
1497 10,000 B.C. NaN ... NaN 7.2
1529 Gone, Baby, Gone NaN ... NaN 6.6
1551 Preacher NaN ... 18.0 8.3

52
Missing Data
• To select NaN entries you can use pd.isnull()
(or its companion pd.notnull())
movies[movies.isnull().any(axis=1)].head()

Title ... IMDB Score

1 3 Backyards ... 5.2
2 3 ... 6.8
4 A Turtle's Tale: Sammy's Adventures ... 6.1
7 All Good Things ... 6.3
10 Anderson's Cross ... 7.2

[5 rows x 19 columns]

53
Replacing Missing Values
• Replacing missing values is a common operation.
• fillna() provides a few different strategies for mitigating such
data Title Dawn Patrol
Year 2014.0
movies.Country = movies.Country.fillna("X") Genres Drama|Thriller
print(movies.iloc[963]) Language English
Country X
movies.Country.fillna("X",inplace = True) Duration 88.0
print(movies.iloc[963]) Budget 3500000.0
Gross Earnings NaN
Director Daniel Petrie Jr.
Actor 1 Chris Brochu
Actor 2 Jeff Fahey
Facebook Likes - Actor 1 795.0
Facebook Likes - Actor 2 535.0
Facebook likes - Movie 570
Facenumber in posters 0.0
User Votes 455
Reviews by Users 13.0
Reviews by Crtiics 9.0
IMDB Score 4.8
Name: 963, dtype: object

Process finished with exit code 0 54

Replacing Missing Values
Title A Turtle's Tale: Sammy's Adventures
Year 2010.0
Genres Adventure|Animation|Family
Language English
Country France
Duration 88.0
Budget Y
movies.fillna(“Y",inplace = True) Gross Earnings Y
print(movies.iloc[4]) Director Ben Stassen
Actor 1 Ed Begley Jr.
Actor 2 Jenny McCarthy
Facebook Likes - Actor 1 783.0
Facebook Likes - Actor 2 749.0
Facebook likes - Movie 0
Facenumber in posters 2.0
User Votes 5385
Reviews by Users 22.0
Reviews by Crtiics 56.0
IMDB Score 6.1
Name: 4, dtype: object

https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.fillna.html
55
Removing Records with Missing Values
• dropna() can be used to remove all the rows
with ‘NA’ values.

print(movies.shape) #(1604, 19)

movies.dropna(inplace=True)
print(movies.shape)#(1044, 19)

56
fillna()/Examples
A B C D
0 NaN 2.0 NaN 0
df = pd.DataFrame([[np.nan, 2, np.nan, 0], 1 3.0 4.0 NaN 1
[3, 4, np.nan, 1], 2 NaN NaN NaN 5
[np.nan, np.nan, np.nan, 5], 3 NaN 3.0 NaN 4
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
A B C D
print(df) 0 NaN 2.0 NaN 0
df1 = df.fillna(method='ffill') 1 3.0 4.0 NaN 1
print(df1) 2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df2 = df.fillna(value=values) A B C D
print(df2) 0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
57
Renaming
• lets you change index names and/or column
names
• Change column name
movies.rename(columns={'IMDB Score':'IMDB_Score'},inplace=True)

• Change index
– Rarely used; set_index() can be used instead

movies.rename(index = {0:'m0',1:'m1'})

58
Combining
• Dataframes can be combined into one
Dataframe
– concat(), join() and merge() are useful methods for
this purpose.

59
Combining/Example

df1 = pd.DataFrame([[1, 2, 5, 0],

[3, 4, 6, 1]],
columns=list('ABCD'))

df2 = pd.DataFrame([[1, 1, 1, 1],

[2, 2, 2, 2]],
columns=list('ABCD'))

df4 = pd.DataFrame([[1, 1, 1, 1],

[2, 2, 2, 2]],
columns=list('ABCF'))

60
Combining/concat()
• Concatenate along an axis
df3 = pd.concat([df1,df2],ignore_index=True)
print(df3)

df3 = pd.concat([df1,df2],axis=1)
print(df3)

A B C D
0 1 2 5 0
1 3 4 6 1
2 1 1 1 1
3 2 2 2 2

A B C D A B C D
0 1 2 5 0 1 1 1 1
1 3 4 6 1 2 2 2 2 61
Combining/concat()
• Concatenate with different columns labels

df3 = pd.concat([df1,df4],sort=False,ignore_index=True)
print(df3)

A B C D F
0 1 2 5 0.0 NaN
1 3 4 6 1.0 NaN
2 1 1 1 NaN 1.0
3 2 2 2 NaN 2.0

62
Combining/join()()
• Concatenate with different columns labels

df3 = df1.join(df4,lsuffix="_X",rsuffix="_Y")
print(df3)

A_X B_X C_X D A_Y B_Y C_Y F

0 1 2 5 0 1 1 1 1
1 3 4 6 1 2 2 2 2

63
References
• https://pandas.pydata.org/docs/
• https://pandas.pydata.org/pandas docs/stable/user_guide/groupby.html

CBSE Class 7 English - Comprehension Passage
100% (1)
CBSE Class 7 English - Comprehension Passage
7 pages
Catalog Amp Ruang Teknik Group
100% (1)
Catalog Amp Ruang Teknik Group
23 pages
Mineral Resources of RP
No ratings yet
Mineral Resources of RP
140 pages
js6 PDF
No ratings yet
js6 PDF
5 pages
Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
No ratings yet
Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
311 pages
RHCSA Rapid Track Course
No ratings yet
RHCSA Rapid Track Course
3 pages
Jasmina Milicevic
100% (1)
Jasmina Milicevic
17 pages
Right Side Seminar7october2018 Final
No ratings yet
Right Side Seminar7october2018 Final
73 pages
04 Pointers
No ratings yet
04 Pointers
103 pages
01 Bio Cell 2024
No ratings yet
01 Bio Cell 2024
28 pages
Industrial Plant Layout
No ratings yet
Industrial Plant Layout
18 pages
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
No ratings yet
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
33 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Use and Care Guide: Motion Security Light
No ratings yet
Use and Care Guide: Motion Security Light
36 pages
Monocular Depth Estimation Based On Deep Learning An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning An Overview
16 pages
Lab 3 Unit 2
No ratings yet
Lab 3 Unit 2
7 pages
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
No ratings yet
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
8 pages
Fluid Statics Examples
No ratings yet
Fluid Statics Examples
14 pages
Stats 101 Assignment 1
No ratings yet
Stats 101 Assignment 1
9 pages
Fantasy Film
No ratings yet
Fantasy Film
26 pages
Dynamics Problem Solving
No ratings yet
Dynamics Problem Solving
6 pages
8 1 AlphaAndBetaDecayLab 2
No ratings yet
8 1 AlphaAndBetaDecayLab 2
3 pages
Manual de Mantenimiento S331D
No ratings yet
Manual de Mantenimiento S331D
32 pages
JD - Lead Salesforce Developer-2
No ratings yet
JD - Lead Salesforce Developer-2
2 pages
After You Graduate You Get A Job in A Small
No ratings yet
After You Graduate You Get A Job in A Small
2 pages
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
No ratings yet
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
2 pages
Trevor Ivan - Final Assessment
No ratings yet
Trevor Ivan - Final Assessment
3 pages
Allied Telesis
No ratings yet
Allied Telesis
2 pages
SAH LAB Risk Assesssment Tool
100% (1)
SAH LAB Risk Assesssment Tool
10 pages
200 One Word Substitution With Examples
No ratings yet
200 One Word Substitution With Examples
14 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Exp 7

Uploaded by

Exp 7

Uploaded by

Pandas

Prepared by Dr. Mohammad Abdel-majeed

Data Science: is a branch of computer science where we study

• DataFrame: two (or more) dimensional data

Index(['Title', 'Year', 'Genres', 'Language', 'Country', 'Duration', 'Budget',

• create a list of the columns headers and

Year IMDB Score

Hobby Name Age Hobby

print(movies.loc[movies.Country== 'Spain'])#all rows with country=Spain

print(movies[(movies['Country']== 'Spain') & (movies['Reviews by

print(movies[(movies['Year']== 2012) | (movies['Year']==2011)])

movies.loc[3,'Budget'] = 1500# make budget at row 3 =1500

Year Duration ... Reviews by Crtiics IMDB Score

Budget IMDB Score

[2010. 2011. 2012. 2013.

[12 rows x 19 columns]

len min max

len min max

len min max

Country Language len min max

[83 rows x 5 columns]

Title ... Country

dropna(how='all') Title Year ... where

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values 51

Title Year ... Reviews by Crtiics IMDB Score

Title ... IMDB Score

Process finished with exit code 0 54

print(movies.shape) #(1604, 19)

df1 = pd.DataFrame([[1, 2, 5, 0],

df2 = pd.DataFrame([[1, 1, 1, 1],

df4 = pd.DataFrame([[1, 1, 1, 1],

A_X B_X C_X D A_Y B_Y C_Y F

You might also like