Pandas Complete Notes
Pandas Complete Notes
Pandas Complete Notes
RAHUL KUMAR
https://www.linkedin.com/in/rahul-kumar-1212a6141/
Outline
- **Installation of pandas** - Importing pandas - Importing the dataset - Dataframe/Series - **Basic ops on a DataFrame** -
df.info() - df.head() - df.tail() - df.shape() - **Creating Dataframe from Scratch** - **Basic ops on columns** - Different ways of
accessing cols - Check for Unique values - Rename column - Deleting col - Creating new cols - **Basic ops on rows** -
Implicit/explicit index - Quiz 2 - df.index - Indexing in series - Slicing in series - loc/iloc - Indexing/Slicing in dataframe -
Adding a row - Deleting a row - **Working with both rows and columns** - **More in-built ops in pandas** - sum() - count() -
mean() - **Sorting** - **Concatenation** - pd.concat() - axis for concat - **Merge** - Concat v/s Merge - `left_on` and
`right_on` - Joins - **Intoduction to IMDB dataset** - Reading two datasets - **Merging the dataframes** - `unique()` and
`nunique()` - `isin()` - Using Left Join for `merge()` - **Feature Exploration** - Create new features - **Fetching data using
pandas** - Quering from dataframe - Masking, Filtering, `&` and `|` - ASSESSMENT: - **Apply** - **Grouping** - Split, Apply,
Combine - `groupby()` - **Group based Aggregates** Coding (E) - **Group based Filtering** - **Group based Apply** -
`apply()` - **Restructuring data** - pd.melt() - pd.pivot() - pd.pivot_table() - pd.cut() - Dealing with Missing Values - None and
nan values - isna() and isnull() - String method in pandas - Handling datetime - **Writing to a file**
Importing Pandas
You should be able to import Pandas after installing it
Like names of places would be string but their population would be int
==> It is difficult to work with data having heterogeneous values using Numpy
McKinsey wants to understand the relation between GDP per capita and life expectancy and various
trends for their clients.
The company has acquired data from multiple surveys in different countries in the past
This contains info of several years about:
country
population size
life expectancy
GDP per Capita
We have to analyse the data and draw inferences meaningful to the company
In [3]: df = pd.read_csv(r"C:\Users\kumar\Downloads\mckinsey.csv")
In [4]: df
6 columns
1704 rows
In [5]: type(df)
pandas.core.frame.DataFrame
Out[5]:
In [6]: df["country"]
0 Afghanistan
Out[6]:
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, Length: 1704, dtype: object
As you can see we get all the values in the column country
In [7]: type(df["country"])
pandas.core.series.Series
Out[7]:
How can we find the datatype, name, total entries in each column ?
In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 year 1704 non-null int64
2 population 1704 non-null int64
3 continent 1704 non-null object
4 life_exp 1704 non-null float64
5 gdp_cap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
Name/Title of Columns
How many non-null values (blank cells) each column has
Type of values in each column - int, float, etc.
By default, it shows data-type as object for anything other than int or float - Will come back later
Now what if we want to see the first few rows in the dataset ?
In [9]: df.head()
In [10]: df.head(20)
In [12]: df.shape
(1704, 6)
Out[12]:
Approach 1: Row-oriented
It takes 2 arguments - Because DataFrame is 2-dimensional
A list of rows
Each row is packed in a list []
All rows are packed in an outside list [[]] - To pass a list of rows
A list of column names/labels
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 pd.DataFrame(['Afghanistan',1952, 8425333, 'Asia', 28.801, 779.445314 ],
2 columns=['country','year','population','continent','life_exp','gdp_
cap'])
Approach 2: Column-oriented
In [17]: pd.DataFrame({'country':['Afghanistan', 'Afghanistan'], 'year':[1952,1957],
'population':[842533, 9240934], 'continent':['Asia', 'Asia'],
'life_exp':[28.801, 30.332], 'gdp_cap':[779.445314, 820.853030]})
We now have a basic idea about the dataset and creating rows and columns
Adding data
Removing data
Updating/Modifying data
and so on
and so on.
But what if our dataset has 20 cols ? ... or 100 cols ? We can't see ther names in one go.
1. df.columns
2. df.keys
Note:
Here, Index is a type of pandas class used to store the address of the series/dataframe
0 Afghanistan 28.801
1 Afghanistan 30.332
2 Afghanistan 31.997
3 Afghanistan 34.020
4 Afghanistan 36.088
In [22]: df[['country']].head()
Out[22]: country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Note:
Notice how this output type is different from our earlier output using df['country']
Now that we know how to access columns, lets answer some questions
In [23]: df['country'].unique()
Now what if you also want to check the count of each country in the dataframe?
In [24]: df['country'].value_counts()
Afghanistan 12
Out[24]:
Pakistan 12
New Zealand 12
Nicaragua 12
Niger 12
..
Eritrea 12
Equatorial Guinea 12
El Salvador 12
Egypt 12
Zimbabwe 12
Name: country, Length: 142, dtype: int64
Note:
In [26]: df.rename(columns={"country":"Country"})
Note
In [28]: df.Country
0 Afghanistan
Out[28]:
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
...
1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: Country, Length: 1704, dtype: object
What do you think could be the problems with using attribute style for accessing the
columns?
Problems such as
An alternative to the above approach is using the "columns" parameter as we did in rename
In [30]: df.drop(columns=['continent'])
In [31]: df.head()
df = df.drop('continent', axis=1)
OR
By default, inplace=False
OR
We can also use values from two columns to form a new column
Values in this column are product of respective values in gdp_cap and population
OR
We can create a Pandas Series from a list/numpy array for our new column
Now that we know how to create new cols lets see some basic ops on rows
In [38]: df.index.values
array([ 0, 1, 2, ..., 1701, 1702, 1703], dtype=int64)
Out[38]:
Now to understand string indices, let's take a small subset of our original dataframe
1 Afghanistan
Out[45]:
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
16 Albania
17 Albania
18 Albania
19 Albania
20 Albania
Name: Country, dtype: object
So, how will be then access the thirteenth element (or say thirteenth row)?
In [46]: ser[12]
'Afghanistan'
Out[46]:
In [47]: ser[5:15]
6 Afghanistan
Out[47]:
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
11 Afghanistan
12 Afghanistan
13 Albania
14 Albania
15 Albania
Name: Country, dtype: object
In [48]: df[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(se
lf, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError: 0
Notice, that this syntax is exactly same as how we tried accessing a column
In [ ]: df[5:15]
===> Indexing in dataframe looks only for explicit indices \ ===> Slicing, however, checked for implicit
indices
1. loc
Allows indexing and slicing that always references the explicit index
In [49]: df.loc[1]
Country Afghanistan
Out[49]:
year 1952
population 8425333
life_exp 28.801
gdp_cap 779.445314
Name: 1, dtype: object
In [50]: df.loc[1:3]
2. iloc
Allows indexing and slicing that always references the implicit Python-style index
In [51]: df.iloc[1]
Country Afghanistan
Out[51]:
year 1957
population 9240934
life_exp 30.332
gdp_cap 820.85303
Name: 2, dtype: object
In [52]: df.iloc[0:2]
NO
As we see, We can just pack the indices in [] and pass it in loc or iloc
In [54]: df.iloc[-1]
In [55]: df.loc[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(se
lf, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
KeyError: -1
The above exception was the direct cause of the following exception:
KeyError: -1
Because iloc works with positional indices, while loc with assigned labels
[-1] here points to the row at last position in iloc
Country
In [57]: temp.loc['Afghanistan']
Country
As you can see we got the rows all having index Afghanistan
How can we reset our index without creating this new column?
append()
loc/iloc
It does not change the DataFrame, but returns a new DataFrame with the row appended.
We will need to provide the position at which we will add the new row
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [63], in <cell line: 1>()
----> 1 df.loc[len(df.index)] = ['India',2000 ,13500000,"Asia",37.08,900.23]
In [64]: df
The new row was added but the data has been duplicated
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [66], in <cell line: 1>()
----> 1 df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]
When using the loc[] attribute, it’s not mandatory that a row already exists with a specific label.
Country Afghanistan
Out[69]:
year 1977
population 14880372
life_exp 38.438
gdp_cap 786.11336
Name: 5, dtype: object
In [73]: df.duplicated()
0 False
Out[73]:
1 False
2 False
3 False
4 False
...
1703 False
1704 True
1705 False
1706 True
1707 False
Length: 1708, dtype: bool
In [75]: df.drop_duplicates()
But how can we decide among all duplicate rows which ones we want to keep ?
Here we can use argument keep:
first
last
False
If first , this considers first value as unique and rest of the same values as duplicate.
In [76]: df.drop_duplicates(keep='first')
If last , This considers last value as unique and rest of the same values as duplicate.
In [77]: df.drop_duplicates(keep='last')
If False , this considers all of the same values as duplicates. All values are dropped.
In [78]: df.drop_duplicates(keep=False)
In [79]: df.drop_duplicates(subset=['Country'],keep='first')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [80], in <cell line: 1>()
----> 1 df.drop_duplicates(subset=['Country', 'Continent'],keep='first')
In [82]: df = pd.read_csv('mckinsey.csv')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [82], in <cell line: 1>()
----> 1 df = pd.read_csv('mckinsey.csv')
How can we slice the dataframe into, say, first 4 rows and first 3 columns?
We can use iloc
Pass in 2 different ranges for slicing - one for row and one for column just like Numpy
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [84], in <cell line: 1>()
----> 1 df.loc[1:5, 1:4]
TypeError: cannot do slice indexing on Index with these indexers [1] of type int
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [85], in <cell line: 1>()
----> 1 df.loc[1:5, ['country','life_exp']]
1 1957 9240934
2 1962 10267083
3 1972 13079460
4 1977 14880372
5 1982 12881816
In [88]: df.iloc[1:10:2]
Out[88]: Country year population life_exp gdp_cap
In [89]: df.loc[1:10:2]
In [90]: le = df['life_exp']
le
0 28.801
Out[90]:
1 30.332
2 31.997
3 36.088
4 38.438
...
1703 37.080
1704 37.080
1705 80.000
1706 80.000
1707 80.000
Name: life_exp, Length: 1708, dtype: float64
In [91]: le.mean()
59.499171358313774
Out[91]:
... and so on
Note:
In [92]: le.sum()
101624.58468
Out[92]:
In [93]: le.count()
1708
Out[93]:
Sorting
If you notice, life_exp col is not sorted
In [95]: df.sort_values(['life_exp'])
Then, rows with same values of 'year' were sorted based on 'lifeExp'
How can we have different sorting orders for different columns in multi-level sorting?
Concatenating DataFrames
Let's use a mini use-case of users and messages
users --> Stores the user details - IDs and Names of users
0 1 sharadh
1 2 shahid
2 3 khusalli
msgs --> Stores the messages users have sent - User IDs and messages
In [100… msgs = pd.DataFrame({"userid":[1, 1, 2, 4], "msg":['hmm', "acha", "theek hai", "nice"]})
msgs
0 1 hmm
1 1 acha
2 2 theek hai
3 4 nice
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
0 1 NaN hmm
1 1 NaN acha
3 4 NaN nice
userid , being same in both DataFrames, was combined into a single column
First values of users dataframe were placed, with values of column msg as NaN
Then values of msgs dataframe were placed, with values of column msg as NaN
The original indices of the rows were preserved
Now how can we make the indices unique for each row?
0 1 sharadh NaN
1 2 shahid NaN
2 3 khusalli NaN
3 1 NaN hmm
4 1 NaN acha
6 4 NaN nice
How can we concatenate them horizontally?
Merging Dataframes
So far we have only concatenated and not merged data
No
=> pd.concat() does not work according to the values in the columns
0 1 sharadh hmm
1 1 sharadh acha
Inner Join
Now what join we want to use to get info of all the users and all the messages?
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
4 4 NaN nice
Note:
And what if we want the info of all the users in the dataframe?
0 1 sharadh hmm
1 1 sharadh acha
3 3 khusalli NaN
Similarly, what if we want all the messages and info only for the users who sent a
message?
0 1 sharadh hmm
1 1 sharadh acha
3 4 NaN nice
Note,
But sometimes the column names might be different even if they contain the same data
Out[109]: id name
0 1 sharadh
1 2 shahid
2 3 khusalli
Now, how can we merge the 2 dataframes when the key has a different name ?
0 1 sharadh 1 hmm
1 1 sharadh 1 acha
Here,
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: C:\Users\kumar\Jupyter Python Files\Scaler Lectures\movies.csv
Downloading...
From: https://drive.google.com/uc?id=1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm
To: C:\Users\kumar\Jupyter Python Files\Scaler Lectures\directors.csv
movies.csv
directors.csv
Out[114]: Unnamed:
id budget popularity revenue title vote_average vote_count director_id year
0
Pirates of
the
1 1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007
At World's
End
The Dark
3 3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012
Rises
Spider-
4 5 43602 258000000 115 890871626 5.9 3576 4767 2007
Man 3
So what kind of questions can we ask from this dataset?
Top 10 most popular movies, using popularity
Or find some highest rated movies, using vote_average
We can find number of movies released per year too
Or maybe we can find highest budget movies in a year using both budget and year
Notice, there's a column Unnamed: 0 which represents nothing but the index of a row.
Out[115]: id budget popularity revenue title vote_average vote_count director_id year month
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec Th
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May Sa
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct M
The Dark
3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012 Jul M
Rises
Spider-
5 43602 258000000 115 890871626 5.9 3576 4767 2007 May T
Man 3
In [116… movies.shape
(1465, 11)
Out[116]:
In [118… directors.shape
(2349, 3)
Out[118]:
Directors df contains:
Summary
1. Movie dataset contains info about movies, release, popularity, ratings and the director ID
2. Director dataset contains detailed info about the director
Now, how can we know the details about the Director of a particular movie?
We will have to merge these datasets
If you observe,
Thus we can merge our dataframes based on these two columns as keys
Before that, lets first check number of unique director values in our movies data
In [119… movies['director_id'].nunique()
199
Out[119]:
Recall,
In [120… directors['id'].nunique()
2349
Out[120]:
Summary:
Movies Dataset: 1465 rows, but only 199 unique directors
Directors Dataset: 2349 unique directors (= no of rows)
In [121… movies['director_id'].isin(directors['id'])
0 True
Out[121]:
1 True
2 True
3 True
5 True
...
4736 True
4743 True
4748 True
4749 True
4768 True
Name: director_id, Length: 1465, dtype: bool
The isin() method checks if the Dataframe column contains the specified value(s).
If you notice,
In [122… np.all(movies['director_id'].isin(directors['id']))
True
Out[122]:
YES
NO
Out[123]: id_x budget popularity revenue title vote_average vote_count director_id year month
0 43597 237000000 150 2787965087 Avatar 7.2 11800 4762 2009 Dec
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 4763 2007 May
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 4764 2015 Oct
The Dark
3 43600 250000000 112 1084939099 Knight 7.6 9106 4765 2012 Jul
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 4767 2007 May
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 4809 1978 May
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 5369 1994 Sep
El
1464 48395 220000 14 2040920 6.6 238 5097 1992 Sep
Mariachi
In [124… data.drop(['director_id','id_y'],axis=1,inplace=True)
data.head()
Out[124]: id_x budget popularity revenue title vote_average vote_count year month day direc
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday Gor
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday Sa
The Dark
C
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday
Man 3
Feature Exploration
Lets explore all the features in the merged dataset
In [125… data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1465 entries, 0 to 1464
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id_x 1465 non-null int64
1 budget 1465 non-null int64
2 popularity 1465 non-null int64
3 revenue 1465 non-null int64
4 title 1465 non-null object
5 vote_average 1465 non-null float64
6 vote_count 1465 non-null int64
7 year 1465 non-null int64
8 month 1465 non-null object
9 day 1465 non-null object
10 director_name 1465 non-null object
11 gender 1341 non-null object
dtypes: float64(1), int64(6), object(5)
memory usage: 148.8+ KB
Looks like only gender column has missing values (will come later)
How can we describe these features to know more about their range of values?
In [126… data.describe()
In [127… data.describe(include=object)
If you notice,
The range of values in the revenue and budget seem to be very high
Generally budget and revenue for Hollywood movies is in million dollars
How can we change the values of revenue and budget into million dollars USD?
Out[128]: id_x budget popularity revenue title vote_average vote_count year month day dire
0 43597 237000000 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
Pirates of
the
1 43598 300000000 139 961.00 Caribbean: 6.9 4500 2007 May Saturday Gor
At World's
End
2 43599 245000000 107 880.67 Spectre 6.3 4466 2015 Oct Monday Sa
The Dark
C
3 43600 250000000 112 1084.94 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890.87 5.9 3576 2007 May Tuesday
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 0.32 7.9 64 1978 May Monday
Waltz
1461 48370 27000 19 3.15 Clerks 7.4 755 1994 Sep Tuesday K
1462 48375 0 7 0.00 Rampage 6.0 131 2009 Aug Friday
El
1464 48395 220000 14 2.04 6.6 238 1992 Sep Friday
Mariachi
In [129… data['budget']=(data['budget']/1000000).round(2)
data.head()
Out[129]: id_x budget popularity revenue title vote_average vote_count year month day director_na
Jam
0 43597 237.0 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
Came
Pirates of
the
1 43598 300.0 139 961.00 Caribbean: 6.9 4500 2007 May Saturday Gore Verbi
At World's
End
2 43599 245.0 107 880.67 Spectre 6.3 4466 2015 Oct Monday Sam Men
The Dark
Christop
3 43600 250.0 112 1084.94 Knight 7.6 9106 2012 Jul Monday
No
Rises
Spider-
4 43602 258.0 115 890.87 5.9 3576 2007 May Tuesday Sam Ra
Man 3
0 True
Out[130]:
1 False
2 False
3 True
4 False
...
1460 True
1461 True
1462 False
1463 False
1464 False
Name: vote_average, Length: 1465, dtype: bool
But we still don't know the row values ... Only that which row satisfied the condtion
Out[131]: id_x budget popularity revenue title vote_average vote_count year month day direc
0 43597 237.00 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
The Dark
C
3 43600 250.00 112 1084.94 Knight 7.6 9106 2012 Jul Monday
Rises
The
Hobbit:
14 43616 250.00 120 956.02 The Battle 7.1 4760 2014 Dec Wednesday Pet
of the Five
Armies
The
Hobbit:
16 43619 250.00 94 958.40 The 7.6 4524 2013 Dec Wednesday Pet
Desolation
of Smaug
19 43622 200.00 100 1845.03 Titanic 7.5 7562 1997 Nov Tuesday
... ... ... ... ... ... ... ... ... ... ...
1456 48321 0.01 20 7.00 Eraserhead 7.5 485 1977 Mar Saturday D
The
1457 48323 0.00 5 0.00 7.1 51 1998 Oct Friday Pete
Mighty
The Last
1460 48363 0.00 3 0.32 7.9 64 1978 May Monday
Waltz
1461 48370 0.03 19 3.15 Clerks 7.4 755 1994 Sep Tuesday K
You can also perform the filtering without even using loc
Out[132]: id_x budget popularity revenue title vote_average vote_count year month day direc
0 43597 237.00 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
The Dark
C
3 43600 250.00 112 1084.94 Knight 7.6 9106 2012 Jul Monday
Rises
14 43616 250.00 120 956.02 The 7.1 4760 2014 Dec Wednesday Pet
Hobbit:
The Battle
of the Five
Armies
The
Hobbit:
16 43619 250.00 94 958.40 The 7.6 4524 2013 Dec Wednesday Pet
Desolation
of Smaug
19 43622 200.00 100 1845.03 Titanic 7.5 7562 1997 Nov Tuesday
... ... ... ... ... ... ... ... ... ... ...
1456 48321 0.01 20 7.00 Eraserhead 7.5 485 1977 Mar Saturday D
The
1457 48323 0.00 5 0.00 7.1 51 1998 Oct Friday Pete
Mighty
The Last
1460 48363 0.00 3 0.32 7.9 64 1978 May Monday
Waltz
1461 48370 0.03 19 3.15 Clerks 7.4 755 1994 Sep Tuesday K
Now, how can we return a subset of columns, say, only title and director_name ?
Out[134]: id_x budget popularity revenue title vote_average vote_count year month day director
30 43641 190.0 102 1506.25 Furious 7 7.3 4176 2015 Apr Wednesday Jam
Mad
Max:
78 43724 150.0 434 378.86 7.2 9427 2015 May Wednesday Georg
Fury
Road
Al
The
106 43773 135.0 100 532.95 7.3 6396 2015 Dec Friday G
Revenant
The
162 43867 108.0 167 630.16 7.6 7268 2015 Sep Wednesday Ridle
Martian
The Man
312 44128 75.0 48 108.15 from 7.1 2265 2015 Aug Thursday Guy
U.N.C.L.E.
Note:
Out[135]: id_x budget popularity revenue title vote_average vote_count year month day director
Pirates of the
Caribbean:
1 43598 300.0 139 961.00 6.9 4500 2007 May Saturday Gore Ve
At World's
End
Pirates of the
Caribbean:
12 43614 380.0 135 1045.71 6.4 4948 2011 May Saturday Rob M
On Stranger
Tides
Spider-Man
22 43627 200.0 35 783.77 6.7 4321 2004 Jun Friday Sam
2
25 43632 150.0 21 836.30 Transformers: 6.0 3138 2009 Jun Friday Mich
Revenge of
the Fallen
Now let's try to answer few more Questions from this data
In [136… data.sort_values(['popularity'],ascending=False).head(5)
Out[136]: id_x budget popularity revenue title vote_average vote_count year month day direct
Ch
58 43692 165.0 724 675.12 Interstellar 8.1 10867 2014 Nov Wednesday
Mad Max:
78 43724 150.0 434 378.86 7.2 9427 2015 May Wednesday Geo
Fury Road
Pirates of
the
119 43796 140.0 271 655.01 Caribbean: 7.5 6985 2003 Jul Wednesday Gore
The Curse
of the Bla...
The
Hunger
120 43797 125.0 206 752.10 Games: 6.6 5584 2014 Nov Tuesday
Mockingjay
- Part 1
The Dark Ch
45 43662 185.0 187 1004.56 8.2 12002 2008 Jul Wednesday
Knight
In [137… data.sort_values(['title'],ascending=False).head(5)
Out[137]: id_x budget popularity revenue title vote_average vote_count year month day directo
xXx: State
436 44364 60.0 36 71.07 of the 4.7 549 2005 Apr Wednesday Lee T
Union
330 44165 70.0 46 277.45 xXx 5.8 1424 2002 Aug Friday Ro
994 45681 15.0 21 2.86 eXistenZ 6.7 475 1999 Apr Wednesday
Cro
Zoolander
547 44594 50.0 37 55.97 4.7 797 2016 Feb Saturday Be
2
850 45313 28.0 38 60.78 Zoolander 6.1 1337 2001 Sep Friday Be
Now, how will get list of movies directed by a particular director, say, 'Christopher
Nolan'?
In [138… data.loc[data['director_name'] == 'Christopher Nolan',['title']]
Out[138]: title
58 Interstellar
59 Inception
74 Batman Begins
565 Insomnia
1341 Memento
Note:
The string indicating "Christopher Nolan" could have been something else as well.
The better way is to use string methods, we will discuss this later
Apply
Now suppose we want to convert our Gender column data to numerical format
Basically,
0 for Male
1 for Female
Out[140]: id_x budget popularity revenue title vote_average vote_count year month day directo
0 43597 237.00 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
C
Pirates of
the
1 43598 300.00 139 961.00 Caribbean: 6.9 4500 2007 May Saturday Gore V
At World's
End
2 43599 245.00 107 880.67 Spectre 6.3 4466 2015 Oct Monday Sam M
3 43600 250.00 112 1084.94 The Dark 7.6 9106 2012 Jul Monday Chri
Knight
Rises
Spider-
4 43602 258.00 115 890.87 5.9 3576 2007 May Tuesday Sam
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0.00 3 0.32 7.9 64 1978 May Monday
Waltz S
1461 48370 0.03 19 3.15 Clerks 7.4 755 1994 Sep Tuesday Kevi
1462 48375 0.00 7 0.00 Rampage 6.0 131 2009 Aug Friday U
El
1464 48395 0.22 14 2.04 6.6 238 1992 Sep Friday
Mariachi Ro
Say,
revenue 209867.04
Out[141]:
budget 70353.62
dtype: float64
But there's a mistake here. We wanted our results per movie (per row)
=> apply() can be applied on any dataframe along any particular axis
Out[143]: id_x budget popularity revenue title vote_average vote_count year month day directo
0 43597 237.00 150 2787.97 Avatar 7.2 11800 2009 Dec Thursday
C
Pirates of
the
1 43598 300.00 139 961.00 Caribbean: 6.9 4500 2007 May Saturday Gore V
At World's
End
2 43599 245.00 107 880.67 Spectre 6.3 4466 2015 Oct Monday Sam M
The Dark
Chri
3 43600 250.00 112 1084.94 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258.00 115 890.87 5.9 3576 2007 May Tuesday Sam
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0.00 3 0.32 7.9 64 1978 May Monday
Waltz S
1461 48370 0.03 19 3.15 Clerks 7.4 755 1994 Sep Tuesday Kevi
1462 48375 0.00 7 0.00 Rampage 6.0 131 2009 Aug Friday U
El
1464 48395 0.22 14 2.04 6.6 238 1992 Sep Friday
Mariachi Ro
Thus, we can access the columns by their names inside the functions too using apply
Importing Data
Let's first import our data and prepare it as we did in the last lecture
Downloading...
From: https://drive.google.com/uc?id=1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd
To: C:\Users\kumar\Jupyter Python Files\Scaler Lectures\movies.csv
Grouping
How can we know the number of movies released by a particular director, say,
Christopher Nolan?
title 8
Out[145]:
dtype: int64
In [146… data["director_name"].value_counts()
Steven Spielberg 26
Out[146]:
Martin Scorsese 19
Clint Eastwood 19
Woody Allen 18
Ridley Scott 16
..
Tim Hill 5
Jonathan Liebesman 5
Roman Polanski 5
Larry Charles 5
Nicole Holofcener 5
Name: director_name, Length: 199, dtype: int64
We can assume pandas must have grouped the rows internally to find the count
For example, average popularity of each director, or max rating among all movies by a director?
1. Split: Breaking up and grouping a DataFrame depending on the value of the specified key.
2. Apply: Computing some function, usually an aggregate, transformation, or filtering, within the
individual groups.
Note:
In [147… data.groupby('director_name')
Notice,
But it's returning an object, we would want to get information out of this object.
How can we know the number of groups our data is divided into?
In [148… data.groupby('director_name').ngroups
199
Out[148]:
Based on this grouping, how can we find which keys belong to which group?
In [149… data.groupby('director_name').groups
{'Adam McKay': [176, 323, 366, 505, 839, 916], 'Adam Shankman': [265, 300, 350, 404, 45
Out[149]:
8, 843, 999, 1231], 'Alejandro González Iñárritu': [106, 749, 1015, 1034, 1077, 1405],
'Alex Proyas': [95, 159, 514, 671, 873], 'Alexander Payne': [793, 1006, 1101, 1211, 128
1], 'Andrew Adamson': [11, 43, 328, 501, 947], 'Andrew Niccol': [533, 603, 701, 722, 143
9], 'Andrzej Bartkowiak': [349, 549, 754, 911, 924], 'Andy Fickman': [517, 681, 909, 92
6, 973, 1023], 'Andy Tennant': [314, 320, 464, 593, 676, 885], 'Ang Lee': [99, 134, 748,
840, 1089, 1110, 1132, 1184], 'Anne Fletcher': [610, 650, 736, 789, 1206], 'Antoine Fuqu
a': [310, 338, 424, 467, 576, 808, 818, 1105], 'Atom Egoyan': [946, 1128, 1164, 1194, 13
47, 1416], 'Barry Levinson': [313, 319, 471, 594, 878, 898, 1013, 1037, 1082, 1143, 118
5, 1345, 1378], 'Barry Sonnenfeld': [13, 48, 90, 205, 591, 778, 783], 'Ben Stiller': [20
9, 212, 547, 562, 850], 'Bill Condon': [102, 307, 902, 1233, 1381], 'Bobby Farrelly': [3
52, 356, 481, 498, 624, 630, 654, 806, 928, 972, 1111], 'Brad Anderson': [1163, 1197, 13
50, 1419, 1430], 'Brett Ratner': [24, 39, 188, 207, 238, 292, 405, 456, 920], 'Brian De
Palma': [228, 255, 318, 439, 747, 905, 919, 1088, 1232, 1261, 1317, 1354], 'Brian Helgel
and': [512, 607, 623, 742, 933], 'Brian Levant': [418, 449, 568, 761, 860, 1003], 'Brian
Robbins': [416, 441, 669, 962, 988, 1115], 'Bryan Singer': [6, 32, 33, 44, 122, 216, 29
7, 1326], 'Cameron Crowe': [335, 434, 488, 503, 513, 698], 'Catherine Hardwicke': [602,
695, 724, 937, 1406, 1412], 'Chris Columbus': [117, 167, 204, 218, 229, 509, 656, 897, 9
96, 1086, 1129], 'Chris Weitz': [17, 500, 794, 869, 1202, 1267], 'Christopher Nolan':
[3, 45, 58, 59, 74, 565, 641, 1341], 'Chuck Russell': [177, 410, 657, 1069, 1097, 1339],
'Clint Eastwood': [369, 426, 447, 482, 490, 520, 530, 535, 645, 727, 731, 786, 787, 899,
974, 986, 1167, 1190, 1313], 'Curtis Hanson': [494, 579, 606, 711, 733, 1057, 1310], 'Da
nny Boyle': [527, 668, 1083, 1085, 1126, 1168, 1287, 1385], 'Darren Aronofsky': [113, 75
1, 1187, 1328, 1363, 1458], 'Darren Lynn Bousman': [1241, 1243, 1283, 1338, 1440], 'Davi
d Ayer': [50, 273, 741, 1024, 1146, 1407], 'David Cronenberg': [541, 767, 994, 1055, 125
4, 1268, 1334], 'David Fincher': [62, 213, 253, 383, 398, 478, 522, 555, 618, 785], 'Dav
id Gordon Green': [543, 862, 884, 927, 1376, 1418, 1432, 1459], 'David Koepp': [443, 64
4, 735, 1041, 1209], 'David Lynch': [583, 1161, 1264, 1340, 1456], 'David O. Russell':
[422, 556, 609, 896, 982, 989, 1229, 1304], 'David R. Ellis': [582, 634, 756, 888, 934],
'David Zucker': [569, 619, 965, 1052, 1175], 'Dennis Dugan': [217, 260, 267, 293, 303, 7
18, 780, 977, 1247], 'Donald Petrie': [427, 507, 570, 649, 858, 894, 1106, 1331], 'Doug
Liman': [52, 148, 251, 399, 544, 1318, 1451], 'Edward Zwick': [92, 182, 346, 566, 791, 8
19, 825], 'F. Gary Gray': [308, 402, 491, 523, 697, 833, 1272, 1380], 'Francis Ford Copp
ola': [487, 559, 622, 646, 772, 1076, 1155, 1253, 1312], 'Francis Lawrence': [63, 72, 10
9, 120, 679], 'Frank Coraci': [157, 249, 275, 451, 577, 599, 963], 'Frank Oz': [193, 35
5, 473, 580, 712, 813, 987], 'Garry Marshall': [329, 496, 528, 571, 784, 893, 1029, 116
9], 'Gary Fleder': [518, 667, 689, 867, 981, 1165], 'Gary Winick': [258, 797, 798, 804,
1454], 'Gavin O'Connor': [820, 841, 939, 953, 1444], 'George A. Romero': [250, 1066, 109
6, 1278, 1367, 1396], 'George Clooney': [343, 450, 831, 966, 1302], 'George Miller': [7
8, 103, 233, 287, 1250, 1403, 1450], 'Gore Verbinski': [1, 8, 9, 107, 119, 633, 1040],
'Guillermo del Toro': [35, 252, 419, 486, 1118], 'Gus Van Sant': [595, 1018, 1027, 1159,
1240, 1311, 1398], 'Guy Ritchie': [124, 215, 312, 1093, 1225, 1269, 1420], 'Harold Rami
s': [425, 431, 558, 586, 788, 1137, 1166, 1325], 'Ivan Reitman': [274, 643, 816, 883, 91
0, 935, 1134, 1242], 'James Cameron': [0, 19, 170, 173, 344, 1100, 1320], 'James Ivory':
[1125, 1152, 1180, 1291, 1293, 1390, 1397], 'James Mangold': [140, 141, 557, 560, 829, 8
45, 958, 1145], 'James Wan': [30, 617, 1002, 1047, 1337, 1417, 1424], 'Jan de Bont': [15
5, 224, 231, 270, 781], 'Jason Friedberg': [812, 1010, 1012, 1014, 1036], 'Jason Reitma
n': [792, 1092, 1213, 1295, 1299], 'Jaume Collet-Serra': [516, 540, 640, 725, 1011, 118
9], 'Jay Roach': [195, 359, 389, 397, 461, 703, 859, 1072], 'Jean-Pierre Jeunet': [423,
485, 605, 664, 765], 'Joe Dante': [284, 525, 638, 1226, 1298, 1428], 'Joe Wright': [85,
432, 553, 803, 814, 855], 'Joel Coen': [428, 670, 691, 707, 721, 889, 906, 980, 1157, 12
38, 1305], 'Joel Schumacher': [128, 184, 348, 484, 572, 614, 652, 764, 876, 886, 1108, 1
230, 1280], 'John Carpenter': [537, 663, 686, 861, 938, 1028, 1080, 1102, 1329, 1371],
'John Glen': [601, 642, 801, 847, 864], 'John Landis': [524, 868, 1276, 1384, 1435], 'Jo
hn Madden': [457, 882, 1020, 1249, 1257], 'John McTiernan': [127, 214, 244, 351, 534, 56
3, 648, 782, 838, 1074], 'John Singleton': [294, 489, 732, 796, 1120, 1173, 1316], 'John
Whitesell': [499, 632, 763, 1119, 1148], 'John Woo': [131, 142, 264, 371, 420, 675, 118
2], 'Jon Favreau': [46, 54, 55, 382, 759, 1346], 'Jon M. Chu': [100, 225, 810, 1099, 118
6], 'Jon Turteltaub': [64, 180, 372, 480, 760, 846, 1171], 'Jonathan Demme': [277, 493,
1000, 1123, 1215], 'Jonathan Liebesman': [81, 143, 339, 1117, 1301], 'Judd Apatow': [32
1, 710, 717, 865, 881], 'Justin Lin': [38, 123, 246, 1437, 1447], 'Kenneth Branagh': [8
0, 197, 421, 879, 1094, 1277, 1288], 'Kenny Ortega': [412, 852, 1228, 1315, 1365], 'Kevi
n Reynolds': [53, 502, 639, 1019, 1059], ...}
Now what if we want to extract data of a particular group from this list?
Out[150]: id_x budget popularity revenue title vote_average vote_count year month day d
About
793 45163 30000000 19 105834556 6.7 362 2002 Dec Friday
Schmidt
The
1006 45699 20000000 40 177243185 6.7 934 2011 Sep Friday
Descendants
1101 46004 16000000 23 109502303 Sideways 6.9 478 2004 Oct Friday
1211 46446 12000000 29 17654912 Nebraska 7.4 636 2013 Sep Saturday
Great! We are able to extract the data from our DataFrameGroupBy object
This does give us the max value of the data, but for all the features
In [151… data.groupby('director_name')['title'].count()
director_name
Out[151]:
Adam McKay 6
Adam Shankman 8
Alejandro González Iñárritu 6
Alex Proyas 5
Alexander Payne 5
..
Wes Craven 10
Wolfgang Petersen 7
Woody Allen 18
Zack Snyder 7
Zhang Yimou 6
Name: title, Length: 199, dtype: int64
For e.g., the very first year and the latest year a director released a movie
This is basically the min and max of year column, grouped by director
director_name
high budget director -> any director with atleast one movie with budget >100M
How can we filter out the director names with max budget >100M?
Finally, how can we filter out the details of the movies by these directors?
In [155… data.loc[data['director_name'].isin(names)]
Out[155]: id_x budget popularity revenue title vote_average vote_count year month day d
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday
The Dark
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday
El
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi
Out[156]: id_x budget popularity revenue title vote_average vote_count year month day d
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday
1 43598 300000000 139 961000000 Pirates of 6.9 4500 2007 May Saturday
the
Caribbean:
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday
The Dark
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday
El
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi
NOTE
==> The result is not a groupby object but regular pandas DataFrame with the filtered groups
eliminated
Yes!
Out[157]: id_x budget popularity revenue title vote_average vote_count year month day d
0 43597 237000000 150 2787965087 Avatar 7.2 11800 2009 Dec Thursday
Pirates of
the
1 43598 300000000 139 961000000 Caribbean: 6.9 4500 2007 May Saturday
At World's
End
2 43599 245000000 107 880674609 Spectre 6.3 4466 2015 Oct Monday
The Dark
3 43600 250000000 112 1084939099 Knight 7.6 9106 2012 Jul Monday
Rises
Spider-
4 43602 258000000 115 890871626 5.9 3576 2007 May Tuesday
Man 3
... ... ... ... ... ... ... ... ... ... ...
The Last
1460 48363 0 3 321952 7.9 64 1978 May Monday
Waltz
1461 48370 27000 19 3151130 Clerks 7.4 755 1994 Sep Tuesday
El
1464 48395 220000 14 2040920 6.6 238 1992 Sep Friday
Mariachi
In [158… data_risky.loc[data_risky["risky"]]
Out[158]: id_x budget popularity revenue title vote_average vote_count year month day
Quantum
7 43608 200000000 107 586090727 6.1 2965 2008 Oct Thursday
of Solace
12 43614 380000000 135 1045713802 Pirates of 6.4 4948 2011 May Saturday
the
Caribbean:
On
Stranger
Tides
Robin
15 43618 200000000 37 310669540 6.2 1398 2010 May Wednesday
Hood
X-Men:
24 43630 210000000 3 459359555 The Last 6.3 3525 2006 May Wednesday
Stand
... ... ... ... ... ... ... ... ... ... ...
The Sweet
1347 47224 5000000 7 3263585 6.8 103 1997 May Wednesday
Hereafter
90
1349 47229 5000000 3 4842699 Minutes in 5.4 40 2015 Sep Friday
Heaven
Light
1351 47233 5000000 6 0 5.7 15 1992 Aug Friday
Sleeper
Dying of
1356 47263 15000000 10 0 4.5 118 2014 Dec Thursday
the Light
In the
1383 47453 3500000 4 0 Name of 3.3 19 2013 Dec Friday
the King III
Yes, there are some 131 movies whose budget was greater than average earnings of its director
Multi-Indexing
Now, lets say, you want to find who is the most productive director
Or
Or
will you also consider the amount of business the movie is doing?
To simplify,
In [159… data.groupby(['director_name'])['title'].count().sort_values(ascending=False)
director_name
Out[159]:
Steven Spielberg 26
Clint Eastwood 19
Martin Scorsese 19
Woody Allen 18
Robert Rodriguez 16
..
Paul Weitz 5
John Madden 5
Paul Verhoeven 5
John Whitesell 5
Kevin Reynolds 5
Name: title, Length: 199, dtype: int64
Chances are, he might be active for more years than other directors
How can we calculate multiple aggregates such as min and max , along with count of
titles together?
director_name
Notice,
What would happen if we print the col year of this multi-index dataframe?
In [162… data_agg["year"]
director_name
director_name
director_name
Columns look good, but we may want to turn back the row labels into a proper column as well
In [165… data_agg.reset_index()
Recall,
Using the new features, can we find the most productive director?
First calculate how many years the director has been active.
director_name
director_name
Adam McKay 2004 2015 6 11 0.545455
director_name
Conclusion:
==> "Tyler Perry" turns out to be the truly most productive director
Link: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
In [169…
Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: C:\Users\kumar\Jupyter Python Files\Scaler Lectures\Pfizer_1.csv
are recorded after an interval of 1 hour everyday to monitor the drug stability in a drug development test
==> These data points are thus used to identify the optimal set of values of parameters for the stability
of the drugs
In [172… data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 18 non-null object
1 Drug_Name 18 non-null object
2 Parameter 18 non-null object
3 1:30:00 16 non-null float64
4 2:30:00 16 non-null float64
5 3:30:00 12 non-null float64
6 4:30:00 14 non-null float64
7 5:30:00 16 non-null float64
8 6:30:00 18 non-null int64
9 7:30:00 16 non-null float64
10 8:30:00 14 non-null float64
11 9:30:00 16 non-null float64
12 10:30:00 18 non-null int64
13 11:30:00 16 non-null float64
14 12:30:00 18 non-null int64
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB
In [173… data.shape
(18, 15)
Out[173]:
In [174… data.head()
Out[174]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:0
0 15- diltiazem Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 22
10- hydrochloride
2020
15-
diltiazem
1 10- Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 24
hydrochloride
2020
15-
docetaxel
2 10- Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 23
injection
2020
15-
docetaxel
3 10- Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 27
injection
2020
15-
ketamine
4 10- Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 23
hydrochloride
2020
In [175… data.tail()
Out[175]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30
17-
diltiazem
13 10- Pressure 3.0 4.0 4.0 4.0 6.0 8 9.0 NaN
hydrochloride
2020
17-
docetaxel
14 10- Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 2
injection
2020
17-
docetaxel
15 10- Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 2
injection
2020
17-
ketamine
16 10- Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 2
hydrochloride
2020
17-
ketamine
17 10- Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 1
hydrochloride
2020
Melting in Pandas
As we saw earlier, the dataset has 18 rows and 15 columns
We can similarly create one column containing the values of these parameters
==> "Melt" timestamp columns into two columns - timestamp and corresponding values
How can we restructure our data into having every row corresponding to a single
reading?
How can we rename the columns "variable" and "value" as per our original dataframe?
data_melt
Conclusion
The labels of the timestamp columns are conviniently melted into a single column - time
It retained all values in column reading
The labels of columns such as 1:30:00 , 2:30:00 have now become categories of the variable
column
The values from columns we are melting are stored in value column
Pivot
Now suppose we want to convert our data back to wide format
The reason could be to maintain the structure for storing or some other purpose.
Notice:
How can we restructure our data back to the original wide format, before it was
melted?
Out[178]: time 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30
15- diltiazem Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0 1
10- hydrochloride
2020 Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0 2
docetaxel Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0 2
injection
Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0 1
ketamine Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN
hydrochloride
Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN 2
16- diltiazem Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0 2
10- hydrochloride
2020 Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0 3
docetaxel Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0 2
injection
Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0 4
ketamine Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0 1
hydrochloride
Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0 1
17- diltiazem Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0
10- hydrochloride
2020 Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0 1
docetaxel Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0 2
injection
Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0 1
ketamine Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0 1
hydrochloride
Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0 1
Notice,
In [179… data_melt.pivot(index=['Date','Drug_Name','Parameter'],
columns = 'time',
values='reading').reset_index()
Out[179]: time Date Drug_Name Parameter 10:30:00 11:30:00 12:30:00 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00
15-
diltiazem
0 10- Pressure 18.0 19.0 20.0 12.0 13.0 NaN 11.0 13.0
hydrochloride
2020
15-
diltiazem
1 10- Temperature 20.0 20.0 21.0 23.0 22.0 NaN 21.0 21.0
hydrochloride
2020
15-
docetaxel
2 10- Pressure 26.0 29.0 28.0 NaN 22.0 22.0 NaN 22.0
injection
2020
15-
docetaxel
3 10- Temperature 23.0 25.0 25.0 NaN 17.0 18.0 NaN 17.0
injection
2020
15-
ketamine
4 10- Pressure 9.0 9.0 11.0 8.0 NaN NaN 7.0 NaN
hydrochloride
2020
15-
ketamine
5 10- Temperature 22.0 21.0 20.0 24.0 NaN NaN 27.0 NaN
hydrochloride
2020
16-
diltiazem
6 10- Pressure 24.0 NaN 27.0 18.0 19.0 20.0 21.0 22.0
hydrochloride
2020
16-
diltiazem
7 10- Temperature 40.0 NaN 42.0 34.0 35.0 36.0 36.0 37.0
hydrochloride
2020
16-
docetaxel
8 10- Pressure 28.0 29.0 30.0 23.0 24.0 NaN 25.0 26.0
injection
2020
9 16- docetaxel Temperature 56.0 57.0 58.0 46.0 47.0 NaN 48.0 48.0
10- injection
2020
16-
ketamine
10 10- Pressure 16.0 17.0 18.0 12.0 12.0 13.0 NaN 15.0
hydrochloride
2020
16-
ketamine
11 10- Temperature 13.0 14.0 15.0 8.0 9.0 10.0 NaN 11.0
hydrochloride
2020
17-
diltiazem
12 10- Pressure 11.0 13.0 14.0 3.0 4.0 4.0 4.0 6.0
hydrochloride
2020
17-
diltiazem
13 10- Temperature 14.0 11.0 10.0 20.0 19.0 19.0 18.0 17.0
hydrochloride
2020
17-
docetaxel
14 10- Pressure 28.0 29.0 28.0 20.0 22.0 22.0 22.0 22.0
injection
2020
17-
docetaxel
15 10- Temperature 21.0 22.0 23.0 12.0 13.0 14.0 15.0 16.0
injection
2020
17-
ketamine
16 10- Pressure 13.0 14.0 15.0 8.0 9.0 10.0 11.0 11.0
hydrochloride
2020
17-
ketamine
17 10- Temperature 22.0 23.0 24.0 13.0 14.0 15.0 16.0 17.0
hydrochloride
2020
In [180… data_melt.head()
Can we further restructure our data into dividing the Parameter column into T/P?
A format like:
data_tidy
In [184… data_tidy.head()
In [185… pd.pivot_table?
In [ ]:
Pivot_table
Now suppose we want to find some insights, like mean temperature day wise
Can we use pivot to find the day-wise mean value of temperature for each drug?
In [186… data_tidy.pivot(index=['Drug_Name'],
columns = 'Date',
values=['Temperature'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [186], in <cell line: 1>()
----> 1 data_tidy.pivot(index=['Drug_Name'],
2 columns = 'Date',
3 values=['Temperature'])
Hence the index values should be unique entry for each row.
Drug_Name
diltiazem hydrochloride 21.454545 37.454545 15.636364
Note:
In fact, pivot_table uses groupby in the backend to group the data and perform the aggregration
The only difference is in the type of output we get using both functions
Similarly, what if we want to find the minimum values of temperature and pressure on a
particular date?
Drug_Name
In [189… data_tidy.head()
1. None
2. NaN (short for Not a Number)
In [190… type(None)
NoneType
Out[190]:
In [191… type(np.nan)
float
Out[191]:
E.g.-strings
Note:
Pandas uses these values nearly interchangeably, converting between them where appropriate, based on
column datatype
For object type, the None is preserved and not changed to NaN
Now we have the basic idea about missing values
In [195… data.isna().head()
Out[195]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 False False False False False True False False False False False False
1 False False False False False True False False False False False False
2 False False False True False False True False False True True False
3 False False False True False False True False False True True False
4 False False False False True True False True False False False False
In [196… data.isnull().head()
Out[196]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:00
0 False False False False False True False False False False False False
1 False False False False False True False False False False False False
2 False False False True False False True False False True True False
3 False False False True False False True False False True True False
4 False False False False True True False True False False False False
But, why do we have two methods, "isna" and "isnull" for the same operation?
isnull() is just an alias for isna()
In [197… pd.isnull
<function pandas.core.dtypes.missing.isna(obj)>
Out[197]:
In [198… pd.isna
<function pandas.core.dtypes.missing.isna(obj)>
Out[198]:
In [199… data.isna().sum()
Date 0
Out[199]:
Drug_Name 0
Parameter 0
1:30:00 2
2:30:00 2
3:30:00 6
4:30:00 4
5:30:00 2
6:30:00 0
7:30:00 2
8:30:00 4
9:30:00 2
10:30:00 0
11:30:00 2
12:30:00 0
dtype: int64
In [200… data.isna().sum(axis=1)
0 1
Out[200]:
1 1
2 4
3 4
4 3
5 3
6 1
7 1
8 1
9 1
10 2
11 2
12 1
13 1
14 0
15 0
16 0
17 0
dtype: int64
Out[201]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30
15-
diltiazem
0 10- Temperature 23.0 22.0 NaN 21.0 21.0 22 23.0 21.0 2
hydrochloride
2020
15-
diltiazem
1 10- Pressure 12.0 13.0 NaN 11.0 13.0 14 16.0 16.0 2
hydrochloride
2020
15-
docetaxel
2 10- Temperature NaN 17.0 18.0 NaN 17.0 18 NaN NaN 2
injection
2020
15-
docetaxel
3 10- Pressure NaN 22.0 22.0 NaN 22.0 23 NaN NaN 2
injection
2020
15-
ketamine
4 10- Temperature 24.0 NaN NaN 27.0 NaN 26 25.0 24.0 2
hydrochloride
2020
15-
ketamine
5 10- Pressure 8.0 NaN NaN 7.0 NaN 9 10.0 11.0 1
hydrochloride
2020
6 16- diltiazem Temperature 34.0 35.0 36.0 36.0 37.0 38 37.0 38.0 3
10- hydrochloride
2020
16-
diltiazem
7 10- Pressure 18.0 19.0 20.0 21.0 22.0 23 24.0 25.0 2
hydrochloride
2020
16-
docetaxel
8 10- Temperature 46.0 47.0 NaN 48.0 48.0 49 50.0 52.0 5
injection
2020
16-
docetaxel
9 10- Pressure 23.0 24.0 NaN 25.0 26.0 27 28.0 29.0 2
injection
2020
16-
ketamine
10 10- Temperature 8.0 9.0 10.0 NaN 11.0 12 12.0 11.0 N
hydrochloride
2020
16-
ketamine
11 10- Pressure 12.0 12.0 13.0 NaN 15.0 15 15.0 15.0 N
hydrochloride
2020
17-
diltiazem
12 10- Temperature 20.0 19.0 19.0 18.0 17.0 16 15.0 NaN 1
hydrochloride
2020
17-
diltiazem
13 10- Pressure 3.0 4.0 4.0 4.0 6.0 8 9.0 NaN
hydrochloride
2020
Note:
We have identified the null count, but how do we deal with them?
We have two options:
In [202… data.dropna()
Out[202]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30
17-
docetaxel
14 10- Temperature 12.0 13.0 14.0 15.0 16.0 17 18.0 19.0 2
injection
2020
17-
docetaxel
15 10- Pressure 20.0 22.0 22.0 22.0 22.0 23 25.0 26.0 2
injection
2020
17-
ketamine
16 10- Temperature 13.0 14.0 15.0 16.0 17.0 18 19.0 20.0 2
hydrochloride
2020
17 17- ketamine Pressure 8.0 9.0 10.0 11.0 11.0 12 12.0 11.0 1
10- hydrochloride
2020
In [203… data.dropna(axis=1)
=> Every column which had even a single missing value has been deleted
loss of data
Instead of dropping, it would be better to fill the missing values with some data
In [204… data.fillna(0).head()
Out[204]: Date Drug_Name Parameter 1:30:00 2:30:00 3:30:00 4:30:00 5:30:00 6:30:00 7:30:00 8:30:00 9:30:0
15-
diltiazem
0 10- Temperature 23.0 22.0 0.0 21.0 21.0 22 23.0 21.0 22
hydrochloride
2020
1 15- diltiazem Pressure 12.0 13.0 0.0 11.0 13.0 14 16.0 16.0 24
10- hydrochloride
2020
15-
docetaxel
2 10- Temperature 0.0 17.0 18.0 0.0 17.0 18 0.0 0.0 23
injection
2020
15-
docetaxel
3 10- Pressure 0.0 22.0 22.0 0.0 22.0 23 0.0 0.0 27
injection
2020
15-
ketamine
4 10- Temperature 24.0 0.0 0.0 27.0 0.0 26 25.0 24.0 23
hydrochloride
2020
In [205… data['2:30:00'].fillna(0)
0 22.0
Out[205]:
1 13.0
2 17.0
3 22.0
4 0.0
5 0.0
6 35.0
7 19.0
8 47.0
9 24.0
10 9.0
11 12.0
12 19.0
13 4.0
14 13.0
15 22.0
16 14.0
17 9.0
Name: 2:30:00, dtype: float64
In [206… data['2:30:00'].mean()
18.8125
Out[206]:
Now let's fill the NaN values with the mean value of the column
In [207… data['2:30:00'].fillna(data['2:30:00'].mean())
0 22.0000
Out[207]:
1 13.0000
2 17.0000
3 22.0000
4 18.8125
5 18.8125
6 35.0000
7 19.0000
8 47.0000
9 24.0000
10 9.0000
11 12.0000
12 19.0000
13 4.0000
14 13.0000
15 22.0000
16 14.0000
17 9.0000
Name: 2:30:00, dtype: float64
But this doesn't feel right. What could be wrong with this?
Can we use the mean of all compounds as average for our estimator?
Different drugs have different characteristics
We can't simply do an average and fill the null values
We could fill the null values of respective compounds with their respective means
In [208… # data_tidy.groupby("Drug_Name")["Temperature"].mean()
Now we can form a new column based on the average values of temperature for each drug
In [210… data_tidy=data_tidy.groupby(["Drug_Name"]).apply(temp_mean)
data_tidy
Now we fill the null values in Temperature using this new column!
In [212… data_tidy.isna().sum()
None
Out[212]:
Date 0
time 0
Drug_Name 0
Pressure 13
Temperature 0
Temperature_avg 0
dtype: int64
Great!!
In [214… data_tidy.isna().sum()
None
Out[214]:
Date 0
time 0
Drug_Name 0
Pressure 0
Temperature 0
Temperature_avg 0
Pressure_avg 0
dtype: int64
We will further learn more on this during later lectures of feature engineering
Pandas Cut
Sometimes, we would want our data to be in categorical format instead of continous data.
Let's try to us this on our max (temp) column to categorise the data into bins
But, to define categories, lets first check min and max temp values
In [215… data_tidy
8.0 58.0
Lets's keep some buffer for future values and take the range from 5-60(instead of 8-58)
Lets divide this data into 4 bins of 10-15 values each
Out[217]: None Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat
15-10- diltiazem
0 10:30:00 18.0 20.0 24.848485 15.424242 low
2020 hydrochloride
15-10-
1 10:30:00 docetaxel injection 26.0 23.0 30.387097 25.483871 medium
2020
15-10- ketamine
2 10:30:00 9.0 22.0 17.709677 11.935484 medium
2020 hydrochloride
15-10-
4 11:30:00 docetaxel injection 29.0 25.0 30.387097 25.483871 medium
2020
In [218… data_tidy['temp_cat'].value_counts()
low 50
Out[218]:
medium 38
high 15
very_high 5
Name: temp_cat, dtype: int64
Say,
How you can you filter rows containing "hydrochloric" in their drug name?
In [219… data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()
Out[219]: None Date time Drug_Name Pressure Temperature Temperature_avg Pressure_avg temp_cat
15-10- diltiazem
0 10:30:00 18.0 20.0 24.848485 15.424242 low
2020 hydrochloride
15-10- ketamine
2 10:30:00 9.0 22.0 17.709677 11.935484 medium
2020 hydrochloride
15-10- diltiazem
3 11:30:00 19.0 20.0 24.848485 15.424242 low
2020 hydrochloride
15-10- ketamine
5 11:30:00 9.0 21.0 17.709677 11.935484 medium
2020 hydrochloride
15-10- diltiazem
6 12:30:00 20.0 21.0 24.848485 15.424242 medium
2020 hydrochloride
> Series.str.function()
Series.str can be used to access the values of the series as strings and apply several methods to it.
Now suppose we want to form a new column based on the year of the experiments?
In [220… data_tidy['Date'].str.split('-')
To extract the year we need to select the last element of each list
The dtype of the output is still an object, we would prefer a number type
The date format will always not be in day-month-year, it can vary
Thus, to work with such date-time type of data, we can use a special method of pandas
Datetime
Lets start with understanding a date-time type of data
Let's first merge our Date and time columns into a new timestamp column
In [224… data_tidy.head()
diltiazem 15-10-2020
0 18.0 20.0 24.848485 15.424242 low
hydrochloride 10:30:00
15-10-2020
1 docetaxel injection 26.0 23.0 30.387097 25.483871 medium
10:30:00
2 ketamine 9.0 22.0 17.709677 11.935484 medium 15-10-2020
hydrochloride 10:30:00
diltiazem 15-10-2020
3 19.0 20.0 24.848485 15.424242 low
hydrochloride 11:30:00
15-10-2020
4 docetaxel injection 29.0 25.0 30.387097 25.483871 medium
11:30:00
data_tidy
diltiazem 2020-10-15
0 18.0 20.0 24.848485 15.424242 low
hydrochloride 10:30:00
2020-10-15
1 docetaxel injection 26.0 23.0 30.387097 25.483871 medium
10:30:00
ketamine 2020-10-15
2 9.0 22.0 17.709677 11.935484 medium
hydrochloride 10:30:00
diltiazem 2020-10-15
3 19.0 20.0 24.848485 15.424242 low
hydrochloride 11:30:00
2020-10-15
4 docetaxel injection 29.0 25.0 30.387097 25.483871 medium
11:30:00
2020-10-17
103 docetaxel injection 26.0 19.0 30.387097 25.483871 low
08:30:00
ketamine 2020-10-17
104 11.0 20.0 17.709677 11.935484 low
hydrochloride 08:30:00
diltiazem 2020-10-17
105 9.0 13.0 24.848485 15.424242 low
hydrochloride 09:30:00
2020-10-17
106 docetaxel injection 27.0 20.0 30.387097 25.483871 low
09:30:00
ketamine 2020-10-17
107 12.0 21.0 17.709677 11.935484 medium
hydrochloride 09:30:00
In [226… data_tidy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108 entries, 0 to 107
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Drug_Name 108 non-null object
1 Pressure 108 non-null float64
2 Temperature 108 non-null float64
3 Temperature_avg 108 non-null float64
4 Pressure_avg 108 non-null float64
5 temp_cat 108 non-null category
6 timestamp 108 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(4), object(1)
memory usage: 10.3+ KB
The type of timestamp column has been changed to datetime from object
In [227… ts = data_tidy['timestamp'][0]
ts
Timestamp('2020-10-15 10:30:00')
Out[227]:
In [228… ts.year
2020
Out[228]:
Similarly we can also access the month and day using the month and day attributes
In [229… ts.month
10
Out[229]:
In [230… ts.day
15
Out[230]:
But what if we want to know the name of the month or the day of the week on that
date ?
We can find it using month_name() and day_name() methods
In [231… ts.month_name()
'October'
Out[231]:
In [232… ts.day_name()
'Thursday'
Out[232]:
In [233… ts.dayofweek
3
Out[233]:
In [234… ts.hour
10
Out[234]:
In [235… ts.minute
30
Out[235]:
... and so on
We can similarly extract minutes and seconds
This data parsing from string to date-time makes it easier to work with data
We can use this data from the columns as a whole using .dt object
In [236… data_tidy['timestamp'].dt
<pandas.core.indexes.accessors.DatetimeProperties object at 0x000001A2D13C7460>
Out[236]:
In [237… data_tidy['timestamp'].dt.year
0 2020
Out[237]:
1 2020
2 2020
3 2020
4 2020
...
103 2020
104 2020
105 2020
106 2020
107 2020
Name: timestamp, Length: 108, dtype: int64
Now, Let's create the new column using these extracted values from the property
We will use strfttime, short for stringformat time, to modify our datetime format
In [238… data_tidy['timestamp'][0]
Timestamp('2020-10-15 10:30:00')
Out[238]:
15
10
30
00
Similarly we can combine the format types to modify the date-time format as per our convinience
In [245… data_tidy['timestamp'][0].strftime('%m-%d')
'10-15'
Out[245]:
Writing to file
How can we write our dataframe to a csv file?
We have to provide the path and file_name in which you want to store the data
To find all the values from the series that starts with a pattern "s":
Python - column_name.str.startswith('s')
To find all the values from the series that ends with a pattern "s":
Python - column_name.str.endswith('s')
To find all the values from the series that contains pattern "s":
Python - column_name.str.contains('s')
Thank You!
In [ ]: