Python For Machine Learning From Basics To Advance Part 3
Python For Machine Learning From Basics To Advance Part 3
"Stacking" and "unstacking" are operations that you can perform on multi-indexed
DataFrames to change the arrangement of the data, essentially reshaping the data between
a wide and a long format (or vice versa).
1. Stacking:
Stacking is the process of "melting" or pivoting the innermost level of column labels to
become the innermost level of row labels.
This operation is typically used when you want to convert a wide DataFrame with multi-
level columns into a long format.
You can use the .stack() method to perform stacking. By default, it will stack the
innermost level of columns.
In [ ]: import numpy as np
import pandas as pd
A B
0 X 0.960684 0.900984
Y 0.118538 0.485585
1 X 0.946716 0.444658
Y 0.049913 0.991469
2 X 0.656110 0.759727
Y 0.158270 0.203801
3 X 0.360581 0.797212
Y 0.965035 0.102426
2. Unstacking:
Unstacking is the reverse operation of stacking. It involves pivoting the innermost level
of row labels to become the innermost level of column labels.
You can use the .unstack() method to perform unstacking. By default, it will unstack
the innermost level of row labels.
Example:
A B
X Y X Y
0 0.960684 0.118538 0.900984 0.485585
1 0.946716 0.049913 0.444658 0.991469
2 0.656110 0.158270 0.759727 0.203801
3 0.360581 0.965035 0.797212 0.102426
You can specify the level you want to stack or unstack by passing the level parameter to
the stack() or unstack() methods. For example:
Out[ ]: A B
0 X 0.960684 0.900984
Y 0.118538 0.485585
1 X 0.946716 0.444658
Y 0.049913 0.991469
2 X 0.656110 0.759727
Y 0.158270 0.203801
3 X 0.360581 0.797212
Y 0.965035 0.102426
Out[ ]: A B
0 1 2 3 0 1 2 3
In [ ]: index_val = [('cse',2019),('cse',2020),('cse',2021),('cse',2022),('ece',2019),('ece
multiindex = pd.MultiIndex.from_tuples(index_val)
multiindex.levels[1]
In [ ]: branch_df1 = pd.DataFrame(
[
[1,2],
[3,4],
[5,6],
[7,8],
[9,10],
[11,12],
[13,14],
[15,16],
],
index = multiindex,
columns = ['avg_package','students']
)
branch_df1
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
branch_df2
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df1
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [ ]: branch_df1.unstack().unstack()
In [ ]: branch_df1.unstack().stack()
cse 2019 1 2
2020 3 4
2021 5 6
2022 7 8
ece 2019 9 10
2020 11 12
2021 13 14
2022 15 16
In [ ]: branch_df2
2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df2.stack()
2019 avg_package 1 0
students 2 0
2020 avg_package 3 0
students 4 0
2021 avg_package 5 0
students 6 0
2022 avg_package 7 0
students 8 0
In [ ]: branch_df2.stack().stack()
Stacking and unstacking can be very useful when you need to reshape your data to make it
more suitable for different types of analysis or visualization. They are common operations in
data manipulation when working with multi-indexed DataFrames in pandas.
In [ ]: branch_df = pd.DataFrame(
[
[1,2,0,0],
[3,4,0,0],
[5,6,0,0],
[7,8,0,0],
[9,10,0,0],
[11,12,0,0],
[13,14,0,0],
[15,16,0,0],
],
index = multiindex,
columns = pd.MultiIndex.from_product([['delhi','mumbai'],['avg_package','studen
)
branch_df
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
Basic Checks
In [ ]: # HEAD
branch_df.head()
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
In [ ]: # Tail
branch_df.tail()
cse 2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [ ]: #shape
branch_df.shape
(8, 4)
Out[ ]:
In [ ]: # info
branch_df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8 entries, ('cse', 2019) to ('ece', 2022)
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (delhi, avg_package) 8 non-null int64
1 (delhi, students) 8 non-null int64
2 (mumbai, avg_package) 8 non-null int64
3 (mumbai, students) 8 non-null int64
dtypes: int64(4)
memory usage: 632.0+ bytes
In [ ]: # duplicated
branch_df.duplicated().sum()
0
Out[ ]:
In [ ]: # isnull
branch_df.isnull().sum()
delhi avg_package 0
Out[ ]:
students 0
mumbai avg_package 0
students 0
dtype: int64
How to Extract
delhi avg_package 7
Out[ ]:
students 8
mumbai avg_package 0
students 0
Name: (cse, 2022), dtype: int64
In [ ]: branch_df
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
In [ ]: # using iloc
branch_df.iloc[2:5]
cse 2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
In [ ]: branch_df.iloc[2:8:2]
cse 2021 5 6 0 0
ece 2019 9 10 0 0
2021 13 14 0 0
In [ ]: # extacting cols
branch_df['delhi']['students']
cse 2019 2
Out[ ]:
2020 4
2021 6
2022 8
ece 2019 10
2020 12
2021 14
2022 16
Name: students, dtype: int64
In [ ]: branch_df.iloc[:,1:3]
students avg_package
cse 2019 2 0
2020 4 0
2021 6 0
2022 8 0
ece 2019 10 0
2020 12 0
2021 14 0
2022 16 0
In [ ]: # Extracting both
branch_df.iloc[[0,4],[1,2]]
students avg_package
cse 2019 2 0
ece 2019 10 0
Sorting
In [ ]: branch_df.sort_index(ascending=False)
ece 2022 15 16 0 0
2021 13 14 0 0
2020 11 12 0 0
2019 9 10 0 0
cse 2022 7 8 0 0
2021 5 6 0 0
2020 3 4 0 0
2019 1 2 0 0
In [ ]: branch_df.sort_index(ascending=[False,True])
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
In [ ]: branch_df.sort_index(level=0,ascending=[False])
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
delhi avg_package 1 3 5 7 9 11 13 15
students 2 4 6 8 10 12 14 16
mumbai avg_package 0 0 0 0 0 0 0 0
students 0 0 0 0 0 0 0 0
In [ ]: # swaplevel
branch_df.swaplevel(axis=1)
cse 2019 1 2 0 0
2020 3 4 0 0
2021 5 6 0 0
2022 7 8 0 0
ece 2019 9 10 0 0
2020 11 12 0 0
2021 13 14 0 0
2022 15 16 0 0
In [ ]: branch_df.swaplevel()
2019 cse 1 2 0 0
2020 cse 3 4 0 0
2021 cse 5 6 0 0
2022 cse 7 8 0 0
2019 ece 9 10 0 0
2020 ece 11 12 0 0
2021 ece 13 14 0 0
2022 ece 15 16 0 0
"Long" and "wide" are terms often used in data analysis and data reshaping in the context of
data frames or tables, typically in software like R or Python. They describe two different ways
of organizing and structuring data.
Long Format:
ID Variable Value
1 Age 25
1 Height 175
1 Weight 70
2 Age 30
2 Height 160
2 Weight 60
Wide Format:
1 25 175 70
2 30 160 60
Converting data between long and wide formats is often necessary depending on the
specific analysis or visualization tasks you want to perform. In software like R and Python,
there are functions and libraries available for reshaping data between these formats, such as
tidyr in R and pivot functions in Python's pandas library for moving from wide to long
format, and gather in R and melt in pandas for moving from long to wide format.
In [ ]: import numpy as np
import pandas as pd
In [ ]: pd.DataFrame({'cse':[120]})
Out[ ]: cse
0 120
In [ ]: pd.DataFrame({'cse':[120]}).melt()
0 cse 120
0 120 100 50
0 cse 120
1 ece 100
2 mech 50
0 cse 120
1 ece 100
2 mech 50
In [ ]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
)
2 mech 60 80 70
In [ ]: pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt()
0 branch cse
1 branch ece
2 branch mech
3 2020 100
4 2020 150
5 2020 60
6 2021 120
7 2021 130
8 2021 80
9 2022 150
10 2022 140
11 2022 70
2 mech 2020 60
5 mech 2021 80
8 mech 2022 70
Real-World Example:
In the context of COVID-19 data, data for deaths and confirmed cases are initially stored
in wide formats.
The data is converted to long format, making it easier to conduct analyses.
In the long format, each row represents a specific location, date, and the corresponding
number of deaths or confirmed cases. This format allows for efficient merging and
analysis, as it keeps related data in one place and facilitates further data exploration.
In [ ]: death.head()
Out[ ]: Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/2
In [ ]: confirm.head()
Out[ ]: Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/2
In [ ]: death = death.melt(id_vars=['Province/State','Country/Region','Lat','Long'],var_nam
confirm = confirm.melt(id_vars=['Province/State','Country/Region','Lat','Long'],var
In [ ]: death.head()
In [ ]: confirm.head()
In [ ]: confirm.merge(death,on=['Province/State','Country/Region','Lat','Long','date'])
Winter Olympics
311249 NaN 39.904200 116.407400 1/2/23 535 0
2022
In [ ]: confirm.merge(death,on=['Province/State','Country/Region','Lat','Long','date'])[['C
0 Afghanistan 1/22/20 0 0
1 Albania 1/22/20 0 0
2 Algeria 1/22/20 0 0
3 Andorra 1/22/20 0 0
4 Angola 1/22/20 0 0
The choice between long and wide data formats depends on the nature of the dataset and
the specific analysis or visualization tasks you want to perform. Converting data between
these formats can help optimize data organization for different analytical needs.
In [ ]: import numpy as np
import pandas as pd
import seaborn as sns
In [ ]: # Sample data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250],
}
In [ ]: print(pivot_table)
Product A B
Date
2023-01-01 100 200
2023-01-02 150 250
In this example, we first create a DataFrame using sample data. Then, we use the
pd.pivot_table function to create a pivot table. Here's what each argument does:
Real-world Examples
In [ ]: df = sns.load_dataset('tips')
In [ ]: df.head()
In [ ]: df.pivot_table(index='sex',columns='smoker',values='total_bill')
sex
In [ ]: ## aggfunc# aggfunc
df.pivot_table(index='sex',columns='smoker',values='total_bill',aggfunc='std')
sex
sex
In [ ]: # multidimensional
df.pivot_table(index=['sex','smoker'],columns=['day','time'],aggfunc={'size':'mean'
Out[ ]: size
time Lunch Dinner Lunch Dinner Dinner Dinner Lunch Dinner Lunch D
sex smoker
Male Yes 2.300000 NaN 1.666667 2.4 2.629630 2.600000 5.00 NaN 2.20
Female Yes 2.428571 NaN 2.000000 2.0 2.200000 2.500000 5.00 NaN 3.48
In [ ]: # margins
df.pivot_table(index='sex',columns='smoker',values='total_bill',aggfunc='sum',margi
sex
Plotting graph
In [ ]: df = pd.read_csv('Data\Day43\expense_data.csv')
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amou
CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense NaN 50
10:11
payment
CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense NaN 300
10:11
payment people
CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense NaN 78
19:50
payment
CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense NaN 30
18:56
payment
CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense NaN 67
18:22
payment
In [ ]: df['Category'].value_counts()
In [ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [ ]: df['Date'] = pd.to_datetime(df['Date'])
In [ ]: df['month'] = df['Date'].dt.month_name()
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amoun
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67
18:22:00 payment
In [ ]: df.pivot_table(index='month',columns='Income/Expense',values='INR',aggfunc='sum',fi
<Axes: xlabel='month'>
Out[ ]:
In [ ]: df.pivot_table(index='month',columns='Account',values='INR',aggfunc='sum',fill_valu
<Axes: xlabel='month'>
Out[ ]:
Vectorized string operations in Pandas refer to the ability to apply string functions and
operations to entire arrays of strings (columns or Series containing strings) without the
need for explicit loops or iteration. This is made possible by Pandas' integration with the
NumPy library, which allows for efficient element-wise operations.
When you have a Pandas DataFrame or Series containing string data, you can use
various string methods that are applied to every element in the column simultaneously.
This can significantly improve the efficiency and readability of your code. Some of the
commonly used vectorized string operations in Pandas include methods like
.str.lower() , .str.upper() , .str.strip() , .str.replace() , and many
more.
Vectorized string operations not only make your code more concise and readable but
also often lead to improved performance compared to explicit for-loops, especially
when dealing with large datasets.
In [ ]: import numpy as np
import pandas as pd
In [ ]: s = pd.Series(['cat','mat',None,'rat'])
0 True
Out[ ]:
1 False
2 None
3 False
dtype: object
In [ ]: df.head()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df['Name']
Common Functions
In [ ]: # lower/upper/capitalize/title
df['Name'].str.upper()
df['Name'].str.capitalize()
df['Name'].str.title()
In [ ]: # len
df['Name'][df['Name'].str.len()]
In [ ]: df['Name'][df['Name'].str.len() == 82].values[0]
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallej
Out[ ]:
o)'
In [ ]: # strip
df['Name'].str.strip()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df[['title','firstname']] = df['Name'].str.split(',').str.get(1).str.strip().str.sp
df.head()
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2500 NaN
21171
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 NaN
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 NaN
Henry
In [ ]: df['title'].value_counts()
In [ ]: # replace
df['title'] = df['title'].str.replace('Ms.','Miss.')
df['title'] = df['title'].str.replace('Mlle.','Miss.')
In [ ]: df['title'].value_counts()
title
Out[ ]:
Mr. 517
Miss. 185
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Major. 2
Col. 2
Don. 1
Mme. 1
Lady. 1
Sir. 1
Capt. 1
the 1
Jonkheer. 1
Name: count, dtype: int64
filtering
In [ ]: # startswith/endswith
df[df['firstname'].str.endswith('A')]
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
Stewart,
Mr. PC
64 65 0 1 male NaN 0 0 27.7208 NaN
Albert 17605
A
Keane,
303 304 1 2 Miss. female NaN 0 0 226593 12.3500 E101
Nora A
In [ ]: # isdigit/isalpha...
df[df['firstname'].str.isdigit()]
Out[ ]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
slicing
In [ ]: df['Name'].str[::-1]
In Pandas, you can work with dates and times using the datetime data type. Pandas
provides several data structures and functions for handling date and time data, making it
convenient for time series data analysis.
In [ ]: import numpy as np
import pandas as pd
1. Timestamp :
This represents a single timestamp and is the fundamental data type for time series data in
Pandas.
Time stamps reference particular moments in time (e.g., Oct 24th, 2022 at 7:00pm)
Timestamp('2023-01-05 00:00:00')
Out[ ]:
In [ ]: # variations
pd.Timestamp('2023-1-5')
pd.Timestamp('2023, 1, 5')
Timestamp('2023-01-05 00:00:00')
Out[ ]:
In [ ]: # only year
pd.Timestamp('2023')
Timestamp('2023-01-01 00:00:00')
Out[ ]:
In [ ]: # using text
pd.Timestamp('5th January 2023')
Timestamp('2023-01-05 09:21:00')
Out[ ]:
x = pd.Timestamp(dt.datetime(2023,1,5,9,21,56))
x
Timestamp('2023-01-05 09:21:56')
Out[ ]:
In [ ]: # fetching attributes
x.year
2023
Out[ ]:
In [ ]: x.month
1
Out[ ]:
In [ ]: x.day
x.hour
x.minute
x.second
56
Out[ ]:
1. Efficiency: The datetime module in Python is flexible and comprehensive, but it may
not be as efficient when dealing with large datasets. Pandas' datetime objects are
optimized for performance and are designed for working with data, making them more
suitable for operations on large time series datasets.
2. Data Alignment: Pandas focuses on data manipulation and analysis, so it provides tools
for aligning data with time-based indices and working with irregular time series. This is
particularly useful in financial and scientific data analysis.
3. Convenience: Pandas provides a high-level API for working with time series data, which
can make your code more concise and readable. It simplifies common operations such
as resampling, aggregation, and filtering.
4. Integration with DataFrames: Pandas seamlessly integrates its date and time objects
with DataFrames. This integration allows you to easily create, manipulate, and analyze
time series data within the context of your data analysis tasks.
5. Time Zones: Pandas has built-in support for handling time zones and daylight saving
time, making it more suitable for working with global datasets and international time
series data.
2. DatetimeIndex :
This is an index that consists of Timestamp objects. It is used to create time series data in
Pandas DataFrames.
In [ ]: # from strings
pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1'])
In [ ]: # from strings
type(pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1']))
pandas.core.indexes.datetimes.DatetimeIndex
Out[ ]:
In [ ]: # using pd.timestamps
dt_index = pd.DatetimeIndex([pd.Timestamp(2023,1,1),pd.Timestamp(2022,1,1),pd.Times
In [ ]: dt_index
pd.Series([1,2,3],index=dt_index)
3. date_range function
In [ ]: # generate daily dates in a given range
pd.date_range(start='2023/1/5',end='2023/2/28',freq='D')
4. to_datetime function
converts an existing objects to pandas timestamp/datetimeindex object
s = pd.Series(['2023/1/1','2022/1/1','2021/1/1'])
pd.to_datetime(s).dt.day_name()
0 Sunday
Out[ ]:
1 Saturday
2 Friday
dtype: object
In [ ]: # with errors
s = pd.Series(['2023/1/1','2022/1/1','2021/130/1'])
pd.to_datetime(s,errors='coerce').dt.month_name()
0 January
Out[ ]:
1 January
2 NaN
dtype: object
In [ ]: df = pd.read_csv('Data\Day43\expense_data.csv')
df.shape
(277, 11)
Out[ ]:
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amou
CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense NaN 50
10:11
payment
CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense NaN 300
10:11
payment people
CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense NaN 78
19:50
payment
CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense NaN 30
18:56
payment
CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense NaN 67
18:22
payment
In [ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null object
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: float64(5), object(6)
memory usage: 23.9+ KB
In [ ]: df['Date'] = pd.to_datetime(df['Date'])
In [ ]: df.info()
5. dt accessor
Accessor object for datetimelike properties of the Series values.
In [ ]: df['Date'].dt.is_quarter_start
0 False
Out[ ]:
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool
In [ ]: # plot graph
import matplotlib.pyplot as plt
plt.plot(df['Date'],df['INR'])
[<matplotlib.lines.Line2D at 0x181faeba430>]
Out[ ]:
df['day_name'] = df['Date'].dt.day_name()
In [ ]: df.head()
Out[ ]: Date Account Category Subcategory Note INR Income/Expense Note.1 Amoun
2022- CUB -
0 03-02 online Food NaN Brownie 50.0 Expense NaN 50
10:11:00 payment
2022- CUB - To
1 03-02 online Other NaN lended 300.0 Expense NaN 300
10:11:00 payment people
2022- CUB -
2 03-01 online Food NaN Dinner 78.0 Expense NaN 78
19:50:00 payment
2022- CUB -
3 03-01 online Transportation NaN Metro 30.0 Expense NaN 30
18:56:00 payment
2022- CUB -
4 03-01 online Food NaN Snacks 67.0 Expense NaN 67
18:22:00 payment
In [ ]: df.groupby('day_name')['INR'].mean().plot(kind='bar')
<Axes: xlabel='day_name'>
Out[ ]:
In [ ]: df['month_name'] = df['Date'].dt.month_name()
In [ ]: df.groupby('month_name')['INR'].sum().plot(kind='bar')
<Axes: xlabel='month_name'>
Out[ ]:
Pandas also provides powerful time series functionality, including the ability to resample,
group, and perform various time-based operations on data. You can work with date and
time data in Pandas to analyze and manipulate time series data effectively.
What is Matplotlib?
Matplotlib is a popular data visualization library in Python. It provides a wide range of tools
for creating various types of plots and charts, making it a valuable tool for data analysis,
scientific research, and data presentation. Matplotlib allows you to create high-quality,
customizable plots and figures for a variety of purposes, including line plots, bar charts,
scatter plots, histograms, and more.
Matplotlib is highly customizable and can be used to control almost every aspect of your
plots, from the colors and styles to labels and legends. It provides both a functional and an
object-oriented interface for creating plots, making it suitable for a wide range of users,
from beginners to advanced data scientists and researchers.
Matplotlib can be used in various contexts, including Jupyter notebooks, standalone Python
scripts, and integration with web applications and GUI frameworks. It also works well with
other Python libraries commonly used in data analysis and scientific computing, such as
NumPy and Pandas.
To use Matplotlib, you typically need to import the library in your Python code, create the
desired plot or chart, and then display or save it as needed. Here's a simple example of
creating a basic line plot using Matplotlib:
In [ ]: import numpy as np
import pandas as pd
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 18, 10]
This is just a basic introduction to Matplotlib. The library is quite versatile, and you can
explore its documentation and tutorials to learn more about its capabilities and how to
create various types of visualizations for your data.
Line Plot
A 2D line plot is one of the most common types of plots in Matplotlib. It's used to visualize
data with two continuous variables, typically representing one variable on the x-axis and
another on the y-axis, and connecting the data points with lines. This type of plot is useful
for showing trends, relationships, or patterns in data over a continuous range.
Bivariate Analysis
categorical -> numerical and numerical -> numerical
Use case - Time series data
price = [48000,54000,57000,49000,47000,45000]
year = [2015,2016,2017,2018,2019,2020]
plt.plot(year,price)
[<matplotlib.lines.Line2D at 0x28b7c3742e0>]
Out[ ]:
Real-world Dataset
In [ ]: batsman = pd.read_csv('Data\Day45\sharma-kohli.csv')
In [ ]: batsman.head()
[<matplotlib.lines.Line2D at 0x28b7c40ca60>]
Out[ ]:
Multiple Plots:
It's possible to create multiple lines on a single plot, making it easy to compare multiple
datasets or variables. In the example, both Rohit Sharma's and Virat Kohli's career runs are
plotted on the same graph.
[<matplotlib.lines.Line2D at 0x28b7d4e2610>]
Out[ ]:
In [ ]: # labels title
plt.plot(batsman['index'],batsman['V Kohli'])
plt.plot(batsman['index'],batsman['RG Sharma'])
In [ ]: #colors
plt.plot(batsman['index'],batsman['V Kohli'],color='Red')
plt.plot(batsman['index'],batsman['RG Sharma'],color='Purple')
You can specify different colors for each line in the plot. In the example, colors like 'Red' and
'Purple' are used to differentiate the lines.
You can change the style and width of the lines. Common line styles include 'solid,' 'dotted,'
'dashed,' etc. In the example, 'solid' and 'dashdot' line styles are used.
In [ ]: # Marker
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
Markers are used to highlight data points on the line plot. You can customize markers' style
and size. In the example, markers like 'D' and 'o' are used with different colors.
In [ ]: # grid
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
plt.grid()
Adding a grid to the plot can make it easier to read and interpret the data. The grid helps in
aligning the data points with the tick marks on the axes.
In [ ]: # show
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
plt.grid()
plt.show()
After customizing your plot, you can use plt.show() to display it. This command is often used
in Jupyter notebooks or standalone Python scripts.
2D line plots are valuable for visualizing time series data, comparing trends in multiple
datasets, and exploring the relationship between two continuous variables. Customization
options in Matplotlib allow you to create visually appealing and informative plots for data
analysis and presentation.
A scatter plot, also known as a scatterplot or scatter chart, is a type of data visualization used
in statistics and data analysis. It's used to display the relationship between two variables by
representing individual data points as points on a two-dimensional graph. Each point on the
plot corresponds to a single data entry with values for both variables, making it a useful tool
for identifying patterns, trends, clusters, or outliers in data.
Bivariate Analysis
numerical vs numerical
Use case - Finding correlation
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
plt.show()
y = 10*x + 3 + np.random.randint(0,300,50)
y
In [ ]: plt.scatter(x,y)
<matplotlib.collections.PathCollection at 0x264627ccc70>
Out[ ]:
In [ ]: import numpy as np
import pandas as pd
In [ ]: # marker
plt.scatter(df['avg'],df['strike_rate'],color='red',marker='+')
plt.title('Avg and SR analysis of Top 50 Batsman')
plt.xlabel('Average')
plt.ylabel('SR')
Scatter plots are particularly useful for visualizing the distribution of data, identifying
correlations or relationships between variables, and spotting outliers. You can adjust the
appearance and characteristics of the scatter plot to suit your needs, including marker size,
color, and transparency. This makes scatter plots a versatile tool for data exploration and
analysis.
Bar plot
A bar plot, also known as a bar chart or bar graph, is a type of data visualization that is used
to represent categorical data with rectangular bars. Each bar's length or height is
proportional to the value it represents. Bar plots are typically used to compare and display
the relative sizes or quantities of different categories or groups.
Bivariate Analysis
Numerical vs Categorical
Use case - Aggregate analysis of groups
plt.bar(colors,children,color='Purple')
In [ ]: plt.bar(np.arange(df.shape[0]) - 0.2,df['2015'],width=0.2,color='yellow')
plt.bar(np.arange(df.shape[0]),df['2016'],width=0.2,color='red')
plt.bar(np.arange(df.shape[0]) + 0.2,df['2017'],width=0.2,color='blue')
plt.xticks(np.arange(df.shape[0]), df['batsman'])
plt.show()
Bar plots are useful for comparing the values of different categories and for showing the
distribution of data within each category. They are commonly used in various fields,
including business, economics, and data analysis, to make comparisons and convey
information about categorical data. You can customize bar plots to make them more visually
appealing and informative.
Histogram
A histogram is a type of chart that shows the distribution of numerical data. It's a graphical
representation of data where data is grouped into continuous number ranges and each
range corresponds to a vertical bar. The horizontal axis displays the number range, and the
vertical axis (frequency) represents the amount of data that is present in each range.
A histogram is a set of rectangles with bases along with the intervals between class
boundaries and with areas proportional to frequencies in the corresponding classes. The x-
axis of the graph represents the class interval, and the y-axis shows the various frequencies
corresponding to different class intervals. A histogram is a type of data visualization used to
represent the distribution of a dataset, especially when dealing with continuous or numeric
data. It displays the frequency or count of data points falling into specific intervals or "bins"
along a continuous range. Histograms provide insights into the shape, central tendency, and
spread of a dataset.
Univariate Analysis
Numerical col
Use case - Frequency Count
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: # simple data
data = [32,45,56,10,15,27,61]
plt.hist(data,bins=[10,25,40,55,70])
In [ ]: # on some data
df = pd.read_csv('Data\Day48\Vk.csv')
df
0 12 62
1 17 28
2 20 64
3 27 0
4 30 10
136 624 75
138 632 54
139 633 0
140 636 54
In [ ]: plt.hist(df['batsman_runs'])
plt.show()
In [ ]: # handling bins
plt.hist(df['batsman_runs'],bins=[0,10,20,30,40,50,60,70,80,90,100,110,120],color='
plt.show()
Pie Chart
A pie chart is a circular graph that's divided into slices to illustrate numerical proportion. The
slices of the pie show the relative size of the data. The arc length of each slice, and
consequently its central angle and area, is proportional to the quantity it represents.
All slices of the pie add up to make the whole equaling 100 percent and 360 degrees. Pie
charts are often used to represent sample data. Each of these categories is represented as a
“slice of the pie”. The size of each slice is directly proportional to the number of data points
that belong to a particular category.
Univariate/Bivariate Analysis
Categorical vs numerical
Use case - To find contibution on a standard scale
In [ ]: # simple data
data = [23,45,100,20,49]
subjects = ['eng','science','maths','sst','hindi']
plt.pie(data,labels=subjects)
plt.show()
In [ ]: # dataset
df = pd.read_csv('Data\Day48\Gayle-175.csv')
df
0 AB de Villiers 31
1 CH Gayle 175
2 R Rampaul 0
3 SS Tiwary 2
4 TM Dilshan 33
5 V Kohli 11
In [ ]: plt.pie(df['batsman_runs'],labels=df['batsman'],autopct='%0.1f%%')
plt.show()
In [ ]: # explode shadow
plt.pie(df['batsman_runs'],labels=df['batsman'],autopct='%0.1f%%',explode=[0.3,0,0,
plt.show()
Advanced Matplotlib(part-1)
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Colored Scatterplots
In [ ]: iris = pd.read_csv('Data\Day49\iris.csv')
iris.sample(5)
In [ ]: iris['Species'] = iris['Species'].replace({'Iris-setosa':0,'Iris-versicolor':1,'Iri
iris.sample(5)
In [ ]: plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'])
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76de6880>
Out[ ]:
In [ ]: # cmap
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec75d12f10>
Out[ ]:
In [ ]: # alpha
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76e54790>
Out[ ]:
In [ ]: # plot size
plt.figure(figsize=(15,7))
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76f3bf40>
Out[ ]:
Annotations
In [ ]: batters = pd.read_csv('Data\Day49\Batter.csv')
In [ ]: sample_df = batters.head(100).sample(25,random_state=5)
In [ ]: sample_df
In [ ]: plt.figure(figsize=(18,10))
plt.scatter(sample_df['avg'],sample_df['strike_rate'],s=sample_df['runs'])
for i in range(sample_df.shape[0]):
plt.text(sample_df['avg'].values[i],sample_df['strike_rate'].values[i],sample_df[
In [ ]: x = [1,2,3,4]
y = [5,6,7,8]
plt.scatter(x,y)
plt.text(1,5,'Point 1')
plt.text(2,6,'Point 2')
plt.text(3,7,'Point 3')
plt.text(4,8,'Point 4',fontdict={'size':12,'color':'brown'})
plt.figure(figsize=(18,10))
plt.scatter(sample_df['avg'],sample_df['strike_rate'],s=sample_df['runs'])
plt.axvline(30,color='red')
for i in range(sample_df.shape[0]):
plt.text(sample_df['avg'].values[i],sample_df['strike_rate'].values[i],sample_df[
Subplots
In [ ]: # A diff way to plot graphs
batters.head()
In [ ]: plt.figure(figsize=(15,6))
plt.scatter(batters['avg'],batters['strike_rate'])
plt.title('Something')
plt.xlabel('Avg')
plt.ylabel('Strike Rate')
plt.show()
In [ ]: fig,ax = plt.subplots(figsize=(15,6))
ax.scatter(batters['avg'],batters['strike_rate'],color='red',marker='+')
ax.set_title('Something')
ax.set_xlabel('Avg')
ax.set_ylabel('Strike Rate')
fig.show()
In [ ]: ig, ax = plt.subplots(nrows=2,ncols=1,sharex=True,figsize=(10,6))
ax[0].scatter(batters['avg'],batters['strike_rate'],color='red')
ax[1].scatter(batters['avg'],batters['runs'])
ax[1].set_title('Avg Vs Runs')
ax[1].set_ylabel('Runs')
ax[1].set_xlabel('Avg')
Text(0.5, 0, 'Avg')
Out[ ]:
In [ ]: fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
ax[0,0]
ax[0,1].scatter(batters['avg'],batters['runs'])
ax[1,0].hist(batters['avg'])
ax[1,1].hist(batters['runs'])
(array([499., 40., 19., 19., 9., 6., 4., 4., 3., 2.]),
Out[ ]:
array([ 0. , 663.4, 1326.8, 1990.2, 2653.6, 3317. , 3980.4, 4643.8,
5307.2, 5970.6, 6634. ]),
<BarContainer object of 10 artists>)
In [ ]: fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.scatter(batters['avg'],batters['strike_rate'],color='red')
ax2 = fig.add_subplot(2,2,2)
ax2.hist(batters['runs'])
ax3 = fig.add_subplot(2,2,3)
ax3.hist(batters['avg'])
(array([102., 125., 103., 82., 78., 43., 22., 14., 2., 1.]),
Out[ ]:
array([ 0. , 5.56666667, 11.13333333, 16.7 , 22.26666667,
27.83333333, 33.4 , 38.96666667, 44.53333333, 50.1 ,
55.66666667]),
<BarContainer object of 10 artists>)
Advanced Matplotlib(part-2)
3D scatter Plot
A 3D scatter plot is used to represent data points in a three-dimensional space.
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: batters = pd.read_csv('Data\Day49\Batter.csv')
batters.head()
In [ ]: fig = plt.figure()
ax = plt.subplot(projection='3d')
ax.scatter3D(batters['runs'],batters['avg'],batters['strike_rate'],marker='+')
ax.set_title('IPL batsman analysis')
ax.set_xlabel('Runs')
ax.set_ylabel('Avg')
ax.set_zlabel('SR')
Text(0.5, 0, 'SR')
Out[ ]:
In the example, you created a 3D scatter plot to analyze IPL batsmen based on runs,
average (avg), and strike rate (SR).
The ax.scatter3D function was used to create the plot, where the three variables were
mapped to the x, y, and z axes.
3D Line Plot
A 3D line plot represents data as a line in three-dimensional space.
In [ ]: x = [0,1,5,25]
y = [0,10,13,0]
z = [0,13,20,9]
fig = plt.figure()
ax = plt.subplot(projection='3d')
ax.scatter3D(x,y,z,s=[100,100,100,100])
ax.plot3D(x,y,z,color='red')
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x23ec4988340>]
Out[ ]:
In the given example, you created a 3D line plot with three sets of data points
represented by lists x, y, and z.
The ax.plot3D function was used to create the line plot.
3D Surface Plots
3D surface plots are used to visualize functions of two variables as surfaces in three-
dimensional space.
In [ ]: x = np.linspace(-10,10,100)
y = np.linspace(-10,10,100)
In [ ]: xx, yy = np.meshgrid(x,y)
In [ ]: z = xx**2 + yy**2
z.shape
(100, 100)
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec66ca9a0>
Out[ ]:
In [ ]: z = np.sin(xx) + np.cos(yy)
fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec33aa520>
Out[ ]:
surface plot using the ax.plot_surface function. In First example, you plotted a parabolic
surface, and in Seound, you plotted a surface with sine and cosine functions.
Contour Plots
Contour plots are used to visualize 3D data in 2D, representing data as contours on a
2D plane.
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec616e0d0>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contour(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec56e4be0>
Out[ ]:
In [ ]: z = np.sin(xx) + np.cos(yy)
fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contourf(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec8865f40>
Out[ ]:
You created both filled contour plots (ax.contourf) and contour line plots (ax.contour) in 2D
space. These plots are useful for representing functions over a grid.
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7b7ca00>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contour(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7c698e0>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contourf(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7ed9700>
Out[ ]:
Heatmap
A heatmap is a graphical representation of data in a 2D grid, where individual values are
represented as colors.
In [ ]: delivery = pd.read_csv('Data\Day50\IPL_Ball_by_Ball_2008_2022.csv')
delivery.head()
Out[ ]: non-
ID innings overs ballnumber batter bowler extra_type batsman_run ext
striker
YBK Mohammed JC
0 1312200 1 0 1 NaN 0
Jaiswal Shami Buttler
YBK Mohammed JC
1 1312200 1 0 2 legbyes 0
Jaiswal Shami Buttler
JC Mohammed YBK
2 1312200 1 0 3 NaN 1
Buttler Shami Jaiswal
YBK Mohammed JC
3 1312200 1 0 4 NaN 0
Jaiswal Shami Buttler
YBK Mohammed JC
4 1312200 1 0 5 NaN 0
Jaiswal Shami Buttler
In [ ]: grid = temp_df.pivot_table(index='overs',columns='ballnumber',values='batsman_run',
In [ ]: plt.figure(figsize=(20,10))
plt.imshow(grid)
plt.yticks(delivery['overs'].unique(), list(range(1,21)))
plt.xticks(np.arange(0,6), list(range(1,7)))
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x23ec9384820>
Out[ ]:
In the given example, we used the imshow function to create a heatmap of IPL
deliveries.
The grid represented ball-by-ball data with the number of sixes (batsman_run=6) in
each over and ball number.
Heatmaps are effective for visualizing patterns and trends in large datasets.
These techniques provide powerful tools for visualizing complex data in three dimensions
and for representing large datasets effectively. Each type of plot is suitable for different
types of data and can help in gaining insights from the data.