IPL Data Analysis
IPL Data Analysis
2 1. Importing Libraries
[ ]: import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
plt.style.use('dark_background')
1
3 2. Datasets
3.1 2.1 Deliveries Data
[ ]: deliveries=pd.read_csv("deliveries.csv")
deliveries.head()
[5 rows x 21 columns]
2
team2 toss_winner toss_decision \
0 Royal Challengers Bangalore Royal Challengers Bangalore field
1 Kings XI Punjab Chennai Super Kings bat
2 Delhi Daredevils Rajasthan Royals bat
3 Royal Challengers Bangalore Mumbai Indians bat
4 Kolkata Knight Riders Deccan Chargers bat
3.2.1 Add team score and team extra columns for each match, each inning.
[ ]: team_score = deliveries.groupby(['match_id', 'inning'])['total_runs'].sum().
↪unstack().reset_index()
3
cols = ['match_id', 'season','city','date','team1','team2', 'toss_winner',␣
↪'toss_decision', 'result', 'dl_applied', 'winner',␣
matches_agg = matches_agg[cols]
matches_agg.head(2)
umpire2 umpire3
0 RE Koertzen NaN
1 SL Shastri NaN
[2 rows x 27 columns]
batsmen = batsman_grp["batsman_runs"].sum().reset_index()
4
right_on=["match_id", "inning", "batsman"], how="left")
batsmen.head(2)
5
3.2.3 Bowler Aggregates
del( bowlers["bye_runs"])
del( bowlers["legbye_runs"])
del( bowlers["total_runs"])
dismissals = deliveries[deliveries["dismissal_kind"].
↪isin(dismissal_kinds_for_bowler)]
bowlers["wickets"] = bowlers["wickets"].fillna(0)
bowlers.head(2)
6
1 2 0 24 2 0.0 24.0
• Team wins in home city vs other cities Each team plays two matches against the
other teams, one in its home city and other in the home city of the opposite team.
It would be interesting see if playing in home city increases a teams chances of a
win.
sns.set_palette("Paired", len(matches_agg['city'].unique()))
plot.set_xlabel("Teams")
plot.set_ylabel("No of wins")
plot.legend(loc='best', prop={'size':8})
x+=1
7
8
9
10
11
12
13
14
15
3.2.4 Plot the performance of top 5 batsmen over seasons
Virat Kohli show a steady improvement over season and C. Gayle and SK Raina show a slump
batsman_runsperseason = batsman_runsperseason.groupby(['season',␣
↪'batsman'])['batsman_runs'].sum().unstack().T
16
batsman_runsperseason = batsman_runsperseason.sort_values(by = 'Total',␣
↪ascending = False).drop('Total', 1)
ax = batsman_runsperseason[:5].T.plot()
3.2.5 Percentage of total runs scored through boundaries for each batsman
The average for top batsmen is around 58-60% with exception of CH Gayle at 76%. Interestingly,
MS Dhoni who is known for helicopter shots(6s) gets close to 45% of his runs through singles
[ ]: <AxesSubplot: xlabel='batsman'>
17
3.2.6 Performance of top bowlers over seasons
Malinga is the highest wicket taken in IPL so far
[ ]: bowlers_wickets = bowlers.groupby(['bowler'])['wickets'].sum()
bowlers_wickets.sort_values(ascending = False, inplace = True)
bowlers_wickets[:10].plot(x= 'bowler', y = 'runs', kind = 'barh', colormap =␣
↪'Accent')
[ ]: <AxesSubplot: ylabel='bowler'>
18
3.2.7 Extra runs conceded by bowlers
bowlers_extras['Total'] = bowlers_extras.sum(axis=1)
#bowlers_extras('Total', ascending = False, inplace = True)
bowlers_extras.head()
[ ]: season 2008 2009 2010 2011 2012 2013 2014 2015 2016 Total
bowler
A Ashish Reddy NaN NaN NaN NaN 8.0 1.0 NaN 1.0 0.0 10.0
A Chandila NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN 0.0
A Flintoff NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0
A Kumble 10.0 7.0 14.0 NaN NaN NaN NaN NaN NaN 31.0
A Mishra 4.0 9.0 11.0 4.0 8.0 4.0 5.0 9.0 12.0 66.0
[ ]: matches['player_of_match'].value_counts()[:10].plot(kind = 'bar')
[ ]: <AxesSubplot: >
19
4 Analysis On IPL Data 2008-2020
[ ]: deliveries2=pd.read_csv("/home/blackheart/Documents/DATA SCIENCE/PROJECT/IPL␣
↪Analysis/IPL Ball-by-Ball 2008-2020.csv")
deliveries2.head()
20
3 1 0 1 0 0
4 1 0 1 0 0
bowling_team
0 Royal Challengers Bangalore
1 Royal Challengers Bangalore
2 Royal Challengers Bangalore
3 Royal Challengers Bangalore
4 Royal Challengers Bangalore
[ ]: matches2=pd.read_csv("/home/blackheart/Documents/DATA SCIENCE/PROJECT/IPL␣
↪Analysis/IPL Matches 2008-2020.csv")
matches2.head()
venue neutral_venue \
0 M Chinnaswamy Stadium 0
1 Punjab Cricket Association Stadium, Mohali 0
2 Feroz Shah Kotla 0
3 Wankhede Stadium 0
4 Eden Gardens 0
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
21
4 Deccan Chargers bat Kolkata Knight Riders
• In December 2018 the team changed their name from Delhi Daredevils to Delhi
Capitals and Sunrisers Hyderabad replaced Deccan Chargers in 2012 and debuted
in 2013. But I consider them to be the same in this IPL analysis task. Now let’s
start with some data preparation:
y =␣
↪['SRH','MI','GL','RPS','RCB','KKR','DC','KXIP','CSK','RR','SRH','KTK','PW','RPS','DC']
matches2.replace(x,y,inplace = True)
deliveries2.replace(x,y,inplace = True)
• Let’s start with looking at the number of matches played in every season of the
IPL
[ ]: d=matches2['date'].str[:4].astype(int)
plt.hist(d,edgecolor='red')
plt.title("Matches in Every Season",color='blue',weight='bold')
plt.show()
22
• The year 2013 has the most matches, possibly due to super overs. Also, there
are 10 teams in 2011, 9 in 2012 and 2013, this is another reason for the increase
in the number of matches.
[ ]: matches_played=pd.concat([matches2['team1'],matches2['team2']])
matches_played=matches_played.value_counts().reset_index()
matches_played.columns=['Team','Total Matches']
matches_played['wins']=matches2['winner'].value_counts().reset_index()['winner']
matches_played.set_index('Team',inplace=True)
totm = matches_played.reset_index().head(8)
totm
23
5 KXIP 190 88
6 CSK 178 86
7 RR 161 81
trace2 = go.Bar(x=matches_played.index,y=matches_played['wins'],
name='Matches Won',marker=dict(color='red'),opacity=0.4)
trace3 = go.Bar(x=matches_played.index,
y=(round(matches_played['wins']/matches_played['Total␣
↪Matches'],3)*100),
name='Win Percentage',opacity=0.6,marker=dict(color='gold'))
yaxis=dict(title='Count'),bargap=0.2,bargroupgap=0.1,␣
↪plot_bgcolor='rgb(245,245,245)')
• So MI, SRH and RCB are the top three teams with the highest winning percent-
age. Let’s look at the winning percentage of these three teams:
[ ]: win_percentage = round(matches_played['wins']/matches_played['Total␣
↪Matches'],3)*100
win_percentage.head(3)
[ ]: Team
MI 59.1
SRH 53.3
RCB 50.8
dtype: float64
• Now let’s have a look the most prefered decision taken by teams after winning the toss:
[ ]: x = matches2["toss_decision"].value_counts()
#y = matches2["toss_decision"].value_counts().values
plt.pie(x)
[ ]: ([<matplotlib.patches.Wedge at 0x7f6cd4ba1610>,
<matplotlib.patches.Wedge at 0x7f6cd40c7a10>],
[Text(-0.3655903556118915, 1.0374698510721025, ''),
24
Text(0.3655904527468272, -1.037469816843059, '')])
[ ]: high_scores=deliveries2.groupby(['id',␣
↪'inning','batting_team','bowling_team'])['total_runs'].sum().reset_index()
high_scores=high_scores[high_scores['total_runs']>=200]
hss = high_scores.nlargest(10,'total_runs')
trace = go.Table(
header=dict(values=["Inning","Batting Team","Bowling Team", "Total Runs"],
fill = dict(color = 'red'),
font = dict(color = 'white', size = 14),
align = ['center'],
height = 30),
cells=dict(values=[hss['inning'], hss['batting_team'], hss['bowling_team'],␣
↪hss['total_runs']],
layout = dict(
width=830,
25
height=410,
autosize=False,
title='Highest scores of IPL',
showlegend=False,
)
[ ]: hss
5 Reference
• Aman Khawal
• Kaggle
6 Thank You
26