0% found this document useful (0 votes)
83 views18 pages

Capstone Notes-1

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views18 pages

Capstone Notes-1

Uploaded by

ANIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Cricket Win Prediction –

Capstone Project Notes-1

This document briefs about the Exploratory Data Analysis(EDA) of the


given cricket data set. The data set contains the match data of Indian
Team Played in the past

Windows User
5/1/2022
Table of Contents
1. Introduction of the business problem...........................................................................................................................3
a) Defining problem statement:....................................................................................................................................3
b) Need of the study/project:........................................................................................................................................3
c) Understanding business/social opportunity:.............................................................................................................3
2. Data Report................................................................................................................................................................... 3
a) Understanding how data was collected in terms of time, frequency and methodology:..........................................3
b) Visual inspection of data (rows, columns, descriptive details):.................................................................................4
c) Understanding of attributes (variable info, renaming if required):...........................................................................5
3. Exploratory data analysis..............................................................................................................................................6
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for
categorical ones):..............................................................................................................................................................6
Skewness Calculation....................................................................................................................................................9
b) Bivariate analysis (relationship between different variables, correlations).............................................................10
Multi-Variate Analysis:................................................................................................................................................11
Pair Plot:...................................................................................................................................................................... 12
c) Removal of unwanted variables (if applicable):.......................................................................................................13
d) Missing Value treatment (if applicable):..................................................................................................................13
e) Outlier Treatment (if applicable):............................................................................................................................14
4. Business insights from EDA.........................................................................................................................................15
a) Is the data unbalanced? If so, what can be done? Please explain in the context of the business...........................15
b) Any business insights using clustering (if applicable)..........................................................................................15
c) Any other business insights:................................................................................................................................16

1
List of Figures:

Figure 1 - Loading Data Set 3


Figure 2 – Shape of the Data 4
Figure 3 – Info of the Data Types 4
Figure 4 – Descriptive Stats of the Data Types 5
Figure 5 – Unique Values of each column 5
Figure 6 – Unique Values of each column after modification 6
Figure 7 – Count Plot on Target Variable 6
Figure 8 – Univariate Analysis for Float and Integer data types 7
Figure 9 – Univariate Analysis for Categorical Variables 8
Figure 10 – Univariate Analysis with Player Highest Run 9
Figure 11 – Skewness Calculation 9
Figure 12 – Bar plot on Opponent and Bowlers_in_team 10
Figure 13 – Scatter plot on Avg_team_age and All_round_in_team 10
Figure 14 – Violin plot on Opponent and All_rounder_in_team 11
Figure 15 – Heat Map 12
Figure 16 – Pair Plot on Data Set 12
Figure 17 – Data Set after dropping the Game_Number Column 13
Figure 18 – Null Values in the Data Set 13
Figure 19 – Box plot on the data set 14
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment 15
Figure 21 – Data Imbalance Check 15

2
1. Introduction of the business problem
a) Defining problem statement:
BCCI has hired an external analytics consulting firm for data analytics. The major objective of this tie up is to extract
actionable insights from the historical match data and make strategic changes to make India win. Primary objective
is to create Machine Learning models which correctly predict a win for the Indian Cricket Team. Once a model is
developed then you have to extract actionable insights and recommendation.

b) Need of the study/project:


The need of the study is BCCI needs an algorithm to be developed that predicts the winning strategy of Team India in
the next upcoming matches. Also, from the available data set, we need to analyze the past performance of the India
with their opponents and suggest some insights to BCCI, considering all the impacting factors.

In this project, an EDA is performed considering all the factors like Season, Match type, bowler and batsmen
performances. A Univariate, Bi-variate and Multi variate Analysis is performed on the preprocessed data.

c) Understanding business/social opportunity:


The major intention of this model building is to find out a strategy that helps Team India win, so that BCCI will have
the financial benefits and can get more and more sponsors. By using this model, If India wins most of the matches
then there will be more funding to the team and can be a booster for team to set new records.

Also, this model can be used sold for more gaming apps so that this will have an commercial advantage. This model
will also be used for different cricket formats like IPL, CCL with slight changes and can capture the market.

2. Data Report
a) Understanding how data was collected in terms of time, frequency and
methodology:
The dataset is provided in the excel format and same is loaded into jupyter notebook for further analysis. The
dataset has 23 columns, with variable factors related to the match and which will affect the India’s win with the
opponents. The ‘Result’ column of the data set is the target variable which will be used for model building.

Figure 1 - Loading Data Set

3
b) Visual inspection of data (rows, columns, descriptive details):
The dataset has 2930 rows and 23 columns.

Figure 2 – Shape of the Data

As per the info command, we can see data types of float – 9, int-4, object -10

Figure 3 – Info of the Data Types

A descriptive Analysis is performed on the data set. Below are the insights.

 Average Age of the team is 30 and ranges from 12 to 70. Seems like there are outliers in the age column
 Bowlers_in_team has a mean of 3 and ranges from 1 to 5. Most of the times team preferred for 3
bowlers in the team and can see that winning rate it is high. So 3 bowlers will be a good number to win
the match
 Wicket_Keeper_team is always 1. So we can exclude this variable from our analysis as it has no impact
on target variable
 All_rounder_in_team average is 3 and ranges between 1 and 4
 Max_run_scored_1 over has a mean of 16 and ranges between 11 and 25.
 Max_wicket_taken_1 over has a mean of 3 and ranges between 1 and 4.
 Extra_bowls_bowled has a mean of 11 and ranges between 7 and 40.
 Min_run_given_1 over has a mean of 2 and ranges between 0 and 4.
4
 Min_run_scored_1 over has a mean of 2 and ranges between 1 and 4.
 Max_run_given_1 over has a mean of 5 and ranges between 6 and 40.
 Extra_bowls_opponent has a mean of 4 and ranges between 0 and 18.
 Player_highest_run has a mean of 65 and ranges between 30 and 100.

Figure 4 – Descriptive Stats of the Data Types

c) Understanding of attributes (variable info, renaming if required):


The data dictionary provided along with the data set helps to understand the variables in the data set. A For loop is
iterated through the columns.

Unique values are shown in the below figure.

5
Figure 5 – Unique Values of each column

Observations:

 Match format type T20 has two ways of entry: T20 and 20-20
 First_Selection has two ways of entry for batting: Batting and Bat
 Player_scored_Zero has two ways of entry for three members: 3 an Three
 Player_Highest_Wicket has two ways of entry for three : 3 an Three

So these information needs to be replaced in the data. Data after processing the data

Figure 6 – Unique Values of each column after modification

3. Exploratory data analysis


a) Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones):
In this section we will do Univariate Analysis for the complete data set. Starting with ‘Target Variable’

Figure 7 – Count Plot on Target Variable

6
 Based on the data set, most of the matches are won by India against their opponents. We need to check the
data balance in model prediction.

7
Figure 8 – Univariate Analysis for Float and Integer data types

8
Figure 9 – Univariate Analysis for Categorical Variables

Figure 10 – Univariate Analysis with Player Highest Run


9
Skewness Calculation:

Skewness of the data set is calculated

Figure 11 – Skewness Calculation

 Most of the data is not uniform and dispersed randomly.


 Seems like data set needs normalization treatment for further analysis as skewness values are
on higher side

b) Bivariate analysis (relationship between different variables, correlations)


This section deals with the Bivariate Analysis of the data.

A bar plot is drawn between Opponent and Bowlers_in_team variable considering result as the hue.

Figure 12 – Bar plot on Opponent and Bowlers_in_team

10
A scatter plot is drawn between the variables Avg_team_age and All_round_in_team

Figure 13 – Scatter plot on Avg_team_age and All_round_in_team

A violin plot is draw between the variables Opponent and All_rounder_in_team

Figure 14 – Violin plot on Opponent and All_rounder_in_team

11
Multi-Variate Analysis:
Also, a correlation plot is drawn for all the variables. From the plot following are the observations

 A correlation of 0.57 exists between Audience and extra_bowls_bowled


 A correlation of 0.62 exists between Max_run_given_1Over and extra_bowls_bowled
 A correlation of 0.65 exists between Max_run_given_1Over and extra_bowls_Opponent
 Rest of the variables, have very weak correlation. Therefore there is no scenario of
dimensionality curse
 Surprisingly, Avg team age and all_rounder_in team has ‘Zero’ correlation. This indicates that
all_rounders are the experienced people only.

Figure 15 – Heat Map

Pair Plot:
A pair plot is also drawn for viewing the data distribution which we already got from skewness calculation. For
complete picture refer jupyter notebook

12
Figure 16 – Pair Plot on Data Set

c) Removal of unwanted variables (if applicable):


As per the analysis done till now, we can claim that columns ‘Game_Number’ and ‘Wicket_keeper_in_team’ are
unwanted variables and can remove them for further modelling

Figure 17 – Data Set after dropping the Game_Number Column

d) Missing Value treatment (if applicable):


The data set as many null values. This is find by using the isnull() function. There are a total of 789 null values out of
67390 entries.

13
Figure 18 – Null Values in the Data Set

Total Percentage of missing values in the dataset is: 1.17 %, which is negligible. But in this data set Avg_team_age
column has 97 missing values. So deleting the missing value rows is not a good choice. So we will impute the null
values accordingly.

All categorical variables are imputed with the mode values and non-categorical variables are imputed with Median
values.

e) Outlier Treatment (if applicable):


The best way to check the outliers is box plot. Hence box plot is drawn for all the numeric variables.

14
Figure 19 – Box plot on the data set

 Variables like Avg_team_age, player_highest_run, Max_run_given_1Over, Extra_bows_bowled


and Audience number has outliers.
 After analyzing the features, only Avg_team_age and Audience variables require outlier
treatment in which Audience variable is of no importance in Model building.
 Therefore, only ‘Avg_team_age’ is only treated for outliers.
 Rest of the variables based on the data collected is appearing like outliers in the dispersion. But
those are all possible in the Cricket and can’t blindly treat them which will exploit the data.
 An outlier function is used by calculation the IQR values for treating the outlier.

15
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment

4. Business insights from EDA


a) Is the data unbalanced? If so, what can be done? Please explain in the context of the
business
From the data provided for analysis, value_counts() is applied to check the data imbalance. This is applied on the
target variable ‘Result’. The following figure gives the result.

Figure 21 – Data Imbalance Check

 From the result we can see that ratio of ‘Win’ vs ‘Loss’ is 84% vs 16%
 A data set having 70:30 ratios is well balanced data, but still this data is good to go for modeling.
No need of any smote techniques since it will bias the data partially.

b) Any business insights using clustering (if applicable)


No clustering technique is applied in the analysis

16
c) Any other business insights:

From the Analysis so far done now following insights are capsulated:

 As per the data when you bowl 40 extra balls to the opponent, definitely India will lose the match
 Most of the times India has success rate
 If the opponent bowls extra balls more than 10 there are high chances that Team India will win
 If the opponent bowls extra balls more than 16 India will win the Match as per the data set
 India played most of the matches with Avg team age as 30 and highest wins are recorded with this
average age
 Overall India has highest winning rate against Bangladesh followed by Pakistan with three full-time
bowlers
 India is performing well in the home pitches than foreign tours
 chances of Winning the match is high when 5 wickets are taken by a single player in the match
 India is doing well in the Rainy season compared to other seasons
 Indian team is performing well with West Indies, Bangladesh, England and Pakistan.
 Indian teams win rate is less with South Africa and Srilanka compared to other opponents
 India has lost to Srilanka majority of the times it is Summer Season.
 Indian team won most of the matches when they opted bowling first, in ODIs and Day matches

17

You might also like