Capstone Notes-1
Capstone Notes-1
Windows User
5/1/2022
Table of Contents
1. Introduction of the business problem...........................................................................................................................3
a) Defining problem statement:....................................................................................................................................3
b) Need of the study/project:........................................................................................................................................3
c) Understanding business/social opportunity:.............................................................................................................3
2. Data Report................................................................................................................................................................... 3
a) Understanding how data was collected in terms of time, frequency and methodology:..........................................3
b) Visual inspection of data (rows, columns, descriptive details):.................................................................................4
c) Understanding of attributes (variable info, renaming if required):...........................................................................5
3. Exploratory data analysis..............................................................................................................................................6
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for
categorical ones):..............................................................................................................................................................6
Skewness Calculation....................................................................................................................................................9
b) Bivariate analysis (relationship between different variables, correlations).............................................................10
Multi-Variate Analysis:................................................................................................................................................11
Pair Plot:...................................................................................................................................................................... 12
c) Removal of unwanted variables (if applicable):.......................................................................................................13
d) Missing Value treatment (if applicable):..................................................................................................................13
e) Outlier Treatment (if applicable):............................................................................................................................14
4. Business insights from EDA.........................................................................................................................................15
a) Is the data unbalanced? If so, what can be done? Please explain in the context of the business...........................15
b) Any business insights using clustering (if applicable)..........................................................................................15
c) Any other business insights:................................................................................................................................16
1
List of Figures:
2
1. Introduction of the business problem
a) Defining problem statement:
BCCI has hired an external analytics consulting firm for data analytics. The major objective of this tie up is to extract
actionable insights from the historical match data and make strategic changes to make India win. Primary objective
is to create Machine Learning models which correctly predict a win for the Indian Cricket Team. Once a model is
developed then you have to extract actionable insights and recommendation.
In this project, an EDA is performed considering all the factors like Season, Match type, bowler and batsmen
performances. A Univariate, Bi-variate and Multi variate Analysis is performed on the preprocessed data.
Also, this model can be used sold for more gaming apps so that this will have an commercial advantage. This model
will also be used for different cricket formats like IPL, CCL with slight changes and can capture the market.
2. Data Report
a) Understanding how data was collected in terms of time, frequency and
methodology:
The dataset is provided in the excel format and same is loaded into jupyter notebook for further analysis. The
dataset has 23 columns, with variable factors related to the match and which will affect the India’s win with the
opponents. The ‘Result’ column of the data set is the target variable which will be used for model building.
3
b) Visual inspection of data (rows, columns, descriptive details):
The dataset has 2930 rows and 23 columns.
As per the info command, we can see data types of float – 9, int-4, object -10
A descriptive Analysis is performed on the data set. Below are the insights.
Average Age of the team is 30 and ranges from 12 to 70. Seems like there are outliers in the age column
Bowlers_in_team has a mean of 3 and ranges from 1 to 5. Most of the times team preferred for 3
bowlers in the team and can see that winning rate it is high. So 3 bowlers will be a good number to win
the match
Wicket_Keeper_team is always 1. So we can exclude this variable from our analysis as it has no impact
on target variable
All_rounder_in_team average is 3 and ranges between 1 and 4
Max_run_scored_1 over has a mean of 16 and ranges between 11 and 25.
Max_wicket_taken_1 over has a mean of 3 and ranges between 1 and 4.
Extra_bowls_bowled has a mean of 11 and ranges between 7 and 40.
Min_run_given_1 over has a mean of 2 and ranges between 0 and 4.
4
Min_run_scored_1 over has a mean of 2 and ranges between 1 and 4.
Max_run_given_1 over has a mean of 5 and ranges between 6 and 40.
Extra_bowls_opponent has a mean of 4 and ranges between 0 and 18.
Player_highest_run has a mean of 65 and ranges between 30 and 100.
5
Figure 5 – Unique Values of each column
Observations:
Match format type T20 has two ways of entry: T20 and 20-20
First_Selection has two ways of entry for batting: Batting and Bat
Player_scored_Zero has two ways of entry for three members: 3 an Three
Player_Highest_Wicket has two ways of entry for three : 3 an Three
So these information needs to be replaced in the data. Data after processing the data
6
Based on the data set, most of the matches are won by India against their opponents. We need to check the
data balance in model prediction.
7
Figure 8 – Univariate Analysis for Float and Integer data types
8
Figure 9 – Univariate Analysis for Categorical Variables
A bar plot is drawn between Opponent and Bowlers_in_team variable considering result as the hue.
10
A scatter plot is drawn between the variables Avg_team_age and All_round_in_team
11
Multi-Variate Analysis:
Also, a correlation plot is drawn for all the variables. From the plot following are the observations
Pair Plot:
A pair plot is also drawn for viewing the data distribution which we already got from skewness calculation. For
complete picture refer jupyter notebook
12
Figure 16 – Pair Plot on Data Set
13
Figure 18 – Null Values in the Data Set
Total Percentage of missing values in the dataset is: 1.17 %, which is negligible. But in this data set Avg_team_age
column has 97 missing values. So deleting the missing value rows is not a good choice. So we will impute the null
values accordingly.
All categorical variables are imputed with the mode values and non-categorical variables are imputed with Median
values.
14
Figure 19 – Box plot on the data set
15
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment
From the result we can see that ratio of ‘Win’ vs ‘Loss’ is 84% vs 16%
A data set having 70:30 ratios is well balanced data, but still this data is good to go for modeling.
No need of any smote techniques since it will bias the data partially.
16
c) Any other business insights:
From the Analysis so far done now following insights are capsulated:
As per the data when you bowl 40 extra balls to the opponent, definitely India will lose the match
Most of the times India has success rate
If the opponent bowls extra balls more than 10 there are high chances that Team India will win
If the opponent bowls extra balls more than 16 India will win the Match as per the data set
India played most of the matches with Avg team age as 30 and highest wins are recorded with this
average age
Overall India has highest winning rate against Bangladesh followed by Pakistan with three full-time
bowlers
India is performing well in the home pitches than foreign tours
chances of Winning the match is high when 5 wickets are taken by a single player in the match
India is doing well in the Rainy season compared to other seasons
Indian team is performing well with West Indies, Bangladesh, England and Pakistan.
Indian teams win rate is less with South Africa and Srilanka compared to other opponents
India has lost to Srilanka majority of the times it is Summer Season.
Indian team won most of the matches when they opted bowling first, in ODIs and Day matches
17