Capstone Notes-2
Capstone Notes-2
Windows User
5/14/2022
Table of Contents
1. Introduction of the business problem...........................................................................................................................3
a) Defining problem statement:....................................................................................................................................3
b) Need of the study/project:........................................................................................................................................3
c) Understanding business/social opportunity:.............................................................................................................3
2. Data Report................................................................................................................................................................... 3
a) Understanding how data was collected in terms of time, frequency and methodology:..........................................3
b) Visual inspection of data (rows, columns, descriptive details):.................................................................................4
c) Understanding of attributes (variable info, renaming if required):...........................................................................5
3. Exploratory data analysis..............................................................................................................................................6
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for
categorical ones):..............................................................................................................................................................6
Skewness Calculation....................................................................................................................................................9
b) Bivariate analysis (relationship between different variables, correlations).............................................................10
Multi-Variate Analysis:................................................................................................................................................11
Pair Plot:...................................................................................................................................................................... 12
c) Removal of unwanted variables (if applicable):.......................................................................................................13
d) Missing Value treatment (if applicable):..................................................................................................................13
e) Outlier Treatment (if applicable):............................................................................................................................14
4. Business insights from EDA.........................................................................................................................................15
a) Is the data unbalanced? If so, what can be done? Please explain in the context of the business...........................15
b) Any business insights using clustering (if applicable)..........................................................................................15
c) Any other business insights:................................................................................................................................16
1
List of Figures:
2
1. Model building and interpretation
a) Build various models (You can choose to build models for either or all of descriptive,
predictive or prescriptive purposes):
PRECAP:
a) In continuation with Notes-1, in this notes we will be creating a model that predict the performance of Team
Indai against the Opponents.
b) Based on the inputs from the EDA performed, it is decided to remove the unwanted variables likes “Game
Number, ‘Wicket_Keeper” and “ Audience_Number” (Based on the Boxplot and EDA Analysis, we found
audience number has no considerable impact on the Result.)
c) So in this section we will build four models “Decision Tree, Random Forest, ANN and Logistic Regression(both
sklearn and stats) and will evaluate the best model based on the model metrics.
d) All the ‘Object’ variables are encoded using ‘One hot encoding method’ and the target variable is encoded using
‘Label Encoder’ method.
e) For the model building, performed train test split is done in the ratio of 70:30
Imp Note: I have built the model on the data without splitting the dataset based on the Match format type.
This is because, in one of the problem statements, it asked to provide the winning strategy of team India
against Australia in T20. But as per the source data set we don’t have any records of India playing with
India so splitting the data set based on format wise will not give the accurate the results. So, build model
without splitting the data.
Using the best Params, found the best model and below are the best parameters. L2-penalty, saga-solver and
tolerance of 1e-05 are the best parameters and the prediction is made using this model.
3
Figure 2 – Logistic Model – Best model Parameters
Classification Report:
The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.88, recall =0.98, f1= 0.9
The AUC of the model on Train data is 84.36% on the train data. Below is the ROC curve of the Train data
4
b) Test your predictive model against the test set using various appropriate
performance metrics
Classification Report:
The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.89, recall =0.97, f1= 0.93
The AUC of the model on Train data is 84.32% on the test data. Below is the ROC curve of the Test data
5
c) Interpretation of the model(s):
Based on the metrics, from train and test data looks like model is stable with an accuracy of 87%. So model doesn’t
look like overfit or underfit. Even the precision values are high with 88%. Hence the model seems good. But we can
cross validate the metrics by building some more models.
'min_samples_split': [150,300,450],
After multiple iterations best Parameters are identified to build the model and below are the best parameters..
The model has an accuracy of 85% on the train data. Correspondingly, precision= 0.87, recall =0.97, f1= 0.91
The AUC of the model on Train data is 78.70% on the train data. Below is the ROC curve of the Train data
Classification Report:
The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.86, recall =0.97, f1= 0.91
7
Figure 15 – Decision Tree Model – Classification Report of Test Data
The AUC of the model on Train data is 75.40% on the Test data. Below is the ROC curve of the Test data
8
Performance of the Random Forest Model on the Train Data:
Confusion Matrix:
Classification Report:
The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.84, recall =1, f1= 0.91
The AUC of the model on Train data is 84.98% on the train data. Below is the ROC curve of the Train data
9
Classification Report:
The model has an accuracy of 83% on the train data. Correspondingly, precision= 0.83, recall =1, f1= 0.91
The AUC of the model on Train data is 83.11% on the Test data. Below is the ROC curve of the Test data
Classification Report:
The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.88, recall =0.98, f1= 0.93
The AUC of the model on Train data is 84.30% on the train data. Below is the ROC curve of the Train data
11
Performance of the ANN Model on the Test Data:
Confusion Matrix:
Classification Report:
The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.89, recall =0.97, f1= 0.93
The AUC of the model on Train data is 84.07% on the train data. Below is the ROC curve of the Train data
12
c) Interpretation of the most optimum model and its implication on the business
1. Data Report
a) Understanding how data was collected in terms of time, frequency and
methodology:
The dataset is provided in the excel format and same is loaded into jupyter notebook for further analysis. The
dataset has 23 columns, with variable factors related to the match and which will affect the India’s win with the
opponents. The ‘Result’ column of the data set is the target variable which will be used for model building.
As per the info command, we can see data types of float – 9, int-4, object -10
13
Figure 33 – Info of the Data Types
A descriptive Analysis is performed on the data set. Below are the insights.
Average Age of the team is 30 and ranges from 12 to 70. Seems like there are outliers in the age column
Bowlers_in_team has a mean of 3 and ranges from 1 to 5. Most of the times team preferred for 3
bowlers in the team and can see that winning rate it is high. So 3 bowlers will be a good number to win
the match
Wicket_Keeper_team is always 1. So we can exclude this variable from our analysis as it has no impact
on target variable
All_rounder_in_team average is 3 and ranges between 1 and 4
Max_run_scored_1 over has a mean of 16 and ranges between 11 and 25.
Max_wicket_taken_1 over has a mean of 3 and ranges between 1 and 4.
Extra_bowls_bowled has a mean of 11 and ranges between 7 and 40.
Min_run_given_1 over has a mean of 2 and ranges between 0 and 4.
Min_run_scored_1 over has a mean of 2 and ranges between 1 and 4.
Max_run_given_1 over has a mean of 5 and ranges between 6 and 40.
Extra_bowls_opponent has a mean of 4 and ranges between 0 and 18.
Player_highest_run has a mean of 65 and ranges between 30 and 100.
14
Figure 34 – Descriptive Stats of the Data Types
Observations:
Match format type T20 has two ways of entry: T20 and 20-20
First_Selection has two ways of entry for batting: Batting and Bat
15
Player_scored_Zero has two ways of entry for three members: 3 an Three
Player_Highest_Wicket has two ways of entry for three : 3 an Three
So these information needs to be replaced in the data. Data after processing the data
Based on the data set, most of the matches are won by India against their opponents. We need to check the
data balance in model prediction.
16
Figure 38 – Univariate Analysis for Float and Integer data types
17
Figure 39 – Univariate Analysis for Categorical Variables
18
Figure 40 – Univariate Analysis with Player Highest Run
Skewness Calculation:
19
b) Bivariate analysis (relationship between different variables, correlations)
This section deals with the Bivariate Analysis of the data.
A bar plot is drawn between Opponent and Bowlers_in_team variable considering result as the hue.
20
A violin plot is draw between the variables Opponent and All_rounder_in_team
Multi-Variate Analysis:
Also, a correlation plot is drawn for all the variables. From the plot following are the observations
21
Figure 45 – Heat Map
Pair Plot:
A pair plot is also drawn for viewing the data distribution which we already got from skewness calculation. For
complete picture refer jupyter notebook
22
c) Removal of unwanted variables (if applicable):
As per the analysis done till now, we can claim that columns ‘Game_Number’ and ‘Wicket_keeper_in_team’ are
unwanted variables and can remove them for further modelling
Total Percentage of missing values in the dataset is: 1.17 %, which is negligible. But in this data set Avg_team_age
column has 97 missing values. So deleting the missing value rows is not a good choice. So we will impute the null
values accordingly.
All categorical variables are imputed with the mode values and non-categorical variables are imputed with Median
values.
23
e) Outlier Treatment (if applicable):
The best way to check the outliers is box plot. Hence box plot is drawn for all the numeric variables.
24
Figure 50 – Box plot for ‘Avg_team_Age’ column after outlier treatment
From the result we can see that ratio of ‘Win’ vs ‘Loss’ is 84% vs 16%
A data set having 70:30 ratios is well balanced data, but still this data is good to go for modeling.
No need of any smote techniques since it will bias the data partially.
25
c) Any other business insights:
From the Analysis so far done now following insights are capsulated:
As per the data when you bowl 40 extra balls to the opponent, definitely India will lose the match
Most of the times India has success rate
If the opponent bowls extra balls more than 10 there are high chances that Team India will win
If the opponent bowls extra balls more than 16 India will win the Match as per the data set
India played most of the matches with Avg team age as 30 and highest wins are recorded with this
average age
Overall India has highest winning rate against Bangladesh followed by Pakistan with three full-time
bowlers
India is performing well in the home pitches than foreign tours
chances of Winning the match is high when 5 wickets are taken by a single player in the match
India is doing well in the Rainy season compared to other seasons
Indian team is performing well with West Indies, Bangladesh, England and Pakistan.
Indian teams win rate is less with South Africa and Srilanka compared to other opponents
India has lost to Srilanka majority of the times it is Summer Season.
Indian team won most of the matches when they opted bowling first, in ODIs and Day matches
26