0% found this document useful (0 votes)
43 views

Capstone Notes-2

Uploaded by

ANIL
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Capstone Notes-2

Uploaded by

ANIL
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Cricket Win Prediction –

Capstone Project Notes-2

This document helps in building strategy based on the models build


on the data set

Windows User
5/14/2022
Table of Contents
1. Introduction of the business problem...........................................................................................................................3
a) Defining problem statement:....................................................................................................................................3
b) Need of the study/project:........................................................................................................................................3
c) Understanding business/social opportunity:.............................................................................................................3
2. Data Report................................................................................................................................................................... 3
a) Understanding how data was collected in terms of time, frequency and methodology:..........................................3
b) Visual inspection of data (rows, columns, descriptive details):.................................................................................4
c) Understanding of attributes (variable info, renaming if required):...........................................................................5
3. Exploratory data analysis..............................................................................................................................................6
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for
categorical ones):..............................................................................................................................................................6
Skewness Calculation....................................................................................................................................................9
b) Bivariate analysis (relationship between different variables, correlations).............................................................10
Multi-Variate Analysis:................................................................................................................................................11
Pair Plot:...................................................................................................................................................................... 12
c) Removal of unwanted variables (if applicable):.......................................................................................................13
d) Missing Value treatment (if applicable):..................................................................................................................13
e) Outlier Treatment (if applicable):............................................................................................................................14
4. Business insights from EDA.........................................................................................................................................15
a) Is the data unbalanced? If so, what can be done? Please explain in the context of the business...........................15
b) Any business insights using clustering (if applicable)..........................................................................................15
c) Any other business insights:................................................................................................................................16

1
List of Figures:

Figure 1 - Loading Data Set 3


Figure 2 – Shape of the Data 4
Figure 3 – Info of the Data Types 4
Figure 4 – Descriptive Stats of the Data Types 5
Figure 5 – Unique Values of each column 5
Figure 6 – Unique Values of each column after modification 6
Figure 7 – Count Plot on Target Variable 6
Figure 8 – Univariate Analysis for Float and Integer data types 7
Figure 9 – Univariate Analysis for Categorical Variables 8
Figure 10 – Univariate Analysis with Player Highest Run 9
Figure 11 – Skewness Calculation 9
Figure 12 – Bar plot on Opponent and Bowlers_in_team 10
Figure 13 – Scatter plot on Avg_team_age and All_round_in_team 10
Figure 14 – Violin plot on Opponent and All_rounder_in_team 11
Figure 15 – Heat Map 12
Figure 16 – Pair Plot on Data Set 12
Figure 17 – Data Set after dropping the Game_Number Column 13
Figure 18 – Null Values in the Data Set 13
Figure 19 – Box plot on the data set 14
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment 15
Figure 21 – Data Imbalance Check 15

2
1. Model building and interpretation
a) Build various models (You can choose to build models for either or all of descriptive,
predictive or prescriptive purposes):
PRECAP:

a) In continuation with Notes-1, in this notes we will be creating a model that predict the performance of Team
Indai against the Opponents.
b) Based on the inputs from the EDA performed, it is decided to remove the unwanted variables likes “Game
Number, ‘Wicket_Keeper” and “ Audience_Number” (Based on the Boxplot and EDA Analysis, we found
audience number has no considerable impact on the Result.)
c) So in this section we will build four models “Decision Tree, Random Forest, ANN and Logistic Regression(both
sklearn and stats) and will evaluate the best model based on the model metrics.
d) All the ‘Object’ variables are encoded using ‘One hot encoding method’ and the target variable is encoded using
‘Label Encoder’ method.
e) For the model building, performed train test split is done in the ratio of 70:30

Imp Note: I have built the model on the data without splitting the dataset based on the Match format type.
This is because, in one of the problem statements, it asked to provide the winning strategy of team India
against Australia in T20. But as per the source data set we don’t have any records of India playing with
India so splitting the data set based on format wise will not give the accurate the results. So, build model
without splitting the data.

a) Logistic Regression Model using Sklearn:


A logistic Regression Model is built on the train data. Following are the parameters used:

penalty':['elasticnet','l2','none'], 'solver':['newton-cg', 'saga'], 'tol':[0.001,0.00001]

A GridSearchCV method is applied to find the best model,

Figure 1 – Logistic Model – “GridSearchCV” Method

Using the best Params, found the best model and below are the best parameters. L2-penalty, saga-solver and
tolerance of 1e-05 are the best parameters and the prediction is made using this model.

3
Figure 2 – Logistic Model – Best model Parameters

Performance of the Logistic Model on the Train Data:


Confusion Matrix:

Figure 3 – Logistic Model – Confusion Matrix of Train Data

Classification Report:

The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.88, recall =0.98, f1= 0.9

Figure 4 – Logistic Model – Classification Report of Train Data

AUC and ROC Curve:

The AUC of the model on Train data is 84.36% on the train data. Below is the ROC curve of the Train data

Figure 5 – Logistic Model – ROC Curve of Train Data

4
b) Test your predictive model against the test set using various appropriate
performance metrics

Performance of the Logistic Model on the Test Data:


Confusion Matrix:

Figure 6 – Logistic Model – Confusion Matrix of Test Data

Classification Report:

The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.89, recall =0.97, f1= 0.93

Figure 7 – Logistic Model – Classification Report of Test Data

AUC and ROC Curve:

The AUC of the model on Train data is 84.32% on the test data. Below is the ROC curve of the Test data

Figure 8 – Logistic Model – ROC Curve of Test Data

5
c) Interpretation of the model(s):
Based on the metrics, from train and test data looks like model is stable with an accuracy of 87%. So model doesn’t
look like overfit or underfit. Even the precision values are high with 88%. Hence the model seems good. But we can
cross validate the metrics by building some more models.

2). Model Tuning and business implication


a) and b) Ensemble modeling, wherever applicable:
To validate the logistic model, another three models have been built and model metrics are compared. Three
models are

(a) Decision Tree Model


(b) Random Forest Model
(c) Artificial Neural Network (ANN) Model

i) Decision Tree Model:


A Decision Tree Model is built on the train data. Following are the parameters used:

'criterion': ['gini'], 'max_depth': [10,20,30,50], 'min_samples_leaf': [50,100,150],

'min_samples_split': [150,300,450],

A GridSearchCV method is applied to find the best model,

Figure 9 – Decision Tree Model – “GridSearchCV” Method

After multiple iterations best Parameters are identified to build the model and below are the best parameters..

Figure 10 – Decision Tree Model – Best model Parameters

Performance of the Decision Tree Model on the Train Data:


Confusion Matrix:

Figure 11 – Decision Tree Model – Confusion Matrix of Train Data


6
Classification Report:

The model has an accuracy of 85% on the train data. Correspondingly, precision= 0.87, recall =0.97, f1= 0.91

Figure 12 – Decision Tree Model – Classification Report of Train Data

AUC and ROC Curve:

The AUC of the model on Train data is 78.70% on the train data. Below is the ROC curve of the Train data

Figure 13 – Decision Tree Model – ROC Curve of Train Data

Performance of the Decision Tree Model on the Test Data:


Confusion Matrix:

Figure 14 – Decision Tree Model – Confusion Matrix of Test Data

Classification Report:

The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.86, recall =0.97, f1= 0.91

7
Figure 15 – Decision Tree Model – Classification Report of Test Data

AUC and ROC Curve:

The AUC of the model on Train data is 75.40% on the Test data. Below is the ROC curve of the Test data

Figure 16 – Decision Tree Model – ROC Curve of Test Data

ii) Random Forest Model:


A Random Forest Model is built on the train data. Following are the parameters used:

'max_depth': [4,5], 'max_features': [2,3], 'min_samples_leaf': [8,9], 'min_samples_split': [46,50],

'n_estimators': [290]. A GridSearchCV method is applied to find the best model,

Figure 17 – Random Forest Model – “GridSearchCV” Method

8
Performance of the Random Forest Model on the Train Data:
Confusion Matrix:

Figure 18 – Random Forest Model – Confusion Matrix of Train Data

Classification Report:

The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.84, recall =1, f1= 0.91

Figure 19 – Random Forest Model – Classification Report of Train Data

AUC and ROC Curve:

The AUC of the model on Train data is 84.98% on the train data. Below is the ROC curve of the Train data

Figure 20 – Random Forest Model – ROC Curve of Train Data

Performance of the Random Forest on the Test Data:


Confusion Matrix:

Figure 21 – Random Forest Model – Confusion Matrix of Test Data

9
Classification Report:

The model has an accuracy of 83% on the train data. Correspondingly, precision= 0.83, recall =1, f1= 0.91

Figure 22 – Random Forest Model – Classification Report of Test Data

AUC and ROC Curve:

The AUC of the model on Train data is 83.11% on the Test data. Below is the ROC curve of the Test data

Figure 23 – Random Forest Model – ROC Curve of Test Data

iii) ANN Model:


An ANN Model is built on the train data. Following are the parameters used:

'hidden_layer_sizes': [50,100,200], 'max_iter': [2500,3000,4000], 'solver': ['adam'],

'tol': [0.01], A GridSearchCV method is applied to find the best model,

Figure 24 – ANN Model – “GridSearchCV” Method

Performance of the ANN Model on the Train Data:


Confusion Matrix:
10
Figure 25 – ANN Model – Confusion Matrix of Train Data

Classification Report:

The model has an accuracy of 87% on the train data. Correspondingly, precision= 0.88, recall =0.98, f1= 0.93

Figure 26 – ANN Model – Classification Report of Train Data

AUC and ROC Curve:

The AUC of the model on Train data is 84.30% on the train data. Below is the ROC curve of the Train data

Figure 27 – ANN Model – ROC Curve of Train Data

11
Performance of the ANN Model on the Test Data:
Confusion Matrix:

Figure 28 – ANN Model – Confusion Matrix of Test Data

Classification Report:

The model has an accuracy of 84% on the train data. Correspondingly, precision= 0.89, recall =0.97, f1= 0.93

Figure 29 – ANN Model – Classification Report of Test Data

AUC and ROC Curve:

The AUC of the model on Train data is 84.07% on the train data. Below is the ROC curve of the Train data

Figure 30 – ANN Model – ROC Curve of Test Data

12
c) Interpretation of the most optimum model and its implication on the business

1. Data Report
a) Understanding how data was collected in terms of time, frequency and
methodology:
The dataset is provided in the excel format and same is loaded into jupyter notebook for further analysis. The
dataset has 23 columns, with variable factors related to the match and which will affect the India’s win with the
opponents. The ‘Result’ column of the data set is the target variable which will be used for model building.

Figure 31 - Loading Data Set

b) Visual inspection of data (rows, columns, descriptive details):


The dataset has 2930 rows and 23 columns.

Figure 32 – Shape of the Data

As per the info command, we can see data types of float – 9, int-4, object -10

13
Figure 33 – Info of the Data Types

A descriptive Analysis is performed on the data set. Below are the insights.

 Average Age of the team is 30 and ranges from 12 to 70. Seems like there are outliers in the age column
 Bowlers_in_team has a mean of 3 and ranges from 1 to 5. Most of the times team preferred for 3
bowlers in the team and can see that winning rate it is high. So 3 bowlers will be a good number to win
the match
 Wicket_Keeper_team is always 1. So we can exclude this variable from our analysis as it has no impact
on target variable
 All_rounder_in_team average is 3 and ranges between 1 and 4
 Max_run_scored_1 over has a mean of 16 and ranges between 11 and 25.
 Max_wicket_taken_1 over has a mean of 3 and ranges between 1 and 4.
 Extra_bowls_bowled has a mean of 11 and ranges between 7 and 40.
 Min_run_given_1 over has a mean of 2 and ranges between 0 and 4.
 Min_run_scored_1 over has a mean of 2 and ranges between 1 and 4.
 Max_run_given_1 over has a mean of 5 and ranges between 6 and 40.
 Extra_bowls_opponent has a mean of 4 and ranges between 0 and 18.
 Player_highest_run has a mean of 65 and ranges between 30 and 100.

14
Figure 34 – Descriptive Stats of the Data Types

c) Understanding of attributes (variable info, renaming if required):


The data dictionary provided along with the data set helps to understand the variables in the data set. A For loop is
iterated through the columns.

Unique values are shown in the below figure.

Figure 35 – Unique Values of each column

Observations:

 Match format type T20 has two ways of entry: T20 and 20-20
 First_Selection has two ways of entry for batting: Batting and Bat
15
 Player_scored_Zero has two ways of entry for three members: 3 an Three
 Player_Highest_Wicket has two ways of entry for three : 3 an Three

So these information needs to be replaced in the data. Data after processing the data

Figure 36 – Unique Values of each column after modification

2. Exploratory data analysis


a) Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones):
In this section we will do Univariate Analysis for the complete data set. Starting with ‘Target Variable’

Figure 37 – Count Plot on Target Variable

 Based on the data set, most of the matches are won by India against their opponents. We need to check the
data balance in model prediction.

16
Figure 38 – Univariate Analysis for Float and Integer data types

17
Figure 39 – Univariate Analysis for Categorical Variables

18
Figure 40 – Univariate Analysis with Player Highest Run

Skewness Calculation:

Skewness of the data set is calculated

Figure 41 – Skewness Calculation

 Most of the data is not uniform and dispersed randomly.


 Seems like data set needs normalization treatment for further analysis as skewness values are
on higher side

19
b) Bivariate analysis (relationship between different variables, correlations)
This section deals with the Bivariate Analysis of the data.

A bar plot is drawn between Opponent and Bowlers_in_team variable considering result as the hue.

Figure 42 – Bar plot on Opponent and Bowlers_in_team

A scatter plot is drawn between the variables Avg_team_age and All_round_in_team

Figure 43 – Scatter plot on Avg_team_age and All_round_in_team

20
A violin plot is draw between the variables Opponent and All_rounder_in_team

Figure 44 – Violin plot on Opponent and All_rounder_in_team

Multi-Variate Analysis:
Also, a correlation plot is drawn for all the variables. From the plot following are the observations

 A correlation of 0.57 exists between Audience and extra_bowls_bowled


 A correlation of 0.62 exists between Max_run_given_1Over and extra_bowls_bowled
 A correlation of 0.65 exists between Max_run_given_1Over and extra_bowls_Opponent
 Rest of the variables, have very weak correlation. Therefore there is no scenario of
dimensionality curse
 Surprisingly, Avg team age and all_rounder_in team has ‘Zero’ correlation. This indicates that
all_rounders are the experienced people only.

21
Figure 45 – Heat Map

Pair Plot:
A pair plot is also drawn for viewing the data distribution which we already got from skewness calculation. For
complete picture refer jupyter notebook

Figure 46 – Pair Plot on Data Set

22
c) Removal of unwanted variables (if applicable):
As per the analysis done till now, we can claim that columns ‘Game_Number’ and ‘Wicket_keeper_in_team’ are
unwanted variables and can remove them for further modelling

Figure 47 – Data Set after dropping the Game_Number Column

d) Missing Value treatment (if applicable):


The data set as many null values. This is find by using the isnull() function. There are a total of 789 null values out of
67390 entries.

Figure 48 – Null Values in the Data Set

Total Percentage of missing values in the dataset is: 1.17 %, which is negligible. But in this data set Avg_team_age
column has 97 missing values. So deleting the missing value rows is not a good choice. So we will impute the null
values accordingly.

All categorical variables are imputed with the mode values and non-categorical variables are imputed with Median
values.

23
e) Outlier Treatment (if applicable):
The best way to check the outliers is box plot. Hence box plot is drawn for all the numeric variables.

Figure 49 – Box plot on the data set

 Variables like Avg_team_age, player_highest_run, Max_run_given_1Over, Extra_bows_bowled


and Audience number has outliers.
 After analyzing the features, only Avg_team_age and Audience variables require outlier
treatment in which Audience variable is of no importance in Model building.
 Therefore, only ‘Avg_team_age’ is only treated for outliers.
 Rest of the variables based on the data collected is appearing like outliers in the dispersion. But
those are all possible in the Cricket and can’t blindly treat them which will exploit the data.
 An outlier function is used by calculation the IQR values for treating the outlier.

24
Figure 50 – Box plot for ‘Avg_team_Age’ column after outlier treatment

3. Business insights from EDA


a) Is the data unbalanced? If so, what can be done? Please explain in the context of the
business
From the data provided for analysis, value_counts() is applied to check the data imbalance. This is applied on the
target variable ‘Result’. The following figure gives the result.

Figure 51 – Data Imbalance Check

 From the result we can see that ratio of ‘Win’ vs ‘Loss’ is 84% vs 16%
 A data set having 70:30 ratios is well balanced data, but still this data is good to go for modeling.
No need of any smote techniques since it will bias the data partially.

b) Any business insights using clustering (if applicable)


No clustering technique is applied in the analysis

25
c) Any other business insights:

From the Analysis so far done now following insights are capsulated:

 As per the data when you bowl 40 extra balls to the opponent, definitely India will lose the match
 Most of the times India has success rate
 If the opponent bowls extra balls more than 10 there are high chances that Team India will win
 If the opponent bowls extra balls more than 16 India will win the Match as per the data set
 India played most of the matches with Avg team age as 30 and highest wins are recorded with this
average age
 Overall India has highest winning rate against Bangladesh followed by Pakistan with three full-time
bowlers
 India is performing well in the home pitches than foreign tours
 chances of Winning the match is high when 5 wickets are taken by a single player in the match
 India is doing well in the Rainy season compared to other seasons
 Indian team is performing well with West Indies, Bangladesh, England and Pakistan.
 Indian teams win rate is less with South Africa and Srilanka compared to other opponents
 India has lost to Srilanka majority of the times it is Summer Season.
 Indian team won most of the matches when they opted bowling first, in ODIs and Day matches

26

You might also like