0% found this document useful (0 votes)

83 views18 pages

Capstone Notes-1

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views18 pages

Capstone Notes-1

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Cricket Win Prediction –

Capstone Project Notes-1

This document briefs about the Exploratory Data Analysis(EDA) of the

given cricket data set. The data set contains the match data of Indian
Team Played in the past

Windows User
5/1/2022
Table of Contents
1. Introduction of the business problem...........................................................................................................................3
a) Defining problem statement:....................................................................................................................................3
b) Need of the study/project:........................................................................................................................................3
c) Understanding business/social opportunity:.............................................................................................................3
2. Data Report................................................................................................................................................................... 3
a) Understanding how data was collected in terms of time, frequency and methodology:..........................................3
b) Visual inspection of data (rows, columns, descriptive details):.................................................................................4
c) Understanding of attributes (variable info, renaming if required):...........................................................................5
3. Exploratory data analysis..............................................................................................................................................6
a) Univariate analysis (distribution and spread for every continuous attribute, distribution of data in categories for
categorical ones):..............................................................................................................................................................6
Skewness Calculation....................................................................................................................................................9
b) Bivariate analysis (relationship between different variables, correlations).............................................................10
Multi-Variate Analysis:................................................................................................................................................11
Pair Plot:...................................................................................................................................................................... 12
c) Removal of unwanted variables (if applicable):.......................................................................................................13
d) Missing Value treatment (if applicable):..................................................................................................................13
e) Outlier Treatment (if applicable):............................................................................................................................14
4. Business insights from EDA.........................................................................................................................................15
a) Is the data unbalanced? If so, what can be done? Please explain in the context of the business...........................15
b) Any business insights using clustering (if applicable)..........................................................................................15
c) Any other business insights:................................................................................................................................16

1
List of Figures:

Figure 1 - Loading Data Set 3

Figure 2 – Shape of the Data 4
Figure 3 – Info of the Data Types 4
Figure 4 – Descriptive Stats of the Data Types 5
Figure 5 – Unique Values of each column 5
Figure 6 – Unique Values of each column after modification 6
Figure 7 – Count Plot on Target Variable 6
Figure 8 – Univariate Analysis for Float and Integer data types 7
Figure 9 – Univariate Analysis for Categorical Variables 8
Figure 10 – Univariate Analysis with Player Highest Run 9
Figure 11 – Skewness Calculation 9
Figure 12 – Bar plot on Opponent and Bowlers_in_team 10
Figure 13 – Scatter plot on Avg_team_age and All_round_in_team 10
Figure 14 – Violin plot on Opponent and All_rounder_in_team 11
Figure 15 – Heat Map 12
Figure 16 – Pair Plot on Data Set 12
Figure 17 – Data Set after dropping the Game_Number Column 13
Figure 18 – Null Values in the Data Set 13
Figure 19 – Box plot on the data set 14
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment 15
Figure 21 – Data Imbalance Check 15

2
1. Introduction of the business problem
a) Defining problem statement:
BCCI has hired an external analytics consulting firm for data analytics. The major objective of this tie up is to extract
actionable insights from the historical match data and make strategic changes to make India win. Primary objective
is to create Machine Learning models which correctly predict a win for the Indian Cricket Team. Once a model is
developed then you have to extract actionable insights and recommendation.

b) Need of the study/project:

The need of the study is BCCI needs an algorithm to be developed that predicts the winning strategy of Team India in
the next upcoming matches. Also, from the available data set, we need to analyze the past performance of the India
with their opponents and suggest some insights to BCCI, considering all the impacting factors.

In this project, an EDA is performed considering all the factors like Season, Match type, bowler and batsmen
performances. A Univariate, Bi-variate and Multi variate Analysis is performed on the preprocessed data.

c) Understanding business/social opportunity:

The major intention of this model building is to find out a strategy that helps Team India win, so that BCCI will have
the financial benefits and can get more and more sponsors. By using this model, If India wins most of the matches
then there will be more funding to the team and can be a booster for team to set new records.

Also, this model can be used sold for more gaming apps so that this will have an commercial advantage. This model
will also be used for different cricket formats like IPL, CCL with slight changes and can capture the market.

2. Data Report
a) Understanding how data was collected in terms of time, frequency and
methodology:
The dataset is provided in the excel format and same is loaded into jupyter notebook for further analysis. The
dataset has 23 columns, with variable factors related to the match and which will affect the India’s win with the
opponents. The ‘Result’ column of the data set is the target variable which will be used for model building.

Figure 1 - Loading Data Set

3
b) Visual inspection of data (rows, columns, descriptive details):
The dataset has 2930 rows and 23 columns.

Figure 2 – Shape of the Data

As per the info command, we can see data types of float – 9, int-4, object -10

Figure 3 – Info of the Data Types

A descriptive Analysis is performed on the data set. Below are the insights.

 Average Age of the team is 30 and ranges from 12 to 70. Seems like there are outliers in the age column
 Bowlers_in_team has a mean of 3 and ranges from 1 to 5. Most of the times team preferred for 3
bowlers in the team and can see that winning rate it is high. So 3 bowlers will be a good number to win
the match
 Wicket_Keeper_team is always 1. So we can exclude this variable from our analysis as it has no impact
on target variable
 All_rounder_in_team average is 3 and ranges between 1 and 4
 Max_run_scored_1 over has a mean of 16 and ranges between 11 and 25.
 Max_wicket_taken_1 over has a mean of 3 and ranges between 1 and 4.
 Extra_bowls_bowled has a mean of 11 and ranges between 7 and 40.
 Min_run_given_1 over has a mean of 2 and ranges between 0 and 4.
4
 Min_run_scored_1 over has a mean of 2 and ranges between 1 and 4.
 Max_run_given_1 over has a mean of 5 and ranges between 6 and 40.
 Extra_bowls_opponent has a mean of 4 and ranges between 0 and 18.
 Player_highest_run has a mean of 65 and ranges between 30 and 100.

Figure 4 – Descriptive Stats of the Data Types

c) Understanding of attributes (variable info, renaming if required):

The data dictionary provided along with the data set helps to understand the variables in the data set. A For loop is
iterated through the columns.

Unique values are shown in the below figure.

5
Figure 5 – Unique Values of each column

Observations:

 Match format type T20 has two ways of entry: T20 and 20-20
 First_Selection has two ways of entry for batting: Batting and Bat
 Player_scored_Zero has two ways of entry for three members: 3 an Three
 Player_Highest_Wicket has two ways of entry for three : 3 an Three

So these information needs to be replaced in the data. Data after processing the data

Figure 6 – Unique Values of each column after modification

3. Exploratory data analysis

a) Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones):
In this section we will do Univariate Analysis for the complete data set. Starting with ‘Target Variable’

Figure 7 – Count Plot on Target Variable

6
 Based on the data set, most of the matches are won by India against their opponents. We need to check the
data balance in model prediction.

7
Figure 8 – Univariate Analysis for Float and Integer data types

8
Figure 9 – Univariate Analysis for Categorical Variables

Figure 10 – Univariate Analysis with Player Highest Run

9
Skewness Calculation:

Skewness of the data set is calculated

Figure 11 – Skewness Calculation

 Most of the data is not uniform and dispersed randomly.

 Seems like data set needs normalization treatment for further analysis as skewness values are
on higher side

b) Bivariate analysis (relationship between different variables, correlations)

This section deals with the Bivariate Analysis of the data.

A bar plot is drawn between Opponent and Bowlers_in_team variable considering result as the hue.

Figure 12 – Bar plot on Opponent and Bowlers_in_team

10
A scatter plot is drawn between the variables Avg_team_age and All_round_in_team

Figure 13 – Scatter plot on Avg_team_age and All_round_in_team

A violin plot is draw between the variables Opponent and All_rounder_in_team

Figure 14 – Violin plot on Opponent and All_rounder_in_team

11
Multi-Variate Analysis:
Also, a correlation plot is drawn for all the variables. From the plot following are the observations

 A correlation of 0.57 exists between Audience and extra_bowls_bowled

 A correlation of 0.62 exists between Max_run_given_1Over and extra_bowls_bowled
 A correlation of 0.65 exists between Max_run_given_1Over and extra_bowls_Opponent
 Rest of the variables, have very weak correlation. Therefore there is no scenario of
dimensionality curse
 Surprisingly, Avg team age and all_rounder_in team has ‘Zero’ correlation. This indicates that
all_rounders are the experienced people only.

Figure 15 – Heat Map

Pair Plot:
A pair plot is also drawn for viewing the data distribution which we already got from skewness calculation. For
complete picture refer jupyter notebook

12
Figure 16 – Pair Plot on Data Set

c) Removal of unwanted variables (if applicable):

As per the analysis done till now, we can claim that columns ‘Game_Number’ and ‘Wicket_keeper_in_team’ are
unwanted variables and can remove them for further modelling

Figure 17 – Data Set after dropping the Game_Number Column

d) Missing Value treatment (if applicable):

The data set as many null values. This is find by using the isnull() function. There are a total of 789 null values out of
67390 entries.

13
Figure 18 – Null Values in the Data Set

Total Percentage of missing values in the dataset is: 1.17 %, which is negligible. But in this data set Avg_team_age
column has 97 missing values. So deleting the missing value rows is not a good choice. So we will impute the null
values accordingly.

All categorical variables are imputed with the mode values and non-categorical variables are imputed with Median
values.

e) Outlier Treatment (if applicable):

The best way to check the outliers is box plot. Hence box plot is drawn for all the numeric variables.

14
Figure 19 – Box plot on the data set

 Variables like Avg_team_age, player_highest_run, Max_run_given_1Over, Extra_bows_bowled

and Audience number has outliers.
 After analyzing the features, only Avg_team_age and Audience variables require outlier
treatment in which Audience variable is of no importance in Model building.
 Therefore, only ‘Avg_team_age’ is only treated for outliers.
 Rest of the variables based on the data collected is appearing like outliers in the dispersion. But
those are all possible in the Cricket and can’t blindly treat them which will exploit the data.
 An outlier function is used by calculation the IQR values for treating the outlier.

15
Figure 20 – Box plot for ‘Avg_team_Age’ column after outlier treatment

4. Business insights from EDA

a) Is the data unbalanced? If so, what can be done? Please explain in the context of the
business
From the data provided for analysis, value_counts() is applied to check the data imbalance. This is applied on the
target variable ‘Result’. The following figure gives the result.

Figure 21 – Data Imbalance Check

 From the result we can see that ratio of ‘Win’ vs ‘Loss’ is 84% vs 16%
 A data set having 70:30 ratios is well balanced data, but still this data is good to go for modeling.
No need of any smote techniques since it will bias the data partially.

b) Any business insights using clustering (if applicable)

No clustering technique is applied in the analysis

16
c) Any other business insights:

From the Analysis so far done now following insights are capsulated:

 As per the data when you bowl 40 extra balls to the opponent, definitely India will lose the match
 Most of the times India has success rate
 If the opponent bowls extra balls more than 10 there are high chances that Team India will win
 If the opponent bowls extra balls more than 16 India will win the Match as per the data set
 India played most of the matches with Avg team age as 30 and highest wins are recorded with this
average age
 Overall India has highest winning rate against Bangladesh followed by Pakistan with three full-time
bowlers
 India is performing well in the home pitches than foreign tours
 chances of Winning the match is high when 5 wickets are taken by a single player in the match
 India is doing well in the Rainy season compared to other seasons
 Indian team is performing well with West Indies, Bangladesh, England and Pakistan.
 Indian teams win rate is less with South Africa and Srilanka compared to other opponents
 India has lost to Srilanka majority of the times it is Summer Season.
 Indian team won most of the matches when they opted bowling first, in ODIs and Day matches

FRA Extended
No ratings yet
FRA Extended
22 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
SuperKart Milestone1 Final
No ratings yet
SuperKart Milestone1 Final
15 pages
Solution Manual For Elementary Statistics Using The TI 83 84 4th Edition by Triola ISBN 055873703X 9780558737030
100% (26)
Solution Manual For Elementary Statistics Using The TI 83 84 4th Edition by Triola ISBN 055873703X 9780558737030
56 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
TSF - Project
100% (1)
TSF - Project
5 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
PM Guided Project Sample Business Report
100% (1)
PM Guided Project Sample Business Report
52 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Answer Book - Rose Wines
100% (1)
Answer Book - Rose Wines
11 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
SMDM Project
100% (1)
SMDM Project
22 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Six Sigma Tools in A Excel Sheet
No ratings yet
Six Sigma Tools in A Excel Sheet
23 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Project ML
100% (4)
Project ML
36 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Akshaya SMDM Project Report
100% (1)
Akshaya SMDM Project Report
18 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
SMDM Extended Project
No ratings yet
SMDM Extended Project
1 page
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Answer Book - Sparkling Wines
No ratings yet
Answer Book - Sparkling Wines
10 pages
20121a3226 Internship Report
No ratings yet
20121a3226 Internship Report
64 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Assighment Project 1
100% (3)
Assighment Project 1
18 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Project Questions
No ratings yet
Project Questions
3 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
Testing Aggregates: BS 812: Part 120: 1989
No ratings yet
Testing Aggregates: BS 812: Part 120: 1989
10 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
A Chemometrics Toolbox Based On Projecti
No ratings yet
A Chemometrics Toolbox Based On Projecti
15 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
The Cricket Winner Prediction With Applications of ML and Data Analytics
No ratings yet
The Cricket Winner Prediction With Applications of ML and Data Analytics
18 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Report Project 1
No ratings yet
Report Project 1
25 pages
Project - Cold Storage Case Study
60% (10)
Project - Cold Storage Case Study
34 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
How To Read and Use A Box-and-Whisker Plot
100% (1)
How To Read and Use A Box-and-Whisker Plot
16 pages
Homework - Week 7: Problem 3.31
No ratings yet
Homework - Week 7: Problem 3.31
13 pages
Capstone Final Project Report Cricket Win Prediction
No ratings yet
Capstone Final Project Report Cricket Win Prediction
20 pages
Project New
No ratings yet
Project New
13 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Biologylabreport 2
No ratings yet
Biologylabreport 2
10 pages
Exchange Risk Sensitivity and Its Determinants: A Firm and Industry Analysis of U.S. Multinationals
No ratings yet
Exchange Risk Sensitivity and Its Determinants: A Firm and Industry Analysis of U.S. Multinationals
13 pages
Math Assignment
No ratings yet
Math Assignment
4 pages
Advanced Statistical Techniques Using R: Outliers and Missing Data
No ratings yet
Advanced Statistical Techniques Using R: Outliers and Missing Data
28 pages
Unit 2-Evaluating Analytical Data
No ratings yet
Unit 2-Evaluating Analytical Data
21 pages
IPL PA-nik
100% (1)
IPL PA-nik
6 pages
Checking Data For Outliers
No ratings yet
Checking Data For Outliers
8 pages
Tanzania: Household Budget Survey Report 2007: AppendixA
No ratings yet
Tanzania: Household Budget Survey Report 2007: AppendixA
26 pages
Double Bow Residual
No ratings yet
Double Bow Residual
32 pages
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
No ratings yet
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
26 pages
Final Prjoect
No ratings yet
Final Prjoect
32 pages
MLESAC: A New Robust Estimator With Application To Estimating Image Geometry
No ratings yet
MLESAC: A New Robust Estimator With Application To Estimating Image Geometry
19 pages
Innovative Project Management Concepts and Project Performance: Construction Project Management in Sri Lanka
No ratings yet
Innovative Project Management Concepts and Project Performance: Construction Project Management in Sri Lanka
9 pages
Master in SQL: Data Cleaning
No ratings yet
Master in SQL: Data Cleaning
14 pages
Best Work (1) Econometrics Assignment
No ratings yet
Best Work (1) Econometrics Assignment
18 pages
Ijisrt19jul283 PDF
No ratings yet
Ijisrt19jul283 PDF
5 pages
How To Write A Lab Report 6
No ratings yet
How To Write A Lab Report 6
11 pages
Cosc2753 A1 MC
No ratings yet
Cosc2753 A1 MC
8 pages
Ial Maths s1 Review Exercise 1 Ans
No ratings yet
Ial Maths s1 Review Exercise 1 Ans
5 pages
STT 215 Exam 1 Example
No ratings yet
STT 215 Exam 1 Example
5 pages

Capstone Notes-1

Uploaded by

Capstone Notes-1

Uploaded by

Cricket Win Prediction –

Capstone Project Notes-1

This document briefs about the Exploratory Data Analysis(EDA) of the

Figure 1 - Loading Data Set 3

b) Need of the study/project:

c) Understanding business/social opportunity:

Figure 1 - Loading Data Set

Figure 2 – Shape of the Data

Figure 3 – Info of the Data Types

Figure 4 – Descriptive Stats of the Data Types

c) Understanding of attributes (variable info, renaming if required):

Unique values are shown in the below figure.

Figure 6 – Unique Values of each column after modification

3. Exploratory data analysis

Figure 7 – Count Plot on Target Variable

Figure 10 – Univariate Analysis with Player Highest Run

Skewness of the data set is calculated

Figure 11 – Skewness Calculation

 Most of the data is not uniform and dispersed randomly.

b) Bivariate analysis (relationship between different variables, correlations)

Figure 12 – Bar plot on Opponent and Bowlers_in_team

Figure 13 – Scatter plot on Avg_team_age and All_round_in_team

A violin plot is draw between the variables Opponent and All_rounder_in_team

Figure 14 – Violin plot on Opponent and All_rounder_in_team

 A correlation of 0.57 exists between Audience and extra_bowls_bowled

Figure 15 – Heat Map

c) Removal of unwanted variables (if applicable):

Figure 17 – Data Set after dropping the Game_Number Column

d) Missing Value treatment (if applicable):

e) Outlier Treatment (if applicable):

 Variables like Avg_team_age, player_highest_run, Max_run_given_1Over, Extra_bows_bowled

4. Business insights from EDA

Figure 21 – Data Imbalance Check

b) Any business insights using clustering (if applicable)

You might also like