Capstone Final Project Report Cricket Win Prediction
Capstone Final Project Report Cricket Win Prediction
By
Subhadeep Seal
PAGE 1
CONTENTS
SL.NO. TITLE PAGE NO.
1. Introduction , Problem Statement understanding & need to solve it 3-4
6. Feature Selection 14
7. Model Building 15
8. Model Validation 16-17
9. Conclusions 18
10. Recommendations 19
PAGE 2
INTRODUCTION
➢ Main aim is to create Machine Learning models which correctly predicts a win for the Indian Cricket Team.
➢ Developing a model to extract and provide actionable insights and recommendation.
India will also play the following 1 0 matches in the next 1 0 days. It is important to predict the outcome of the
matches, and if you get a loss, suggest some changes, and re-run your model until you get a win. The sam e
strategy cannot be used throughout the series, as opponent will become accustomed to it and come up with
their own counter strategy. As such, for all the below 5 matches, you must suggest unique strategies to help
India win. There should be suggestions that correspond with the variables in the dataset. Be sure to carefully
consider whether these suggestions could be implemented. Total no. of matches will be 5.
PAGE 3
NEED FOR THE STUDY/PROJECT
BCCI aims to make data-driven strategic decisions to improve India’s win rate. This model supports strategic planning for
upcoming matches.
India is one the successful cricket team in all the formats that is Test, ODI and T20 matches. India plays all the formats
throughout the year.
It is necessary to be world best team which will help the cricket council to maintain the standards and also yield more
revenue
With the above said intention historical data is provided with certain parameters.
We need to build a accurate model which can predict the future matches.
The critical need would be if it is a loss then we have to change the parameter accordingly such that India will win the match.
The tweaking of parameters based on opponent and other parameters has to be predicted.
The metric to measure the success of this project would be to make the team/council to choose the right given parameters
and make India win every match.
The data for next 5 matches which India going to play is provided. The model has to be built on the historical data and predict
the 5 matches.
Then parameters for each match has to be tweaked as the opponent will understand the strategy as well.
The following are the matches to be predicted,
•Test match with England in England. All the matches are day matches. In England, it will be rainy season at the time to match.
•T20 match with Australia in India. All the matches are Day and Night matches. In India, it will be winter season at the time to
match.
•2 ODI match with Sri Lanka in India. All the matches are Day and Night matches. In India, it will be winter season at the time to
match.
•The study is a supervised learning and it specifies the class to which data elements belong to and is best used when the output
has finite and discrete values. PAGE 4
WORKFLOW & TOOLS
Data Data
Data Data Model Model
Exploration preprocessor
Collection Cleaning Building Evaluation
PAGE 5
DATA & VARIABLES DESCRIPTION
• Total 2 9 2 9 matches result for all three format (T20, ODI, Test) have taken for this c ase study.
• We have total 2 3 variable present in the dataset in which 3 datatypes are given float64(9), int64(4), object (10). We have ‘Result’ a s
target variable.
Variables Description
Game_number Unique ID for each match
Result Final result of the match
Avg_team_Age Average age of the playing 11 players for that match
Match_light_type type of match: Day, night or day & night
Match_format Format of the match: T20, ODI or test
Bowlers_in_team how many full time bowlers has been player in the team
Wicket_keeper_in_team how many full time wicket keeper has been player in the team
All_rounder_in_team how many full time all rounder has been player in the team
First_selection First inning of team: batting or bowling
Opponent Opponent team in the match
Season What is the season of the city, where match has been played
Outlier Treatment
Label Encoding& Feature selection
PAGE 7
Missing Value treatment
The data set has many null values. This is detected by using the is null() function. There are a total of 789 null values out of
67390 entries. These missing values were treated using KNN Imputer as they don’t contribute towards Model Building.
Figure 3 – Box plot for ‘Avg_team_Age’ column after outlier treatment PAGE 10
Insights:
Exploratory Data Analysis- Univariate ➢➢
- Around 72% of matches
Around 72% of matches
are played in India and
are played in India and
only 28% are played out
only 28% are played out
of India.
off India.
➢ Most of the matches are
➢ Most of the matches are
played in Rainy Season.
played in Rainy Season.
➢ Majority of the matches
➢ Majority of the matches
are played against South
are played against South
Africa (676).
Africa (676).
➢ There are some outliers
➢ There are some outliers
present in the predictors
present in the predictors
such as Avg_team_age,
such as Avg_team_age,
max_run_1over,
max_run_over,
audience_number.
audience_number.
➢➢
Inexperienced team
Inexperienced team
(young player) has the
Figure 5 – Box plot for Categorical Variables VS Target Variables (young player) has the
higher chances to lose the
higher chances to lose the
match PAGE 12
match
Insights:
Insight:-
Exploratory Data Analysis- Multivariate
➢ Can see multicollinearity
➢ Can see multicollinearity
among some variable.
among some variable.
➢ Extra bowls opponent
➢ Extra bowl opponent is
highly correlated with
highly correlated with
maximum run given in
maximum run given an
an over.
We can also see negative
➢ We can also see negative
correlation between extra
correlation between extra
Bowls bowled and all
bowl bowled and all
rounder in team.
rounder in team.
Variables Target
Result Variable
Avg_team_Age
Match_light_type
Match_format
Bowlers_in_team
Eliminating
Game number Eliminating All_rounder_in_team
Wicket keeper in First_selection
team
Opponent
Season
Audience_number
Offshore
Max_run_scored_1over
Max_wicket_taken_1over
Extra_bowls_bowled
Min_run_given_1over
Min_run_scored_1over
Max_run_given_1over
extra_bowls_opponent
player_highest_run
Players_scored_zero
player_highest_wicket PAGE 14
Model Building
➢ Target variable –Result is categorical in nature.
➢ For this case study following classification model will be used:
• Logistic Regression
• Adaboost
• Random Forest Classifier
• XG boost
• KNN
• SVM
• Naive Baye’s
Evaluation Parameters
• Correctly classified point • The ratio of total positive • Out of the total positive, • It is the harmonic mean
in test data and total predicted by the model what percentage are of precision and recall. It
number of points in the to the total actual predicted positive. takes both false positive
test data positive. and false negatives into
account
PAGE 15
Model Validation
➢ All model except Random Forest model is giving 96.59%
Accuracy and 100% of recall.
Model Selected:
➢ Random Forest model has less False ‘+ve’ and False ‘-ve’ for
Table 3 – Model Comparison
both win and loss Classes. Comparing to other model it has
AUC, Recall and Accuracy for both Train and Test.
PAGE16
Model Evaluation Random Forest on Test Data
MODEL
➢ For test data set we got 96.59% Accuracy and 100%
RM-Test
Recall
100
Precision 96.02
AUC
95.50
PAGE 17
• When playing in
• In the first batting daylight, winning
situation, the team wins chances increase by
33% of the time and looses 60%.
only 0.06% of the time. • In ODIs, the team
• In the first bowling has won 1666
situation, team wins 51% of games out of 1935.
the time and looses 10% of
the time.
• Playing with more than
2 all-rounders in a
team increases winning
• Out off 2121 matches chances by more than
India only lost 227 50%.
Matches while playing • Team won 1201
within the country. matches in Rainy
• India has a high condition out off 1371.
probability of winning • Its show Rain helps
at home Ground. bowler to get some
• India’s Win rate extra advantage in his
improves by 18% bowling.
batting first at home
during winter.
PAGE 18
• Try to collect some more predictor like total score & bowling style for a better model.
• Try to add more than 3 all rounder in the team that will improve the team performance.
• If team opt for bowling first with an Avg team age of 30, with 4 bowlers in the team has
higher chance to win against England in test match in Rainy season in England .
• If team opt for bowling first with an Avg team age of 30, minimum 3 bowlers in the team,
scoring average 15 runs per over has higher chance to win against Australia in T20 match
in Winter season in India.
• If team opt for Batting first with an Avg team age of 30, with 3 bowlers in the team and at
least one player should score century has higher chance to win against Sri Lanka in ODI
match in Winter season in India.
PAGE 19
THANK YOU
PAGE 20