The Cricket Winner Prediction With Application of Machine Learning and Data Analytics
The Cricket Winner Prediction With Application of Machine Learning and Data Analytics
Abstract: With the evolution in the field of Data Sciences, every business firm is adapting latest technologies to grow their business. There are
competitions in delivering better management, better quality of evaluations and better services in the market. The only possible way to meet all these
qualities is to conduct analysis of data with purity and more accurately. Machine learning is the emerging field to predict future outcomes with existing
data and based on these predictions better decisions can be made. Cricket is a well-known game that played and watched around a globe in 104
countries. Many of these cricket fans want their team to perform good and declare as a winner. To make sure their team’s win, team should work on their
strengths and team performances. Predicting winner of a cricket match depends on many factors like batsman’s performances, team strengths, venues
and weather conditions etc. In this research various features have been analyzed to predict the match winner of the game. This research paper is about
prediction of an IPL match winner before the match started. The winner of IPL is predicted by training machine learning models on the selected features.
For this purpose of model building, different machine learning algorithms has been applied on test and training datasets of different sizes which are
Random Forest, SVM, Naive Bayes, Logistic Regression and Decision Tree. The prediction model will have benefits for cricketing boards like evaluating
the team’s strength and cricket analysis. For gambling applications and match reporting media this model will be a blessing of disguise.
Index Terms:Cricket, Data Analysis, Data Science, Machine Learning, Model Classifiers, Modelling, Prediction, Prediction Models
————————————————————
1. INTRODUCTION Eight teams participate in this competition every year. More
SPORTS statistical analysis use in sports has been growing than 150 players are selected by each team. Each team
quickly year by year. Due to which the ways in which game consist of 11 players, four overseas players and seven local
strategies are formed or the player’s evaluation criteria has players. Every team’s performance based on the key
been changed but also has the got the more interest of performances of players, team conditions and other important
audience towards cricket. Now Cricket has become one of the aspects which decides the team’s performances in a cricket
most followed team games in the world with billions of fans all match. The model will be built on all the possible factors
across the globe. Cricket is a sports game that played globally affecting the outcome of cricket match. Ground impacts, team
across 106-member states of the International Cricket Council quality and home field advantage were observed to be
(ICC), which has 1.5 billion worldwide fans according to ICC. essential in by the Nagel kerke R2 and AIC analysis. This
However, much of the global finance and interest is focused might be on the grounds that the ICC rating assesses result
upon the 10 full ICC member nations and more specifically (win, draw, misfortune) alongside the success edge, wickets
upon ‘the big three’ of England, Australia and India. Cricket and adversary rating. Winning the hurl was likewise
has many evolutions over time. Today, there are three major considered in the model fitting however was observed to be
formats in which cricket is being played internationally, One insignificant. The playing conditions differ from ground to
Day Internationals (ODIs) and the T20 cricket and Test ground and nation to nation. For instance, playing conditions in
Matches. Besides these international cricket matches, T20 Wankhede at Mumbai are very not quite the same as in Leeds
League cricket is getting attention in the fans due to its at Headingly [14]. Player performances decides the win factor
shortest format and the most exciting format of the game. of a team. Player performances matters a lot as every team
Indian Premiere League is one of most popular t20 cricket depends on their player to perform good and perform
league in the world. Ever since its inception in 2007, IPL has according to match situation. In selecting the lineup for the
been a huge success and has become an industry with team, the player performance is taken as a major factor.
investment of billion-dollars. Similarly, England’s county cricket Batsman performances in recent matches tells about their
(t-twenty blast), Big bash, PSL and BPL are other big leagues form, ability to score runs with a healthy strike rate which is a
who are investing a lot of money to promote their franchise- need of twenty-twenty cricket nowadays. Pitch Conditions are
based cricket. In franchised-cricket every team wants to win very important in cricket game. There are several kinds of
and improve the team performance. For this purpose, every pitches on which cricket has been played. Every ground and
team needs a better management panel to handle the his own pitch conditions known for bowling pitches or batting
responsibilities of complete franchise, team selection pitches. A match’s outcome can also be affected by bad
committee who will select the best possible team with good weather. Weather conditions also plays a role in deciding
players such as to select the best batsmen from the draft by results of a match. Players having good batting averages and
looking at their past performances. Indian Premier League is a consistent performances in the recent matches are the ones
domestic competition played in India in April and May every on which teams rely on. Because they can play a major role in
year between eight teams. posting a good target score and in chasing, by handling
pressure situations. Sometimes, matches are interrupted by
____________________________ rain or any other miscellaneous circumstances. To reset the
target in interrupted matches, there is an approach used name
Daniel Mago Vistro, Asia Pacific University, Malaysia, as Duckworth-Lewis or D/L method D/L [8]. Multiple Linear
[email protected] Rasheed, Asia Pacific Regression is a valuable method to allot the winning
University, Malaysia, [email protected] probabilities to the contending groups in One Day International
Leo Gertrude David, Asia Pacific University, Malaysia, matches. With the utilization of D-L approach, this procedure
[email protected]
Leo Gertrude David, Kumaraguru College of Liberal Arts and
can be promptly adjusted to deliver 'in the run' forecasts. While
Science, [email protected] a conclusive investigation of the productivity of the betting
985
IJSTR©2019
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 09, SEPTEMBER 2019 ISSN 2277-8616
market is yet to be directed, starter proof propose punters 2.2 Cricket Winner Prediction Models
might be inclined to over or under estimate the genuine Machine learning has become a vast field that is consist of
likelihood of the contending groups as the diversion advances many domains’ statistics such as artificial intelligence,
[2]. information technology, and others. Many problems can be
solved by Machine learning model. In the advance era of
2 RELATED WORKS today, the machines can now work as a human brain because
With the evolution of Cricket, it became a very hot topic for machine learning has been so much evolved. It is learning of
sports analysts. A lot of research has been made on cricket but computers by creating algorithms which tells the computer
due to inconsistent and complicated data sets, they could not how to learn which includes finding the patterns using
get breakthrough in predicting match winner accurately. There statistical approaches or similarities in the data. Machine
are many techniques that has been used in predicting match learning algorithms has proved prediction very easy by using
winner like KNN, Logistic Regression, SVM, Naïve Bayes but classification function to relate the values of attributes in the
nobody has achieved the accuracy. According to Ahmed & dataset [11].
Nazir [1] they implemented different statistical approaches for
formation of datasets and tried various classification 2.2.1. Naïve Bayes
techniques to predict the winner of One Day Cricket (50 over) Naïve Bayes works on the Bayes probability theorem with the
match. He has predicted the winner with 80 % accuracy. Shah assumption that all the features are independent of class label
[14] predicted One Day International match results by using (predicted variable) which may be a wrong assumption. Naive
data of ICC match ratings, ICC ranking points for batsmen and Bayes model used in conjunction with recursive feature
bowlers, home factor, ICC rating differences and ground elimination [10].
effects on the match. They implemented Logistic Regression
on this data and achieved accuracy in predicting the results of 2.2.2. Decision Tree Regressor
matches 74.9% and in 81% matches they predicted the winner Decision Tree Regressor has been used to check the overfit
team correctly. Jhanwar [5] predicted 71% accuracy in by learning from the noise of data using tree node system. If
predicting winner of the One Day International cricket match. max depth of tree is high, decision tree regressor take details
He used binary classification models such as Logistic from training data’s noise. Decision Trees classification works
Regression, KNN, Random Forest and Decision trees. Cross on tree node principal in which instances are sorted into tree
validation procedure was not carried out. Jhawar [6] have node system. By this hierarchy complex decision-making
done research on predicting the winner of the match at end of system are break-down into smaller simpler decisions which
the over, player’s performance recent and past performance provides a simple solution that is easy to implement [9].
and other statistics’ which are necessary for predicting the
winner of the match has been used. First challenge is to 2.2.3. Support Vector Machine (SVM)
estimate the score that first team will score at the end of first Support Vector Machine has been proven to be most used
innings. In Features combination to predict the match component classifier of Ada Boosting for different prediction
outcome, is relative strength of Team B divided by relative techniques like image recognition, medical health diagnosis
strength of Team A is successful in measuring and comparing and facial recognition. SVM classifier on given Training data,
the strength of the playing teams. By Random Forest classifier outputs an optimal hyperplane by which new examples can be
R.F.C. accuracy of 84% has been achieved. Jhanwar [5] categorized. Hyperplane is a plane that divides line into two
analyzed the performances of the One Day International parts where in each class lay in either side. SVM’s
matches played from 2006 and 2016 and accuracy stated that optimization measured by Regularization parameters.
86% is achieved that top 3 positions of batsman are hot for the Regularization parameter tells about the SVM Optimization [3].
man of the match award which is better to previous search SVM is a category of supervised machine learning algorithms
and models Random Forests, Decision Trees, KNN and which has to be trained with pre-defined output class. The
Logistic Regression are the techniques used to predict player SVM classifier on given Training data, outputs an optimal
performances in a match. Yasir [16] predicted outcome of hyperplane by which new examples can be categorized.
cricket match and for the winner prediction techniques, he Hyperplane is a plane that divides line into two parts where in
proposed a method for predicting the team results and each class lay in either side [12].
elaborated the working of method which is by using properties
of dynamic team for the winner’s prediction like player’s 2.2.4. Random Forest Classifier
history, weather conditions, ground history and winning Random Forest classifier is a method used for regression and
percentage. He applied this technique on 100 matches and got classification techniques. In the Random Forest Classifiers, to
85 % prediction. classify a new instance, there are number of trees in working
randomly in a forest putting input vector down and duty of
2.1 Factors to Anticipate Cricket winner every tree is to give a class label or target variable as a vote
Winning a cricket match depends on multiple factors like for the class. And which node has highest votes will be chosen
batting, bowling, fielding, team performances and player by Random Forest Classifier. To increase the accuracy
performances. To predict the winner of a cricket match is never predicted and to control the over-fitting, Random forest uses
an easy task. But there are always some kind of unique estimation and averaging approach on the sub-samples of
aspects or match conditions that may favor to some team and dataset that is done by fitting various number of decision tree
sometime does not such as home advantage, Key Players, classifiers. The sub-samples taken for this are remain equal to
Pitch Conditions and weather conditions [8]. original input size [15]. Random forest is a versatile
mechanism enough to deal with both supervised classification
and regression tasks. For the datasets under experimentation,
986
IJSTR©2019
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 09, SEPTEMBER 2019 ISSN 2277-8616
DBDP approach achieves accuracy of that of the original nature of data for the forecasting of values of target variable.
Random Forest in a smaller number of trees, and the Last step is evaluation of implemented models. Checking the
reduction in size achieved is in the range of 52% to 87% [7]. fitness’s of models whether the model is overfit or underfit and
comparing performances of models by different statistical
2.2.5. Accuracy Score techniques. If the model, is not appropriate and not giving the
To optimize a model’s performance, it should be ensuring that best results then try different techniques to make it
proper selection of features is under training of generative appropriate.
classifier. To calculate the model’s performance or model’s
accuracy confusion matrix is a matrix which gives the 3.2. Data Visualizations
comparison between the predicted class and the actual class Visualizations are important part of any research to understand
into classification report [4]. the business and behavior of data in a way that how different
attributes are relating to target variable and what attribute
3 RESEARCH METHODOLOGY should be the point of focus. Visualizations of data give
Methodology is a process in which data is selected, valuable meaning insights. By the visualizations every end
transformed and prepared for the calculations needed to user can easily represent the data into understandable
generate useful insights [13]. For this research methodology is interactive graphic. Cubes will be generated related to different
SEMMA modeling. aspects of data. There are various visual analytic tools to
create visualizations but as this research has been done in
python so visualizations will also be made in python
programming using mat plot lib libraries. As the topic of this
research is to predict the winner of match so all the cubes will
be related about how different attributes of data are interacting
with match winner variable.
The above confusion matrix of Decision Tree model has trees has to suffer problems of overfitting whereas in Random
successfully predicted the values of ‘winner’ by 76.9% Forest it prevents the overfitting. But Random Forest
accuracy. It may not be enough for our model as XGBoost sometimes makes slower computation because it consists of
predictions was over 90 % so we need to fine tune our subtrees which does not work every time. Allow else, Random
Decision Tree model for better results. Forest has various parameters to increase the model’s
performance like n_estimators , min_sample_leaf and
Parameter’s Tuning: max_features. Model’s speed can be increased by setting
As the results by Decision Tree model were not perfect hyper parameters such as n_jobs, for example, it can be set to
according to the requirements so we need to fine tune the 1 for using only one processer. By the random_state hyper
parameters of Decision Trees. A machine learning model parameter output of model can be made replicable and the last
consists of various parameters which decide how different one is oob_score used for validation. Random Forest
computations will be performed in selected models. Usually Classifier will be used in this research according to nature of
the predictions of data are made by parameters that has been our problem.
already set by default in models but in some cases the results
are not good enough because of different nature of data. So, if
parameters are set according to requirements of data then the
computations performed result in terms of better performances
of model. In case of Decision Tree modeling, the maximum
depth value describes how deep the tree will be. If set to be a
larger value, Decision Tree model will be deeper and will cover
more details about data by splitting more. The max_depth has
been set to 33 in this case.Criterion of the Decision tree was
Gini before, but it was not good for information gain. Now it
has been changed to entropy for measuring the impurity and
information gain.
Parameter’s Tuning
Usually Random Forest is like a black box to which inputs are
given and predictions has been made by Random Forest
without knowing that what are the computations are going to
take for this process. This Black Box Classifiers have several
levers which we can tune to get better results. Parameter’s
tuning is necessary sometimes to achieve good results such
Decision Tree model’s performance has been changed and it as in our research 71 % is not enough so by tuning of
has successfully predicted the winner by 94.87%. It means parameter and to get better results parameter’s will tune with
that tuning of parameters has made the model better and more random values. The first parameter tuned for this purpose is
accurate. increasing the number of estimators from 100 to 1500. This
may slow the model for milliseconds but make computations
4.2 Model’s Implementation – Random Forest Classifier more stable and stronger, n_jobs set to 1 so that 1 processor
Random forest is one of the best machine learning algorithms will be used at a time and maximum depth of trees has been
that produces best results without parameter tuning frequently set to 565.
and very hard to beat in terms of performances. It’s very easy
to use because of its hyper parameters gives best results with
default values. It avoids overfitting problem. It works both for
classification and regression problems. Random forest is
mixture of multiple Decision Trees that combine together to
give better results. Most frequent method for training in
Random Forest is bagging method and idea of this method is
to combination of learning methods to enhance performance
of model for the better predictions. The basic difference
between the Decision Trees algorithm and random forest
classifier is that in decision trees some set of rules needed to
be set before applying model features and target variable and
in Random Forest there is no need to set any kind of decision
rules. Another difference is that sometimes deep decision
988
IJSTR©2019
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 09, SEPTEMBER 2019 ISSN 2277-8616
computation needs to perform very well, data needs to meet International. IJCSNS International Journal of Computer
all the business problems and business systems. The Science and Network Security.
prediction of winner produced through this project required a
lot of domain information and expertise for observations and
their relations to the winning team.
ACKNOWLEDGMENT
First of all, we would like to express our deepest gratitude to
the God who has given us the knowledge and wisdom so that
we can finish this research. Thank you for our family and
colleagues who in a way always support and encourage us to
work hard.
REFERENCES
[1] Ahmed, W. & Nazir, K., 2015. A Multivariate Data Mining
Approach to Predict Match Outcome in One-Day International
Cricket. 10.13140/RG.2.2.30683.46880.
[2] Bailey, M. & Clarke, S. R., 2006. Predicting the Match Outcome
in One Day International Cricket Matches, while the Game is in
Progress. Department of Epidemiology & Preventive Medicine,
Monash University, Australia Swinburne University of
Technology, Melbourne, Australia. .
[3] Firat, . U. S., Vargeloğlu, O. B. & Bingol, S., 2016. A Literature
Review of Adabost and SVM Techniques. Conference: 3rd
International Management Information Systems Conference, At
İzmir Turkey.
[4] Hossin, M. & Sulaiman, M., 2015. A REVIEW ON EVALUATION
METRICS FOR DATA CLASSIFICATION EVALUATIONS.
International Journal of Data Mining & Knowledge Management
Process (IJDKP) , 5(2).
[5] Jhanwar, G. M., 2017. Quantitative Assessment of Player
Performance and Winner Prediction in ODI Cricket. International
Institute of Information Technology Hyderabad - 500032, INDIA.
[6] Jhawar, M. G., Viswanadha, S., Sivalenka, K. & Pudi, V., 2017.
Dynamic Winner Prediction in Twenty20 Cricket: Based on
Relative Team Strengths.. Conference: Machine Learning For
Sports Analytics at ECML-PKDD .
[7] Kulkarni, V. & Sinha , P., n.d. Effective Learning and
Classification using Random Forest Algorithm. International
Journal of Engineering and Innovative Technology (IJEIT).
[8] Lokhande, A., Chawan, R. &. & Pramila &, S., 2018. Prediction
of Live Cricket Score and Winning.. Computer and IT Dept,
Veermata Jeejabai Technological Institute, Mumbai, India ,
5(4)(2394-9333).
[9] Mitchel, M. T., 1997. Machine learning.. Burr Ridge, IL: McGraw
Hill, 45, 1997.
[10] Murphy, K. P., 2006. Naive bayes classifiers.. University of
British Columbia.
[11] Nasteski & Vladmir, 2007. An Overview of the Supervised
Machine Learning Methods. Faculty of Information and
Technology.. Faculty of Information and communication
Technologies.
[12] Patel,S.,n.d.MachineLearning101[Online]
[13] Available at: https://medium.com/machine-learning-101/chapter-
2-svm-support-vector-machine-theory-f0812effc72
[14] Shah, P. & Shah, M., 2015. Predicting ODI Cricket Result. ISSN
(Paper) 2312-5187 ISSN (Online) 2312-5179 An International
Peer-reviewed Journal , Volume 5.
[15] Asare-Frempong, J. and Jayabalan, M., 2017. Predicting
customer response to bank direct telemarketing campaign. In
2017 International Conference on Engineering Technology and
Technopreneurship (ICE2T) (pp. 1-4). IEEE.
[16] Yasir, M. et al., 2017. Ongoing Match Prediction in T20
990
IJSTR©2019
www.ijstr.org