0% found this document useful (0 votes)
20 views6 pages

Ipl 2

The document discusses models to predict cricket scores and classify players. It describes using linear regression and MLPRegressor to predict scores after 5 overs of an ODI match based on features like runs scored, wickets fallen, and run rates. For player classification, it uses KNN, SVM, Naive Bayes and MLPClassifier on player statistics to classify players for limited overs or test matches.

Uploaded by

soyff7spxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Ipl 2

The document discusses models to predict cricket scores and classify players. It describes using linear regression and MLPRegressor to predict scores after 5 overs of an ODI match based on features like runs scored, wickets fallen, and run rates. For player classification, it uses KNN, SVM, Naive Bayes and MLPClassifier on player statistics to classify players for limited overs or test matches.

Uploaded by

soyff7spxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

237

INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018


ISSN 2229-5518

Score Prediction and Player Classification Model


in the Game of Cricket Using Machine Learning
Sonu Kumar, Sneha Roy

Abstract— Score prediction is something we always try in our sports life. An early prediction is always helpful for the team management to
work on their plans quickly. Generally, prediction is always biased but this model attempts to predict the innings total in ODI after the first 5
over of the match.Player’s selection is one of the most important tasks for any sports and thus cricket is not any exception. The
performance of the player depends on various factors such as current form, record against the opposition, nature of the pitch, format of the
game, venue etc. The team management and captain select 11 players out of 15-16 squad members. This model classifies the players
based on their stats that who should play the limited over and which player should play in the test format of the game.For predictive model
Linear Regression and MLPRegressor have been used and for classification KNN, SVM, Naïve Bayes, and MLPClassifier have been used.

Keywords— Cricket, KNN, Linear Regression, MLP, Naïve Bayes, Prediction Model, SVM.

——————————  ——————————

1 INTRODUCTION

IJSER
Cricket is like Religion in our Country. Every fan tries to
predict the score and also they want the playing 11 according I ask to Google about it but didn’t understand a single line
to their choice. Cricket is increasingly popular among the when I was in class 8-9. Later, I studied about it and came with
statistical science community, but the unpredictable and these models and my implementation is still in process on the
inconsistent natures of this game make it challenging to apply different datasets of Cricket (including Test and T20s). There
in common probability models. However, numerous are numerous factors that can affect a cricket game’s score. In
researchers successfully applied various statistical methods to Cricket, It is believed that wickets in hand and current run rate
cricket data. I got inspiration from WASP (Winning and Score are very important factor to get a good total.
Prediction) model that was developed by Mr. Samuel back in Like in many sports, ODI cricket has both controllable and
2011. ICC used the same model in 2012, and I noticed that for uncontrollable variables. Playing combination, in and out field
the first time in a match against India vs. New Zealand that tactics including aggressive and offensive playing behaviors
WASP was predicting 282 according to model and if run rate may be considered controllable variables. However, coin-toss
would have been used then the score would have been 226 (go result is the main uncontrollable variables in the ODI format.
through the image) .
2 DATA AND TOOLS
I obtained all the data from different sources on the internet
like www.cricinfo.com, www.cricbuzz.com and Wikipedia
using web scrapping application that is developed by me in
Node JS using “Cheerio” module. The data (for predictive
modeling) contains the matches’ information between the
periods of 2006 to 2017. The data has many crucial factors
which are important for the prediction of the inning total.

Fig. 1: WASP Model Prediction


Fig. 2: ODI Dataset
————————————————
 Sonu Kumar is currently pursuing bachelors degree program in computer
science and engineering in Dr. B. C. Roy Engineering College, Durgapur,
India, PH-8609751381. E-mail: [email protected] For classification of players, I have considered Indian team
 Sneha Roy is currently pursuing bachelors degree program in computer squad statistics which is manually prepared from the ESPN
science and engineering in Dr. B. C. Roy Engineering College, Durgapur, cricinfo site.
India, PH-8944921593. E-mail: [email protected]

IJSER © 2018
http://www.ijser.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018 238
ISSN 2229-5518

I have used 9 crucial features to predict the ODI inning total of


a cricket match. The features are as follows:-
1. Current Runs Scored( at least 5 over of play)
2. Current Wickets fall
3. Current over
4. Current Run rate
Fig. 3: Player Career Stats. 5. Runs scored in last 5 over
6. Wickets fell in last 5 over
You can find the data on my Google Drive as follows:-
7. Run rate in last 5 over
https://drive.google.com/open?id=16JoFayeALv_J7WQ9znA7T
8. Max(striker_runs_scored, non_striker_runs_scored)
lTi8J4McUio
9. Min(striker_runs_scored, non_striker_runs_scored)
3 IMPLEMENTATION
All the above features are self-explanatory except the feature 8
and 9. This is an important feature because that matters as
3.1. PREDICTION MODEL both the batsman are settled or new on the crease. The settled
Currently we see that ICC uses Current run rate for the inning batsmen always score more and thus the inning total will be
total and following image is the reason that why ICC should more.
not use the existing system!! Run rate is no way near the best Calculation of Current Run rate:-
feature for the prediction of the inning total. There are many Run rate= (Runs scored/Total no. of over bowled)
crucial factors that are must for the prediction. Before all that, The model has a custom function which has a window of 20

IJSER
we know that Prediction is always bias and Cricket is a game runs. If the predicted scores matches the Actual scores then
of uncertainty. model is acceptable. The function is as follows:-

Fig. 6: Accuracy Calculating Function

3.1.1. REGRESSION ANALYSIS

Multiple linear regression (MLR) is used to determine a


mathematical relationship among a number of random
variables. In other terms, MLR examines how multiple
independent variables are related to one dependent variable.
Once each of the independent factors have been determined to
Fig. 4: Run Rate Based Score Prediction predict the dependent variable, the information on the
multiple variables can be used to create an accurate prediction
on the level of effect they have on the outcome variable. The
model creates a relationship in the form of a straight line
(linear) that best approximates all the individual data points.

The model for multiple linear regression is: yi = B0 + B1xi1 +


B2xi2 + ... + Bpxip + E.

The multiple regression model is based on the following


assumptions:

Fig. 5: Run Rate vs. Total Run


IJSER © 2018
http://www.ijser.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018 239
ISSN 2229-5518

 There is a linear relationship between the dependent 3.1.3. RESULT


variables and the independent variables TABLE 1
 The independent variables are not too RESULT TABLE OF PREDICTION MODELS
highly correlated with each other
Match Date Result
 yi observations are selected independently and
randomly from the population
Actual Predicted Score
 Residuals should be normally distributed with
Score
a mean of 0 and variance σ MLP LR
India vs. 02/04/2011 274 256 250
Sri Lanka
(Sri Lanka)

Zimbabwe 20/07/2018 399 390 418


vs.
Pakistan
(Pakistan)
Zimbabwe 22/07/2018 364 359 377
vs.
Pakistan
(Pakistan)
West Indies 23/07/2018 231 222 243
Fig. 7: Linear Regression Algorithm Modeling
vs.

IJSER
Bangladesh
3.1.2. MULTILAYER PERCEPTRON (West
A multilayer perceptron (MLP) is a feed forward artificial Indies)
neural network model that maps sets of input data onto a set
of appropriate outputs. 3.2. CLASSIFICATION MODEL
Constructor Parameters
In Indian Cricket, There are players who play Test cricket only
 inputLayerFeatures (int) - the number of input layer and players who play Limited over only (exception of players
features who play both the formats). Pujara is comfortable in the Test
Cricket only because his strike rate is low, patience is high,
 hiddenLayers (array) - array with the hidden layers
and temperament is of that level. Virat Kohli can play all the
configuration, each value represent number of neurons in formats of the game because his statistics shows everything.
each layers This classification model classifies the players based on their
 classes (array) - array with the different training set stats that which player should play the Test and Limited over
classes (array keys are ignored) Cricket.
 iterations (int) - number of training iterations
 learningRate (float) - the learning rate
 activationFunction (ActivationFunction) - neuron
activation function

Fig. 9: Virat Kohli Stats.

There are several factors which classifies the players (in both
Fig. 8: Multilayer Perceptron Algorithm Modeling ODI and Test) as follows:-
IJSER © 2018
http://www.ijser.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018 240
ISSN 2229-5518

1. Total No. of Balls played in his career 3.2.2. SUPPORT VECTOR MACHINES
2. Total No. of Centuries Scored
3. Total Average Vladimir Vapnik, Bernhard Boser and IsabellGuyon
introduced the concept of support vector machine in their
4. Total Strike Rate
paper. SVMs are highly accurate and less prone to overfitting.
5. Total No. of Not outs SVMs can be used for both numeric prediction and
classification. SVM transforms the original data into a higher
The above factors are self explanatory for a cricket fan. A test dimension using a nonlinear mapping. It then
player always plays more no. of balls, scored bigger runs, has searches for a linear optimal hyperplane in this new
more average, low strike rate, more no. of not outs (less in rare dimension separating the tuples of one class from another.
cases) than a ODI player. With an appropriate mapping to a sufficiently high dimension,
The target has 3 classes as follows:- tuples from two classes can always be separated by a
1. Test only- 1 hyperplane. The algorithm finds this hyperplane using
2. ODI only – 2 support vectors and margins defined by the support vectors.
The support vectors found by the algorithm provide a
3. Test and ODI both – 3
compact description of the learned prediction model. A
separating hyperplane can be written as:
For generating the classification models, we used supervised
W∙X+b=0
machine learning algorithms. In supervised learning
algorithms, each training tuple is labeled with the class to where W is a weight vector, W = {w1, w2, w3,..., wn}, n is the
which it belongs. We used Naïve Bayes, K-Nearest Neighbors, number of attributes and b is a scalar often referred to as a
Multilayer Perceptron Classifier and Multiclass Support Vector bias. If we input two attributes A1 and A2, training tuples are
Machines for our experiments. These algorithms are explained 2-D, (e.g., X = (x1, x2)), where x1 and x2 are the values of

IJSER
in brief. attributes A1 and A2, respectively. Thus, any points above the
separating hyperplane belong to Class A1:
3.2.1. NAÏVE BAYES W∙X+b>0
Bayesian classifiers are statistical classifiers that predict the and any points below the separating hyperplane belong to
probability with which a given tuple belongs to a particular Class A2:
class. Naïve Bayes classifier assumes that each attribute has its W∙X+b<0
own individual effect on the class label, independent of the
values of other attributes. This is called class-conditional
independence. Bayesian classifiers are based on Bayes’
theorem.
Bayes Theorem: Let X be a data tuple and C be a class label.
Let X belongs to class C, then
P(C|X) = P(X|C)P(C) / P(X)
where;
• P(C|X) is the posterior probability of class C given predictor Fig. 11: Support Vector Machines Algorithm Modeling
X.
• P(C) is the prior probability of class. 3.2.3. K-NEAREST NEIGHBORS
• P(X|C) is the posterior probability of X given the class C.
In pattern recognition, the k-nearest neighbors algorithm (k-
• P(X) is the prior probability of predictor. NN) is a non-parametric method used
The classifier calculates P(C|X) for every class Ci for a given for classification and regression. In both cases, the input
tuple X. It will then predict that X
consists of the k closest training examples in the feature space.
belongs to the class having the highest posterior probability, The output depends on whether k-NN is used for
conditioned on X. That is X belongs
classification or regression.
to class Ci if and only if
P(Ci|X)> P(Cj|X) for 1 ≤ j ≤ m, j ≠ i. To determine which of the K instances in the training dataset
are most similar to a new input a distance measure is used.
For real-valued input variables, the most popular distance
measure is Euclidean distance.
Euclidean distance is calculated as the square root of the sum
of the squared differences between a new point (x) and an
existing point (xi) across all input attributes j.

Fig. 10: Naïve Bayes Algorithm Modeling Euclidean Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
IJSER © 2018
http://www.ijser.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018 241
ISSN 2229-5518

When KNN is used for classification, the output can be


calculated as the class with the highest frequency from the K-
most similar instances. Each instance in essence votes for their
class and the class with the most votes is taken as The first is a hyperbolic tangent that ranges from -1 to 1, while
the prediction. the other is the logistic function, which is similar in shape but
ranges from 0 to 1. Here yi is the output of the ith node
(neuron) and vi is the weighted sum of the input connections.
Class probabilities can be calculated as the normalized
Alternative activation functions have been proposed,
frequency of samples that belong to each class in the set of K
including the rectifier and soft plus functions. More
most similar instances for a new data instance. For example, in
specialized activation functions include radial basis functions
a binary classification problem (class is 0 or 1):
(used in radial basis networks, another class of supervised
neural network models).
p(class=0) = count(class=0) / (count(class=0)+count(class=1))
Learning occurs in the perceptron by changing connection
If you are using K and you have an even number of classes weights after each piece of data is processed, based on the
(e.g. 2) it is a good idea to choose a K value with an odd amount of error in the output compared to the expected result.
number to avoid a tie. And the inverse, use an even number This is an example of supervised learning, and is carried out
for K when you have an odd number of classes. through back propagation, a generalization of the least mean
squares algorithm in the linear perceptron.
Here, Cluster of “Total balls played”, “No. of centuries”,
“strike rate” etc. can be mapped by KNN as it picks the

IJSER
nearest neighbor. KNN has application in recommendation
system mostly.

Fig. 12: K-Nearest Neighbors Algorithm Modeling

3.2.4. MULTILAYER PERCEPTRON CLASSIFIER


A multilayer perceptron (MLP) is a class of feed
forward artificial neural network. An MLP consists of at least
three layers of nodes. Except for the input nodes, each node is
a neuron that uses a nonlinear activation function. MLP
utilizes a supervised learning technique called back
propagation for training. Its multiple layers and non-linear
activation distinguish MLP from a linear perceptron. It can Fig. 13: Multilayer Perceptron Classifier Algorithm Modeling
distinguish data that is not linearly separable.
If a multilayer perceptron has a linear activation function in all
neurons, that is, a linear function that maps the weighted
inputs to the output of each neuron, then linear algebra shows
that any number of layers can be reduced to a two-layer input-
output model. In MLPs some neurons use
a nonlinear activation function that was developed to model
the frequency of action potentials, or firing, of biological
neurons.
The two common activation functions are both sigmoid, and
are described by
IJSER © 2018
http://www.ijser.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 9, ISSUE 8, AUGUST-2018 242
ISSN 2229-5518

3.2.5. RESULT
[3] Multilayer Perceptron, Wikipedia
https://en.wikipedia.org/wiki/Multilayer_perceptron\

[4] Support Vector Machine, Wikipedia


https://en.wikipedia.org/wiki/Support_vector_machine

[5] K-Nearest Neighbors, Machine Learning Mastery


https://machinelearningmastery.com/k-nearest-neighbors-for-
machine-learning/

Fig. 14: Efficiency of Different Classification Algorithms

4 CONCLUSION

The main limitation in carrying out this project was the limited

IJSER
dataset, which I had at my disposal. The next logical step in
the direction to improve the accuracy of prediction problem at
hand would be to test out the approaches and various
methodologies proposed in this paper using a larger and more
representative dataset. Also I would like to extend the features
like Weather condition, Nature of the pitch and Venue. The
accuracy will be even higher if Deep Neural network
(Tensorflow, Keras and Thaeno) comes into the
implementation. Not only this, Similar model can be
developed for other sports like Tennis, NBA and Football and
newer sports like Pro Kabaddi League.

ACKNOWLEDGMENT

First of all, I am thankful to International Journal of Scientific


and Engineering Research to provide me a platform to display
my research.
I would also like to extend my heartiest thankfulness to Mr.
Chandan Kumar Verma for guiding me through the basics of
machine learning so that I could come up with implementing
this project. Through this project I have reflected on important
aspects of Prediction and Classification Modeling, which
when Predicted with the right amount of accuracy and
classified with a very good efficiency, can help a lot in terms of
prediction modeling and can prove to be extremely crucial in
predicting the Inning score and players classification based on
their stats.

REFERENCES

[1] ESPN cricinfo


http://www.espncricinfo.com/

[2] Linear Regression, Wikipedia


https://en.wikipedia.org/wiki/Linear_regression
IJSER © 2018
http://www.ijser.org

You might also like