0% found this document useful (0 votes)
28 views16 pages

Project Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Project Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IPL Score Prediction

1
Table of Contents
Introduction .................................................................................................................................... 3
Literature Review ........................................................................................................................... 4
Dataset ............................................................................................................................................ 5
Data Preprocessing......................................................................................................................... 6
Models and Implementation ......................................................................................................... 8
Linear Regression ........................................................................................................................ 8
K Nearest Neighbors Regressors ................................................................................................. 8
Random Forest Regressors ........................................................................................................ 10
Recurrent Neural Networks using LSTM ................................................................................... 11
Comparison of Models ................................................................................................................. 14
Conclusion .................................................................................................................................... 15
References .................................................................................................................................... 16

2
Introduction
The Indian Premier League (IPL) is a professional Twenty20 cricket league in India contested
during March or April and May of every year by eight teams representing eight different cities in
India[11]. It is a very popular sporting event with a huge fan following across the globe. This
growing popularity of the sport has influenced research on different aspects of the game.
Machine Learning has found entry into the field of sports analytics since long time. Different
aspects of the game have been constantly studied and efforts have been made to make sense
out of data to improve performance of teams as well as individuals. We aim at creating a
Machine Learning model to predict the final score of an innings given a set of input parameters.
This information can be crucial for team management to make decisions during a match.
For e.g. this prediction can help decide the playing XI members for a match. It can also help one
make decision regarding the choice of batsman or bowler to in the match.
In the current scenario, analysts make use of metrics such as Projected Run rate to predict the
final score. They plug-in different run-rates to current score try to predict outcome. These
models though sometimes effective are very naïve and do not take into consideration the
influence of certain external factors. Machine learning models can identify features in the
dataset which have more variance and observe patters to generate better outputs. The model
we suggest would take into consideration the influence of additional features such as the
current batsman, bowler, runs scored so far, venue of the match to predict the final score of an
innings.

3
Literature Review
The project’s work is closely related to the idea presented in the paper Live Cricket Score and
Winning Prediction by Rameshwari Lokhande and P.M. Chawan[3] and has been a starting point
for the project. In this paper, authors have presented the idea of live prediction of match in
progress. The paper aims at predicting the final score as well as the winning probabilities of the
teams. They have considered the importance of factors such as number of wickets fallen,
venue of the match, ranking of the teams, pitch report, home team advantage, etc. The paper
further proceeds with describing the importance of the features mentioned earlier. These
factors have been considered during feature set selection in our project. Authors then proceed
with the implementation of Linear Regression, Naïve Bayes Classifier and Reinforced Learning
Algorithm models.

The problem of live prediction of final score depending on previous outcomes is analogous to
the problem of predicting Stock Prices. Author of the paper Predicting Stock Prices using
LSTM[9]. RNNs are powerful tools for processing sequential data. LSTM introduces memory cell
to capture the dynamic change in data for a period. The paper presents different LSTM models
with different configuration of topology and training methods. The paper starts with pre-
processing of the data, then performs feature extraction and then describing different
methodologies used to train models and selecting the optimum model. The models are used to
predict closing values of stock using NIFTY’s trading dataset. Author’s use an RNN model with
two LSTM layers and two dense layer’s and then evaluate the performance of the models based
on the metric Root Mean Squared Error(RMSE). We plan to adopt similar methodology while
implementing the LSTM model to predict final score.

4
Dataset
The dataset has been obtained from the repository Indian Premier League (Cricket)[1] hosted
on Kaggle. The repository has no usage constraints. The dataset comprised of two csv files
“matches.csv” and “deliveries.csv”. The characteristics of the individual files are as below:
1. Matches.csv:
This data file comprises of records of all matches played in IPL from season 2008 to
2017. The data file comprises of 18 features. It contains data corresponding to the name
of the teams, venue of the match, outcome, umpires and details pertaining to the
matches played. There are 636 entries in the data file.

Figure 3.1

2. Deliveries.csv:
This data file comprises of records of every delivery bowled in each of the matches. The
records are chronologically arranged. The data includes 23 features including the
outcome of every delivery and the number of runs scores and the way runs were scored.
There are 150460 entries in the data file.

Figure 3.1

In addition to the above-mentioned files, we create a new file with records of the teams and
numerically encode names of the teams in the matches and deliveries files.

5
Data Preprocessing
The dataset files are initially loaded using Python Pandas library. After loading the dataset files,
we check for ‘nan’ values in the data and replace it by blank space. If not given a default data
type, pandas expect the values to be in the float data type. There are multiple cells in the
dataset with blank values for columns like fielder since every delivery would not result in a
catch.

Figure 4.1

The data comprises of multiple features with categorical data like name of the batsmen,
bowler, etc. For our models to function effectively we first perform label encoding and convert
categorical data into numerical data.
After encoding these features, we perform PCA to assess the variance of different features
through the data. After performing PCA we reckon that the features 'is_super_over',
'wide_runs', 'bye_runs', 'legbye_runs', 'noball_runs', 'penalty_runs', 'batsman_runs',
'extra_runs', 'player_dismissed', 'dismissal_kind', 'fielder' do not contribute much towards our
algorithm. Hence, we drop these fields in the data to reduce the dimensionality of our data and
to make our models less complex.
In addition to the above-mentioned processes, we need to perform feature extraction to
extract two additional features that we require to perform prediction of the final score – score
and final_score. Score denotes the total runs scored by the team at the end of the current
frame in the data and final_score corresponds to the final batting score of the inning.

Figure 4.2

6
Figure 4.3

After pre-processing the data, we set aside records of single match aside to evaluate all the
optimum models simultaneously at the end. In our case we have set aside records
corresponding to Match Id = 7 which was match between Mumbai Indians and Kolkata Knight
Riders played in 2017 season.
After setting aside the evaluation data, we extract the target values from the data and then we
split the data into Train and Test data in the ration 4:1. At the end we have train data X_train
and y_train with 120100 records and test data X_test and y_test with 30025 records.

7
Models and Implementation
For this project, we have implemented four models – Linear Regression, K Nearest Neighbors
Regressor, Random Forest Regressor and Recurrent Neural Networks using LSTM. We have
selected four regression models with increasing complexity and will later compare these
models to identify the tradeoff between performance and complexity of the models. In order
to evaluate the performance of our regression models, we consider two metrics – R2 score and
Mean Square Error ( MSE)[10].

1. Linear Regression
Linear Regression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual
sum of squares between the observed targets in the dataset, and the targets predicted by the
linear approximation[5]. Since, Linear Regression does not have any tuning Hyper parameters,
we simply create a model and fit our training data. We then evaluate the model by predicting
values on test data. The evaluation results are as below:

Train/Test R2 Score Mean Squared Error


Train Data 0.210787129937 678.9248990081413
Cross Validation 0.2105561586301486
Test Data 0.21666095930040563 674.876222704071
Figure 5.1

Looking at the performance of the model on both train and test data we can see that the model
is not able to identify a linear relationship between the input features and the prediction. Thus,
we need a more complex model that can capture non-linear relationship between them.

2. K Nearest Neighbors Regressor


K Nearest Neighbors Regressor, predict the value of the target based on the value of k nearby
neighbors based on distance. KNN performs better when in our n-dimensional space the data
points that contribute to the same final score are nearby. In order to identify the most
optimum model, we evaluate training error and cross validation error for the models for value
of k from 1-20.

8
Figure 5.2

KNN models will always perform best on train data and thus, all the models have accuracy 1 on
train data. We can see that the Cross-Validation Accuracy increase till k=2 and then decreases.
We can infer that the optimum model is for the value of k=2 and after that point the model
becomes over-fitting as the Cross-Validation Accuracy decreases further. In addition to the
value of k, we set the value of parameter weights=”distance”. This gives weight points by the
inverse of their distance.
We then evaluate the performance of this optimum model on test data.
Configuration: k=2, weights=distance

Figure 5.3

Test / Train R2 Score MSE


Train 1 0
Cross Validation 0.8711764340407484
Test 0.9044300211769578 82.33715283027406
Figure 5.4

9
Looking at the performance of the model, we infer that in the dataset, the combination of
features that contribute to a final score are closer in the n-dimensional space.

3. Random Forest Regressor


A random forest is a meta estimator that fits a number of classifying decision trees on various
sub-samples of the dataset and uses averaging to improve the predictive accuracy and control
over-fitting[7]. Random forest uses several base decision tree estimators that train on random
samples of data using random subset of features. Random forests are great at reducing bias and
variance using basic non-complex models.
In order to identify the optimum model, we have tried multiple configurations using
combinations of the parameters max_depth and n_estimators. The performance of the
different configurations is as given below.

Number of Maximum Training Accuracy Cross Validation


Estimators Depth Accuracy
10 10 0.4840049028425052 0.47693872639379864
10 100 0.9864901860876021 0.9286533825286192
50 10 0.491970066893198 0.4813928338003521
50 100 0.9927607632575627 0.9443133126256662
100 10 0.4901200917987477 0.4844173565056498
100 100 0.9933213438845026 0.9458115557919676
1000 10 0.4915126219793772 0.485660999465179
1000 100 0.9938383645011852 0.9468169475827226

Figure 5.5

Figure 5.6

10
Looking at the figures, we identify the model with Highest Cross Validation Accuracy i.e. model
with 1000 estimators and 100 depth to be the optimum model. Evaluating the performance of
this model on test data, we get the results as below:

Test / Train R2 Score MSE


Train 0.9938383645011852
Cross Validation 0.9468169475827226
Test 0.9567427701315365 37.267740252289755
Figure 5.7

Random Forest Regressor performs better than the models, discussed previously and has great
performance on the test data as well. Random Forest Regressor also helps us get an insight on
the importance of the features within the data. It is strikingly similar to what we have obtained
using PCA. Features such as venue of the match, batsman on strike and bowler has higher
importance as compared to other factors.

Figure 5.8

4. Recurrent Neural Networks (RNNs) using LSTM layers


RNNs are a class of artificial neural networks with capability to use their internal state to
process variable length input sequences. These models are great to work on temporal data. As
discussed in the Literature Review, our problem is analogous of stock market prediction and
thus, RNNs using Long Short-Term Memory layers can be used for prediction.
The input data to the Keras models are reshaped to include “timestamp” for the LSTM layers.

11
Performance before Normalization of data: The models used so far were trained on un-
normalized data and performed well. But, the LSTM model initially performed poorly on the un-
normalized data. For multiple configurations, with combinations of number of nodes and layers,
the models always converged to an MSE value of about 660 with R2 score of about 0.26. The
models were tested for different types of optimizers: ‘adam’, ‘SGD’ with different learning
rates.
Performance after normalization of data: The performance of the models changed drastically
when we normalized the data. Also, after running multiple trials we observe that the model
converges very quickly within 50 epochs. This helped in identifying the ideal value of the
hyperparameter epochs. This inference is supported by the Training Loss vs Epochs graph. Also,
after trying multiple trials for the activation functions, it was observed that the models
function better when we use ‘relu’ activation for the dense layers.
Code Excerpt of the optimum model during training:

The different configurations tested for identifying the optimum model is as below:

Number of Nodes in LSTM Number of Nodes in Dense MSE


LSTM Layers Layers Dense Layers Layers
1 50 2 50 0.0057
1 100 2 50 0.0048
1 50 0 50 0.0124
1 100 0 100 0.0123
2 50 0 50 0.0120
2 100 0 100 0.0121
1 100 3 100 0.0022
1 200 3 100 0.0020
1 200 3 200 0.0018
1 300 3 200 0.0019
Figure 5.9

12
Figure 5.11

Looking at the figures, we reckon that the optimum model is with the configuration:
No of LSTM Layers: 1, No of Nodes in LSTM: 200, No of Dense Layers: 3, Nodes in Dense Layers:
100
Models with a greater number of layers and nodes seem to be over-fitting and hence, we
consider this as the optimum model. Performance of the models on test data can be evaluated
as below:

Test / Train R2 Score MSE


Train 0.020
Test 0.9241242817444608 0.001338420633915955
Figure 5.12

The RNN as expected being a complex model is capable of identifying the non-linear complex
relationship between the features and the prediction values.

13
Comparison of models
Consolidating the data of the performance of different models, we can evaluate the optimum
model for the given problem.
Test Data Performance:

Model R2 Score – Test MSE – Test


Linear Regression 0.21666095930040563 674.876222704071
K Nearest Neighbors Regressor 0.9044300211769578 82.33715283027406
Random Forest Regressor 0.9567427701315365 37.267740252289755
Recurrent Neural Networks using LSTM 0.9241242817444608 0.001338 (Normalized)
Figure 6.1

Evaluation on Separated Data:


We also, evaluate the performance of the models on the evaluation data of a match that we
have set aside. We compare it against the projected run rate of the match.

Figure 6.2

The figure represents the ball to ball prediction of the final score of the innings using the
optimum models for each of the models considered. The purple line represents the actual final
score. We can see that RNN with LSTM (green) performs performed better than other models
and the projected run-rate. Right from the beginning the predictions are lose to the result and
is very stable. Random Forest Regressor model even with better accuracy is stable yet produces
a sub-optimal solution.

14
Conclusion
We have compared the performance of the models on test data and the evaluation data of a
match.
Looking at figures 6.1 and 6.2 we can see that the models Random Forest Regressors and LSTM
RNNs perform better than the rest of the models. Both the models have good accuracy on test
data. Random Forest Regressors(RFRs) utilizing ensemble methods can produce good models
with low values of bias and variance . Neural networks on the other are very efficient in
identifying non-linear mapping of features and target values and demonstrate high accuracy.
Both the models take considerable amount of time for training, but the RFRs tend to take more
time while predicting. Looking at the performance of the models on the evaluation set of the
match, we can see that the prediction by the RNN has been stable and very close to the actual
value right from the beginning. RFRs on the other hand, despite demonstrating high accuracy
on test data does not seem to perform well for the instance.
Keeping these factors in mind, we would like to propose the Recurrent Neural Networks Model
using LSTM for predicting the final score for a team in a match.

15
References
1. Indian Premier League Kaggle Dataset
2. Predicting Stock Prices Using LSTM
3. Live Cricket Score and Winning Prediction
4. Sklearn Data Processing
5. Sklearn Linear Regression
6. Sklearn K Nearest Neighbor Regressor
7. Sklearn Random Forest Regressor
8. Keras LSTM
9. Predicting Stock Price with LSTM
10. Metrics and scoring: quantifying the quality of predictions
11. Indian Premier League

16

You might also like