source file
source file
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted by
P Neelima
Assistant Professor
[2021 – 2022]
i
SRKR ENGINEERING COLLEGE (A)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ChinnaAmiram, Bhimavaram, West Godavari Dist., A.P.
BONAFIDE CERTIFICATE
ii
SELF DECLARATION
We hereby declare that the project work entitled “CRICKET MATCH SCORE
PREDICTION USING MACHINE LEARNING” is a genuine work carried out by
us in B.Tech., (Computer Science and Engineering) at SRKR Engineering
College(A), Bhimavaram and has not been submitted either in part or full for
the award of any other degree or diploma in any other institute or University.
1. J. Ganesh
2. Y. S. V. Santosh
3. C. C. Srinivas
4. V. V. Srunjani
iii
TABLE OF CONTENTS
S NO DESCRIPTION PG NO
ABSTRACT iv
iv
LIST OF TABLES
iv
LIST OF FIGURES
iv
ABBREVIATIONS
1 INTRODUCTION 1
2 LITERATURE SURVEY 3
3 PROBLEM STATEMENT 11
4 METHODOLOGY 12
5 IMPLEMENTATION 17
6 RESULT ANALYSIS 23
7 CONCLUSION 27
8 REFERENCES 28
9 APPENDIX 30
iv
ABSTRACT
This project tells us about the model that forecast the projected score of an ipl cricket match.
The result of this model depends on various characteristics like current runs, current wickets
fallen, runs scored in last 5 overs, wickets fallen in last 5 overs. The dataset contains history
of all the ipl matches that have been conducted between the years 2008 to 2017. This project
can forecast the first innings score of an ipl match before the first innings is completed.
Ridge_regression algorithm is used to forecast the score. This model mainly focuses on the
data of last 5 overs to forecast the final score of the match.
v
LIST OF TABLES
S NO DESCRIPTION PG NO
vi
LIST OF FIGURES
S NO DESCRIPTION PG NO
OUTPUT SCREEN 36
vii
ABBREVIATIONS
viii
1 INTRODUCTION
Cricket, the most popular game at the time, was introduced to North America by
English people in the 17th century. This game is being played by the majority of countries.
The BCCI launched numerous betting apps, such as Dream 11, in 2008 to forecast the first
innings score of a cricket match. As a result, there is a high demand for algorithms that
anticipate the first innings score of a more crucial cricket match. The easiest technique to
anticipate the first innings score is to use machine learning algorithms. Reinforced,
unsupervised, and supervised learning are the three types of machine learning algorithms.
These algorithms are based on the application as well as the model's output.
Cricket is one of the most popular television programs. This sport is extremely
popular in countries such as India, Australia, England, New Zealand, and South Africa.
One of the significant difficulties that has arisen recently is that the projected score for
the first inning of the game does not match the actual score for the first inning. This is
where the necessity for a model to accurately estimate the first innings score of an IPL
match arises. This will aid viewers in predicting the ultimate score of the current match.
A normal Twenty20 match lasts three to four hours, with each innings lasting 75–
90 minutes and a 10- to 20-minute break in the middle of the game. Each innings is played
over 20 overs, with 11 players on each team. This is far shorter than previous versions of the
game and more in line with other popular team sports. It was introduced to establish a new
version of the game that would appeal to both on-field spectators and television viewers.
The first innings score of a cricket match is predicted utilizing many strategies in cricket
score prediction. To anticipate the cricket score of IPL matches, many techniques and prediction
methods are employed. The CRR method is extensively used to estimate a cricket match's first inning
score. The number of runs scored in an over is multiplied by the total number of overs in an innings in
the CRR technique. This approach focuses simply on the runs made in an over and ignores the various
parameters. The current technique can only estimate the first innings score based on the current score,
1
accuracy of the present method by incorporating several characteristics while
forecasting the first innings score of a cricket match. We will be focusing on live
cricket score prediction and will be evaluating IPL matches for score prediction.
The linear, lasso, and ridge regression machine learning techniques were employed
to forecast the IPL first Inning Match Score. The machine learning model is provided labelled data
in Linear Regression, and the data given to the model is already known to the human. The linear
regression model is used to predict continuous values rather than object classification. Ridge
There are two sorts of problems that supervised machine learning algorithms can
solve: regression and classification problems. The main issue with categorization models is
output. When real value is the desired output, the Regression model has a severe flaw.
Recommendations and forecasts for a sequence of time series are two more prominent sorts of
aims to model the structure or distribution of data in order to gain a better understanding of it.
2
2 LITERATURE SURVEY
A significant amount of work has been done over the decades on cricket score prediction
technique and machine learning for predicting the first innings final score. Several different
recognition and detection algorithms for score prediction have been evolving in this field.
Nikhil Dhonge, et.al [1] chipped away at the calculations used to anticipate the IPL first Inning
Match Score. They want to look for a computationally proficient procedure that predicts the
extended score of the principal innings. The various classifiers utilized by them are SVC classifier,
choice tree classifier and Random Forest classifier. They gathered information on all the IPL
matches played from 2008. The dataset comprises of 76015 quantities of columns. Dataset
comprises 15 segments over which they applied include choice strategies and chose 8 elements
in which 7 are input highlights and 1 is target variable. Relapse examination involves different
calculation for the calculation and in light of that it predicts the ceaseless worth. There are sure
arrangement of factors are utilized for the information and the constant reach esteem is the
objective variable. The calculation accomplishes an exactness of 80.92% on straight relapse and
practically 80.84% on edge relapse and practically 80.45% on tether relapse. They reason that for
score expectation the direct relapse gives the most elevated precision result.
Apurva Lawate, et.al [2] proposed Cricket Match Prediction of Projected Score and Winner
Prediction utilizing AI. They thought about the exactness of various calculations: Linear
relapse, Ridge relapse, Multilayer Perceptron Neural Network. Information of the beyond 10
years of IPL matches is utilized to make this dataset. The Data in the dataset is dated from
2009 to 2019. The dataset is parted into two sections, the information from 2008 to 2016 is
utilized to prepare the models and the information from 2017 onwards is utilized to test the
model. They carried out the calculation that can anticipate exact extended in the middle
between an advancing match. They carried out as a web application with the assistance of
Flask. This model furnishes an exactness of 77.286% with the assistance of direct relapse
and practically 74.236% with the assistance of edge relapse.
3
R. Kamble, et.al [3] examined Cricket Score Prediction Using Machine Learning. They
fostered a model that can foresee the score of a group in the wake of playing 20 over from
the ongoing circumstance. They carried out Naive Thomas Bayes, Random Forest, multiclass
SVM and Tree classifiers to prompt the expectation models for each the issues. They viewed
that Random Forest classifier as the chief right for each issue. Information of the beyond 5
years of IPL matches is utilized to make this dataset. The dataset is partitioned into preparing
and testing part in 80 - 20 proportion .80 percent of the information for preparing and 20
percent of information for testing is utilized. Straight relapse is utilized in this model, it over
and over plays out a similar work to give better exact score. This model is utilized for
foreseeing and computing the last qualities. They executed a framework that is important for
going with key choices. The data set kept up with is refreshed on each forecast and the
framework works proficiently with a gigantic dataset of two a large number of columns.
T. Suvarna Kumari, et.al [4] dealt with Cricket Match Score Prediction utilizing k-Nearest
Neighbors Algorithm. They proposed a technique in which the last score can be anticipated of the
main innings. They executed KNN calculation to anticipate the match scores for the main innings
and second innings datasets where the class trait 'X' is the 'Score' and the info quality 'artificial
intelligence' (I =1,2,3…) are the overall group strength, home/away and scene normal. Factors like
Relative group strength, Home, and Venue normal has been considered for the expectation. The
dataset comprises of complete matches barring all the downpour hindered and downpour
deserted games, played somewhere in the range of 2000 and 2018 among the ODI playing groups
like India, England, Australia and so forth. They determined the blunder pace of the two innings,
and the mistake pace of first innings is 24% and that of second innings is 16%.
Kushooo, et al. [5] utilized information mining to estimate cricket scores and champs. They contrived a
procedure for anticipating and working out the main innings score in a cricket match. Most games
forecasts are made utilizing relapse or order issues, the two of which are administered learning
undertakings. The result in relapse is a consistent worth, though characterization manages discrete
result. Straight Regression seemed, by all accounts, to be very viable for anticipating persistent
qualities, and learning calculations like Naive Bayes, Logistic Regression, Neural Networks, and
4
characterization issues, for example, foreseeing the result of matches or arranging players. From
the outcomes, they inferred that Random Forest ended up being the most dependable classifier
for both the datasets with a precision of 89.74% for anticipating runs scored by a batsman and
93.27% for foreseeing wickets taken by a bowler. Consequences of SVM accomplished a precision
of only 52.35% for anticipating runs and 72.85% for foreseeing wickets.
Prasad Thorat, et al. [6] proposed productive and tough calculations for assessing the expected score
of a cricket match's most memorable innings. The unique person of the cricket match-up is by and
large neglected while making expectations. They proposed CricFirst Predictor (CFP), a strategy that
can consider the game's powerful person. The information comes from Kaggle datasets. The CSV
design is utilized to store the information. The dataset is isolated into two sections: preparing
information for the model and test information for the model. The client inputs incorporate, Name of
the batting group, name of the bowling crew, number of runs scored, number of overs bowled, number
of wickets taken, number of runs scored in the past 5 overs, and number of wickets taken in the past 5
overs. The calculation accomplishes an exactness of 75.16% on direct relapse and practically 67.36%
on irregular woods relapse and practically 74.10% on rope relapse. They infer that for score
expectation the direct relapse gives the most noteworthy precision result.
Prateek Gupta, et.al [7] broke down Cricket Score Forecasting utilizing Neural Networks. They
proposed a framework which beats the significant shortcoming of manual work which is
tedious and required labor to keep up with the records and insights of every player
physically. For foreseeing consistent qualities, LSTM based Neural Network is utilized where
the information of the past 18 conveyances was utilized as setting, and the score after 18
conveyances was our objective. They utilized a dataset contains information from more than
10 years (2010 - 2021) separated for T20 matches, adding up to north of 2600 matches. Our
dataset incorporates information from IPL (2010-2021), BBL (2011/12-2020/21), PSL (2015/16-
2020/21), CPL (2013-2020), Lanka Premier League (2019/20) and T20 International Matches
(2010-2021). The most dependable model purposes the "ELU" initiation work on the result
layer and Mean Square Error as the misfortune work and beats the score anticipated by run
rate. They reasoned that model produces Root Mean Square Error of 7.0 when contrasted
with 18.11 created by the run rate-based forecast.
5
Sudhanshu Akarshe, et.al [8] chipped away at Cricket Score Prediction utilizing Machine
Learning Algorithms in view of different AI calculations. They proposed a model to anticipate
the match result, execution of every player in light of the authentic information. They
gathered the information for all conceivable matches. They gathered the dataset from
different sites like ESPN, Kaggle, and so on. Generally, single calculation is utilized in such
forecast framework and separate execution is estimated. All things considered, they mean to
utilize numerous AI calculations and moderately measure their exhibition.
The Indian Premier League was the subject of Harshit Barot, et.al [9]. They introduced a
framework that would utilize Machine Learning strategies like SVM, Logistic Regression,
Random Tree, Random Forest, and Naive Bayes to foresee the results of the IPL games.
The best still up in the air by taking a gander at the past five years' records, from 2015 to
2019. We utilized two different datasets. The first gives ball-by-ball data to each IPL
match at any point played, including the batsman, bowler, runs, wicket, and that's just
the beginning, for every single bundle of the game. The second dataset contains an
outline of each match, including the groups in question, the victor, the throw champ, and
other data for each match played in the IPL. They inferred that the most minimal
exactness got begins from 82.6% (Naive Bayes) and continues expanding to give higher
precision utilizing choice tree calculation (86.9%) and strategic relapse (94.6%).
Arjun Singhvi and partners [10] concentrated on the result of a Twenty-20 cricket match. The T20
arrangement of cricket is profoundly startling, which is one reason for its ongoing prominence.
They proposed a model that uses Machine Learning calculations, for example, strategic relapse,
support vector machines, bayes organization, choice tree, and arbitrary woods to anticipate the
results of IPL games. Just information from worldwide, homegrown, and association T20 cricket
matches was incorporated in light of the fact that it was the most reliable image of how a player
would act in a T20 challenge. ESPN gave a rundown of 5390 Twenty20 matches. They inferred that
AdaBoost has the most noteworthy exactness of 62%, while Decision Trees have the least.
6
Rameshwari, et al. [11] utilized a straight relapse classifier to foresee live cricket score and
winning. They made a model with two techniques: the main gauges the score of the primary
innings in view of the ongoing run rate as well as the quantity of wickets lost, the match scene,
and the batting and bowling crews. The subsequent strategy involves similar properties as the
main technique, in addition to the batting group's point, to figure the result of the match in the
subsequent innings. For the first and second innings, these two procedures were created utilizing
Linear Regression Classifier or Q-Learning premise choice tree approach and Naive Bayes
Classifier, individually. They got the data from the site http://cricsheet.org.
G. Sudhamathy et al. [12] involved AI strategies in the R bundle to foresee ipl information. They
introduced a model that zeroed in on authorization and estimated the contrast between the
models to foresee which side would be the more convincing in an IPL match. The calculation
examinations the information first to create a model that can be utilized to figure out examples
and patterns. The model is improved by picking boundaries and emphasizing to make the mining
model. The settings are then stacked into the dataset to recognize significant examples and nitty
gritty measurements. They utilized the elements of the R Package to find significant data about
the IPL Teams. They for the most part utilized Decision Tree, Naive Bayes, K-Nearest Neighbor,
Cricket Match Analytics Using the Big Data Approach was proposed by Mazhar Javed Awan,
et al [13]. They recommended a direct relapse model-based strategy to foresee group scores
without utilizing huge information or the Spark ML large information system. The champ has
been anticipated utilizing procedures like calculated relapse, KNN, Nave Bayes, and SVM,
among others. They accumulated information from sites like as Kaggle, Cricsheet, and
others, and afterward applied ball-by-ball subtleties as well as various limitations. The
arranged information was partitioned into two classifications: preparing (80%) and testing
(20%). (20% ). They presumed that subsequent to applying straight relapse, the precision of
root mean square blunder (RMSE), mean square mistake (MSE), and mean outright mistake
(MAE) is 95%, 30.2, 1350.34, and 28.2, separately.
7
Aman Sahua et al. [14] zeroed in on a calculated relapse based Predictive Analysis of Cricket. They
handled information and made suggestions utilizing Google Colab, an information investigation
instrument. Irregular Forest Classifier, Multinomial calculated relapse, and Adaboost are a portion of
the calculations they use. Further developed models can help leaders during cricket match-ups by
permitting them to contrast a group's capacities with those of the resistance as well as ecological
contemplations. To describe the standards and proposals of the arrangements, they utilized a mark
encoder for preprocessing the dataset and an irregular woods classifier procedure for assemblage.
The information was exposed to calculated relapse, which uncovered 74.9 percent precision in
anticipating game results and 81 percent exactness in foreseeing the triumphant group. From the
outcomes, they reasoned that Random Forest Classifier, Multinomial Logistic Regression, Adaptive
Boosting calculations accomplished a precision of 98.14%,29.62%,1.0 for the preparation model and an
Akhil Nimmagadda, et al. [15] utilized information mining and the Random Forest technique to
conjecture Cricket score and win. They conceived a calculation to gauge the result of One-Day
International cricket matches, in which they assess the batting and bowling possibilities of the 22
players partaking in the match involving their vocation measurements and dynamic support in
late games. They took advantage of player potential to portray one group's relative
incomparability over the other. They utilize directed learning calculations to anticipate the match
champ utilizing specific other base elements, like run rate and the match scene, as well as relative
group strength. The model was constructed utilizing Multiple Variable Linear Regression.
Wickramasinghe, et al. [16] zeroed in on utilizing a Naive Bayes classifier to anticipate the champ
of an ODI cricket match. After the main inning of an ODI cricket match, they proposed a
calculation to anticipate the champs. They produce this expectation utilizing the Naive Bayes (NB)
strategy with information from 15 highlights, which incorporate factors connected to batting,
bowling, group arrangement, and different elements. They want to utilize highlight choice
calculations, for example, univariate, recursive end, and head part examination to expand the
precision of foreseeing the champ in the wake of building an underlying model (PCA). They
(univariate, recursive, and PCA strategies). In view of the consequences of the examinations, they
8
found that utilizing a univariate highlight choice method with a 85:15 preparation to
testing test size proportion accomplishes the most extreme precision in anticipating the
triumphant group. The most noteworthy expectation exactness is 86.71 percent (i.e., the
least blunder rate is 13.49 percent), which is magnificent for an ODI cricket match.
Jalaz Kumar et al. [17] proposed utilizing choice trees and MLP organizations to foresee the result
of ODI cricket matches. They utilize choice tree Classifier and multi-facet perceptron as
calculations. They assembled the measurements from all ODI matches from January 5, 1971,
through October 29, 2017. There were 3933 ODI match results that were dropped. They introduced
a technique that is better than measurements in light of the fact that, not at all like insights, which
utilizes numerical conditions to characterize connections between factors, these strategies don't
need any earlier presumptions about the information factors and their fundamental connections.
D. Jyothsna et al. [18] utilized Linear Regression, Decision Tree, K-implies, and Logistic
Regression to dissect and anticipate the result of IPL Cricket Data. They assembled
information from the IPL during the most recent seven years, including player data, match
setting data, group data, and ball-to-ball data, and assessed it to arrive at different
resolutions that guide in the improvement of a player's exhibition. Different variables, for
example, what the area or throw choice meant for the match's result in the past seven years,
are likewise determined. As per the discoveries, Random Forest is the most reliable
classifier, anticipating the best player execution with a precision of 89.15 percent.
Rameshwari Lokhande, et al. [19] focused on Live Cricket Score and Winning Prediction
utilizing Linear Regression and Naive Bayes Classifier. They proposed a methodology for
making a model for determining the principal innings last score and assessing the match
result in the second innings for a restricted overs cricket match. The throw, the groups' ODI
rankings, and the host group advantage are factors in the projections. On earlier matches,
two unmistakable models, one for the primary innings and the other for the subsequent
innings, have been proposed for use in the Linear Regression classifier and Naive Bayes
classifier, separately. Rather than straight relapse, the support calculation is used.
9
Foreseeing the Winner in One-Day International Cricket was a venture attempted by Ananda
Bandulasiri and associates [20]. They proposed a factual investigation of the meaning of "home
field" advantage in One Day International cricket. Besides, the curious result of the detriment of
winning the coin throw for daytime matches has been noted. They have found, notwithstanding,
that triumphant the coin throw offers them a benefit in "Day and Night" challenges. The best
current technique for refreshing focuses for deferred cricket matches is the Duckworth Lewis
strategy, which appears to measurably lean toward the group most affected by the interference.
They inferred that this is definitely not a fair correlation in light of the fact that the DL approach
10
3 PROBLEM STATEMENT
This model is to plan a framework that can be anticipate the primary innings score of
cricket match, the framework can dissect numerous boundaries like batting group,
bowling crew, overs finished, runs scored till a specific number of overs, wickets fallen
till a specific number of overs, runs scored in past 5 overs, wickets fallen in past 5 overs.
To anticipate the aftereffects of an IPL match utilizing AI strategies or calculations like
Logistic Regression, Linear Regression, Ridge Regression, SVM, Lasso Regression and
Random Forest. We have utilized 15 elements which are as per the following: mid, date,
scene, batting group, bowling crew, batsman, bowler, runs, wickets, overs, runs scored
in last 5 overs, wickets fallen in last 5 overs, striker, non-striker, complete.
11
4 Methodology
4.1 Description of the dataset
IPL DATASET:
The dataset used in this project is ipl.csv which is obtained from Kaggle website. The
dataset contains information about each and every ball of all ipl matches from 2008 to
2017.The dataset contains 1140210 tuples. Each tuple contains 15 attributes. The
th
target feature is the 15 feature, which is a real valued attribute, its range lies
between 0 to infinity. Each feature used in this dataset are described as follows:
We plan to construct a model that can anticipate the main innings score of a live
IPL match effectively. We are hoping to construct a model that can consider
different boundaries that add to the score forecast.
12
DATA COLLECTION:
We will be taking the dataset from the datasets accessible on Kaggle. The
dataset will be taken in the CSV design. The information gathered from the site
will be cleaned in the following stage.
DATA CLEANING:
In the information cleaning step, we need to eliminate undesirable sections like match id,
setting, name of the batsman, name of the bowler, the score of the striker batsman, and
score of the non-striker batsman. These segments won't be needed during forecast
subsequently we will be neglecting the sections. In the IPL dataset, barely any groups are
not reliable they had just played for not many years. Along these lines, we really want to
dispose of those groups from the dataset and we just have to think about the steady
groups. We will think about the information after 5 overs. The date section in the dataset is
available in the string design yet we need to apply a few procedures on the date segment
for that we should switch the string over completely to a date-time object.
DATA PREPROCESSING:
In the wake of cleaning the information, we will require our information to be preprocessed. In
the information preprocessing step, we will perform one-hot encoding. One hot encoding is
made sense of exhaustively in the execution area. We should revise the sections of our dataset
in the information preprocessing step. The motivation behind revising sections is that we want
DATA SPLITTING:
After data preprocessing, we will separate our data with the goal that IPL matches
played before 2016 will be considered for the arrangement of the model and IPL
matches played after 2016 will be considered for testing the model.
13
MODEL GENERATION:
We will utilize the Linear Regression model, Random Forest Regression and Lasso
Regression model for the foreseeing the result in light of info. These models are generated
by using training data from the input dataset. These models are tested by using test data
from the input dataset. Then, they are used to predict output on new data.
FINAL PREDICTION:
Finally, Inorder to predict the output, different data sources will be taken from the client ,based
on the input a range of final score will be displayed as output on the web page.
Model Generation
(Ridge Regression)
14
4.3 MACHINE LEARNING ALGORITHM:
RIDGE REGRESSION:
Edge relapse is utilized to make a closefisted model when how much indicator factors
during a set surpasses the quantity of perceptions, or when an information set has
multi-collinearity (relationships between's indicator factors). Edge relapse has a place
with L2 regularization. This occasionally prompts the end of certain coefficients
through and through, which might yield inadequate models. L2 regularization adds a L2
punishment, which rises to the square of the extent of coefficients.
A direct relationship will be worked between input factors and the objective variable
in Linear Regression. On the off chance that result relies upon just single info
variable, it gives line as result, on the off chance that indicator variable relies upon
numerous information factors, it gives hyperplane as result. The coefficients of the
model are found through a headway interaction that tries to limit the aggregate
squared blunder between the forecasts (yhat) and the normal objective qualities (y).
loss = sum i=0 to n (y_i – yhat_i)^2
In straight relapse, once in a while assessed coefficients of the model can turn out to
be huge, making the model delicate to inputs and perhaps unsteady. It valid for issues
with not many examples or less examples (n) than input indicators (p) or factors.
One way to deal with address the security of relapse models is to change the
misfortune capacity to incorporate extra expenses for a model that has huge
coefficients. Direct relapse models that utilization these changed misfortune
capacities during preparing are alluded to by and large as punished straight relapse.
15
A L2 punishment limits the size, all things considered, in spite of the fact that it keeps any
coefficients from being taken out from the model by permitting their worth to become zero.
The impact of this punishment is that the boundary gauges are possibly permitted to
turn out to be huge assuming there is a corresponding decrease in SSE. Essentially,
this strategy recoils the evaluations towards 0 as the lambda punishment turns out
to be huge (these procedures are here and there called "shrinkage techniques").
This punishment can be added to the expense work for straight relapse and is alluded to as
Tikhonov regularization (after the creator), or Ridge Regression all the more by and large.
A hyperparameter is used called "lambda" that controls the weighting of the discipline to the
mishap work. A default worth of 1.0 will completely weight the punishment; a worth of 0 avoids
the punishment. Tiny upsides of lambda, for example, 1e-3 or more modest are normal.
16
5 IMPLEMENTATION
The execution is done by using a web application called flask API, the models are
trained by using Jupyter Notebook and SKLearn library is used. Below are the
steps for implementing a machine model along with the training of the model.
Figure 5.1 shows the snapshot of the code which is used to collect the data from
external source. For any model to be trained there must be input data which is
named as dataset. There are many ways to collect the data from web, here we are
using the dataset called as ipl.csv. It is collected from the Kaggle website. The
dataset consists of 1140210 tuples and each tuple contains 15 features.
17
5.2 DATA EXTRACTION:
Figure 5.2 shows the snapshot of the data that is necessary for building the model.
The above code is used to remove the unwanted features from the dataset. Here the
unwanted features are mid, batsman, bowler, striker, non-striker.
18
Figure 5.3 shows the snapshot of code that is used to extract only the consistent
teams playing from the starting of the IPL. The model can’t predict the score with the
first 5 overs data so in the above code first 5 overs data is removed from the dataset.
Date column is initially in string format and it is converted into datetime object.
Figure 5.4 shows the snapshot of code that is used to convert the categorical features by using
oneHot encoding method. This type of encoding creates new binary feature for each possible
category for columns bat_team and bowl_team. The batting team entered by the user should be
restored by 1 and all the other teams should be restored with 0.The bowling team entered by the
user should be restored by 1 and all other teams should be restored with 0.
19
5.4 DATA SPLITTING:
Figure 5.5 shows the entire dataset is split in 2 parts. The original dataset contains
the ipl matches from 2008 to 2017.This code divides the dataset into training and
testing where training data is taken from 2008 to 2016 and test data is taken from the
year 2017. Train data contains 821260 tuples and test data contains 61116 tuples.
20
Figure 5.6 shows the snapshot of the code to build the model successfully for score prediction.
The main step of entire project is building a model to process the data that is collected. Here we
have used Ridge regression as it gives accurate output like linear regression, but it also protects
the model from over fitting. Grid search is a cross validation technique that takes different
combinations of values by a dictionary and assess the model by using cross validation method.
A user interface is created by using flask and web templates to take inputs from the
user and to display the range of score predicted by the ridge regressor model.
Whenever user fills all the user entries for the above webpage then a list is created with around 16
variables. For example, if batting team is Bengaluru and bowling team is Rajasthan then it will be
encoded in backend as [1,0,0,0,0,0,0,0] + [0,0,0,0,1,0,0,0]. The above 2 lists are then combined and
added with the remaining 6 variables in HTML form. The forecasting of the final score is then
added by 10 to get maximum score can get and subtracted by 10 to get minimum score can get in
21
that match. This output is then displayed to user as the forecasted value. For
example, if the model forecasts 180 as the final score for the first inning, then 170
to 190 will be displayed for the final output.
22
6 RESULT ANALYSIS
Figure 6.1 shows the 3 different types error rates occurred by implementing the linear regression
technique. The mean absolute error rate occurred by implementing linear regression technique is
12.11861. The mean squared error occurred by implementing linear regression technique is 251.
00792.The root mean squared error occurred by implementing linear regression is 15.84322.
23
Figure 6.2 shows the 3 different types error rates occurred by implementing the lasso regression
technique. The mean absolute error rate occurred by implementing ridge regression technique is
12.11729. The mean squared error occurred by implementing ridge regression technique is 251.
03172.The root mean squared error occurred by implementing ridge regression is 15.84398.
Figure 6.3 shows the 3 different types error rates occurred by implementing the lasso regression
technique. The mean absolute error rate occurred by implementing lasso regression technique is
12.21405. The mean squared error occurred by implementing lasso regression technique is 262.
37973.The root mean squared error occurred by implementing lasso regression is 16.19813.
From the above 3 figures it is clear that root mean squared error value is very low when the
model is implemented by using Linear Regression and the mean absolute error value is very
low when the model is implemented by using Ridge regression and the mean squared error
value is very low when the model is implemented by using Linear regression.
Linear regression gives the best accurate result when compared to remaining
regression techniques like Ridge regression and Lasso regression.
24
Linear Regression Ridge Regression Lasso Regression
OVERFITTING:
Overfitting occurs when a model works well on train data but does not give
accurate output for test data.
UNDERFITTING:
Underfit models are those which does not work well on testing data.
GOOD FIT:
Good Fit works with both training dataset and test dataset and gives accurate
outputs for both the datasets around 70 to 80%.
25
Fig 6.5 Using train data for testing
Figures 6.6 Shows the error rates for training data between 2010 to 2016.Here the error
rates are same as test data. The models are not overfitting our data. So, they are giving
accurate outputs on both test and training data. So they are good fit for our dataset.
26
7 CONCLUSION
This project mainly deals with first innings score of an ipl match is forecasted by using a machine
learning algorithm called ridge regression, linear regression, lasso regression. Here a dataset
named ipl.csv is used. This dataset contains of 15 different features out of which mainly used are
batting team, bowling team, runs scored at current over, wickets fallen at current overs, overs
completed, runs scored in last 5 overs, wickets fallen in last 5 overs. This project takes above
inputs and forecast the score of an ipl match as output. This project predicts the output by using
3 different models those are ridge regression, linear regression and lasso regression and their
corresponding root mean squared error rates are 15.8439,15.8432,16.1976. Linear regression
gives less root mean squared error rate when compared to different regression techniques like
ridge regression and lasso regression. But ridge regression protects the model from overfitting.
27
8 REFERENCES
1. Nikhil Dhonge, Shraddha Dhole. "Ipl cricket score forecast utilizing AI methods" Research
9. Navya Sanjna Joshi. “Prediction for the Indian Premier League” 2020
International Conference of Technology July, 2019
10. Arjun Singhvi, Ashish V Shenoy. “Prediction of a Twenty-20 Cricket Match”
GRD Journal for Engineering | Volume 7 | Issue 2 | September 2020
11. Ashish V Shenoy. “Prediction of Live Cricket Score and Winning
Prediction” Journal of Trend in Research and Development, Volume 5
12. Vighnesh Buddhivant. “Prediction of Score of an IPL using machine learning by using
R package” Journal of Computer Science, November 2019, Volume: 2, Issue: 04
13. Devang Kaushik. “Cricket Match Analytics by using Big Data Approach”
Journal of Computer Science and Technology, 26 September 2021
28
14. Aman Sahua. “Predictive Analysis of Cricket” Journal of Computer Science
and Mathematics Education Vol.14 2020
15. Akhil Nimmagadda , Nidamanuri Venkata Kalyan. “IPL score prediction and
winning prediction using data mining Approach” Journal of Advance Research
and Development Volume 6, Issue 4)
16. Venkata Kalyan. “Machine Learning approach to predict the score of an IPL
cricket game” Journal of Sports Analytics 2018
17. Jalaz. “Score Prediction of IPL Matches using Machine Learning Algorithms”
2018 International Conference of Cyber Computing and Communication
18. Jyothsna. “Predicting the outcome of IPL Cricket Match” Journal of Research
in Science, Engineering and Technology Vol. 6, Issue 4, June 2018
19. Rameshwari Lokhande. “Cricket Live Score Prediction and Winning Prediction”
Journal of Computer Research and Development, Volume 5, Issue 9
20. Nidamanuri Venkata Kalyan, “Predicting the Score in an IPL Cricket”,
Journal of Computer Sciences & Mathematics Education, Vol. 4.
29
9 APPENDIX
SAMPLE CODES:
dset = pds.read_csv('ipl.csv')
dset = dset[dset['overs']>=5.0]
30
Code for Ridge Regressor:
31
Flask Code:
fn = 'first-innings-score-lr-model.pkl'
regressor = pickle.load(open(fn, 'rb'))
filename2 = 'linear-first-innings-score-lr-
model.pkl' lr = pickle.load(open(filename2, 'rb'))
filename3 = 'Lasso-first-innings-score-lr-model.pkl'
lassor = pickle.load(open(filename3, 'rb'))
app = Flask(_name_)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
t_a = list()
if request.method == 'POST':
bat_team = request.form['batting-team']
if bat_team == 'Chennai_Super_Kings':
t_a = t_a+ [1,0,0,0,0,0,0,0]
elif bat_team == 'Delhi_Daredevils':
t_a = t_a + [0,1,0,0,0,0,0,0]
elif bat_team == 'Kings_XI_Punjab':
t_a = t_a + [0,0,1,0,0,0,0,0]
elif bat_team == 'Kolkata_Knight_Riders':
t_a = t_a + [0,0,0,1,0,0,0,0]
elif bat_team == 'Mumbai_Indians':
t_a = t_a + [0,0,0,0,1,0,0,0]
elif bat_team == 'Rajasthan_Royals':
t_a = t_a + [0,0,0,0,0,1,0,0]
elif bat_team == 'Royal _Challengers_Bangalore':
t_a = t_a + [0,0,0,0,0,0,1,0]
elif bat_team == 'Sunrisers_Hyderabad':
t_a = t_a + [0,0,0,0,0,0,0,1]
32
bowl_team = request.form['bowling-team']
if bowl_team == 'Chennai_Super_Kings':
t_a = t_a + [1,0,0,0,0,0,0,0]
elif bowl_team == 'Delhi_Daredevils':
t_a = t_a + [0,1,0,0,0,0,0,0]
elif bowl_team == 'Kings_XI_Punjab':
t_a = t_a + [0,0,1,0,0,0,0,0]
elif bowl_team == 'Kolkata_Knight_Riders':
t_a = t_a + [0,0,0,1,0,0,0,0]
elif bowl_team == 'Mumbai_Indians':
t_a = t_a + [0,0,0,0,1,0,0,0]
elif bowl_team == 'Rajasthan_Royals':
t_a = t_a + [0,0,0,0,0,1,0,0]
elif bowl_team == 'Royal_Challengers_Bangalore':
t_a = t_a + [0,0,0,0,0,0,1,0]
elif bowl_team == 'Sunrisers_Hyderabad':
t_a = t_a + [0,0,0,0,0,0,0,1]
overs = float(request.form['overs'])
runs = int(request.form['runs'])
wickets = int(request.form['wickets'])
runs_in_prev_5 = int(request.form['runs_in_prev_5'])
wickets_in_prev_5 = int(request.form['wickets_in_prev_5'])
data = npy.array([temp_array])
my_prediction = regressor.predict(data)[0]
l_prediction=lr.predict(data)[0]
lass_prediction=lassor.predict(data)[0]
if _name_ == '_main_':
app.run(debug=True)
33
IMPLEMENTING SCREEN:
34