0% found this document useful (0 votes)
26 views49 pages

Editable

The project report focuses on predicting the first innings score of IPL cricket matches using machine learning techniques, specifically ridge regression. It analyzes historical data from IPL matches between 2008 and 2017, considering various factors such as current runs and wickets. The report aims to enhance the accuracy of score predictions compared to existing methods by incorporating multiple variables.

Uploaded by

naruto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views49 pages

Editable

The project report focuses on predicting the first innings score of IPL cricket matches using machine learning techniques, specifically ridge regression. It analyzes historical data from IPL matches between 2008 and 2017, considering various factors such as current runs and wickets. The report aims to enhance the accuracy of score predictions compared to existing methods by incorporating multiple variables.

Uploaded by

naruto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

A Project report on

CRICKET MATCH SCORE PREDICTION


USING MACHINE LEARNING
in partial fulfillment for the award of the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING

Submitted by

ADDANKI SWARNA SRI 19B91A0504


DANGETI DHARMA SAI 19B91A0542
BHUPATHIRAJU DILEEP VARMA 19B91A0523
BASAVA JAYANTH 19B91A0520

Under the Guidance of

SRI CH. VINOD VARMA


Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SRKR ENGINEERING COLLEGE (A)
Chinna Amiram, Bhimavaram, West Godavari Dist., A.P.
[2022 – 2023]
DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING
SRKR ENGINEERING COLLEGE (A)

Chinna Amiram, Bhimavaram, West Godavari Dist., A.P.

[2022 – 2023]

BONAFIDE CERTIFICATE

This is to certify that the project work entitled “CRICKET MATCH SCORE PREDICTION USING
MACHINE LEARNING” is the bonafied work of “Addanki Swarna Sri (19B91A0504), Dangeti
Dharma Sai (19B91A0542), Bhupathiraju Dileep Varma (19B91A0523), Basava Jayanth
(19B91A0520)” who carried out the project work under my supervision in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science and Engineering.

SUPERVISOR HEAD OF THE DEPARTMENT

Sri.CH.Vinod Varma Dr. V Chandra Sekhar

Assistant Professor Professor


SELF DECLARATION

We hereby declare that the project work entitled “CRICKET MATCH SCORE PREDICTION USING
MACHINE LEARNING” is a genuine work carried out by us in B.Tech., (Computer Science and
Engineering) at SRKR Engineering College(A), Bhimavaram and has not been submitted either in part or
full for the award of any other degree or diploma in any other institute or University.

1. A.Swarna Sri 19B91A0504


2. D.Dharma Sai 19B91A0542
3. B.Dileep Varma 19B91A0523
4. B.Jayanth 19B91A0520
TABLE OF CONTENTS

S.NO DESCRIPTION PG.NO

1 ABSTRACT i

2 LIST OF FIGURES ii

3 INTRODUCTION 1

4 LITERATURE SURVEY 3

5 PROBLEM STATEMENT 7

6 EXISTING SYSTEM 8

7 PROPOSED SYSTEM 9

8 METHODOLOGY 10

9 SYSTEM DESIGN 15

10 IMPLEMENTATION 19

11 RESULT ANALYSIS 27

12 CONCLUSION 32

13 REFERENCES 33

14 APPENDIX 34
ABSTRACT

The creation of a model that forecasts the ultimate result of an IPL cricket match is the aim of this research.
Current runs, current wickets, runs scored in the past five overs, and recent wicket falls are some of the
variables that affect this model's results. The archive contains a history of each IPL match played between
2008 and 2017. This study can forecast the first innings score of an IPL match before it begins. Ridge
regression approach is used to forecast the score. This model primarily examines data from the previous five
overs in order to forecast the final score of the game.

i
LIST OF FIGURES

S.NO DESCRIPTION PG.NO

1 MODEL ARCHITECTURE 12

2 USE CASE DIAGRAM 15

3 CLASS DIAGRAM 16

4 SEQUENCE DIAGRAM 17

5 ACTIVITY DIAGRAM 18

6 DATA COLLECTION 19

7 DATA CLEANING 20

8 DATA EXTRACTION 20

9 DATA PREPROCESSING 21

10 DATA SPLITTING 22

11 LINEAR REGRESSION 23

12 LASSO REGRESSION 23

13 RIDGE REGRESSION 24

14 SCREENSHOT FOR WEB PAGE 25

15 SCREENSHOT FOR FILLING VALUES IN WEB PAGES 26

16 ERROR RATES USING LINEAR REGRESSION 27

17 ERROR RATES USING RIDGE REGRESSION 27

18 ERROR RATES USING LASSO REGRESSION 28

19 COMPARISION BETWEEN DIFFERENT FITTING MODELS 29

20 GRAPHICAL REPRESENTATION FOR ACTUAL AND PRDICTED VALUES 30

21 ERROR RATES FOR TRAINING DATA 31

22 OUTPUT SCREEN 43

ii
1. INTRODUCTION

English people brought cricket, the most popular game at the time, to North America in the 17th
century. The majority of nations participate in this game. In order to predict the first innings score of a cricket
match, the BCCI introduced many betting apps, such as Dream 11, in 2008. The first innings score of a more
important cricket match is hence highly sought after algorithms. Using machine learning algorithms is the most
straightforward method for predicting the first inning score. The three different categories of machine learning
algorithms are supervised, unsupervised, and reinforced learning. These algorithms are based on both the output
of the model and the application.

One of the most watched television shows is cricket. In nations like India, Australia, England, New
Zealand, and South Africa, this sport is wildly popular. One of the major issues that has lately surfaced is that
the predicted score for the game's first inning does not line up with the actual score for the first inning. This is
the point at which a model to precisely predict the first innings score of an IPL match becomes necessary. This
will make it easier for the audience to forecast the final result of the present game.

A typical Twenty20 game lasts three to four hours, with each innings lasting between 75 and 90 minutes
and a 10- to 20-minute intermission. There are 11 players on each team, and each inning is played over 20
overs. This version of the game is considerably shorter than earlier iterations and more in line with other well-
liked team sports. To create a new form of the game that would appeal to both on-field spectators and television
viewers, it was implemented.

Utilizing a variety of methodologies, the first innings score of a cricket match is predicted. Many
strategies and prediction systems are used to forecast the cricket score of IPL matches. The first inning score
of a cricket match is frequently estimated using the CRR method. The CRR method multiplies the number of
runs scored in each over by the overall number of overs in an inning. This strategy ignores the numerous criteria
and concentrates just on the runs scored in an over. The present method, which simply takes into account the
current score and a few parameters, can only estimate the first inning score.By combining a number of criteria
while predicting the first innings score of a cricket match, we are striving to improve the accuracy of the current
system. We'll be concentrating on live cricket score prediction and analyzing IPL games to anticipate scores.

1
To predict the IPL first inning match score, the machine learning techniques of linear, lasso, and
ridge regression were used. The data presented to the machine learning model is labeled in a linear regression
and is already known to humans. Instead of classifying objects, the linear regression model is used to forecast
continuous values. Data multicollinearity can be investigated using ridge regression.

Regression and classification problems are the two categories of problems that supervised
machine learning algorithms can handle. The output of classification models is the key problem. A serious
problem exists in the regression model when real value is the desired output. Based on classification models
and regression approaches, the two more popular types of challenges are recommendations and forecasts for
a series of time series. Unsupervised learning seeks to simulate the distribution or structure of data in order
to comprehend it better.

2
2. LITERATURE SURVEY

For estimating the ultimate score of the first innings in cricket, a lot of work has been done on machine
learning and score prediction techniques. For score prediction, numerous alternative recognition and detection
algorithms have been developing in this area. Different modern strategies are emerging from literature
reviews.

Nikhil Dhonge, et.al [1] undermined the formulas used to predict the IPL first inning match score.
They are looking for a computationally sound method that can forecast the final score of the first two innings.
They make use of the SVC classifier, choice tree classifier, and random forest classifier, among other
classifiers. They gathered data on every IPL game played starting in 2008. There are 76015 columns total in
the dataset. There are 15 segments in the data set; using include choice procedures, they selected 8 of them, 7
of which are input highlights and 1 of which is a target variable. Relapse analysis uses various calculations for
the calculation, and because of this, it forecasts the value in perpetuity. Certain groupings of variables are used
for information and constant.

Apurva Lawate, et.al [2] proposed AI-based projected score and winner prediction for a cricket match.
They considered the accuracy of several calculations, including the Multilayer Perceptron Neural Network,
Ridge Relapse, and Linear Relapse. This dataset was created using data from IPL games played more than ten
years ago. The dataset's data spans the years 2009 to 2019. The dataset is divided into two parts: data from
2008 to 2016 is used to create the models, while data from 2017 and after is used to test the models. They
performed the calculation that can accurately predict the precise length of time between progressing matches.
With the aid of Flask, they operated as a web application. With the help of, this model provides an accuracy
of 77.286 percent.

R. Kamble, et.al [3]examined Machine Learning for Predicting Cricket Scores. They developed a model
that can predict a team's score after playing 20 overs under the current conditions. They performed Naive To
generate the expectation models for each problem, classifiers such Thomas Bayes, Random Forest, multiclass
SVM, and Trees are used. They considered the Random Forest classifier to be each issue's fundamental
correct. This dataset was created using data from IPL games played over the past five years. The set of data is
divided into 80-20 portions for preparing and testing. 20% of the information is used for testing, while 80%
of the information is used for preparation. In this approach, straight relapse is used, and the same work is
repeatedly performed. 3
T. Suvarna Kumari, et.al [4] worked with the k-Nearest Neighbors Algorithm for predicting the cricket
match score. They suggested a method in which the final score of the major innings might be predicted. In the
primary innings and second innings datasets, they used KNN calculations to predict the match scores where
the class attribute "X" is the "Score" and the information quality "artificial intelligence" (I = 1, 2, 3,...) are the
overall group strength, home/away, and scene normal. Relative group strength, home environment, and venue
norms have all been taken into account when determining the expectation. The dataset consists of entire
matches, excluding all rain-related cancellations and postponements, played between the years 2000 and 2018
between the ODI playing nations such as India, England, Australia, and so on.

Kushooo, et al. [5] used data mining to predict cricket scores and winners. They devised a method for
determining and predicting the primary innings score in a cricket match. The majority of game predictions use
relapse or order problems, both of which are delivered learning tasks. Relapse has a consistent outcome,
whereas characterization manages various outcomes. All things considered, Straight Regression appeared to
be very effective for predicting persistent qualities, and learning algorithms like Naive Bayes, Logistic
Regression, Neural Networks, and Random Forests were thought to have been used in many previous studies
for 5 characterization issues, such as predicting the outcome of matches or placing players. They concluded
from the results that Random Forest was the most reliable classifier for both datasets.

Akhil Nimmagadda, et al. [6]used data mining and the Random Forest method to predict the Cricket
score and the winner. They developed a formula to predict the outcome of One-Day International cricket
matches, in which they evaluate the bowling and batting potential of the 22 players participating in the match
using their job metrics and dynamic support in late games. They used player potential to highlight one group's
relative superiority to the other. They employ directed learning computations to predict the match winner by
taking into account a few other base factors, such as run rate, the match scene, and relative group strength.
Using Multiple Variable Linear Regression, the model was built.

Sudhanshu Akarshe, et.al [7] worked on predicting the cricket score using machine learning algorithms
in light of various AI calculations. They suggested a model to predict the outcome of the game and the actions
of each player in light of the real information. They gathered the data for every potential match. They acquired
the data from a variety of websites, including ESPN, Kaggle, and others. In such forecast frameworks, a
single calculation is often used, and each execution is estimated separately. All things considered, they want
to moderately measure their show while using a variety of AI computations.
4
Rameshwari, et al. [8] used a straight relapse classifier to predict the winning and live cricket score.
They created a model using two different approaches: the primary method evaluates the score of the first
innings in light of the current run rate, the number of wickets lost, the match environment, and the batting
and bowling teams. The next approach uses the same characteristics as the primary technique, as well as the
batting group's point, to predict the outcome of the game in the innings to follow. These two approaches were
developed separately for the first and second innings using Linear Regression Classifier or Q-Learning
premise choice tree approach and Naive Bayes Classifier.

Jalaz Kumar et al. [9] presented to predict the outcomes of ODI cricket matches using decision trees and
MLP organizations. They use multi-facet perceptrons and decision tree classifiers for their calculations. They
compiled the data from all ODI games played from January 5, 1971, and October 29, 2017. Results of 3933
ODI matches were discarded. They presented a method that is superior to measures because, unlike insights,
which characterize connections between components using numerical circumstances, these tactics don't
require any prior assumptions about the information factors and their underlying linkages.

D. Jyothsna et al. [10] analyzed and predicted the outcome of IPL Cricket Data using Linear Regression,
Decision Tree, K-implies, and Logistic Regression. They collected data from the IPL's most recent seven
seasons, including player, match, group, and ball-to-ball statistics, and analyzed it to come up with several
recommendations for how to enhance a player's performance. The impact of several factors, such as the
region or throw decision on the match's outcome during the previous seven years, is also determined.
According to the findings, Random Forest is the most accurate classifier, foretelling the greatest player
performance with an accuracy of 89.15 percent.

Rameshwari Lokhande, et al. [11] centered on Live Cricket Score and Winning Prediction Using Naive
Bayes Classifier and Linear Regression. In a limited overs cricket match, they provided a way for developing
a model for figuring out the final score of the first innings and evaluating the match outcome in the second
innings. The projections take into account the throw, the teams' ODI standings, and the host group advantage.
On earlier matches, the Linear Regression classifier and the Naive Bayes classifier, separately, have been
suggested to use two distinct models, one for the first innings and the other for the later innings. The support
calculation is employed rather than relapse in its purest form.

5
Prasad Thorat, et al. [12] provided useful and challenging equations for estimating the predicted score
of the most memorable innings of a cricket contest. When setting expectations, the cricket match's unique
character is typically overlooked. They suggested CricFirst Predictor (CFP), a method that takes the game's
key player into account. The knowledge is derived from Kaggle datasets. The information is stored using the
CSV design. The dataset is separated into two parts: data used to prepare the model and data used to test the
model. Name of the batting team, name of the bowling unit, total runs scored, total overs bowled, total
wickets taken, total runs scored in the last five overs, and total wickets taken in the previous five overs are
among the client inputs.

Prateek Gupta, et.al [13] analyzed and explained Cricket Score Forecasting with Neural Networks. They
put forth a framework that overcomes the primary drawback of manual work, which is tiresome and
necessitates exertion to physically keep up with the records and insights of each player. A LSTM-based
neural network is used to predict consistent traits, with the score after 18 conveyances serving as our aim and
the data from the previous 18 conveyances as the setting. They used a dataset that included statistics on T20
matches from over ten years (2010–2021) apart, totaling more than 2600 matches. Information from the IPL
(2010–2021), BBL (2011–2012–2020–2021), PSL (2015–2016–2020–2021), CPL (2013–2020), Lanka
Premier League (2019–2020), and T20 international matches is included in our dataset (2010-2021). The
most reliable model bases its initiation work on the "ELU".

6
3. PROBLEM STATEMENT

Cricket is a popular sport worldwide, and predicting the outcome of a cricket match has always been a
fascinating and challenging task for enthusiasts. Cricket score prediction is a complex task that involves
various factors such as weather conditions, pitch type, team composition, and past performance. Therefore,
there is a need for a machine learning-based model that can accurately predict the score of a cricket match.

7
4. EXISTING SYSTEM

❖ To forecast the outcome of a cricket match, many people utilize the current run rate technique.
❖ The total number of overs in an innings is multiplied by the number of runs scored in an over when
using the CRR method.
❖ It has been noted that the prediction made using the CRR approach does not take the game's dynamic
character into account.

8
5. PROPOSED SYSTEM

❖ The proposed solution will help cricket fans and enthusiasts to predict the final score of the match,
which will not only enhance their experience but also help them to make informed decisions while
placing bets.
❖ Additionally, it can be used by cricket teams to analyze their performance, identify the areas of
improvement, and develop strategies accordingly.
❖ We keep track of all the accuracy results from several models, such as Linear Regression, Ridge
Regression, and Lasso Regression, in order to attain high prediction accuracy.
❖ Choosing the best model in accordance with their values.
❖ Accurate score prediction is our aim.

9
6. METHODOLOGY

6.1 Description of the dataset

IPL DATASET:

The project's dataset, ipl.csv, was downloaded from the Kaggle website. The dataset includes data on each and
every ball played in every IPL game from 2008 to 2017. There are 1140210 tuples in the collection. 15 attributes
are contained in each tuple. The 15th characteristic, a real valued attribute with a range of 0 to infinity, is the
target feature. In this dataset, each feature is specified as follows:

❖ mid: The match id is defined by it.

❖ date: It indicates the day the match took place. It has the format DD-MMYYYY.

❖ venue: It is the location where the contest will take place..

❖ bat_team: It is the title of the batting team.

❖ Bowl_team: It is the title of the bowling team.

❖ Batsman: batting name of the striker.

❖ bowler: bowler's name

❖ runs: total runs scored up until that ball

❖ wickets: how many wickets were lost

❖ overs: It provides details on which over and ball is being bowled.

❖ Runs_last_5: scores in the previous five overs

❖ Wickets_last_5: wickets lost in the final five overs

❖ Stricker: A batter who is prepared to take the ball

❖ Non-striker: Unprepared to receive the ball despite being in the batting lane

❖ Total: It represents the total runs scored in that game after 20 overs.

10
6.2 MODEL ARCHITECTURE:

We intend to develop a model that is capable of accurately predicting the primary innings score of a live IPL
match. We intend to build a model that can take into account several boundaries that improve the score forecast.

DATA COLLECTION:
The dataset will be taken from those available on Kaggle. The dataset will be collected in CSV format. The
information gathered from the site will be cleaned in the following stage

DATA CLEANING:
The match ID, setting, batsman and bowler names, the score of the striker batsman, and the score of the non-
striker batsman must all be removed during the information cleaning stage. We won't use these parts for
forecasting, thus we won't pay attention to the portions. There are very few groups that are unreliable in the IPL
dataset because they have only recently started playing. In light of this, we only need to consider the stable
groups and would really like to remove those groups from the dataset. Following the fifth over, we will reflect
on the material. Although the dataset's date segment is available in a string design, we still need to use a few
techniques on the date segment for

DATA PREPROCESSING:
We will demand that the information be preprocessed after it has been cleaned. We will execute one-hot
encoding as part of the information preparation stage. In the execution area, one hot encoding is thoroughly
decoded. The dataset's parts should be revised during the information preparation stage. We want our segments
to be properly ordered in some succession, which is why we revise portions.

DATA SPLITTING:
After preprocessing the data, we will separate it so that IPL games from before 2016 and IPL games from after
2016 will be taken into account for the model's setup and testing, respectively.

11
MODEL GENERATION:
For predicting the outcome in light of the information, we will use the Lasso Regression model, Random Forest
Regression, and Linear Regression models. The input dataset's training data is used to create these models.
Utilizing test data from the input dataset, these models are evaluated. the outcome is then predicted using fresh
data.

FINAL PREDICTION:

❖ Finally, various data sources will be obtained from the client in order to forecast the result. Based on
the input, a range of final scores will be displayed as output on the website.

❖ The model design for the CFP framework is as follows.

Model Generation

(Ridge Regression)

Fig 6.1 Model Architecture

12
6.3 MACHINE LEARNING ALGORITHM:

When the number of indicator components in a set exceeds the number of perceptions, or when an information
set involves multi-collinearity (relationships between indicator factors), edge relapse is used to create a close-
fisted model. A place for edge relapse in L2 regularization. This can occasionally result in the complete
termination of some coefficients, which could lead to inaccurate models. A L2 punishment that increases to the
square of the extent of the coefficients is added by L2 regularization.

In a linear regression, the input variables and the goal variable will be worked out to have a direct relationship.
If the indicator variable depends only on one information factor, the result is a line; if the indicator variable
depends on several information factors, the result is a hyperplane. The model's coefficients are determined by
a headway interaction, which seeks to reduce the sum squared error between the forecasts (yhat) and the typical
objective characteristics (y).

loss = sum i=0 to n (y_i – yhat_i)^2

In straight relapse, the model's occasionally calculated coefficients can end up becoming extremely large,
making the model susceptible to input changes and possibly unstable. It is applicable to problems where there
are fewer examples (n) than input indicators (p) or factors.

One strategy for improving the safety of relapse models is to modify the capacity for misfortune such that it
can include additional costs for a model with large coefficients. In general, straight relapse models that use
these modified loss capacities during preparation are referred to as penalised straight relapse.

One well-known penalty is to penalize a model based on the quantity of squared coefficient values (beta). This
is referred to as an L2 penalty.

l2_penalty = sum j=0 to p beta_j^2

All things considered, an L2 punishment restricts the size even while it prevents any coefficients from being
removed from the model by allowing their value to reach zero.

13
The result of this penalty is that, given a matching drop in SSE, the border gauges may be permitted to grow
significantly. Essentially, when the lambda punishment increases, this system recoils the assessments toward
0. (these procedures are here and there called "shrinkage techniques").

This punishment, which is sometimes referred to as Ridge Regression or Tikhonov regularization (after its
author), can be added to the expense work for simple relapse.

The "lambda" hyperparameter is used to alter the weighting of the discipline to the accident work. If the default
value is set to 1, the punishment will be fully weighted; if it is set to 0, it will not. Small lambda upsides, such
1e-3 or other small values, are common.

ridge_loss = loss + (lambda * l2_penalty)

14
7. SYSTEM DESIGN

UML DESIGN:

UML is a modeling language with many uses. The main objective of the UML is to create a standard technique
for visualizing the design process of a system. It looks like plans used in other industries.

7.1 USE CASE DIAGRAM:


A use case diagram illustrates a system's dynamic behavior. To encapsulate the system's functionality, it
includes use cases, actors, and their interactions.

Fig 7.1 USE CASE DIAGRAM

15
7.2 CLASS DIAGRAM:

A class diagram is a form of static structure diagram used in software engineering that displays the classes,
attributes, operations (or methods), and interactions between the classes to illustrate the structure of a system.
It explains which sort of information is contained.A class diagram is a form of static structure diagram used in
software engineering that displays the classes, attributes, operations (or methods), and interactions between the
classes to illustrate the structure of a system. It explains which sort of information is contained.

Fig 7.2 CLASS DIAGRAM

16
7.3 SEQUENCE DIAGRAM:

A sequence diagram, commonly referred to as a system sequence diagram (SSD), is a visual representation of
process interactions arranged sequentially in the field of software engineering.

Fig 7.3 SEQUENCE DIAGRAM

17
7.4 ACTIVITY DIAGRAM:

Organizational processes and the flow of control between class objects are both shown in activity diagrams.
These diagrams are built using specialized forms, and arrows are used to connect them.

Fig 7.4 ACTIVITY DIAGRAM

18
8. IMPLEMENTATION

8.1 DATA COLLECTION:

Fig 8.1 DATA COLLECTION

A screenshot of the code that is used to get data from an external source is shown in Figure 8.1. Any
model must have input data, also known as datasets, in order to be trained. There are other ways to
gather data from the web; in this instance, we're using the ipl.csv dataset. The Kaggle website is where
it was gathered. There are 1140210 tuples in the dataset, and each tuple has 15 features.

19
8.2 DATA CLEANING:

Fig 8.2 DATA CLEANING

The snapshot of the data required for creating the model is shown in Figure 8.2. The dataset's
undesirable features are eliminated using the code mentioned above. Mid, batsman, bowler, striker,
and non-striker are the undesirable traits in this situation.

8.3 DATA EXTRACTION:

Fig 8.3 DATA EXTRACTION

20
The code snapshot used to extract only the consistent teams competing from the beginning of the IPL
is shown in Figure 8.3. Since the first five overs of data cannot be used to predict the score, the first
five overs of data are excluded from the dataset in the code above. The date column is initially
translated from its string format into a datetime object.

8.4 DATA PREPROCESSING:

Fig 8.4 DATA PREPROCESSING

The code snapshot used to convert the category features using the oneHot encoding method is shown
in Figure 8.4. For the columns bat team and bowl team, this kind of encoding generates new binary
features for each potential category. All other teams should be restored with 0, except for the batting
team that the user entered. All other teams should be restored with 0, while the bowling team that the
user entered should be restored with 1.

21
8.5 DATA SPLITTING:

Fig 8.5 DATA SPLITTING

The complete dataset is divided into two halves, as shown in Figure 8.5. The ipl matches from 2008 to
2017 are included in the original dataset. This code separates the dataset into training and testing,
using data from 2008 to 2016 for training and 2017 for testing. Test data has 61116 tuples, while train
data has 821260 tuples.

22
8.6 MODEL EVOLUTION:

Multiple Machine Learning Algorithms, including Linear Regression, Ridge Regression, and Lasso
Regression, are included in the proposed system.

8.6.1 LINEAR REGRESSION:

Fig 8.6.1 LINEAR REGRESSION

8.6.2 LASSO REGRESSION:

Fig 8.6.2 LASSO REGRESSION

23
8.6.3 RIDGE REGRESSION:

Fig 8.6.3 RIDGE REGRESSION

24
8.7 FINAL PREDICTION:

To receive input from the user and present the range of score predicted by the ridge regressor model, a user
interface is made using flask and web templates.

Fig 8.7.1 SCREENSHOT FOR WEB PAGE

When a user completes all of the user fields on the aforementioned homepage, a list with roughly 16 variables
is generated. For example, if batting team is Bengaluru and bowling team is Rajasthan then it will be encoded
in backend as [1,0,0,0,0,0,0,0] + [0,0,0,0,1,0,0,0]. The two aforementioned lists are then concatenated and
added in HTML form together with the remaining 6 variables. The predicted final score is then multiplied
by 10 to determine the highest possible score and divided by 10 to determine the lowest possible score in that
game. The user is then shown this output as the predicted value. For instance, if the model predicts a first
inning score of 180, the final score will be shown as 170 to 190.

25
Fig 8.7.2 SCREENSHOT FOR FILLING VALUES IN WEB PAGE

26
9. RESULT ANALYSYS

9.1 LINEAR REGRESSION:

Fig 9.1 ERROR RATE USING LINEAR REGRESSION

The three various types of error rates that resulted from using the linear regression technique are displayed in

Figure 9.1. Implementing the linear regression technique resulted in a mean absolute error rate of 10.2572.

Implementing the linear regression technique resulted in a mean squared error of 160.60052. Implementing

linear regression resulted in a root mean squared error of 12.67282.

9.2 RIGE REGRESSION:

Fig 9.2 ERROR RATES USING RIDGE REGRESSION

27
The three various types of error rates that resulted from using the Ridge regression technique are shown in
Figure 9.2. Ridge regression approach implementation resulted in a mean absolute error rate of 12.43432.
Implementing the ridge regression technique resulted in a mean squared error of 276.68004. Implementing
ridge regression resulted in a root mean squared error of 16.63370.

9.3 LASSO REGRESSION:

Fig 9.3 ERROR RATES USING LASSO REGRESSION

Figure 9.3 shows the 3 different types error rates occurred by implementing the lasso regression technique. The
mean absolute error rate occurred by implementing lasso regression technique is 12.21358. The mean squared
error occurred by implementing lasso regression technique is 262. 36538.The root mean squared error occurred
by implementing lasso regression is 16.19769.

9.4 COMPARISION BETWEEN 3 DIFFERENT REGRESSORS:

As shown in the aforementioned three figures, applying linear regression results in very low mean squared
error, mean absolute error, and root mean squared error values.

The most accurate results are produced by linear regression when compared to other regression methods like
Ridge regression and Lasso regression.

28
LINEAR RIDGE REGRESSION LASSO REGRESSION
REGRESSION
MEAN ABSOLUTE 10.25729245359391 12.434327274896821 11.775248362875073
ERROR
MEAN SQUARED 160.6005271696334 276.6800450229181 236.32988043635393
ERROR
ROOT MEAN 12.672826329182982 16.633702084109782 15.373024440114376
SQUARED ERROR

COMPARISION BETWEEN DIFFERENT REGRESSORS

Fig 9.4 COMPARISION BETWEEN DIFFERENT FITTING MODELS

OVERFITTING:
When a model performs well on training data but poorly on test data, this is known as overfitting.

UNDERFITTING:
Models that do not perform well on test data are said to be underfit.

GOOD FIT:
The accuracy of the results produced by Good Fit for both the training and test datasets is between 70 and
80 percent.

29
9.5 GRAPHICAL REPRESENTATION FOR ACTUAL AND PREDICTED VALUES:

9.5.1 LINEAR REGRESSION:

Fig 9.5.1 PREDICTION FOR LINEAR REGRESSION

9.5.2 RIDGE REGRESSION:

Fig 9.5.2 PREDICTION FOR RIDGE REGRESSION

9.5.3 LASSO REGRESSION:

Fig 9.5.3 PREDICTION FOR LASSO REGRESSION

30
9.6 ERROR RATES FOR TRAINING DATA:

Fig 9.5 ERROR RATES FOR TRAINING DATA

31
10 .CONCLUSION

This research primarily focuses on predicting the first innings score of an IPL match using the Linear
Regression, Ridge Regression, and Lasso Regression machine learning algorithms. Here, the ipl.csv dataset is
utilized. This dataset includes 15 distinct elements, the most frequently utilized of which are the batting team,
bowling team, runs scored in the current over, wickets taken in the current overs, number of completed overs,
runs scored in the last 5 overs, and wickets taken in the previous 5 overs. The forecasted score of an IPL match
is the project's output using the aforementioned inputs. Three alternative models—Linear Regression, Ridge
Regression, and Lasso Regression—are used in this project to forecast the output. Their respective root mean
squared error rates are 12.67282, 16.63370, and 16.19769. Less is given by linear regression.

32
11. REFERENCES

1. Nikhil Dhonge, Shraddha Dhole. "Ipl cricket score forecast utilizing AI methods" Research Journal
ofComputer Science and Technology, Volume:05/Issue:04/May-2021

2. Apurva Lawate , Nomesh Katare. “Cricket Prediction of projected Score and Winner Prediction”
Journalof Computer and Communication Engineering Vol. 12, Issue 4, February 2021

3. R. R. Kamble , Nidhi Koul. “IPL Score Prediction by using Machine Learning Algorithm” Journal
ofComputer Science and Engineering Vol.10 , 2020

4. T. Suvarna Kumari, P.Narsaiah. “Match Score Prediction using k-Nearest Neighbors Algorithm”
IJRECEVOL. 9 ISSUE 5 Apr - July 2018

5. Kushooo , Nisha. “IPL Score and winner prediction by using data mining” Journal of Multi-
DisciplinaryVolume 5, Issue 4, February 2020

6. Akhil Nimmagadda , Nidamanuri Venkata Kalyan. “IPL score prediction and winning prediction using
data mining Approach” Journal of Advance Research and Development Volume 6, Issue 4)

7. Sudhanshu Akarshe, Rohit Khade. “Cricket Score Prediction using Machine Learning Algorithms”
GRD Journal for Engineering | Volume 8 | Issue 7 | September 2018

8. Ashish V Shenoy. “Prediction of Live Cricket Score and Winning Prediction” Journal of Trend in
Research and Development, Volume 5

9. Jalaz. “Score Prediction of IPL Matches using Machine Learning Algorithms” 2018 International
Conference of Cyber Computing and Communication

10. Jyothsna. “Predicting the outcome of IPL Cricket Match” Journal of Research in Science, Engineering
and Technology Vol. 6, Issue 4, June 2018

11. Rameshwari Lokhande. “Cricket Live Score Prediction and Winning Prediction” Journal of Computer
Research and Development, Volume 5, Issue 9

12. Prasad Thorat, Vighnesh Buddhivant. “Cricket score prediction” IJCRT | Volume 9, Issue 5 May 2021

13. Prateek Gupta, Navya Sanjna Joshi. “Cricket Score Forecasting using Neural Networks” I Journal of
Engineering and Technology, Volume-11 Issue-4, June 2020

33
12. APPENDIX

SAMPLE CODES:

MACHINE LEARNING CODE:

import pandas as pd

import pickle

from datetime import datetime

# Load the dataset

df = pd.read_csv("D:\project\ipl.csv")

# Drop unwanted columns

df = df.drop(columns=['mid', 'bowler', 'striker', 'non-striker'])

# Keep only consistent teams

consistent_teams = ['Kolkata Knight Riders', 'Chennai Super Kings', 'Rajasthan Royals',

'Mumbai Indians', 'Kings XI Punjab', 'Royal Challengers Bangalore',

'Delhi Daredevils', 'Sunrisers Hyderabad']

df = df[df['bat_team'].isin(consistent_teams) & df['bowl_team'].isin(consistent_teams)]

# Remove the first 5 overs data in every match

df = df[df['overs'] >= 5.0]

34
# Convert date to datetime object

df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')

# One-hot encode categorical features

encoded_df = pd.get_dummies(df, columns=['bat_team', 'bowl_team', 'venue'])

# Reorder columns

encoded_df = encoded_df[['date', 'bat_team_Chennai Super Kings', 'bat_team_Delhi Daredevils',


'bat_team_Kings XI Punjab',

'bat_team_Kolkata Knight Riders', 'bat_team_Mumbai Indians', 'bat_team_Rajasthan


Royals',

'bat_team_Royal Challengers Bangalore', 'bat_team_Sunrisers Hyderabad',

'bowl_team_Chennai Super Kings', 'bowl_team_Delhi Daredevils', 'bowl_team_Kings XI


Punjab',

'bowl_team_Kolkata Knight Riders', 'bowl_team_Mumbai Indians', 'bowl_team_Rajasthan


Royals',

'bowl_team_Royal Challengers Bangalore', 'bowl_team_Sunrisers Hyderabad',


'venue_Barabati Stadium',

'venue_Brabourne Stadium', 'venue_Buffalo Park', 'venue_De Beers Diamond Oval',


'venue_Dr DY Patil Sports Academy',

'venue_Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium', 'venue_Dubai


International Cricket Stadium',

'venue_Eden Gardens', 'venue_Feroz Shah Kotla', 'venue_Himachal Pradesh Cricket


Association Stadium',

'venue_Holkar Cricket Stadium', 'venue_JSCA International Stadium Complex',


'venue_Kingsmead',

35
'venue_M Chinnaswamy Stadium', 'venue_MA Chidambaram Stadium, Chepauk',
'venue_Maharashtra Cricket Association Stadium',

'venue_New Wanderers Stadium', 'venue_Newlands', 'venue_OUTsurance Oval',


'venue_Punjab Cricket Association IS Bindra Stadium, Mohali',

'venue_Punjab Cricket Association Stadium, Mohali', 'venue_Rajiv Gandhi International


Stadium, Uppal',

'venue_Sardar Patel Stadium, Motera', 'venue_Sawai Mansingh Stadium', 'venue_Shaheed


Veer Narayan Singh International Stadium',

'venue_Sharjah Cricket Stadium', 'venue_Sheikh Zayed Stadium', 'venue_Subrata Roy


Sahara Stadium',

'venue_SuperSport Park', 'venue_Wankhede Stadium', 'overs', 'runs', 'wickets', 'runs_last_5',


'wickets_last_5', 'total']]

# Split data into train and test sets

X_train = encoded_df[encoded_df['date'].dt.year < 2017].drop(['date', 'total'], axis=1)

X_test = encoded_df[encoded_df['date'].dt.year >= 2017].drop(['date', 'total'], axis=1)

y_train = encoded_df[encoded_df['date'].dt.year < 2017]['total'].values

y_test = encoded_df[encoded_df['date'].dt.year >= 2017]['total'].values

# Removing the 'date' column

X_train.drop(labels='date', axis=True, inplace=True)

X_test.drop(labels='date', axis=True, inplace=True)

36
# --- Model Building ---

# Linear Regression Model

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_test,y_test)

filename = 'linear_r_model.pkl'

pickle.dump(regressor, open(filename, 'wb'))

test_lr = regressor.score(X_test,y_test)

print("testing accuracy with lr ",test_lr*100)

prediction1=regressor.predict(X_test)

regressor.fit(X_train,y_train)

train_lr = regressor.score(X_train,y_train)

print("training accuracy with lr ",train_lr*100)

from sklearn import metrics

import numpy as np

print('MAE:', metrics.mean_absolute_error(y_test, prediction1))

print('MSE:', metrics.mean_squared_error(y_test, prediction1))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction1)))

# Creating a pickle file for the classifier

filename = 'inning_one-lr-model.pkl'

pickle.dump(regressor, open(filename, 'wb'))

37
## Ridge Regression

from sklearn.linear_model import Ridge

from sklearn.model_selection import GridSearchCV

ridge=Ridge()

parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,15,20,25,30]}

ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error',cv=6)

ridge.fit(X_test,y_test)

ridge_regressor.fit(X_train,y_train)

filename = 'ridge-inning_one-score-lr-model.pkl'

pickle.dump(regressor, open(filename, 'wb'))

prediction2=ridge.predict(X_test)

test_ridge = ridge.score(X_test,y_test)

print("testing accuracy with ridge ",test_ridge*100)

ridge.fit(X_train,y_train)

train_ridge1 = ridge.score(X_train,y_train)

print("training accuracy with ridge ",train_ridge1*100)

print(ridge_regressor.best_params_)

print(ridge_regressor.best_score_)

prediction2=ridge.predict(X_test)

from sklearn import metrics

import numpy as np

print("error rates using ridge regression")

print("\n")

print('MAE:', metrics.mean_absolute_error(y_test, prediction2))

print('MSE:', metrics.mean_squared_error(y_test, prediction2))

38
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction2)))

# Lasso Regression

from sklearn.linear_model import Lasso

from sklearn.model_selection import GridSearchCV

lasso=Lasso()

parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30]}

lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

lasso.fit(X_test,y_test)

lasso_regressor.fit(X_train,y_train)

print(lasso_regressor.best_params_)

print(lasso_regressor.best_score_)

prediction3=lasso.predict(X_test)

from sklearn import metrics

import numpy as np

print("error rates using lasso regression")

print("\n")

print('MAE:', metrics.mean_absolute_error(y_test, prediction3))

print('MSE:', metrics.mean_squared_error(y_test, prediction3))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, prediction3)))

from sklearn.linear_model import Lasso

from sklearn.model_selection import GridSearchCV

lasso=Lasso()

parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30]}

39
lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

lasso.fit(X_test,y_test)

test_lasso = lasso.score(X_test,y_test)

print("testing accuracy with lasso ",test_lasso*100)

lasso_regressor.fit(X_train,y_train)

train_lasso1 = lasso.score(X_train,y_train)

print("training accuracy with lasso ",train_lasso1*100)

predictionlr=regressor.predict(X_train)

predictionr=ridge_regressor.predict(X_train)

predictionl=lasso_regressor.predict(X_train)

# Error Rates

print("linear regression")

print('MAE:', metrics.mean_absolute_error(y_train, predictionlr))

print('MSE:', metrics.mean_squared_error(y_train, predictionlr))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, predictionlr)))

print("\n")

print("ridge regression")

print('MAE:', metrics.mean_absolute_error(y_train, predictionr))

print('MSE:', metrics.mean_squared_error(y_train, predictionr))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, predictionr)))

print("\n")

print("lasso regression")

print('MAE:', metrics.mean_absolute_error(y_train, predictionl))

print('MSE:', metrics.mean_squared_error(y_train, predictionl))

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, predictionl)))

40
FLASK:

from flask import Flask, render_template, request


import pickle
import numpy as np

# Load the models


ridge_model = pickle.load(open('first-innings-score-lr-model.pkl', 'rb'))
linear_model = pickle.load(open('linear-first-innings-score-lr-model.pkl', 'rb'))
lasso_model = pickle.load(open('Lasso-first-innings-score-lr-model.pkl', 'rb'))

app = Flask(__name__)

# Define the routes


@app.route('/')
def home():
return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
# Get the team names and encode them as one-hot vectors
bat_team = request.form['batting-team']
bowl_team = request.form['bowling-team']
team_names = ['Chennai_Super_Kings', 'Delhi_Daredevils', 'Kings_XI_Punjab',
'Kolkata_Knight_Riders', 'Mumbai_Indians', 'Rajasthan_Royals', 'Royal_Challengers_Bangalore',
'Sunrisers_Hyderabad']
team_encoding = [int(bat_team==team) for team in team_names] + [int(bowl_team==team) for team in
team_names]

# Get the other input values


overs = float(request.form['overs'])
runs = int(request.form['runs'])

41
wickets = int(request.form['wickets'])
runs_in_prev_5 = int(request.form['runs_in_prev_5'])
wickets_in_prev_5 = int(request.form['wickets_in_prev_5'])

# Concatenate the input values and the team encoding into a single input vector
input_vector = np.array(team_encoding + [overs, runs, wickets, runs_in_prev_5,
wickets_in_prev_5]).reshape(1, -1)

# Make predictions using the models


ridge_prediction = int(ridge_model.predict(input_vector)[0])
linear_prediction = int(linear_model.predict(input_vector)[0])
lasso_prediction = int(lasso_model.predict(input_vector)[0])

# Define the result range for the Ridge model


lower_limit = ridge_prediction - 10
upper_limit = ridge_prediction + 10

# Render the results page


return render_template('result.html', Ridge_result=ridge_prediction, Linear_result=linear_prediction,
Lasso_result=lasso_prediction, lower_limit=lower_limit, upper_limit=upper_limit)

if __name__ == '__main__':
app.run(debug=True)

42
IMPLEMENTING SCREEN:

Fig 12.1 OUTPUT SCREEN

43

You might also like