Machine Learning For Football Matches and Tournaments
Machine Learning For Football Matches and Tournaments
2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON) | 978-1-6654-9602-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/COM-IT-CON54601.2022.9850673
Abstract— The excitement of predicting the result of a football Prediction using ensemble techniques’ as in [1]. We
game and support for their favourite clubs has brought all the
sports enthusiasts together to develop platforms around it. Given have taken advantage of the freely and easily
the access to data, football analytics has emerged as a huge accessible data for one of the biggest leagues in
industry not limited to the players and teams. This research Europe, i.e., the English Premier League or the
manipulates data available for different football leagues and
concepts of classification models to predict the result of a match Premier League, for our research. The
at half time with 80% accuracy (using ensemble techniques) for a understanding of data clubbed with the
match being played in the English Premier League and the understanding of the game has helped us formulate
winner of the same league using the historical and current
performance of the club and the players in the league at a match a way of predicting the result of a football match at
week level with up to 95% accuracy. The research uses a halftime and the winner of the league/tournament
dynamic dataset and proposes an application that requires based on their historical and current performance.
minimum user input for prediction. Further, simple features and
visualization of results make the research understandable and re- We have divided the research into two parts: match
usable. winner prediction and league winner prediction.
The first part of the research establishes how
Keywords— classification, league winner, match week, data adding features to the data can help predict the
preparation, winning probability. results of a game better. We show the effect of
I. INTRODUCTION added features and compare it with previously
computed results. For the second part, we shift our
Football is a global sport, played and watched in
focus from the match level to the league level, i.e.,
every part of the world. Different national and
we do not consider two teams in the prediction
international leagues, divisions, tournaments, etc.,
model but only one (each of the participating clubs
with different competition structures and fixtures.
in the league). We perform a set of operations to
Generally, there is more than one football
convert the data used in the first part for the latter
tournament going on at a time, and fans and
part. We use our understanding of the game to get
football enthusiasts are eager to watch the game's
data at a match week level and then predict the
results uncover. Many virtual platforms have also
league's winner on a team and match week level.
been developed around football with a whole
We further develop an application to make our
community, such as FIFA, Fantasy Premier League,
research accessible to everyone and use visual tools
etc. Given that several organizations manage these
to display the modeling results.
tournaments offline and online, there is also a huge
In the first section of the paper, we have
database for every action of every game. The data is
described the frequently used terms in the later
present in the forms of numbers, images, videos,
sections of the paper and detailed similar work done
etc., at different levels, i.e., player level, match
in the past. The next section discusses the first part
level, league level, playing ground level, coach
of the research, the dataset, features, modeling,
level, country level, etc.
results, and comparison. The latter part of the
This research work is a continuation of the done
research is described, along with the preprocessing
by the authors in the paper 'Full-Time Result
490
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
III. MATCH WINNER PREDICTION
This part of the research extends previous
research work in 'Full-Time Result Prediction using
ensemble techniques' as in [1]. Here, we try to
predict the result of a football match played in the
English Premier League at halftime in terms of the
winning probability of the home team based on the
goals scored till halftime, the current leaderboard
position of the teams in the league, previous year Fig. 1 EPL Previous Year Standings
leader board standing and the performances in the
Here, for each team (40 in total), the final leader
earlier matches. In comparison to the previous
board position (between 1 to 20, both included) of
research, we have added independent features such
the previous 18 years is specified. For a match
as the form, form points, etc., and considered goals
being played in the current season, the leader board
scored and conceded by the teams in the current
position of last year, i.e., 2021, would be used. The
season.
blank cells suggest that the given team did not
A. Dataset participate in the league last season and would be
For this research, we are using the English filled as 0 for modeling. In the next section, we
Premier League or the Premier League data have described how multiple features were derived
available freely on the Football-Data.co.uk website from these basic 7 features, i.e., Date, Home Team,
and many others. The data's granularity is at every Away Team, Halftime ad full-time goals for both
match level; thus, a dataset of every year considers teams, and the full-time result.
380 matches. The dataset consists of 45 features: B. Features Added
match date, home team, away team, referee, goals
All our added features in this research are
scored at half time and full time, shots, shots on
arranged by teams and match week, i.e., for a match
target, corners, fouls, offsides, cards, and many
played in the third match week (third match of the
more. We wanted to have a prediction model based
season for both teams), we would consider the
on historical and current season statistics. For this
cumulative of all features of the previous two
purpose, we used the premier league data from
matches played with different or same teams. We
2005 to 2020 for training (6060 matches), the last
list down the added features and their description in
20 matches of the 2020-2021 league as validation,
Table I.
and the ongoing season 2021-22 as the test set.
The dataset for each year includes five TABLE I
categorical variables, one ID variable, one date
FEATURES ADDED
variable, and the rest are numerical variables. For
the sake of simplicity and to make the end Feature Acronym Description
Goals HTGS/ATGS The total number of goals
application easily usable by all kinds of users and Scored scored by the team in the
not just football fanatics, we stuck to using common season
features for our model and then building functions Goals HTGC/ATGC The total number of goals
Conceded conceded, by the team in the
on it rather than giving all the features available in season
the dataset to keep the number of data points high. Cumulative HTP/ATP The sum of points of each
The chosen features are names of the playing teams, points team in the season (3 for win,
1 for draw, and 0 for losing a
goals scored by them at halftime, and we retrieve match)
the leaderboard position of the previous year from a Form HTFormPtsStr/ A character string consisting
static self-concatenated dataset, a sample of which ATFormPtsStr of W, D, or L based on results
of the last 5 matches played by
is shown in Fig. 1. Our dependent variable is the each team in the season
full-time result. Form Points HTFormPts/ The sum of points scored by
ATFormPts each team in the previous 5
matches
491
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
Leader HomeTeamLP/ The final leader board position modelling steps before fitting the model as
board AwayTeamLP of the two playing teams
Position
described below.
Match MW A number between 1 and 38, C. Modelling
Week both included, specifying the
playing week for every match As a part of pre-modelling, we filter the data to
3-win streak HTWinStreak3/ Specifies if the team has won keep only those matches that were played third
ATWinStreak3 the previous 3 matches played
in the season
match week onwards because of the insignificant
5-win streak HTWinStreak5/ Specifies if the team has won contribution of the initial matches to the team's
ATWinStreak5 the previous 5 matches played future performance. After separating the data into
in the season
3 lose streak HTLossStreak3/ Specifies if the team has lost
the feature set and the target variable, the numerical
ATWLossStrea the previous 3 matches played data were scaled to make the data comprehensible
k3 in the season by the model and free from bias. The next pre-
5 lose streak HTLossStreak5/ Specifies if the team has lost processing step included charting out dummies for
ATWinStreak5 the previous 5 matches played
in the season all the categorical variables in the feature set,
Goal HTGD/ATGD The difference between the mainly Home Team and Away Team. For
Difference Goals Scored and Goals evaluation, we have split the dataset into training
conceded by playing teams
Points DiffPts Difference in the cumulative and validation sets in sequential order and not
Difference points of the two playing random as by default.
teams To notice the behaviour of different state-of-the-
Form Points DiffFormPts The difference in the form
Difference points of the two playing art classification algorithms on our dataset, we
teams fitted and evaluated our data on 9 algorithms: XGB
Leader DiffLP Difference between the Classifier, Logistic Regression, Random Forest
board previous year's standings of
Position the two playing teams Classifier, Decision Tree Classifier, K Neighbours
Difference Classifier, SVC, Gaussian NB, Ada Boost Classifier,
a.
Note: HT stands for Home Team, and AT stands for Away Team. and Gradient Boosting Classifier. We also
considered their cross-validation scores to choose
the top-performing algorithms. The results of the
For Goals Scored and Goals Conceded, the value
iterations are shown in Table II.
is 0 for the first match played by each team as this
feature consists of an aggregation of goals scored TABLE II
up till that point in the season. For calculating the
COMPARISON OF ACCURACY (VALIDATION SET)
form points, we have used the data of the previous 5
matches, which can be increased or decreased for Model Name Accuracy Cross Val
accuracy. For the first five-match weeks, we add 0 Score Score
XBG Classifier 0.75 0.637276
points for the matches that have not been played.
Logistic Regression 0.75 0.628674
For example, the result of the last 5th match of a Random Forest Classifier 0.70 0.624373
team in a season would not be available in the fifth Decision Tree Classifier 0.65 0.539964
match week. The four streak features are K Neighbors Classifier 0.60 0.564337
categorical features added to give the model more SVC 0.65 0.635842
to learn about a team's performance in the current GaussianNB 0.50 0.463620
Ada Boost Classifier 0.70 0.627240
season.
Gradient Boosting Classifier 0.70 0.636380
Some of our different features, namely the goal
difference, points difference, and form points
We can see that XGB Classifier, Logistic
difference, will increase linearly as the season
Regression, Ada Boost Classifier, and Gradient
progresses. To abstain from bias in the model, we
Boosting Classifier give a similar accuracy of 70%
scale these features according to their match week.
on the validation set. Using the Cross-Validation
This concluded the data preparation of our training
Score, we can discard Ada Boost Classifier. The
and validation set, and next, we followed some pre-
next steps included combining the top 3 algorithms
492
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
using a Voting Classifier to get the best results data remains the same from season 2005 to 2020.
possible. Thus, for this purpose, we ensembled the To predict the results at the match week level, we
three, fitted the data to our ensemble model, and would require bringing it to a week level instead of
tested it on the validation set for evaluation. Our the current match one. This is described in the
ensemble model resulted in an accuracy of 80% on sections below.
our validation set. A. Pre-processing
D. Comparison For this application, we would be working with
In the research, Full-Time Result Prediction using the following features: year of the league, match
Ensemble Techniques just like ‘Heterogeneous week, team name, cumulative goals scored,
ensembles of classifiers in predicting Bundesliga cumulative goals conceded, points, and form points
football results’ [6] by Jan Kozak and Szymon at a match week level. The teams generally play
Głowania, we had tried different combinations of half of the matches on home ground and the other
the top-performing algorithms. We evaluated them half as an away team. So, we have considered all
on the same dataset with fewer features (only goals the matches played by the team irrespective of the
scored at halftime, points of playing teams before a ground. We concatenate the home match and away
particular match, and the last year's leader board match statistics for the same, remove duplicates (if
position). The ensemble of 3 algorithms gave the any), and then group the data on a Year and Match
best results: XGBoost, Gradient Boosting Classifier, week level for each playing team resulting in a
and Logistic Regression, i.e., 75% accuracy as dataset of shape (11015, 46), including dummy
shown in table 3 and 4. A similar combination variables for each team. As our dependent feature is
could classify the results with an 80% accuracy this the result of the English Premier League, we have
time with the help of the addition of features to include it in our training set. So, a categorical
mentioned in Table 1. An interesting point to note variable 'Win' was included depending on the
here is that the combination of only two classifiers, league winners, as mentioned in Table III.
Gradient Boosting Classifier and Logistic
TABLE III
Regression, also resulted in similar accuracy of
80% when fitted to the latter. WINNERS OF EPL (2005-20)
493
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
B. Testing Data independent variables include all the features in the
As the league for which the prediction is to be testing data, and our dependent variable is 'Win', a
done would always be in progress of using the categorical variable created by us. For validation,
application, there is a need for recent data that we would be splitting our training data into train
would keep updating. To keep the process simple, and validation sets with validation size being 25%,
we are loafing the dataset directly from the URL. i.e., our training set has 8,261 rows, and validation
The data has 105 features, but t match our has 2,754. Similar to the first part of the research,
independent variables with training data, we would we have tested and evaluated nine state-of-the-art
need only the names of the playing teams, goals classification algorithms, as mentioned in Fig 2.
scored by them at full time, and the full-time result. The accuracy and cross-validation scores of the
As the recent season was interrupted by Covid-19, models are also given for comparison in Table IV.
the fixtures are different from the past years, i.e., all
TABLE IV
10 matches of a match week are not played
continuously. So, we calculate some of the features MODEL EVALUATION RESULTS
differently than training data like match week. Model Name Accuracy Cross Val
Score Score
For match week, we use the dates of matches and XBGClassifer 0.946 0.978
define game weeks according to the fixtures on the Logistic Regression 0.95 0.974
official website. We parse our Date variable as a RandomForestClassifier 0.952 0.976
Date Time object for the same. It was observed that DecisionTreeClassifier 0.937 0.973
some match which was supposed to be played in KNeighborsClassifier 0.959 0.966
match week 3 was postponed and later played after
SVC 0.949 0.97
several scheduled match weeks. In such cases, we
GaussianNB 0.692 0.505
have hardcoded the dates and match weeks. We
AdaBoostClassifer 0.948 0.979
have also calculated the Goals scored, Goals
GradientBoostingClassifier 0.945 0.977
conceded, the cumulative points, the form, and the
form points (considering the previous five matches
played by each team. We have also scaled the According to accuracy and cross-validation score,
points scored by the team according to match week the AdaBoost classifier seemed to fit our dataset
as it tends to increase with match weeks. We have perfectly. In this situation, we did not see a need to
also dropped testing data related to the team use a fusion model as in ‘Prediction of Football
'Brentford' as it is a new team in the league with no Match Results Based on Model Fusion’ [8] by
historical data. Quan Zhang, HongZhen Xu, Li Wei and LiangQi
Our training data included 380 matches, i.e., 38 Zhou. Thus, we use AdaBoost for our testing data
match weeks for each season of the league; and predict probability function. This function gives
however, as the league considered for testing is the probability of a team winning or losing the
undergoing, the number of match weeks is less. league at the end of each match week. We represent
Thus, for all features dependent on match week, we the probability in terms of a percentage up to 2
have kept the matches not played till now as null decimal places.
and denoted them often using X. The testing data is D. Results
independent of the playing ground and is grouped Post modelling, we are getting an output that includes the
by each team and match week. The team names are team's name, match week, and the probability of winning the
one-hot encoded, and the data is then fed to our league based on performance that week. We wished to have
model. interactive user input and visual representation of the result.
For this purpose, we developed a simple application using
C. Modelling Flask where we can take two inputs from the user: the team
For modelling, we will use the training data and whose winning chances are to be predicted and the match
week for which the winning chances of all teams need to be
testing data as prepared in the above sections. Our predicted as shown in Fig. 2. The output is then manipulated
494
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
according to the user input, and the results are presented in the work required only inputting the basic features of
form of an interactive chart using JSON [9], Plotly [10], and the playing teams, and every important metric was
plotly express.
calculated automatically before feeding the data to
the model. Adding features like form and form
points and the cumulative sum of previously present
features enhanced the accuracy by 5%, i.e., 80% on
the validation set. The comparison graph is shown
in Fig. 5. The same dataset was used for the second
part of the research.
Fig. 2 User Input Interface
495
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.
[7] J. Kozak, S. Głowania, “ Heterogeneous ensembles of classifiers in [10] I. Spivak, S. Krepych, M. Litvynchuk, and S. Spivak, "Validation and
predicting Bundesliga football results”, Procedia Computer Science, Data Processing in JSON Format," IEEE EUROCON 2021 - 19th
Vol. 192, pp.1573-1582, Sept. 2021 International Conference on Smart Technologies, pp. 326-330, Sept.
[8] Y.Bai, X.Zhang, ”Prediction Model of Football World Cup 2021
Championship Based on Machine Learning and Mobile Algorithm. [11] A. A. Sutchenkov and A. I. Tikhonov, "Embedding Interactive Python
Mobile Information Systems”, Vol. 2021, pp.1-11, Sept. 2021 Web Applications into Electronic Textbooks,"V International
[9] Q. Zhang, H. Xu, L. Wei, L. Zhou, “Prediction of Football Match Conference on Information Technologies in Engineering Education
Results Based on Model Fusion”,Proceedings of the 2019 3rd ( Inforino ), pp. 1-4, June 2020
International Conference on Innovation in Artificial Intelligence -
ICIAI 2019, pp 198-202, March 2019
496
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:43:39 UTC from IEEE Xplore. Restrictions apply.