Comparison of Football Results Using Machine Learning Algorithms
Comparison of Football Results Using Machine Learning Algorithms
E-mail : [email protected]
Abstract- Many of the predictions for a result of a focuses on predicting results of a premier league
football match have primarily taken only the number season using data from its previous seasons.
of goals scored into account while giving out their
predictions. One of the major drawbacks in this type Starting out mainly as only a sport, Football began
of prediction is the inconsistency while basing the
to commercialise when the betting industry came
results using only just goals. Thus, the main objective
of this project is to explore different Machine learning into picture. With betting, people wanted to analyse
techniques to predict the score and outcome of football players as well as the teams and place their bets for a
matches using previous game match events rather given match. Thus with the analysis and prediction
than the number of goals scored by each team. The coming into picture, the value of machine learning
project sees us develop a set of custom generated and data science rose rapidly.
features which helps us evaluate a team’s full time
result, instead of using the actual number of goals II.LITERATURE SURVEY
scored. These features are trained using the selected
algorithms and the results are then compared to the
models trained using basic in game data obtained after The Hucaljuk,J & Rakipović,A (2011) proposed to
conclusion of the match. build a system that would predict result at the full
time of all matches from the Champions League in
Keywords:goals scored, machine learning, custom Europe. Based on [1], they used several machine
features, algorithms. learning algorithms.The main ones were namely
Naive Bayes, k-nearest neighbors, Random forest
I.INTRODUCTION and Bayesian networks. However, there were still
various ways to improve their accuracy by using
Football is a game where eleven players from each better features and a larger dataset.
side strive to score goals and the team with the most
goals wins the game after 90 minutes. Started out Similarly, according to [4] O.Fazrin et al. in 2013
purely as a sport before 100 years, modern football used Bayesian networks to predict the results. This
has become a place for business. In recent times, time the predictions were based on better features,
data for various parts of the game have been which were divided into two categories – non-
collected and used in various fields of business to physiological and physiological factors. The
make enormous amounts of money. The major field prediction was based on La Liga – the Spanish
where football data plays a vital role is the betting football league, mainly involving FC Barcelona.
industry. While better features were used, the desired results
were not obtained as the dataset was based on just
The resurgence of betting field has made every one one team.
of them use football as a medium to mint money.
Thus the field of prediction to predict the future T. Cheng et al. (2003) tried to better the previous two
outcomes has become a trend in recent times. approaches by involving a better algorithm.
According to [5], they used neural networks to
Machine Learning allows users to predict outcomes predict the results and based their model on
of future events using data of the present. This paper difference between two teams. Using a better
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
algorithm and a good dataset (They used the Italian testing the model. The dataset was acquired from the
league – Serie A datasets) profited as they could website football-data.co.uk. It contains a total of 64
deliver with good accuracy. features such as betting odds, yellow cards, red
cards, free kicks, fouls, shots etc. from every game
(Abel Hijmans) proposed models based on data of the season.
mining to predict matches from the Eredivisie, the
football league of Netherlands . Based on [2], they
use three models namely Generalized Boosted
Models (GBM), K-nearest neighbor and Naive Bayes
classification. The GBM model produced good
results while the other two did not show good output.
Their prediction could have been improved with the
likes of improved player data, since they used only
Dutch team stats for their model.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
and relevant features are extracted. the results of a set of football matches using in
game data that is already present in the dataset
This process of pre-processing the existing dataset fed initially.
and churning out the custom features that are
independent of the in game statistics, in order to 3.1 Cleaning and Scraping
generate predictions for a future event is done in the After collecting the database, the next step was to
scrapping and cleaning phase. clean and scrap the data in order to acquire our
custom features from the raw dataset. The initial step
The cleaned data is split into two categories, the included truncating the database to include only the
training data and testing data. The training data is following fields: Date, Home team, Away team,
used to train the models using the three algorithms Home team and Away team‟s full time score (FTHG
that have already been pre decided and following the and FTAG respectively) and full time result (FTR).
training, the test data is fed into the model in order to The FTR is the one which is to be predicted in this
make predictions. The F1 score and accuracy is project. It is a bi-variant classification, determining
recorded in order to analyze the output of the whether the home team or the away team wins a
prediction. football match.The next step sees the calculation of
goal difference for each side in a cumulative fashion
The steps of the entire process that takes place is to include what the respective team‟s goal difference
listed down as follows: is for the corresponding game week. This is
Acquiring football dataset from the user for calculated by tallying the amount of goals scored
about 15 seasons. 15 seasons has been chosen as and conceded until that game by that side. Following
a rough estimate in order to get better prediction the cumulative calculation of goals for and goals
results. against, points for each match is calculated based on
From the available raw data that comprises of in the full time result. This is also done in a cumulative
game features, initial pre processing is done to fashion in order to determine which team finishes
select only the relevant columns needed for where in the table for that season.Based on the
custom feature extraction. points accumulated by each side for the following
The chosen columns are then scrapped and season, standings are calculated. The side with most
cleaned so as to develop the custom features in points is placed first while the one with the least is at
order to train them using machine learning the bottom.In order to better analyze the probability
algorithms. of a side‟s chance of winning the game in a
After the custom features are developed, they corresponding game week, recent form is also taken
are then standardized and split into testing and into account. The recent form includes the side‟s
training sets. results in the past 3 games and 5 games.This also
The training sets are used to train the model takes into account whether the side is maintaining a
using the chosen algorithms. The three winning streak or not. Finally following all these
algorithms chosen to train the model are Logistic data collection, a difference between both the home
Regression, Support Vector Machines and team and away team‟s standings is also calculated in
Gradient Boosting. order to determine the strength of the sides facing
After training the models, predictions are made each other.After acquiring all the relevant data, it is
using the testing set and output is calculated. all saved into a new dataset and is maintained in an
The accuracy is then compared to the same ordered fashion for all the 15 seasons in order to
predictions made using the in game data already train the models based on this data.
available in the dataset.
The project, Football Results Prediction using 3.2 Three and Five game Winning Streak
The first of the features calculated in determining the
Machine Learning Techniques thus involves
full time result is the three and five game winning
prediction in two phases using the same process.
streak. It gives a good estimate as to how the team
The first prediction involves predicting the
has performed before the current match and how
results for a set of football matches using custom
they will perform in the current game. The general
generated features from the initially available
idea behind the feature selection is that team which
dataset.
has a good streak may win over the opponent who
The second set of prediction involves predicting
has not so good streak.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
The steps to calculate the following three game and their goal scored in each gameweek as
winning streak feature is as follows: values.
a) Creation of a dictionary to store each team‟s b) The goals scored are then summed up to
corresponding streak. calculate the cumulative goal scored at the
b) Processing the last five game week of every particular game week.
team from the current game week to store their c) This was done for all the teams over the 15
form streak in the dictionary. seasons taken into account.
c) Repetition of the process for every dataset to get d) The same practice is done for goals conceded for
detailed streak of every team over the 15 every team.
seasons.
Pseudo Code:
Pesudo Code: Input : Home team league position (HTLP), Away
Input: Dataset of the following season D, game team league position (ATLP)
number N of the streak. Output: Difference in league position (DiffLP)
Output : Results of the last five game weeks for
every team before that week. Procedure to find head to head difference:
HomeTeamLP = []
Procedure to find out five game winning streak: AwayTeamLP = []
form = {} for i >0 && i<380:
form = get_form(D,N) – used to get form of each homeTeam = playing_stat[i].HomeTeam
team awayTeam = playing_stat[i].AwayTeam
function get_from(D,N){
form = get_matches(D) – finds result of each match HomeTeamLP.append(Standings[homeTeam][year]
for the team. )
form_final = form
form_final[i] = ' ' AwayTeamLP.append(Standings[awayTeam][year])
j=0 End loop
while j < num: playing_stat['HomeTeamLP'] = HomeTeamLP
form_final[i] += form[i-j] playing_stat['AwayTeamLP'] = AwayTeamLP
j += 1 playing_stat['DiffLP'] =
end loop playing_stat['HomeTeamLP'] -
playing_stat['AwayTeamLP']
Conclusions from the chosen feature:
Conclusion from the chosen feature:
1. Past form for the last 5 games can be a good 1. Every EPL season is different. Some teams
metric to predict the outcome of the next game. But might perform better in a particular season and
judging purely on the basis of „n‟ game streak may then be average for the next fewyears.
prove to provide incorrect results. 2. Most teams are generally inconsistent
2. This is because a team may have had an easy run throughout the season, i.e., they may perform
of fixtures during the calculated streak period. better in the first quarter and then slack off in the
next quarter. Because of this, there is no certain
3.3 Goal Difference way to predict the outcome purely based on goal
difference record. It isn‟t a metric that changes
The second feature taken into consideration was goal
in a single pattern but varies in a random fashion
difference between the two sides. This was
as per the team‟s form.
considered to be a fair estimate in determining the 3. Past goal difference can however be really
outcome of the match. This was chosen because the useful for predicting the performance of top
side‟s current positions in the table can tell the true teams against the mid table teams. Since the top
story of how their season has panned out and teams tend to be consistent and the mid table
whether they will win or lose the current game. teams inconsistent, past head to head records can
be a good metric.
The steps taken in calculating the goal difference 4. Similarly, past goal difference alone is
between the two sides are as follows: unreliable but if combines with another metric
a) Initialisation of a dictionary with team names like past form, it can give pretty accurate results.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
3.4 Difference in Points and League Position 2. Though it can clearly point out one favorite from
The third feature that was considered is the the match, confusions do arise when the
difference in points and league position between the difference is very negligible. Thus other factors
two sides heading into the game. This saw need to be considered along with this for a better
calculation of three four values. Home team points, prediction.
away team points, difference in form points and
difference in league position. This almost gives 3.5 Prediction Engine
away the favourite heading into the game. TheThree models have been chosen in order to train
both the raw dataset and the scrapped dataset in
The steps taken to determine the difference in points order to introduce a comparison as to which data
were as follows: produced the better accuracy and results.The three
a) A dictionary with every team as key was prediction algorithms that were chosen in order to
created. train the model are Logistic Regression, Support
b) Wins were substituted with 3 points, draw with Vector Machines and Gradient Boosting.The raw
1 and loss with 0. dataset that consisted of in game data such as shots,
c) Following the conversion, the points for each fouls and set piece chances where first gathered and
team was calculated cumulatively and stored as trained with the three chosen algorithms. Following
the value in the dictionary. that, the three algorithms were put forward for
d) This was done for very season of the team. prediction.A similar strategy was followed using the
e) Points summed up gave the overall points for same three algorithms, but this time for the custom
both the sides and their difference was used. features that were generated following the scrapping
f) Difference in every year‟s league position was and cleaning procedure.From the analysis of both the
also calculated. datasets which were trained from the same three
algorithms, it was evident that the scrapped and
Pseudo Code: cleaned custom features produced better features
Input : Dataset of the following season (playing_stat) when compared to the raw data which was present in
Output : Difference in points (DiffPts) the dataset.
Procedure to find out difference in points:
matchres = get_matchres(playing_stat) - to calculate V CONCLUSION AND FUTURE WORK
results
cum_pts = get_cuml_points(matchres) – to calculate The project work was to prove that, using overall
cumulative points team features ahead of the game to predict their
HTP = [] outcome would yield in fairly positive prediction
ATP = [] results. Also it was to bring out a totally contrasting
j=0 comparison between the prediction made using in
while i<380 game features available to us after the match as
ht = playing_stat[i].HomeTeam ended and features known before hand.
at = playing_stat[i].AwayTeam
HTP.append(cum_pts[ht][j]) These were the results that were yielded while doing
ATP.append(cum_pts[at][j]) the prediction using xgBoost algorithm using the
if ((i + 1)% 10) == 0: custom generated features.
j=j+1 Table 1. Report for prediction done using xgBoost
playing_stat['HTP'] = HTP for custom features.
playing_stat['ATP'] = ATP F1-
Precision Recall support
end loop score
playing_stat['DiffPts'] = playing_stat['HTP'] - Home win 0.68 0.83 0.75 23
playing_stat['ATP'] Not Home
0.82 0.67 0.73 27
win
Conclusions from the chosen feature :
1. Comparing the two sides difference in points Accuracy 0.74 50
and League positions can give a fair estimate of Macro avg 0.75 0.75 0.74 50
Weighted
how their respective season‟s have panned out 0.75 0.74 0.74 50
avg
and what to expect from the game.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
These were the results that were generated when the The values of precision and recall for the predictions
same xgBoost algorithm was used for predicting the made using custom features is 0.82 and 0.67.
match result using in game features.
Similarly, the values of precision and recall for the
Table 2. Report for prediction done using xgBoost predictions made using in game features is 0.71 and
for in-game features. 0.74.
F1-
Precision Recall support
score From the above tables we can infer that, the values
Home win 0.68 0.65 0.67 23 of f1 score and accuracy for the predictions made
Not Home using custom generated features is 0.74 and 0.74
0.71 0.74 0.73 27
win while the asme values for prediction done using in-
game features is 0.66 and 0.70.
Accuracy 0.70 50
Macro avg 0.70 0.70 0.70 50
Weighted
0.70 0.70 0.70 50
avg
Fig.1 Actual vs Predicted values done with xg Boost for in game features.
Fig. 2 Actual vs Predicted values done with xg boost for custom features.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.
Thus from the above figure, we can infer that the f1 could be taken into consideration to better
score as well as the accuracy of the prediction done understand team performances.
using custom data is greater than the prediction done
using the raw data (in-game data) already available Finally, analysing betting odds and capturing fan
in the dataset. reactions from the social media could be included as
their give out fair estimate as to how the team‟s
So from this, we can conclude that as f1 score mood is going into a match.
increases, the accuracy of the prediction also
increases. REFERENCES
[1] Hucaljuk, J & Rakipović, A. (2011). Predicting football scores
Though the custom generated features yielded better using machine learning techniques. 2011 Proceedings of the
accuracy in all the three algorithms, it wasn‟t 34th International Convention MIPRO,, 1623-1627.
significantly huge and could be extended with future [2] Abel Hijmans. Dutch football prediction using machine learning
classifiers(unpublished)
work. [3] Darwin, P & Dra, H (2016). Predicting Football Match Results
with Logistic Regression. 2016 International Conference On
Advanced Informatics: Concepts, Theory And
Improved data would have played a major role in Application(ICAICTA).
increasing the accuracy of our prediction. In depth [4] Farzin, O., Parinaz, E., &Faezeh, S. M. Football result
stats such as dribbling, take ons, expected goals per prediction with Bayesian network in Spanish league-Barcelona
team. International Journal of Computer Theory and
90, tackles may have proven a blessing in disguise. Engineering, 5(5), 2013, 812-815
[5] T. Cheng, D. Cui, Z. Fan, J. Zhou, and S. Lu, “A new model to
Improved data plays a vital role in determing forecast the results of matches based on hybrid neural networks
in the soccer rating system,” Proceedings Fifth International
whether the accuracy of the model comes out good Conference on Computational Intelligence and Multimedia
or not. Applications. ICCIMA 2003.
[6] J. Lasek. Euro 2016 Predictions Using Team Rating Systems.
ECML/PKDD, 2016.
Had more time been available to us, newer models [7] Lars Magnus Hvattuma, Halvard Arntzen. Using ELO ratings
could have been generated which were player for match result prediction in association football. International
Journal of Forecasting, 2010.
centric, comparing the player stats which were to
feature in the following game. Also team profiles
such as formation, attacking style and mentality
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:34:51 UTC from IEEE Xplore. Restrictions apply.