0% found this document useful (0 votes)
50 views9 pages

Sports Analyticsfor Football League Tableand Player Performance Prediction CR

This paper aims to use machine learning algorithms and historical sports data to predict football league tables and player performances. It presents predictions for league standings and whether teams will perform better than the previous season. It also examines personal skills and statistics that differentiate excellent and average central defenders.

Uploaded by

rautpranay111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views9 pages

Sports Analyticsfor Football League Tableand Player Performance Prediction CR

This paper aims to use machine learning algorithms and historical sports data to predict football league tables and player performances. It presents predictions for league standings and whether teams will perform better than the previous season. It also examines personal skills and statistics that differentiate excellent and average central defenders.

Uploaded by

rautpranay111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337521686

Sports Analytics algorithms for performance prediction

Conference Paper · July 2019


DOI: 10.1109/IISA.2019.8900754

CITATIONS READS

43 6,890

2 authors:

Konstantinos Apostolou Christos Tjortjis


International Hellenic University International Hellenic University
1 PUBLICATION 43 CITATIONS 129 PUBLICATIONS 1,559 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Christos Tjortjis on 01 October 2020.

The user has requested enhancement of the downloaded file.


Sports Analytics for Football League
Table and Player Performance Prediction
Victor Chazan – Pantzalis Christos Tjortjis
The Data Mining and Analytics research group The Data Mining and Analytics research group
School of Science and Technology School of Science and Technology
International Hellenic University International Hellenic University
Thermi, Greece Thermi, Greece
[email protected] [email protected]

Abstract–Common Machine Learning applications in sports Following the same pattern, performance prediction is not easy and
analytics relate to player injury prediction and prevention, long–term performance prediction is even tougher, but also not
potential skill or market value evaluation, as well as team or sufficiently studied until now.
player performance prediction. This paper focuses on football. Its Nevertheless, as it is shown in this paper, it is possible, up to a certain
scope is long–term team and player performance prediction. A level, to make some long–term predictions, especially for team
reliable prediction of the final league table for certain leagues is performance. Our prediction is relatively good, mainly with regards to
presented, using past data and advanced statistics. Other the champion and the teams that win European qualification. What
predictions for team performance included refer to whether a makes this research interesting is that the prediction is performed
team is going to have a better season than the last one. before the beginning of the season, with no official matches played,
Furthermore, we approach detection and recording of personal only with historical data and the information gathered during the
skills and statistical categories that separate an excellent from an summer break. Another novelty of this paper is that advanced
average central defender. Experimental results range between statistics from previous seasons are used for prediction.
encouraging to remarkable, especially given that predictions were The remaining of this paper is structured as follows: Section II
based on data available at the beginning of the season. reviews the literature providing background information, Section III
defines the problem and details our approach, Section IV provides
Keywords–Sports Analytics, Performance Prediction, Machine experimental results, evaluated in Section V, and Section VI
Learning (ML), Data Mining, Classification, Regression. concludes with directions for further work.

I. INTRODUCTION II. BACKGROUND


Sports analytics is the use of historical data and advanced statistics to C. Reep is believed to be the first British notational analyst. He
measure performance, make decisions and predictions regarding published a statistical analysis of patterns of play in football, along
performance and outcomes, in order to gain an advantage over with B. Benjamin in 1968, using 578 matches between 1953 and 1967.
competitors [1]. Performance prediction is the commonest task in During the last 20 years, sophisticated techniques, algorithms, and
sports analytics. Sport analysts process data regarding players and tools for sports analysis were developed, while articles and papers
teams with an intended goal: the prediction of match results, related to sports analytics are constantly being published. Match
tournament winners or team and individual player efficiency. outcome prediction is an interesting topic in Sports Analytics.
Forecasts may be related to short–term or long–term events. For that Researchers approach the problem from different angles. A simple,
reason, diverse methods and algorithms have been deployed. solid but also obsolete prediction strategy is to predict the number of
Clubs use sophisticated devices and software (i.e. GPS tracking goals scored by the two teams.
systems) to gather and analyze data generated by players during The first model sufficient for predicting the result of a match was
training sessions and matches. They process these data to use for created in 1997 by Dixon and Coles. The model is considered a classic
short–term decision making and long–term organization development. and was able to extract probabilities for the goals scored in a match,
Also, extensive analysis of all data available is a prerequisite for following Poisson distribution [2].
betting companies. Finally, fans are also very interested in advanced During the last years, researchers, focused on directly predicting wins,
statistics and how they affect football. draws and losses, instead of trying to predict goals scored or points
For all the above reasons, the use of sports analytics has increased won. Various Machine Learning (ML) algorithms were implemented
during the last few years. Football was selected for our research in order to discover the most discriminating factors that separate the
because of the abundance of statistical categories and historical data, winning from the losing side; Lago-Penas et al. concluded in shots on
its fame, as well as the simplicity of its rules and of national goal, crosses, match location, ball possession and opponent team
championships formats. On the other hand, there are special ability, based on a ranking system [3]. Harrop and Nevill supported
difficulties, which make football long–term prediction challenging. that the best predictor is pass accuracy, followed by the number of
The abundance of online data regarding football is an asset, but shots, the number of passes and dribbles (the fewer the better) and the
requires filtering and proper data for team and player performance venue of the match [4]. Mao et al. claimed that the features that
prediction. Unfortunately, this is not always easy. Additionally, team provide the most positive effects are shots on goal, shot accuracy,
and player performance can be affected by incidents not depicted in tackles and aerials won [5].
the data collected; a team is rated higher than it should be when their Tax and Joustra employed a set of factors from public data and used
opponents underperform. A player might have a low rating dimensionality reduction techniques, such as Principal Component
performance when coming into action after a serious injury. Analysis (PCA), along with ML algorithms (Naive Bayes and
Finally, the nature of football makes statistical recording of match Multilayer Perceptron) to predict the Dutch football championship.
events as well as player and team rating, an ambiguous process.
They achieved an accuracy of almost 55% in their predictions and team and it is much harder to predict its performance by comparison
proved that a hybrid model, combining public data and betting odds with the performance of other teams. Limited work has been done on
could improve accuracy [6]. this challenging task so far. One of the most intriguing but also almost
Neural Networks (NNs) have also been used for prediction in football. unexplored scopes is the prediction of a championship’s final table.
McCabe and Trevathan dealt with four different sports. Using data Van Haaren and Davis emphasized on the difficulty to predict the
from 02–08 and a Multilayer Perceptron, trained with Back exact position of a team in the final table, because it depends on the
Propagation, equipped with conjugative–gradient algorithms, they final position of every other team [15]. Another obstacle for their
tried to predict match results. The NNs had 20 input layer nodes, 10 method was the number of matches that ended in a draw. Ranking
hidden layer nodes and 1 output layer node. The same features were systems used for simulating match results have difficulties in
used for every sport. Football had the worst average prediction predicting draws. This resulted in high variance on the predicted
performance of 54.6% [7]. Then, Hucaljuk and Rakipovic concluded number of points for each team. However, they indicated two
that NNs performed better than any other ML technique they used [8]. substantial metrics needed for evaluating the quality of the predicted
Goddard, in 2005, compared the two methods, i.e. modeling the goals final tables: the percentage of correctly predicted relative positions
scored vs modeling win–draw–lose match result and concluded that a and the Mean Squared Error (MSE) regarding positions.
hybrid model achieves the best prediction performance. He also was Oberstone developed a multiple regression model, ending up with 6
one of the first to use variables other than previous match results. He independent variables which he assessed to be sufficient for predicting
leveraged features like the importance of individual matches, the final league table of EPL in terms of points, instead of accurate
geographical distance between the two opponents and more. For the positions [16]. He also used F distribution to compare means of
win–draw–lose method he used an ordered probit regression model multiple samples (i.e. one–way analysis of variance) to investigate
and exploited a database of English match results for the past 25 which pitch actions differentiate the four best teams from all the
years. He also included in his work a comparison of his predictions others in the league. He managed to achieve outstanding results.
with the betting odds of the matches and concluded that achieving a There have been some interesting works focused on the financial
positive betting return over time is possible [9]. strand of football clubs, rather than pitch performances. Kringstad and
In addition, many remarkable advanced statistics have emerged during Olsen used data from the Norwegian league and focused on the
the last decade, such as Expected Goals (xG), Packing, Defensive relationship between financial strength and sporting outcome [17].
Coverage, Sequences and more. xG is a statistical measure of the They presented some mixed results: evidence suggested that budgeted
quality of chances created and conceded (Expected Goals Against). revenue was a success indicator, but only for bottom–half teams,
xG probabilistically assign a score from 0 to 1 to each chance based while static and dynamic regression models they implemented
on several variables. Shot quality evaluation is usually achieved by supported the notion of budgeted revenues being a driver of sporting
training NNs over large datasets of shots. xG are calculated for outcome. They concluded that focus on athletics is still vital as money
individual players, but also cumulatively for the whole team. The is a significant factor of success, but only to a certain extent.
model eliminates some of the randomness of the actual goals scored Coates et al. used data from every team that participated in Major
and gives better insights into team performance. xG has not avoided League Soccer (MLS) in USA during 2005–13. They examined the
criticism, but there have been certain cases that the method was relationship between salary level and dispersion with football success.
implemented with great success. They revealed that while the wage bill of team has a positive effect on
In 2016, Eggels et al. used xG trying to build a model to classify each success, salary inequality has a negative effect on success. In that way
scoring opportunity into a scoring probability. They leveraged they proved that cohesion is essential in football [18].
geospatial data and implemented various classification techniques. Cintia et al. used pass–based performance indicators and other
They also indicated that xG could be further used for evaluating efficient metrics, like the Pezzali score. The signification of this
players and seasons, but they warned that probability estimates of goal metric lies on the fact that it rewards teams effective on both sides of
scoring opportunities may suffer from high standard deviation [10]. the pitch, i.e. in offensive skills and in defensive duties. It is
Apart from works on predictive analysis, there are various interesting formulated as follows:
researches referring to comparative analysis. The main components |𝑔𝑜𝑎𝑙𝑠(𝑡𝑒𝑎𝑚)| |𝑎𝑡𝑡𝑒𝑚𝑝𝑡𝑠(𝑜𝑝𝑝𝑜𝑛𝑒𝑛𝑡)|
compared are wins and losses. It appears that a noteworthy attribute 𝑃𝑒𝑧𝑧𝑎𝑙𝑖 𝑠𝑐𝑜𝑟𝑒(𝑡𝑒𝑎𝑚) = |𝑎𝑡𝑡𝑒𝑚𝑝𝑡𝑠(𝑡𝑒𝑎𝑚)| × |𝑔𝑜𝑎𝑙𝑠(𝑜𝑝𝑝𝑜𝑛𝑒𝑛𝑡)|
(1)
that most researchers point out is efficiency. Efficiency is defined as
the number of goals divided by the number of shots. Shots on goal, They simulated matches from four major leagues and claimed that
pass accuracy, quality of the opponent team, venue of the match and they achieved superb results, as they predicted match outcome with an
ball possession also seem to be significant variables [11]. accuracy of almost 60%. They also found that the final rankings in the
Bekris et al. used a different approach; they compared matches with at simulated championships were very close to the true rankings.
most one–goal difference (i.e. short range) to matches with at least Nevertheless, some teams had a considerable ranking error, which was
three–goals difference (i.e. wide range). They found out that wide explained by very high or very low Pezzali score. Finally, they
range winners outplayed their opponents in ball possession marked the simplicity of their models and encouraged researchers to
percentage, number of passes, “one vs. one” duels won, number of work with more complex models as they reckoned that there is room
shots, number of shots on target and shooting accuracy. Contrariwise, for improvement in accuracy [19].
those findings do not stand for short range matches, which are more Constantinou and Fenton, studying predictive accuracy in long–term
sensitive to luck [12]. team performance, proposed a method which they called smart–data
Researchers have also used the concept of rating. Rating is a single [14]. They exploited external factors which might influence the
number which is used to describe the strength of a team in comparison strength of a team (i.e. managerial changes, European qualification,
to other teams at the time. A famous rating system is the ELO Ratings, newly promoted teams etc.). With those factors they built new ones,
which was used by Hvattum and Arntzen [13]. They used ELO Rating such as “true team strength”, “expected performance” and more. Their
differences between teams as covariates in ordered logit regression goal was to predict the final table in terms of points won by each
models. Constantinou and Fenton used the pi-ratings that they had team. They achieved great results, managed to single out certain
earlier invented for model validation, trying to make long–term external factors that boost or worsen a team performance, focusing on
prediction over team performance [14]. the quality of their data, not on the quantity.
Predicting the outcome of a match is important, but maybe not as Football passes are important actions. Cakmak et al. introduced a
important as the prediction of team performance for the season. It is metric, named “Pass Effectiveness” [20].. They based pass evaluation
obvious that it is very hard to predict the long–term performance of a upon mathematic grounds. Pass effectiveness is being extracted from
the combination of five other measurable pass metrics: gain of a pass, Our research focuses on statistics from the previous season and
pass advantage, goal chance, decision time and pass effectiveness of historical data. Also, some financial data (i.e. transfer spending, team
the next pass. salaries) are exploited to contribute to the team evaluation process.
Passing networks is also a very intriguing subject; players are The novelty of this research is that advanced metrics were used, such
represented as nodes of a network, while passes between two players as xG and Pezzali score to predict next season performance before the
are represented as edges between the players. The edges are weighted season begins, not after matches have already been played and
based on the number of passes being exchanged between players. recorded. Additionally, attackers are usually graded higher than
Cintia et al. leverage passing networks in several papers, in order to defenders, even if they are not always more influential in team
predict football matches outcome, but also final league tables. They strategy. So, regarding to player evaluation, this research attempts to
concluded that networks are more efficient for long–term predictions identify skills and features suitable, that make good defenders.
for whole competitions [21]. Grund analyzed a dataset of 283,259
passes and applied mixed–effects modeling to 76 repeated B. Approach Followed
observations of the interaction networks and performance of 23 soccer This section showcases the flow of events taking place before we can
teams. He proved that best performing teams were characterized by get any meaningful experimental results, as well as the way the data
networks with high intensity and low centralization [22]. were acquired and their preprocessing. The block diagram which
Spatiotemporal data are significant in sports analytics. The advances summarizes the process is depicted in Fig. 1:
in image processing made the analysis of positional data a lot easier.
Borrie et al. suggest that temporal pattern analysis will lead to a
deeper understanding of sport performance. They detected temporal
patterns to find similar pass sequences within matches [23].
Player performance prediction is also interesting. Nsolo et al.
investigated the attributes which best predict the success of individual
players, based on their position, and evaluated different ML
algorithms regarding prediction performance. They focused on top
players of the top five European leagues and evaluated players based
on different attributes for each position. They concluded that forwards
tend to have higher performance ratings than other players, so maybe
more advanced metrics should be applied on defensive players [24].
Previously, we used past data for long–term performance prediction; Figure 1: Block diagram of the process followed for the experiments.
we estimated how many goals a certain player will score in a season
and the number of a player’s shots during each individual match. We At first, the appropriate data had to be found. There are a lot of web
also predicted the playing positions of a set of players according to pages that contain information and statistics regarding football
their attributes [25]. We also predicted the best NBA defender as well matches and events. The data refer to both teams and players. Some of
as the MVP for 2 years [26]. the data were accessed and collected manually, especially when that
Sîrb et al. presented a set of 54 performance criteria, over different was easy. However, some of them were scraped from the internet
playing positions in order to evaluate the performance of players, using various scraping tools. Finally, a free database from an expired
consider each player’s natural position and the tactical formation that online competition had been downloaded and used for the
the team deployed in a match [27]. experiments. The database contains data from thousands of players
Finally, Pappalardo et al. analyzed player performance from 18 and is extracted from a famous manager simulation game. It
different competitions for several years and presented PlayeRank, a demonstrates player ratings for several football skills. Players are
data–driven framework. The dataset contained 31 million matches and rated by domain experts.
21 thousand players. PlayeRank was found to outperform competitive After the process of data acquisition, there was a large database which
predictive algorithms. They also discussed what distinguishes top needed to be organized. The database was split into different csv files,
players from others and discovered patterns for excellent according to what data were essential for each experiment. Then, the
performances. One of the limitations was that PlayeRank does not csv files were uploaded to jupyter, the software that was mainly used
consider off–ball actions, like pressing. The authors also emphasized for data processing.
on the fact that an improved version of the framework should be able Naturally, the data firstly needed to be preprocessed. They were
to leverage data from other sources, like wearables, GPS and video checked for null values, duplicates, noise etc. Python was used to
tracking data [28]. clean the data and build the models. Then, data transformation and
data reduction took place to keep only the appropriate features for
each classification or regression.
III. PROBLEM DEFINITION, APPROACH FOLLOWED Finally, results were evaluated in terms of accuracy, error rates and
bias involved. They were also being compared to results of other
A. Problem Definition
similar researches to estimate the value produced by them.
Long–term performance prediction for teams or individual players are
fields requiring exploration. Not only coaches, but also sports agents
and bookmakers are interested in how teams or players perform during IV. EXPERIMENTS AND RESULTS
a season compared to previous ones. What is discussed in this section
is the context of this problem. Also, the objectives of the research are A. 1st experiment: Team Performance Prediction
set. The first experiment is divided in two parts: The first part can be
The unique components of football matches make long–term described as follows: Having a dataset with every team from four
predictions very difficult; only few goals are scored per match. Also, important European football national leagues, with more than 40
there is no clear changeover between the instantaneous change of features for every team for each of the last four years (2015-18),
possession and transition between offense and defense. Moreover, predict whether a certain team is going to have a better or worse
player positions and tactics are not fixed and finally, the game has a season than the previous year in terms of points. Every previous
continuous flow, which complicates recording of game events [16]. season is used as training set and the final season (i.e. 2017–18) is
used as test set. It is handled as a binary classification problem and the most valuable features; difference between goals and team xG in the
evaluation is conducted by measuring AccuracyP as follows: previous championship turned out to be the most important features.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑎𝑚𝑠 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑒𝑟𝑓𝑟𝑜𝑚𝑎𝑛𝑐𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
On the other hand, managerial change was not deemed an important
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑃 = (2) performance indicator. However, it must be noted that the attribute
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑡𝑒𝑎𝑚𝑠
used in this research does not factor what circumstances caused the
Then, for the second part of the experiment, another method is managerial change in the club. Similar researches in the future may
presented; using almost the same features as in the first part, a model deal with this issue.
was built, to simulate every match of the 2018–19 season for the same On the model build, Random Forest was the classifier that achieved
championships (i.e. 380 matches per championship). Then, the virtual the highest accuracy, with more than 70% AccuracyP and with
points collected by teams are accumulated in order to predict the final standard deviation less than 10%.
league standings. The predicted league table is compared to the actual For the second part of the experiment, more databases were used. That
league table and the evaluation is conducted by calculating the Root data pertained to the results of every match of the four championships
Mean Squared Error (RMSE) for the championship: presented in the first part of the experiment. Every unnecessary
2
attribute was again removed and datasets were merged with the
1 ∧
𝑅𝑀𝑆𝐸 = √ ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦𝑖 ) (3) datasets of the first part of the experiment. This process resulted into a
𝑛
new dataset, which contained every match from a football season with
where: its full time result and with statistical, financial, and historical data
• n is the number of teams participating in the championship. about the two teams involved in each match.
∧ Naturally, a problem came up: some of the teams participating in
• 𝑦𝑖 is the predicted points for the i-th team.
championship matches lack any data, because they are newly
• 𝑦𝑖 is the actual points for the i-th team.
promoted. Thus, there were some missing values in the dataset to be
Also, every model is evaluated for its ability to predict the outcome of
handled properly, by the following method. Newly promoted teams
matches played. The evaluation metric is AccuracyM, defined as
were considered to be the weakest ones in the league and were
follows:
assigned the maximum, the minimum or the mean value of the
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑔𝑎𝑚𝑒𝑠 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 corresponding attribute, according to the nature of each attribute.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑀 = (4)
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑔𝑎𝑚𝑒𝑠 The next step was to combine the attributes of home and away team
The features used for the experiment are the ones that were considered by subtracting the corresponding pairs. Some attributes from the first
more relative to team performance. Those features can be divided in part of the problem were excluded from the second one as the
three categories: subtraction was not meaningful. Finally, every team was encoded
1. Past data generated during the last five years. This mainly refers to using dummy variables and the first three seasons of each national
performance indicators from previous seasons (e.g. team average championship were concatenated.
points). The dataset was split into a training set and a validation set. Last
2. Team statistical features from the season that has just ended (e.g. season was kept separately from others, the target (i.e. the full-time
wins, xG, shots, possession percentage, Pezzali score and more). result) was hidden and was used as a test set. Training/validation set
3. Data not measurable by team performance (e.g. financial). These consisted of 1140 rows (380 rows for the test set) and almost 40
attributes are generated during the summer break, so most of them attributes (team dummy variables were not included).
are independent from the previous season, but very likely to have Standardization of the data, parameter tuning, and Cross Validation
an impact on the new season’s performance. techniques were used again, exactly as previously. Multiple classifiers
Finally, in the dataset, there is the target attribute. It is binary and were deployed to predict the match outcome and therefore the
corresponds to whether the team is going to have a better or worse leagues’ final standings. Team market value percentage, expected
season than the previous one in terms of points won. points and non–penalties xG turned out to be the most important
After data preprocessing, some attributes were removed from the features, but not by a big margin from other attributes.
original datasets, being irrelevant with the research or noisy, adding During this process two problems came up; the first problem was that
limited value to the outcome. Those were team statistics, like cards, almost every classifier used had the tendency to favor big teams over
interceptions, offsides, fouls etc. smaller ones. The other problem was that most of the models built
The first problem to handle was that not every team of the previous faced difficulties in predicting draws.
championship takes part in the new one; there are teams that are Despite the drawbacks, the results achieved could be considered
relegated and teams that are promoted. It is meaningless to have promising. They are comparable to results from similar researches,
historical data about newly promoted teams, because the data would while the advantage of this research is that the experiments can be
refer to a different league than the one studied. So for the newly concluded at the beginning of the season, with no official matches
promoted teams, some adjustments had to be made. Indicatively, played and recorded. The best AccuracyM for the outcome of the
calculating the average team points of the last five seasons, if a team matches was 57% for the English Premier Division and the smallest
was playing in a lower division during that time, the points of the RMSE for team points was 9, achieved for Spanish La Liga. French
bottom league team were assigned to it. Ligue 1 produced the worst results, both in terms of AccuracyM and
Those adjustments caused certain problems; newly promoted teams do RMSE. The results from each league are presented in the following
not necessarily have the same strength as teams that have just got Tables 1 to 4. The best result for each league is noted with green color
relegated. Thus, the way they are described by features and attributes and the worst one with red.
assigned to them might not be representative of their actual status.
Furthermore, the three newly promoted teams are all assigned with the Table 1: Results from the English Premier League.
same values for the corresponding variables, which is not efficient. CLASSIFIER AccuracyM RMSE
Therefore, the validity of this method is questionable. Naïve Bayes 55 17
The data were split into train and test set. Then, multiple classifiers Decision Tree 45 12.9
were used to classify the test set teams into two classes (i.e. better Random Forest 56 14.3
season / worse season). Grid search was used for model tuning and KNN 48 15.3
10–fold Cross Validation was used for testing the effectiveness of the SVM (rbf kernel) 54 18.2
model. A feature importance graph was also deployed to track the SVM (poly kernel) 57 11
XGBoost 52 17.3
Table 2: Results from the Spanish La Liga. predicting these teams’ performance. It correctly predicted the
CLASSIFIER AccuracyM RMSE champion, but also the ranking of the first six teams in the league.
Naïve Bayes 47 23.7 In this example, SVM with polynomial kernel succeeded not to
Decision Tree 39 14.9 overestimate the top teams (i.e. a problem which was often observed
Random Forest 48 17.7 throughout most of the classifiers), but on the other hand
KNN 46 13.3 overestimated the bottom teams instead. One other problem was its
SVM (rbf kernel) 51 13.8 inefficiency in predicting draws, as very few match outcomes were
SVM (poly kernel) 47 9 predicted as “draw”.
XGBoost 45 17.4 Despite their divergence and how small or big AccuracyM and RMSE
were in every case, most of the classifiers correctly predicted the
league champion. The equivalent accuracy was very good, regarding
Table 3: Results from the Italian Serie A.
teams that won European qualification and mainly those amongst
CLASSIFIER AccuracyM RMSE
them that qualified for Champions’ League, as shown in Table 5.
Naïve Bayes 53 19.7
Results for the relegated teams were also acceptable. Europa League
Decision Tree 41 11.3 teams were the exception, as the prediction accuracy was poor.
Random Forest 40 14.4
KNN 47 14.7 Table 5: Accuracy in predicting champion, teams that won European
SVM (rbf kernel) 52 19 qualification and teams relegated.
SVM (poly kernel) 50 12.2 Premier La Serie Ligue
Overall
XGBoost 42 14.5 League Liga A 1
Championship
71% 71% 57% 57% 64%
Table 4: Results from the French Legue 1. Winner
CLASSIFIER AccuracyM RMSE European
86% 76% 82% 46% 75%
Naïve Bayes 42 28.2 Qualification
Decision Tree 39 17.6 Champions
79% 86% 71% 57% 74%
Random Forest 45 24.8 League
KNN 39 20.9 Europa
38% 29% 29% 0% 29%
SVM (rbf kernel) 43 22.2 League
SVM (poly kernel) 43 17.3 League
52% 48% 57% 10% 42%
XGBoost 44 21.8 Relegation

It is obvious that SVM with polynomial kernel is observed to steadily Finally, as far as AccuracyM is concerned, another aspect of the
achieve good results in every league studied, so it is regarded as a experiment is the following: Instead of using the previous three
benchmark for this research from now on. Overall, the best result in seasons as training set and the last season as validation set, use the
terms of RMSE was achieved from the Spanish La Liga, where the first 10 match days of 2018–19 season as training set and the
classifier predicted the final league table with surprisingly high remaining 28 match days as test set. In that case, AccuracyM of the
accuracy, given the few attributes used. In Fig. 2, the real vs predicted Spanish La Liga rose from 51% to 70%, as seen in Fig. 3. Therefore, it
league tables are shown and compared. is shown that present season’s data can boost the accuracy of the
model in a very beneficial manner.

Figure 3: Accuracy in predicting match results after 10 match days


from the Spanish La Liga have been analyzed.
B. 2nd experiment: Player Performance Prediction
This experiment focuses on individual players, specifically central
defenders. In rating systems, there is a bias toward forwards and
attacking midfielders. Goals are considered the most important
element of football, so defenders’ contribution to a team is usually
underestimated. Consequently, there is very limited research on
central defenders. Additionally, while it is easy to rate attacking
Figure 2: Spanish La Liga 2018–19 actual vs predicted table. players, according to the goals, key passes and assists, it is not
straightforward what makes a good central defender.
Green fonts are used for teams that won European qualification The purpose of this research is to examine the characteristics and the
through Champions League, blue for teams that won European statistics for central defenders in comparison with their season rating
qualification through Europa League and red for teams relegated after and decide which of them contribute more to distinguish a central
the end of the season. The classifier has done an outstanding job in defender as a top class player.
The data collected refer to player attributes, playing positions and Additionally, all five assumptions of linear regression were met;
some demographic features. The database was narrowed down to 59 Linearity of the model was obvious, as seen in Fig. 5. The expectation
players, as only central defenders, playing in English Premier League (mean) of residuals was found almost zero and there was no (perfect)
and having participated in at least 10 league matches for the 2016–17 multicollinearity between the features. The Breusch–Pagan test gave a
season were selected. p–value of 0.44, so there was no heteroscedasticity and the Durbin–
The next step was to collect season statistics for those players. The Watson test gave a value of 1.91, so there was almost no
main focus was on statistics regarding defensive player actions, but autocorrelation between the features.
also, some team statistics were collected; despite demanding to build a
model based on player performance, it must be acknowledged that a
footballer’s team has an impact both on his statistics but also on his
overall rating.
The initial approach to the problem was to normalize every numeric
value of the dataset, so every attribute’s range was transformed to
range 0 to 1. Then a multiple regression model was built with every
possible feature. Despite the simplicity of this approach, some useful
early conclusions were drawn regarding to which features contribute
more to central defenders’ competency. It seems that for the examined
dataset, interceptions are the most important characteristic, followed
by team overall rating, as expected. Players’ best attributes turned out
to be their jumping reach, versatility, acceleration and first touch on
the ball.
Another approach that was followed was to split the dataset’s features
into three categories:
Figure 5: Linearity of the second model.
1. Player characteristics and attributes.
2. Player statistics. The third set of features (i.e. team statistics) did not help to build a
3. Team statistics. satisfactory model. There was an indication that the only of those
Again, the target was to build three multiple linear regression models variables worth noting is “TeamRating”. It was decided to incorporate
(i.e. player attributes based, and statistics based), but with fewer “TeamRating” in the second model, in order to exploit that feature.
independent variables than in the first approach. The method used for Indeed, by updating the model, adding “TeamRating” as its 13th
the implementation of this part of the experiment was backward feature, there has been a slight improvement to the model; R–squared
elimination. For the first category of features, the final model was rose to 0.907 and adjusted R–squared rose to 0.88.
built with seven features, which seem to be the most influential for a In conclusion, summarizing the results of all models deployed, the
central defender. most critical attributes and game actions for predicting the
The five assumptions of linear regression were also verified for this performance of a central defender can be described in the following
model; there was an indication of linearity in the model. Also, the list. It must be highlighted that attacking skills are not absent from the
expectation (mean) of residuals was almost zero and it appeared that list, following the way modern defenders are expected to play:
there was no (perfect) multicollinearity between features. Interceptions,
Additionally, by performing a Breusch–Pagan test, it was proven that Clearances,
there is no heteroscedasticity in the model. Nevertheless, the final Aerials Won,
assumption was not verified; The Durbin–Watson test gives a value Tackles,
much lower than 2, which implies that there was positive Jumping reach,
autocorrelation between features. Also, the R–squared and the Versatility,
adjusted R–squared were relatively low (under 0.5). However, Acceleration,
considering that dependent and independent variables emerged from First touch on ball,
two different sources, the results could be described as encouraging. Age,
The features of the second category (all derived from the same source) Passing,
were the independent variables, while player rating (also derived from Vision,
the same source) was, again, the dependent variable. This time, the Determination,
final model consisted of 12 features, after Backward Elimination, with Strength,
very low P–values, while, as seen in Figure 4, R–squared was 0.867 Professionalism and ability to perform well in important matches,
and adjusted R–squared was 0.833, a vast improvement from the first International Caps,
model. Minutes Played,
Fouls,
Inaccurate short passes,
Key passes,
Goals,
Team’s rating.

V. DISCUSSION
This section reviews and discusses our approach and results. Problems
that came up during the process and the solutions given are debated.
Results of the experiments are evaluated and threats to validity are
mentioned, too.
The first problem encountered was the abundance of data. It was
practically impossible to use every free online data acquired, so data
Figure 4: Model built with player statistics as independent variables.
selection was challenging. Fortunately, the models produced from the considered a top class player. Of course, every player is different and
datasets were not computationally intensive, so the approach followed has his/her own playing style, but it would be very useful for coaches
was to include as many attributes relative to the research as possible, to have a specific targeting when training a player. Long–term player
in order not to miss out important information. Later, during the prediction performance could also be a huge contribution to fantasy
feature selection phase, some less important attributes were removed. sports games. The experiment resulted in a variety of features.
Conversely, the acquisition of data substantial for research on football Unsurprisingly, some of them were the main defending actions and
analytics was very difficult. Data regarding player injuries and data attributes, but in an interesting manner, some were also found to be
from wearable devices are mostly defined as private personal details. attacking actions or attacking attributes.
Thus, there are no such free online data to be used for the
experiments.
Another problem was the handling of newly promoted teams, as VI. CONCLUSIONS AND FUTURE WORK
statistics from previous seasons were generated for a lower division, A. Conclusions
so they could not be used. Concerning those teams, values were
automatically assigned to some variables. That was a necessity, but in In this research, two fundamental cases of sports analytics were
certain circumstances the predicted team performance did not meet studied: team performance prediction and player performance
some teams’ real potential. prediction.
Additionally, most models were biased in favor of big clubs. An For the first experiment, the goal was to predict how each team of four
attribute that could be used as a penalization factor in cases of important European leagues would perform during the 2018–19
overestimation could balance out the aforementioned bias. season. The data available were only historical data (from 2015
A similar problem was encountered because of the models’ difficulty onwards) and information about team actions (transfers, managerial
in predicting draws. The solution behind that problem usually lies on changes etc.) during the summer of 2018, just before the beginning of
the proper usage of cost sensitive classifiers or by tuning the weights the season. Two approaches were followed to address this issue.
of classes. For the first approach, the target was to classify teams in those that
In both cases, the proposed solutions were tested. Even though they would perform better than last season and those that would perform
resolved the issues that were deployed for, they both failed to extract worse than last season in terms of points collected. Results could be
better results than the ones already achieved. Hence, they were not described as satisfactory, but not impressive, as AccuracyP of the
included in the models. classifiers deployed reached the level of 70%. In this approach, no
Every championship has its own particularities, so rules extracted distinction between the examined championships was made, so the
from one league do not necessarily apply to others. Therefore, despite model used could be described as universal.
the feasibility of building a universal model, proven by experimental Another approach for team performance prediction achieved
results, the exploration of the differences between leagues would remarkable results; the idea was to simulate every match of the season
probably provide better research opportunities. This issue also reflects and classify their results as home win, draw or away win. At the end,
the second experiment. The database used consisted of defenders each team’s points were accumulated, and a predicted league final
based only in England, because it would be inefficient to include table was extracted. The effectiveness of the model was measured
players from different leagues. with two metrics: AccuracyM of the predicted match outcomes and
Domain expert opinion was used in rating player attributes. Scout RMSE of predicted vs actual team points in the league table. The
reports, despite generally describing well enough player ability and highest AccuracyM achieved was 57% for the English Premier
potential, have often been misleading, while intentional tampering League and the lowest RMSE was 9 for the Spanish La Liga.
with ratings and attributes should not be ruled out. Additionally, Additionally, the champion was correctly predicted in 64% of the
financial–based data usually suffer from inaccuracies and cannot be times and the teams that won European qualification were correctly
fully trusted. predicted in 75% of the times. Also, this time, the four championships
Unfortunately, football matches are not affected only by team ability were separately studied and differences between them were evident.
and player skills. There are some external factors that cannot be Our results are very satisfactory and comparable to results of similar
predicted. Luck is an imponderable factor. Long term injuries of researches. Regarding prediction of match outcome, Tax and Joustra
important players are also part of the game. “Strange” results in achieved 56% accuracy [6], while McCabe and Trevathan achieved
matches where one or both teams are not in real need of victory are 54.6% accuracy [7]. Joseph et al. achieved their best result using
often observed. Finally, betting odds inevitably have an influence on Baysian Networks, with 59.2% accuracy [29] and Eggels et al.
match outcome. All those drawbacks, which can be viewed as threats achieved 54% accuracy [10]. Cintia et al. predicted match outcome
to validity, prove that long–term sports prediction is very demanding with 60% accuracy and team points with 9.1 RMSE [19]. Our results
and may not always provide meaningful results. Nevertheless, the have the advantage of being obtained without any current official
results of the experiments conducted in our research can be described match data available.
as good or even impressive in certain occasions. Additionally, applying prediction after using the first ten match days
AccuracyP level for the first part of the first experiment can be of the season as a training set was suggested as an alternative. In that
described as satisfactory, given the fact that it is a long–term case, AccuracyM of predicted match outcomes was impressively
prediction with no official match data and statistics available. A raised to 70%.
professional expert could exploit the experimental results, along with The second experiment was about defining which attributes and match
his own intuition and make certain decisions. actions are mainly influencing a central defender’s match rating. The
The main achievement of the research is the second part of the dataset consisted of 59 central defenders having played at least 10
experiment, where the models used predicted some famous matches for the English Premier League 2016–17 season. The method
champions’ final table with great accuracy. Also, classifiers are able to used was Multiple Linear Regression with Backward Elimination and
predict almost 2 out of 3 match outcomes when the model is applied the evaluation metrics were R–squared and adjusted R–squared.
in the midst of the season. Consequently, this implementation can be Findings were noteworthy, as for a quite satisfying 0.907 R–squared
vastly used for betting purposes under certain circumstances. and 0.88 adjusted R–squared, thirteen features were proved to be
Provably, planning a profitable betting strategy based on experimental statistically significant. Classic defensive actions like interceptions
results and –apparently– in some human expertise, is possible. and clearances were amongst them, along with player attributes more
Finally, the second experiment succeeds into locating a set of suitable for defenders, such as jumping reach and strength. The
attributes and skills that a central defender must improve in order to be interesting part was that some attacking skills, such as passing, and
some attacking match actions (i.e. key passes made, goals scored) Systematic Review.1, 2018, The Open Sports Sciences Journal, Vol. 11,
were also found to have an impact on rating central defenders. This pp. 3-23.
fact stresses the change of playing approach from central defenders [12] Bekris, E. - Gioldasis, A. - Gissis, I. - Komsis, S. - Alipasali, F. Winners
nowadays. and losers in top level soccer. How do they differ? 2014, Journal of
Physical Education and Sport, Vol. 14, pp. 398-405.
B. Future Work [13] Hvattum, L.M. and Arntzen, H. Using ELO ratings for match result
The experiments have shown that it is possible to make long–term prediction in association football. 2010, Int'l Journal of Forecasting, Vol.
26, pp. 460-470.
predictions about team and player performance, so it is reasonable that
researchers will work in the same direction in the future, trying to [14] Constantinou, A. and Fenton, N. Towards smart-data: Improving
resolve some issues or trying to improve the experimental results. predictive accuracy in long-term football team performance. 2017,
Knowledge-Based Systems, Vol. 124, pp. 93-104.
Data from cameras and wearables would be an invaluable asset to any
sports analytics research. Future works on sports analytics should [15] Van Haaren, J. and Davis, J. Predicting the Final League Tables of
Domestic Football Leagues. 2015. 5th int'l conf. mathematics in sport. pp.
focus their attention on gathering and leveraging data from those
202-207.
devices.
Another idea would be the evaluation of newly promoted teams’ [16] Oberstone, J. Differentiating the Top English Premier League Football
Clubs from the Rest of the Pack: Identifying the Keys to Success. 2009,
ability and the study of their performance to comprehend what are the
Journal of Quantitative Analysis in Sports, Vol. 5.
factors that lead them to be successful or not.
Problem with models’ bias in favor of bigger clubs and difficulties in [17] Kringstad, M. and Olsen, T.-E. Can sporting success in Norwegian
football be predicted from budgeted revenues?
predicting draws were not fully resolved. Cost sensitive classifiers and
tuning of the classes’ weights did not improve the experimental [18] Coates, D. - Frick, Bernd - Jewell, T. Superstar Salaries and Soccer
Success: The Impact of Designated Players in Major League Soccer.
results. Hence, it is suggested to scientists to delve deeper into those
2014, Journal of Sports Economics, Vol. 17, pp. 716-735.
methods or implement a different approach to solve the
aforementioned problems. [19] Cintia, P. - Pappalardo, L. - Pedreschi, D. - Giannotti, F. - Malvaldi,
M. The harsh rule of the goals: Data-driven performance indicators for
What was generally observed in this research and must preoccupy football teams. 2015. IEEE Int'l Conf. Data Science and Advanced
researchers in the future is the major divergence displayed on results Analytics,
extracted from different leagues. Therefore, fundamental differences
[20] Cakmak, A. - Uzun, A. - Delibas, E. "Computational Modeling of Pass
between leagues should be specified, otherwise models could only be Effectiveness in Soccer," Advances in Complex Systems, vol. 21, no. 3-4,
applicable on individual leagues and not become universal. 2018.
Additionally, it would be very useful if future researchers took into
[21] Cintia, P. - Rinzivillo, S. - Pappalardo, L. A network-based approach to
consideration some aspects that were not examined in this research; evaluate the performance of football teams. 2015. Machine Learning and
player fatigue or starting lineup rotation due to consecutive matches Data Mining for Sports Analytics workshop (MLSA'15), ECML/PKDD
and important player long–term injuries are factors that can affect conf. 2015.
players or teams, but at the same time can make a model very [22] Grund, T.U. Network structure and team performance: The case of
complex. However, if the complexity is confronted, those data could English Premier League soccer teams. 2012, Vol. 34, pp. 682-690.
be great assets for the research. [23] Borrie, A. - Jonsson, G.K. - Magnusson, M. Temporal pattern analysis
and its applicability in sport: an explanation and exemplar data. 2002,
REFERENCES Journal of Sports Sciences, Vol. 20, pp. 845-852.
[24] Nsolo, E. - Lambrix, Pa. - Niklas, C. Player Valuation in European
[1] Holman, V.. What is Sports Analytics? Agile Sports Analytics. [Online] Football. 2018. 5th Workshop on Machine Learning and Data Mining for
November 15, 2018. https://www.agilesportsanalytics.com/what-is-sports- Sports Analytics co-located with ECML PKDD 2018.
analytics/.
[25] Apostolou, K. and Tjortjis, C. Sports Analytics algorithms for
[2] Dixon, M.J. and Coles, S.G. Modelling Association Football Scores and performance prediction. IEEE 10th Int'l Conf. on Information,
Inefficiencies in the Football Betting Market. 1997. Intelligence, Systems and Applications (IISA 2019), pp. 469-472, 2019.
[3] Lago-Peñas, C. - Lago-Ballesteros, J. - Rey, E.. Differences in [26] Sarlis V. and Tjortjis C., Sports Analytics – Evaluation of Basketball
performance indicators between winning and losing teams in the UEFA Players and Team Performance, Information Systems, Vol. 93, November
Champions League, 2011, Journal of Human Kinetics, Vol. 27, pp. 135- 2020, doi: 10.1016/j.is.2020.101562..
146.
[27] Sîrb, L. - Molcuţ A. - Nastor, F. The Exercise of Prediction Process o
[4] Harrop, K. and Nevill, A. Performance indicators that predict success in fPerformance within Football Sports Management by Using Fuzzy Logic
an English professional League One soccer team, 2014, Int'l Journal of from the Perspective of Value Analysis on Tactical Compartments of
Performance Analysis in Sport, Vol. 14, pp. 907-920. Game of the Football Players. 2015, Journal of Knowledge Management,
[5] Mao, L. - Peng, Z. - Liu, H. - Gómez, M.-A. Identifying keys to win in Economics and Information Technology, Vol. 5
the Chinese professional soccer league. 2016, Vol. 16, pp. 935-947. [28] Pappalardo, L. - Cintia, P. - Ferragina, P. -Massucco, E. - Pedreschi,
[6] Tax, N. and Joustra, Y. Predicting The Dutch Football Competition D. - Giannotti, F. PlayeRank: data-driven performance evaluation and
Using Public Data: A Machine Learning Approach. 10, 2015, player ranking in soccer via a machine learning approach, ACM
Transactions of knowledge and data engineering, Vol. 10. Transactions on Intelligent Systems and Technology September 2019
[7] McCabe, A. and Trevathan, J. Artificial Intelligence in Sports Article No.: 59.
Prediction Information Technology: New Generations, 2008. pp. 1194- [29] Joseph, A. - Fenton, N. - Neil, M. Predicting football results using
1197. Bayesian nets and other machine learning techniques. 2016, Knowledge-
[8] Hucaljuk, J. and Rakipovic, A. Predicting football scores using Based Systems, vol. 19, pp. 544-553.
machine learning techniques. 2011. MIPRO, 2011 Proc. 34th Int'l
Convention.
[9] Goddard, J. Regression models for forecasting goals and match results
in association football, Elsevier B.V., 2005, Int'l Journal of Forecasting,
Vol. 21, pp. 331-340.
[10] Eggels, H. - van Elk, R. - Pechenizkiy, M. Explaining soccer match
outcomes with goal scoring opportunities predictive analytics, 2016.
[11] Lepschy, H. - Wasche, H. - Woll, A. How to be Successful in Football: A

View publication stats

You might also like