Sports Analyticsfor Football League Tableand Player Performance Prediction CR
Sports Analyticsfor Football League Tableand Player Performance Prediction CR
net/publication/337521686
CITATIONS READS
43 6,890
2 authors:
All content following this page was uploaded by Christos Tjortjis on 01 October 2020.
Abstract–Common Machine Learning applications in sports Following the same pattern, performance prediction is not easy and
analytics relate to player injury prediction and prevention, long–term performance prediction is even tougher, but also not
potential skill or market value evaluation, as well as team or sufficiently studied until now.
player performance prediction. This paper focuses on football. Its Nevertheless, as it is shown in this paper, it is possible, up to a certain
scope is long–term team and player performance prediction. A level, to make some long–term predictions, especially for team
reliable prediction of the final league table for certain leagues is performance. Our prediction is relatively good, mainly with regards to
presented, using past data and advanced statistics. Other the champion and the teams that win European qualification. What
predictions for team performance included refer to whether a makes this research interesting is that the prediction is performed
team is going to have a better season than the last one. before the beginning of the season, with no official matches played,
Furthermore, we approach detection and recording of personal only with historical data and the information gathered during the
skills and statistical categories that separate an excellent from an summer break. Another novelty of this paper is that advanced
average central defender. Experimental results range between statistics from previous seasons are used for prediction.
encouraging to remarkable, especially given that predictions were The remaining of this paper is structured as follows: Section II
based on data available at the beginning of the season. reviews the literature providing background information, Section III
defines the problem and details our approach, Section IV provides
Keywords–Sports Analytics, Performance Prediction, Machine experimental results, evaluated in Section V, and Section VI
Learning (ML), Data Mining, Classification, Regression. concludes with directions for further work.
It is obvious that SVM with polynomial kernel is observed to steadily Finally, as far as AccuracyM is concerned, another aspect of the
achieve good results in every league studied, so it is regarded as a experiment is the following: Instead of using the previous three
benchmark for this research from now on. Overall, the best result in seasons as training set and the last season as validation set, use the
terms of RMSE was achieved from the Spanish La Liga, where the first 10 match days of 2018–19 season as training set and the
classifier predicted the final league table with surprisingly high remaining 28 match days as test set. In that case, AccuracyM of the
accuracy, given the few attributes used. In Fig. 2, the real vs predicted Spanish La Liga rose from 51% to 70%, as seen in Fig. 3. Therefore, it
league tables are shown and compared. is shown that present season’s data can boost the accuracy of the
model in a very beneficial manner.
V. DISCUSSION
This section reviews and discusses our approach and results. Problems
that came up during the process and the solutions given are debated.
Results of the experiments are evaluated and threats to validity are
mentioned, too.
The first problem encountered was the abundance of data. It was
practically impossible to use every free online data acquired, so data
Figure 4: Model built with player statistics as independent variables.
selection was challenging. Fortunately, the models produced from the considered a top class player. Of course, every player is different and
datasets were not computationally intensive, so the approach followed has his/her own playing style, but it would be very useful for coaches
was to include as many attributes relative to the research as possible, to have a specific targeting when training a player. Long–term player
in order not to miss out important information. Later, during the prediction performance could also be a huge contribution to fantasy
feature selection phase, some less important attributes were removed. sports games. The experiment resulted in a variety of features.
Conversely, the acquisition of data substantial for research on football Unsurprisingly, some of them were the main defending actions and
analytics was very difficult. Data regarding player injuries and data attributes, but in an interesting manner, some were also found to be
from wearable devices are mostly defined as private personal details. attacking actions or attacking attributes.
Thus, there are no such free online data to be used for the
experiments.
Another problem was the handling of newly promoted teams, as VI. CONCLUSIONS AND FUTURE WORK
statistics from previous seasons were generated for a lower division, A. Conclusions
so they could not be used. Concerning those teams, values were
automatically assigned to some variables. That was a necessity, but in In this research, two fundamental cases of sports analytics were
certain circumstances the predicted team performance did not meet studied: team performance prediction and player performance
some teams’ real potential. prediction.
Additionally, most models were biased in favor of big clubs. An For the first experiment, the goal was to predict how each team of four
attribute that could be used as a penalization factor in cases of important European leagues would perform during the 2018–19
overestimation could balance out the aforementioned bias. season. The data available were only historical data (from 2015
A similar problem was encountered because of the models’ difficulty onwards) and information about team actions (transfers, managerial
in predicting draws. The solution behind that problem usually lies on changes etc.) during the summer of 2018, just before the beginning of
the proper usage of cost sensitive classifiers or by tuning the weights the season. Two approaches were followed to address this issue.
of classes. For the first approach, the target was to classify teams in those that
In both cases, the proposed solutions were tested. Even though they would perform better than last season and those that would perform
resolved the issues that were deployed for, they both failed to extract worse than last season in terms of points collected. Results could be
better results than the ones already achieved. Hence, they were not described as satisfactory, but not impressive, as AccuracyP of the
included in the models. classifiers deployed reached the level of 70%. In this approach, no
Every championship has its own particularities, so rules extracted distinction between the examined championships was made, so the
from one league do not necessarily apply to others. Therefore, despite model used could be described as universal.
the feasibility of building a universal model, proven by experimental Another approach for team performance prediction achieved
results, the exploration of the differences between leagues would remarkable results; the idea was to simulate every match of the season
probably provide better research opportunities. This issue also reflects and classify their results as home win, draw or away win. At the end,
the second experiment. The database used consisted of defenders each team’s points were accumulated, and a predicted league final
based only in England, because it would be inefficient to include table was extracted. The effectiveness of the model was measured
players from different leagues. with two metrics: AccuracyM of the predicted match outcomes and
Domain expert opinion was used in rating player attributes. Scout RMSE of predicted vs actual team points in the league table. The
reports, despite generally describing well enough player ability and highest AccuracyM achieved was 57% for the English Premier
potential, have often been misleading, while intentional tampering League and the lowest RMSE was 9 for the Spanish La Liga.
with ratings and attributes should not be ruled out. Additionally, Additionally, the champion was correctly predicted in 64% of the
financial–based data usually suffer from inaccuracies and cannot be times and the teams that won European qualification were correctly
fully trusted. predicted in 75% of the times. Also, this time, the four championships
Unfortunately, football matches are not affected only by team ability were separately studied and differences between them were evident.
and player skills. There are some external factors that cannot be Our results are very satisfactory and comparable to results of similar
predicted. Luck is an imponderable factor. Long term injuries of researches. Regarding prediction of match outcome, Tax and Joustra
important players are also part of the game. “Strange” results in achieved 56% accuracy [6], while McCabe and Trevathan achieved
matches where one or both teams are not in real need of victory are 54.6% accuracy [7]. Joseph et al. achieved their best result using
often observed. Finally, betting odds inevitably have an influence on Baysian Networks, with 59.2% accuracy [29] and Eggels et al.
match outcome. All those drawbacks, which can be viewed as threats achieved 54% accuracy [10]. Cintia et al. predicted match outcome
to validity, prove that long–term sports prediction is very demanding with 60% accuracy and team points with 9.1 RMSE [19]. Our results
and may not always provide meaningful results. Nevertheless, the have the advantage of being obtained without any current official
results of the experiments conducted in our research can be described match data available.
as good or even impressive in certain occasions. Additionally, applying prediction after using the first ten match days
AccuracyP level for the first part of the first experiment can be of the season as a training set was suggested as an alternative. In that
described as satisfactory, given the fact that it is a long–term case, AccuracyM of predicted match outcomes was impressively
prediction with no official match data and statistics available. A raised to 70%.
professional expert could exploit the experimental results, along with The second experiment was about defining which attributes and match
his own intuition and make certain decisions. actions are mainly influencing a central defender’s match rating. The
The main achievement of the research is the second part of the dataset consisted of 59 central defenders having played at least 10
experiment, where the models used predicted some famous matches for the English Premier League 2016–17 season. The method
champions’ final table with great accuracy. Also, classifiers are able to used was Multiple Linear Regression with Backward Elimination and
predict almost 2 out of 3 match outcomes when the model is applied the evaluation metrics were R–squared and adjusted R–squared.
in the midst of the season. Consequently, this implementation can be Findings were noteworthy, as for a quite satisfying 0.907 R–squared
vastly used for betting purposes under certain circumstances. and 0.88 adjusted R–squared, thirteen features were proved to be
Provably, planning a profitable betting strategy based on experimental statistically significant. Classic defensive actions like interceptions
results and –apparently– in some human expertise, is possible. and clearances were amongst them, along with player attributes more
Finally, the second experiment succeeds into locating a set of suitable for defenders, such as jumping reach and strength. The
attributes and skills that a central defender must improve in order to be interesting part was that some attacking skills, such as passing, and
some attacking match actions (i.e. key passes made, goals scored) Systematic Review.1, 2018, The Open Sports Sciences Journal, Vol. 11,
were also found to have an impact on rating central defenders. This pp. 3-23.
fact stresses the change of playing approach from central defenders [12] Bekris, E. - Gioldasis, A. - Gissis, I. - Komsis, S. - Alipasali, F. Winners
nowadays. and losers in top level soccer. How do they differ? 2014, Journal of
Physical Education and Sport, Vol. 14, pp. 398-405.
B. Future Work [13] Hvattum, L.M. and Arntzen, H. Using ELO ratings for match result
The experiments have shown that it is possible to make long–term prediction in association football. 2010, Int'l Journal of Forecasting, Vol.
26, pp. 460-470.
predictions about team and player performance, so it is reasonable that
researchers will work in the same direction in the future, trying to [14] Constantinou, A. and Fenton, N. Towards smart-data: Improving
resolve some issues or trying to improve the experimental results. predictive accuracy in long-term football team performance. 2017,
Knowledge-Based Systems, Vol. 124, pp. 93-104.
Data from cameras and wearables would be an invaluable asset to any
sports analytics research. Future works on sports analytics should [15] Van Haaren, J. and Davis, J. Predicting the Final League Tables of
Domestic Football Leagues. 2015. 5th int'l conf. mathematics in sport. pp.
focus their attention on gathering and leveraging data from those
202-207.
devices.
Another idea would be the evaluation of newly promoted teams’ [16] Oberstone, J. Differentiating the Top English Premier League Football
Clubs from the Rest of the Pack: Identifying the Keys to Success. 2009,
ability and the study of their performance to comprehend what are the
Journal of Quantitative Analysis in Sports, Vol. 5.
factors that lead them to be successful or not.
Problem with models’ bias in favor of bigger clubs and difficulties in [17] Kringstad, M. and Olsen, T.-E. Can sporting success in Norwegian
football be predicted from budgeted revenues?
predicting draws were not fully resolved. Cost sensitive classifiers and
tuning of the classes’ weights did not improve the experimental [18] Coates, D. - Frick, Bernd - Jewell, T. Superstar Salaries and Soccer
Success: The Impact of Designated Players in Major League Soccer.
results. Hence, it is suggested to scientists to delve deeper into those
2014, Journal of Sports Economics, Vol. 17, pp. 716-735.
methods or implement a different approach to solve the
aforementioned problems. [19] Cintia, P. - Pappalardo, L. - Pedreschi, D. - Giannotti, F. - Malvaldi,
M. The harsh rule of the goals: Data-driven performance indicators for
What was generally observed in this research and must preoccupy football teams. 2015. IEEE Int'l Conf. Data Science and Advanced
researchers in the future is the major divergence displayed on results Analytics,
extracted from different leagues. Therefore, fundamental differences
[20] Cakmak, A. - Uzun, A. - Delibas, E. "Computational Modeling of Pass
between leagues should be specified, otherwise models could only be Effectiveness in Soccer," Advances in Complex Systems, vol. 21, no. 3-4,
applicable on individual leagues and not become universal. 2018.
Additionally, it would be very useful if future researchers took into
[21] Cintia, P. - Rinzivillo, S. - Pappalardo, L. A network-based approach to
consideration some aspects that were not examined in this research; evaluate the performance of football teams. 2015. Machine Learning and
player fatigue or starting lineup rotation due to consecutive matches Data Mining for Sports Analytics workshop (MLSA'15), ECML/PKDD
and important player long–term injuries are factors that can affect conf. 2015.
players or teams, but at the same time can make a model very [22] Grund, T.U. Network structure and team performance: The case of
complex. However, if the complexity is confronted, those data could English Premier League soccer teams. 2012, Vol. 34, pp. 682-690.
be great assets for the research. [23] Borrie, A. - Jonsson, G.K. - Magnusson, M. Temporal pattern analysis
and its applicability in sport: an explanation and exemplar data. 2002,
REFERENCES Journal of Sports Sciences, Vol. 20, pp. 845-852.
[24] Nsolo, E. - Lambrix, Pa. - Niklas, C. Player Valuation in European
[1] Holman, V.. What is Sports Analytics? Agile Sports Analytics. [Online] Football. 2018. 5th Workshop on Machine Learning and Data Mining for
November 15, 2018. https://www.agilesportsanalytics.com/what-is-sports- Sports Analytics co-located with ECML PKDD 2018.
analytics/.
[25] Apostolou, K. and Tjortjis, C. Sports Analytics algorithms for
[2] Dixon, M.J. and Coles, S.G. Modelling Association Football Scores and performance prediction. IEEE 10th Int'l Conf. on Information,
Inefficiencies in the Football Betting Market. 1997. Intelligence, Systems and Applications (IISA 2019), pp. 469-472, 2019.
[3] Lago-Peñas, C. - Lago-Ballesteros, J. - Rey, E.. Differences in [26] Sarlis V. and Tjortjis C., Sports Analytics – Evaluation of Basketball
performance indicators between winning and losing teams in the UEFA Players and Team Performance, Information Systems, Vol. 93, November
Champions League, 2011, Journal of Human Kinetics, Vol. 27, pp. 135- 2020, doi: 10.1016/j.is.2020.101562..
146.
[27] Sîrb, L. - Molcuţ A. - Nastor, F. The Exercise of Prediction Process o
[4] Harrop, K. and Nevill, A. Performance indicators that predict success in fPerformance within Football Sports Management by Using Fuzzy Logic
an English professional League One soccer team, 2014, Int'l Journal of from the Perspective of Value Analysis on Tactical Compartments of
Performance Analysis in Sport, Vol. 14, pp. 907-920. Game of the Football Players. 2015, Journal of Knowledge Management,
[5] Mao, L. - Peng, Z. - Liu, H. - Gómez, M.-A. Identifying keys to win in Economics and Information Technology, Vol. 5
the Chinese professional soccer league. 2016, Vol. 16, pp. 935-947. [28] Pappalardo, L. - Cintia, P. - Ferragina, P. -Massucco, E. - Pedreschi,
[6] Tax, N. and Joustra, Y. Predicting The Dutch Football Competition D. - Giannotti, F. PlayeRank: data-driven performance evaluation and
Using Public Data: A Machine Learning Approach. 10, 2015, player ranking in soccer via a machine learning approach, ACM
Transactions of knowledge and data engineering, Vol. 10. Transactions on Intelligent Systems and Technology September 2019
[7] McCabe, A. and Trevathan, J. Artificial Intelligence in Sports Article No.: 59.
Prediction Information Technology: New Generations, 2008. pp. 1194- [29] Joseph, A. - Fenton, N. - Neil, M. Predicting football results using
1197. Bayesian nets and other machine learning techniques. 2016, Knowledge-
[8] Hucaljuk, J. and Rakipovic, A. Predicting football scores using Based Systems, vol. 19, pp. 544-553.
machine learning techniques. 2011. MIPRO, 2011 Proc. 34th Int'l
Convention.
[9] Goddard, J. Regression models for forecasting goals and match results
in association football, Elsevier B.V., 2005, Int'l Journal of Forecasting,
Vol. 21, pp. 331-340.
[10] Eggels, H. - van Elk, R. - Pechenizkiy, M. Explaining soccer match
outcomes with goal scoring opportunities predictive analytics, 2016.
[11] Lepschy, H. - Wasche, H. - Woll, A. How to be Successful in Football: A