Predicting Sports Results Using Latent Features A Case Study
Predicting Sports Results Using Latent Features A Case Study
Stefan Dobravec
University of Ljubljana, Faculty of electrical engineering, Ljubljana, Slovenia
[email protected]
Abstract - Predicting sports results is normally a challenging of expert knowledge, thus making them in general more
task, even more in case of a sport that shows a highly accurate [5]. The remaining two categories are poorly
stochastic nature. In football, for example, numerous represented in recent years.
features are tracked and combined with expert knowledge,
yielding various predicting algorithms. Our work however, A comparison of the described approaches is hindered
is based on a case where there is no expert knowledge due to the fact that they do not use a common dataset. In
available and the only data comes from previous match fact, neither a standard nor an agreed dataset exists. An
results. We built a goal score prediction model that uses analysis of football match parameters is performed in [8]
latent features obtained from matrix factorization process. pointing out features that influence the game, however
We also added a Naive Bayes Classifier to be able to predict various approaches use various sets of parameters. This is
outcome of the match. The algorithm has been tested on at least partly due to the high dynamics of a football match,
results of the FIFA World Cup 2014. We also built a match which makes it hard to systematically gather statistical data.
result predictor based on the betting quotas. As these are Also, different competition formats (league, tournament)
derived from a complex algorithms that encompass also the emphasize different parameters or their combinations.
expert knowledge, our algorithm can be used to estimate
accuracy of an expert knowledge-based system. This case Our research, in contrast, is based on latent features of
study shows that there is no significant difference between the a matrix factorization model instead of relying on a real
two algorithms that we tested and that the latent features may (observed) features. We assume that it is better to use latent
provide a valid substitute for real features, when the later features that in best way possible reflect the variability of
ones are not available. the observed facts, than to rely on a poor set of real features
that can’t be properly assessed. The proposed method is
I. INTRODUCTION thus based solely on results of past matches (the number of
Sport result prediction can be viewed as one of the goals scored by each team), which represent the
objectives of sport analytics, which aims at helping fundamental data that is recorded in all cases. No other
decision makers (e.g. team managers, odds traders, etc.) to statistical data or expert knowledge is used.
gain competitive advantage. The difficulty of this task The Matrix factorization technique became very
depends on many factors, like the availability of data for the popular in the field of multimedia content recommender
past events, the ability to gather data for future events, the systems where it showed good scalability and predictive
knowledge needed to interpret gathered data, and others. In accuracy [9]. The idea behind using the latent features in
some cases, like in football, the task becomes even more our case is to be able to build a successful model even when
challenging due to the highly stochastic nature of the game. expert knowledge is not available or when large statistical
At the same time, however, strategies of a team can be data is not available.
approximated by crisp logic rules [1].
The proposed method has been tested on the case of
Various techniques for modelling a football match exist FIFA World Cup 2014 tournament. Since different
that yield different result prediction algorithms. According approaches use different techniques to measure the success,
to [2] the modelling can be put under the four generic direct comparison with our method would be misleading.
categories: (i) empirical models, (ii) dynamic systems, (iii) Instead, we used cross-folding to statistically evaluate the
statistical techniques, and (iv) artificial intelligence prediction success of our method. We also compared the
(including expert systems). The most basic approach in the proposed method to a match result classifier that is based
category of statistical techniques is to use Poisson on betting quotas, which indirectly encompass the expert
distribution for goal-based data analysis [3]. While Poisson knowledge of odds traders.
distribution is used to predict the number of goals conceded
and scored, the other statistical models restrict their In the following sections first the methodology for
prediction to match outcome. However, it has been shown predicting match results is explained. Following is the
in [4], that both approaches yield similar performance. In explanation of the validation procedure. Next, the results
the artificial intelligence category there are several are presented, followed by discussion and conclusions.
approaches that focus on Bayesian network modelling [1]
[5] [6]. In most cases these approaches are highly complex,
based on numerous assumptions, and require large
statistical data [7]. On the other hand, they enable inclusion
1267
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
II. MATERIALS AND METHODS model is then used to predict the user’s satisfaction with an
unknown (unrated) content.
A. FIFA World Cup 2014 Dataset
Using World Cup tournament results as the dataset In our case the model is based on the observation of
allowed us to observe the short-term performance of number of goals scored by a given team () to its opponent
competing teams. Thus we were able to exclude time factor ( ) in previous matches - , . Every observed match
from our model, in contrast to a league-type competition yields two goal score observation:
where teams compete over long period, which increases the team ∶ team − scored ∶ scored
importance of the time factor. We assume that teams enter
the tournament in peak condition, with a fixed set of ∶ , =
players, and team strategies planned ahead. Therefore
∶ , =
major changes in team’s performance are not expected.
A major drawback of the dataset is its size, which is The observations can be arranged in a table (see Table
quite small. The dataset could be increased by including 1), displaying the number of goals scored by a team (T)
results of previous World Cup tournaments, however the against its opponent (O). Initially, such table is highly
list of competing teams varies through tournaments. Thus sparse. New values are added as the tournament progresses,
only the number of teams would increase and the number however even in the final stages the table remains sparse.
of matches per team would increase only for those teams
TABLE 1. THE TABLE REPRESENTATION OF GOAL SCORE OBSERVATIONS
that qualify regularly. Additionally, the time factor would
play an important role as the players, coach, and strategy T\O A B …
may change significantly in-between the two tournaments.
A sophisticated strategy for handling such dataset, A ■ ,
requiring significant research on its own, would be
B , ■
required.
In the World Cup teams compete in the tournament … ■
format. In the group stage 32 teams are put in groups of four
(altogether 8 groups). Each team plays three matches
(rounds) against their group adversaries, which accounts The model is then used to predict the number of goals
for 48 matches overall in the group stage. Best two teams scored in a match to come – , . As in the case of
from each group proceed to elimination stage, where 16 observing past matches and match observations as above,
matches are played (8 in the eight of the final, 4 in the in every match there are two goal score predictions, one for
quarter-final, 2 in the semi-final, and 2 in the final). The each of the two paired teams.
tournament therefore consists of 64 matches.
Additional parameters, besides the goals scored by both
Additionally, 56 matches played in the preparation teams, may be included in the model in the similar way as
period of one month before the World Cup tournament [10] included the time component. However, this is not
were taken into account. straightforward, as the parameters may not be directly
correlated to goal score predictions, and an expert
The data was obtained from the official FIFA World knowledge is therefore required. For example, if key player
Cup page (http://www.fifa.com/worldcup/). In the is injured this should influence the match outcome, but you
elimination stage the match can’t end as a draw, therefore need an expert to point out key players. Since the time
overtimes (and optionally penalties) are required to factor was excluded in our case due to a short-termed data
determine the winner. As this rule has huge impact on the set, and we avoid using any expert knowledge, we use the
final result of the match, we ignored the final score and matrix factorization model with biases:
considered only the score after the regular period, which is
also a standard practice with 3-way bets (home win, draw,
away win). , = μ + + + q p
Betting quotas for the World Cup matches were In the model the team and its opponent are represented
obtained on the day of the match from online betting service using latent feature vectors p and q and their
bwin (https://www.bwin.com/). Betting quotas are in form interactions are modeled as inner products of the
of 3-way bets and reflect the probability of the three corresponding vectors in the latent feature space. By
possible outcomes: home team wins, draw, away team implementing biases and we model the team’s
wins. Lower quota (approximately) corresponds to higher inclination towards scoring/conceding goals, while μ
probability of the given outcome. represents an overall ‘goals scored’ average. Bias is
B. Matrix Factorization Model calculated so that μ + represents the average of goals
scored by the team. Similarly, the bias is calculated so
In the field of multimedia recommender systems the
that μ + represents the average of goals conceded by the
matrix factorization model showed robustness and
opposing team.
accuracy when predicting user experience [9]. The
interaction between users and items is modelled on the base In the learning phase the model is fitted to the known
of known ratings that were given by the users to the specific match results, or more precisely to the number of goals
items and that correspond to the user’s satisfaction. The scored by one team to its opponent (note again that every
match provides two observations). The stochastic gradient
1268
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
descent algorithm is used to loop through the training set
( ), where for each given training case , , the prediction
1 0 2
, is calculated and the model parameters are modified to
minimize squared error function:
1/ p11 p12 p13
min ∑!" , − , + #‖% ‖ + ‖q ‖ + +
, 0/ p21 p22 p23
1269
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
approach can be used, where the weights %(4 ) correspond and the betting quota based) gives a latent feature approach
to prevalence of the reference class [15]: usability estimation.
1 0 2 P
DB
Matrix
0/ 0.031 0.031 0.016 0.400
Factorization
(ĝH,A,ĝA,H) t++ 1 0 2 P
Prediction
Match 1/ 0.297 0.156 0.141 0.500
Results
1270
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
Figure 6 is showing ROC graphs for the classifier that be overlooked. Thus we cannot conclude that the former
uses matrix factorization model output. The curves classifier performs significantly better than the latter.
represents average over eight folds, used for cross-folding.
Shaded areas mark the 95% confidence interval for the V. CONLUSION
averaged curves. In our case the following class prevalence The prediction algorithm presented in this paper is
was observed: %8 = 0.375, %A = 0.266, and % = 0.359 . based only on the base of previous results and does not
Using (7) we obtained: require any form of match statistics or expert knowledge.
AUClf = 0.677 We propose to use the described approach in cases when
statistics and expert knowledge are not available or when
To estimate the performance of an expert knowledge- they cannot be imported, interpreted, or evaluated properly.
based algorithm, we tested Naive Bayes classifier using
betting quotas. We achieved the 50% overall accuracy As football shows highly stochastic nature it is difficult
(OSR= 0.5), which is the same as with ‘latent features’ to evaluate any prediction algorithm. The only correct way
approach. However by inspecting the confusion matrix would be to test all relevant algorithms against each other
(Figure 5) some differences may be observed, for example on the same (reasonably) large dataset. However, it is
more candidates for ‘draw’ outcome at the cost of lower practically impossible to perform such comparison as
precision, but much higher hit rate (relevance) for the various algorithms use various statistical data and
‘home win’ outcomes, and also lower hit rate for the ‘away knowledge, and a standard dataset for testing purpose
win’. containing all required data does not exist yet. The
comparison between two classifiers (the latent features-
Furthermore, ROC graphs on Figure 7 show superior based and the betting quotas-based), performed in this
performance in terms of classifying ‘away win’ class but at study, can be used only to estimate the possibility to
the cost of extremely poor classification of ‘draw class’. In substitute real features with latent ones. This case study
this case, we obtained the following AUC value: showed no significant difference between the two
approaches, showing that such substitution is possible.
AUCbq = 0.562
The major issue with our case study lies in the size of
The AUC that we obtained for latent feature-based the dataset. We used a short-termed dataset, where 64
classifier is actually higher than the one obtained for betting matches were played in a period of two months, therefore
quotas-based one. However the fact of high measuring drastic changes in teams’ performance (e.g. due to player
uncertainty (shaded areas on Figure 6 and Figure 7) cannot injuries) were not expected. It is known that the matrix
factorization approach towards latent features extraction
Figure 6. ROC graphs for Naive Bayes classifier using Matrix Factorization model
Figure 7. ROC graphs for Naive Bayes classifier using betting quotas
1271
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
performs significantly better on large datasets. However, [4] J. Goddard, “Regression models for forecasting goals and match
larger dataset would require long-term observation results in association football,” International Journal of Forecasting,
vol. 21, n. 2, pages 331--340, Elsevier, 2005.
bringing (at least) time factor into the equation.
[5] A. Joseph, N. Fenton, M. Neil, “Predicting football results using
We believe that the prediction process can be further Bayesian nets and other machine learning techniques,” Knowledge-
improved by inclusion of expert knowledge into the matrix Based Systems, vol. 19, n. 7, pages 544--553, Elsevier, 2006.
factorization model training, exposing parameters that do [6] A. Constantinou, N. Fenton, M. Neil, “pi-football: A Bayesian
network model for forecasting Association Football match
not encompass teams’ interaction (e.g. motivation, outcomes,” Knowledge-Based Systems, vol. 36, pages 322--339,
fatigue…) On the other hand, the parameter selection step 2012.
would introduce another level of complexity into the matrix [7] G. Kumar, Machine Learning for Soccer Analytics, Cambridge
factorization process. University Press, MSc thesis, KU Leuven, 2013.
[8] D. Bunker, R. Thorpe, “A model for teaching games in the
Another possibility to improve the prediction success, secondary school,” Bulletin of Physical Education, n. 10, pages 9--
is to append regression model that would be used to 16, 1982.
improve latent feature model based predictions. This would [9] Y. Koren, R. Bell, C. Volinsky, “Matrix factorization techniques for
allow us to include various ‘real’ match features into the recommender systems,” Computer, vol. 42, n. 8, pages 30--37,
regression model, where are easier to handle. However, as IEEE 2009.
already pointed out in the paper, the expert knowledge [10] Y. Koren, “Collaborative Filtering with Temporal Dynamics,”
would be required to point out the important features. Communications of the ACM, vol. 53, n. 4, pages 89--97, 2010.
[11] I. Rish, “An empirical study of the naive Bayes classifier,” IJCAI
REFERENCES 2001 workshop on empirical methods in artificial intelligence, vol.
3, n. 22, pages 41--46, 2001.
[1] B. Min, J. Kim, C. Choe, H. Eom, R. McKay, “A compound
framework for sports results prediction: A football case study,” [12] V. Labatut, H. Cherifi, “Evaluation of Performance Measures for
Knowledge-Based Systems, vol. 21, n. 7, pages 551--562, 2008. Classifiers Comparison,” Ubiquitous Computing and
Communication Journal, vol. 6, pages 21-34, 2011.
[2] M. Hughes, I. Franks, The essentials of performance analysis: an
introduction, Routledge, 2007. [13] T. Fawcett, “An introduction to ROC analysis,” Pattern recognition
letters, vol. 27, n. 8, pages 861--874, Elsevier, 2006.
[3] D. Karlis, I. Ntzoufras, “Analysis of sports data by using bivariate
poisson models,” Journal of the Royal Statistical Society: Series D [14] F.J. Provost, T. Fawcett, R. Kohavi, “The case against accuracy
(The Statistician), vol. 52, n. 3, pages 381--393, Wiley Online estimation for comparing induction algorithms,” proceedings of
Library, 2003. ICML-98, vol. 98, pages 445--453, 1998.
[15] F.J. Provost, P. Domingos, “Well-trained PETs: Improving
probability estimation trees,” Citeseer, 2000.
1272
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.