Predicting Sports Results Using Latent Features A Case Study

Uploaded by

sivarashwanth.s2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views6 pages

Predicting Sports Results Using Latent Features A Case Study

Uploaded by

sivarashwanth.s2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MIPRO 2015, 25-29 May 2015, Opatija, Croatia

Predicting sports results using latent features: a

case study

Stefan Dobravec
University of Ljubljana, Faculty of electrical engineering, Ljubljana, Slovenia
[email protected]

Abstract - Predicting sports results is normally a challenging of expert knowledge, thus making them in general more
task, even more in case of a sport that shows a highly accurate [5]. The remaining two categories are poorly
stochastic nature. In football, for example, numerous represented in recent years.
features are tracked and combined with expert knowledge,
yielding various predicting algorithms. Our work however, A comparison of the described approaches is hindered
is based on a case where there is no expert knowledge due to the fact that they do not use a common dataset. In
available and the only data comes from previous match fact, neither a standard nor an agreed dataset exists. An
results. We built a goal score prediction model that uses analysis of football match parameters is performed in [8]
latent features obtained from matrix factorization process. pointing out features that influence the game, however
We also added a Naive Bayes Classifier to be able to predict various approaches use various sets of parameters. This is
outcome of the match. The algorithm has been tested on at least partly due to the high dynamics of a football match,
results of the FIFA World Cup 2014. We also built a match which makes it hard to systematically gather statistical data.
result predictor based on the betting quotas. As these are Also, different competition formats (league, tournament)
derived from a complex algorithms that encompass also the emphasize different parameters or their combinations.
expert knowledge, our algorithm can be used to estimate
accuracy of an expert knowledge-based system. This case Our research, in contrast, is based on latent features of
study shows that there is no significant difference between the a matrix factorization model instead of relying on a real
two algorithms that we tested and that the latent features may (observed) features. We assume that it is better to use latent
provide a valid substitute for real features, when the later features that in best way possible reflect the variability of
ones are not available. the observed facts, than to rely on a poor set of real features
that can’t be properly assessed. The proposed method is
I. INTRODUCTION thus based solely on results of past matches (the number of
Sport result prediction can be viewed as one of the goals scored by each team), which represent the
objectives of sport analytics, which aims at helping fundamental data that is recorded in all cases. No other
decision makers (e.g. team managers, odds traders, etc.) to statistical data or expert knowledge is used.
gain competitive advantage. The difficulty of this task The Matrix factorization technique became very
depends on many factors, like the availability of data for the popular in the field of multimedia content recommender
past events, the ability to gather data for future events, the systems where it showed good scalability and predictive
knowledge needed to interpret gathered data, and others. In accuracy [9]. The idea behind using the latent features in
some cases, like in football, the task becomes even more our case is to be able to build a successful model even when
challenging due to the highly stochastic nature of the game. expert knowledge is not available or when large statistical
At the same time, however, strategies of a team can be data is not available.
approximated by crisp logic rules [1].
The proposed method has been tested on the case of
Various techniques for modelling a football match exist FIFA World Cup 2014 tournament. Since different
that yield different result prediction algorithms. According approaches use different techniques to measure the success,
to [2] the modelling can be put under the four generic direct comparison with our method would be misleading.
categories: (i) empirical models, (ii) dynamic systems, (iii) Instead, we used cross-folding to statistically evaluate the
statistical techniques, and (iv) artificial intelligence prediction success of our method. We also compared the
(including expert systems). The most basic approach in the proposed method to a match result classifier that is based
category of statistical techniques is to use Poisson on betting quotas, which indirectly encompass the expert
distribution for goal-based data analysis [3]. While Poisson knowledge of odds traders.
distribution is used to predict the number of goals conceded
and scored, the other statistical models restrict their In the following sections first the methodology for
prediction to match outcome. However, it has been shown predicting match results is explained. Following is the
in [4], that both approaches yield similar performance. In explanation of the validation procedure. Next, the results
the artificial intelligence category there are several are presented, followed by discussion and conclusions.
approaches that focus on Bayesian network modelling [1]
[5] [6]. In most cases these approaches are highly complex,
based on numerous assumptions, and require large
statistical data [7]. On the other hand, they enable inclusion

1267

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
II. MATERIALS AND METHODS model is then used to predict the user’s satisfaction with an
unknown (unrated) content.
A. FIFA World Cup 2014 Dataset
Using World Cup tournament results as the dataset In our case the model is based on the observation of
allowed us to observe the short-term performance of number of goals scored by a given team () to its opponent
competing teams. Thus we were able to exclude time factor ( ) in previous matches - , . Every observed match
from our model, in contrast to a league-type competition yields two goal score observation:
where teams compete over long period, which increases the team ∶ team − scored ∶ scored
importance of the time factor. We assume that teams enter
the tournament in peak condition, with a fixed set of ∶ , =
players, and team strategies planned ahead. Therefore
∶ , =
major changes in team’s performance are not expected.
A major drawback of the dataset is its size, which is The observations can be arranged in a table (see Table
quite small. The dataset could be increased by including 1), displaying the number of goals scored by a team (T)
results of previous World Cup tournaments, however the against its opponent (O). Initially, such table is highly
list of competing teams varies through tournaments. Thus sparse. New values are added as the tournament progresses,
only the number of teams would increase and the number however even in the final stages the table remains sparse.
of matches per team would increase only for those teams
TABLE 1. THE TABLE REPRESENTATION OF GOAL SCORE OBSERVATIONS
that qualify regularly. Additionally, the time factor would
play an important role as the players, coach, and strategy T\O A B …
may change significantly in-between the two tournaments.
A sophisticated strategy for handling such dataset, A ■ ,
requiring significant research on its own, would be
B , ■
required.
In the World Cup teams compete in the tournament … ■
format. In the group stage 32 teams are put in groups of four
(altogether 8 groups). Each team plays three matches
(rounds) against their group adversaries, which accounts The model is then used to predict the number of goals
for 48 matches overall in the group stage. Best two teams scored in a match to come – , . As in the case of
from each group proceed to elimination stage, where 16 observing past matches and match observations as above,
matches are played (8 in the eight of the final, 4 in the in every match there are two goal score predictions, one for
quarter-final, 2 in the semi-final, and 2 in the final). The each of the two paired teams.
tournament therefore consists of 64 matches.
Additional parameters, besides the goals scored by both
Additionally, 56 matches played in the preparation teams, may be included in the model in the similar way as
period of one month before the World Cup tournament [10] included the time component. However, this is not
were taken into account. straightforward, as the parameters may not be directly
correlated to goal score predictions, and an expert
The data was obtained from the official FIFA World knowledge is therefore required. For example, if key player
Cup page (http://www.fifa.com/worldcup/). In the is injured this should influence the match outcome, but you
elimination stage the match can’t end as a draw, therefore need an expert to point out key players. Since the time
overtimes (and optionally penalties) are required to factor was excluded in our case due to a short-termed data
determine the winner. As this rule has huge impact on the set, and we avoid using any expert knowledge, we use the
final result of the match, we ignored the final score and matrix factorization model with biases:
considered only the score after the regular period, which is
also a standard practice with 3-way bets (home win, draw,
away win). , = μ + + + q p

Betting quotas for the World Cup matches were In the model the team and its opponent are represented
obtained on the day of the match from online betting service using latent feature vectors p and q and their
bwin (https://www.bwin.com/). Betting quotas are in form interactions are modeled as inner products of the
of 3-way bets and reflect the probability of the three corresponding vectors in the latent feature space. By
possible outcomes: home team wins, draw, away team implementing biases and we model the team’s
wins. Lower quota (approximately) corresponds to higher inclination towards scoring/conceding goals, while μ
probability of the given outcome. represents an overall ‘goals scored’ average. Bias is
B. Matrix Factorization Model calculated so that μ + represents the average of goals
scored by the team. Similarly, the bias is calculated so
In the field of multimedia recommender systems the
that μ + represents the average of goals conceded by the
matrix factorization model showed robustness and
opposing team.
accuracy when predicting user experience [9]. The
interaction between users and items is modelled on the base In the learning phase the model is fitted to the known
of known ratings that were given by the users to the specific match results, or more precisely to the number of goals
items and that correspond to the user’s satisfaction. The scored by one team to its opponent (note again that every
match provides two observations). The stochastic gradient

1268

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
descent algorithm is used to loop through the training set
( ), where for each given training case , , the prediction
1 0 2
, is calculated and the model parameters are modified to
minimize squared error function:
1/ p11 p12 p13

min ∑!" , − , + #‖% ‖ + ‖q ‖ + +
, 0/ p21 p22 p23

where the second part of the sum serves for regularization

and it is controlled by the parameter #. 2/ p31 p32 p33
C. Match Outcome Prediction
The described model predicts the number of goals Figure 1. Confusion Matrix for 3-class classifier
scored by a team against a given opponent. For a match
outcome prediction another step is required. First we need understand measures, like the overall success rate,
to pair the two predictions, one for home team (&, ) and precision and recall, should be used [12]. Overall success
one for the away team ( ,& ), and compare them to predict rate (OSR) is used to measure the percentage of correctly
the outcome. There are three possible outcomes, commonly classified samples. In our case it is calculated as:
noted as 1-home team wins, 0-draw, 2-away team wins:
OSR = p11 +p22 +p33 . (5)
'
&, , ,& → {1,0,2} To get more insight into how the classifier is performing
according to a specific class, the precision and recall are
The easiest way to predict the outcome using the pair used. Both are used to measure percentage of correctly
of goal-scoring predictions is to simply round the classified samples, however while precision (P) is used for
predictions to nearest integer to get predicted match score: a given prediction class, recall (R) is used for a given actual
class:
1; *&, > *( ,& )
pii pii
': = )0; *&, = *( ,& ) (4) Pi = , Ri= . (6)
pi1 +pi2 +pi3 p1i +p2i +p3i
2; *&, < *( ,& )
Fawcett [13] proposes usage of ROC (Receiver
Such approach, however, proved to be unfavorable to Operating Characteristics) graphs as a performance
such pairs of predictions, where the two are close together graphing method as they are especially useful for domains
but on different side of rounding threshold. Better results with skewed class distribution and unequal classification
were expected when deploying a standard classifier, trained error costs. ROC graphs show relationship between two
on a set of prediction pairs. We deiced to use the Naive ratios: (i) the rate of correctly classified samples from the
Bayes classifier, which proved to be robust and performs given actual class (same as recall), named true positive rate
well in most cases, despite the fact that it builds on the (tp rate), and (ii) the rate of incorrectly classified samples
assumption that the input parameters are uncorrelated [11]. that do not belong to the given actual class, named false
As the input parameters for the classifier, the &, , ,& positive rate (fp rate). In case of perfect classification, the
pairs were used, and the match outcome was used as a label. tp rate would be 1 and the fp rate would be zero. ROC
graphs allow us to find the point on the curve (which
While in the first case no training is needed and the corresponds to a certain classification threshold) closest to
prediction is possible ‘on the fly’, this is not the case with the ‘perfect case’. The diagonal line represents the ‘random
the Naive Bayes classifier. Due to the small size of the data classification’ where the two rates are equal. The same
set (64 matches altogether), the classification has been done author also proposes the area under ROC curve (AUC) as
‘post festum’, using the cross-folding technique to train and the measure for a classifier performance. Higher AUC
to test the performance of the classification. represents better performance. The performance of a
D. Evaluation random classifier in terms of AUC is 0.5.
Overall classification success rate, which gives the In cases with a small dataset (like ours) it is a common
percentage of correctly predicted (classified) samples, is approach to use the cross-folding technique. The correct
insufficient, as in most cases we need more insight into the way to calculate AUC is such cases is to obtain the ROC
classification performance. Thus a confusion matrix is curve for each fold and represent the classifier with the
commonly used (see Figure 2). In our case there are three average curve and the corresponding AUC [14].
possible classes (match outcomes), predicted classes are In their basic form, the ROC graphs are used for two-
marked with ‘hat’ sign (^), actual (achieved) classes class problems. With more classes the situation becomes
without it. Values pij represent shares of samples more complex. One method for handling such cases is to
originating from class j and classified as i. produce several different ROC graphs, one for each class,
Several performance measures were developed on where reference class is treated as positive and all others as
basis of the confusion matrix that vary over different negative [13]. For the multi-class AUC, the weighted sum
research areas. However, in most cases simple and easy to

1269

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
approach can be used, where the weights %(4 ) correspond and the betting quota based) gives a latent feature approach
to prevalence of the reference class [15]: usability estimation.

AUCtotal = ∑5_4∈' AUC(4 ) ∙ %(4 ) (7) IV. RESULTS

The match prediction performance using the ‘rounding’
III. EXPERIMENT approach (4) yielded 44% overall accuracy (OSR= 0.438).
Confusion matrix on Figure 3, as anticipated, shows very
On Figure 1 flow of the experiment is presented.
poor precision and recall values for the ‘draw’ class. Using
Known match results are stored in the local database. In the
Naive Bayes classifier improved overall accuracy to 50%
start only the results of friendly matches were known, the
(OSR= 0.5). Closer look to confusion matrix on Figure 4
results of the tournament matches were added as the
shows that prediction value of the ‘draw’ class improved,
tournament progressed. After every tournament round, the
as also the recall value of the ‘away win’ class. There were
known data was used to (re)train matrix factorization
less candidates for the ‘draw’ outcome as in the first case,
model. Betting quotas were obtained on the day of the
but those were picked better. Consequently we managed to
match and stored for later comparison.
find more ‘away wins’ and also ‘home wins’ as in the case
Due to the small size of the data set the latent feature of ‘rounding’ approach.
predictor (classifier) performance evaluation was
performed ‘post festum’. We used cross-folding of the data
set, more precisely the 8 bin folding stratified sampling
approach. The 8-fold approach was used instead of more
commonly used 10-fold, as this allowed the same size of all 1 0 2 P
bins (64 samples divided across 8 bins).
For comparison, a classifier based on betting quotas was 1/ 0.219 0.188 0.109 0.424
developed and evaluated using the same approach as the
latent feature match predictor. In this case, the three quotas
(one for each possible outcome) were used for the Naive 0/ 0.078 0.047 0.078 0.230
Bayes classifier input, and the match outcome was used as
the label. Betting quotas are (for obvious reasons) prepared
using sophisticated proprietary algorithms, backed also by 2/ 0.078 0.031 0.172 0.611
an expert knowledge of the odds traders and others. Such
comparison of the two classifiers (the latent feature based R 0.583 0.176 0.478

Figure 3. Confusion Matrix for ‘rounding’ approach

www.fifa.com www.bwin.com

1 0 2 P
DB

1/ 0.234 0.109 0.109 0.417

κT

Matrix
0/ 0.031 0.031 0.016 0.400
Factorization

2/ 0.109 0.016 0.234 0.652

qO,pT,
μ,bT,bO
R 0.625 0.118 0.652
Model Match Betting Figure 4. Confusion Matrix for ‘latent features’ approach
Fixtures Quotas

(ĝH,A,ĝA,H) t++ 1 0 2 P

Prediction
Match 1/ 0.297 0.156 0.141 0.500
Results

0/ 0.063 0.047 0.063 0.273

Evaluation 2/ 0.016 0.063 0.156 0.667

(ĝH,A,ĝA,H)

Figure 2. Flow of the experiment

R 0.792 0.177 0.435
Figure 5. Confusion Matrix for ‘betting quotas’ approach

1270

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
Figure 6 is showing ROC graphs for the classifier that be overlooked. Thus we cannot conclude that the former
uses matrix factorization model output. The curves classifier performs significantly better than the latter.
represents average over eight folds, used for cross-folding.
Shaded areas mark the 95% confidence interval for the V. CONLUSION
averaged curves. In our case the following class prevalence The prediction algorithm presented in this paper is
was observed: %8 = 0.375, %A = 0.266, and % = 0.359 . based only on the base of previous results and does not
Using (7) we obtained: require any form of match statistics or expert knowledge.
AUClf = 0.677 We propose to use the described approach in cases when
statistics and expert knowledge are not available or when
To estimate the performance of an expert knowledge- they cannot be imported, interpreted, or evaluated properly.
based algorithm, we tested Naive Bayes classifier using
betting quotas. We achieved the 50% overall accuracy As football shows highly stochastic nature it is difficult
(OSR= 0.5), which is the same as with ‘latent features’ to evaluate any prediction algorithm. The only correct way
approach. However by inspecting the confusion matrix would be to test all relevant algorithms against each other
(Figure 5) some differences may be observed, for example on the same (reasonably) large dataset. However, it is
more candidates for ‘draw’ outcome at the cost of lower practically impossible to perform such comparison as
precision, but much higher hit rate (relevance) for the various algorithms use various statistical data and
‘home win’ outcomes, and also lower hit rate for the ‘away knowledge, and a standard dataset for testing purpose
win’. containing all required data does not exist yet. The
comparison between two classifiers (the latent features-
Furthermore, ROC graphs on Figure 7 show superior based and the betting quotas-based), performed in this
performance in terms of classifying ‘away win’ class but at study, can be used only to estimate the possibility to
the cost of extremely poor classification of ‘draw class’. In substitute real features with latent ones. This case study
this case, we obtained the following AUC value: showed no significant difference between the two
approaches, showing that such substitution is possible.
AUCbq = 0.562
The major issue with our case study lies in the size of
The AUC that we obtained for latent feature-based the dataset. We used a short-termed dataset, where 64
classifier is actually higher than the one obtained for betting matches were played in a period of two months, therefore
quotas-based one. However the fact of high measuring drastic changes in teams’ performance (e.g. due to player
uncertainty (shaded areas on Figure 6 and Figure 7) cannot injuries) were not expected. It is known that the matrix
factorization approach towards latent features extraction

Figure 6. ROC graphs for Naive Bayes classifier using Matrix Factorization model

Figure 7. ROC graphs for Naive Bayes classifier using betting quotas

1271

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.
performs significantly better on large datasets. However, [4] J. Goddard, “Regression models for forecasting goals and match
larger dataset would require long-term observation results in association football,” International Journal of Forecasting,
vol. 21, n. 2, pages 331--340, Elsevier, 2005.
bringing (at least) time factor into the equation.
[5] A. Joseph, N. Fenton, M. Neil, “Predicting football results using
We believe that the prediction process can be further Bayesian nets and other machine learning techniques,” Knowledge-
improved by inclusion of expert knowledge into the matrix Based Systems, vol. 19, n. 7, pages 544--553, Elsevier, 2006.
factorization model training, exposing parameters that do [6] A. Constantinou, N. Fenton, M. Neil, “pi-football: A Bayesian
network model for forecasting Association Football match
not encompass teams’ interaction (e.g. motivation, outcomes,” Knowledge-Based Systems, vol. 36, pages 322--339,
fatigue…) On the other hand, the parameter selection step 2012.
would introduce another level of complexity into the matrix [7] G. Kumar, Machine Learning for Soccer Analytics, Cambridge
factorization process. University Press, MSc thesis, KU Leuven, 2013.
[8] D. Bunker, R. Thorpe, “A model for teaching games in the
Another possibility to improve the prediction success, secondary school,” Bulletin of Physical Education, n. 10, pages 9--
is to append regression model that would be used to 16, 1982.
improve latent feature model based predictions. This would [9] Y. Koren, R. Bell, C. Volinsky, “Matrix factorization techniques for
allow us to include various ‘real’ match features into the recommender systems,” Computer, vol. 42, n. 8, pages 30--37,
regression model, where are easier to handle. However, as IEEE 2009.
already pointed out in the paper, the expert knowledge [10] Y. Koren, “Collaborative Filtering with Temporal Dynamics,”
would be required to point out the important features. Communications of the ACM, vol. 53, n. 4, pages 89--97, 2010.
[11] I. Rish, “An empirical study of the naive Bayes classifier,” IJCAI
REFERENCES 2001 workshop on empirical methods in artificial intelligence, vol.
3, n. 22, pages 41--46, 2001.
[1] B. Min, J. Kim, C. Choe, H. Eom, R. McKay, “A compound
framework for sports results prediction: A football case study,” [12] V. Labatut, H. Cherifi, “Evaluation of Performance Measures for
Knowledge-Based Systems, vol. 21, n. 7, pages 551--562, 2008. Classifiers Comparison,” Ubiquitous Computing and
Communication Journal, vol. 6, pages 21-34, 2011.
[2] M. Hughes, I. Franks, The essentials of performance analysis: an
introduction, Routledge, 2007. [13] T. Fawcett, “An introduction to ROC analysis,” Pattern recognition
letters, vol. 27, n. 8, pages 861--874, Elsevier, 2006.
[3] D. Karlis, I. Ntzoufras, “Analysis of sports data by using bivariate
poisson models,” Journal of the Royal Statistical Society: Series D [14] F.J. Provost, T. Fawcett, R. Kohavi, “The case against accuracy
(The Statistician), vol. 52, n. 3, pages 381--393, Wiley Online estimation for comparing induction algorithms,” proceedings of
Library, 2003. ICML-98, vol. 98, pages 445--453, 1998.
[15] F.J. Provost, P. Domingos, “Well-trained PETs: Improving
probability estimation trees,” Citeseer, 2000.

1272

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:41:00 UTC from IEEE Xplore. Restrictions apply.