Predicting Football Results With Statistical Modelling in Python
Predicting Football Results With Statistical Modelling in Python
Football (or soccer to my American readers) is full of clichés: “It’s a game of two
halves”, “taking it one game at a time” and “Liverpool have failed to win the Premier
League”. You’re less likely to hear “Treating the number of goals scored by each team
as independent Poisson processes, statistical modelling suggests that the home team
have a 60% chance of winning today”. But this is actually a bit of cliché too (it has been
discussed here, here, here, here and particularly well here). As we’ll discover, a simple
Poisson model is, well, overly simplistic. But it’s a good starting point and a nice
intuitive way to learn about statistical modelling. So, if you came here looking to make
money, I hear this guy makes £5000 per month without leaving the house.
Poisson Distribution
The model is founded on the number of goals scored/conceded by each team. Teams
that have been higher scorers in the past have a greater likelihood of scoring goals in
the future. We’ll import all match results from the recently concluded Premier League
1
(2016/17) season. There’s various sources for this data out there (kaggle, football-
data.co.uk, github, API). I built an R wrapper for that API, but I’ll go the csv route this
time around.
0 Burnley Swansea 0 1
2 Everton Tottenham 1 1
3 Hull Leicester 2 1
2
You’ll notice that, on average, the home team scores more goals than the away team.
This is the so called ‘home (field) advantage’ (discussed here) and isn’t specific to
soccer. This is a convenient time to introduce the Poisson distribution. It’s a discrete
probability distribution that describes the probability of the number of events within a
specific time period (e.g 90 mins) with a known average rate of occurrence. A key
assumption is that the number of events is independent of time. In our context, this
means that goals don’t become more/less likely by the number of goals already scored
in the match. Instead, the number of goals is expressed purely as function an average
rate of goals. If that was unclear, maybe this mathematical formulation will make
clearer:
e−λ λ x
P (x) = ,λ > 0
x!
λ represents the average rate (e.g. average number of goals, average number of letters
you receive, etc.). So, we can treat the number of goals scored by the home and away
team as two independent Poisson distributions. The plot below shows the proportion
of goals scored compared to the number of goals estimated by the corresponding
Poisson distributions.
3
We can use this statistical model to estimate the probability of specfic events.
The probability of a draw is simply the sum of the events where the two teams score
the same amount of goals.
Note that we consider the number of goals scored by each team to be independent
events (i.e. P(A n B) = P(A) P(B)). The difference of two Poisson distribution is actually
called a Skellam distribution. So we can calculate the probability of a draw by inputting
the mean goal values into this distribution.
4
So, hopefully you can see how we can adapt this approach to model specific matches.
We just need to know the average number of goals scored by each team and feed this
data into a Poisson model. Let’s have a look at the distribution of goals scored by
Chelsea and Sunderland (teams who finished 1st and last, respectively).
5
Building A Model
You should now be convinced that the number of goals scored by each team can be
approximated by a Poisson distribution. Due to a relatively sample size (each team
plays at most 19 home/away games), the accuracy of this approximation can vary
significantly (especially earlier in the season when teams have played fewer games).
Similar to before, we could now calculate the probability of various events in this
Chelsea Sunderland match. But rather than treat each match separately, we’ll build a
more general Poisson regression model (what is that?).
6
Dep. Variable: goals No. Observations: 740
No. Iterations: 8
7
team[T.Sunderland] -0.9619 0.222 -4.329 0.000 -1.397 -0.526
If you’re curious about the part, you can find more information here (edit:
earlier versions of this post had erroneously employed a Generalised Estimating
8
Equation (GEE)- what’s the difference?). I’m more interested in the values presented in
the column in the model summary table, which are analogous to the slopes in
linear regression. Similar to logistic regression, we take the exponent of the parameter
values. A positive value implies more goals (ex > 1∀x > 0 ), while values closer to
zero represent more neutral effects (e0 = 1 ). Towards the bottom of the table you
might notice that has a of 0.2969. This captures the fact that home teams
generally score more goals than the away team (specifically, e0.2969 =1.35 times more
likely). But not all teams are created equal. Chelsea has a of 0.0789, while the
corresponding value for Sunderland is -0.9619 (sort of saying Chelsea (Sunderland) are
better (much worse!) scorers than average). Finally, the values
penalize/reward teams based on the quality of the opposition. This relfects the
defensive strength of each team (Chelsea: -0.3036; Sunderland: 0.3707). In other
words, you’re less likely to score against Chelsea. Hopefully, that all makes both
statistical and intuitive sense.
Let’s start making some predictions for the upcoming matches. We simply pass our
teams into and it’ll return the expected average number of goals for that
team (we need to run it twice- we calculate the expected average number of goals for
each team separately). So let’s see how many goals we expect Chelsea and Sunderland
to score.
9
Just like before, we have two Poisson distributions. From this, we can calculate the
probability of various events. I’ll wrap this in a function.
This matrix simply shows the probability of Chelsea (rows of the matrix) and
Sunderland (matrix columns) scoring a specific number of goals. For example, along
the diagonal, both teams score the same the number of goals (e.g. P(0-0)=0.031). So,
you can calculate the odds of draw by summing all the diagonal entries. Everything
below the diagonal represents a Chelsea victory (e.g P(3-0)=0.149). If you prefer
Over/Under markets, you can estimate P(Under 2.5 goals) by summing the entries
where the sum of the column number and row number (both starting at zero) is less
than 3 (i.e. the 6 values that form the upper left triangle). Luckily, we can use basic
matrix manipulation functions to perform these calculations.
10
Hmm, our model gives Sunderland a 2.7% chance of winning. But is that right? To
assess the accuracy of the predictions, we’ll compare the probabilities returned by our
model against the odds offered by the Betfair exchange.
Sports Betting/Trading
Unlike traditional bookmakers, on betting exchanges (and Betfair isn’t the only one- it’s
just the biggest), you bet against other people (with Betfair taking a commission on
winnings). It acts as a sort of stock market for sports events. And, like a stock market,
due to the efficient market hypothesis, the prices available at Betfair reflect the true
price/odds of those events happening (in theory anyway). Below, I’ve posted a
screenshot of the Betfair exchange on Sunday 21st May (a few hours before those
matches started).
11
The numbers inside the boxes represent the best available prices and the amount
available at those prices. The blue boxes signify back bets (i.e. betting that an event will
happen- going long using stock market terminology), while the pink boxes represent lay
bets (i.e. betting that something won’t happen- i.e. shorting). For example, if we were to
bet £100 on Chelsea to win, we would receive the original amount plus 100*1.13= £13
should they win (of course, we would lose our £100 if they didn’t win). Now, how can
we compare these prices to the probabilities returned by our model? Well, decimal
odds can be converted to the probabilities quite easily: it’s simply the inverse of the
decimal odds. For example, the implied probability of Chelsea winning is 1/1.13
(=0.885- our model put the probability at 0.889). I’m focusing on decimal odds, but you
might also be familiar with Moneyline (American) Odds (e.g. +200) and fractional odds
(e.g. 2/1). The relationship between decimal odds, moneyline and probability is
illustrated in the table below. I’ll stick with decimal odds because the alternatives are
either unfamiliar to me (Moneyline) or just stupid (fractional odds).
12
Match Home Draw Away
So, we have our model probabilities and (if we trust the exchange) we know the true
probabilities of each event happening. Ideally, our model would identify situations the
market has underestimated the chances of an event occurring (or not occurring in the
case of lay bets). For example, in a simple coin toss game, imagine if you were offered
$2 for every $1 wagered (plus your stake), if you guessed correctly. The implied
probability is 0.333, but any valid model would return a probability of 0.5. The odds
returned by our model and the Betfair exchange are compared in the table below.
13
Match Home Draw Away
Green cells illustrate opportunities to make profitable bets, according to our model (the
opacity of the cell is determined by the implied difference). I’ve highlighted the
difference between the model and Betfair in absolute terms (the relative difference may
be more relevant for any trading strategy). Transparent cells indicate situations where
the exchange and our model are in broad agreement. Strong colours imply that either
14
our model is wrong or the exchange is wrong. Given the simplicity of our model, I’d lean
towards the latter.
Something’s Poissony
So should we bet the house on Manchester United? Probably not (though they did
win!). There’s some non-statistical reasons to resist backing them. Keen football fans
would notice that these matches represent the final gameweek of the season. Most
teams have very little to play for, meaning that the matches are less predictable
(especially when they involve unmotivated ‘bigger’ teams). Compounding that, Man
United were set to play Ajax in the Europa Final three days later. Man United manager,
Jose Mourinho, had even confirmed that he would rest the first team, saving them for
the much more important final. In a similar fashion, injuries/suspensions to key
players, managerial sackings would render our model inaccurate. Never underestimate
the importance of domain knowledge in statistical modelling/machine learning! We
could also think of improvements to the model that would incorporate time when
considering previous matches (i.e. more recent matches should be weighted more
strongly).
15
We have irrefutable evidence that violates a fundamental assumption of our model,
rendering this whole post as pointless as Sunderland!!! Or we can build on our crude
first attempt. Rather than a simple univariate Poisson model, we might have more
success with a bivariate Poisson distriubtion. The Weibull distribution has also been
proposed as a viable alternative. These might be topics for future blog posts.
Summary
We built a simple Poisson model to predict the results of English Premier League
matches. Despite its inherent flaws, it recreates several features that would be a
necessity for any predictive football model (home advantage, varying offensive
strengths and opposition quality). In conclusion, don’t wager the rent money, but it’s a
good starting point for more sophisticated realistic models. Thanks for reading!
16
Categories: football python
Previous Next
LEAVE A COMMENT
17
49 Comments https://dashee87.github.io/ Login
18
19