100% found this document useful (1 vote)
110 views28 pages

Bayes-xG Player and Position Correction On Expected Goals (XG) Using Bayesian Hierarchical Approach

This document discusses using Bayesian hierarchical models to analyze approximately 10,000 shots from the English Premier League to determine if player or positional factors influence expected goals (xG) metrics. The models reveal that positional effects are seen in a basic model but diminish with additional predictors, while player-level effects persist, indicating certain players have notable positive or negative adjustments to xG. Comparable results were found analyzing data from Spain's La Liga and Germany's Bundesliga. The priors employed provided sound results but could be refined to enhance modeling complex data.

Uploaded by

amy.jieyu.wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
110 views28 pages

Bayes-xG Player and Position Correction On Expected Goals (XG) Using Bayesian Hierarchical Approach

This document discusses using Bayesian hierarchical models to analyze approximately 10,000 shots from the English Premier League to determine if player or positional factors influence expected goals (xG) metrics. The models reveal that positional effects are seen in a basic model but diminish with additional predictors, while player-level effects persist, indicating certain players have notable positive or negative adjustments to xG. Comparable results were found analyzing data from Spain's La Liga and Germany's Bundesliga. The priors employed provided sound results but could be refined to enhance modeling complex data.

Uploaded by

amy.jieyu.wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Bayes-xG: Player and Position Correction on Expected

Goals (xG) using Bayesian Hierarchical Approach


Alexander Scholtes1 Oktay Karakuş1
1
Cardiff University, School of Computer Science and Informatics, Abacws,
Senghennydd Road, Cardiff, CF24 4AG, UK.
{scholtesa, karakuso}@cardiff.ac.uk
arXiv:2311.13707v1 [stat.AP] 22 Nov 2023

Abstract
This study employs Bayesian methodologies to explore the influence of player or posi-
tional factors in predicting the probability of a shot resulting in a goal, measured by the
expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian
hierarchical logistic regressions are constructed, analysing approximately 10,000 shots
from the English Premier League to ascertain whether positional or player-level effects
impact xG. The findings reveal positional effects in a basic model that includes only
distance to goal and shot angle as predictors, highlighting that strikers and attacking
midfielders exhibit a higher likelihood of scoring. However, these effects diminish when
more informative predictors are introduced. Nevertheless, even with additional predic-
tors, player-level effects persist, indicating that certain players possess notable positive
or negative xG adjustments, influencing their likelihood of scoring a given chance. The
study extends its analysis to data from Spain’s La Liga and Germany’s Bundesliga, yield-
ing comparable results. Additionally, the paper assesses the impact of prior distribution
choices on outcomes, concluding that the priors employed in the models provide sound
results but could be refined to enhance sampling efficiency for constructing more complex
and extensive models feasibly.
Keywords: Expected goals, Football, Bayesian hierarchical models, Player adjustment,
Position adjustment, Prior effects.

1 Introduction
One of the most common advanced football analytics metrics is the idea of expected goals (xG),
which estimates the probability of a given shot resulting in a goal based on several features
about the shot such as distance from the shooter to the goal or the body part used by the
shooter. However, none of the mainstream xG models take into account any player-specific
features when estimating these values. To illustrate this, imagine that you have two players
taking the same shot from the same position, with defenders in the same place and everything
else being the same. Still, one player is Lionel Messi and the other is a random player from
the National League (English 5th tier). Obviously, players who play in the National League
are good, but it is not unreasonable to assume that Lionel Messi would be more likely to score.
However, xG metrics would assign the same value for both of these changes.
The objective of this paper is to investigate if there are position or player effects on xG,
meaning that certain positions or players have higher or lower goal probabilities for a given
chance than others. This will be achieved using a Bayesian hierarchical model, where the hier-
archies will be the position of the player or the player. The results of this method will initially

1
be compared to a more traditional frequentist xG model to evaluate baseline results without any
group effects. Then the hierarchical models will be compared to the non-hierarchical Bayesian
models, to assess the impact of having hierarchies in the data on the results. If the rationale
described above of the two players taking the same shot is valid, then it is expected that the xG
predictions of the hierarchical models will differ significantly from the non-hierarchical models,
supporting the idea that there is a position and/or player effect on the xG of a given shot.
This paper will begin with a review of the relevant literature around this topic, looking at
the development of football analytics and xG, as well as any attempts to use Bayesian mod-
elling in football analytics. Thereafter, the methodology will be described by going through
the frequentist and Bayesian techniques used. Then, the data will be introduced and described
with any changes made before the choice of Bayesian prior distributions for predictors is dis-
cussed. Next, the results of the modelling will be presented before validating the results of
the Bayesian models on additional data. The aforementioned choice of prior distributions will
then be evaluated. Finally, a discussion section will deliberate on the significance of the results
before concluding the paper.

2 Related Works
The use of data in football is often not fully embraced, with many decision-makers arguing
the sport is too complex for data to be used effectively to improve results and performance
(Smith, 2022). However, with its successful use in other sports, there was sufficient interest for
some clubs, companies, and individuals to pursue using data to derive conclusions and make
suggestions in football. With a growing demand for data, companies that specialise in sports
data collection have grown too, along with their ability to track data. The result is that there
is now an enormous amount of football data to use for several purposes, such as player/club
performance, scouting, and player fitness and injury risk, to name a few (Tippett, 2019).
At the heart of the idea of using data in football was the potential to gain a competitive
advantage. As a result, clubs that use data tend to be secretive about their operations and
procedures (Tippett, 2019). Despite this, there is plenty of publicly available literature and
sources showing how data can be used in football. Moreover, sports broadcasters have long
used data when giving an overview of a match, such as possession statistics. Still, these have
only recently moved away from simple counts and percentages to more complex metrics. The
Bundesliga, for example, provides a goal probability value after each goal is scored, giving the
chance of that given opportunity resulting in a goal (Aberle et al., 2020).
This goal probability, also commonly called expected goals (xG), has been a central topic
in the development of more advanced statistics using football data (Smith, 2022). Crucially,
it moves away from the idea of things that did happen and focuses on things that could have
happened. With football being such a complex and chaotic sport, outcomes often do not reflect
expectations as matches are often decided by fine margins or decisions out of the players’ and
coaches’ control. Nevertheless, using expectations gives decision-makers an idea of the underly-
ing performance of their team and allows them to see if their team is over or underperforming
according to expectations (Brechot and Flepp, 2020).
There have been many versions of xG models created since the idea was founded, using
a variety of machine-learning techniques and data sources. Herold et al. (2019) provide a
summary of many applications of machine learning in football, including xG models. The
most common methods of estimating xG in their paper are logistic regressions, decision trees,
ensemble methods (e.g. random forest), and neural networks. Lucey et al. (2015) use player
and ball tracking data from the 10 seconds leading up to a shot to estimate goal probabilities
across an entire season and found that “defender proximity, interaction of surrounding players,
speed of play, coupled with shot location play an impact on determining the likelihood of a

2
team scoring a goal”. Madrero Pardo (2020) uses qualitative data from the popular video game
FIFA to account for player effects on xG using a logistic regression and an XGBoost model.
They found that an adjusted model can better predict goals over a season for individual players
and teams than an overall xG model. Fairchild et al. (2018) built an xG model again using
logistic regression and used it to estimate MLS teams’ offensive efficiency in scoring. They
also discuss evaluation metrics for expected goals models and suggest the use of the Brier score
to compare predicted probability to ground-truth binary outcomes. Cavus and Biecek (2022)
apply a variety of ensemble and boosting methods to calculate xG values and find that a random
forest model performs best, even compared to models from other papers using other techniques
and data.
The closest study to this paper to date is that of Hewitt and Karakuş (2023), which inves-
tigates position and player-adjusted xG models. They find evidence of positional adjustments
with forwards having a positive adjustment, midfielders having a slightly negative adjustment,
and defenders having a large negative adjustment. Moreover, they also find evidence of player
effects on xG by fitting their model with only data from Lionel Messi and find a large positive
adjustment in this case.
One of the features a lot of these models have in common is their frequentist approach, as
opposed to using Bayesian methods. Spearman (2018) uses a Bayesian approach to estimate the
maximum a posteriori effects of parameters in a model for predicting future scoring of teams
in games. Joseph et al. (2006) used Bayesian networks and Naı̈ve Bayes learners to predict
the results of matches played by Tottenham Hotspur and compared the results to K-nearest
neighbour and decision tree models. They reiterate one of the benefits of Bayesian modelling
which is comparably accurate predictions in the absence of a large amount of data. Zambom-
Ferraresi et al. (2018) use Bayesian methods to analyse team performance in Europe’s top
leagues to determine which features tend to be most significant in predicting team performance.
They find that the most important features include the number of assists, the number of shots
conceded, saves made by the goalkeeper, passing accuracy, and number of shots on target.
One area of Bayesian modelling which is also often not considered in football analytics is
using multi-level, or hierarchical, models. Tureen and Olthof (2022) construct a multi-level
model for player-adjusted expected goals but do not use a Bayesian approach to do so. Still,
they use their model to calculate estimated player impact values on xG. On the other hand,
Baio and Blangiardo (2010) construct a Bayesian hierarchical model but use it to predict match
results as opposed to xG directly and group their data by the team as opposed to by the player.
Still, the use of Bayesian hierarchical modelling in football is a relatively unexplored area.
By using Bayesian hierarchical modelling, group-level effects can be reliably estimated even
with small group sizes. Therefore, the effect of a player’s position or even the player themselves
on the chance of a given shot resulting in a goal can be reliably measured. The result is that
certain players could be identified as being more likely to score than others for given chances,
which is a result that can be used for player selection or scouting purposes. This idea is present
in the work of Hewitt and Karakuş (2023), where Messi is found to be an extremely efficient goal
scorer. This conclusion may be obvious to football fans, but the fact that the efficiency can be
reliably measured is extremely interesting for potentially comparing the goalscoring efficiency
of footballers. Tureen and Olthof (2022) construct a metric they refer to as “estimated player
impact”, which is another calculation of a player’s individual effect on the probability of scoring.
The estimation of a player’s impact on xG can potentially be another tool in evaluating player
performance for team selection or scouting purposes.

3
3 Methodology
After a general look at the literature on the xG metric and its evaluation throughout the years,
in the section, we explain the details of the methodology this paper is proposing to study
positional and player-related corrections to generic xG approaches. We propose the utilisation
of the Bayes formula by taking player and position information into a conditional probability
formulation which is then evaluated under Bayesian hierarchical modelling.
Before showing how Bayesian methods can be applied to xG modelling, we now show how
xG models are typically created. This involves using a frequentist approach and, in this case,
a logistic regression appears as the natural choice to obtain goal probabilities. This paper
will first attempt to build a generic xG model with comparable results to an xG model built
by StatsBomb, an industry leader in data collection and analysis. To do so, we gradually
increase the number of predictors (both given and engineered) used in the logistic regression
that generally follows the formulation given below

  N
pi X
logit(pi ) = log = β0 + βj · Xji (1)
1 − pi j=0

where pi is the probability of the shot i resulting in a goal, Xji is the value of predictor j
for the shot i. Traditional logistic regression can be seen as a method used to model the
relationship between a binary dependent variable (in this context Goal or No-Goal) and one
or more independent variables (given and engineered features). It uses the logistic function in
(??) to transform a linear combination of features into a probability of the dependent variable
being one of the two classes.

3.1 Bayesian Logistic Regression


Bayesian logistic regression extends logistic regression by introducing a Bayesian framework for
modelling. Instead of estimating fixed model parameters as in traditional logistic regression,
Bayesian logistic regression treats these parameters as random variables with associated proba-
bility distributions. This means that we not only get point estimates for the model parameters
but also full probability distributions, allowing us to quantify uncertainty.
In Bayesian logistic regression, we specify prior distributions for the model parameters,
representing our beliefs about their values before observing any data. The likelihood function
represents the probability of observing the data given the model parameters. Using Bayes’
theorem, we update the prior distribution with the likelihood distribution to obtain the posterior
distribution, which reflects our updated beliefs after considering the data. To compute the
posterior distribution in Bayesian logistic regression, various techniques can be used, including
Markov Chain Monte Carlo (MCMC) methods and variational inference. These methods sample
from the posterior distribution of the parameters to estimate their values and uncertainties.
Once we have the posterior distribution of the model parameters, one can perform vari-
ous tasks such as parameter estimation, uncertainty quantification, and prediction. Bayesian
logistic regression provides a natural way to make probabilistic predictions, as it generates a
distribution of predicted probabilities for each class, rather than just point estimates.
For this paper, the Bayesian methods used will involve fitting Bayesian logistic regressions as
both single-level models (baseline) without group-level effects, and multi-level, or hierarchical,
models with group-level effects. The predictions from both models will then be compared to
determine if there is evidence of group-level effects on xG. To formulate such Bayesian models,

4
this paper specifies
Yi = binary outcome of shot i (1 = goal, 0 = no goal), (2)
pi = probability of shot i resulting in a goal, (3)
Xn,i = the value of predictor n for shot i, (4)
where the likelihood distribution is
Yi ∼ Bernoulli(pi ) (5)
Hence, the Baseline and Hierarchical models are like
N
X
Baseline Model: logit(pi ) = β0 + βj · Xji , (6)
j=0
N
X
Hierarchical Model: logit(pij ) = β0 + βj · Xji + βN +1 · XN +1,i , (7)
j=0

where N + 1 is the index of the grouping variable in the data.

3.2 Data
The data used for this project is all freely available event data from StatsBomb, obtained
using their Python package StatsBombPy (Please see the Statsbomb GitHub page via https:
//github.com/statsbomb for details). From their database, only men’s competitions were used
because it could be that there is a difference in given goal probabilities in men’s and women’s
football, and we have more data from men’s competitions. Then, all open-play shots were
extracted with all relevant information for each shot. Set-pieces were excluded because again
goal probabilities could vary for set-pieces, and we are not interested in modelling this effect.
The resulting data has more than 60,000 shots from a variety of competitions and years,
with 42 columns of information for each shot. Tables 1 and 2 give some summary statistics for
the most relevant variables in the data.
As well as the information given in the columns already in the data, there are several features
which were not included by StatsBomb which could be useful for predicting goal probability.
Many sources cite distance to goal and shot angle to be two of the most important predictors
of goal probability.
The data has the location of the shot, and the StatsBomb data specification (Statsbomb,
2019) provides information about the coordinates of the goalposts. Distance to goal is therefore
calculated as the Euclidean distance from the shot to the centre of the goal.
For shot angle, the cosine rule is used to calculate the angle from the shooter to the two
goalposts. To calculate this reliably, any shots that are taken from the same x-coordinate as
the goal line are excluded since this would create a straight line instead of a triangle with no
shot angle.
Next, the freeze-frame feature in the data is utilised, which provides information about all
other players than the shooter when the shot is taken, including crucially their location on the
pitch. From this information, several features are added: the goalkeeper’s distance to the goal,
whether the goalkeeper is present in the shot triangle formed by the shot and two goalposts,
the number of players present in the shot triangle, and the number of opponents within a 1m
radius of the shooter. Opponents are used for the 1m radius as opposed to all players because
the only time a non-opponent in the radius of a shooter would impact goal probability is when
they are in the shot triangle, which is already being accounted for. Otherwise, only opponents
will try to put pressure or tackle the shooter outside of the shot triangle.

5
Table 1: Summary statistics for most relevant features for xG in the dataset. N(x):y means
that for y number of samples the variable has the value of x.
Variable Description Summary (2.d.p)

Distance between the shooter and the N: 63,309 Min: 0.63


middle of the goal line, normalised to Mean: 18.96 Max: 88.83
distance to goal SD: 8.58
StatsBomb pitch size of 120x80 due to
varying pitch sizes.
By creating a triangle between the shooter N: 63,309 Min: 0.66
shot angle and the two goalposts, the shot angle is Mean: 25.39 Max: 168.61
the angle that is by the shooter. SD: 15.60
N: 63,309 Min: 0.00
Same as distance to goal but for the goal- Mean: 3.56 Max: 118.00
gk distance to goal
keeper instead of the shooter. SD: 2.61
N(0): 1,786 N(1): 30,262
N(2): 19,481 N(3): 6,802
The number of players in the shot trian- N(4): 2,918 N(5): 1,217
players in shot triangle gle, created by the shooter and the two N(6): 513 N(7): 211
goalposts. N(8): 77 N(9): 29
N(10): 11 N(11): 2
N(0): 55,536 N(1): 7,030
The number of opposition players in a 1m N(2): 662 N(3): 71
opponents in radius
radius of the shooter. N(4): 10

The body part used by the shooter to hit N(Preferred N(Head):


shot body part Foot): 30,738 10,647
the ball.
N(Other N(Other): 191
Foot): 11,733
shot first time Whether the shot was a first-time shot, N(True): N(False):
meaning the shooter took no additional 20,946 42,363
touches of the ball before shooting.
gk in shot triangle Whether the goalkeeper was in the shot N(True): N(False):
triangle created by the shooter and the 60,570 2,739
two goalposts when the shot was taken.
shot one on one Whether the shooter was one-on-one with N(True): N(False):
the goalkeeper when shooting. 3,546 59,763
shot open goal Whether the shooter was shooting at an N(True): 736 N(False):
open goal. 62,573

Then, each player’s mode general position is added. These positions are grouped into:
strikers, attacking midfielders, non-attacking midfielders, and defenders, and we expect goal
probability to fall on average for each group respectively. Wingers are included as attacking
midfielders, while wide midfielders and central midfielders are non-attacking midfielders. Wing-
backs are classed as defenders, and so are the few attempts by goalkeepers. Otherwise, all
positions to classes are self-explanatory.
Finally, we modify the body part used by changing “left foot” and “right foot” to “preferred
foot” and “other foot” according to the player’s apparent preference. To assign these, we go
back through the open data and instead look at the passes for each player, we then assign
whichever foot the player made the most passes with as their preferred foot because they often

6
Table 2: (CON’T) Summary statistics for most relevant features for xG in the dataset. N(x):y
means that for y number of samples the variable has the value of x.
Variable Description Summary (2.d.p)
N(Normal): 47,854
N(Overhead Kick): 385
N(Half Volley): 9,371
shot technique The technique the shooter used. N(Diving Header): 284
N(Volley): 4,483
N(Backheel): 244
N(Lob): 688
under pressure Whether the shooter was under pressure N(True): 16,149 N(False): 47,160
when shooting.
goal Whether the shot resulted in a goal. N(True): 6,559 N(False): 56,750
N: 63,309 Min: 0.00
StatsBomb’s own estimated xG value for Mean: 0.10 Max: 1.00
shot statsbomb xg
each shot. SD: 0.13

The general position of the shooter N(ST): 17,073 N(M): 15,858


general position (striker, attacking midfielder, other mid- N(AM): 20,065 N(D): 10,313
fielder, or defender).
N(Messi): 1,907 N(T. Henry): 325
N(L. Suárez): 596 ...
player The name of the shooter. N(Iniesta): 374 N(V. Migas): 1
N(Neymar): 352 N(R. Tricella): 1

have more time to choose which foot to pass with and will then tend to go safely with their
preferred foot, which is less feasible when taking a shot as players tend to be under more
pressure and have less time.

3.3 Reference Models


As outlined in the preceding sections, the primary aims of this paper are to support the adoption
of the Bayesian hierarchical modelling approach in investigating the xG metric, specifically
concerning player-specific attributes and position groups of players. To fulfil these objectives,
we commence by providing definitions for the benchmark and baseline models in the subsequent
sections.
Statsbomb xG Model: This model, employed for comparison in this study, relies on
Statsbombpy’s open data set to provide corresponding xG calculations for each shot discussed
in the dataset. The specifics of Statsbomb’s xG model remain undisclosed, and it serves as the
benchmark model in this paper.
Baseline xG Model: The initial frequentist model introduced is a fundamental model
primarily utilising the distance between the shooter and the goal as a predictor, recognised
as a key factor in goal probability determination. Another crucial feature considered is the
angle of the shot, formed by the shooter and the two goalposts. Recognising the inherent link
between distance and shot angle, an interaction term is incorporated to capture the combined
influence of both features. This foundational parameterisation constitutes the formulation of
the Baseline xG model under a logistic regression model as

logit(pi ) = β0 + β1 · (distance to goal)i + β2 · (shot angle)i + β3 · (distancei · anglei ) (8)

7
It is logical to infer that, under reasonable assumptions, the likelihood of scoring decreases,
on average, as the distance to the goal increases. This implies that the coefficient β1 is likely
to be negative. Conversely, β2 is expected to be positive, reflecting the observation that as the
shot angle decreases, the shot’s position is likely to be farther away or from a wider position,
both scenarios leading to a lower goal probability. Figure 1 illustrates the associations between
shot angle, distance, and the proportion of goals scored to shots, providing a clearer insight
into these relationships.

Figure 1: Relationships between shot angle (binned in the 20s)/distance to goal (binned in 10s)
and the proportion of goals from shots.

Expanded Model: While the primary factors considered for calculating the goal probabil-
ity are typically assumed to be the distance and angle of the shot utilised in the Baseline Model,
it is important to acknowledge that various other elements can influence the shot’s outcome as
shown clearly in the literature. Section 3.2 has previously addressed each of these factors, and
we now introduce our frequentist Extended model, outlined below
logit(pi ) = β0 +β1 · distance to goali + β2 · shot anglei + β3 · (distancei · anglei )+
β4 · gk distance to goali + β5 · players in shot trianglei +
β6 · body parti + β7 · first time shoti + β8 · gk in shot trianglei + (9)
β9 · one on one shoti + β10 · open goali + β11 · techniquei +
β12 · under pressurei

3.4 Bayes-xG Models


Bayesian logistic regression hierarchical modelling is a powerful statistical approach used to
analyse and model complex relationships within data. In this context, an additional grouping
parameter is employed to incorporate the hierarchical structure of the data, acknowledging po-
tential dependencies or variations across groups. Unlike frequentist logistic regression, Bayesian
hierarchical modelling allows for the inclusion of prior information, facilitating a more robust
estimation of parameters and uncertainties. Below, we initiate by formulating our Bayes-xG
models to account for player and position grouping effects. Subsequently, we provide a compre-
hensive explanation of the rationale behind selecting specific prior distributions for the model
parameters.

8
3.4.1 Model Definitions
The foundational Bayesian logistic regression model extends the traditional logistic regression
equation by adding a grouping effect
N
X
logit(pik ) = β0k + βjk · Xji + βN +1,k · XN +1,i , (10)
| {z }
j=0
Grouping effect

where k = 1, . . . , K with K is the number of elements in the group. It is also clear that for this
formulation, we extended logistic regression model coefficients βj ∈ RN into a complex form
for grouping effect as βjk ∈ RK×(N +1) .
Specifically, β0k is a group-specific intercept for k th element of the group which accounts
for variations in the baseline success probability across different groups. On the other hand,
β{1,...,N }k refer to the group-specific slopes which capture variations in the effect of the covariate
across groups.
Following the technical details above, for this paper, we decided to define three versions of
Bayes-xG models which can be expressed as
• Bayes-xG 1 → uses the Baseline model with grouping parameter of position,
• Bayes-xG 2 → uses the Extended model with grouping parameter of position,
• Bayes-xG 3 → uses the Extended model with grouping parameter of player.

3.4.2 Choice of priors


When using Bayesian modelling methods, one important consideration is the choice of prior
distributions for the predictors in the model. Table 3 lists the predictors in each of the models
and the prior distributions used for them.
Variables lacking a clear rationale for a positive or negative value are assigned a prior
distribution modelled on the normal distribution. In contrast, variables, where the direction of
the effect can be reasonably predicted, are assigned a prior distribution skewed towards that
direction. This choice is made to favour values in the predicted direction while still allowing for
the possibility of values in the opposite direction, acknowledging the potential for an incorrect
prediction. The justifications for prior distribution choices are given below:
• distance to goal (µ = −1 and α = −1): Scoring becomes progressively more challenging
as the distance from the goal increases, primarily because the goalkeeper gains additional
time to react to the shot.
• shot angle (µ = 1 and α = 1): This is because an increase in the shot angle implies either
moving closer to the goal or positioning oneself more centrally, both of which result in a
higher average scoring probability.
• players in shot triangle (α = {5, . . . , −5}): The presence of an increasing number of
players within the shot triangle makes it increasingly challenging for a player to execute
a shot in a manner that avoids hitting any of the players while still managing to score a
goal.
• opponents in radius (α = {1, . . . , −2}): In a manner akin to the previously mentioned
feature, an increase in the number of opponents within the shooter’s proximity corre-
sponds to an increased challenge for the shooter. With more players in the vicinity, there
is an augmented effort from opponents to disrupt and block the shot, consequently making
it more difficult for the shooter to successfully score.

9
Table 3: Listing which predictors are used in each of the Bayesian models, and what prior
distribution is given to the coefficient of the given predictor.( N : Normal distribution, SN :
Skew-Normal distribution, HN : Half-Normal distribution.)
Bayes-xG
Predictor 1 2 3 Prior
Intercept ✓ ✓ ✓ N (µ = 0, σ = 5)
distance to goal ✓ ✓ ✓ SN (µ = −1, σ = 5, α = −1)
shot angle ✓ ✓ ✓ SN (µ = 1, σ = 5, α = 1)
distance angle interaction ✓ ✓ ✓ N (µ = 0, σ = 5)
gk distance to goal ✓ ✓ N (µ = 0, σ = 5)
players in shot triangle ✓ ✓ SN (µ = −1, σ = 5, α = {5, 4, . . . , −5})
where α is determined by the value of this
feature (0 players = 5, 1 player = 4, etc.)
opponents in radius ✓ ✓ SN (µ = −1, σ = 5, α = {1, 0, . . . , −2})
where α is determined by the value of this
feature (0 players = 1, 1 player = 0, etc.)
shot body part ✓ ✓ N (µ = 0, σ = 5)
shot first time ✓ ✓ N (µ = 0, σ = 5)
gk in shot triangle ✓ ✓ SN (µ = 0, σ = 5, α = −2)
shot one on one ✓ ✓ SN (µ = 0, σ = 5, α = 2)
shot open goal ✓ ✓ SN (µ = 0, σ = 5, α = 4)
shot technique ✓ ✓ N (µ = 0, σ = 5)
under pressure ✓ ✓ SN (µ = 0, σ = 5, α = −2)
general position ✓ ✓ SN (µ = 0, σ, α = {2, 1, 0, −2}) where scale
parameter σ ∼ HN (γ = 5) and α is {ST,
AM, M, D}, respectively.
player ✓ SN (µ = 0, σ, α = {2, 0}) where scale parame-
ter σ ∼ HN (γ = 5) and α is assigned depend-
ing on prior beliefs about a player ({2: good
finisher, 0: not good finisher}).

• gk in shot triangle (α = −2): If the goalkeeper is positioned within the shot triangle,
their chances of successfully saving a shot are higher compared to when they are located
outside of the shot triangle.

• shot one on one (α = 2): When a player finds themselves in a one-on-one situation with
the goalkeeper, their sole task is to outplay the goalkeeper with their shot, without the
need to navigate or consider other players. This circumstance makes scoring compara-
tively more straightforward.

• shot open goal (α = 4): Similar to the previous feature, in this case, there is no goalkeeper
present. Consequently, the shooter’s sole objective is to direct the shot accurately towards
the target, making the likelihood of scoring very high.

10
• under pressure (α = −2): When a player is under pressure, their ability to concentrate
on accurate and well-targeted shooting diminishes, resulting in a decreased likelihood of
scoring on average.

• general position (α = {ST : 2, AM : 1, M : 0, D : −2}): The α values signify the expec-


tation that strikers will exhibit the highest proficiency in finishing, followed by attacking
midfielders, other midfielders, and, finally, a decrease for defenders.

• player (α = {2, 0}): If a player is anticipated to excel in finishing skills based on their
name and reputation, they are given a value of 2 for α; otherwise, a value of 0 is assigned.

• A standard value of σ = 5 was chosen for the priors to strike a balance, ensuring sufficient
variability. This choice aims to prevent the priors from becoming overly narrow in case
the underlying prior knowledge is incorrect. Simultaneously, it avoids excessive largeness
that could prolong convergence and necessitate numerous rounds of sampling.

3.5 Model Development


Following the exposition of technical aspects related to the models employed in this paper, we
proceed to elucidate the practical details of their implementation. The entire computational
implementation was carried out using the Python programming language, specifically version
3.8 and above. In non-Bayesian modelling phases, logistic regression was executed using the
Python sklearn module along with its associated methods.
Bayesian modelling stages were implemented by using the bambi module which is an open-
source Python package and purposefully crafted to simplify the fitting of generalised linear
multilevel models (GLMMs) using a Bayesian framework. This encompasses a broad category
of techniques widely employed in various research domains, including linear regression, anal-
ysis of variance (ANOVA), logistic and Poisson regression, as well as multilevel and crossed-
group-specific effects models. Bambi facilitates the specification of intricate generalised linear
hierarchical models through a formula notation reminiscent of R, providing a balance between
user-friendly syntax for novices and direct access to internal objects for advanced users. This
design allows beginners to swiftly articulate complex models with default priors, akin to popular
R packages while providing seasoned users with the flexibility to directly manipulate internal
objects for a more advanced and nuanced modelling approach (Capretto et al., 2022).
In particular, a Markov chain Monte Carlo model created in bambi is used to produce
posterior distributions with 1500 draws where 250 draws of which are chosen to be the burning
period. In total, 4 Markov chain sampling is developed resulting in 6000 total samples for each
Bayes-xG model. We set a target acceptance ratio of 95% in the model whilst using the prior
distributions given in Table 3. Considering the target feature in the models is a binary variable
of goal status, we decided to utilise a Bernoulli likelihood for all Bayes-xG models.
Furthermore, in the context of this paper, Bayesian implementation of the models was exe-
cuted using data from a specific league. This decision was driven by the fact that a substantial
portion of the entire dataset consisted primarily of Barcelona matches, as this information was
publicly provided by StatsBomb. The concern was that including such a dominant dataset
might introduce bias and influence the results. By focusing on a single league, the analysis
can encompass a diverse range of players and team matchups, ensuring a more balanced and
comprehensive examination.

4 Experimental Analysis
The experimental analysis of this paper was studied under 5 cases:

11
Figure 2: Distributions of Predictions from Frequentist xG Models

1. Frequentist model comparison to benchmark model for the whole 60K+ shots data set,
2. Bayes-xG model-based “positional ” analysis and comparisons for English Premier League
data set (10K+ shots)
3. Bayes-xG model-based “player-specific” analysis and comparisons for English Premier
League data set (10K+ shots)
4. Extending the developed Bayes-xG model evaluations into different countries, e.g. Spain
(La Liga - 19K shots) and Germany (Bundesliga - 7.5K shots).
5. Investigating the choice of priors on Bayes-xG model outputs

4.1 Frequentist / non-Bayesian models


In the initial series of experiments, we assessed the modelling performance of frequentist models
based on logistic regression, namely the Baseline xG and Extended xG, utilising two distinct
sets of features. The analysis involved the complete data set comprising over 63,000 instances.
The objective was to observe and compare how these models deviate from the predictions of
the benchmark Statsbomb xG model.
The distributions of the predicted xG values for each model are shown below in Figure 2,
along with the StatsBomb xG values for comparison. As expected, the Extended model per-
forms better than the Baseline model (with shot angle, distance to goal, and the interaction
between the two) by also predicting much more extreme values and by decreasing the interquar-
tile and whisker ranges. Furthermore, the extended model has a distribution very close to that
of the StatsBomb model.
Regarding the assessment metrics, Table 4 illustrates the performance of each model in com-
parison to the Statsbomb xG model in terms of Brier score, R2 , mean absolute error (MAE) and
root mean square error (RMSE). As anticipated, the extended model demonstrated superior
performance over the Baseline model, given its enhanced information for predicting goal prob-
ability. Notably, the extended model exhibited a Brier score nearly identical to the StatsBomb

12
Figure 3: Model fitting performance when increasing the number of features.

model, indicating comparable performance to an industry-leading xG model according to this


metric.

Table 4: Outcomes from non-Bayesian models, including the Baseline xG model incorporating
distance to the goal, shot angle, and their interaction, and the Extended xG model introducing
additional features, in comparison to the StatsBomb xG model.
Baseline xG Extended xG Statsbomb xG
RMSE 0.095 0.055 -
MAE 0.058 0.029 -
R2 Score 0.428 0.826 -
Brier Score 0.086 0.076 0.075

In the final analysis of the initial set of experiments, we explored the impact of incorporating
engineered advanced features on the performance of the frequentist logistic regression model.
Figure 3 illustrates the trends in performance evaluation metrics—Brier score, R2, MAE, and
RMSE—relative to the number of features integrated into the model. We initiated the analysis
with a single-parameter model, utilizing only the distance to the goal, and systematically added
features one by one. Typically, there are 16 model parameters (as detailed in Table 3), but this
count increases to a maximum of 33 after one-hot encoding categorical features. Examining
Figure 3 reveals a substantial influence on model performance resulting from the introduction
of these created parameters. Notably, there are certain plateaus in the trends, particularly
associated with one-hot encoded parameters representing categories with a minimal number of
samples in the dataset (e.g., 10 players in the shot triangle).

13
Figure 4: Distributions of xG Adjustments by Position of Bayes-xG1 .

4.2 Positional analysis via Bayesian models


For the second experimental case in this paper, we investigate the effects of the general player
positions on the pitch to the xG values by performing a Bayesian hierarchical modelling ap-
proach. Baseline xG and Extended xG models are redeveloped by using the positional grouping
effect and we obtained Bayes-xG 1 and Bayes-xG 2 models, respectively.
We commence with a more straightforward model that examines only a few features within
the logistic regression framework. As mentioned in the above sections, the baseline xG model
incorporates distance to the goal, shot angle, and their interaction as predictors. To assess the
influence of the grouping variable, ”general position” on xG, we calculate the differences in xG
predictions between the single-level frequentist model and its hierarchical Bayesian counterpart,
referred to as the Bayes-xG 1 model, for each shot in the dataset.
Figure 4 illustrates the distributions of the xG adjustment, considering the general position
within the hierarchical model for each position category. The observed distributions align well
with the prior beliefs regarding the impact of the general position. Defenders exhibit a substan-
tial number of negative xG adjustments in comparison to the baseline model’s xG predictions,
with some adjustments reaching as low as 0.1. As anticipated, non-attacking midfielders display
smaller xG adjustments, encompassing both positive and negative values. Contrary to expec-
tations, attacking midfielders exhibit larger positive xG adjustments on average compared to
strikers, who also generally have positive adjustments. This unexpected finding may stem from
the fact that strikers, by shooting more frequently from high xG scoring positions, often pos-
sess sufficiently high xG values without requiring a significant positional adjustment. On the
other hand, attacking midfielders frequently take shots from areas around the goal, where xG
chances are lower, yet they excel in scoring due to their above-average attacking and scoring
capabilities.
Figure 5 illustrates the shot locations categorised by general playing positions, normalised
for each position. It highlights the notable concentration of chances for defenders positioned di-

14
Figure 5: Normalized Heatmap of Shot Locations by General Position.

Table 5: Mean xG adjustment for each general position from Bayes-xG1 versus theoretical
adjustment of Baseline xG prediction using Bayes’ Theorem.
Position Mean Model Adjustment Mean Theoretical Adjustment
ST 0.009 0.010
AM 0.019 0.020
M -0.006 -0.005
D -0.042 -0.044

rectly in front of the goal, providing a potential explanation for the observed reversals in Figure
4. Contrary to the theorised expectation from the analysis in Figure 4, attacking midfielders
are not inclined to take shots from distant or challenging angles. In fact, on average, strikers
exhibit a higher tendency for such shots. This discovery, coupled with the observation that
attacking midfielders, on average, have larger positive xG adjustments than strikers, suggests
that attacking midfielders may have a superior ability, on average, to convert high xG chances
situated right in front of the goal compared to their striking counterparts.
Before delving into the more intricate model analysis for Bayes-xG2 , we aim to showcase a
validation step to demonstrate the accuracy of the MCMC-based sampling technique employed
in developing Bayesian models in this paper. To achieve this, we replicated the Bayesian
analysis, this time utilising Bayes’ Formula
P (positioni |goal) · P (goal)
P (goal|positioni ) = . (11)
P (positioni )
This allowed us to conduct an analysis where the results of the baseline model could be adjusted
using Bayes’ Theorem, and these adjusted outcomes were then compared to the results of
the hierarchical model. The comparison aimed to assess the proximity between theoretical
adjustments and model adjustments. The outcomes of this process are detailed in Table 5.
Notably, the mean model adjustments closely align with the theoretical adjustments for each
position, affirming that the model has effectively estimated the positional impact.
We proceed with our analysis by delving into the upgraded iteration of the Baseline model,
referred to as the Extended model. Employing Bayes-xG2 , this advanced model involves a posi-
tional analysis akin to its predecessor, Bayes-xG1 . However, it incorporates numerous additional
predictors, including factors such as opponents in radius and gk distance to goal, aiming to en-

15
Figure 6: Distributions of xG adjustments for Bayes-xG2 , where adjustment is hierarchical
model prediction minus baseline model prediction - grouped by general position

hance the baseline xG predictions. Upon inspecting the distributions of xG adjustments by


position in Figure 6, it is evident that the values are notably smaller when compared to those
derived from Bayes-xG1 . Few adjustments now exceed 0.01 away from the baseline xG values.
This observation implies that the supplementary predictors effectively contribute to position
prediction. This indicates that xG advantages stem less from the player’s position and more
from the specific play situation during shooting. For instance, while attackers generally enjoy
better scoring chances on average, leading to positive xG adjustments in the basic model, the
Extended model, by accounting for various features defining these improved scoring chances
(e.g., one-on-ones), mitigates the impact of player position on xG adjustments.
It is intriguing to examine the xG adjustments across the predictor ranges of ”distance to goal ”
and ”shot angle.” Figure 7 illustrates these xG adjustments grouped by the position for both
Bayes-xG models. In Figure 7-(a) for Bayes-xG1 , the convergence of each position towards an
adjustment of 0 signifies that, at a certain distance, the xG value tends to be close to zero,
regardless of other shot-related factors. Notably, defenders exhibit a slight dip in xG adjust-
ments at the lowest distance before increasing. This suggests scenarios where defenders find
themselves in goal-scoring positions, perhaps following set pieces or during a team’s pursuit in a
match. The order of adjustments in Figure 4 aligns consistently across positions, with attacking
midfielders having the highest positive adjustments, followed by strikers, and other midfielders
showing minimal adjustment from the baseline model. Examining the ”shot angle” for Bayes-
xG1 in Figure 7-(b), smaller values correspond to more challenging scoring opportunities, either
due to being far from the goal or from tight angles. Defenders display gradually larger nega-
tive adjustments as the shot angle increases, reversing after a certain point for very high shot
angles, likely corresponding to very close distances to the goal. Similar patterns emerge for
other positions, with attacking midfielders having the largest positive xG adjustments, fol-
lowed by strikers and other midfielders. For Bayes-xG2 model outputs in Figure 7-(c) and (d),

16
Figure 7: Comparison of point estimates for xG adjustments against distance to goal and
shot angle between Bayes-xG1 and Bayes-xG2 , grouped by general position. Adjustments are
hierarchical model prediction minus baseline model prediction.

the compensation of positional effects through additional predictors is evident. Despite a clear
distinction in the effects of distance to the goal and shot angle for each position in Bayes-xG1 ,
no significant differences between positions are observed in Bayes-xG2 . The diminishing impact
of position on xG adjustment is attributed to the diverse player abilities and roles within each
position category. A distinct observation evident in both Figure 7-(c) and (d) is that defenders
exhibit consistent trends for all angles just below 0, whereas attacking midfielders mirror the
same pattern but above the 0 line. Adjustments made to strikers’ and midfielders’ xG values
are barely discernible, with values hovering around 0 across all shot angles.
To further explore the aforementioned phenomena, the next experimental case involves a
player-specific analysis with Bayes-xG3 , grouping data based on the player taking the shot
rather than their general position.

4.3 Player-specific analysis via Bayesian models


In the context of the third experimental case explored in this paper, we introduced Bayes-xG3 ,
derived from the Extended model, with players designated as the grouping variable in Bayesian
hierarchical modelling. The focus here is to assess whether the models necessitate player-

17
Table 6: Selected players for Bayes-xG3 and their goal-scoring statistics in the data set.
Player Shots Goals Conversion Rate
Robert Pirès 56 14 25.00%
Sergio Agüero 112 20 17.90%
Jamie Vardy 111 19 17.10%
Phillippe Coutinho 105 8 7.60%
Ross Barkley 82 6 7.30%
Jonjo Shelvey 51 0 0.00%

specific adjustments. Unlike the previous experimental case that centred on positional analysis,
conducting a player-specific analysis poses increased complexity due to the considerably larger
pool of candidates within the group, making the analysis more challenging and computationally
intensive. Instead of individually representing each player in our group, a selective approach
is employed, categorising the majority as ”other” and opting for a few players with expected
positive or negative xG adjustments. Player selection is based on the ”conversion rate,” i.e., the
percentage of shots scored. To ensure relevance, only players with a minimum of 50 shots are
considered, and the chosen players, along with their statistics, are detailed in Table 6. Players
like R. Pirès, S. Agüero, and J. Vardy, recognized for their prolific goal-scoring, are expected
to have positive xG adjustments. Pirès, notably, exhibits an exceptional conversion rate in the
data subset. Conversely, players like P. Coutinho and R. Barkley, with below-average conversion
rates, might have slight negative xG adjustments, while J. Shelvey, who failed to convert any
of his 51 shots in the data, is likely to have a more substantial negative xG adjustment.
As explained in the preceding methodology section, the impact of each player will be char-
acterised by a prior distribution, specifically a skewed normal distribution. The choice of
distribution parameters is dependent upon the prior beliefs regarding a player’s proficiency as
a goal scorer, with the parameter α taking values of either 2 or 0. This selection is guided by
qualitative beliefs about the players rather than direct utilisation of the data for informing the
priors. Players such as Pirès, Agüero, Vardy, and Coutinho, acknowledged as talented attack-
ing players, are attributed α = 2. In contrast, Barkley and Shelvey, who are not commonly
associated with being top-tier attackers but possess other defining qualities in their game, are
assigned α = 0.
Figure 8 illustrates the distributions of xG adjustments for individual players and the collec-
tive ”other” players group. A prominent observation is the substantial positive xG adjustments
for Robert Pirès, some reaching as high as 0.3 above the baseline xG. Notably, these adjust-
ments persist even after incorporating additional predictors in Bayes-xG2 that were intended to
eliminate group effects in the previous experimental analysis. Pirès also exhibits a wide spread
of adjustments, ranging close to 0, indicating a diverse array of shot types. Some were high
xG chances, requiring minimal adjustment, while others were more challenging but consistently
converted by Pirès, resulting in significant positive adjustments. Agüero displays consistently
positive xG adjustments, albeit smaller on average and with a narrower spread compared to
Pirès. Intriguingly, Vardy and Coutinho exhibit minimal positive xG adjustments, not signifi-
cantly greater than those of Barkley. Shelvey aligns with expectations, displaying substantial
negative xG adjustments based on his conversion rate in the data. Lastly, the ”other” group
centres around 0 for xG adjustment, as anticipated, given its diverse player composition with
no discernible group effect to capture.
Figure 9 displays the shot locations and outcomes for the selected players, offering insights

18
Figure 8: Distributions of xG adjustments for Bayes-xG3 , where adjustment is hierarchical
model prediction minus baseline model prediction - grouped by player.

into the findings presented in Figure 8. Beginning with Pirès, notable for his substantial positive
xG adjustments, the observation centres on his efficiency in goal scoring despite a relatively
low number of shots. His ability to score from challenging positions, such as both corners
of the box and outside the area, contributes to the positive xG adjustments, indicating his
prowess as a goal scorer even in demanding scenarios. Agüero and Vardy exhibit similar shot
patterns, but the model assigns significantly higher positive xG adjustments to Agüero. This
discrepancy may stem from the nature of Vardy’s shots being inherently high xG chances, like
one-on-one opportunities, whereas Agüero manages to convert more challenging shots, resulting
in larger adjustments. Comparing Vardy with Coutinho and Barkley, who exhibit similar
xG adjustments in Figure 8, suggests that their goal-scoring patterns align with baseline xG
values without substantial player adjustments. Lastly, Shelvey’s shot map lacks goals from
various positions. While difficult-to-score shots receive minor adjustments, centrally located
missed chances likely contribute to the notable negative adjustments, reflecting Shelvey’s poor
conversion rates in this dataset.
Figure 10 displays the cumulative data for goals scored, baseline expected goals (xG) from
the single-level model, and adjusted xG from the player-corrected model for the selected players
in this analysis. The visual representation illustrates that, in comparison to the single-level
model, the player-corrected model provides more accurate estimates of total goals scored by
each player. Notably, players like Pirès and Agüero, who outperformed their baseline xG by
scoring difficult chances, exhibit adjusted xG totals much closer to their actual goals scored.
Conversely, Shelvey’s adjusted xG total is more aligned with the zero goals he scored, although
it is crucial to emphasise that it is not precisely zero.

4.4 Extension into other leagues


To ascertain the generalizability of the conclusions drawn in this study beyond the Premier
League dataset examined earlier and their applicability to football on a broader scale, as our

19
Figure 9: Selected player shot locations and goals.

Figure 10: Comparison of Baseline, Bayes-xG3 hierarchical predictions, and actual goals scored
for selected players.

fourth experimental case, a parallel analysis was conducted using data from Spain’s La Liga
and Germany’s Bundesliga. The datasets for these leagues encompass approximately 19,000
and 7,500 shots, respectively. It is noteworthy that the La Liga dataset is notably influenced by
Barcelona, primarily due to the fact that StatsBomb predominantly released data from games

20
Figure 11: Distributions of xG adjustments for Bayes-xG1 and Bayes-xG2 for Spanish La-Liga
and the German Bundesliga.

in which Lionel Messi, a prominent Barcelona player, participated.


Figure 11 displays the distributions of xG adjustments for the Bundesliga and La Liga,
which serve as the data sets for the Bayes-xG1 and Bayes-xG2 models. The outcomes for both
leagues closely resemble those of the Premier League in the Bayes-xG1 model. The average
adjustments follow a consistent order across all leagues, with attacking midfielders exhibiting
the most positive adjustments, followed by strikers, other midfielders, and then a substantial
drop to defenders. Additionally, the magnitudes of these adjustments exhibit similar patterns.
The Bayes-xG2 model reveals a noteworthy reduction in the magnitude of xG adjustments, a
trend observed consistently across all leagues. Although the Bayes-xG2 model indicates slightly
larger adjustment magnitudes for the additional leagues, the variation is not significant enough
to alter the model results significantly when compared to those of the Premier League.
To conduct a player-specific examination of Bayes-xG3 , the inclusion of new players from
both leagues was necessary. The selection process mirrored that of the Premier League, where
players were listed based on their conversion rates in the data sets, encompassing both the best
and worst performers. Tables 7 and 8 display the chosen players and their respective statistics
for La Liga and the Bundesliga, providing a comprehensive overview of the selected players’
performance in each league.
Given these conversion rates, it is reasonable to infer that the top three performers in
each table would, on average, experience positive adjustments in expected goals (xG), while
the bottom three would, on average, encounter negative or negligible xG adjustments. The
visual representation in Figure 12 illustrates these adjustments for players in both the Spanish
La Liga (a) and the German Bundesliga (b). As anticipated, players like Bale, Messi, and
Eto’o from La Liga predominantly exhibit positive xG adjustments, aligning with expectations.
Conversely, other selected players either display minimal adjustments or predominantly negative
adjustments. In the Bundesliga context, Figure 12-(b) reaffirms many anticipated outcomes.
An intriguing finding is that Aubameyang, despite maintaining a high conversion rate, exhibits
predominantly negative xG adjustments. This suggests that the goals he scores tend to be from
chances with already high expected goals, contrary to some expectations.

21
Table 7: Selected players for Bayes-xG3 and their scoring statistics from La Liga data set.
Player Shots Goals Conversion Rate
Gareth Bale 89 20 22.50%
Lionel Messi 1862 375 20.10%
Samuel Eto’o 295 62 21%
Bebé 74 2 2.70%
Rafael Márquez 53 2 3.80%
Andrés Iniesta 362 25 6.90%

Table 8: Selected players for Bayes-xG3 and their scoring statistics from Bundesliga data set.
Player Shots Goals Conversion Rate
Javier Hernández 63 16 25.40%
Pierre-Emerick Aubameyang 107 22 20.60%
Robert Lewandowski 147 28 19%
Pascal Groß 56 1 1.80%
Hakan Çalhanoğlu 51 1 2%
Timo Werner 64 6 9.40%

Figure 12: Distributions of xG adjustments for Bayes-xG3 for Spanish La-Lida and the German
Bundesliga

Figure 13 illustrates the shot locations of Bundesliga players, validating that the majority
of Aubameyang’s shots originate from within the penalty area. This observation implies that
these shots likely possess additional characteristics, such as one-on-one opportunities, making
them high expected goals (xG) chances. In contrast, Timo Werner exhibits minimal goals
despite a comparable shot map, and the substantial negative xG adjustments suggest that he
should have scored more from these shot positions. On a different note, Çalhanoğlu records
relatively few goals from shots that present higher difficulty due to their distance from the goal,
resulting in slightly fewer negative xG adjustments.
In conclusion, the findings from both La Liga and the Bundesliga verify the results obtained

22
Figure 13: Selected player shot locations and goals for the German Bundesliga

in the Premier League sections above. Notably, there is an indication of a positional impact
on xG in a fundamental xG model (Bayes-xG1 ), but such effects markedly diminish with the
adoption of a more intricate model (Bayes-xG2 ). Nevertheless, even with the extended model,
there remains evidence of player-specific effects on xG, providing a quantitative measure of how
certain players excel or lag behind others in scoring.

4.5 Analysing the choice of priors


The impact of prior choices is particularly evident when assessing the efficiency and accuracy
of Bayesian models. Sensitivity to the choice of priors is crucial, and careful consideration is
needed to strike a balance between the informativeness of priors and their impact on computa-
tional efficiency. In Bayesian hierarchical modelling, this emphasizes the need for a thoughtful
approach to prior selection, ensuring that the chosen priors align with the characteristics of the
data and contribute to the model’s robustness and reliability.
The selection of prior distributions holds significant importance in Bayesian hierarchical
modelling for several reasons. Firstly, the choice of prior widths plays a pivotal role. Employing
overly wide priors, encompassing a large range of values, may necessitate an extensive number
of samples for the model to converge to the true values of the variables, leading to prolonged
computation times. Conversely, if the prior distributions are excessively narrow and the true
values are unlikely to be sampled, the model’s outcomes may be biased and inaccurate.
To evaluate the appropriateness of the chosen prior distributions, a reassessment will be
conducted by refitting the extended baseline model (single-level) with different priors, followed
by a comparison of predictions. The sets of priors under consideration include (1) the existing
priors given in Table 3, (2-3) a pair of wide-tight uniform priors, (4-5) a pair of wide-tight
normal priors, and (6) a deliberately ill-suited set of priors. The choices of priors for the cases
of (2-6) are presented in Table 9. For the uniform priors, each variable has been given a wide

23
Table 9: Choices of prior distributions for analysing the impact of prior choice on results.
2-Wide 3-Tight 4-Wide 5-Tight 6-Ill-suited
Predictor Uniform Uniform Normal Normal Prior
Intercept SN (0, 0.25, 2)
distance to goal SN (0, 0.25, 2)
shot angle N (0, 0.25)
distance*angle SN (0, 0.25, −2)
gk distance to goal N (0, 0.25)
players in shot triangle SN (0, 0.25, {−5, . . . , 5})
opponents in radius SN (0, 0.25, {1, . . . , −2})
)

, 1)

0)

)
shot body part N (0, 0.25)
100

.25
0, 1
−1

0, 0
00,

shot first time N (0, 0.25)


N(
U(

N(
−1

gk in shot triangle SN (0, 0.25, 2)


U(

shot one on one SN (0, 0.25, −2)


shot open goal SN (0, 0.25, −4)
shot technique N (0, 0.25)
under pressure SN (0, 0.25, 2)

and tight uniform distribution pair. Similarly, for normal priors, zero mean priors are chosen
with two different σ values to represent a wise and tight value support. On the other hand,
the ill-suited priors have been given very narrow distributions by using a small value for σ.
Moreover, some of the skews in the distributions have been flipped such that the prior belief
about the effect is the reverse of what was actually used.
Furthermore, the predictions generated by these models will be juxtaposed with those of the
extended non-Bayesian model on the same data, providing a baseline for assessing the efficacy
of the selected priors in yielding accurate predictions.
Figure 14 illustrates the distributions of predictions generated by each of the prior models.
The model utilising wide uniform prior distributions exhibits notably poor performance when
compared to both the non-Bayesian baseline model and the Statsbomb benchmark, displaying
a considerable spread of predictions. On the other hand, employing tight uniform priors leads
to more restricted predictions, although the average expected goals (xG) predictions tend to be
relatively smaller. The utilisation of a tight normal prior yields similar performance, primarily
underestimating xG values, particularly with the highest xG values concentrated around 0.8.
In contrast, adopting wide normal priors results in enhanced performance compared to the
tight setting, with predictions following a similar trend to the existing prior configuration. It
is noteworthy that the mean and interquartile ranges of both normal priors depicted in Figure
14 exhibit a favourable correspondence with the baseline and benchmark models.
On the other hand, as depicted in Figure 14, ill-suited priors result in a notably narrow
spread, with scarce xG predictions exceeding 0.5. Despite this, the ill-suited priors exhibit
improved performance compared to uniform priors in terms of aligning the average with the
baseline and benchmark predictions and maintaining a similar-sized interquartile range. This
improved performance is likely attributed to the greater number of samples utilised for param-
eter estimation. In the case of the model with uniform priors, the same number of samples,

24
Figure 14: Distributions of xG predictions for each of the extended single-level Bayesian model,
with different choices of prior distributions.

however, prevented the model from converging to optimal parameter values, resulting in poor
predictions. While, given more samples, this model could eventually yield accurate results, the
computational time required is uncertain and could be extensive. The model with the exist-
ing prior distributions outperforms the others significantly, closely resembling the distribution
of baseline and benchmark models’ predictions while also offering valuable insights based on
player positions.
In this section’s final analysis, we explore the mean signed deviation (MSD) values between
each prior case and the predictions of the Non-Bayesian extended model. The selection of MSD
aims to emphasize instances of over or underprediction based on the prior choice, using the
mean spread of MSD values as a performance metric. Figure 15 illustrates the distributions
of MSD values for each prior group through various box plots. Notably, Figure 15 highlights
the considerable performance of the Wide Normal prior choice, exhibiting results akin to the
current prior approaches. Its interquartile range closely aligns with the current prior case, albeit
with a few more instances of over-predicted outliers. It is crucial to observe that both uniform
priors’ MSD values are distributed around zero, despite with a broader spread. In contrast, the
Tight-Normal and Ill-suited prior cases exhibit notably poor performance, marked by a higher
frequency of overestimated xG values compared to the Non-Bayesian extended model.

5 Final Remarks & Conclusion


The objective of this study was to explore whether there exist distinct effects based on player
position on expected goal (xG) values, suggesting that different positions or players may exhibit
specific xG adjustments for given scoring opportunities. The hypothesis was that proficient
attacking players, such as strikers, attacking midfielders, or those recognised for their offensive
prowess, would demonstrate positive xG adjustments. In contrast, players less renowned for
their attacking contributions, such as defenders or those with defensive roles, were expected to
show negative xG adjustments.

25
Figure 15: Mean signed deviation distributions for each analysed prior choice.

To reach the objective mentioned above, this study has developed several Bayesian models to
evaluate the influence of a player’s position and individual player effects on xG predictions. Ini-
tially, a basic xG model (Baseline xG), incorporating only distance to the goal, shot angle, and
their interaction, indicated positional effects on xG (Bayes-xG1 ). Strikers and attacking mid-
fielders exhibited positive xG adjustments, midfielders displayed minimal adjustments, while
defenders had notably negative xG adjustments on average. However, the introduction of addi-
tional predictors in the models diminished the positional effects to the extent that they became
almost insignificant (Extended xG), suggesting that player position had minimal impact on xG
when considering more shot-related factors (Bayes-xG2 ). Subsequently, player effects were ex-
plored using the extended model employed for the second positional-effects model (Bayes-xG3 ),
grouping the data based on the player’s shooting rather than the shooter’s position. The model
was illustrated using six players from each dataset from three of the European Top 5 leagues,
revealing significant player effects on xG even when controlling for various shot-related factors.
These effects were diverse in direction, notably positive for R. Pirés (as well as for G. Bale and
J. Hernandez) and negative for J. Shelvey (as well as for A. Iniesta and T. Werner).
The indication that there exist player-specific effects in determining goal probability could
prove beneficial in football scouting and player selection. By computing adjusted xG values
for various players and comparing these adjusted values to their non-adjusted counterparts,
as demonstrated in this analysis, it becomes possible to distinguish players who excel at con-
verting challenging opportunities from those who consistently find themselves in advantageous
positions. Examining the results for the English Premier League dataset, particularly for J.
Vardy, reveals that his total adjusted xG is not significantly different from his baseline xG
(see Figure 10). This suggests that, given the quality of chances Vardy receives, he scores at a
relatively average rate. On the other hand, S. Agüero demonstrates a more consistent ability to
score from more challenging positions, evident in his larger adjusted xG. It is important to note
that this observation does not imply that Agüero is a superior player or attacker compared to
Vardy. Instead, it suggests that, on average, Agüero is more adept at converting chances with
lower xG values than Vardy, indicating proficiency in scoring from less favourable situations.
It is important to acknowledge that there might also be team-related influences at play

26
in this context. To further compare Vardy and Agüero, Vardy is part of a Leicester team
known for its high-tempo and direct attacking style. This approach likely leads to shooting
scenarios where the ball is played behind the defence, creating situations with fewer defenders
to obstruct or impede a shot. This dynamic often results in one-on-one opportunities with the
goalkeeper, contributing to Vardy consistently receiving numerous high xG chances. Conversely,
Agüero played for one of the top teams in the league, causing opponents to adopt a more
conservative approach. Teams facing Manchester City tend to minimise space, making chances
more challenging with multiple players in the shot triangle and other complicating factors.
The Bayesian modelling results were collocated with non-Bayesian, or frequentist, mod-
elling. Particularly in the case of player correction, Bayesian modelling offers a significant
advantage by capturing uncertainty through posterior distributions rather than relying solely
on point estimates. This becomes particularly advantageous when dealing with younger play-
ers with limited match experience, as Bayesian hierarchical modelling effectively addresses data
groups with few observations. Regardless, Bayesian modelling is rarely used in the literature
of football analytics. Beyond its applications in scouting and player selection, Bayesian hierar-
chical modelling holds promise for various metrics in football. For instance, the assessment of
injury risk among players, is a common practice in large football clubs, where certain players
may have a higher overall susceptibility to injuries. Hierarchical modelling, in this context,
has the potential to provide more accurate assessments of individual players’ injury risks by
considering their specific injury history.

References
Marcelo Aberle, Luuk Figdor, Lina Mongrand, and Mirko Janetzke. The tech behind the
bundesliga match facts xgoals: How machine learning is driving data-driven insights in
soccer, Jun 2020. URL https://aws.amazon.com/blogs/machine-learning/the-tech-
behind-the-bundesliga-match-facts-xgoals-how-machine-learning-is-driving-
data-driven-insights-in-soccer/.

Gianluca Baio and Marta Blangiardo. Bayesian hierarchical model for the prediction of football
results. Journal of Applied Statistics, 37(2):253–264, 2010.

Marc Brechot and Raphael Flepp. Dealing with randomness in match outcomes: how to
rethink performance evaluation in european club football using expected goals. Journal of
Sports Economics, 21(4):335–362, 2020.

Tomás Capretto, Camen Piho, Ravin Kumar, Jacob Westfall, Tal Yarkoni, and Osvaldo A
Martin. Bambi: A simple interface for fitting bayesian linear models in python. Journal
of Statistical Software, 103(15):1–29, 2022. doi:10.18637/jss.v103.i15. URL https://
www.jstatsoft.org/index.php/jss/article/view/v103i15.

Mustafa Cavus and Przemyslaw Biecek. Explainable expected goal models for performance
analysis in football analytics. In 2022 IEEE 9th International Conference on Data Science
and Advanced Analytics (DSAA), pages 1–9. IEEE, 2022.

Alexander Fairchild, Konstantinos Pelechrinis, and Marios Kokkodis. Spatial analysis of shots
in mls: a model for expected goals and fractal dimensionality. Journal of Sports Analytics, 4
(3):165–174, 2018.

Mat Herold, Floris Goes, Stephan Nopp, Pascal Bauer, Chris Thompson, and Tim Meyer.
Machine learning in men’s professional football: Current applications and future directions

27
for improving attacking play. International Journal of Sports Science & Coaching, 14(6):
798–817, 2019.

James H Hewitt and Oktay Karakuş. A machine learning approach for player and position
adjusted expected goals in football (soccer). Franklin Open, 4:100034, 2023.

Anito Joseph, Norman E Fenton, and Martin Neil. Predicting football results using bayesian
nets and other machine learning techniques. Knowledge-Based Systems, 19(7):544–553, 2006.

Patrick Lucey, Alina Bialkowski, Mathew Monfort, Peter Carr, and Iain Matthews. quality vs
quantity: Improved shot prediction in soccer using strategic features from spatiotemporal
data. 2015.

Pau Madrero Pardo. Creating a model for expected goals in football using qualitative player
information. Master’s thesis, Universitat Politècnica de Catalunya, 2020.

Rory Smith. Expected goals: the story of how data conquered football and changed the game
forever. (No Title), 2022.

William Spearman. Beyond expected goals. In Proceedings of the 12th MIT sloan sports
analytics conference, pages 1–17, 2018.

Statsbomb. Statsbomb open data specification, 2019. URL https://github.com/statsbomb/


open-data/blob/master/doc/StatsBomb%20Open%20Data%20Specification%20v1.1.pdf.

James Tippett. The Expected Goals Philosophy: A Game-Changing Way of Analysing Football.
publisher, 2019.

Tahmeed Tureen and SBH Olthof. “Estimated Player Impact”(EPI): Quantifying the effects
of individual players on football (soccer) actions using hierarchical statistical models. In
StatsBomb Conference Proceedings. StatsBomb, 2022.

Fabı́ola Zambom-Ferraresi, Vicente Rios, and Fernando Lera-López. Determinants of sport


performance in european football: What can we learn from the data? Decision Support
Systems, 114:18–28, 2018.

28

You might also like