Bayes-xG Player and Position Correction On Expected Goals (XG) Using Bayesian Hierarchical Approach
Bayes-xG Player and Position Correction On Expected Goals (XG) Using Bayesian Hierarchical Approach
Abstract
This study employs Bayesian methodologies to explore the influence of player or posi-
tional factors in predicting the probability of a shot resulting in a goal, measured by the
expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian
hierarchical logistic regressions are constructed, analysing approximately 10,000 shots
from the English Premier League to ascertain whether positional or player-level effects
impact xG. The findings reveal positional effects in a basic model that includes only
distance to goal and shot angle as predictors, highlighting that strikers and attacking
midfielders exhibit a higher likelihood of scoring. However, these effects diminish when
more informative predictors are introduced. Nevertheless, even with additional predic-
tors, player-level effects persist, indicating that certain players possess notable positive
or negative xG adjustments, influencing their likelihood of scoring a given chance. The
study extends its analysis to data from Spain’s La Liga and Germany’s Bundesliga, yield-
ing comparable results. Additionally, the paper assesses the impact of prior distribution
choices on outcomes, concluding that the priors employed in the models provide sound
results but could be refined to enhance sampling efficiency for constructing more complex
and extensive models feasibly.
Keywords: Expected goals, Football, Bayesian hierarchical models, Player adjustment,
Position adjustment, Prior effects.
1 Introduction
One of the most common advanced football analytics metrics is the idea of expected goals (xG),
which estimates the probability of a given shot resulting in a goal based on several features
about the shot such as distance from the shooter to the goal or the body part used by the
shooter. However, none of the mainstream xG models take into account any player-specific
features when estimating these values. To illustrate this, imagine that you have two players
taking the same shot from the same position, with defenders in the same place and everything
else being the same. Still, one player is Lionel Messi and the other is a random player from
the National League (English 5th tier). Obviously, players who play in the National League
are good, but it is not unreasonable to assume that Lionel Messi would be more likely to score.
However, xG metrics would assign the same value for both of these changes.
The objective of this paper is to investigate if there are position or player effects on xG,
meaning that certain positions or players have higher or lower goal probabilities for a given
chance than others. This will be achieved using a Bayesian hierarchical model, where the hier-
archies will be the position of the player or the player. The results of this method will initially
1
be compared to a more traditional frequentist xG model to evaluate baseline results without any
group effects. Then the hierarchical models will be compared to the non-hierarchical Bayesian
models, to assess the impact of having hierarchies in the data on the results. If the rationale
described above of the two players taking the same shot is valid, then it is expected that the xG
predictions of the hierarchical models will differ significantly from the non-hierarchical models,
supporting the idea that there is a position and/or player effect on the xG of a given shot.
This paper will begin with a review of the relevant literature around this topic, looking at
the development of football analytics and xG, as well as any attempts to use Bayesian mod-
elling in football analytics. Thereafter, the methodology will be described by going through
the frequentist and Bayesian techniques used. Then, the data will be introduced and described
with any changes made before the choice of Bayesian prior distributions for predictors is dis-
cussed. Next, the results of the modelling will be presented before validating the results of
the Bayesian models on additional data. The aforementioned choice of prior distributions will
then be evaluated. Finally, a discussion section will deliberate on the significance of the results
before concluding the paper.
2 Related Works
The use of data in football is often not fully embraced, with many decision-makers arguing
the sport is too complex for data to be used effectively to improve results and performance
(Smith, 2022). However, with its successful use in other sports, there was sufficient interest for
some clubs, companies, and individuals to pursue using data to derive conclusions and make
suggestions in football. With a growing demand for data, companies that specialise in sports
data collection have grown too, along with their ability to track data. The result is that there
is now an enormous amount of football data to use for several purposes, such as player/club
performance, scouting, and player fitness and injury risk, to name a few (Tippett, 2019).
At the heart of the idea of using data in football was the potential to gain a competitive
advantage. As a result, clubs that use data tend to be secretive about their operations and
procedures (Tippett, 2019). Despite this, there is plenty of publicly available literature and
sources showing how data can be used in football. Moreover, sports broadcasters have long
used data when giving an overview of a match, such as possession statistics. Still, these have
only recently moved away from simple counts and percentages to more complex metrics. The
Bundesliga, for example, provides a goal probability value after each goal is scored, giving the
chance of that given opportunity resulting in a goal (Aberle et al., 2020).
This goal probability, also commonly called expected goals (xG), has been a central topic
in the development of more advanced statistics using football data (Smith, 2022). Crucially,
it moves away from the idea of things that did happen and focuses on things that could have
happened. With football being such a complex and chaotic sport, outcomes often do not reflect
expectations as matches are often decided by fine margins or decisions out of the players’ and
coaches’ control. Nevertheless, using expectations gives decision-makers an idea of the underly-
ing performance of their team and allows them to see if their team is over or underperforming
according to expectations (Brechot and Flepp, 2020).
There have been many versions of xG models created since the idea was founded, using
a variety of machine-learning techniques and data sources. Herold et al. (2019) provide a
summary of many applications of machine learning in football, including xG models. The
most common methods of estimating xG in their paper are logistic regressions, decision trees,
ensemble methods (e.g. random forest), and neural networks. Lucey et al. (2015) use player
and ball tracking data from the 10 seconds leading up to a shot to estimate goal probabilities
across an entire season and found that “defender proximity, interaction of surrounding players,
speed of play, coupled with shot location play an impact on determining the likelihood of a
2
team scoring a goal”. Madrero Pardo (2020) uses qualitative data from the popular video game
FIFA to account for player effects on xG using a logistic regression and an XGBoost model.
They found that an adjusted model can better predict goals over a season for individual players
and teams than an overall xG model. Fairchild et al. (2018) built an xG model again using
logistic regression and used it to estimate MLS teams’ offensive efficiency in scoring. They
also discuss evaluation metrics for expected goals models and suggest the use of the Brier score
to compare predicted probability to ground-truth binary outcomes. Cavus and Biecek (2022)
apply a variety of ensemble and boosting methods to calculate xG values and find that a random
forest model performs best, even compared to models from other papers using other techniques
and data.
The closest study to this paper to date is that of Hewitt and Karakuş (2023), which inves-
tigates position and player-adjusted xG models. They find evidence of positional adjustments
with forwards having a positive adjustment, midfielders having a slightly negative adjustment,
and defenders having a large negative adjustment. Moreover, they also find evidence of player
effects on xG by fitting their model with only data from Lionel Messi and find a large positive
adjustment in this case.
One of the features a lot of these models have in common is their frequentist approach, as
opposed to using Bayesian methods. Spearman (2018) uses a Bayesian approach to estimate the
maximum a posteriori effects of parameters in a model for predicting future scoring of teams
in games. Joseph et al. (2006) used Bayesian networks and Naı̈ve Bayes learners to predict
the results of matches played by Tottenham Hotspur and compared the results to K-nearest
neighbour and decision tree models. They reiterate one of the benefits of Bayesian modelling
which is comparably accurate predictions in the absence of a large amount of data. Zambom-
Ferraresi et al. (2018) use Bayesian methods to analyse team performance in Europe’s top
leagues to determine which features tend to be most significant in predicting team performance.
They find that the most important features include the number of assists, the number of shots
conceded, saves made by the goalkeeper, passing accuracy, and number of shots on target.
One area of Bayesian modelling which is also often not considered in football analytics is
using multi-level, or hierarchical, models. Tureen and Olthof (2022) construct a multi-level
model for player-adjusted expected goals but do not use a Bayesian approach to do so. Still,
they use their model to calculate estimated player impact values on xG. On the other hand,
Baio and Blangiardo (2010) construct a Bayesian hierarchical model but use it to predict match
results as opposed to xG directly and group their data by the team as opposed to by the player.
Still, the use of Bayesian hierarchical modelling in football is a relatively unexplored area.
By using Bayesian hierarchical modelling, group-level effects can be reliably estimated even
with small group sizes. Therefore, the effect of a player’s position or even the player themselves
on the chance of a given shot resulting in a goal can be reliably measured. The result is that
certain players could be identified as being more likely to score than others for given chances,
which is a result that can be used for player selection or scouting purposes. This idea is present
in the work of Hewitt and Karakuş (2023), where Messi is found to be an extremely efficient goal
scorer. This conclusion may be obvious to football fans, but the fact that the efficiency can be
reliably measured is extremely interesting for potentially comparing the goalscoring efficiency
of footballers. Tureen and Olthof (2022) construct a metric they refer to as “estimated player
impact”, which is another calculation of a player’s individual effect on the probability of scoring.
The estimation of a player’s impact on xG can potentially be another tool in evaluating player
performance for team selection or scouting purposes.
3
3 Methodology
After a general look at the literature on the xG metric and its evaluation throughout the years,
in the section, we explain the details of the methodology this paper is proposing to study
positional and player-related corrections to generic xG approaches. We propose the utilisation
of the Bayes formula by taking player and position information into a conditional probability
formulation which is then evaluated under Bayesian hierarchical modelling.
Before showing how Bayesian methods can be applied to xG modelling, we now show how
xG models are typically created. This involves using a frequentist approach and, in this case,
a logistic regression appears as the natural choice to obtain goal probabilities. This paper
will first attempt to build a generic xG model with comparable results to an xG model built
by StatsBomb, an industry leader in data collection and analysis. To do so, we gradually
increase the number of predictors (both given and engineered) used in the logistic regression
that generally follows the formulation given below
N
pi X
logit(pi ) = log = β0 + βj · Xji (1)
1 − pi j=0
where pi is the probability of the shot i resulting in a goal, Xji is the value of predictor j
for the shot i. Traditional logistic regression can be seen as a method used to model the
relationship between a binary dependent variable (in this context Goal or No-Goal) and one
or more independent variables (given and engineered features). It uses the logistic function in
(??) to transform a linear combination of features into a probability of the dependent variable
being one of the two classes.
4
this paper specifies
Yi = binary outcome of shot i (1 = goal, 0 = no goal), (2)
pi = probability of shot i resulting in a goal, (3)
Xn,i = the value of predictor n for shot i, (4)
where the likelihood distribution is
Yi ∼ Bernoulli(pi ) (5)
Hence, the Baseline and Hierarchical models are like
N
X
Baseline Model: logit(pi ) = β0 + βj · Xji , (6)
j=0
N
X
Hierarchical Model: logit(pij ) = β0 + βj · Xji + βN +1 · XN +1,i , (7)
j=0
3.2 Data
The data used for this project is all freely available event data from StatsBomb, obtained
using their Python package StatsBombPy (Please see the Statsbomb GitHub page via https:
//github.com/statsbomb for details). From their database, only men’s competitions were used
because it could be that there is a difference in given goal probabilities in men’s and women’s
football, and we have more data from men’s competitions. Then, all open-play shots were
extracted with all relevant information for each shot. Set-pieces were excluded because again
goal probabilities could vary for set-pieces, and we are not interested in modelling this effect.
The resulting data has more than 60,000 shots from a variety of competitions and years,
with 42 columns of information for each shot. Tables 1 and 2 give some summary statistics for
the most relevant variables in the data.
As well as the information given in the columns already in the data, there are several features
which were not included by StatsBomb which could be useful for predicting goal probability.
Many sources cite distance to goal and shot angle to be two of the most important predictors
of goal probability.
The data has the location of the shot, and the StatsBomb data specification (Statsbomb,
2019) provides information about the coordinates of the goalposts. Distance to goal is therefore
calculated as the Euclidean distance from the shot to the centre of the goal.
For shot angle, the cosine rule is used to calculate the angle from the shooter to the two
goalposts. To calculate this reliably, any shots that are taken from the same x-coordinate as
the goal line are excluded since this would create a straight line instead of a triangle with no
shot angle.
Next, the freeze-frame feature in the data is utilised, which provides information about all
other players than the shooter when the shot is taken, including crucially their location on the
pitch. From this information, several features are added: the goalkeeper’s distance to the goal,
whether the goalkeeper is present in the shot triangle formed by the shot and two goalposts,
the number of players present in the shot triangle, and the number of opponents within a 1m
radius of the shooter. Opponents are used for the 1m radius as opposed to all players because
the only time a non-opponent in the radius of a shooter would impact goal probability is when
they are in the shot triangle, which is already being accounted for. Otherwise, only opponents
will try to put pressure or tackle the shooter outside of the shot triangle.
5
Table 1: Summary statistics for most relevant features for xG in the dataset. N(x):y means
that for y number of samples the variable has the value of x.
Variable Description Summary (2.d.p)
Then, each player’s mode general position is added. These positions are grouped into:
strikers, attacking midfielders, non-attacking midfielders, and defenders, and we expect goal
probability to fall on average for each group respectively. Wingers are included as attacking
midfielders, while wide midfielders and central midfielders are non-attacking midfielders. Wing-
backs are classed as defenders, and so are the few attempts by goalkeepers. Otherwise, all
positions to classes are self-explanatory.
Finally, we modify the body part used by changing “left foot” and “right foot” to “preferred
foot” and “other foot” according to the player’s apparent preference. To assign these, we go
back through the open data and instead look at the passes for each player, we then assign
whichever foot the player made the most passes with as their preferred foot because they often
6
Table 2: (CON’T) Summary statistics for most relevant features for xG in the dataset. N(x):y
means that for y number of samples the variable has the value of x.
Variable Description Summary (2.d.p)
N(Normal): 47,854
N(Overhead Kick): 385
N(Half Volley): 9,371
shot technique The technique the shooter used. N(Diving Header): 284
N(Volley): 4,483
N(Backheel): 244
N(Lob): 688
under pressure Whether the shooter was under pressure N(True): 16,149 N(False): 47,160
when shooting.
goal Whether the shot resulted in a goal. N(True): 6,559 N(False): 56,750
N: 63,309 Min: 0.00
StatsBomb’s own estimated xG value for Mean: 0.10 Max: 1.00
shot statsbomb xg
each shot. SD: 0.13
have more time to choose which foot to pass with and will then tend to go safely with their
preferred foot, which is less feasible when taking a shot as players tend to be under more
pressure and have less time.
7
It is logical to infer that, under reasonable assumptions, the likelihood of scoring decreases,
on average, as the distance to the goal increases. This implies that the coefficient β1 is likely
to be negative. Conversely, β2 is expected to be positive, reflecting the observation that as the
shot angle decreases, the shot’s position is likely to be farther away or from a wider position,
both scenarios leading to a lower goal probability. Figure 1 illustrates the associations between
shot angle, distance, and the proportion of goals scored to shots, providing a clearer insight
into these relationships.
Figure 1: Relationships between shot angle (binned in the 20s)/distance to goal (binned in 10s)
and the proportion of goals from shots.
Expanded Model: While the primary factors considered for calculating the goal probabil-
ity are typically assumed to be the distance and angle of the shot utilised in the Baseline Model,
it is important to acknowledge that various other elements can influence the shot’s outcome as
shown clearly in the literature. Section 3.2 has previously addressed each of these factors, and
we now introduce our frequentist Extended model, outlined below
logit(pi ) = β0 +β1 · distance to goali + β2 · shot anglei + β3 · (distancei · anglei )+
β4 · gk distance to goali + β5 · players in shot trianglei +
β6 · body parti + β7 · first time shoti + β8 · gk in shot trianglei + (9)
β9 · one on one shoti + β10 · open goali + β11 · techniquei +
β12 · under pressurei
8
3.4.1 Model Definitions
The foundational Bayesian logistic regression model extends the traditional logistic regression
equation by adding a grouping effect
N
X
logit(pik ) = β0k + βjk · Xji + βN +1,k · XN +1,i , (10)
| {z }
j=0
Grouping effect
where k = 1, . . . , K with K is the number of elements in the group. It is also clear that for this
formulation, we extended logistic regression model coefficients βj ∈ RN into a complex form
for grouping effect as βjk ∈ RK×(N +1) .
Specifically, β0k is a group-specific intercept for k th element of the group which accounts
for variations in the baseline success probability across different groups. On the other hand,
β{1,...,N }k refer to the group-specific slopes which capture variations in the effect of the covariate
across groups.
Following the technical details above, for this paper, we decided to define three versions of
Bayes-xG models which can be expressed as
• Bayes-xG 1 → uses the Baseline model with grouping parameter of position,
• Bayes-xG 2 → uses the Extended model with grouping parameter of position,
• Bayes-xG 3 → uses the Extended model with grouping parameter of player.
9
Table 3: Listing which predictors are used in each of the Bayesian models, and what prior
distribution is given to the coefficient of the given predictor.( N : Normal distribution, SN :
Skew-Normal distribution, HN : Half-Normal distribution.)
Bayes-xG
Predictor 1 2 3 Prior
Intercept ✓ ✓ ✓ N (µ = 0, σ = 5)
distance to goal ✓ ✓ ✓ SN (µ = −1, σ = 5, α = −1)
shot angle ✓ ✓ ✓ SN (µ = 1, σ = 5, α = 1)
distance angle interaction ✓ ✓ ✓ N (µ = 0, σ = 5)
gk distance to goal ✓ ✓ N (µ = 0, σ = 5)
players in shot triangle ✓ ✓ SN (µ = −1, σ = 5, α = {5, 4, . . . , −5})
where α is determined by the value of this
feature (0 players = 5, 1 player = 4, etc.)
opponents in radius ✓ ✓ SN (µ = −1, σ = 5, α = {1, 0, . . . , −2})
where α is determined by the value of this
feature (0 players = 1, 1 player = 0, etc.)
shot body part ✓ ✓ N (µ = 0, σ = 5)
shot first time ✓ ✓ N (µ = 0, σ = 5)
gk in shot triangle ✓ ✓ SN (µ = 0, σ = 5, α = −2)
shot one on one ✓ ✓ SN (µ = 0, σ = 5, α = 2)
shot open goal ✓ ✓ SN (µ = 0, σ = 5, α = 4)
shot technique ✓ ✓ N (µ = 0, σ = 5)
under pressure ✓ ✓ SN (µ = 0, σ = 5, α = −2)
general position ✓ ✓ SN (µ = 0, σ, α = {2, 1, 0, −2}) where scale
parameter σ ∼ HN (γ = 5) and α is {ST,
AM, M, D}, respectively.
player ✓ SN (µ = 0, σ, α = {2, 0}) where scale parame-
ter σ ∼ HN (γ = 5) and α is assigned depend-
ing on prior beliefs about a player ({2: good
finisher, 0: not good finisher}).
• gk in shot triangle (α = −2): If the goalkeeper is positioned within the shot triangle,
their chances of successfully saving a shot are higher compared to when they are located
outside of the shot triangle.
• shot one on one (α = 2): When a player finds themselves in a one-on-one situation with
the goalkeeper, their sole task is to outplay the goalkeeper with their shot, without the
need to navigate or consider other players. This circumstance makes scoring compara-
tively more straightforward.
• shot open goal (α = 4): Similar to the previous feature, in this case, there is no goalkeeper
present. Consequently, the shooter’s sole objective is to direct the shot accurately towards
the target, making the likelihood of scoring very high.
10
• under pressure (α = −2): When a player is under pressure, their ability to concentrate
on accurate and well-targeted shooting diminishes, resulting in a decreased likelihood of
scoring on average.
• player (α = {2, 0}): If a player is anticipated to excel in finishing skills based on their
name and reputation, they are given a value of 2 for α; otherwise, a value of 0 is assigned.
• A standard value of σ = 5 was chosen for the priors to strike a balance, ensuring sufficient
variability. This choice aims to prevent the priors from becoming overly narrow in case
the underlying prior knowledge is incorrect. Simultaneously, it avoids excessive largeness
that could prolong convergence and necessitate numerous rounds of sampling.
4 Experimental Analysis
The experimental analysis of this paper was studied under 5 cases:
11
Figure 2: Distributions of Predictions from Frequentist xG Models
1. Frequentist model comparison to benchmark model for the whole 60K+ shots data set,
2. Bayes-xG model-based “positional ” analysis and comparisons for English Premier League
data set (10K+ shots)
3. Bayes-xG model-based “player-specific” analysis and comparisons for English Premier
League data set (10K+ shots)
4. Extending the developed Bayes-xG model evaluations into different countries, e.g. Spain
(La Liga - 19K shots) and Germany (Bundesliga - 7.5K shots).
5. Investigating the choice of priors on Bayes-xG model outputs
12
Figure 3: Model fitting performance when increasing the number of features.
Table 4: Outcomes from non-Bayesian models, including the Baseline xG model incorporating
distance to the goal, shot angle, and their interaction, and the Extended xG model introducing
additional features, in comparison to the StatsBomb xG model.
Baseline xG Extended xG Statsbomb xG
RMSE 0.095 0.055 -
MAE 0.058 0.029 -
R2 Score 0.428 0.826 -
Brier Score 0.086 0.076 0.075
In the final analysis of the initial set of experiments, we explored the impact of incorporating
engineered advanced features on the performance of the frequentist logistic regression model.
Figure 3 illustrates the trends in performance evaluation metrics—Brier score, R2, MAE, and
RMSE—relative to the number of features integrated into the model. We initiated the analysis
with a single-parameter model, utilizing only the distance to the goal, and systematically added
features one by one. Typically, there are 16 model parameters (as detailed in Table 3), but this
count increases to a maximum of 33 after one-hot encoding categorical features. Examining
Figure 3 reveals a substantial influence on model performance resulting from the introduction
of these created parameters. Notably, there are certain plateaus in the trends, particularly
associated with one-hot encoded parameters representing categories with a minimal number of
samples in the dataset (e.g., 10 players in the shot triangle).
13
Figure 4: Distributions of xG Adjustments by Position of Bayes-xG1 .
14
Figure 5: Normalized Heatmap of Shot Locations by General Position.
Table 5: Mean xG adjustment for each general position from Bayes-xG1 versus theoretical
adjustment of Baseline xG prediction using Bayes’ Theorem.
Position Mean Model Adjustment Mean Theoretical Adjustment
ST 0.009 0.010
AM 0.019 0.020
M -0.006 -0.005
D -0.042 -0.044
rectly in front of the goal, providing a potential explanation for the observed reversals in Figure
4. Contrary to the theorised expectation from the analysis in Figure 4, attacking midfielders
are not inclined to take shots from distant or challenging angles. In fact, on average, strikers
exhibit a higher tendency for such shots. This discovery, coupled with the observation that
attacking midfielders, on average, have larger positive xG adjustments than strikers, suggests
that attacking midfielders may have a superior ability, on average, to convert high xG chances
situated right in front of the goal compared to their striking counterparts.
Before delving into the more intricate model analysis for Bayes-xG2 , we aim to showcase a
validation step to demonstrate the accuracy of the MCMC-based sampling technique employed
in developing Bayesian models in this paper. To achieve this, we replicated the Bayesian
analysis, this time utilising Bayes’ Formula
P (positioni |goal) · P (goal)
P (goal|positioni ) = . (11)
P (positioni )
This allowed us to conduct an analysis where the results of the baseline model could be adjusted
using Bayes’ Theorem, and these adjusted outcomes were then compared to the results of
the hierarchical model. The comparison aimed to assess the proximity between theoretical
adjustments and model adjustments. The outcomes of this process are detailed in Table 5.
Notably, the mean model adjustments closely align with the theoretical adjustments for each
position, affirming that the model has effectively estimated the positional impact.
We proceed with our analysis by delving into the upgraded iteration of the Baseline model,
referred to as the Extended model. Employing Bayes-xG2 , this advanced model involves a posi-
tional analysis akin to its predecessor, Bayes-xG1 . However, it incorporates numerous additional
predictors, including factors such as opponents in radius and gk distance to goal, aiming to en-
15
Figure 6: Distributions of xG adjustments for Bayes-xG2 , where adjustment is hierarchical
model prediction minus baseline model prediction - grouped by general position
16
Figure 7: Comparison of point estimates for xG adjustments against distance to goal and
shot angle between Bayes-xG1 and Bayes-xG2 , grouped by general position. Adjustments are
hierarchical model prediction minus baseline model prediction.
the compensation of positional effects through additional predictors is evident. Despite a clear
distinction in the effects of distance to the goal and shot angle for each position in Bayes-xG1 ,
no significant differences between positions are observed in Bayes-xG2 . The diminishing impact
of position on xG adjustment is attributed to the diverse player abilities and roles within each
position category. A distinct observation evident in both Figure 7-(c) and (d) is that defenders
exhibit consistent trends for all angles just below 0, whereas attacking midfielders mirror the
same pattern but above the 0 line. Adjustments made to strikers’ and midfielders’ xG values
are barely discernible, with values hovering around 0 across all shot angles.
To further explore the aforementioned phenomena, the next experimental case involves a
player-specific analysis with Bayes-xG3 , grouping data based on the player taking the shot
rather than their general position.
17
Table 6: Selected players for Bayes-xG3 and their goal-scoring statistics in the data set.
Player Shots Goals Conversion Rate
Robert Pirès 56 14 25.00%
Sergio Agüero 112 20 17.90%
Jamie Vardy 111 19 17.10%
Phillippe Coutinho 105 8 7.60%
Ross Barkley 82 6 7.30%
Jonjo Shelvey 51 0 0.00%
specific adjustments. Unlike the previous experimental case that centred on positional analysis,
conducting a player-specific analysis poses increased complexity due to the considerably larger
pool of candidates within the group, making the analysis more challenging and computationally
intensive. Instead of individually representing each player in our group, a selective approach
is employed, categorising the majority as ”other” and opting for a few players with expected
positive or negative xG adjustments. Player selection is based on the ”conversion rate,” i.e., the
percentage of shots scored. To ensure relevance, only players with a minimum of 50 shots are
considered, and the chosen players, along with their statistics, are detailed in Table 6. Players
like R. Pirès, S. Agüero, and J. Vardy, recognized for their prolific goal-scoring, are expected
to have positive xG adjustments. Pirès, notably, exhibits an exceptional conversion rate in the
data subset. Conversely, players like P. Coutinho and R. Barkley, with below-average conversion
rates, might have slight negative xG adjustments, while J. Shelvey, who failed to convert any
of his 51 shots in the data, is likely to have a more substantial negative xG adjustment.
As explained in the preceding methodology section, the impact of each player will be char-
acterised by a prior distribution, specifically a skewed normal distribution. The choice of
distribution parameters is dependent upon the prior beliefs regarding a player’s proficiency as
a goal scorer, with the parameter α taking values of either 2 or 0. This selection is guided by
qualitative beliefs about the players rather than direct utilisation of the data for informing the
priors. Players such as Pirès, Agüero, Vardy, and Coutinho, acknowledged as talented attack-
ing players, are attributed α = 2. In contrast, Barkley and Shelvey, who are not commonly
associated with being top-tier attackers but possess other defining qualities in their game, are
assigned α = 0.
Figure 8 illustrates the distributions of xG adjustments for individual players and the collec-
tive ”other” players group. A prominent observation is the substantial positive xG adjustments
for Robert Pirès, some reaching as high as 0.3 above the baseline xG. Notably, these adjust-
ments persist even after incorporating additional predictors in Bayes-xG2 that were intended to
eliminate group effects in the previous experimental analysis. Pirès also exhibits a wide spread
of adjustments, ranging close to 0, indicating a diverse array of shot types. Some were high
xG chances, requiring minimal adjustment, while others were more challenging but consistently
converted by Pirès, resulting in significant positive adjustments. Agüero displays consistently
positive xG adjustments, albeit smaller on average and with a narrower spread compared to
Pirès. Intriguingly, Vardy and Coutinho exhibit minimal positive xG adjustments, not signifi-
cantly greater than those of Barkley. Shelvey aligns with expectations, displaying substantial
negative xG adjustments based on his conversion rate in the data. Lastly, the ”other” group
centres around 0 for xG adjustment, as anticipated, given its diverse player composition with
no discernible group effect to capture.
Figure 9 displays the shot locations and outcomes for the selected players, offering insights
18
Figure 8: Distributions of xG adjustments for Bayes-xG3 , where adjustment is hierarchical
model prediction minus baseline model prediction - grouped by player.
into the findings presented in Figure 8. Beginning with Pirès, notable for his substantial positive
xG adjustments, the observation centres on his efficiency in goal scoring despite a relatively
low number of shots. His ability to score from challenging positions, such as both corners
of the box and outside the area, contributes to the positive xG adjustments, indicating his
prowess as a goal scorer even in demanding scenarios. Agüero and Vardy exhibit similar shot
patterns, but the model assigns significantly higher positive xG adjustments to Agüero. This
discrepancy may stem from the nature of Vardy’s shots being inherently high xG chances, like
one-on-one opportunities, whereas Agüero manages to convert more challenging shots, resulting
in larger adjustments. Comparing Vardy with Coutinho and Barkley, who exhibit similar
xG adjustments in Figure 8, suggests that their goal-scoring patterns align with baseline xG
values without substantial player adjustments. Lastly, Shelvey’s shot map lacks goals from
various positions. While difficult-to-score shots receive minor adjustments, centrally located
missed chances likely contribute to the notable negative adjustments, reflecting Shelvey’s poor
conversion rates in this dataset.
Figure 10 displays the cumulative data for goals scored, baseline expected goals (xG) from
the single-level model, and adjusted xG from the player-corrected model for the selected players
in this analysis. The visual representation illustrates that, in comparison to the single-level
model, the player-corrected model provides more accurate estimates of total goals scored by
each player. Notably, players like Pirès and Agüero, who outperformed their baseline xG by
scoring difficult chances, exhibit adjusted xG totals much closer to their actual goals scored.
Conversely, Shelvey’s adjusted xG total is more aligned with the zero goals he scored, although
it is crucial to emphasise that it is not precisely zero.
19
Figure 9: Selected player shot locations and goals.
Figure 10: Comparison of Baseline, Bayes-xG3 hierarchical predictions, and actual goals scored
for selected players.
fourth experimental case, a parallel analysis was conducted using data from Spain’s La Liga
and Germany’s Bundesliga. The datasets for these leagues encompass approximately 19,000
and 7,500 shots, respectively. It is noteworthy that the La Liga dataset is notably influenced by
Barcelona, primarily due to the fact that StatsBomb predominantly released data from games
20
Figure 11: Distributions of xG adjustments for Bayes-xG1 and Bayes-xG2 for Spanish La-Liga
and the German Bundesliga.
21
Table 7: Selected players for Bayes-xG3 and their scoring statistics from La Liga data set.
Player Shots Goals Conversion Rate
Gareth Bale 89 20 22.50%
Lionel Messi 1862 375 20.10%
Samuel Eto’o 295 62 21%
Bebé 74 2 2.70%
Rafael Márquez 53 2 3.80%
Andrés Iniesta 362 25 6.90%
Table 8: Selected players for Bayes-xG3 and their scoring statistics from Bundesliga data set.
Player Shots Goals Conversion Rate
Javier Hernández 63 16 25.40%
Pierre-Emerick Aubameyang 107 22 20.60%
Robert Lewandowski 147 28 19%
Pascal Groß 56 1 1.80%
Hakan Çalhanoğlu 51 1 2%
Timo Werner 64 6 9.40%
Figure 12: Distributions of xG adjustments for Bayes-xG3 for Spanish La-Lida and the German
Bundesliga
Figure 13 illustrates the shot locations of Bundesliga players, validating that the majority
of Aubameyang’s shots originate from within the penalty area. This observation implies that
these shots likely possess additional characteristics, such as one-on-one opportunities, making
them high expected goals (xG) chances. In contrast, Timo Werner exhibits minimal goals
despite a comparable shot map, and the substantial negative xG adjustments suggest that he
should have scored more from these shot positions. On a different note, Çalhanoğlu records
relatively few goals from shots that present higher difficulty due to their distance from the goal,
resulting in slightly fewer negative xG adjustments.
In conclusion, the findings from both La Liga and the Bundesliga verify the results obtained
22
Figure 13: Selected player shot locations and goals for the German Bundesliga
in the Premier League sections above. Notably, there is an indication of a positional impact
on xG in a fundamental xG model (Bayes-xG1 ), but such effects markedly diminish with the
adoption of a more intricate model (Bayes-xG2 ). Nevertheless, even with the extended model,
there remains evidence of player-specific effects on xG, providing a quantitative measure of how
certain players excel or lag behind others in scoring.
23
Table 9: Choices of prior distributions for analysing the impact of prior choice on results.
2-Wide 3-Tight 4-Wide 5-Tight 6-Ill-suited
Predictor Uniform Uniform Normal Normal Prior
Intercept SN (0, 0.25, 2)
distance to goal SN (0, 0.25, 2)
shot angle N (0, 0.25)
distance*angle SN (0, 0.25, −2)
gk distance to goal N (0, 0.25)
players in shot triangle SN (0, 0.25, {−5, . . . , 5})
opponents in radius SN (0, 0.25, {1, . . . , −2})
)
, 1)
0)
)
shot body part N (0, 0.25)
100
.25
0, 1
−1
0, 0
00,
N(
−1
and tight uniform distribution pair. Similarly, for normal priors, zero mean priors are chosen
with two different σ values to represent a wise and tight value support. On the other hand,
the ill-suited priors have been given very narrow distributions by using a small value for σ.
Moreover, some of the skews in the distributions have been flipped such that the prior belief
about the effect is the reverse of what was actually used.
Furthermore, the predictions generated by these models will be juxtaposed with those of the
extended non-Bayesian model on the same data, providing a baseline for assessing the efficacy
of the selected priors in yielding accurate predictions.
Figure 14 illustrates the distributions of predictions generated by each of the prior models.
The model utilising wide uniform prior distributions exhibits notably poor performance when
compared to both the non-Bayesian baseline model and the Statsbomb benchmark, displaying
a considerable spread of predictions. On the other hand, employing tight uniform priors leads
to more restricted predictions, although the average expected goals (xG) predictions tend to be
relatively smaller. The utilisation of a tight normal prior yields similar performance, primarily
underestimating xG values, particularly with the highest xG values concentrated around 0.8.
In contrast, adopting wide normal priors results in enhanced performance compared to the
tight setting, with predictions following a similar trend to the existing prior configuration. It
is noteworthy that the mean and interquartile ranges of both normal priors depicted in Figure
14 exhibit a favourable correspondence with the baseline and benchmark models.
On the other hand, as depicted in Figure 14, ill-suited priors result in a notably narrow
spread, with scarce xG predictions exceeding 0.5. Despite this, the ill-suited priors exhibit
improved performance compared to uniform priors in terms of aligning the average with the
baseline and benchmark predictions and maintaining a similar-sized interquartile range. This
improved performance is likely attributed to the greater number of samples utilised for param-
eter estimation. In the case of the model with uniform priors, the same number of samples,
24
Figure 14: Distributions of xG predictions for each of the extended single-level Bayesian model,
with different choices of prior distributions.
however, prevented the model from converging to optimal parameter values, resulting in poor
predictions. While, given more samples, this model could eventually yield accurate results, the
computational time required is uncertain and could be extensive. The model with the exist-
ing prior distributions outperforms the others significantly, closely resembling the distribution
of baseline and benchmark models’ predictions while also offering valuable insights based on
player positions.
In this section’s final analysis, we explore the mean signed deviation (MSD) values between
each prior case and the predictions of the Non-Bayesian extended model. The selection of MSD
aims to emphasize instances of over or underprediction based on the prior choice, using the
mean spread of MSD values as a performance metric. Figure 15 illustrates the distributions
of MSD values for each prior group through various box plots. Notably, Figure 15 highlights
the considerable performance of the Wide Normal prior choice, exhibiting results akin to the
current prior approaches. Its interquartile range closely aligns with the current prior case, albeit
with a few more instances of over-predicted outliers. It is crucial to observe that both uniform
priors’ MSD values are distributed around zero, despite with a broader spread. In contrast, the
Tight-Normal and Ill-suited prior cases exhibit notably poor performance, marked by a higher
frequency of overestimated xG values compared to the Non-Bayesian extended model.
25
Figure 15: Mean signed deviation distributions for each analysed prior choice.
To reach the objective mentioned above, this study has developed several Bayesian models to
evaluate the influence of a player’s position and individual player effects on xG predictions. Ini-
tially, a basic xG model (Baseline xG), incorporating only distance to the goal, shot angle, and
their interaction, indicated positional effects on xG (Bayes-xG1 ). Strikers and attacking mid-
fielders exhibited positive xG adjustments, midfielders displayed minimal adjustments, while
defenders had notably negative xG adjustments on average. However, the introduction of addi-
tional predictors in the models diminished the positional effects to the extent that they became
almost insignificant (Extended xG), suggesting that player position had minimal impact on xG
when considering more shot-related factors (Bayes-xG2 ). Subsequently, player effects were ex-
plored using the extended model employed for the second positional-effects model (Bayes-xG3 ),
grouping the data based on the player’s shooting rather than the shooter’s position. The model
was illustrated using six players from each dataset from three of the European Top 5 leagues,
revealing significant player effects on xG even when controlling for various shot-related factors.
These effects were diverse in direction, notably positive for R. Pirés (as well as for G. Bale and
J. Hernandez) and negative for J. Shelvey (as well as for A. Iniesta and T. Werner).
The indication that there exist player-specific effects in determining goal probability could
prove beneficial in football scouting and player selection. By computing adjusted xG values
for various players and comparing these adjusted values to their non-adjusted counterparts,
as demonstrated in this analysis, it becomes possible to distinguish players who excel at con-
verting challenging opportunities from those who consistently find themselves in advantageous
positions. Examining the results for the English Premier League dataset, particularly for J.
Vardy, reveals that his total adjusted xG is not significantly different from his baseline xG
(see Figure 10). This suggests that, given the quality of chances Vardy receives, he scores at a
relatively average rate. On the other hand, S. Agüero demonstrates a more consistent ability to
score from more challenging positions, evident in his larger adjusted xG. It is important to note
that this observation does not imply that Agüero is a superior player or attacker compared to
Vardy. Instead, it suggests that, on average, Agüero is more adept at converting chances with
lower xG values than Vardy, indicating proficiency in scoring from less favourable situations.
It is important to acknowledge that there might also be team-related influences at play
26
in this context. To further compare Vardy and Agüero, Vardy is part of a Leicester team
known for its high-tempo and direct attacking style. This approach likely leads to shooting
scenarios where the ball is played behind the defence, creating situations with fewer defenders
to obstruct or impede a shot. This dynamic often results in one-on-one opportunities with the
goalkeeper, contributing to Vardy consistently receiving numerous high xG chances. Conversely,
Agüero played for one of the top teams in the league, causing opponents to adopt a more
conservative approach. Teams facing Manchester City tend to minimise space, making chances
more challenging with multiple players in the shot triangle and other complicating factors.
The Bayesian modelling results were collocated with non-Bayesian, or frequentist, mod-
elling. Particularly in the case of player correction, Bayesian modelling offers a significant
advantage by capturing uncertainty through posterior distributions rather than relying solely
on point estimates. This becomes particularly advantageous when dealing with younger play-
ers with limited match experience, as Bayesian hierarchical modelling effectively addresses data
groups with few observations. Regardless, Bayesian modelling is rarely used in the literature
of football analytics. Beyond its applications in scouting and player selection, Bayesian hierar-
chical modelling holds promise for various metrics in football. For instance, the assessment of
injury risk among players, is a common practice in large football clubs, where certain players
may have a higher overall susceptibility to injuries. Hierarchical modelling, in this context,
has the potential to provide more accurate assessments of individual players’ injury risks by
considering their specific injury history.
References
Marcelo Aberle, Luuk Figdor, Lina Mongrand, and Mirko Janetzke. The tech behind the
bundesliga match facts xgoals: How machine learning is driving data-driven insights in
soccer, Jun 2020. URL https://aws.amazon.com/blogs/machine-learning/the-tech-
behind-the-bundesliga-match-facts-xgoals-how-machine-learning-is-driving-
data-driven-insights-in-soccer/.
Gianluca Baio and Marta Blangiardo. Bayesian hierarchical model for the prediction of football
results. Journal of Applied Statistics, 37(2):253–264, 2010.
Marc Brechot and Raphael Flepp. Dealing with randomness in match outcomes: how to
rethink performance evaluation in european club football using expected goals. Journal of
Sports Economics, 21(4):335–362, 2020.
Tomás Capretto, Camen Piho, Ravin Kumar, Jacob Westfall, Tal Yarkoni, and Osvaldo A
Martin. Bambi: A simple interface for fitting bayesian linear models in python. Journal
of Statistical Software, 103(15):1–29, 2022. doi:10.18637/jss.v103.i15. URL https://
www.jstatsoft.org/index.php/jss/article/view/v103i15.
Mustafa Cavus and Przemyslaw Biecek. Explainable expected goal models for performance
analysis in football analytics. In 2022 IEEE 9th International Conference on Data Science
and Advanced Analytics (DSAA), pages 1–9. IEEE, 2022.
Alexander Fairchild, Konstantinos Pelechrinis, and Marios Kokkodis. Spatial analysis of shots
in mls: a model for expected goals and fractal dimensionality. Journal of Sports Analytics, 4
(3):165–174, 2018.
Mat Herold, Floris Goes, Stephan Nopp, Pascal Bauer, Chris Thompson, and Tim Meyer.
Machine learning in men’s professional football: Current applications and future directions
27
for improving attacking play. International Journal of Sports Science & Coaching, 14(6):
798–817, 2019.
James H Hewitt and Oktay Karakuş. A machine learning approach for player and position
adjusted expected goals in football (soccer). Franklin Open, 4:100034, 2023.
Anito Joseph, Norman E Fenton, and Martin Neil. Predicting football results using bayesian
nets and other machine learning techniques. Knowledge-Based Systems, 19(7):544–553, 2006.
Patrick Lucey, Alina Bialkowski, Mathew Monfort, Peter Carr, and Iain Matthews. quality vs
quantity: Improved shot prediction in soccer using strategic features from spatiotemporal
data. 2015.
Pau Madrero Pardo. Creating a model for expected goals in football using qualitative player
information. Master’s thesis, Universitat Politècnica de Catalunya, 2020.
Rory Smith. Expected goals: the story of how data conquered football and changed the game
forever. (No Title), 2022.
William Spearman. Beyond expected goals. In Proceedings of the 12th MIT sloan sports
analytics conference, pages 1–17, 2018.
James Tippett. The Expected Goals Philosophy: A Game-Changing Way of Analysing Football.
publisher, 2019.
Tahmeed Tureen and SBH Olthof. “Estimated Player Impact”(EPI): Quantifying the effects
of individual players on football (soccer) actions using hierarchical statistical models. In
StatsBomb Conference Proceedings. StatsBomb, 2022.
28