0% found this document useful (0 votes)
159 views19 pages

Rugby

This document proposes new methods for providing in-game win probabilities for matches in the National Rugby League (NRL) using detailed event data from 2016-2019. It models the score differential and features extracted from the match data as functional data that change over time. A conditional probability formulation is used, with components including a unique home team kickoff win probability from betting odds and conditional densities of the score differential and event features given the match outcome and kickoff probability. Functional data analysis methods are applied to estimate the functional parameters and variances/covariances to evaluate the in-game win probabilities in real-time.

Uploaded by

swapbjspn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views19 pages

Rugby

This document proposes new methods for providing in-game win probabilities for matches in the National Rugby League (NRL) using detailed event data from 2016-2019. It models the score differential and features extracted from the match data as functional data that change over time. A conditional probability formulation is used, with components including a unique home team kickoff win probability from betting odds and conditional densities of the score differential and event features given the match outcome and kickoff probability. Functional data analysis methods are applied to estimate the functional parameters and variances/covariances to evaluate the in-game win probabilities in real-time.

Uploaded by

swapbjspn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

The Annals of Applied Statistics

2022, Vol. 16, No. 1, 349–367


https://doi.org/10.1214/21-AOAS1514
© Institute of Mathematical Statistics, 2022

IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE

B Y T IANYU G UAN1,a , ROBERT N GUYEN2,b , J IGUO C AO3,c AND T IM S WARTZ3,d


1 Department of Mathematics and Statistics, Brock University, a [email protected]
2 Department of Statistics, School of Mathematics and Statistics, University of New South Wales, b [email protected]
3 Department of Statistics and Actuarial Science, Simon Fraser University, c [email protected], d [email protected]

This paper develops new methods for providing instantaneous in-game


win probabilities for the National Rugby League. Besides the score differ-
ential, betting odds, and real-time features extracted from the match event
data are also used as inputs to inform the in-game win probabilities. Rugby
matches evolve continuously in time, and the circumstances change over the
duration of the match. Therefore, the match data are considered as functional
data, and the in-game win probability is a function of the time of the match.
We express the in-game win probability using a conditional probability for-
mulation, the components of which are evaluated from the perspective of
functional data analysis. Specifically, we model the score differential process
and functional feature extracted from the match event data as sums of mean
functions and noises. The mean functions are approximated by B-spline ba-
sis expansions with functional parameters. Since each match is conditional on
a unique kickoff win probability of the home team obtained from the betting
odds (i.e., the functional data are not independent and identically distributed),
we propose a weighted least squares method to estimate the functional pa-
rameters by borrowing the information from matches with similar kickoff
win probabilities. The variance and covariance elements are obtained by the
maximum likelihood estimation method. The proposed method is applicable
to other sports when suitable match event data are available.

1. Introduction. In recent years, analytics have made a profound impact on sports where
great investments have been made in the “big” professional sports of basketball (the Na-
tional Basketball Association), football (the National Football league), soccer (major Euro-
pean leagues), hockey (the National Hockey League), and baseball (Major League Baseball).
Many teams now have their own analytics staff where decisions are scrutinized across many
areas of the sporting operation, including strategy, drafting, salaries, player evaluation, and
marketing. For a survey of some of the work that has been done in sports analytics, see Albert
et al. (2017).
Whereas the National Rugby League (NRL) may be considered a big sport (it has the great-
est television viewership of any sport in Australia), the NRL is underrepresented in the sports
analytics literature. For example, in a search of the archives of the Journal of Quantitative
Analysis in Sports (founded in 2005) the authors were unable to find a single article devoted
to the rugby league. Similarly, in a search of Australian & New Zealand Journal of Statistics
we were only able to find a single article devoted to the rugby league (Lee (1999)). However,
there have been many papers written on the rugby league from the sports science perspective,
and a small sample of these include Glassbrook et al. (2019), Booth and Orr (2017), Windt
et al. (2017), Seitz et al. (2014), King, Jenkins and Gabbett (2009), and Gabbett (2005).
In an attempt to grow the game, the NRL is adding an analytics focus to the sport (see
www.nrl.com/stats). In particular, to provide additional excitement to the television viewing
experience the NRL would like to include in-game win probabilities. The idea is that such a

Received August 2020; revised June 2021.


Key words and phrases. In-game win probability, event data, NRL, functional data analysis, model validation.
349
350 GUAN, NGUYEN, CAO AND SWARTZ

graphic may be presented in a small corner of the screen and be continually updated as the
game circumstances change. The graphic would be appealing to the NRL fan base and also
to punters. The continual update precludes highly computational techniques, and, of course,
the predictions of the in-game win probabilities ought to be accurate.
Prediction of the in-game win probabilities has been investigated in other major sports,
such as basketball, football, and hockey. For example, in basketball, Stern (1994) developed
a Brownian motion model to investigate the score differential process. Gable and Redner
(2012) and Clauset, Kogan and Redner (2015) built computational random walk models
for analyzing the scoring processes in basketball games. Štrumbelj and Vračar (2012) and
Vračar, Štrumbelj and Kononenko (2016) used possession-based Markov models to simu-
late basketball matches. Data snapshots approaches were developed by Kayhan and Watkins
(2018, 2019). Song, Gao and Shi (2020) obtained in-game predictions by fitting a gamma
process based model for the total points process of basketball. A multiresolution stochastic
process model was proposed by Cervone et al. (2016) to quantify the expected possession
value, a concept that is similar to the in-game win probability. In football, Lock and Nettle-
ton (2014) employed a random forest method to provide the in-game win probability of the
National Football League (NFL), whereas Robberechts, Van Haaren and Davis (2019) intro-
duced a Bayesian statistical model. For the National Hockey League (NHL), Buttrey, Wash-
burn and Price (2011) proposed to predict the scoring process by fitting a Poisson process,
and Pettigrew (2015) used the in-game win probabilities to assess the offensive productiv-
ity of the NHL players. However, the in-game win probability is a new concept in rugby. In
this article we focus on the NRL matches and propose methods to estimate the in-game win
probabilities from the perspective of functional data analysis (FDA), which uses the informa-
tion of the score differential, the features extracted from the in-game event data and a unique
kickoff win probability of the home team for each match.
The NRL has provided us with four seasons of detailed event data (2016–2019), which we
use to inform the in-game win probabilities. Our approach begins with a conditional proba-
bility formulation where our main interest concerns the evaluation of the in-game posterior
win probability. Specifically, the in-game win probability is expressed by a conditional prob-
ability formulation with components of a unique kickoff win probability of the home team
and conditional joint densities of the score differential and event feature that arises from the
in-game event data conditional on the event that the home team wins or losses the match and
the unique kickoff win probability of the home team. The challenge is the development of an
accurate model for which the posterior probability can be evaluated in real time. The accu-
racy provided by the model relies on the domain knowledge of the sport; hence, we search
for data and covariates that have high predictive capability.
A rugby league match is 80 minutes in duration and that circumstances change over the
duration of the match. Therefore, we consider the match data as functional data, and the
in-game win probability is a function of the time of the match. The distributions that are
specified in our model are determined via FDA. FDA is a relatively new branch of statis-
tics where regression methods are extended to the study of curves or functions. There is an
extensive literature on FDA. The most popular techniques in FDA include various smooth-
ing methods (e.g., Ramsay and Silverman ((2005), Chapter 3), de Boor (2001) and Wand
and Jones (1995)), functional principal component analysis (e.g., Besse and Ramsay (1986),
Bosq (2000), Cardot (2000), and Yao, Müller and Wang (2005a)), functional linear regres-
sion model (Hastie and Mallows (1993), Hall and Horowitz (2007), Cardot, Ferraty and Sarda
(2003), Yuan and Cai (2010), and Yao, Müller and Wang (2005b)), and clustering and clas-
sification of functional data (e.g., James and Sugar (2003), Jacques and Preda (2014), Leng
and Müller (2006), and Delaigle and Hall (2012)). For a broad theoretical, methodological,
and practical introduction to functional data analysis, interested readers are referred to the
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 351

monographs by Ramsay and Silverman (2005), Ferraty and Vieu (2006), Ramsay, Hooker
and Graves (2009), Horváth and Kokoszka (2012), Hsing and Eubank (2015), and Kokoszka
and Reimherr (2017), and the review papers by Morris (2015) and Wang, Chiou and Müller
(2016) and references therein. FDA also has many applications in other areas. For instance,
Ainsworth, Routledge and Cao (2011) applied FDA for ecosystem research in which they
studied the relationship between river flow and salmon abundance. Luo et al. (2013) esti-
mated the intensity of ward admissions and investigated its effect on emergency department
access in public hospitals by using FDA methods. However, little has been done on applying
FDA in sports, except for Chen and Fan (2018) who investigated the score differential process
in basketball by employing FDA.
In this paper we apply FDA to evaluate the components in the conditional probability for-
mulation that we use to express the in-game win probabilities. We model the functional fea-
ture extracted from the match event data and score differential processes as sums of smooth
mean functions which are approximated by nonparametric smoothing techniques and noises
from Brownian motions. The mean functions are approximated by B-spline basis expansions
with functional parameters. In FDA a typical application involves the analysis of a sample
of realizations from independent and identically distributed (iid) functions (e.g. Ramsay and
Silverman (2005), Chapter 3.2.4, Cai and Hall (2006) and Hall and Horowitz (2007)). A nov-
elty in our work is that the matches are not iid, because each match is conditional on a unique
kickoff win probability of the home team. Therefore, we propose a weighted least squares
method to estimate the functional parameters which borrows the information from matches
with similar kickoff win probabilities. The variance and covariance elements are obtained by
the maximum likelihood estimation method. A key feature of our work is that the general
approach for estimating in-game win probabilities may be used in any sport that has event
data. Event data consists of a chronological record of well-defined events that occur during
a match which are relevant to the match and are recorded with a time stamp. The necessary
modifications to alternative sports would involve the determination of the relevant event data
which is predictive and sport specific.
In Section 2 we begin with a discussion of the data that is at our disposal. We then outline
a model from which we obtain the in-game posterior win probability. The model consists
of distributions that are specified via FDA methods. The FDA methodology is explained in
detail. In Section 3 we consider the utilization of the event data to provide good predictions.
There are many potential insights from a game that are relevant. We use the domain knowl-
edge from the rugby league for the specification. We then demonstrate that our estimated
in-game win probabilities change during a match in expected ways. In Section 4 we demon-
strate that our estimated win probabilities are reliable. We conclude with a short discussion
in Section 5.

2. Model development.

2.1. Available data. The NRL consists of 16 teams, and each team plays 24 games during
the regular season. The NRL has graciously given us access to event data for the resultant 769
regular season matches that have taken place during the four seasons 2016–2019. Event data
are detailed match data that go well beyond box score data. With event data, every time an
event occurs during a match (e.g., field goal, try, tackle, etc.), characteristics of the event are
recorded (e.g., location on the pitch, players involved, time of the match, etc.). In the NRL,
2.1 events are recorded on average per second, and there are up to 410 characteristics that
can be recorded as an event. The events and characteristics are obtained through cameras
and optical recognition software that carry out the data collection process in real time. Our
dataset is a huge matrix with rows corresponding to events and columns corresponding to
352 GUAN, NGUYEN, CAO AND SWARTZ

characteristics. Our dataset has 8,144,905 events (rows) obtained over the four seasons. Note
that most events may not have significant influences on the prediction of the in-game win
probabilities, such as running and catch.
An important component of our work, which is developed in Section 3, is the determination
of relevant event data to inform the in-game win probabilities. We propose several choices
of the event data that assist us to inform the in-game win probabilities. We focus on the
event features that are related to tackles and missed tackles. Specifically, we choose the event
feature, missed tackle differential, to illustrate the proposed method. Among the 8,144,905
rows, we use 278,813 of them which include the data of scoring, tackles, and missed tackles.
In our development and without loss of generality, the in-game win probabilities and data
will refer to the home team.
For the time being, for a particular match we will refer to X(t) as a random functional fea-
ture that arises from the event data relative to the home team, defined on a time interval [0, 80]
minutes. Note that X(t) may be multivariate. For example, it is obvious that the average field
position by the home team is a measure of dominance, and it may be a good predictor of the
home team’s chance of winning the match.
Another important predictor of the in-game win probability of the home team is the current
score differential. We will refer to D(t) as the number of points by which the home team is
defeating the road team at time t. Note that D(t) < 0 indicates that the road team is winning
by |D(t)| points at time t.
Finally, another important predictor of the in-game win probability of the home team is a
measure of its strength relative to the road team. This is not something immediately available
from the event data, and, therefore, we sourced an additional dataset. The website http://www.
aussportsbetting.com/data/historical-nrl-results-and-odds-data/ gives closing betting odds of
NRL matches immediately prior to kickoff. A nice feature of the betting odds is that they take
into account everything that is relevant to a match including home team advantage, injuries,
travel, etc. Betting odds are also known to be efficient; otherwise, sportsbooks would not
exist. Therefore, we can rely on the betting odds as providing reliable information concerning
the win-probability of the home team at the time of kickoff.
Betting odds arise in various formats, and we will refer to odds provided in the European
format. Odds oh on the home team indicate that a winning bet of $1 on the home team will
result in a payout of $oh . Clearly, oh ≥ 1. Similarly, odds or on the road team indicate that a
winning bet of $1 on the road team will result in a payout of $or . We ignore the rare event that
a match can end in a draw as this does not affect the subsequent calculations. Draws occur
roughly 4.94% of the time in the NRL. Among the 769 regular season matches in 2016–2019,
only 38 matches ended in draws. We use the remaining 731 matches for our analysis. Now,
some simple probability calculations involving expectations yield that the probability of the
home team winning is ph = 1/oh and the probability of the road team winning is pr = 1/or .
However, these calculations do not take into account the vigorish (i.e., the expected profit) by
the sportsbook, and, therefore, ph + pr > 1. We, therefore, remove the vigorish and set the
kickoff probability that the home team wins the match as p0 = ph /(ph + pr ).
Therefore, to review, the inputs to our model, which we use to estimate in-game win prob-
abilities for the home team, are given by
X(t) ≡ functional feature extracted from the event data relative to the home
team at time t;
(1)
D(t) ≡ score differential in favour of the home team at time t;
p0 ≡ kickoff probability of the home team winning based on sportsbook odds.
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 353

2.2. Model overview. In this subsection we present a model based on the inputs given by
(1). We let W denote the event that the home team wins the match and let W denote the event
that the home team does not win the match, and it is the posterior probability of W which is
our quantity of interest. We, therefore, obtain the expression
 
Prob W | X(t) = x(t), D(t) = d(t), p0
f (x(t), d(t) | W, p0 ) Prob(W | p0 )
=
(2) f (x(t), d(t) | W, p0 ) Prob(W | p0 ) + f (x(t), d(t) | W , p0 ) Prob(W | p0 )
f (x(t), d(t) | W, p0 )p0
= ,
f (x(t), d(t) | W, p0 )p0 + f (x(t), d(t) | W , p0 )(1 − p0 )
where f (x(t), d(t) | W, p0 ) represents the conditional joint density of X(t) and D(t), given
W and p0 , and f (x(t), d(t) | W , p0 ) represents the conditional joint density of X(t) and
D(t), given W and p0 . We observe that (2) is a simple expression for the purposes of calcula-
tion. However, for the application to television broadcasts, we emphasize that it is necessary
that the component distributions in (2) need to be evaluated instantaneously.

2.3. Estimation of model components using FDA. This is the most technical portion of
the paper where an atypical FDA structure is introduced and estimation techniques are de-
veloped to determine the conditional joint densities f (x(t), d(t) | W, p0 ) and f (X(t), D(t) |
W , p0 ) in (2). We illustrate the methodology with univariate X(t), although the methods can
be extended to multivariate X(t). This subsection may be skimmed while still retaining the
overall intent of the paper.
We begin by focusing on the f (X(t), D(t) | W, p0 ) term where f (X(t), D(t) | W , p0 ) is
handled in a similar fashion. Given W and p0 , we assume that
X(t) = μX (t, W, p0 ) + X (t, W ),
D(t) = μD (t, W, p0 ) + D (t, W ),
where μX (t, W, p0 ) is the expected value of the functional feature X(t), given the home team
winning and having a kickoff win probability of p0 . Similarly, μD (t, W, p0 ) is the expected
value of the score differential D(t), given the home team winning and having a kickoff win
probability of p0 . {X (t, W )}t and {D (t, W )}t are error processes.
For ease of notation, we use X (t) and D (t) to represent X (t, W ) and D (t, W ), re-
spectively. In Section 3 we consider various choices for X(t) that affect Var(X (t)) and the
resultant estimation procedure. Suppose for now that X (t) is a random variable that consists
of independent incremental contributions up to time t. Therefore, we assume that X (t) has
mean 0 and variance tσX2 . However, we note that the following theory may be modified to
accommodate other variance assumptions, such as a constant variance. For D (t), we also as-
sume that it is based on a white noise process where we recognize that the score differential
consists of incremental contributions during the match up to time t. Therefore, assuming that
these contributions are independent and identically distributed, it is appropriate that D (t)
have mean 0 and variance tσD2 . These assumptions are equivalent to assuming that X (t)
and D (t) are Brownian motion processes. We justify the Brownian motion assumptions in
the Supplementary Material (Guan et al. (2022)). The correlation between X(t) and D(t) is
assumed to be invariant of t, and let ρ = Corr (X(t), D(t)). Then, at time t, the noises are
distributed as
 
X (t)  
(3) ∼ Normal 0, tK ,
D (t)
354 GUAN, NGUYEN, CAO AND SWARTZ

where 0 = (0, 0)T and


 
σX2 ρσX σD
K= .
ρσX σD σD2
For different time points t and t  , Cov(X (t), X (t  )) = min{t, t  }σX2 , Cov(D (t), D (t  )) =
min{t, t  }σD2 , and Cov(X (t), D (t  )) = min{t, t  }ρσX σD .
We further assume that μX (t, W, p0 ) and μD (t, W, p0 ) are continuous smooth functions,
and we approximate these functions by linear combinations of basis functions as follows:

K
μX (t, W, p0 ) = ak (W, p0 )bk (t),
k=1
(4)

K
μD (t, W, p0 ) = ck (W, p0 )bk (t),
k=1
where bk are predetermined basis functions. Up until this point, except for the variance as-
sumptions associated with the noise terms, this is a standard setup in FDA applications (see
Chapter 3 in Ramsay and Silverman (2005), e.g.).
With our initial concentration on the specification of f (x(t), d(t) | W, p0 ), we restrict our
data to matches where the home team has won (i.e., W is observed). Assume that we have
functional data {(Xi (tij ), Di (tij )) : i = 1, . . . , N; j = 1, . . . , ni }. We also have the kickoff
win probability pi0 associated with the ith match.
An aspect of our problem that makes it different from a typical FDA application is that
the functional data are not iid. Specifically, the functional distribution of the ith match is
conditional on pi0 (the kickoff win probability of the home team in the ith match). Suppose
that ti1 < ti2 < · · · < tini , and let
⎡ ⎤
ti1 ti1 ti1 ... ti1
⎢t ti2 ⎥
⎢ i1 ti2 ti2 ... ⎥
⎢ ⎥
 i0 = ⎢ t
⎢ i1
ti2 ti3 ... ti3 ⎥ .
⎢ .. .. .. .. .. ⎥

⎣. . . . . ⎦
ti1 ti2 ti3 . . . tini
Therefore, to address the estimation of the a’s and c’s in (4) we minimize the functions

N  
−(p0 − pi0 )2
Ha (a) = exp (Xi − Bi a)T G−1
i (Xi − Bi a),
i=1
γ
(5)

N  
−(p0 − pi0 )2
Hc (c) = exp (Di − Bi c)T H−1
i (Di − Bi c),
i=1
γ

where a = (a1 (W, p0 ), . . . , aK (W, p0 ))T , c = (c1 (W, p0 ), . . . , cK (W, p0 ))T , Xi is an ni × 1


vector with the j th element Xi (tij ), Di is an ni × 1 vector with the j th element Di (tij ), Bi
is the ni × K matrix with the (j, k)th element bk (tij ), and Gi = Hi =  i0 . In (5), γ > 0 is a
tuning parameter. The term exp{−(p0 − pi0 )2 /γ } assigns more weight to matches that have
similar kickoff win probabilities to the generic value p0 .
The proposed estimation procedure is based on the minimization of the functions Ha and
Hc . What makes the equations in (5) unusual is that E(Xi (tij )|W, pi0 ) and E(Di (tij )|W, pi0 )
 K
do not equal the specified expressions K k=1 ak (W, p0 )bk (tij ) and k=1 ck (W, p0 )bk (tij ).
Equality would only exist if the Xi and Di were observed under the generic value p0 , where
again, we emphasize that the functional data from different matches don’t have the same
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 355

conditional distribution because each match is conditional on a unique kickoff win proba-
bility pi0 . This provides the motivation for the exponential terms; we assign more weight to
observations for which the generic p0 is closer to the observed pi0 .
With a little bit of work, it can be shown that, for fixed γ , the minimization of Ha and Hc
yields the analytic expressions
N −1  N 
 
â = vi BTi G−1
i Bi vi BTi G−1
i Xi ,
i=1 i=1
(6) N −1  N 
 
ĉ = vi BTi H−1
i Bi vi BTi H−1
i Di ,
i=1 i=1

where vi = vi (p0 , γ ) = exp{−(p0 − pi0 )2 /γ }. Note that, for any given 0 < p0 < 1, we
smooth μX (t, W, p0 ) and μD (t, W, p0 ) on the dimension of time t and obtain â and ĉ which
means the estimation of a and c is pointwise in p0 . Another way to formulate the problem
is to smooth μX (t, W, p0 ) and μD (t, W, p0 ) by using bivariate smoothers on the dimensions
of both t and p0 . For example, we can construct the bivariate basis functions using a tensor
product of B-splines. However, the potential problem is that the range for the B-spline basis
functions on the dimension of p0 is difficult to determine because the kickoff win probability
varies for each match, and we don’t have much data for p0 that are close to 0 or 1.
With estimated â’s and ĉ’s, we now turn to more traditional estimation procedures. Let
âik = âk (W, pi0 ) and ĉik = ĉk (W, pi0 ). Based on the data and modelling assumptions, the
resulting likelihood is given by


N  
1
L(σX , σD , ρ|W, p0 ) = (2π) −ni − 12
| i | exp − (wi − μi )T  −1
i (wi − μi ) ,
i=1
2

where
   
Xi μiX
wi = , μi =
Di μiD

with
K T
 
K
μiX = âik bk (ti1 ), . . . , âik bk (tini ) ,
k=1 k=1
K T
 
K
μiD = ĉik bk (ti1 ), . . . , ĉik bk (tini ) ,
k=1 k=1

and  i is the Kronecker product of K and  i0 , denoted by K ⊗  i0 . The parameters σX , σD ,


and ρ appear in matrix K. The likelihood can then be maximized to provide estimates

σ̂X2 = Dxx /v0 ,


(7) σ̂D2 = Ddd /v0 ,

ρ̂ = Dxd / Dxx Ddd ,
356 GUAN, NGUYEN, CAO AND SWARTZ

where

N
v0 = ni ,
i=1


N
Dxx = (Xi − Bi âi )T G−1
i (Xi − Bi âi ),
i=1
(8)

N
Dxd = (Xi − Bi âi )T  −1
i0 (Di − Bi ĉi ),
i=1


N
Ddd = (Di − Bi ĉi )T H−1
i (Di − Bi ĉi ),
i=1

with âi = (âi1 , . . . , âiK )T , ĉi = (ĉi1 , . . . , ĉiK )T . Finally, the parameter γ is tuned as de-
scribed in Section 3.
Putting this all together, suppose that there is a new match l with a kickoff win probability
pl0 , and we observe event data xl (t) and score differential dl (t) at time t. Then,
 
fˆ xl (t), dl (t)|W, pl0
  K
1 1 (xl (t) − k=1 âk (W, pl0 )bk (t))
2
=  exp −
2πt σ̂X σ̂D 1 − ρ̂ 2 2(1 − ρ̂ )
2 t σ̂X2
K K
2ρ̂(xl (t) − k=1 âk (W, pl0 )bk (t))(dl (t) − k=1 ĉk (W, pl0 )bk (t))

t σ̂X σ̂D
K 2 
(dl (t) − k=1 ĉk (W, pl0 )bk (t))
+ .
t σ̂D2

Similarly, we can obtain fˆ(xl (t), dl (t)|W , pl0 ) by using the data of matches where the home
team has lost (i.e., W is observed). Then, by (2) we can simply estimate the posterior in-game
win probability at time t for match l.

2.4. In-game win probabilities for the second half of the match. The method proposed in
Section 2.3, which we call the split FDA method, is used to estimate the in-game win proba-
bilities in the first half of the match. By splitting the matches based on whether the home team
has won or lost the match and using additional information provided by the functional feature
X extracted from the event data, the split FDA method is expected to provide good predic-
tions of the in-game win probabilities in the first half of the match. However, the method may
not provide reasonable estimates toward the end of the match. The estimated in-game win
probability at time t = 80, using the split FDA method, may not be exactly 1 when the home
team has won (W ) or 0 when the home team has lost (W ). Moreover, the score differential
D(t), given that the home team wins or loses, is not normally distributed when the match ap-
proaches the end. For example, Figure 1 displays the histograms of the score differentials at
time t = 75 for the matches that the home teams won (left) and lost (right) in the 2016–2018
seasons. We can observe that the distributions for D(75), given either W or W, are skewed.
To overcome the above problems, we consider a method which does not split the matches
into matches that home teams win and matches that home teams lose. We call this method the
joint FDA method. The joint FDA method uses only the score differentials and kickoff win
probabilities as inputs since the score differential is the most dominant factor that impacts the
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 357

F IG . 1. Left panel: Histogram of the score differentials at t = 75 for the 311 matches in the training data
(2016–2018 seasons) that the home teams won. Right panel: Kistogram of the score differentials at t = 75 for the
241 matches in the training data (2016–2018 seasons) that the home teams lost.

in-game win probabilities toward the end of a match. We assume that the score differential
process follows a Brownian motion model with independent increments which implies that
D(80) − D(t) is independent of D(t). Therefore, we have
   
Prob W | D(t) = d(t), p0 = Prob D(80) > 0 | D(t) = d(t), p0
(9)  
= Prob D(80) − D(t) > −d(t)|p0 .
We assume that, given p0 ,
D(t) = μD (t, p0 ) + Djoint (t),
where μD (t, p0 ) is the expected value of the score differential D(t) with a kickoff win
probability p0 and Djoint (t) is the noise with mean 0 and variance tσDjoint 2 . We approxi-
K
mate μD (t, p0 ) by k=1 c̃k (p0 )bk (t). We assume that Djoint (t) can be modeled as a Brow-
nian motion process. Since the Brownian motion model assumes independent increments,
Cov(Djoint (t), Djoint (t  )) = min{t, t  }σDjoint
2 for different time points t and t  . Therefore,
 
D(80) − D(t) ∼ Normal μD (t, p0 ), (80 − t)σDjoint
2
,

where μD (t, p0 ) = K k=1 c̃k (p0 )(bk (80) − bk (t)). We estimate c̃k ’s and obtain σ̂Djoint by
the relevant parts of (6)–(8) using all matches instead of restricting to the matches that the
home teams win. Then, for a new match l with a kickoff win probability pl0 and observed
score differential dl (t) at time t, (9) yields
 

  dl (t) + μ̂D (t, pl0 )
Prob W | Dl (t) = dl (t), pl0 =  √ ,
80 − t σ̂Djoint
K
where μ̂D (t, pl0 ) = k=1 c̃ˆk (p0 )(bk (80) − bk (t)) and  represents the cumulative distri-
bution function of the standardized normal distribution. Compared to the split FDA method,
the joint FDA method is more sensitive to the scoring events (see Section 4 for more details).
Now, let psplit (t) and pjoint (t) denote the in-game win probabilities at time t obtained by
the split FDA method and joint FDA method, respectively. Let w(t) = 80−t 40 ; then, we use
the weighted average p(t) = w(t)psplit (t) + (1 − w(t))pjoint (t) to estimate the in-game win
probability at time t in the second half of the match when 40 < t ≤ 80.

3. Results. We begin by considering appropriate choices for the functional match event
feature X(t). When a game is being viewed, there are often indications that one of the teams is
gaining an upper hand in the match. The variable X(t) is chosen to quantitatively reflect this
358 GUAN, NGUYEN, CAO AND SWARTZ

TABLE 1
Potential choices of event data, where all variables are measured with respect to the
home team, and larger values denote increasing superiority

Event feature Description

X1 (t) tackle differential up to time t.


X2 (t) tackle differential during the most recent 10 minutes at time t.
X3 (t) missed tackle differential up to time t.
X4 (t) missed tackle differential during the most recent 10 minutes at time t.

sort of dominance as a predictor of winning the match. In Table 1 we propose several choices
that are intended to reflect dominance by the home team. All of the variables presented in
Table 1 are recorded with respect to the home team.
For clarity, a missed tackle is one where a player on the team of interest may have been
tackled, but the tackle was unsuccessful. Therefore, the missed tackle differential with re-
spect to the home team is favourable to the home team if the variable is positive. Now, we
are not suggesting that the variables proposed in Table 1 are the best choices. For example,
Parmar et al. (2017) investigated key performance indicators in the professional rugby league.
However, the variables in Table 1 are easy to calculate based on live match data. We imagine
that experts with detailed domain knowledge of the rugby league may be able to propose
improved variables from the point of view of prediction.
However, to illustrate the proposed methods we will, hereafter, use the variable X3 (t) in
Table 1 as the event data of interest. We tried all the four event features in Table 1. They
impact the estimation procedure to varying degrees. We choose to illustrate the proposed
method using X3 (t), the missed tackle differential up to time t, mainly for two reasons. First,
X1 and X3 have simpler forms of variance and covariance matrices which makes it easier
to implement the proposed method. Second, X3 provides better results than X1 . For ease
of notation, we denote X3 (t) as X(t). We also emphasize that the choice of the event data
impacts the modelling distribution (3) and the estimation equations given by (5)–(8).
The basis functions bk (t) introduced in (4) are cubic B-splines. For details on B-spline
approximation, see de Boor (2001). Specifically, we choose nine equally spaced knots over
the interval [0, 80] minutes, and this results in K = 11 cubic B-spline basis functions, as
depicted in Figure 2. This selection of knots and splines leads to flexible shapes that can be
used to express μX (t, W, p0 ) and μD (t, W, p0 ) in (4) and μD (t, p0 ) in Section 2.4.

F IG . 2. Cubic B-spline basis functions defined on nine equally spaced knots over the interval [0, 80] minutes.
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 359

TABLE 2
Descriptive statistics of the scores corresponding to all 731 matches from the four regular seasons (2016–2019)
of the NRL

Variable Min Value Max Value Average Std Dev

Home Team Score 0 64 21.1 10.8


Road Team Score 0 62 19.2 10.1
Score Differential wrt Home Team −62 58 1.8 16.9

Before proceeding to estimation, it is good to have a sense of the data. In Tables 2 and 3 we
provide descriptive statistics of data collected from 731 NRL regular-season matches from
2016–2019. We observe that there is indeed a home-field advantage, as the average score
differential in favour of the home team is 1.8 points. We also observe that the average missed
tackle differential is positive which is also evidence of the home team advantage. The score
differential curves and the missed tackle differential curves for the 731 matches are plotted in
Figures 3 and 4, respectively. On average, it seems that both the score differential and missed
tackle differential are linear with respect to the time of the match. This is consistent with a
process whereby the better team separates itself from the weaker team in a consistent manner
over the course of a match.
Having specified the basis functions, the procedure in Section 2.3 requires the estimation
of the parameters σX , σD , and ρ, as specified in the multivariate normal distribution (3).
We use the data from the 552 matches in the first three seasons 2016–2018 to estimate the
parameters and fit the model. The fitted model will be used to predict the match outcomes in
the 179 matches from the 2019 season. The model validation will be discussed in Section 4.
We first restrict estimation to data where the home team has won (i.e., W ), and we note that
in the training set, which includes 552 matches in 2016–2018 seasons, there are 311 matches
that fit this criterion.
Based on the specification of the tuning parameter γ = 0.01, the chosen basis functions
and the determination of the ak and ck terms, we obtain
σ̂X = 1.33,
σ̂D = 2.06,
ρ̂ = 0.21.
These estimates appear to be sensible in terms of the descriptive statistics provided in Tables 2
and 3. In particular, we note a positive correlation ρ̂ which suggests that X(t) and D(t) tend
to work in tandem.
Using the training data (2016–2018 seasons) where the home team has not won (i.e., W ),
there are 241 matches, and we similarly obtain
σ̂X = 1.31,

TABLE 3
Descriptive statistics of the missed tackles (at the end of the match) corresponding to all 731 matches from the
four regular seasons (2016–2019) of the NRL

Variable Min Value Max Value Average Std Dev

Home Team Missed Tackle 8 48 23.5 6.7


Road Team Missed Tackle 8 44 22.5 6.2
Missed Tackle Differential wrt Home Team −31 34 1.0 9.6
360 GUAN, NGUYEN, CAO AND SWARTZ

F IG . 3. The score differential curves for all the 731 matches from the four regular seasons (2016–2019) of the
NRL.

σ̂D = 2.00,
ρ̂ = 0.22.
When we use all the 552 matches (i.e., W or W ) in the training data set (2016–2018
seasons), we obtain
σ̂Djoint = 1.63.
Our estimation procedure involves a tuning parameter γ . We select the tuning parameter
by fivefold cross-validation. Specifically, we randomly split the matches in 2016–2018 sea-
sons into five groups. For each unique group we take it as a holdout test data set, On the

F IG . 4. The missed tackle differential curves for all the 731 matches from the four regular seasons (2016–2019)
of the NRL.
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 361

remaining groups, for a particular γ , we fit the model parameters (a, c, σX , σD , ρ), using
the split FDA method, and (c̃, σDjoint ), using the joint FDA method. We then apply the split
FDA method and joint FDA method to estimate the home team win probability at time t on
the holdout data set. If, for a given match at time t, the estimated in-game win probability
is larger (smaller) than 0.5, and the home team eventually wins (loses) the match; then, the
prediction is considered to be correct. We repeat this procedure over all matches in the test
set and all times to give the overall correct prediction rate. For both the split FDA and joint
FDA methods, the choice γ = 0.01 yields the highest average overall correct prediction rate
over all the five groups. In Figure 5 we show the estimated mean functions of the X and D
processes with γ = 0.01 and various kickoff probabilities of the home team winning p0 . The
top and middle panels present the estimated mean functions of X and D, using the split FDA
method. We observe that the plots exhibit the expected behaviors. For example, in matches
where the home team wins, mean differentials in both X and D increase as the game pro-
gresses. When a curve is wiggly, we attribute this to lack of data. For example, in the top right
plot where p0 = 0.8, there are not many matches where the home team is heavily favored and
they lose. The bottom panel of Figure 5 shows that, for the joint FDA method, in general,
the mean score differentials increase when the kickoff win probability is larger than 0.5 (i.e.,
p0 = 0.6 and 0.8) and decrease when p0 = 0.2 and 0.4.

4. Model validation. Obviously, there is a random component to sport, and this is part
of its appeal. If matches were perfectly predictable, then there would be no point in holding
sporting competitions. Therefore, our investigation in this section involves an assessment of
whether our predictions are reasonable; they cannot and should not be perfect predictions.
We should not use the same data to both fit models and carry out the model assessment.
We, therefore, fit our model using the first three seasons 2016–2018 of the event data and use
the fitted model to predict the match outcomes in the 2019 season for which there are 179
matches. We then compare the actual 2019 match outcomes with the predicted outcomes.
In Figure 6 we investigate the predictive capability of the split FDA method, joint FDA
method, and the proposed weighted method. We consider the estimated probability that the
home team wins at times t = 1, . . . , 75 for the 2019 data. It is sensible to only consider
predictions up to the 75th minute as many sportsbooks terminate in-match betting toward
the end of matches. A reason for this is that possession of the ball near the end of a close
match is critical and becomes more important than both X and D in the determination of
fair betting odds. Punters could exploit this situation. If an estimated probability exceeds 0.5,
then this indicates a prediction in favour of the home team. At time t we compare the 2019
match predictions with the actual match results and obtain the correct prediction rate. As
one would expect, Figure 6 demonstrates that the correct prediction rates, obtained by all
methods, improve as matches progress in time. This figure shows that the split FDA method
provides higher correct prediction rates than the joint FDA method for the first 40 minutes of
the game, whereas the joint FDA method performs better in the second half, especially when
the game approaches the end. We observe that the methods yield good results, exceeding 80%
accuracy by the 55th minute.
To investigate whether our estimated in-game win probabilities are reliable, we randomly
select four matches from the 2019 season where the home teams won. In Figure 7 the solid
curves are the predicted in-game win probabilities by the proposed method annotated with
scoring events (dashed vertical lines). Recall that the proposed method applies the split FDA
method in the first half of the match, whereas, in the second half of the match, a weighted
average of the estimates obtained by the split FDA method and the joint FDA method is
used. In comparison, we also include the estimated in-game win probabilities by the split
FDA method and joint FDA method. We can see that the joint FDA method is more sensitive
362 GUAN, NGUYEN, CAO AND SWARTZ

F IG . 5. Top panel: Estimated μ̂X (t, W, p0 ) and μ̂X (t, W , p0 ) using the split FDA method. Middle panel: Es-
timated μ̂D (t, W, p0 ) and μ̂D (t, W , p0 ) using the split FDA method. Bottom panel: Estimated μ̂D (t, p0 ) using
the joint FDA method.

to scoring events. For example, the top right plot shows a match that the home team won.
When the home team scored at the 7th minute, the in-game probability, obtained by the
joint FDA method, increases dramatically, whereas the proposed method is less sensitive to
the scoring event. Sensibly, we observe that the predicted win probabilities are impacted by
scoring (discontinuous jumps).
We compare the proposed weighted method to the Brownian motion model that was pro-
posed by Stern (1994) to study the scoring process of basketball. Stern (1994) estimated the
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 363

F IG . 6. The correct prediction rates for the 2019 NRL season obtained by the split FDA method (dashed curve),
the joint FDA method (dotted curve), and the proposed weighted method (solid curve). Note that the proposed
weighted method has the same result as the split FDA method in the first 40 minutes of the match.

drift and variance parameters of the Brownian motion model by treating the game outcome
as a binary response and maximizing the probit regression likelihood. We use “BMM” to
represent the Brownian motion model proposed by Stern (1994). One limitation of the BMM
method is that the probability that a home team wins the match, when it leads d points at time
t, is assumed to be the same for any basketball game. In comparison, our proposed method

F IG . 7. Predicted instantaneous in-game win probabilities by the proposed method (solid curves), the split FDA
method (long-dashed curves), and the joint FDA method (short-dashed curves) for four randomly selected matches
from the 2019 season where the home team won. The dashed horizontal lines indicate the values of 0, 0.5, and 1.
The dashed vertical lines indicate the times when the score changed.
364 GUAN, NGUYEN, CAO AND SWARTZ

F IG . 8. (a) The correct prediction rates for the 2019 NRL season obtained by the proposed weighted method
(solid curve) and the BMM approach (dashed curve). (b) Predicted instantaneous in-game win probabilities by
the proposed weighted method (solid curves) and the BMM approach (dashed curves) for one randomly selected
match from the 2019 season. The dashed horizontal lines indicate the values of 0, 0.5, and 1. The dashed vertical
lines indicate the times when the score changed.

incorporates the betting odds to account for the home team strength relative to the road team
strength, and thus, even if two games have the same score differentials at time t, the in-game
win probabilities for the home teams may not be the same. Figure 8 (a) compares the correct
prediction rates obtained by the proposed method and the BMM approach. It is observed that
our proposed method outperforms the BMM approach in the first half of the match. It seems
that the main difference in the correct prediction rate comes from the incorporation of the
betting odds but not so much of the event data. However, the event data make a difference in
the prediction of the instantaneous in-game win probabilities. For example, in Figure 8(b) we
present the predicted in-game probabilities by the proposed method and the BMM approach.
The results indicate that the BMM approach is sensitive to the scoring events. In addition,
because the BMM method does not use the information of the betting odds, the predicted
in-game win probabilities at time 0 are the same for all matches in 2019.
To see the benefits obtained from the FDA approach, we compare the proposed method to
a naive approach suggested by a referee. The naive approach considers a minute-by-minute
logistic regression model using the betting odds, score differential, and the missed tackle dif-
ferential as covariates. The minute-by-minute fashion does not have a “functional” nature.
The results for the naive approach are reported in the Supplementary Material (Guan et al.
(2022)). The results indicate that the proposed method has a higher prediction accuracy than
the naive approach in the early stage of the match. Moreover, the naive approach is sensitive
to the scoring events, and the predicted in-game win probability, obtained by the naive ap-
proach, may still fluctuate even when the score differential and missed tackle differential do
not change. In comparison, the FDA approach smooths the mean curves for X(t) and D(t),
and, therefore, the predicted probabilities are more stable.
To see how X(t) impacts the estimation procedure, we consider two scenarios. In Sce-
nario I the split FDA method predicts the in-game win probabilities using both the event data
X and the score differential D, whereas in Scenario II, the split FDA method predicts the in-
game win probabilities using only the score differential D. For both scenarios the joint FDA
method uses only the score differentials. We select a match played on 6 April 2019 between
the Melbourne Storm (home) and the Canterbury-Bankstown Bulldogs. The half time score
is 6 (Storm)–12 (Bulldogs) and the full time score is 18 (Storm)–16 (Bulldogs). More details
about the match can be found at https://www.nrl.com/draw/nrl-premiership/2019/round-4/
storm-v-bulldogs/.
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 365

F IG . 9. Predicted instantaneous in-game win probabilities for the match Storm vs. Bulldogs on 6 April 2019 by
Scenario I (bold solid curve) and Scenario II (dark green dotted curve). The dots indicate the score differentials
of the match. The bars indicate the missed tackle differentials of the match.

In Figure 9 we present the predicted instantaneous in-game win probabilities for the match
under Scenario I and Scenario II together with the score differentials and missed tackle differ-
entials. The solid curve in Figure 9 represents the predictions obtained using Scenario I, and
the dotted curve represents the predictions based on Scenario II. The kickoff win probability
p0 = 0.85 indicates that the Storm was heavily favored. We can see from Figure 9 that the
road team scored on the sixth minute of the match, and after that, the predicted in-game win
probabilities based on D only (Scenario II) decreased to below 0.8. In contrast, the missed
tackle differentials keep positive for most of the time in the first half of the match. This in-
dicates that, even though the Storm were trailing, there was reason to be hopeful that they
would turn the match around. We observe that the predicted in-game win probabilities based
on Scenario I are greater than those based on Scenario II for the entire game, except for the
short time interval between the 24th and 32nd minute. Clearly, the example demonstrates the
added value in the event data X(t) through the superiority of Scenario I over Scenario II.

5. Discussion. We have developed a model that provides instantaneous in-game win


probabilities for the National Rugby League. The model has distributional components that
are informed by FDA techniques.
There are various future research directions associated with our work. First, the approach
is general and is applicable to other sports whenever suitable event data are available. Second,
there are obvious gambling questions that may be explored with respect to our predictions.
Finally, the choice of the functional event feature X(t) impacts our estimation procedure,
and we have focused on the missed tackle differential. We believe that experts with detailed
domain knowledge of the rugby league may be able to propose better predictive choices for
X(t). Although we illustrate the use of univariate X(t), our methods can be extended to
multivariate settings.

Acknowledgments. T. Guan is an Assistant Professor at Department of Mathematics and


Statistics, Brock University, and J. Cao and T. Swartz are Professors, Department of Statistics
and Actuarial Science, Simon Fraser University. R. Nguyen is a Ph.D. candidate, Department
of Statistics, School of Mathematics and Statistics, University of New South Wales. This
work was initiated while Nguyen visited the Department of Statistics and Actuarial Science at
Simon Fraser University. The authors would like to thank the Editor, the Associate Editor and
two anonymous referees for their constructive comments which are very helpful to improve
the quality of this paper. We are particularly grateful to the NRL for providing the data.
366 GUAN, NGUYEN, CAO AND SWARTZ

Funding. Cao and Swartz have been partially supported by the Natural Sciences and
Engineering Research Council of Canada.

SUPPLEMENTARY MATERIAL
Supplementary document for “In-game win probabilities for the National Rugby
League” (DOI: 10.1214/21-AOAS1514SUPP; .pdf). We provide an investigation of the
Brownian motion assumption used in our model as well as a comparison of our model with a
naive approach for estimating in-game match probabilities.

REFERENCES
A INSWORTH , L. M., ROUTLEDGE , R. and C AO , J. (2011). Functional data analysis in ecosystem research: The
decline of Oweekeno Lake sockeye salmon and Wannock River flow. J. Agric. Biol. Environ. Stat. 16 282–300.
MR2818550 https://doi.org/10.1007/s13253-010-0049-z
A LBERT, J., G LICKMAN , M. E., S WARTZ , T. B. and KONING , R. H., eds. (2017). Handbook of Statistical
Methods and Analyses in Sports. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC
Press, Boca Raton, FL. MR3838291 https://doi.org/10.1201/9781315166070
B ESSE , P. and R AMSAY, J. O. (1986). Principal components analysis of sampled functions. Psychometrika 51
285–311. MR0848110 https://doi.org/10.1007/BF02293986
B OOTH , M. and O RR , R. (2017). Time-loss injuries in sub-elite and emerging rugby league players. J. Sports Sci.
Med. 16 295–301.
B OSQ , D. (2000). Linear Processes in Function Spaces: Theory and Applications. Lecture Notes in Statistics 149.
Springer, New York. MR1783138 https://doi.org/10.1007/978-1-4612-1154-9
B UTTREY, S. E., WASHBURN , A. R. and P RICE , W. L. (2011). Estimating NHL scoring rates. J. Quant. Anal.
Sports 7 1–18.
C AI , T. T. and H ALL , P. (2006). Prediction in functional linear regression. Ann. Statist. 34 2159–2179.
MR2291496 https://doi.org/10.1214/009053606000000830
C ARDOT, H. (2000). Nonparametric estimation of smoothed principal components analysis of sampled noisy
functions. J. Nonparametr. Stat. 12 503–538. MR1785396 https://doi.org/10.1080/10485250008832820
C ARDOT, H., F ERRATY, F. and S ARDA , P. (2003). Spline estimators for the functional linear model. Statist.
Sinica 13 571–591. MR1997162
C ERVONE , D., D’A MOUR , A., B ORNN , L. and G OLDSBERRY, K. (2016). A multiresolution stochastic pro-
cess model for predicting basketball possession outcomes. J. Amer. Statist. Assoc. 111 585–599. MR3538688
https://doi.org/10.1080/01621459.2016.1141685
C HEN , T. and FAN , Q. (2018). A functional data approach to model score difference process in professional
basketball games. J. Appl. Stat. 45 112–127. MR3736861 https://doi.org/10.1080/02664763.2016.1268106
C LAUSET, A., KOGAN , M. and R EDNER , S. (2015). Safe leads and lead changes in competitive team sports.
Phys. Rev. E (3) 91 062815, 11. MR3491426 https://doi.org/10.1103/PhysRevE.91.062815
DE B OOR , C. (2001). A Practical Guide to Splines, Revised ed. Applied Mathematical Sciences 27. Springer,
New York. MR1900298
D ELAIGLE , A. and H ALL , P. (2012). Achieving near perfect classification for functional data. J. R. Stat. Soc. Ser.
B. Stat. Methodol. 74 267–286. MR2899863 https://doi.org/10.1111/j.1467-9868.2011.01003.x
F ERRATY, F. and V IEU , P. (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer
Series in Statistics. Springer, New York. MR2229687
G ABBETT, T. J. (2005). Science of rugby league football: A review. J. Sports Sci. 23 961–976.
G ABLE , A. and R EDNER , S. (2012). Random walk picture of basketball scoring. J. Quant. Anal. Sports 8 1–20.
G LASSBROOK , D. J., D OYLE , T. L. A., A LDERSON , J. A. and F ULLER , J. T. (2019). The demands of profes-
sional rugby league match-play: A meta-analysis. Sports Medicine—Open 5 Article number: 24.
G UAN , T., N GUYEN , R., C AO , J. and S WARTZ , T. (2022). Supplement to “In-game win probabilities for the
National Rugby League.” https://doi.org/10.1214/21-AOAS1514SUPP
H ALL , P. and H OROWITZ , J. L. (2007). Methodology and convergence rates for functional linear regression.
Ann. Statist. 35 70–91. MR2332269 https://doi.org/10.1214/009053606000000957
H ASTIE , T. and M ALLOWS , C. (1993). A statistical view of some chemometrics regression tools. Technometrics
35 140–143.
H ORVÁTH , L. and KOKOSZKA , P. (2012). Inference for Functional Data with Applications. Springer Series in
Statistics. Springer, New York. MR2920735 https://doi.org/10.1007/978-1-4614-3655-3
IN-GAME WIN PROBABILITIES FOR THE NATIONAL RUGBY LEAGUE 367

H SING , T. and E UBANK , R. (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to
Linear Operators. Wiley Series in Probability and Statistics. Wiley, Chichester. MR3379106 https://doi.org/10.
1002/9781118762547
JACQUES , J. and P REDA , C. (2014). Model-based clustering for multivariate functional data. Comput. Statist.
Data Anal. 71 92–106. MR3131956 https://doi.org/10.1016/j.csda.2012.12.004
JAMES , G. M. and S UGAR , C. A. (2003). Clustering for sparsely sampled functional data. J. Amer. Statist. Assoc.
98 397–408. MR1995716 https://doi.org/10.1198/016214503000189
K AYHAN , V. O. and WATKINS , A. (2018). A data snapshot approach for making real-time predictions in basket-
ball. Big Data 6 96–112. https://doi.org/10.1089/big.2017.0054
K AYHAN , V. O. and WATKINS , A. (2019). Predicting the point spread in professional basketball in real time: A
data snapshot approach. Journal of Business Analytics 2 63–73.
K ING , T., J ENKINS , D. and G ABBETT, T. (2009). A time-motion analysis of professional rugby league match-
play. J. Sports Sci. 27 213–219.
KOKOSZKA , P. and R EIMHERR , M. (2017). Introduction to Functional Data Analysis. Texts in Statistical Science
Series. CRC Press, Boca Raton, FL. MR3793167
L EE , A. (1999). Applications: Modelling rugby league data via bivariate negative binomial regression. Aust. N.
Z. J. Stat. 14 141–152.
L ENG , X. and M ÜLLER , H.-G. (2006). Classification using functional data analysis for temporal gene expression
data. Bioinformatics 22 68–76. https://doi.org/10.1093/bioinformatics/bti742
L OCK , D. and N ETTLETON , D. (2014). Using random forests to estimate win probability before each play of an
NFL game. J. Quant. Anal. Sports 10 197–205.
L UO , W., C AO , J., G ALLAGHER , M. and W ILES , J. (2013). Estimating the intensity of ward admission and its
effect on emergency department access block. Stat. Med. 32 2681–2694. MR3067415 https://doi.org/10.1002/
sim.5684
M ORRIS , J. S. (2015). Functional regression. Annu. Rev. Stat. Appl. 2 321–359.
PARMAR , N., JAMES , N., H UGHES , M., J ONES , H. and H EARNE , G. (2017). Team performance indicators that
predict match outcome and points difference in professional rugby league. International Journal of Perfor-
mance Analysis in Sport 17 1044–1056.
P ETTIGREW, S. (2015). Assessing the offensive productivity of NHL players using in-game win probabilities. In
Proceedings of the 9th MIT Sloan Sports Analytics Conference.
R AMSAY, J. O., H OOKER , G. and G RAVES , S. (2009). Functional Data Analysis with R and Matlab. Springer,
New York.
R AMSAY, J. O. and S ILVERMAN , B. W. (2005). Functional Data Analysis, 2nd ed. Springer Series in Statistics.
Springer, New York. MR2168993
ROBBERECHTS , P., VAN H AAREN , J. and DAVIS , J. (2019). Who will win it? An in-game win probability model
for football. In Proceedings of the 6th Workshop on Machine Learning and Data Mining for Sports Analytics,
20 September 2019 page 13, Würzburg, Germany.
S EITZ , L. B., R IVIÈRE , M., DE V ILLARREAL , E. S. and H AFF , G. G. (2014). The athletic performance of elite
rugby league players is improved after an 8-week small-sided game training intervention. J. Strength Cond.
Res. 28 971–975.
S ONG , K., G AO , Y. and S HI , J. (2020). Making real-time predictions for NBA basketball games by combining
the historical data and bookmaker’s betting line. Phys. A 547 124411.
S TERN , H. S. (1994). A Brownian motion model for the progress of sports scores. J. Amer. Statist. Assoc. 89
1128–1134.
Š TRUMBELJ , E. and V RA ČAR , P. (2012). Simulating a basketball match with a homogeneous Markov model and
forecasting the outcome. Int. J. Forecast. 28 532–542.
V RA ČAR , P., Š TRUMBELJ , E. and KONONENKO , I. (2016). Modeling basketball play-by-play data. Expert Syst.
Appl. 44 58–66.
WAND , M. P. and J ONES , M. C. (1995). Kernel Smoothing. Monographs on Statistics and Applied Probability
60. CRC Press, London. MR1319818 https://doi.org/10.1007/978-1-4899-4493-1
WANG , J.-L., C HIOU , J.-M. and M ÜLLER , H.-G. (2016). Review of functional data analysis. Annu. Rev. Stat.
Appl. 3 257–295.
W INDT, J., G ABBETT, T. J., F ERRIS , D. and K HAN , K. M. (2017). Training load–injury paradox: Is greater
preseason participation associated with lower in-season injury risk in elite rugby league players? Br. J. Sports
Med. 51 645–650. https://doi.org/10.1136/bjsports-2016-095973
YAO , F., M ÜLLER , H.-G. and WANG , J.-L. (2005a). Functional data analysis for sparse longitudinal data. J.
Amer. Statist. Assoc. 100 577–590. MR2160561 https://doi.org/10.1198/016214504000001745
YAO , F., M ÜLLER , H.-G. and WANG , J.-L. (2005b). Functional linear regression analysis for longitudinal data.
Ann. Statist. 33 2873–2903. MR2253106 https://doi.org/10.1214/009053605000000660
Y UAN , M. and C AI , T. T. (2010). A reproducing kernel Hilbert space approach to functional linear regression.
Ann. Statist. 38 3412–3444. MR2766857 https://doi.org/10.1214/09-AOS772

You might also like