0% found this document useful (0 votes)
104 views5 pages

Soccer: Is Scoring Goals A Predictable Poissonian Process?: PACS Numbers: 89.20.-A, 02.50.-r Keywords

This document analyzes whether scoring goals in soccer matches can be characterized as a predictable Poisson process. It introduces a model to quantify the predictability of soccer match outcomes based on the fitness levels of the competing teams. The model treats the goal difference as having three independent contributions - the expected goal difference based on team fitness, external influences, and the randomness of individual goals. It finds that while goals are not strictly Poissonian, the non-Poissonian nature of overall goal distributions arises primarily from fitness variations between teams rather than intra-match effects.

Uploaded by

Balazs Csaba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views5 pages

Soccer: Is Scoring Goals A Predictable Poissonian Process?: PACS Numbers: 89.20.-A, 02.50.-r Keywords

This document analyzes whether scoring goals in soccer matches can be characterized as a predictable Poisson process. It introduces a model to quantify the predictability of soccer match outcomes based on the fitness levels of the competing teams. The model treats the goal difference as having three independent contributions - the expected goal difference based on team fitness, external influences, and the randomness of individual goals. It finds that while goals are not strictly Poissonian, the non-Poissonian nature of overall goal distributions arises primarily from fitness variations between teams rather than intra-match effects.

Uploaded by

Balazs Csaba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Soccer: is scoring goals a predictable Poissonian process?

A. Heuer,1 C. Mller,1, 2 and O. Rubner1 u


1

Westflische Wilhelms Universitt Mnster, Institut fr physikalische Chemie, Corrensstr. 30, 48149 Mnster, Germany a a u u u 2 Westflische Wilhelms Universitt Mnster, Institut fr organische Chemie, Corrensstr. 40, 48149 Mnster, Germany a a u u u The non-scientic event of a soccer match is analysed on a strictly scientic level. The analysis is based on the recently introduced concept of a team tness (Eur. Phys. J. B 67, 445, 2009) and requires the use of nite-size scaling. A uniquely dened function is derived which quantitatively predicts the expected average outcome of a soccer match in terms of the tness of both teams. It is checked whether temporary tness uctuations of a team hamper the predictability of a soccer match. To a very good approximation scoring goals during a match can be characterized as independent Poissonian processes with pre-determined expectation values. Minor correlations give rise to an increase of the number of draws. The non-Poissonian overall goal distribution is just a consequence of the tness distribution among dierent teams. The limits of predictability of soccer matches are quantied. Our model-free classication of the underlying ingredients determining the outcome of soccer matches can be generalized to dierent types of sports events.
PACS numbers: 89.20.-a,02.50.-r Keywords:

arXiv:1002.0797v2 [physics.data-an] 3 Mar 2010

In recent years dierent approaches, originating from the physics community, have shed new light on sports events, e.g. by studying the behavior of spectators [1], by elucidating the statistical vs. systematic features behind league tables [24], by studying the temporal sequence of ball movements [5] or using extreme value statistics [6, 7] known, e.g., from nance analysis [8]. For the specic case of soccer matches dierent models have been introduced on phenomenological grounds [914]. However, very basic questions related, e.g., to the relevance of systematic vs. statistical contributions or the temporal tness evolution are still open. It is known that the distribution of soccer goals is broader than a Poissonian distribution [7, 15, 16]. This observation has been attributed to the presence of self-armative eects during a soccer match[15, 16], i.e. an increased probability to score a goal depending on the number of goals already scored by that team. In this work we introduce a general model-free approach which allows us to elucidate the outcome of sports events. Combining strict mathematical reasoning, appropriate nite-size scaling and comparison with actual data all ingredients of this framework can be quantied for the specic example of soccer. A unique relation can be derived to calculate the expected outcome of a soccer match and three hierarchical levels of statistical inuence can be identied. As one application we show that the skewness of the distribution of soccer goals [7, 15, 16] can be fully related to tness variations among dierent teams and does not require the presence of self-armative eects. As data basis we take all matches in the German Bundesliga (www.bundesliga-statistik.de) between seasons 1987/88 and 2007/08 except for the year 1991/92 (in that year the league contained 20 teams). Every team plays 34 matches per season. Earlier seasons are not taken into account because the underlying statistical

properties (in particular number of goals per match) are somewhat dierent. Conceptually, our analysis relies on recent observations in describing soccer leagues [17]: (i) The home advantage is characterized by a team-independent but season-dependent increase of the home team goal dierence chome > 0. (ii) An appropriate observable to characterize the tness of a team i in a given season is the average goal dierence (normalized per match) Gi (N ), i.e. the dierence of the goals scored and conceded during N matches. In particular it contains more information about the team tness than, e.g., the number of points.
0.5 0.4 0.3
h(t)

0.2 0.1 0 0

10

20

30

FIG. 1: The correlation function h(t). The average value of h(t) is included (excluding the value for t = 17) yielding approx. 0.22 [17].

Straightforward information about the team behavior during a season can be extracted from correlating its match results from dierent match days. Formally, this is expressed by the correlation function h(t) = gij (t0 )gik (t0 + t) . Here gij := gi gj denotes

2 the goal dierence of a match of team i vs. team j with the nal result gi : gj . j and k are the opponents of team i at match days t0 and t0 + t. The home-away asymmetry can be taken into account by the transformation gij gij chome where the sign depends on whether team i plays at home or away. The resulting function h(t) is shown in Fig.1. Apart from the data point for t = 17 one observes a time-independent positive plateau value. The absolute value of this constant corresponds to the 2 variance G of Gi and is thus a measure for the tness variation in a league [17]. Furthermore, the lack of any decay shows that the tness of a team is constant during the whole season. This result is fully consistent with the nite-size scaling analysis in Ref.[17] where additionally the tness change between two seasons was quantied. The exception for t = 17 just reects the fact that team i is playing against the same team at days t0 and t0 + 17, yielding additional correlations between the outcome of both matches (see also below). As an immediate consequence, the limit of Gi (N ) for large N , corresponding to the true tness Gi , is welldened. A consistent estimator for Gi , based on the information from a nite number of matches, reads Gi = aN Gi (N ). (1) tness values using the variable fij with a mean of zero: (a) External eects such as several players which are injured or tired, weather conditions (helping one team more than the other), or red cards. As a consequence the eective tness of a team relevant for this match may dier from the estimation Gi (or Gj ). (b) Intra-match effects depending on the actual course of a match. One example is the suggested presence of self-armative effects, i.e. an increased probability to score a goal (equivalently an increased tness) depending on the number of goals already scored by that team [15, 16]. Naturally, fij is much harder to predict if possible at all. Here we restrict ourselves to the estimation of its relevance via 2 determination of f . (3) Finally, one has to understand the emergence of the actual goal distribution based on expectation values as expressed by the random variable rij with average zero. This problem is similar to the physical problem when a decay rate (here corresponding to qij + fij ) has to be translated into the actual number of decay processes. Determination of qij : qij has to fulll the two basic conditions (taking into account the home advantage): qij chome = (qji chome ) (symmetry condition) and qij j chome = Gi (consistency condition) where the average is over all teams j = i (in the second condition a minor correction due to the nite number of teams in a league is neglected). The most general dependence on Gi,j up to third order, which is compatible with both conditions, is given by
2 qij = chome +(Gi Gj )[1c3 (G +Gi Gj )]. (3)

2 with aN 1/[1 + 3/(N G )] [17] . For large N the factor aN approaches unity and the estimation becomes error-free, i.e. Gi (N ) Gi . For N = 33 one has aN = 0.71 and the variance of the estimation er2 2 ror is given by e,N = (N/3 + 1/G )1 0.06 [17]. This statistical framework is known as regression toward the mean [18]. Analogously, introducing Gi (N ) as the average sum of goals scored and conceded by team i in N matches its long-time limit is estimated via Gi = bN (Gi (N ) ) where is the average number of goals per match in the respective season. Using 2 G 0.035 one correspondingly obtains bN =33 = 0.28 [17]. Our key goal is to nd a sound characterization of the match result when team i is playing vs. team j, i.e. gij or even gi and gj individually. The nal outcome gij has three conceptually dierent and uncorrelated contributions

gij = qij + fij + rij .

(2)

Averaging over all matches one can dene the respec2 2 2 tive variances q , f and r . (1) qij expresses the average outcome which can be expected based on knowledge of the team tness values Gi and Gj , respectively. Conceptually this can be determined by averaging over all matches when teams with these tness values play against each other. The task is to determine the dependence of qij q(Gi , Gj ) on Gi and Gj . (2) For a specic match, however, the outcome can be systematically inuenced by dierent factors beyond the general

Qualitatively, the c3 -term takes into account the possible eect that in case of very dierent team strengths (e.g. Gi 0 and Gj 0) the expected goal difference is even more pronounced (c3 > 0: too much respect of the weaker team) or reduced (c3 < 0: tendency of presumption of the better team). On a phenomenological level this eect is already considered in the model of, e.g., Ref.[12]. The task is to determine the adjustable parameter c3 from comparison with actual data. We rst rewrite Eq.3 as qij (Gi Gj ) chome = 2 c3 (Gi Gj )(G + Gi Gj ). In case that Gi,j is known this would correspond to a straightforward regression problem of gij (Gi Gj ) chome vs. 2 (Gi Gj )(G + Gi Gj ). An optimum estimation of the tness values for a specic match via Eq.1 is based on Gi,j (N ), calculated from the remaining N = 33 matches of both teams in that season . Of course, the resulting value of c3 (N = 33) is still hampered by nite-size eects, in analogy to the regression towards the mean. This problem can be solved by estimating c3 (N ) for dierent values of N and subsequent extrapolation to innite N in an 1/N -representation. Then our estimation of c3 is not hampered by the uncertainty in the determination of Gi,j . For a xed N 30 the regression analysis is based on 50 dierent choices of Gi,j (N ) by

3 choosing dierent subsets of N matches to improve the statistics. The result is shown in Fig.2. The estimated error results from performing this analysis individually for each season. Due to the strong correlations for dierent N -values the nal error is much larger than suggested by the uctuations among dierent data points. The data are compatible with c3 = 0. Thus, we have shown that the simple choice qij = Gi Gj + chome (4)
2 2 has f = A 2G . Actually, to improve the statistics we have additionally used dierent partitions of the match (e.g. rst and third quarter vs. second and fourth 2 quarter). Numerical evaluation yields f = 0.04 0.06 where the error bar is estimated from individual averaging over the dierent seasons. Thus one obtains in 2 2 particular f q which renders match-specic tness uctuations irrelevant. Actually, as shown in [17], one can observe a tendency that teams which have lost 4 times in a row tend to play worse in the near future than expected by their tness. Strictly speaking these strikes indeed reect minor temporary tness variations. However, the number of strikes is very small (less than 10 per season) and, furthermore, mostly of statistical nature. The same holds for red cards which naturally inuence the tness but fortunately are quite rate. Thus, these extreme events are interesting in their own right but are not relevant for the overall statistical description. The 2 negative value of f points towards anti-correlations between both partitions of the match. A possible reason is the observed tendency towards a draw, as outlined below.

is the uniquely dened relation (neglecting irrelevant terms of 5th order) to characterize the average outcome of a soccer match. In practice the right side can be estimated via Eq.1. This result implies that h(t) = 2 (Gi Gj )(Gi Gk ) = G + Gj Gk , i.e. 2 2 h(t = 17) = G and h(t = 17) = 2G . This agrees very well with the data. Furthermore, the variance of 2 the qij distribution, i.e. q , is by denition given by 2 2G 0.44.

0.1

1e+00

Empirical data Full Poisson estimation Simple Poisson estim.

-0.1
p(gi,j )

1e-01

-0.2

c3(N) ; limit: -0.026 +/- 0.18


0.02 0.04 1/N 0.06 0.08

1e-03 0.8
0.6

1e-04 0

2 gi,j 4

4 gi,j

p(gi,j )est/p(gi,j )

1e-02

FIG. 2: Determination of c3 by nite-size scaling.


2 Determination of f : This above analysis does not contain any information about the match-specic tness relative to Gi Gj . For example fij > 0 during a specic match implies that team i plays better than expected from qij . The conceptual problem is to disentangle the possible inuence of these tness uctuations from the random aspects of a soccer match. The key idea is based on the observation that, e.g., for fij > 0 team i will play better than expected in both the rst and the second half of the match. In contrast, the random features of a match do not show 2 this correlation. For the identication of f one denes

FIG. 3: (a)Distribution of goals per team and match and the Poisson prediction if the dierent tness values are taken into account (solid line). Furthermore a Poisson estimation is included where only the home-away asymmetry is included (broken line). The quality of the predicted distribution is highlighted in (b) where the ratio of the estimated and the actual probability is shown.

A = ((gij /b1 chome ) ((gij /b2 chome )


(1),(2) gij

(1)

(2)

ij

where

is the goal dierence in the rst and second half in the specic match, respectively and b1,2 the fraction of goals scored during the rst and the second half, respectively (b1 = 0.45; b2 = 0.55). Based on Eq.4 one

Determination of rij : The actual number of goals gi,j per team and match is shown in Fig.3. The error bars are estimated based on binomial statistics. As discussed before the distribution is signicantly broader than a Poisson distribution, even if separately taken for the home and away goals [7, 15, 16]. Here we show that this distribution can be generated by assuming that scoring goals are independent Poissonian processes. We proceed in two steps. First, we use Eq.4 to estimate the average goal difference for a specic match with tness values estimated from the remaining 33 matches of each team. Second, we

4 supplement Eq.4 by the corresponding estimator for the sum of the goals gi + gj given by Gi + Gj . Together with Eq.4 this allows us to calculate the expected number of goals for both teams individually. Third, we generate for both teams a Poissonian distribution based on the corresponding expectation values. The resulting distribution is also shown in Fig.1 and perfectly agrees with the actual data up to 8 (!) goals. In contrast, if the distribution of tness values is not taken into account signicant deviations are present. Two conclusions can be drawn. First, scoring goals is a highly random process. Second, the good agreement again reects the fact that 2 f is small because otherwise an additionally broadening of the actual data would be expected. Thus there is no indication of a possible inuence of self-armative eects during a soccer match [15, 16]. Because of the underly2 ing Poissonian process the value of r is just given by the average number of goals per match ( 3).
pest(n:n)/p(n:n)
Empirical data Poisson estimation

goals holds again. The three major contributions to the nal soccer re2 2 2 sult display a clear hierarchy, i.e. r : q : f 102 : 1 0 2 10 : 10 . f , albeit well dened and quantiable, can be neglected for two reasons. First, it is small as compared to the tness variation among dierent teams. Second, the uncertainty in the prediction of qij is, even at the end of the season, signicantly larger (variance of the uncer2 tainty: 2 e,N =33 = 0.12, see above). Thus, the limit of predictability of a soccer match is, beyond the random eects, mainly related to the uncertainty in the tness determination rather than to match specic eects. Thus, the hypothesis of a strictly constant team tness during a season, even on a single-match level cannot be refuted even for a data set comprising more than 20 years. In disagreement with this observation soccer reports in media often stress that a team played particularly good or bad. Our results suggest that there exists a strong tendency to relate the assessment too much to the nal result thereby ignoring the large amount of random aspects of a match. In summary, apart from the minor correlations with respect to the number of draws soccer is a surprisingly simple match in statistical terms. Neglecting the minor dierences between a Poissonian and binomial distribution and the slight tendency towards a draw a soccer match is equivalent to two teams throwing a dice. The number 6 means goal and the number of attempts of both teams is xed already at the beginning of the match, reecting their respective tness in that season. More generally speaking, our approach may serve as a general framework to classify dierent types of sports in a three-dimensional parameter space, expressed by 2 2 2 r , q , f . This set of numbers, e.g., determines the degree of competitiveness [3]. For example for matches between just two persons (e.g. tennis) one would expect 2 that tness uctuations (f ) play a much a bigger role and that for sports events with many goals or points (e.g. 2 basketball) the random eects (r ) are much less pronounced, i.e. it is more likely that the stronger team indeed wins. Hopefully, the present work stimulates activities to characterize dierent types of sports along these lines. We greatly acknowledge helpful discussions with B. Strauss, M. Trede, and M. Tolan about this topic.

1.2 0.8

b
0 1 2 3 n

p(Dg ij )

0,2

a
0,1

-4

-2

D gij

FIG. 4: (a) The probability distribution of the goal dierence per match together with its estimation based on independent Poisson processes of both teams. In (b) it is shown for different scores how the ratio of the estimated and the actual number of draws dier from unity.

As already discussed in literature the number of draws is somewhat larger than expected on the basis of independent Poisson distributions; see, e.g., Refs. [10, 12]. As an application of the present results we quantify this statement. In Fig.4 we compare the calculated distribution of gij with the actual values. The agreement is very good except for gij = 1, 0, 1. Thus, the simple picture of independent goals of the home and the away team is slightly invalidated. The larger number of draws is balanced by a reduction of the number of matches with exactly one goal dierence. More specically, we have calculated the relative increase of draws for the dierent results. The main eect is due to the strong increase of more than 20% of the 0:0 draws. Note that the present analysis has already taken into account the tness distribution for the estimation of this number. Starting from 3:3 the simple picture of independent home and away

[1] I. Farkas, D. Helbing, and T. Vicsek, Nature 419, 131 (2002). [2] E. Ben-Naim, S. Redner, and F. Vazquez, Europhys. Lett. 77, 30005/1 (2007). [3] E. Ben-Naim and N. W. Hengartner, Phys. Rev. E 76, 026106/1 (2007). [4] J. Wesson, The Science of Soccer (Institute of Physics Publishing, 2002).

5
[5] R. Mendes, L. Malacarne, and C. Anteneodo, Eur. Phys. J. B 57, 357 (2007). [6] D. Gembris, J. Taylor, and D. Suter, Nature 417, 506 (2002). [7] J. Greenhough, P. Birch, S. Chapman, and G. Rowlands, Physica A 316, 615 (2002). [8] R. Mantegna and H. Stanley, Nature 376, 46 (1995). [9] A. Lee, Chance 10, 15 (1997). [10] M. Dixon and S. Coles, Appl. Statist. 46, 265 (1997). [11] M. Dixon and M. Robinson, The Statistician 47, 523 (1998). [12] H. Rue and O. Salvesen, The Statistician 49, 399 (2000). [13] R. Koning, The Statistician 49, 419 (2000). [14] S. Dobson and J. Goddard, European Journal of Operational Research 148, 247 (2003). [15] E. Bittner, A. Nussbaumer, W. Janke, and M. Weigel, Europhys. Lett. 78, 58002/1 (2007). [16] E. Bittner, A. Nussbaumer, W. Janke, and M. Weigel, Eur. Phys. J. B 67, 459 (2009). [17] A. Heuer and O. Rubner, Eur. Phys. J. B 67, 445 (2009). [18] S. Stigler, Statistics on the Table. The History of Statistical Concepts and Methods. (Harvard University Press, 2002).

You might also like