Time Series
Time Series
cole Polytechnique Fdrale de Lausanne Sui Kai Wong and Anis Younes
Abstract We evaluate the use of the Duckworth-Lewis (D/L) method in limited overs cricket matches. We rst consider the time series of the runs scored in cricket matches and test the validity of the results predicted by the D/L method by examining the errors of the D/L predicted target score with the actual score. We will then enhance the D/L table with results from our time series analysis and design an alternative model for truncated cricket matches. The enhanced D/L table will be based on predictions of the time series model as well as cubic spline smoothing.
Part I
Introduction
The game of Cricket is a popular sport in many Commonwealth countries. It is a bat-and-ball game in which 2 teams of 11 will take turns to bat and the team with the highest score wins. In traditional test matches, cricket is played over 5 days in which the teams bowl and bat twice. In an attempt to garner greater interest, the games governing body, the International Cricket Council introduced shortened versions of the games. In limited-overs cricket matches, target scores have to be predicted in case the game is cut short due to external events (rain, bad lighting, etc...). Duckworth and Lewis (1998)[1] introduced a table which extrapolates the target score based on the number of resources available left at the teams disposal. The Duckworth-Lewis method considers the number of wickets on hand and the number of overs remaining as resources. As the game progresses, the batting team uses up their resources and in case of interruption, their resource pool is diminished. At the end of the game, their scores are scaled by a factor inversely proportionate to their resource allocated. Since its adoption by the ICC for the limited-overs matches, the Duckworth-Lewis method has been the target of criticism [2] by the press and cricket teams alike. In this project we will examine the validity of the Duckworth-Lewis method by pitting its predicted target runs against actual scores of One Day Internationals. In this section we will next provide a brief introduction to limited overs cricket and the DuckworthLewis method. At the end of Part I we will explain how we extracted the data and what methods we will be using in the time series investigation of the data. In Part II we will examine the time series properties of the data that we have extracted. We will rst examine the runs, the raw score that we have extracted and we will try to t a time series model to the score in one inning. Next, we will use the D/L resource table as a non-parametric proxy for the expected consumption level 1
of resources to investigate the shortfall or surplus, if any, of the resource allocation. We look to examine the time series of errors (errors being the dierence between the predicted DuckworthLewis score and the actual score). Once we have an appropriate model, we will rst predict using the errors model a more accurate target runs score and nally we will apply cubic spline smoothing that will ll in target scores where we have missing empirical data. The end result is an enhanced Duckworth-Lewis table. Part III will present our conclusion and some concluding remarks.
The most primitive way is a comparison between the run rates. The ratios of the runs against the number of overs played are compared and team with the higher run rate is declared the winner. This is a straight-line extrapolation of the truncated score to the 50 overs. However, this is hardly the case in reality. The run rate evolves not in a straight-line fashion with the overs as batsmen manage their risk of dismissal. Typically, the teams with wickets at their disposal will score more runs towards the end of the inning. The D/L method provides an alternative to the run rate method. The D/L method considers the number of wickets on hand and the number of overs remaining as resources. As the game progresses, the batting team uses up their resources and in case of interruption, their resource pool is diminished. At the end of the game, their scores are scaled by a factor inversely proportionate to their resource allocated. For commercial reasons, only parts of the D/L formulation are disclosed. Partial information suggests that the full formula is an extension to a two factor exponential equation [1]: bu F (w)
where Z(u, w) denotes the average expected number of runs with u overs remaining and w wickets taken. Z0 F (w) denotes represents the expected number of runs with w wickets taken as u tends to innity. The parameters b, Z0 and F (w) are to estimated from match data. By including the wicket count, DL method is already a more comprehensive one as it considers the style of play given that one has fewer wickets on hand ceteris paribus. One huge advantage for the D/L method is that it is a generic model in which players can use without the technical knowhow. However, this method is by no means foolproof. The comparison of ODI data Z(u, w) vis--vis test cricket data Z0 F (w) meant that the D/L method does not incorporate the time urgency of ODIs. Furthermore, it is unable to the streaks of good or bad form the batsmen undergo in matches as a time series analysis would. Despite the consideration, we can assume that the D/L method is robust in modeling the resource allocation. We consider the example when the team batting rst nished their inning uninterrupted, scoring 300 runs. The second inning started with the team batting second having 100% of their resource1 . After 20 overs and 2 dismissals, rain interrupted the game. It is later resumed but the umpire declared that only 10 more overs could be played due to time constraint. The team has 67.3% of their resource left when the game was rst stopped and 30.8% resource when play resumed. In other words they have lost 36.5% of their resources due to the rain. They will be compensated at the end of their inning by scaling their score by a factor 100/(100-36.5). Similarly, one can scale down the rst teams score of 300 by 63.5% to set the target score of 190.5 for the team batting second to outrun. The D/L Standard Edition Table is given below [3].
1 This will depend on whether the game started out as per normal or if the umpire has already decided that the match will be truncated eventually. Should the umpire already predicted rain and see t that only 30 overs can be played, the team batting second is considered start with 73.5% of the resource.
50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
10 100.0 99.1 98.1 97.1 96.1 95.0 93.9 92.8 91.7 90.5 89.3 88.0 86.7 85.4 84.1 82.7 81.3 79.8 78.3 76.7 75.1 73.5 71.8 70.1 68.3 66.5 64.6 62.7 60.7 58.7 56.6 54.4 52.2 49.9 47.6 45.2 42.7 40.2 37.6 34.9 32.1 29.3 26.4 23.4 20.3 17.2 13.9 10.6 7.2 3.6
9 93.4 92.6 91.7 90.9 90.0 89.1 88.2 87.3 86.3 85.3 84.2 83.1 82.0 80.9 79.7 78.5 77.2 75.9 74.6 73.2 71.8 70.3 68.8 67.2 65.6 63.9 62.2 60.4 58.6 56.7 54.8 52.8 50.7 48.5 46.3 44.1 41.7 39.3 36.8 34.2 31.6 28.9 26.0 23.1 20.1 17.0 13.8 10.5 7.1 3.6
8 85.1 84.5 83.8 83.2 82.5 81.8 81.0 80.3 79.5 78.7 77.8 76.9 76.0 75.0 74.1 73.0 72.0 70.9 69.7 68.6 67.3 66.1 64.8 63.4 62.0 60.5 59.0 57.4 55.8 54.1 52.4 50.5 48.6 46.7 44.7 42.6 40.4 38.1 35.8 33.4 30.8 28.2 25.5 22.7 19.8 16.8 13.7 10.4 7.1 3.6
7 74.9 74.4 74.0 73.5 73.0 72.5 72.0 71.4 70.9 70.3 69.6 69.0 68.3 67.6 66.8 66.0 65.2 64.4 63.5 62.5 61.6 60.5 59.5 58.4 57.2 56.0 54.7 53.4 52.0 50.6 49.1 47.5 45.9 44.1 42.3 40.5 38.5 36.5 34.3 32.1 29.8 27.4 24.8 22.2 19.4 16.5 13.5 10.3 7.0 3.6
Wickets on hand 6 5 62.7 49.0 62.5 48.9 62.2 48.8 61.9 48.6 61.6 48.5 61.3 48.4 61.0 48.3 60.7 48.1 60.3 47.9 59.9 47.8 59.5 47.6 59.1 47.4 58.7 47.1 58.2 46.9 57.7 46.6 57.2 46.4 56.6 46.1 56.0 45.8 55.4 45.4 54.8 45.1 54.1 44.7 53.4 44.2 52.6 43.8 51.8 43.3 50.9 42.8 50.0 42.2 49.0 41.6 48.0 40.9 47.0 40.2 45.8 39.4 44.6 38.6 43.4 37.7 42.0 36.8 40.6 35.8 39.1 34.7 37.6 33.5 35.9 32.2 34.2 30.8 32.3 29.4 30.4 27.8 28.3 26.1 26.1 24.2 23.8 22.3 21.4 20.1 18.8 17.8 16.1 15.4 13.2 12.7 10.2 9.9 7.0 6.8 3.6 3.5
4 34.9 34.9 34.9 34.9 34.8 34.8 34.8 34.7 34.7 34.6 34.6 34.5 34.5 34.4 34.3 34.2 34.1 34.0 33.9 33.7 33.6 33.4 33.2 33.0 32.8 32.6 32.3 32.0 31.6 31.2 30.8 30.3 29.8 29.2 28.5 27.8 27.0 26.1 25.1 24.0 22.8 21.4 19.9 18.2 16.4 14.3 12.0 9.5 6.6 3.5
3 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 22.0 21.9 21.9 21.9 21.9 21.9 21.9 21.9 21.9 21.8 21.8 21.8 21.7 21.7 21.6 21.6 21.5 21.4 21.3 21.2 21.1 20.9 20.7 20.5 20.2 19.9 19.5 19.0 18.5 17.9 17.1 16.2 15.2 13.9 12.5 10.7 8.7 6.2 3.4
2 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.9 11.8 11.8 11.8 11.7 11.6 11.5 11.4 11.2 10.9 10.5 10.1 9.4 8.4 7.2 5.5 3.2
1 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.6 4.6 4.5 4.2 3.7 2.5
Overs remaining
In our project, we interpret the proportion of runs scored in each over given the number of wickets at hand as the proportion of the resource consumed on that situation. Thus, cumulatively over the entire match, all 100% of resources will be consumed.
Part II
Oi 50
100% ,
where ri,t is the number of runs in over t of inning i, and Oi is the number of overs played in inning i.
1000
3000
50
100
150
200
250
Overs played
Figure 2: From Top to Bottom: Time Series Plot of Empirical Data, Auto-Correlation, PartialAutocorrelation, and Spectral Plots.
A rsthand examination of our data demonstrates clear seasonality, this was expected as we have concatenated the innings back-to-back, we expect to have a seasonality factor of 50 overs. So we dierence the data taking into account the seasonality factor and look at the time series plots.
diff(emp, lag = 50)[1:1000]
200
600
800
1000
ACF
20
40 Lag
60
80
100
20
40 Lag
60
80
100
1e04
0.0
0.1
0.2
0.3
0.4
0.5
Figure 3: From Top to Bottom: Time Series Plot of Empirical Data, Auto-Correlation, PartialAutocorrelation, and Spectral Plots.
The time series plot shows that the the dierencing did produce a stationary process. Looking at the ACF and PACF plots clearly cut o at multiples of s=50. We examine SARIMA (p, 0, q) (0, 1, 0)50 models to model this empirical data (ARMA (p, q) model for the dierenced data). Notice the spectrum plot reveals the seasonal frequencies. We choose the Akaike Information Criteria to choose the appropriate model. It tends to choose models that are more complicated, however it is the most stringent criterion and will therefore be the Information criterion of choice upon which we decide which model to use. We will not examine cases where we have more than 6 parameters to estimate as we believe they are too complex and are computationally exhaustive for our purposes. 7
SARIMA (p, 0, q) (0, 1, 0)50 q 0 0 1 p 2 3 4 5 14415.815 14401.09 14347.96 14348.47 14349.98 14351.57 1 14404.41 14371.95 14348.76 14350.20 14279.01 14352.38 2 14350.89 14347.78 14349.74 14351.86 14280.04 3 14348.20 14349.75 14351.77 14352.67 4 14349.64 14351.64 14353.62 5 14351.63 14353.64 -
Table 1: Moving Average Model AIC values for the Empirical Data
It seems that a SARIMA (4, 0, 1) (0, 1, 0)50 model would best represent the D/L Errors with the lowest AIC value than all models, and we do not see any improvement in larger orders. Here the choice of model is obvious and there is no tradeo, yet the downside is that the model is rather complex and cumbersome to calculate its parameters. The SARIMA (4, 0, 1) (0, 1, 0)50 models parameters are given in the table below. 1 2 3 4 1 1.0363 (0.0166) 0.0686 (0.0239) -0.0978 (0.0239) -0.0411 (0.0167) -1.000 (0.001)
Table 2: MA(3) Model Parameters for the Empirical Data with standard errors
We now examine the suitability of the t with the following graphs testing the properties of the residuals.
Standardized Residuals
4 4 2 0 2
1000 Time
2000
3000
ACF of Residuals
ACF
0.0
0.4
0.8
10
15 Lag
20
25
30
35
0.8
p value
0.4
0.0
q q q q q q q q q q q q q q q q q q q q q q q q
q q q q q
q q q q q q q q q q q
10
20 lag
30
40
50
Figure 4: From Top to Bottom: Time Series Plot of Standardized Residuals, Auto-Correlation of Residuals, p-values of Ljung-Box Test testing autocorrelation of Residuals.
Sample Quantiles
2
q
0 Theoretical Quantiles
0.0
0.1
0.2 frequency
0.3
0.4
0.5
Examining the plot of the time series of residuals, it seems like white noise. The ACF plot shows no autocorrelation among residuals, consistent with white noise property. The plot of signicance level of the Ljung-Box Test however reveals that the model residuals are not white noise, we do have dependene amongst residuals between lags 10 and 20. The QQ plot to test for the normality of residuals shows a very poor t to the data, the residuals dont seem to be normally distributed, they rather exhibit heavy skewness to the left and leptokurtic symptoms. Finally, the cumulative periodogram of the residuals stay within the bounds but we cannot claim the white noise property of the residuals and therefore conclude that the model is a bad t.
10
Therefore with our SARIMA (4, 0, 1) (0, 1, 0)50 model, we cannot predict the target percentage of resources and hence the target runs per over, but as we have explained, we have to account for the number of wickets in our prediction mechanism. Since the number of runs per over is also dependent on the number of wickets left, we have to account for that dimension as well. Duckworth and Lewis (1998) have suggested a resource table for predicting the target number of runs based on both overs left to play and wickets left. We now move on to the centerpiece of our project: examining the validity of the Duckworth-Lewis method in predicting target run scores.
In this section we focus on examining the Duckworth-Lewis Method for predicting target runs based on wickets and overs left. In particular we will look at individual runs per over rather than the cumulative runs stated in the Duckworth-Lewis table. This is because errors in the number of runs would eventually sum up, resulting in an autoregressive trend. The errors xw,t that we examine here are the dierence between the resources per over in the empirical data and the resources per over as projected by the Duckworth-Lewis Resource table: xt (w) = Rt (w) Rt (w) where Rt (w) is the total percentage of available resources in over t , given w wickets left of the empirical data, and Rt (w) is the target percentage of resources in over t , given w wickets left from the Duckworth-Lewis table. By referring to the D/L resource table, we assumed the D/L table is robust in representing the mean resource used in the over given the wicket count. Thus, our approach here will both investigate the time series property of the errors as well as the overall goodness of t of the D/L table. Another implication of taking the dierence is that the errors are independent of the number of wickets on hand. A rsthand examination of our data demonstrates clear seasonality, this was expected as we have concatenated the innings back-to-back, we expect to have a seasonality factor of 50 overs.
11
1000
3000
50
100
200
250
300
Figure 6: Time Series Plot of xw,t and a zoomed-in plot demonstrating clear seasonality of 50 overs.
In order to account for this seasonality, we perform a rst order dierencing of the seasonality factor. So in terms of the SARIMA factors, D=1 and s=50. Next, we examine the plots of the seasonality-neutral (dierenced data), its Auto-Correlation and Partial Auto-Correlation plots.
12
200
600
800
1000
ACF
20
40 Lag
60
80
100
20
40 Lag
60
80
100
0.0
0.1
0.2
0.3
0.4
0.5
Figure 7: From Top to Bottom: Time Series Plot of season-dierenced xw,t , Auto-Correlation, Partial-Autocorrelation, and Spectral Plots.
The top plot reveals that the dierencing produces a more stationary looking process. Looking at the above plots, we can clearly see that both ACF, and PACF plots clearly tail o at multiples of s=50. The ACF demonstrates signicant autocorrelations up till a lag of 3, while the PACF tails o at a lag of 2. We therefore look to examine SARIMA (p, 0, q) (0, 1, 0)50 models with p and q values similar to those mentioned. In other words this is just an ARMA (p, q) model for the dierenced data. One should take note of the spectrum plot which demonstrates the seasonality at the seasonal frequencies.
AIC of possible SARIMA (p, 0, q) (0, 1, 0)50 models. The choice of the AIC is for the same reasons mentioned in Section 1.
SARIMA (p, 0, q) (0, 1, 0)50 q 0 0 1 p 2 3 4 5 14518.74 14491.96 14395.93 14382.95 14380.08 14377.65 1 14499.01 14381.39 14371.75 14360.93 14362.41 14364.25 2 14414.56 14374,33 14368.56 14368.99 14364.56 3 14400.49 14360.57 14362.28 14316 4 14392.56 14362.29 14364.84 5 14386.57 14364,25
Table 3: SARIMA (p, 0, q) (0, 1, 0)50 Model AIC values for the Duckworth-Lewis Errors.
Combining the results of the above table with what we deduced from the ACF and PACF plots, it seems that a SARIMA (3, 0, 3) (0, 1, 0)50 has the lowest AIC value. But we choose the SARIMA (1, 0, 3) (0, 1, 0)50 model to represent the D/L Errors with the second lowest AIC value than all models, and a much simpler model. Here the choice of model is a tradeo between a more complex model: the SARIMA (3, 0, 3) (0, 1, 0)50 and one which is simpler, the SARIMA (1, 0, 3) (0, 1, 0)50 but also represents well the data. When simulated in R, the estimated variance of residuals of the SARIMA (3, 0, 3) (0, 1, 0)50 is degenerate (NaN) which also makes us choose the former model. The SARIMA (1, 0, 3) (0, 1, 0)50 models parameters are given in the table below. 1 1 2 3 0.9142 (0.0229) -0.8596 (0.0287) 0.0990 (0.0221) -0.0760 (0.0179)
Table 4: SARIMA (1, 0, 3) (0, 1, 0)50 Model Parameters (with standard errors) for the Duckworth-Lewis Errors
In other words our model is the following: (B)(I B 50 )xt (w) = (B)t , where t N (0, 2 ), and B is the backshift operator. We now examine the suitability of the t with the following graphs testing the properties of the residuals up till lag 50 which is the period of our seasonality (50 overs per inning).
14
Standardized Residuals
4 4 2 0 2
1000 Time
2000
3000
ACF of Residuals
ACF
0.0
0.4
0.8
10
15 Lag
20
25
30
35
p value
0.8
0.4
q q q q q q q q q q q q q q q q q
q q q
q q q q q q q
q q q q q
0.0
q q q q q q q q q
10
20 lag
30
40
50
Figure 8: Model Fit: (From Top to Bottom) Time Series Plot of Standardized Residuals, AutoCorrelation of Residuals, p-values of Ljung-Box Test testing autocorrelation of Residuals.
15
qq q q q q q q q qq qq q qq qq qq q q q qq qq q q q q q q q q q q q q q q q q qq qq q q q qq q q qq qq qq qq q q q qq qq qq qq q q q q q q qq qq qq qq q q q qq q q qq qq qq qq q qq qq q q qq qq qq qq qq qq qq qq qq qq qq qq qq qq qq qq qq q q qq qq qq qq qq qq qq qq qq qq qq qq qq qq q qq qq qq qq qq qq qq qq qq q qq qq qq q q qq qq qq qq qq qq qq q qq qq qq qq qq qq qq qq qq q q q q q qq qq qq q q q q qq q q q q qq q q q q q q q q q q q qq q qq qq qq q q q q q q q q q q q
Sample Quantiles
0 Theoretical Quantiles
0.0
0.1
0.2 frequency
0.3
0.4
0.5
Examining the plot of the time series of residuals, we can see that it looks like stationary white noise. The ACF plot shows no autocorrelation among residuals, consistent with white noise property. The plot of signicance level of the Ljung-Box Test reveals that the model residuals are indeed white noise (a slight exception is present in the case of lags 10 till 14). We also examine the QQ plot to test for the normality of residuals: the linearity of the model shows an excellent t to the data. The ordered values of the residuals and the theoretical quantiles seem almost perfectly aligned with a normal distribution with slightly fatter tails, but this slight leptokurtocy is insignicant. Finally, the cumulative periodogram of the residuals stay within the bounds suggesting they are indeed white noise. 16
If we were to go further to check wheter the residuals have a t-distribution, we examine the QQ plot of residuals versus the t-distribution theoretical quantiles with 3 degrees of freedom, which seem like a slightly better t to the residuals than a normal distribution but outliers are signicant.
t(3) QQ Plot
q
q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q
q q
Sample Quantiles
20
10
10
20
30
rt(3659, df = 3)
Figure 10: Student t-distribution with 3 Degrees of Freedom Q-Q Plot of Residuals.
Therefore our SARIMA (1, 0, 3) (0, 1, 0)50 model ts extremely well the errors of resources per over as predicted by the Duckworth-Lewis method and we will use this model to predict the discrepancy of the D/L method and hence predict a more realistic score based on the empirical data.
17
Figure 11: Top Left to Right: Theoretical D/L Cumulative Resources, Empirical D/L Cumulative Resources. Bottom: Heat map of the dierence.
Notice the dark shades especially in the o-diagonal resources where the D/L method fails to replicate the empirical data.
table of resources. We also plot to compare: the Empirical Cumulative D/L table of resources, the Enhanced Cumulative D/L table of resources (see Figure 1), and the heat map displays the dierences between the two plots (darker shades translate to larger dierences) .
H a Ma o Df rn eb te nE h n e a d e t p f i e c ew e n a c d n f e E i a C mu teDLR s uc s mp i l u l i / e o re r c av
Figure 12: Top Left to Right: Enhanced D/L Cumulative Resources, Empirical D/L Cumulative Resources. Bottom: Heat map of the dierence.
Notice the dominance of the very light shades in the heat map compared with the heat map of the dierence with the Theoretical D/L table where darker shades dominate. This shows that our Enhanced D/L table replicates the empirical data better than the theoretical D/L, the improvement is most noticeable in the o-diagonal resources. Our Enhanced D/L table values are given below.
19
50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
10 100.0 98.5 97.1 95.5 94.0 92.3 90.6 88.8 87.0 85.3 83.5 81.8 80.1 78.5 76.9 75.4 73.9 72.5 71.2 69.9 68.6 67.2 65.7 64.1 62.5 61.0 59.5 58.1 56.7 55.4 53.8 52.1 50.2 48.2 46.3 44.6 43.2 41.8 40.5 38.9 36.9 34.4 31.5 28.3 24.8 20.9 16.9 13.1 9.2 4.7
9 93.1 92.0 90.9 89.7 88.5 87.2 85.8 84.4 82.9 81.4 79.8 78.2 76.6 74.9 73.3 71.6 70.0 68.4 66.9 65.4 64.0 62.6 61.1 59.6 58.1 56.5 54.9 53.2 51.5 49.7 48.0 46.1 44.3 42.4 40.6 38.7 36.9 35.1 33.1 30.8 28.2 25.3 22.2 19.0 16.0 13.2 10.5 7.7 5.1 2.8
8 84.2 83.2 82.2 81.3 80.3 78.9 77.6 76.6 75.2 73.9 72.7 71.7 70.6 69.2 67.8 66.6 65.4 64.1 62.8 61.3 60.0 58.7 57.3 56.0 54.8 53.4 52.0 50.6 49.0 47.3 45.6 43.9 42.4 40.9 39.4 37.7 35.7 33.8 32.3 30.9 29.4 27.4 25.0 22.8 20.9 18.8 16.6 13.6 9.6 5.0
7 77.5 76.7 76.1 75.3 74.5 73.7 72.9 71.9 71.1 70.1 68.6 67.9 66.9 65.5 64.7 63.6 62.5 61.3 60.3 58.8 57.3 55.8 54.4 53.2 51.9 50.3 48.6 47.0 45.4 43.7 42.2 40.1 38.7 37.0 35.5 33.7 32.3 30.6 28.9 27.1 25.3 23.5 21.4 18.8 16.7 14.6 11.6 9.0 6.5 3.2
Wickets on hand 6 5 63.3 60.2 63.0 60.1 62.5 59.9 62.0 59.6 61.5 59.4 61.1 59.3 60.6 59.1 60.1 58.8 59.5 58.5 58.8 58.3 58.2 58.0 57.5 57.6 56.8 57.2 56.2 56.8 55.7 56.4 55.3 56.0 54.5 55.6 53.2 55.2 52.2 54.8 51.0 53.8 49.9 52.3 48.6 50.8 47.1 49.5 45.8 48.2 44.6 46.9 43.2 45.4 41.8 43.9 40.3 42.6 38.8 41.2 37.3 39.7 35.7 38.3 34.0 36.9 32.3 35.4 30.7 33.7 29.2 32.0 27.6 30.3 26.1 28.8 24.7 27.5 23.0 26.2 21.2 24.7 19.4 23.3 17.7 21.8 15.7 20.3 13.6 18.7 11.5 16.8 9.1 14.4 6.3 11.8 3.4 9.0 1.5 5.8 0.8 2.5
4 52.2 52.2 52.2 52.2 52.1 52.1 52.1 51.9 51.9 51.7 51.7 51.6 51.6 51.4 51.3 51.1 50.9 50.8 50.6 50.3 50.1 49.0 48.7 46.9 45.1 44.4 42.6 40.8 40.0 39.1 37.7 36.8 34.1 32.8 32.0 30.1 28.1 27.0 25.6 24.0 22.6 21.3 19.9 18.2 16.8 14.6 12.7 10.2 6.9 2.9
3 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.4 36.4 36.4 36.4 36.4 36.4 36.4 36.3 36.2 36.2 36.2 36.1 36.0 35.9 35.9 35.7 35.6 35.3 34.5 33.5 32.7 31.4 29.7 28.1 26.7 25.3 23.7 21.8 20.3 19.2 18.1 16.6 14.8 12.8 10.8 8.8 6.4 3.7
2 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.8 30.8 30.8 30.9 30.8 30.4 29.7 28.6 27.1 25.5 23.9 22.4 20.8 19.2 17.3 15.4 13.5 11.5 9.6 7.6 5.6 3.5
1 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.7 8.8 8.8 8.8 8.6 8.3 7.6 6.7 5.5 4.1 2.6 1.0
Overs remaining
20
Part III
Conclusion
In this project we looked to examine the Duckworth-Lewis method which predicts a target run score in limited overs cricket. We rst examined the time series properties of normalized runs and found that a SARIMA (4, 0, 1) (0, 1, 0)50 was not sucient to model the data and therefore we might need a more complex model, this was expected since the model accounted for runs using overs (time) as the only dimension whereas we need to account for the number of wickets. The Duckworth-Lewis method provides a semi-parametric table that allows for predicting target score runs in the event of the truncation of a match. So we next examined the time series of errors which are the dierences between the runs per over in the empirical data and the runs per over as projected by the Duckworth-Lewis Resource table. We nd that a SARIMA (1, 0, 3) (0, 1, 0)50 model ts the errors extremely well. We then used this model to ll in missing values in our empirical D/L table. Finally, we smoothed the table per column of wicket resources using cubic splines to obtain an enhanced D/L table which seems to better mimic empirical data than the theoretical D/L table. We observed that there is indeed a time series aecting the gameplay. As such, we can use the tted model to predict the nal scores. An alternative to using the resource table is to use the time series model directly to predict the nal score. The time series model would incorporate the dynamics of that specic match. At this point, the run rate will also play a part in determining the result. We can extend the time series approach to predict test cricket scores. As mentioned above, test cricket are played over 5 days. Each team can play 2 innings of unlimited numbers of overs each (instead of 1 50-over inning) over the 5 days. The winner will be determined by the nal accumulated run counts. After 5 days of play, there are (many) instances where the game will still end up as a draw due to the lack of time. For example, consider that team A has batted twice and team B is batting their second inning nearing the end of the fth day. Team B is trailing by 400 runs with only 1 wicket on hand. Considering that the last batsmen are usually the weakest, it is almost impossible for team B to win the match. However, if the last batsmen can hold up until the end of the day, they can still salvage a draw despite team As commanding lead. Suppose we can model the fall of wicket like a stopping time or waiting time problem, then we can extend the ARMA model to predict the expected number of runs at the expected stopping time. We can expand the set of resources to incorporate powerplays. Powerplays are segment of the games when a restriction is imposed to the elding team. Fielding restrictions limit the number of players outside a certain radius on the eld. This is to promote aggressive batting and scoring. The rst 10 overs of an inning is a mandatory period of powerplay. Both teams will need to call for another 5 overs of powerplays whenever they deem advantageous. As there is more aggressive style of play during these powerplays, we can consider these 20 overs can be considered another set of resource and further enhance the D/L table. An interesting investigation would be to backtest the dierent resource tables. We can evaluate if the match results will dier given truncation of various length at various stages of the game. Due to the multiple permutations of truncations we may have, there are bound to be some cases where the table fails to declare the ultimate winner. We understand that both tables are not comprehensive, but its main advantage that it can be easily implemented still remains. Using the time series model to predict the nal score for truncated matches may be less transparent and more
21
References
[1] F. C. Duckworth and A. J. Lewis, A fair method for resetting the target in interrupted one-day cricket matches, The Journal of the Operational Research Society, vol. 49, no. 3, pp. pp. 220227, 1998. [Online]. Available: http://www.jstor.org/stable/3010471 [2] R. Ramachandran. (2002, November 23 - December 6) For a fair formula. [Online]. Available: http://www.hinduonnet.com/ine/1924/stories/20021206004410400.htm [3] I. C. Council. Duckworth lewis standard edition table. [Online]. Available: cricket.yahoo.net http://icc-
[4] A. Davison, Time Series, C. o. S. Mathematics Institute (MATHAA), Ed. cole Polytechnique Fdrale de Lausanne, 2011.
22
Appendix
Sample R code:
## Time Series plots of D/L Errors ## par(mfrow=c(2,1)) plot(ind,type="l",main="D/L Error Time Series",xlab="Overs Played",ylab="D/L Error") plot(ind[1:300],type="l",main="Zoomed D/L Error Time Series",xlab="Overs Played",ylab="D/L Error") x11() par(mfrow=c(4,1)) plot(di(ind,lag=50)[1:1000],type="l",main="Season Dierenced D/L Error Time Series",xlab="Overs Played",ylab="D/L Error") acf(di(ind,lag=50),lag.max=100,main="Season Dierenced D/L Error ACF") pacf(di(ind,lag=50),lag.max=100,main="Season Dierenced D/L Error PACF") spectrum(di(ind,lag=50),main="Season Dierenced D/L Error Spectrum") ## Test of Fit ## tIND0=arima(ind,order=c(0,0,0),seasonal=list(order=c(0,1,0),period=50)) tIND1=arima(ind,order=c(0,0,1),seasonal=list(order=c(0,1,0),period=50)) tIND2=arima(ind,order=c(0,0,2),seasonal=list(order=c(0,1,0),period=50)) tIND3=arima(ind,order=c(0,0,3),seasonal=list(order=c(0,1,0),period=50)) tIND4=arima(ind,order=c(0,0,4),seasonal=list(order=c(0,1,0),period=50)) tIND5=arima(ind,order=c(0,0,5),seasonal=list(order=c(0,1,0),period=50)) tIND0$aic tIND1$aic tIND2$aic tIND3$aic tIND4$aic tIND5$aic tINDA0=arima(ind,order=c(0,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA1=arima(ind,order=c(1,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA2=arima(ind,order=c(2,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA3=arima(ind,order=c(3,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA4=arima(ind,order=c(4,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA5=arima(ind,order=c(5,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA0$aic tINDA1$aic tINDA2$aic tINDA3$aic tINDA4$aic tINDA5$aic tINDA10=arima(ind,order=c(1,0,0),seasonal=list(order=c(0,1,0),period=50)) tINDA11=arima(ind,order=c(1,0,1),seasonal=list(order=c(0,1,0),period=50)) tINDA12=arima(ind,order=c(1,0,2),seasonal=list(order=c(0,1,0),period=50)) tINDA13=arima(ind,order=c(1,0,3),seasonal=list(order=c(0,1,0),period=50)) tINDA14=arima(ind,order=c(1,0,4),seasonal=list(order=c(0,1,0),period=50)) tINDA15=arima(ind,order=c(1,0,5),seasonal=list(order=c(0,1,0),period=50)) tINDA10$aic tINDA11$aic tINDA12$aic tINDA13$aic tINDA14$aic tINDA15$aic tINDA21=arima(ind,order=c(2,0,1),seasonal=list(order=c(0,1,0),period=50)) tINDA22=arima(ind,order=c(2,0,2),seasonal=list(order=c(0,1,0),period=50)) tINDA23=arima(ind,order=c(2,0,3),seasonal=list(order=c(0,1,0),period=50)) tINDA24=arima(ind,order=c(2,0,4),seasonal=list(order=c(0,1,0),period=50)) tINDA25=arima(ind,order=c(2,0,5),seasonal=list(order=c(0,1,0),period=50))
23
tINDA21$aic tINDA22$aic tINDA23$aic tINDA24$aic tINDA25$aic tINDA31=arima(ind,order=c(3,0,1),seasonal=list(order=c(0,1,0),period=50)) tINDA32=arima(ind,order=c(3,0,2),seasonal=list(order=c(0,1,0),period=50)) tINDA33=arima(ind,order=c(3,0,3),seasonal=list(order=c(0,1,0),period=50)) tINDA31$aic tINDA32$aic tINDA33$aic tINDA41=arima(ind,order=c(4,0,1),seasonal=list(order=c(0,1,0),period=50)) tINDA42=arima(ind,order=c(4,0,2),seasonal=list(order=c(0,1,0),period=50)) tINDA41$aic tINDA42$aic tINDA51=arima(ind,order=c(5,0,1),seasonal=list(order=c(0,1,0),period=50)) tINDA51$aic ## Model Diagnostics ## x11() tsdiag(tINDA13,gof.lag=50) x11() par(mfrow=c(2,1)) qqnorm(tINDA13$residuals,main="D/L Error Model Residual QQ") abline(0,1) cpgram(tINDA13$residuals,main="D/L Error Model Residual CPGRAM") x11() qqplot(rt(length(tINDA13$residuals),df=3), tINDA13$residuals, main="t(3) Q-Q Plot", + ylab="Sample Quantiles") abline(0,1) ##Smoothing of Errors## smoothDL<-matrix(nrow=50,ncol=10) for (i in 1:10){ smoothDL[,i]<-smooth.spline(newDL[,i])$y } ##HeatMap Codes## matEN<-data.matrix(heatEN) heatmapEN <- heatmap(matEN[50:1,], Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10),ylab="Overs",xlab="Wickets") heatOR <-read.csv("/Users/skwong/Desktop/Timeseries/dlEN.csv",header=T) matOR<-data.matrix(heatOR) heatmapOR <- heatmap(matOR[50:1,], Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10),ylab="Overs",xlab="Wickets")
24