Lec 05
Lec 05
Steps in “fitting”
Thus, all we have so far are some hunches that may or may not be
correct. In the third step, we return back to the original data and
determine the “goodness-of-fit” of the hypothesis that the data is iid
according to the density fθml
– Chi-squared test
– Kolmogorov-Smirnov test
– Serial test
– Runs-up-and-down test
Input Distributions from Data 5–5
Chi-squared test
• But how large is large ? The answer lies in the so-called p-value.
– Suppose that the data Xi results in a error E = ²
– the p-value is defined to be
p-value = P(E ≥ ² | H0 true),
i.e. the p-value measures the likelihood one would observe an
error ² or higher if the hypothesis H0 were true.
– A reasonably good approximation of the p-value can be
obtained by invoking a classical result that, for large value of
n, the error E has a χ2-distribution with k − 1 degrees of
freedom.
Input Distributions from Data 5–6
Example of χ2-test
10
number per bin
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
bins
• Get d = Fχ−1
2 (0.95) from the tables ... d = 16.919.
9
Kolmogorov-Smirnov Test
• Hypothesis H0: X1, X2 , . . ., Xn ∼ F (x)
Let X(i) be the i-th largest of the X’s, i.e. X(1) is the smallest
and X(n) is the largest. Define
i i − 1
Dn+= max − F̂ (X(i) ) , Dn− = max F̂ (X(i) ) − .
1≤i≤n n 1≤i≤n n
Then
Dn = max{Dn+, Dn−}
Case Test
³√ Statistic ´ α = 0.1 α = 0.05 α = 0.01
0.11
All pars known n + 0.12 + √
n´ n
D 1.224 1.358 1.628
³√
0.85
Normal N (x̄, s2n ) n − 0.01 + √ D
n´ ³ n
0.819 0.895 1.035
³√ ´
0.5 0.2
Exponential exp(1/x̄) n + 0.26 + √ Dn − 0.990 1.094 1.308
√ n n
Weibull (α, β) ( n) Dn 0.803 0.874 1.007
Table 1: Kolmogorov-Smirnov test
• Kolmogorov-Smirnov test :
– Choose probability threshold α = 0.01, 0.05, or 0.1.
– Calculate the value of the error Dn corresponding to the data
– If the test statistic is greater than the critical values in the in
Table 1, reject hypothesis
Input Distributions from Data 5 – 10
Kolmogorov-Smirnov Test
Sorted : 0.0493 0.2618 0.3603 0.5485 0.5711 0.5973 0.7009 0.7400 0.7505 0.9623
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• From table :
√ 0.11
n + 0.12 + √ Dn = 0.8243 < 1.358
n
• Accept or reject H0 ? 0.2485 < 0.4094 therefore accept
Input Distributions from Data 5 – 11
Comments on tests
Sequence 2 : 0.0493 0.2618 0.3603 0.5485 0.5711 0.5973 0.7009 0.7400 0.7505 0.9623
C(X, Y )
ρ(X, Y ) = r r
V(X) V(Y )
Properties of ρ
Cauchy-Schwartz inequality : −1 ≤ ρ ≤ 1
Complete dependence : If Y = cX + d, c > 0 then ρ = +1
and if Y = cX + d, c < 0 then ρ = −1
Independence : X, Y independent ⇒ ρ = 0 but not vice versa
Partial dependence : ρ 6∈ {0, 1, −1}
Want to show that that there are no correlations in the data ...
Input Distributions from Data 5 – 13
Serial test
U , U2}, U
| 1{z
,U , ..., U
| 3{z 4} |
2n−1
{z
, U2n}
U1 U2 Un
i−1 i j−1 j
X ∈ Bij ⇔ ≤ X1 < and ≤ X2 <
k k k k
n
Eij = E[Oij ] =
k2
... we are now set for χ2 test ...
Input Distributions from Data 5 – 14
j
X (Oij − Eij )2
Ek2,n =
i,j=1 Eij
d = Fχ−1
2 (1 − α)
k 2 −1
Problems:
Runs-up-and-down test
• Example :
0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
ª ⊕ ª ⊕ ª ª ª ⊕ ª
Length of runs up = 1, 1, 1
Length of runs down = 1, 1, 3, 1
Run-up-and-down test
2n − 1 16n − 29
µn = E[Rn] = , σn2 = V[Rn] =
3 90
µn, σn are not sample averages ... expectations for length n
Input Distributions from Data 5 – 17
Rn −µn
• Use approximation ... σn ≈ η, the standard normal random
variable
• Define
ln,1−α = z α2 = N ( α2 )
un,1−α = z1− α2 = N (1 − α2 )
Notice : z’s have really nothing to with the process Rn all the
information in µn and σn
µn + z α2 σn ≤ Rn ≤ µn + z1− α2 σn
Input Distributions from Data 5 – 18
Example cont.
0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
ª ⊕ ª ⊕ ª ª ª ⊕ ª
• n = 10 and Rn = 7
• Accept or reject H0 ?
• Decision : accept
Input Distributions from Data 5 – 19
• Example:
0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
⊕ ª ⊕ ª ⊕ ⊕ ª ª ⊕ ª
Run-above-below-mean test
• same old answer ... choose ln,1−α and un,1−α such that
n+1 n−1
µn = E[Rn] = 2 and σn2 = V[Rn] = 4
Rn − µ n
≈ η (standard Normal)
σn
µn + z α2 σn ≤ Rn ≤ µn + z1− α2 σn
Notice : the test is the same as before ... the same z’s .. only
thing different are µn and σn
Input Distributions from Data 5 – 21
Example cont.
0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
⊕ ª ⊕ ª ⊕ ⊕ ª ª ⊕ ª
• Accept or reject H0 ?
• Decision : accept
Input Distributions from Data 5 – 22
Cj = C(Xn+j , Xn)
Autocorrelation test
• Approximation ....
H0 : ρj = 0, for all 0 < j < J
H1 : ρj 6= 0 for some j
µn = E[ρ̂j,n ] = 0
13n+7
σn2 = V(ρ̂j,n ) = (n+1) 2
ρ̂j,n − E[ρ̂j,n ]
r ≈η
V(ρ̂j,n )
Numerical Example
300
250
number per bin
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18
bins
70
60
50
number per bin
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18
bins
60
50
number per bin
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
bins
k
X (Ei − n/k)2
Ek,n = = 21.68 ≤ d19,0.95 = 30.1435
j=1 (n/k)
Test passed
Input Distributions from Data 5 – 27
j
j − 1
Dn = max max − Vj , max Vj −
1≤j≤n n 1≤j≤n n
= 0.0289 ≤ d1000,0.95 = 0.0428
Test passed
√ √
666.333 − (1.96) 177.46 ≤ 666 ≤ 666.333 + (1.96) 177.46
Test passed.
What we know