0% found this document useful (0 votes)
11 views28 pages

Lec 05

The document discusses methods for fitting input distributions from data, focusing on steps such as hypothesizing distribution families, estimating parameters using Maximum Likelihood Estimation (MLE), and assessing goodness-of-fit through tests like Chi-squared and Kolmogorov-Smirnov. It emphasizes the importance of prior knowledge, summary statistics, and graphical plots in the fitting process. Additionally, it highlights the necessity of independence tests to evaluate the sequences of data beyond mere distribution fitting.

Uploaded by

simon wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Lec 05

The document discusses methods for fitting input distributions from data, focusing on steps such as hypothesizing distribution families, estimating parameters using Maximum Likelihood Estimation (MLE), and assessing goodness-of-fit through tests like Chi-squared and Kolmogorov-Smirnov. It emphasizes the importance of prior knowledge, summary statistics, and graphical plots in the fitting process. Additionally, it highlights the necessity of independence tests to evaluate the sequences of data beyond mere distribution fitting.

Uploaded by

simon wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture 5

Input Distributions from Data

• Geometrical features : Histograms, Q-Q plots

• Maximum likelihood parameter estimation

• Goodness of fit tests for density and independence


Input Distributions from Data 5–2

Steps in “fitting”

Step 1: Hypothesizing families of distributions - decide


what general shapes seem to “fit” the data ...how ?

• prior knowledge: information about the system, model, range of


variates etc. e.g. Memoryless property of exponentials make
them suitable for describing arrival processes; Service times
should not be represented by Normal because they can never be
negative; etc.

• summary statistics: mean, variance, median, coefficient of


variation, skewness, etc. e.g. if mean and variance are the same
then exponential seems to be a very likely choice.

• plots: histograms, q-q plots, p-p plots, box plots, etc.


Input Distributions from Data 5–3

Step 2: Estimation of parameters of the distribution

Many different estimators. We will focus on the Maximum Likelihood


Estimator (MLE)

• Let θ be the parameters defining the density fθ (x). e.g. if the


exponential is the “fitted” density then λ is the parameter, with
fλ(x) = λe−λx.
• If the observed data X1, . . . , Xn were iid ∼ fθ then

L(θ) = f (X1 , . . . , Xn) = fθ (X1 ) · fθ (X2 ) · · · fθ (Xn)
• The maximum likelihood estimate θml is the parameter value θ
that maximizes L(θ)

Example: X1, . . . , Xn ∼ exp(λ) then


L(λ) = (λe−λX1 ) · (λe−λX2 ) · · · (λe−λXn )
n −λ ni=1 Xi
P
= λ e ,
n
X
log L(λ) = n log(λ) − λ · ( Xi )
i=1
Thus, the MLE estimate λml = X1 , where X = n1 ni=1 Xi ...
P

something one should have expected in the first place.

Example: X1, . . . , Xn ∼ N (µ, σ 2) then the MLE estimates


are : v
1 X u1 X
u
n n
µml = Xi , σml = u
t (Xi − µ̄)2
n i=1 n i=1
(What is the problem with the estimator for σ ?)
Input Distributions from Data 5–4

Step 3: How good are the “fitted” distributions


In step 1, based on some properties of the observed data, we guessed
a probable family of distributions that is expected to explain the
data.

In step 2, we used the notion of MLE estimation to choose the best


distribution in the family determined in step 1.

Thus, all we have so far are some hunches that may or may not be
correct. In the third step, we return back to the original data and
determine the “goodness-of-fit” of the hypothesis that the data is iid
according to the density fθml

Will break up the problem into two steps:

• Goodness of fit test for density:

– Chi-squared test
– Kolmogorov-Smirnov test

• Test for independence

– Serial test
– Runs-up-and-down test
Input Distributions from Data 5–5

Chi-squared test

• Easier to motivate in the discrete case ... suppose


X1 , . . . , Xn ∼ the pmf p = (p1, . . . , pk ), i.e. we want to test
the null hypothesis H0: X1, X2, . . . , Xn i.i.d. ∼ p

• Let nj = #{Xi = j} for 1 ≤ j ≤ k. If H0 is true, then


nj ∼ Ber(pj ), i.e. E[nj ] = npj , and so (nj − npj )2 is an
indication of how likely pj = P(X = j).

• Thus, the error term


k
X (nj − npj )2
E=
j=1 npj
measures the likelihood of the hypothesis H0. One rejects H0 if
the error E is large ... else one accepts it.

• But how large is large ? The answer lies in the so-called p-value.
– Suppose that the data Xi results in a error E = ²
– the p-value is defined to be
p-value = P(E ≥ ² | H0 true),
i.e. the p-value measures the likelihood one would observe an
error ² or higher if the hypothesis H0 were true.
– A reasonably good approximation of the p-value can be
obtained by invoking a classical result that, for large value of
n, the error E has a χ2-distribution with k − 1 degrees of
freedom.
Input Distributions from Data 5–6

• The chi-squared test with all parameters known:


– Choose a critical p-value α below which the hypothesis is to
be rejected. The typical values are α = 0.05, 0.01, 0.005.
−1
– Lookup the value of d = Fk−1 (1 − α) where Fk−1 is the CDF
of a χ2-r.v. with (k − 1) d.f.
– Accept if ² ≤ d else reject.

• What happens if some parameters are unknown:


– Estimate the unknown parameters from data
– If there are m unknown parameters, then, as n becomes large,
the CDF of E lies between the CDF’s of χ2k−1 and χ2k−m−1 .
−1 −1
– So, compare ² with Fk−1 (1 − α) and Fk−m−1 (1 − α)

• So, how does one deal with continuous random variables ?


Convert it into a discrete r.v. by dividing the real line into
intervals. Strategies for constructing intervals:

– Equiprobable approach: Divide the interval into equiprobable


intervals such that k ≥ 3 and the probability of each
interval p ≥ n5 .
– Uniform approach: If Xi’s ∼ F then Ui = F (Xi) ∼ unif
[0, 1]. Divide the interval [0, 1] into k = blog 2(n) + 1c bins
and use χ2-test.
Input Distributions from Data 5–7

Example of χ2-test

• Data: assumed to be uniform


0.3603 0.5485 0.2618 0.5973 0.0493 0.5711 0.7009 0.9623 0.7505 0.7400

• n = 100, k = 10 and α = 0.05

• Split the n samples into k bins ...

Histogram using 10 bins


15

10
number per bin

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
bins

• Compute the error


Pk
(nj − npj )2
k
X j=1 (nj − 10)2
E= = = 6.8
j=1 npj 10

• Get d = Fχ−1
2 (0.95) from the tables ... d = 16.919.
9

• Accept or reject H0 ? E = 6.8 < d = 16.919 ... accept


Input Distributions from Data 5–8

Kolmogorov-Smirnov Test
• Hypothesis H0: X1, X2 , . . ., Xn ∼ F (x)

• Define the emperical cumulative distribution function (CDF) F̂n


as follows:
1
Fˆn(x) = (# Xi0 s less than x)
n

• Expect F̂n ≈ F , the true CDF or equivalently the error

D = max | F̂n(x) − FU (x) | is “small”


x∈R

Let X(i) be the i-th largest of the X’s, i.e. X(1) is the smallest
and X(n) is the largest. Define
   
i    i − 1
Dn+= max  − F̂ (X(i) ) , Dn− = max F̂ (X(i) ) − .
1≤i≤n n 1≤i≤n n 
Then
Dn = max{Dn+, Dn−}

• Natural test : Reject hypothesis H0 if Dn is “large”

• Again the acceptance/rejection will be based on the p-value


Input Distributions from Data 5–9

Case Test
³√ Statistic ´ α = 0.1 α = 0.05 α = 0.01
0.11
All pars known n + 0.12 + √
n´ n
D 1.224 1.358 1.628
³√
0.85
Normal N (x̄, s2n ) n − 0.01 + √ D
n´ ³ n
0.819 0.895 1.035
³√ ´
0.5 0.2
Exponential exp(1/x̄) n + 0.26 + √ Dn − 0.990 1.094 1.308
√ n n
Weibull (α, β) ( n) Dn 0.803 0.874 1.007
Table 1: Kolmogorov-Smirnov test

• Kolmogorov-Smirnov test :
– Choose probability threshold α = 0.01, 0.05, or 0.1.
– Calculate the value of the error Dn corresponding to the data
– If the test statistic is greater than the critical values in the in
Table 1, reject hypothesis
Input Distributions from Data 5 – 10

Kolmogorov-Smirnov Test

• Data: assumed to be uniform


Unsorted : 0.3603 0.5485 0.2618 0.5973 0.0493 0.5711 0.7009 0.9623 0.7505 0.7400

Sorted : 0.0493 0.2618 0.3603 0.5485 0.5711 0.5973 0.7009 0.7400 0.7505 0.9623

n = 10 and set α = 0.05

• Construct Fˆn(x) = n1 (# Ui0s less than x)


CDF and Emperical CDF
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

• Compute the error Dn = maxx∈[0,1] | F̂10(x) − Fu(x) |


Dn = 0.2485 and is achieved at x = 0.5485

• From table :
 
√ 0.11 
 n + 0.12 + √  Dn = 0.8243 < 1.358

n
• Accept or reject H0 ? 0.2485 < 0.4094 therefore accept
Input Distributions from Data 5 – 11

Comments on tests

Comparison of K-S and χ2 tests


1. K-S test is exact for any sample size n whereas the χ2 test
implicitly assumes that n is large.
2. K-S test looks at each sample Ui whereas the χ2 test does not
differentiate between sample in a bin.
3. But ... K-S test involves more computation than the χ2 test.

Why are these tests not enough ?


Because they do not take the sequence into account ...

Consider the following two sequences :


Sequence 1 : 0.3603 0.5485 0.2618 0.5973 0.0493 0.5711 0.7009 0.9623 0.7505 0.7400

Sequence 2 : 0.0493 0.2618 0.3603 0.5485 0.5711 0.5973 0.7009 0.7400 0.7505 0.9623

• Both sequences will


give the same value of Dn in K-S test ... so if seq 1 passes
so does seq 2
give the same value of Ek,n in χ2 test ... so if seq 1 passes
so does seq 2
• But ... highly improbable that the second sequence is
independently and identically U [0, 1] distributed

Tests do not capture independence!


Input Distributions from Data 5 – 12

Tests for independence

Define a new sequence Ui = F (Xi). If Xi are iid ∼ F then the


sequence Ui are iid unif[0, 1]. We will focus on independence tests for
iid uniform random numbers.

Three types of tests :


• Serial test
• Run test
• Autocorrelation test

Correlation coefficient ρ(X, Y ) between random variables X, Y :

C(X, Y )
ρ(X, Y ) = r r
V(X) V(Y )
Properties of ρ

Cauchy-Schwartz inequality : −1 ≤ ρ ≤ 1
Complete dependence : If Y = cX + d, c > 0 then ρ = +1
and if Y = cX + d, c < 0 then ρ = −1
Independence : X, Y independent ⇒ ρ = 0 but not vice versa
Partial dependence : ρ 6∈ {0, 1, −1}
Want to show that that there are no correlations in the data ...
Input Distributions from Data 5 – 13

Serial test

Idea: Bunch up data into vectors ... if independent vectors should


be uniformly distributed in the unit cube ...

... but we know how to check uniformity ...

Consider the case for d = 2 ...

• Bunch up the data in pairs :

U , U2}, U
| 1{z
,U , ..., U
| 3{z 4} |
2n−1
{z
, U2n}
U1 U2 Un

• Choose k and define bins Bij as follows :

i−1 i j−1 j
X ∈ Bij ⇔ ≤ X1 < and ≤ X2 <
k k k k

• Dump vectors Ui into the bins ...

let Oij be the number of vectors in bin Bij

If the original sequence U1, ..., U2n is uniform then

n
Eij = E[Oij ] =
k2
... we are now set for χ2 test ...
Input Distributions from Data 5 – 14

• Define the error term Ek2,n as before

j
X (Oij − Eij )2
Ek2,n =
i,j=1 Eij

• Define the critical value

d = Fχ−1
2 (1 − α)
k 2 −1

where Fχ2 2 is the cdf of χk2 −1.


k −1

• Serial test: Reject H0 if Ek2,n ≥ d

• Can generalize this to higher dimensions ...

Problems:

1. Starting point bias ... easily fixed


arrange the sequence Ui on a circle and randomly pick a
starting point

2. in order to thoroughly check independence ... need large d


number of bins go up exponentially in d ... bad
Input Distributions from Data 5 – 15

Runs-up-and-down test

• Run : a sequence of events that share a property


runs up : a sequence of points that are all increasing
runs down : a sequence of points that are all decreasing

• Example :

0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
ª ⊕ ª ⊕ ª ª ª ⊕ ª

Number of runs up (represented by ⊕) = 3


Number of runs down (represented by ª) = 4
Total number of runs =7

Length of runs up = 1, 1, 1
Length of runs down = 1, 1, 3, 1

• Two types of tests :


1. based on the number of runs : will cover in class
2. based on the length of runs : in Handout #9
Input Distributions from Data 5 – 16

Run-up-and-down test

• Idea : If the number of runs Rn is “too large” or “too small”


then H0 improbable

• same old story ... what is “large” and what is “small” ?

• answer ... choose ln,1−α and un,1−α such that

P(ln,1−α ≤ Rn ≤ un,1−α | H0) = 1 − α

we now have two bounds because E[Rn] 6= 0.

• test: Accept H0 if ln,1−α ≤ Rn ≤ un,1−α

But how does one get ln,1−α and un,1−α ?

• simple probabilistic arguments show that

2n − 1 16n − 29
µn = E[Rn] = , σn2 = V[Rn] =
3 90
µn, σn are not sample averages ... expectations for length n
Input Distributions from Data 5 – 17

• For Rn the central limit theorem is valid, i.e.


 
Rn − µ n
lim
n→∞
P  ≤ x = N (x)
σn

This result is true even though Rn is not a sum of independent


events ...

Rn −µn
• Use approximation ... σn ≈ η, the standard normal random
variable

• Define

ln,1−α = z α2 = N ( α2 )
un,1−α = z1− α2 = N (1 − α2 )

Notice : z’s have really nothing to with the process Rn all the
information in µn and σn

• approximate test: Accept H0 if

µn + z α2 σn ≤ Rn ≤ µn + z1− α2 σn
Input Distributions from Data 5 – 18

Example cont.

0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
ª ⊕ ª ⊕ ª ª ª ⊕ ª

• n = 10 and Rn = 7

• µn = 6.3333 and σn = 1.2065

• From the standard normal tables get z0.025 = −1.96 and


z0.975 = 1.96

• Accept or reject H0 ?

µn + z α2 σn = 3.9687 ≤ Rn = 7 ≤ µn + z1− α2 σn = 8.698

• Decision : accept
Input Distributions from Data 5 – 19

Runs above/below mean test

• same idea as before .. but compare with mean µ = 0.5

• Example:

0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
⊕ ª ⊕ ª ⊕ ⊕ ª ª ⊕ ª

Number of runs above mean (represented by ⊕) =4


Number of runs below mean (represented by ª) =4
Total number of runs =8

Length of runs above mean = 1,1,2,1


Length of runs below mean = 1,1,2,1

• Again two types of tests

1. based on the number of runs : will cover in class


2. based on the length of runs : in Handout #9
Input Distributions from Data 5 – 20

Run-above-below-mean test

• Idea : If the number of runs Rn is “too large” or “too small”


then H0 improbable

• same old story ... what is “large” and what is “small” ?

• same old answer ... choose ln,1−α and un,1−α such that

P(ln,1−α ≤ Rn ≤ un,1−α | H0) ≥ 1 − α

• same old approximation ...

n+1 n−1
µn = E[Rn] = 2 and σn2 = V[Rn] = 4

central limit theorem valid ... therefore

Rn − µ n
≈ η (standard Normal)
σn

• approximate test : Accept H0 if

µn + z α2 σn ≤ Rn ≤ µn + z1− α2 σn

Notice : the test is the same as before ... the same z’s .. only
thing different are µn and σn
Input Distributions from Data 5 – 21

Example cont.

0.9501 0.2311 0.6068 0.4860 0.8913 0.7621 0.4565 0.0185 0.8214 0.4447
⊕ ª ⊕ ª ⊕ ⊕ ª ª ⊕ ª

• n = 10 and the number of runs Rn = 8


r
10+1 10−1
• µn = 2 = 5.5 and σn = 4 = 1.5

• From the standard normal tables get z0.025 = −1.96 and


z0.975 = 1.96

• Accept or reject H0 ?

µn + z α2 σn = 2.5601 ≤ Rn = 8 ≤ µn + z1− α2 σn = 8.4399

• Decision : accept
Input Distributions from Data 5 – 22

Primer on stationary stochastic process

• Let X = {Xi : −∞ < i < ∞} be a discrete time stochastic


process

• The stochastic process X is said to stationary iff for all


i1 < i2 < ... < im , B ∈ Rm, and n
P ([Xn+i1 , . . . Xn+im ] ∈ B) = P ([Xi1 , . . . Xim ] ∈ B)

Special cases ...


1 dim :
P(Xn ∈ B) = P(X0 ∈ B) ⇒ E[Xn] = E[X0]
2 dim :
P([Xn+i , Xn] ∈ B) = P([Xi, X0 ] ∈ B) ⇒ E[Xn+i Xn] = E[XiX0]

• Lag j autocovariance of X (independent of n)

Cj = C(Xn+j , Xn)

• Lag j autocorrelation of X (independent of n)

C(Xn+j , Xn) C(Xn+j , Xn)


ρj = ρ(Xn+j , Xn) = r r =
V(Xn+j ) V(Xn) V(Xn)
Input Distributions from Data 5 – 23

Autocorrelation test

• U1, . . . Un be a sequence of numbers supposedly i.i.d. uniform

• if true then ρ0 = 1 and ρj = 0 for all j 6= 0

• Form the hypothesis testing problem


H0 : ρj = 0, for all j > 0
H1 : ρj 6= 0 for some j

• Approximation ....
H0 : ρj = 0, for all 0 < j < J
H1 : ρj 6= 0 for some j

• simplified expression for ρj

E[Xj X0] − (E[X0])2


ρj = = 12E[Xj X0] − 3
V(X0)

• From data estimate the value of ρj as follows


 Pn 
i=1 Xi X[(i+j)modn] 
ρ̂j,n = 12  −3
n
Input Distributions from Data 5 – 24

• Idea : Reject H0 if ρ̂j is “too large” ...

• Easy to show from definition

µn = E[ρ̂j,n ] = 0
13n+7
σn2 = V(ρ̂j,n ) = (n+1) 2

• Again central limit theorem holds for ρ̂n,j so

ρ̂j,n − E[ρ̂j,n ]
r ≈η
V(ρ̂j,n )

• So we have the test ...

Accept H0 if z α2 σn ≤ ρ̂j,n ≤ z1− α2 σn


Tests do not capture independence!
Input Distributions from Data 5 – 25

Numerical Example

• Given N = 1000 data points from some input

• Step 1 : Histogram ... try different number of bins

Histogram with 20 bins


350

300

250
number per bin

200

150

100

50

0
0 2 4 6 8 10 12 14 16 18
bins

Histogram with 100 bins


80

70

60

50
number per bin

40

30

20

10

0
0 2 4 6 8 10 12 14 16 18
bins

The histogram seems to suggest exponential distribution. Notice


that the picture actually gets murkier when number of bins are
increased.
Rule of thumb : Use k = blog2(n) + 1c bins
Input Distributions from Data 5 – 26

• Step 2 : Check summary statistics

mean X = 2.1091 and std. deviation is 2.0775 ... most likely an


exponential dist.

• Step 3 : Parameter estimation

Know that λmax = 1/X = 1/2.1091 = 0.4814

• Step 4 : Hypothesis testing ...

Use the cdf F0.4814 (x) = 1 − e−0.4814x to convert data into


uniform numbers, i.e. Yi = 1 − e−0.4814Xi .
Histogram of Y with 20 bins
70

60

50
number per bin

40

30

20

10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
bins

(a) χ2 test with K = 20 bins

k
X (Ei − n/k)2
Ek,n = = 21.68 ≤ d19,0.95 = 30.1435
j=1 (n/k)

Test passed
Input Distributions from Data 5 – 27

(b) Kolmogorov-Smirnov test

Let Vi’s be the Yi’s order in increasing order, then

    
 j
   j − 1 
Dn = max  max  − Vj  , max Vj −
1≤j≤n n 1≤j≤n n 
= 0.0289 ≤ d1000,0.95 = 0.0428

Test passed

(c) Run-up-and-down test


The data has R = 666 up-and-down-runs

The mean is µ = 666.333, variance σ 2 = 177.46 and the


threshold zα/2 = 1.96, therefore the test is

√ √
666.333 − (1.96) 177.46 ≤ 666 ≤ 666.333 + (1.96) 177.46

Test passed.

Data is independent identically distributed exp(0.4814).


Input Distributions from Data 5 – 28

Where are we in our story ?

What we know

• Modeling to some extent (Take IEOR E4106 to get more of it!)

• Methods for generating samples from given distributions

• Selecting input distributions based on data

Still to come ...

• Methods for building simulators (know some particular cases such


as Car Wash simulator)

• Methods for analyzing the output from the simulator

• Methods for improving the efficiency of simulators

You might also like