0% found this document useful (0 votes)
56 views57 pages

IE5005 Lecture 04

Uploaded by

Braewyn Hsu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views57 pages

IE5005 Lecture 04

Uploaded by

Braewyn Hsu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

IE5005 Data Analytics for Industrial Engineers

Lecture 04. Statistical Inference and Data Resampling

Dr. Wang Zhiguo


[email protected]

Semester 1 AY2024/25
Course Outline
Statistical inference

01
• Point estimation
• Sampling distribution
• Hypothesis testing
• Confusion matrix

Cross validation

02 •


The validation set approach
Leave-one-out cross validation (LOOCV)
K-fold cross validation

03
Bootstrapping
• What is bootstrapping
• Sampling and bootstrapping distribution
01
Statistical Inference
• Point estimation
• Sampling distribution
• Hypothesis testing
• Confusion matrix
Statistical inference
When we use the sample data we have collected to make estimate or draw conclusion
about one or more characteristics of a population, we are using the process of statistical
inference.

Source: Camm, J. D., Cochran, J. J., Fry, M.


J., & Ohlmann, J. W. (2023). Business
analytics. 5th Edition, Cengage Learning. 4
Point estimation

• The sample mean 𝑥ҧ is a point


estimator of the population mean 𝜇.
σ 𝑥𝑖 2,154,420
𝑥ҧ = = = $71,814
𝑛 30

• The sample SD 𝑠 is a point estimator


of population SD 𝜎.
σ(𝑥𝑖 − 𝑥)ҧ 2 325,009,260
𝑠= = = $3,384
𝑛−1 29

• The proportion 𝑝ҧ is a point estimator


of population proportion 𝑝.
𝑥𝑌𝑒𝑠 19
𝑝ҧ = = = 0.63
𝑛 30

5
Sampling distribution

A random variable is a quantity whose values are not known with certainty.

The sample mean 𝑥ҧ is a random variable. As a random variable, it will have

i) Expected value of 𝑥ҧ
𝐸 𝑥ҧ = 𝜇 (unbiased)

ii) Standard deviation of 𝑥ҧ


𝑁−𝑛 𝜎
𝜎𝑥ҧ = ( ) Finite Population
𝑁−1 𝑛
Rule of thumb: finite population correction factor is used when the sample size is more than 5% of
𝑛
population (i.e. 𝑁 > 5%)

𝜎
𝜎𝑥ҧ =
𝑛 Infinite Population

iii) The probability distribution of 𝑥ҧ is called the sampling distribution of 𝑥.ҧ


6
Re-visit Lecture 01

Suppose if you want to study customers of McDonald’s.

The population of McDonald customers are “infinitely” large, because the pool of the
customer population are being generated by an ongoing process over time. Therefore, it is
not possible to obtain a complete list of the full population, at the time of sampling. [We
don’t know the value of N (pop. size)]

7
Sampling distribution
❑ If the population has a normal
distribution, the sampling
distribution of 𝑥ҧ is normally
distributed for any sample size.

❑ If the population does not have a


normal distribution, the
sampling distribution of 𝑥ҧ is
normally distributed for large
sample size.

Why? Central limit theorem

Source: Camm, J. D., Cochran, J. J., Fry, M. J., & Ohlmann, J.


W. (2023). Business analytics. 5th Edition, Cengage Learning. 8
How ‘large’ is large?

Central limit theorem

In selecting random samples of size 𝑛 from


a population, the sampling distribution of
the sample mean can be approximated by
a normal distribution as the sample size
becomes large.

How ‘large’ is large?


Statistical researchers have investigated
this question by studying the sampling
distribution of 𝑥ҧ for a variety of
populations and sample sizes.

For most cases, 𝑥ҧ can be approximated by


normal distribution with n = 30 or more.
9
Hypothesis test

So far, we have shown how a sample could be used to develop point estimates of
population parameters.

Now we continue the discussion of statistical inference by showing how hypothesis


testing can be used to determine whether a statement about the value of a population
parameter should or should not be rejected.

• In hypothesis testing, we begin by making a tentative conjecture about a population


parameter. This tentative conjecture is called the null hypothesis and is denoted by 𝐻0 .

• We then define another hypothesis, called the alternative hypothesis, which is the
opposite of what is stated in the null hypothesis. The alternative hypothesis is denoted
by 𝐻𝑎 .

The hypothesis testing procedure uses data from a sample to test the validity of the two
competing statements about a population that are indicated by 𝐻0 and 𝐻𝑎 .

10
Hypothesis test
Scenario 1.
Suppose you bought a can of coffee beans, and the bottle label states it is 3 pounds.

We know that it is almost impossible to place exactly 3 pounds accurately in each bottle.
However, as long as the mean filling weight is at least 3 pounds, the rights of consumers
will be protected.

However, you suspect that some cans have less


than 3 pounds. How can you justify it?

11
Hypothesis test

We can start by formulating an alternative hypothesis:


𝑯𝒂 : 𝜇 < 3
and make it the conclusion we wish to support.

Then, the null hypothesis is defined as the opposite:


𝑯𝟎 : 𝜇 ≥ 3

The hypothesis testing is conducted by gathering statistical evidence to reject 𝑯𝟎 (a.k.a.


to accept 𝑯𝒂 ).

The null hypothesis is often formulated as a conjecture to be challenged (in other words,
the alternative hypothesis is the conclusion that researchers want to support).
12
Hypothesis test

Scenario 1

𝑯𝟎 : 𝜇 ≥ 3

𝑯𝒂 : 𝜇 < 3

There are two possible outcomes:


Content

1. Reject 𝑯𝟎 .
2. Do not reject 𝑯𝟎 .

13
Hypothesis test

Suppose a sample of 36 cans of coffee are selected. The sample mean 𝑥ҧ and sample
variance 𝑠 2 are computed. If the value of 𝑥ҧ is less than 3 pounds, say 𝑥ҧ = 2.98, will that
conclude the soft drink is underfilled?

No. There can be sampling errors. If you take another sample of 36 cans, 𝑥ҧ can be 3.05
because 𝑥ҧ is a random variable.

14
Hypothesis test

The statistical inference question to answer here is:


Is the difference between 2.98 and 3 significant?

Test statistic
𝑥ҧ − 3 2.98 − 3
𝑡= = = −2.824
𝑠𝑥ҧ 𝑠/ 36

We use the t-statistic to determine whether 𝑥ҧ deviates from the hypothesized value of 𝜇 = 3
enough to justify rejection of 𝑯𝟎 .

The smaller the t-statistic value, the higher the chance to reject 𝑯𝟎 , in this example.

How small must the t-statistic be so that we can reject 𝑯𝟎 ?


We look at p value.
15
P value

A p value is the probability, assuming that 𝑯𝟎 is true, of obtaining a random sample of


size 𝑛 that results in a test statistic at least as extreme as the one observed in the current
sample.

A small p value indicates more evidence against 𝑯𝟎 .


• A small p-value means the probability of obtaining another random sample which
leads to a similar result that is as extreme as what you observed in the current sample,
is very low.
[It suggests it is very rare the observed sample could occur if 𝑯𝟎 is true. That is to say, 𝑯𝟎
cannot be true if this rare observed sample shall occur].
So, we shall reject 𝑯𝟎

16
Student’s t-distribution
𝑥ҧ − 3 𝑥ҧ − 3
Test statistic 𝑡=
𝑠𝑥ҧ
= = −2.824
𝑠/ 36

Prob. (𝑡 ≤ −2.824) = 0.00389

𝛼 = 0.05
0.00389

-2.824 0 -2.824 0
-1.690

In practice, the person responsible for the


Rule of thumb: hypothesis test specifies the level of significance.
Reject 𝑯𝟎 if p value ≤ 𝛼 Common choices for 𝛼 are 0.05 and 0.01. 17
Hypothesis test

Demo of hypothesis test with Excel.

Download the file ‘coffee.xlsx’.

A sample data of 36 coffee cans’ weights.

18
Hypothesis test
Scenario 2.

Companies stay competitive by developing new products, new technologies, new


services. But before adopting something new, it is desirable to conduct research to
determine whether there is statistical support for the conclusion that the new approach is
indeed better.

Suppose a particular automobile currently attains a fuel efficiency of 25 miles per gallon
for driving. A research group has developed a new fuel injection system which claims to
increase the mile-per-gallon efficiency.

𝑯𝟎 : 𝜇 ≤ 25 [null hypothesis]
𝑯𝒂 : 𝜇 > 25 [alternative hypothesis]

19
Hypothesis test

There is another scenario which is to test the equality of certain measure.

Scenario 3.

For example, you want to test if the average age of Nobel Prize winners is precisely 60.

𝑯𝟎 : 𝜇 = 60 [null hypothesis]

𝑯𝒂 : 𝜇 ≠ 60 [alternative hypothesis]

20
3 Forms of Hypothesis test

Scenario 1. Scenario 2. Scenario 3.

𝑯𝟎 : 𝜇 ≥ 3 𝑯𝟎 : 𝜇 ≤ 25 𝑯𝟎 : 𝜇 = 60
𝑯𝒂 : 𝜇 < 3 𝑯𝒂 : 𝜇 > 25 𝑯𝒂 : 𝜇 ≠ 60

Lower tail test Upper tail test

One-tailed test Two-tailed test

21
How to compute P value
1. Compute test statistic
𝑥ҧ − 3 𝑥ҧ − 3
𝑡= = = −2.824
𝑠𝑥ҧ 𝑠/ 36
One-tailed test
Two-tailed test
2. Lower tail test
Prob. (𝑡 ≤ −2.824) = 0.00389 2. Compute cumulative probability

2*MIN{Pr(𝑡 ≤ −2.824), Pr(𝑡 ≥ −2.824)}


0.00389

-2.824
3. Upper tail test
Prob. (𝑡 ≥ −2.824) = 1 − 0.00389

-2.824 0
22
-2.824
Confusion Matrix

Actual (population condition)


𝐻0 True 𝐻𝑎 True

Predicted Do not reject 𝐻0 Correct Type II error


(your conclusion) Reject 𝐻0 Type I error Correct

𝜶 (level of significance) = probability of making (Type I error) when the null hypothesis is
true.
𝜷 = probability of making (Type II error) when the alternative hypothesis is true.

23
Example: heart disease

Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
3 No Yes Yes No
4 Yes Yes No Yes
… … … … …

We can use different models to predict the classification of ‘heart disease’.

How to determine which model works the best with the data? 24
Example: heart disease
Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
… … … … …

Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
… … … … … 25
Confusion Matrix

Actual (population condition)


Do not Have heart disease Has heart disease
Total
(negative -) (positive +)
Do not have heart disease
Predicted TN FN N* = TN + FN
(negative -)
(your
Has heart disease
conclusion) FP TP P* = FP + TP
(positive +)
Total N = TN + FP P = FN + TP

26
Confusion Matrix

Notion Name Interpretation


The probability of false positive (FP).

[In the example, it means the probability that you classify a


FP/N Type I error patient wrongly as ‘has heart disease’ when he/she actually
‘does not have heart disease’].

The probability of false negative (FN).

[In the example, it means the probability that you fail to


FN/P Type II error diagnose a patient correctly when he/she actually ‘has heart
disease’].

27
Confusion Matrix

Notion Name Interpretation

TN/N Specificity = 1 − Type I error (prob of TN)

TP/P Sensitivity = 1 − Type II error (prob of TP)

𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵 Accuracy Classification accuracy

28
After-class Practice: example of bank loan application

A dataset of bank loan applications


Python codes for computing the confusion matrix. from various customers are available.
• Default: 1
• Not default: 0

Answer the following questions.


1. Compute Type I error
The confusion matrix is computed as below:
actual
Negative (-) Positive (+)
predicted Negative (-) 4803 113
Positive (+) 22 62
2. Compute Type II error

IE5005_L04_codes.ipynb 29
(Answer)

Answer the following questions.


Python codes for computing the confusion matrix.
1. Compute Type I error
𝐹𝑃 22
Type I error = = = 0.5%
𝑁 4825

customers who do not default are


classified wrongly as ‘default’.
Therefore, their loan application will be
rejected wrongly. (loss of reputation)
The confusion matrix is computed as below:
actual
2. Compute Type II error
Not default (-) Default (+)
predicted 𝐹𝑁 113
Not default (-) 4803 (TN) 113 (FN) Type II error = = = 64.6%
Default (+) 22 (FP) 62 (TP) 𝑃 175
Total 4825 (N) 175 (P)
Customers who will default are
classified wrongly as ‘not default’.
Therefore, their loan application will be
approved wrongly. (loss of profit) 30
Data Resampling
• Cross Validation
• Bootstrap
Data sampling Vs Data re-sampling

❑ Data sampling refers to statistical methods for an active process of gathering

observations with the intent of estimating a population variable.

❑ Data resampling refers to methods for economically using a collected dataset to

improve the estimate of the population parameter and help to quantify the uncertainty

of the estimate.

❑ In order to obtain additional information about the fitted model (e.g. the variability of a

fitted regression model), we can repeatedly draw samples from the existing sample.

❑ This allows us obtain information that would not be available from fitting the model

only once using the original sample.


32
02
Cross Validation
• The validation set approach
• Leave-one-out-cross-validation (LOOCV)
• K-fold cross validation
Training Vs Testing Data

Based on some historical observations 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 , we can train some


መ We can then use the trained model to estimate 𝑓መ 𝑥1 , 𝑓መ 𝑥2 , … , 𝑓መ 𝑥𝑛 and
statistical model 𝑓.
see how well it predicts the true values 𝑦1 , 𝑦2 , … , 𝑦𝑛 .

Here the historical observations used for training a statistical model is called training set.

Suppose we are interested in developing an algorithm to predict


Singapore’s weather. The training data can be the weather conditions
in the past 6 months.

34
Training Vs Testing Data

But in general, we do not really care how well the model works on the training data.
Rather, we are interested in the accuracy of the predictions that we obtain when we apply
our method to previously unseen* test data. Why?
Are u
serious?
I predict it rains
yesterday in Singapore

It indeed rained
yesterday. Perfect
prediction!!!

*[‘unseen’ in the sense that these data were not used for training] 35
Training Vs Testing set

❑ Training set: the data set used for fitting/training the machine learning model.

❑ Testing set: the data set used for testing the performance of the fitted model on a new

data set. It is also known as the ‘validation set’, or ‘hold-out set’. (Unseen to the

model/different from training set)

❑ The key difference among various cross validation approaches is about how to

partition the original data set into the training set and the validation set!

36
i. The validation set approach
Simply partition the data set into two parts, with size K and (n-K).

Original Data Set (n observations)

1 2 …… 10 …… 50 …… n

2 50 …… 6 1 n …… 10

Training Set ( 𝑲 observations) Validation Set ( 𝒏 − 𝑲 observations)


37
The validation set approach
Merits

• It is simple to understand and easy to implement.


• It is less computationally expensive.

Drawbacks

• Since we divide the data set into two parts randomly every time, the resulting training
and validation set can be different. The validation error computed therefore will vary
with respect to the different validation sets.

• A significant number of observations are held out for validation, which could
otherwise have been used for training the model. This suggests that the validation
error computed using the reduced set may appear to overestimate the actual test
error than if the model was fitted based on the entire data set. 38
The validation set approach

sklearn.model_selection.train_test_split

IE5005_L04_codes.ipynb 39
ii. Leave-one-out cross-validation (LOOCV)

Original Data Set (n observations)

𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n

(1) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n

(2) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n

……
(i) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n

……
(n) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
40
Leave-one-out cross-validation (LOOCV)
Merits

• LOOCV tends not to overestimate the testing error as much as the validation set approach,
since more observations are used by LOOCV in training the model.

• As compared to the validation set approach which always yields different testing MSE
depending on how the data is divided, LOOCV always returns the same result no matter who
applies it and how many times it is repeated.

Drawbacks

• Since the model has to be trained and tested for 𝑛 times, it can be potentially expensive in
computation. This consists of two aspects. First, if the model fitted is simple like linear or
polynomial regression, it can be fast. However, if the model has a complex form, it can be time-
consuming to fit even once, let alone 𝑛 times. Second, even if the model fitted is simple and fast,
it needs to be repeated for 𝑛 times. If 𝑛 is very large, it can be computationally intensive as well.

41
Leave-one-out cross-validation (LOOCV)

sklearn.model_selection.LeaveOneOut

42
iii. K-fold cross-validation
Original Data Set (n observations)

𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n

fold 1 fold 2 fold k


𝑥2 , 𝑦2 … … 𝑥4 , 𝑦4 𝑥1 , 𝑦1 … … 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 … … 𝑥7 , 𝑦7

fold 1 fold 2 fold k


𝑥2 , 𝑦2 … … 𝑥4 , 𝑦4 𝑥1 , 𝑦1 … … 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 … … 𝑥7 , 𝑦7

𝑥2 , 𝑦2 …… 𝑥4 , 𝑦4 𝑥1 , 𝑦1 …… 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 …… 𝑥7 , 𝑦7
……

𝑥2 , 𝑦2 …… 𝑥4 , 𝑦4 𝑥1 , 𝑦1 …… 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 …… 𝑥7 , 𝑦7 43
K-fold cross-validation

Merits

• More computationally affordable than LOOCV

• Less biased in overestimating the actual testing error than the validation set approach.
(nevertheless, more biased than LOOCV)

Drawbacks

• Relatively more computationally expensive than the validation set approach

• [bias-variance tradeoff] k-fold cross validation have higher variation in the testing error
than the validation set approach. (nevertheless, the variation is less than LOOCV) [The
mean of many highly correlated quantities tend to have higher variance than the mean
of many less correlated quantities]
44
K-fold cross-validation

sklearn.model_selection.KFold

sklearn.model_selection.cross_val_score
45
Quick Quiz

Given n data points in the sample, leave-one-out cross validation (LOOCV) can be
considered as a special case of k-fold cross validation with k = ___?

A. 1
B. n-1
C. n
D. n+1

Time’s
60
10
2
7
1
6
4
5
3
9
8
Up!

46
03
Bootstrapping
• What is bootstrapping
• Sampling and Bootstrap distribution
Bootstrapping

1 2

3 4
48
Source of image: https://www.youtube.com/watch?v=Xz0x-8-cgaQ
Bootstrapping

1 2

3 494
Bootstrapping

How do we confirm whether the drug is effective, not due


to chance?

One way is to repeat the experiment a few times on many


patients. Record down the mean value for each round of
experiment.

We will get a histogram (distribution) of the mean values.


• mean values closer to 0~0.5, have higher probability.
That means more likely to occur.
• mean values far away from 0~0.5 , have lower
probabilities, which means rare to occur.
This is expensive and time consuming!!! 50
Bootstrapping

1 2

A bootstrapped data set

3 514
Bootstrapping

Repeat and
construct the
distribution

The process of creating bootstrapped data sets, calculate and record down some desired
statistic (in this case, mean value), is called bootstrapping.

52
Bootstrapping

• Mean
• SD
• 95% CI
• 𝑯𝟎 : 𝝁 = 𝟎 53
Bootstrapping
Financial Portfolio Optimization

Suppose you wish to invest in two assets X and Y. The return of these two assets is x and y. (The
return of an asset means that for every $1 dollar invested, it will yield $x and $y dollars.)

In practice, x and y are random variables, which may change over time. It is assumed that a fraction
𝜃 (0 ≤ 𝜃 ≤ 1) of your fund is invested on asset X and the remaining (1 − 𝜃) is invested in Y. The total
return of the portfolio consisting of assets X and Y is 𝜽𝒙 + 𝟏 − 𝜽 𝒚.

The risk of the portfolio is measured by the volatility Var(𝜽𝒙 + 𝟏 − 𝜽 𝒚). One of the objectives in
portfolio management is to minimise the volatility of the portfolio return. It can be shown that the
optimal 𝜽, which minimises the variance, is computed as:

𝜎𝑌2 − 𝜎𝑋𝑌
𝜃= 2
𝜎𝑋 + 𝜎𝑌2 − 2𝜎𝑋𝑌

where 𝜎𝑋2 = 𝑉𝑎𝑟(𝑥), 𝜎𝑌2 = 𝑉𝑎𝑟(𝑦), 𝜎𝑋𝑌 = 𝐶𝑜𝑣 𝑥, 𝑦 .

54
Finger Exercise
Write a Python program for the following tasks:

• Generate a random sample (𝑥, 𝑦) with size 𝑛 = 100, where 𝑥 ∈ 0.2, 2.5 , 𝑦 ∈ 0.7, 3.5 .

• Construct a function named ‘compute_theta()’ which returns the value of the optimal 𝜃.

• Generate 200 bootstrapped data sets.

• Draw a histogram of the 𝜃 values of the bootstrapped data sets.

• Compute the mean and standard error of the bootstrap estimation of 𝜃

IE5005_L04_codes.ipynb 55
Feel free to share your feedback with me via this
link/QR code throughout the whole semester.

https://app.sli.do/event/hUgiGrg7Ln8KeEFVyCT9o3

56
Thank You!

57

You might also like