IE5005 Lecture 04
IE5005 Lecture 04
Semester 1 AY2024/25
Course Outline
Statistical inference
01
• Point estimation
• Sampling distribution
• Hypothesis testing
• Confusion matrix
Cross validation
02 •
•
•
The validation set approach
Leave-one-out cross validation (LOOCV)
K-fold cross validation
03
Bootstrapping
• What is bootstrapping
• Sampling and bootstrapping distribution
01
Statistical Inference
• Point estimation
• Sampling distribution
• Hypothesis testing
• Confusion matrix
Statistical inference
When we use the sample data we have collected to make estimate or draw conclusion
about one or more characteristics of a population, we are using the process of statistical
inference.
5
Sampling distribution
A random variable is a quantity whose values are not known with certainty.
i) Expected value of 𝑥ҧ
𝐸 𝑥ҧ = 𝜇 (unbiased)
𝜎
𝜎𝑥ҧ =
𝑛 Infinite Population
The population of McDonald customers are “infinitely” large, because the pool of the
customer population are being generated by an ongoing process over time. Therefore, it is
not possible to obtain a complete list of the full population, at the time of sampling. [We
don’t know the value of N (pop. size)]
7
Sampling distribution
❑ If the population has a normal
distribution, the sampling
distribution of 𝑥ҧ is normally
distributed for any sample size.
So far, we have shown how a sample could be used to develop point estimates of
population parameters.
• We then define another hypothesis, called the alternative hypothesis, which is the
opposite of what is stated in the null hypothesis. The alternative hypothesis is denoted
by 𝐻𝑎 .
The hypothesis testing procedure uses data from a sample to test the validity of the two
competing statements about a population that are indicated by 𝐻0 and 𝐻𝑎 .
10
Hypothesis test
Scenario 1.
Suppose you bought a can of coffee beans, and the bottle label states it is 3 pounds.
We know that it is almost impossible to place exactly 3 pounds accurately in each bottle.
However, as long as the mean filling weight is at least 3 pounds, the rights of consumers
will be protected.
11
Hypothesis test
The null hypothesis is often formulated as a conjecture to be challenged (in other words,
the alternative hypothesis is the conclusion that researchers want to support).
12
Hypothesis test
Scenario 1
𝑯𝟎 : 𝜇 ≥ 3
𝑯𝒂 : 𝜇 < 3
1. Reject 𝑯𝟎 .
2. Do not reject 𝑯𝟎 .
13
Hypothesis test
Suppose a sample of 36 cans of coffee are selected. The sample mean 𝑥ҧ and sample
variance 𝑠 2 are computed. If the value of 𝑥ҧ is less than 3 pounds, say 𝑥ҧ = 2.98, will that
conclude the soft drink is underfilled?
No. There can be sampling errors. If you take another sample of 36 cans, 𝑥ҧ can be 3.05
because 𝑥ҧ is a random variable.
14
Hypothesis test
Test statistic
𝑥ҧ − 3 2.98 − 3
𝑡= = = −2.824
𝑠𝑥ҧ 𝑠/ 36
We use the t-statistic to determine whether 𝑥ҧ deviates from the hypothesized value of 𝜇 = 3
enough to justify rejection of 𝑯𝟎 .
The smaller the t-statistic value, the higher the chance to reject 𝑯𝟎 , in this example.
16
Student’s t-distribution
𝑥ҧ − 3 𝑥ҧ − 3
Test statistic 𝑡=
𝑠𝑥ҧ
= = −2.824
𝑠/ 36
𝛼 = 0.05
0.00389
-2.824 0 -2.824 0
-1.690
18
Hypothesis test
Scenario 2.
Suppose a particular automobile currently attains a fuel efficiency of 25 miles per gallon
for driving. A research group has developed a new fuel injection system which claims to
increase the mile-per-gallon efficiency.
𝑯𝟎 : 𝜇 ≤ 25 [null hypothesis]
𝑯𝒂 : 𝜇 > 25 [alternative hypothesis]
19
Hypothesis test
Scenario 3.
For example, you want to test if the average age of Nobel Prize winners is precisely 60.
𝑯𝟎 : 𝜇 = 60 [null hypothesis]
𝑯𝒂 : 𝜇 ≠ 60 [alternative hypothesis]
20
3 Forms of Hypothesis test
𝑯𝟎 : 𝜇 ≥ 3 𝑯𝟎 : 𝜇 ≤ 25 𝑯𝟎 : 𝜇 = 60
𝑯𝒂 : 𝜇 < 3 𝑯𝒂 : 𝜇 > 25 𝑯𝒂 : 𝜇 ≠ 60
21
How to compute P value
1. Compute test statistic
𝑥ҧ − 3 𝑥ҧ − 3
𝑡= = = −2.824
𝑠𝑥ҧ 𝑠/ 36
One-tailed test
Two-tailed test
2. Lower tail test
Prob. (𝑡 ≤ −2.824) = 0.00389 2. Compute cumulative probability
-2.824
3. Upper tail test
Prob. (𝑡 ≥ −2.824) = 1 − 0.00389
-2.824 0
22
-2.824
Confusion Matrix
𝜶 (level of significance) = probability of making (Type I error) when the null hypothesis is
true.
𝜷 = probability of making (Type II error) when the alternative hypothesis is true.
23
Example: heart disease
Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
3 No Yes Yes No
4 Yes Yes No Yes
… … … … …
How to determine which model works the best with the data? 24
Example: heart disease
Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
… … … … …
Patient Chest pain Good blood circulation Blocked arteries Heart disease
1 Yes No Yes Yes
2 No Yes No No
… … … … … 25
Confusion Matrix
26
Confusion Matrix
27
Confusion Matrix
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵 Accuracy Classification accuracy
28
After-class Practice: example of bank loan application
IE5005_L04_codes.ipynb 29
(Answer)
improve the estimate of the population parameter and help to quantify the uncertainty
of the estimate.
❑ In order to obtain additional information about the fitted model (e.g. the variability of a
fitted regression model), we can repeatedly draw samples from the existing sample.
❑ This allows us obtain information that would not be available from fitting the model
Here the historical observations used for training a statistical model is called training set.
34
Training Vs Testing Data
But in general, we do not really care how well the model works on the training data.
Rather, we are interested in the accuracy of the predictions that we obtain when we apply
our method to previously unseen* test data. Why?
Are u
serious?
I predict it rains
yesterday in Singapore
It indeed rained
yesterday. Perfect
prediction!!!
*[‘unseen’ in the sense that these data were not used for training] 35
Training Vs Testing set
❑ Training set: the data set used for fitting/training the machine learning model.
❑ Testing set: the data set used for testing the performance of the fitted model on a new
data set. It is also known as the ‘validation set’, or ‘hold-out set’. (Unseen to the
❑ The key difference among various cross validation approaches is about how to
partition the original data set into the training set and the validation set!
36
i. The validation set approach
Simply partition the data set into two parts, with size K and (n-K).
1 2 …… 10 …… 50 …… n
2 50 …… 6 1 n …… 10
Drawbacks
• Since we divide the data set into two parts randomly every time, the resulting training
and validation set can be different. The validation error computed therefore will vary
with respect to the different validation sets.
• A significant number of observations are held out for validation, which could
otherwise have been used for training the model. This suggests that the validation
error computed using the reduced set may appear to overestimate the actual test
error than if the model was fitted based on the entire data set. 38
The validation set approach
sklearn.model_selection.train_test_split
IE5005_L04_codes.ipynb 39
ii. Leave-one-out cross-validation (LOOCV)
𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
(1) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
(2) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
……
(i) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
……
(n) 𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
40
Leave-one-out cross-validation (LOOCV)
Merits
• LOOCV tends not to overestimate the testing error as much as the validation set approach,
since more observations are used by LOOCV in training the model.
• As compared to the validation set approach which always yields different testing MSE
depending on how the data is divided, LOOCV always returns the same result no matter who
applies it and how many times it is repeated.
Drawbacks
• Since the model has to be trained and tested for 𝑛 times, it can be potentially expensive in
computation. This consists of two aspects. First, if the model fitted is simple like linear or
polynomial regression, it can be fast. However, if the model has a complex form, it can be time-
consuming to fit even once, let alone 𝑛 times. Second, even if the model fitted is simple and fast,
it needs to be repeated for 𝑛 times. If 𝑛 is very large, it can be computationally intensive as well.
41
Leave-one-out cross-validation (LOOCV)
sklearn.model_selection.LeaveOneOut
42
iii. K-fold cross-validation
Original Data Set (n observations)
𝑥1 , 𝑦1 𝑥2 , 𝑦2 …… 𝑥i , 𝑦i …… 𝑥n , 𝑦n
𝑥2 , 𝑦2 …… 𝑥4 , 𝑦4 𝑥1 , 𝑦1 …… 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 …… 𝑥7 , 𝑦7
……
𝑥2 , 𝑦2 …… 𝑥4 , 𝑦4 𝑥1 , 𝑦1 …… 𝑥n , 𝑦n …… 𝑥3 , 𝑦3 …… 𝑥7 , 𝑦7 43
K-fold cross-validation
Merits
• Less biased in overestimating the actual testing error than the validation set approach.
(nevertheless, more biased than LOOCV)
Drawbacks
• [bias-variance tradeoff] k-fold cross validation have higher variation in the testing error
than the validation set approach. (nevertheless, the variation is less than LOOCV) [The
mean of many highly correlated quantities tend to have higher variance than the mean
of many less correlated quantities]
44
K-fold cross-validation
sklearn.model_selection.KFold
sklearn.model_selection.cross_val_score
45
Quick Quiz
Given n data points in the sample, leave-one-out cross validation (LOOCV) can be
considered as a special case of k-fold cross validation with k = ___?
A. 1
B. n-1
C. n
D. n+1
Time’s
60
10
2
7
1
6
4
5
3
9
8
Up!
46
03
Bootstrapping
• What is bootstrapping
• Sampling and Bootstrap distribution
Bootstrapping
1 2
3 4
48
Source of image: https://www.youtube.com/watch?v=Xz0x-8-cgaQ
Bootstrapping
1 2
3 494
Bootstrapping
1 2
3 514
Bootstrapping
Repeat and
construct the
distribution
The process of creating bootstrapped data sets, calculate and record down some desired
statistic (in this case, mean value), is called bootstrapping.
52
Bootstrapping
• Mean
• SD
• 95% CI
• 𝑯𝟎 : 𝝁 = 𝟎 53
Bootstrapping
Financial Portfolio Optimization
Suppose you wish to invest in two assets X and Y. The return of these two assets is x and y. (The
return of an asset means that for every $1 dollar invested, it will yield $x and $y dollars.)
In practice, x and y are random variables, which may change over time. It is assumed that a fraction
𝜃 (0 ≤ 𝜃 ≤ 1) of your fund is invested on asset X and the remaining (1 − 𝜃) is invested in Y. The total
return of the portfolio consisting of assets X and Y is 𝜽𝒙 + 𝟏 − 𝜽 𝒚.
The risk of the portfolio is measured by the volatility Var(𝜽𝒙 + 𝟏 − 𝜽 𝒚). One of the objectives in
portfolio management is to minimise the volatility of the portfolio return. It can be shown that the
optimal 𝜽, which minimises the variance, is computed as:
𝜎𝑌2 − 𝜎𝑋𝑌
𝜃= 2
𝜎𝑋 + 𝜎𝑌2 − 2𝜎𝑋𝑌
54
Finger Exercise
Write a Python program for the following tasks:
• Generate a random sample (𝑥, 𝑦) with size 𝑛 = 100, where 𝑥 ∈ 0.2, 2.5 , 𝑦 ∈ 0.7, 3.5 .
• Construct a function named ‘compute_theta()’ which returns the value of the optimal 𝜃.
IE5005_L04_codes.ipynb 55
Feel free to share your feedback with me via this
link/QR code throughout the whole semester.
https://app.sli.do/event/hUgiGrg7Ln8KeEFVyCT9o3
56
Thank You!
57