0% found this document useful (0 votes)
26 views58 pages

AST Day 2 Slides

Uploaded by

Joel Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views58 pages

AST Day 2 Slides

Uploaded by

Joel Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Analytics Strategy and

Techniques (Day 2)

Neumann Chew C. H.
ITOM, Nanyang Business School.
[email protected]

Updated: 26 Aug 2023


Day 2 Schedule

2
Day 2 (Part 1a)

THE USEFULNESS OF A MODEL


3
Models
• Plan for the models that will be
tested.
• Some models have special
performance metrics in-built.
• Some models need extra code to
compute specific metrics.
• Models learnt in this course
– Linear Regression
– Logistic Regression
– CART
– Random Forest
4
Models* on Explanability Scale
Linear Quantile Logistic Neural Deep
CART MARS
Regression Regression Regression Network Learning

Highest Explanability Lowest Explanability


Power (White Box) Power (Black Box)

*: Selected list of models on the Explanability Scale.

Source: Chew C.H. (2021) Artificial Intelligence, Analytics and Date Science Vol. 1 Core Concepts and Models, Chapter 2, Cengage. 5
The concept of a model is useful as it allows Xs.

X1
X2
X3
.
Model 𝑌෠
𝐸𝑟𝑟𝑜𝑟 = 𝑌 − 𝑌෠
.
Xk Example: Predicting housing price.
What is Y? What is 𝑌෠ ? What are Xs?
6
Model Complexity
• The size of the model (e.g. model parameters)
• The number of X variables.
• The greater the complexity, the lower the error on the
dataset.
• Should we be happy if we reach zero error on the dataset?
• Scenario: Investment company selecting which stock to
buy using a ML model.
7
Train-Test Split

Source: Chew C.H. (2021) Artificial Intelligence, Analytics and Date Science Vol. 1 Core Concepts and Models, Chapter 2, Cengage. 8
Industry Standard Practice
• The Train-Test split is the industry standard for
ML/AI/Analytics practice in Predictive Modeling.
• There are two limitations:
– If Y is categorical, rare cases may appear in only one of the two
subsets.
– Data is sacrificed (from the model) to form a testset.

9
Train – Test Split
(Stratified version)

Source: Chew C.H. (2021) Artificial Intelligence, Analytics and Date Science Vol. 1 Core Concepts and Models, Chapter 2, Cengage. 10
10-fold Cross Validation

Source: Chew C.H. (2021) Artificial Intelligence, Analytics and Date Science Vol. 1 Core Concepts and Models, Chapter 2, Cengage. 11
Model Overfitting

Source: Chew C.H. (2021) Artificial Intelligence, Analytics and Date Science Vol. 1 Core Concepts and Models, Chapter 2, Cengage. 12
Common Model Performance Metrics
Predict a Continuous Predict a Categorical
Target Variable Y Variable Y
• RMSE (Root Mean • Confusion Matrix
Square Error) • False Positive Rate
• MAPE (Mean Absolute • False Negative Rate
Prediction Error)
• Mean Directional
Accuracy (MDA)

13
RMSE – A popular metric to compute model
prediction error on a continuous Y variable

1. Why is this a good metric?

2. Can we use RMSE for Categorical Y variable? Explain.

3. Netflix used RMSE in their US$1 mil prize. Right/Wrong? What’s the implication?
14
Compare Different Models’ Performance
• The lower the RMSE on a testset (or average of
10 folds CV), the better is the model performance.

15
Watch Pre-class Lecture Videos 6.x or read main textbook Chapter 6.

Day 2 (Part 1b)

CONTINUOUS Y VARIABLE AND


LINEAR REGRESSION
16
Review on Linear Regression Model
𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑚 𝑥𝑚 + 𝑒

𝑦ො e ~ N(0, σ)
Straight Line Equation
Y_hat typically used as an Errors (aka Residuals) follow a Normal
estimate of Y. Distribution with mean 0 and a
constant standard deviation.
17
Day 2 (Part 2a)

MULTICOLLINEARITY
18
Pre-Class Activity: Did Exercise 2.1 at Home

• Discuss Solution to Exercise 2.1.PDF


• The lm() function generates the linear regression model (see lecture
videos 6.x)
– Also shows how to do train-test split and compute RMSE.
• There is something strange in the linear regression model results in
Q3. Did you detect it?

19
Multi-Collinearity
• When an X can be “easily explained” using all the other Xs
in the model.
– Example: A linear combination of X1, X3, X4 can explain 91% of X2.
• Why do you still need that X in the model?
• Creates instability in model coefficients i.e. high variance in
the model coefficient of that X.

20
Example of Multi-Collinearity: Predict Weight of
Growing Child

𝑌෠ = 4𝑋1
𝑌෠ = 2𝑋1 + 2𝑋2
𝑌෠ = 10𝑋1 − 6𝑋2
𝑌෠ = 1000𝑋2 − 996𝑋1

• Many models can be used to predict Y and X variables are collinear.


• Each model are equally accurate.
21
Variance Inflation Factor to Detect Multicollinearity
• Given a linear regression model with only continuous predictors, the
variance inflation factor of Xj is 1
𝑉𝐼𝐹𝑗 =
1 − 𝑅𝑗2
• Rj2 is the R2 value of a linear regression with Xj ~ all other Xs in the model.
• Two popular cut-offs to conclude collinear Xj
– VIF > 5 (implies R2 > 80%)
– VIF > 10 (implies R2 > 90%)
• VIF is for continuous Xs only model. Use adjusted Generalised VIF if model includes categorical X.
– Adj.GVIF is shown in the last column of the vif() output from Rpackage car.
• Two popular cut-offs using Adj.GVIF
– Adj.GVIF > sqrt(5) ≈ 2.24
– Adj.GVIF > sqrt(10) ≈ 3.16

22
Is multicollinearity to be avoided?
• No.
• Model Performance may still be very good.
• But do not interpret the model coefficients the standard
way if your model is multi-collinear.

23
Day 2 (Part 2b)

CATEGORICAL X VARIABLES AND


DUMMY VARIABLES
24
In-Class Activity: Do Exercise 2.2
Est: 20 mins.
• Clarify why we need dummy variables.
• Instructor shows how to do Q1 – Q3.
• Do (as much as you can) the remaining questions in Exercise
2.2.PDF

25
Watch Pre-class Lecture Videos 7.x or
read main textbook Chapter 7.

Day 2 (Part 3)

CATEGORICAL Y VARIABLE AND


LOGISTIC REGRESSION
26
Linear Regression Model for Continuous Y
X1
X2 Model
𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +… 𝑏𝑚 𝑥𝑚 𝑌෠
X3
.
.
Xm

• Xs are unrestricted.
• Model output 𝑌෠ can be any value within a reasonable range.
• But what if Y is a categorical variable?
• Has Disease X or not; Approve/Reject Loan Application; Pass/Fail;
• Very Happy/Happy/Neutral/Sad/Very Sad; A/B/C/D/E/F; Red/Green/Blue,…
27
Logistic Regression Model for Categorical Y is a 2-step process.

X1 Find a Function based on Xs that


Linear Equation can be used as a Probability
X2
𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +… 𝑏𝑚 𝑥𝑚 P(Y = cat1)
X3
.
.
Xm

𝑌෠ = 𝑐𝑎𝑡1 𝑌෠ = 𝑐𝑎𝑡0

• Find a Function that takes linear equation as input and outputs a probability .
• For Binary outcomes Y, a popular choice for threshold = 50%.
28
What is the Logistic Function?

• Linear function: f(x) = 2 + 3x


• Quadratic function: f(x) = 4x2 + 2x – 2
• Logistic function:
1
𝑓 𝑥 =
1+𝑒 −𝑥

29
Logistic Function output is between 0 and 1
1
𝑓 𝑥 =
1 + 𝑒 −𝑥

𝑥
• Accepts any value for x.
• Output is between 0 and 1.
• Hence Logistic function f(x) can be interpreted as a probability.
30
Logistic Function with multiple Xs
1
𝑓 𝑧 =
1 + 𝑒 −𝑧

• Let 𝑧 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +… 𝑏𝑚 𝑥𝑚 𝑧
• Model coefficients are optimised to fit the data.
• Xs affect P(Y = cat1) via the model coefficients.
31
Logistic Regression Model for Categorical Y is a 2-step process.

X1 Logistic Function
1
Linear Equation P(Y = cat1) =
X2 1+𝑒 −𝑧
𝑧 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +… 𝑏𝑚 𝑥𝑚
X3
.
.
Xm

𝑌෠ = 𝑐𝑎𝑡1 𝑌෠ = 𝑐𝑎𝑡0

• Logistic Function takes the linear equation as input and outputs P(Y = cat1).
• P(Y = cat0) = 1 – P(Y = cat1)
32
Measuring Model Prediction Errors
If Y is continuous, the model prediction error can be
calculated by considering:
• For each obs: Error = Actual Y value – Model Predicted Y
• Over the entire dataset with n obs: RMSE.

If Y is binary (cat0 or cat1), then there are only two possible


prediction error for each obs:
• Model predicted Y = 1, but actually Y = 0. i.e. False Positive.
• Model predicted Y = 0, but actually Y = 1. i.e. False Negative.
• Over the entire dataset with n obs: Confusion Matrix.

33
Confusion Matrix

• A confusion matrix compares model predicted values


against the actual data values.
• The main diagonal values “50” and “100” represents the
number of correct model predictions.
• The other diagonal values “10” and “5” represents the
number of wrong model predictions.
34
Confusion Matrix

• True Positive rate = TP/Actual Yes = 100/105. Aka Sensitivity or Recall.


• False Positive rate = FP/Actual No = 10/60. Aka Type 1 error.
• True Negative rate = TN/Actual No = 50/60. Aka Specificity.
• False Negative rate = FN/Actual Yes = 5/105. Aka Type 2 error.
• Overall Accuracy = 150 / 165; Overall Error = 15 / 165.
35
Discuss Solution to Pre-Class Ex 2.3 Part A.

• Q1 reveals the main weakness of logistic regression.


• Q2 reveals a common misinterpretation of the model coefficient.

36
Odds of Event A is defined in terms of P(A)
𝑃(𝐴)
𝑂 𝐴 ≡
1 − 𝑃(𝐴)

Typically expressed as two numbers: Integer Numerator


and Integer Denominator

37
Example: Probability & Odds of Heart Attack

• Event A: Heart Attack


• If P(A) = 0.25, what is the Odds(A)?
Odds (A) = 0.25/(1-0.25) = 1/3
Odds of A is 1 to 3.

• If P(A) = 0.75, what is the Odds(A)?


Odds(A) = 0.75/(1-0.75) = 3/1
Odds of A is 3 to 1.
38
Odds if P(Y = 1) is a logistic function
𝑧 = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑚 𝑋𝑚

1
𝑃 𝑌=1 =
1 + 𝑒 −𝑧

𝑃 𝑌=1 1 𝑒 −𝑧 𝑧
𝑂𝑑𝑑𝑠 𝑌 = 1 ≡ = ÷ = 𝑒
1−𝑃 𝑌 =1 1 + 𝑒 −𝑧 1 + 𝑒 −𝑧

39
Odds Ratio for each predictor
For each continuous X variable:
𝑂𝑑𝑑𝑠 𝑌 = 1 𝑖𝑓 𝑋 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑏𝑦 1 𝑢𝑛𝑖𝑡
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 𝑌 = 1 = = 𝑒 𝑐𝑜𝑒𝑓
𝑂𝑑𝑑𝑠 𝑌 = 1 𝑖𝑓 𝑋 𝑖𝑠 𝑠𝑡𝑎𝑡𝑢𝑠 𝑞𝑢𝑜

For each categorical X variable:

𝑂𝑑𝑑𝑠 𝑌 = 1 𝑖𝑓 𝑋 𝑖𝑠 𝐵
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 𝑌 = 1 = = 𝑒 𝑐𝑜𝑒𝑓
𝑂𝑑𝑑𝑠 𝑌 = 1 𝑖𝑓 𝑋 𝑖𝑠 𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒 𝐴

Request for Proof of Relationship between Logistic Reg Model Coef and
Odds Ratio.PDF from instructor if you are interested in the proof.
40
Identifying the Risk Factors for Y to be cat1
• Two equivalent “tests”
– Which X variable has p-value < 5%
– Which X variable has Odds Ratio 95% Confidence
Interval excluding 1.

41
Hours Studying is a risk factor in
Pass/Fail Exam (from p-value)

• It’s p-value is less than 5% (actual p-value = 1.67%)


• Hours studying has a positive association with passing the exam.
– Coefficient is positive.
42
Hours Studying is a risk factor in Pass/Fail Exam
(from Odds Ratio 95% Confidence Interval)

• Odds Ratio 95% CI excludes 1.


• Statistical conclusion is the same as using p-value.

43
What’s special about Odds Ratio = 1?
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐵 ℎ𝑎𝑝𝑝𝑒𝑛𝑠
=1
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑠

• Odds of getting Y = cat1 is the same regardless of A or B.


• A or B does not affect Y.

44
What if Odds Ratio > 1?
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐵 ℎ𝑎𝑝𝑝𝑒𝑛𝑠
>1
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑠

• Odds of getting Y = cat1 is higher if B occurs compared to A.


• Odds is related to probability. The higher the probability, the
higher the Odds, and vice versa.

45
What if Odds Ratio < 1?
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐵 ℎ𝑎𝑝𝑝𝑒𝑛𝑠
<1
𝑂𝑑𝑑𝑠 𝑜𝑓 𝑌 = 𝑐𝑎𝑡1 𝑖𝑓 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑠

• Odds of getting Y = cat1 is lower if B occurs compared to A.

46
The events A and B depends on the type of X
• If X is categorical, then dummy variables are
created, and A is always the baseline level.
• If X is continuous, then A is the status quo and B
is a 1 unit increase in X.

47
Odds Ratio for the predictor in passexam.csv

• If a student studies for 1 more hour, what do


you expect his Odds of Passing the Exam?
1. Equals 1
2. More than 1
3. Less than 1

48
Quantifying Risk Factor with Odds Ratio = ecoef

• For predictor Hours: Odds Ratio = e1.5046 ≈ 4.5


• If student studies for 1 more hour, odds of passing
the exam increases by a factor of 4.5.
49
Identifying the Risk Factors for Y to be cat1
• Two equivalent “tests”
– Which X variable has p-value < 5%
– Which X variable has Odds Ratio 95% Confidence Interval
excluding 1.

50
Discuss Solution to Pre-Class Ex 2.3 Part B.

• Showing Odds Ratios are sufficient to identify high risk factors.

51
Day 1 (Part 4)

MULTI-CATEGORICAL Y
52
What if Y has 3 or more categorical outcomes?

• A/B/C/D/E
• Pass/borderline Pass/Fail
• 0/1/2

• We will only need to study 3 outcomes scenario as


the structure is similar for more than 3 outcomes.
• First, define the baseline level for Y e.g. Y = 0.
• Then Y = 1 is compared to baseline,
• and Y = 2 is compared to baseline, …etc.

53
Discuss Solution to Pre-Class Ex 2.3 Part C.

• For 3 categorical Y, 2 linear equations and hence 2 logistic functions


are produced.
• Use Odds Ratios to determine statistical significant if p-values are
not shown.

54
Summary
• Categorical Y prediction can be achieved by
– using logistic function on a linear combination of Xs.
– Comparing the logistic function against a threshold.
• Good habit to check the levels of the Y variable to avoid
misinterpreting the software output.
• Confusion Matrix shows the performance (both correct
and wrong predictions) of the logistic regression model.
• Changing the threshold can change the error tradeoffs.
55
Quiz 2

56
What is the most impt
Q1 thing that you learned
today?
Reflection
on your
Learning
What is still confusing
Q2 or difficult to you?

5
7
The End of Day 2
ANY QUESTIONS?
REMEMBER TO COMPLETE PRE-CLASS ACTIVITIES
BEFORE DAY 3 CLASS (SEE CHECKLIST 3)
58

You might also like