0% found this document useful (0 votes)
2 views51 pages

Lecture+4+ +Intro+to+Modeling+and+Linear+Regression

This lecture introduces modeling concepts in data science, focusing on statistical models, particularly linear regression. It explains the modeling process, including defining models, estimating parameters, and evaluating model performance using metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). The lecture also discusses the importance of loss functions and their impact on model accuracy and computational cost.

Uploaded by

csy060522
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views51 pages

Lecture+4+ +Intro+to+Modeling+and+Linear+Regression

This lecture introduces modeling concepts in data science, focusing on statistical models, particularly linear regression. It explains the modeling process, including defining models, estimating parameters, and evaluating model performance using metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). The lecture also discusses the importance of loss functions and their impact on model accuracy and computational cost.

Uploaded by

csy060522
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Lecture 4 : Intro to modeling and Linear

regression model

2 0 2 5 S P R ING I N T RO TO DATA S CI E NCE

1
Where Are We?

2
What is a Model?

Essentially, all
• A model is an idealized representation of a system. models are
wrong, but some
• Example: we model the fall of an object on Earth as are useful.
subject to a constant acceleration of 9.81 m/s² due to
gravity.
• While this describes the behavior of our system, it is
merely an approximation.
• It doesn’t account for the effects of air resistance, local
variations in gravity, etc.
• But in practice, it’s accurate enough to be useful! George Box, Statistician
(1919-2013)

3
What is a Model?

• In data science, a model usually means a mathematical rule or function that describes the
relationships between variables.

• In this course, we will cover


• Statistical Models

• Machine Learning Models

• Deep Learning Models

4
Example : A statistical model
Below are Sales vs TV, Radio and Newspaper

Can we predict Sales using these three?


Perhaps we can do better using a model
Sales ≈ f(TV; Radio; Newspaper)

5
A statistical model

• Here Sales is a response or target that we wish to predict; generically referred to as Y .

• TV, Radio, Newspaper are a feature, or input, or predictor; we name X

• We write our model as


𝑌𝑌 = 𝑓𝑓 𝑋𝑋 + 𝜖𝜖

where 𝜖𝜖 captures measurement errors and other discrepancies (will come back here later).

• A set of approaches for estimating 𝑓𝑓 is sometimes called a statistical learning procedure.

6
What is f(X) good for?

• 1. We can make predictions of 𝑌𝑌 at new points 𝑋𝑋 = 𝑥𝑥.

• 2. We can understand which components of 𝑋𝑋 = (𝑋𝑋1 , 𝑋𝑋2 , … , 𝑋𝑋𝑝𝑝 ) are important in explaining
𝑌𝑌, and which are irrelevant.

• 3. Depending on the complexity of f, we may be able to understand how each component 𝑋𝑋𝑖𝑖
of 𝑋𝑋 affects 𝑌𝑌.

7
The Modeling Process

How should we represent the world?

How do we quantify prediction error?

How do we choose the best parameters of our model given our


data?

How do we evaluate whether this process gave rise to a good


model?

8
The Modeling Process

9
Simple Linear Regression: Our First Model

• Simple Linear Regression Model (SLR) : 𝑦𝑦 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥 + 𝜖𝜖

• SLR is a parametric model, meaning we choose the "best" parameters for slope and
intercept based on data.

• We often express 𝜃𝜃 = (𝜃𝜃0 , 𝜃𝜃1 ) as a single parameter vector.

• Sample-based estimate of parameter 𝜃𝜃 is written as 𝜃𝜃̂ which will provide the estimate of 𝑦𝑦

• Usually, we pick the parameters that appear "best" according to some criterion we choose.

10
Which 𝜃𝜃 is best?

We need some metric of how


"good" or "bad" our predictions are.
For every chapter of the novel Little Women, Estimate the # of
characters based on the # of periods in that chapter.

11
The Modeling Process

12
Loss Functions

• A loss function 𝐿𝐿 𝑦𝑦, 𝑦𝑦� characterizes the cost, error, or fit resulting from a particular choice
of model or model parameters

• The choice of loss function affects the accuracy and computational cost of estimation.

• The choice of loss function should depends on the estimation task:


• Are outputs quantitative or qualitative?
• Do we care about outliers?
• Are all errors equally costly? (e.g., false negative on cancer test)

13
L2 and L1 Loss

14
Residuals as Loss Function?

• Why don't we directly use residual error as the loss


function?

•Doesn't work: Big negative residuals shouldn't cancel out


big positive residuals
• Our predictions can be very off, but we can still get a
zero residual.

15
Empirical Risk is Average Loss over Data

• We care about how bad our model’s predictions are for our entire data set, not just for one
point.

• A natural measure, then, is of the average loss (aka empirical risk) across all points.

• Given data

• The average loss on the sample tells us how well the model fits the data (not the
population), But hopefully these are close.

16
Empirical Risk is Average Loss over Data

The colloquial term for average loss depends on which loss function we choose.

17
The Modeling Process

We want to find that minimize


this objective function.

18
Minimizing MSE for the SLR Model

• Recall: we wanted to pick the regression line

• To minimize the (sample) Mean Squared Error:

• To find the best values, we set derivatives equal to zero to obtain the optimality
conditions:

19
Partial Derivative of MSE with Respect to 𝜃𝜃0, 𝜃𝜃1

20
Estimating Equations

• To find the best values, we set derivatives equal to zero to obtain the optimality conditions:

• To find the best 𝜃𝜃0, 𝜃𝜃1, we need to solve the estimating equations on the right.

21
From Estimating Equations to Estimators

• Goal: Choose 𝜃𝜃0 , 𝜃𝜃1 to solve two estimating equations:

(1) (2)

• (1)

22
From Estimating Equations to Estimators

Lets try (2)-(1) *

23
Estimating Equations

• Plug in definitions of correlation and SD:

• Solve for parameters :

24
Minimizing MSE for the SLR Model

• Recall: we wanted to pick the regression line

• To minimize the (sample) Mean Squared Error:

• To find the best values, we set derivatives equal to zero to obtain the optimality
conditions:

25
Estimating Equations

• Estimating equations are the equations that the model fit has to solve. They help us:
• Derive the estimates.
• Understand what our model is paying attention to.

For SLR:

•The residuals should average to zero (otherwise we should fix the intercept!)
•The residuals should be orthogonal to the predictor variable (or we should fix the slope!)

26
The Modeling Process

27
Evaluating Models

What are some ways to determine if our model was a good fit to our data?
1. Performance metrics: Root Mean Square Error (RMSE)
• A lower RMSE indicates more "accurate" predictions (lower "average loss" across data)
• RMSE is in the same units as y.

28
Four Mysterious Datasets (Anscombe’s quartet)

2. Visualization: Look at a residual plot to visualize the difference between actual and predicted values.

• The four dataset each have the same mean of


x, mean of y, SD of x, SD of y, and r value.

• Since our optimal Least Squares SLR model


only depends on those quantities, they all have
the same regression line and RMSE.

• However, only one of these four sets of data


makes sense to model using SLR.

29
Four Mysterious Datasets (Anscombe’s quartet)

The residual plot of a good regression shows no pattern.

30
The Modeling Process: Using a Different Model

31
The Constant Model

• You work at a local boba shop and want to


estimate the sales each day.

• Here's your data from 5 randomly selected


previous days, arbitrarily sorted by number of
drinks sold:

{20, 21, 22, 29, 33}


A. 0
B. 25
C. 22
D. 100
E. Something else

This is a constant model.

32
The Constant Model

• The constant model summarizes the data by always "predicting" the same number—
i.e., predicting a constant.

• It ignores any relationships between variables:

• For instance, boba tea sales likely depend on the time of year, the weather, how the
customers feel, whether school is in session, etc.
• Ignoring these factors is a simplifying assumption.

33
The Constant Model

• The constant model is also a parametric, statistical model:


𝑦𝑦 = 𝜃𝜃0 + 𝜖𝜖

• Our parameter is 1-dimensional.


• We now have no input into our model; we predict 𝑦𝑦� = 𝜃𝜃̂0 .
• Like before, we can still determine the best 𝜃𝜃0 that minimizes average loss on our data.

34
The Modeling Process: Using a Different Model

35
The Modeling Process: Using a Different Model

36
Fit the Model: Rewrite MSE for the Constant Model

• Recall that Mean Squared Error (MSE) is average squared loss (L2 loss) over the data

• Given the constant model,

• We fit the model by finding the optimal 𝜃𝜃0 that minimizes the MSE.

37
�0 = 𝑦𝑦̅
Fit the Model : 𝜃𝜃

• We can show that average loss is minimized by

• Derivation :

• It holds true regardless of what data sample you have.


• It provides some formal reasoning as to why the mean is such a common summary statistic.

38
Revisit the Boba Shop Example

• You work at a local boba shop and want to estimate


the sales each day.

• Here's your data from 5 randomly selected previous


days, arbitrarily sorted by number of drinks sold:

{20, 21, 22, 29, 33}

A. 0
B. 25
C. 22
D. 100
E. Something else
We will predict the mean of the previous five days’ sale:

(20 + 21 + 22 + 29 + 33)/5 = 25.

39
[Loss] Comparing Two Different Models, Both Fit with MSE

40
[Fit] Comparing Two Different Models, Both Fit with MSE

41
The Modeling Process: Using a Different Loss Function

42
Fit the Model: Rewrite MAE for the Constant Model

• Recall that Mean Absolute Error (MAE) is average absolute loss (L1 loss)

• Given the constant model

• We fit the model by finding the optimal 𝜃𝜃0 that minimizes the MAE.

43
Exploring MAE: A Piecewise function

• For the boba dataset {20, 21, 22, 29, 33}:

• Absolute (L1) Loss on one observation:


MAE (Mean Absolute Error) across all data:

Piecewise linear function…


minimized at 22?
44
Fit the Model: differentiation

45
Fit the Model: set equal to 0

• Theta needs to be such that there are an equal # of points to the left and right.

• This is the definition of the median!

• For example, in our boba tea dataset {20, 21, 22, 29, 33}, the point in green (22) is the
median.

46
MSE and MAE: Comparing Optimal Parameters

47
MSE and MAE: Comparing Loss Surfaces

48
MSE and MAE: Comparing Sensitivity to Outliers

49
MSE and MAE: Comparing Uniqueness of Solutions

50
Summary: Loss Optimization, Calculus, and…Critical Points?

• First, define the objective function as average loss.


• Plug in L1 or L2 loss.
• Plug in model so that resulting expression is a function of 𝜃𝜃.
•Then, find the minimum of the objective function:
• 1. Differentiate with respect to 𝜃𝜃.
• 2. Set equal to 0.
• 3. Solve for 𝜃𝜃̂
• Recall critical points from calculus: could be a minimum, maximum, or saddle point!
• We should technically also perform the second derivative test, i.e., show .
• MSE has a property—convexity—that guarantees that is a global minimum.
• The proof of convexity for MAE is beyond this course.

51

You might also like