0% found this document useful (0 votes)
6 views33 pages

CPSC 4830 2025summer Lecture 3

The document covers the fundamentals of regression analysis, focusing on linear and logistic regression, including their formulas and assumptions. It discusses the importance of various tests for model validation, such as linearity, normality, and homoscedasticity, and provides guidance on how to address failures in these tests. Additionally, it highlights key metrics like p-value and R-squared, and offers practical examples for applying regression techniques in data analytics.

Uploaded by

Jerd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views33 pages

CPSC 4830 2025summer Lecture 3

The document covers the fundamentals of regression analysis, focusing on linear and logistic regression, including their formulas and assumptions. It discusses the importance of various tests for model validation, such as linearity, normality, and homoscedasticity, and provides guidance on how to address failures in these tests. Additionally, it highlights key metrics like p-value and R-squared, and offers practical examples for applying regression techniques in data analytics.

Uploaded by

Jerd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

CPSC 4830

Data Mining for Data Analytics


Lecture 3
Regression
2 types
1.Linear Regression (Predict values)

Y = β0 + β1X1+ β2X2+ β2X3+… + βkXk

2.Logistic Regression (Predict class label)


Linear Regression
= β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
Terminology:
True value (Y) = β0 + β1X1+ β2X2+ β2X3+… + βkXk
Predicted value () : Dependent Variable (DV)
Input variable (Xi) : Independent Variables (IV)
Note: Xi always refer to one IV, and X refers to ALL IV
Assumptions:
1.Linearity: Each Xi has a linear relationship with the mean of Y
2.Normality: For any fixed value of X (fixed all Xi), Y is normally distributed
3.Homoscedasticity: The variance of residual (error) is the same for any value of X
4.Independence: All Xi are independent with each other
Linear Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Before building model:
1.Linearity: Plot scatter plot for Y and every Xi, Transformation or Remove
2.Independence: Calculate the Correlation Matrix among X, or use VIF (Variance Inflation Factor)
After building model:
3.Normality: Check with K-S test or Q-Q Plot
4.Homoscedasticity: Q-Q Plot the residual (errors), or residual against y, or residual against all X
Linear Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Before building model:
1.Linearity: Plot scatter plot for Y and every Xi, Transformation or Remove
2.Independence: Calculate the Correlation Matrix among X, or use VIF (Variance Inflation Factor)
After building model:
3.Normality: Check with K-S test or Q-Q Plot
4.Homoscedasticity: Q-Q Plot the residual (errors), or residual against y, or residual against all X

What if failed for the above tests?


Linear Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Before building model:
1.Linearity: Plot scatter plot for Y and every Xi, Transformation or Remove
2.Independence: Calculate the Correlation Matrix among X, or use VIF (Variance Inflation Factor)
After building model:
3.Normality: Check with K-S test or Q-Q Plot
4.Homoscedasticity: Q-Q Plot the residual (errors), or residual against y, or residual against all X

What if failed for the above tests?


5.Linearity: Consider higher order or transformation, and check again
6.Independence: Group the correlated features together by PCA or average, etc.
7.Normality: Data transformation or robust regression (higher order)
8.Homoscedasticity: check the pattern and apply different algorithms, like funnel shape -> transformation for Y;
non-linear -> weighted least square, etc.
Linear Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
The values that we concern most:
1.p-value for each Xi: to keep it or drop it
2.Squared R: How much variance being explained
3.Slope and Intercept for the model
Note: other values are useful too
Linear Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Kaggle
Interview Query
Datasets for playing
Telus

1.Predict brain weight of mammals based their body weight (x01.txt)


2.Predict blood fat content based on age and weight (x09.txt)
3.Predict death rate from cirrhosis based on a number of other factors (x20.txt)
4.Predict selling price of houses based on a number of factors (X27.txt)
And so on…
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Beside MSE, what other measures did you learn?


Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Beside MSE, what other measures did you learn?


MSE, MAE, MAPE, RMSE
So, when to use each case?
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Beside MSE, what other measures did you learn?


MSE, MAE, MAPE, RMSE
So, when to use each case?
MSE, RMSE sensitive to outlier -> Good for stable
MAE relatively not sensitive -> Good for unstable
MAPE favours underestimate model -> budget
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Linear Regression Theory = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Polynomial Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Polynomial Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Questions: When will we use linear regression? When will we consider 2 nd order or higher order?
When to stop searching for higher order solution?
Polynomial Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Questions: When will we use linear regression? When will we consider 2 nd order or higher order?
When to stop searching for higher order solution?
Polynomial Regression = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk

Questions: When will we use linear regression? When will we consider 2 nd order or higher order?
When to stop searching for higher order solution?
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Model Capacity, = β0 + β1X1+ β2X2+ β2X3+… + βkX k + ε
= β0 + β1X1+ β2X2+ β2X3+… + βkXk
Overfitting, Underfitting
Take home messages
• Regression: Linear regression and Logistic regression
• How to use it in python
• What is p-value, R square, intercept, slope (beta or weight)
• What are the assumptions

You might also like