BA501 Week5 Linear Regression
BA501 Week5 Linear Regression
版权声明
所有太阁官方网站以及在第三方平台 课程中所产生的课程内容,如文本, 图形,徽标,按钮图
标,图像,音频剪辑,视频剪辑,直播流,数字下 载,数据编辑和软件均属于太阁所有并受版
权法保护。
对于任何尝试散播或转售BitTiger的所属资料的行为,太阁将采取适当的法律行 动。
C 有关详情,请参阅
https://www.bittiger.io/termsofuse https://www.bittiger.io/termsofservice
Copyright Policy
All content included on the Site or third-party platforms as part of the class, such as text,
graphics, logos, button icons, images, audio clips, video clips, live streams, digital
downloads, data compilations, and software, is the property of BitTiger or its content
suppliers and protected by copyright laws.
Any attempt to redistribute or resell BitTiger content will result in the appropriate legal
action being taken.
C
We thank you in advance for respecting our copyrighted content.
Dependent / Response
Υ = β0 + β1 Χ + Error term
Variable ε
Coefficients Independent Variable
● Assumptions
○ 1. Linear relationship: Υ = β0 + β1 Χ + ε
○ 2. ε is independent, identically distributed (i.i.d)
○ 3. Homoscedasticity (constant variance) Var[ε |Χ = x] = σ2
(no matter what x is)
○ 4. Gaussian Noise: Normal distribution ε ~ Ν(0, σ2)
Assumption Violation
● Linear Relationship
Assumption Violation
● Mathematical convenience
○ Closed form estimation
Under Gaussian Assumption
Υ = β0 + β1 Χ +
E(Υ) = β0 + β1 Χ ε
freedom)?
How to predict for new data?
● New data point x, how to predict y?
● Predicted (fitted) value at x:
● Is a single value?
?
x ?
Bias and variance tradeoff
● Prediction error, suppose we have Υ = f(Χ) + ε
Bias and variance
● Accurate (small bias)
● Precise (small variance)
Multiple Linear Regression
Multiple linear regression
● More than one predictor, say p. ith data point
○ MSE: 1/n *
Regression model p
Residual n-1-p
F distribution
Total n-1
How to evaluate model performance?
● R2, and relation with correlation
○
● Property
○ R2 , percentage of variance explained by model
○ 0 <= R2 <= 1
○ R2 could be misleading, link
■ More features always increase R2
■ Need to adjust R2
How do you know your model meet assumptions?
Residual Diagnostics
Residual vs. Error Term
● Residual
● Error (Noise)
○ ei is an observation of εi
● Properties of residual
○ 0
○ Constant variance, unchanging with x.
○ The residuals are not uncorrelated with each other, or extremely
weak
○ Normal distributed
Residual v.s. Fitted value
● To check if unbiased and homoscedasticity
• The residuals spread randomly around the 0 line indicating that the relationship is linear.
• The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error
variance.
• No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
Residual v.s. Fitted value – Violation Examples
QQ Plot
● To check for Normality
• The points should roughly follow the diagonal straight line
• You can also check normality use Shapiro-Wilk test or Kolmogorov-Smirnov test
QQ Plot – Violation Examples
Outlier, High Leverage, Influential
• An outlier is a data point whose response y does not follow the general trend of the rest of the
data.
• A data point has high leverage if it has "extreme" predictor x values. With or without this data point
will highly impact the estimate of coefficients
• A data point is influential if it unduly influences any part of a regression analysis, such as the
predicted responses, the estimated slope coefficients, or the hypothesis test results. Outliers and
high leverage data points have the potential to be influential, but we generally have to investigate
further to determine whether or not they are actually influential.
• Great reading
Standardized residuals vs. Leverage
● To check high leverage, influential data points
• Leverage: consider fitted line as lever, passing center point. Points further from center point
have larger leverage. hii= [H]ii measure distance of x to center of x
3. If you delete any data after you've collected it, justify and describe it in your reports.
If you are not sure what to do about a data point, analyze the data twice — once with and once
without the data point — and report the results of both analyses.
Interview Questions
Basic
• What model will you use? For continuous Y variable, linear relationship. Not for categorical
y variable
Advanced
hii: leverage of
ith data point
● Properties of hii
○ hii is a measure of the distance between xi and mean of the x
○ hii is a number between 0 and 1.
○ Sum of the hii equals p+1, the number of parameters
(regression coefficients including the intercept).
Standardizing
● Standardized residual
○ What: Residuals rescaled to have a mean of 0 and a variance of 1
○ Why:
■ Since data points far from center has larger impact on
estimating coefficient (leverage), residual variance of these
data points are smaller
○ How to standardize:
Standardizing in simple linear regression
●