0% found this document useful (0 votes)
5 views33 pages

1.1 Regression Analysis

The document provides an overview of regression analysis, focusing on its predictive capabilities and methods such as simple and multiple linear regression. It discusses estimating coefficients, assessing model fit, and testing relationships between variables, including the use of p-values and F-statistics. Additionally, it covers variable selection techniques and challenges associated with qualitative data and multicollinearity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views33 pages

1.1 Regression Analysis

The document provides an overview of regression analysis, focusing on its predictive capabilities and methods such as simple and multiple linear regression. It discusses estimating coefficients, assessing model fit, and testing relationships between variables, including the use of p-values and F-statistics. Additionally, it covers variable selection techniques and challenges associated with qualitative data and multicollinearity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Regression Analysis

Dr. Chathura Rajapakse


Department of Industrial Management
Faculty of Science, University of Kelaniya
Regression Analysis – What it does

• Predicting numbers
• Supervised Learning
• Simple linear regression
• Multiple linear regression
• K-Nearest Neighbors
• Decision tree-based methods
• Artificial neural networks
Simple Linear Regression
• A very straightforward simple linear
approach for predicting a quantitative
response Y on the basis of a single predictor
variable X

? ? ?

• What do we intend to know?


How to estimate coefficients

• Measuring the closeness


• How?
• Minimizing the least squares
• Residual Sum of Squares (RSS)

Residual

• Can write as:


Estimating coefficients

• Least square approach uses minimizing the RSS


• Can conclude as:

Least square coefficient


estimates ?
The true regression line
• What we collect is data from a sample

• That gives estimates for that sample


• Estimates from the sample are different from estimates of the entire
population
• Analogy:
• The variance of sample mean from population mean

Square of standard
deviation of each of the
observations yi of Y
Standard error
Variance
The true regression line

Drawn from a normal


distribution
Standard errors in coefficients

• The standard errors of the coefficients of the simple regression line can
be computed as:

Unknown but can be


estimated from
Residual Standard
Error (RSE)
The confidence intervals

• A range of values such that with 95% probability, the range will contain
true unknown value of a parameter
• For ,

• There is an approximately 95% chance that the range;

• will contain its true value


• Similarly, for ;
Are X and Y related?

Why?

• The t-statistic
The number of standard
deviations is away from
zero
• Whether is sufficiently far away from 0 so that the true value of is
non-zero
Are X and Y related? The p-value
We estimate
two parameters
• A p-value computed from t-distribution
• If no relationship between X and Y, a t-distribution with n-2 degrees
of freedom
• N > 30 makes it normal distribution
• Compute the probability of any value equal to , assuming
• The p-value
• Provides the smallest level of significance at which the null-hypothesis
would be rejected
• Smaller p-value means an association between X and Y
• Typical cutoff values of p are 5% and 1%
Is there a relationship between TV
advertising and sales?
🠶 What would you infer?
How much the model fits data?

• Two methods to investigate


• Residual Standard Error (RSE)
• An estimation of the standard deviation of
• Would not be possible to accurately predict Y even if the true regression
line is known because of
• Roughly the average amount that the response will be deviate from the
regression line
• RSE formula

• What will happen to RSE if the deviation is high (lack of fit)


How much the model fits data?
• R2 statistic
• RSE is measured in the units of Y (So what?)
• Lack of standardization
• Can we build a measure between 0 and 1 (independent of Y’s scale)?

• Total Sum of Squares (TSS)

• measures the proportion of variability in Y that can be explained using X


• What is the best R2 value?
• Depends on from
where the data comes
Relationship with correlation measure

• In simple regression, r = Cor(X,Y) is a measure of association between X


and Y
• R2 can be considered equal to r2
Multiple Linear Regression

• What if there is more than one predictor?


• One simple linear regression model per predictor?
• Extending the simple linear regression to incorporate multiple predictors
Estimating the regression coefficients

• For unknown coefficients, we can use


estimates for prediction

• Using the same least squares approach

• The coefficients that minimize the RSS


are the least squares
Advertising example

• How do you interpret the below values

What does this slope


value mean?
• How can the following be interpreted?
Important questions

🠶 The important questions when doing multiple regression analysis are:


The relationship between response
and predictors
• Remember the simple linear regression context
• Similarly, the null and alternative hypothesis could be defined for
multiple linear regression

• This hypothesis test can be performed with F-statistic


The F-statistic

• Similar to the simple linear model:

So, if H0 is true, what


• If the linear model assumptions are correct:
can we say about F
value?

• If Ha is true what will happen to F value?


Interpreting the F-statistic
• The F statistic for the regression model of sales over TV, Radio, and
Newspaper advertisements

Way above 1.
At least one
advertising media is
related to sales

• What if n is larger?
• An F value slightly above 1 is sufficient to reject the null hypothesis
• What’s the right F value to reject the null hypothesis?
• The p-value from the F-distribution (when the errors are normally distributed)
• A smaller p-value suggests a relationship between the predictors and the
response
Can we test the null hypothesis for a
subset of coefficients?
• The corresponding null hypothesis for a subset of q coefficients:

The coefficients are


chosen from the end of
the list
• Can fit a second model to get the F-statistic

The new residual sum


of squares for that
subset
Deciding on important variables

• How many combinations can we make out of p predictors?


• Ex: 2 predictors
• {0,0}, {0,1}, {1,0}, {1,1} => 22

• Can test models with such combinations and choose the best model
• Methods used:
• Mallow’s Cp
• Akaike Information Criterion (AIC)
• Bayesian Information Criterion (BIC)
• Adjusted R2
• What would be the challenge here when combining variables?
Variable selection
• Automated and efficient methods to choose smaller yet effective
subsets of models (of subsets of variables)
• Common approaches:
• Forward selection
• Start with the null model (only intercept)
• Evaluate p simple linear regression models and add the variable with the lowest
RSS to the null model
• Evaluate the new set of two variable models and add the variable with the lowest
RSS to the model
• Continue till a particular stopping criteria is met
• Backward selection
• Start with all variables
• Remove the variable with highest p-value
• Re-evaluate the model, remove the variable with highest p-value and so on..
Variable selection

• Common approaches
• Mixed selection
• Start with a no-variable model
• Keep adding one by one like in the forward selection
• In case the p-value of any variable is above a certain threshold, remove that
variable
• Continue this back-and-forth process until a stopping criterion is met

• Can you evaluate the three approaches?


• Can we use the forward selection always?
• Can we use backward selection always?
• Why do we need a mixed selection approach?
Model Fit
What was it in simple
linear regression?

• Both RSE and R2 approaches can be used


• In multiple regression R2 turns out to be
• What can happen to R2 when a new variable is added to the model?
• If there is a significant increase?
• If the increment is insignificant?
• Can plotting be helpful to evaluate fitness?
• Visual summaries reveal problems that are not obvious in numerical statistics
Making predictions

• The model, once fitted to training data can be used for predictions
• Challenges
• Coefficient estimates are distant from actual population parameter values
• Least square plane vs true population regression plane

• Assuming a linear model is only an approximation of reality – a biasness from


a reducible error
• The impact of irreducible error even if we know the true f(x)
What to do with qualitative data?

• Male female?
• Large, medium, small?
Issues with standard linear regression

• SLR works well with many real-world problems


• Two strict assumptions
• Additive: effect of changes in a predictor Xj on the response Y is independent
of the values of the other predictors
• Linear: states that the change in the response Y due to a one-unit change in X j
is constant, regardless of the value of Xj

This is not changed with the value of


X2
Correcting interaction impact

• Including an interaction term

• How does this relax the additive assumption?

What does this do?


Outliers
• The real value is far from the predicted value
• Does it have an impact on the model?

After removal of the


outlier
Multicollinearity

• Refers to the situation in which two or more predictor


variables are closely related to one another
• presence of collinearity can pose problems in the
regression context
• difficult to separate out the individual effects of
collinear variables on the response
• Ridge and Lasso Regression

You might also like