0% found this document useful (0 votes)
13 views37 pages

Chapter11 Regression

The document discusses regression analysis, focusing on linear regression models, inference, and the significance of coefficients. It highlights the importance of understanding collinearity, feature selection, and the assumptions necessary for effective multiple regression. Additionally, it emphasizes model evaluation through statistical metrics and cross-validation techniques to ensure accuracy and reliability of predictions.

Uploaded by

samridhi Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views37 pages

Chapter11 Regression

The document discusses regression analysis, focusing on linear regression models, inference, and the significance of coefficients. It highlights the importance of understanding collinearity, feature selection, and the assumptions necessary for effective multiple regression. Additionally, it emphasizes model evaluation through statistical metrics and cross-validation techniques to ensure accuracy and reliability of predictions.

Uploaded by

samridhi Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

INFERENCE FOR Chapter 11

REGRESSION
LINEAR REGRESSION MODEL
REGRESSION INFERENCE
AND INTUITION
For regression, the null hypothesis is so natural that it is rare to
see any other considered.
The natural null hypothesis is that the slope is zero and the
alternative is (almost) always two-sided.
DISTRIBUTION OF THE
SLOPE
Less scatter around the regression model means the slope will be more consistent
from sample to sample. The spread around the line is measured with the residual
standard deviation, .
CONFIDENCE INTERVALS
AND HYPOTHESIS TESTS
EXAMPLE :R
plot(vix.log,sp.log ,main='SP500 vs VIX',
xlab='SP500', ylab='VIX', pch=1,
col='blue’)

##plot the regression line


res <- lm(sp.log ~ vix.log)
res$coefficients
plot(res)
abline(res, col='red')
CONFIDENCE INTERVALS
FOR THE SLOPE
Vix.Log coefficient is -.1199, degrees of freedom 3974
With n = 3976, there are n - 2 = 3974 degrees of freedom and t*
0.025, 3974 = 1.960
The confidence interval for the slope is:
(-.1199-1.96*.001822, -.1199+1.96*.001822)
Linear regression coefficients

slope se lower upper


- -
0.123471 0.116328
-0.1199 0.001822 1 9

T test P-Value
-
65.80680 2*P(T>-
6 65) 2.00E-16
VIX Line Fit Plot
SP500 VS 0.15

VOLATILITY(VIX) 0.1

0.05
f(x) = − 0.119949304623057 x + 0.000328263651201543
R² = 1 SP500

SP500
Predicted SP500
0
Linear (Predicted SP500)
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-0.05

-0.1

-0.15

VIX

VIX Residual Plot


0.1

0.05
Residuals

0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05

-0.1
VIX
INTERPRET REGRESSION
MODEL
MULTIPLE REGRESSION
INFERENCE
The standard error, t-statistic, and P-values
mean the same thing in the multiple
regression as they meant in a simple
regression.
The t-ratios and corresponding P-values in
each row of the table refer to their
corresponding coefficients. The
complication in multiple regression is that
all of these values are interrelated.
What Multiple Regression Coefficients Mean
Including any new predictor or changing
any data value can change any or all of
the other numbers in the table. And we
For example, when we restrict our
can see from the increased R2, the added
attention to men with waist sizes equal to
complication of an additional predictor was
38 inches (points in blue), we can see a
worthwhile in improving the fit of the
relationship between %body fat and height:
regression model.
Pred %Body Fat = -3.10 + 1.77(Waist) –
0.60(Height)
MULTIPLE REGRESSION
CASES
Can pick up subtle
associations across
slices of the population
For example in the
previous case it picks
up the association of
body fat with waist size
for different heights

Most of the times the challenge that we encounter in multiple regression is


Collinearity
COLLINEARITY
Data on roller coasters and found that the duration of the ride depended on,
among other things, the drop—that initial stomach-turning plunge down the high
hill that powers the coaster through its run
COLINEARITY
Adding a second predictor should only improve the model, so let’s
add the maximum Speed of the coaster to the model:

What happened to the coefficient of Drop? Not only has it switched from positive
to negative, but it now has a small t-ratio and large P-value, so we can’t reject
the null hypothesis that the coefficient is actually zero after all.
What we have seen here is a problem known as collinearity. Specifically, Drop and
Speed are highly correlated with each other. As a result, the effect of Drop after allowing for
the effect of Speed is negligible. Whenever you have several predictors, you must think about
how the predictors are
Multicollinearity? You may find this problem referred to as “multicollinearity.” But there is no
such thing as “unicollinearity”—we need at least two predictors for there to be a linear
association between them—so there is no need for the extra two syllables.
When predictors are unrelated to each other, each provides new information to help account
for more of the variation in y. But when there are several predictors, the model will work best if
they vary in different ways so that the multiple regression has a stable base.
If you wanted to build a deck on the back of your house, you wouldn’t build it with supports
placed just along one diagonal. Instead, you’d want the supports spread out in different
directions as much as possible to make the deck stable. We’re in a similar situation with
multiple regression.
When predictors are highly correlated, they line up together, which makes the regression they
support balance precariously.
What should you do about a collinear regression model?
The simplest cure is to remove some of the predictors. That
simplifies the model and usually improves the t-statistics. And, if
several predictors provide pretty much the same information,
removing some of them won’t hurt the model.
Which predictors should you remove? Keep those that are most
reliably measured, those that are least expensive to find, or even
the ones that are politically important.
WHAT MULTIPLE REGRESSION
COEFFICIENTS MEAN

This relationship is conditional because we’ve restricted our set to


only those roller coasters with a certain drop.
 For roller coasters with a certain drop increase in speed by 1 is
associated with an increase of 2.70 of duration
 If that relationship is consistent for each drop, then the multiple
regression coefficient will estimate it.
ASSUMPTIONS AND
CONDITIONS
Linearity Assumption:
 Straight Enough Condition: Check the scatterplot for each
candidate predictor variable—the shape must not be
obviously curved or we can’t consider that predictor in our
multiple regression model.
Independence Assumption:
 Randomization Condition: The data should arise from a
random sample. Also, check the residuals plot - the
residuals should appear to be randomly scattered.
ASSUMPTIONS AND
CONDITIONS
Equal Variance Assumption:
 Does the Plot Thicken? Condition: Check the residuals plot—the spread of the
residuals should be uniform.
Normality Assumption:
 Nearly Normal Condition: Check a histogram of the residuals—the distribution of
the residuals should be unimodal and symmetric, and the Normal probability plot
should be straight.
Summary of the checks of conditions in order:
Check the Straight Enough Condition with scatterplots of the y-variable against each
x- variable.
1. If the scatterplots are straight enough, fit a multiple regression model to the data.
2. Find the residuals and predicted values.
3. Make and check a scatterplot of the residuals against the predicted values. This plot should look
patternless.
FEATURE SELECTION
Adding more variables isn’t always helpful because the model may
‘over-fit,’ and it’ll be too complicated. The trained model doesn’t
generalize with the new data. It only works on the trained data.
All the variables/columns in the dataset may not be independent.
This condition is called multicollinearity, where there is an
association between predictor variables.
We have to select the appropriate variables to build the best
model. This process of selecting variables is called Feature
selection.
THE ANOVA TABLE
MULTIPLE REGRESSION INFERENCE:
I THOUGHT I SAW AN ANOVA TABLE...

Now that we have more than one predictor, there’s an overall test
we should consider before we do more inference on the
coefficients.
 We ask the global question “Is this multiple regression model any
good at all?”
 We test

 The F-statistic and associated P-value from the ANOVA table are
used to answer our question.
COMPARING MULTIPLE
REGRESSION MODEL
How do we know that some other choice of predictors might not provide
a better model?
What exactly would make an alternative model better?
These questions are not easy—there’s no simple measure of the success
of a multiple regression model.
Regression models should make sense.
  Predictors that are easy to understand are usually better choices than obscure
variables.
  Similarly, if there is a known mechanism by which a predictor has an effect on the
response variable, that predictor is usually a good choice for the regression model.

The simple answer is that we can’t know whether we have the


best possible model.
COEFFICIENT OF MULTIPLE
DETERMINATION
Reports the proportion of total variation in Y explained by all X
variables taken together
MULTIPLE REGRESSION
ADJUSTED R2

There is another statistic in the full regression table


called the adjusted R2.
 This statistic is a rough attempt to adjust for the simple fact that when we add
another predictor to a multiple regression, the R2 can’t go down and will most
likely get larger.
 This fact makes it difficult to compare alternative regression models that have
different numbers of predictors.
Shows the proportion of variation in Y explained by all X variables adjusted for the
number of X variables used
Penalize excessive use of independent variables
 Smaller than R2
 Useful in comparing among models
THE BEST MULTIPLE
REGRESSION MODEL
The first and most important thing to realize is that often there is no such
thing as the “best” regression model. (After all, all models are wrong.)
1. Multiple regressions are subtle. The choice of which predictors to use
determines almost everything about the regression.
The best regression models have:
 Relatively few predictors.
 A relatively high R2.
 A relatively small s, the standard deviation of the residuals.
 Relatively small P-values for their F- and t-statistics.
 No cases with extraordinarily high leverage.
 No cases with extraordinarily large residuals;.
 Predictors that are reliably measured and relatively unrelated to each other.
BUILDING REGRESSION
MODELS SEQUENTIALLY
You can build a regression model by adding variables to a growing regression.
Each time you add a predictor, you hope to account for a little more of the
variation in the response. What’s left over is the residuals. At each step,
consider the predictors still available to you. Those that are most highly
correlated with the current residuals are the ones that are most likely to
improve the model. If you see a variable with a high correlation at this stage
and it is not among those that you thought were important, stop and think
about it. Is it correlated with another predictor or with several other
predictors?
.At each step make a plot of the residuals to check for outliers, and check the
leverages (say, with a histogram of the leverage values) to be sure there are
no high-leverage points. Influential cases can strongly affect which variables
appear to be good or poor predictors in the model. It’s also a good idea to
check that a predictor doesn’t appear to be unimportant in the model only
because it’s correlated with other predictors in the model.
MODEL SELECTION: CROSS -
VALIDATION
The major challenge in designing a model is to make it work accurately on the unseen data.

To know whether the designed model is working fine or not, we have to test it against those data
points which were not present during the training of the model. These data points will serve the
purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy.

One of the finest techniques to check the effectiveness of a model is Cross-validation techniques
which can be easily implemented by using the R programming language. In this, a portion of the
data set is reserved which will not be used in training the model.
Once the model is ready, that reserved data set is used for testing purposes. Values of the
dependent variable are predicted during the testing phase and the model accuracy is calculated
on the basis of prediction error i.e., the difference in actual values and predicted values of the
dependent variable.
There are several statistical metrics that are used for evaluating the accuracy of regression model
STATISTICAL METRICS
Root Mean Squared Error (RMSE): As the name suggests it is the square root
of the averaged squared difference between the actual value and the predicted
value of the target variable. It gives the average prediction error made by the
model, thus decrease the RMSE value to increase the accuracy of the model.
Mean Absolute Error (MAE): This metric gives the absolute difference between
the actual values and the values predicted by the model for the target variable. If
the value of the outliers does not have much to do with the accuracy of the
model, then MAE can be used to evaluate the performance of the model. Its value
must be less in order to make better models.
R2 Error: The value of the R-squared metric gives an idea about how much
percentage of variance in the dependent variable is explained collectively by the
independent variables. In other words, it reflects the relationship strength
between the target variable and the model on a scale of 0 – 100%. So, a better
model should have a high value of R-squared.
TYPES OF CROSS-
VALIDATION
During the process of partitioning the complete dataset into the
training set and the validation set, there are chances of losing
some important and crucial data points for the training purpose.
Since those data are not included in the training set, the model has
not got the chance to detect some patterns. This situation can lead
to overfitting or under fitting of the model.
To avoid this, there are different types of cross-validation
techniques that guarantees the random sampling of training and
validation data set and maximizes the accuracy of the model.
One of the most popular cross-validation techniques is Validation
Set Approach
VALIDATION SET APPROACH
In this method, the dataset is divided randomly into training and
testing sets. Following steps are performed to implement this
technique:
A random sampling of the dataset
Model is trained on the training data set
The resultant model is applied to the testing data set
Calculate prediction error by using model performance metrics
EXAMPLE
200 observations of sales
vs marketing on youtube ,
facebook and newspaper

We want to have a model


to predict sales from
marketing and decide what
where to spend the money.
Result of the multiple
regression shows us that
that newspaper is not
significant but we can
explain 89% of sales from
this model.

Can we trust this model


going forward ?

Let’s do some
crossvalidation
CROSS-VALIDATION
Take a set of 150 observation to construct the model.
Leave out 50 observations to predict and see the error.
If the model is correct the error from the model “in-sample” will be
roughly consistent with the “out-of-sample” the error from the last
50 observations.
Regression where I just
use the 150 observations

The results are consistent


with using the full set of
200 observations

Use the model to predict


the last 50 and check the
error

Out of sample is
consistent with in-sample.

You might also like