Chapter11 Regression
Chapter11 Regression
REGRESSION
LINEAR REGRESSION MODEL
REGRESSION INFERENCE
AND INTUITION
For regression, the null hypothesis is so natural that it is rare to
see any other considered.
The natural null hypothesis is that the slope is zero and the
alternative is (almost) always two-sided.
DISTRIBUTION OF THE
SLOPE
Less scatter around the regression model means the slope will be more consistent
from sample to sample. The spread around the line is measured with the residual
standard deviation, .
CONFIDENCE INTERVALS
AND HYPOTHESIS TESTS
EXAMPLE :R
plot(vix.log,sp.log ,main='SP500 vs VIX',
xlab='SP500', ylab='VIX', pch=1,
col='blue’)
T test P-Value
-
65.80680 2*P(T>-
6 65) 2.00E-16
VIX Line Fit Plot
SP500 VS 0.15
VOLATILITY(VIX) 0.1
0.05
f(x) = − 0.119949304623057 x + 0.000328263651201543
R² = 1 SP500
SP500
Predicted SP500
0
Linear (Predicted SP500)
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05
-0.1
-0.15
VIX
0.05
Residuals
0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05
-0.1
VIX
INTERPRET REGRESSION
MODEL
MULTIPLE REGRESSION
INFERENCE
The standard error, t-statistic, and P-values
mean the same thing in the multiple
regression as they meant in a simple
regression.
The t-ratios and corresponding P-values in
each row of the table refer to their
corresponding coefficients. The
complication in multiple regression is that
all of these values are interrelated.
What Multiple Regression Coefficients Mean
Including any new predictor or changing
any data value can change any or all of
the other numbers in the table. And we
For example, when we restrict our
can see from the increased R2, the added
attention to men with waist sizes equal to
complication of an additional predictor was
38 inches (points in blue), we can see a
worthwhile in improving the fit of the
relationship between %body fat and height:
regression model.
Pred %Body Fat = -3.10 + 1.77(Waist) –
0.60(Height)
MULTIPLE REGRESSION
CASES
Can pick up subtle
associations across
slices of the population
For example in the
previous case it picks
up the association of
body fat with waist size
for different heights
What happened to the coefficient of Drop? Not only has it switched from positive
to negative, but it now has a small t-ratio and large P-value, so we can’t reject
the null hypothesis that the coefficient is actually zero after all.
What we have seen here is a problem known as collinearity. Specifically, Drop and
Speed are highly correlated with each other. As a result, the effect of Drop after allowing for
the effect of Speed is negligible. Whenever you have several predictors, you must think about
how the predictors are
Multicollinearity? You may find this problem referred to as “multicollinearity.” But there is no
such thing as “unicollinearity”—we need at least two predictors for there to be a linear
association between them—so there is no need for the extra two syllables.
When predictors are unrelated to each other, each provides new information to help account
for more of the variation in y. But when there are several predictors, the model will work best if
they vary in different ways so that the multiple regression has a stable base.
If you wanted to build a deck on the back of your house, you wouldn’t build it with supports
placed just along one diagonal. Instead, you’d want the supports spread out in different
directions as much as possible to make the deck stable. We’re in a similar situation with
multiple regression.
When predictors are highly correlated, they line up together, which makes the regression they
support balance precariously.
What should you do about a collinear regression model?
The simplest cure is to remove some of the predictors. That
simplifies the model and usually improves the t-statistics. And, if
several predictors provide pretty much the same information,
removing some of them won’t hurt the model.
Which predictors should you remove? Keep those that are most
reliably measured, those that are least expensive to find, or even
the ones that are politically important.
WHAT MULTIPLE REGRESSION
COEFFICIENTS MEAN
Now that we have more than one predictor, there’s an overall test
we should consider before we do more inference on the
coefficients.
We ask the global question “Is this multiple regression model any
good at all?”
We test
The F-statistic and associated P-value from the ANOVA table are
used to answer our question.
COMPARING MULTIPLE
REGRESSION MODEL
How do we know that some other choice of predictors might not provide
a better model?
What exactly would make an alternative model better?
These questions are not easy—there’s no simple measure of the success
of a multiple regression model.
Regression models should make sense.
Predictors that are easy to understand are usually better choices than obscure
variables.
Similarly, if there is a known mechanism by which a predictor has an effect on the
response variable, that predictor is usually a good choice for the regression model.
To know whether the designed model is working fine or not, we have to test it against those data
points which were not present during the training of the model. These data points will serve the
purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy.
One of the finest techniques to check the effectiveness of a model is Cross-validation techniques
which can be easily implemented by using the R programming language. In this, a portion of the
data set is reserved which will not be used in training the model.
Once the model is ready, that reserved data set is used for testing purposes. Values of the
dependent variable are predicted during the testing phase and the model accuracy is calculated
on the basis of prediction error i.e., the difference in actual values and predicted values of the
dependent variable.
There are several statistical metrics that are used for evaluating the accuracy of regression model
STATISTICAL METRICS
Root Mean Squared Error (RMSE): As the name suggests it is the square root
of the averaged squared difference between the actual value and the predicted
value of the target variable. It gives the average prediction error made by the
model, thus decrease the RMSE value to increase the accuracy of the model.
Mean Absolute Error (MAE): This metric gives the absolute difference between
the actual values and the values predicted by the model for the target variable. If
the value of the outliers does not have much to do with the accuracy of the
model, then MAE can be used to evaluate the performance of the model. Its value
must be less in order to make better models.
R2 Error: The value of the R-squared metric gives an idea about how much
percentage of variance in the dependent variable is explained collectively by the
independent variables. In other words, it reflects the relationship strength
between the target variable and the model on a scale of 0 – 100%. So, a better
model should have a high value of R-squared.
TYPES OF CROSS-
VALIDATION
During the process of partitioning the complete dataset into the
training set and the validation set, there are chances of losing
some important and crucial data points for the training purpose.
Since those data are not included in the training set, the model has
not got the chance to detect some patterns. This situation can lead
to overfitting or under fitting of the model.
To avoid this, there are different types of cross-validation
techniques that guarantees the random sampling of training and
validation data set and maximizes the accuracy of the model.
One of the most popular cross-validation techniques is Validation
Set Approach
VALIDATION SET APPROACH
In this method, the dataset is divided randomly into training and
testing sets. Following steps are performed to implement this
technique:
A random sampling of the dataset
Model is trained on the training data set
The resultant model is applied to the testing data set
Calculate prediction error by using model performance metrics
EXAMPLE
200 observations of sales
vs marketing on youtube ,
facebook and newspaper
Let’s do some
crossvalidation
CROSS-VALIDATION
Take a set of 150 observation to construct the model.
Leave out 50 observations to predict and see the error.
If the model is correct the error from the model “in-sample” will be
roughly consistent with the “out-of-sample” the error from the last
50 observations.
Regression where I just
use the 150 observations
Out of sample is
consistent with in-sample.