Anova Explain
Anova Explain
Y =β +β X +ϵ
Y =β +β X
1 2
3 4
1
09/05/2023
EXAMPLE :R
CONFIDENCE INTERVALS AND HYPOTHESIS TESTS plot(vix.log,sp.log ,main='SP500 vs VIX',
xlab='SP500', ylab='VIX', pch=1, col='blue’)
5 6
0.05
0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Predicted SP500
With n = 3976, there are n - 2 = 3974 degrees of freedom and t* 0.025, 3974 = Linear (Predicted SP500)
1.960 -0.05
(-.1199-1.96*.001822, -.1199+1.96*.001822)
-0.15
VIX
Linear regression coefficients
0.05
Residuals
T test P-Value
0
-65.806806 2*P(T>-65) 2.00E-16 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05
-0.1
VIX
7 8
2
09/05/2023
9 10
Most of the times the challenge that we encounter in multiple regression is Collinearity
11 12
3
09/05/2023
COLINEARITY
Adding a second predictor should only improve the model, so let’s add the maximum What we have seen here is a problem known as collinearity. Specifically, Drop and
Speed of the coaster to the model: Speed are highly correlated with each other. As a result, the effect of Drop after allowing for the effect of
Speed is negligible. Whenever you have several predictors, you must think about how the predictors are
Multicollinearity? You may find this problem referred to as “multicollinearity.” But there is no such thing as
“unicollinearity”—we need at least two predictors for there to be a linear association between them—so there
is no need for the extra two syllables.
When predictors are unrelated to each other, each provides new information to help account for more of the
variation in y. But when there are several predictors, the model will work best if they vary in different ways so
that the multiple regression has a stable base.
If you wanted to build a deck on the back of your house, you wouldn’t build it with supports placed just along
one diagonal. Instead, you’d want the supports spread out in different directions as much as possible to make
the deck stable. We’re in a similar situation with multiple regression.
When predictors are highly correlated, they line up together, which makes the regression they support balance
precariously.
What happened to the coefficient of Drop? Not only has it switched from positive to negative, but it
now has a small t-ratio and large P-value, so we can’t reject the null hypothesis that the coefficient is
actually zero after all.
13 14
What should you do about a collinear regression model? This relationship is conditional because we’ve restricted our set to only those roller
coasters with a certain drop.
The simplest cure is to remove some of the predictors. That simplifies the model and
usually improves the t-statistics. And, if several predictors provide pretty much the For roller coasters with a certain drop increase in speed by 1 is associated with
same information, removing some of them won’t hurt the model. an increase of 2.70 of duration
Which predictors should you remove? Keep those that are most reliably measured, If that relationship is consistent for each drop, then the multiple regression
those that are least expensive to find, or even the ones that are politically important. coefficient will estimate it.
15 16
4
09/05/2023
17 18
19 20
5
09/05/2023
The F-statistic and associated P-value from the ANOVA table are used to answer
our question.
21 22
The simple answer is that we can’t know whether we have the best possible
model.
23 24
6
09/05/2023
25 26
The first and most important thing to realize is that often there is no such thing as the
“best” regression model. (After all, all models are wrong.)
1. Multiple regressions are subtle. The choice of which predictors to use determines
almost everything about the regression.
The best regression models have:
Relatively few predictors.
A relatively high R2.
A relatively small s, the standard deviation of the residuals.
Relatively small P-values for their F- and t-statistics.
No cases with extraordinarily high leverage.
No cases with extraordinarily large residuals;.
Predictors that are reliably measured and relatively unrelated to each other.
27 28
7
09/05/2023
29 30
Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged During the process of partitioning the complete dataset into the training set and the
squared difference between the actual value and the predicted value of the target variable. validation set, there are chances of losing some important and crucial data points for
It gives the average prediction error made by the model, thus decrease the RMSE value to the training purpose.
increase the accuracy of the model.
Since those data are not included in the training set, the model has not got the chance
Mean Absolute Error (MAE): This metric gives the absolute difference between the actual
to detect some patterns. This situation can lead to overfitting or under fitting of the
values and the values predicted by the model for the target variable. If the value of the
outliers does not have much to do with the accuracy of the model, then MAE can be used to model.
evaluate the performance of the model. Its value must be less in order to make better models. To avoid this, there are different types of cross-validation techniques that guarantees
R2 Error: The value of the R-squared metric gives an idea about how much percentage of the random sampling of training and validation data set and maximizes the accuracy
variance in the dependent variable is explained collectively by the independent variables. In of the model.
other words, it reflects the relationship strength between the target variable and the model on
a scale of 0 – 100%. So, a better model should have a high value of R-squared. One of the most popular cross-validation techniques is Validation Set Approach
31 32
8
09/05/2023
33 34
CROSS-VALIDATION
Result of the multiple regression Take a set of 150 observation to construct the model.
shows us that that newspaper is
not significant but we can explain Leave out 50 observations to predict and see the error.
89% of sales from this model.
If the model is correct the error from the model “in-sample” will be roughly consistent
with the “out-of-sample” the error from the last 50 observations.
Can we trust this model going
forward ?
35 36
9
09/05/2023
37
10