0% found this document useful (0 votes)
19 views10 pages

Anova Explain

The document discusses linear regression models, including the formulation, inference, and interpretation of regression coefficients. It highlights the importance of understanding assumptions, conditions, and the impact of collinearity in multiple regression analysis. Additionally, it covers model selection techniques, statistical metrics for evaluating models, and the significance of cross-validation in ensuring model accuracy.

Uploaded by

samridhi Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Anova Explain

The document discusses linear regression models, including the formulation, inference, and interpretation of regression coefficients. It highlights the importance of understanding assumptions, conditions, and the impact of collinearity in multiple regression analysis. Additionally, it covers model selection techniques, statistical metrics for evaluating models, and the significance of cross-validation in ensuring model accuracy.

Uploaded by

samridhi Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

09/05/2023

LINEAR REGRESSION MODEL


𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 = 𝑏 + 𝑏 𝐸𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑜𝑟𝑦 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 + 𝐸𝑟𝑟𝑜𝑟

𝐷𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = 𝑏 + 𝑏 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 + 𝐸𝑟𝑟𝑜𝑟

Y =β +β X +ϵ

Y =β +β X

INFERENCE FOR REGRESSION Chapter 11

1 2

REGRESSION INFERENCE AND INTUITION DISTRIBUTION OF THE SLOPE


Less scatter around the regression model means the slope will be more consistent from sample to sample.
For regression, the null hypothesis is so natural that it is rare to see any other The spread around the line is measured with the residual standard deviation, 𝑠 .
considered.
The natural null hypothesis is that the slope is zero and the alternative is (almost)
always two-sided.

3 4

1
09/05/2023

EXAMPLE :R
CONFIDENCE INTERVALS AND HYPOTHESIS TESTS plot(vix.log,sp.log ,main='SP500 vs VIX',
xlab='SP500', ylab='VIX', pch=1, col='blue’)

##plot the regression line


res <- lm(sp.log ~ vix.log)
res$coefficients
plot(res)
abline(res, col='red')

5 6

VIX Line Fit Plot


SP500 VS 0.15

CONFIDENCE INTERVALS FOR THE SLOPE VOLATILITY(VIX) 0.1


y = -0.1199x + 0.0003
R² = 1

0.05

Vix.Log coefficient is -.1199, degrees of freedom 3974 SP500


SP500

0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Predicted SP500

With n = 3976, there are n - 2 = 3974 degrees of freedom and t* 0.025, 3974 = Linear (Predicted SP500)

1.960 -0.05

The confidence interval for the slope is: -0.1

(-.1199-1.96*.001822, -.1199+1.96*.001822)
-0.15
VIX
Linear regression coefficients

slope se lower upper VIX Residual Plot


-0.1199 0.001822 -0.1234711 -0.1163289 0.1

0.05
Residuals

T test P-Value
0
-65.806806 2*P(T>-65) 2.00E-16 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05

-0.1
VIX

7 8

2
09/05/2023

INTERPRET REGRESSION MODEL MULTIPLE REGRESSION INFERENCE


The standard error, t-statistic, and P-values mean the
same thing in the multiple
regression as they meant in a simple regression.
The t-ratios and corresponding P-values in each row of
the table refer to their corresponding coefficients. The
complication in multiple regression is that all of these
values are interrelated.
Including any new predictor or changing any data
value can change any or all of the other numbers in
What Multiple Regression Coefficients Mean
the table. And we can see from the increased R2, the
added complication of an additional predictor was
worthwhile in improving the fit of the regression
For example, when we restrict our attention to men with
model.
waist sizes equal to 38 inches (points in blue), we can
see a relationship between %body fat and height:
Pred %Body Fat = -3.10 + 1.77(Waist) – 0.60(Height)

9 10

MULTIPLE REGRESSION CASES COLLINEARITY


Data on roller coasters and found that the duration of the ride depended on, among other things, the
Can pick up subtle drop—that initial stomach-turning plunge down the high hill that powers the coaster through its run
associations across slices of
the population
For example in the previous
case it picks up the
association of body fat with
waist size for different
heights

Most of the times the challenge that we encounter in multiple regression is Collinearity

11 12

3
09/05/2023

COLINEARITY
Adding a second predictor should only improve the model, so let’s add the maximum What we have seen here is a problem known as collinearity. Specifically, Drop and
Speed of the coaster to the model: Speed are highly correlated with each other. As a result, the effect of Drop after allowing for the effect of
Speed is negligible. Whenever you have several predictors, you must think about how the predictors are
Multicollinearity? You may find this problem referred to as “multicollinearity.” But there is no such thing as
“unicollinearity”—we need at least two predictors for there to be a linear association between them—so there
is no need for the extra two syllables.
When predictors are unrelated to each other, each provides new information to help account for more of the
variation in y. But when there are several predictors, the model will work best if they vary in different ways so
that the multiple regression has a stable base.
If you wanted to build a deck on the back of your house, you wouldn’t build it with supports placed just along
one diagonal. Instead, you’d want the supports spread out in different directions as much as possible to make
the deck stable. We’re in a similar situation with multiple regression.
When predictors are highly correlated, they line up together, which makes the regression they support balance
precariously.
What happened to the coefficient of Drop? Not only has it switched from positive to negative, but it
now has a small t-ratio and large P-value, so we can’t reject the null hypothesis that the coefficient is
actually zero after all.

13 14

WHAT MULTIPLE REGRESSION COEFFICIENTS MEAN

What should you do about a collinear regression model? This relationship is conditional because we’ve restricted our set to only those roller
coasters with a certain drop.
The simplest cure is to remove some of the predictors. That simplifies the model and
usually improves the t-statistics. And, if several predictors provide pretty much the  For roller coasters with a certain drop increase in speed by 1 is associated with
same information, removing some of them won’t hurt the model. an increase of 2.70 of duration
Which predictors should you remove? Keep those that are most reliably measured,  If that relationship is consistent for each drop, then the multiple regression
those that are least expensive to find, or even the ones that are politically important. coefficient will estimate it.

15 16

4
09/05/2023

ASSUMPTIONS AND CONDITIONS ASSUMPTIONS AND CONDITIONS

Linearity Assumption: Equal Variance Assumption:


 Straight Enough Condition: Check the scatterplot for each candidate  Does the Plot Thicken? Condition: Check the residuals plot—the spread of the residuals should
be uniform.
predictor variable—the shape must not be obviously curved or we can’t
consider that predictor in our multiple regression model. Normality Assumption:
 Nearly Normal Condition: Check a histogram of the residuals—the distribution of the residuals
Independence Assumption: should be unimodal and symmetric, and the Normal probability plot should be straight.
 Randomization Condition: The data should arise from a random sample. Summary of the checks of conditions in order:
Also, check the residuals plot - the residuals should appear to be randomly Check the Straight Enough Condition with scatterplots of the y-variable against each x- variable.
scattered. 1. If the scatterplots are straight enough, fit a multiple regression model to the data.
2. Find the residuals and predicted values.
3. Make and check a scatterplot of the residuals against the predicted values. This plot should look patternless.

17 18

FEATURE SELECTION THE ANOVA TABLE


Adding more variables isn’t always helpful because the model may ‘over-fit,’ and it’ll
be too complicated. The trained model doesn’t generalize with the new data. It only
works on the trained data.
All the variables/columns in the dataset may not be independent. This condition is
called multicollinearity, where there is an association between predictor variables.
We have to select the appropriate variables to build the best model. This process of
selecting variables is called Feature selection.

19 20

5
09/05/2023

MULTIPLE REGRESSION INFERENCE:


I THOUGHT I SAW AN ANOVA TABLE...
Now that we have more than one predictor, there’s an overall test we should consider
before we do more inference on the coefficients.
 We ask the global question “Is this multiple regression model any good at all?”
 We test

 The F-statistic and associated P-value from the ANOVA table are used to answer
our question.

21 22

COMPARING MULTIPLE REGRESSION MODEL COEFFICIENT OF MULTIPLE DETERMINATION


Reports the proportion of total variation in Y explained by all X variables taken
together
How do we know that some other choice of predictors might not provide a better
model?
What exactly would make an alternative model better?
These questions are not easy—there’s no simple measure of the success of a multiple
regression model.
Regression models should make sense.
  Predictors that are easy to understand are usually better choices than obscure variables.
  Similarly, if there is a known mechanism by which a predictor has an effect on the response
variable, that predictor is usually a good choice for the regression model.

 The simple answer is that we can’t know whether we have the best possible
model.

23 24

6
09/05/2023

MULTIPLE REGRESSION ADJUSTED R2

There is another statistic in the full regression table


called the adjusted R2.
 This statistic is a rough attempt to adjust for the simple fact that when we add another
predictor to a multiple regression, the R2 can’t go down and will most likely get larger.
 This fact makes it difficult to compare alternative regression models that have different
numbers of predictors.
Shows the proportion of variation in Y explained by all X variables adjusted for the number of
X variables used
Penalize excessive use of independent variables
 Smaller than R2
 Useful in comparing among models

25 26

THE BEST MULTIPLE REGRESSION MODEL

The first and most important thing to realize is that often there is no such thing as the
“best” regression model. (After all, all models are wrong.)
1. Multiple regressions are subtle. The choice of which predictors to use determines
almost everything about the regression.
The best regression models have:
 Relatively few predictors.
 A relatively high R2.
 A relatively small s, the standard deviation of the residuals.
 Relatively small P-values for their F- and t-statistics.
 No cases with extraordinarily high leverage.
 No cases with extraordinarily large residuals;.
 Predictors that are reliably measured and relatively unrelated to each other.

27 28

7
09/05/2023

BUILDING REGRESSION MODELS SEQUENTIALLY MODEL SELECTION: CROSS - VALIDATION


You can build a regression model by adding variables to a growing regression. Each time you The major challenge in designing a model is to make it work accurately on the unseen data.
add a predictor, you hope to account for a little more of the variation in the response. What’s
left over is the residuals. At each step, consider the predictors still available to you. Those that
are most highly correlated with the current residuals are the ones that are most likely to To know whether the designed model is working fine or not, we have to test it against those data
points which were not present during the training of the model. These data points will serve the
improve the model. If you see a variable with a high correlation at this stage and it is not purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy.
among those that you thought were important, stop and think about it. Is it correlated with
another predictor or with several other predictors?
One of the finest techniques to check the effectiveness of a model is Cross-validation techniques
.At each step make a plot of the residuals to check for outliers, and check the leverages (say, which can be easily implemented by using the R programming language. In this, a portion of the
with a histogram of the leverage values) to be sure there are no high-leverage points. data set is reserved which will not be used in training the model.
Influential cases can strongly affect which variables appear to be good or poor predictors in Once the model is ready, that reserved data set is used for testing purposes. Values of the
the model. It’s also a good idea to check that a predictor doesn’t appear to be unimportant in dependent variable are predicted during the testing phase and the model accuracy is calculated
on the basis of prediction error i.e., the difference in actual values and predicted values of the
the model only because it’s correlated with other predictors in the model. dependent variable.
There are several statistical metrics that are used for evaluating the accuracy of regression model

29 30

STATISTICAL METRICS TYPES OF CROSS-VALIDATION

Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged During the process of partitioning the complete dataset into the training set and the
squared difference between the actual value and the predicted value of the target variable. validation set, there are chances of losing some important and crucial data points for
It gives the average prediction error made by the model, thus decrease the RMSE value to the training purpose.
increase the accuracy of the model.
Since those data are not included in the training set, the model has not got the chance
Mean Absolute Error (MAE): This metric gives the absolute difference between the actual
to detect some patterns. This situation can lead to overfitting or under fitting of the
values and the values predicted by the model for the target variable. If the value of the
outliers does not have much to do with the accuracy of the model, then MAE can be used to model.
evaluate the performance of the model. Its value must be less in order to make better models. To avoid this, there are different types of cross-validation techniques that guarantees
R2 Error: The value of the R-squared metric gives an idea about how much percentage of the random sampling of training and validation data set and maximizes the accuracy
variance in the dependent variable is explained collectively by the independent variables. In of the model.
other words, it reflects the relationship strength between the target variable and the model on
a scale of 0 – 100%. So, a better model should have a high value of R-squared. One of the most popular cross-validation techniques is Validation Set Approach

31 32

8
09/05/2023

VALIDATION SET APPROACH EXAMPLE


In this method, the dataset is divided randomly into training and testing sets. 200 observations of sales vs
Following steps are performed to implement this technique: marketing on youtube , facebook
and newspaper
A random sampling of the dataset
Model is trained on the training data set
We want to have a model to
The resultant model is applied to the testing data set predict sales from marketing and
Calculate prediction error by using model performance metrics decide what where to spend the
money.

33 34

CROSS-VALIDATION
Result of the multiple regression Take a set of 150 observation to construct the model.
shows us that that newspaper is
not significant but we can explain Leave out 50 observations to predict and see the error.
89% of sales from this model.
If the model is correct the error from the model “in-sample” will be roughly consistent
with the “out-of-sample” the error from the last 50 observations.
Can we trust this model going
forward ?

Let’s do some crossvalidation

35 36

9
09/05/2023

Regression where I just use the


150 observations

The results are consistent with


using the full set of 200
observations

Use the model to predict the last


50 and check the error

Out of sample is consistent with


in-sample.

37

10

You might also like