0% found this document useful (0 votes)

13 views37 pages

Chapter11 Regression

The document discusses regression analysis, focusing on linear regression models, inference, and the significance of coefficients. It highlights the importance of understanding collinearity, feature selection, and the assumptions necessary for effective multiple regression. Additionally, it emphasizes model evaluation through statistical metrics and cross-validation techniques to ensure accuracy and reliability of predictions.

Uploaded by

samridhi Dwivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views37 pages

Chapter11 Regression

Uploaded by

samridhi Dwivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

INFERENCE FOR Chapter 11

REGRESSION
LINEAR REGRESSION MODEL
REGRESSION INFERENCE
AND INTUITION
For regression, the null hypothesis is so natural that it is rare to
see any other considered.
The natural null hypothesis is that the slope is zero and the
alternative is (almost) always two-sided.
DISTRIBUTION OF THE
SLOPE
Less scatter around the regression model means the slope will be more consistent
from sample to sample. The spread around the line is measured with the residual
standard deviation, .
CONFIDENCE INTERVALS
AND HYPOTHESIS TESTS
EXAMPLE :R
plot(vix.log,sp.log ,main='SP500 vs VIX',
xlab='SP500', ylab='VIX', pch=1,
col='blue’)

##plot the regression line

res <- lm(sp.log ~ vix.log)
res$coefficients
plot(res)
abline(res, col='red')
CONFIDENCE INTERVALS
FOR THE SLOPE
Vix.Log coefficient is -.1199, degrees of freedom 3974
With n = 3976, there are n - 2 = 3974 degrees of freedom and t*
0.025, 3974 = 1.960
The confidence interval for the slope is:
(-.1199-1.96*.001822, -.1199+1.96*.001822)
Linear regression coefficients

slope se lower upper

- -
0.123471 0.116328
-0.1199 0.001822 1 9

T test P-Value
-
65.80680 2*P(T>-
6 65) 2.00E-16
VIX Line Fit Plot
SP500 VS 0.15

VOLATILITY(VIX) 0.1

0.05
f(x) = − 0.119949304623057 x + 0.000328263651201543
R² = 1 SP500

SP500
Predicted SP500
0
Linear (Predicted SP500)
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-0.05

-0.1

-0.15

VIX

VIX Residual Plot

0.1

0.05
Residuals

0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-0.05

-0.1
VIX
INTERPRET REGRESSION
MODEL
MULTIPLE REGRESSION
INFERENCE
The standard error, t-statistic, and P-values
mean the same thing in the multiple
regression as they meant in a simple
regression.
The t-ratios and corresponding P-values in
each row of the table refer to their
corresponding coefficients. The
complication in multiple regression is that
all of these values are interrelated.
What Multiple Regression Coefficients Mean
Including any new predictor or changing
any data value can change any or all of
the other numbers in the table. And we
For example, when we restrict our
can see from the increased R2, the added
attention to men with waist sizes equal to
complication of an additional predictor was
38 inches (points in blue), we can see a
worthwhile in improving the fit of the
relationship between %body fat and height:
regression model.
Pred %Body Fat = -3.10 + 1.77(Waist) –
0.60(Height)
MULTIPLE REGRESSION
CASES
Can pick up subtle
associations across
slices of the population
For example in the
previous case it picks
up the association of
body fat with waist size
for different heights

Most of the times the challenge that we encounter in multiple regression is

Collinearity
COLLINEARITY
Data on roller coasters and found that the duration of the ride depended on,
among other things, the drop—that initial stomach-turning plunge down the high
hill that powers the coaster through its run
COLINEARITY
Adding a second predictor should only improve the model, so let’s
add the maximum Speed of the coaster to the model:

What happened to the coefficient of Drop? Not only has it switched from positive
to negative, but it now has a small t-ratio and large P-value, so we can’t reject
the null hypothesis that the coefficient is actually zero after all.
What we have seen here is a problem known as collinearity. Specifically, Drop and
Speed are highly correlated with each other. As a result, the effect of Drop after allowing for
the effect of Speed is negligible. Whenever you have several predictors, you must think about
how the predictors are
Multicollinearity? You may find this problem referred to as “multicollinearity.” But there is no
such thing as “unicollinearity”—we need at least two predictors for there to be a linear
association between them—so there is no need for the extra two syllables.
When predictors are unrelated to each other, each provides new information to help account
for more of the variation in y. But when there are several predictors, the model will work best if
they vary in different ways so that the multiple regression has a stable base.
If you wanted to build a deck on the back of your house, you wouldn’t build it with supports
placed just along one diagonal. Instead, you’d want the supports spread out in different
directions as much as possible to make the deck stable. We’re in a similar situation with
multiple regression.
When predictors are highly correlated, they line up together, which makes the regression they
support balance precariously.
What should you do about a collinear regression model?
The simplest cure is to remove some of the predictors. That
simplifies the model and usually improves the t-statistics. And, if
several predictors provide pretty much the same information,
removing some of them won’t hurt the model.
Which predictors should you remove? Keep those that are most
reliably measured, those that are least expensive to find, or even
the ones that are politically important.
WHAT MULTIPLE REGRESSION
COEFFICIENTS MEAN

This relationship is conditional because we’ve restricted our set to

only those roller coasters with a certain drop.
 For roller coasters with a certain drop increase in speed by 1 is
associated with an increase of 2.70 of duration
 If that relationship is consistent for each drop, then the multiple
regression coefficient will estimate it.
ASSUMPTIONS AND
CONDITIONS
Linearity Assumption:
 Straight Enough Condition: Check the scatterplot for each
candidate predictor variable—the shape must not be
obviously curved or we can’t consider that predictor in our
multiple regression model.
Independence Assumption:
 Randomization Condition: The data should arise from a
random sample. Also, check the residuals plot - the
residuals should appear to be randomly scattered.
ASSUMPTIONS AND
CONDITIONS
Equal Variance Assumption:
 Does the Plot Thicken? Condition: Check the residuals plot—the spread of the
residuals should be uniform.
Normality Assumption:
 Nearly Normal Condition: Check a histogram of the residuals—the distribution of
the residuals should be unimodal and symmetric, and the Normal probability plot
should be straight.
Summary of the checks of conditions in order:
Check the Straight Enough Condition with scatterplots of the y-variable against each
x- variable.
1. If the scatterplots are straight enough, fit a multiple regression model to the data.
2. Find the residuals and predicted values.
3. Make and check a scatterplot of the residuals against the predicted values. This plot should look
patternless.
FEATURE SELECTION
Adding more variables isn’t always helpful because the model may
‘over-fit,’ and it’ll be too complicated. The trained model doesn’t
generalize with the new data. It only works on the trained data.
All the variables/columns in the dataset may not be independent.
This condition is called multicollinearity, where there is an
association between predictor variables.
We have to select the appropriate variables to build the best
model. This process of selecting variables is called Feature
selection.
THE ANOVA TABLE
MULTIPLE REGRESSION INFERENCE:
I THOUGHT I SAW AN ANOVA TABLE...

Now that we have more than one predictor, there’s an overall test
we should consider before we do more inference on the
coefficients.
 We ask the global question “Is this multiple regression model any
good at all?”
 We test

 The F-statistic and associated P-value from the ANOVA table are
used to answer our question.
COMPARING MULTIPLE
REGRESSION MODEL
How do we know that some other choice of predictors might not provide
a better model?
What exactly would make an alternative model better?
These questions are not easy—there’s no simple measure of the success
of a multiple regression model.
Regression models should make sense.
  Predictors that are easy to understand are usually better choices than obscure
variables.
  Similarly, if there is a known mechanism by which a predictor has an effect on the
response variable, that predictor is usually a good choice for the regression model.

The simple answer is that we can’t know whether we have the

best possible model.
COEFFICIENT OF MULTIPLE
DETERMINATION
Reports the proportion of total variation in Y explained by all X
variables taken together
MULTIPLE REGRESSION
ADJUSTED R2

There is another statistic in the full regression table

called the adjusted R2.
 This statistic is a rough attempt to adjust for the simple fact that when we add
another predictor to a multiple regression, the R2 can’t go down and will most
likely get larger.
 This fact makes it difficult to compare alternative regression models that have
different numbers of predictors.
Shows the proportion of variation in Y explained by all X variables adjusted for the
number of X variables used
Penalize excessive use of independent variables
 Smaller than R2
 Useful in comparing among models
THE BEST MULTIPLE
REGRESSION MODEL
The first and most important thing to realize is that often there is no such
thing as the “best” regression model. (After all, all models are wrong.)
1. Multiple regressions are subtle. The choice of which predictors to use
determines almost everything about the regression.
The best regression models have:
 Relatively few predictors.
 A relatively high R2.
 A relatively small s, the standard deviation of the residuals.
 Relatively small P-values for their F- and t-statistics.
 No cases with extraordinarily high leverage.
 No cases with extraordinarily large residuals;.
 Predictors that are reliably measured and relatively unrelated to each other.
BUILDING REGRESSION
MODELS SEQUENTIALLY
You can build a regression model by adding variables to a growing regression.
Each time you add a predictor, you hope to account for a little more of the
variation in the response. What’s left over is the residuals. At each step,
consider the predictors still available to you. Those that are most highly
correlated with the current residuals are the ones that are most likely to
improve the model. If you see a variable with a high correlation at this stage
and it is not among those that you thought were important, stop and think
about it. Is it correlated with another predictor or with several other
predictors?
.At each step make a plot of the residuals to check for outliers, and check the
leverages (say, with a histogram of the leverage values) to be sure there are
no high-leverage points. Influential cases can strongly affect which variables
appear to be good or poor predictors in the model. It’s also a good idea to
check that a predictor doesn’t appear to be unimportant in the model only
because it’s correlated with other predictors in the model.
MODEL SELECTION: CROSS -
VALIDATION
The major challenge in designing a model is to make it work accurately on the unseen data.

To know whether the designed model is working fine or not, we have to test it against those data
points which were not present during the training of the model. These data points will serve the
purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy.

One of the finest techniques to check the effectiveness of a model is Cross-validation techniques
which can be easily implemented by using the R programming language. In this, a portion of the
data set is reserved which will not be used in training the model.
Once the model is ready, that reserved data set is used for testing purposes. Values of the
dependent variable are predicted during the testing phase and the model accuracy is calculated
on the basis of prediction error i.e., the difference in actual values and predicted values of the
dependent variable.
There are several statistical metrics that are used for evaluating the accuracy of regression model
STATISTICAL METRICS
Root Mean Squared Error (RMSE): As the name suggests it is the square root
of the averaged squared difference between the actual value and the predicted
value of the target variable. It gives the average prediction error made by the
model, thus decrease the RMSE value to increase the accuracy of the model.
Mean Absolute Error (MAE): This metric gives the absolute difference between
the actual values and the values predicted by the model for the target variable. If
the value of the outliers does not have much to do with the accuracy of the
model, then MAE can be used to evaluate the performance of the model. Its value
must be less in order to make better models.
R2 Error: The value of the R-squared metric gives an idea about how much
percentage of variance in the dependent variable is explained collectively by the
independent variables. In other words, it reflects the relationship strength
between the target variable and the model on a scale of 0 – 100%. So, a better
model should have a high value of R-squared.
TYPES OF CROSS-
VALIDATION
During the process of partitioning the complete dataset into the
training set and the validation set, there are chances of losing
some important and crucial data points for the training purpose.
Since those data are not included in the training set, the model has
not got the chance to detect some patterns. This situation can lead
to overfitting or under fitting of the model.
To avoid this, there are different types of cross-validation
techniques that guarantees the random sampling of training and
validation data set and maximizes the accuracy of the model.
One of the most popular cross-validation techniques is Validation
Set Approach
VALIDATION SET APPROACH
In this method, the dataset is divided randomly into training and
testing sets. Following steps are performed to implement this
technique:
A random sampling of the dataset
Model is trained on the training data set
The resultant model is applied to the testing data set
Calculate prediction error by using model performance metrics
EXAMPLE
200 observations of sales
vs marketing on youtube ,
facebook and newspaper

We want to have a model

to predict sales from
marketing and decide what
where to spend the money.
Result of the multiple
regression shows us that
that newspaper is not
significant but we can
explain 89% of sales from
this model.

Can we trust this model

going forward ?

Let’s do some
crossvalidation
CROSS-VALIDATION
Take a set of 150 observation to construct the model.
Leave out 50 observations to predict and see the error.
If the model is correct the error from the model “in-sample” will be
roughly consistent with the “out-of-sample” the error from the last
50 observations.
Regression where I just
use the 150 observations

The results are consistent

with using the full set of
200 observations

Use the model to predict

the last 50 and check the
error

Out of sample is
consistent with in-sample.

Pragmatics PDF
0% (1)
Pragmatics PDF
87 pages
Computer Science and Engineering
No ratings yet
Computer Science and Engineering
145 pages
Handbook Rheometer
No ratings yet
Handbook Rheometer
328 pages
UKP6053 - L8 Multiple Regression
100% (2)
UKP6053 - L8 Multiple Regression
105 pages
Gill
No ratings yet
Gill
474 pages
Formulation and Evaluation of Topical Herbal Gel For The Treatment
No ratings yet
Formulation and Evaluation of Topical Herbal Gel For The Treatment
16 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
Web Application Architectures
No ratings yet
Web Application Architectures
8 pages
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
Multicollinearity Assignment April 5
100% (1)
Multicollinearity Assignment April 5
15 pages
120.508 Module 8 Multiple Regression (PDF Full Page Color)
No ratings yet
120.508 Module 8 Multiple Regression (PDF Full Page Color)
52 pages
RSM1282-2025-Session 6-Multiple Regression POST
No ratings yet
RSM1282-2025-Session 6-Multiple Regression POST
84 pages
Ken Black QA 5th Chapter15 Solution
100% (1)
Ken Black QA 5th Chapter15 Solution
12 pages
3rd Plate Sample
100% (1)
3rd Plate Sample
39 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
73 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Multiple Regression Analysis 1
No ratings yet
Multiple Regression Analysis 1
57 pages
Boost Power Stage in SMPS
No ratings yet
Boost Power Stage in SMPS
32 pages
ML Unit3 MultipleLinearRegression
No ratings yet
ML Unit3 MultipleLinearRegression
70 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Unit 4-1
No ratings yet
Unit 4-1
29 pages
Inferential Analysis
No ratings yet
Inferential Analysis
45 pages
1 Multicollinearity and Partial F Test PowerPoint
No ratings yet
1 Multicollinearity and Partial F Test PowerPoint
61 pages
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
No ratings yet
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
50 pages
Multiple Linear Regression (Multiple Regression Analysis)
No ratings yet
Multiple Linear Regression (Multiple Regression Analysis)
37 pages
VU21997 - Expose Website Security Vulnerabilities - Class 4 SQLMap Final
No ratings yet
VU21997 - Expose Website Security Vulnerabilities - Class 4 SQLMap Final
21 pages
Industrial Training Presentation (BHEL)
No ratings yet
Industrial Training Presentation (BHEL)
25 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
53 pages
Multi Col Linearity
No ratings yet
Multi Col Linearity
45 pages
FRM Part 1: Regression With Multiple Explanatory Variables
No ratings yet
FRM Part 1: Regression With Multiple Explanatory Variables
29 pages
Chapter 3 Econometrics
No ratings yet
Chapter 3 Econometrics
34 pages
3.multiple Correlation & Regression
No ratings yet
3.multiple Correlation & Regression
24 pages
4 Multiple Regression Analysis
No ratings yet
4 Multiple Regression Analysis
58 pages
Multiple-Regression - Batool & Raya
No ratings yet
Multiple-Regression - Batool & Raya
24 pages
Multi Collinearity
No ratings yet
Multi Collinearity
22 pages
01 - Quantitative Methods
No ratings yet
01 - Quantitative Methods
28 pages
Session-Multiple Regression
No ratings yet
Session-Multiple Regression
26 pages
A-level Physics Revision: Cheeky Revision Shortcuts
From Everand
A-level Physics Revision: Cheeky Revision Shortcuts
Scool Revision
3/5 (10)
STAT 252-Notes-Topic 5-Multiple Linear Regression
No ratings yet
STAT 252-Notes-Topic 5-Multiple Linear Regression
33 pages
Lecture 6 - Test Design Techniques
No ratings yet
Lecture 6 - Test Design Techniques
44 pages
Mulicolinearity
No ratings yet
Mulicolinearity
18 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
7-Multiple Regression
No ratings yet
7-Multiple Regression
17 pages
Code
No ratings yet
Code
13 pages
YW Mortality 1.1 FutureLifetime
No ratings yet
YW Mortality 1.1 FutureLifetime
22 pages
Welcome To:: Multiple Regression and Model Building
No ratings yet
Welcome To:: Multiple Regression and Model Building
20 pages
Unit 4 Multiple Regression Model: 4.0 Objectives
No ratings yet
Unit 4 Multiple Regression Model: 4.0 Objectives
23 pages
Multiple Linear Regression Session 4
No ratings yet
Multiple Linear Regression Session 4
32 pages
2024 Chapter 1
No ratings yet
2024 Chapter 1
8 pages
Q4G8W2
No ratings yet
Q4G8W2
7 pages
Limitations of Conventional Mobile Systems Over Cellular Mobile System
No ratings yet
Limitations of Conventional Mobile Systems Over Cellular Mobile System
17 pages
Chapter 4 Multicollinearity
No ratings yet
Chapter 4 Multicollinearity
7 pages
Missing Value 11
No ratings yet
Missing Value 11
14 pages
135-4500-421H 5.12inch 20-23 PPF 6K RHP-SPR Packer
No ratings yet
135-4500-421H 5.12inch 20-23 PPF 6K RHP-SPR Packer
15 pages
Iron FerroVer + TPTZ Methods
No ratings yet
Iron FerroVer + TPTZ Methods
15 pages
Field Training Report: Executive Engineer
No ratings yet
Field Training Report: Executive Engineer
19 pages
Multicollinearity and Endogeneity PDF
No ratings yet
Multicollinearity and Endogeneity PDF
37 pages
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
No ratings yet
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
7 pages
Anova Explain
No ratings yet
Anova Explain
10 pages
Collinarity
No ratings yet
Collinarity
6 pages
241 Galley
No ratings yet
241 Galley
7 pages
RESEARCH METHODS LESSON 18 - Multiple Regression
No ratings yet
RESEARCH METHODS LESSON 18 - Multiple Regression
6 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 2, Reading 5
11 pages
Infinera WP Advantages of Indium Phosphide
No ratings yet
Infinera WP Advantages of Indium Phosphide
9 pages
Assumptions of Linear Regression: No or Little Multicollinearity
No ratings yet
Assumptions of Linear Regression: No or Little Multicollinearity
14 pages
Refrigeration Unit Datasheet
No ratings yet
Refrigeration Unit Datasheet
8 pages
Xaliss Jamal Omer - Numerical
No ratings yet
Xaliss Jamal Omer - Numerical
16 pages
Name: Muhammad Siddique Class: B.Ed. Semester: Fifth Subject: Inferential Statistics Submitted To: Sir Sajid Ali
No ratings yet
Name: Muhammad Siddique Class: B.Ed. Semester: Fifth Subject: Inferential Statistics Submitted To: Sir Sajid Ali
6 pages
Teknik Menjawab Kimia 3 SPM
No ratings yet
Teknik Menjawab Kimia 3 SPM
31 pages
ADM2304 Multiple Regression Dr. Suren Phansalker
No ratings yet
ADM2304 Multiple Regression Dr. Suren Phansalker
12 pages
Handout 4 Multiple Regression
No ratings yet
Handout 4 Multiple Regression
2 pages
Module 5: Multiple Regression Analysis: Tom Ilvento
No ratings yet
Module 5: Multiple Regression Analysis: Tom Ilvento
20 pages
Multiple Regression: by Dr. D. Israel
No ratings yet
Multiple Regression: by Dr. D. Israel
23 pages
Debt Fund Financial Model
No ratings yet
Debt Fund Financial Model
4 pages
Chapter 14, Multiple Regression Using Dummy Variables
No ratings yet
Chapter 14, Multiple Regression Using Dummy Variables
19 pages
Module 5 in Mathematics in The Modern World: Community College of Manito Manito, Albay A.Y. 2021 - 2022
No ratings yet
Module 5 in Mathematics in The Modern World: Community College of Manito Manito, Albay A.Y. 2021 - 2022
4 pages
X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
No ratings yet
X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
6 pages
Trs en
No ratings yet
Trs en
2 pages
P Formula Sheet
No ratings yet
P Formula Sheet
4 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
2 pages
Multiple Linear Regression: y BX BX BX
No ratings yet
Multiple Linear Regression: y BX BX BX
14 pages
Multi Col Linearity
No ratings yet
Multi Col Linearity
3 pages
Career Design Lab Resume Sample
No ratings yet
Career Design Lab Resume Sample
1 page
MC 3487
No ratings yet
MC 3487
6 pages
Multiple Regression Example PDF
No ratings yet
Multiple Regression Example PDF
5 pages
Design Application Hairpin
No ratings yet
Design Application Hairpin
23 pages
10-An - Swimming Pool Dehumidifier Sizing
No ratings yet
10-An - Swimming Pool Dehumidifier Sizing
4 pages
Robot Manipulators: Modeling, Performance Analysis and Control
From Everand
Robot Manipulators: Modeling, Performance Analysis and Control
Etienne Dombre
No ratings yet
Standard-Slope Integration: A New Approach to Numerical Integration
From Everand
Standard-Slope Integration: A New Approach to Numerical Integration
Peter James Italia, MD
No ratings yet
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
No ratings yet
International GCSE Biology (4BI1) - Grade Characteristics: Holistic Approach To Grades
7 pages

Chapter11 Regression

Uploaded by

Chapter11 Regression

Uploaded by

INFERENCE FOR Chapter 11

##plot the regression line

slope se lower upper

VIX Residual Plot

Most of the times the challenge that we encounter in multiple regression is

This relationship is conditional because we’ve restricted our set to

The simple answer is that we can’t know whether we have the

There is another statistic in the full regression table

We want to have a model

Can we trust this model

The results are consistent

Use the model to predict

You might also like