0% found this document useful (0 votes)

4 views12 pages

Lec 40

The lecture focuses on building and selecting multiple linear regression models using a dataset on restaurant pricing in New York City, with price as the dependent variable and food, decor, service, and location as independent variables. It covers data loading, correlation analysis, model building using R, and interpreting model summaries to identify significant variables. The importance of residual analysis and the impact of variable selection on model performance are also discussed.

Uploaded by

sarika satya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views12 pages

Lec 40

Uploaded by

sarika satya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Data science for Engineers

Prof. Shankar Narasimhan, Prof. Ragunathan Rengasamy

Department of Computer Science and Engineering
Indian Institute of Technology, Madras

Lecture – 40
Multiple Linear Regression Model Building and Selection

Welcome to the lecture on implementation of multiple linear

regression to summarize from the previous lecture.

(Refer Slide Time: 00:23)

We looked at steps in building a simple linear regression model

where we looked at how to regress an independent variable with a
dependent variable. As a part of this we also looked at how to assess
the model that we have built and under that we looked at how to
interpret the model summary and identify the significant variables.
How to do residual analysis, how to check if the model needs
refinement and we built a refined model.
(Refer Slide Time: 00:57)

In this lecture we are going to extend all of this to multiple

independent variable so, it is called multiple linear regression and in
this we are going to build linear model with one dependent and
multiple independent variables. We are also going to look at the model
summary and identify the insignificant variables and discard them and
rebuild the model. We will also look at how to identify the subset of
variables to build the model, this is called model selection.

(Refer Slide Time: 01:20)

So, let us start by loading the data so, the data set ‘nyc’ is given to
you in a “csv" format and to load the dataset we are going to use the
function read dot csv.

(Refer Slide Time: 01:30)

So, the inputs for the function read dot csv it is similar to what we
saw in the previous lecture for read dot delim. So, read dot csv reads
the file in the table format and creates a data frame from it. So, the
syntax is read dot csv and the inputs to the function are file and row
names. So, le is the name of the file from which you want to read the
data and row names is the vector giving the actual row names, could
also be a single number.
(Refer Slide Time: 02:00)

So, let us see how to load the data now so, assuming ‘nyc.csv’ is in
your current working directory the command is read dot csv followed
by the name of the le in double quotes. Now, once this command is
executed it will create an object nyc which is a data frame. Now, let us
see how to view the data.

(Refer Slide Time: 02:23)

Now view of nyc will display the data frame in a tabular format.
There is a small snippet below which shows you how the output looks.
So, I have price, food, decor, service and east as the 5 variables. So,
say suppose if your data is really huge and you do not want to view the
entire data then we can use head or tail function. So, head will give you
the first 6 rows from a data frame and tail will give you the last 6 rows
from the data frame.
So, now, let us look at the description of the data set we have already
loaded it and we viewed it, but we do not know yet what the
description is.

(Refer Slide Time: 02:57)

So, the data is about menu pricing in restaurants of New York City. So,
y which is my dependent variable is the price of the dinner, there are 4
other independent variables. So, I have food which is one of the
independent variables it, is the customer rating of the food then I have
decor which is the customer rating of decor, then I have service which
is the customer rating of the service and east.

So, east is whether the restaurant is located on the east or west side
of the city. So, now, our objective is to build a linear model with y
which is price and with all the other 4 independent variables. Before
we go on building a model let us say if our data exhibits some
interdependency between the variables. So, for me to do that I am
going to use a “pair wise scatter plot." So, I am going to use same
function plot which we have earlier used.

Now, since I have multiple variable I am going to give the data frame
as my input and I am just giving a heading as pair wise scatter plot.
(Refer Slide Time: 04:06)

On my right this is the output you will get so, we can see that all the
variables are mentioned across the diagonals. So, when one moves
from left to right the variables on my left will be in the y axis and the
variables above or below will be on the x axis. So, let us take the first
row for instance. So, I have price on the left. So, price is in the y and I
have food below. So, food becomes the x axis.
Now, this is the plot for price versus food similarly I have price
versus decor and price versus service and price versus east. I am going
to the next row which is food on the y axis. So, if you take food versus
decor the data is randomly scattered so, it does not show any
correlation patterns, but whereas, if you see for food versus service you
see strong patterns being exhibited here. So, let us see what the
correlation is as such for all of these. So, correlation is a function and
Professor Shankar has told you how it is computed.
So, cor is the function in R. I need to give the dataset with all the
variables now round tells you to how many decimal points you want
round off the number to. So, if I give round and I am giving the input
as my correlation function and if I am saying 3 it means round of the
number to 3 decimal places. So, let us see how to interpret the output.
So, the correlation for price versus price will always be 1.
So, let us look at food and decor so, correlation between food and
decor is 0.5 which is pretty low, but whereas, if you look at food and
service it is almost equal to 0.8 which is quite high. So, we can see that
food and service are correlated, but one of them can be dropped while
building a final model. So, as we go along let us see which of the two
we have to drop.
(Refer Slide Time: 06:25)

Now, let us go on to model building.

(Refer Slide Time: 06:28)

So, like I earlier said, my dependent variable is only one here mean
which is denoted by y. I have several independent variables which are
denoted by xi and i code ranges from 1 to p, where p is the total number
of independent variables. Now let us see how to write this equation
with multiple independent variables. Again I have ŷ which is the
predicted value now I have β̂₀ which is the intercept then I have β̂1x1 +
β̂2 x2 so on and so forth up to β̂p xp. So, β̂₀ is the intercept and β1
hat β̂2 hat etcetera are the slopes.

So, ε is the error. So, if you could recall from your earlier lectures in
OLS, the assumption is that, so, error is present only in the
measurement of dependent variable and not on the independent
variable. So, independent variables are free of errors whereas, there is
always some error present in the measurement of y. So, this ε is an
unknown quantity which has 0 mean and some variance, now for any i
th observation this is how my equation is written.

(Refer Slide Time: 07:42)

So, now, let us go and build a model. So, the function to build a
multiple linear model is same as what we used in the univariate case.
Here also I am going to use lm now again the syntax is l m and there
are 2 input parameters formula and data. Now the syntax is slightly
different compared to the univariate case. So, I have my dependent
variable here then I have a tilde sign and how many ever independent
variables I have I am going to separate them with a + sign. Say for
instance I have 2 independent variables in my data. So, I am regressing
the dependent variable with 2 independent variables so, the 2
independent variables have to be separated by a + sign.
So, now, let us see how to do it for our data nyc. So, again I have l m
so, I am regressing price with all the 4 input variables which is food,
decor, service and east and I am taking these variables from the data
nyc. So, you can either separate the independent variables by a + sign.
So now, if you want to say regress price with all the 4 inputs, there is
another way you can write the same command. So, I say regress price
and then I give a tilde sign and then I say a dot. So, this means regress
price with all the input variables from the data nyc So, if you are going
to give all the input variables for regression then you can go with this,
but if you have a subset of variables that you want to build a model
with, then you can specify the variables separated by a + sign. So, just
to reiterate this is the form of my equation. So, now, let us go and see
how to interpret the summary. So, after having built this model I am
going to look at the summary of it.

(Refer Slide Time: 09:30)

So, this snippet gives you a just at the summary. So, if you could
recall in the first lecture of simple linear regression, we looked at what
each of this line here means in depth. So, we have the formula in the
first line we have the residuals and the 5 point summary of them ,then
we look in at the coefficients. So, here we say that intercept, food,
decor, service and east and these are the coefficients for these
variables.

So, for each of these coefficients, I am given an estimate value some

variance associated with it a t value which is the ratio of estimate by
the standard error and some probability value. So, if you look at the p
value. So, the p value for intercept is very low compared to our
significance level which is 0.05. So, this tells you that intercept is one
of the important terms that have to be included in the model. The same
goes with p value for food and decor they are also less than 0.05 and so
we need to retain them and the stars tell you how significantly different
are they from 0.
Whereas, if you look at the p value of service, it is 0.9945 which is
really high compared to our significance level. So, this tells you that
this term, service, is not important and if you look at the estimated
value it is very very close to 0. So, this tells you that service is not an
important term in explaining the price.
So, now, if you see the p value for east, though it is not very very
low compared to food and decor it is though it does not have a p value
which is very low as that compared to food and decor, it is still OK and
the significance star is only one which tells you that look if I have a
significance level of say 0.025 or 0.01 then I can reject this term, but
till then I can always keep it.

So, let us look at the r squared value. The r squared value is 0.628
and the adjusted r square is 0.619 and the f statistic value is really high
which is 68.76. So, this tells you that compared to the reduced models
which are the only intercept my full model is performing better and I
should retain it. So, now, that we know service is not significant, let us
build a new model dropping service.

(Refer Slide Time: 12:11)

So, I have dropped service and I have built a new model and I am
calling it nycmod_2. So, let us jump on to the coefficient section. So,
the estimates are not drastically different before and after removing the
service variable. So, this tells you that service is not very important.
So, again if you look at the p value it tells you that these variables are
very significant and if you look at the r squared value here down.
So, the r squared value before and after removing service is not
changed much this itself is an indicator that service is not helping us in
explaining the variation in price. The adjusted r square has changed a
bit and that is only because we have removed one variable and the
degrees of freedom change. The f statistic again is really really high
telling you that full model with food, decor and east is performing
better compared to your reduced model with only the intercept.
(Refer Slide Time: 13:15)

If you recall from the scatter plot, we saw there was a high
correlation between food and service. So, now, we built a model
dropping service, let us now retain service and build a model dropping
food. So, I have dropped food from here. So, let us take a look at this
summary.

(Refer Slide Time: 13:28)

If you take a look at this summary though the p value tells you that
all the variables are significant, if you look at the r squared value it has
dropped from 0.628 to 0.588 which is a huge decrease and even the
adjusted r square has decreased. So, this tells you that service is less
important and food is explaining the price in a much better sense than
service.
So, the r squared value and the scatter plots tell us to go ahead with
the linear model where we still need to verify the assumptions we make
on the errors using residual analysis. So, this task we are going to leave
it to you as an exercise you can do it and verify these assumptions.

Thank you.

Unit-2 Ak
No ratings yet
Unit-2 Ak
106 pages
Ia Template Ess
No ratings yet
Ia Template Ess
12 pages
Regression
No ratings yet
Regression
64 pages
DWDM .Unit - 1
No ratings yet
DWDM .Unit - 1
58 pages
Statistics and Probability STAT 112 Grade11 Week 11 20leb
90% (21)
Statistics and Probability STAT 112 Grade11 Week 11 20leb
71 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Lec 05 2 - Time Series Regression Model
No ratings yet
Lec 05 2 - Time Series Regression Model
75 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Regression Models For Data Science in R
No ratings yet
Regression Models For Data Science in R
137 pages
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
No ratings yet
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
15 pages
Unit 4 - R Programming
No ratings yet
Unit 4 - R Programming
26 pages
Statistical Modelling
No ratings yet
Statistical Modelling
39 pages
Regression Predict PART 1of2
No ratings yet
Regression Predict PART 1of2
26 pages
Digital To Analog Converters Testing
No ratings yet
Digital To Analog Converters Testing
105 pages
Illich Et Al - 2023 - multiscaleDTM Open Source R Package For Multiscale Geomorphometric Analysis
No ratings yet
Illich Et Al - 2023 - multiscaleDTM Open Source R Package For Multiscale Geomorphometric Analysis
42 pages
(Original PDF) Australasian Business Statistics, 4th Edition Ebook All Chapters PDF
100% (3)
(Original PDF) Australasian Business Statistics, 4th Edition Ebook All Chapters PDF
55 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
10 A Simplified Economic Filter For Underground Mining of Massive
No ratings yet
10 A Simplified Economic Filter For Underground Mining of Massive
20 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Lec 35
No ratings yet
Lec 35
11 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Irrational Exuberance
No ratings yet
Irrational Exuberance
24 pages
Multiple Regression Edit - Removed
No ratings yet
Multiple Regression Edit - Removed
14 pages
Lab-Practice-I (ML) - Lab Manual-sknIT
No ratings yet
Lab-Practice-I (ML) - Lab Manual-sknIT
57 pages
Relation Between Executive Functions and Screen Time Exposure in Under 6
No ratings yet
Relation Between Executive Functions and Screen Time Exposure in Under 6
10 pages
Leadership Styles and Employees' Motivation - Perspective From An Emerging Economy - Document - Gale Academic OneFile
No ratings yet
Leadership Styles and Employees' Motivation - Perspective From An Emerging Economy - Document - Gale Academic OneFile
14 pages
Lecture-17-Linear Regression Using Sklearn
No ratings yet
Lecture-17-Linear Regression Using Sklearn
32 pages
Lec 41
No ratings yet
Lec 41
6 pages
Knowledge Representation Ai Unit 3
No ratings yet
Knowledge Representation Ai Unit 3
26 pages
Econometrics I Lecture 3 Wooldridge
No ratings yet
Econometrics I Lecture 3 Wooldridge
50 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
03a.session Notes On Multiple Linear Regression Analysis
No ratings yet
03a.session Notes On Multiple Linear Regression Analysis
6 pages
A Second Course in Statistics Regression Analysis
No ratings yet
A Second Course in Statistics Regression Analysis
8 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
Lec 20
No ratings yet
Lec 20
16 pages
Module 3
No ratings yet
Module 3
34 pages
Lec 46
No ratings yet
Lec 46
12 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Topic 7-Regression Analysis
No ratings yet
Topic 7-Regression Analysis
56 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Statistical Modelling Using Python
No ratings yet
Statistical Modelling Using Python
2 pages
Chapter - 2-Statistical Estimations
No ratings yet
Chapter - 2-Statistical Estimations
55 pages
Cursus Advanced Econometrics
No ratings yet
Cursus Advanced Econometrics
129 pages
Research On Airbnb
No ratings yet
Research On Airbnb
7 pages
Question Bank (DA) - 1
No ratings yet
Question Bank (DA) - 1
14 pages
Principes D'économétrie Avec R
No ratings yet
Principes D'économétrie Avec R
20 pages
Mac Sublayer
No ratings yet
Mac Sublayer
42 pages
Multiplexing
No ratings yet
Multiplexing
40 pages
Homework 4
No ratings yet
Homework 4
119 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Internet Protocols
No ratings yet
Internet Protocols
12 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
Ai Unit 1
No ratings yet
Ai Unit 1
54 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Ai PPT Un It 2
No ratings yet
Ai PPT Un It 2
60 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
8 pages
Inglese, Lucas - Python For Finance and Algorithmic Trading (2nd Edition) (2022)
No ratings yet
Inglese, Lucas - Python For Finance and Algorithmic Trading (2nd Edition) (2022)
247 pages
Gawain NG Mag-Aaral #10 - Pagbuo NG Kabanata IV
No ratings yet
Gawain NG Mag-Aaral #10 - Pagbuo NG Kabanata IV
9 pages
Problem Set 3: General Guideline
No ratings yet
Problem Set 3: General Guideline
12 pages
Multiple Linear Regressioin Part 1
0% (1)
Multiple Linear Regressioin Part 1
27 pages
Sensor Networks Unit 1
No ratings yet
Sensor Networks Unit 1
20 pages
Homework Assignment 4: Carlos M. Carvalho Mccombs School of Business
No ratings yet
Homework Assignment 4: Carlos M. Carvalho Mccombs School of Business
18 pages
Nonlinear Relationships: Y X X X X EYX
No ratings yet
Nonlinear Relationships: Y X X X X EYX
23 pages
Statistical Data Analysis Assignment
No ratings yet
Statistical Data Analysis Assignment
17 pages
Machine Learning - Multi Linear Regression Analysis
No ratings yet
Machine Learning - Multi Linear Regression Analysis
29 pages
IAC Lecture4 Homework
No ratings yet
IAC Lecture4 Homework
12 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Econometrics
No ratings yet
Econometrics
28 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Bi Manual
No ratings yet
Bi Manual
66 pages
Fundamental Law FT
No ratings yet
Fundamental Law FT
33 pages
Regression Analysis
No ratings yet
Regression Analysis
57 pages
ch03 Regression
No ratings yet
ch03 Regression
10 pages
Amta - Final - Notes.r: ### Step Wise AIC Regression
No ratings yet
Amta - Final - Notes.r: ### Step Wise AIC Regression
6 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Sakhil Assignment 02
No ratings yet
Sakhil Assignment 02
8 pages
Regression Anslysis
No ratings yet
Regression Anslysis
23 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
MUNAR - Linear Regression - Ipynb - Colaboratory
No ratings yet
MUNAR - Linear Regression - Ipynb - Colaboratory
30 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Multiple Regression
100% (1)
Multiple Regression
21 pages
Data Sheet Determination of Residual Chlorine
No ratings yet
Data Sheet Determination of Residual Chlorine
16 pages
Analysis of Variance - ANOVA: Eleisa Heron Eleisa Heron
No ratings yet
Analysis of Variance - ANOVA: Eleisa Heron Eleisa Heron
43 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
Homework 5 Solutions
No ratings yet
Homework 5 Solutions
10 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Pavement Deterioration Prediction Model and Project Selection For Kentucky Highways
No ratings yet
Pavement Deterioration Prediction Model and Project Selection For Kentucky Highways
13 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Econ 3049: Econometrics: Department of Economics The University of The West Indies, Mona
No ratings yet
Econ 3049: Econometrics: Department of Economics The University of The West Indies, Mona
16 pages
PGDM (2015-17) Term 1 Managerical Economics Dr. V. J. Sebastian Dependence of GDP On Exports and CPI-IW 10 Sl. Roll No. Name
No ratings yet
PGDM (2015-17) Term 1 Managerical Economics Dr. V. J. Sebastian Dependence of GDP On Exports and CPI-IW 10 Sl. Roll No. Name
10 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Multiple Linear Regression: Response Explanatory - I
No ratings yet
Multiple Linear Regression: Response Explanatory - I
5 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages

Lec 40

Uploaded by

Lec 40

Uploaded by

Data science for Engineers

Prof. Shankar Narasimhan, Prof. Ragunathan Rengasamy

Welcome to the lecture on implementation of multiple linear

(Refer Slide Time: 00:23)

We looked at steps in building a simple linear regression model

In this lecture we are going to extend all of this to multiple

(Refer Slide Time: 01:20)

(Refer Slide Time: 01:30)

(Refer Slide Time: 02:23)

(Refer Slide Time: 02:57)

Now, let us go on to model building.

(Refer Slide Time: 06:28)

(Refer Slide Time: 07:42)

(Refer Slide Time: 09:30)

So, for each of these coefficients, I am given an estimate value some

(Refer Slide Time: 12:11)

(Refer Slide Time: 13:28)

You might also like