Lec 40
Lec 40
Lecture – 40
Multiple Linear Regression Model Building and Selection
So, the inputs for the function read dot csv it is similar to what we
saw in the previous lecture for read dot delim. So, read dot csv reads
the file in the table format and creates a data frame from it. So, the
syntax is read dot csv and the inputs to the function are file and row
names. So, le is the name of the file from which you want to read the
data and row names is the vector giving the actual row names, could
also be a single number.
(Refer Slide Time: 02:00)
So, let us see how to load the data now so, assuming ‘nyc.csv’ is in
your current working directory the command is read dot csv followed
by the name of the le in double quotes. Now, once this command is
executed it will create an object nyc which is a data frame. Now, let us
see how to view the data.
So, the data is about menu pricing in restaurants of New York City. So,
y which is my dependent variable is the price of the dinner, there are 4
other independent variables. So, I have food which is one of the
independent variables it, is the customer rating of the food then I have
decor which is the customer rating of decor, then I have service which
is the customer rating of the service and east.
So, east is whether the restaurant is located on the east or west side
of the city. So, now, our objective is to build a linear model with y
which is price and with all the other 4 independent variables. Before
we go on building a model let us say if our data exhibits some
interdependency between the variables. So, for me to do that I am
going to use a “pair wise scatter plot." So, I am going to use same
function plot which we have earlier used.
Now, since I have multiple variable I am going to give the data frame
as my input and I am just giving a heading as pair wise scatter plot.
(Refer Slide Time: 04:06)
On my right this is the output you will get so, we can see that all the
variables are mentioned across the diagonals. So, when one moves
from left to right the variables on my left will be in the y axis and the
variables above or below will be on the x axis. So, let us take the first
row for instance. So, I have price on the left. So, price is in the y and I
have food below. So, food becomes the x axis.
Now, this is the plot for price versus food similarly I have price
versus decor and price versus service and price versus east. I am going
to the next row which is food on the y axis. So, if you take food versus
decor the data is randomly scattered so, it does not show any
correlation patterns, but whereas, if you see for food versus service you
see strong patterns being exhibited here. So, let us see what the
correlation is as such for all of these. So, correlation is a function and
Professor Shankar has told you how it is computed.
So, cor is the function in R. I need to give the dataset with all the
variables now round tells you to how many decimal points you want
round off the number to. So, if I give round and I am giving the input
as my correlation function and if I am saying 3 it means round of the
number to 3 decimal places. So, let us see how to interpret the output.
So, the correlation for price versus price will always be 1.
So, let us look at food and decor so, correlation between food and
decor is 0.5 which is pretty low, but whereas, if you look at food and
service it is almost equal to 0.8 which is quite high. So, we can see that
food and service are correlated, but one of them can be dropped while
building a final model. So, as we go along let us see which of the two
we have to drop.
(Refer Slide Time: 06:25)
So, like I earlier said, my dependent variable is only one here mean
which is denoted by y. I have several independent variables which are
denoted by xi and i code ranges from 1 to p, where p is the total number
of independent variables. Now let us see how to write this equation
with multiple independent variables. Again I have ŷ which is the
predicted value now I have β̂₀ which is the intercept then I have β̂1x1 +
β̂2 x2 so on and so forth up to β̂p xp. So, β̂₀ is the intercept and β1
hat β̂2 hat etcetera are the slopes.
So, ε is the error. So, if you could recall from your earlier lectures in
OLS, the assumption is that, so, error is present only in the
measurement of dependent variable and not on the independent
variable. So, independent variables are free of errors whereas, there is
always some error present in the measurement of y. So, this ε is an
unknown quantity which has 0 mean and some variance, now for any i
th observation this is how my equation is written.
So, now, let us go and build a model. So, the function to build a
multiple linear model is same as what we used in the univariate case.
Here also I am going to use lm now again the syntax is l m and there
are 2 input parameters formula and data. Now the syntax is slightly
different compared to the univariate case. So, I have my dependent
variable here then I have a tilde sign and how many ever independent
variables I have I am going to separate them with a + sign. Say for
instance I have 2 independent variables in my data. So, I am regressing
the dependent variable with 2 independent variables so, the 2
independent variables have to be separated by a + sign.
So, now, let us see how to do it for our data nyc. So, again I have l m
so, I am regressing price with all the 4 input variables which is food,
decor, service and east and I am taking these variables from the data
nyc. So, you can either separate the independent variables by a + sign.
So now, if you want to say regress price with all the 4 inputs, there is
another way you can write the same command. So, I say regress price
and then I give a tilde sign and then I say a dot. So, this means regress
price with all the input variables from the data nyc So, if you are going
to give all the input variables for regression then you can go with this,
but if you have a subset of variables that you want to build a model
with, then you can specify the variables separated by a + sign. So, just
to reiterate this is the form of my equation. So, now, let us go and see
how to interpret the summary. So, after having built this model I am
going to look at the summary of it.
So, this snippet gives you a just at the summary. So, if you could
recall in the first lecture of simple linear regression, we looked at what
each of this line here means in depth. So, we have the formula in the
first line we have the residuals and the 5 point summary of them ,then
we look in at the coefficients. So, here we say that intercept, food,
decor, service and east and these are the coefficients for these
variables.
So, let us look at the r squared value. The r squared value is 0.628
and the adjusted r square is 0.619 and the f statistic value is really high
which is 68.76. So, this tells you that compared to the reduced models
which are the only intercept my full model is performing better and I
should retain it. So, now, that we know service is not significant, let us
build a new model dropping service.
So, I have dropped service and I have built a new model and I am
calling it nycmod_2. So, let us jump on to the coefficient section. So,
the estimates are not drastically different before and after removing the
service variable. So, this tells you that service is not very important.
So, again if you look at the p value it tells you that these variables are
very significant and if you look at the r squared value here down.
So, the r squared value before and after removing service is not
changed much this itself is an indicator that service is not helping us in
explaining the variation in price. The adjusted r square has changed a
bit and that is only because we have removed one variable and the
degrees of freedom change. The f statistic again is really really high
telling you that full model with food, decor and east is performing
better compared to your reduced model with only the intercept.
(Refer Slide Time: 13:15)
If you recall from the scatter plot, we saw there was a high
correlation between food and service. So, now, we built a model
dropping service, let us now retain service and build a model dropping
food. So, I have dropped food from here. So, let us take a look at this
summary.
If you take a look at this summary though the p value tells you that
all the variables are significant, if you look at the r squared value it has
dropped from 0.628 to 0.588 which is a huge decrease and even the
adjusted r square has decreased. So, this tells you that service is less
important and food is explaining the price in a much better sense than
service.
So, the r squared value and the scatter plots tell us to go ahead with
the linear model where we still need to verify the assumptions we make
on the errors using residual analysis. So, this task we are going to leave
it to you as an exercise you can do it and verify these assumptions.
Thank you.