Application of Linear Regression
Application of Linear Regression
Department/Institute:
1 year 1 semester, Masters Final Examination-2020
st st
Registration No. #
Academic Session #
2019-2020
Instructions:
1. Don’t copy from other’s assignment. Copying from others will be punished severely.
2. The student must submit the assignment online (Google classroom/email/google form
etc.) as the course-teacher prescribes.
3. You must use your name# your EXAM ID only for naming your submitted file.
Application of linear Regression
Here the data is used from the analysis of annual deaths in road accidents across half of the US states.
The dependent variable is Number of Deaths in road accidents. Independent variables are number of
drivers dead, Population density in people per mile, Length of rural roads in 1000s of mile, Average
maximum daily temperature in January and Fuel Consumption in 10,000,000 US gallons per year.
A multiple regression analysis has been conducted using the data and before conducting a regression
analysis the following assumptions need to be tested.
From the scatter diagram we can see that the relationship between IV and DV could be modeled by
straight line which means there is a linear relation between the variables.
Relation between road accidents and population density per mile
Here, from the scatter diagram, we can see that a straight line can be drawn through the scatter plots
though some of the data are far from the line which means they are associated with residuals. But most of
the data show that there is a linear relationship between the independent and dependent variable.
Relation between road accidents and length of rural roads
Here, from the scatter diagram we can see that the data plots are randomly distributed though some of the
data plots show some linearity which means a single line can be drawn through the data plots but few of
them is near to the line. So, it can be said that here the independent and dependent variables show little
linearity.
Relation between road accidents and Average maximum daily temperature
Here from the scatter diagram, we can see that the data plots are very randomly distributed and the
relationship between IV and DV cannot be modeled by a straight line suggesting that the relation between
the variables is not linear.
Relation between road accidents and Fuel consumption
Here from the scatter diagram, we can see that most of the plots are distributed following a straight line
which suggests that there is linear relationship between the independent and dependent variable.
Here from the P-P plot we can see that, data points hardly touch the line. So, it can be concluded that
though the data point staying near to the line, majority of the points are not touching the line and it
indicates that the data are not fully normally distributed but the data set is showing multivariate normality
to some extent.
Here from the coefficient table VIF and tolerance statistic have been used to assess the assumption. For
the assumption to be met VIF score is preferred to be below 10 and tolerance score to be above 0.2 and
from the output we can see that all the VIF score is below 10 and tolerance score is above 0.2. so, it can
be said that there is no multicollinearity in the data.
Assumption 4: No auto-correlation
Autocorrelation occurs when the residuals are not independent from each other or it can be said that when
the value of y(x+1) is not independent from the value of y(x). Here, autocorrelation has been tested
through Durbin-Watson test. The Durbin Watson value, D can assume values between 0 and 4. When the
value of D is around 2 it means there is no auto-correlation. If the value is less than 2 then there is a
positive correlation and if the value is greater than 2 then there is a negative auto-correlation.
Here, from the model summary we can see that the Durbin-Watson value is 2.07 which is around 2 and
this indicates that there is no auto-correlation in the data set.
Assumption 5: Homoscedasticity
For linear regression it is important for the data to be homoscedastic which means the data should have
similar dispersion from the standard line, showing similar variability of their dispersion from the standard
line. So basically, it is an assumption that the variation in the residuals is similar at each point across the
model. It can also be said that the spread of the residuals should be fairly constant at each point of the
independent variables. The graph plots the standardized values the model can predict, against the
standardized residuals obtained. If the predicted values increase, the variation in the residual should show
roughly similar results. If everything goes right it should look like random array of dots but if the graph
looks like a funnel shape, then it is likely that the assumption is violated or heteroscedastic.
From the scattered data plot, it is a bit difficult to predict homoscedasticity as the data set was small in
size. But it generally appears to be more random than funneled, so it can be said that the data set is not
fully heteroscedastic, to some extent it shows homoscedasticity.
Regression analysis:
Before conducting regression analysis, the data set should meet the above-mentioned assumptions. From
the multiple assumption tests, it has been found that the data set is suitable for regression analysis as it has
met all the assumptions.
Hypothesis development
Null hypothesis, Ho: There is no relation between annual deaths in road accidents in US and are number
of drivers dead, Population density, Length of rural roads, Average maximum daily
temperature in January and Fuel Consumption.
Hypothesis test:
For testing hypothesis, a multiple regression analysis has been conducted.
From the above table it is clearly shown that the regression model is statistically significant as the p value
is lower than .05 (.0 < .05), thus the null hypothesis gets rejected and it can be concluded that there is a
relation between annual deaths in road accidents in US and are number of drivers dead, Population
density, Length of rural roads, Average maximum daily temperature in January and Fuel Consumption.
The beta coefficient tells us how many units or measures of dependent variable increases for a single unit
of the independent variables. Here from the coefficient table, we can see that the beta value for population
density and fuel consumption is negative which means they have a negative relation with annual death
rates in road accidents in U.S. so it can be said that increase in any of the variable except population
density and fuel consumption will result in increase of the death rates by road accidents.