Linear Regression For Real
Linear Regression For Real
3. No auto-correlation of errors
y = mx + c
4. Normality m = slope = 0.6049
Residuals should be distributed normally through the plot.Q-Q c = intercept = 69.6326
Residuals plot (residuals should be close to the line) or a histogram.
5. Homoscedasticity Residual standard error:
Homogeneity of variances. Residuals vs Fitted and the scale-location 3. Plot boxplot The average distance between the data values and the regression line. The lower the value, the
plot, residuals should be randomly scattered, if there is a pattern in Visualises the distribution of exam scores allowing the identification of any outliers. more closely the regression line fits the data values.
the distribution there is no homoscedasticity. >boxplot(data$score, Multiple R-squared:
6. No outliers main= 'Boxplot: Distribution of Scores', In this case, only 68.42% of the variation in scores can be explained by hours studied, suggesting
The residual vs leverage uses Cook’s distance to calculate whether ylab= 'Score (%)') that this might not be the only variable affecting exam scores.
any significant outliers could affect our analysis results. No outliers are present here. Adjusted R-squared:
distribution. The value would be lower than the one from multiple R-squared. Here, it is 65.99%.
4. Perform Linear Regression F-statistic & p-value:
>linearmodel <- lm(data$score ~ data$hours) #fitting linear The p-value is less than 0.05, (p-value = 0.000142). The model is statistically significant and
regression model hours is deemed to be a useful explanation for variation of exam scores.
>summary(linearmodel) #for quantitative output
>plot(data$score ~ data$hours, To verify that these assumptions, plot the 4 diagnostic plots.
main="Hours studied vs. Exam score", >par(mfrow = c(2, 2))
xlab= 'Time (hours)', plot(linearmodel)
ylab= 'Score (%)') #plot the 2 variables Q-Q plot: For normal distribution.
>abline(linearmodel, col= 'blue') #adding the linear model If data values follow roughly the dotted straight line at a 45-degree angle, the data is
regression line normally distributed.
Here, normal distribution can be assumed.
Residual vs. fitted values plot: For homoscedasticity.
The x-axis displays the fitted values; y-axis displays the residuals.
The residuals should appear randomly and evenly around the value zero. Otherwise,
homoscedasticity would be violated.
Due to the parabola, homoscedasticity seems to be violated.
Scale-Location: For homoscedasticity.
You should see a horizontal line with equally spread points.
5. Creating Residual Plots and Running Here, the line curves off, suggesting that the variance of the residuals is different for each
diagnostic plots data point. Violation of homoscedacity.
2 assumptions of linear regression: Residuals vs Leverage: For influential outliers.
The residuals are roughly normally Outlying values outside of the dashed lines mean that these values are influential to the
distributed regression. The regression results will be altered if these values are excluded.
The residuals are homoscedastic Here, there are no influential outliers, so the regression line is not affected by it.
There seems to be low homoscedacity between data values. As this is one of the main
assumption of linear regression, a weak linear regression is implied suggesting that another
model may be more appropriate for this dataset.