0% found this document useful (0 votes)
7 views1 page

Linear Regression For Real

Uploaded by

Adlina Balkis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views1 page

Linear Regression For Real

Uploaded by

Adlina Balkis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

LINEAR REGRESSION

Background Coding in R Interpretation


A linear regression tests shows the relationship between 2
variables to determine if an independent variable significantly
1. Create/Import the data:
affects a dependent variable. >data <- data.frame(hours=c(3, 7, 9, 8, 10, 12, 12, 14, 16, 18, 20,
Purpose: To evaluate the null hypothesis (i.e there is no 28, 30, 38, 42),
relationship between the 2 variables) against the alternative score=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91,
hypothesis (i.e there is a significant linear relationship between 2 93, 89))
variables). >view(data)
P values indicate the significance of the relationship.
A low P value (<0.05): The null hypothesis is rejected.
2. Visualise the Data
Allows us to check the model’s assumptions.
2 assumptions behind the linear regression model:
Variables having a roughly linear relationship
Assumptions Absence of outliers.
>scatter.smooth(data$score ~ data$hours,
main= "Scatter graph: Hours studied vs. Exam Score",
1.Linear relationship
xlab = "Time (hours)",
The relationship between independent and dependent variables
ylab = "Score (%)")
must be linear. (scatter plot).
Though not entirely linear, still worth carrying out a linear regression.
2. Little or no multicollinearity between explanatory variables​

3. No auto-correlation of errors
y = mx + c
4. Normality m = slope = 0.6049
Residuals should be distributed normally through the plot.Q-Q c = intercept = 69.6326
Residuals plot (residuals should be close to the line) or a histogram.
5. Homoscedasticity Residual standard error:
Homogeneity of variances. Residuals vs Fitted and the scale-location 3. Plot boxplot The average distance between the data values and the regression line. The lower the value, the
plot, residuals should be randomly scattered, if there is a pattern in Visualises the distribution of exam scores allowing the identification of any outliers. more closely the regression line fits the data values.
the distribution there is no homoscedasticity. >boxplot(data$score, Multiple R-squared:
6. No outliers main= 'Boxplot: Distribution of Scores', In this case, only 68.42% of the variation in scores can be explained by hours studied, suggesting
The residual vs leverage uses Cook’s distance to calculate whether ylab= 'Score (%)') that this might not be the only variable affecting exam scores.
any significant outliers could affect our analysis results. No outliers are present here. Adjusted R-squared:
distribution. The value would be lower than the one from multiple R-squared. Here, it is 65.99%.
4. Perform Linear Regression F-statistic & p-value:
>linearmodel <- lm(data$score ~ data$hours) #fitting linear The p-value is less than 0.05, (p-value = 0.000142). The model is statistically significant and
regression model hours is deemed to be a useful explanation for variation of exam scores.
>summary(linearmodel) #for quantitative output
>plot(data$score ~ data$hours, To verify that these assumptions, plot the 4 diagnostic plots.
main="Hours studied vs. Exam score", >par(mfrow = c(2, 2))
xlab= 'Time (hours)', plot(linearmodel)
ylab= 'Score (%)') #plot the 2 variables Q-Q plot: For normal distribution.
>abline(linearmodel, col= 'blue') #adding the linear model If data values follow roughly the dotted straight line at a 45-degree angle, the data is
regression line normally distributed.
Here, normal distribution can be assumed.
Residual vs. fitted values plot: For homoscedasticity.
The x-axis displays the fitted values; y-axis displays the residuals.
The residuals should appear randomly and evenly around the value zero. Otherwise,
homoscedasticity would be violated.
Due to the parabola, homoscedasticity seems to be violated.
Scale-Location: For homoscedasticity.
You should see a horizontal line with equally spread points.
5. Creating Residual Plots and Running Here, the line curves off, suggesting that the variance of the residuals is different for each
diagnostic plots data point. Violation of homoscedacity.
2 assumptions of linear regression: Residuals vs Leverage: For influential outliers.
The residuals are roughly normally Outlying values outside of the dashed lines mean that these values are influential to the
distributed regression. The regression results will be altered if these values are excluded.
The residuals are homoscedastic Here, there are no influential outliers, so the regression line is not affected by it.
There seems to be low homoscedacity between data values. As this is one of the main
assumption of linear regression, a weak linear regression is implied suggesting that another
model may be more appropriate for this dataset.

You might also like