unit5_R
unit5_R
UNIT-5
What is Linear Regression?
It is a statistical method that is used for predictive analysis.
Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (y) variables, hence called as
linear regression. Since linear regression shows the linear relationship,
which means it finds how the value of the dependent variable changes
according to the value of the independent variable. It is mathematically
denoted as y=ax+b
Linear Regression Line:A linear line showing the relationship between the
dependent and independent variables is called a regression line. A regression line
can show two types of relationship:
Positive Linear Relationship: If the dependent variable increases on the Y-axis
and the independent variable increases on the X-axis, then such a relationship is
termed as a Positive linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis
and independent variable increases on the X-axis, then such a relationship is
called a negative linear relationship.
There are two types of linear regression.
Simple Linear Regression
Multiple Linear Regression
Simple Linear Regression
Simple linear regression is used to estimate the relationship between two
quantitative variables. Simple linear regression uses only one independent
variable. You can use simple linear regression when you want to know:
How strong the relationship is between two variables (e.g., the relationship
between rainfall and soil erosion).
The value of the dependent variable at a certain value of the independent
variable (e.g., the amount of soil erosion at a certain level of rainfall).
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain
assumptions about the data. These assumptions are:
Homogeneity of variance (homoscedasticity): the size of the error in our
prediction doesn’t change significantly across the values of the independent
variable.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Implementation in R
In R programming, lm() function is used to create linear regression model.
Syntax: lm(formula,data)
Formula: This is symbolic description of a model to be fitted. It is written
in the form response ~ predictor1+predictor2+….(reponse is dependent
variable and predictor is independent variable, in case of simple linear
regression only one predictor and one or more for Multiple linear
Regression)
Data: specifies the data frame containing the variables in the formula.
Example:
# Create the data frame
data <- data.frame(
Years_Exp = c(1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7),
Salary = c(39343.00, 46205.00, 37731.00, 43525.00,
39891.00, 56642.00, 60150.00, 54445.00, 64445.00,
57189.00)
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Call:
lm(formula = Salary ~ Years_Exp, data = data)
Residuals:
1 2 3 5 6 8 10
463.1 5879.1 -4041.0 -6942.0 4748.0 381.9 -489.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30927 4877 6.341 0.00144 **
Years_Exp 7230 1983 3.645 0.01482 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4944 on 5 degrees of freedom
Multiple R-squared: 0.7266, Adjusted R-squared: 0.6719
F-statistic: 13.29 on 1 and 5 DF, p-value: 0.01482
Predict() function in R
This is a built-in function in the R language used to extract predicted values from
complex machine-learning models widely used by analysts.
Syntax:
predict(object, newdata, interval)
object: The class inheriting from the linear model
newdata: Input data to predict the values
interval: Type of interval calculation
Example:
Library “cars”
df <- datasets::cars
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# Creates a linear model
my_linear_model <- lm(dist~speed, data = df)
1 2 3 4 5
25.67740 25.67740 29.60981 29.60981 29.60981
6 7 8 9 10
29.60981 33.54222 33.54222 33.54222 33.54222
LINEAR MODEL SELECTION
It is often the case that some or many of the variables used in a multiple regression
model are in fact not associated with the response variable. Including such
irrelevant variables leads to unnecessary complexity in the resulting model.
Unfortunately, manually filtering through and comparing regression models can
be tedious. Luckily, several approaches exist for automatically performing feature
selection or variable selection — that is, for identifying those variables that result
in superior regression results. This leads to the concept of model selection.
INNAHAI ANUGRAHAM
BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE
Certainly, let’s delve into diagnostic procedures for linear regression in R. We’ll
focus on the key diagnostic checks:
# Create diagnostic plots (residuals vs. fitted values, residuals vs. normal
quantiles,
#and a histogram of residuals)
par(mfrow = c(2, 2))
plot(model)
INNAHAI ANUGRAHAM