SlideShare a Scribd company logo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
What is Regression?Machine Learning Types Of Regression
Linear Regression -
Example
Linear Regression –
Use Cases
Demo In R: Real
Estate Use Case
1 2 3
4 65
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning - Example
 Facebook's News Feed uses machine learning to
personalize each member's feed.
 When you upload photos to Facebook, the service
automatically highlights faces and suggests friends
to tag.
 Facebook also uses AI(Artificial Intelligence) to
personalize
• Newsfeeds
• Advertisements
• Trending news
• Friend recommendations
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Regression?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Regression?
 Regression analysis is a predictive
modelling technique.
 It estimates the relationship between
a dependent (target) and an
independent variable (predictor).
X-axis
Y-axis
Input value = 7.00
Predicted outcome = 123.9
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Regression
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Regression
Linear Regression
• When there is a linear
relationship between
independent and dependent
variables.
• When the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in
nature.
Logistic Regression Polynomial Regression
• When the power of independent
variable is more than 1.
X
Y
X
Y
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Introduction
 The linear regression model assumes a linear
relationship between the input variables and
the outcome variable.
 This relationship can be expressed as
Where, y = outcome variable
x = input variables
= random error
= slope of the line
= intercept
y = β0 + β1x + ε
β1
β0
ε
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
I have a dataset consisting of height
and weight of students. Let’s see how
would linear regression fit into it.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
Scatterplot of y vs. x
We have a dataset of 10 students. We will
use it to draw scatterplot between height
and weight:
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
Scatterplot of height vs. weight
Now, the natural question arises — "what is the best fitting line?"
 The prediction error (or residual error) is:
Where,
• yi is the observed value of the unit i (i.e,
students).
• ŷ is the predicted response (or fitted value) for
unit i
 The goal is to minimize the sum of the squared
prediction errors (Least squared error or LER)
ei = yi - ŷ
𝑄 = 𝑖=1
𝑛
(𝑦i − ŷ )2
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W2= -331.2 + 7.1h
W1= -266.5 + 6.1h
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
 Least squared error
(LER) w1 = 597.4
 Least squared error
(LER) w2 = 766.5
W2 = -331.2 + 7.1h W1 = -266.5 + 6.1h
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Model Description
The solid line represented by
w = -266.53 +6.1376 will be the
best fit line as least squared
error is minimum for it.
127
121
142
157
162
156
169
165
181
208
0
50
100
150
200
250
62 64 66 68 70 72 74 76
WEIGHT
HEIGHT
W1= -266.5 + 6.1h
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Now, lets understand linear
regression further with the
help of a simple example.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Example
Here, Dependent
variable is
Churn_out_rate
And Independent
variable is Salary_hike
Let’s take an example,
A company is facing high churnout this year, salary hike being one of the major
reason.
So let us consider a company’s data where we will find out the relationship
between these two variables.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Example
> plot(Salary_hike, Churn_out_rate)
x-axis = Salary_hike
y-axis = Churn_out_rate
Conclusion:
From the graph, we can see that as the Salary hike increases, the Churn out rate
decreases.
Salary hike
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Linear Regression - Use Cases
Real Estate
Demand
forecasting
Real Estate
To model residential home prices as a function of
the home's living area, bathrooms, number of
bedrooms, lot size.
Medicine
To analyze the effect of a proposed radiation
treatment on reducing tumor sizes based on
patient attributes such as age or weight.
Demand forecasting
To predict demand for goods and services. For
example, restaurant chains can predict the quantity
of food depending on weather.
Marketing
To predict company’s sales based on previous
month’s sales and stock prices of a company.
Use-cases
Marketing
Medicine
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Real Estate Consultation firm has the data
comprising price of apartments in Boston.
Based on this data, company wants to
decide the price of new apartments.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Model Validation
Let’s use the inbuilt housing data of Boston for linear regression analysis.
To load it we can use following code:
 library(MASS)
 data(Boston)
The Boston Data looks like this:
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
For description of the data we can use
> ?Boston
It will contain details about the data such as
• No. of rows and column
• Attributes description
Lets move forward to see the description of attributes
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Prediction
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
Description
The Boston data frame has 506 rows and 14 columns.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Optimize Model
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
Here we will divide the data in 7:3 ratio such that 70% will be
present as training set and remaining 30% as the testing set.Prediction
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s find the relation among all the variables through scatterplot matrix.
 library(lattice)
 splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
 splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)
Let’s check the plots
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Model Validation
 The plot shows positive
linear trend between
rm (average no. of
rooms) and medv
(value of home).
 No relevant relationship
between indus
(proportion of non-
retail business) and
medv
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
 The plot shows
negative linear trend
between lstat (lower
status of population)
and medv.
 No relevant relationship
between tax (property
tax rate) and medv
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
When we have various variables, correlation
is an important factor to check the
dependencies within themselves
Correlation analysis gives us an insight,
between mutual relationship among
variables.
To get correlation relationship among
different variables for a data set use following
code
> cr<- cor(Boston)
This will give us the correlation values.
For visualizing the same we can use corrplot()
function
> library(corrplot)
> corrplot(cr,type = "lower")
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
From the plot we can get visual relationship
among different variables:
• Dark blue signifies strong positive relationship
• Dark red signifies strong negative relationship
• Scale varies from red to blue, and size of the
circle varies according to correlation factor
Example:
medv and lstat have large negative
relationship
medv and rm have large positive relationship
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
 Multicollinearity exists when two or more
predictor are highly correlated among
themselves.
 When correlation among X’s is low, OLS has
lots of information to estimate.
 When correlation among X’s is high, OLS has
very little information to estimate. This makes
us relatively uncertain about our estimate.
X1
X2
Y
X1
X2
Y
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
How can I detect
multicollinearity ?
You can use VIF (variance
inflation factor) for it.
Let’s see how
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Optimize Model
Variance inflation factor (VIF) measures the increase in the variance (the square
of the estimate's standard deviation) of an estimated regression coefficient due to
multicollinearity.
 A VIF of 1 means that there is no correlation among variables.
 Here, rad and tax have higher variance factor values indicating high multicollinearity.
 nox, indus and dis are moderately correlated.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s check the correlation between rad and tax from corrgram.
 rad and tax are highly
correlated at 0.91
 We can remove one of the
predictors (rad or tax) to
remove multicollinearity
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Let’s find the equation representing this best fit line
 summary(model)
As per the summary ,the equation representing our regression line is
medv= -34.671 + 9.102* rm
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now let’s build a model with the help of training set using the code below,
Here we will be using all variables excluding tax
 model<-lm(medv~ crim + zn + indus + chas + nox + rm + age + dis + rad +
ptratio + black + lstat,data = training_data)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Description of the model can be
found using Summary() function
> summary(model)
Some of the important values
are:
1. R-squared value
2. P-value
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Here r-squared = 0.726
R-squared value indicates the
perfection of the predictive value.
If the R-squared value is closer to
1.0, then the Linear Model is best-
suited.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Model Validation
 High P values: your data are
likely with a true null.
 Low P values: your data are
unlikely with a true null.
 Here, indus and age relatively
higher in p- value, so they can
be neglected.
P values are used to determine statistical significance in a hypothesis test.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now let’s build a model with the help of training set using the code
below, Here we will be excluding indus and age
> model<-lm(medv~ crim + zn + chas + nox + rm + dis + ptratio + black +
lstat,data = training_data)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Model Validation
Here, adjusted R-squared value
remained same despite of
removing indus and age from the
model.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Now we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output
> predic<-predict(model,test)
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
For comparing these values we can use plots
Here we plot a line graph where green lines represent the actual price and the
blue lines represent the predictive model generated for the data.
 plot(testing_data$medv,type = "l",lty = 1.8,col = "green")
 lines(predic,type = "l",col = "blue")
As we can see from the graph most of the predictive values are overlapping the
actual values.
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
I have this dataset. What
will be the estimated
cost of apartment?
Here’s the code
line and
predicted value
Data acquisition
Divide dataset
Exploratory Analysis
Implement Model
Optimize Model
Prediction
Model Validation
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”
www.edureka.co/data-scienceEdureka’s Data Science Certification Training

More Related Content

PDF
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Edureka!
 
PPTX
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
PDF
Logistic Regression Analysis
COSTARCH Analytical Consulting (P) Ltd.
 
PDF
Logistic regression in Machine Learning
Kuppusamy P
 
PDF
Model selection and cross validation techniques
Venkata Reddy Konasani
 
PPTX
ML - Simple Linear Regression
Andrew Ferlitsch
 
PPTX
Exploratory data analysis with Python
Davis David
 
PPTX
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Edureka!
 
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Logistic Regression Analysis
COSTARCH Analytical Consulting (P) Ltd.
 
Logistic regression in Machine Learning
Kuppusamy P
 
Model selection and cross validation techniques
Venkata Reddy Konasani
 
ML - Simple Linear Regression
Andrew Ferlitsch
 
Exploratory data analysis with Python
Davis David
 
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 

What's hot (20)

PDF
Introduction on Data Science
Edureka!
 
PDF
Cross validation
RidhaAfrawe
 
PPTX
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Simplilearn
 
PPTX
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
PPTX
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
PDF
Lecture13 - Association Rules
Albert Orriols-Puig
 
PDF
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
ODP
Machine Learning with Decision trees
Knoldus Inc.
 
PDF
An introduction to Machine Learning
butest
 
PPTX
Lecture #01
Konpal Darakshan
 
PPTX
Maximum likelihood estimation
zihad164
 
PDF
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
PPTX
Introduction to Data Analytics
Utkarsh Sharma
 
PPTX
Linear and Logistics Regression
Mukul Kumar Singh Chauhan
 
PPTX
Introduction to Maximum Likelihood Estimator
Amir Al-Ansary
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PDF
Predictive Modelling
Rajib Kumar De
 
PPT
Linear regression
Karishma Chaudhary
 
PDF
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
Introduction on Data Science
Edureka!
 
Cross validation
RidhaAfrawe
 
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Simplilearn
 
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Lecture13 - Association Rules
Albert Orriols-Puig
 
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Machine Learning with Decision trees
Knoldus Inc.
 
An introduction to Machine Learning
butest
 
Lecture #01
Konpal Darakshan
 
Maximum likelihood estimation
zihad164
 
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
Introduction to Data Analytics
Utkarsh Sharma
 
Linear and Logistics Regression
Mukul Kumar Singh Chauhan
 
Introduction to Maximum Likelihood Estimator
Amir Al-Ansary
 
Predictive Modelling
Rajib Kumar De
 
Linear regression
Karishma Chaudhary
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
Ad

Similar to Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka (20)

PPTX
Linear Regression with R programming.pptx
anshikagoel52
 
PPTX
Introduction to machine learning and model building using linear regression
Girish Gore
 
PDF
Machine learning Introduction
Kuppusamy P
 
PPTX
unit 3_Predictive Analysis Dr. Neeraj.pptx
rohitrajbhar1845
 
PDF
Linear Regression
SourajitMaity1
 
PDF
MLEARN 210 B Autumn 2018: Lecture 1
heinestien
 
PPTX
Linear regression
Harikrishnan K
 
PPTX
Linear regression by Kodebay
Kodebay
 
PPTX
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
PDF
Introduction to machine learning
Sanghamitra Deb
 
PDF
Module-2_ML.pdf
ArpanSoni16
 
PPTX
Linear Regression final-1.pptx thbejnnej
mathukiyak44
 
PPTX
Ml ppt at
pradeep kumar
 
PDF
working with python
bhavesh lande
 
PDF
Workbook Project
Brian Ryan
 
PPTX
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
PDF
Machine Learning Course | Edureka
Edureka!
 
PDF
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Edureka!
 
PPT
Data Analysison Regression
jamuga gitulho
 
PPTX
Regression Analysis.pptx
arsh260174
 
Linear Regression with R programming.pptx
anshikagoel52
 
Introduction to machine learning and model building using linear regression
Girish Gore
 
Machine learning Introduction
Kuppusamy P
 
unit 3_Predictive Analysis Dr. Neeraj.pptx
rohitrajbhar1845
 
Linear Regression
SourajitMaity1
 
MLEARN 210 B Autumn 2018: Lecture 1
heinestien
 
Linear regression
Harikrishnan K
 
Linear regression by Kodebay
Kodebay
 
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
Introduction to machine learning
Sanghamitra Deb
 
Module-2_ML.pdf
ArpanSoni16
 
Linear Regression final-1.pptx thbejnnej
mathukiyak44
 
Ml ppt at
pradeep kumar
 
working with python
bhavesh lande
 
Workbook Project
Brian Ryan
 
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Machine Learning Course | Edureka
Edureka!
 
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Edureka!
 
Data Analysison Regression
jamuga gitulho
 
Regression Analysis.pptx
arsh260174
 
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 

Recently uploaded (20)

PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Extract Transformation Load (3) (1).pptx
revathi148366
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Extract Transformation Load (3) (1).pptx
revathi148366
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Chad Readey - An Independent Thinker
Chad Readey
 

Linear Regression Algorithm | Linear Regression in R | Data Science Training | Edureka

  • 1. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression
  • 2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What Will You Learn Today? What is Regression?Machine Learning Types Of Regression Linear Regression - Example Linear Regression – Use Cases Demo In R: Real Estate Use Case 1 2 3 4 65
  • 3. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Machine Learning
  • 4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Introduction To Machine Learning Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Training Data Learn Algorithm Build Model Perform Feedback
  • 5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Machine Learning - Example  Facebook's News Feed uses machine learning to personalize each member's feed.  When you upload photos to Facebook, the service automatically highlights faces and suggests friends to tag.  Facebook also uses AI(Artificial Intelligence) to personalize • Newsfeeds • Advertisements • Trending news • Friend recommendations
  • 6. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What Is Regression?
  • 7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What Is Regression?  Regression analysis is a predictive modelling technique.  It estimates the relationship between a dependent (target) and an independent variable (predictor). X-axis Y-axis Input value = 7.00 Predicted outcome = 123.9
  • 8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Types Of Regression
  • 9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Types Of Regression Linear Regression • When there is a linear relationship between independent and dependent variables. • When the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Logistic Regression Polynomial Regression • When the power of independent variable is more than 1. X Y X Y
  • 10. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression
  • 11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression - Introduction  The linear regression model assumes a linear relationship between the input variables and the outcome variable.  This relationship can be expressed as Where, y = outcome variable x = input variables = random error = slope of the line = intercept y = β0 + β1x + ε β1 β0 ε
  • 12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training I have a dataset consisting of height and weight of students. Let’s see how would linear regression fit into it.
  • 13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Model Description Scatterplot of y vs. x We have a dataset of 10 students. We will use it to draw scatterplot between height and weight: 127 121 142 157 162 156 169 165 181 208 0 50 100 150 200 250 62 64 66 68 70 72 74 76 WEIGHT HEIGHT
  • 14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Model Description Scatterplot of height vs. weight Now, the natural question arises — "what is the best fitting line?"  The prediction error (or residual error) is: Where, • yi is the observed value of the unit i (i.e, students). • ŷ is the predicted response (or fitted value) for unit i  The goal is to minimize the sum of the squared prediction errors (Least squared error or LER) ei = yi - ŷ 𝑄 = 𝑖=1 𝑛 (𝑦i − ŷ )2 127 121 142 157 162 156 169 165 181 208 0 50 100 150 200 250 62 64 66 68 70 72 74 76 WEIGHT HEIGHT W2= -331.2 + 7.1h W1= -266.5 + 6.1h
  • 15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Model Description  Least squared error (LER) w1 = 597.4  Least squared error (LER) w2 = 766.5 W2 = -331.2 + 7.1h W1 = -266.5 + 6.1h
  • 16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Model Description The solid line represented by w = -266.53 +6.1376 will be the best fit line as least squared error is minimum for it. 127 121 142 157 162 156 169 165 181 208 0 50 100 150 200 250 62 64 66 68 70 72 74 76 WEIGHT HEIGHT W1= -266.5 + 6.1h
  • 17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Now, lets understand linear regression further with the help of a simple example.
  • 18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression - Example Here, Dependent variable is Churn_out_rate And Independent variable is Salary_hike Let’s take an example, A company is facing high churnout this year, salary hike being one of the major reason. So let us consider a company’s data where we will find out the relationship between these two variables.
  • 19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression - Example > plot(Salary_hike, Churn_out_rate) x-axis = Salary_hike y-axis = Churn_out_rate Conclusion: From the graph, we can see that as the Salary hike increases, the Churn out rate decreases. Salary hike
  • 20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Linear Regression - Use Cases Real Estate Demand forecasting Real Estate To model residential home prices as a function of the home's living area, bathrooms, number of bedrooms, lot size. Medicine To analyze the effect of a proposed radiation treatment on reducing tumor sizes based on patient attributes such as age or weight. Demand forecasting To predict demand for goods and services. For example, restaurant chains can predict the quantity of food depending on weather. Marketing To predict company’s sales based on previous month’s sales and stock prices of a company. Use-cases Marketing Medicine
  • 21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Real Estate Consultation firm has the data comprising price of apartments in Boston. Based on this data, company wants to decide the price of new apartments.
  • 23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Model Validation Let’s use the inbuilt housing data of Boston for linear regression analysis. To load it we can use following code:  library(MASS)  data(Boston) The Boston Data looks like this: Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo For description of the data we can use > ?Boston It will contain details about the data such as • No. of rows and column • Attributes description Lets move forward to see the description of attributes Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Prediction Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation Description The Boston data frame has 506 rows and 14 columns.
  • 26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Optimize Model Model Validation We will divide our entire dataset into two subsets as: • Training dataset -> to train the model • Testing dataset -> to validate and make predictions Here we will divide the data in 7:3 ratio such that 70% will be present as training set and remaining 30% as the testing set.Prediction Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Let’s find the relation among all the variables through scatterplot matrix.  library(lattice)  splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0)  splom(~Boston[c(1:6,14)], groups=NULL, data=Boston,axis.line.tck = 0,axis.text.alpha = 0) Let’s check the plots Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Data Acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Model Validation  The plot shows positive linear trend between rm (average no. of rooms) and medv (value of home).  No relevant relationship between indus (proportion of non- retail business) and medv Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo  The plot shows negative linear trend between lstat (lower status of population) and medv.  No relevant relationship between tax (property tax rate) and medv Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo When we have various variables, correlation is an important factor to check the dependencies within themselves Correlation analysis gives us an insight, between mutual relationship among variables. To get correlation relationship among different variables for a data set use following code > cr<- cor(Boston) This will give us the correlation values. For visualizing the same we can use corrplot() function > library(corrplot) > corrplot(cr,type = "lower") Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo From the plot we can get visual relationship among different variables: • Dark blue signifies strong positive relationship • Dark red signifies strong negative relationship • Scale varies from red to blue, and size of the circle varies according to correlation factor Example: medv and lstat have large negative relationship medv and rm have large positive relationship Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo  Multicollinearity exists when two or more predictor are highly correlated among themselves.  When correlation among X’s is low, OLS has lots of information to estimate.  When correlation among X’s is high, OLS has very little information to estimate. This makes us relatively uncertain about our estimate. X1 X2 Y X1 X2 Y Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo How can I detect multicollinearity ? You can use VIF (variance inflation factor) for it. Let’s see how Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Optimize Model Variance inflation factor (VIF) measures the increase in the variance (the square of the estimate's standard deviation) of an estimated regression coefficient due to multicollinearity.  A VIF of 1 means that there is no correlation among variables.  Here, rad and tax have higher variance factor values indicating high multicollinearity.  nox, indus and dis are moderately correlated. Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 35. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Let’s check the correlation between rad and tax from corrgram.  rad and tax are highly correlated at 0.91  We can remove one of the predictors (rad or tax) to remove multicollinearity Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 36. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Let’s find the equation representing this best fit line  summary(model) As per the summary ,the equation representing our regression line is medv= -34.671 + 9.102* rm Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 37. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Now let’s build a model with the help of training set using the code below, Here we will be using all variables excluding tax  model<-lm(medv~ crim + zn + indus + chas + nox + rm + age + dis + rad + ptratio + black + lstat,data = training_data) Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 38. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Description of the model can be found using Summary() function > summary(model) Some of the important values are: 1. R-squared value 2. P-value Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 39. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Here r-squared = 0.726 R-squared value indicates the perfection of the predictive value. If the R-squared value is closer to 1.0, then the Linear Model is best- suited. Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 40. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Model Validation  High P values: your data are likely with a true null.  Low P values: your data are unlikely with a true null.  Here, indus and age relatively higher in p- value, so they can be neglected. P values are used to determine statistical significance in a hypothesis test. Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 41. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Now let’s build a model with the help of training set using the code below, Here we will be excluding indus and age > model<-lm(medv~ crim + zn + chas + nox + rm + dis + ptratio + black + lstat,data = training_data) Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 42. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Model Validation Here, adjusted R-squared value remained same despite of removing indus and age from the model. Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 43. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo Now we can use our model to predict the output of our testing dataset. We can use the following code for predicting the output > predic<-predict(model,test) Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 44. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo For comparing these values we can use plots Here we plot a line graph where green lines represent the actual price and the blue lines represent the predictive model generated for the data.  plot(testing_data$medv,type = "l",lty = 1.8,col = "green")  lines(predic,type = "l",col = "blue") As we can see from the graph most of the predictive values are overlapping the actual values. Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 45. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo I have this dataset. What will be the estimated cost of apartment? Here’s the code line and predicted value Data acquisition Divide dataset Exploratory Analysis Implement Model Optimize Model Prediction Model Validation
  • 46. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Course Details Go to www.edureka.co/data-science Get Edureka Certified in Data Science Today! What our learners have to say about us! Shravan Reddy says- “I would like to recommend any one who wants to be a Data Scientist just one place: Edureka. Explanations are clean, clear, easy to understand. Their support team works very well.. I took the Data Science course and I'm going to take Machine Learning with Mahout and then Big Data and Hadoop”. Gnana Sekhar says - “Edureka Data science course provided me a very good mixture of theoretical and practical training. LMS pre recorded sessions and assignments were very good as there is a lot of information in them that will help me in my job. Edureka is my teaching GURU now...Thanks EDUREKA.” Balu Samaga says - “It was a great experience to undergo and get certified in the Data Science course from Edureka. Quality of the training materials, assignments, project, support and other infrastructures are a top notch.”

Editor's Notes