Lecture-2 Unit 2
Lecture-2 Unit 2
Management (YIASCM)
Supervised Learning
Lecture 2: Unit-2
Objective
• At the end of this session the leaner will able to understand
⮚ Simple Linear Regression in Machine Learning
⮚ Simple Linear Regression Model
⮚ Mean Squared Error
⮚ Implementation of Simple Linear Regression Algorithm
using Python
⮚ Multiple Linear Regression
⮚ Implementation of Multiple Linear Regression model using
Python
• Summary
• Reference
Simple Linear Regression in Machine Learning
y= a0+a1x+ ε
Where,
• a0= It is the intercept of the Regression line (can be
obtained putting x=0)
• a1= It is the slope of the regression line, which tells
whether the line is increasing or decreasing.
• ε = The error term. (For a good model it will be negligible)
Mean Squared Error | What Is
Mean Squared Error?
https://youtu.be/beIgcdf0YDE?
si=YLwL9gw0X4GOfI1s
Example
• In this example, we generate some random
data, split it into training and testing sets,
create a Linear Regression model, train it using
the training data, make predictions on the test
data, and then evaluate the model using Mean
Squared Error. Finally, we plot the training
data, test data, and the regression line.
Code
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import
train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate some example data
np.random.seed(42)
X = 2 *np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('Salary_Data.csv')
.
The above output shows the dataset, which has two variables: Salary and
Experience.
• After that, we need to extract the dependent and
independent variables from . the given dataset.
• The independent variable is years of experience, and
the dependent variable is salary.
• Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.
.
• Next, we will split both variables into the test set and
training set.
• We have 30 observations, so we will take 20
observations for the training set and 10 observations
for the test set.
• We are splitting our dataset so that we can train our
model using a training dataset and then test the
model using a test dataset.
• The code for this is given below:
• from sklearn.model_selection import train_test_sp
lit
• x_train, x_test, y_train, y_test= train_test_split(x, y
, test_size= 1/3, random_state=0)
By executing the above code, we will get x-test, x-train and y-
test, y-train dataset. Consider the below images:
.
For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here. Now, our
dataset is well prepared to work on it and we are going to start building a Simple
Linear Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to
the Training Set:
• Now the second step is to fit our model to the training
dataset.
• To do so, we will import the LinearRegression class of
the linear_model library from the scikit learn.
• After importing the class, we are going to create an object
of the class named as a regressor.
• The code for this is given below:
#Fitting the Simple Linear Regression model to the training d
ataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
• In the above code, we have used a fit() method
.
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
.
• Output:
• By executing the above lines of code, we will
get the below graph plot as an output.
• In the above plot, we can see the real values
observations in green. dots and predicted
values are covered by the red regression line.
The regression line shows a correlation
between the dependent and independent
variable.
• The good fit of the line can be observed by
calculating the difference between actual
values and predicted values. But as we can see
in the above plot, most of the observations
are close to the regression line, hence our
model is good for the training set.
Step: 5. visualizing the Test set results:
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
• Output:
.
In the above plot, there are observations given by the blue color,
and prediction is given by the red regression line. As we can see,
most of the observations are close to the regression line, hence
we can say our Simple Linear Regression is a good model and
Multiple Linear Regression
• In the previous topic, we have learned about
Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model
the response variable (Y). But there may be various
cases in which the response variable is affected by
more than one predictor variable; for such cases,
the Multiple Linear Regression algorithm is used.
• “Multiple Linear Regression is one of the important
regression algorithms which models the linear
relationship between a single dependent
continuous variable and more than one
independent variable.”
Example: .
• Prediction of CO2 emission based on engine size and
number of cylinders in a car.
• Where,
• Y= Output/Response variable
• b0, b1, b2, b3 , bn....= Coefficients of the model.
• x1, x2, x3, x4,...= Various Independent/feature
variable
Implementation of Multiple Linear Regression model
using Python:
Problem Description:
• We have a dataset of 50 start-up companies. This dataset
contains five main information: R&D Spend, Administration
Spend, Marketing Spend, State, and Profit for a financial year.
Our goal is to create a model that can easily determine which
company has a maximum profit, and which is the most affecting
factor for the profit of a company.
• Since we need to find the Profit, so it is the dependent variable,
and the other four variables are independent variables. Below are
the main steps of deploying the MLR model:
• Data Pre-processing Steps
• Fitting the MLR model to the training set
• Predicting the result of the test set
Step-1: Data Pre-processing
• Importing libraries: Firstly we will import the library
which will help in building the model. Below is the code
for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
In the above output, the last column contains categorical variables which are not
suitable to apply directly for fitting the model. So we need to encode this variable.
Encoding Dummy Variables:
• As we have one categorical variable (State), which
cannot be directly applied to the model, so we will
encode it.
• To encode the categorical variable into numbers, we
will use the LabelEncoder class.
• But it is not sufficient because it still has some
relational order, which may create a wrong model. So
in order to remove this problem, we will
use OneHotEncoder, which will create the dummy
variables.
.
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lblencode= LabelEncoder()
x[:, 3]= lblencode.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
As we can see in the above output, the state column has been converted
into dummy variables (0 and 1). Here each dummy variable column is
corresponding to the one State. We can check by comparing it with the
original dataset. The first column corresponds to the California State, the
second column corresponds to the Florida State, and the third column
corresponds to the New York State.
• Now, we are writing a single line of
. code just to avoid the dummy variable
trap:
• #avoiding the dummy variable trap:
• x = x[:, 1:]
• If we do not remove the first dummy variable, then it may introduce
multicollinearity in the model.
• As we can see in the above. output image, the first
column has been removed.
• Now we will split the dataset into training and test set.
The code for this is given below:
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_
size= 0.2, random_state=0)
• The above code will split our dataset into a training set
and test set.
Output: The above code will split the dataset into training set and test set.
.
Step: 2- Fitting our MLR model to the Training set:
• Now, we have well prepared our dataset in order to
provide training, which means we will fit our
regression model to the training set. It will be similar
to as we did in Simple Linear Regression model. The
code for this will be:
#Fitting the MLR model to the training set:
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
Output:
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now, we have successfully trained our model using the training dataset. In the
next step, we will test the performance of the model using the test dataset.
Step: 3- Prediction of Test set results: