0% found this document useful (0 votes)
5 views

Lecture-17-Linear Regression Using Sklearn

Uploaded by

Tahir Mahmod
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture-17-Linear Regression Using Sklearn

Uploaded by

Tahir Mahmod
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

AI Free Basic Course | Lecture 17 | Linear

regression using Sklearn

Word regression has been repeated many times. Regression is the


technique of supervised learning. There are two techniques in
supervised learning, one is classification and other is
regression. In classification we predict category / nominal
value. Category or nominal value means something that has label
for example smoker /nonsmoker, cat /dog etc which have to be
converted into numerical value by effort.
In prediction we use the word positive and negative. For example,
if a picture is given to the model and how it will predict
whether it is picture of cat or not. Positive value in this would
be that it is cat as we want to recognize cat. In medical test,
the result is either positive or negative. So positive value is
the value in which are more interested to predict. Remember it
does not depend upon the whether covid report is positive nor
not, it depends what we have set the algorithm in which we are
interested to predict. So covid negative could be our positive
class if we have to distribute the shirts to non-positive covid
persons.
Regression is other technique of supervised learning. In
regression we always predict a numerical value or score. We place
a machine we converts the numeric into 0 or 1. Machine here means
not physical machine like drill machine , it is software. When we
convert our values to 0 or 1, it becomes classification.
Now we are going to discuss another form of regression in which
we predict score. For example, we have to predict score of babar
azam, price of house or marks in the exam. There are many
algorithms to predict regression. The first algorithm we are
going to use is linear regression. As the name suggests it is
related to line. So linear regression means it is related to line
with the help of which we will make predictions. In linear
regression we use line to predict score.
While doing linear regression we draw a line between two axis.
Presently are discussing only two axis and linear regression
about three or n number of axis would be discussed later.

For predicting dependent variable from independent variable we


make use of Linear Regression. Data points shown in the diagram
are coordinates of showing corresponding values of x to y. In 2-
D there are two numerical values of each point, one is called x
coordinate and other is called y coordinate.
After drawing the points, our next task is to draw a line that is
close to these points. How to measure the closeness of these
points to the line will be discussed later classes. Now our goal
is to draw a line that is close the data points. The line we draw
represents the linear regression modal. So the linear regression
modal chooses a line. In this way by selecting any point on x-
axis and draw a perpendicular line up to the line, the
corresponding value would be the prediction.
Let us explain this with the help of an example. Suppose we are
going to predict the size of plot against the value of the plot.
Size of plot is mentioned along with x-axis and its corresponding
value is mentioned against y-axis. If there is a plot having size
of 3-marla and its value is Rs.40, 000/- we will fix a data point
against Rs.40, 000/-. We will pick another plot having plot size
of 6-Marla and its value is Rs. 60,000/-. We will also fix a data
point against Rs.50, 000/- and so on. In this way we would get
many spreading data points some of which will be relevant and
others would not be relevant. We will draw multiple lines and
will observe which line has more data points. How to draw these
multiple lines will be taught in coming classes. The lines having
more data points around it is our desired linear regression
modal.

If we have two items, intercept and slope, we can draw any type
of line. If we have multiple intercepts and slopes, it means we
would have many lines. When we find an intercept and slope that
is close to our data point, it would be our desired linear
regression modal which is used to predict.
Now we are going to introduce another concept of machine learning
that is loss. It is very useful concept to train the modal in
machine learning. More the distance of line from the data point
more the loss is and vice versa. We will move our search
systematically from the line having more loss towards line having
less loss.
It is pertinent to mention that loss and error can be used
interchangeable. These represent the same concept. More
accurately a given line predicts or represents a data point, less
the error or loss would be. Less accurately a given line predicts
or represents a data point, more the error or loss would be.
Now question arises, why we take square. Suppose we have four
points and we draw a line.
1-One point lies at +10 on the graph.
2- Second point lies at -10 on the graph
3-Third point lies at +40 on graph
4- 4th point lies at -40 on the graph
What happened in this case? When we add all these figures the
result would be zero. In spite the no data point is close to
liner regression modal, we have reduced the error to zero. Looks
satisfactory!
So in order to avoid this all the four points must be in positive
value. One way to covert negative to positive is to take the
square. So mean squared error means we have squared the all
errors and then taken the mean of it. More the Mean Square Error
more bad line representation is and less the Mean Square Error is
good line representation is. Mean squared error is the loss
function.
The other way to convert the negative to positive is to take the
absolute value.
Let us summarize it, out todays topic is Linear Regression, in
linear regression we want to predict a continuous value. For
example we want predict what score a batsman will score in a
match, either would it be 00, 20, 50,250 etc. In order to verify
the prediction we draw linear regression. The straight line drawn
in liner regression have slope, intercept and data points around
it. Our purpose is to find the line which have more data points
around it. In other words, our purpose is to reach a consensus
where every body of the family agrees.
We also discussed about the concept of loss function, which means
in case the modal is performing bad we impose penalty. With the
passage of time, the value of the error diminishes and reach
close to zero. More data points are around the line and we call
this point as optimal line. The error value at this stage is
called Mean Squared Value which we use to avoid zero calculated
value which arises due to addition of same positive and negative
values.
Now let us move colab.

Linear Regression using Sklearn

0. Hope to Skills - Free AI Course


This note book cover the following concepts

1. Visualization
2. Sea born

TRAINING DATA PRE-PROCESSING


The first step in the machine learning pipeline is to clean and
transform the training data into a useable format for analysis
and modeling.
As such, data pre-processing addresses:

 Assumptions about data shape


 Incorrect data types
 Outliers or errors
 Missing values
 Categorical variables
Note is being shared with you. Now we confirm that our desired
file available in uploads. We are using US housing data. Same
data set was given in the assignment number 7. In this assignment
it was asked to predict the pricing of the house by determining
various factors like number of rooms, area of house, the locality
in which it exists or the average house income.
At this stage we are going to skip the routine EDA tasks like
data cleaning etc. We are going to perform our preliminary steps
like:
Importing
the
library

Loading
the Data
Set
Data Shape
After loading the dataset, I examine its shape to get a better sense
of the data and the information it contains.
# Data print('train data:',full_data.shape)
shape
train data: (5000, 7)
# View first few rows
full_data.head(5)
From the data set given above, longitude, latitude house median age
total rooms etc etc has been given. Now we can guess that the address
columns from the data set given above has nothing to do with our
desired prediction and we have to drop it from the data set. However
as nothing more is important than the location and address of the
house, but purpose to drop this column from the data set is to
simplify the things.

Now we generate the heatmap to know about the missing value. We


see that there is no missing value in this data set.
Missing Data
A heatmap will help better visualize what features as missing the most
information.

# Heatmap
sns.heatmap(full_data.isnull(),yticklabels = False, cbar = False,cmap =
'tab20c_r')
plt.title('Missing Data: Training Set')
plt.show()
Now we remove the address column as given below

# Remove Address feature


full_data.drop('Address', axis = 1, inplace = True)

The code you provided seems to be dropping the 'Address' column


from the DataFrame 'full_data'. Let me explain each part of the
code:

1. full_data: This is the DataFrame from which the column 'Address'


is being dropped.
2. .drop('Address', axis=1): The drop() method is used to remove a
specified column or row from the DataFrame. In this case, the
column being dropped is 'Address'. The axis=1 argument indicates
that we are dropping a column (as opposed to a row, where axis=0
would be used).
3. inplace=True: This parameter ensures that the operation is
performed directly on the DataFrame 'full_data' itself, and the
changes are made in place. If inplace=False (which is the
default), the drop operation would return a new DataFrame with
the specified column removed, leaving the original DataFrame
unchanged.

So, after running this code, the 'Address' column will be


removed from the 'full_data' DataFrame.

Now we remove the missing data

# Remove rows with missing data


full_data.dropna(inplace = True)

The code you provided is removing rows with missing data (NaN
values) from the DataFrame 'full_data'. Let's break down the
code:

1. full_data: This is the DataFrame from which the rows with


missing data will be removed.
2. .dropna(): The dropna() method is used to remove rows with
missing values. By default, it removes any row that contains at
least one NaN value. If you want to remove rows with missing
values only in specific columns, you can pass the subset
parameter to specify those columns.
3. inplace=True: This parameter ensures that the operation is
performed directly on the DataFrame 'full_data' itself, and the
changes are made in place. If inplace=False, the drop operation
would return a new DataFrame with the specified rows removed,
leaving the original DataFrame unchanged.

After running this code, any rows in the 'full_data' DataFrame


that contain at least one NaN value will be removed, and the
DataFrame will be modified in place. The DataFrame will now
contain only rows with complete data (no NaN values).
# Numeric summary
full_data.describe()

The code you provided, full_data.describe(), will generate a


numeric summary of the DataFrame 'full_data'. The describe()
method provides descriptive statistics for the numeric columns
in the DataFrame. Let's break down what this function does:

1. full_data: This is the DataFrame for which the numeric summary


will be generated.
2. .describe(): This method computes various descriptive statistics
for each numeric column in the DataFrame. The statistics
include measures like count, mean, standard deviation, minimum
value, 25th percentile (Q1), median (50th percentile or Q2),
75th percentile (Q3), and maximum value.

The output will be a new DataFrame containing the numeric


summary statistics for each numeric column in 'full_data'. It
is a great way to get a quick overview of the central tendency,
spread, and distribution of the numerical data in the
DataFrame.
Now we have data set all values of which are numeric and no value in
strings or characters exists which needs to be converted into numeric.
In these value we need to observe the number of values before
decimal. We will count it under each column. We will also count the
range of these values.
Now look at the value given in the price columns.

The e show the exponential form of price of house. It show how many
digit values are available in the value.
Describe gives the five number summary. Look at the count row, it must
be same for all columns. If these are not same for all columns than
data cleaning needs to be performed in order to make it equal for all
columns.

GETTING MODEL READY


Now that we've explored the data, it is time to get these
features 'model ready'. Categorical features will need to be
converted into 'dummy variables', otherwise a machine learning
algorithm will not be able to take in those features as inputs.
Now we see the shape of the data.

# Shape of train data


full_data.shape
(5000, 6)
Now the train data is perfect for a machine learning algorithm:

 all the data is numeric


 everything is concatenated together

OBJECTIVE 2: MACHINE LEARNING


Next, I will feed these features into various classification
algorithms to determine the best performance using a simple
framework: Split, Fit, Predict, Score It.

Target Variable Splitting


We will spilt the Full dataset into Input and target variables.
Input is also called Feature Variables Output referes to
Target variables. Our target is to determine the price of the
house based on the different features.
As the price column is out target column so we need to make it
separate from the data set by removing it.
# Split data to be used in the models
# Create matrix of features
x = full_data.drop('Price', axis = 1) # grabs everything else but
'Price'
All features have been stored in x as an input except price.
# Create target variable
y = full_data['Price'] # y is the column we're trying to predict
In y as output only price column will exist.
Following will be generated after executing the code.
As was told earlier x represents input and y represents output.
This convention will be used throughout the course.
Until now we have segregated the values of x-axis and y-axis. Our
next step is to take the chunks from this data that would be used
for testing and training. For which we use train test function as
follows:

# Use x and y variables to split the training data into train and test
set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =


.20, random_state = 101)

The code you provided is using the train_test_split function from


a machine learning library, likely scikit-learn. This function is
used to split a dataset into training and testing sets for
building and evaluating machine learning models. Let's break down
the code:

1. x: This is the input feature dataset, typically a DataFrame


or a NumPy array, containing the independent variables (features)
that will be used to make predictions.
2. y: This is the target dataset, typically a Series or a NumPy
array, containing the dependent variable (target) we are trying
to predict.
3. test_size = .20: This parameter determines the proportion of
the data that will be used for testing the model. In this case,
20% of the data will be used for testing, and the remaining 80%
will be used for training.
4. random_state = 101: This parameter is used to set the random
seed, ensuring reproducibility of the train-test split. By
setting random_state to a fixed value (101 in this case), the data
will be split in the same way every time you run the code, making
the results consistent.
5. x_train, x_test, y_train, y_test: These are the variables where
the data will be stored after the split. x_train and y_train will
contain the training data (input features and target), while
x_test and y_test will contain the testing data (input features
and target).

After running this code, you will have four datasets: x_train,
x_test, y_train, and y_test, which can be used to train and
evaluate machine learning models. The model will be trained on
x_train and y_train, and then its performance will be evaluated on
x_test and y_test.
x_train.shape
x_train
The first line of code, x_train.shape, will return the shape of
the x_train dataset, which is the training set of input features
for a machine learning model. It will give you the number of rows
and columns in the x_train dataset.

The second line of code, x_train, will display the contents of


the x_train dataset itself, which will show you the actual values
of the input features used for training the model.

The output of x_train.shape will look something like this


(assuming it's a 2-dimensional array)

This means that the x_train dataset has 4000 rows × 5 columns

The output of x_train (if displayed directly) will be the actual


content of the x_train dataset, showing the values of the input
features. It will be a tabular representation of the data. Since
it can be quite large, displaying the entire content might not be
practical. The output will depend on the specific data you are
working with.

# y_train.shape
y_train

The first line of code, y_train.shape, will return the shape of


the y_train dataset, which is the training set of target values
for a machine learning model. It will give you the number of rows
in the y_train dataset.

The second line of code, y_train, will display the contents of


the y_train dataset itself, which will show you the actual values
of the target variable used for training the model.

The output of y_train.shape will look something like this


(assuming it's a 1-dimensional array or a pandas Series):

This means that the y_train dataset has 4000 elements (rows).

The output of y_train (if displayed directly) will be the actual


content of the y_train dataset, showing the values of the target
variable. It will be a one-dimensional representation of the
data. Since it can be quite large, displaying the entire content
might not be practical. The output will depend on the specific
data you are working with.

In summary, y_train.shape tells you the size of the target


variable's training set, and y_train itself shows the actual
values of the target variable used for training the machine
learning model.

It means 4000 rows out of 5000 rows have been selected for data
training. The columns on the left shows the rows numbers that are
selected randomly for data selection.

x_test.shape
x_test

The first line of code, x_test.shape, will return the shape of the
x_test dataset, which is the testing set of input features for a
machine learning model. It will give you the number of rows and
columns in the x_test dataset.

The second line of code, x_test, will display the contents of the
x_test dataset itself, which will show you the actual values of
the input features used for testing the model.

The output of x_test.shape will look something like this (assuming


it's a 2-dimensional array):

This means that the x_test dataset has 1000 rows × 5 columns

The output of x_test (if displayed directly) will be the actual


content of the x_test dataset, showing the values of the input
features. It will be a tabular representation of the data. Since
it can be quite large, displaying the entire content might not be
practical. The output will depend on the specific data you are
working with.

In summary, x_test.shape tells you the size of the input features'


testing set, and x_test itself shows the actual values of the
input features used for testing the machine learning model.

Now the data is in the shape that it can be passed to the modal for
training. For which we will import linear regression function from
Sklearn library.
LINEAR REGRESSION

Model Training
# Fit
# Import model
from sklearn.linear_model import LinearRegression

Explanation:

1. We import the LinearRegression class from the


sklearn.linear_model module.
2. Next, we create an instance of the LinearRegression model and
store it in the variable model.
3. We use the fit method of the model object to train the
linear regression model. The fit method takes two arguments: the
training data x_train (input features) and the corresponding
target values y_train.

After fitting the model, it learns the coefficients (weights) and


the intercept from the training data, and it will be ready to
make predictions on new, unseen data.

Keep in mind that fitting a linear regression model assumes a


linear relationship between the input features and the target
variable. If your data has non-linear relationships, you may need
to consider using other regression models or performing feature
engineering to capture those patterns effectively.

# Create instance of model


lin_reg = LinearRegression()

Explanation:

1. You imported the LinearRegression class from the


sklearn.linear_model module.
2. You created an instance of the LinearRegression model and
stored it in the variable lin_reg.

Now that you have lin_reg, you can proceed with the model fitting
using the fit method, as shown in the previous response.
Additionally, you can use lin_reg to access various attributes of
the fitted model, such as coefficients and intercept, or use it
to make predictions on new data.

LinearRegression
LinearRegression()

It seems you want more information about the LinearRegression


class and how to use it. Here's a detailed explanation:

Linear Regression: Linear Regression is a simple and widely used


statistical technique used for modeling the relationship between
a dependent variable (target) and one or more independent
variables (features). In a simple linear regression, there is
only one independent variable, while in multiple linear
regression, there are multiple independent variables.

LinearRegression Class in scikit-learn: LinearRegression is a class


provided by scikit-learn (sklearn) library for performing linear
regression. It is a part of the sklearn.linear_model module. The
LinearRegression class allows you to fit a linear regression model
to your data, make predictions using the fitted model, and access
the model's attributes such as coefficients and intercept.
Creating an Instance: To use the LinearRegression class, you need
to create an instance of it. This is typically done by calling
the class with no arguments, as you have done in your code:

Fitting the Model: After creating the instance lin_reg, you can
fit the linear regression model to your training data using the
fit method. The fit method takes the input features (x_train) and
the corresponding target values (y_train) as arguments. It learns
the coefficients and the intercept from the training data.

Accessing Model Attributes: Once the model is fitted, you can


access its attributes. For example, you can get the coefficients
(weights) of the linear regression model using the coef_
attribute and the intercept using the intercept_ attribute.

Making Predictions: After the model is trained, you can use it


to make predictions on new, unseen data using the predict method.
Pass the input features of the new data as an argument to get the
corresponding predicted target values.

Now, y_pred will contain the predicted target values for the
x_test data, based on the fitted linear regression model.

Remember, while linear regression is a powerful and widely used


technique, it may not be suitable for all types of data or
complex relationships. Always evaluate the model's performance
and consider other algorithms if needed.

# Pass training data into model


lin_reg.fit(x_train, y_train)

Here's a brief explanation of the parameters used in the fit


method:

 x_train: This parameter represents the input features or


independent variables of your training data. It should be a 2D
array-like object (e.g., NumPy array, Pandas DataFrame) with
shape (n_samples, n_features), where n_samples is the number of
data points (samples) in your training set, and n_features is the
number of features (attributes) for each data point.
 y_train: This parameter represents the target values or
dependent variable of your training data. It should be a 1D
array-like object (e.g., NumPy array, Pandas Series) with shape
(n_samples,), where n_samples is the same as the number of data
points in x_train.

The "fit" method is used to train the model on the provided data.
After the fitting process, the model will have learned the
relationships between the input features and the target values,
allowing it to make predictions on new, unseen data.

Keep in mind that different machine learning libraries may use


slightly different conventions for fitting models, but the
general idea is the same across most libraries. Make sure you
have imported the appropriate library (e.g., scikit-learn) and
that the lin_reg object is an instance of the linear regression
model from that library before calling the fit method.

Model Testing

Class prediction
# Predict
y_pred = lin_reg.predict(x_test)
print(y_pred.shape)
print(y_pred)

After fitting the linear regression model to the training data,


you can use it to make predictions on the test data (x_test). The
"predict" method is used for this purpose. It takes the test
input features as its parameter and returns the predicted target
values.

Here's a brief explanation of the code you provided:

In this code, "y_pred" will be an array-like object containing


the predicted target values based on the input features in
"x_test." The shape of "y_pred" will be (n_samples,), where
n_samples is the number of data points in the test set.

The print statement will display the shape of the predicted


values, and then the predicted values themselves will be printed.
The actual output will depend on the specific data and model
used, but it will be an array of predicted target values for the
corresponding data points in "x_test."
# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

In the code you provided, you are combining the actual target
values from the test set (y_test) with the predicted target
values (y_pred) side by side using NumPy's column_stack function.
This creates a new NumPy array where the two arrays are stacked
as columns.

Here's a brief explanation of the code:

After running this code, the variable results will be a NumPy


array with shape (n_samples, 2), where n_samples is the number of
data points in the test set. The first column of the results
array will contain the actual target values (from y_test), and
the second column will contain the corresponding predicted target
values (from y_pred).

This combined array can be useful for comparing the actual and
predicted values, and you can use it to calculate various metrics
to evaluate the performance of your linear regression model on
the test data. For example, you can calculate the mean squared
error, R-squared score, or other relevant evaluation metrics to
assess how well your model is performing on unseen data.

Now print the values predicted by the modal and the actual values side
by side.
Now from the above values we have taken a value and highlighted it. We
see that the difference between the actual value and the predicted
value is more.

Difference between the values can be calculated by subtracting the


actual value from the predicted value. For this we will find the
difference between the actual and predicted value by following method.

# Printing the results


print("Actual Values | Predicted Values")
print("-----------------------------")
for actual, predicted in results:
print(f"{actual:14.2f} | {predicted:12.2f}")

The code you provided is for printing the combined actual and
predicted values side by side in a tabular format. This can be
useful for visually inspecting the performance of your linear
regression model on the test data.

Here's a brief explanation of the code:

The code uses a for loop to iterate over each row of the results
array, where each row consists of an actual value and its
corresponding predicted value. The f-string formatting is used to
align the numbers properly in the table. Each actual value is
printed with a field width of 14 characters and two decimal
places, while each predicted value is printed with a field width
of 12 characters and two decimal places.

The output will look something like this:

This format makes it easy to compare the actual and predicted


values for each data point in the test set and visually evaluate
how well the model is performing. If the predicted values are
close to the actual values, it indicates that the model is making
accurate predictions. If there are significant differences
between the actual and predicted values, it may suggest that the
model needs further improvement.

Residual Analysis
Residual analysis in linear regression is a way to check how well
the model fits the data. It involves looking at the differences
(residuals) between the actual data points and the predictions
from the model.
In a good model, the residuals should be randomly scattered
around zero on a plot. If there are patterns or a fan-like shape,
it suggests the model may not be the best fit. Outliers, points
far from the others, can also affect the model.
Residual analysis helps ensure the model's accuracy and whether
it meets the assumptions of linear regression. If issues are
found, adjustments to the model may be needed to improve its
performance.

residual = actual- y_pred.reshape(-1)


print(residual)
In the code you provided, you are calculating the residuals by
subtracting the predicted values (y_pred) from the actual values.
To perform this calculation, it seems like you are reshaping the
y_pred array to be a 1D array before the subtraction.

Here's a brief explanation of the code:

Explanation:

1. The line y_pred = results[:, 1] extracts the second column


(index 1) from the 'results' array, which contains the predicted
values. Since 'results' is a NumPy array with two columns, the
first column contains the actual values (y_test) and the second
column contains the predicted values (y_pred).
2. The line y_pred.reshape(-1) reshapes the 'y_pred' array to be
a 1D array. The '-1' argument in the reshape method is a
placeholder that lets NumPy automatically determine the
appropriate size for the reshaped array based on the input shape.
In this case, it converts the 'y_pred' array from a 2D shape to a
1D shape.
3. The line residual = y_test - y_pred.reshape(-1) calculates the
residuals by subtracting the predicted values (y_pred) from the
actual values (y_test). Since both 'y_test' and 'y_pred' are 1D
arrays, the subtraction is element-wise, and it results in a new
1D array called 'residual', which contains the differences
between the actual and predicted values for each data point.

The 'residual' array represents the prediction errors for each


data point in the test set. Positive values indicate that the
model overestimated the target value, while negative values
indicate that it underestimated the target value. If the model is
accurate, the residuals should be close to zero on average. You
can further analyze the residuals to assess the performance of
your linear regression model and check for patterns or biases in
the predictions.
As the difference between the actual and predicted values is great so
it show that the modal is not performing good. But the graph as
plotted later shows that more data points are around the linear
regression line.
As discussed earlier in this lecture the price column of high
exponential values which means these are very high value amounts.
The encircled area shows that graph. It shows that this graph is 2 x
10 6 alongside x-axis and same is along side y-axis. Which means the
difference between the values scaled on x-axis and y-axis are very
high so the data points as shown above are not actually very close. As
the so big graph cant be drawn so we can say that it is example of
optic illusion.

# Distribution plot for Residual (difference between actual and


predicted values)
sns.distplot(residual, kde=True)

It represents that our mode is not skewed as the distribution is


center aligned but note the values of the X and Y axis they in power
of 6. Which means the difference between actual and predicted value is
high.

Explanation:

1. The code imports the required libraries, seaborn for


creating the plot and matplotlib.pyplot for adding labels and
titles to the plot.
2. The sns.distplot(residual, kde=True) function is used to create
the distribution plot. The residual array, which contains the
differences between the actual and predicted values, is passed as
the input variable to visualize its distribution. The kde=True
argument adds a kernel density estimate to the histogram,
providing a smooth line that represents the estimated probability
density function of the residuals.
3. After creating the plot, we add labels to the x-axis and y-
axis using plt.xlabel() and plt.ylabel(), respectively.
4. The plt.title() function is used to set the title of the plot
to "Distribution of Residuals."
5. Finally, plt.show() is called to display the plot.

The resulting plot will show the distribution of the residuals,


which can provide insights into the performance of the linear
regression model. If the residuals are normally distributed
around zero with a symmetric shape, it indicates that the model's
predictions are unbiased. However, if the residuals have a skewed
or non-symmetric distribution, it suggests that the model might
have systematic errors in its predictions. It's essential to
examine the distribution plot to understand the behavior of the
residuals and identify any potential issues with the model's
predictions.

The plot shown above is in the bell shape. The bell curve should be
centrally aligned but its height must be lower at the same time. The
height shows the difference of predicted value with the actual value.

sns.scatterplot(x=y_test, y=y_pred, color='blue', label='Actual Data


points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red', label='Ideal Line')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual (Linear Regression)')
plt.legend()
plt.show()

Explanation:

1. The code imports the required libraries, seaborn and


matplotlib.pyplot.
2. The sns.scatterplot(x=y_test, y=y_pred, color='blue', label='Actual
Data points') function creates a scatter plot with the actual
values (y_test) on the x-axis and the predicted values (y_pred)
on the y-axis. The 'color' parameter is set to 'blue' to use blue
color for the data points, and the 'label' parameter is set to
'Actual Data points' to use this label in the legend.
3. The plt.plot([min(y_test), max(y_test)], [min(y_test),
max(y_test)], color='red', label='Ideal Line') function adds a red
line to the plot. The line represents the ideal situation where
the predicted values perfectly match the actual values. It starts
from the point (min(y_test), min(y_test)) and ends at
(max(y_test), max(y_test)).
4. The plt.xlabel('Actual Values') and plt.ylabel('Predicted Values')
functions set the labels for the x-axis and y-axis, respectively.
5. The plt.title('Actual vs. Predicted (Linear Regression)') function
sets the title of the plot to 'Actual vs. Predicted (Linear
Regression)'.
6. The plt.legend() function displays the legend on the plot,
including the labels 'Actual Data points' and 'Ideal Line'.
7. Finally, plt.show() is called to display the plot.

The resulting scatter plot will show how well the linear
regression model's predictions match the actual values. Data
points close to the red ideal line indicate accurate predictions,
while points scattered away from the line represent prediction
errors. By examining the scatter plot, you can visually assess
the performance of your linear regression model and identify any
patterns or trends in its predictions.
Model Evaluation
# Score It
from sklearn.metrics import mean_squared_error

print('Linear Regression Model')


# Results
print('--'*30)
# mean_squared_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print evaluation metrics


print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)

Now we discuss the loss function that was mentioned in theoretical


part above. Mean Square error can be calculated as follows:
Import Mean Square Error from SK learn.
Take the mean_squared_error function and take both actual values and
values predicted by model and placed in mse variable.

mse = mean_squared_error(y_test, y_pred)


When we execute the code above, we get the following output:

Mean Squared Error: 10100187858.864885


Root Mean Squared Error: 100499.69083964829
The Mean squared Error value is very high, but it must be close to zero.
That is another proof that the model will not perform good.
As we discussed earlier that we are taking square while calculating Mean
Squared Error.

rmse = np.sqrt(mse)
In order to reverse the squre taken already we take square root. The
values converted to kilometers by taking square from meters. In order to
cancel the affect of square we take square root.
Root Mean Squared Error: 100499.69083964829
The resultant value will give the true picture of error.

Explanation:

1. The code imports the mean_squared_error function from scikit-


learn's sklearn.metrics module.
2. The line mse = mean_squared_error(y_test, y_pred) calculates the
Mean Squared Error (MSE) by comparing the actual values (y_test)
with the predicted values (y_pred). The MSE is a common metric
used to quantify the average squared difference between predicted
and actual values. A lower MSE indicates better model
performance.
3. The line rmse = np.sqrt(mse) calculates the Root Mean Squared
Error (RMSE) by taking the square root of the MSE. RMSE is
another commonly used metric for regression tasks and represents
the square root of the average squared difference between
predicted and actual values. Like MSE, a lower RMSE indicates
better model performance.
4. The final two lines of code print the evaluation metrics MSE
and RMSE to the console.

After running this code, you will see the MSE and RMSE values for
the linear regression model's predictions on the test data. These
metrics provide insights into how well the model is performing,
with lower values indicating better accuracy. Comparing the MSE
and RMSE with other models or benchmarks can help you assess the
effectiveness of your linear regression model for the given task.

Interpretation
Accuracy
MSE is very high Here are some Questions for you

1. what are the possible reasons for higher MSE Values


2. what can we do to lower the value of MSE.

You might also like