Lecture-17-Linear Regression Using Sklearn
Lecture-17-Linear Regression Using Sklearn
If we have two items, intercept and slope, we can draw any type
of line. If we have multiple intercepts and slopes, it means we
would have many lines. When we find an intercept and slope that
is close to our data point, it would be our desired linear
regression modal which is used to predict.
Now we are going to introduce another concept of machine learning
that is loss. It is very useful concept to train the modal in
machine learning. More the distance of line from the data point
more the loss is and vice versa. We will move our search
systematically from the line having more loss towards line having
less loss.
It is pertinent to mention that loss and error can be used
interchangeable. These represent the same concept. More
accurately a given line predicts or represents a data point, less
the error or loss would be. Less accurately a given line predicts
or represents a data point, more the error or loss would be.
Now question arises, why we take square. Suppose we have four
points and we draw a line.
1-One point lies at +10 on the graph.
2- Second point lies at -10 on the graph
3-Third point lies at +40 on graph
4- 4th point lies at -40 on the graph
What happened in this case? When we add all these figures the
result would be zero. In spite the no data point is close to
liner regression modal, we have reduced the error to zero. Looks
satisfactory!
So in order to avoid this all the four points must be in positive
value. One way to covert negative to positive is to take the
square. So mean squared error means we have squared the all
errors and then taken the mean of it. More the Mean Square Error
more bad line representation is and less the Mean Square Error is
good line representation is. Mean squared error is the loss
function.
The other way to convert the negative to positive is to take the
absolute value.
Let us summarize it, out todays topic is Linear Regression, in
linear regression we want to predict a continuous value. For
example we want predict what score a batsman will score in a
match, either would it be 00, 20, 50,250 etc. In order to verify
the prediction we draw linear regression. The straight line drawn
in liner regression have slope, intercept and data points around
it. Our purpose is to find the line which have more data points
around it. In other words, our purpose is to reach a consensus
where every body of the family agrees.
We also discussed about the concept of loss function, which means
in case the modal is performing bad we impose penalty. With the
passage of time, the value of the error diminishes and reach
close to zero. More data points are around the line and we call
this point as optimal line. The error value at this stage is
called Mean Squared Value which we use to avoid zero calculated
value which arises due to addition of same positive and negative
values.
Now let us move colab.
1. Visualization
2. Sea born
Loading
the Data
Set
Data Shape
After loading the dataset, I examine its shape to get a better sense
of the data and the information it contains.
# Data print('train data:',full_data.shape)
shape
train data: (5000, 7)
# View first few rows
full_data.head(5)
From the data set given above, longitude, latitude house median age
total rooms etc etc has been given. Now we can guess that the address
columns from the data set given above has nothing to do with our
desired prediction and we have to drop it from the data set. However
as nothing more is important than the location and address of the
house, but purpose to drop this column from the data set is to
simplify the things.
# Heatmap
sns.heatmap(full_data.isnull(),yticklabels = False, cbar = False,cmap =
'tab20c_r')
plt.title('Missing Data: Training Set')
plt.show()
Now we remove the address column as given below
The code you provided is removing rows with missing data (NaN
values) from the DataFrame 'full_data'. Let's break down the
code:
The e show the exponential form of price of house. It show how many
digit values are available in the value.
Describe gives the five number summary. Look at the count row, it must
be same for all columns. If these are not same for all columns than
data cleaning needs to be performed in order to make it equal for all
columns.
# Use x and y variables to split the training data into train and test
set
from sklearn.model_selection import train_test_split
After running this code, you will have four datasets: x_train,
x_test, y_train, and y_test, which can be used to train and
evaluate machine learning models. The model will be trained on
x_train and y_train, and then its performance will be evaluated on
x_test and y_test.
x_train.shape
x_train
The first line of code, x_train.shape, will return the shape of
the x_train dataset, which is the training set of input features
for a machine learning model. It will give you the number of rows
and columns in the x_train dataset.
This means that the x_train dataset has 4000 rows × 5 columns
# y_train.shape
y_train
This means that the y_train dataset has 4000 elements (rows).
It means 4000 rows out of 5000 rows have been selected for data
training. The columns on the left shows the rows numbers that are
selected randomly for data selection.
x_test.shape
x_test
The first line of code, x_test.shape, will return the shape of the
x_test dataset, which is the testing set of input features for a
machine learning model. It will give you the number of rows and
columns in the x_test dataset.
The second line of code, x_test, will display the contents of the
x_test dataset itself, which will show you the actual values of
the input features used for testing the model.
This means that the x_test dataset has 1000 rows × 5 columns
Now the data is in the shape that it can be passed to the modal for
training. For which we will import linear regression function from
Sklearn library.
LINEAR REGRESSION
Model Training
# Fit
# Import model
from sklearn.linear_model import LinearRegression
Explanation:
Explanation:
Now that you have lin_reg, you can proceed with the model fitting
using the fit method, as shown in the previous response.
Additionally, you can use lin_reg to access various attributes of
the fitted model, such as coefficients and intercept, or use it
to make predictions on new data.
LinearRegression
LinearRegression()
Fitting the Model: After creating the instance lin_reg, you can
fit the linear regression model to your training data using the
fit method. The fit method takes the input features (x_train) and
the corresponding target values (y_train) as arguments. It learns
the coefficients and the intercept from the training data.
Now, y_pred will contain the predicted target values for the
x_test data, based on the fitted linear regression model.
The "fit" method is used to train the model on the provided data.
After the fitting process, the model will have learned the
relationships between the input features and the target values,
allowing it to make predictions on new, unseen data.
Model Testing
Class prediction
# Predict
y_pred = lin_reg.predict(x_test)
print(y_pred.shape)
print(y_pred)
In the code you provided, you are combining the actual target
values from the test set (y_test) with the predicted target
values (y_pred) side by side using NumPy's column_stack function.
This creates a new NumPy array where the two arrays are stacked
as columns.
This combined array can be useful for comparing the actual and
predicted values, and you can use it to calculate various metrics
to evaluate the performance of your linear regression model on
the test data. For example, you can calculate the mean squared
error, R-squared score, or other relevant evaluation metrics to
assess how well your model is performing on unseen data.
Now print the values predicted by the modal and the actual values side
by side.
Now from the above values we have taken a value and highlighted it. We
see that the difference between the actual value and the predicted
value is more.
The code you provided is for printing the combined actual and
predicted values side by side in a tabular format. This can be
useful for visually inspecting the performance of your linear
regression model on the test data.
The code uses a for loop to iterate over each row of the results
array, where each row consists of an actual value and its
corresponding predicted value. The f-string formatting is used to
align the numbers properly in the table. Each actual value is
printed with a field width of 14 characters and two decimal
places, while each predicted value is printed with a field width
of 12 characters and two decimal places.
Residual Analysis
Residual analysis in linear regression is a way to check how well
the model fits the data. It involves looking at the differences
(residuals) between the actual data points and the predictions
from the model.
In a good model, the residuals should be randomly scattered
around zero on a plot. If there are patterns or a fan-like shape,
it suggests the model may not be the best fit. Outliers, points
far from the others, can also affect the model.
Residual analysis helps ensure the model's accuracy and whether
it meets the assumptions of linear regression. If issues are
found, adjustments to the model may be needed to improve its
performance.
Explanation:
Explanation:
The plot shown above is in the bell shape. The bell curve should be
centrally aligned but its height must be lower at the same time. The
height shows the difference of predicted value with the actual value.
Explanation:
The resulting scatter plot will show how well the linear
regression model's predictions match the actual values. Data
points close to the red ideal line indicate accurate predictions,
while points scattered away from the line represent prediction
errors. By examining the scatter plot, you can visually assess
the performance of your linear regression model and identify any
patterns or trends in its predictions.
Model Evaluation
# Score It
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mse)
In order to reverse the squre taken already we take square root. The
values converted to kilometers by taking square from meters. In order to
cancel the affect of square we take square root.
Root Mean Squared Error: 100499.69083964829
The resultant value will give the true picture of error.
Explanation:
After running this code, you will see the MSE and RMSE values for
the linear regression model's predictions on the test data. These
metrics provide insights into how well the model is performing,
with lower values indicating better accuracy. Comparing the MSE
and RMSE with other models or benchmarks can help you assess the
effectiveness of your linear regression model for the given task.
Interpretation
Accuracy
MSE is very high Here are some Questions for you