CSE 412 Lab Manual 3 Linear Regression
CSE 412 Lab Manual 3 Linear Regression
1 Objective
The objective of learning linear regression is to develop an understanding of a fundamental statistical and
machine learning technique used for predictive modeling and understanding the relationships between
variables. Linear regression is a simple yet powerful method used in various fields, including statistics,
economics, finance, and machine learning. Here are the primary objectives of learning linear regression:
• Understanding the Basics: Learn the fundamental concepts of linear regression, including the
terminology (dependent and independent variables, coefficients, intercept, etc.) and the mathe-
matical representation of linear regression models.
• Model Building: Learn how to build a linear regression model by selecting appropriate indepen-
dent variables (features) and estimating coefficients that best fit the data.
• Interpretation: Develop the ability to interpret the coefficients of a linear regression model. Un-
derstand how changes in the independent variables affect the dependent variable.
ha (x) = a0 + a1 x.
The above equation is an equation for a straight line with a slope of a1 and an intercept of a0 . We say that
ha (x) is a linear function of x that is parameterized by a0 and a1 . The following are a few plots with
different values of a0 , and a1 .
y = mx + b.
We use h instead of y to denote that this model is a hypothesis of the true trend. In a sense, we are
attempting to provide our “best guess” for finding a relationship in the data provided. The parameters a0
and a1 are coefficients of our linear equation (analogous to b and m in the standard line equation) and
will be considered parameters of the model.
The goal of linear regression is to make the sum of all of those errors as small as possible by finding
the “best” parameters a0 , and a1 for the hypothesis function ha (x).
1 m 2
J(a0 , a1 ) = ∑ (ha (xi ) − y (i) )
2m i=1
1 m 2
= ∑ ((a0 + a1 (xi )) − y (i) )
2m i=1
Meaning, we are looking for the parameter values for a0 , and a1 so that the cost J(a0 , a1 ) is minimum
(also called optimum). This is an optimization problem and we will see how we can perform such a task.
To make it clear how the regression line relates to the cost function, in the following the test data we
used earlier with a set of lines for a range of values of a1 and a0 = 0 are plotted. Additionally, the cost
values J(a0 = 0, a1 ) plotted for different a1 .
You can see that the (a0 = 0 and a1 = 0.93) line is the closest to the minimum. Our task is to find the
best possible parameter a1 (for this particular problem).
ha (x) = a0 + a1 x
So far, our model has only been capable of capturing one independent variable and mapping its re-
lationship to one dependent variable. We will now extend our linear model to allow for n different
independent variables (in our case, features) to predict one dependent variable (an outcome).
1 m
∑ (ha (xi ) − yi )
2
min
2m i=1
Substituting in our definition of the model, ha (xi ), we update our cost function as:
2
1 m n
J (a) = ∑ ((∑ aj xi,j ) − yi )
2m i=1 j=0
Using gradient descent, the weights will continue to update until we have converged on the optimal
value. For linear regression, the cost function is always convex with only one optimum, so we can be
confident that our solution is the best possible one. This is not always the case, and we will need to be
wary of converging on local optima using other cost functions.
As the picture depicts, gradient descent is an iterative process that guides the parameter values toward
the global minimum of our cost function.
∂
aj ∶= aj − η J(a0 , a1 )
∂aj
Gradient descent will update your parameter value by subtracting the current value from the slope of
the cost function at that point multiplied by a learning rate, which basically dictates how far to step. The
learning rate, η, is always positive.
Notice in the picture above how the first attempted value, a1 = 0, has a negative slope on the cost
function. Thus, we would update the current value by subtracting a negative number (in other words,
addition) multiplied by the learning rate. In fact, every point left of the global minimum has a negative
slope and would thus result in increasing our parameter weight. On the flip side, every value to the right
of the global minimum has a positive slope and would thus result in decreasing our parameter weight
until convergence.
∂ 1 m n
J (a) = ∑ ((∑ aj xi,j ) − yi )xi,j
∂aj m i=1 j=0
• It is important that your learning rate is reasonable. If it is too large it may overstep the global
optimum and never converge, and if it is too small it will take a very long time to converge.
• Plot the J(a0 , . . . , an ) as a function of the number of iterations you have run gradient descent. If
gradient descent is working properly it will continue to decrease steadily until leveling off, hope-
fully, close to zero. You can also use this plot to visually determine when your cost function has
converged and gradient descent may be stopped.
• If J(a0 , . . . , an ) begins to increase as the number of iterations increases, something has gone
wrong in your optimization. Typically, if this occurs it is a sign that your learning rate is too
large.
Gradient descent is often the preferred method of parameter optimization for models which have a
large number (think, over 10, 000) of features.
5 Implementation
In this section, we will see how the Python Scikit-Learn library for machine learning can be used to
implement regression functions. We will start with simple linear regression involving two variables and
then you will move towards linear regression involving multiple variables.
In this regression task, we will predict the percentage of marks that a student is expected to score
based on the number of hours they studied. This is a simple linear regression task as it involves just two
variables.
To import the necessary libraries for this task, execute the following import statements:
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 %matplotlib inline
The following command imports the CSV dataset using pandas and retrieves the first 5 records from
the dataset:
1 # https :// drive.google.com/file/d/1 oakZCv7g3mlmCSdv9J8kdSaqO5_6dIOw /view
2 # Please download the dataset from the above link
3
4 dataset = pd.read_csv('/content/drive/MyDrive/student_scores.csv')
5 dataset.head ()
Output:
Hours Scores
0 2.5 21
1 5.1 47
2 3.2 27
3 8.5 75
4 3.5 30
Let’s plot the data points on 2-D graph to understand the characteristics of the dataset and see if we
can manually find any relationship between the data. We can create the plot with the following script:
1 dataset.plot(x='Hours ', y='Scores ', style='o')
2 plt.title('Hours vs Percentage ')
3 plt.xlabel('Hours Studied ')
4 plt.ylabel('Percentage Score ')
5 plt.show ()
The attributes are stored in the X variable. We specified “-1” as the range for columns since we wanted
our attribute set to contain all the columns except the last one, which is “Scores”. Similarly, the y variable
contains the labels. We specified 1 for the label column since the index for the “Scores” column is 1.
Remember, the column indexes start with 0, with 1 being the second column. In the next section, we
will see a better way to specify columns for attributes and labels.
Now that we have our attributes and labels, the next step is to split this data into training and test
sets. We’ll do this by using Scikit-Learn’s built-in train_test_split() method:
1 from sklearn.model_selection import train_test_split
2 X_train , X_test , y_train , y_test = train_test_split(X, y, test_size =0.2, random_state =0)
The above script splits 80% of the data into the training set while 20% of the data to the test set. The
test_size variable is where we actually specify the proportion of the test set.
Now is the time to train our algorithm. Execute the following command:
1 from sklearn.linear_model import LinearRegression
2 regressor = LinearRegression ()
3 regressor.fit(X_train , y_train)
With Scikit-Learn it is extremely straightforward to implement linear regression models, as all you
really need to do is import the LinearRegression class, instantiate it, and call the fit() method along
with our training data. This is about as simple as it gets when using a machine-learning library to train
your data.
1 print(regressor.intercept_)
2 print(regressor.coef_)
Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use
our test data and see how accurately our algorithm predicts the percentage score. To make predictions
on the test data, execute the following script:
1 y_pred = regressor.predict(X_test)
2 df = pd.DataFrame ({'Actual ': y_test , 'Predicted ': y_pred })
3 df.head ()
Actual Predicted
0 20 16.884145
1 27 33.732261
2 69 75.357018
3 30 26.794801
4 62 60.491033
The final step is to evaluate the performance of the algorithm. This step is particularly important to
compare how well different algorithms perform on a particular dataset. For regression algorithms, the
following evaluation metrics are commonly used:
• Mean Squared Error (MSE) is the mean of the squared errors and is calculated as:
1 n
= ∑(actuali − predictedi )2
n i=1
We don’t have to perform these calculations manually. The Scikit-Learn library comes with pre-built
functions that can be used to find out these values for us.
Let’s find the values for these metrics using our test data. Execute the following code:
1 from sklearn import metrics
2 print('Mean Squared Error:', metrics.mean_squared_error(y_test , y_pred))
Output:
6 Lab Exercises
Please code yourself and write a report based on your findings:
1. Download the Diabetes dataset from the following link and predict if a person is diabetic
using a linear regression algorithm.
https://www.kaggle.com/saurabh00007/diabetescsv
2. Write a report on your experiments and experimental results and submit it to your instruc-
tor.
My Notes Date: