1.
Simple Linear Regression
● Linear regression is the simplest machine learning algorithm you'll encounter
○ Especially simple linear regression
● It is a simple algorithm initially developed in the field of statistics and was studied as
a model for understanding the relationship between input and output variables
● It is a linear model - assumes a linear relationship between input variables (X) and
the output variable (y)
● Used to predict continuous values (e.g., weight, price...)
Simple vs. Multiple linear regression
● Simple linear regression solves problems with only one input feature
● Multiple linear regression solves problems with multiple input features
Assumptions
1. Linear Assumption — model assumes the relationship between variables is linear
2. No Noise — model assumes that the input and output variables are not noisy — so
remove outliers if possible
3. No Collinearity — model will overfit when you have highly correlated input
variables
4. Normal Distribution — the model will make more reliable predictions if your input
and output variables are normally distributed. If that’s not the case, try using some
transforms on your variables to make them more normal-looking
5. Rescaled Inputs — use scalers or normalizer to make more reliable predictions
Take-home point
● Training a simple linear regression model is as simple as solving a couple of
equations
Math behind
● In a nutshell, simple linear regression is based on coefficients - and which you need
to find in order to solve a line equation:
Line equation:
● The coefficient has to be calculated first
● It tells you the slope of the line
B1 coefficient:
● The coefficient relies on the slope
● It represents Y-intercept - location at which the line intercepts the Y-axis
B0 coefficient:
● Let's implement simple linear regression with pure Numpy next
Implementation
● You'll need only Numpy to implement the logic
● Matplotlib is used for optional visualizations
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = (14, 7)
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False
● The SimpleLinearRegression class is written to follow the familiar Scikit-Learn
syntax
● The coefficients are set to None at the start - __init__() method
● The fit() method calculates the coefficients
● The predict() method essentially implements the line equation
○ Before it does so, it makes sure the coefficients have been
calculated
In [2]:
class SimpleLinearRegression:
'''
A class which implements simple linear regression model.
'''
def __init__(self):
self.b0 = None
self.b1 = None
def fit(self, X, y):
'''
Used to calculate slope and intercept coefficients.
:param X: array, single feature
:param y: array, true values
:return: None
'''
numerator = np.sum((X - np.mean(X)) * (y - np.mean(y)))
denominator = np.sum((X - np.mean(X)) ** 2)
self.b1 = numerator / denominator
self.b0 = np.mean(y) - self.b1 * np.mean(X)
def predict(self, X):
'''
Makes predictions using the simple line equation.
:param X: array, single feature
:return: None
'''
if not self.b0 or not self.b1:
raise Exception('Please call `SimpleLinearRegression.fit(X, y)` before making predictions.')
return self.b0 + self.b1 * X
Testing
● Let's create some dummy data
○ X contains a list of numbers between 1 and 300 (1, 2, 3, ..., 299,
300)
○ y contains normally distributed values centered around X with
standard deviation of 20
● The source data is then visualized:
In [13]:
X = np.arange(start=1, stop=301)
y = np.random.normal(loc=X, scale=20)
plt.scatter(X, y, s=200, c='#087E8B', alpha=0.65)
plt.title('Source dataset', size=20)
plt.xlabel('X', size=14)
plt.ylabel('Y', size=14)
plt.show()
● For validation sake, we'll split the dataset into training and testing parts:
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
● You can now initialize and train the model, and afterwards make predictions:
In [5]:
model = SimpleLinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
● Here's how you can get the coefficients:
In [6]:
model.b0, model.b1
● These are the predictions:
In [7]:
preds
● And these are the original values
● Original and predicted differ, but not much
In [8]:
y_test
● You can now evaluate the model by calculating RMSE
○ Root Mean Squared Error
● On average, the model is 21.35 units wrong
● It makes sense, as standard deviation of the dataset is 20
In [9]:
from sklearn.metrics import mean_squared_error
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
rmse(y_test, preds)
Visualize the Best-Fit line
● If you re-train the model of the entire dataset and then make predictions for the
entire dataset, you'll get the best fit line
● You can then visualize this line with Matplotlib:
In [14]:
model_all = SimpleLinearRegression()
model_all.fit(X, y)
preds_all = model_all.predict(X)
plt.scatter(X, y, s=200, c='#087E8B', alpha=0.65, label='Source data')
plt.plot(X, preds_all, color='#000000', lw=3, label=f'Best fit line > B0 = {model_all.b0:.2f}, B1 =
{model_all.b1:.2f}')
plt.title('Best fit line', size=20)
plt.xlabel('X', size=14)
plt.ylabel('Y', size=14)
plt.legend()
plt.show()
Comparison with Scikit-Learn
● We want to know if our model is good, so let's compare it with LinearRegression
model from Scikit-Learn
● The input data must be reshaped beforehand:
In [11]:
from sklearn.linear_model import LinearRegression
sk_model = LinearRegression()
sk_model.fit(np.array(X_train).reshape(-1, 1), y_train)
sk_preds = sk_model.predict(np.array(X_test).reshape(-1, 1))
sk_model.intercept_, sk_model.coef_
● Our coefficients were (-1.357484948041531, 1.0026529556316826)
● Not identical, but within a margin of error
● Let's check the RMSE:
In [12]:
rmse(y_test, sk_preds)
21.351850699502783
● Ours was 21.351850699502787, so nearly identical.