0% found this document useful (0 votes)
24 views

ML Lecture 2 2023

Lecture in Machine learning

Uploaded by

Kevon Bvunza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

ML Lecture 2 2023

Lecture in Machine learning

Uploaded by

Kevon Bvunza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

• Machine Learning

• 2023-2024 Fall Term


• Dr. Selchuk Dzan
Core area of AI is multidisciplinary science. AI is like as any other data science areas
particular discipline but interrelated with many other academic disciplines. But AI
profoundly related with Machine learning, Artificial Neural Networks, Deep Learning,
Data mining, KDD, Pattern Recognition, Soft Computing, Natural language processing
(NLP), Statistics, Computer vision, Neurocomputing, Bioinformatics, Visualisations.
Machine learning vs Traditional Programming

Rules Traditional Answers


Inputs or data programming

Answers Machine Rules


Inputs or data Learning
if (ball.hitToWall() or ball.hitToPaddle() )
{bounceBall();}
else {decreaseLife();}

For example, in this simple bouncing ball game, whenever the user presses the right
or left arrow keys, respectively the paddle moves some pixels to the right or left.
The user uses the paddle to catch the ball so that it does not hit the bottom line.
Every time the ball hits the paddle, it will bounce off the paddle. If the ball hits the
bottom of the screen then the player loses their life.

In the traditional programming, rules and data go in answers come out. Rules are
expressed by coding and data can come from a variety of the sensors or past
experiences, experiments or historical data. Machine learning rearranges this
process by inputting data and answers first and then gets rules out.
https://www.chegg.com/homework-help/questions-and-answers/python3-assignment-start-solution-bouncing-ball-lab-implement-simple-pong-game-interface-s-q15926071
Introduction
to Numerical
Prediction
Machine Learning

Supervised Unsupervised Reinforcement


learning learning Learning

Numerical Classifiers Association


Clustering
prediction Rules
or
Regression Market Basket Recommender
or Analysis Systems
Function
approximation
Example: Use matrix operations to solve the following system of linear
equations.

X1 X2 X3 Y
4 -2 6 8
2 8 2 4
6 10 3 0
Solve Using Right Division
Use matrix operations to solve the following system of linear equations.

𝑋𝐴 = 𝐵

𝑋 𝐴 = 𝐵

In this equation X COLUMN WISE and B is ROW WISE vector. This equation can be
solved by multiplying, on the right, both sides by the inverse of A :
import numpy as np
𝑋 𝐴 = 𝐵 A = np.array([[4,2,6],
[-2,8,10],
[6,2,3]])
𝑋 ∙ 𝐴 𝐴−1 = 𝐵 ∙ 𝐴−1 B=np.array([[8,4,0]])
print('shape of A:',A.shape)
𝑋 ∙ 𝐼 = 𝐵 ∙ 𝐴−1 print('shape of B:',B.shape)
coefficients=np.dot(B,np.linalg.inv(A))
print(coefficients)
𝑋 = 𝐵 ∙ 𝐴−1
[[-1.80487805 0.29268293 2.63414634]]
Solve Using Left Division

In this equation X ROW WISE and B is COLUMN WISE. This equation can be solved
by multiplying, on the left, both sides by the inverse of A :
Solution with Python
import numpy as np
A = np.array([[4,-2,6],
[2,8,2],
[6,10,3]])
B=np.array([[8],[4],[0]])
coefficients=np.linalg.inv(A).dot(B)
print(coefficients)

>>> print(coefficients)
[[-1.80487805]
[ 0.29268293]
[ 2.63414634]]
Use matrix operations to solve the following system of linear equations.

𝑋𝐴 = 𝐵

X=inv(A)*B X=B*inv(A)
ARRAY (and Matrix) DIVISION

Identity matrix:
The identity matrix is a square matrix in which the main diagonal elements are 1s
and the rest of the elements are 0s.

an identity matrix can be generated in MATLAB with the eye command.

When the identity matrix multiplies another matrix (or vector), that matrix (or
vector) is unchanged.

If a matrix A is square, it can be multiplied by the identity matrix, I, from the


left or from the right:
Inverse of a matrix:

The inverse of a matrix A is typically written as A-1 .The matrix A-1 is the inverse of
the matrix A if, when the two matrices are multiplied, the product is the identity
matrix. Both matrices must be square, and the multiplication order can be A-1 A
or A A-1.

A-1 x A = A x A-1 = I

In Python the inverse of a matrix can be obtained either by using NumPy library function

np.linalg.inv(A).
Examples with Python

If A is
2 1 4
4 1 8
import numpy as np
2 -1 3 A = np.array([[2,1,4],
[4,1,8],
inv(A)is [2,-1,3]])
B=np.linalg.inv(A)
5.5000 -3.5000 2.0000 print('Matrix A:\n',A)
2.0000 -1.0000 0 print('Inverse A:\n',B)
-3.0000 2.0000 -1.0000 I=np.dot(A,B)
print('Idendity Matrix of A:\n',I)
Idendity matrix of A is
I=Axinv(A)
1 0 0
0 1 0
0 0 1
Simple Linear Regression
Linear Regression is modelling a straight-line equation using
relationship between a scalar response (or dependent variable
or y) and one or more explanatory variables(or independent
variables or x). This is a linear relationship between x and y.

y = mx + c
where y is the dependent variable, x is the independent
variable, m is the slope of the line and c is y-intercept( value
of y when x=0). m is also called gradient, because it is the ratio
of amount of the changes in x and y.

Pivoting at intercept point on x axis Translating (moving) by y axis


Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Source: Textbook - Grokking Machine Learning by Luis Serrano
Simple Linear Regression using Gradient Descent
Algorithm
The gradient descent (or least mean squares) algorithm updates the coefficients
based on the derivation of the error functions.

Because linear function is continuous, it is differentiable, which is one of the biggest


advantages of it. This property allows us to define a error function 𝑒𝑟𝑟𝑜𝑟 𝑝 (also
called cost or loss function) that we can minimize in order to update our coefficients.
In the case of the linear relationship between the input and output variables, we can
define the error function 𝑒𝑟𝑟𝑜𝑟 𝑝 as the sum of squared errors (SSE).

𝑛
1 errors = Y - predictions
𝑒𝑟𝑟𝑜𝑟 𝑝 = ෍(𝑡𝑖 − 𝑝𝑖 )2
𝑛 cost = (errors**2).sum()
𝑖

(Quadratic cost function depends on the prediction p, prediction p is depend on input


x and desired output y, which is constant here.
In order to minimize the SSE error function, we will use gradient descent, a simple yet
useful optimization algorithm that is often used in machine learning to find the local
minimum of linear systems.
To simplify the illustration, let us consider single coefficient with our convex cost
function. As illustrated in the figure below, gradient descent is as “climbing down a
hill” until a local or global minimum is reached. At each step, we take a step into the
opposite direction of the gradient. Further the step size can be controlled by the
value of the learning rate as well as speed of the convergence.

Now, the error surface of a linear regression model is a


multidimensional parabola. Because parabolas have only one
minimum, a gradient descent algorithm (such as the LMS rule)
must produce a solution at that minimum with enough
iteration.
The amount the output changes in response is the derivative. By the way, we usually talk
about the derivative with respect to a single input, or about a gradient with respect to all the
inputs. The gradient is just made up of the partial derivatives of all the inputs concatenated
in a vector. The gradient is, in fact, the direction of the steepest increase of the function.
In gradient decent optimization, coefficients are updated by taking a step into the
opposite direction of the gradient Δ𝑎 = −𝛼𝛻𝐸 𝑎 𝑎𝑛𝑑 Δ𝑏 = −𝛼𝛻𝐸 𝑏 thus, we
have to compute the partial derivative of the error function for each coefficient in
the coefficients vector:
𝜕𝐸 𝜕𝐸
∆𝑎 = −𝛼 ∆𝑏 = −𝛼
𝜕𝑎 𝜕𝑏
If we take the partial derivations of the squared error function with respect to the
coefficients of a and b at the kth iteration:
𝜕𝑒(𝑘)2 𝜕𝑒(𝑘)2
𝜕𝑎 𝜕𝑏
you can use the chain rule is expressed as

𝜕e2 𝑘 𝜕e2 𝑘 𝜕𝑒 𝑘 𝜕e2 𝑘 𝜕e2 𝑘 𝜕𝑒 𝑘


= . = −2𝑒 𝑘 𝑥 = . = −2𝑒 𝑘
𝜕𝑎 𝜕𝑒 𝑘 𝜕𝑎 𝜕𝑏 𝜕𝑒 𝑘 𝜕𝑏

𝜕e2 𝑘 𝜕e2 𝑘
= 2𝑒 𝑘 = 2𝑒 𝑘
𝜕𝑒 𝑘 𝜕𝑒 𝑘

𝜕𝑒 𝑘 𝜕[𝑡(𝑘) − 𝑝(𝑘)] 𝜕 𝜕𝑒 𝑘 𝜕[𝑡(𝑘) − 𝑝(𝑘)] 𝜕


= = 𝑦 𝑘 − (𝑎𝑥 𝑘 + 𝑏) = −𝑥 = = 𝑦 𝑘 − (𝑎𝑥 𝑘 + 𝑏) = −1
𝜕𝑎 𝜕𝑎 𝜕𝑎 𝜕𝑏 𝜕𝑏 𝜕𝑏

Target(actual, real) output, y value from the dataset Predicted (calculated) output
And we can update our coefficients:

𝑎 𝑛𝑒𝑤 = 𝑎 𝑜𝑙𝑑 + ∆𝑎
𝜕𝑒
∆𝑎 = 𝛼 = 𝛼 ෍ 2 𝑒 −𝑥 = −2𝛼 ෍(𝑦 − (𝑎𝑥 + 𝑏))(𝑥)
𝜕𝑎
𝑖 𝑖

Alfa- 𝛼 is learning rate 𝑏 𝑛𝑒𝑤 = 𝑏 𝑜𝑙𝑑 + ∆𝑏


𝜕𝑒
∆𝑏 = 𝛼 = 𝛼 ෍ 2 𝑒 −1 = −2𝛼 ෍(𝑦 − (𝑎𝑥 + 𝑏))
𝜕𝑏
𝑖 𝑖

𝑦 is target(actual, real) output, 𝑦 value is from the dataset 𝑎𝑥 + 𝑏 is predicted (calculated) output

Y_pred = a*X + b # The current predicted value of Y


D_a = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt a
D_b = (-2/n) * sum(Y - Y_pred) # Derivative wrt b
a = a - L * D_a # Update a, L is learning rate
b = b - L * D_b # Update b
Simple Linear regression implementation in Python

import numpy as np

X = np.array([1, 2, 3, 4, 5])
Y = np.array([3, 5, 7, 9, 11])

# Building the model y=ax+b


a = 0 # initial value of the coefficient a
b = 0 # initial value of the coefficient b

L = 0.01 # The learning Rate


epochs = 1000 # The number of iterations to perform gradient descent a= 2.0 b= 1.0
Press any key to continue . . .
n = float(len(X)) # Number of elements in X

# Performing Gradient Descent


for i in range(epochs):
Y_pred = a*X + b # The current predicted value of Y
D_a = (-2/n) * sum(X * (Y - Y_pred)) # Derivative wrt a
D_b = (-2/n) * sum(Y - Y_pred) # Derivative wrt b
a = a - L * D_a # Update a
b = b - L * D_b # Update b

print ('a=',np.round(a), 'b=',np.round(b))


Simple Linear regression implementation in Python
import numpy as np

class myLinearRegression(object):
def __init__(self, lrate = 0.01, niter = 10):
self.lrate = lrate
self.niter = niter

def fit(self, X, y):


# coefficients
self.coefficient = np.zeros(1 + X.shape[1])

# Errors
self.errors = []

# Cost function
self.cost = []

for i in range(self.niter):
predicted = self.net_input(X)
0.6375842816075681
errors = y - predicted 2.0948700956551938
self.coefficient[1:] += self.lrate * X.T.dot(errors) [9.01706466]
self.coefficient[0] += self.lrate * errors.sum()
cost = (errors**2).sum() / 2.0
Press any key to continue . . .
self.cost.append(cost)
return self

def net_input(self, X):


"""Compute net input"""
return np.dot(X, self.coefficient[1:]) +
self.coefficient[0]

def prediction(self, X):


"""Compute linear prediction"""
return self.net_input(X)

import numpy as np
import pandas as pd

y = np.array([3, 5, 7, 11])
X = np.array([[1],[2],[3],[5]])

# learning lrate = 0.01


net = myLinearRegression(0.01, 10).fit(X,y)
print(net.coefficient[0])
print(net.coefficient[1])
print(net.net_input(4))
Memory Online
Method Accuracy Updating Time
Usage Learning
Stochastic Gradient
Decent Good After each case Low Yes
Mini-Batch Gradient
Decent Good After subset of data Medium Yes

Batch Gradient Decent Good After whole data High No


Lab
Using Google Colab
Using Visual Studio
Free
Using Visual Studio
Popular Python Libraries for Machine Learning
Scikit-learn is a free machine learning library for
Python.
Theano is a popular python library for mathematical
expressions involving multi-dimensional arrays in an
efficient manner.

TensorFlow is a very popular open-source library


developed by the Google.

Keras is a very popular Machine Learning library for


Python with capable of running on top of TensorFlow,
CNTK, or Theano.

PyTorch is a popular open-source Machine Learning


library for Python based on Torch.

Caffe is a library for machine learning in vision


applications.
Machine Learning Model Development Steps

Basics Development Advanced Development

Loading data Loading data

Explore the data Explore the data

Divide the data into the


Implement the ML model
training and test datasets
Train the model
(fitting the model to the data)
Implement the ML model
Evaluate the quality of the
trained (fitted) model Train the model using training
data (fitting the model to the
Make predictions using the training data)
trained model
Evaluate the quality of the
trained (fitted) model using test
dataset
Make predictions using the
trained model
10 common regression algorithm at Scikit-learn
# Linear Regression
from sklearn.linear_model import LinearRegression
# LGBM Regressor (Light Gradient Boosted Machine)
from lightgbm import LGBMRegressor
# XGBoost Regressor
from xgboost.sklearn import XGBRegressor
# CatBoost Regressor
from catboost import CatBoostRegressor
# Stochastic Gradient Descent Regression
from sklearn.linear_model import SGDRegressor
# Kernel Ridge Regression
from sklearn.kernel_ridge import KernelRidge
# Elastic Net Regression
from sklearn.linear_model import ElasticNet
# Bayesian Ridge Regression
from sklearn.linear_model import BayesianRidge
# Gradient Boosting Regression
from sklearn.ensemble import GradientBoostingRegressor
# Support Vector Machine
from sklearn.svm import SVR
Linear regression implementation in sciLearn
x y
1 3 1
2 5
# Step 2 - Loading the data and performing basic data exploration.
3 7
# load our input data as a numpy array object
x = np.array([1, 2, 3, 4, 5]) 4 9
# Examine the data
print('type of x:',type(x))
5 10
print('x=\n',x)
print('shape of x:',x.shape) 11 ?
# Modify the input data shape by "reshaping" input data Input data shape must be provided in this format in the Scilearn
x = x.reshape(-1, 1)
print('shape of x:',x.shape) # (5, 1)
print('x=\n',x)

# load our data as an 2D numpy array object


# x = np.array([[1], [2], [3], [4], [5]])
#
#
or
load our data as a panda dataframe object

#
#
x = pd.DataFrame({'inputs': [1, 2, 3, 4, 5]})
print(x['inputs'])
− ML
− Input data
# load our output data as a numpy array object model
y = np.array([3, 5, 7, 9, 11])
print('type of y:',type(y))

print('y=\n',y)
print('shape of y:',y.shape) (n,1)
# Train the model
model = LinearRegression()
model.fit(x, y)
2
# Evaluating the model quality
r_sq = model.score(x,y)
print('score of r square: ', r_sq)

# Making prediction using the trained (fitted ) model

y_pred = model.predict(x)
print('predicted response:', y_pred)

y_pred = model.predict([[11]])
print('predicted response:', y_pred)

y_pred = model.predict([[11],[15]])
print('predicted response:', y_pred)

# Making prediction
xnew=xnew=np.array([11])
xnew = xnew.reshape(-1, 1)
y_pred = model.predict(xnew)
print('predicted response:', y_pred)

xnew=np.array([11, 15])
xnew.shape
xnew = xnew.reshape(-1, 1)
xnew.shape
y_pred = model.predict(xnew)
print('predicted response:', y_pred)
3
# Making prediction
xnew=xnew=np.array([11])
xnew = xnew.reshape(-1, 1)
y_pred = model.predict(xnew)
print('predicted response:', y_pred)

xnew=np.array([11, 15])
xnew.shape
xnew = xnew.reshape(-1, 1)
xnew.shape
y_pred = model.predict(xnew)
print('predicted response:', y_pred)

# examining the fitted model coefficents

print('slope:', model.coef_)
print('intercept:', model.intercept_)

# You can notice that .intercept_ is a scalar, while .coef_ is an array.

# making prediction using the regression model


xnew=xnew=np.array([11])
y_pred = model.coef_ * xnew + model.intercept_
print('predicted response:', y_pred)
Linear regression implementation in sciLearn

# Multiple Linear Regression With scikit-learn

# load dataset as python list object


x = [[3, 1], [5, 2], [3, 2], [3, 5], [2, 3]] # list
y = [8, 12, 10, 16, 11] # list

print(type(x)) x1 x2 y
print(np.shape(x))
print(x) 3 1 8
print(y)

# optionally you can convert to numpy arrays


5 2 12
#x, y = np.array(x), np.array(y) 3 2 10
#print(type(x))
#print(np.shape(x))
3 5 16
#print(x)
#print(y)
2 3 11
# Implement the regression model and fit it (train it) 4 5 ?
model = LinearRegression().fit(x, y)

# Evaluate the quality of the model


r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)

# Display the learned parameters of the model


print('intercept:', model.intercept_)
print('slope:', model.coef_)

# Use the trained (fitted) model for prediction


y_pred = model.predict([[4,5]])
print('predicted response:', y_pred)
Linear regression implementation in sciLearn

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# load dataset as an panda data frame object
df = pandas.read_csv('fishData.csv')

# Examine the data Width Height


print(type(df))
print(df) 1.408 2.112
print(df.shape)

x=df['Width'].values
1.9992 3.528
print(type(x))
print(x) 2.432 3.824
print(x.shape)
2.6316 4.5924
# Modify the input data shape by "reshaping" input data
x = x.reshape(-1, 1)
print(x.shape) # (5, 1)
… …
print(x)
1.4 ?
y=df['Height'].values
print(type(y))
print(y)
print(y.shape)

# Implement the model


model = LinearRegression()
# Train (fit) the model
model.fit(x,y)
# Evaluate the model
r_sq = model.score(x,y)
print('score of r square: ', r_sq)
# Use the trained (fitted) model for prediction
y_pred = model.predict([[1.4]])
print('predicted response:', y_pred)
Linear regression implementation in sciLearn

X1 X2 X3 X4 X5 Y

Length1 Length2 Length3 Height Width Weight

7.5 8.4 8.8 2.112 1.408 5.9

12.5 13.7 14.7 3.528 1.9992 32

13.8 15 16 3.824 2.432 40

15 16.2 17.2 4.5924 2.6316 51.5

… … … … … …

41 44 46 12 7.55 ?
Linear regression implementation in sciLearn
# Multiple Linear Regression With scikit-learn import matplotlib.pyplot as plt
import numpy as np df.plot()
df.hist()
import pandas as pd plt.show()
from sklearn.linear_model import LinearRegression plt.scatter(df["Height"],df["Width"])
# load dataset as an panda data frame object
df = pandas.read_csv('fishDataMLR.csv') # Examine the data
# Explore the data print(type(df))
print(df)
df.head() print(df.shape)
df.dtypes x = df[["Length1", "Length2","Length3","Height","Width"]]
len(df.columns) print(type(x))
df.describe() print(x)
# Display the number of null data observations print(x.shape)
df.isnull().values.sum()
# Specify target and input features
target = df.iloc[:, 5].name y=df['Weight'].values
print(type(y))
features = df.iloc[:, 0:5].columns.tolist() print(y)
features print(y.shape)

# Correlations of features with target variable


correlations = df.corr() # Implement the model
correlations['Weight'] model = LinearRegression()
# Train the model
# Plot Pearson correlation matrix model.fit(x,y)
# Evaluate the quality of the model
import seaborn as sns r_sq = model.score(x,y)
fig_1 = plt.figure(figsize=(12, 10)) print('score of r square: ', r_sq)
new_correlations = df.corr() # Display the learned parameters of the model
sns.heatmap(correlations, annot=True, cmap='Greens', coeff_df = pd.DataFrame(model.coef_, x.columns, columns=['Coefficient'])
annot_kws={'size': 8}) coeff_df
plt.title('Pearson Correlation Matrix')
plt.show() # Use the trained (fitted) model for prediction
xnew = np.array([[41,44,46,12,7.55]])
y_pred = model.predict(xnew)
# Determine the highest intercorrelations print('predicted response:', y_pred)
highly_correlated_features = correlations[new_correlations > 0.98]
highly_correlated_features.fillna('-')
# Example for Multiple Linear Regression With scikit-learn
import numpy as np
Select only portion of the data set, which
import pandas as pd belongs to the Perch type of the fish
from sklearn.linear_model import LinearRegression
# load dataset as an panda data frame object Species Weight Length1 Length2 Length3 Height Width
df = pd.read_csv('Fish.csv')
Perch 5.9 7.5 8.4 8.8 2.112 1.408
# Perch
df.head() Perch 32 12.5 13.7 14.7 3.528 1.9992
df['Species'].unique().tolist() Perch 40 13.8 15 16 3.824 2.432
df=df.loc[df['Species'] == 'Perch']
print('\nDataFrame Shape :', df.shape) Perch 51.5 15 16.2 17.2 4.5924 2.6316
print('\nNumber of rows :', df.shape[0]) Perch 70 15.7 17.4 18.5 4.588 2.9415
print('\nNumber of columns :', df.shape[1]) Perch 100 16.2 18 19.2 5.2224 3.3216
# Explore the data
df.head() Perch 78 16.8 18.7 19.4 5.1992 3.1234
df.dtypes Perch 80 17.2 19 20.2 5.6358 3.0502
len(df.columns)
df.describe() … … … … … … …

# Specify target and input features Set the input variables


x = df[["Length1", "Length2","Length3","Height","Width"]]
print(type(x)) # pandas.core.frame.DataFrame
print(x)
print(x.shape)

y=df['Weight'].values
print(type(y)) # numpy.ndarray Set the output variable
print(y)
print(y.shape)
Implement the linear regression model
# Implement the model
model = LinearRegression()
# Train the model Fit (train) the model
model.fit(x,y)
# Evaluate the quality of the model
r_sq = model.score(x,y) Evaluate the quality of the trained model
print('score of r square: ', r_sq)
# Display the learned parameters of the model
coeff_df = pd.DataFrame(model.coef_, x.columns,
columns=['Coefficient']) Display the learned coefficients of the
coeff_df
print('intercept:', model.intercept_) trained model
# Use the trained (fitted) model for prediction
xnew = np.array([[41,44,46,12,7.55]])
y_pred = model.predict(xnew) Make a prediction using the trained model
print('predicted response:', y_pred)
import pandas
df = pandas.read_csv('fishDataMLR.csv')

# Examine the data


print(type(df))
print(df)
print(df.shape)

Length1

Length2

Length3 Weight
model
Height

Width
Mathematics of Gradient Descent – Artificial Intelligence & Deep Learning

𝑦 = 𝑚𝑥 + 𝑏
Simple regression

Simple linear regression uses traditional slope-intercept form, where m and b are
the variables our algorithm will try to “learn” to produce the most accurate
predictions. x represents our input data and y represents our prediction.

m the coefficient for the independent variable. In machine learning we call


coefficients weights.
x the independent variable. In machine learning we call these variables features.
b (bias or the intercept) where our line intercepts the y-axis. In machine learning we
can call intercepts bias. Bias offsets all predictions that we make.

https://www.youtube.com/watch?v=jc2IthslyzM
The goal of any Machine Learning Algorithm is to minimize the Cost Function.

A Cost Function/Loss Function evaluates the performance of our Machine


Learning Algorithm.

It is an algorithm used to find the minimum of a function. That’s called an


optimization problem
Define an objective (cost/loss function)
In order to find the ideal weight and bias we have to first define an objective for
the algorithm. The objective is a function that calculates the error rate.
Basically, how far off are we with a given set of weights. The goal is to minimize
the output of this function.

Error function Loss(m,b)=

Error = 𝑦 − 𝑦ො
Cost function

We need a cost function to start optimizing our weights.

Let’s use MSE (L2) as our cost function. MSE measures the average squared
difference between an observation’s actual and predicted values. The output is a
single number representing the cost, or score, associated with our current set of
weights. Our goal is to minimize MSE to improve the accuracy of our model.

Given our simple linear equation y=mx+b, we can calculate MSE as:

Error function Loss(m,b)=

Error = 𝑦 − 𝑦ො
To minimize MSE we use Gradient Descent to calculate the gradient of our cost
function.
There are two parameters (coefficients) in our cost function we can control: weight
m and bias b. Since we need to consider the impact each one has on the final
prediction, we use partial derivatives. To find the partial derivatives, we use the
Chain rule. We need the chain rule because (y−(mx+b))2 is really 2 nested functions:
the inner function y−(mx+b) and the outer function (error)2
What is Gradient Descent?

Gradient descent is an optimization algorithm used to find the values of parameters


(coefficients) of a function (f) that minimizes a lost function (cost).
𝑁
1
𝑦 = 𝑚𝑥 + 𝑏 𝐿(𝑚,𝑛) = ෍(𝑦𝑖 − 𝑦ෝ𝑖 )2 𝑦 = (𝑒𝑟𝑟𝑜𝑟)2 𝑙𝑖𝑘𝑒𝑠 𝑦 = (𝑥)2
𝑁
𝑖=1

error = 𝑦 − 𝑦ො
If we look carefully, our Cost function is of the
form y = x². In a Cartesian coordinate system,
this is an equation for a parabola and can be
graphically represented as :

To minimize the function above, we need to find that


𝑦= (𝑥)2 value of x that produces the lowest value of y which
is the red dot. For those cases, we need to devise an
algorithm to locate the minima, and that algorithm is
called Gradient Descent.
What is Gradient Descent?
Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a lost function (cost).
𝑁
1
𝑦 = 𝑚𝑥 + 𝑏 𝐿(𝑚,𝑛) = ෍(𝑦𝑖 − 𝑦ෝ𝑖 )2 𝑦 = (𝑒𝑟𝑟𝑜𝑟)2 𝑙𝑖𝑘𝑒𝑠 𝑦 = (𝑥)2
𝑁
𝑖=1

error = 𝑦 − 𝑦ො

Select randomly a starting point for example black


dot.
Possible actions would be:
• You might go upward or downward
• If you decide on which way to go, you might take a
bigger step or a little step to reach your destination.
Essentially, there are two things that you should know to
reach the minima, i.e. which way to go and how big a step to
take.

Gradient Descent Algorithm helps us to make these


decisions efficiently and effectively with the use of
derivatives. A derivative is a term that comes from calculus
and is calculated as the slope of the graph at a particular
point. The slope is described by drawing a tangent line to the
graph at the point. So, if we are able to compute this tangent
line, we might be able to compute the desired direction to
reach the minima.
If y = f(u) and u = g(x), then this abbreviated
𝑁 𝑁 form is written in Leibniz notation as:
1 1
𝐿(𝑚,𝑛) = ෍(𝑦𝑖 − 𝑦ෝ𝑖 ) = ෍(𝑒𝑟𝑟𝑜𝑟)2
2
𝑁 𝑁
𝑖=1 𝑖=1

error = 𝑦 − 𝑦ො 𝑒𝑟𝑟𝑜𝑟 = 𝑢 ∴ 𝑢2

𝜕𝐿 𝜕𝐿 𝜕𝑢
𝐿(𝑚,𝑛) = 𝑢2 = 𝑒𝑟𝑟𝑜𝑟 2 = 𝜕𝑢 . 𝜕𝑚 = 2 𝑒𝑟𝑟𝑜𝑟 (−𝑥)
𝜕𝑚

𝜕𝐿 𝜕(𝑢)2 𝜕𝑢
=
𝜕(𝑦−(𝑚𝑥+𝑏)) 𝜕(𝑦−𝑚𝑥−𝑏))
= =
𝜕(−𝑚𝑥+𝑦−𝑏))
= −x
= = 2 𝑒𝑟𝑟𝑜𝑟 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚
𝜕𝑢 𝜕𝑢

𝜕𝐿 𝜕𝐿 𝜕𝑢
= . = 2 𝑒𝑟𝑟𝑜𝑟 −𝑥 = −2 𝑥 𝑒𝑟𝑟𝑜𝑟
𝜕𝑚 𝜕𝑢 𝜕𝑚
If y = f(u) and u = g(x), then this abbreviated
𝑁 𝑁 form is written in Leibniz notation as:
1 1
𝐿(𝑚,𝑛) = ෍(𝑦𝑖 − 𝑦ෝ𝑖 ) = ෍(𝑒𝑟𝑟𝑜𝑟)2
2
𝑁 𝑁
𝑖=1 𝑖=1

error = 𝑦 − 𝑦ො 𝑒𝑟𝑟𝑜𝑟 = 𝑢 ∴ 𝑢2

𝜕𝐿 𝜕𝐿 𝜕𝑢
𝐿(𝑚,𝑛) = 𝑢2 = 𝑒𝑟𝑟𝑜𝑟 2 = 𝜕𝑢 . 𝜕𝑏 = 2 𝑒𝑟𝑟𝑜𝑟 (−1)
𝜕𝑏

𝜕𝐿 𝜕(𝑢)2 𝜕𝑢
=
𝜕(𝑦−(𝑚𝑥+𝑏)) 𝜕(𝑦−𝑚𝑥−𝑏))
= =
𝜕(−𝑚𝑥+𝑦−𝑏))
= −1
= = 2 𝑒𝑟𝑟𝑜𝑟 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏
𝜕𝑢 𝜕𝑢

𝜕𝐿 𝜕𝐿 𝜕𝑢
= 𝜕𝑢 . 𝜕𝑚 = 2 𝑒𝑟𝑟𝑜𝑟 −1 = −2 𝑒𝑟𝑟𝑜𝑟=
𝜕𝑚
Updating the learnable parameters of m and b

𝜕𝐿𝑜𝑠𝑠
𝑚 =𝑚+𝛼
𝜕𝑚

𝜕𝐿𝑜𝑠𝑠
𝑏 =𝑏+𝛼
𝜕𝑏

∆𝑚 = 𝛼 (−2 𝑥 𝑒𝑟𝑟𝑜𝑟) ∆𝑏 = 𝑏𝑜𝑙𝑑 + (− 𝑒𝑟𝑟𝑜𝑟)

𝑚𝑛𝑒𝑤 = 𝑚𝑜𝑙𝑑 + ∆𝑚 = 𝑚𝑜𝑙𝑑 + (−2 𝑥 𝑒𝑟𝑟𝑜𝑟) = 𝑚𝑜𝑙𝑑

𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 + ∆𝑏 = 𝑏𝑜𝑙𝑑 + (−2 𝑒𝑟𝑟𝑜𝑟) = 𝑏𝑜𝑙𝑑


Gradient Descent Rule for multiple weights and biases (vectorized notation)

We have derived the update rule for a single weight and bias. In reality a deep neural
network has a lot of weights and biases, which are represented as matrices (or tensors),
and so our update rule should also be modified to update all weights and biases of the
network simultaneously.
Linear Regression in Numpy Linear Regression in sklearn
import numpy as np
import numpy as np
# Data Generation
np.random.seed(42) # Data Generation
x = np.random.rand(100, 1) np.random.seed(42)
y = 1 + 2 * x + .1 * np.random.randn(100, 1) x = np.random.rand(100, 1)
y = 1 + 2 * x + .1 * np.random.randn(100, 1)
# Shuffles the indices
idx = np.arange(100) # Shuffles the indices
np.random.shuffle(idx) idx = np.arange(100)
np.random.shuffle(idx)
# Uses first 80 random indices for train
train_idx = idx[:80] # Uses first 80 random indices for train
# Uses the remaining indices for validation train_idx = idx[:80]
val_idx = idx[80:] # Uses the remaining indices for validation
val_idx = idx[80:]
# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx] # Generates train and validation sets
x_val, y_val = x[val_idx], y[val_idx] x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

# Initializes parameters "a" and "b" randomly # sklearn


np.random.seed(42) from sklearn.linear_model import LinearRegression
a = np.random.randn(1) linr = LinearRegression()
b = np.random.randn(1) linr.fit(x_train, y_train)
print(linr.intercept_, linr.coef_[0])
print(a, b)

# Sets learning rate


lr = 1e-1
# Defines number of epochs
n_epochs = 1000

for epoch in range(n_epochs):


# Computes our model's predicted output
yhat = a + b * x_train

# How wrong is our model? That's the error!


error = (y_train - yhat)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Computes gradients for both "a" and "b" parameters


a_grad = -2 * error.mean()
b_grad = -2 * (x_train * error).mean()

# Updates parameters using gradients and the learning rate


a = a - lr * a_grad
b = b - lr * b_grad

print(a, b)
Linear Regression in Numpy
import numpy as np

# Data Generation
np.random.seed(42)
x = np.random.rand(100, 1)
y = 1 + 2 * x + .1 * np.random.randn(100, 1) Make sure to always initialize your random
# Shuffles the indices
idx = np.arange(100)
seed to ensure reproducibility of your results.
np.random.shuffle(idx)

# Uses first 80 random indices for train


train_idx = idx[:80]
For training a model, there are two initialization steps:
# Uses the remaining indices for validation
val_idx = idx[80:]

# Generates train and validation sets


• Random initialization of parameters/weights (we
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]
have only two, a and b)
• Initialization of hyper-parameters (in our case, only
# Initializes parameters "a" and "b" randomly
np.random.seed(42)
learning rate and number of epochs)
a = np.random.randn(1)
b = np.random.randn(1)

print(a, b)
For each epoch, there are four training steps:
# Sets learning rate
lr = 1e-1
# Defines number of epochs • Compute model’s predictions — this is the forward pass
n_epochs = 1000
• Compute the loss, using predictions and and labels and
for epoch in range(n_epochs):
# Computes our model's predicted output
the appropriate loss function for the task at hand
yhat = a + b * x_train • Compute the gradients for every parameter, this is the
# How wrong is our model? That's the error! backward pass
error = (y_train - yhat)
# It is a regression, so it computes mean squared error (MSE) • Update the parameters
loss = (error ** 2).mean()

# Computes gradients for both "a" and "b" parameters


a_grad = -2 * error.mean() Those four training steps will be looped for either each
b_grad = -2 * (x_train * error).mean()
individual case (stochastic) or n cases (mini-batch) or all
# Updates parameters using gradients and the learning rate
a = a - lr * a_grad
cases (batch gradient descent) like in this example.
b = b - lr * b_grad

print(a, b)
All arithmetic operates elementwise in Numpy Python:
multiply() function or simple * operation are used to perform
element wise matrix multiplication.

dot() function is used to compute the matrix multiplication, rather


than element wise multiplication.

You might also like