0% found this document useful (0 votes)
2 views

Module3_Ch1

Module 3 covers various machine learning models and their training algorithms, including linear regression, gradient descent, and logistic regression. It discusses concepts such as bias, variance, and regularization techniques like Ridge and Lasso regression. The module also explains the importance of training methods, validation techniques, and the computational complexity of different algorithms.

Uploaded by

abiahmoljoy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module3_Ch1

Module 3 covers various machine learning models and their training algorithms, including linear regression, gradient descent, and logistic regression. It discusses concepts such as bias, variance, and regularization techniques like Ridge and Lasso regression. The module also explains the importance of training methods, validation techniques, and the computational complexity of different algorithms.

Uploaded by

abiahmoljoy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Module 3

Training Models: Linear regression, gradient descent, polynomial


regression, learning curves, regularized linear models, logistic regression
Support Vector Machine: linear, Nonlinear , SVM regression and under
the hood
Topics covered:
• Machine Learning models and their training algorithms mostly like
black boxes.
• Bias is defined as the inability of the model because of that there is
some difference or error occurring between the model’s predicted
value and the actual value.
• These differences between actual or expected values and the
predicted values are known as error or bias error or error due to bias.
• Bias is a systematic error that occurs due to wrong assumptions in
the machine learning process.
• Low Bias: Low bias value means fewer assumptions are taken to build
the target function. In this case, the model will closely match the
training dataset.
• High Bias: High bias value means more assumptions are taken to
build the target function. In this case, the model will not match the
training dataset closely.
Chapter objectives
• Linear Regression model, one of the simplest models there is.
• Training methods
• A direct “closed-form” equation that directly computes the model parameters
that best fit the model to the training set (i.e., the model parameters that
minimize the cost function over the training set).
• Using an iterative optimization approach, called Gradient Descent (GD), that
gradually tweaks the model parameters to minimize the cost function over
the training set, eventually converging to the same set of parameters as the
first method.
• Polynomial Regression
• Logistic Regression
• Softmax Regression
• A simple linear model is mathematically represented as

• Where a0 is bias, a1 is slope of the line and e is error prediction


• a0 and a1 are regression coefficients
• Y is random and Mutual independent- Every event is independent of any
intersection of the other events
• The difference between predicted and true values are called error, error is also
mutually independent
• Unknown parameters are constants
• Figure shows optimal line , data point and error
• X1….xn are data points ei error
• A regression line is the line if best fit for which the sum of the squares
of residuals is minimum
• Linear regression is modelled as
• Given that weekly sales y , predict 7th week sales.
• (n*1) (n*2)(2*1)+(n*1)
• where n is number of data sets
Closed-form solution
Training Models: Linear regression
• Simple regression model of life satisfaction:
• life_satisfaction= θ0 + θ1 × GDP_per_capital.
• θ0 , θ1 are model parameters
• input feature: GDP_per_capita
• Linear Regression model prediction

• Where ŷ (y hat) is the predicted value


• n is the number of features.
• xi is the ith feature value.
• θj is the jth model parameter (including the bias term θ0 and the feature weights θ1, θ2, ⋯, θn).
Linear Regression model prediction
(vectorized form)
Validating regression methods

• 1. Standard error(SE) y-y’

• 2. Mean Absolute error (MAE)

• 3. Mean Squared error (MSE)


• 4. Root mean Square error (RMSE)
• 5. Relative MSE (RMSE)
MSE cost function for a Linear Regression
model
The Normal Equation
• To find the value of θ that minimizes the cost function, there is a
closed-form solution, a mathematical equation that gives the result
directly
• Randomly generated linear dataset

import numpy as np
import matplotlib as plt
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.plot(X, y, "b.")
Set θ=1
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

• Apply inv() function from NumPy’s Linear Algebra module (np.linalg)


to compute the inverse of a matrix, and the dot() method for matrix
multiplication
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
# add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
Plot
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
Performing linear regression using Scikit-
Learn is quite simple
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
lin_reg.predict(X_new)
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
np.linalg.pinv(X_b).dot(y)

• The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands
for “least squares”)
• The pseudoinverse itself is computed using a standard matrix factorization technique
called Singular Value Decomposition (SVD) that can decompose the training set matrix X
into the matrix multiplication of three matrices U Σ VT
Computational Complexity
• The Normal Equation computes the inverse of XT X, which is an (n + 1) × (n + 1)
matrix.

• The computational complexity of inverting a matrix is typically about O(n2.4) to


O(n3) (depending on the implementation).
• In other words, if you double the number of features, you multiply the
computation time by roughly 22.4 = 5.3 to 23 = 8.

• The SVD approach used by Scikit-Learn’s LinearRegression class is about O(n2).


Gradient Descent
• Gradient Descent is a very generic optimization algorithm capable of
finding optimal solutions to a wide range of problems

θ with random values (this is called random


initialization), and then improve it gradually, taking one
small step at a time, each step attempting to decrease
the cost function (e.g., the MSE), until the algorithm
converges to a minimum
• An important parameter in Gradient Descent is the size of the steps,
determined by the learning rate hyper parameter.
• If the learning rate is too small, then the algorithm will have to go
through many iterations to converge, which will take a long time
• If the learning rate is too high, you might jump across the valley and
end up on the other side, possibly even higher up than you were
before.
• This might make the algorithm diverge, with larger and larger values,
failing to find a good solution
• Finally, not all cost functions look like nice regular bowls. There may
be holes, ridges, plateaus, and all sorts of irregular terrains, making
convergence to the minimum very difficult.
• Two main challenges with Gradient
Descent: if the random initialization starts
the algorithm on the left, then it will
converge to a local minimum, which is not
as good as the global minimum.
• If it starts on the right, then it will take a
very long time to cross the plateau, and if
you stop too early you will never reach the
global minimum.
Global and local minima point
• The point at which a function takes the minimum value is called
global minima.
• However, when the goal is to minimize the function and solved using
optimization algorithms such as gradient descent, it may so happen
that function may appear to have a minimum value at different
points. Those several points which appear to be minima but are not
the point where the function actually takes the minimum value are
called local minima.
• Machine learning algorithms such as gradient descent algorithms
may get stuck in local minima during the training of the models.
Gradient descent is able to find local minima most of the time and not
global minima because the gradient does not point in the direction of
the steepest descent
• The MSE cost function for a Linear Regression model happens to be a
convex function, which means that if you pick any two points on the
curve, the line segment joining them never crosses the
curve.Gradient Descent is guaranteed to approach arbitrarily close
the global minimum
Batch Gradient Descent

• To implement Gradient Descent, you need to compute the gradient of the


cost function with regards to each model parameter θj

• computes the partial derivative of the cost function with regards to


parameter
• θj, noted ∂/∂θj MSE(θ).
Gradient vector of the cost function
This formula involves calculations over the full
training set X, at each Gradient Descent step!
This is why the algorithm is called Batch
Gradient Descent:
it uses the whole batch of training
data at every step (actually, Full Gradient
Descent would probably be a better name).
• Once the gradient vector is obtained, which points uphill, just go in
the opposite direction to go downhill.
• This means subtracting ∇θMSE(θ) from θ.
• This is where the learning rate η comes into play: multiply the
gradient vector by η to determine the size of the downhill step
eta = 0.1 # learning rate Eta (η) is the 7th letter of the Greek alphabet
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
Stochastic Gradient Descent
• The main problem with Batch Gradient Descent is the fact that it uses
the whole training set to compute the gradients at every step, which
makes it very slow when the training set is large.
• At the opposite extreme, Stochastic Gradient Descent just picks a
random instance in the training set at every step and computes the
gradients based only on that single instance.
• Obviously this makes the algorithm much faster
• On the other hand, due to its stochastic (i.e., random) nature, this
algorithm is much less regular than Batch Gradient Descent: instead
of gently decreasing until it reaches the minimum, the cost function
will bounce up and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets
there it will continue to bounce around, never settling down
n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
def learning_schedule(t):
return t0 / (t + t1)
theta = np.random.randn(2,1) # random initialization
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
sgd_reg.intercept_, sgd_reg.coef_
Mini-batch Gradient Descent
• At each step, instead of computing the gradients based on the full
training set (as in Batch GD) or based on just one instance (as in
Stochastic GD), Mini-batch GD computes the gradients on small
random sets of instances called mini-batches.
• The main advantage of Mini-batch GD over Stochastic GD is that you
can get a performance boost from hardware optimization of matrix
operations, especially when using GPUs
• Mini-batch GD will end up walking
around a bit closer to the minimum
than SGD
• All end up near the minimum, but
Batch GD’s path actually stops at
the minimum, while both Stochastic
GD and Mini-batch GD continue to
walk around.
• Batch GD takes a lot of time to take
each step, and Stochastic GD and
Mini-batch GD would also reach the
minimum
Polynomial Regression
• Polynomial regression can handle non linear relationship among
variables by using nth degree of a polynomial
• It provides non linear curve such as quadratic and cubic

Second degree transformation ( quadratic)

Thrd degree transformation ( cubic))


Polynomial with degree 2
Simple quadratic equation (degree 2)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

import matplotlib.pyplot as plt


plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
• from sklearn.preprocessing
• import PolynomialFeatures
• poly = PolynomialFeatures(degree=2, include_bias=False)
• X_poly = poly.fit_transform(X)

• from sklearn.linear_model
• import LinearRegression
• lin_reg = LinearRegression()
• lin_reg.fit(X_poly, y)
• lin_reg.intercept_, lin_reg.coef_ #a0,a1,a2 y = ax2 + bx + c.
• yhat = 0 . 56x1 2 + 0 . 93x1 + 1 . 78
• Original y = 0 . 5x1 2 + 1 . 0x1 + 2 . 0 + Gaussian noise

• Polynomial Regression is capable of finding relationships between


features
• Training data:
• when there are just one or two instances in the training set, the
model can fit them perfectly, which is why the curve starts at
zero.
• But as new instances are added to the training set, it becomes
impossible for the model to fit the training data perfectly, both
because the data is noisy and because it is not linear at all.
• So the error on the training data goes up until it reaches a
plateau, at which point adding new instances to the training set
doesn’t make the average error much better or worse.
• validation data:
• When the model is trained on very few training instances, it is
incapable of generalizing properly, hence validation error is
initially quite big.
• Then as the model is shown more training examples, it learns
and thus the validation error slowly goes down.
• However, once again a straight line cannot do a good job
modeling the data, so the error ends up at a plateau, very
close to the other curve.
Bias, Variance, Irreducible error
• Bias: This part of the generalization error is due to wrong assumptions, such as assuming
that the data is linear when it is actually quadratic. A high-bias model is most
likely to underfit the training data
• Variance
This part is due to the model’s excessive sensitivity to small variations in the training data.
A model with many degrees of freedom (such as a high-degree polynomial
model) is likely to have high variance, and thus to overfit the training data.
• Irreducible error
This part is due to the noisiness of the data itself. The only way to reduce this
part of the error is to clean up the data (e.g., fix the data sources, such as broken
sensors, or detect and remove outliers)
Regularized Linear Models
• Ridge Regression
• Lasso Regression
• Elastic Net
Ridge Regression
• Ridge Regression is also called Tikhonov regularization
• Regularization term used is

• This term is added to the cost function.


• This forces the learning algorithm to not only fit the data but also keep the
model weights as small as possible.
• The regularization term should only be added to the cost function during
training.
• Evaluation of model’s performance is using the un-regularized performance
measure
• Cost function used:

• Θ0 is not regularized hence I starts from 1


Ridge function usage in scikit
• from sklearn.linear_model import Ridge
• ridge_reg = Ridge(alpha=1, solver="cholesky")
• ridge_reg.fit(X, y)
• ridge_reg.predict([[1.5]])
Lasso Regression
(Least Absolute Shrinkage and Selection Operator
Regression)
• Cost function:

• Eliminate the weights of the least important features


Logistic Regression
• Logistic Regression (also called Logit Regression)
• Commonly used to estimate the probability that an instance belongs
to a particular class
• Logistic Regression model estimated probability (vectorized form)

• σ(・)—is a sigmoid function


Examples for logistic
• Model prediction

σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0,


so a Logistic Regression model predicts 1 if xT θ is positive, and 0 if it is negative.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
X = iris["data"][:, 3:]
# petal width
y = (iris["target"] == 2).astype(np.int64) # 1 if Iris-Virginica, else 0
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X, y)
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginica")

log_reg.predict([[1.7], [1.5]])
Softmax Regression
• The Logistic Regression model can be generalized to support multiple
classes directly, without having to train and combine multiple binary
classifiers.
• Softmax score function
Predictor
In scikit learn
• X = iris["data"][:, (2, 3)] # petal length, petal
• widthy = iris["target"]
• softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",
C=10)
• softmax_reg.fit(X, y)
• softmax_reg.predict([[5, 2]])
• softmax_reg.predict_proba([[5, 2]])

You might also like