0% found this document useful (0 votes)
9 views25 pages

Regression

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

Regression

Uploaded by

brokenbottle571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Regression

What is regression?

❖ In statistical modeling, regression analysis is a set of statistical processes for estimating


the relationships between a dependent variable (often called the 'outcome' or
'response' variable, or a 'label' or ‘target’ in machine learning parlance) and one or
more independent variables (often called 'predictors', 'covariates', 'explanatory
variables' or 'features‘ or ‘attributes’).

❖ Regression analysis is primarily used for two conceptually distinct purposes: Prediction
and understanding causal relationship.

❖ Applications: Economics, finance, biology, psychology, and machine learning, to make


predictions, infer relationships, and understand the underlying patterns in data.

❖ The earliest form of regression was the method of least squares, which was published
by Legendre in 1805 [1], and by Gauss in 1809 [2].

❖ History: The term "regression" was coined by Francis Galton in the 19th century to
describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a
phenomenon also known as regression toward the mean).
Types of regression

There are different types of regression models, including:

❖ Linear Regression: Assumes a linear relationship between the dependent variable and
one or more independent variables.

❖ Multiple Regression: Extends linear regression to include multiple independent


variables.

❖ Polynomial Regression: Allows for non-linear relationships by including polynomial


terms in the model.

❖ Logistic Regression: Used when the dependent variable is binary (two possible
outcomes) and models the probability of a particular outcome.

❖ Ridge Regression and Lasso Regression: Variants of linear regression that include
regularization terms to prevent overfitting.
Linear regression

❖ The most common form of regression analysis is linear regression, in which the
relationship between dependent and independent variable is approximated by a linear
equation.
❖ The goal is to find the best-fitting line (or hyperplane in the case of multiple
independent variables) that minimizes the sum of squared differences between the
observed and predicted values.

❖ Linear regression model:


Linear regression (cont.)

❖ Linear regression model prediction in vectorized form:


Linear regression: method of least square

❖ The model parameters for the hypothesis function may be determined by the principle
of least square:
minimizing the mean of the squares of the differences between the observed
dependent or target variable in the input dataset and the output of the (linear)
hypothesis function of the independent or feature variables (MSE).

❖ MSE cost function for a linear regression model:

❖ The normal equation:

❖ The matrix XTX is referred to as Gram matrix or normal matrix


Coding linear regression model from scratch
import numpy as np
np.random.seed(42) # to make this code example
reproducible
m = 100 # number of instances
X = 2*np.random.rand(m, 1) # column vector
y = 4 + 3 * X + np.random.randn(m, 1) # column
vector
from sklearn.preprocessing import
add_dummy_feature
X_b = add_dummy_feature(X) # add x0 = 1 to each
instance
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @
>>> theta_best
y Fig.2.1. A randomly generated linear data-set
array([[4.21509616],
[2.77011339]])

>>> X_new = np.array([[0], [2]])


>>> X_new_b = add_dummy_feature(X_new) # add x0 =
1 to each instance
>>> y_predict = X_new_b @ theta_best
>>> y_predict
array([[4.21509616],
[9.75532293]])

import matplotlib.pyplot as plt


plt.plot(X_new, y_predict, "r-",
label="Predictions")
plt.plot(X, y, "b.") Fig.2.2. Linear regression model predictions
Performing linear regression using Scikit-Learn

>>> from sklearn.linear_model import ❖ The LinearRegression class is based on the


LinearRegression scipy.linalg.lstsq() function (the name stands for
>>> lin_reg = LinearRegression()
“least squares”), which can be called directly.
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
❖ This function computes θ = X+ y, where X+ is the
(array([4.21509616]), array([[2.77011339]])) pseudoinverse of X (specifically, the
Moore–Penrose inverse). You can use
>>> lin_reg.predict(X_new)
np.linalg.pinv() to compute the pseudoinverse
array([[4.21509616],
[9.75532293]]) directly.
❖ The pseudoinverse itself is computed using a
>>> theta_best_svd, residuals, rank, s = standard matrix factorization technique called
np.linalg.lstsq(X_b, y, rcond=1e-6) singular value decomposition (SVD) that can
decompose the training set matrix X into the
>>> theta_best_svd matrix multiplication of three matrices U Σ V⊺.

array([[4.21509616], The pseudoinverse is computed as X+ = VΣ+U⊺.


[2.77011339]])

❖ Computational complexity:
>>> np.linalg.pinv(X_b) @ y

array([[4.21509616], Normal equation: O(n3).


[2.77011339]])
SVD based pseudoinv: O(n2)
Computational complexity of linear regression- Gradient Descent

❖ Computational cost of solving normal equation: O(n3), where ‘n’ is the number of
features

❖ Computational cost of SVD based approach of Scikit-Learn: O(n2)

❖ Both the Normal equation and the SVD approach get very slow when the number of
features grows large (e.g., 100,000).

❖ Gradient descent is a fundamental optimization technique and is widely employed in


training various machine learning models, including neural networks, linear regression,
and logistic regression, among others.

❖ The primary goal of gradient descent is to iteratively adjust the model's parameters in
the direction that reduces the error.
Gradient Descent: Overview
❖ High level overview of gradient descent:
• Initialize Parameters: Start with initial values for the parameters of the model.

• Calculate the Gradient: Compute the gradient of the cost function with respect to each
parameter. The gradient represents the direction and magnitude of the steepest
increase in the cost function.

• Update Parameters: Adjust the parameters in the opposite direction of the gradient to
decrease the cost. This is done by multiplying the gradient by a learning rate and
subtracting the result from the current parameter values.

• Repeat steps 2 and 3 until convergence or a❖predefined number


Different of iterations.
variants:
Batch Gradient Descent,
Stochastic Gradient Descent,
Mini-Batch Gradient Descent

Fig.2.3. Pictorial description of Gradient Descent algorithm


Gradient Descent: Learning Rate
❖ The learning rate is a crucial hyperparameter: too small value may lead to slow
convergence, while a too large value may cause overshooting and instability.

Fig.2.4. Learning rate too small Fig.2.5. Learning rate too high

❖ Local and global minimum

❖ Plateau

❖ Fortunately, the MSE cost function


for a linear regression model
happens to be a convex function.

Fig.2.6. Gradient descent pitfalls


Gradient Descent: Feature scaling
❖ When using gradient descent, you should ensure that all features have a similar scale
(e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to
converge.

Fig.2.7. Gradient descent with (left) and without (right) feature scaling
Batch Gradient Descent
❖ To implement gradient descent, you need to compute the gradient of the cost function
with regard to each model parameter
An implementation of Gradient descent:
(5) with (left) and without (right) feature scaling

eta = 0.1 # learning rate


n_epochs = 1000
❖ Gradient vector of the cost function:
m = len(X_b) # number of instances
np.random.seed(42)

theta = np.random.randn(2, 1) #
(6) randomly initialized model parameters

for epoch in range(n_epochs):


gradients = 2 / m * X_b.T @ (X_b @
theta - y)
❖ Gradient Descent step:
theta = theta - eta * gradients
>>> theta
(7) array([[4.21509616],
[2.77011339]])
Batch Gradient Descent: Learning rates
❖ Gradient descent worked perfectly. But what if you had used a different learning rate
(eta)? Figure 2-8 shows the first 20 steps of gradient descent using three different
learning rates.

Fig.2.8. Batch gradient descent with various learning rates

❖ To find a good learning rate, we can use grid search.


Stochastic Gradient Descent
❖ Instead of using the whole training set, stochastic gradient descent picks a random
instance in the training set at every step and computes the gradients based only on that
single instance..
❖ It is very fast and memory efficient.

❖ Due to its stochastic nature the algorithm is


much less regular than batch gradient
descent.
❖ It also helps the algorithm jumps out of the
local minima.

Fig.2.9. Convergence of Stochastic


from sklearn.linear_model import
Gradient Descent algorithm
SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000,
tol=1e-5, penalty=None, eta0=0.01,
n_iter_no_change=100, random_state=42) ❖ Learning schedule: simulated
annealing
sgd_reg.fit(X, y.ravel()) # y.ravel()
because fit() expects 1D sgd_reg.coef_
>>> sgd_reg.intercept_, targets
(array([4.21278812]), array([2.77270267]))
Mini Batch Gradient Descent
❖ Mini-batch GD computes the gradients on small random sets of instances called
mini-batches.

Fig.2.10. Gradient descent paths in


parameter space
Polynomial Regression
❖ The same ordinary least square regression method as used in linear regression method can be used
for polynomial regression as well.
❖ A simple way to do this is to add powers of each feature as new features, then train a linear model
on this extended set of features.
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 +
np.random.randn(m, 1)
>>> from sklearn.preprocessing import
PolynomialFeatures

>>> poly_features = PolynomialFeatures(degree=2,


include_bias=False)
>>> X_poly = poly_features.fit_transform(X) Fig.2.11. Generated nonlinear and noisy dataset
>>> X[0]
array([-0.75275929])
>>> X_poly[0]
array([-0.75275929, 0.56664654])
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X_poly, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([1.78134581]), array([[0.93366893,
0.56456263]]))
❖ Model estimate: ypred = 0.56(x1)2 + 0.93x1 + 1.78 Fig.2.12. Generated nonlinear and noisy dataset
Learning Curves
❖ The high-degree polynomial regression model is severely overfitting the training data, while the
linear model is underfitting it.

❖ In general, how can you decide how


complex your model should be? How can
you tell that your model is overfitting or
underfitting the data?

❖ Cross-validation is the method to get an


estimate of a model’s generalization
performance.

❖ Learning curves: Plots of the model’s


training error and validation error as a
function of the training iteration.
Fig.2.13. High-degree polynomial regression

❖ Learning curves: There is a learning curve function from sklearn.model_selection to analyze how
the performance of a LinearRegression model changes as the training dataset size increases.
Learning Curves (Cont.)

❖ Figure 2.14 shows output of the learning curve for simple linear regression applied to the noisy
quadratic data set of figure 2.11.

from sklearn.model_selection import


learning_curve

train_sizes, train_scores, valid_scores =


learning_curve(
LinearRegression(), X, y,
train_sizes=np.linspace(0.01, 1.0, 40), cv=5,
scoring="neg_root_mean_squared_error")

train_errors = -train_scores.mean(axis=1)

valid_errors = -valid_scores.mean(axis=1)

Fig.2.14. Learning curves


❖ Interpretation of the results:
Training and validation scores are close, but they are high which indicates underfitting.
Learning Curves (Cont.)
❖ Now, consider the learning curves of a 10th-degree polynomial model on the same data (Figure 2.11):.

from sklearn.pipeline import make_pipeline

polynomial_regression = make_pipeline(
PolynomialFeatures(degree=10,
include_bias=False),
LinearRegression())

train_sizes, train_scores, valid_scores =


learning_curve(
polynomial_regression, X, y,
train_sizes=np.linspace(0.01, 1.0, 40), cv=5,
scoring="neg_root_mean_squared_error") Fig.2.15. Learning curves
❖ There are two important differences as compared to last one:
• The error on the training data is much lower than before.
• There is a gap between the curves. This means that the model performs significantly better on
the training data than on the validation data, which is the sign of overfitting.

❖ Bottom-line:
• If the training and validation scores are close and good, the model generalizes well.
• If the training score is much better than the validation score, the model may be overfitting.
• If both scores are poor, the model may be underfitting..
Logistic Regression

❖ Logistic regression is a statistical method used for binary classification problems, where the goal is to
predict the probability that an instance belongs to a particular class.

❖ Just like a linear regression model, a logistic regression model computes a weighted sum of the input
features (plus a bias term). but instead of outputting the result directly like the linear regression
model does, it outputs the logistic of this result.

❖ Estimating Probabilities:

❖ Logistic sigmoid function:

❖ Prediction:
Logistic regression model prediction
using a 50% threshold probability

(10)

❖ With default threshold of 50%


probability it predicts 1 if θ⊺x>0
Fig.2.11. Logistic function and θ⊺x<0.
Logistic Regression: Training and Cost Function

❖ The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).

❖ Cost function for single training instance:

(11)

❖ log loss:

(12)

❖ Minimization of the cost function:

(13)

❖ The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).
Logistic Regression: Decision Boundary

❖ We can use the iris dataset to illustrate logistic regression. This is a famous dataset that contains the
sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris
versicolor, and Iris virginica
from sklearn.linear_model import
>>> from sklearn.datasets import load_iris
LogisticRegression
>>> iris = load_iris(as_frame=True)
from sklearn.model_selection import
>>> list(iris)
train_test_split
['data', 'target', 'frame', 'target_names',
X = iris.data[["petal width (cm)"]].values
'DESCR', 'feature_names',
y = iris.target_names[iris.target] ==
'filename', 'data_module’]
'virginica'
X_train, X_test, y_train, y_test =
>>> iris.data.head(3)
train_test_split(X, y, random_state=42)
sepal length (cm) sepal width (cm) petal length
(cm) petal width (cm)
log_reg =
0 5.1 3.5 1.4 0.2
LogisticRegression(random_state=42)
1 4.9 3.0 1.4 0.2
log_reg.fit(X_train, y_train)
2 4.7 3.2 1.3 0.2

>>> iris.target.head(3) # note that the instances


are not shuffled
0 0
1 0
2 0
Name: target, dtype: int64

>>> iris.target_names
array(['setosa', 'versicolor', 'virginica'],
Logistic Regression: Decision Boundary

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)


# reshape to get a column vector
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >=
0.5][0, 0]
plt.plot(X_new, y_proba[:, 0], "b--",
linewidth=2,
label="Not Iris virginica proba")
plt.plot(X_new, y_proba[:, 1], "g-",
linewidth=2, label="Iris virginica proba")
plt.plot([decision_boundary,
decision_boundary], [0, 1], "k:", linewidth=2,
label="Decision boundary")
[...] # beautify the figure: add grid, labels,
axis, legend, arrows, and samples
plt.show()
>>> decision_boundary
1.6516516516516517
>>> log_reg.predict([[1.7],
[1.5]])
array([ True, False])
References

1. A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes, Firmin Didot,
Paris, 1805. “Sur la Méthode des moindres quarrés” appears as an appendix.

2. Chapter 1 of: Angrist, J. D., & Pischke, J. S. (2008). Mostly Harmless Econometrics: An Empiricist's
Companion. Princeton University Press.

You might also like