Regression
Regression
What is regression?
❖ Regression analysis is primarily used for two conceptually distinct purposes: Prediction
and understanding causal relationship.
❖ The earliest form of regression was the method of least squares, which was published
by Legendre in 1805 [1], and by Gauss in 1809 [2].
❖ History: The term "regression" was coined by Francis Galton in the 19th century to
describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a
phenomenon also known as regression toward the mean).
Types of regression
❖ Linear Regression: Assumes a linear relationship between the dependent variable and
one or more independent variables.
❖ Logistic Regression: Used when the dependent variable is binary (two possible
outcomes) and models the probability of a particular outcome.
❖ Ridge Regression and Lasso Regression: Variants of linear regression that include
regularization terms to prevent overfitting.
Linear regression
❖ The most common form of regression analysis is linear regression, in which the
relationship between dependent and independent variable is approximated by a linear
equation.
❖ The goal is to find the best-fitting line (or hyperplane in the case of multiple
independent variables) that minimizes the sum of squared differences between the
observed and predicted values.
❖ The model parameters for the hypothesis function may be determined by the principle
of least square:
minimizing the mean of the squares of the differences between the observed
dependent or target variable in the input dataset and the output of the (linear)
hypothesis function of the independent or feature variables (MSE).
❖ Computational complexity:
>>> np.linalg.pinv(X_b) @ y
❖ Computational cost of solving normal equation: O(n3), where ‘n’ is the number of
features
❖ Both the Normal equation and the SVD approach get very slow when the number of
features grows large (e.g., 100,000).
❖ The primary goal of gradient descent is to iteratively adjust the model's parameters in
the direction that reduces the error.
Gradient Descent: Overview
❖ High level overview of gradient descent:
• Initialize Parameters: Start with initial values for the parameters of the model.
• Calculate the Gradient: Compute the gradient of the cost function with respect to each
parameter. The gradient represents the direction and magnitude of the steepest
increase in the cost function.
• Update Parameters: Adjust the parameters in the opposite direction of the gradient to
decrease the cost. This is done by multiplying the gradient by a learning rate and
subtracting the result from the current parameter values.
Fig.2.4. Learning rate too small Fig.2.5. Learning rate too high
❖ Plateau
Fig.2.7. Gradient descent with (left) and without (right) feature scaling
Batch Gradient Descent
❖ To implement gradient descent, you need to compute the gradient of the cost function
with regard to each model parameter
An implementation of Gradient descent:
(5) with (left) and without (right) feature scaling
theta = np.random.randn(2, 1) #
(6) randomly initialized model parameters
❖ Learning curves: There is a learning curve function from sklearn.model_selection to analyze how
the performance of a LinearRegression model changes as the training dataset size increases.
Learning Curves (Cont.)
❖ Figure 2.14 shows output of the learning curve for simple linear regression applied to the noisy
quadratic data set of figure 2.11.
train_errors = -train_scores.mean(axis=1)
valid_errors = -valid_scores.mean(axis=1)
polynomial_regression = make_pipeline(
PolynomialFeatures(degree=10,
include_bias=False),
LinearRegression())
❖ Bottom-line:
• If the training and validation scores are close and good, the model generalizes well.
• If the training score is much better than the validation score, the model may be overfitting.
• If both scores are poor, the model may be underfitting..
Logistic Regression
❖ Logistic regression is a statistical method used for binary classification problems, where the goal is to
predict the probability that an instance belongs to a particular class.
❖ Just like a linear regression model, a logistic regression model computes a weighted sum of the input
features (plus a bias term). but instead of outputting the result directly like the linear regression
model does, it outputs the logistic of this result.
❖ Estimating Probabilities:
❖ Prediction:
Logistic regression model prediction
using a 50% threshold probability
(10)
❖ The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).
(11)
❖ log loss:
(12)
(13)
❖ The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0).
Logistic Regression: Decision Boundary
❖ We can use the iris dataset to illustrate logistic regression. This is a famous dataset that contains the
sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris
versicolor, and Iris virginica
from sklearn.linear_model import
>>> from sklearn.datasets import load_iris
LogisticRegression
>>> iris = load_iris(as_frame=True)
from sklearn.model_selection import
>>> list(iris)
train_test_split
['data', 'target', 'frame', 'target_names',
X = iris.data[["petal width (cm)"]].values
'DESCR', 'feature_names',
y = iris.target_names[iris.target] ==
'filename', 'data_module’]
'virginica'
X_train, X_test, y_train, y_test =
>>> iris.data.head(3)
train_test_split(X, y, random_state=42)
sepal length (cm) sepal width (cm) petal length
(cm) petal width (cm)
log_reg =
0 5.1 3.5 1.4 0.2
LogisticRegression(random_state=42)
1 4.9 3.0 1.4 0.2
log_reg.fit(X_train, y_train)
2 4.7 3.2 1.3 0.2
>>> iris.target_names
array(['setosa', 'versicolor', 'virginica'],
Logistic Regression: Decision Boundary
1. A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes, Firmin Didot,
Paris, 1805. “Sur la Méthode des moindres quarrés” appears as an appendix.
2. Chapter 1 of: Angrist, J. D., & Pischke, J. S. (2008). Mostly Harmless Econometrics: An Empiricist's
Companion. Princeton University Press.