Week6
Week6
iii
Contents
iv
Contents
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
v
1 Analysis of Maximum Likelihood Estimation in Linear
Regression
1.1 Introduction
Linear regression is a foundational algorithm in machine learning and statistics. It provides a means to model the relationship between input
features x ∈ Rd and a target variable y ∈ R. By incorporating a probabilistic perspective, we derived the Maximum Likelihood Estimator (MLE) for
the regression coefficients w, denoted ŵML . This estimator coincides with the optimal solution w∗ derived through minimizing the squared error.
In this chapter, we analyze the quality of ŵML as an estimator for the true w. We examine how noise and feature properties affect the deviation
of ŵML from w, providing insights for improving the estimator.
1
1 Analysis of Maximum Likelihood Estimation in Linear Regression
where:
• X ∈ Rn×d is the design matrix formed by stacking the feature vectors,
where the expectation is taken over the randomness in y, induced by the Gaussian noise .
2. Feature Properties: The term tr((X> X)−1 ) depends on the geometry of the features xi . Poorly conditioned feature matrices (e.g., highly
correlated features) increase this term, leading to worse estimations.
2
1.4 Implications and Extensions
• Highly Correlated Features: Correlated features lead to a poorly conditioned X> X, increasing the trace of its inverse.
• Redundant Features: Adding redundant or irrelevant features increases the dimensionality d, contributing to higher deviation.
1. Regularization: Adding a penalty term to the loss function, such as in Ridge Regression, modifies the estimator to:
where λ > 0 controls the strength of regularization. This improves the conditioning of X> X, reducing the trace of its inverse.
2. Feature Engineering: Selecting orthogonal or minimally correlated features can reduce redundancy and improve the quality of the estimator.
1.5 Conclusion
The Maximum Likelihood Estimator ŵML for linear regression is influenced by both the noise variance σ 2 and the feature matrix X. While the
noise variance is inherent to the data-generating process, the feature matrix offers a means to control the quality of the estimator. Strategies like
regularization and careful feature selection can help reduce the expected deviation, providing a more reliable estimate of the true parameter w.
3
2 Improving Maximum Likelihood Estimation: Regularization
and Cross-Validation
2.1 Introduction
The Maximum Likelihood Estimator (MLE) for linear regression, ŵML , provides an optimal solution under the assumption of Gaussian noise with mean
zero and variance σ 2 . However, the performance of ŵML can degrade in the presence of poorly conditioned feature matrices. This chapter explores
methods to improve the estimator by introducing regularization, deriving a new estimator, and employing cross-validation to tune hyperparameters.
where w is the true parameter vector. This expected deviation, also referred to as the mean squared error (MSE), is derived as:
E kŵML − wk2 = σ 2 tr (X> X)−1 .
The trace of a matrix, denoted tr(·), is the sum of its diagonal elements. For a symmetric positive-definite matrix, the trace is also equal to the
sum of its eigenvalues:
Xd
tr(A) = λi ,
i=1
5
2 Improving Maximum Likelihood Estimation: Regularization and Cross-Validation
where λi are the eigenvalues of X> X. This formulation reveals that the MSE is proportional to the noise variance σ 2 and inversely proportional to
the eigenvalues of X> X. Small eigenvalues lead to large contributions to the MSE, highlighting the sensitivity of ŵML to poorly conditioned feature
matrices.
where λ > 0 is a regularization parameter and I is the identity matrix. The addition of λI modifies the eigenvalues of X> X by increasing each
eigenvalue by λ, thereby improving numerical stability.
d
> −1
X 1
tr((X X + λI) ) = .
i=1
λi + λ
Adding λ to the eigenvalues decreases the contribution of small eigenvalues to the trace, reducing the MSE. Intuitively, the regularization term
λ penalizes large deviations in ŵnew , improving its robustness.
This existence theorem guarantees that the regularized estimator can achieve lower MSE than the unregularized estimator. However, the optimal
λ depends on the data and noise properties, making its determination non-trivial.
6
2.4 Hyperparameter Selection via Cross-Validation
3. Evaluate the performance of ŵnew on the validation set using metrics such as MSE.
2. Train the model on K − 1 folds and validate on the remaining fold. Repeat this process K times, using a different fold for validation each
time.
3. Compute the average validation error across the K folds for each λ.
7
2 Improving Maximum Likelihood Estimation: Regularization and Cross-Validation
This formulation is known as Ridge Regression, where the `2 -norm penalty shrinks the coefficients w to mitigate overfitting.
2.6 Conclusion
Regularization introduces a principled way to improve the Maximum Likelihood Estimator for linear regression. By adding a penalty term or
employing a Bayesian framework, the regularized estimator ŵnew reduces the sensitivity to small eigenvalues of X> X, lowering the mean squared
error. Cross-validation provides a practical method to select the optimal regularization parameter λ, ensuring robust performance on unseen
data. These techniques form the foundation for modern regularized regression methods, paving the way for further advancements in predictive
modeling.
8
3 Bayesian Linear Regression and Maximum A Posteriori
Estimation
3.1 Introduction
Bayesian modeling provides a structured approach to probabilistic inference by combining prior knowledge with observed data to compute posterior
distributions. In this chapter, we explore a Bayesian approach to linear regression. Specifically, we introduce a prior distribution over the parameters
w, derive the posterior distribution, and obtain the Maximum A Posteriori (MAP) estimate. We then connect the MAP estimate to the concept of
regularization, demonstrating its equivalence to ridge regression.
9
3 Bayesian Linear Regression and Maximum A Posteriori Estimation
10
3.4 Solution to the MAP Problem
where λ = γ12 . Ridge regression penalizes large coefficients to mitigate overfitting, and the MAP estimate provides a Bayesian justification for this
regularization approach.
3.6 Conclusion
The Bayesian framework for linear regression introduces a prior over parameters, yielding a posterior distribution that incorporates both the prior
belief and observed data. The MAP estimate minimizes a regularized objective function, connecting Bayesian inference to ridge regression. This
dual perspective highlights the power of Bayesian modeling in deriving principled solutions for regularization and offers insights into parameter
estimation under uncertainty.
11
4 Linear Regression, Ridge Regression, and Regularization
4.1 Introduction
Linear regression provides a foundational framework for modeling relationships between input features and target variables. The classical approach
minimizes the squared error between predicted and actual target values. However, this approach can face challenges in the presence of redundant
or collinear features. Ridge regression, a regularized variant of linear regression, introduces a penalty term to address such issues, leading to more
robust solutions.
where w is the weight vector, xi are the feature vectors, and yi are the target values. This optimization problem yields the maximum likelihood
estimate (MLE) under the assumption of Gaussian noise.
where kwk2 = w> w represents the squared norm of the weight vector, and λ > 0 is a hyperparameter controlling the strength of regularization.
13
4 Linear Regression, Ridge Regression, and Regularization
The term λkwk2 is referred to as the regularizer, and its role is to penalize large weights, encouraging solutions with smaller norms. This
formulation reflects a Bayesian viewpoint, where w is assumed to follow a Gaussian prior with zero mean and variance proportional to λ1 .
The Maximum A Posteriori (MAP) estimate maximizes the posterior and corresponds to the ridge regression solution:
" n #
2
X 2 kwk
ŵridge = arg min yi − w> xi + 2 .
w
i=1
γ
Identifying λ = 1
γ2
, we observe that ridge regression imposes a prior preference for weight vectors with smaller norms.
• Feature f1 : Height
• Feature f2 : Weight
14
4.6 The Role of λ
Suppose the label y is a noisy version of 3 × Height + 4 × Weight. Multiple combinations of weights can explain y, such as:
w = [1, 1, 1] or w = [2, 3, 0].
Ridge regression prefers solutions with smaller norms, effectively penalizing redundant features like f3 .
15
4 Linear Regression, Ridge Regression, and Regularization
4.9 Conclusion
Ridge regression introduces a regularization term to linear regression, striking a balance between minimizing loss and controlling model complexity.
Its Bayesian interpretation links regularization to prior beliefs, offering a principled framework for managing redundancy and overfitting. By
understanding ridge regression geometrically, we pave the way for exploring alternative formulations and extensions.
16
5 Geometric Insights into Regularization in Linear Regression
5.1 Introduction
Linear regression is a fundamental supervised learning method that models the relationship between features and a target variable. Previously, we
introduced two formulations of linear regression:
n
X 2
ŵML = arg min w> xi − yi ,
w
i=1
where w is the weight vector, xi are the feature vectors, and yi are the corresponding target values.
Pd
Here, kwk22 = j=1 wj2 is the squared L2 -norm, and λ controls the trade-off between minimizing the loss and penalizing large weights.
Ridge regression reduces overfitting by discouraging large values in the weight vector w. However, it does not explicitly set weights to zero,
which limits its ability to perform feature selection. This chapter explores the geometric implications of ridge regression and examines the potential
to modify the regularization strategy to encourage sparsity.
17
5 Geometric Insights into Regularization in Linear Regression
where θ depends on the regularization parameter λ. This equivalence indicates that ridge regression is searching for the optimal w within a spherical
region in parameter space defined by the constraint kwk22 ≤ θ.
w12 + w22 ≤ θ.
The loss function contours around ŵML take the form of ellipses due to the quadratic nature of the loss:
where H = X> X is the Hessian of the loss function. For simplicity, if H is the identity matrix, the contours become circular.
Figure 5.1: Geometric interpretation of ridge regression. The unconstrained solution ŵML lies outside the feasible circular region. The ridge regres-
sion solution ŵRidge is the point where the smallest elliptical contour intersects the circle.
18
5.3 Limitations of Ridge Regression and Motivation for Sparsity
19
5 Geometric Insights into Regularization in Linear Regression
Figure 5.2: Geometric interpretation of LASSO. The sharp vertices of the diamond-shaped feasible region increase the likelihood of sparsity by
encouraging intersections at axes-aligned points.
This method is known as LASSO (Least Absolute Shrinkage and Selection Operator), which combines shrinkage and feature selection in a single
framework.
5.5 Conclusion
Ridge regression and LASSO represent two distinct approaches to regularization in linear regression. While ridge regression reduces overfitting by
shrinking weights, LASSO goes further by promoting sparsity, making it especially useful for high-dimensional datasets with redundant features.
The geometric insights presented in this chapter provide a foundation for understanding the strengths and limitations of each approach. Future
chapters will delve into the theoretical guarantees and practical considerations of LASSO and its extensions.
20
6 LASSO: Sparsity in Linear Regression through L1
Regularization
6.1 Introduction
In the previous chapters, we explored linear regression and its regularized variant, ridge regression, which employs L2 regularization. Ridge
regression discourages large weights by penalizing the squared norm of the weight vector. However, while ridge regression reduces the magnitudes
of weights, it does not drive them to zero, making it less effective in explicitly eliminating redundant features.
In this chapter, we introduce an alternative approach: L1 regularization, which directly promotes sparsity in the weight vector by encouraging
many components of w to become exactly zero. This technique forms the foundation of the LASSO algorithm (Least Absolute Shrinkage and
Selection Operator).
• The number of features (d) is very large, potentially exceeding the number of samples (n).
By encouraging sparsity, we aim to simplify the model, improve interpretability, and reduce overfitting.
21
6 LASSO: Sparsity in Linear Regression through L1 Regularization
Here:
• The first term represents the loss function, which is the sum of squared errors between predicted and actual target values.
• The second term λkwk1 is the regularization term, where λ > 0 controls the trade-off between minimizing the loss and promoting sparsity.
where θ is a parameter related to λ. This formulation provides a geometric insight into the solution space.
22
6.5 Advantages of L1 Regularization
• The L2 constraint w12 + w22 ≤ θ defines a circular region centered at the origin.
• The L1 constraint |w1 | + |w2 | ≤ θ defines a diamond-shaped region centered at the origin.
The sharp corners of the L1 -norm constraint region increase the likelihood that the elliptical contours of the loss function intersect the feasible
region at axes-aligned points, where one or more components of w are exactly zero.
This intersection is more likely to occur at a vertex of the diamond (e.g., points where one of the components of w is exactly zero), promoting
sparsity.
• Sparsity: Encourages many components of w to become exactly zero, leading to simpler models.
• Feature Selection: Identifies and retains only the most relevant features.
• Interpretability: Sparse models are easier to interpret since they involve fewer features.
23
6 LASSO: Sparsity in Linear Regression through L1 Regularization
6.7 Conclusion
LASSO (Least Absolute Shrinkage and Selection Operator) provides a powerful framework for achieving sparsity in linear regression. By employing
L1 regularization, it effectively identifies and eliminates irrelevant features, leading to simpler and more interpretable models. While ridge regression
shrinks weights, LASSO goes further by setting many weights to zero, making it particularly suitable for high-dimensional problems with redundant
features.
In the next chapter, we will explore theoretical guarantees for LASSO and its extensions to broader machine learning problems.
24
7 Advanced Topics in Regularization: Ridge, LASSO, and
Beyond
7.1 Introduction
In this chapter, we delve deeper into the differences and trade-offs between ridge regression and LASSO, as well as the practical implications of
each. We also explore the computational aspects, including the lack of closed-form solutions for LASSO and how optimization techniques like
subgradient methods are employed to solve such problems. Finally, we provide a summary of regression concepts and discuss extensions to mixed
regularization techniques.
25
7 Advanced Topics in Regularization: Ridge, LASSO, and Beyond
26
7.4 Conclusion and Future Directions
kwk22 ≤ θ.
• LASSO restricts the solution space to a diamond-shaped region defined by the L1 -norm constraint:
kwk1 ≤ θ.
The sharper vertices of the LASSO constraint region increase the likelihood of sparse solutions, where some weights are exactly zero.
This method benefits from the sparsity of LASSO and the stability of ridge regression.
• Domain-Specific Regularization: Incorporates prior knowledge, such as group structure or sparsity patterns, into the penalty function.
27