0% found this document useful (0 votes)
0 views

Week6

The document discusses various machine learning techniques focusing on maximum likelihood estimation (MLE) in linear regression, including its analysis, implications, and extensions. It also covers improving MLE through regularization and cross-validation, as well as Bayesian linear regression and its connection to ridge regression. The content is structured into sections that detail theoretical foundations, practical applications, and conclusions on the effectiveness of these methods.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Week6

The document discusses various machine learning techniques focusing on maximum likelihood estimation (MLE) in linear regression, including its analysis, implications, and extensions. It also covers improving MLE through regularization and cross-validation, as well as Bayesian linear regression and its connection to ridge regression. The content is structured into sections that detail theoretical foundations, practical applications, and conclusions on the effectiveness of these methods.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Machine Learning Techniques - Week 6

December 16, 2024


Contents
1 Analysis of Maximum Likelihood Estimation in Linear Regression 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probabilistic Assumptions and Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Goodness of ŵML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Expected Deviation of ŵML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Interpretation of the Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Implications and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Effect of Noise (σ 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Effect of Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.3 Improving the Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Improving Maximum Likelihood Estimation: Regularization and Cross-Validation 5


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Trace of the Covariance Matrix and Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 A Regularized Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Eigenvalue Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Existence of Optimal λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Hyperparameter Selection via Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Basic Cross-Validation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Alternative Interpretation of the Regularized Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Bayesian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

iii
Contents

3 Bayesian Linear Regression and Maximum A Posteriori Estimation 9


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Bayesian Modeling for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Maximum A Posteriori (MAP) Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Solution to the MAP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Connection to Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Linear Regression, Ridge Regression, and Regularization 13


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Linear Regression Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Bayesian Perspective on Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Understanding the Regularization Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 The Role of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.7 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.8 Extensions and Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Geometric Insights into Regularization in Linear Regression 17


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Geometric Understanding of Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.1 Formulation as a Constrained Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.2 Parameter Space and Elliptical Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.3 Intersection of Elliptical Contours and Circular Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.4 Effect of Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Limitations of Ridge Regression and Motivation for Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Towards L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4.1 Geometric Interpretation of L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4.2 Advantages of L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

iv
Contents

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 LASSO: Sparsity in Linear Regression through L1 Regularization 21


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Motivation for Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 L1 Regularization Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 Equivalence to a Constrained Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Geometric Insight into L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4.1 Comparison of L1 and L2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4.2 Elliptical Contours and Sparse Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Advantages of L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.6 LASSO and Ridge Regression: A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Advanced Topics in Regularization: Ridge, LASSO, and Beyond 25


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Why Not Always Use LASSO? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.1 Closed-Form Solution for Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.2 Subgradient Methods for LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2.3 Iterative Algorithms for LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 Summary of Linear Regression and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3.1 Key Insights from Ridge and LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3.3 Extensions to Mixed Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

v
1 Analysis of Maximum Likelihood Estimation in Linear
Regression
1.1 Introduction
Linear regression is a foundational algorithm in machine learning and statistics. It provides a means to model the relationship between input
features x ∈ Rd and a target variable y ∈ R. By incorporating a probabilistic perspective, we derived the Maximum Likelihood Estimator (MLE) for
the regression coefficients w, denoted ŵML . This estimator coincides with the optimal solution w∗ derived through minimizing the squared error.
In this chapter, we analyze the quality of ŵML as an estimator for the true w. We examine how noise and feature properties affect the deviation
of ŵML from w, providing insights for improving the estimator.

1.2 Probabilistic Assumptions and Maximum Likelihood Estimation


We assume the following data-generating process:
y = w> x + ,
where:
•  ∼ N (0, σ 2 ) is zero-mean Gaussian noise with variance σ 2 ,

• x ∈ Rd represents the feature vector,

• w ∈ Rd is the unknown true parameter vector.


The likelihood function for the data {(xi , yi )}ni=1 is given by:
n
(yi − w> xi )2
 
Y 1
P (y1 , . . . , yn |w) = √ exp − .
i=1 2πσ 2 2σ 2

1
1 Analysis of Maximum Likelihood Estimation in Linear Regression

Maximizing the log-likelihood leads to the closed-form solution:

ŵML = (X> X)−1 X> y,

where:
• X ∈ Rn×d is the design matrix formed by stacking the feature vectors,

• y ∈ Rn is the vector of target values.

1.3 Goodness of ŵML


The estimator ŵML is derived from a random dataset, meaning it is a random variable. Its performance as an estimator can be quantified by the
expected squared deviation from the true parameter w:
E kŵML − wk2 ,
 

where the expectation is taken over the randomness in y, induced by the Gaussian noise .

1.3.1 Expected Deviation of ŵML


The expected squared deviation can be derived (algebraically intensive) and is given by:

E kŵML − wk2 = σ 2 tr (X> X)−1 ,


  

where tr(·) denotes the trace of a matrix.

1.3.2 Interpretation of the Result


The result highlights two key factors affecting the deviation:
1. Noise Variance (σ 2 ): Larger noise variance increases the expected deviation. This aligns with intuition; noisier data leads to less reliable
estimates.

2. Feature Properties: The term tr((X> X)−1 ) depends on the geometry of the features xi . Poorly conditioned feature matrices (e.g., highly
correlated features) increase this term, leading to worse estimations.

2
1.4 Implications and Extensions

1.4 Implications and Extensions


1.4.1 Effect of Noise (σ 2 )
The noise  is intrinsic to the data-generating process and cannot be controlled. Therefore, σ 2 sets a lower bound on the achievable accuracy of
the estimator.

1.4.2 Effect of Feature Design


The feature matrix X plays a critical role in determining the quality of the estimator. For instance:

• Highly Correlated Features: Correlated features lead to a poorly conditioned X> X, increasing the trace of its inverse.

• Redundant Features: Adding redundant or irrelevant features increases the dimensionality d, contributing to higher deviation.

1.4.3 Improving the Estimator


To reduce the expected deviation:

1. Regularization: Adding a penalty term to the loss function, such as in Ridge Regression, modifies the estimator to:

ŵRidge = (X> X + λI)−1 X> y,

where λ > 0 controls the strength of regularization. This improves the conditioning of X> X, reducing the trace of its inverse.

2. Feature Engineering: Selecting orthogonal or minimally correlated features can reduce redundancy and improve the quality of the estimator.

1.5 Conclusion
The Maximum Likelihood Estimator ŵML for linear regression is influenced by both the noise variance σ 2 and the feature matrix X. While the
noise variance is inherent to the data-generating process, the feature matrix offers a means to control the quality of the estimator. Strategies like
regularization and careful feature selection can help reduce the expected deviation, providing a more reliable estimate of the true parameter w.

3
2 Improving Maximum Likelihood Estimation: Regularization
and Cross-Validation
2.1 Introduction
The Maximum Likelihood Estimator (MLE) for linear regression, ŵML , provides an optimal solution under the assumption of Gaussian noise with mean
zero and variance σ 2 . However, the performance of ŵML can degrade in the presence of poorly conditioned feature matrices. This chapter explores
methods to improve the estimator by introducing regularization, deriving a new estimator, and employing cross-validation to tune hyperparameters.

2.2 Trace of the Covariance Matrix and Mean Squared Error


To analyze the quality of ŵML , we examine the expected squared deviation:
E kŵML − wk2 ,
 

where w is the true parameter vector. This expected deviation, also referred to as the mean squared error (MSE), is derived as:
E kŵML − wk2 = σ 2 tr (X> X)−1 .
  

The trace of a matrix, denoted tr(·), is the sum of its diagonal elements. For a symmetric positive-definite matrix, the trace is also equal to the
sum of its eigenvalues:
Xd
tr(A) = λi ,
i=1

where λi are the eigenvalues of A. Consequently, the trace of (X X)


> −1
is:
d
>
X
−1 1
tr((X X) ) = ,
λ
i=1 i

5
2 Improving Maximum Likelihood Estimation: Regularization and Cross-Validation

where λi are the eigenvalues of X> X. This formulation reveals that the MSE is proportional to the noise variance σ 2 and inversely proportional to
the eigenvalues of X> X. Small eigenvalues lead to large contributions to the MSE, highlighting the sensitivity of ŵML to poorly conditioned feature
matrices.

2.3 A Regularized Estimator


To address the sensitivity of ŵML to small eigenvalues, we propose a new estimator:

ŵnew = (X> X + λI)−1 X> y,

where λ > 0 is a regularization parameter and I is the identity matrix. The addition of λI modifies the eigenvalues of X> X by increasing each
eigenvalue by λ, thereby improving numerical stability.

2.3.1 Eigenvalue Analysis


Let λ1 , λ2 , . . . , λd be the eigenvalues of X> X. The eigenvalues of X> X + λI are λi + λ, and the trace of the inverse becomes:

d
> −1
X 1
tr((X X + λI) ) = .
i=1
λi + λ

Adding λ to the eigenvalues decreases the contribution of small eigenvalues to the trace, reducing the MSE. Intuitively, the regularization term
λ penalizes large deviations in ŵnew , improving its robustness.

2.3.2 Existence of Optimal λ


There exists a λ > 0 such that:
E kŵnew − wk2 < E kŵML − wk2 .
   

This existence theorem guarantees that the regularized estimator can achieve lower MSE than the unregularized estimator. However, the optimal
λ depends on the data and noise properties, making its determination non-trivial.

6
2.4 Hyperparameter Selection via Cross-Validation

2.4 Hyperparameter Selection via Cross-Validation


To select the optimal λ, we use a procedure called cross-validation. Cross-validation divides the dataset into training and validation sets to evaluate
the estimator’s performance on unseen data.

2.4.1 Basic Cross-Validation Procedure


1. Split the dataset into a training set (e.g., 80% of the data) and a validation set (e.g., 20% of the data).

2. Train ŵnew on the training set for different values of λ.

3. Evaluate the performance of ŵnew on the validation set using metrics such as MSE.

4. Select the λ that minimizes the validation error.

2.4.2 K-Fold Cross-Validation


To improve the reliability of the selected λ, we use K-fold cross-validation:
1. Divide the dataset into K equal-sized folds.

2. Train the model on K − 1 folds and validate on the remaining fold. Repeat this process K times, using a different fold for validation each
time.

3. Compute the average validation error across the K folds for each λ.

4. Select the λ that minimizes the average validation error.


In practice, K is chosen based on computational resources. For large datasets, K = 5 or K = 10 is common. In extreme cases, leave-one-out
cross-validation (LOOCV) can be used, where K = n, but this is computationally expensive.

2.5 Alternative Interpretation of the Regularized Estimator


The estimator ŵnew can be interpreted from a Bayesian perspective or as a result of adding a penalty term to the optimization objective. The
introduction of λ implicitly assumes a prior distribution over w, favoring smaller magnitudes of w.

7
2 Improving Maximum Likelihood Estimation: Regularization and Cross-Validation

2.5.1 Bayesian Perspective


From a Bayesian viewpoint, λ corresponds to the precision of a Gaussian prior on w. The regularized estimator maximizes the posterior distribution
of w, balancing the likelihood term and the prior.

2.5.2 Ridge Regression


The regularized estimator can also be derived by minimizing the penalized least squares objective:

ŵnew = arg min ky − Xwk2 + λkwk2 .


w

This formulation is known as Ridge Regression, where the `2 -norm penalty shrinks the coefficients w to mitigate overfitting.

2.6 Conclusion
Regularization introduces a principled way to improve the Maximum Likelihood Estimator for linear regression. By adding a penalty term or
employing a Bayesian framework, the regularized estimator ŵnew reduces the sensitivity to small eigenvalues of X> X, lowering the mean squared
error. Cross-validation provides a practical method to select the optimal regularization parameter λ, ensuring robust performance on unseen
data. These techniques form the foundation for modern regularized regression methods, paving the way for further advancements in predictive
modeling.

8
3 Bayesian Linear Regression and Maximum A Posteriori
Estimation
3.1 Introduction
Bayesian modeling provides a structured approach to probabilistic inference by combining prior knowledge with observed data to compute posterior
distributions. In this chapter, we explore a Bayesian approach to linear regression. Specifically, we introduce a prior distribution over the parameters
w, derive the posterior distribution, and obtain the Maximum A Posteriori (MAP) estimate. We then connect the MAP estimate to the concept of
regularization, demonstrating its equivalence to ridge regression.

3.2 Bayesian Modeling for Linear Regression


Bayesian inference begins with a prior belief about the parameters, encoded as a probability distribution. The posterior distribution is proportional
to the product of the prior and the likelihood:
P (w | D) ∝ P (D | w)P (w),
where D = {(xi , yi )}ni=1 represents the observed data, P (D | w) is the likelihood, and P (w) is the prior.

3.2.1 Likelihood Function


We assume the data y is generated as:
yi = w> xi + ,
where  ∼ N (0, σ 2 ). For simplicity, let σ 2 = 1. The likelihood for the entire dataset is:
n
Y
P (D | w) = P (yi | xi , w),
i=1

9
3 Bayesian Linear Regression and Maximum A Posteriori Estimation

and each term is given by:


(yi − w> xi )2
 
1
P (yi | xi , w) = √ exp − .
2π 2

3.2.2 Prior Distribution


To simplify posterior computation, we select a prior conjugate to the Gaussian likelihood. A natural choice is a Gaussian prior over w:
kwk2
 
1
P (w) = exp − ,
(2πγ 2 )d/2 2γ 2
where γ 2 is a variance parameter and kwk2 denotes the squared Euclidean norm of w.

3.3 Posterior Distribution


Combining the likelihood and prior, the posterior distribution is:
P (w | D) ∝ P (D | w)P (w),
or equivalently: !
n
1X kwk2
P (w | D) ∝ exp − (yi − w> xi )2 − .
2 i=1 2γ 2

3.3.1 Maximum A Posteriori (MAP) Estimation


The MAP estimate maximizes the posterior:
ŵMAP = arg max log P (w | D).
w

Taking the log of the posterior and ignoring constants, we obtain:


" n
#
2
1X kwk
ŵMAP = arg min (yi − w> xi )2 + .
w 2 i=1 2γ 2
kwk2
This optimization problem can be interpreted as minimizing a penalized sum of squared errors, where the penalty term 2γ 2
discourages large
parameter values.

10
3.4 Solution to the MAP Problem

3.4 Solution to the MAP Problem


To find ŵMAP , we differentiate the objective function:
1 kwk2
f (w) = ky − Xwk +
2
,
2 2γ 2
where X is the n × d design matrix and y is the n-dimensional response vector. Taking the gradient and setting it to zero yields:
w
∇f (w) = −X> (y − Xw) + = 0.
γ2
Rearranging gives:
1
(X> X + 2
I)ŵMAP = X> y.
γ
Thus, the solution is:
1 −1 >
ŵMAP = (X> X + I) X y.
γ2

3.5 Connection to Ridge Regression


The MAP estimator is equivalent to the solution of ridge regression:

ŵridge = arg min ky − Xwk2 + λkwk2 ,


 
w

where λ = γ12 . Ridge regression penalizes large coefficients to mitigate overfitting, and the MAP estimate provides a Bayesian justification for this
regularization approach.

3.6 Conclusion
The Bayesian framework for linear regression introduces a prior over parameters, yielding a posterior distribution that incorporates both the prior
belief and observed data. The MAP estimate minimizes a regularized objective function, connecting Bayesian inference to ridge regression. This
dual perspective highlights the power of Bayesian modeling in deriving principled solutions for regularization and offers insights into parameter
estimation under uncertainty.

11
4 Linear Regression, Ridge Regression, and Regularization
4.1 Introduction
Linear regression provides a foundational framework for modeling relationships between input features and target variables. The classical approach
minimizes the squared error between predicted and actual target values. However, this approach can face challenges in the presence of redundant
or collinear features. Ridge regression, a regularized variant of linear regression, introduces a penalty term to address such issues, leading to more
robust solutions.

4.2 Linear Regression Recap


The classical formulation of linear regression aims to minimize the sum of squared errors:
n
X 2
ŵML = arg min w> xi − y i ,
w
i=1

where w is the weight vector, xi are the feature vectors, and yi are the target values. This optimization problem yields the maximum likelihood
estimate (MLE) under the assumption of Gaussian noise.

4.3 Ridge Regression


Ridge regression extends linear regression by adding a penalty term to the objective function:
n
X 2
ŵridge = arg min w> xi − y i + λkwk2 ,
w
i=1

where kwk2 = w> w represents the squared norm of the weight vector, and λ > 0 is a hyperparameter controlling the strength of regularization.

13
4 Linear Regression, Ridge Regression, and Regularization

The term λkwk2 is referred to as the regularizer, and its role is to penalize large weights, encouraging solutions with smaller norms. This
formulation reflects a Bayesian viewpoint, where w is assumed to follow a Gaussian prior with zero mean and variance proportional to λ1 .

4.4 Bayesian Perspective on Ridge Regression


The prior P (w) over w is modeled as:
kwk2
 
P (w) ∝ exp − ,
2γ 2
where γ 2 represents the variance of the Gaussian prior. Combining this prior with the likelihood from linear regression, the posterior distribution
is: !
n 2
1X 2 kwk
P (w | D) ∝ exp − yi − w> xi −

.
2 i=1 2γ 2

The Maximum A Posteriori (MAP) estimate maximizes the posterior and corresponds to the ridge regression solution:
" n #
2
X 2 kwk
ŵridge = arg min yi − w> xi + 2 .

w
i=1
γ

Identifying λ = 1
γ2
, we observe that ridge regression imposes a prior preference for weight vectors with smaller norms.

4.5 Understanding the Regularization Term


The regularization term λkwk2 biases the optimization towards solutions with smaller magnitudes for w. This bias can be interpreted as discouraging
complex models that overly rely on features with large weights, thereby reducing overfitting.
For example, consider a case with redundant features:

• Feature f1 : Height

• Feature f2 : Weight

• Feature f3 : 2 × Height + 3 × Weight

14
4.6 The Role of λ

Suppose the label y is a noisy version of 3 × Height + 4 × Weight. Multiple combinations of weights can explain y, such as:
w = [1, 1, 1] or w = [2, 3, 0].
Ridge regression prefers solutions with smaller norms, effectively penalizing redundant features like f3 .

4.6 The Role of λ


The hyperparameter λ determines the strength of regularization:
• Small λ: Solutions are closer to classical linear regression, with minimal regularization.
• Large λ: Solutions heavily penalize large weights, favoring simpler models.
From the Bayesian perspective, λ = γ12 , where γ 2 is the variance of the Gaussian prior. Smaller variances (large λ) reflect a stronger belief that
most weights should be close to zero, indicating redundancy among features.

4.7 Geometric Interpretation


Ridge regression balances two objectives:
2
• Minimizing the loss ni=1 yi − w> xi .
P

• Penalizing large weights via λkwk2 .


This balance can be viewed geometrically as constraining the solution within a hypersphere, with its radius determined by λ.

4.8 Extensions and Next Steps


Ridge regression addresses redundancy by preferring smaller weights, but it does not explicitly enforce sparsity. This raises natural questions:
• Can we design methods to directly enforce sparsity, setting many weights exactly to zero?
• How does ridge regression relate to other forms of regularization, such as the Lasso?
To explore these ideas, we will analyze ridge regression in a geometric context and develop modified formulations for linear regression in the next
chapter.

15
4 Linear Regression, Ridge Regression, and Regularization

4.9 Conclusion
Ridge regression introduces a regularization term to linear regression, striking a balance between minimizing loss and controlling model complexity.
Its Bayesian interpretation links regularization to prior beliefs, offering a principled framework for managing redundancy and overfitting. By
understanding ridge regression geometrically, we pave the way for exploring alternative formulations and extensions.

16
5 Geometric Insights into Regularization in Linear Regression

5.1 Introduction
Linear regression is a fundamental supervised learning method that models the relationship between features and a target variable. Previously, we
introduced two formulations of linear regression:

• Standard Linear Regression: Minimizes the squared loss:

n
X 2
ŵML = arg min w> xi − yi ,
w
i=1

where w is the weight vector, xi are the feature vectors, and yi are the corresponding target values.

• Ridge Regression: Adds an L2 -norm penalty to the loss:


" n
#
X 2
ŵRidge = arg min w> xi − yi + λkwk22 .
w
i=1

Pd
Here, kwk22 = j=1 wj2 is the squared L2 -norm, and λ controls the trade-off between minimizing the loss and penalizing large weights.

Ridge regression reduces overfitting by discouraging large values in the weight vector w. However, it does not explicitly set weights to zero,
which limits its ability to perform feature selection. This chapter explores the geometric implications of ridge regression and examines the potential
to modify the regularization strategy to encourage sparsity.

17
5 Geometric Insights into Regularization in Linear Regression

5.2 Geometric Understanding of Ridge Regression


5.2.1 Formulation as a Constrained Optimization Problem
The ridge regression objective can be reformulated as a constrained optimization problem:
n
X 2
ŵRidge = arg min w> xi − y i , subject to kwk22 ≤ θ,
w∈Rd
i=1

where θ depends on the regularization parameter λ. This equivalence indicates that ridge regression is searching for the optimal w within a spherical
region in parameter space defined by the constraint kwk22 ≤ θ.

5.2.2 Parameter Space and Elliptical Contours


Consider a simplified two-dimensional parameter space, where w = [w1 , w2 ]. The unconstrained solution to standard linear regression, ŵML , lies
outside the constrained region defined by ridge regression. The constraint kwk22 ≤ θ forms a circular feasible region centered at the origin:

w12 + w22 ≤ θ.

The loss function contours around ŵML take the form of ellipses due to the quadratic nature of the loss:

(w − ŵML )> H(w − ŵML ) = c,

where H = X> X is the Hessian of the loss function. For simplicity, if H is the identity matrix, the contours become circular.

5.2.3 Intersection of Elliptical Contours and Circular Constraint


The ridge regression solution ŵRidge is the point of intersection between the circular constraint kwk22 ≤ θ and the smallest elliptical contour that
includes ŵML . This intersection minimizes the loss within the constrained region, as illustrated in Figure 5.1.

Figure 5.1: Geometric interpretation of ridge regression. The unconstrained solution ŵML lies outside the feasible circular region. The ridge regres-
sion solution ŵRidge is the point where the smallest elliptical contour intersects the circle.

18
5.3 Limitations of Ridge Regression and Motivation for Sparsity

5.2.4 Effect of Regularization


Ridge regression pushes the weight vector w closer to the origin by reducing its L2 -norm. However, it does not force weights to become ex-
actly zero. This property limits its ability to identify and discard irrelevant features, which motivates the exploration of alternative regularization
techniques.

5.3 Limitations of Ridge Regression and Motivation for Sparsity


While ridge regression shrinks weights, it does not achieve sparsity—setting some components of w to exactly zero. Sparsity is desirable for:
• Feature Selection: Identifying and retaining only the most relevant features.
• Model Interpretability: Simplifying the model by eliminating redundant features.
• Computational Efficiency: Reducing the number of features in high-dimensional datasets.
To achieve sparsity, we seek an alternative regularization method that modifies the feasible region in parameter space.

5.4 Towards L1 Regularization


Instead of constraining the L2 -norm of w, we constrain its L1 -norm:
d
X
kwk1 = |wj |.
j=1

The L1 -regularized regression problem is formulated as:


" n
#
X 2
ŵLASSO = arg min w> xi − yi + λkwk1 .
w
i=1

5.4.1 Geometric Interpretation of L1 Regularization


The constraint kwk1 ≤ θ defines a diamond-shaped feasible region in parameter space. This shape differs from the circular region of ridge
regression. Elliptical loss contours are more likely to intersect the sharp vertices of the diamond, where one or more components of w are exactly
zero.

19
5 Geometric Insights into Regularization in Linear Regression

Figure 5.2: Geometric interpretation of LASSO. The sharp vertices of the diamond-shaped feasible region increase the likelihood of sparsity by
encouraging intersections at axes-aligned points.

5.4.2 Advantages of L1 Regularization


• Sparsity: Encourages many components of w to become exactly zero.

• Feature Selection: Identifies and retains the most important features.

• Interpretability: Simplifies the model by selecting a subset of features.

This method is known as LASSO (Least Absolute Shrinkage and Selection Operator), which combines shrinkage and feature selection in a single
framework.

5.5 Conclusion
Ridge regression and LASSO represent two distinct approaches to regularization in linear regression. While ridge regression reduces overfitting by
shrinking weights, LASSO goes further by promoting sparsity, making it especially useful for high-dimensional datasets with redundant features.
The geometric insights presented in this chapter provide a foundation for understanding the strengths and limitations of each approach. Future
chapters will delve into the theoretical guarantees and practical considerations of LASSO and its extensions.

20
6 LASSO: Sparsity in Linear Regression through L1
Regularization

6.1 Introduction
In the previous chapters, we explored linear regression and its regularized variant, ridge regression, which employs L2 regularization. Ridge
regression discourages large weights by penalizing the squared norm of the weight vector. However, while ridge regression reduces the magnitudes
of weights, it does not drive them to zero, making it less effective in explicitly eliminating redundant features.
In this chapter, we introduce an alternative approach: L1 regularization, which directly promotes sparsity in the weight vector by encouraging
many components of w to become exactly zero. This technique forms the foundation of the LASSO algorithm (Least Absolute Shrinkage and
Selection Operator).

6.2 Motivation for Sparsity


The motivation for using L1 regularization arises from scenarios where:

• The number of features (d) is very large, potentially exceeding the number of samples (n).

• Many features are redundant or irrelevant to the prediction task.

By encouraging sparsity, we aim to simplify the model, improve interpretability, and reduce overfitting.

21
6 LASSO: Sparsity in Linear Regression through L1 Regularization

6.3 L1 Regularization Formulation


The L1 norm of a vector w is defined as:
d
X
kwk1 = |wi |,
i=1

where wi represents the i-th component of w.


The L1 -regularized regression problem is formulated as:
" n
#
X 2
ŵLASSO = arg min w> xi − yi + λkwk1 .
w
i=1

Here:

• The first term represents the loss function, which is the sum of squared errors between predicted and actual target values.

• The second term λkwk1 is the regularization term, where λ > 0 controls the trade-off between minimizing the loss and promoting sparsity.

6.3.1 Equivalence to a Constrained Optimization Problem


Similar to ridge regression, the LASSO problem can also be expressed as a constrained optimization problem:
n
X 2
ŵLASSO = arg min w> xi − yi , subject to kwk1 ≤ θ,
w∈Rd
i=1

where θ is a parameter related to λ. This formulation provides a geometric insight into the solution space.

6.4 Geometric Insight into L1 Regularization


In the L2 regularization (ridge regression), the constraint kwk22 ≤ θ corresponds to a circular (or spherical) region in the parameter space. In
contrast, the L1 constraint kwk1 ≤ θ corresponds to a diamond-shaped region (or a hyperoctahedron in higher dimensions).

22
6.5 Advantages of L1 Regularization

6.4.1 Comparison of L1 and L2 Constraints


Consider a two-dimensional parameter space (w = [w1 , w2 ]):

• The L2 constraint w12 + w22 ≤ θ defines a circular region centered at the origin.

• The L1 constraint |w1 | + |w2 | ≤ θ defines a diamond-shaped region centered at the origin.

The sharp corners of the L1 -norm constraint region increase the likelihood that the elliptical contours of the loss function intersect the feasible
region at axes-aligned points, where one or more components of w are exactly zero.

6.4.2 Elliptical Contours and Sparse Solutions


The LASSO solution is determined by the point of intersection between:

• Elliptical contours of the loss function.

• The diamond-shaped feasible region defined by kwk1 ≤ θ.

This intersection is more likely to occur at a vertex of the diamond (e.g., points where one of the components of w is exactly zero), promoting
sparsity.

6.5 Advantages of L1 Regularization


Compared to ridge regression, L1 regularization offers several advantages:

• Sparsity: Encourages many components of w to become exactly zero, leading to simpler models.

• Feature Selection: Identifies and retains only the most relevant features.

• Interpretability: Sparse models are easier to interpret since they involve fewer features.

23
6 LASSO: Sparsity in Linear Regression through L1 Regularization

Aspect Ridge Regression LASSO


Regularizer kwk22
kwk1
Effect on Weights Shrinks weights but does not Promotes sparsity by setting
set them to zero many weights to exactly zero
Geometric Constraint Circular/spherical region Diamond-shaped region
Feature Selection No Yes

Table 6.1: Comparison of Ridge Regression and LASSO.

6.6 LASSO and Ridge Regression: A Comparison

6.7 Conclusion
LASSO (Least Absolute Shrinkage and Selection Operator) provides a powerful framework for achieving sparsity in linear regression. By employing
L1 regularization, it effectively identifies and eliminates irrelevant features, leading to simpler and more interpretable models. While ridge regression
shrinks weights, LASSO goes further by setting many weights to zero, making it particularly suitable for high-dimensional problems with redundant
features.
In the next chapter, we will explore theoretical guarantees for LASSO and its extensions to broader machine learning problems.

24
7 Advanced Topics in Regularization: Ridge, LASSO, and
Beyond
7.1 Introduction
In this chapter, we delve deeper into the differences and trade-offs between ridge regression and LASSO, as well as the practical implications of
each. We also explore the computational aspects, including the lack of closed-form solutions for LASSO and how optimization techniques like
subgradient methods are employed to solve such problems. Finally, we provide a summary of regression concepts and discuss extensions to mixed
regularization techniques.

7.2 Why Not Always Use LASSO?


Given LASSO’s ability to induce sparsity by pushing weights to exactly zero, a natural question arises: Why not always prefer LASSO over ridge
regression? While LASSO has significant advantages in sparsity and feature selection, there are practical considerations where ridge regression
might be more suitable. Below, we discuss the key differences:

7.2.1 Closed-Form Solution for Ridge Regression


Ridge regression has a closed-form solution:
−1
ŵRidge = X> X + λI X> y.
This closed form makes ridge regression computationally efficient for small datasets, allowing direct computation of ŵRidge without iterative meth-
ods.
In contrast, LASSO does not have a closed-form solution due to the non-differentiability of the L1 -norm penalty at zero. Consequently, solving
the LASSO optimization problem requires iterative methods.

25
7 Advanced Topics in Regularization: Ridge, LASSO, and Beyond

7.2.2 Subgradient Methods for LASSO


Since the L1 -norm penalty is non-differentiable at zero, we use subgradient methods to solve the LASSO problem. Subgradients generalize
gradients to non-differentiable functions. A vector g ∈ Rd is a subgradient of a convex function f : Rd → R at x if:
f (z) ≥ f (x) + g> (z − x), ∀z ∈ Rd .
For example, the absolute value function f (x) = |x| has the following subgradient at x = 0:
g ∈ [−1, 1].
At x 6= 0, the subgradient is unique and equals the gradient:
(
1 if x > 0,
g=
−1 if x < 0.

7.2.3 Iterative Algorithms for LASSO


To solve LASSO problems, iterative methods such as subgradient descent or specialized algorithms like Iterative Reweighted Least Squares (IRLS)
are employed:
• Subgradient Descent: Iteratively updates the weights by moving in the negative direction of a subgradient. For convex functions like the
LASSO objective, subgradient descent is guaranteed to converge to the global minimum.
• Iterative Reweighted Least Squares (IRLS): Leverages the quadratic loss structure of LASSO to iteratively solve weighted least squares
problems, using the closed-form solution for linear regression as a subroutine.
While these methods are effective, they introduce additional computational complexity compared to the direct closed-form solution of ridge
regression.

7.3 Summary of Linear Regression and Regularization


7.3.1 Key Insights from Ridge and LASSO
• Ridge Regression: Shrinks weights closer to zero but does not set them exactly to zero. Useful for multicollinearity and when interpretability
through sparsity is not critical.
• LASSO: Encourages sparsity by pushing weights to exactly zero. Suitable for feature selection in high-dimensional datasets.

26
7.4 Conclusion and Future Directions

7.3.2 Geometric Interpretation


• Ridge regression restricts the solution space to a hypersphere defined by the L2 -norm constraint:

kwk22 ≤ θ.

• LASSO restricts the solution space to a diamond-shaped region defined by the L1 -norm constraint:

kwk1 ≤ θ.

The sharper vertices of the LASSO constraint region increase the likelihood of sparse solutions, where some weights are exactly zero.

7.3.3 Extensions to Mixed Regularization


Regularization techniques can be customized for specific tasks. For example:

• Elastic Net: Combines L1 and L2 penalties:


" n
#
X
>
2
ŵElasticNet = arg min w xi − yi + λ1 kwk1 + λ2 kwk22 .
w
i=1

This method benefits from the sparsity of LASSO and the stability of ridge regression.

• Domain-Specific Regularization: Incorporates prior knowledge, such as group structure or sparsity patterns, into the penalty function.

7.4 Conclusion and Future Directions


In this chapter, we explored the theoretical and computational aspects of ridge regression, LASSO, and their extensions. While ridge regression
provides computational simplicity and stability, LASSO excels in sparsity and feature selection. Both methods address overfitting but cater to
different practical needs.
We also highlighted the versatility of regularization techniques, which can be adapted to domain-specific requirements. This adaptability under-
scores the importance of understanding the underlying assumptions and geometry of regularization methods.
In the next chapter, we transition from regression to classification, exploring supervised learning in the context of categorical target variables.

27

You might also like