Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview
Instructions: Your answers to the questions below, including plots and mathematical work,
should be submitted as a single PDF file. It’s preferred that you write your answers using software
that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though scanning handwrit-
ten work is fine as well. You may find the minted package convenient for including source code in
your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Introduction
In this homework you will investigate regression with `1 regularization, both implementation tech-
niques and theoretical properties. On the methods side, you’ll work on coordinate descent (the
“shooting algorithm”), homotopy methods, and [optionally] projected SGD. On the theory side
you’ll derive the largest `1 regularization parameter you’ll ever need to try, and optionally you’ll
derive the explicit solution to the coordinate minimizers used in coordinate descent, you’ll investi-
gate what happens with ridge and lasso regression when you have two copies of the same feature,
and you’ll work out the details of the classic picture that “explains” why `1 regularization leads to
sparsity.
1
Figure 1: Training data and target function we will be considering in this assignment.
To get familiar with using the data, and perhaps to learn some techniques, it’s recommended
that you work through the main() function of the include file ridge_regression.py. You’ll go through
the following steps (on your own - no need to submit):
1. Load the problem from disk into memory with load_problem.
2. Use the featurize function to map from a one-dimensional input space to a d-dimensional
feature space.
3. Visualize the design matrix of the featurized data. (All entries are binary, so we will not do
any data normalization or standardization in this problem, though you may experiment with
that on your own.)
4. Take a look at the class RidgeRegression. Here we’ve implemented our own RidgeRegression
using the general purpose optimizer provided by scipy.optimize. This is primarily to intro-
duce you to the sklearn framework, if you are not already familiar with it. It can help with
hyperparameter tuning, as we will see shortly.
5. Take a look at compare_our_ridge_with_sklearn. In this function, we want to get some
evidence that our implementation is correct, so we compare to sklearn’s ridge regression.
Comparing the outputs of two implementations is not always trivial – often the objective
functions are slightly different, so you may need to think a bit about how to compare the
results. In this case, sklearn has total square loss rather than average square loss, so we
needed to account for that. In this case, we get an almost exact match with sklearn. This is
because ridge regression is a rather easy objective function to optimize. You may not get as
exact a match for other objective functions, even if both methods are “correct.”
6. Next take a look at do_grid_search, in which we demonstrate how to take advantage of the
fact that we’ve wrapped our ridge regression in an sklearn “Estimator” to do hyperparameter
tuning. It’s a little tricky to get GridSearchCV to use the train/test split that you want, but
2
an approach is demonstrated in this function. In the line assigning the param_grid variable,
you can see my attempts at doing hyperparameter search on a different problem. Below you
will be modifying this (or using some other method, if you prefer) to find the optimal L2
regularization parameter for the data provided.
7. Next is some code to plot the results of the hyperparameter search.
8. Next we want to visualize some prediction functions. We plotted the target function, along
with several prediction functions corresponding to different regularization parameters, as func-
tions of the original input space R, along with the training data. Next we visualize the coef-
ficients of each feature with bar charts. Take note of the scale of the y-axis, as they may vary
substantially, buy default.
2 Ridge Regression
In the problems below, you do not need to implement ridge regression. You may use any of the code
provided in the assignment, or you may use other packages. However, your results must correspond
to the ridge regression objective function that we use, namely
n
1X T 2
J(w; λ) = w xi − yi + λkwk2 .
n i=1
1. Run ridge regression on the provided training dataset. Choose the λ that minimizes the
empirical risk (i.e. the average square loss) on the validation set. Include a table of the
parameter values you tried and the validation performance for each. Also include a plot of
the results.
2. Now we want to visualize the prediction functions. On the same axes, plot the following: the
training data, the target function, an unregularized least squares fit (still using the featurized
data), and the prediction function chosen in the previous problem. Next, along the lines of the
bar charts produced by the code in compare_parameter_vectors, visualize the coefficients for
each of the prediction functions plotted, including the target function. Describe the patterns,
including the scale of the coefficients, as well as which coefficients have the most weight.
3. For the chosen λ, examine the model coefficients. For ridge regression, we don’t expect any
parameters to be exactly 0. However, let’s investigate whether we can predict the sparsity
pattern of the true parameters (i.e. which parameters are 0 and which are nonzero) by
thresholding the parameter estimates we get from ridge regression. We’ll predict that wi = 0
if |ŵi | < ε and wi 6= 0 otherwise. Give the confusion matrix for ε = 10−6 , 10−3 , 10−1 , and any
other thresholds you would like to try.
3
3 Coordinate Descent for Lasso (a.k.a. The Shooting algo-
rithm)
The Lasso optimization problem can be formulated as1
m
X
ŵ ∈ arg min (hw (xi ) − yi )2 + λkwk1 ,
w∈Rd i=1
Pd
where hw (x) = wT x, and kwk1 = i=1 |wi |. Note that to align with Murpy’s formulation below,
and for historical reasons, we are using the total square loss, rather than the average square loss,
in the objective function.
Since the `1 -regularization term in the objective function is non-differentiable, it’s not immedi-
ately clear how gradient descent or SGD could be used to solve this optimization problem directly.
(In fact, as we’ll see in the next homework on SVMs, we can use “subgradient” methods when the
objective function is not differentiable, in addition to the two methods discussed in this homework
assignment.)
Another approach to solving optimization problems is coordinate descent, in which at each step
we optimize over one component of the unknown parameter vector, fixing all other components.
The descent path so obtained is a sequence of steps, each of which is parallel to a coordinate axis
in Rd , hence the name. It turns out that for the Lasso optimization problem, we can find a closed
form solution for optimization over a single component fixing all other components. This gives us
the following algorithm, known as the shooting algorithm:
(Source: Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.)
The “soft thresholding” function is defined as
for any a, δ ∈ R.
NOTE: Algorithm 13.1 does not account for the case that aj = cj = 0, which occurs when
the jth column of X is identically 0. One can either eliminate the column (as it cannot possibly
help the solution), or you can set wj = 0 in that case since it is, as you can easily verify, the
coordinate minimizer. Note also that Murphy is suggesting to initialize the optimization with the
1
4
ridge regession solution. Although theoretically this is not necessary (with exact computations and
enough time, coordinate descent will converge for lasso from any starting point), in practice it’s
helpful to start as close to the solution as we’re able.
There are a few tricks that can make selecting the hyperparameter λ easier and faster. First, as
we’ll see in a later problem, you can show that for any λ ≥ 2kX T (y − ȳ)k∞ , the estimated weight
vector ŵ is entirely zero, where ȳ is the mean of values in the vector y, and k · k∞ is the infinity
norm (or supremum norm), which is the maximum over the absolute values of the components of a
vector. Thus we need to search for an optimal λ in [0, λmax ], where λmax = 2kX T (y − ȳ)k∞ . (Note:
This expression for λmax assumes we have an unregularized bias term in our model. That is, our
decision functions are of the form hw,b (x) = wT x + b. In our the experiments, we do not have an
unregularized bias term, so we should use λmax = 2kX T yk∞ .)
The second trick is to use the fact that when λ and λ0 are close, the corresponding solutions
ŵ(λ) and ŵ(λ0 ) are also close. Start with λ = λmax , for which we know ŵ(λmax ) = 0. You can run
the optimization anyway, and initialize the optimization at w = 0. Next, λ is reduced (e.g. by a
constant factor close to 1), and the optimization problem is solved using the previous optimal point
as the starting point. This is called warm starting the optimization. The technique of computing
a set of solutions for a chain of nearby λ’s is called a continuation or homotopy method. The
resulting set of parameter values ŵ(λ) as λ ranges over [0, λmax ] is known as a regularization
path.
2. Write a function that computes the Lasso solution for a given λ using the shooting algorithm
described above. For convergence criteria, continue coordinate descent until a pass through
the coordinates reduces the objective function by less than 10−8 , or you have taken 1000
passes through the coordinates. Compare performance of cyclic coordinate descent to
randomized coordinate descent, where in each round we pass through the coordinates in a
different random order (for your choices of λ). Compare also the solutions attained
(following the convergence criteria above) for starting at 0 versus starting at the ridge
regression solution suggested by Murphy (again, for your choices of λ). If you like, you may
adjust the convergence criteria to try to attain better results (or the same results faster).
3. Run your best Lasso configuration on the training dataset provided, and select the λ that
minimizes the square error on the validation set. Include a table of the parameter values you
tried and the validation performance for each. Also include a plot of these results. Include
also a plot of the prediction functions, just as in the ridge regression section, but this time
add the best performing Lasso prediction function and remove the unregularized least
squares fit. Similarly, add the lasso coefficients to the bar charts of coefficients generated in
5
the ridge regression setting. Comment on the results, with particular attention to parameter
sparsity and how the ridge and lasso solutions compare. What’s the best model you found,
and what’s its validation performance?
4. Implement the homotopy method described above. Compute the Lasso solution for (at least)
the regularization parameters in the set λ = λmax 0.8i | i = 0, . . . , 29 . Plot the results (av-
erage validation loss vs λ).
5. [Optional] Note that the data in Figure 1 is almost entirely nonnegative. Since we don’t have
an unregularized bias term, we have “pay for” this offset using our penalized parameters. Note
also that λmax would decrease significantly if the y values were 0 centered (using the training
data, of course), or if we included an unregularized bias term. Experiment with one or both
of these approaches, for both and lasso and ridge regression, and report your findings.
where we’ve written xij for the jth entry of the vector xi . This function is convex in wj . The only
thing keeping f from being differentiable is the term with |wj |. So f is differentiable everywhere
except wj = 0. We’ll break this problem into 3 cases: wj > 0, wj < 0, and wj = 0. In the first two
cases, we can simply differentiate f w.r.t. wj to get optimality conditions. For the last case, we’ll
use the fact that since f : R → R is convex, 0 is a minimizer of f iff
6
expression in terms of the following definitions:
1
wj > 0
sign(wj ) := 0 wj = 0
−1 wj < 0
n
X
aj := 2 x2ij
i=1
n
X X
cj := 2 xij yi − wk xik .
i=1 k6=j
3. If wj > 0 and minimizes f , show that wj = a1j (cj − λ). Similarly, if wj < 0 and minimizes f ,
show that wj = a1j (cj + λ). Give conditions on cj that imply that a minimizer wj is positive
and conditions for which a minimizer wj is negative.
4. Derive expressions for the two one-sided derivatives at f (0), and show that cj ∈ [−λ, λ] implies
that wj = 0 is a minimizer.
5. Putting together the preceding results, we conclude the following:
1
aj (cj − λ) cj > λ
wj = 0 cj ∈ [−λ, λ]
1
aj (cj + λ) cj < −λ
4 Lasso Properties
4.1 Deriving λmax
In this problem we will derive an expression for λmax . For the first three parts, use the Lasso
2
objective function excluding the bias term i.e, J(w) = kXw − yk2 + λ kwk1 . We will show that for
any λ ≥ 2kX T yk∞ , the estimated weight vector ŵ is entirely zero, where k · k∞ is the infinity norm
(or supremum norm), which is the maximum absolute value of any component of the vector.
1. The one-sided directional derivative of f (x) at x in the direction v is defined as:
f (x + hv) − f (x)
f 0 (x; v) = lim
h↓0 h
Compute J 0 (0; v). That is, compute the one-sided directional derivative of J(w) at w = 0 in
the direction v. [Hint: the result should be in terms of X, y, λ, and v.]
2. Since the Lasso objective is convex, w∗ is a minimizer of J(w) if and only if the directional
derivative J 0 (w∗ ; v) ≥ 0 for all v 6= 0. Show that for any v 6= 0, we have J 0 (0; v) ≥ 0 if
and only if λ ≥ C, for some C that depends on X, y, and v. You should have an explicit
7
expression for C.
3. In the previous problem, we get a different lower bound on λ for each choice of v. Show that
the maximum of these lower bounds on λ is λmax = 2kX T yk∞ . Conclude that w = 0 is a
minimizer of J(w) if and only if λ ≥ 2kX T yk∞ .
2
4. [Optional] Let J(w, b) = kXw + b1 − yk2 + λ kwk1 , where 1 ∈ Rn is a column vector of 1’s.
Let ȳ be the mean of values in the vector y. Show that (w∗ , b∗ ) = (0, ȳ) is a minimizer of
J(w, b) if and only if λ ≥ λmax = 2kX T (y − ȳ)k∞ .
8
minimizes the ridge regression objective function. What is the relationship between a and b,
and why?
3. [Optional] What do you think would happen with Lasso and ridge when X·i and X·j are highly
correlated, but not exactly the same. You may investigate this experimentally or theoretically.
(While Hastie et al. use β for the parameters, we’ll continue to use w.)
In this problem we’ll show that the level sets of the empirical risk are indeed ellipsoids centered
at the empirical risk minimizer ŵ.
Consider linear prediction functions of the form x 7→ wT x. Then the empirical risk for f (x) =
T
w x under the square loss is
n
1X T 2
R̂n (w) = w xi − yi
n i=1
1 T
= (Xw − y) (Xw − y) .
n
−1
1. [Optional] Let ŵ = X T X X T y. Show that ŵ has empirical risk given by
1
−y T X ŵ + y T y
R̂n (ŵ) =
n
9
2. [Optional] Show that for any w we have
1 T
R̂n (w) = (w − ŵ) X T X (w − ŵ) + R̂n (ŵ).
n
Note that the RHS (i.e. “right hand side”) has one term that’s quadratic in w and one term
that’s independent of w. In particular, the RHS does not have any term that’s linear in w. On
T
the LHS (i.e. “left hand side”), we have R̂n (w) = n1 (Xw − y) (Xw − y). After expanding
this out, you’ll have terms that are quadratic, linear, and constant in w. Completing the
square is the tool for rearranging an expression to get rid of the linear terms. The following
“completing the square” identity is easy to verify just by multiplying out the expressions on
the RHS:
T
xT M x − 2bT x = x − M −1 b M (x − M −1 b) − bT M −1 b
3. [Optional] Using the expression derived for R̂n (w) in 2, give a very short proof that ŵ =
−1 T
XT X X y is the empirical risk minimizer. That is:
4. [Optional] Give an expression for the set of w for which the empirical risk exceeds the min-
imum empirical risk R̂n (ŵ) by an amount c > 0. If X is full rank, then X T X is positive
definite, and this set is an ellipse – what is its center?
m
X d
X d
X
(θ̂+ , θ̂− ) = arg min (hθ+ ,θ− (xi ) − yi )2 + λ θi+ + λ θi−
θ + ,θ − ∈Rd i=1 i=1 i=1
such that θ+ ≥ 0 and θ− ≥ 0,
where hθ+ ,θ− (x) = (θ+ − θ− )T x. The original parameter θ can then be estimated as θ̂ = (θ̂+ − θ̂− ).
10
This is a convex optimization problem with a differentiable objective and linear inequality
constraints. We can approach this problem using projected stochastic gradient descent, as discussed
in lecture. Here, after taking our stochastic gradient step, we project the result back into the feasible
set by setting any negative components of θ+ and θ− to zero.
1. [Optional] Implement projected SGD to solve the above optimization problem for the same
λ’s as used with the shooting algorithm. Since the two optimization algorithms should find
essentially the same solutions, you can check the algorithms against each other. Report the
differences in validation loss for each λ between the two optimization methods. (You can
make a table or plot the differences.)
2. [Optional] Choose the λ that gives the best performance on the validation set. Describe the
solution ŵ in term of its sparsity. How does the sparsity compare to the solution from the
shooting algorithm?
11