0% found this document useful (0 votes)

72 views11 pages

Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview

Ridge regression and lasso regression are investigated on artificial data. Ridge regression is implemented using grid search to select the optimal regularization parameter λ. Visualizations of the prediction functions and coefficients are produced. Coordinate descent is then introduced as an algorithm for lasso regression, with each step optimizing one coefficient while holding others fixed using a closed-form soft-thresholding solution.

Uploaded by

Mayur Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views11 pages

Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview

Uploaded by

Mayur Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Homework 2: Lasso Regression

Instructions: Your answers to the questions below, including plots and mathematical work,
should be submitted as a single PDF file. It’s preferred that you write your answers using software
that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though scanning handwrit-
ten work is fine as well. You may find the minted package convenient for including source code in
your LATEX document. If you are using LYX, then the listings package tends to work better.

1 Introduction
In this homework you will investigate regression with `1 regularization, both implementation tech-
niques and theoretical properties. On the methods side, you’ll work on coordinate descent (the
“shooting algorithm”), homotopy methods, and [optionally] projected SGD. On the theory side
you’ll derive the largest `1 regularization parameter you’ll ever need to try, and optionally you’ll
derive the explicit solution to the coordinate minimizers used in coordinate descent, you’ll investi-
gate what happens with ridge and lasso regression when you have two copies of the same feature,
and you’ll work out the details of the classic picture that “explains” why `1 regularization leads to
sparsity.

1.1 Data Set and Programming Problem Overview

For the experiments, we are generating some artifical data using code in the file setup_problem.py.
We are considering the regression setting with the 1-dimensional input space R. An image of the
training data, along with the target function (i.e. the Bayes prediction function for the square loss
function) is shown in Figure 1 below.
You can examine how the target function and the data were generated by looking at setup_problem.py.
The figure can be reproduced by running the LOAD_PROBLEM branch of the main function.
As you can see, the target function is a highly nonlinear function of the input. To handle this
sort of problem with linear hypothesis spaces, we will need to create a set of features that perform
nonlinear transforms of the input. A detailed description of the technique we will use can be found
in the Jupyter notebook basis-fns.ipynb, included in the zip file.
In this assignment, we are providing you with a function that takes care of the featurization. This
is the “featurize” function, returned by the generate_problem function in setup_problem.py.
The generate_problem function also gives the true target function, which has been constructed
to be a sparse linear combination of our features. The coefficients of this linear combination are
also provided by generate_problem, so you can compare the coefficients of the linear functions
you find to the target function coefficients. The generate_problem function also gives you the
train and validation sets that you should use.

1
Figure 1: Training data and target function we will be considering in this assignment.

To get familiar with using the data, and perhaps to learn some techniques, it’s recommended
that you work through the main() function of the include file ridge_regression.py. You’ll go through
the following steps (on your own - no need to submit):
1. Load the problem from disk into memory with load_problem.
2. Use the featurize function to map from a one-dimensional input space to a d-dimensional
feature space.
3. Visualize the design matrix of the featurized data. (All entries are binary, so we will not do
any data normalization or standardization in this problem, though you may experiment with
that on your own.)
4. Take a look at the class RidgeRegression. Here we’ve implemented our own RidgeRegression
using the general purpose optimizer provided by scipy.optimize. This is primarily to intro-
duce you to the sklearn framework, if you are not already familiar with it. It can help with
hyperparameter tuning, as we will see shortly.
5. Take a look at compare_our_ridge_with_sklearn. In this function, we want to get some
evidence that our implementation is correct, so we compare to sklearn’s ridge regression.
Comparing the outputs of two implementations is not always trivial – often the objective
functions are slightly different, so you may need to think a bit about how to compare the
results. In this case, sklearn has total square loss rather than average square loss, so we
needed to account for that. In this case, we get an almost exact match with sklearn. This is
because ridge regression is a rather easy objective function to optimize. You may not get as
exact a match for other objective functions, even if both methods are “correct.”
6. Next take a look at do_grid_search, in which we demonstrate how to take advantage of the
fact that we’ve wrapped our ridge regression in an sklearn “Estimator” to do hyperparameter
tuning. It’s a little tricky to get GridSearchCV to use the train/test split that you want, but

2
an approach is demonstrated in this function. In the line assigning the param_grid variable,
you can see my attempts at doing hyperparameter search on a different problem. Below you
will be modifying this (or using some other method, if you prefer) to find the optimal L2
regularization parameter for the data provided.
7. Next is some code to plot the results of the hyperparameter search.

8. Next we want to visualize some prediction functions. We plotted the target function, along
with several prediction functions corresponding to different regularization parameters, as func-
tions of the original input space R, along with the training data. Next we visualize the coef-
ficients of each feature with bar charts. Take note of the scale of the y-axis, as they may vary
substantially, buy default.

2 Ridge Regression
In the problems below, you do not need to implement ridge regression. You may use any of the code
provided in the assignment, or you may use other packages. However, your results must correspond
to the ridge regression objective function that we use, namely
n
1X T 2
J(w; λ) = w xi − yi + λkwk2 .
n i=1

1. Run ridge regression on the provided training dataset. Choose the λ that minimizes the
empirical risk (i.e. the average square loss) on the validation set. Include a table of the
parameter values you tried and the validation performance for each. Also include a plot of
the results.
2. Now we want to visualize the prediction functions. On the same axes, plot the following: the
training data, the target function, an unregularized least squares fit (still using the featurized
data), and the prediction function chosen in the previous problem. Next, along the lines of the
bar charts produced by the code in compare_parameter_vectors, visualize the coefficients for
each of the prediction functions plotted, including the target function. Describe the patterns,
including the scale of the coefficients, as well as which coefficients have the most weight.

3. For the chosen λ, examine the model coefficients. For ridge regression, we don’t expect any
parameters to be exactly 0. However, let’s investigate whether we can predict the sparsity
pattern of the true parameters (i.e. which parameters are 0 and which are nonzero) by
thresholding the parameter estimates we get from ridge regression. We’ll predict that wi = 0
if |ŵi | < ε and wi 6= 0 otherwise. Give the confusion matrix for ε = 10−6 , 10−3 , 10−1 , and any
other thresholds you would like to try.

3
3 Coordinate Descent for Lasso (a.k.a. The Shooting algo-
rithm)
The Lasso optimization problem can be formulated as1
m
X
ŵ ∈ arg min (hw (xi ) − yi )2 + λkwk1 ,
w∈Rd i=1

Pd
where hw (x) = wT x, and kwk1 = i=1 |wi |. Note that to align with Murpy’s formulation below,
and for historical reasons, we are using the total square loss, rather than the average square loss,
in the objective function.
Since the `1 -regularization term in the objective function is non-differentiable, it’s not immedi-
ately clear how gradient descent or SGD could be used to solve this optimization problem directly.
(In fact, as we’ll see in the next homework on SVMs, we can use “subgradient” methods when the
objective function is not differentiable, in addition to the two methods discussed in this homework
assignment.)
Another approach to solving optimization problems is coordinate descent, in which at each step
we optimize over one component of the unknown parameter vector, fixing all other components.
The descent path so obtained is a sequence of steps, each of which is parallel to a coordinate axis
in Rd , hence the name. It turns out that for the Lasso optimization problem, we can find a closed
form solution for optimization over a single component fixing all other components. This gives us
the following algorithm, known as the shooting algorithm:

(Source: Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.)
The “soft thresholding” function is defined as

soft (a, δ) = sign(a) (|a| − δ)+ ,

for any a, δ ∈ R.
NOTE: Algorithm 13.1 does not account for the case that aj = cj = 0, which occurs when
the jth column of X is identically 0. One can either eliminate the column (as it cannot possibly
help the solution), or you can set wj = 0 in that case since it is, as you can easily verify, the
coordinate minimizer. Note also that Murphy is suggesting to initialize the optimization with the
1

4
ridge regession solution. Although theoretically this is not necessary (with exact computations and
enough time, coordinate descent will converge for lasso from any starting point), in practice it’s
helpful to start as close to the solution as we’re able.
There are a few tricks that can make selecting the hyperparameter λ easier and faster. First, as
we’ll see in a later problem, you can show that for any λ ≥ 2kX T (y − ȳ)k∞ , the estimated weight
vector ŵ is entirely zero, where ȳ is the mean of values in the vector y, and k · k∞ is the infinity
norm (or supremum norm), which is the maximum over the absolute values of the components of a
vector. Thus we need to search for an optimal λ in [0, λmax ], where λmax = 2kX T (y − ȳ)k∞ . (Note:
This expression for λmax assumes we have an unregularized bias term in our model. That is, our
decision functions are of the form hw,b (x) = wT x + b. In our the experiments, we do not have an
unregularized bias term, so we should use λmax = 2kX T yk∞ .)
The second trick is to use the fact that when λ and λ0 are close, the corresponding solutions
ŵ(λ) and ŵ(λ0 ) are also close. Start with λ = λmax , for which we know ŵ(λmax ) = 0. You can run
the optimization anyway, and initialize the optimization at w = 0. Next, λ is reduced (e.g. by a
constant factor close to 1), and the optimization problem is solved using the previous optimal point
as the starting point. This is called warm starting the optimization. The technique of computing
a set of solutions for a chain of nearby λ’s is called a continuation or homotopy method. The
resulting set of parameter values ŵ(λ) as λ ranges over [0, λmax ] is known as a regularization
path.

3.1 Experiments with the Shooting Algorithm

1. The algorithm as described above is not ready for a large dataset (at least if it has being im-
plemented in Python) because of the implied loop in the summation signs for the expressions
for aj and cj . Give an expression for computing aj and cj using matrix and vector operations,
without explicit loops. This is called “vectorization” and can lead to dramatic speedup when
implemented in languages such as Python, Matlab, and R. Write your expressions using X,
T
w, y = (y1 , . . . , yn ) (the column vector of responses), X·j (the jth column of X, represented
as a column matrix), and wj (the jth coordinate of w – a scalar).

2. Write a function that computes the Lasso solution for a given λ using the shooting algorithm
described above. For convergence criteria, continue coordinate descent until a pass through
the coordinates reduces the objective function by less than 10−8 , or you have taken 1000
passes through the coordinates. Compare performance of cyclic coordinate descent to
randomized coordinate descent, where in each round we pass through the coordinates in a
different random order (for your choices of λ). Compare also the solutions attained
(following the convergence criteria above) for starting at 0 versus starting at the ridge
regression solution suggested by Murphy (again, for your choices of λ). If you like, you may
adjust the convergence criteria to try to attain better results (or the same results faster).
3. Run your best Lasso configuration on the training dataset provided, and select the λ that
minimizes the square error on the validation set. Include a table of the parameter values you
tried and the validation performance for each. Also include a plot of these results. Include
also a plot of the prediction functions, just as in the ridge regression section, but this time
add the best performing Lasso prediction function and remove the unregularized least
squares fit. Similarly, add the lasso coefficients to the bar charts of coefficients generated in

5
the ridge regression setting. Comment on the results, with particular attention to parameter
sparsity and how the ridge and lasso solutions compare. What’s the best model you found,
and what’s its validation performance?
4. Implement the homotopy method described above. Compute the Lasso solution for (at least)
the regularization parameters in the set λ = λmax 0.8i | i = 0, . . . , 29 . Plot the results (av-
erage validation loss vs λ).
5. [Optional] Note that the data in Figure 1 is almost entirely nonnegative. Since we don’t have
an unregularized bias term, we have “pay for” this offset using our penalized parameters. Note
also that λmax would decrease significantly if the y values were 0 centered (using the training
data, of course), or if we included an unregularized bias term. Experiment with one or both
of these approaches, for both and lasso and ridge regression, and report your findings.

3.2 [Optional] Deriving the Coordinate Minimizer for Lasso

This problem is to derive the expressions for the coordinate minimizers used in the Shooting al-
gorithm. This is often derived using subgradients (slide 15), but here we will take a bare hands
approach (which is essentially equivalent).
In each step of the shooting algorithm, we would like to find the wj minimizing
n
X 2
f (wj ) = wT xi − yi + λ |w|1
i=1
 2
n
X X X
= wj xij + wk xik − yi  + λ |wj | + λ |wk | ,
i=1 k6=j k6=j

where we’ve written xij for the jth entry of the vector xi . This function is convex in wj . The only
thing keeping f from being differentiable is the term with |wj |. So f is differentiable everywhere
except wj = 0. We’ll break this problem into 3 cases: wj > 0, wj < 0, and wj = 0. In the first two
cases, we can simply differentiate f w.r.t. wj to get optimality conditions. For the last case, we’ll
use the fact that since f : R → R is convex, 0 is a minimizer of f iff

f (ε) − f (0) f (−ε) − f (0)

lim ≥ 0 and lim ≥ 0.
ε↓0 ε ε↓0 ε
This is a special case of the optimality conditions described in slide 6 here, where now the “direction”
v is simply taken to be the scalars 1 and −1, respectively.
1. First let’s get a trivial case out of the way. If xij = 0 for i = 1, . . . , n, what
Pn is the coordinate
minimizer wj ? In the remaining questions below, you may assume that i=1 x2ij > 0.
2. Give an expression for the derivative f (wj ) for wj 6= 0. It will be convenient to write your

6
expression in terms of the following definitions:

1
 wj > 0
sign(wj ) := 0 wj = 0

−1 wj < 0

n
X
aj := 2 x2ij
i=1
 
n
X X
cj := 2 xij yi − wk xik  .
i=1 k6=j

3. If wj > 0 and minimizes f , show that wj = a1j (cj − λ). Similarly, if wj < 0 and minimizes f ,
show that wj = a1j (cj + λ). Give conditions on cj that imply that a minimizer wj is positive
and conditions for which a minimizer wj is negative.
4. Derive expressions for the two one-sided derivatives at f (0), and show that cj ∈ [−λ, λ] implies
that wj = 0 is a minimizer.
5. Putting together the preceding results, we conclude the following:
1
 aj (cj − λ) cj > λ

wj = 0 cj ∈ [−λ, λ]
1

aj (cj + λ) cj < −λ

Show that this is equivalent to the expression given in 3.

4 Lasso Properties
4.1 Deriving λmax
In this problem we will derive an expression for λmax . For the first three parts, use the Lasso
2
objective function excluding the bias term i.e, J(w) = kXw − yk2 + λ kwk1 . We will show that for
any λ ≥ 2kX T yk∞ , the estimated weight vector ŵ is entirely zero, where k · k∞ is the infinity norm
(or supremum norm), which is the maximum absolute value of any component of the vector.
1. The one-sided directional derivative of f (x) at x in the direction v is defined as:

f (x + hv) − f (x)
f 0 (x; v) = lim
h↓0 h
Compute J 0 (0; v). That is, compute the one-sided directional derivative of J(w) at w = 0 in
the direction v. [Hint: the result should be in terms of X, y, λ, and v.]
2. Since the Lasso objective is convex, w∗ is a minimizer of J(w) if and only if the directional
derivative J 0 (w∗ ; v) ≥ 0 for all v 6= 0. Show that for any v 6= 0, we have J 0 (0; v) ≥ 0 if
and only if λ ≥ C, for some C that depends on X, y, and v. You should have an explicit

7
expression for C.

3. In the previous problem, we get a different lower bound on λ for each choice of v. Show that
the maximum of these lower bounds on λ is λmax = 2kX T yk∞ . Conclude that w = 0 is a
minimizer of J(w) if and only if λ ≥ 2kX T yk∞ .

2
4. [Optional] Let J(w, b) = kXw + b1 − yk2 + λ kwk1 , where 1 ∈ Rn is a column vector of 1’s.
Let ȳ be the mean of values in the vector y. Show that (w∗ , b∗ ) = (0, ȳ) is a minimizer of
J(w, b) if and only if λ ≥ λmax = 2kX T (y − ȳ)k∞ .

4.2 Feature Correlation

In this problem, we will examine and compare the behavior of the Lasso and ridge regression in
the case of an exactly repeated feature. That is, consider the design matrix X ∈ Rm×d , where
X·i = X·j for some i and j, where X·i is the ith column of X. We will see that ridge regression
divides the weight equally among identical features, while Lasso divides the weight arbitrarily. In an
optional part to this problem, we will consider what changes when X·i and X·j are highly correlated
(e.g. exactly the same except for some small random noise) rather than exactly the same.
1. Without loss of generality, assume the first two colums of X are our repeated features. Par-
tition X and θ as follows:
 
θ1
X = x1 x2 Xr θ =  θ2 
θr
We can write the Lasso objective function as:
2
J(θ) = kXθ − yk2 + λ kθk1
2
= kx1 θ1 + x2 θ2 + Xr θr − yk2 + λ|θ1 | + λ|θ2 | + λ kθr k1
With repeated features, there will be multiple minimizers of J(θ). Suppose that
 
a
θ̂ =  b 
r
T
is a minimizer of J(θ). Give conditions on c and d such that c, d, rT is also a minimizer of
J(θ). [Hint: First show that a and b must have the same sign, or at least one of them is zero.
Then, using this result, rewrite the optimization problem to derive a relation between a and b.]

2. Using the same notation as the previous problem, suppose

 
a
θ̂ =  b 
r

8
minimizes the ridge regression objective function. What is the relationship between a and b,
and why?

3. [Optional] What do you think would happen with Lasso and ridge when X·i and X·j are highly
correlated, but not exactly the same. You may investigate this experimentally or theoretically.

5 [Optional] The Ellipsoids in the `1 /`2 regularization picture

Recall the famous picture purporting to explain why `1 regularization leads to sparsity, while `2
regularization does not. Here’s the instance from Hastie et al’s The Elements of Statistical Learning:

(While Hastie et al. use β for the parameters, we’ll continue to use w.)
In this problem we’ll show that the level sets of the empirical risk are indeed ellipsoids centered
at the empirical risk minimizer ŵ.
Consider linear prediction functions of the form x 7→ wT x. Then the empirical risk for f (x) =
T
w x under the square loss is
n
1X T 2
R̂n (w) = w xi − yi
n i=1
1 T
= (Xw − y) (Xw − y) .
n
−1
1. [Optional] Let ŵ = X T X X T y. Show that ŵ has empirical risk given by
1
−y T X ŵ + y T y

R̂n (ŵ) =
n

9
2. [Optional] Show that for any w we have
1 T
R̂n (w) = (w − ŵ) X T X (w − ŵ) + R̂n (ŵ).
n
Note that the RHS (i.e. “right hand side”) has one term that’s quadratic in w and one term
that’s independent of w. In particular, the RHS does not have any term that’s linear in w. On
T
the LHS (i.e. “left hand side”), we have R̂n (w) = n1 (Xw − y) (Xw − y). After expanding
this out, you’ll have terms that are quadratic, linear, and constant in w. Completing the
square is the tool for rearranging an expression to get rid of the linear terms. The following
“completing the square” identity is easy to verify just by multiplying out the expressions on
the RHS:
T
xT M x − 2bT x = x − M −1 b M (x − M −1 b) − bT M −1 b

3. [Optional] Using the expression derived for R̂n (w) in 2, give a very short proof that ŵ =
−1 T
XT X X y is the empirical risk minimizer. That is:

ŵ = arg min R̂n (w).

Hint: Note that X T X is positive semidefinite and, by definition, a symmetric matrix M is

positive semidefinite iff for all x ∈ Rd , xT M x ≥ 0.

4. [Optional] Give an expression for the set of w for which the empirical risk exceeds the min-
imum empirical risk R̂n (ŵ) by an amount c > 0. If X is full rank, then X T X is positive
definite, and this set is an ellipse – what is its center?

6 [Optional] Projected SGD via Variable Splitting

In this question, we consider another general technique that can be used on the Lasso problem. We
first use the variable splitting method to transform the Lasso problem to a differentiable problem
with linear inequality constraints, and then we can apply a variant of SGD.
Representing the unknown vector θ as a difference of two non-negative vectors θ+ and θ− , the
Xd d
X
`1 -norm of θ is given by θi+ + θi− . Thus, the optimization problem can be written as
i=1 i=1

m
X d
X d
X
(θ̂+ , θ̂− ) = arg min (hθ+ ,θ− (xi ) − yi )2 + λ θi+ + λ θi−
θ + ,θ − ∈Rd i=1 i=1 i=1
such that θ+ ≥ 0 and θ− ≥ 0,

where hθ+ ,θ− (x) = (θ+ − θ− )T x. The original parameter θ can then be estimated as θ̂ = (θ̂+ − θ̂− ).

10
This is a convex optimization problem with a differentiable objective and linear inequality
constraints. We can approach this problem using projected stochastic gradient descent, as discussed
in lecture. Here, after taking our stochastic gradient step, we project the result back into the feasible
set by setting any negative components of θ+ and θ− to zero.
1. [Optional] Implement projected SGD to solve the above optimization problem for the same
λ’s as used with the shooting algorithm. Since the two optimization algorithms should find
essentially the same solutions, you can check the algorithms against each other. Report the
differences in validation loss for each λ between the two optimization methods. (You can
make a table or plot the differences.)

2. [Optional] Choose the λ that gives the best performance on the validation set. Describe the
solution ŵ in term of its sparsity. How does the sparsity compare to the solution from the
shooting algorithm?

1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.6.1 Documentation
41 pages
ML Coursera Python Assignments
100% (1)
ML Coursera Python Assignments
20 pages
cs231n 2018 Midterm Review-2 PDF
No ratings yet
cs231n 2018 Midterm Review-2 PDF
86 pages
Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Unit 4 ML NN, DL, CNN-1
No ratings yet
Unit 4 ML NN, DL, CNN-1
84 pages
Unit 5
No ratings yet
Unit 5
171 pages
Machine Learning Refined: Foundations, Algorithms, and Applications Second Edition Jeremy Watt 2024 Scribd Download
100% (1)
Machine Learning Refined: Foundations, Algorithms, and Applications Second Edition Jeremy Watt 2024 Scribd Download
55 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
7 Neural Networks - Lecture Slides
No ratings yet
7 Neural Networks - Lecture Slides
74 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Bias
No ratings yet
Bias
62 pages
Comprehensive Machine Learning Tutorial - Regressio
No ratings yet
Comprehensive Machine Learning Tutorial - Regressio
9 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
Main SGD
No ratings yet
Main SGD
32 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Deep Learning Module 1
No ratings yet
Deep Learning Module 1
46 pages
Unit 1 Question and Answers
100% (1)
Unit 1 Question and Answers
29 pages
Lecture - 6 Classification (Logistic Regression)
No ratings yet
Lecture - 6 Classification (Logistic Regression)
48 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Regularization & Gradient Descent
No ratings yet
Regularization & Gradient Descent
18 pages
Gradient Descent
No ratings yet
Gradient Descent
58 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Project Report On Recommendation System
No ratings yet
Project Report On Recommendation System
26 pages
INSY446 - 3 - Linear Model Part 2
No ratings yet
INSY446 - 3 - Linear Model Part 2
27 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
22 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
BDS-Homework-1-Submission - Ipynb - Colab
No ratings yet
BDS-Homework-1-Submission - Ipynb - Colab
11 pages
Image Categorization Using CNN pt2
No ratings yet
Image Categorization Using CNN pt2
25 pages
Book1 Artifficial Intelligence
No ratings yet
Book1 Artifficial Intelligence
84 pages
Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Regression
No ratings yet
Regression
16 pages
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
17 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Message
No ratings yet
Message
5 pages
Assigment Regression
No ratings yet
Assigment Regression
9 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Department of Electrical Engineering School of Science and Engineering
No ratings yet
Department of Electrical Engineering School of Science and Engineering
10 pages
Homework #1 (100 Points) : A. Theory Problems
No ratings yet
Homework #1 (100 Points) : A. Theory Problems
4 pages
PPB ML Notes
No ratings yet
PPB ML Notes
54 pages
LAB5 Regularization
No ratings yet
LAB5 Regularization
6 pages
Ass 1
No ratings yet
Ass 1
3 pages
Lab Manual 05
No ratings yet
Lab Manual 05
13 pages
Visvesvaraya Technological University: Chest X-Ray of Pnenmonia Disease Diagnosis Using CNN
No ratings yet
Visvesvaraya Technological University: Chest X-Ray of Pnenmonia Disease Diagnosis Using CNN
47 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Early Detection of Parkinsons Disease Using Deep Learning and Machine Learning
No ratings yet
Early Detection of Parkinsons Disease Using Deep Learning and Machine Learning
12 pages
Maritime Vessel Images Classification Using Deep
No ratings yet
Maritime Vessel Images Classification Using Deep
6 pages
Unit 2
No ratings yet
Unit 2
8 pages
Homework2 Advanced ML
No ratings yet
Homework2 Advanced ML
4 pages
HW 1 in 2015
No ratings yet
HW 1 in 2015
3 pages
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
No ratings yet
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
41 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
Machine Learning Lab (3) Report (21 CP 81)
No ratings yet
Machine Learning Lab (3) Report (21 CP 81)
7 pages
Unit 5
No ratings yet
Unit 5
11 pages
HW 3
No ratings yet
HW 3
7 pages
Project2 2022 Fall
No ratings yet
Project2 2022 Fall
7 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
MIT6 0002F16 ProblemSet5
No ratings yet
MIT6 0002F16 ProblemSet5
13 pages
Autoencoder Asset Pricing Models
No ratings yet
Autoencoder Asset Pricing Models
22 pages
Ridge Mt1cars
No ratings yet
Ridge Mt1cars
4 pages
Batch Size To Improve Result
No ratings yet
Batch Size To Improve Result
4 pages
Message
No ratings yet
Message
2 pages
Lab Experiments Vi Sem-1
No ratings yet
Lab Experiments Vi Sem-1
10 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization
No ratings yet
Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization
6 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
Regression Problems in Python PDF
No ratings yet
Regression Problems in Python PDF
34 pages
Predicting The Price of Bitcoin Using Machine Learning
No ratings yet
Predicting The Price of Bitcoin Using Machine Learning
5 pages
Endsem Deep Learning Important
No ratings yet
Endsem Deep Learning Important
2 pages
PS4
No ratings yet
PS4
8 pages
Homework 9 Due: March 13, 2020, 11:59PM PT
No ratings yet
Homework 9 Due: March 13, 2020, 11:59PM PT
2 pages
Guidelines: DSE3-Machine Learning
No ratings yet
Guidelines: DSE3-Machine Learning
2 pages
Assignment III
No ratings yet
Assignment III
3 pages
1) Transfer Learning Based Plant Disease Detection Using ResNet50
No ratings yet
1) Transfer Learning Based Plant Disease Detection Using ResNet50
6 pages
HW1
No ratings yet
HW1
4 pages
Exercise 03
No ratings yet
Exercise 03
5 pages
Wa0002.
No ratings yet
Wa0002.
5 pages
HW 4
No ratings yet
HW 4
7 pages
Ridge Regression
No ratings yet
Ridge Regression
5 pages
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
No ratings yet
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
12 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
SP18 Practice Midterm
No ratings yet
SP18 Practice Midterm
5 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Ridge and Lasso in Python PDF
No ratings yet
Ridge and Lasso in Python PDF
5 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview

Uploaded by

Homework 2: Lasso Regression: 1.1 Data Set and Programming Problem Overview

Uploaded by

Homework 2: Lasso Regression

1.1 Data Set and Programming Problem Overview

soft (a, δ) = sign(a) (|a| − δ)+ ,

3.1 Experiments with the Shooting Algorithm

3.2 [Optional] Deriving the Coordinate Minimizer for Lasso

f (ε) − f (0) f (−ε) − f (0)

Show that this is equivalent to the expression given in 3.

4.2 Feature Correlation

2. Using the same notation as the previous problem, suppose

5 [Optional] The Ellipsoids in the `1 /`2 regularization picture

ŵ = arg min R̂n (w).

Hint: Note that X T X is positive semidefinite and, by definition, a symmetric matrix M is

6 [Optional] Projected SGD via Variable Splitting

You might also like