0% found this document useful (0 votes)
4 views2 pages

CS771: Practice Set 2: Problem 1

Uploaded by

darshan sethia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

CS771: Practice Set 2: Problem 1

Uploaded by

darshan sethia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS771: Practice Set 2

Problem 1
(A More Fancy Version of?) Consider a classification model where we are given training data {xn , yn }N n=1
from K classes. Each input xn ∈ RD and each class c is defined by two parameters, wc ∈ RD and a D × D
positive definite (PD) matrix Mc , c = 1, 2, . . . , K. Assume Nc denotes the number of training examples from
class c. Suppose we estimate wc and Mc by solving the following optimization problem

X 1
(ŵc , M̂c ) = arg min (xn − wc )> Mc (xn − wc ) − log |Mc |
wc ,Mc
xn :yn =c
Nc

(note that, in the above objective, the log |Mc | term ensures positive definiteness of Mc because the determinant
of a PD matrix is always non-negative)
For the given objective/loss function, find the optimal values of wc and Mc using first-order optimality (you
may use standard results of derivatives of functions w.r.t. vectors and matrices from the Matrix Cookbook1 ).
Also, what will this model reduce to as a special case when Mc is an identity matrix?

Problem 2
(Corrective Updates) Consider the weight vector update equation of the Perceptron algorithm for binary clas-
sification: w(t+1) = w(t) + yn xn . Assume yn ∈ {−1, +1}.
Prove that these updates are “corrective” in nature, i.e., if the current weight vector w(t) mispredicts (xn , yn )
>
(i.e., if yn w(t) xn < 0) then after this update, the new weight vector w(t+1) will mispredict this example by a
> >
”lesser extent” (i.e.., yn w(t+1) xn will be less negative than yn w(t) xn < 0 after this update).

Problem 3
(Arbitrary Choice?) Formally, show that changing the condition yn (w> xn + b) ≥ 1 in SVM to a different
condition yn (w> xn + b) ≥ m does not change the effective separating hyperplane that is learned by the SVM.
Assume the hard-margin SVM for simplicity.

Problem 4
(Recover the Bias) Assuming hard-margin SVM, show that, given the solution for the dual variables αn ’s, the
bias term b ∈ R can be computed as b = ys − ts where s can denote the index of any of the support vectors,
and ts is a term that requires computing a summation defined over all the support vectors. (Hint: Use KKT
conditions)

Problem 5
(Look Ma, No Subgradients!) Show that we can rewrite regression with absolute loss function |yn − w> xn |
as a reweighted least squares objective where the squared loss term for each example (xn , yn ) is multiplied by
an importance weight sn > 0. Write down the expression for sn , and briefly explain why this expression for sn
1
https://www.math.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf

1
makes intuitive sense. Given N examples {(xn , yn }N
n=1 , briefly outline the steps of an optimization algorithm
that estimates the unknowns (w and the importance weights {sn }Nn=1 ) for this reweighted least squares problem.

Problem 6
(Linear Regression viewed as Nearest Neighbors) Show that, for the unregularized linear regression model,
where the solution ŵ = (X> X)−1 X> y, the prediction at a test input x∗ can be written as a weighted sum of
all the training responses, i.e.,
N
X
f (x∗ ) = wn yn
n=1

Give the expression for the weights wn ’s in this case and briefly discuss (<50 words) in what way these weights
are different from the weights in a weighted version of K nearest neighbors where each wn typically is the
inverse distance of x∗ from the training input xn . Note: You do not need to give a very detailed expression for
wn (if it makes algebra messy) but you must give a precise meaning as to what wn depends on and how it is
different from the weights in the weighted K nearest neighbors.

Problem 7
(Feature
PN Masking > as Regularization) Consider linear regression model by minimizing the squared loss func-
tion n=1 (yn − w xn )2 . Suppose we decide to mask out or “drop” each feature xnd of each input xn ∈ RD ,
independently, with probability 1 − p (equivalently, retaining the feature with probability p). Masking or drop-
ping out basically means that we will set the feature xnd to 0 with probability 1 − p. Essentially, it would be
equivalent to replacing each input xn by x̃n = xn ◦ mn , where ◦ denotes elementwise product and mn denotes
the D × 1 binary mask vector with mnd ∼ Bernoulli(p) (mnd = 1 means the feature xnd was retained; mnd = 0
means the feature xnd was masked/zeroed).
Let us now define a new loss function using these masked inputs as follows: N > 2
P
n=1 (yn − w x̃n ) . Show that
minimizing the expected value of this new loss function (where the expectation is used since the mask vectors
mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of
this regularized loss function. Note that showing this would require some standard results related to expectation
of random variables, such as linearity of expectation, and expectation and variance of a Bernoulli random
variable. Note that, so far in the course, we haven’t talked much about probability ideas but, with this much
information, you should be able to attempt this problem.

You might also like