553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21
553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21
Fall 2020
Due on Wednesday October 21.
• The solution must be written with sufficient explanations and your results must be commented,
as if you were writing a report or a paper.. Returning a series of numerical results and figures is not
enough. A solution in the form of a program output is not acceptable either, or a Jupyter notebook
output.
• You must provide your program sources. They must be added as an appendix, separated to your
answers to the questions. They will not graded (so no direct credit for them), but they will be useful in
order to understand why results are not correct (and decide whether partial credit can be given) and to
ensure that your work is original. You may use any programming language, although Python is strongly
recommended.
• Data files. The “.csv” files associated with this project are available on Blackboard. The columns are
formatted as (Index, X1, . . . , Xd, Y) and each row corresponds to a sample. Here, X1, . . . , Xd are the
d coordinates of the input variable (in this homework, d = 1 or 2), and Y is the input variable (Index
is just the line number).
Question 1.
Let X be a real-valued random variable modeled with a normal distribution N (m, σ 2 ). Fixing a positive
number ρ, we propose to estimate m and σ 2 , based on a sample (x1 , . . . , xN ) by maximizing the penalized
log-likelihood
N
X |m|
log ϕm,σ2 (xk ) − N ρ
σ
k=1
2
where ϕm,σ2 is the p.d.f. of N (m, σ ).
We will assume in the following that the observed samples are not equal to the same constant value.
(1.1) Let α = m/σ and β = 1/σ. Let also µ1 and µ2 denote the first- and second-order empirical moments:
1
µ1 = (x1 + · · · + xN )
N
1
µ2 = (x21 + · · · + x2N )
N
Prove that this maximization problem is equivalent to minimizing (introducing an auxiliary variable ξ)
1 2 1
F (α, β, ξ) = α − µ1 αβ + µ2 β 2 − log β + ρξ
2 2
subject to the constraints ξ − α ≥ 0 and ξ + α ≥ 0.
1
(1.2) Prove that F is a convex function over R × (0, +∞) × R and that that any feasible point for this
constrained minimization problem is regular.
(1.3) Prove that the KKT conditions for this problem are:
α − µ1 β = λ2 − λ1
1
− µ1 α + µ2 β − = 0
β
ρ − λ1 − λ2 = 0
λ1 , λ2 ≥ 0
λ1 (α − ξ) = 0
λ2 (α + ξ) = 0
(1.4) Find a necessary and sufficient condition on µ1 , µ2 and ρ for a solution to this system to satisfy α = 0,
and provide the optimal β in that case.
(1.5) Assume that the previous condition is not satisfied. (a) Prove that in that case, α has the same sign
as µ1 . (b) Completely describe the solution of the system, separating the cases µ1 > 0 and µ1 < 0.
(1.6) Summarize this discussion and prove that one of the two following statements (with extra credit if you
probe both, i.e., prove that they are equivalent). (a) The optimal parameters are
with s2 = µ2 − µ21 .
(b) The optimal parameters can be computed as follows
√
• If µ21 ≤ ρ2 µ2 : m̂ = 0, σ̂ = µ2 .
• If µ21 ≥ ρ2 µ2 :
m̂ = µ1 − sign(µ1 )ρσ̂
2s2
σ̂ = p
ρ2 µ21 + 4s2 − ρ|µ1 |
Question 2.
We now consider a second random variable, Y ∈ {0, 1}, and assume that, conditionally to Y = y,
X ∼ N (my , σ 2 ) for some m0 , m1 and σ 2 . Based on samples (x1 , . . . , xN ) and (y1 , . . . , yN ), we want to
2
maximize the penalized log-likelihood
N
X |m1 − m0 |
log ϕmyk ,σ2 (xk ) − N ρ
σ
k=1
(2.1) Let m = (N0 m0 + N1 m1 )/N , α = (m1 − m0 )/σ and β = 1/σ. Express the problem in terms of these
new parameters and prove that, denoting the optimal solution by m̂, α̂, β̂, that
1. m̂ = µ1
2. α̂ and β̂ minimize
s2 β 2 N0 N1 N0 N1 2
− 2
(µ1,1 − µ1,0 )αβ + α − log β + ρ|α|.
2 N 2N 2
with respect to α and β.
(2.2) Adapt the solution found in question 1 to show one of the following statements (with extra credits if
you show both)
(a) The optimal solution of the penalized likelihood problem is
!
2(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 )
σ̂ = min s, p
ρ2 (µ1,1 − µ1,0 )2 + 4(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 ) − ρ|µ1,1 − µ1,0 |
N2
N1
m̂0 = µ1 − sign(µ1,1 − µ1,0 ) max 0, |µ1,1 − µ1,0 | − ρσ̂
N N0 N1
N2
N0
m̂1 = µ1 + sign(µ1,1 − µ1,0 ) max 0, |µ1,1 − µ1,0 | − ρσ̂
N N0 N1
(b) The optimal solution of the penalized likelihood problem is computed as follows.
• If N0 N1 |µ1,1 − µ1,0 | ≤ N 2 ρs: m̂0 = m̂1 = µ1 , σ = s.
3
• If N0 N1 |µ1,1 − µ1,0 | ≤ N 2 ρs:
2(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 )
σ̂ = p
ρ2 (µ1,1 − µ1,0 )2 + 4(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 ) − ρ|µ1,1 − µ1,0 |
N
m̂0 = µ1,0 − sign(µ1,1 − µ1,0 )ρσ̂
N0
N
m̂1 = µ1,1 − sign(µ1,1 − µ1,0 )ρσ̂
N1
Question 3.
We now assume that the random variable X is multi-dimensional and that, conditional to Y , it follows
a normal distribution N (my , Σ) with m0 , m1 ∈ Rd and Σ a diagonal matrix. We will denote by my (i) the
ith coefficient of my (y = 0, 1) and by σ 2 (i) the ith diagonal entry of Σ.
(3.1) Use the previous question to describe the maximizers of the penalized likelihood
N d
X X |m0 (i) − m1 (i)|
log Φmyk ,Σ (xk ) − N ρ
i=1
σ(i)
k=1
(3.2) Write a program for training that takes as input an N by d array X, an N -dimensional binary vector
Y and a value of ρ, and that returns m0 , m1 and σ as vectors.
(a) Test this program on the dataset “OPT FALL2020 train1.csv” with ρ = 0.1 and return the number of
coordinates such that m0 (i) 6= m1 (i).
(b) Same question, with ρ = 0.15.
(3.3) Write a program for prediction that takes as input vectors m0 , m1 and σ (such as those returned by
the previous training program) and an M by d array X of test data and that returns an M -dimensional
vector Y such that, for k = 1, . . . , M : Y(k) = 0 if
d d
X (X(k, j) − m0 (j))2 X (X(k, j) − m1 (j))2
<
j=1
σ 2 (j) j=1
σ 2 (j)
0
where Y contains the predicted classes and Y the true ones.
4
Provide, in your answer,
• The numerical value of the E(0.1) and E(0.15).
• The values ρ for which E is minimal.
• Two plots: of ν(ρ) vs. ρ and of E(ρ) vs. ρ.
(3.4) Instead of using a closed form for the solution of the problem considered in this question, we consider
here an iterative approach using proximal gradient descent. Let α0 (i) = (m0 (i)/σ(i), i = 1, . . . , d), α1 (i) =
(m1 (i)/σ(i), i = 1, . . . , d) and β = (1/σ(i), i = 1, . . . , d) be d-dimensional vectors.
(a) Rewrite the penalized likelihood maximization problem considered in this question as an equivalent
minimization problem for a function taking the form
(c) Give the expression of ∇f (α0 , α1 , β) for the function f obtained in (a).
(d) Write a new version of the training program that takes the same input X, Y and ρ and returns the
optimal m0 , m1 , σ obtained after stabilization of a proximal gradient algorithm
where stabilization is defined as the maximal change between the optimized variable during one iteration
being less than τ = 10−6 .
Run this program on the dataset “OPT FALL2020 train1.csv” using ρ = 0.1 and γ = 0.1. Provide in
your answer the maximum value (over all estimated variables) of the difference between the result obtained
after convergence and the exact formula obtained in question (3.1).