0% found this document useful (0 votes)
82 views5 pages

553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21

The document provides directions for project 2 on optimization. It states that solutions must be typed, sufficiently explained, and include program sources as an appendix. It will provide CSV data files with sample data. Question 1 asks to reformulate a penalized log-likelihood estimation problem into an equivalent constrained minimization problem. It then asks to prove properties of the objective function like convexity, describe the KKT conditions, and find the optimal solutions under different conditions on the parameters. Question 2 extends the model to consider a binary variable Y and estimates the parameters conditionally on Y=0 and Y=1. It asks to reformulate the penalized log-likelihood problem and adapt the solution from question 1 to find the optimal

Uploaded by

Koilama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views5 pages

553.740 Project 2 Optimization. Fall 2020 Due On Wednesday October 21

The document provides directions for project 2 on optimization. It states that solutions must be typed, sufficiently explained, and include program sources as an appendix. It will provide CSV data files with sample data. Question 1 asks to reformulate a penalized log-likelihood estimation problem into an equivalent constrained minimization problem. It then asks to prove properties of the objective function like convexity, describe the KKT conditions, and find the optimal solutions under different conditions on the parameters. Question 2 extends the model to consider a binary variable Y and estimates the parameters conditionally on Y=0 and Y=1. It asks to reformulate the penalized log-likelihood problem and adapt the solution from question 1 to find the optimal

Uploaded by

Koilama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

553.740 Project 2; Optimization.

Fall 2020
Due on Wednesday October 21.

Please read carefully the directions below.


• You must type your answers (using, e.g., LateX or MS Word.) A 5% penalty will be applied
otherwise.

• The solution must be written with sufficient explanations and your results must be commented,
as if you were writing a report or a paper.. Returning a series of numerical results and figures is not
enough. A solution in the form of a program output is not acceptable either, or a Jupyter notebook
output.
• You must provide your program sources. They must be added as an appendix, separated to your
answers to the questions. They will not graded (so no direct credit for them), but they will be useful in
order to understand why results are not correct (and decide whether partial credit can be given) and to
ensure that your work is original. You may use any programming language, although Python is strongly
recommended.
• Data files. The “.csv” files associated with this project are available on Blackboard. The columns are
formatted as (Index, X1, . . . , Xd, Y) and each row corresponds to a sample. Here, X1, . . . , Xd are the
d coordinates of the input variable (in this homework, d = 1 or 2), and Y is the input variable (Index
is just the line number).

Question 1.
Let X be a real-valued random variable modeled with a normal distribution N (m, σ 2 ). Fixing a positive
number ρ, we propose to estimate m and σ 2 , based on a sample (x1 , . . . , xN ) by maximizing the penalized
log-likelihood
N
X |m|
log ϕm,σ2 (xk ) − N ρ
σ
k=1
2
where ϕm,σ2 is the p.d.f. of N (m, σ ).
We will assume in the following that the observed samples are not equal to the same constant value.

(1.1) Let α = m/σ and β = 1/σ. Let also µ1 and µ2 denote the first- and second-order empirical moments:
1
µ1 = (x1 + · · · + xN )
N
1
µ2 = (x21 + · · · + x2N )
N
Prove that this maximization problem is equivalent to minimizing (introducing an auxiliary variable ξ)
1 2 1
F (α, β, ξ) = α − µ1 αβ + µ2 β 2 − log β + ρξ
2 2
subject to the constraints ξ − α ≥ 0 and ξ + α ≥ 0.

1
(1.2) Prove that F is a convex function over R × (0, +∞) × R and that that any feasible point for this
constrained minimization problem is regular.

(1.3) Prove that the KKT conditions for this problem are:

α − µ1 β = λ2 − λ1



1


 − µ1 α + µ2 β − = 0





 β
ρ − λ1 − λ2 = 0
λ1 , λ2 ≥ 0





λ1 (α − ξ) = 0






λ2 (α + ξ) = 0

where λ1 , λ2 are Lagrange multipliers.

(1.4) Find a necessary and sufficient condition on µ1 , µ2 and ρ for a solution to this system to satisfy α = 0,
and provide the optimal β in that case.

(1.5) Assume that the previous condition is not satisfied. (a) Prove that in that case, α has the same sign
as µ1 . (b) Completely describe the solution of the system, separating the cases µ1 > 0 and µ1 < 0.

(1.6) Summarize this discussion and prove that one of the two following statements (with extra credit if you
probe both, i.e., prove that they are equivalent). (a) The optimal parameters are

m̂ = sign(µ1 ) max (0, |µ1 | − ρσ̂)


!
√ 2s2
σ̂ = min µ2 , p
ρ2 µ21 + 4s2 − ρ|µ1 |

with s2 = µ2 − µ21 .
(b) The optimal parameters can be computed as follows

• If µ21 ≤ ρ2 µ2 : m̂ = 0, σ̂ = µ2 .
• If µ21 ≥ ρ2 µ2 :

m̂ = µ1 − sign(µ1 )ρσ̂
2s2
σ̂ = p
ρ2 µ21 + 4s2 − ρ|µ1 |

(1.7) What happens if ρ ≥ 1?

Question 2.
We now consider a second random variable, Y ∈ {0, 1}, and assume that, conditionally to Y = y,
X ∼ N (my , σ 2 ) for some m0 , m1 and σ 2 . Based on samples (x1 , . . . , xN ) and (y1 , . . . , yN ), we want to

2
maximize the penalized log-likelihood
N
X |m1 − m0 |
log ϕmyk ,σ2 (xk ) − N ρ
σ
k=1

with respect to m0 , m1 and σ 2 .


Let N0 and N1 denote the number of observations with Y = 0 and Y = 1. where
N
1 X
µ1,0 = xk 1yk =0
N0
k=1
N
1 X
µ1,1 = xk 1yk =1
N1
k=1
1
µ1 = (N0 µ1,0 + N1 µ1,1 )
N
N
1 X 2
µ2 = xk
N
k=1
s2 = µ2 − µ21

(2.1) Let m = (N0 m0 + N1 m1 )/N , α = (m1 − m0 )/σ and β = 1/σ. Express the problem in terms of these
new parameters and prove that, denoting the optimal solution by m̂, α̂, β̂, that
1. m̂ = µ1
2. α̂ and β̂ minimize
s2 β 2 N0 N1 N0 N1 2
− 2
(µ1,1 − µ1,0 )αβ + α − log β + ρ|α|.
2 N 2N 2
with respect to α and β.

(2.2) Adapt the solution found in question 1 to show one of the following statements (with extra credits if
you show both)
(a) The optimal solution of the penalized likelihood problem is
!
2(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 )
σ̂ = min s, p
ρ2 (µ1,1 − µ1,0 )2 + 4(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 ) − ρ|µ1,1 − µ1,0 |
N2
 
N1
m̂0 = µ1 − sign(µ1,1 − µ1,0 ) max 0, |µ1,1 − µ1,0 | − ρσ̂
N N0 N1
N2
 
N0
m̂1 = µ1 + sign(µ1,1 − µ1,0 ) max 0, |µ1,1 − µ1,0 | − ρσ̂
N N0 N1

(b) The optimal solution of the penalized likelihood problem is computed as follows.
• If N0 N1 |µ1,1 − µ1,0 | ≤ N 2 ρs: m̂0 = m̂1 = µ1 , σ = s.

3
• If N0 N1 |µ1,1 − µ1,0 | ≤ N 2 ρs:
2(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 )
σ̂ = p
ρ2 (µ1,1 − µ1,0 )2 + 4(s2 − N0 N1 (µ1,1 − µ1,0 )2 /N 2 ) − ρ|µ1,1 − µ1,0 |
N
m̂0 = µ1,0 − sign(µ1,1 − µ1,0 )ρσ̂
N0
N
m̂1 = µ1,1 − sign(µ1,1 − µ1,0 )ρσ̂
N1

Question 3.
We now assume that the random variable X is multi-dimensional and that, conditional to Y , it follows
a normal distribution N (my , Σ) with m0 , m1 ∈ Rd and Σ a diagonal matrix. We will denote by my (i) the
ith coefficient of my (y = 0, 1) and by σ 2 (i) the ith diagonal entry of Σ.

(3.1) Use the previous question to describe the maximizers of the penalized likelihood
N d
X X |m0 (i) − m1 (i)|
log Φmyk ,Σ (xk ) − N ρ
i=1
σ(i)
k=1

where Φm,Σ is the p.d.f. of the multivariate Gaussian N (m, Σ).

(3.2) Write a program for training that takes as input an N by d array X, an N -dimensional binary vector
Y and a value of ρ, and that returns m0 , m1 and σ as vectors.
(a) Test this program on the dataset “OPT FALL2020 train1.csv” with ρ = 0.1 and return the number of
coordinates such that m0 (i) 6= m1 (i).
(b) Same question, with ρ = 0.15.

(3.3) Write a program for prediction that takes as input vectors m0 , m1 and σ (such as those returned by
the previous training program) and an M by d array X of test data and that returns an M -dimensional
vector Y such that, for k = 1, . . . , M : Y(k) = 0 if
d d
X (X(k, j) − m0 (j))2 X (X(k, j) − m1 (j))2
<
j=1
σ 2 (j) j=1
σ 2 (j)

and Y(k) = 1 otherwise.


For ρ = 0.01, 0.02, 0.03, . . . , 0.34, 0.35, execute the following operations:
(1) Run the training program of question(3.2) to estimate m0 , m1 , σ. Compute ν(ρ), the number of
coordinates such that m0 (i) 6= m1 (i).
(2) Using these values of m0 , m1 and σ obtained in the previous question for λ = 0.10 and λ = 0.15
and apply the classification program to the dataset provided in the file “OPT FALL2020 test1.csv.”
Compute the balanced error rate E(ρ) defined as the average between the fraction of wrong answers
among test data with Y = 0 and that among test data with Y = 1, formally defined by
1 X #{k : Y(k) = y, Y0 (k) 6= y}
E=
2 #{k : Y(k) = y}
y∈{0,1}

0
where Y contains the predicted classes and Y the true ones.

4
Provide, in your answer,
• The numerical value of the E(0.1) and E(0.15).
• The values ρ for which E is minimal.
• Two plots: of ν(ρ) vs. ρ and of E(ρ) vs. ρ.

(3.4) Instead of using a closed form for the solution of the problem considered in this question, we consider
here an iterative approach using proximal gradient descent. Let α0 (i) = (m0 (i)/σ(i), i = 1, . . . , d), α1 (i) =
(m1 (i)/σ(i), i = 1, . . . , d) and β = (1/σ(i), i = 1, . . . , d) be d-dimensional vectors.
(a) Rewrite the penalized likelihood maximization problem considered in this question as an equivalent
minimization problem for a function taking the form

F (α0 , α1 , β) = f (α0 , α1 , β) + ρg(α0 , α1 , β)


Pd
with g(α0 , α1 , β) = i=1 |α1 (i) − α0 (i)|.
(b) Provide an explicit expression of the proximal operator
 
1
proxγρg (α0 , α1 , β) = argminα00 ,α01 ,β 0 ρg(α00 , α10 , β 0 ) + (|α0 − α00 |2 + |α1 − α10 |2 + |β − β 0 |2 )

(c) Give the expression of ∇f (α0 , α1 , β) for the function f obtained in (a).
(d) Write a new version of the training program that takes the same input X, Y and ρ and returns the
optimal m0 , m1 , σ obtained after stabilization of a proximal gradient algorithm

(α0,n+1 , α1,n+1 , βn+1 ) = proxγρg ((α0,n , α1,n , βn ) − γ∇f (α0,n , α1,n , βn )) ,

where stabilization is defined as the maximal change between the optimized variable during one iteration
being less than τ = 10−6 .
Run this program on the dataset “OPT FALL2020 train1.csv” using ρ = 0.1 and γ = 0.1. Provide in
your answer the maximum value (over all estimated variables) of the difference between the result obtained
after convergence and the exact formula obtained in question (3.1).

You might also like