0% found this document useful (0 votes)
4 views16 pages

Exam 2023

Uploaded by

OJUGBA OLUCHUKWU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

Exam 2023

Uploaded by

OJUGBA OLUCHUKWU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

y +1/1/60+ y

Profs. Martin Jaggi and Nicolas Flammarion


Optimization for Machine Learning – CS-439 - IC
03.07.2023 from 15h15 to 18h15
Duration : 180 minutes
1
Student One
SCIPER : 111111

Wait for the start of the exam before turning to the next page. This document is printed
double sided, 16 pages. Do not unstaple.

T
AF
• This is a closed book exam. No electronic devices of any kind.
DR

• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet if you
have one; place all other personal items below your desk or on the side.

• You each have a different exam.

• For technical reasons, do use black or blue pens for the MCQ part, no pencils! Use white
corrector if necessary.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/2/59+ y

First part, multiple choice


There is exactly one correct answer per question.

Convexity
Question 1 Let f : Rn → R, h : R → R, g : Rn → R be three functions such that f = h ◦ g, i.e.,
f (x) = h(g(x)) for all x ∈ Rn .
What are all the true statements?

A) f is concave provided that h is concave and non-decreasing, and g is concave.

B) f is concave provided that h is concave and non-decreasing, and g is convex.

C) f is concave provided that h is concave and non-increasing, and g is concave.

D) f is concave provided that h is concave and non-increasing, and g is convex.

F) f is concave provided that h is convex and non-decreasing, and g is concave.

B, C, and F
A and D
A, B, and D
A and B
T
AF
A and C

Smoothness and gradient descent


Question 2 Let f : R → R be a convex, differentiable, and L-smooth function. Let Gf be the update
function when running gradient descent with learning rate γ so that xt+1 = Gf (xt ). Which statement is
DR

true for any such function f ?

Gf is injective if 1
2L <γ≤ L.
1

Regardless of how γ is chosen, Gf is always injective.


Gf is injective if γ ≤ 2L .
1

Regardless of how γ is chosen, there always exists a function f for which Gf is not injective.

Question 3 Let f : R → R be a function. Assume the sequence x0 , x1 , . . . exists such that xt−1 :=
xt − γ∇f (xt ), i.e. it is the reverse of a sequence generated by running gradient descent over f from some
starting point with learning rate γ. Assuming limt→∞ xt → ∞, and that x0 is a global minimum of f (so
the gradient descent converges), which statement is true for any such function f ?

For each of the other statements there is at least one function f for which the statement does not hold.
f is either globally smooth for some value L or satisfies Polyak-Lojasiewicz inequality.
f satisfies the Polyak-Lojasiewicz inequality.
f is γ -smooth.
1

Question 4 Which of the following statements is true?

f (x) := x4 is 2-smooth.
f (x) := x4 is 4-smooth.
f (x) := x2 is 10-smooth.
f (x) := x2 is 1-smooth.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/3/58+ y

Question 5 Let f : R → R and g : R → R be L-smooth functions. Which of the following is true about
h(x) := f (g(x))?

h may not be smooth.


h is never L′ -smooth for any L′ < L.
h is always L2 smooth.
h is always L smooth.

Question 6 Let f1 , . . . , fn be functions that are smooth with L1 , . . . , Ln , respectively. Which of the
Pn
following statements on functions g = i=1 λi fi and h = maxi λi fi for λ1 , . . . , λn ∈ R+ hold generally?
Pn
A) g is i=1 λi Li smooth.
Pn
B) h is i=1 λi Li smooth.

C) g is maxi λi Li smooth.

D) h is maxi λi Li smooth.

B and D
A and D
B and C
A, B, and D
A
T
AF
A and B

Proximal Gradient Descent


Question 7 Assume that the proximal operator is readily available for a convex function f : Rd → R
DR

and any γ > 0. Recall that the proximal operator is defined by


1
proxf,γ (v) = argmin f (x) + ∥x − v∥2 .
x∈Rd 2γ

For which of the following transformations of f , do we have the proximal operator readily available (with a
single call to the proximal oracle of f and any γ)?

A) g(x) := af (x) + b, for a, b ∈ R.

B) g(x) := f (x) + λ∥x∥2 , for λ > 0.

C) g(x) := f (x) + f (2x).

D) g(x) := f (x ⊙ x), where x ⊙ x denotes the coordinate wise multiplication.

B, C, D
B, C
C, D
A, B
A, B, D

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/4/57+ y

Subgradient Descent
Question 8 Which class of functions from R to R always have at least one subgradient (i.e., there exist
a point x, ∂f (x) ̸= ∅)?

Lipschitz continuous
Convex
Bounded
Bounded and Lipschitz continuous

Frank-Wolfe
Question 9 For any convex and bounded region X ⊂ Rd and any vector g ∈ Rd , define the LMO oracle,

LMOX (g) = argmin g⊤ z.


z∈X

Which of the following statements are true ?

A) The LM OX (g) is unique for any vector g.

T
B) When X = {x ∈ Rd : ∥x∥1 ≤ 1}, the computational complexity of evaluating the LMO is less than the
projection on the set X .

C) For any y ∈ X and λ > 0, the linear combination (1 − λ)y + λ LMOX (g) stays in the feasible region X .
AF
A and C
B
C
A and B
DR

B and C
A

Newton’s Method
Question 10 Consider the following strictly convex function
p
f (x) = 1 + x2

Definition: we say that a sequence (xn ) converges to its limit l at γ-superlinear rate (for a γ > 1) if γ
is the biggest constant such that there exists a constant C such that for all n we have ∥xn+1 −l∥ ≤ C|xn −l∥γ .

Which of the following statements is not true?

A) If |x0 | < 1, Newton’s method converges with a quadratic superlinear rate (γ = 2).

B) If |x0 | < 1, Newton’s method converges with a cubic superlinear rate (γ = 3).

C) If |x0 | = 1, Newton’s method oscillates between −1 and 1.

D) If |x0 | > 1, Newton’s method diverges.

A
B
C
D

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/5/56+ y

Second part, true/false questions


Question 11 (Subgradients) The number of subgradients |∂f (x)| for any point x ∈ dom(f ) and function f
is either 0, 1 or ∞.

TRUE FALSE

Question 12 (Subgradients) Consider the function fα : R → R:


(
xα if x > 0
fα (x) =
0 if x ≤ 0.

Then for any α > 1, fα is differentiable everywhere.

TRUE FALSE

Question 13 (Subgradients) Again consider the same function fα : R → R:


(
xα if x > 0

T
fα (x) =
0 if x ≤ 0.

Then for any real number α > 0, fα has subgradients everywhere.


AF
TRUE FALSE

Question 14 (Smoothness) Gradient descent with stepsize γ = 1


L for a smooth function (L being the
smoothness constant) converges to a critical point of the function.
DR

TRUE FALSE

Question 15 (Convex functions) Consider a function f with a domain that is not necessarily convex.
If f has subgradients everywhere then it is necessarily convex.

TRUE FALSE

Question 16 (Semi-norms) A semi-norm is a non-negative function that satisfies i) for any α ∈ R


f (αx) = |α|f (x), ii) the triangular inequality f (x + y) ≤ f (x) + f (y).
All semi-norms are convex.

TRUE FALSE

Question 17 (Convex functions) If f1 , . . . , fm are all convex functions Rd → R, then f defined as


f (x) := min(f1 (x), . . . , fm (x)) is also convex.

TRUE FALSE

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/6/55+ y

Question 18 (Convex minimum) A convex function has always a minimum.

TRUE FALSE

Question 19 (Strong convexity) A strongly convex function has always exactly one minimizer.

TRUE FALSE

Question 20 (Convex sets) The empty set ∅ is convex.

TRUE FALSE

Question 21 (Newton) Newton’s method always converges faster than gradient descent.

TRUE FALSE

Question 22 (Stochastic Gradient Descent) If the starting point x0 is chosen too far from the convergence

T
point x⋆ , i.e. ∥x0 − x⋆ ∥ is larger than a certain threshold R0 , it is not possible to choose a learning rate
such that SGD converges in expectation even if the function is convex and differentiable, and the expected
2
norm of the stochastic gradient is bounded, i.e. E[∥gt ∥ ] ≤ B 2 .
AF
TRUE FALSE

Question 23 (Projected Gradient Descent) Computing the projection of any vector z ∈ Rd on an


Euclidean ball {x ∈ Rd : ∥x∥2 ≤ 1} is a O(1) operation.
DR

TRUE FALSE

Question 24 (Proximal Gradient Descent) For any two convex functions f and g, the proximal operator
is additive, i.e., proxf +g,γ = proxf,γ + proxg,γ .

TRUE FALSE

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/7/54+ y

Third part, open questions


Answer in the space provided! Your answer must be justified with all steps. Do not cross any checkboxes,
they are reserved for correction.

Convex smooth functions


Until the end of this section, we assume that the function f : Rd → R is convex and L-smooth. We let
x⋆ ∈ arg minx∈Rd f (x).

Question 25: 2 points. For x, y ∈ Rd , define the function gx (y) := f (y) − ∇f (x)⊤ y. Show that gx is
convex and has a global minimum at x.

0 1 2

T
AF
DR

Question 26: 2 points. Prove that for all x, y ∈ Rd :


1
f (y) ≥ f (x) + ∇f (x)⊤ (y − x) + ∥∇f (x) − ∇f (y)∥2
2L

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/8/53+ y

Question 27: 2 points. Show that for all x, y ∈ Rd , we have,


1 ⊤
∥∇f (x) − ∇f (y)∥2 ≤ ∇f (x) − ∇f (y) (x − y).

L
This property is usually referred to as co-coercivity.

0 1 2

T
AF
Gradient descent
DR

Now consider the gradient descent algorithm on the function f :


xt+1 := xt − γ∇f (xt ) ∀t≥0.
Question 28: 2 points. Show that for γ ≤ L,
2
the sequence (∥xt − x⋆ ∥2 )t≥0 is non-increasing.

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/9/52+ y

Define ∆t := f (xt ) − f (x⋆ ).


Question 29: 3 points. Show that there exists a constant α > 0 such that
∆t ≤ ∆t−1 − α∆2t−1 .
Further show that the above inequality implies
1 1
≤ −α.
∆t−1 ∆t
Hint: Use convexity and Cauchy-Schwarz.

0 1 2 3

T
AF
DR

Question 30: 1 point. Deduce that


2L∥x0 − x⋆ ∥2
∆t ≤ .
t+4

0 1

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/10/51+ y

Non-convex optimization
A differentiable function f with a convex dom(f ), is called invex with respect to η that is defined over
dom(η) = dom(f ) × dom(f ), if for all x, y ∈ dom(f ),

f (x) ≥ f (y) + η(x, y)⊤ ∇f (y). (IVX)

If there exist a η such that f is invex with respect to η, f is called invex. In the following until the end of
this section, assume that the function f is differentiable and dom(f ) is open and convex.

Basics of Invexity
Question 31: 1 point. Prove that f is invex if it does not have any critical points. Hint: construct a η.

0 1

T
AF
Question 32: 2 points. Prove that f is invex if and only if all of its critical points are global minimum.
DR

Hint: use the previous question for non-critical points.

0 1 2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/11/50+ y

Rate of convergence
In this subsection, assume that f is invex with respect to function η,

η(x, y) = c0 (x − y) + η0 (x, y),

defined through constant c0 ≥ 0 ∈ R and function η0 that is defined over dom(f ) × dom(f ). In addition,
assume that η0 satisfy the following condition for a fixed constant N ≥ 0,

∥η0 (x, y)∥ ≤ N (∥∇f (x)∥ + ∥∇f (y)∥) , for all x, y ∈ dom(f ).

Recall that when c0 = 1 and η0 = 0, the invexity condition given in (IVX) is equivalent to convexity. And,
in the convex case, vanilla analysis of gradient descent with constant step size γ and timestep T yields the
following bound
T −1 T −1
X γ X 1
(f (xt ) − f (x⋆ )) ≤ ∥∇f (xt )∥2 + ∥x0 − x⋆ ∥2 , (VB)
t=0
2 t=0

where x0 is the initial point and x1 , . . . , xT −1 are iterates obtained by running gradient descent. Our aim in
this subsection is to extend this analysis to a more general class of invex functions.

Question 33: 2 points. Derive the following bound that is reminiscent of (VB) for f ,
T
X −1

t=0
(f (xt ) − f (x⋆ )) ≤
T −1
c0 γ X
2 t=0
∥∇f (xt )∥2 +
c0

T∥x0 − x⋆ ∥2 −
T
X −1

t=0
η0 (x⋆ , xt )⊤ ∇f (xt ).
AF
Hint: treat two terms in η separately.

0 1 2
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/12/49+ y

Question 34: 1 point. Assume that f is B-Lipschitz and R = ∥x0 − x⋆ ∥ is finite. Give a bound on
T
X −1
ET = f (xi ) − f (x⋆ )
i=0

that depends on B, R, N, T, c0 and γ.

0 1

T
AF
Question 35: 1 point. Does ET found in the previous question convergence to zero as T → ∞ with a
DR

proper choice of γ?

0 1

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/13/48+ y

Question 36: 3 points. Assume now that f is L-smooth, ∥x0 − x⋆ ∥ ≤ R1 and f (x0 ) − f (x⋆ ) ≤ R2 for
positive constants L, R1 and R2 . Also, let γ = L1 be the fixed stepsize. Give a bound on

ET = f (xT ) − f (x⋆ )

that depends on B, R, N, T, c0 and γ.

0 1 2 3

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/14/47+ y

Question 37: 3 points. Repeat the previous question for η0 defined over dom(f ) × dom(f ) that satisfy the
following uniform bound,
∥η0 (x, y)∥ ≤ N, for all x, y ∈ dom(f ).
Hint: Cauchy-Schwarz is your friend.

0 1 2 3

T
AF
DR

Question 38: 1 point. Derive the order of steps required to achieve a small error ε, i.e., ET ≤ ε, for the
two previous questions.
Hint: Focus only on the highest order term for the latter.

0 1

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/15/46+ y

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/16/45+ y

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.

You might also like