Exam 2023
Exam 2023
Wait for the start of the exam before turning to the next page. This document is printed
double sided, 16 pages. Do not unstaple.
T
AF
• This is a closed book exam. No electronic devices of any kind.
DR
• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet if you
have one; place all other personal items below your desk or on the side.
• For technical reasons, do use black or blue pens for the MCQ part, no pencils! Use white
corrector if necessary.
Convexity
Question 1 Let f : Rn → R, h : R → R, g : Rn → R be three functions such that f = h ◦ g, i.e.,
f (x) = h(g(x)) for all x ∈ Rn .
What are all the true statements?
B, C, and F
A and D
A, B, and D
A and B
T
AF
A and C
Gf is injective if 1
2L <γ≤ L.
1
Regardless of how γ is chosen, there always exists a function f for which Gf is not injective.
Question 3 Let f : R → R be a function. Assume the sequence x0 , x1 , . . . exists such that xt−1 :=
xt − γ∇f (xt ), i.e. it is the reverse of a sequence generated by running gradient descent over f from some
starting point with learning rate γ. Assuming limt→∞ xt → ∞, and that x0 is a global minimum of f (so
the gradient descent converges), which statement is true for any such function f ?
For each of the other statements there is at least one function f for which the statement does not hold.
f is either globally smooth for some value L or satisfies Polyak-Lojasiewicz inequality.
f satisfies the Polyak-Lojasiewicz inequality.
f is γ -smooth.
1
f (x) := x4 is 2-smooth.
f (x) := x4 is 4-smooth.
f (x) := x2 is 10-smooth.
f (x) := x2 is 1-smooth.
Question 5 Let f : R → R and g : R → R be L-smooth functions. Which of the following is true about
h(x) := f (g(x))?
Question 6 Let f1 , . . . , fn be functions that are smooth with L1 , . . . , Ln , respectively. Which of the
Pn
following statements on functions g = i=1 λi fi and h = maxi λi fi for λ1 , . . . , λn ∈ R+ hold generally?
Pn
A) g is i=1 λi Li smooth.
Pn
B) h is i=1 λi Li smooth.
C) g is maxi λi Li smooth.
D) h is maxi λi Li smooth.
B and D
A and D
B and C
A, B, and D
A
T
AF
A and B
For which of the following transformations of f , do we have the proximal operator readily available (with a
single call to the proximal oracle of f and any γ)?
B, C, D
B, C
C, D
A, B
A, B, D
Subgradient Descent
Question 8 Which class of functions from R to R always have at least one subgradient (i.e., there exist
a point x, ∂f (x) ̸= ∅)?
Lipschitz continuous
Convex
Bounded
Bounded and Lipschitz continuous
Frank-Wolfe
Question 9 For any convex and bounded region X ⊂ Rd and any vector g ∈ Rd , define the LMO oracle,
T
B) When X = {x ∈ Rd : ∥x∥1 ≤ 1}, the computational complexity of evaluating the LMO is less than the
projection on the set X .
C) For any y ∈ X and λ > 0, the linear combination (1 − λ)y + λ LMOX (g) stays in the feasible region X .
AF
A and C
B
C
A and B
DR
B and C
A
Newton’s Method
Question 10 Consider the following strictly convex function
p
f (x) = 1 + x2
Definition: we say that a sequence (xn ) converges to its limit l at γ-superlinear rate (for a γ > 1) if γ
is the biggest constant such that there exists a constant C such that for all n we have ∥xn+1 −l∥ ≤ C|xn −l∥γ .
A) If |x0 | < 1, Newton’s method converges with a quadratic superlinear rate (γ = 2).
B) If |x0 | < 1, Newton’s method converges with a cubic superlinear rate (γ = 3).
A
B
C
D
TRUE FALSE
TRUE FALSE
T
fα (x) =
0 if x ≤ 0.
TRUE FALSE
Question 15 (Convex functions) Consider a function f with a domain that is not necessarily convex.
If f has subgradients everywhere then it is necessarily convex.
TRUE FALSE
TRUE FALSE
TRUE FALSE
TRUE FALSE
Question 19 (Strong convexity) A strongly convex function has always exactly one minimizer.
TRUE FALSE
TRUE FALSE
Question 21 (Newton) Newton’s method always converges faster than gradient descent.
TRUE FALSE
Question 22 (Stochastic Gradient Descent) If the starting point x0 is chosen too far from the convergence
T
point x⋆ , i.e. ∥x0 − x⋆ ∥ is larger than a certain threshold R0 , it is not possible to choose a learning rate
such that SGD converges in expectation even if the function is convex and differentiable, and the expected
2
norm of the stochastic gradient is bounded, i.e. E[∥gt ∥ ] ≤ B 2 .
AF
TRUE FALSE
TRUE FALSE
Question 24 (Proximal Gradient Descent) For any two convex functions f and g, the proximal operator
is additive, i.e., proxf +g,γ = proxf,γ + proxg,γ .
TRUE FALSE
Question 25: 2 points. For x, y ∈ Rd , define the function gx (y) := f (y) − ∇f (x)⊤ y. Show that gx is
convex and has a global minimum at x.
0 1 2
T
AF
DR
0 1 2
0 1 2
T
AF
Gradient descent
DR
0 1 2
0 1 2 3
T
AF
DR
0 1
Non-convex optimization
A differentiable function f with a convex dom(f ), is called invex with respect to η that is defined over
dom(η) = dom(f ) × dom(f ), if for all x, y ∈ dom(f ),
If there exist a η such that f is invex with respect to η, f is called invex. In the following until the end of
this section, assume that the function f is differentiable and dom(f ) is open and convex.
Basics of Invexity
Question 31: 1 point. Prove that f is invex if it does not have any critical points. Hint: construct a η.
0 1
T
AF
Question 32: 2 points. Prove that f is invex if and only if all of its critical points are global minimum.
DR
0 1 2
Rate of convergence
In this subsection, assume that f is invex with respect to function η,
defined through constant c0 ≥ 0 ∈ R and function η0 that is defined over dom(f ) × dom(f ). In addition,
assume that η0 satisfy the following condition for a fixed constant N ≥ 0,
∥η0 (x, y)∥ ≤ N (∥∇f (x)∥ + ∥∇f (y)∥) , for all x, y ∈ dom(f ).
Recall that when c0 = 1 and η0 = 0, the invexity condition given in (IVX) is equivalent to convexity. And,
in the convex case, vanilla analysis of gradient descent with constant step size γ and timestep T yields the
following bound
T −1 T −1
X γ X 1
(f (xt ) − f (x⋆ )) ≤ ∥∇f (xt )∥2 + ∥x0 − x⋆ ∥2 , (VB)
t=0
2 t=0
2γ
where x0 is the initial point and x1 , . . . , xT −1 are iterates obtained by running gradient descent. Our aim in
this subsection is to extend this analysis to a more general class of invex functions.
Question 33: 2 points. Derive the following bound that is reminiscent of (VB) for f ,
T
X −1
t=0
(f (xt ) − f (x⋆ )) ≤
T −1
c0 γ X
2 t=0
∥∇f (xt )∥2 +
c0
2γ
T∥x0 − x⋆ ∥2 −
T
X −1
t=0
η0 (x⋆ , xt )⊤ ∇f (xt ).
AF
Hint: treat two terms in η separately.
0 1 2
DR
Question 34: 1 point. Assume that f is B-Lipschitz and R = ∥x0 − x⋆ ∥ is finite. Give a bound on
T
X −1
ET = f (xi ) − f (x⋆ )
i=0
0 1
T
AF
Question 35: 1 point. Does ET found in the previous question convergence to zero as T → ∞ with a
DR
proper choice of γ?
0 1
Question 36: 3 points. Assume now that f is L-smooth, ∥x0 − x⋆ ∥ ≤ R1 and f (x0 ) − f (x⋆ ) ≤ R2 for
positive constants L, R1 and R2 . Also, let γ = L1 be the fixed stepsize. Give a bound on
ET = f (xT ) − f (x⋆ )
0 1 2 3
T
AF
DR
Question 37: 3 points. Repeat the previous question for η0 defined over dom(f ) × dom(f ) that satisfy the
following uniform bound,
∥η0 (x, y)∥ ≤ N, for all x, y ∈ dom(f ).
Hint: Cauchy-Schwarz is your friend.
0 1 2 3
T
AF
DR
Question 38: 1 point. Derive the order of steps required to achieve a small error ε, i.e., ET ≤ ε, for the
two previous questions.
Hint: Focus only on the highest order term for the latter.
0 1
T
AF
DR
T
AF
DR