0% found this document useful (0 votes)
74 views14 pages

Obd 04 PDF

- The document provides proofs for theorems regarding weak convergence of iterative algorithms for solving optimization problems. - It shows that the error between the iterates and optimal solution converges weakly to 0, and establishes necessary and sufficient conditions for this convergence in terms of the algorithm parameters. - The proofs utilize properties of the eigenvalue decomposition of the algorithm matrix and apply tower properties of conditional expectations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views14 pages

Obd 04 PDF

- The document provides proofs for theorems regarding weak convergence of iterative algorithms for solving optimization problems. - It shows that the error between the iterates and optimal solution converges weakly to 0, and establishes necessary and sufficient conditions for this convergence in terms of the algorithm parameters. - The proofs utilize properties of the eigenvalue decomposition of the algorithm matrix and apply tower properties of conditional expectations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CS390FF: Special Topics in Data Sciences: Big Data Optimization

KAUST, Fall 2017

1.4 Convergence Analysis of the Basic Method

109 / 165

Covariance Matrix and Total Variance of a Random Vector


Definition 35 (Covariance matrix)
If x 2 Rn is a random vector, then the matrix
def ⇥ ⇤
Var (x) = E (x E [x])(x E [x])>

is called the covariance matrix of x.


Definition 36 (Total Variance)
If x 2 Rn is a random vector, then the value
def ⇥ ⇤ ⇥ ⇤
TVar (x) = E (x E [x])> (x E [x]) = E kx E [x] k2

is called the total variance of x.


Exercise 8
Let x 2 Rn be a random vector. Show that:
(i) The total variance is the trace of the covariance matrix:
TVar(x) = Tr (Var (x))
⇥ ⇤
(ii) TVar (U> B1/2 x) = E kx E [x] k2B .
110 / 165
Strong vs Weak Convergence
Definition 37 (Strong and Weak Convergence)
We say that a sequence of random vectors {xk } converges to x⇤
I weakly if kE [xk x⇤ ] k2B ! 0 as k ! 1
⇥ ⇤
I strongly if E kxk x⇤ k2B ! 0 as k ! 1 (aka L2 convergence)

The following lemma explains why strong convergence is a stronger


convergence concept than weak convergence.
Lemma 38
For any random vector xk 2 Rn and any x⇤ 2 Rn we have the identity
⇥ ⇤ h i
2 2 2
E kxk x⇤ kB = kE [xk x⇤ ]kB + E kxk E [xk ]kB .
| {z }
TVar (U> B1/2 xk )

As a consequence, strong convergence implies


I weak convergence,
I convergence of TVar (U> B1/2 xk ) to zero.
111 / 165

Proof of Lemma 38

Let µ = E [xk ]. Then


⇥ ⇤ ⇥ ⇤
E kxk x⇤ k2B = E kxk µ+µ x⇤ k2B
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hxk µ, µ x⇤ iB
⇥ ⇤
= E kxk µk2B + kµ x⇤ k2B + 2hE [xk µ], µ x⇤ iB
| {z }
0
⇥ 2
⇤ 2
= E kxk µkB + kµ x⇤ kB .

In the first step we have expanded the square and in the second step we
have used linearity of expectation.

112 / 165
Weak Convergence

113 / 165

Weak Convergence
Theorem 39 (Weak Convergence 1)
Choose any x0 2 Rn and let {xk } be the random iterates produced by
Algorithm 2. Let x⇤ 2 L be chosen arbitrarily. Then
1
E [xk+1 x⇤ ] = I !B E [Z] E [xk x⇤ ] . (35)

Moreover, by transforming the error via the linear mapping


h ! U> B1/2 h, this can be written in the form
h i
E U B (xk x⇤ ) = (I !⇤)k U> B1/2 (x0 x⇤ ),
> 1/2
(36)

which is separable in the coordinates of the transformed error:


h i
E ui B (xk x⇤ ) = (1 ! i )k ui> B1/2 (x0 x⇤ ),
> 1/2
i = 1, 2, . . . , n.
(37)
Finally,
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i) 2k
ui> B1/2 (x0 x⇤ ) . (38)
i=1
114 / 165
Weak Convergence

Theorem 40 (Convergence 2)
Let x⇤ = ⇧B
L (x0 ). Then for all i = 1, 2, . . . , n,
(
h i 0 if = 0,
i
E ui> B1/2 (xk x⇤ ) =
(1 ! i )k ui> B1/2 (x0 x⇤ ) if i > 0.
(39)
Moreover,
kE [xk x⇤ ] k2B  ⇢k (!)kx0 x⇤ k2B , (40)
where the rate is given by
def
⇢(!) = max (1 ! i )2 . (41)
i: i >0

115 / 165

Necessary and Sufficient Conditions for Convergence

Corollary 41 (Necessary and sufficient conditions)


Let Assumption 3 (exactness) hold. Choose any x0 2 Rn and let
x⇤ = ⇧BL (x0 ).

If {xk } are the random iterates produced by Algorithm 2, then the


following statements are equivalent:
(i) |1 ! i | < 1 for all i for which i >0
(ii) 0 < ! < 2/ max
⇥ ⇤
(iii) E ui> B1/2 (xk x⇤ ) ! 0 for all i
(iv) kE [xk x⇤ ] k2B ! 0

116 / 165
Proof of Theorems 39 and 40 - I
We first start with a lemma.
Lemma 42
Let Assumption 3 (exactness) hold. Consider arbitrary x 2 Rn and let
> 1/2
x⇤ = ⇧BL (x). If i = 0, then ui B (x x⇤ ) = 0.

Proof.
From (19) we see that x x⇤ = B 1 A> w for some w 2 Rm . Therefore,
ui> B1/2 (x x⇤ ) = ui> B 1/2 A> w . By Theorem 29, we have
Range (ui : i = 0) = Null AB 1/2 , from which it follows that
ui> B 1/2 A = 0.
Proof of Theorem 39: Algorithm 2 can be written in the form
1
ek+1 = (I !B Zk )ek , (42)

where ek = xk x⇤ . Multiplying both sides of this equation by B1/2 from


the left, and taking expectation conditional on ek , we obtain
h i
E B ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )B1/2 ek .
1/2

117 / 165

Proof of Theorems 39 and 40 - II

Taking expectations on both sides and using the tower property, we get
⇥ ⇤ ⇥ ⇥ ⇤⇤ ⇥ ⇤
E B1/2 ek+1 = E E B1/2 ek+1 | ek = (I !B 1/2 E [Z] B 1/2 )E B1/2 ek .

We now replace B 1/2 E [Z] B 1/2 by its eigenvalue decomposition


U⇤U> (see (33)), multiply both sides of the last inequality by U> from
the left, and use linearity of expectation to obtain
⇥ ⇤ ⇥ ⇤
E U> B1/2 ek+1 = (I !⇤)E U> B1/2 ek .

Unrolling the recurrence, we get (36). When this is written


coordinate-by-coordinate, (37) follows. Identity (38) follows immediately
by equating standard Euclidean norms of both sides of (36).

Proof of Theorem 40: If x⇤ = ⇧B


L (x0 ), then from Lemma 42 we see
> 1/2
that i = 0 implies ui B (x0 x⇤ ) = 0. Using this in (37) gives (39).

118 / 165
Proof of Theorems 39 and 40 - III
Finally, inequality (40) follows from
(38)
n
X ⇣ ⌘2
kE [xk x⇤ ] k2B = (1 ! i )2k ui> B1/2 (x0 x⇤ )
i=1
X ⇣ ⌘2
= (1 ! i )2k ui> B1/2 (x0 x⇤ )
i: i >0
(41) X ⇣ ⌘2
 ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0
X ⇣ ⌘2 X ⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ ) + ⇢k (!) ui> B1/2 (x0 x⇤ )
i: i >0 i: i =0
X⇣ ⌘2
= ⇢k (!) ui> B1/2 (x0 x⇤ )
i
X
= k
⇢ (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ )
i
!
X X
= ⇢k (!) (x0 x⇤ )> B1/2 ui ui> B1/2 (x0 x⇤ ) = ⇢k (!)kx0 x⇤ k2B .
i i
P
The last identity follows from the fact that i ui ui> = UU> = I.

119 / 165

Optimal Stepsize Choice for Weak Convergence

120 / 165
Convergence Rate as a Function of !

We now consider the problem of choosing the stepsize (relaxation)


parameter !.

In view of (40) and (41), the optimal relaxation parameter is the one
solving the following optimization problem:

min ⇢(!) = max (1 ! i )2 . (43)
!2R i: i >0

We solve the above problem in the next result (Theorem 43).

121 / 165

Optimal Stepsize
Theorem 43 (Stepsize Choice)
def +
Let ! ⇤ = 2/( min + max ). Then the objective of (43) is given by
8
> 2 !0
<(1 ! max ) if
+ 2
⇢(!) = (1 ! min ) if 0  !  !⇤ . (44)
>
:
(1 ! max )
2 if ! !⇤

Moreover, ⇢ is decreasing on ( 1, ! ⇤ ] and increasing on [! ⇤ , +1), and


hence the optimal solution of (43) is ! ⇤ . Further, we have:
(i) If we choose ! = 1 (no over-relaxation), then
+ 2
⇢(1) = (1 min ) . (45)

(ii) If we choose ! = 1/ max (over-relaxation), then


⇣ +
⌘2 (34)
⇣ ⌘2
1
⇢(1/ max ) = 1 = 1 . (46)
min
max ⇣

(iii) If we choose ! = ! ⇤ (optimal over-relaxation), the optimal rate is


⇣ +
⌘2 (34)
⇣ ⌘2
⇤ 2 2
⇢(! ) = 1 +
min
= 1 ⇣+1 . (47)
min + max 122 / 165
Proof of Theorem 43

Recall that max  1. Letting

⇢i (!) = (1 ! i )2 ,

it can be shown that

⇢(!) = max{⇢j (!), ⇢n (!)},

where j is such that j = + ⇤


min . Note that ⇢j (!) = ⇢n (!) for ! 2 {0, ! }.
From this we deduce that ⇢j ⇢n on ( 1, 0], ⇢j  ⇢n on [0, ! ⇤ ], and
⇢j ⇢n on [! ⇤ , +1), obtaining (44). We see that ⇢ is decreasing on
( 1, ! ⇤ ], and increasing on [! ⇤ , +1).

The remaining results follow directly by plugging specific values of ! into


(44).

123 / 165

Strong Convergence

124 / 165
Decrease of Distance is Proportional to fS

Lemma 44 (Decrease of Distance)


Choose x0 2 Rn and let {xk }1k=0 be the random iterates produced by
Algorithm 2, with an arbitrary relaxation parameter ! 2 R. Let x⇤ 2 L.

Then we have the identities kxk+1 xk k2B = 2! 2 fSk (xk ), and

kxk+1 x⇤ k2B = kxk x⇤ k2B 2!(2 !)fSk (xk ). (48)


⇥ ⇤
Moreover, E kxk+1 xk k2B = 2! 2 E [f (xk )], and
⇥ ⇤ ⇥ ⇤
E kxk+1 x⇤ k2B = E kxk x⇤ k2B 2!(2 !)E [f (xk )] . (49)

Remarks: Equation (48) says that for any x⇤ 2 L, in the k-th iteration of
Algorithm 2 the distance of the current iterate from x⇤ decreases by the
amount 2!(2 !)fSk (xk ).

125 / 165

Lower Bound on a Quadratic


Lemma 45
Let Assumption 3 be satisfied. Then the inequality

x >B 1/2
E [Z] B 1/2
x +
min (B
1/2
E [Z] B 1/2 )x > x (50)

holds for all x 2 Range B 1/2


A> .
Proof.
It is known that for any matrix M 2 Rm⇥n , the inequality

x > M> Mx + > >


min (M M)x x

holds for all x 2 Range M> . Applying this with M = (E [Z])1/2 B 1/2 ,
we see that (50) holds for all x 2 Range B 1/2 (E [Z])1/2 . However,
⇣ ⌘ ⇣ ⌘
1/2 1/2 1/2 1/2 1/2 1/2 >
Range B (E [Z]) = Range B (E [Z]) (B (E [Z]) )
⇣ ⌘ ⇣ ⌘
= Range B 1/2 E [Z] B 1/2 = Range B 1/2 A> ,

where the last identity follows by combining Assumption 3 and


Theorem 29. 126 / 165
Proof of Lemma 44 - I
Recall that Algorithm 2 performs the update
1
xk+1 = xk !B Zk (xk x⇤ ).

From this we get

kxk+1 xk k2B = ! 2 kB 1
Zk (xk x⇤ )k2B
(21)
= ! 2 (xk x⇤ )> Zk (xk x⇤ )
(22)
= 2! 2 fSk (xk ). (51)

In a similar vein,

kxk+1 x⇤ k2B = k(I !B 1


Zk )(xk x⇤ )k2B
= (xk x⇤ )> (I !Zk B 1
)B(I !B 1
Zk )(xk x⇤ )
(21)
= (xk x⇤ )> (B !(2 !)Zk )(xk x⇤ )
(22)
= kxk x⇤ k2B 2!(2 !)fSk (xk ), (52)
127 / 165

Proof of Lemma 44 - II
establishing (48).

Taking expectation in (51) and using the tower property, we get


⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 xk k2B = E E kxk+1 xk k2B | xk
(51)
= 2! 2 E [E [fSk (xk ) | xk ]]
= 2! 2 E [f (xk )] ,

where in the last step we have used the definition of f .

Taking expectation in (48), we get


⇥ ⇤ ⇥ ⇥ ⇤⇤
E kxk+1 x⇤ k2B = E E kxk+1 x⇤ k2B | xk
(52) ⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)f (xk )
⇥ ⇤
= E kxk x⇤ k2B 2!(2 !)E [f (xk )] .

128 / 165
Quadratic Bounds

Lemma 46 (Quadratic bounds)


For all x 2 Rn and x⇤ 2 L we have

+ 1
min · f (x)  krf (x)k2B  max · f (x). (53)
2
and
max
f (x)  kx x⇤ k2B . (54)
2
Moreover, if Assumption 3 holds, then for all x 2 Rn and x⇤ = ⇧B
L (x) we
have
+
min
kx x⇤ k2B  f (x). (55)
2

129 / 165

Proof of Lemma 46 - I
In view of (17) and (33), we obtain a spectral characterization of f :

1X
n ⇣ ⌘2
f (x) = i ui> B1/2 (x x⇤ ) , (56)
2
i=1

where x⇤ is any point in L. On the other hand, in view of (28) and (33),
we have
krf (x)k2B = kB 1
E [Z] (x x⇤ )k2B (57)
> 1
= (x x⇤ ) E [Z] B E [Z] (x x⇤ )
= (x x⇤ )> B1/2 (B 1/2
E [Z] B 1/2
)(B 1/2
E [Z] B 1/2
)B1/2 (x x⇤ )
= (x x⇤ )> B1/2 U(U> B 1/2
E [Z] B 1/2
U)2 U> B1/2 (x x⇤ )
(33)
= (xx⇤ )> B1/2 U⇤2 U> B1/2 (x x⇤ )
n
X ⇣ ⌘2
2 > 1/2
= i u i B (x x ⇤ ) . (58)
i=1

Inequality (53) follows by comparing (56) and (57), using the bounds
+ 2
min i  i  max i ,

which hold for i for which i > 0.


130 / 165
Proof of Lemma 46 - II

We now move to the bounds involving norms. First, note that for any
x⇤ 2 L we have
(17) 1
f (x) = (x x⇤ )> E [Z] (x x⇤ ) (59)
2
1 1/2
= (B (x x⇤ ))> (B 1/2 E [Z] B 1/2
)B1/2 (x x⇤ ).
2
The upper bound follows by applying the inequality
1/2 1/2
B E [Z] B max I.

If x⇤ = ⇧B
L (x), then in view of (19), we have
⇣ ⌘
1/2 1/2 >
B (x x⇤ ) 2 Range B A .

Applying Lemma 45 to (59), we get the lower bound.

131 / 165

Strong Convergence
Theorem 47 (Strong convergence)
Let Assumption 3 (exactness) hold and set x⇤ = ⇧BL (x0 ). Let {xk } be the
random iterates produced by Algorithm 2, where the relaxation parameter
def ⇥ ⇤
satisfies 0 < ! < 2, and let rk = E kxk x⇤ k2B . Then for all k 0 we
have
(1 !(2 !) max )k r0  rk  (1 !(2 !) + k
min ) r0 . (60)
The best rate is achieved when ! = 1.

Proof.
Let k = E [f (xk )]. We have

(49) (55)
+
rk+1 = rk 2!(2 !) k  rk !(2 !) min rk ,

and
(49) (54)
rk+1 = rk 2!(2 !) k rk !(2 !) max rk .

Inequalities (60) follow from this by unrolling the recurrences.

132 / 165
Convergence of f (xk )

133 / 165

Convergence of f (xk )

Theorem 48 (Convergence of f )
Choose x0 2 Rn , and let {xk }1
k=0 be the random iterates produced by
Algorithm 2, where the relaxation parameter satisfies 0 < ! < 2.
def Pk 1
(i) Let x⇤ 2 L. The average iterate x̂k = k1 t=0 xt for all k 1
satisfies
kx0 x⇤ k2B
E [f (x̂k )]  . (61)
2!(2 !)k
(ii) Now let Assumption 3 hold. For x⇤ = ⇧B
L (x0 ) and k 0 we have

+ k max kx0 x⇤ k2B


E [f (xk )]  1 !(2 !) min . (62)
2
The best rate is achieved when ! = 1.

134 / 165
Proof of Theorem 48
⇥ ⇤
(i) Let k = E [f (xk )] and rk = E kxk x⇤ k2B . By summing up the
identities from (49), we get
k 1
X
2!(2 !) t = r0 rk .
t=0

Therefore, using Jensen’s inequality, we get


" k 1 # k 1
1X 1X r0 rk r0
E [f (x̂k )]  E f (xt ) = t =  .
k t=0 k t=0 2!(2 !)k 2!(2 !)k

(ii) Combining inequality (54) with Theorem 47, we get

max ⇥ ⇤ (60) + k max kx0 x⇤ k2B


E [f (xk )]  E kxk x⇤ k2B  1 !(2 !) min .
2 2

135 / 165

136 / 165

You might also like