Applied Mathematics in Medical Physics: D. Fuentes
Applied Mathematics in Medical Physics: D. Fuentes
in
Medical Physics
D. Fuentes
The University of Texas M.D. Anderson Cancer Center,
Department of Imaging Physics, Houston TX 77030, USA
Lecture Notes
References
[Aggarwal et al., 2001] Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising behavior
of distance metrics in high dimensional space. Springer.
[CAMPEP, 2014] CAMPEP (2014). Standards for Accreditation of Graduate Educational Programs in
Medical Physics. Commission on Accreditation of Medical Physics Educational Programs.
[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John
Wiley & Sons.
[Goldstein and Osher, 2009] Goldstein, T. and Osher, S. (2009). The split bregman method for l1 regularized
problems. SIAM Journal on Imaging Sciences, 2(2):323–343.
[Golub and Van Loan, 1996] Golub, G. H. and Van Loan, C. F. (1996). Matrix computations. JHU Press, 3
edition.
[Greenberg, 1978] Greenberg, M. (1978). Foundations of applied mathematics. Prentice-Hall.
[Heath, 1998] Heath, M. (1998). Scientific computing: An introductory survey.
[Kreyszig, 1989] Kreyszig, E. (1989). Introductory functional analysis with applications, volume 21. wiley.
[Nocedal and Wright, 1999] Nocedal, J. and Wright, S. (1999). Numerical optimization. Springer verlag.
[Oden and Demkowicz, 1996] Oden, J. and Demkowicz, L. (1996). Applied functional analysis. CRC press.
[Yin et al., 2008] Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithms for
l1-minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143–
168.
1
Contents
1 Preliminaries 4
1.1 Operation Counts ([Golub and Van Loan, 1996], Chapter 1.2.4) . . . . . . . . . . . . . . . . 4
6 Unconstrained Optimization 78
6.1 Characterizations of Solutions, [Nocedal and Wright, 1999] Ch 2 . . . . . . . . . . . . . . . . 82
6.2 Search Directions, [Nocedal and Wright, 1999] Ch 2 . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Applications: Nonlinear Least Squares, [Heath, 1998] Ch. 6 . . . . . . . . . . . . . . . . . . . 85
6.4 Line Search and Trust Region Strategies and Convergence . . . . . . . . . . . . . . . . . . . . 88
6.5 Line Search, [Nocedal and Wright, 1999] Ch 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Applications: Ill conditioned matrices and Convergence . . . . . . . . . . . . . . . . . . . . . 90
6.7 Quasi Newton Hessian Approximations, [Nocedal and Wright, 1999] Ch 8 . . . . . . . . . . . 94
6.8 Trust Region, [Nocedal and Wright, 1999] Ch 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.9 Newton-Krylov Trust Region Methods, [Nocedal and Wright, 1999] Ch 6 . . . . . . . . . . . . 98
A Homework I 118
B Homework II 122
∗∗
Advanced Topics may be skipped on first read
2
CAMPEP [CAMPEP, 2014] has standardized essential educational and experience requirements needed
to engage in medical physics research and development, and to enter a residency program in preparation for
clinical practice of one of the first three fields. The standardizations committee has outlined a core graduate
curriculum needed to meet these requirements (‘Core Graduate Curriculum’ [CAMPEP, 2014]). Within this
context, this lecture series is intended to provide an introduction to the mathematics of image formation
and image processing, (‘8.3 Fundamentals of Medical Imaging’ [CAMPEP, 2014]). Specific topics addressed
include:
3
We will focus on the mathematical structure of an optimization problem of the form
Here X is a vector space of feasible solutions in which we will look for a solution. d(·, ·) defines a distance
measure relevant to the application. A is an operator that embodies the physics of the therapy planning or
image reconstruction. You should become aware of the following logical thought process:
• At the very basic level, we will define precisely the mathematical spaces and functions we are working
with.
• We will build on the definitions to develop increasingly complex statements, ie ”true” statements will
be used to derive more complex ”true” statements.
1 Preliminaries
1.1 Operation Counts ([Golub and Van Loan, 1996], Chapter 1.2.4)
It is import to be aware of the floating point and memory operations required by the typical algorithms
encountered in research. The algorithmic complexity is directly proportional to the amount of time you will
wait for you program to finish. We typically refer to the algorithmic complexity of an algorithm by the
number of floating point or memory operations. O(N p )
(Definition) O(N p ) We say an algorithm of complexity f (n) is O(np ), if there is a constant c such that
|f (n)| is no larger than cnp .
end
end
Here were have 2 FLOP within a nested loop. For each iteration of row i, n iterations of 2 FLOP is
performed. Each row iteration is repeated m times. Thus the FLOP count is 2mn which is O(n2 ) when
m = n. For the memory operations, a n × 1 vector x and an m × n matrix A must be read from memory.
A m × 1 vector y must be read and written back.
for i = 1:m
for j = 1:n
y(i) = A(i, j) ∗ x(j) + y(i)
|{z} | {z } |{z} |{z}
write read read read
end
end
Here we have mn + 2m + n total memory operations, which is O(n2 ) when m = n.
Example 2 (Linear system solve operation count). Gaussian elimination to solve a linear system requires
2/3n3 floating point operations for the LU factorization and 2n2 floating point operations for the forward
and backward substitution. In this case f (n) is O(n3 ).
4
Data movement for this algorithm required n2 storage plus additional vectors to store the right hand side and
the solution vector. Memory operations are O(n2 ).
Operation counts for common algorithms you will encounter in research are listed in Table 1. An interface to
these algorithms is commonly found in various vendor library implementations such as the BLAS, LAPACK,
and/or MKL.
5
end
end
end
toc
Elapsed time is 12.639929 seconds.
Now consider the BLAS 1 implementation. The underlying kernel is a dot product or inner product that we
will discuss in Section 4.
(a, b)
tic
for iii = 1:Num
for jjj = 1:Num
C(iii,jjj) = A(iii,:) * B(:,jjj);
end
end
toc
Elapsed time is 13.851716 seconds.
Now consider the BLAS 2 implementation. The underlying kernel is a matrix-vector multiply or the action
of a linear operator on a vector, Section 3.
y = Ax
tic
for jjj = 1:Num
C(:,jjj) = A * B(:,jjj);
end
toc
Elapsed time is 0.181035 seconds.
Finally consider the BLAS 3 implementation. The underlying kernel is a matrix-matrix multiply or the
composition of two linear operators, Section 3.
A◦B
tic
A*B;
toc
Elapsed time is 0.043407 seconds.
As a rule of thumb, cast your algorithm in terms of the higher level BLAS when possible to achieve maximum
efficiency.
It is useful to estimate the theoretical run time of the typical compute bound or memory bound algorithms
encountered in research. A typical computing architecture consists of a hierarchy of memory each with its
own bandwidth and latency characteristics, Figure 2. For simplicity, we will consider an ‘effective‘/average
of the memory bandwidth and number of CPU in the calculations.
Example 4 (Matrix Multiplication). BLAS 3 matrix multiplication is an example of a compute bound
algorithm in which the memory transfer overhead is hidden by the computations. ie The vendors have
invested significant resources into the code design so that computations are being performed simultaneously
with the data transfer to achieve near peak performance. Consider an Intel
Xeon
R
CPU
R with a clock speed
of 2.90GHz and 6-12 physical cores (6 cores per socket). The Xeon is capable of 4 floating point operations
per cycle
maeda$ cat /proc/cpuinfo | grep "model name\|cpu cores" | head -n 2
model name : Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz
cpu cores : 6
6
Courtesy https://www.tacc.utexas.edu/user-services/training/course-materials
Figure 2: Computing Architecture. The typical computing architecture consists of a hierarchy of memory.
The closer the memory is to the processor, the faster the access and bandwidth will be. The memory clock
cycle is typically slower than the CPU clock cycle. Algorithm need to hide the cost of the data transfer with
overlapping computations to achieve good performance.
Here, as an upper and lower bound, we are considering a range of cores that that library may use An experi-
mental verification of the performace is given below. > 90% peak performance is achieved. \exampledir/ExMatMatMult.m
close all
clear all
NSizeList = [1000:1000:10000];
GFLOPsPerformance = zeros(size(NSizeList));
7
Num = NSizeList (iii);
A = rand(Num);
B = rand(Num);
tstart = tic;
A*B;
telapsed = toc(tstart );
GFLOPsPerformance (iii) = 2*Num^3 / telapsed /1.e9 ;
PercentagePeakAchieved = GFLOPsPerformance (iii)/UBMaxGFLOPs
end
set(gca,’FontSize’,16)
handle = figure(1);
plot(NSizeList ,GFLOPsPerformance )
hold
plot(xlim,[.9* UBMaxGFLOPs .9* UBMaxGFLOPs ],’r--’)
plot(xlim,[LBMaxGFLOPs LBMaxGFLOPs ],’r-.’)
xlabel(’N’)
ylabel(’GFLOPs’)
legend(’measure’, ’.9 peak UB’, ’peak LB’,’Location’,’East’)
saveas(handle,’PeakFLOP’,’png’)
Example 5 (Matrix Vector Multiplication). The matrix vector multiply is an example of a memory bound
algorithm. The bottle neck is typically the time to transfer the data from the RAM to the processors. Com-
pared to the Matrix Matrix multiply significantly less FLOPs are achieved. Consider a range of the quoted
bandwidth for the architecture. The quoted peak bandwidth is 51.2 GB/s. We are measuring 21.5 GB/s.
http://ark.intel.com/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI
ans =
8
21476
Figure 4: Performance of matrix vector multiply. Floating point efficiency is drastically reduced as compared
to the matrix matrix multiply in Figure 3. The memory clock cycle 1.6GHz is slower than the CPU clock cycle
2.9 GHz and the algorithm is not able to hide the cost of the data transfer with overlapping computations.
\exampledir/ExMatVecMult.m
close all
clear all
%http://ark.intel.com/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI
CPUSpeed = 2.9; % GHz
numberCore = 12;
instructionPerCycle = 4;
MaxGFLOPs = CPUSpeed * instructionPerCycle * numberCore
UBBandwidth = 51.2 ; % GB/s
LBBandwidth = 21.5 ; % GB/s
NSizeList = [1000:1000:20000];
GFLOPsPerformance = zeros(size(NSizeList));
MEMPerformance = zeros(size(NSizeList));
handle = figure(1);
set(gca,’FontSize’,16)
plot(NSizeList ,GFLOPsPerformance, ’k’ )
hold
9
plot(xlim,[.05* MaxGFLOPs, .05*MaxGFLOPs ],’r--’)
xlabel(’N’)
ylabel(’GFLOPs’)
legend(’measure’, ’5% peak ’,’Location’,’East’)
saveas(handle,’PeakFLOPMatVec’,’png’)
handle = figure(2);
set(gca,’FontSize’,16)
plot(NSizeList ,MEMPerformance )
hold
plot(xlim,[UBBandwidth UBBandwidth ],’r--’)
plot(xlim,[LBBandwidth LBBandwidth ],’r-.’)
xlabel(’N’)
ylabel(’MB/s’)
legend(’measure’, ’UB BW’ ,’LB BW’,’Location’,’NorthEast’)
saveas(handle,’PeakBWMatVec’,’png’)
xi ⊕ xj = xj ⊕ xi (Commutative)
∀xi , xj , xk ∈ X
xi ⊕ (xj ⊕ xk ) = (xi ⊕ xj ) ⊕ xk (Associative)
(ii) The space X contains a ‘zero vector’, 0 and for each vector x there exist a ‘−x’
(iii) Another operation is defined such that the vector space X is closed under multiplication of vectors by
scalars, ∗, α ∗ x ∈ X. Scalar multiplication satisfies the following properties
α ∗ (β ∗ xi ) = (α × β)xi (Associative)
α ∗ (xi ⊕ xj ) = α ∗ xi ⊕ α ∗ xj (Distributive)
0 ∈ R (zero scalar) 0 ∈ X (zero vector)
(α + β) ∗ xi = α ∗ xi ⊕ β ∗ xi (Distributive)
∀xi , xj , xk ∈ X α, β ∈ R
1 ∗ xi = xi
0 ∗ xi = 0
10
Notice that the + and ⊕ mean two very different concepts in the distributive law of scalar multiplication.
(α + β) ∗ xi = α ∗ xi ⊕ β ∗ xi
The + is the usual scalar addition and the ⊕ is the vector addition for the space we are working in. However,
we typically abuse notation and denote vector addition as “+” without confusion.
Notice that we intentionally do not specify the nature of the underlying elements of the spaces we are
studying. Rather, we assume that the framework is general enough to encompass and application in our
research that we may be interested in.
Example 6 (Vector algebra in Rn ). Defined in the usual component-wise way
Example 7 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].
Notice that in this space, each ‘point’ represents a function. Vector algebra defined as you might expect
Example 8 (Floating Point Arithmetic). It is import to realize that the usual associative laws typically
expected for vector spaces are not satisfied for floating point arithmetic.
>> eps(2^54)
ans =
1
This is due to the finite precision of floating point arithmetic and, in general, occurs when adding a large
number to a small number.
d(4, 9) = |4 − 9| = 5
4 9
Figure 5: d(x, y) = |x − y|. Analogous to the absolute value on the real line, R, we are interested in defining
a distance on the abstract vector and function spaces that may typically arise in research.
11
(Definition) Metric The function d : X × X → R+ is known as a metric if it satisfies
(M3) agrees with intuition and may be used to show that the shortest path between to points is a straight
line.
Example 9 (Distance in Rn ). The canonical example is 3-dimensional real space, R3
p
d(x, y) = (ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2
Its not difficult to see that this satisfies the properties of the metric.
(M1) Given that the distance is 0, properties of positive numbers on the real line
may be used to show that the two vectors are in fact the same.
p
(ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2 = 0 ⇒ (ξi − ηi )2 = 0 ⇒ ξi = ηi
Conversely, if the two vectors are identical the metric is zero by direct evaluation.
p
ξi = η i ⇒ (ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2 = 0
(M3) Showing the triangle inequality is a bit more tricky and requires Minkowski’s inequality.
n
!1/p n
!1/p n
!1/p
X p
X X
p p
|ai + bi | ≤ |ai | + |bi | ∀a, b ∈ Rn 1≤p<∞
i=1 i=1 i=1
n
!1/2 n
!1/2 n
!1/2 n
!1/2
X X X X
2 2
(ξi + 0 − ηi ) = (ξi ± γi − ηi ) ≤ (ξi − γi )2 + (γi − ηi )2
i=1 i=1 i=1 i=1
An example calculation of the distance defined by the 2-norm is provided below. \exampledir/VectorTwoNormCalc.m
x=[9;-10;7]
y=[-4;-5;3]
sqrt( (x(1) - y(1))^2 + (x(2) - y(2))^2 + (x(3) - y(3))^2 )
>> echo on
>> VectorTwoNormCalc
x=[9;-10;7]
x =
9
-10
7
y=[-4;-5;3]
y =
12
-4
-5
3
ans =
14.4914
Example 10 (Distance in C). In MR applications the convention workspace is the complex plane, C. For
two complex numbers x = ξ1 + ξ2 i and y = η1 + η2 i
p
d(x, y) = (ξ1 − η1 )2 + (ξ2 − η2 )2
Example 11 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].
C[a, b] ≡ {f : [a, b] → R : f is continuous}
Notice that in this space, each ‘point’ represents a function. The max difference between the function over
the domain [a, b] defines a metric on this space.
d(x, y) = max |x(t) − y(t)|
t∈[a,b]
(M1) Properties of the absolute value may be used to show that two functions with zero distance are the same
and zero distance between the functions implies the functions are the same.
max |x(t) − y(t)| = 0 ⇔ x(t) = y(t) ∀t
t∈[a,b]
(M3) The triangle inequality for the absolute value may be used to show that the max satisfies the triangle
inequality for the metric.
|x(t)−y(t)| = |x(t)−z(t)+z(t)−y(t)| ≤ |x(t)−z(t)|+|z(t)−y(t)| ≤ max |x(t)−z(t)|+ max |z(t)−y(t)| ∀t
t∈[a,b] t∈[a,b]
Because this holds for all t, this includes the max over all t as well, and we have the result
max |x(t) − y(t)| ≤ max |x(t) − z(t)| + max |z(t) − y(t)|
t∈[a,b] t∈[a,b] t∈[a,b]
13
>> echo on
>> ExFunctionDistanceCheb
t = [0:.1:10];
x = t.^2;
y = 20*ones(size(t));
handle = figure
handle =
plot(t,x)
hold
Current plot held
plot(t,y)
% save matlab plot
% saveas(handle,’FunctionDistance’,’png’)
max(abs(x-y))
ans =
80
Example 12 (Distance Between Images). Suppose we want to measure the distance between a transformation
of two images, I : [0, 1] × [0, 1] ⊂ R2 → R and J : [0, 1] × [0, 1] ⊂ R2 → R.
s
Z 1Z 1 sX
2 2
d(I, J) = (I(x, y) − J(x, y)) dx dy ≈ (I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)) ∆x ∆y
0 0 i,j
s s s 1 r
1 1 1 1
x3
Z Z Z Z
1
d(I, J) = (x + sin(π y) − sin(π y))2 dx dy = dy x2 dx = =
0 0 0 0 3 0 3
\exampledir/ExLTwoImageDistance.m
close all
clear all
14
I(x, y) = x + sin(π y) J(x, y) = sin(π y)
delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),2)*sqrt(delta*delta)
sqrt(1/3)
>> echo on
>> ExLTwoImageDistance
close all
clear all
delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),2)*sqrt(delta*delta)
ans =
0.5777
sqrt(1/3)
ans =
0.5774
Example 13 (Dice Similarity Measure). A common measure for the agreement between segmented/labeled
images is the Dice Similarity Coefficient (DSC). The DSC of two sets A and B is proportional to the
area/volume of the overlap A∩B normalized by the combined area/volume of the two sets. The proportionality
15
constant 2 is chosen so that the DSC has a max value of 1.
|A ∩ B|
DSC(A, B) ≡ 2 0 ≤ DSC(A, B) ≤ 1
|A| + |B|
DSC(A, A) = 1 6= 0
How about
d(A, B) = 1 − DSC(A, B)
Zero distance, (1)(M1), is satisfied through the definition of set intersection.
Symmetry, (1)(M2), is satisfied by the commutative property of set intersection and addition.
d(A, B) = d(B, A)
Triangle inequality, (1)(M2), is not satisfied. Consider, A,B, C = A ∪ B such that A ∩ B = 0 and |A| = |B|
|A ∩ B|
d(A, B) = 1 − DSC(A, B) = 1 − 2 =1
|A| + |B|
| {z }
=0
|A ∩ (A ∪ B) | |A| 1
d(A, C) = 1 − 2 =1−2 =
|A| + |A ∪ B| |A| + |A| + |B| 3
|B ∩ (A ∪ B) | |B| 1
d(B, C) = 1 − 2 =1−2 =
|B| + |A ∪ B| |B| + |B| + |A| 3
d , = d(A, B) = 1 d(A, C) + d(B, C) = d , + d , = 2
3
Another measure of ‘distance’ commonly use is the cross correlation (CC) or the normalized cross corre-
lation (NCC). Normalized
Cross Correla-
(Definition) Normalized Cross Correlation (NCC) Given two images A : Ω → R and B : Ω → R, tion (NCC)
Ω ⊂ Rd , the normalized cross correlation is computed as
2
~a − â, ~b − b̂
N CC(A, B) = −
2 (2)
2
k~a − âk
~b − b̂
Here (·, ·) denotes the inner product, Section 4, and k · k denotes the norm, Section 2.4. ~a and ~b denote the
vector of intensities.
The NCC is defined in terms of the norm and inner product (Sections 2.4, 4). We will see that the norm
and inner product defines a distance
p
d(x, y) = kx − yk = (x − y, x − y)
16
Hence the NCC is bounded below by -1. And (using the definition of distance) above by 0.
2 2
2
k~a − âk
~b − b̂
~a − â, ~b − b̂
−1 = −
2 ≤ −
2
2
2
k~a − âk
~b − b̂
k~a − âk
~b − b̂
−1 ≤ NCC ≤ 0
The NCC is symmetric
N CC(A, B) = N CC(B, A)
However, (M1) is not satisfied
2 4
(~a − â, ~a − â) k~a − âk
N CC(A, A) = − 2 2 =− 2 2 = −1
k~a − âk k~a − âk k~a − âk k~a − âk
Triangle inequality (M3) for the NCC is left as a homework exercise. Entropy
(Definition) Entropy ([Cover and Thomas, 2012] Chapter 2) The concept of entropy appears in image
procesing as a quantative measure of the information content. Given a discrete probability function
( ) N
+ ci = [a + i dx, a + (i + 1)dx) ⊂ [a, b] X
p:Ω→R Ω≡ p(ci ) = 1
i = 0, ..., N = (b − a)/dx − 1 i=1
| {z }
≡pi
Lemma 2.1 (Log sum inequality). Given a1 , a2 ,... , an and b1 , b2 ,... , bn non-negative
n n
! Pn
X ai X ( i=1 ai )
ai log ≥ ai log Pn
i=1
bi i=1
( i=1 bi )
Properties of Entropy
L’Hospital’s rule justifies the limiting case p → 0 and the notation 0 log 0 = 0
log p 1/p
lim p log p = lim = lim = lim −p = 0
p→0 p→0 1/p p→0 −1/p2 p→0
17
• For a general distribution, the log sum inequality with ai = pi and bi = 1 implies
N N
!
X pi X 1
pi log ≥ pi log
i=1
1 i=1
N
| {z }
=1
Figure 8: Entropy. Similar to thermodynamics, the entropy is a measure of the spread or uncertainty in
a probability distribution. A uniform probability distribution has the largest entropy. This agrees within
our intuition. If we have relatively little information about a model parameter/variable, then it can be
anywhere uniformly within the interval. The more information we have the more ‘peaked’ the probability
distribution. For example, suppose an object is located with uniform probability, x ∼ U[a, b]. The object
is located between [a, b] with equal probability and we are uncertain were it may actually be. Compared to
x ∼ N [(a + b)/2, 1], we are more certain that the object is likely to be located near the mid-point so we have
relatively more information about the location of the object.
\exampledir/ExEntropy.m
clear all
close all
UniformEntropy = -sum(uniformpdf.*log(uniformpdf))
handle1=figure(1);
bar(x, uniformpdf,’k’)
18
h = findobj(gca,’Type’,’Patch’);
set(h,’FaceColor’,[1 1 1], ’EdgeColor’,’black’);
hold
for iii = 1:length(sigmaList)
y1 = pdf(’normal’, x, mu1, sigmaList(iii) );
normalentropy(iii) = -sum(y1.*log(y1));
plot(x, y1)
end
set(gca,’FontSize’,16)
xlabel(’x’)
ylabel(’p(x)’)
saveas(handle1,’EntropyBins’,’png’)
handle2=figure(2);
plot(sigmaList,normalentropy)
hold
plot(xlim,[UniformEntropy UniformEntropy],’r--’)
set(gca,’FontSize’,16)
xlabel(’sigma’)
ylabel(’Entropy’)
legend(’normal’, ’uniform’ ,’Location’,’SouthEast’)
saveas(handle2,’EntropyValue’,’png’)
Algebraically manipulating the entropy yeilds an interpretation of the entropy as a ‘distance’ or divergence
from the entropy of a uniform distribution.
N
X X
H(p) = − pi log pi + pi log (p0 N )
i=1 i
| {z }
=1
| {z }
=0
N
X pi
= log N − pi log
i=1
p0
N
X pi
= H(p0 ) − pi log
i=1
p0
This difference motivates the definition of the Kullback Leibler distance or relative entropy between two
probability distributions.
N
X pi
D(p||p0 ) = pi log = H(p0 ) − H(p)
i=1
p0
Kullback
Leibler Dis-
(Definition) Kullback Leibler Distance The Kullback Leibler distance between two probability distri-
tance
butions, p and q, is defined as the relative entropy
N
X pi
D(p||q) = pi log
i=1
qi
For two general probablity distributions, the difference in the entropy may be interpreted as the difference
in the relative entropy with the uniform probablity distribution.
H(p) − H(q) = D(q||p0 ) − D(p||p0 )
However, the Kullback Leibler Distance is not a ‘distance’ in the sense of a metric. Both the symmetry and
triangle inequality properties are missing.
Example 14. The below example numerically evaluates the symmetry and triangle inequality within the
context of the Kullback Leibler ‘Distance’
D(p||q) 6= D(q||p)
19
D(p||q) > D(q||r) + D(p||r)
\exampledir/ExEntropyCounterExample.m
clear all
close all
% D(pdf1 || pdf2)
RelEntropy12 = sum(pdf1.*log(pdf1.* pdf2.^(-1) ))
RelEntropy21 = sum(pdf2.*log(pdf2.* pdf1.^(-1) ))
Symmetry = RelEntropy12 - RelEntropy21
% D(pdf1 || pdf3)
RelEntropy13 = sum(pdf1.*log(pdf1.* pdf3.^(-1) ))
% D(pdf2 || pdf3)
RelEntropy23 = sum(pdf2.*log(pdf2.* pdf3.^(-1) ))
RelEntropy12 =
0.0628
RelEntropy21 =
0.0548
Symmetry =
0.0080
RelEntropy13 =
0.0085
RelEntropy23 =
0.0331
TriangleInequality =
ie within this space, the shortest distance between two points may not necessarily be a straight line. Nev-
ertheless, the Kullback Leibler Distance is heavily used within mutual information-based image registration.
Mutual Infor-
mation
20
(Definition) Mutual Information Given two images A : Ω → R and B : Ω → R, Ω ⊂ Rd with probability
intensities p(a) and p(b), respectively, the mutual information between the two images, I(A, B), is defined
as
I(A, B) = D (p(a, b)||p(a)p(b)) = H(A) + H(B) − H(A, B)
For image registration, we typically want to maximize the mutual information. This corresponds to mini-
mizing the joint entropy/ uncertainty. Interpreting mutual information within the context of the Kullback
Leibler ‘distance’, maximizing the mutual information maximizes the distance between the joint distribution
for independent images. ie two registered images will highly correlated.
The joint entropy is defined in an analogous manner.
XX
H(p(a, b)) = pij log (1/pij )
i j
Here the probability intensities are defined through the image histograms normalized to the number of pixels.
Example code for a single image is below. The joint histogram is defined in an analogous manner. ie count
% reshape to 1D array
IntensityImage = IntensityImage.img(:);
21
histogram = histogram /length(IntensityImage); % normalize to 1
entropy = -sum(histogram .*log(histogram ))
handle2 = figure(2);
set(gca,’FontSize’,16)
bar( histogram ,’k’)
saveas(handle2,’ImageHistogram’,’png’)
>> ImageEntropy
entropy =
2.3567
• (M3). Kullback Leibler does not satisfy triangly inequality as we have seeng in previous example.
Example 15 (Registration Distance Measures). The below example compares three ‘distance’ measures
commonly used for image registration.
Figure 10: Consider the 1D rigid registration of a brain image parametrized by distance d.
Figure 11: Comparison of (a) MI, (b) MSQ, (c) NCC Distance Measures.
\exampledir/segmentation/distancemeasure.m
22
clear all
close all
c3dexe = ’/opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d’;
T1Image = ’ICBM_TemplateZSlab.nii.gz’;
T1Image = ’brain_T1ZSlab.nii.gz’;
T1CImage = ’brain_T1CZSlab.nii.gz’;
T2Image = ’brain_T2ZSlab.nii.gz’;
FLImage = ’brain_FlairZSlab.nii.gz’;
ImageList = {T1CImage; T2Image ;T1Image ;FLImage }
transformedimage = ’slicetranslate’;
OriginalImage = load_nii(T1CImage );
nbins = 32;
% fix noise
originalnoisepower = (max(OriginalImage.img(:)) - min(OriginalImage.img(:))) *.2;
originalmean = mean(OriginalImage.img(:));
NoisyImage = OriginalImage.img+floor(originalnoisepower*rand(size(OriginalImage.img))+originalmean);
handle = figure(1);
% look at rigid registration distance of image list
for jjj = 1:length(ImageList)
currentImage = ImageList{jjj}
for iii = translationlist
disp(’###################’)
% create transformed images
theta = (iii-1)*pi/180;
system(sprintf(’sed "s/1 0 0 0 1 0 0 0 1 0 0 0/%f %f 0 %f %f 0 0 0 1 %f %f 0/" identity.tfm > tmp.tfm;’
transformimagefilename = sprintf(’%s%04d.nii.gz’,transformedimage ,iii);
transformcmd = sprintf(’%s %s %s -reslice-itk tmp.tfm -o %s’,c3dexe,T1CImage,currentImage,transformimag
disp(transformcmd );
system(transformcmd );
imagesc(NoisyImage(:,:,1) +TransformImage.img(:,:,1));
colormap(gray);
23
pause(.1)
end
end
%saveas(handle,’RegistrationExample’,’png’)
set(gca,’FontSize’,16)
% plot MI
handle2 = figure(2);
hold
xlabel(’distance’)
ylabel(’MI’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,1),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,4),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C
% plot MSQ
handle3 = figure(3);
hold
xlabel(’distance’)
ylabel(’MSQ’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,2),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,5),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C
% plot NCOR
handle4 = figure(4);
hold
xlabel(’distance’)
ylabel(’NCOR’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,3),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,6),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C
saveas(handle2,’DistanceMI’,’png’)
saveas(handle3,’DistanceMSQ’,’png’)
saveas(handle4,’DistanceNCOR’,’png’)
The Hellinger distance provides an alternative notion of distance between a probability distribution that
satisfies the properties of a metric.
Example 16 (Hellinger Distance). In statistics, we are typically interested in comparing probability distri-
butions. Here the space of probability functions are positive, continuous, and normalized to 1.
Z
+
X ≡ f : Ω → R | f continuous f dx = 1
Ω
Various, Z-test, F-test, T-test have been developed to compare the distributions. In some applications, the
probability distributions are known and a more direct measure of distance between two probability distributions
24
f and g may be given by the Hellinger Distance.
1/2
Z 1/2 Z Z Z
1 p p 2 1 p p
d(f, g) = √ f (x) − g(x) dx =√ f (x)dx + g(x)dx −2 f (x) g(x)dx
2 Ω 2
| Ω {z } | Ω {z } Ω
=1 =1
Z p 1/2
p
= 1− f (x) g(x)dx 0 ≤ d(f, g) ≤ 1
Ω
Notice the intuition for the Hellinger distance is obtained from the simplified form above. In a sense, the
Hellinger distance measures the area of overlap between two probability distribution functions. In the extreme
case with no overlap, ie f is non-zero when g is zero and vice-versa, the Hellinger distance attains its max
value of 1. As an explicit example for two normal distribution P ∼ N (µ1 , σ1 ), Q ∼ N (µ2 , σ2 ) the Hellinger
metric reduces to v
u s
1 (µ1 − µ2 )2
u 2σ1 σ2
d(P, Q) = 1 −
t exp −
σ12 + σ22 4 σ12 + σ22
>> ExHellingerDistance
\exampledir/ExHellingerDistance.m
mu1 = 0;
sigma1= 1;
mu = .5:.5:10;
sigma = 1.5:.5:10;
sigma = [1,5,10]
plotcolors = [’b’,’r’,’k’]
maxsigma = max(sigma);
close all
handle = figure
hold
for jjj = 1:size(sigma,2)
for iii = 1:size(mu,2)
mu2 = mu(iii);
sigma2= sigma(jjj);
hellinger(iii,jjj) = 1 - sqrt( ( 2 * sigma1 * sigma2 ) / ...
(sigma1^2 + sigma2^2) ) * exp(-(mu1-mu2)^2/(sigma1^2 + sigma2^2)/4);
end
plot(mu,hellinger(:,jjj),plotcolors(jjj))
end
legend(’sigma=1’, ’sigma=5’, ’sigma=10’)
saveas(handle,’HellingerDistance’,’png’)
handle = figure
hold
x = [-4*maxsigma:1e-3:4*maxsigma];
y1 = pdf(’normal’, x, mu1, sigma1);
y2 = pdf(’normal’, x, mu(6), sigma(2));
y3 = pdf(’normal’, x, mu(10), sigma(3));
plot(x, y1)
plot(x, y2, ’r’)
plot(x, y3, ’k’)
title(’Density functions’)
legend(’mu=0 sigma=1’, ’mu=3 sigma=5’, ’mu=5 sigma=10’)
saveas(handle,’NormalPDFCompare’,’png’)
25
µ vs d(N (0, 1), ·) x vs prob
may be used to show that zero distance is equivalent to the same function
Z p 2
1 p
f (x) − g(x) dx = 0 ⇒ f (x) = g(x) ∀x
2 Ω
√ √ √ √
for p = 2, a = f− z, b = z− g
Z 2 1/2 Z 2 1/2 Z 2 1/2
p p p p p p p
f (x) ± z(x) − g(x) dx ≤ f (x) − z(x) dx + z(x) − g(x) dx
Ω Ω Ω
The entropy facilitates a quantitative measure of the information gain for an image segmentation.
Example 17 (Informational Entropy). For image segmentation the information gain during the segmenation
is a measure of the ’reduction in the uncertainty’.
The entropy should be reduced when information is added to the system and the information gain should be
> 0.
Assume that we are given disjoint segmentation data sets, Sj , that classify the pixel type to be used in
‘training’ our algorithm. A label l ∈ N is associated with each pixel in the image, v ∈ Rn ,
S = ∪j Sj Sj ∩ Si = ∅ i 6= j
26
Define the entropy of these data sets in terms of the probability distributions/histograms of the class labels.
X |Sj |
H(S) = H(Sj ) |A| ≡ # of training points in set A
j
|S|
X X # of class label i
H(A) = − pi log(pi ) = pi log (1/pi ) pi =
i i
total # of label in setA
Consider the thresholding operation on a T1 image with class labels l1 = White Matter, l2 = Grey Matter,
l3 = CSF. The entropy of the entire training set ‘before’ a thresholding operation is applied is computed as
|S|
Hbefore (S) = H(S)
|S|
Below are the initial label statistics to compute the label histograms.
$ /opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d ICBM_Template.nii.gz ICBM_grey_white_csf.nii.gz -lstat
LabelID Mean StdD Max Min Count Vol(mm^3) Extent(Vox)
0 432.23113 677.59359 4095.00000 0.00000 5610187 5610187.000 181 217 181
1 1669.76621 281.12593 2493.00000 177.00000 1032234 1032234.000 145 180 138
2 2214.56145 158.47540 2715.00000 734.00000 435476 435476.000 134 171 120
3 861.27990 308.32083 2146.00000 459.00000 31240 31240.000 77 93 87
Thresholding the dataset at Intensity = 2010, yeilds a left ‘L’ and right ‘R’ dataset. The entropy of the
entire training set ‘after’ the thresholding operation is applied is computed as
X |Sj |
Hafter (S) = H(Sj )
|S|
j∈L,R
\exampledir/ExEntropySegmentation.m
close all
clear all
% verify split
verify = sum(GREYCSF )+ sum(WHITE)- sum(Initial )
27
(a) (b)
(c) (d)
28
2000 4000 6000 8000 10000
2500
SAGT1
2000
1500
cor= 0.95
1000
500
10000
N4CORR
8000
6000
4000
2000
(a) (b)
Figure 14: Segmentation Histogram. Entropy provides a quantitative measure of the a segmentation thresh-
old result. The ‘distance’ defined by the information gain provides a repeatable and reproducible measure.
>> ExEntropySegmentation
verify =
EntropyBefore =
0.6967
EntropyAfter =
0.3852
InformationGain =
0.3115
Example 18 (Distances in High Dimensional Space). Non-euclidean distances have been suggested to be ap-
propriate for high dimensional clustering applications [Aggarwal et al., 2001]. Lets look at high-dimensional
‘distances’ for identifying noisy images. Consider the image basis derived from the image histograms shown in
Figure 15, ie intensity groups within a histogram bin form a basis vector. As the number of bins increases, the
dimension of the basis vector effectively increases. Consider the ‘contrast’ measure of a uniformly distributed
vector in this high dimensional space, x ∼ U d (0, 1), contrast = E(kxkmax
p − kxkmin
p )
\exampledir/ExHighDimensional.m
.
distancesubset = [2:10:1024];
datamatrix = 1000*rand(Npixel*Npixel,Nsample );
.
29
.
for iii =1:Nsample
for jjj =1:length(distancesubset)
[hresample edges imagebins] = histcounts(originalimage,distancesubset(jjj));
noisyimage = zeros(size(originalimage));
for idnoise = 1:distancesubset(jjj)
noisyimage = noisyimage + datamatrix(idnoise ,iii) * (imagebins==idnoise);
end
.
distanceone( iii,jjj) = norm(datamatrix(1:distancesubset(jjj),iii),1);
distancetwo( iii,jjj) = norm(datamatrix(1:distancesubset(jjj),iii),2);
.
end
end
.
for jjj =1:length(distancesubset)
distanceoneSeparation( jjj) = max(distanceone( :,jjj)) - min(distanceone( :,jjj));
distancetwoSeparation( jjj) = max(distancetwo( :,jjj)) - min(distancetwo( :,jjj));
end
Figure 15: Image Basis of Figure 9 derived from histogram. The max distance for a uniformly distributed
vector with respect to this basis, x ∼ U d (0, 1), is shown in Figure 16.
In summary, as your a progressing through your research and are asked to quantitatively evaluate the
distance between to measurements or computer simulations, in general, your advisor may be skeptical if the
distance function you are using does not satisfy the properties of a metric (1).
30
(a) (b) (c)
Figure 16: For the basis shown in Figure 15, notice that the ‘contrast’ (1) increases with dimension for
E(kxkmax
1 − kxkmin
1 ) (2) asymtotes with dimension for E(kxk2
max
− kxkmin
2 ) (3) decreases with dimension for
max min
E(kxk3 − kxk3 ). Comparision of this behavior to p=0.5 and MI is left as a homework exercise.
α1 x1 + α2 x2 + ... + αm xm
The set of all linear combinations of a set M is call the span of the set M . Span of a set of
vectors
(Definition) Span of a set of vectors The span of a set of vectors M = {x1 , x2 , ...xm } ⊂ X denotes the
set of all linear combinations of the vectors
( )
X
span M ≡ y : y = αi xi for some (α1 , α2 , ..., αm ) ∈ Rm
i
~b = 5
2
−5
7
~c =
~
a =
:
−2
- 0
9
Figure 17: A linear combination is a sum of vectors.
n o
span ~a, ~b = R2
20
why ? how would you represent ~y = ?
7
7 5 20 7 5 α1 20
α1~a + α2~b = ~y ⇔ α1 + α2 = ⇔ =
0 2 7 0 2 α2 7
alpha =
0.3571
31
3.5000
ans =
20
7
n o
5
span ~c, ~b = y : y = α for some α∈R
2
is this R2 ?
~c = −1 · ~b
Linear Inde-
pendence
(Definition) Linear Independence A set of vectors x1 , ..., xm is said to be linearly independent if the
linear combination of the vectors equals zero iff the scalar coefficients all equal zero.
α1 x1 + α2 x2 + ... + αm xm = 0 ⇔ α1 = α2 = ... = αm = 0
Example 20 (Linear Dependence in Rn ). Two vectors that are collinear in R are linearly dependent ie
x = αy ⇒ x + (−α)y = 0 α 6= 0
Example 21 (Linear independence of functions). Show cos(x), sin(x) ∈ C[a, b] are linearly independent.
Obviously
0 cos(x) + 0 sin(x) = 0 ∀x
Conversely, we need to show that the sum of the functions equal zero implies that the coefficients equal zero.
α1 cos(x) + α2 sin(x) = ~0 ∀x
~0 ⇒ function ⇒ 0 ∀x
Since this is the zero vector this holds for all x. In particular let x = 0
α1 cos(0) + α2 sin(0) = α1 = 0
e1 = (1, 0, 0, ..., 0)
e2 = (0, 1, 0, ..., 0)
. . .
en = (0, 0, 0, ..., 1)
32
Ωi
ei (x) = 1 x ∈ Ωi
ei (x) = 0 x∈/ Ωi
Example 22 (Discrete Image). In imaging applications, we typically assume the image of the object, g, that
we are taking a picture of is square integrable
Z
g ∈ L2 (Ω) ≡ f : f 2 dx < ∞
Ω
Here, our imaging domain, Ω, is a subset of R2 , Ω ⊂ R2 . Unfortunately, it will take an infinite amount of
basis functions to represent an arbitrary image in our square integrable space, L2 (Ω). In order to represent
the image, g on a finite dimensional space for the computer to understand, we typically discretize the domain
into a 256×256 pixel image. We can then define the i-th basis function such that the function equals one
on the i-th pixel and zero everywhere else. An image that we would be interested in can now be easily be
represented as the linear combination of the basis functions on a computer.
X
g(x) = αi ei (x) (α1 , α2 , ..., αn ) ∈ R256×256
i
and the constants of the linear combinations has the interpretation as the piecewise intensity values.
Dimension of a
vector space
(Definition) Dimension of a vector space A vector space X is said to be finite dimensional if there is
a positive integer n such that X contains a linearly independent set of n vectors whereas any set of n + 1 or
more vectors of X is linearly dependent. n is called the dimension of X, written n = dim X.
(N1) kxk ≥ 0
(N2) kxk = 0 ⇔ x=0
(N3) kαxk = |α|kxk
(N4) kx + yk ≤ kxk + kyk
33
Figure 19: Illustration of Triangle Inequality (N4)
• (N4) is the triangle inequality and means that the length of one side of a triangle cannot exceed the
sum of the length of the other two sides
Example 23 (Metric induced by the norm). A norm on X defines a metric d on X which is given by
d(x, y) = kx − yk
and is called the metric induced by the norm. Showing that the metric induced by the norm will be left as a
homework exercise. The metric induced by the norm bay also be shown to be translationally invariant
• The metric is unchanged with respect to a translation by an arbitrary a
Lemma 2.2. The triangle inequality may be used to show that the norm satisfies the reverse triangle in-
equality
|kyk − kxk| ≤ ky − xk (3)
Proof.
kak = ka + b − bk ≤ ka − bk + kbk ⇒ kak − kbk ≤ ka − bk
Similarly
kbk = kb + a − ak ≤ kb − ak + kak ⇒ kbk − kak ≤ ka − bk
Pay attention to this process. If a holds, then b must be true. If b is true, then c must be true. etc.
Example 24 (Distance in Rn ). The p-norm defines a norm in Rn
!1/p
X
p
kxkp ≡ |xi | p≥1 kxk∞ ≡ max{|xi |}
i
Proof that this satisfies the properties of a norm is similar to the examples defining a metric and will be left
as a homework exercise. Example calculations of the 1-norm and ∞-norm are provided below.
>> x=[9;-4;7;-10]
x =
9
-4
7
34
-10
ans =
30
>> max(abs(x))
ans =
10
Example 25 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].
Notice that in this space, each ‘point’ represents a function. The max difference between the function over
the domain [a, b] defines a norm on this space.
Proof that this satisfies the properties of a norm is similar to the examples defining a metric and will be left
as a homework exercise.
Example 26 (Lp distance between functions). The Lp norm
Z p1
kukLp ≡ |u(x)|p dx
Ω
defines a norm on a superset of the space of continuous functions that we have seen so far. Proof that
this satisfies the properties of a norm a more technical (involves equivalence classes of Lebesgue measurable
functions) and the result will be assumed. If f, g ∈ X are two images and we want to know how “close” they
are we can quantitatively evaluate their Lp distance.
Z p1
p
d(f, g) = kf − gkp = (f − g) dx
Ω
For p = 2, this may be interpreted as the usual RMS difference. A canonical example is image registration
where we want the “distance” between two “registered” images to be as small as possible.
We have seen the L2 distance between images. Numerical values for other norms may be computed in a
similar fashion, I : [0, 1] × [0, 1] ⊂ R2 → R and J : [0, 1] × [0, 1] ⊂ R2 → R.
Z 1 Z 1 X
kI − Jk1 = |(I(x, y) − J(x, y)| dx dy ≈ |I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)| ∆x ∆y
0 0 i,j
51
Z 1 Z 1 51 X 5
kI − Jk5 = (I(x, y) − J(x, y))5 dx dy ≈ (I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)) ∆x ∆y
0 0 i,j
>> echo on
>> ExLOneFiveImageDistance
close all
clear all
35
I(x, y) = x + sin(π y) J(x, y) = sin(π y)
delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),1)*delta*delta
ans =
0.5005
norm(I(:)-J(:),5)*(delta*delta)^(1/5)
ans =
0.6991
Figure 21: Two distance measures may have the same quantitative value, but the difference can be quite
different.
36
∗∗
2.5 Continuity and Convergence ([Kreyszig, 1989], Section 1.4)
The concept of the distance measure provided by our metric and norm allows us to precisely define, at a
very basic level, the concept of continuity of a function and convergence of a sequence of vectors.
Convergence in normed spaces are motivated by the metric induced by the norm d(x, y) = kx − yk. Convergence
of a sequence,
(Definition) Convergence of a sequence, limit A sequence (xn ) in a metric space X = (X, d) is said
limit
to converge or to be convergent if there is an x ∈ X:
lim d(xn , x) = 0 lim kxn − xk = 0
n→∞ n→∞
Figure 22: sN (x) converges pointwise to 0 at a particular point x. However, this sequence does not converge
to the 0 function in the mean square or L2 sense.
Example 27 (Pointwise convergence). In general we are custom to point wise convergence. That is for at
given t
Xn
x(t) − αi ei (t) → 0 n → ∞ ∀t
i
This is different the the mean square convergence or convergence with respect to the norm
n
X
kx(t) − αi ei (t)k → 0 n→∞
i
37
Example 28 (Optimization). In optimization we are typically search for a solution such that the gradient
of the objective function converges to zero.
∇fk → 0
Familiar properties from calculus carry over into our vector space with a metric and norm defined.
Example 29 (Convergence of Sequences). The metric of two converging sequences converges to the metric
of the limit.
xn → x and yn → y ⇒ d(xn , yn ) → d(x, y)
Proof of this is left as a homework exercise.
Example 30 (Convergence of Sequences In Normed Space). The norm of two converging sequences converges
to the norm of the limit.
xn → x and yn → y ⇒ kxn + yn k → kx + yk
Rearranging
kxn , yn k − kx, yk ≤ kxn , xk + ky, yn k
Interchanging xn with x and yn with y
kx+yk ≤ kx−xn k+kxn −yk ≤ kxn −xk+kxn +yn k+ky−yn k ⇒ kx+yk−kxn +yn k ≤ kxn −xk+ky−yn k
Hence, the difference in the norms are bounds by a sequence of numbers converging to zero
as n → ∞
∗∗
2.5.1 Continuity
Figure 23: Continuity. A mapping T : X → Y is said to be continuous at a point x0 if for each > 0 (no
˜ x, T x0 ) < whenever whenever d(x, x0 ) < δ
matter how small) there corresponds a δ(, x0 ) such that d(T
Continuity
(Definition) Continuity Let X = (X, d) and Y = (Y, d) ˜ be metric spaces. A mapping T : X → Y is said
to be continuous at a point x0 if for each > 0 (no matter how small) there corresponds a δ(, x0 ) such that
˜ x, T x0 ) < whenever whenever d(x, x0 ) < δ, Figure 23.
d(T
Example 31 (Continuity of f (x) = 1/x). Consider f (x) = 1/x over [.1, 1], ie not including 0. Here
˜ y) = d(x, y) = |x − y|
X = Y = R and d(x,
• For the point in question, x0 , draw an arbitrarily small band about its mapping, f (x0 ) = 1/x0 . (this
is the “for any > 0” part) and where it intersects the graph (A and B) drop verticals to the x-axis.
1 1 1 x0 x0 x20
A= += ⇒ x0 − a = 1 = ⇒ a = x0 − =
x0 x0 − a x0 + 1 + x0 1 + x0 1 + x0
1 1 1 x0 x0 x20
B= −= ⇒ x0 + b = 1 = ⇒ b= − x0 =
x0 x0 + b x0 −
1 − x0 1 − x0 1 − x0
38
Figure 24: By inspection, the function is continuous at a point f (x0 ) = 1/x0 but to show this, we must
identify a δ as a function of the distance from the mapping value, , and the continuity point in question x0 .
Theorem 2.3 (Equivalent norms). On a finite dimensional vector space X, and norm k · k is equivalent to
any other norm k · k0
Proof. See [Kreyszig, 1989], Theorem 2.4-5
Notice that this implies that convergence or divergence of a sequence does not depends on the choice of the
norm. √
ksk − ~0k1 ≤ nksk − ~0k2 → 0
Remark It may be shown (as a homework exercise) that the k · k1 and k · k2 satisfy
1
√ kxk1 ≤ kxk2 ≤ kxk1 ∀x (4)
n
Within the context of optimization, many compressed sensing applications involve the reconstruction
of some image, x ∈ Rn , from some measurements y ∈ Rm and the measurements are assumed a linear
transformation of the image, y = Ax. There are generally infinitely many images, N (A) 6= {0}, that satisfy
the measurement data and we wish to optimize with respect to a particular norm k · k. In optimization
theory, we may typically choose to optimize the 1-norm over the 2-norm.
39
Figure 25: A 2D k · k1 vs k · k2 optimization is shown with respect to the constraint, y = m · x + b.
Notice that iso-distance lines of the k · k2 are ‘circles’ while iso-distance lines of the k · k1 are ‘dia-
monds’. These are distinct functions, f (x, y) = k · k1 6= fˆ(x, y) = k · k2 , and minimizing these norms
with respect to the constraint, y = m · x + b leads to different solutions to the optimization problem.
http://www.cse.illinois.edu/iem/linear equations/pnorms
>> echo on
>> ExL1vsL2min
\exampledir/ExL1vsL2min.m
clear all
close all
delta = 5.e-2;
bound = 6
xcoord = [-bound:delta:bound];
[X,Y] = meshgrid( xcoord , xcoord );
V = [ 1 2 3 4 5 ];
handle = figure;
set(gcf,’renderer’,’zbuffer’);
set(gca,’FontSize’,16)
contour(X,Y,g,V,’--’)
hold
contour(X,Y,h,V,’k-’ )
slope = 0.5;
intercept = 4.1;
ycoord = slope * xcoord + intercept ;
plot(xcoord ,ycoord ,’r’);
40
xloc = 2;
text(xloc,slope*xloc + intercept,’ y = m \cdot x + b’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
text(4/sqrt(2),4/sqrt(2),’ ||x||_2 = c = 4 ’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
text(2,-2,’ ||x||_1 = c = 4 ’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
plot(onemin(1),onemin(2),’x’)
text(onemin(1),onemin(2),’min ||x||_1’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
plot(twomin(1),twomin(2),’o’)
text(twomin(1),twomin(2),’min ||x||_2’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
xlabel(’x’)
ylabel(’y’)
saveas(handle,’L1vsL2Min’,’png’)
Numerically, we will see that minimization with respect to the k · k1 has advantages over the k · k2 .
Intuitively, from Figure 25, in 2D we see that the k·k2 solution occurs where the smallest sphere intersects the
line constraint. Similarly, the k · k1 solution occurs where the smallest diamond intersects the line constraint.
Similarly, in higher dimensions, we will see that these same properties hold (ie intersect ’diamond’ with ’line’
in 100D) and minimization with respect to the 1-norm promotes a sparse representation of the solution. Here
‘sparse’ solution implies that the number of non-zeros entries is much less than the dimension of the image
space Rn , # non-zero << n. The solution of the k · k1 problem lies on the corners of an k · k1 -ball. Because
the corners lie on the coordinate axis the solution will lie on the intersection of most of the coordinate
axis (which is a little hard to imagine for a 6 million dimensional k · k1 -ball). The solution of the k · k2
problem lies on the k · k2 -ball, which has no corners, and thus is not restricted to a sparse solution. This has
applications in efficiently storing the solution, analogous to jpeg compression, reducing image acquisition
time, and improving computational efficiency.
In general the kx̂k1 and kx∗ k2 solution will not be the same, x̂ 6= x∗ . However the equivalence of norms
(4) may be used to derive the expected relationships between the two solutions.
√
kx̂k1 = min kxk1 : Ax = b ⇒ kx̂k1 ≤ kx∗ k1 ≤ nkx∗ k2
x |{z} |{z}
defn of min equiv of norm
Hence the 2-norm solution is less than the 1-norm solution and the 1-norm solution is less than a constant
times the 2-norm solution √
kx∗ k2 ≤ kx̂k1 ≤ nkx∗ k2 x∗ =
6 x̂
41
p4 p5 p6
Id
R
I0 = exp (− µ(s) ds)
p1 = µ1 ∆x + µ2 ∆x + µ3 ∆x
µ1 µ2 µ3 p1
p2 = µ4 ∆x + µ5 ∆x + µ6 ∆x
p3 = µ7 ∆x + µ8 ∆x + µ9 ∆x
µ4 µ5 µ6 p2 p4 = µ1 ∆x + µ4 ∆x + µ7 ∆x
∆y p5 = µ2 ∆x + µ5 ∆x + µ8 ∆x
p6 = µ3 ∆x + µ6 ∆x + µ9 ∆x
√ √ √
µ7 µ8 µ9 p3 p7 = µ1 (2 − 2)∆x + µ2 2( 2 − 1)∆x + µ4 2( 2 − 1)∆x
√ √ √
p7 p8 = µ3 2∆x + µ5 2∆x + µ7 2∆x
√ √ √
∆x p9 = µ6 2( 2 − 1)∆x + µ8 2( 2 − 1)∆x + µ9 (2 − 2)∆x
p8 R
p = ln IId0 = µ(s) ds
p9 ∆x=∆y
Figure 26: Image Reconstruction.
∆x ∆x ∆x 0 0 0 0 0 0
µ1 p1
0 0 0 ∆x ∆x ∆x 0 0 0 µ2 p 2
0 0 0 0 0 0 ∆x ∆x ∆x µ3 p 3
∆x 0 0 ∆x 0 0 ∆x 0 0 µ4 p 4
0 ∆x 0 0 ∆x 0 0 ∆x 0 µ5 = p 5
√0 √ 0 ∆x √ 0 0 ∆x 0 0 ∆x µ6 p 6
(2 − 2)∆x 2( 2 − 1)∆x 2( 2 − 1)∆x
√0 √0 0 √0 0 0 µ7 p 7
0 0 2∆x 0 2∆x µ8 p 8
√ 0 2∆x 0 0
√ √
0 0 0 0 0 2( 2 − 1)∆x 0 2( 2 − 1)∆x (2 − 2)∆x µ9 p9
| {z } | {z } | {z }
A ~
x ~
b
• Is it unique?
• How does uncertainty due to precision and accuracy limitations in the measurement, ∆b affect the
solution we obtain from a canned MATLAB program ?
Using the language of norms and vectors spaces that we have been studying, we can begin to answer
these questions and understand the effect of the conditioning on the solution in quite some detail.
3.1 Linear Operator; Null space; Range Space ([Kreyszig, 1989], Section 2.6,2.9)
Linear operator theory is fundamental to all practical applications. The linear operator, T , is understood as
a mapping from the domain, D(T ) ⊂ X, to the range, R(T ) ⊂ Y .
The operator may defined to be restricted to a a subset of the space X, but In most applications, the domain
and the range are the full space, X and Y , respectively, we write
T :X→Y
Linear Opera-
tor
42
(Definition) Linear Operator A linear operator T is an operator such that
• The domain D(T ) of T is a vector space and the range R(T ) lies in a vector space over the same field.
• A linear operator satisfies the following property
By definition the null space, N (T ), denotes the set of all x ∈ D(T ) such that T x = 0
N (T ) = {x : T x = 0}
Notice that letting the scalar α = 0 implies that the zero vector is in the null space
T0 = 0
There are many examples of linear operators in addition to the matrix operator’s that we are accustomed
to from linear algebra.
Example 32 (Identity Operator). The identity operator IX : X → X is defined by IX x = x ∀x ∈ X
We typically write
Ix = x I(αx + βy) = αx + βy = αIx + βIy
Example 33 (Zero Operator). The zero operator 0 : X → Y is defined by 0x = 0 x∈X
Example 34 (Differentiation). Let X = P[a, b] be the vector space of all polynomials on [a, b] We may define
a linear operator T on X by setting
T x(t) = x0 (t) ∀x ∈ X
By linearity of differentiation
0
T (αx + βy) = (αx(t) + βy(t)) = αx0 (t) + βy 0 (t) = αT x + βT y
Here the prime denote classical differentiation and the operator T maps X into itself.
Example 35 (Integration). A linear operator T from C[a, b] into itself can be defined by
Z t
y(t) = T x(t) = x(τ )dτ
a
By linearity of integration
Z t Z t Z t
T (αx + βy) = αx(τ ) + βy(τ )dτ = α x(τ )dτ + β y(τ )dτ = αT x + βT y
a a a
Example 36 (Multiplication). Another linear operator from C[a, b] into itself is defined by
T x(t) = tx(t)
T1 = x × a ∀x ∈ X
T2 = x · a ∀x ∈ X
43
(Definition) Point Spread Function Consider an imaging system, L, that transforms an exact object
I : Rn → R to an imaged object Iˆ : Rn → R.
Iˆ = L I
The point spread function h is defined as the action of the operator on the delta functional.
h(x, y) ≡ L δ(x − y)
Applying this decomposition within the operator leads to a representation of the transformed object.
Z X
LI(x) = L I(y)δ(x − y)dy = L lim I(yi )δ(x − yi )∆y
∆y→0
X Z
= lim I(yi )Lδ(x − yi )∆y = I(y)Lδ(x − y)dy
∆y→0
| {z }
linearity of L
Z Z
= I(y)h(x, y)dy = I(y)h(x − y)dy
| {z }| {z }
defn of h shift invariant
Hence any image from the system may be understood as the convolution with the point spread function.
Iˆ = I ∗ h
Example 39 (Point Spread Function Applied to 1D image). Consider a 1D image I : [a, b] → R Given a
shift invariant point spread function h The imaged object may be represented as
Z N
X
ˆ i) =
I(x I(y)h(x − y)dy ≈ I(yj )h(xi − yj )δy i = 0, ...N − 1
j=1
sin (π N ∆k x)
h(x) = ∆k
sin (π ∆k x)
44
Figure 27: Point Spread Function for Image Reconstruction. Iˆ = I ∗ h
Notice that the convolution operator is commutative. Given a phantom of known geometry I and the
ˆ the PSF that characterizes the system may be found from the solution of the linear
reconstructed image I,
system of equations (Homework exercise) .
I[0] I[−1] I[−2] I[−3] I[−4] . . .
I[1] I[0] I[−1] I[−2] I[−3] . . . h[0]
h[1]
I[2] I[1] I[0] I[−1] I[−2] . . .
ˆ = A~h
I =I ∗h=h∗I = .
I[3] I[2] I[1] I[0] I[−1] . . .
.
I[4] I[3] I[2] I[1] I[i − j] . . .
h[N − 1]
. . . . . ... | {z }
~
| {z }
≡h
≡A
\exampledir/ExPSFFourier.m
clear all
close all
FOV = 20;
deltax = FOV/400
epsilon = 1.e-6
xx = [epsilon:deltax:FOV/2];
Image = heaviside(xx - 5) - heaviside(xx - 8);
deltak = 1/FOV
N = 64;
ImpulseResponse = deltak * sin(pi * N * deltak * xx).* (sin(pi * deltak * xx)).^(-1);
A = toeplitz(ImpulseResponse );
ImageHat = A * Image’;
handle1 = figure(1)
plot(xx ,Image,’r’)
hold
plot(xx ,ImageHat,’k--’)
45
plot(reflectxx ,ReflectImpulseResponse )
set(gca,’FontSize’,16)
xlabel(’x’)
legend(’Exact’, ’Image’, ’PSF’ ,’Location’,’NorthWest’)
saveas(handle1,’PSFMR1D’,’png’)
Example 40 (Null Space of a matrix). Given an operator T : X → Y , It is important to realize that the
null space of an operator is also a vector space that is generally a subset of the domain of the operator,
N (T ) ⊂ X. As an example lets compute the null space of the matrix operator, L ≡ A
2 1
A=
−4 −2
Recall the definition. The null space is the set of all vectors, z that map to zero.
2·(2z1 + z2 = 0) 2z1 + z2 = 0 1
2 1 z1 ⇔ ⇔ z1 = − z2
Az = =0 ⇔
−4 −2 z2 +(−4z1 + −2z2 = 0) 0z1 + 0z2 = 0 2
Hence, the null space of this operator is all vectors where the first component is the negative 1/2 of the second
component.
− 21 · 2 · β + β
1 1
−2β 2 1 −2β 0
N (A) = :β∈R ⇔ = =
β −4 −2 β − 21 · −4 · β − 2β 0
The null space is important when studying the inverse of an operator and the uniqueness of a solution. Inverse Opera-
tor
(Definition) Inverse Operator The mapping T : X → Y is said to be injective or one-to-one if different
points in the domain have different images.
T x1 = T x2 ⇒ x1 = x2 ⇔ x1 6= x2 ⇒ T x1 6= T x2 ∀x1 , x2 (5)
T −1 : Y → X
and associates a given y0 ∈ Y to unique x0 ∈ X by the application of the operator to the vector.
T x0 = y0 T −1 y0 = x0
T −1 T x = x ∀x T T −1 y = y ∀y ∀x
For our study of linear operators, it is important to note that the inverse of a linear operator exists if and
only if the null space of the operator consists of the zero vector only.
Theorem 3.1 (Inverse Operator). Given two vector spaces, X and Y and T : X → Y a linear operator.
46
(i) The inverse T −1 : Y → X exists if and only if the null space is zero.
(T x1 = T x2 ⇒ x1 = x2 ∀x1 , x2 ) ⇔ (T z = 0 ⇒ z=0 ⇔ N (T ) = 0)
T x1 = T x2 ⇒ T x1 − T x2 = 0 = T (x1 − x2 ) (linearity of T)
Hence by the assumption, T z = 0 ⇒ z = 0, the difference maps to ~0, so the difference is zero and the
two points are the same.
(⇒) Conversely, We want to show that the Null space of the operator is zero given that T −1 exists.
(T x1 = T x2 ⇒ x1 = x2 ∀x1 , x2 ) ⇒ (T z = 0 ⇒ z = 0)
Since this holds for all x1 , x2 , let x2 = 0 and x1 = z arbitrary. By the properties of the linear operator,
T x2 = T 0 = 0.
Tz = 0 ⇒ T z = T x1 = 0 = T x2 = T 0 ⇒ z = x1 = x2 = 0
| {z }
by assumption, T x1 =T x2 ⇒x1 =x2
(ii) We assume that T −1 exists and show that it is linear. Consider any x1 , x2 ∈ X and there images
y1 = T x1 T −1 y1 = x1 y2 = T x2 T −1 y2 = x2
Since T is linear
αy1 + βy2 = αT x1 + βT x2 = T (αx1 + βx2 )
Since the inverse is defined we can associate αy1 + βy2 with its image in X.
Finally since xj = T −1 yj
Example 41 (Ill Conditioned System). Returning to everyday life, why is this important? What does this
null space have to do with anything ?
>> ExMatrixLinearOperator
clear all
close all
echo on
A =
47
2.4420 2.7390 0.8340
zone =
6.8069
15.4675
20.4206
ztwo =
6.8069
15.4675
20.4206
xone =
1.0e+11 *
-1.3894
0.8301
1.3422
xtwo =
1.0e+12 *
-2.9178
1.7432
2.8187
48
%the solution has changed by an order of magnitude ?!?!?
%what’s going on ?!?!?
%x is a solution to A x = b
x
x =
0.9572
0.4854
0.8003
A * x -b
ans =
1.0e-03 *
0.0144
0.0627
0.1432
ans =
1.0e-03 *
0.0568
0.1594
0.2704
ans =
1.0e-03 *
0.0144
0.0627
0.1432
N (A) 6= {~0}
49
For clarity lets consider a 2x2 system, or a 2-pixel scanner if you will. Find (x1 , x2 )
0.9130 0.6590 x1
0.254
T = A : R2 → R2
Ax = = =b
0.4570 0.3300 x2 0.127 X = Y = R2
we are getting closer to be able to answer fundamental questions that will appear time and time again in
our research.
Given a linear operator T : X → Y and a right hand side vector y ∈ Y ,
(i) Does an element x ∈ X exist such that
Tx = y
(ii) Is the element unique ?
(iii) How does uncertainty due to precision and accuracy limitations in the measurement, ∆b affect the
solution we obtain from a canned MATLAB program ?
Obvious observations are:
1. For a given y ∈ Y , a solutions exists ⇔ y ∈ R(T )
2. A solution is unique ⇔ N (T ) = {0}, ie the null space is the zero vector or the operator is one-to-one
(injective).
Ax = y
⇒ A(x + z) = y
Az = 0
We can gain further intuition by analyzing the linear operator in an explicit finite dimensional setting. We
have a very powerful theorem at our disposal on finite dimensions. We will state without proof.
Theorem 3.2 (Rank and Nullity). Let X be a finite-dimensional vector space and T : X → Y denotes a
linear transformation from X into Y . Then
dim X = dim N (T ) + dim R(T )
i.e. the sum of the rank and nullity of the linear transformation T equals the dimension of space V .
To be explicit, let {e1 , e2 , ..., en } and {f1 , f2 , ..., fm } denote a basis for X = Rn and Y = Rm , respectively.
Then for a given x ∈ X we can represent it in terms of its basis and apply the operator, T : Rn → Rm
x1
x2
n n
. X X
x= ⇔ x= x j e j T x = xj T ej
.
j=1 j=1
.
xm
Now each of the vectors T ej ∈ Y and therefore has its own representation with respect to the basis fi .
Denoting the components of T ej with respect to the basis fi with aij
a1j
a2j
m
. X
T ej =
= a1j f1 + a2j f2 + ... + amj fm =
aij fi
. i=1
.
amj
we have the usual matrix-vector multiply with m rows and n columns.
a1j
x1
a2j y1 a11 ...a1n
n n n m m X n m
y2 a21 ...a2n x2
X X . X X X X
Tx = xj T ej = . =
xj xj aij fi = aij xj fi = yi fi ⇔
. =
. ⇔ y = Ax
...
.
j=1 j=1 j=1 i=1 i=1 j=1 i=1
. ym am1 ...amn
xn
amj
Again, the matrix entries aij have the interpretation as the the components of the mapping of the domain
basis T ej with respect to the range basis fi . Singular Ma-
trix
50
(Definition) Singular Matrix We say that a matrix, A, is singular iff the determinant of the matrix is
equal to zero .
A is singular ⇔ det A = 0 ⇔ any two rows or columns of the matrix are linearly dependent
Algebraic properties of the determinant may be used to show that any two rows or columns of the matrix
are linearly dependent for a singular matrix.
Here we are saying that the set {T e1 , T e2 , ..., T em } is linearly independent and hence forms a basis for
Y . Since Y is of the same dimension as the rankA then any y ∈ Y maybe decomposed in this basis so
a solution exists and is the coefficients of this basis {xi }.
X
y= xi T ei
i
Finally by the Rank and Nullity Theorem 3.2, since dim X = rankA, the dimension of the Null space
must be zero
N (T ) = {0}
and the solution is unique.
(ii) dim X = dim Y > rankA, (ie A singular), Infinite solutions: N (T ) 6= {0}, and y ∈ R(T ). y existing
within the range of the operator is equivalent to saying that the rank of the augmented matrix is the
same as the rank of the original matrix, ie y is linearly dependent on the columns of the matrix.
a11 . . .a1n y1 a11 . . .a1n
a21 . . .a2n y2
= rank a21 . . .a2n
rank . . . . . .
am1 . . .amn ym am1 . . .amn
Tx = y → y = T x = T x + 0 = T x + αT z = T (x + αz) ∀α
(iii) dim X = dim Y > rankA, (ie A singular), No solutions: y ∈/ R(T ). In other words the span{T e1 , T e2 , ..., T em }
does not cover all of Y AND the y in question does not exist in that span.
(iv) n = dim X > dim Y = m, The number of equation is smaller than the number of unknowns. From
the fundamental identity n = dim N (T ) + dim R(T ), with dim R(T ) ≤ m = dim Y , the dimension null
space is always greater than zero, dim N (T ) ≥ 0, and if a solution exists it is never unique. Similar to
before the rank of the augmented matrix may be used to show if a solution exists.
(v) n = dim X < dim Y = m, The number of equations is bigger than the number of unknowns. Again
from the fundamental identity m > n = dim N (T ) + dim R(T ), so the range space must be a subspace
of the full space m > dim R(T ) and the the rank of the augmented matrix may be used to show if a
solution exists. The dimension of the null space may or may not be 0.
51
Bounded linear operators form the basis of a rich theory in functional analysis and will facilitate much of the
discussion of matrix analysis for solving systems of linear equations that appears in our research. In fact,
the space of bounded linear operators may be considered a normed space with an operator norm defined as
the supremum over all bounding constants. There are rigorous methods of defining the supremum of a set
in terms of partial orderings. For our purposes, R will suffice. Supremum
(Definition) Supremum The supremum of a set, A, is denoted sup(A) and denotes the least upper bound
of the set.
sup(A) ≤ c ∀c ∈ {b : b ≥ a ∀a ∈ A} sup(A) ∈ {b : b ≥ a ∀a ∈ A}
| {z }
is an element of the set of upper bnds
Example 42 (Supremum vs Maximum). Motivation for defining the supremum is that it allows to define
an upper bound of a set when the maximum does not exist. Consider, for example, the open set
The max does not exist. Indeed, you can always find a number that is epsilon bigger, ie 3.999...99 <
3.999...991. However, the set [4, ∞) consists of all upper bounds for A and the least upper bound is the
supremum.
sup A = 4
2 4
( )
There are several equivalent definitions of the “matrix norm” in terms of the supremum of the bounding
constants. Proof of the equivalence is outside the scope of this course. As before, the matrix norm provides
a method to quantify the size and magnitude of a matrix and other linear operators.
kT xk
kT k ≡ sup = sup kT xk
x∈D(T ) x6=0 kxk x∈D(T ) kxk=1
Note that letting the constant equal the norm c = kT k, we arrive at a frequently used formula that bounds
the application of the operator to the domain. By definition, the matrix norm, kT k is an element of the set
of upper bounds.
kT xk
c: ≤ c x 6= 0
kxk
in fact it is the least upper bound.
kT xk kT xk
kT k = sup ⇒ ≤ kT k ∀x ⇒ kT xk ≤ kT kkxk ∀x (7)
x∈D(T ) x6=0 kxk kxk
| {z }
(by defn, sup is an upp. bnd.)
52
R
Figure 30: Area under the curve is bounded by the rectangular at the max value, ∆x
f (x) dx ≤ fmax ∆x.
Here k is a given function, which is called the kernel of T and is assumed to be continuous on the closed
square G = J × J in the tτ -plane, where J = [0, 1]. This operator is linear and bounded. To prove this we
first note that the continuity of k on the closed square implies that k is bounded, ie there exist a k0 such that
|k(t, τ )| ≤ k0 ∈ R ∀(t, τ ) ∈ J × J
Hence, Z
1
Z 1
kyk = kT xk = max k(t, τ )x(τ )dτ ≤ max
|k(t, τ )||x(τ )|dτ ≤ k0 kxk
t∈J 0 t∈J 0
So we have
kT xk ≤ k0 kxk
And T is bounded
We state without proof some important regarding properties of bounded operators in finite dimension as
well as continuity of bounded operators.
Theorem 3.4 (Finite Dimension). If A normed space X is finite dimensional, then every linear operator
on X is bounded
Proof. See Theorem. 2.7-8 in [Kreyszig, 1989]
Theorem 3.5 (Continuity and boundedness Dimension). Let T : D → Y be a linear operator, where
D(T ) ⊂ X and X, Y are normed spaces. Then:
T is continuous ⇔ T is bounded
xn → x ⇒ T xn → T x xn , x ∈ D(T )
An important example that will be used in studying the accuracy of the solution to a linear system of
equation is the norm of a matrix.
53
Example 46 (Matrix Operator). Consider the matrix operator T = A, A : (Rn , k · k1 ) → (Rn , k · k1 ). In
determining the matrix norm notice that it is important to specify the analytical form of the norm used on the
domain and range space. We will denote kAk1 as the matrix norm subordinate to the vector norms kAxk1
and kxk1 . Our strategy for establishing the analytical expression of the matrix norm will be to
• Find a constant, K, such that
kAxk1 ≤ Kkxk1 ∀x
If we can find a particular x∗ with unit norm, kx∗ k1 = 1 for which equality is obtained.
Then our constant can be bounded above and below by the matrix norm and our result is obtained
For the final step, pick x∗ such that all entries are zero except in the the j-th position that corresponds to the
max column sum of the matrix
Example 47 (Differentiation Operator). Let X be the normed space of all polynomials on J = [0, 1] with
norm given kxk = max |x(t)|, t ∈ J. A differentiation operator T is defined on X by
T x(t) = x0 (t)
where prime denotes differentiation with respect to t. This operator is linear but NOT bounded. Consider
xn (t) = tn , where n ∈ N = {1, 2, 3, ...}. Then kxn k = 1 and
kT xn k
T xn (t) = x0n (t) = ntn−1 kT xn (t)k = max |ntn−1 | = n · 1 = nkxn k =n ∀n
t∈[0,1] kxn k
kT xn k
Since n ∈ N is arbitrary, there is no fixed number c such that kxn k ≤ c. Thus T is not bounded.
54
3.3 Applications: Conditioning & Residual [Heath, 1998] Chapter 2
The condition number κ of a matrix is defined with respect to a particular matrix norm k.k as
κ(A) ≡ kAkkA−1 k
In finite dimensions, the values of the matrix norm depends on the norm of the domain, D(T ), and range,
R(T ), of the operator.
n
X
T : (Rn , k · k∞ ) → (Rn , k · k∞ ) ⇒ kT k∞ = max |Tij |
1≤i≤n
j=1
n
X
T : (Rn , k · k1 ) → (Rn , k · k1 ) ⇒ kT k1 = max |Tij |
1≤j≤n
i=1
From a practical standpoint, Theorem 2.3 suggests that on this finite dimensional space of matrix operators
it may be more convenient to compute the matrix norm with respect to 1-norms vs 2-norms
for example. However, if the matrix norm is approaching singularity in a given norm then it approaches
singularity in all norms.
The condition number may be verified in MATLAB
>> A = [0.913, 0.659;0.457,0.330]
A =
0.9130 0.6590
0.4570 0.3300
>> cond(A,1)
ans =
1.6958e+04
>> cond(A,2)
ans =
1.2485e+04
>> cond(A,inf)
ans =
1.6958e+04
The residual of an approximate solution, x̂, to the linear system Ax = b is the difference
r = b − Ax̂
Lets look at the residual norms of two potential solutions. In general we would expect the residual norm to
decrease as we obtain a better solution.
55
>> b = [ 0.254; 0.127];
>> xexact = [1;-1]
xexact =
1
-1
>> norm(A*xexact-b,1)
ans =
xone =
-0.0827
0.5000
xtwo =
0.9990
-1.0010
>> norm(A*xone-b,1)
ans =
2.1120e-04
>> norm(xexact-xone,1)
ans =
2.5827
>> norm(A*xtwo-b,1)
ans =
0.0024
>> norm(xexact-xtwo,1)
ans =
0.0020
Upon initial inspection, one may think that the x̂1 solution is the better
approximation because of the
1
smaller residual. However, the exact solution may be verified to be x = . As seen in equation (8), this
−1
is an excellent example of where a small residual does not imply a small error in the solution because of the
ill-conditioning of the linear system matrix, A. Assuming the matrix is nonsingular, a relative error bound
56
may be related to the residual
Ax = b
⇒ A(x − x̂) = r ⇒ (x − x̂) = A−1 r ⇒ k∆xk = kx̂ − xk = kA−1 rk ≤ kA−1 kkrk
Ax̂ = b − r
(8)
Manipulating the inequality,
kAkkx̂k −1 k∆xk krk
k∆xk ≤ kA−1 kkrk = kA kkrk ⇒ ≤ cond(A)
kAkkx̂k kx̂k kAkkx̂k
Revisiting our example matrix,
0.913 0.659 x1 0.254
Ax = = =b
0.457 0.330 x2 0.127
Consider the residual from two approximate solutions
−0.0827 0.999
x̂1 = and x̂2 =
0.5 −1.001
The first solution may be obtained from four-digit arithmetic Gaussian elimination and multiplying the linear
system by an elimination matrix
1 0 0.913 0.659 x1 1 0 0.254
0.457 = (multiply both side by matrix)
− 0.913 1 0.457 0.330 x2 − 0.457
0.913 1 0.127
0.9130 0.6590 x1 0.2540
⇒ =
0.0 0.0002 x2 0.0001
Back substitution, gives the solution
−0.0827
x̂1 =
0.5
57
(a) (b)
Figure 31: The effect of matrix conditioning on error bounds of the solution (a) The solution set of each
of the two equations in the linear system is drawn as a straight line in the plane. The width of the lines
reflects the uncertainty in the data within the specified precision arising from limitation in measure precision
and accuracy for example. The resulting uncertainty in the intersection (i.e., the solution) depends on
the condition number of the matrix. http://www.cse.illinois.edu/iem/linear equations/conditioning (b) The
region of uncertainty in the right-hand-side vector for a given relative error is shown in the right graph by a
shaded circular disk whose size can be altered by dragging its perimeter, and the resulting numerical value
for the relative error in rhs is shown below. The lightly shaded circular disk in the left graph shows the
corresponding region of uncertainty in the solution vector x given by the condition number of the matrix,
and the corresponding bound on the relative error in x is shown below. In this case the poorly conditioned
matrix is seen significantly amplify the error in the solution. Working with this example you can also see
the instability in the solution, small changes in the measurement data produces large changes in the output
solution. http://www.cse.illinois.edu/iem/linear equations/error bound
Notice that large perturbations of the solution may occur when large conditioning numbers overwhelm
machine epsilon, k∆bk ≈ mach , k∆bk/kbk << κ(A)
When solving a system of equations in floating point arithmetic, both numerical inaccuracies of the
matrix and the right hand side may exist.
(A + E)x̂ = b + ∆b
As a homework exercise, a similar derivation can show that the error in the perturbed solution may be
bounded by these numerical inaccuracies
k∆xk k∆bk kEk
≤ cond(A) +
kxk kbk kAk
Here we see that the condition number of the system plays an important role in the computer solution of
the system of equations. If we assume that the numerical perturbations are on the order of the machine
precision, mach , then the relative error is directly proportional to the condition number.
kx̂ − xk
≤ cond(A)O(mach )
kxk
58
Compute the condition number of this matrix.
1 αa + −a 1 αa + −a X 2αa +
A−1 = = ⇒ kA−1 k1 = max |aij | =
a(αa + ) − aαa −αa a a −αa a j
i
a
f : D(f ) → K K = R or C
Bounded Lin-
ear Functional
(Definition) Bounded Linear Functional A bounded linear functional f is a bounded linear operator
with range in the scalar field, R or C. Thus there exists c ∈ R such that
|f (x)| ≤ ckxk
and
|f (x)| ≤ kf kkxk
f (x) = x · a = x1 a1 + x2 a2 + x2 a3 a ∈ R3 a fixed
This functional has an import place in Hilbert space theory. f is in fact linear and bounded. Proof is left as
a homework exercise.
I1 (d)
I~ = I2 (d) ∈ (C[a, b])3
I3 (d)
59
~ = Dideal . For a given exposure time, ∆t, lets assume that the delivered
such that the delivered dose D(I)
dose may be written as a linear combination of the individual beam dose.
Nbeam
X=3 Z !
D(x, y) = ∆tIj (d(x, y)) exp − µ(s) ds
j=1 l(x,y)
Here d(x, y) and l(x, y) represent the distance along the beam intensity profile and path to the source;
respectively. To make this infinite dimensional problem tractable, consider the dose at a finite set of pixels
{~x1 , ~x2 , ..., ~xm } and the corresponding intensities for each beam path at these pixels
I1 (d(~x1 ))
I2 (d(~x1 ))
I3 (d(~x1 ))
I1 (d(~x2 )) Dideal (~x1 )
I2 (d(~x2 )) Dideal (~x2 )
I3 (d(~x2 )) .
D(x, y) = A =
.
.
.
.
.
Dideal (~xm )
I1 (d(~xm )) | {z }
b
I2 (d(~xm ))
I3 (d(~xm ))
| {z }
x
Here we have more unknowns than equations. What do we know about the solution to this problem ?
Ax = b
From the rank and nullity theorem, (3.2), 3m = dim X > dim Y = m, The number of equation is smaller
than the number of unknowns. From the fundamental identity n = dim N (T ) + dim R(T ), with dim R(T ) ≤
m = dim Y , the dimension null space is always greater than zero, dim N (T ) ≥ 0, and if a solution exists it
is never unique.
p p
(b − p, b − p) = inf (b − v, b − v) M ≡ R(A) b ∈ Rm
v∈M
Figure 33: We cannot tell the patient the the solution is not inside the range space or the null space is
non-zero. A particular unique and well-defined solution may be provided in terms of the inner product.
This approach is commonly referred to as the least squares solution and had has the interpretation as the
minimal distance to a subspace defined by the range of the operator, M ≡ R(A), ie orthogonalp projection.
Notice that the distance measure is defined in terms of the inner product, d(x, y) = kx−yk = (x − y, x − y)
60
K defined on X, ie R or C. Specifically, we say that (·, ·) : X × X → K is an inner product if the following
properties hold.
(x, αy + βz) = (αy + βz, x) = α(y, x) + β(z, x) = α(y, x) + β(z, x) = α(x, y) + β(x, z)
For most practical applications we will assume that we have an inner product defined. This will provide
us a notion of the ‘angle’ between two vectors and notice that an inner product on X defines a norm on X
p
kxk = (x, x) (9)
(x, y) = 0
(a, b) = 0 ∀a ∈ A b∈B
(x, y) = x1 y1 + x2 y2 + ...xn yn
And the norm and metric induced by this inner product is the familiar l2 distance measure.
p p
d(x, y) = kx − yk = (x − y, x − y) = |x1 − y1 |2 + |x2 − y2 |2 + ... + |xn − yn |2
The conjugate on y is needed to satisfy the symmetry property (I2) (x, y) = (y, x) and to ensure that the
length of the vectors is positive and real-valued in the case of imaginary numbers.
For R3 , this gives the usual dot product from vector calculus
(x, y) = x · y = x1 y1 + x2 y2 + x3 y3
(x, y) = x · y = 0
You will shown in a homework exercise that the norm induced by an inner product satisfies the parallel-
ogram equality p
kx + yk2 + kx − yk2 = 2 kxk2 + kyk2
kxk = (x, x) (11)
It is worth noting that there do exist norms that are not generated by an inner product, ie do not satisfy
(11), hence not all normed spaces are inner product spaces.
61
Figure 34: Parallelogram Equality. As the name suggest, even in our abstract inner product spaces, the
parallelogram equality, Eqn (11), from elementary geometry still holds. Ie, the squared sum of the sides
equals the squared sum of the diagonals.
Example 51 (1-norm). Not all norms are induced by a inner product. For example, there does not exist a
inner product that can introduce the 1-norm on a vector space.
X p
kxk1 = |xk | =
6 (x, x)
k
on the space of continuous functions does not satisfy the parallel equality (11) and is thus not an inner
product space. To see this consider
t−a
x(t) = 1, kx(t)k = 1 y(t) = , ky(t)k = 1
b−a
t−a t−a
x(t) + y(t) = 1 + , kx(t) + y(t)k = 2 x(t) − y(t) = 1 − , kx(t) − y(t)k = 1
b−a b−a
The parallel equality (11) is not satisfied
The motivating example for a ‘closed’ inner product space is the space of continuous functions C[−1, 1] with
the inner product Z 1
(x, y) ≡ x(t)y(t)dt
−1
This is an example of an inner product space but not Hilbert space. The function
0,
−1≤t≤0
xn (t) = nt, 0 ≤ t ≤ 1/n
1, 1/n ≤ t ≤ 1
defines a very important Hilbert space (a closed inner product space) of square integrable functions
( Z )
b
2
L2 (a, b) ≡ f : |f (t)| dt < ∞ f (t)f (t) = |f (t)|2
a
62
Figure 35: Incomplete Space.
Notice that this space includes many more functions than C[a, b] including discontinuous functions and
function in which the tail of the function decays fast enough. For example,
Z ∞ Z ∞ ∞
1 1 1
f (x) = 1/x ∈ L2 (1, ∞) f 2 (x)dx = 2
dx = − =1− ∞ =1
1 1 x x 1
There are also functions that blow up to infinity but the integral is defined. For example in spherical coordinate
Z
f (x, y, z) = f (r) = 1/r ∈ L2 (Ω) Ω = {x ∈ R3 : kxk ≤ 1} f 2 (r)r2 sin φdrdφdθ
Ω
2π Z π/2 Z 1 2π π/2 1
r2
Z Z Z Z
sin φdrdφ = dθ sin φdφ dr = 2π
0 0 0 r2 0 0 0
Cauchy
(Definition) Cauchy Schwarz Inequality The Cauchy Schwarz Inequality is common inequality that Schwarz In-
bounds the inner product the the norm induced by the inner product equality
p p
|(x, y)| ≤ (x, x) (y, y) = kxkkyk
Example 54. It is intructive to verify the Cauchy Schwarz inequality on vector an function spaces. Consider
x, y ∈ R3
>> x = [-3;8;11];
>> y = [ 7;-4;1];
>> norm(x,2)
ans =
13.9284
>> norm(y,2)
ans =
8.1240
ans =
113.1548
>> abs(dot(x,y))
ans =
42
−3 7
x= 8 kxk = 13.9284 y = −4 kyk = 8.1240
11 1
63
|(x, y)| = |−3 · 7 − 8 · 4 + 11 · 1| = 42 ≤ kxkkyk = 113.1548
Consider functions f, g ∈ L2 (0, 1)
s s s s
1 1 r
Z 1 r 1
x3 x5
Z
2 1 2 2 1
f (x) = x kf k = (x) dx = = g(x) = x kgk = (x2 ) dx = =
0 3 0 3 0 5 0 5
Z 1 7 1 r
2 2
x 1 1
|(f, g)| = x·x dx = = = 0.1429 ≤ kf kkgk =
= 0.2582
0 7 0 7 15
(ei , ej ) = δij
The expansion coefficients in an orthonormal basis are advantageous and may be easily determined for a
given vector x.
(x, ej ) X X
αj = x= αj ej = (x, ej )ej
(ej , ej ) j j
Example 55 (Inner product in R3 ). You are familiar with this concept from vector calculus.
5
x = 3 (x, e1 ) = x · e1 = 5
1
Example 56 (Finite Fourier Basis). An example you will see repeatedly within the context of MR is the
projection to the space spanned by the Finite set of orthogonal Fourier basis functions
When we get to the eigenvalue theory, we will see that this basis is indeed orthogonal such that
Z l
h
(f, sin(kπ x/l)) = (f , sin(kπx/l)) ⇒ f (x) sin(kπx/l)dx = bk (sin(kπx/l), sin(kπx/l))
−l
Z l
kπx
(sin(kπx/l), sin(kπx/l)) = sin2 dx = l
−l l
64
so that Z l
1
bk = f (x) sin(kπx/l)dx
l −l
Figure 36: Resampling is used for multi-resolution registration. Contours outline the exhale image over the
steps of a multi-resolution registration to match the inhale image. (a) Initial Inhale-Exhale Pair. (b) Initial
Affine tranformation of exhale image to inhale image. (c) Resampling blurs the image to low resolutions to
find bulk changes. Computations run fast at low resolutions seen in (e)-(g), (d) The resolution is iteratively
increased to the solving the registration problem at full resolution (h), 32x → 16x → 8x → 4x → 2x→ 1x
Suppose that we have a 512 × 512 that we want to down sample to a 128 × 128 image. In this case
As before, the k-th coefficient in the lower resolution basis is simply the projection of the higher dimen-
sional basis onto the k-th basis function, êk
!
128×128
X 512×512
X
(fˆ, êk ) = α̂k (êk , êk ) = (f, êk ) ⇒ α̂j êj , êk = α̂k (êk , êk ) = αi ei , êk
j=1 i=1
In the typical case that the higher resolution voxels are completely contained in the lower resolution voxels
and the expansion coefficients in the lower resolution are the volume weighted sum as you would expect.
R P512×512
(f, êk ) i=1 αi ei êk dx X Vol(Ωj )
α̂k = = Ω̂k R = αj
(êk , êk ) Ω̂k
dx Vol(Ω̂k )
j:Ωj ⊂Ω̂k
65
Ωi
ei (x) = 1 x ∈ Ωi
ei (x) = 0 x∈/ Ωi
Ωi ⊂ Ω̂i
So why did we go through all this formality to arrive at this intuitive result ? Notice that the final
representation of the basis coefficients is in terms of the inner product and basis functions
only.
(f, êk )
α̂k =
(êk , êk )
The inner product space formality can easily be extending to any other basis and inner product defined, ie
b-spline basis and weighted inner products that are prevalent in image registration.
∗∗
4.3 Minimizing Vector ([Kreyszig, 1989], Section 3.3)
R
Suppose we are given a continuous finite energy signal f ∈ L2 (Ω) = f : Ω f 2 dx < ∞ ≡ X that we
need to represent in a finite dimensional subspace, M , for a computer to understand. In this situation, we
will assume that we are given a known orthonormal basis {φj , j = 1, .., n} for our finite dimensional
subspace M ⊂ X, dim M < ∞. We will formulate this as a projection problem in the inner product space,
[L2 (Ω), (·, ·)]. We want to find the element of the subspace, f h ∈ M ⊂ X that minimizes the distance to an
element f ∈ X.
δ = inf kx − ŷk = kx − yk
ŷ∈M
Proof. Proof follows from Parallelogram Inequality (11), ie minimization is with respect to the norm induced
by the metric [Kreyszig, 1989] Theorem. 3.3-1
66
Returning to our example, we wish to find f h ∈ M ⊂ L2 (Ω) that provides the best approximation to our
original function f with respect to the norm induced by our inner product
The projection of f onto the basis may be shown to be the minimum. The squared sum difference between
an arbitrary set of basis coefficients βi may be written as
n
X n
X n
X n
X
2
((f, φj ) − βj ) = (f, φj )2 −2 βj (f, φj ) + βj2
j j j j
Explicitly writing out the norm using the properties of the inner product, orthonormality, and substituting
the difference between the coefficients yields below.
Xn n
X n
X Xn Xn n
X
kf − f h k2 = f − βj φ j , f − βj φj = (f, f ) − f, βj φ j − βj φ j , f + βj φ j , βj φ j
j j j j j j
n
X n
X
= (f, f ) − 2<e f, βj φ j + βj β j (φj , φj )
j j
n
X n
X
= (f, f ) −2 βj (f, φj ) + βj2 (assuming real num.)
j j
n
X n
X
2 2
= kf k − (f, φj ) + ((f, φj ) − βj )
j j
This difference is minimized when the expansion coefficients equal the inner product of the original function
with the basis
(f, φj ) = βj = (f h , φj ) ∀j (12)
Here, the parameters, (f, φj ) ≡ fj , that provide the ”best fit” of the data is understood to provide the
minimum distance with respect to the norm induced by the inner product.
Typically we have an overdetermined system such that the number of measurements is greater than the
number of coefficients of the expansion, m > n. A linear system of equations may be obtained with each
row of the linear system representing one measurement point
ˆ
φ1 (t1 ) φ2 (t1 ) ... φn (t1 ) b1 (f, φ1 ) f1
φ1 (t2 ) φ2 (t2 ) ... φn (t2 ) b2 (f, φ2 ) fˆ2
Ax = . . ... . x = . = b x= . ≡ .
. . ... . . . .
φ1 (tm ) φ2 (tm ) ... φn (tm ) bm (f, φn ) fˆn
The range space must be a subspace of the full space m > dim R(T ) and the rank of the augmented matrix
may be used to show if a solution exists. The dimension of the null space may or may not be 0.
67
Since the solution may or may not exist, we need to redefine our problem setup to impose meaning to
this problem and guarantee that a solution exists. Recall our minimizing vector Theorem 4.1. We apply the
inner product setup with A : Rn → Rm , the usual inner product in Rn and the subspace is the range of the
operator A.
φ1 (t1 ) φ2 (t1 ) φn (t1 )
φ1 (t2 ) φ2 (t2 ) φn (t2 )
M ≡ R(A) = {Ax : x ∈ R } = span . . . . . . ⊂ Rm
n
. . .
φ1 (tm ) φ2 (tm ) φn (tm )
The trick is to project the right hand side of measurements into the range space of the operator. Let p be
the orthogonal projection of b into R(A)
From the minimizing vector Theorem 4.1 we know that p exists and is unique. then by the definition of the
range space there exists an x∗ that maps to this p
or
kb − pk2 = kb − Ax∗ k2 = infn kb − Axk2
x∈R
∗
Since p is unique and Ax = p the least squares problem has a unique solution if
Notice that the 2-norm minimization reduces to the expected minimization of the difference in the residual
2
X n
X
min kb − Axk22 = minn (b − Ax, b − Ax) = minn yi − fj φj (ti ) = minn r> r r = b − Ax
x∈Rn x∈R x∈R x∈R
i j
Notice that while we generated our problem description from the 2-norm from the usual (·, ·)2 inner
product, we can easily redefine our problem in terms of a different inner product and all arguments hold in
terms of distances of inner product spaces. In particular, a weighted inner product may be used.
Normal Equations There are several methods for solving this minimization/optimization problem. The
normal equations have simple intuitive derivation, however, we should be wary of normal equations. Normal
equations are obtain by expanding the residual
>
r> r = (b − Ax) (b − Ax) = b> b − 2x> Ab + x> A> Ax
and taking the derivative with respect to x and setting it to zero, similar to undergrad calculus.
d >
r r = 2A> Ax − 2A> b = ~0
dx
which reduces to a n × n square linear system with an amplified condition number.
Thus, if the condition number of the original system was large, the condition number of the normal equations
will be that number squared.
68
Orthogonal Transformations Orthogonal transformations based on QR Factorization and are common
in MATLAB . These factorizations are based on the idea of orthogonal matrices. Orthogonal
Matrix
(Definition) Orthogonal Matrix
Q> Q = QQ> = I
Orthogonal matrices preserve the norm and hence the distance we are trying to minimize of any vector, x
where R is an n × n upper triangular matrix. and O is an (m − n) × n matrix of zeros. Using the properties
of the orthogonal matrix, this leads to a transformation of the least squares equations to an equivalent, but
more numerically stable form.
2 R 2 > R 2 > R 2 > R
kb − Axk2 = kb − Q xk2 = k QQ b − Q xk2 = kQ Q b − x k2 = kQ b − xk22
O | {z } O O O
=I
Here b̂1 is the n × 1 sub-vector of the transformed vector Q> b and b̂2 is the (m − n) × 1 remaining sub-vector.
Since the optimization has no control over the kb̂2 k term, the minimum occurs when the residual is equal to
this term.
Rx = b̂1 krk22 = kb̂2 k22
We will not go into details but several methods are possible for computing this QR factorization including
• Householder transformations
• Givens transformations
• Gram-Schmidt orthogonalization
As an example of the effect of the ill conditioning on a least squares interpolation using the monomial
basis for a normal equation approach and a orthogonal transformation approach, consider the interpolation
with a 10-th degree polynomial on [0,1], P 1 0[0, 1].
>> x=[0:.02:1]’;
>> b=exp(x);
>> A = [] ; for iii = 0:10; A = [A,x.^iii]; end
>> cond(A)
ans =
2.0371e+07
ans =
4.1451e+14
>> [Q R] = qr(A);
>> cond(R) % conditioning of matrix used in QR factorization
69
x21
1 x1 y1
1
x2 x22
y2
. . . f = .
. . . .
1 xm x2m ym
Figure 39: http://www.cse.illinois.edu/iem/least squares/data fitting An example least square fit is shown.
Here φ1 (x) = 1, φ2 (x) = x, and φ2 (x) = x2 . Notice that the monomial basis functions becomes more indis-
tinguishable with increase polynomial order. This lead to nearly linearly dependent rows and ill-conditioning
in the matrix
ans =
2.0371e+07
>> xone = (A’*A)\(A’*b) % normal equation solution
xone =
1.000000000804584
0.999999840420694
0.500004795998859
0.166609456930047
0.042019700547931
0.007065676553890
0.004185623326857
-0.003646097590225
0.003234439475462
-0.001486452960945
0.000294845427892
xtwo =
1.000000000000017
0.999999999994582
0.500000000208064
0.166666663555648
0.041666690728768
0.008333224089361
0.001389198998930
0.000197847650168
0.000025459516102
0.000002286833573
0.000000456883813
70
ans =
2.882617002030374e-09
kT ∗ k = kT k
Example 58 (Adjoint of a Matrix Operator). The adjoint of an n × n matrix A may be determined from
properties of the inner product
(Ax, y) = (Ax)> ȳ = x> A> ȳ = x> A> ȳ = x> A> y = (x, A> y)
Matrices are Self adjoint with respect to the usual inner product if the adjoint equals the conjugate of the
transpose.
∗ ∗ ∗
1 3 1 −i 3 1 + 2i 3 1 + 2i 2 3 2 3
= = =
i 2−i 3 2+i 1 − 2i −1 1 − 2i −1 3 1 3 1
| {z }
Hermitian Symmetry
The following lemma’s are useful in studying the properties of Hilbert adjoint operators
Lemma 4.3 (Equality). If the inner product of two vector v1 , v2 ∈ X is equal for all w ∈ X, then the two
vectors are the same.
(v1 , w) = (v2 , w) ∀w ∈ X ⇒ v1 = v2
In particular,
(v1 , w) = 0 ∀w ∈ X ⇒ v1 = 0
Proof. By assumption,
(v1 − v2 , w) = (v1 , w) − (v2 , w) = 0 ∀w ∈ X
For w = v1 − v2 this gives kv1 − v2 k = 0. Hence v1 − v2 = 0, so that v1 = v2 . In particular,
Lemma 4.4 (Zero Operator). Let X and Y be inner product spaces and Q : X → Y a bounded linear
operator. Then:
Q = 0 ⇔ (Qx, y) = 0 ∀x ∈ X y ∈ Y
71
Proof. (⇒)
(⇐) Conversely,
The following properties are used frequently in applying adjoint operators and the derivations are useful
in understanding manipulations of adjoint operators.
Theorem 4.5 (Properties of Hilbert-Adjoint Operators). Let H1 ,H2 be Hilbert spaces, S : H1 → H2 and
T : H1 → H2 bounded linear operators and α and scalar.
(a) (T ∗ y, x) = (y, T x) x ∈ H1 , y ∈ H2
∗ ∗ ∗
(b) (S + T ) = S + T
(c) (αT )∗ = ᾱT ∗
(d) (T ∗ )∗ = T
(e) kT ∗ T k = kT T ∗ k = kT k2
(f ) T ∗T = 0 ⇔ T =0
(g) (ST )∗ = T ∗ S ∗ assuming H1 = H2
Proof. • (a) The adjoint may be written with respect to the other arguments in the inner product. By
definition 4.5 we have
(T ∗ y, x) = (x, T ∗ y) = (T x, y) = (y, T x)
• (c) Not to confuse this formula with the action of the linear adjoint on the vector αx, ie T ∗ (αx) = αT ∗ x
Using lemma 4.4 with Q = (αT )∗ − αT ∗
• (d) The adjoint operator applied twice equals the original operator
T ∗T = 0 ⇔ kT ∗ T k = kT k2 = 0 ⇔ T =0
72
• (g) Repeated application of definition of adjoint
The adjoint may also be define with respect to an operator on a continuous space
Example 59 (Adjoint of an Integral Operator). Consider a differential operator with a specified zero bound-
ary defined on the space of differentiable functions condition and the usual L2 inner product
Z 1
1 d
X ≡ C [0, 1] (x, y) = x(t)y(t)dt Lx = x + x, x(0) = 0
0 dt
To find the adjoint we start with the definition of the inner product and integrate the derivative term by parts
Z b Z b
d b
(u(x)v(x))dx = u(x)v(x)|a = u0 (x)v(x) + u(x)v 0 (x)dx
a dx a
to manipulate the result in the form of the operator on the second variable (x, L∗ y)
Z 1
(Lx, y) = x0 (t)y(t) + x(t)y(t)dt
0
Z 1
= x(1)y(1) − x(0)y(0) + −y 0 (t)x(t) + x(t)y(t)dt
0
Z 1
= x(1)y(1) + −y 0 (t)x(t) + x(t)y(t)dt
0
= (x, L∗ y)
Here the x(0) = 0 take care of one of the terms in the right hand side. Therefore we must define the boundary
condition of the adjoint operator to be zero on the other part of the domain y(1) = 0
d
Ly = − y + y, y(1) = 0
dt
Hence this operator fails to be self adjoint for two reason: (1) L 6= L∗ and (2) the boundary conditions are
not the same.
It is important to realize that the adjoint operator is defined with respect to the inner product defined
on the space.
Example 60 (Adjoint of Sturm Liouville Operator). The Sturm Liouville Operator is a differential operator
that has an important role in many applications.
1 d d
L≡ p(x) + r(x) p(x), w(x) > 0 (14)
w(x) dx dx
Boundary conditions are typically assumed of the form:
x(a) = 0 x(b) = 0
The Sturm Liouville operator may be shown to be self adjoint with respect to the weighted inner product
Z b
(f, g) = f (x)g(x)w(x)dx w(x) > 0 a≤x≤b
a
Here positivity of the weighting function w(x) is imposed to ensure the inner product is positive. Using
integration by parts twice
Z b Z b Z b
d d d d d d
p(t) x(t) y(t) dt = p(t) x(t) y(t) + p(t) x(t) y(t)dt
a dt dt a dt dt a dt dt
73
Z b Z b Z b
d d d d d d
y(t)p(t)x(t) dt = y(t)p(t) x(t)dt + y(t)p(t) x(t)dt
a dt dt a dt dt a dt dt
The adjoint may be found to be
Z b
1 d d
(Lx, y) = p(t) x(t) + r(t)x(t) y(t)w(t)dt
a w(t) dt dt
Z b
d d
= p(t) x(t) + r(t)x(t) y(t)dt
a dt dt
Z b
0 b d d
= [p(t)x (t)y(t)]a + − y(t) p(t) x(t) + r(t)x(t)y(t)dt
a dt dt
Z b
b b d d
= [p(t)x0 (t)y(t)]a − [p(t)x(t)y 0 (t)]a + p(t) y(t) x(t) + r(t)x(t)y(t)dt
a dt dt
Z b
0 b 0 b 1 d d
= [p(t)x (t)y(t)]a − [p(t)x(t)y (t)]a + p(t) y(t) + r(t)y(t) w(t)x(t)dt
a w(t) dt dt
= (x, L∗ y)
We see that we require similar boundary conditions on y for the boundary terms to vanish. Hence the operator
is self adjoint with respect to the weight inner product.
1 d d
L∗ ≡ p(x) + r(x) , y(a) = 0 y(b) = 0
w(x) dx dx
The following property appears when looking at the eigenvalues of a self adjoint operator.
Theorem 4.6 (Self-adjointness). [Kreyszig, 1989] Theorem 3.10-3 Let T : H → H be a bounded linear
operator on a Hilbert space H.
T self adjoint ⇒ (T x, x) ∈ R ∀x ∈ H
74
Example 61 (Eigenvalues of a Matrix). Consider the linear operator A : R2 → R2
2 1
A=
1 2
Rewriting Ax = λx as (A − λI)x = 0 we recall that this linear system of equations has non-trivial solution
if and only if
2 − λ 1
det(A − λI) = 0 ⇒ = (2 − λ)2 − 1 = λ2 − 4λ + 3 = 0
1 2 − λ
This is known as the characteristic equation of A and the roots λ1 , λ2 = 1, 3 are the eigenvalues. To find the
corresponding eigenvectors, consider λ1 with Ax = λx explicitly written out.
Lαx = λαx ∀α
Repeating for λ2 = 3
2ξ1 + ξ2 = 3ξ1 ξ1 + 2ξ2 = 3ξ2 ⇒ ξ2 = ξ1
we have
1
e2 =
1
If we prefer, the eigenvectors may easily be normalized.
The fact that the eigen vectors are orthogonal is not by coincidence
Theorem 5.1 (Spectrum of Self-Adjoint Operator). If L is self adjoint, then
(i) the eigenvalues are real
75
Example 62 (Multiplicity of Eigenvalues). Consider
2 0 0
A = 0 1 1
0 1 1
The eigenvalues are thus λ1 , λ2 = 0, 2. Note carefully that x = 0 is never acceptable as an eigenvector. By
definition and eigenvector is to be nontrivial, x = 0. However, as seen, the eigenvalue may be zero. In this
case the eigenvalue λ2 = 2 is said to be of ”multiplicity 2”. Proceeding as before
0 α
λ1 = 0 e1 = 1 λ2 = 2 β
−1 β
Here α and β cannot both be zero. As before, A is self adjoint so the eigenvalues are real. and (e1 , e2 ) = 0
∀α, β Moreover the second eigenvector actually contains two orthogonal vectors as well. For instance,
1 0
0 1
0 1
So in fact we have three mutually orthogonal eigenvectors, and the three eigenvectors constitute an orthogonal
basis for the space. We will see that this basis is particularly helpful in solving the inhomogeneous problem
Lx = c.
In fact it can be shown that
Theorem 5.2. For any self adjoint operator L on a finite dimensional domain, k mutually orthogonal
eigenvector can be found for each eigenvalue of multiplicity k.
Proof.
Together with the fact that eigenvectors corresponding to distinct eigen values are orthogonal we have
Theorem 5.3. The eigenvectors of any self-adjoint operator L on a finite dimensional space constitute a
basis for the space.
Proof.
These results lead to a diagonalization of self adjoint matrices. Modal Matrix
(Definition) Modal Matrix The modal matrix is defined as the columns of the normalized eigenvectors
As we will see, this leads to a concise form of symmetric matrices in which we may study convergence of
algorithms in the framework of optimization theory
z}|{ z}|{
e1 e1
|{z} |{z} λ1 0 0
Q> AQ =
z}|{ z}|{
e2
|{z}
Ae1 Ae2 Ae3 = e2 λ1 e1 λ2 e2 λ3 e3 = 0 λ2 0
|{z}
z}|{ z}|{ 0 0 λ3
e3 e3
|{z} |{z}
Our study of eigenvalues is not restricted to the the finite dimensional matrix operators.
76
Example 63 (Eigenvalues of a Differential operator). Consider the differential operator L ≡ d2 /dx2 with
zero boundary conditions. The eigenvalue problem is
y 00 + λy = 0 y(0) = y(l) = 0
From differential equations, the general solution is of the form
√ √
y = A sin λx + B cos λx
Where the arbitrary constants A and B are determined by the boundary conditions.
√
y(0) = 0 ⇒ B=0 y(l) = 0 ⇒ A sin λl = 0
Since A 6= 0 √
is required to have non trivial solutions, we arrive at the analogous characteristic equation as
before where λl must coincide with a zero of the sine function.
√ n2 π 2 nπx
sin λl = 0 ⇒ λn = ⇒ yn = sin n = 1, 2, 3, ...
l2 l
As seen in example 14, the operator under consideration is self-adjoint, hence the eigenvalues are real and
the eigen functions are mutually orthogonal.
Z l
mπx nπx
(ym , yn ) = sin sin dx = 0 m 6= n
0 l l
Unfortunately, Theorem 5.3 was for finite dimensions and we cannot say that this constitutes a basis for
our space at this point. However, the Sturm-Liouville theory provides a rigorous framework in which we can
identify the eigenfunctions as a basis.
Theorem 5.4 (Basis of Sturm-Liouville System). If both p(x) and w(x) are analytic and positive (p, w > 0)
over a ≤ x ≤ b where a and b are finite, Then the eigenfunctions of the Sturm Liouville system (14) form a
basis over L2 [a, b].
This is an important results that may be used to justify that the set of trigonometric Fourier functions
indeed constitutive a basis and any given function may indeed be represented or decomposed into this Fourier
basis.
Example 64 (Fourier Basis from Sturm Liouville Theory). The set of Fourier basis functions may be shown
to satisfy a slightly differ Sturm-Liouville system with periodic boundary conditions.
y 00 + λy = 0
y(−l) = y(l) = 0 y 0 (−l) = y 0 (l) = 0
√ √
Subjecting the general solution y = A sin λx + B cos λx to the periodic boundary conditions
√ √ n2 π 2
A sin λx = 0 λB cos λx = 0 ⇒ λn = n = 0, 1, 2
l2
The eigenvalues n 6= 0 are of multiplicity two and the constants A and B are arbitrary. For the given
eigenfunction
nπx nπx
yn = A sin + B cos
l l
we can obtain an orthogonal functions by setting A = 1, B = 0 and A = 0, B = 1. From Sturm Liouville
theory, the eigenfuctions for a basis for the L2 [a, b] and we may write a general function f in the form
∞
X nπx nπx
f (x) = a0 + an cos + bn sin
1
l l
where the coefficient coincide with the usual trigonometric Fourier series.
1 l
Z
(f, 1)
a0 = = f (x)dx
(1, 1) 2l −l
1 l
Z
(f, cos nπx/l) nπx
an = = f (x) cos dx
(cos nπx/l, cos nπx/l) l −l l
1 l
Z
(f, sin nπx/l) nπx
bn = = f (x) sin dx
(sin nπx/l, sin nπx/l) l −l l
77
5.2 Applications: Spectral Method for the Inhomogeneous Problem
For a symmetric matrix A we know that the eigenvectors form a basis for the space, hence we can expand
expand the solution and of a system of equation in terms of the basis. The solution may be represented in
terms of the spectrum of the operator A.
n
X n
X
Ax = Λx + c c= cj ej x= αj ej Λ∈R
1 1
Here ci = (c, ej ) is assumed known and the αj ’s are to be found. By direct substitution
n
X n
X
αj (λj − Λ)ej = cj e j ⇒ αj (λj − Λ) = cj
1 1
If Λ does not coincide with any of the eigenvalues, then αj = cj /(λj − Λ) and the unique solution is
n
X cj
x= ej
1
λj − Λ
On the other hand if Λ = λk then there are two possibilities. (a) If ck 6= 0, there is no solution. (b) If ck = 0,
then we have a nonunique solution with β arbitrary
n
X cj
x= ej + βek β∈R
λj − Λ
j6=k
6 Unconstrained Optimization
As a motivating example consider the image denoising problem.
Figure 40: Given a noisy image b ∈ L2 (Ω) we wish to remove noise. How can we mathematicall express this
? One mathematical way to express this is to say that we wish to find an minimum distance between the
original and a denoised imaged, f , such that the image is smooth.
1
min kf − bk22 + k∇f kpp
f
|2 {z } | {z }
minimium distance minimum variation, ie smooth
78
Lets look at this problem in vector notation, fi,j,k ≡ f (i∆x, j∆y, k∆z) and bi,j,k ≡ b(i∆x, j∆y, k∆z) for a
256x256x100 image
f1,1,1 b1,1,1
f2,1,1 b2,1,1
.
.
.
.
.
.
f256,1,1 b256,1,1
f1,2,1 b1,2,1
f2,2,1 b2,2,1
~
f = ~
b=
.
.
.
.
.
.
f256,2,1 b256,2,1
.
.
.
.
. .
f256,256,100 b256,256,100
Explicitly discretizing the integrals
Z X
kf − bk22 = (f − b)2 dx dy dz = (fi,j,k − bi,j,k )2 ∆x∆y∆z = f~ − ~b, f~ − ~b ∆x∆y∆z
Ω i,j,k
For p=1, the smoothing term reduces to the so-called ”total variation” regularizer
Z
X fi+1,j,k − fi,j,k
∂f
∂f
= dx dy dz ≈ ∆x∆y∆z
∂x
Ω ∂x ∆x
1 i,j,k
P
Figure 41: Total Variation, T V (f ) = i |f (xi+1 ) − f (xi )|. Intuitively the function on the left will have less
total variation.
The may equivalently be written as a linear operator notation
−1 1 0 0 0 0 0
0 −1 1 0 0 0 0
0 0 −1 1 0 0 0
∂f
~
∆x∆y∆z
≈
∆x
0 0 0 −1 1 0 0 f
∂x
| {z }1 .
.
function space norm
.
| {z }
≡L
x 1
Operators for the y, Ly and z, Lz may be similiarly derived. Hence our total variation denoising problem is
of the form of minimizing a function with a differentiable and non-differentiable term.
min f~ − ~b, f~ − ~b + kLx f~k1 + kLy f~k1 + kLz f~k1
f~∈R256x256x100 | {z } | {z }
non-differentiable
differentiable
79
2D
100D
(a) (b) (c) (d)
Figure 42: Comparison of various optimization techniques in 2D and 100D. (a) Nelder Mead (b) Steepest
Descent (c) Quasi-Newton (d) Newton.
N
X −1
100(x2i − xi+1 )2 + (xi − 1)2 x ∈ RN
f (x) =
i=1
\exampledir/multidimrosenboth.m
\exampledir/ExOptimizationComparison.m
clear all
close all
maxIter = 10e9
maxFunEval = 5e3
nDimension = 2;
x0 = -2*ones(nDimension,1);
% direct search
[x,fval] = fminsearch(@rosenboth, x0, optimset(’TolX’,1e-8,’Display’,’Simplex’,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptNelderMead2D’,’png’)
80
pause
% steepest descent
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,’TolX’,1e-8,...
’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,’LargeScale’,’off’,’HessUpdate’,’steepdesc’,...
’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptSteepestDescent2D’,’png’)
pause
% trust region hessian approx
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...
’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,...
’TolPCG’,1e-3,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptQuasiNewton2D’,’png’)
pause
% trust region exact hessian
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...
’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’on’,...
’TolPCG’,1e-3,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptNewton2D’,’png’)
pause
fminsearch, Nelder Mead Method Direct search methods may be appropriate for function optimization
with the following properties
• Function evaluation is very expensive or time consuming.
• Gradient information is not practical, ie discontinuous functions or very complex physics models.
81
• Numerical Derivatives are impractical and/or slow.
Similar to a bisection or goldent selection method in 1D, the Nelder Mead method used geometrical arguments
to reach a mininum. A simplex in 2D is a triangle and in 3D is a tetrahedron.
min f (x)
x
n
Here the objective function, f : R → R, is in general a nonlinear function that is typically too expensive to
extensively evaluate to obtain a global perspective. We generally only have function evaluation and derivative
information at a finite set of points x0 , x1 , x2 , ... and must design algorithms with this in mind.
Ideally we would like to find a global minimizer to the optimization problem. Global Mini-
mizer
(Definition) Global Minimizer A point x∗ is a global minimizer if
f (x∗ ) ≤ f (x) ∀x
For a general nonlinear and discontinuous function, since we typically only evaluate a finite set of points of
the objective function, we can never be certain that we have neglected to sample a region of the function in
which the function takes a sharp dip towards a minimum. This global optimization problem is outside the
scope of this class.
We will restrict our discussion in optimization of functions that are sufficiently smooth such that all
necessary 1st and 2nd derivatives exist and are well defined. Using derivative information, it is feasible to
search for solutions that are local minimizers within this framework. Local Mini-
mizer
(Definition) Local Minimizer A point x∗ is a local minimizer if ∃ a neighborhood N of x∗ such that
f (x∗ ) ≤ f (x) ∀x ∈ N
Many optimization packages developed at National labs and optimization toolkit in MATLAB have adopted
this framework of search for local minimizers and have defined the solutions returned to the user as local
optimizers.
Taylor’s theorem is the primary mathematical tool used for studying local minimizers.
82
Figure 44: Global vs local Minimum
and substituting
1
f (x + p) = f (x) + ∇f (x)> p + p> ∇2 f (x + tp)p p ∈ Rn for some t ∈ (0, 1)
2
Necessary conditions of an optimal solution assume that x∗ is a local minimizer and then show require-
ments on the gradient and Hessian.
Theorem 6.2 (First Order Necessary Conditions). Given that f is continuously differentiable
f (x∗ + t̂p) = f (x∗ ) + (t̂p)> ∇f (x∗ + tp) for some t ∈ (0, t̂)
83
Because ∇f is continuous near x∗ there is a scalar T > 0 such that
p> ∇f (x∗ + tp) < 0 ∀t ∈ [0, T ]
Therefore, f (x∗ + t̂p) < f (x∗ ) for all t̂ ∈ (0, T ].
Given a = b + c c<0 ⇒ a<b
We have found a direction leading away from x along which f decreases, so x∗ is not a local minimizer.
∗
Contradiction.
According to Theorem 6.2 and local minimizer must be a stationary point. Stationary
Point
(Definition) Stationary Point A stationary point x∗ of a function is a point where the gradient vanishes.
f (x∗ ) = 0
Positive Defi-
(Definition) Positive Definite, Negative Definite, etc. nite, Negative
Definite, etc.
x> Ax > 0 ∀x Positive Def.
x> Ax < 0 ∀x Negative Def.
>
x Ax ≥ 0 ∀x Positive Semi-Def.
>
x Ax ≤ 0 ∀x Negative Semi-Def.
Theorem 6.3 (Second Order Necessary Conditions). If ∇2 f is continuous
x∗ is a local minimizer ⇒ ∇f (x∗ ) = 0 and ∇2 f is positive semidefinite
Proof. That x∗ is a stationary point follow from Theorem 6.2. For contrapositive, suppose that ∇2 f (x∗ ) is
not positive semidefinite.
(p ⇒ q) ⇔ (∼ q ⇒∼ p)
Then we can choose a vector p such that p∇2 f (x∗ )p < 0, ie not positive semidefinite. By continuity ∃T > 0
such that p> ∇2 f (x∗ + tp)p < 0 for all t ∈ [0, T ]
Using a Taylor expansion about x∗ , we have for some t ∈ (0, t̂)
1
f (x∗ + t̂p) = f (x∗ ) + t̂p> ∇f (x∗ ) + t̂2 p> ∇f (x∗ + tp)p < f (x∗ )
| {z } 2
=0, by assumption
∗
Thus x is not a local minimizer as we have found a direction along which it is decreasing.
Theorem 6.4 (Second Order Sufficient Conditions). If ∇2 f is continuous and positive definite
∇f (x∗ ) = 0 ⇒ x∗ is a strict local minimizer
Proof. Because the Hessian is continuous and positive definite at x∗ , we can choose a radius r > 0 so that
∇2 f (x) remains positive definite for all x in a neighborhood D = {z : kz − x∗ k < r}. For any nonzero p with
kpk < r, x∗ + p ∈ D
1 1
f (x∗ + p) = f (x∗ ) + p> ∇f (x∗ ) + p> ∇2 f (z)p = f (x∗ ) + p> ∇2 f (z)p
| {z } 2 2
=0, by assumption
Note that the second order sufficient conditions are not necessary: A point x∗ may be a strict local
minimizer, and yet may fail to satisfy the sufficient conditions.
Example 65 (Sufficient not Necessary). Consider f (x) = x4 for which x∗ = 0 is a strict local minimizer at
which the Hessian vanishing and is therefore not positive definite.
84
6.2 Search Directions, [Nocedal and Wright, 1999] Ch 2
It is important to recognize the search directions that produce a descent in the objective function.
• The steepest descent direction −∇fk is an obvious search direction to decrease our function value.
Among all directions we could move from the current iterate xk , it is the one that decreases most
rapidly. To see this, recall our Taylor expansion for our search direction p, kpk = 1 and step length α
1
f (x + αp) = f (x) + α∇f (x)> p + α2 p> ∇2 f (x + tp)p p ∈ Rn for some t ∈ (0, α)
2
= f (x) + α∇f (x)> p + O(α2 )
The rate of change of the function is coefficient of α, ie p> ∇fk . Hence the direction of the most rapid
decrease is the solution to the problem
Since kpk = 1, using properties of our inner product from vector calculus.
• In general, any search direction that makes an angle of strictly less than π/2 radians with −∇fk is a
descent direction, provide the step length is sufficiently small. Again we use Taylor Theorem to see
this.
f (xk + pk ) = f (xk ) + p> 2 >
k ∇fk + O( ) ≈ f (xk ) + pk ∇fk
When the angle θk between pk and ∇fk is such that cos θk < 0
θk ∈ (π/2, 3/2π) ⇒ (pk , ∇fk ) = kpk kk∇fk k cos θk < 0 ⇒ f (xk +pk ) < f (xk ) sufficiently small
• One of the most important directions is the Newton direction. Consider a second order Taylor series
approximation
1
f (xk + p) ≈ fk + p> ∇fk + p> ∇2 fk p ≡ mk (p)
2
Where we interpret mk (p) as a quadratic model approximation to our function at xk . Assuming ∇2 fk
is positive definite, we find the Newton direction by minimizing our quadratic model, mk (p).
∂mk (p) 2 −1
=0 ⇒ pN
k = ∇ fk ∇fk
∂pi
It is important to realize that the Newton direction may not be defined when the inverse does not exist
∇2 fk−1 . This is no different that before, a solution will not exist then the dimension of the null space
of the operator is non-zero, N (∇2 fk−1 ) 6= {0}. Further when the hessian is not positive definite the
Newton direction may actually increase the objective function value and is not suitable.
(∇2 fk x, x) 0 ∀x ⇒ (∇fk , pN
k ) 6< 0 is possible
85
Similar to as before we are interested in minimizing the least squares residual
m
1 1X 2 ~
minn f (~x) = minn ~r>~r = minn (yl − φ(tl , ~x)) ~r = ~y − φ
x∈R
~ x∈R 2
~ x∈R 2
~
l
The gradient of the objective function f (~x) is given as the matrix vector product of the Jacobian transpose
times the residual.
m
! m
!
∂ 1X X ∂rl ∂ri ∂φ(ti , ~x)
(∇f )i = rl rl = rl ∇f = J >~r Jij = =−
∂xi 2 ∂xi ∂xj ∂xj
l l
Similarly, without working through the algebra, the matrix of second derivatives may be obtained as a
function of the jacobian J(~x), the residual ~r, the the hessian components of the residuals, Hi (~x)
m m
∂2 1 X X ∂ 2 rl
∇2 f ∇2 f = J > (~x)J(~x) +
ij
= rl rl ri (~x)Hi (~x) (Hl (~x))ij =
∂xi xj 2 i
∂xi xj
l
Notice that all components of the gradients and Hessian’s depend of the current solution ~x, ie J(~x) , Hi (~x).
Thus if ~xk is a current solution, then Newton step sk is given by the linear system
" m
#
X
>
J (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ~sk = −J > (~xk )~r(~xk )
i
Gauss-Newton Method. The m Hessian residual matrices Hi (~x) are typically very inconvenient and
expensive to compute. Further since they are multiplied by the residual, ri which should be small near a
solution we are motivated to drop the second order terms and solve an approximation at each step.
m
X
>
ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk )
>
J (~xk )J(~xk ) ~sk = −J > (~xk )~r(~xk )
J (~xk )J(~xk ) + ⇒ (16)
i
Notice that this is an approximation to the Hessian and is thus a Quasi-Newton method. You should recognize
this as the normal equations that we visited when we looked as the linear least square problem, Eqn (13).
Thus the nonlinear least squares problem reduces to a linear least squares problem at each iteration which
may be solved by some orthogonalization method.
Update solution
~xk+1 = ~xk + ~sk
end while
to be explicit suppose that we have a time series of imaging data, we draw and ROI on the image and we
need to fit the measurements within the ROI to the time series of data The Jacobian of this basis is given
by
∂ri ∂ri
(J(~xk ))i,1 = = −ex2 ti (J(~xk ))i,2 = = −x1 ti ex2 ti
∂x1 ∂x2
86
φ(ti , ~x) = x1 ex2 t
ri = yi − φ(ti , ~x)
Figure 46: Measurements within an ROI. An Gauss Newton Nonlinear Least Squares Estimates
Given an initial guess x0 = [1, 0]> the initial least squares problem to solve is
The solution to this least square problem and next iterate yields
0.69 0.69 1.69
s0 = x1 = x0 + =
−0.61 −0.61 −0.61
As a Matlab example, consider the following residual function to be used with ‘lsqnonlin’ in MATLAB .
function [ r e s i d u a l , j a c o b i a n ]= p h a r m a c o k i n e t i c ( x )
time = [ 0 . 0 ; 1 . 0 ; 2 . 0 ; 3 . 0 ] ;
y = [2.0; 0.7;0.3;0.1];
r e s i d u a l = y − x ( 1 ) ∗ exp ( x ( 2 ) ∗ time ) ;
j a c o b i a n = [ −exp ( x ( 2 ) ∗ time ) , −x ( 1 ) ∗ time . ∗ exp ( x ( 2 ) ∗ time ) ] ;
disp ( s p r i n t f ( ’%f %f %f ’ , x ( 1 ) , x ( 2 ) , r e s i d u a l ’ ∗ r e s i d u a l ) )
Below is example usage and output. Without the jacobian specified MATLAB will compute finite differences
of the Jacobian. Notice that this requires one addition function evaluation per optimization variable.
This can be prohibitive when x ∈ Rn , n > O(104 ). When the jacobian is specified, many less function
evaluations are required but you have to explicitly provide the analytic derivatives.
>> lsqnonlin(@pharmacokinetic,[1;0],-inf,inf,optimset(’jacobian’,’off’))
1.000000 0.000000 2.390000
1.000000 0.000000 2.390000
1.000000 0.000000 2.390000
1.690000 -0.610000 0.212590
1.690000 -0.610000 0.212590
1.690000 -0.610000 0.212590
1.975070 -0.930547 0.007335
1.975070 -0.930547 0.007335
1.975070 -0.930547 0.007335
1.994066 -1.003607 0.002024
1.994066 -1.003607 0.002024
1.994066 -1.003607 0.002024
1.994955 -1.009347 0.001996
1.994955 -1.009347 0.001996
1.994955 -1.009347 0.001996
1.995002 -1.009520 0.001996
87
1.995002 -1.009520 0.001996
1.995002 -1.009520 0.001996
ans =
1.9950
-1.0095
>> lsqnonlin(@pharmacokinetic,[1;0],-inf,inf,optimset(’jacobian’,’on’))
1.000000 0.000000 2.390000
1.690000 -0.610000 0.212590
1.975070 -0.930547 0.007335
1.994066 -1.003607 0.002024
1.994955 -1.009347 0.001996
1.995002 -1.009520 0.001996
ans =
1.9950
-1.0095
• trust region: (1) fix distance to move (2) choose search direction
Line Search
(Definition) Line Search In a line search strategy the algorithms chooses a descent direction pk and
searches along this direction for the next iterate with a lower function value f (xk+1 ) < f (xk ). The distance
to move along pk is formulated as a 1D optimization problem for α
Exact solution of the line search subproblem (17) is expensive and unnecessary. Typically an approximation
to (17) is found for xk+1 and the subproblem is repeated. Trust Region
88
Figure 47: Trust Region Intuition
(Definition) Trust Region In the trust region strategy, information about the objective function f is used
to construct a surrogate model function mk whose behavior near the current iterate xk is expected to be
near the actual function f . We search for a solution within the region that we trust, kpk k ≤ ∆k , this model
function to be a good approximation to the actual function. The model is typically chosen to be a quadratic
function and the trust region approach proceeds by solving a sequence of subproblems.
1
min mk (p) kpk ≤ ∆k mk (p) ≡ fk + p> ∇fk + p> Bk p (18)
p 2
Bk is typically chosen as the Hessian matrix or some approximation to it. The Gauss-Newton approximation
to the Hessian, seen in (16), is an excellent example of a Quasi-Newton approximation of the Hessian for the
nonlinear least squares problem. If the trust region subproblem does not achieve adequate decrease in the
objective function we conclude that the model function is not a good approximation and decrease the trust
region radius which, by Taylor series, should produce a closer approximation to the actual function.
The performance of an optimization algorithm may be characterized in terms of its convergence to a
solution, x∗ . Convergence
Rate
(Definition) Convergence Rate We say that the convergence rate of an algorithm to a solution x∗ is p,
(p > 1), if there exists a positive constant, M , such that
kxk+1 − x∗ k
≤M k sufficiently large
kxk − x∗ kp
Steepest descent methods converge linearly, p = 1, and Newton methods converge quadratically, p = 2.
89
Figure 48: Wolfe Conditions. (Sufficient Decrease) The reduction in f should be proportional to both the
step length α and the directional derivative ∇fk> pk . For example, a sequence of iterates that infinitesimally
converge to zero, 1/k, k = 1, 2, ... is NOT acceptable. (Curvature Condition) Rules out unacceptably short
steps. The curvature condition ensures that the slope at the next iterate, φ(αk ) is greater than c2 times the
gradient φ0 (0). Ie if the slope is strongly negative we have indication that we can continue moving along in
this direction. Otherwise, if the slope is only slightly negative or perhaps positive, we really cannot expect
much more of a decrease in this direction.
• Practical strategies perform an inexact line search that can achieve an adequate reduction of the
objective function f at minimal cost.
A popular inexact line search approach impose a requirement on the step length α such that
• Sufficient decrease in the function is seen. Sufficient decrease is measured by the following inequality.
f (xk + αpk ) ≤ f (xk ) + c1 α∇fk> pk c1 ∈ (0, 1)
In practice, c1 ≈ 10−4 is typically used.
• Step lengths are reasonably far from the current iteration. To rule out unacceptably small steps, a
curvature condition in imposed that requires αk satisfy
∇f (xk + αk pk )> pk ≥ c2 ∇fk> pk c2 ∈ (c1 , 1) 0 < c1 < c2 < 1
In practice, c2 ≈ .9 is typically used for Newton Methods.
Collectively, the sufficient decrease and curvature conditions are known as the Wolfe conditions.
90
Algorithm 2 Steepest Descent
Require: tolerance > 0, and initial guess x0
k=0
while k∇fk k > do
Compute fk and ∇fk
Compute step length along gradient αk
xk+1 = xk − αk ∇fk
k =k+1
end while
Hence, without loss of generality, we can study problems where the matrix is diagonal.
1 >
f (x) = x Ax − b> x ∇f (x) = Ax − b A = diag(λ1 , λ2 , ...)
2
Hence for A SPD, the solution x∗ is the solution to the linear system Ax∗ = b.
2 −1
A= b=0
−1 2
> 1 0
Q AQ =Λ =
0 3
1 1 −1
Q= √ Q−1 = Q>
2 1 1
Left: Graph of the function. Contour lines with the red lines indicate the eigenvector directions of A.
2 0 −2 0 2 0 −1 1
A1 = A2 = A= b1 = b2 =
0 −2 0 2 0 0 −1 0
Indefinite matrices lead to saddle points. For semi-definite Hessian matrix, the choose of b influences the
existence of a solution. In the singularity direction , the function is dominated be the linear term b. The
function based on A and b1 is unbounded from below and, thus no solution exists. However, for A and b2 is
independent of x2 and bounded from below and a solution exists but it is not unique.
An exact solution to the line search may be obtained in this situation by differentiating the line search
91
function with respect to α
∇fk> ∇fk
d 1
f (xk − αgk ) = (xk − αgk )> A(xk − αgk ) − b> (xk − αgk ) = 0 ⇒ α=
dα 2 ∇fk> A∇fk
Steepest descent with exact line search in this case yields
∇fk> ∇fk
xk+1 = xk − ∇fk
∇fk> A∇fk
The error in the solution may be bounded by the ratio of the largest to smallest eigenvalue.
Theorem 6.5 (Convergence of Steepest Descent [Nocedal and Wright, 1999]). Given that f : Rn → R is
twice differentiable, and that iterates generate by the steepest descent method with exact line searches converge
to a point x∗ where the Hessian matrix ∇2 f (x∗ ) is positive definite. Then the convergence of the algorithm
is bounded by the eigenvalues of the Hessian, λ1 ≤ .. ≤ λn
2
λn − λ1
f (xk+1 ) − f (x∗ ) ≤ [f (xk ) − f (x∗ )]
λn + λ1
Examples are provided in Figure 49.
It is worth mentioning that the conditioning number is related to the largest and smallest eigenvalue, for
the matrix induced by the 2-norm
λ1 0 λmax
κ(A) = cond = A : (Rn , k · k2 ) → (Rn , k · k2 )
0 λ2 λmin
Consider
1
(c1 x21 + c2 x22 )
f (x1 , x2 ) = A = diag(c1 , c2 ) (19)
2
This function is convex and has a global minimum at x = 0 when the eigenvalues are c1 and c2 are positive.
The steepest descent method takes many iterations to converge to a solution when the eigenvalues are far
apart however the Newton Method converges very quickly.
http://www.cse.illinois.edu/iem/optimization/SteepestDescent
http://www.cse.illinois.edu/iem/optimization/Newton Opt2D
Theorem 6.6 (Convergence of Newton Method [Nocedal and Wright, 1999]). Given that f : Rn → R is
sufficiently differentiable in a neighborhood N such that the sufficient conditions are satisfied.
92
Newton Direction It is helpful to look at Newtons method in 1-D. We expect to see quadratic conver-
gence. Ie, the number of correct digits doubles. In 1-D Newton method becomes
fk0
H [xk+1 − xk ] = −∇fk ⇒ fk00 [xk+1 − xk ] = −fk0 ⇒ xk+1 = xk −
fk00
function newtonExample ( x0 )
x = x0 ;
Tol = 0 . 0 0 0 0 0 1 ;
count = 0 ;
fprime = x − x ˆ2;
f = 1/2∗ x ˆ2 − 1/3∗ x ˆ 3 ; % compute t h e new v a l u e o f f ( x )
dx=1; %t h i s i s a f a k e v a l u e s o t h a t t h e w h i l e l o o p w i l l e x e c u t e
fprintf ( ’ step x dx f ( x )\n ’ )
f p r i n t f ( ’−−−− −−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−\n ’ )
f p r i n t f ( ’%3 i %23.16 e %23.16 e %23.16 e \n ’ , count , x , dx , f )
xVec=x ; fVec=f ;
while ( abs ( f p r i m e )>Tol ) %l o o p u n t i l s t a t i o n a r y p o i n t r e a c h e d
count = count + 1 ;
fprime = x − x ˆ2;
f d o u b l e p r i m e = 1 − 2∗ x ;
xnew = x − ( f p r i m e / f d o u b l e p r i m e ) ; % compute t h e new v a l u e o f x
dx=abs ( x−xnew ) ; % compute how much x has changed s i n c e l a s t s t e p
x = xnew ;
f = 1/2∗ x ˆ2 − 1/3∗ x ˆ 3 ; % compute t h e new v a l u e o f f ( x )
f p r i n t f ( ’%3 i %23.16 e %23.16 e %23.16 e \n ’ , count , x , dx , f )
end
f = 1/2 ∗ x2 − 1/3 ∗ x3
http://www.cse.illinois.edu/iem/nonlinear eqns/Newton
>> newtonExample(20)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 2.0000000000000000e+01 1.0000000000000000e+00 -2.4666666666666665e+03
1 1.0256410256410257e+01 9.7435897435897427e+00 -3.0704046483139189e+02
2 5.3910172175612390e+00 4.8653930388490183e+00 -3.7694964230510664e+01
3 2.9710656645912565e+00 2.4199515529699824e+00 -4.3284789023937087e+00
4 1.7861182949934300e+00 1.1849473695978265e+00 -3.0425996536838751e-01
5 1.2402508292312779e+00 5.4586746576215206e-01 1.3318397332485366e-01
93
6 1.0389870964455772e+00 2.0126373278570076e-01 1.6588691644185172e-01
7 1.0014100464549900e+00 3.7577049990587197e-02 1.6666567161666468e-01
8 1.0000019826397770e+00 1.4080638152129676e-03 1.6666666666470126e-01
9 1.0000000000039309e+00 1.9826358461649818e-06 1.6666666666666669e-01
10 1.0000000000000000e+00 3.9308556409878292e-12 1.6666666666666669e-01
>> newtonExample(-20)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 -2.0000000000000000e+01 1.0000000000000000e+00 2.8666666666666665e+03
1 -9.7560975609756095e+00 1.0243902439024390e+01 3.5712385678288661e+02
2 -4.6402366520692553e+00 5.1158609089063543e+00 4.4070108044524261e+01
3 -2.0944362725536232e+00 2.5458003795156321e+00 5.2558605600796291e+00
4 -8.4539815955291497e-01 1.2490381130007082e+00 5.5875049560892487e-01
5 -2.6560837886568645e-01 5.7978978068722853e-01 4.1519935359147608e-02
6 -4.6073039997407417e-02 2.1953533886827903e-01 1.0939626388017812e-03
7 -1.9436273713611465e-03 4.4129412626046270e-02 1.8912911515357276e-06
8 -3.7630593882501204e-06 1.9398643119728964e-03 7.0803257421616288e-12
9 -1.4160509385405506e-11 3.7630452277407350e-06 1.0026001302802519e-22
10 -2.0052182829735183e-22 1.4160509385204984e-11 2.0104501811856324e-44
>> newtonExample(.49)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 4.8999999999999999e-01 1.0000000000000000e+00 8.0833666666666665e-02
1 -1.2004999999999990e+01 1.2494999999999990e+01 6.4878031254166513e+02
2 -5.7624960015993549e+00 6.2425039984006352e+00 8.0387019317008566e+01
3 -2.6512080933838607e+00 3.1112879082154943e+00 9.7261482145687737e+00
4 -1.1152713730936505e+00 1.5359367202902101e+00 1.0843178694211222e+00
5 -3.8502206389628169e-01 7.3024930919736886e-01 9.3146473785263459e-02
6 -8.3750448567531555e-02 3.0127161532875013e-01 3.7028812087205941e-03
7 -6.0078220517643943e-03 7.7742626515767160e-02 1.8119244863963280e-05
8 -3.5665383253675152e-05 5.9721566685107192e-03 6.3602490367083519e-10
9 -1.2719288349747086e-09 3.5664111324840178e-05 8.0890148130596978e-19
10 -1.6178029439585086e-18 1.2719288333569056e-09 1.3086431827404087e-36
>> newtonExample(.51)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 5.1000000000000001e-01 1.0000000000000000e+00 8.5832999999999993e-02
1 1.3004999999999990e+01 1.2494999999999990e+01 -6.4861364587499838e+02
2 6.7624960015993558e+00 6.2425039984006343e+00 -8.0220352650341923e+01
3 3.6512080933838615e+00 3.1112879082154943e+00 -9.5594815479021165e+00
4 2.1152713730936510e+00 1.5359367202902106e+00 -9.1765120275445655e-01
5 1.3850220638962822e+00 7.3024930919736875e-01 7.3520192881403101e-02
6 1.0837504485675318e+00 3.0127161532875046e-01 1.6296378545794610e-01
7 1.0060078220517643e+00 7.7742626515767466e-02 1.6664854742180274e-01
8 1.0000356653832536e+00 5.9721566685106975e-03 1.6666666603064179e-01
9 1.0000000012719288e+00 3.5664111324829051e-05 1.6666666666666669e-01
10 1.0000000000000000e+00 1.2719287845186500e-09 1.6666666666666669e-01
94
tive information and saves implementation time.
m
∂ 2 rl >
X
(Hl (~x))ij = 2
∇ fk = J (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk ) ≡ Bk
∂xi xj i
Levenberg-Marquardt methods builds upon the Gauss Newton method by shifting upon the eigenvalues
of the Hessian approximation to make it positive definite. Levenberg-
Marquardt
(Definition) Levenberg-Marquardt Levenberg-Marquardt approximation to the Hessian of the nonlin-
ear least square problem by replacing the second derivative terms by a scalar multiple of the identity matrix.
m
X
∇2 fk = J > (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk ) + µk I ≡ Bk µk > 0
i
Notice that the final least squares problem is a weighted linear combination of the Gauss-Newton Direction
and the Steepest Descent Direction.
>
J (~xk )J(~xk ) + µk I ~sk ≈ −J > (~xk )~r(~xk )
> √ J(~xk ) > > √ −~r(~xk )
= −J > (~xk )~r(~xk )
> >
A A = J (~xk ) µk I √ = J (~xk )J(~xk )+µk I A y = J (~xk ) µk I
µk I 0
> > J(~xk ) −~r(~xk )
A A~sk = A y ⇒ √ ~s ≈
µk I k 0
The Broyden, Fletcher, Goldfarb, Shanno (BFGS) is a widely used approximation to the Hessian available
in many packages. Details of why the BFGS approximation is in general a good approximation to the Hessian
may be found in [Nocedal and Wright, 1999] Chapter 8. Similar to the Gauss-Newton and Levenburg-
Marquardt approach only first derivative information is needed. BFGS
(Definition) BFGS The BFGS method performs rank-one updates of the Hessian matrix using information
at previous iterations
sk ≡ xk+1 − xk yk ≡ ∇fk+1 − ∇fk
> 1
Hk+1 = I − ρk sk yk> Hk I − ρk sk yk> + ρk sk s>
k ρk = >
yk sk
Here the matrix updates to the Hessian are of rank-1.
a1 b1 a1 b2 . . a1 bn
a2 b1 a2 b2 . . a2 bn
ab> =
. . . . .
. . . . .
an b1 an b2 . . an bn
95
Algorithm 3 Trust Region.
The trust region algorithm is fairly intuitive and allows the trust region radius to grow and shrink
depending on how well the surrogate model approximates the actual model at a given iteration.
pk ≈ min mk (p)
p
Evaluate how well the surrogate model is approximating the actual model
∆k+1 = 1/4kpk k
else
Surrogate model is a reasonable approximation and we are not hitting the boundary.
96
Theorem 6.7 (Characterization of Exact Solution to Trust Region Subproblem ). The vector p∗ is a global
solution of the trust-region problem
1
minn m(p) = f + g > p + p> Bp kpk ≤ ∆
p∈R 2
if and only if p∗ is feasible and there is a scalar λ ≥ 0 such that
(B + λI)p∗ = −g
λ(∆ − kp∗ k) = 0
(B + λI) is positive semi-def
Notice that the summation B +λI is positive semi-definite, but B may not be positive semi-definite.
The characterization of the solution provides complementary conditions that at least on of the nonnegative
quantities λ or (∆ − kp∗ k) must be zero. Hence when the solution lies strictly within the trust region, the
solution is approximately the Newton step depending on the Hessian approximation.
When the solution lies on the boundary of the trust region, it is collinear with the negative gradient of the
model function m and normal to its contours
As with line search methods, we do not attempt to solve the trust region subproblem (18) exactly. The
general approach is to develop algorithms that begin with the so called Cauchy Point and try to improve
upon this estimate to take the full Newton step pk = −(∇2 f )−1 ∇f provided the Hessian is positive definite
and within the trust radius kpk k ≤ ∆k . Cauchy Point
(Definition) Cauchy Point Find the vector psk that solve a linear version of (18).
Calculate the scalar τk > 0 that minimizes mk (τ psk ) subject to satisfying the trust-region bound, that is,
Set pC s
k = τk p k
97
Minimizing the step length along the model in this direction, the Cauchy Point is found to be some step
∆k
length, α = τk k∇fkk
, along the steepest descent direction, subject to the trust region radius.
1 ∇fk> Bk ∇fk ≤ 0
∆k
pC = −τk ∇fk τk = k∇fk k3
k
k∇fk k min
, 1 otherwise
∆k ∇>
k Bk ∇fk
∇2 fk sk = −∇fk (21)
• For large number of optimization variables, the Gaussian elimination solve grows in computational
complexity as O(n3 ).
• Storage for the Hessian matrix can be quite expensive. Consider a 256 × 256 × 10 image. This image
has 655360 voxels. The corresponding matrix for an optimization problem with one variable per pixel
bytes
is 655360 × 655360 × 8 entry ≈ 3.4 TB
• The Hessian inverse and hence a well defined solution may not exist. Also, perhaps Hessian is not
positive definite and the Newton direction is not a descent direction.
Alternatively. A CG solution for the newton step, pk has several attractive properties, Algorithm 4.
• Krylov space methods work under the assume that the Hessian is a linear operator H : Rn → Rn
and require only the operation of the linear operator acting on the gradient, ie the matrix-vector
product. Hence, the full matrix does not need to be stored. Further, since in theory, the accuracy
of the linear system solve can be controlled by the number of matrix vector products. This provides a
mechanism to control the work preformed within the inner loop of the Newton solve. Most rules for
terminating the iterative solver for (21) are based on the residual.
One iteration being the worst accuracy. The accuracy monotonically increases (decreasing residual)
until the number of iterations reaches the full number of degrees of freedom at which machine precision
can be achieved (same as Gaussian elimination). We can control the amount of work within the inner
Newton solve by the tolerance we set on the residual. Since the residual is not invariant to scaling of
the objective function, the size of the residual relative to the right hand side of (21) is used, k∇f (xk )k
and the solver is terminated when the residual is less than the gradient scaled by a forcing sequence ηk
Intuitively, notice that this means that we solve the Newton system more and more as we approach a
solution k∇fk k = 0 and when are are initially starting out and the matrix may not be positive definite
and the Newton direction may not be a descent direction we are essentially taking a step that is closer
to a steepest descent direction. In fact, the choice of ηk = min(0.5,k∇fk k) may be shown to recover
quadratic convergence of the Newton method, Theorem 6.2 [Nocedal and Wright, 1999]. The choice of
convergence criteria, ηk = min(0.5,k∇fk k), is known as the Eisentstat-Walker convergence criteria.
98
Algorithm 4 Newton-CG Approximation to Trust Region subproblem, (18), with Eisentstat-Walker con-
vergence criteria
Require: ∆ > 0,p0 = 0, r0 = g, d0 = −r0 , rj = Bpj + g residual, pj solution, dj search direction
for j= 0, 1, 2 do
if d>j Bdj ≤ 0 then
negative curvature is detected, return optimal solution, as function of τ,
along the current search direction p = pj + τ dj and satisfies kpk k = ∆.
1
min m(p) m(p) = f + g > p + p> Bp kpk = ∆
p 2
• Notice if j = 0, then this is the gradient direction the initial solution is the steepest descent
direction.
r> r0 g> g
p1 = α0 d0 = >0 d0 = − > g
d0 Bd0 g Bg
The initialization of p0 to zero is a crucial feature of the algorithm. Hence when the first iteration
encounters the boundary, this is exactly the Cauchy point.
return p;
end if
Subsequent CG iterates serve to improve the model value. Notice that the computational work is in the
mat-vec multiply, Bdj
rj> rj
αj = > pj+1 = pj + αj dj
dj Bdj Newton
CG iterates monotonically increase the length of the step direction, Theorem 4.2
[Nocedal and Wright, 1999].
0 = kp0 k < ... < kpj k < kpj+1 k < kpk k ≤ ∆
Steepest Descent
-
if kpj+1 k ≥ ∆ then
check if trust region has been reached. if kpj+1 k > ∆ ⇒, return optimal solution, as function of τ ,
along the current search direction pk = pj + τ dj and satisfies kpk k = ∆. find τ such that p = pj + τ dj
satisfies
kpk = ∆
return p;
end if
Update Residual
rj+1 = rj + αj Bdj
end for
99
• adjust trust region for subsequent Newton iterates based on current solve. This essentially controls
the step length of the model. The trust region or step length along the descent direction is allowed to
change according to how well the quadratic model, mk (pk ), approximates the actual function. When
the step length has reached the boundary of the trust region, the trust region size is increased.
(
f (xk ) − f (xk + pk ) < 1/4 ⇒ decrease trust region
mk (0) − mk (pk ) > 3/4 ⇒ increase trust region ifkpk k = ∆k
Near a well behaved solution, the trust region can be shown to become inactive as quadratic convergence
is obtained, Theorem 6.4 [Nocedal and Wright, 1999].
• Each successive step of the conjugate gradient algorithm can be shown to be a descent direction for
function m, provided that the negative curvature condition is met. Suppose that pj is the current CG
solution, the CG update implies pj+1 = pj + αj dj
!
> 1 2 > >
rj> rj >
m(pj+1 ) = m(pj + αj dj ) = m(pj ) + αj dj g + αj dj Bdj = m(pj ) + αj dj g + dj Bdj
2 2 d>
j Bdj
rj> rj X X
p>
i Bpj = 0 i 6= j αj = −g = Bx∗ = B αi pi ⇒ −d> >
j g = −dj B αi pi = αj −d> >
j Bdj = rj rj
d>
j Bdj i i
substituting
2
rj> rj 1
m(pj+1 ) = m(pj ) −
2 d>
j Bd j
reveals that the next CG iterate is a descent direction provided that the positive curvature condition
is obtained, d>
j Bdj > 0!
7 Constrained Optimization
7.1 Theory of Constrained Optimization, [Nocedal and Wright, 1999] Ch 12
Under the constrained optimization formalism, we would like to minimize a function f (x) such that the
solution satisfies physically meaningful constraints, ci , that restrict the parameter space.
(
ci (x) = 0, i ∈ E
minn f (x) such that (22)
x∈R ci (x) ≥ 0, i ∈ I
Here we will assume that all functions are sufficiently smooth such all needed derivatives are well defined in
the classical sense. E and I are two finite sets of indices’s representing the equality and inequality constraints
respectively. Equality Con-
straints
(Definition) Equality Constraints E denotes the equality constraints and denotes the finite set of in-
dices’s such that equality holds.
ci (x) = 0 i∈E
Inequality
Constraints
(Definition) Inequality Constraints I denotes the inequality constraints and denotes the finite set of
indices such that inequality holds.
ci (x) ≥ 0 i∈I
Feasible Set
(Definition) Feasible Set The feasible set, Ω, represents the set of points x, that satisfy the constraints
Ω = {x : ci (x) = 0 i ∈ E; ci (x) ≥ 0 i ∈ I}
100
Figure 53: Constraint and Function Gradients at Various Feasible Points
Using the feasible set notation, the constrained optimization formalism (22) may be represented concisely.
min f (x)
x∈Ω
Active/Inactive
constraint
(Definition) Active/Inactive constraint At a feasible point x, the inequality constraint i ∈ I is said to
be active if ci (x) = 0 and inactive if the strict inequality ci (x) > 0 is satisfied.
Active Set
(Definition) Active Set The active set A(x) at any feasible x is the union of the set E with the indices
of the active inequality constraints.
A(x) ≡ E ∪ {i ∈ T : ci (x) = 0}
As before, first order necessary conditions will characterize a solution to the constrained optimization
problem. Lagrangian
(Definition) Lagrangian A solution to constrained optimization problem may be characterized through
the so-called Lagrangian of the problem.
X
L(x, λ) ≡ f (x) − λi ci (x) = f (x) − (~λ, ~c) (23)
i∈E∪I
We will state the first order necessary conditions defining a solution to the constrained optimization problem
without proof and look at several simple examples to build intuition.
Theorem 7.1 (First-Order Necessary Conditions). Suppose that x∗ is a local solution of (22) and that the
linear independence constraint qualification (LICQ) holds at x∗ ,
Then there exists a Lagrange multiplier vector λ∗ with components λi , i ∈ E ∪ I such that
∇x L(x∗ , λ∗ ) = 0
ci (x∗ ) = 0, ∀i ∈ E
ci (x∗ ) ≥ 0, ∀i ∈ I
λ∗i ≥ 0, ∀i ∈ I
λ∗i ci (x
∗
) = 0, ∀i ∈ I ∪ E
Hence the direction d retains feasibility with respect to the constraint c1 when it satisfies
∇c>
1 (x)d = (∇c1 , d) = 0 (24)
Similarly, a direction of improvement must produce a decrease in f , so that
0 > f (x + d) − f (x) ≈ ∇f > (x)d
or, to first order, as before
∇f > (x)d = (∇f, d) < 0 (25)
It follows that a necessary conditions then is that there exists no direction d that satisfies both (24) and
(25). By inspection, the only way such a direction cannot exist is if
1 1 −2x1 1 2
∇f (x) = λ1 ∇c1 (x) ⇒ = =
1 2 −2x2
|{z} 2 2
|{z} | {z }
∇f (x) λ1 ∇c1 (x)
102
Example 67 (Single Inequality Constraint). Now consider the inequality constrained problem.
0 ≤ c1 (x + d) ≈ c1 (x) + ∇c>
1 (x)d
0 ≤ c1 (x) + ∇c>
1 (x)d
In determining if such a direction exists we will consider the case where we are at a point inside the
feasible set and on the boundary of the feasible set.
Case I: Consider the case in which x lies strictly within the circle, c1 (x) > 0. Whenever ∇f 6= 0 we can
obtain a direction d that decreases the objective function and retains feasibility when
c1 (x) −c1 (x)
d=− ∇f (x) ⇒ 0 ≤ c1 (x) + ∇c1 , ∇f (x)
k∇c(x)kk∇f (x)k k∇c(x)kk∇f (x)k
∇c1 (x) ∇f (x)
≤ c1 (x) 1 − ,
k∇c1 (x)k k∇f (x)k
Case II: Consider the case in which x lies on the boundary of the circle, c1 (x) = 0. The conditions for
improvement become
∇f > (x)d < 0 ∇c>1 (x)d ≥ 0
The first condition is the open half space characterize by the direction of the objective function gradient.
The direction of the inequality sign imposes the search direction in the negative of the gradient. The second
103
condition is the closed half space characterized by the constraint gradient. The direction of the inequality
imposes the search direction in the positive direction of the constraint. Hence the these two regions fail to
intersect when the gradient and constraint gradient point in the same direction.
0 > (∇f, d) = (−λ1 ∇c1 , d) ⇒ −λ1 (∇c1 , d) < 0 ⇒ (∇c1 , d) > 0 constraint satisfied
∇x L(x, λ1 ) = 0 λ1 ≥ 0
λ1 c1 (x) = 0
Notice that we are using the Lagrangian to concisely and conveniently represent the requirements on our
solutions for multiple cases.
For Case I, c1 (x) > 0 so this requires that λ∗1 = 0 and the gradient of the Lagrangian reduces to the
gradient of the objective function. For Case II, λ1 is allowed to take a non negative value.
The examples suggest that several conditions are important to characterizing a solution to our constrained
optimization problem (22). These include the relation (1) ∇x L(x, λ) = 0 (2) the non-negativity of the
multipliers λi ≥ 0, i = 1,2,.. (3) and the complementary condition λi ci (x) = 0.
Example 69 (Two Inequality Constraints). Now consider the inequality constrained problem.
Repeating the arguements for the previous examples, we conclude that a direction d is a feasible descent
direction, to first order, if it satisfies the following conditions:
Here λ is the vector of multipliers. Extending the derivative of the Lagrangian in this case we have
∇x L(x∗ , λ∗ ) = 0, λ∗ ≥ 0
104
(a) (b)
Figure 55: Gradient at a solution√ and non optimal point. Here√ the feasible region consists of the upper
>
interior of the circle of radius
√ 2. (a) Here the solution is (− 2, 0) , a point at which both constraints are
>
active. (b) For the point ( 2, 0) , we have both constraints active. However, the objective gradient ∇f (x)
no longer lies in the quadrant defined by the conditions ∇ci (x)> d ≥ 0, i = 1, 2. One first order feasible
descent direction from this point- a vector d that satisfies (26)- is simply (−1, 0)> ; there are many others.
For this value of x it is easy to verify that the condition ∇x L(x, λ) = 0 is satisfied when λ = ( 2−1 √ , 1). The
2
first component λ1 is negative but we require positive multipliers.
√
At the solution x∗ = (− 2, 0)> , we have
√
∗ 1 ∗ 2 2 ∗ 0
∇f (x ) = ∇c1 (x ) = ∇c2 (x ) =
1 0 1
In fact, we need worry only about satisfying the second and third condtions, since we can always satisfy the
first condition by multiplying d by a sufficiently small postive quantity. Noting that
1 0
∇f (x) = ∇c2 (x) =
1 1
it is easy to verify that the vector d = (−1/2, 1/4) satisfies (27) and is therefore a descent direction. To show
that the gradient of the Lagrangian in non-zero ∇x L = 6 0 and the complmentary conditions fail, we first not
that since c1 (x) > 0 we must have λ1 = 0 Therefore in trying to satisfy ∇x L = 0 we are left to search for a
value λ2 such that ∇f (x) − λ2 ∇c2 (x) = 0. Since no such λ2 exists, this point fails to satisfy the optimality
conditions. LICQ
(Definition) LICQ Given the point x∗ and the active set A(x∗ ) we say that the linear independence
constraint qualification (LICQ) holds if the set of active constraint gradients
is linearly independent.
It is possible for the gradient of the constraint to vanish as a result of the algebraic representation of the
constraint. Restrictions are typically applied to the constraints to avoid degenerate behavior. For example,
if we replaced our circle constraint by the equivalent
2
c1 (x) = x21 + x22 − 2 = 0
105
we would have ∇c1 (x) = 0 for all feasible points and the condition ∇f = λ∇c no longer holds at
the optimal point (−1, −1). To avoid this we typically require that the constraint gradients be linearly
independent at the solution By definition of linear independence the active constraint gradients
can not be zero.
Sensitivity At this point the Lagrange multipliers are more of a mathematical convenience. However, the
value of the Lagrange multiplier, λi , can provide information as to the sensitivity of the optimal value of
f (x∗ ) to the presence of the constraint, ci .
• For an inactive constraint i ∈/ A(x∗ ) such that ci (x∗ ) > 0, the solution, x∗ and function value f (x∗ )
are indifferent to the constraint ci (x∗ ) Hence, λi = 0
• Suppose instead that the constraint i is active then the solution x∗ perturbed by an at contraint ci
ci (x) ≥ k∇ci (x)k instead of ci (x) ≥ 0
yeilds a change proportional to the multiplier λi
df (x∗ ())
= −λ∗i k∇ci (x∗ )k
d
Hence, if λ∗i k∇ci (x∗ )k is large, the the optimal value or solution point is very sensitive to the i-th
constraint and the function value at the solution depends heavily on the constraint.
here the penalty parameter is positive, µ > 0, and by driving the penalty parameter to zero we penalize
the constraint more severely. Intuitively, we would like to consider a sequence of penalty parameters that
increasingly penalize the constraints
µk → 0
The general framework for this solution technique
Theorem 7.2 (Convergence of Quadratic Penalty Function). If the tolerances in Algorithm 5 satisfy
τk → 0
and the penalty parameters µk → 0, then for all limit points x∗ of the sequence xk at which the constraint
gradients ∇ci (x∗ ) are linearly independent, we have that x∗ is a KKT point for the problem
min f (x) subject to ci (x) = 0 i∈E
x
For such points, we have for the the infinite subsequence K such that limk∈K xk = x∗
−ci (xk )
lim = λ∗i ∀i ∈ E (28)
k∈K µk
where λ∗ is a multiplier vector that satisfies the KKT conditions (7.1).
106
Algorithm 5 Quadratic Penalty
Require: µ0 > 0, tolerance τ0 > 0, and initial guess xs0
for k= 0, 1, 2 do
Find an approximate minimizer xk of Q(·; µk ) starting at xk , terminate when k∇Q(x; µk )k ≤ τk
if final convergence test satisfied then
STOP with approximate solution xk
else
Choose new penalty parameter µk+1 ∈ (0, µk )
Choose new subproblem convergence τk+1 ∈ (0, τk )
Choose new starting point xsk+1
end if
end for
Here we see that Algorithm 5 is in fact attracted to a KKT point, ie satisfies necessary conditions.
Further, the quantities ci (xk )/µk may be used as estimates of the Lagrange multipliers λ∗i under certain
conditions. Notice that this implies that as µk → 0, the constraint ci (xk ) becomes more active.
Unfortunately, the hessian of the Quadratic penalty approach becomes increasingly ill conditioned with
µk → 0. The Hessian is given by
X ci (x) 1 >
∇2xx Q(x; µk ) = ∇2 f (x) + ∇2 ci (x) + A (x)A(x) A> (x) ≡ [∇ci (x)]i∈E
µk µk
i∈E
Near a solution the matrix is approximately the sum of (1) the Lagrangian term ∇2xx L and (2) a matrix of
rank |E| whose nonzero eigenvalues are of order 1/µk
1 >
∇2xx Q(x; µk ) ≈ ∇2xx L(x, µ∗ ) + A (x)A(x)
µk
Hence the overall matrix has some of its eigenvalues approaching a constant while others are of order 1/µk .
Since µk → 0, the increasing ill conditioning of Q(x; µk ) is apparent.
Alternatively to letting the penalty parameter tend to zero, µk → 0, we can ask if we could more accurately
solve the constraints, ci (x) = 0, and avoid any potential ill-conditioning problems for small values of the
penalty parameter.
The Augmented Lagrangian formulation achieves this by including and explicit estimate of the Lagrange
multipliers λ based on the formula (28). By definition
X 1 X 2
La (x, λ, µ) ≡ f (x) − λi ci (x) + c (x)
i
2µ i i
the augmented Lagrangian La differs from the standard Lagrangian (23) by the presence of the squared
terms. And this approach differs from the quadratic penalty method by the presence of the summation
terms involving the multipliers, λ. As before, the min over the feasible set Ω ⊂ RN , occurs at the stationary
point, x∗ , of the Lagrangian
X
min f (x) ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) − λ∗i ∇ci (x∗ ) = 0
x∈Ω
i∈E
min Q(x; µk )
x∈Ω
107
Applying the standard Lagrangian (23) to this problem, with λk being a particular Lagrange multiplier
for the µk subproblem, the augmented Lagrangian appears and is stationary at a possibly different point
xk 6= x∗
X X ci (xk ) X
0 = ∇x La (xk , λk , µk ) = ∇Q(xk ; µk ) − λki ∇ci (xk ) = ∇f (xk ) + ∇ci (xk ) − λki ∇ci (xk )
i
µk
i∈E i∈E
X ci (xk )
k k
= ∇f (x ) − λi − ∇ci (xk )
µk
i∈E | {z }
λ∗
i
Rearranging,
−−−−−−−−−−→
ci (xk ) ≈ −µk (λ∗i − λki ) (λ∗i − λki ) → 0 0 ∀i ∈ E
we see that as the approximation to the Lagrange multiplier is close to the actual multiplier λ, the infeasibility
in xk will be much smaller than µk .
Further, the Lagrange multipliers, λk , particular to the augmented function or subproblem,
Q(x, µk ), provides an explicit estimate for the multipliers for the original problem, λ∗ . Equation
(28) suggests an update algorithm for the multipliers based on the current information
λk+1
i = λki − ci (xk )/µk i∈E
ηk+1 ∈ (0, ηk )
µk+1 ∈ (0, µk )
108
7.5 Applications: L1 minimization
Compressed sensing literature has generated significant interest in L1 solving problems of the form
Here Φx is a transformation into a space where the solution is sparse. H : Rn → Ris a general nonlinear con-
straint that is assumed twice differentiable and bounded below. The split Bregman formulation has received
much attention in these types of L1 type formulations. For compressed sensing with linear constraints, the
split Bregman formulation [Yin et al., 2008] may be shown to be equivalent to the Augmented Lagrangian
Framework in Section 7.4.
Example 70 (Typical CS Example ).
Initial approaches generally proposed to view this problem as an unconstrained optimization problem
with a penalty term accounting for the sparsity constraint. This is generally of the form:
Initial approaches to this problem attempt to regularize (smooth) the L1 penalty term and approximated it
with a differentiable function. Unfortunately, as the smooth approximation approaches the non-differentiable
problem the problem becomes ill-conditioned because the derivative is not defined. The resulting ill con-
ditioned Hessian matrix, as we have seen, can significantly affect the convergence properties of algorithms
used to solve the optimization problem.
Example 71 (Condition Number of Penalty Method ). The k · k1 may be approximately represented as
Xq −−−→
x2i + →0 kxk1
i
P √
Letting b(x) ≡ i xi + we see that the Hessian is a diagonal matrix with diagonal terms scaled by
q
∂ 1 2
q −1/2
2 2
x1 + + x2 + + ... = xj + (2xj )
∂xj 2
∂2 2
−1/2 −1/2 xj 2 −3/2
2
x j x j + = x2j + − xj + (2xj ) = 3/2
∂ xj 2 x2j +
We expect our sparse solution to contain several sufficiently large non-zero xi elements, (ie |xi | >> )
We can approximate the diagonal elements as a positive number that becomes very small as |xi | increases.
∇2ii b(x) ≈
|xi |3
On the other hand, for the sparse solution we are looking for we except a significant portion of the xi equal
zero. The diagonal entries are relatively large in this case.
√
xi = 0 ⇒ ∇2ii b(x) = −1
The condition number of this diagonal matrix may thus be approximated as the maximum non-zero entry
cubed times the product of a large number −3/2 .
√ −1 |xi |3 √
2 2 −1
k∇ bk1 = max , k∇ b k1 = max ,
i |xi |3 i
maxi |xi |3
κ(∇2 b) =
3/2
Thus, ill-conditioning is expected for our sparse solutions and will lead to slow convergence of unconstrained
gradient and Newton algorithms as we have seen.
109
Alternatively, we may approximate the CS problem
within the scope of Algorithm 6, given λk and µk , the subproblem is still formidable and contains L1 and
L2 terms
1 1
min La (·, λk ; µk ) = min kAx − bk22 + τ kwk1 − (λk , Φx − w) + k(Φx − w)k22
x,w x,w 2 2µ
The Lagrangian may be written in an equivalent form, using the linearity of the inner product
1 1 1 1 1 c2 1 c
(a − cb, a − cb) = [(a − cb, a) − (a − cb, cb)] = kak2 − c(b, a) − c(a, b) + kbk2 = kak2 − (a, b) + kbk2
2c 2c 2c 2c 2c 2c 2c 2
1 c 1
⇒ ka − cbk2 − kbk2 = kak2 − (a, b)
2c 2 2c
µ k 2
and minimizing with respect to (x, w), 2 kλ k2 may be considered a constant
1 1 µ 1 1
min La (·, λk ; µk ) = min kAx−bk22 +τ kwk1 + kΦx−w−µλk k22 − kλk k22 ⇔ min kAx−bk22 +τ kwk1 + kΦx−w−µλk k22
x,w x,w 2 2µ 2 x,w 2 2µ
The approach becomes tractable if we break the subproblem up into an L1 subproblem and L2 subproblem
using an alternating direction or coordinate descent techinque.
1 1
L2 subproblem min kAx − bk22 + kΦx − wk − µλk k22
| {z } x 2 2µ
wk ,λk fixed
1
L1 subproblem min τ kwk1 + kΦxk − w − µλk k22
| {z } w 2µ
xk ,λk fixed
A summary of the Augmented Lagrangian Algorithm in this context is presented in Algorithm 7. Notice
that we have converted the constrained problem into two unconstrained problems.
110
Algorithm 7 Augmented Lagrangian Approach For CS
1
min kH(x)k22 + τ kwk1 such that Φx = w H(x) = Ax − b
x,w 2
Solve L1 subproblem.
1
wk+1 = min τ kwk1 + kΦxk+1 − w − µλk k22
w 2µ
wi = softτ µ (Φxk )i − µλi
b
softa (b) ≡ max(0, |b| − a)
|b|
Derivatives The L2 subproblem may be solved with a linesearch or trust region approach using finite
difference or analytical derivatives. For the general case
A~x − ~b
" #
1~ > ~ ~
min f (~x) = minn h (~x)h(~x) h(~x) = 1
x∈Rn
~ ~x∈R 2 √ x−w
µ Φ~ ~ − µ~λ
The gradient of the objective function f (~x) is given as the matrix vector product of the Jacobian transpose
times the residual.
m
! m
!
∂ 1X X ∂hl >~ ∂hi A
(∇f )i = hl hl = hl ∇f = J h Jij = J = √1 Φ
∂xi 2 ∂xi ∂xj µ
l l
Soft Thresholding Operator The L1 subproblem is (perhaps surprisingly) now given by the soft thresh-
olding operator
b
wi = softτ µ (Φxk )i − µλi
softa (b) ≡ max(0, |b| − a)
|b|
To see this, we need to generalize our definition of the derivative for k · k1 . Consider the subdifferentiable of
the L1 problem
wi 1
(Φxk ))i − wi − µλki
0 ∈ τ −
|w|i µ
Subdifferential
(Definition) Subdifferential We say a vector g ∈ Rn is a subgradient of f : Rn → R at x ∈ Rn if
111
(a) (b)
Figure 56: Subdifferential. (a) If f is differentiable, then the gradient is the subgradient. However, the
subgradient may exist when the gradient does not exist. In fact there may be several subgradients at this
point. (b) Absolute value. Consider f (z) = |z|. For x < 0 the subgradient is unique: ∂f (x) = {−1}.
Similarly, for x > 0 we have ∂f (x) = {1}. At x = 0 the subdifferential is defined by the inequality |z| > gz
for all z, which is satisfied if and only if g ∈ [−1, 1]. Therefore we have ∂f (0) = [−1, 1].
The set of subgradients of f at the point x is called the subdifferential at x and is denoted ∂f (x).
∂f (x) ≡ g : f (z) ≥ f (x) + g > (z − x)
∀z
To illustrate the solution to this equation consider
−1 x<0
0 ∈ a sign(x) + x − b ⇔ b ∈ a sign(x) + x sign(x) ≡ (−1, 1) x=0
1 x>0
Case II
|b| < a ⇒ x=0
Case III
b < −a ⇒ b=x−a ⇒ x=b+a
These three cases can be convienently expressed by the so-called soft threshold operator. One may
directly verify that:
b + a
b < −a
b
x = softa (b) = max(0, |b| − a) = 0 |b| ≤ −a
|b|
b−a b>a
112
Total Variation Denoising Alternative formulations may have computational savings at different steps
of the splitting. For example,
1
min kF (x)k22 + τ (kΦ1 wk1 + kΦ2 wk1 + kΦ3 wk1 ) such that x = w
x,w 2
The L2 sub-subproblem is equivalent to a least squares with identy operator. The normal equations are used
to identify the equivalent form.
kΦ1 w − v1k − µλ̄k1 k22
1 1
L2 subproblem min kw − b¯k k22 + +kΦ2 w − v2k − µλ̄k2 k22
| {z } w 2µ 2µ
vik ,λ̄k
i fixed +kΦ3 w − v3k − µλ̄k3 k22
kΦ1 wk − v1 − µλ̄k1 k22
1
L1 subproblem min τ (kv1 k1 + kv2 k1 + kv3 k1 ) + +kΦ2 wk − v2 − µλ̄k2 k22
| {z } v 2µ
wk ,λ̄k i fixed +kΦ3 wk − v3 − µλ̄k3 k22
The L1 sub-subproblem(s) now have the solution given by the component-wise soft thresholding operator.
a∗ < a ∀a b∗ < b ∀b c∗ < c ∀c ⇒ a∗ + b∗ + c∗ < a + b + c ∀a, b, c
1
L1 subproblem min τ kvi k1 + kΦi wk − vi − µλ̄ki k22
| {z } vi 2µ
wk ,λ̄k
i fixed
Notice that this allows the solution of a multiple soft thesholding in between the L2 solve. In fact, in the
limit, of one threshold per L2 solve we arrive at the previous Algorithm 7.
113
Algorithm 8 Augmented Lagrangian Approach With Denoising Subproblem
Φ1
1
min kH(x)k22 + τ kΦwk1 Φ = Φ2 Φ> = Φ> Φ> Φ>
such that x = w 1 2 3
x,w 2
Φ3
Require: µ > 0, tolerance τ > 0, and initial guess w0 = 0, and λ0 = 0
while not converged, ie kxk − wk k > and kH(xk )k22 > do
Solve L2 subproblem.
1 1
xk+1 = min kH(x)k22 + kx − wk − µλk k22
x 2 2µ
while not converged, ie kΦwk − v k k > and kwk − b̄k k22 > , given initial guess vi0 = 0, and λ̄0i = 0 do
Solve L1 -L2 sub-subproblem.
!
1 X 1
kw−b¯k k2 +
X
>
X ¯
w k+1
= min 2 k k 2
kΦi w−vi −µλ̄i k2 |{z} ⇔ Φi Φi + I wk+1 = Φ> k k
i vi − µλ̄i +b
k
w 2µ 2µ i Normal Eqn i i
114
∗∗
7.6 Applications: Adjoint Method for Nonlinear Least Squares
Suppose we want to reconstruct an image from measurement data u0
Z
1
min f (u) = min (u − u0 )2 dx
u u 2 Ω
This may also be interpreted as a function image reconstruction for a concentration subject to a conservation
based constraint. This infinite dimension minimization problem is implicitly a function of the diffusion
parameters, ~η . Z
1
min (u(k) − u0 )2 dx k(~η )∆u = f
η) 2 Ω
u(~
The first thing to realize is that this is a nonlinear least squares problem in disguise. Indeed, consider a
finite dimension basis {φi i = 1, ..., N } in which we expand our solution and measurements.
X X
u(x) = ui φi (x) u0 (x) = u0i φi (x)
i i
u1 − u01
u2 − u02
. R . .
r≡ .
M ≡ . Ω φi (x)φj (x)dx .
. . . .
un − u0n
Using a finite difference or finite element formulation the smoothness constraint may be written as a linear
system where the matrix depends on the diffusion coefficient, k
u1
u2 f (x1 )
u(xi + h) − 2u(xi ) + u(xi + h) f (x2 )
∆u(xi ) ≈ 2
⇒ Au = b u= . b= .
h
.
f (xn )
un
The derivative of the objective function with respect to k does not have an analytical form this time.
∂
∂ηi u1
Z Z ∂ u2
∂ 1 0 2 0 ∂u(~
η) ∂ηi
(u(k) − u ) dx = (u(k) − u ) dx =
r,
.
∂ηi 2 Ω Ω ∂ηi
.
∂
∂ηi un M
115
Solving for adjoint variable, p, with the right hand side given by the residual:
A> p = r
Table 2: Comparing the computational expense of available methods for computing the gradient for M
parameters η1 , η2 , ..., ηM
finite difference sensitivities adjoint
(approx) (exact) (exact)
M+1 linear Au = b solves 1 linear Au = b solves 1 linear Au = b solves
∂u ∂A
- M linear A ∂η i
= − ∂ηi u solves -
- - 1 linear A> p = r solve
Second derivatives of the objective function and the constraint the Hessian matrix of second derivatives.
∂2 ∂ 2 u(~η )
Z Z
1 ∂u(~η ) ∂u(~η )
(u(k) − u0 )2 dx = + (u(k) − u0 ) dx
∂ηi ∂ηi 2 Ω Ω ∂ηi ∂ηj ∂ηi ηj
∂ ∂2
∂
∂ηi u1 ∂ηj u1 ∂ηi ηj u1
∂ u ∂ u ∂2
∂ηi ηj u2
∂ηi 2 ∂ηj 2
= . , . + r,
.
. .
.
∂ ∂ 2
u u ∂
∂ηi n ∂ηj n M ∂ηi ηj un M
∂
∂
∂2
∂ηj u1 ∂ηi u1 ∂ηi ∂ηj u1
∂ u ∂ u2 ∂2
∂ηi ∂ηj u2
∂2 ∂2A ∂A ∂ηj 2 ∂ηi
∂A
(Au = b) u+ . + . + A =0
.
∂ηi ∂ηj ∂ηi ∂ηj ∂ηi ∂ηj
. .
.
∂ ∂ ∂2
∂ηj un ∂ηi un ∂ηi ∂ηj un
It is worth mentioning that this matrix may be infeasible to compute in a realistic time, however, the
Hessian-Vector product or the action of a linear operator may be computed without explicit storing
the matrix. The hessian-vector product may be computed for an arbitrary linear combination of sensitivities
116
∂u
P
jqj ∂η j
using a second adjoint variable, p̃. To avoid needing to compute each sensitivity independently,
we consider a Matrix Vector product, using linearity of IP
∂ ∂
∂ηj u1 ∂ηi u1
M
M M
∂ u M ∂ u2
2
X ∂2 X ∂2A ∂A X ∂ηj X ∂A ∂ηi
f (u)qj = − qj u+ qj . + qj . , p
∂ηi ∂ηj ∂ηi ∂ηj ∂ηi j ∂ηj
.
.
j j j
∂ ∂
u
∂ηj n
u
∂ηi n M
A second adjoint problem is needed to avoid needing to compute each sensitivity independently.
M
X ∂AT
A> p̃ = qj p
j
∂ηj
∂
∂
∂ηj u1 ∂ηi u1
∂ u ∂ u2
M M M ∂ηj 2 M
∂2 ∂2A ∂AT
X X ∂A X ∂ηi X
f (u)qj = − qj u+ qj . , p − . , qj p
∂ηi ∂ηj ∂ηi ∂ηj ∂ηi j ∂ηj
. .
j j j
∂ ∂
∂ηj un ∂ηi un M
∂
M
∂
∂ηj u1 ∂ηi u1
∂ u ∂ u2
M M ∂ηj 2 M
∂2A ∂AT
X ∂A X ∂ηi X
= − qj u+ qj . , p − . , qj p
∂ηi ∂ηj ∂ηi j ∂ηj
. .
j j
∂ ∂
∂ηj un ∂ηi un M
∂
M
∂
∂ηj u1 ∂ηi u1
M M
∂ u ∂ u2
X ∂2A ∂A X ∂ηj 2 ∂ηi
, A> p̃
= − qj u+ qj . , p − .
∂ηi ∂ηj ∂ηi j
. .
j
∂ ∂
∂ηj un M ∂ηi un M
∂
∂ηj u1
M M
∂ u
X ∂2A ∂A X ∂ηj 2
∂A
= − qj u+ qj . , p + u, p̃
∂ηi ∂ηj ∂ηi j ∂ηi M
.
j
∂
∂ηj un M
Similar ideas of only computing matrix-vector products without storing the matrix (typically be the
matrices would use a prohibitive amount of memory ) appear frequently in image reconstruction. The
computation of a matrix-vector product are ideal for a class of iterative solution techniques for a linear
system of equations and are the motivation for the Newton-Krylov or Newton-CG optimization techniques
we will discuss, Section 6.9.
Table 3: Comparing the computational expense of available methods for computing the Hessian for M
parameters η1 , η2 , ..., ηM
finite difference sensitivities adjoint
(approx) (exact) (exact)
2M2 +1 linear Au = b solves 1 linear Au = b solves 1 linear Au = b solves
∂u ∂A
PM ∂u PM ∂A
- M linear A ∂ηi = − ∂ηi u solves 1 linear A i qi ∂ηi = − i qi ∂η i
u solves
2 ∂2u ∂2A >
- M linear A ∂ηi ηj = − ∂ηi ∂ηj u solves 2 linear A p = r adjoint solve
117
A Homework I
1. The notion of ‘distance’ does not necessarily behave in an intuitive way in higher dimensions. The
‘best’ distance measure is application dependent and may Not be the usual Euclidean distance that
we are familiar with. Similar to Example 18, analyze the behavior of the common image distances for
increasing dimesion ∈ [2, 1024] noisy images. Compare E(kxkmax.5 − kxkmin
.5 ) and Mutual Information
max
(MI) E(M I −M I ) to E(kxk1 −kxk1 ), E(kxk2 −kxk2 ), and E(kxkmax
min max min max min
3 −kxkmin
3 ). Discuss
which image ‘contrast(s)’ would be most appropriate for detecting noisy images in the high dimensional
spaces.
2. An phantom of known geometry was imaged on a new scanner. Download the data of the exact, I and
ˆ phantom data, KnownPhantom.mat MeasuredPhantom.mat,
measured, I,
from http://172.30.205.52/fuentes/AppliedMath/
for k= 1:n-1
A(k + 1:n, k) = A(k + 1:n, k)/A(k, k)
for i= k+ 1:n
for j = k + 1:n
A(i,j) = A(i,j)- A(i, k)A(k,j)
end
end
end
118
5. Download the image data, ICBM_grey_white_csf.nii.gz ICBM_Template.nii.gz,
from http://172.30.205.52/fuentes/AppliedMath/
Compute the intensity threshold value that maximizes the information gain. Hint: perform an exhaus-
tive search. Your search should resemble Figure 14 (b).
6. Download the image data, brain_T1C.mha ICBM_Template.nii.gz,
from http://172.30.205.52/fuentes/AppliedMath/
Calculate MI, MSQ, NCC image distance between these images using matlab.
7. Does the NCC (2) satisfy the triangle inequality ? Prove of give a counter example.
8. What are the properties of a metric? What are the properties of a norm? What are the properties of
an inner product?
• Verify that the norm k · k satisfies the properties of a metric defined as
d(x, y) = kx − yk
• Verify that the norm (9) and metric (10) induced by an inner product indeed satisfy the properties
of the norm and metric.
p
kxk = (x, x)
p
d(x, y) = kx − yk = (x − y, x − y)
10. Define what is means for two norms to be equivalent. Show that the norms k · k1 and k · k2 satisfy
1
√ kxk1 ≤ kxk2 ≤ kxk1 ∀x ∈ Rn
n
Hint: Use Hölder Inequality
n n
!1/p n
!1/q
X X
p
X
q 1 1
|ξj ηj | ≤ |ξj | |ηj | p>1 + =1
j=1 m=1
p q
k=1
11. Show that the following set of vectors are linearly independent.
• (1, 0, 0), (0, 1, 0), (1, 0, 1) ∈ R3
• x, sin(x), ex ∈ C[0, 1]
12. Define a linear operator. Define a bounded linear operator. Show that:
119
• The operator T : C[a, b] → C[a, b] is linear as defined by (Example 36)
Z t
T x(t) = x(τ )dτ kxk = max |x(t)|
a t∈[a,b]
T1 : R3 → R3 T1 = x × a ∀x ∈ X
Here a ∈ R3 is fixed, Example 37. Is this operator bounded under the k · k2 norm ?
13. Define a functional. Show that the functional defined as the dot product with a fixed vector in R3
f (x) = x · a = x1 a1 + x2 a2 + x3 a3 a ∈ R3 a fixed
f is linear and bounded.
14. Given the matrices,
1 2 1 2 3
A1 = A2 =
20 25 2 4 5
Define the null space. Find all vectors that are in the null space of these matrices.
N (A1 ) = {z :?} N (A2 ) = {z :?}
What is the dimension of each of these null spaces? Use the rank and nullity Theorem (3.2). What is
the dimension of the range space ? Discuss existence and uniqueness of a solution, x1 and x2 , to each
of the linear systems
A1 x 1 = b1 A2 x2 = b2
15. Consider an exact image reconstruction of an object x from the measurements b,
.8 .3 x1 4.5
Ax = = =b
.2 .5 x2 7.8
• For arbitrary n ∈ N, show that the matrix norm induced by
A : (Rn , k · k1 ) → (Rn , k · k∞ )
is
kAk1,∞ = max max |aij |
i j
120
16. Consider the below images defined on the unit square Ω = [0, 1] × [0, 1]
121
B Homework II
1. Download/Obtain a 512 × 512 pixel image and write a MATLAB program to resample the image to
128 × 256 image. Do not use the ‘resample’ command. Express your solution in terms of projections
in an inner product space. Compute the L1 and L2 norm of the difference between the original image
and the resampled image. What is the resulting weighted norm on Rn ?
2. Consider a differential operator with specified zero derivative boundary conditions defined on the space
of differentiable functions condition and the usual L2 inner product
Z b
d2
1
X ≡ C [a, b] (x, y) = x(t)y(t)dt Lx = 2 x + x, x0 (a) = 0 x0 (b) = 0 a<b
a dt
x> Ax
f (x) = (Ax, x) > 0 ∀x A = A>
x> x
(a) Show n stationary points, {xi }ni=1 , of f (x) are eigenvectors of A, f (xi ) are eigenvalues
(b) Consider the constraint c(x) = x> x − 1
i. Show the constraint is satisfied at the optimum, ie c(x) = 0 ⇒ ∇f (x) = 0
4. Consider the quadratic function
1 > 8 2 6
min f (x) = x Ax − b> x A ∈ R2x2 symmetric b ∈ R2 A= b=
x 2 2 900 1
• What are the eigen-values and eigen-vectors of A? (Compute by hand, ie do not use ”eigs” in
MATLAB )
• Code the Steepest Descent Method, Algorithm 2, in MATLAB with your favorite line search and
apply it to find a solution.
• Explain the convergence behavior observed.
5. Suppose that we have a time series of imaging data, we draw and ROI on the image and we need to
fit the measurements within the ROI, {ti , yi }5i=1 , to our model function φ(t, ~x).
2
φ(ti , ~x) = x1 ex2 ti +x3 ti
ri = yi − φ(ti , ~x)
ROI
• Formulate the solution as a nonlinear least squares problem. Define what metric you are using
for your objective function. What is the gradient? What is the Hessian ?
• Explicitly write out the first iteration of the Gauss-Newton approach to this nonlinear least squares
problem.
122
– What is your initial guess?
– What is the initial residual?
– What is the resulting least square problem?
– Solve this least square problem using normal equations in MATLAB .
– Solve this least square problem using QR factorization in MATLAB .
– Compare QR factorization vs normal equations. What are potential advantages and disad-
vantages of both approaches ?
• Using either a steepest descent, quasi-Newton, or Newton approach, code your own iterative
algorithm to find a solution, ie do not use MATLAB intrinsic functions. What algorithm did
you use ? What properties led to your algorithm selection ?
• Compare your answer to MATLAB lsqnonlin function. What is the convergence rate of your
algorithm? What is the convergence rate of MATLAB lsqnonlin function?
6. Discuss the properties of a Newton-CG algorithm for the Trust Region subproblem. In particular,
which descent directions does the algorithm favor ? Is the algorithm robust ? What type of convergence
properties are expected ? In terms of RAM/disk usage and floating point operations, how does the
algorithm allow you to reduce the computational expense of the Newton solve ? Give an example when
the Newton-CG trust region algorithm would increase the trust region radius. Give an example when
the Newton-CG trust region algorithm would decrease the trust region radius.
7. Explore the ability of an L1 algorithm to recover a sparse solution. Consider an exact solution as the
sum of three sinusoidal frequencies
Code an L1 minimization solver to recover the amplitude of the dominant frequency components from
an assumed model g(t).
x1
x2 f (t1 )
100
X . . . . f (t2 )
g(t) = xj sin(j ∗ t) . sin(jti ) .
=
. ⇔ Ax = y
.
j=1 . . . .
.
f (tN sample )
x100
Specifically, beginning with initial guess, x = ~0, solve the L1 problem using Algorithm 7.
At your solution, what is the condition number of the Hessian that results from the smooth approxi-
mation to the L1-norm ?
Xq
kxk1 ≈ b(x) ≡ x2i + .001 ∇2 b(x) =?
i
123