0% found this document useful (0 votes)
160 views123 pages

Applied Mathematics in Medical Physics: D. Fuentes

This document provides an introduction to applied mathematics concepts for medical physics. It outlines topics that are part of the core graduate curriculum standards for medical physics education, including mathematical models of image formation, reconstruction mathematics, image quality and reconstruction. The lecture notes cover areas of vector spaces, linear operators, inner product spaces, optimization, and other applied mathematics topics that are prevalent in medical physics research literature. The goal is to introduce graduate-level mathematical concepts and notation relevant to medical imaging and physics.

Uploaded by

mellieman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views123 pages

Applied Mathematics in Medical Physics: D. Fuentes

This document provides an introduction to applied mathematics concepts for medical physics. It outlines topics that are part of the core graduate curriculum standards for medical physics education, including mathematical models of image formation, reconstruction mathematics, image quality and reconstruction. The lecture notes cover areas of vector spaces, linear operators, inner product spaces, optimization, and other applied mathematics topics that are prevalent in medical physics research literature. The goal is to introduce graduate-level mathematical concepts and notation relevant to medical imaging and physics.

Uploaded by

mellieman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Applied Mathematics

in
Medical Physics

D. Fuentes
The University of Texas M.D. Anderson Cancer Center,
Department of Imaging Physics, Houston TX 77030, USA

Lecture Notes

References
[Aggarwal et al., 2001] Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising behavior
of distance metrics in high dimensional space. Springer.
[CAMPEP, 2014] CAMPEP (2014). Standards for Accreditation of Graduate Educational Programs in
Medical Physics. Commission on Accreditation of Medical Physics Educational Programs.
[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John
Wiley & Sons.
[Goldstein and Osher, 2009] Goldstein, T. and Osher, S. (2009). The split bregman method for l1 regularized
problems. SIAM Journal on Imaging Sciences, 2(2):323–343.
[Golub and Van Loan, 1996] Golub, G. H. and Van Loan, C. F. (1996). Matrix computations. JHU Press, 3
edition.
[Greenberg, 1978] Greenberg, M. (1978). Foundations of applied mathematics. Prentice-Hall.
[Heath, 1998] Heath, M. (1998). Scientific computing: An introductory survey.
[Kreyszig, 1989] Kreyszig, E. (1989). Introductory functional analysis with applications, volume 21. wiley.
[Nocedal and Wright, 1999] Nocedal, J. and Wright, S. (1999). Numerical optimization. Springer verlag.
[Oden and Demkowicz, 1996] Oden, J. and Demkowicz, L. (1996). Applied functional analysis. CRC press.
[Yin et al., 2008] Yin, W., Osher, S., Goldfarb, D., and Darbon, J. (2008). Bregman iterative algorithms for
l1-minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143–
168.

1
Contents
1 Preliminaries 4
1.1 Operation Counts ([Golub and Van Loan, 1996], Chapter 1.2.4) . . . . . . . . . . . . . . . . 4

2 Introduction to vector and function spaces 10


2.1 Vector Spaces ([Greenberg, 1978], Ch 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Metric Spaces ([Kreyszig, 1989], Section 1.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Linear Independence and Dimension of a Vector Spaces ([Kreyszig, 1989], Section 2.1) . . . . 30
2.4 Normed Spaces ([Kreyszig, 1989], Section 2.2,2.3) . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 ∗∗ Continuity and Convergence ([Kreyszig, 1989], Section 1.4) . . . . . . . . . . . . . . . . . . 37
2.5.1 ∗∗ Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Finite Dimensional Spaces ([Kreyszig, 1989], Section 2.4) . . . . . . . . . . . . . . . . . . . . 39

3 Linear operators and Solvability of a Linear System of Equations 41


3.1 Linear Operator; Null space; Range Space ([Kreyszig, 1989], Section 2.6,2.9) . . . . . . . . . 42
3.2 Bounded Linear Operators ([Kreyszig, 1989], Section 2.7) . . . . . . . . . . . . . . . . . . . . 51
3.3 Applications: Conditioning & Residual [Heath, 1998] Chapter 2 . . . . . . . . . . . . . . . . . 55
3.4 Applications: Accuracy & Numerical Stability [Heath, 1998] Chapter 2 . . . . . . . . . . . . . 57
3.5 Applications: Condition number of nearly singular matrix, [Heath, 1998] Chapter 2 . . . . . . 58
3.6 Linear Functionals ([Kreyszig, 1989], Section 2.8) . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Inner Product Spaces 59


4.1 Inner Product, ([Kreyszig, 1989], Section 3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Orthonormal Sets ([Kreyszig, 1989], Section 3.4) . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 ∗∗ Minimizing Vector ([Kreyszig, 1989], Section 3.3) . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Applications: Interpolation and Least Squares, [Heath, 1998] Ch. 3 . . . . . . . . . . . . . . . 67
4.5 Adjoint Operator [Greenberg, 1978] Ch 18.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Eigen-formulation for Bounded Self-Adjoint Linear Operator 74


5.1 Spectrum of Bounded Self-Adjoint Linear Operator, ([Greenberg, 1978], Ch 20) . . . . . . . 74
5.2 Applications: Spectral Method for the Inhomogeneous Problem . . . . . . . . . . . . . . . . . 78

6 Unconstrained Optimization 78
6.1 Characterizations of Solutions, [Nocedal and Wright, 1999] Ch 2 . . . . . . . . . . . . . . . . 82
6.2 Search Directions, [Nocedal and Wright, 1999] Ch 2 . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Applications: Nonlinear Least Squares, [Heath, 1998] Ch. 6 . . . . . . . . . . . . . . . . . . . 85
6.4 Line Search and Trust Region Strategies and Convergence . . . . . . . . . . . . . . . . . . . . 88
6.5 Line Search, [Nocedal and Wright, 1999] Ch 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Applications: Ill conditioned matrices and Convergence . . . . . . . . . . . . . . . . . . . . . 90
6.7 Quasi Newton Hessian Approximations, [Nocedal and Wright, 1999] Ch 8 . . . . . . . . . . . 94
6.8 Trust Region, [Nocedal and Wright, 1999] Ch 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.9 Newton-Krylov Trust Region Methods, [Nocedal and Wright, 1999] Ch 6 . . . . . . . . . . . . 98

7 Constrained Optimization 100


7.1 Theory of Constrained Optimization, [Nocedal and Wright, 1999] Ch 12 . . . . . . . . . . . . 100
7.2 Gradient Project Method, [Nocedal and Wright, 1999] Ch 16.6 . . . . . . . . . . . . . . . . . 106
7.3 Quadratic Penalty Method, [Nocedal and Wright, 1999] Ch 17.1 . . . . . . . . . . . . . . . . . 106
7.4 Augmented Lagrangian Formulation, [Nocedal and Wright, 1999] Ch 17 . . . . . . . . . . . . 107
7.5 Applications: L1 minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.6 ∗∗ Applications: Adjoint Method for Nonlinear Least Squares . . . . . . . . . . . . . . . . . . 115

A Homework I 118

B Homework II 122
∗∗
Advanced Topics may be skipped on first read

2
CAMPEP [CAMPEP, 2014] has standardized essential educational and experience requirements needed
to engage in medical physics research and development, and to enter a residency program in preparation for
clinical practice of one of the first three fields. The standardizations committee has outlined a core graduate
curriculum needed to meet these requirements (‘Core Graduate Curriculum’ [CAMPEP, 2014]). Within this
context, this lecture series is intended to provide an introduction to the mathematics of image formation
and image processing, (‘8.3 Fundamentals of Medical Imaging’ [CAMPEP, 2014]). Specific topics addressed
include:

• 8.3.2- Mathematical Models


• 8.3.3- Reconstruction mathematics
• 8.3.6.2- Basic reconstruction

• 8.3.7.5- Image quality and reconstruction


• 8.3.9.11- Image fusion, registration, segmentation, quantitation
The methods of applied mathematics and scientific computing discussed are prevalent in an engineering
curriculum and are increasingly encountered in medical physics literature and various areas of medical
physics research. The lecture series is intended to provide an overview of the precision and organization of
thought inherent to graduate level mathematics. Abstract mathematical notation will be used, however the
material attempts to avoid frivolous mathematics with respect to the medical physics curriculum as much
as possible. The notation will take time to become familiar with, but it is needed to provide mathematical
rigor. Homework assignments will be application oriented towards understanding the underlying principles
and algorithms available in Matlab and other scientific computing packages routinely used in graduate
student projects.
Lecture Topic CAMPEP
1 (8/31) Preliminaries - Algorithm Complexity, BLAS 8.3.6.2 Basic reconstruction
2 (9/2) Vector and function spaces 8.3.3 Reconstruction mathematics
3 (9/4) Metric Spaces 8.3.7.5 Image quality and reconstruction
4 (9/9) Entropy, Mutual Information 8.3.2, 8.3.9.11 Models, registration, segmentation
5 (9/11) Linear Independence, Equivalence of Norms 8.3.3 Reconstruction mathematics
6 (9/14) Linear Operators, Null space, Range Space 8.3.2, 8.3.6.2 Math Models, Reconstruction
7 (9/16) Point Spread Function, Operator Inverse 8.3.7.5 Image quality and reconstruction
8 (9/18) Rank and Nullity 8.3.3 Reconstruction mathematics
9 (9/21) Bounded Operator and Stability of Linear Systems 8.3.3 Reconstruction mathematics
10 (9/23) Inner Product, Orthogonality, Least Squares 8.3.7.5 Image quality and reconstruction
11 (9/25) Adjoint Operators, Eigen-formulation 8.3.3 Reconstruction mathematics
12 (9/28) Optimization, Characterization of Solution 8.3.6.2, 8.3.9.11 Reconstruction,Registration
13 (9/30) Line Search, Newton-CG Trust-Region methods 8.3.6.2, 8.3.9.11 Reconstruction,Registration
14 (10/2) Applications: L1 minimization 8.3.2 Mathematical Models
15 (10/5) exam on optimization

3
We will focus on the mathematical structure of an optimization problem of the form

min d(Ax, b) A:X→Y d:Y ×Y →R


x∈X

Here X is a vector space of feasible solutions in which we will look for a solution. d(·, ·) defines a distance
measure relevant to the application. A is an operator that embodies the physics of the therapy planning or
image reconstruction. You should become aware of the following logical thought process:
• At the very basic level, we will define precisely the mathematical spaces and functions we are working
with.
• We will build on the definitions to develop increasingly complex statements, ie ”true” statements will
be used to derive more complex ”true” statements.

1 Preliminaries
1.1 Operation Counts ([Golub and Van Loan, 1996], Chapter 1.2.4)
It is import to be aware of the floating point and memory operations required by the typical algorithms
encountered in research. The algorithmic complexity is directly proportional to the amount of time you will
wait for you program to finish. We typically refer to the algorithmic complexity of an algorithm by the
number of floating point or memory operations. O(N p )
(Definition) O(N p ) We say an algorithm of complexity f (n) is O(np ), if there is a constant c such that
|f (n)| is no larger than cnp .

f (n) is O(np ) ⇒ ∃c : |f (n)| < cnp

Example 1 (General A × x plus y (GAXPY)). Given a matrix A ∈ Rm×n , x ∈ Rn , and y ∈ Rm . The


below algorithm overwrites y with Ax + y.
for i = 1:m
for j = 1:n
y(i) = A(i, j) ∗ x(j) +y(i)
| {z }
FLOP
| {z }
FLOP

end
end
Here were have 2 FLOP within a nested loop. For each iteration of row i, n iterations of 2 FLOP is
performed. Each row iteration is repeated m times. Thus the FLOP count is 2mn which is O(n2 ) when
m = n. For the memory operations, a n × 1 vector x and an m × n matrix A must be read from memory.
A m × 1 vector y must be read and written back.
for i = 1:m
for j = 1:n
y(i) = A(i, j) ∗ x(j) + y(i)
|{z} | {z } |{z} |{z}
write read read read

end
end
Here we have mn + 2m + n total memory operations, which is O(n2 ) when m = n.
Example 2 (Linear system solve operation count). Gaussian elimination to solve a linear system requires
2/3n3 floating point operations for the LU factorization and 2n2 floating point operations for the forward
and backward substitution. In this case f (n) is O(n3 ).

2/3n3 + 2n2 < 1 n3 + 1 n3


< cn3 c=2

4
Data movement for this algorithm required n2 storage plus additional vectors to store the right hand side and
the solution vector. Memory operations are O(n2 ).

n2 + 2n < cn2 c=2

Operation counts for common algorithms you will encounter in research are listed in Table 1. An interface to
these algorithms is commonly found in various vendor library implementations such as the BLAS, LAPACK,
and/or MKL.

Algorithm Floating Point Operations Memory Operations


linear system solve (Section 3) O(N 3 ) O(N 2 )
matrix multiplication (Section 3) O(N 3 ) O(N 2 )
matrix-vector multiply (Section 3) O(N 2 ) O(N 2 )
vector operations (Section 2.1, 2.4, 4) O(N 1 ) O(N 1 )
FFT O(N logN ) O(N )
Quick sort O(N logN ) O(N )

Figure 1: Common Operation Counts


BLAS
(Definition) BLAS The (Basic Linear Algebra Subroutines) BLAS routines, http://www.netlib.org/blas/,
are an excellent example of a successful standardization in which vendors have decided on a uniform interface
from common linear algebra routines. Vendors have invested significantly in optimizing the BLAS 1, 2, 3
routines for the memory hierarchy and cache structure of their computing architectures. You directly benefit
by casting your research algorithms directly in terms of these routines.
BLAS 1 Vector-vector operations. O(n) data and O(n) work.
Vector Addition (Section 2.1), Norm (Section 2.4), Inner product (Section 4)
BLAS 2 Matrix-vector operations. O(n2 ) data and O(n2 ) work.
Linear Operator on a Vector (Section 3)
BLAS 3 Matrix-matrix operations. O(n2 ) data and O(n3 ) work.
Composition of Linear Operators and Solution of a linear system of equations (Section 3)
Compute
Bound
(Definition) Compute Bound A compute bound algorithm referes to the case when the time to complete
an algorithm is dominated by the clock speed of the processors. In this situation the time to complete the
alorithm is determined by the number of floating point operations moved divided by the number of floating
point operations that the processor(s) can compute per clock cycle.

floating point operations (F LOP )


time =
FLOP per clock cycle (F LOP s)
Memory
Bound
(Definition) Memory Bound A memory bound algorithm refers to the case when the time to complete
an algorithm is dominated by the memory movement. In this situation the time to complete the alorithm is
determined by the amount of data to be moved divided by the bandwidth of the memory system.
data movement (M B)
time =
bandwidth (M B/s)
\exampledir/ExBLAS.m
Example 3 (BLAS). Num = 1000;
A = rand(Num);
B = rand(Num);
C = zeros(Num);
disp(’running loop’)
tic
for iii = 1:Num
for jjj = 1:Num
for kkk = 1:Num
C(iii,jjj) = C(iii,jjj) + A(iii,kkk) * B(kkk,jjj);

5
end
end
end
toc
Elapsed time is 12.639929 seconds.
Now consider the BLAS 1 implementation. The underlying kernel is a dot product or inner product that we
will discuss in Section 4.
(a, b)
tic
for iii = 1:Num
for jjj = 1:Num
C(iii,jjj) = A(iii,:) * B(:,jjj);
end
end
toc
Elapsed time is 13.851716 seconds.
Now consider the BLAS 2 implementation. The underlying kernel is a matrix-vector multiply or the action
of a linear operator on a vector, Section 3.
y = Ax
tic
for jjj = 1:Num
C(:,jjj) = A * B(:,jjj);
end
toc
Elapsed time is 0.181035 seconds.
Finally consider the BLAS 3 implementation. The underlying kernel is a matrix-matrix multiply or the
composition of two linear operators, Section 3.

A◦B

tic
A*B;
toc
Elapsed time is 0.043407 seconds.
As a rule of thumb, cast your algorithm in terms of the higher level BLAS when possible to achieve maximum
efficiency.
It is useful to estimate the theoretical run time of the typical compute bound or memory bound algorithms
encountered in research. A typical computing architecture consists of a hierarchy of memory each with its
own bandwidth and latency characteristics, Figure 2. For simplicity, we will consider an ‘effective‘/average
of the memory bandwidth and number of CPU in the calculations.
Example 4 (Matrix Multiplication). BLAS 3 matrix multiplication is an example of a compute bound
algorithm in which the memory transfer overhead is hidden by the computations. ie The vendors have
invested significant resources into the code design so that computations are being performed simultaneously
with the data transfer to achieve near peak performance. Consider an Intel Xeon
R CPU
R with a clock speed
of 2.90GHz and 6-12 physical cores (6 cores per socket). The Xeon is capable of 4 floating point operations
per cycle
maeda$ cat /proc/cpuinfo | grep "model name\|cpu cores" | head -n 2
model name : Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz
cpu cores : 6

The theoretical Peak Flop count is computed as

F LOP speak = 2.9


|{z} × 6 − 12 × 4
|{z} = 69.6−139.2GF LOP s
| {z }
(CPU speed in GHz) (number of CPU cores) (CPU instruction per cycle)

6
Courtesy https://www.tacc.utexas.edu/user-services/training/course-materials

Figure 2: Computing Architecture. The typical computing architecture consists of a hierarchy of memory.
The closer the memory is to the processor, the faster the access and bandwidth will be. The memory clock
cycle is typically slower than the CPU clock cycle. Algorithm need to hide the cost of the data transfer with
overlapping computations to achieve good performance.

Figure 3: Performance of matrix matrix multiply.

Here, as an upper and lower bound, we are considering a range of cores that that library may use An experi-
mental verification of the performace is given below. > 90% peak performance is achieved. \exampledir/ExMatMatMult.m
close all
clear all

CPUSpeed = 2.9; % GHz


numberCoreLB = 6;
numberCoreUB = 12;
instructionPerCycle = 4;
UBMaxGFLOPs = CPUSpeed * instructionPerCycle * numberCoreUB
LBMaxGFLOPs = CPUSpeed * instructionPerCycle * numberCoreLB

NSizeList = [1000:1000:10000];
GFLOPsPerformance = zeros(size(NSizeList));

for iii = 1:length(NSizeList)

7
Num = NSizeList (iii);
A = rand(Num);
B = rand(Num);
tstart = tic;
A*B;
telapsed = toc(tstart );
GFLOPsPerformance (iii) = 2*Num^3 / telapsed /1.e9 ;
PercentagePeakAchieved = GFLOPsPerformance (iii)/UBMaxGFLOPs
end

set(gca,’FontSize’,16)
handle = figure(1);
plot(NSizeList ,GFLOPsPerformance )
hold
plot(xlim,[.9* UBMaxGFLOPs .9* UBMaxGFLOPs ],’r--’)
plot(xlim,[LBMaxGFLOPs LBMaxGFLOPs ],’r-.’)
xlabel(’N’)
ylabel(’GFLOPs’)
legend(’measure’, ’.9 peak UB’, ’peak LB’,’Location’,’East’)
saveas(handle,’PeakFLOP’,’png’)

Example 5 (Matrix Vector Multiplication). The matrix vector multiply is an example of a memory bound
algorithm. The bottle neck is typically the time to transfer the data from the RAM to the processors. Com-
pared to the Matrix Matrix multiply significantly less FLOPs are achieved. Consider a range of the quoted
bandwidth for the architecture. The quoted peak bandwidth is 51.2 GB/s. We are measuring 21.5 GB/s.
http://ark.intel.com/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI

maeda$ lshw -short -C memory


H/W path Device Class Description
=============================================================
/0/0 memory 64KiB BIOS
/0/4/5 memory 384KiB L1 cache
/0/4/6 memory 1536KiB L2 cache
/0/4/7 memory 15MiB L3 cache
/0/1/9 memory 384KiB L1 cache
/0/1/a memory 1536KiB L2 cache
/0/1/b memory 15MiB L3 cache
/0/32 memory 64GiB System Memory
/0/32/0 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/1 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/2 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/3 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/4 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/5 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/6 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
/0/32/7 memory 8GiB DIMM DDR3 1600 MHz (0.6 ns)
maeda$ mbw 1024 | grep AVG
AVG Method: MEMCPY Elapsed: 0.16950 MiB: 1024.00000 Copy:
6041.166 MiB/s
AVG Method: DUMB Elapsed: 0.27155 MiB: 1024.00000 Copy:
3770.990 MiB/s
AVG Method: MCBLOCK Elapsed: 0.05022 MiB: 1024.00000 Copy:
20481.907 MiB/s

>> 20481 * 1.04858 ; % 1 MiB = 1.04858 MB

ans =

8
21476

Figure 4: Performance of matrix vector multiply. Floating point efficiency is drastically reduced as compared
to the matrix matrix multiply in Figure 3. The memory clock cycle 1.6GHz is slower than the CPU clock cycle
2.9 GHz and the algorithm is not able to hide the cost of the data transfer with overlapping computations.

\exampledir/ExMatVecMult.m
close all
clear all
%http://ark.intel.com/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI
CPUSpeed = 2.9; % GHz
numberCore = 12;
instructionPerCycle = 4;
MaxGFLOPs = CPUSpeed * instructionPerCycle * numberCore
UBBandwidth = 51.2 ; % GB/s
LBBandwidth = 21.5 ; % GB/s

NSizeList = [1000:1000:20000];
GFLOPsPerformance = zeros(size(NSizeList));
MEMPerformance = zeros(size(NSizeList));

for iii = 1:length(NSizeList)


Num = NSizeList (iii);
A = rand(Num);
b = rand(Num,1);
y = rand(Num,1);
tstart = tic;
y= A*b;
telapsed = toc(tstart );
BytesTransfered = (Num^2 + 2*Num) * 8; % 8 bytes per double
GFLOPsPerformance (iii) = 2*Num^2 / telapsed /1.e9 ;
MEMPerformance (iii) = BytesTransfered / telapsed /1.e9 ;
PercentagePeakAchieved = GFLOPsPerformance (iii)/MaxGFLOPs
end

handle = figure(1);
set(gca,’FontSize’,16)
plot(NSizeList ,GFLOPsPerformance, ’k’ )
hold

9
plot(xlim,[.05* MaxGFLOPs, .05*MaxGFLOPs ],’r--’)
xlabel(’N’)
ylabel(’GFLOPs’)
legend(’measure’, ’5% peak ’,’Location’,’East’)
saveas(handle,’PeakFLOPMatVec’,’png’)

handle = figure(2);
set(gca,’FontSize’,16)
plot(NSizeList ,MEMPerformance )
hold
plot(xlim,[UBBandwidth UBBandwidth ],’r--’)
plot(xlim,[LBBandwidth LBBandwidth ],’r-.’)
xlabel(’N’)
ylabel(’MB/s’)
legend(’measure’, ’UB BW’ ,’LB BW’,’Location’,’NorthEast’)
saveas(handle,’PeakBWMatVec’,’png’)

2 Introduction to vector and function spaces


These fundamental mathematical spaces provide a rigorous framework for communication and problem
formulation. The notation will take time and practice to become familiar with, but will appear throughout
various aspects of your PhD level research. Subsequent problems posed in this first part of the course will
all be presented within this mathematical framework. These lectures will provide a a formalism for defining
and understanding Fourier transforms as linear operators on closed inner product spaces as well as defining
distances in probability spaces in the remaining parts of the course taught by other faculty.

2.1 Vector Spaces ([Greenberg, 1978], Ch 17)


For all practical applications, we need to define the basic algebraic operations for vectors in our spaces. Here
“vectors” may be interpreted in a general sense as elements of Rn as well as functions an even sequences. Vector Space
(Definition) Vector Space We will denote a vector space X to be a set of vectors x1 , x2 , ...

X = {x1 , x2 , x3 , ...} = {xi i = 1, 2, ..}

if the following requirements are satisfied.


(i) An operation between any two vectors in xi , xj ∈ X will be called vector addition and denoted by ⊕,
is defined that the vector space X is closed under vector addition, xi ⊕ xj ∈ X. The precise definition
of vector addition will depend on the vector space, ie Rn , C[a, b], operators, etc.

xi ⊕ xj = xj ⊕ xi (Commutative)
∀xi , xj , xk ∈ X
xi ⊕ (xj ⊕ xk ) = (xi ⊕ xj ) ⊕ xk (Associative)

(ii) The space X contains a ‘zero vector’, 0 and for each vector x there exist a ‘−x’

xi + 0 = xi xi + (−xi ) = 0 (Zero Vector) ∀xi ∈ X

(iii) Another operation is defined such that the vector space X is closed under multiplication of vectors by
scalars, ∗, α ∗ x ∈ X. Scalar multiplication satisfies the following properties

α ∗ (β ∗ xi ) = (α × β)xi (Associative)
α ∗ (xi ⊕ xj ) = α ∗ xi ⊕ α ∗ xj (Distributive)
0 ∈ R (zero scalar) 0 ∈ X (zero vector)
(α + β) ∗ xi = α ∗ xi ⊕ β ∗ xi (Distributive)
∀xi , xj , xk ∈ X α, β ∈ R
1 ∗ xi = xi
0 ∗ xi = 0

10
Notice that the + and ⊕ mean two very different concepts in the distributive law of scalar multiplication.

(α + β) ∗ xi = α ∗ xi ⊕ β ∗ xi

The + is the usual scalar addition and the ⊕ is the vector addition for the space we are working in. However,
we typically abuse notation and denote vector addition as “+” without confusion.
Notice that we intentionally do not specify the nature of the underlying elements of the spaces we are
studying. Rather, we assume that the framework is general enough to encompass and application in our
research that we may be interested in.
Example 6 (Vector algebra in Rn ). Defined in the usual component-wise way

(x + y) ≡ (x1 + y1 , x2 + y2 , ...) αx ≡ (αx1 , αx2 , ...)

Example 7 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].

C[a, b] ≡ {f : [a, b] → R : f is continuous}

Notice that in this space, each ‘point’ represents a function. Vector algebra defined as you might expect

(x + y)(t) ≡ x(t) + y(t) (α ∗ x)(t) ≡ αx(t)

Example 8 (Floating Point Arithmetic). It is import to realize that the usual associative laws typically
expected for vector spaces are not satisfied for floating point arithmetic.

>> eps(2^54)
ans =

>> 2^54 - (2^54 -1)


ans =

>> (2^54 - 2^54) +1


ans =

1
This is due to the finite precision of floating point arithmetic and, in general, occurs when adding a large
number to a small number.

d(4, 9) = |4 − 9| = 5

4 9

Figure 5: d(x, y) = |x − y|. Analogous to the absolute value on the real line, R, we are interested in defining
a distance on the abstract vector and function spaces that may typically arise in research.

2.2 Metric Spaces ([Kreyszig, 1989], Section 1.1)


Introducing a metric on the vector space provides a notion of distance for the vectors. Mathematicians have
generalized the notion of distance on the real line R by identifying key properties that a distance measure
must satisfy on arbitrary set of elements, X. Metric

11
(Definition) Metric The function d : X × X → R+ is known as a metric if it satisfies

(M1) d(x, y) = 0 ⇔ x=y


(M2) d(x, y) = d(y, x) symmetry ∀x, y, z ∈ X (1)
(M3) d(x, y) ≤ d(x, z) + d(z, y) triangle inequality

(M3) agrees with intuition and may be used to show that the shortest path between to points is a straight
line.
Example 9 (Distance in Rn ). The canonical example is 3-dimensional real space, R3
p
d(x, y) = (ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2

Its not difficult to see that this satisfies the properties of the metric.
(M1) Given that the distance is 0, properties of positive numbers on the real line

x, y ≥ 0 x+y =0 ⇒ x = −y ≤ 0 ⇒ 0≤x≤0 ⇒ x=0

may be used to show that the two vectors are in fact the same.
p
(ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2 = 0 ⇒ (ξi − ηi )2 = 0 ⇒ ξi = ηi

Conversely, if the two vectors are identical the metric is zero by direct evaluation.
p
ξi = η i ⇒ (ξ1 − η1 )2 + (ξ2 − η2 )2 + (ξ3 − η3 )2 = 0

(M2) Commutativity of the square implies symmetry

(ξi − ηi )2 = (ηi − ξi )2 ⇒ d(x, y) = d(y, x)

(M3) Showing the triangle inequality is a bit more tricky and requires Minkowski’s inequality.

n
!1/p n
!1/p n
!1/p
X p
X X
p p
|ai + bi | ≤ |ai | + |bi | ∀a, b ∈ Rn 1≤p<∞
i=1 i=1 i=1

In our case, letting p = 2, a = x − z, and b = z − y we have our result.

n
!1/2 n
!1/2 n
!1/2 n
!1/2
X X X X
2 2
(ξi + 0 − ηi ) = (ξi ± γi − ηi ) ≤ (ξi − γi )2 + (γi − ηi )2
i=1 i=1 i=1 i=1

An example calculation of the distance defined by the 2-norm is provided below. \exampledir/VectorTwoNormCalc.m
x=[9;-10;7]
y=[-4;-5;3]
sqrt( (x(1) - y(1))^2 + (x(2) - y(2))^2 + (x(3) - y(3))^2 )

>> echo on
>> VectorTwoNormCalc
x=[9;-10;7]

x =

9
-10
7

y=[-4;-5;3]

y =

12
-4
-5
3

sqrt( (x(1) - y(1))^2 + (x(2) - y(2))^2 + (x(3) - y(3))^2 )

ans =

14.4914

Example 10 (Distance in C). In MR applications the convention workspace is the complex plane, C. For
two complex numbers x = ξ1 + ξ2 i and y = η1 + η2 i
p
d(x, y) = (ξ1 − η1 )2 + (ξ2 − η2 )2
Example 11 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].
C[a, b] ≡ {f : [a, b] → R : f is continuous}
Notice that in this space, each ‘point’ represents a function. The max difference between the function over
the domain [a, b] defines a metric on this space.
d(x, y) = max |x(t) − y(t)|
t∈[a,b]

(M1) Properties of the absolute value may be used to show that two functions with zero distance are the same
and zero distance between the functions implies the functions are the same.
max |x(t) − y(t)| = 0 ⇔ x(t) = y(t) ∀t
t∈[a,b]

Using reverse triangle inequality


||x(t)| − |y(t)|| ≤ |x(t)−y(t)| ≤ 0 ∀t ⇒ |x(t)| ≤ |y(t)| and |y(t)| ≤ |x(t)| ∀t ⇒ x(t) = y(t) ∀t

(M2) Symmetry is obvious


max |x(t) − y(t)| = max |y(t) − x(t)|
t∈[a,b] t∈[a,b]

(M3) The triangle inequality for the absolute value may be used to show that the max satisfies the triangle
inequality for the metric.
|x(t)−y(t)| = |x(t)−z(t)+z(t)−y(t)| ≤ |x(t)−z(t)|+|z(t)−y(t)| ≤ max |x(t)−z(t)|+ max |z(t)−y(t)| ∀t
t∈[a,b] t∈[a,b]

Because this holds for all t, this includes the max over all t as well, and we have the result
max |x(t) − y(t)| ≤ max |x(t) − z(t)| + max |z(t) − y(t)|
t∈[a,b] t∈[a,b] t∈[a,b]

A example distance between two functions is provided below. \exampledir/ExFunctionDistanceCheb.m


t = [0:.1:10];
x = t.^2;
y = 20*ones(size(t));
handle = figure
plot(t,x)
hold
plot(t,y)
% save matlab plot
% saveas(handle,’FunctionDistance’,’png’)
max(abs(x-y))

13
>> echo on
>> ExFunctionDistanceCheb
t = [0:.1:10];
x = t.^2;
y = 20*ones(size(t));
handle = figure

handle =

plot(t,x)
hold
Current plot held
plot(t,y)
% save matlab plot
% saveas(handle,’FunctionDistance’,’png’)
max(abs(x-y))

ans =

80

Figure 6: Function Distance. d(x, y) = maxt∈[a,b] |x(t) − y(t)|

Example 12 (Distance Between Images). Suppose we want to measure the distance between a transformation
of two images, I : [0, 1] × [0, 1] ⊂ R2 → R and J : [0, 1] × [0, 1] ⊂ R2 → R.
s
Z 1Z 1 sX
2 2
d(I, J) = (I(x, y) − J(x, y)) dx dy ≈ (I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)) ∆x ∆y
0 0 i,j

s s s 1 r
1 1 1 1
x3
Z Z Z Z
1
d(I, J) = (x + sin(π y) − sin(π y))2 dx dy = dy x2 dx = =
0 0 0 0 3 0 3
\exampledir/ExLTwoImageDistance.m
close all
clear all

14
I(x, y) = x + sin(π y) J(x, y) = sin(π y)

Figure 7: Distance Between Images.

delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),2)*sqrt(delta*delta)
sqrt(1/3)

>> echo on
>> ExLTwoImageDistance
close all
clear all
delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),2)*sqrt(delta*delta)

ans =

0.5777

sqrt(1/3)

ans =

0.5774

Example 13 (Dice Similarity Measure). A common measure for the agreement between segmented/labeled
images is the Dice Similarity Coefficient (DSC). The DSC of two sets A and B is proportional to the
area/volume of the overlap A∩B normalized by the combined area/volume of the two sets. The proportionality

15
constant 2 is chosen so that the DSC has a max value of 1.

|A ∩ B|
DSC(A, B) ≡ 2 0 ≤ DSC(A, B) ≤ 1
|A| + |B|

This is not a metric, (1)(M1) not satisfied.

DSC(A, A) = 1 6= 0

How about
d(A, B) = 1 − DSC(A, B)
Zero distance, (1)(M1), is satisfied through the definition of set intersection.

d(A, B) = 1 − DSC(A, B) = 0 ⇔ A=B


DSC(A, B) = 1 ⇔ A=B

Symmetry, (1)(M2), is satisfied by the commutative property of set intersection and addition.

d(A, B) = d(B, A)

Triangle inequality, (1)(M2), is not satisfied. Consider, A,B, C = A ∪ B such that A ∩ B = 0 and |A| = |B|

|A ∩ B|
d(A, B) = 1 − DSC(A, B) = 1 − 2 =1
|A| + |B|
| {z }
=0

|A ∩ (A ∪ B) | |A| 1
d(A, C) = 1 − 2 =1−2 =
|A| + |A ∪ B| |A| + |A| + |B| 3
|B ∩ (A ∪ B) | |B| 1
d(B, C) = 1 − 2 =1−2 =
|B| + |A ∪ B| |B| + |B| + |A| 3
     
   
d ,  = d(A, B) = 1  d(A, C) + d(B, C) = d  ,  + d , = 2
3
   

Another measure of ‘distance’ commonly use is the cross correlation (CC) or the normalized cross corre-
lation (NCC). Normalized
Cross Correla-
(Definition) Normalized Cross Correlation (NCC) Given two images A : Ω → R and B : Ω → R, tion (NCC)
Ω ⊂ Rd , the normalized cross correlation is computed as
 2
~a − â, ~b − b̂
N CC(A, B) = − 2 (2)
2
k~a − âk ~b − b̂

Here (·, ·) denotes the inner product, Section 4, and k · k denotes the norm, Section 2.4. ~a and ~b denote the
vector of intensities.
The NCC is defined in terms of the norm and inner product (Sections 2.4, 4). We will see that the norm
and inner product defines a distance
p
d(x, y) = kx − yk = (x − y, x − y)

The inner product satisfies p p


|(x, y)| ≤ (x, x) (y, y) = kxkkyk
thus  
−| ~a − â, ~b − b̂ | ≥ − k~a − âk ~b − b̂

16
Hence the NCC is bounded below by -1. And (using the definition of distance) above by 0.
2  2
2
k~a − âk ~b − b̂ ~a − â, ~b − b̂

−1 = − 2 ≤ − 2
2 2
k~a − âk ~b − b̂ k~a − âk ~b − b̂

−1 ≤ NCC ≤ 0
The NCC is symmetric
N CC(A, B) = N CC(B, A)
However, (M1) is not satisfied
2 4
(~a − â, ~a − â) k~a − âk
N CC(A, A) = − 2 2 =− 2 2 = −1
k~a − âk k~a − âk k~a − âk k~a − âk

Triangle inequality (M3) for the NCC is left as a homework exercise. Entropy

(Definition) Entropy ([Cover and Thomas, 2012] Chapter 2) The concept of entropy appears in image
procesing as a quantative measure of the information content. Given a discrete probability function
( ) N
+ ci = [a + i dx, a + (i + 1)dx) ⊂ [a, b] X
p:Ω→R Ω≡ p(ci ) = 1
i = 0, ..., N = (b − a)/dx − 1 i=1
| {z }
≡pi

The entropy is defined as: X X


H(p) = − pi log(pi ) = pi log (1/pi )
i i

Lemma 2.1 (Log sum inequality). Given a1 , a2 ,... , an and b1 , b2 ,... , bn non-negative
n n
! Pn
X ai X ( i=1 ai )
ai log ≥ ai log Pn
i=1
bi i=1
( i=1 bi )

Properties of Entropy

• Entropy is positive, H(p) ≥ 0


1 1
0 ≤ p(x) ≤ 1 ⇒ ≥1 ⇒ log ≥0
p(x) p(x) |{z}
property of log

L’Hospital’s rule justifies the limiting case p → 0 and the notation 0 log 0 = 0

log p 1/p
lim p log p = lim = lim = lim −p = 0
p→0 p→0 1/p p→0 −1/p2 p→0

• For a uniform distribution, N p0 = 1


N
X
pi = p0 ∀i ⇒ H(p) = pi log (1/pi )
i
N
X
= log (1/p0 ) pi
i
| {z }
=1
= log (1/p0 ) = log N

17
• For a general distribution, the log sum inequality with ai = pi and bi = 1 implies
N N
!
X pi X 1
pi log ≥ pi log
i=1
1 i=1
N
| {z }
=1

or the entropy is bounded by the corresponding uniform distribution.


N
X pi 1 1
H(p) = − pi log ≤ − log = − log p0 = + log
i=1
1 N p0
| {z }
p0 N =1

Figure 8: Entropy. Similar to thermodynamics, the entropy is a measure of the spread or uncertainty in
a probability distribution. A uniform probability distribution has the largest entropy. This agrees within
our intuition. If we have relatively little information about a model parameter/variable, then it can be
anywhere uniformly within the interval. The more information we have the more ‘peaked’ the probability
distribution. For example, suppose an object is located with uniform probability, x ∼ U[a, b]. The object
is located between [a, b] with equal probability and we are uncertain were it may actually be. Compared to
x ∼ N [(a + b)/2, 1], we are more certain that the object is likely to be located near the mid-point so we have
relatively more information about the location of the object.

\exampledir/ExEntropy.m
clear all
close all

% initialize data structures


sigmaList = [2:1:10];
normalentropy = zeros(size(sigmaList));
maxsigma = max(sigmaList);
mu1 = 1;
dx = 1.2;
x = [-4*maxsigma :dx :4*maxsigma ];
uniformpdf = pdf(’unif’, x,-4*maxsigma ,4*maxsigma );

UniformEntropy = -sum(uniformpdf.*log(uniformpdf))

handle1=figure(1);

bar(x, uniformpdf,’k’)

18
h = findobj(gca,’Type’,’Patch’);
set(h,’FaceColor’,[1 1 1], ’EdgeColor’,’black’);
hold
for iii = 1:length(sigmaList)
y1 = pdf(’normal’, x, mu1, sigmaList(iii) );
normalentropy(iii) = -sum(y1.*log(y1));
plot(x, y1)
end
set(gca,’FontSize’,16)
xlabel(’x’)
ylabel(’p(x)’)
saveas(handle1,’EntropyBins’,’png’)

handle2=figure(2);
plot(sigmaList,normalentropy)
hold
plot(xlim,[UniformEntropy UniformEntropy],’r--’)
set(gca,’FontSize’,16)
xlabel(’sigma’)
ylabel(’Entropy’)
legend(’normal’, ’uniform’ ,’Location’,’SouthEast’)
saveas(handle2,’EntropyValue’,’png’)
Algebraically manipulating the entropy yeilds an interpretation of the entropy as a ‘distance’ or divergence
from the entropy of a uniform distribution.
N
X X
H(p) = − pi log pi + pi log (p0 N )
i=1 i
| {z }
=1
| {z }
=0
N
X pi
= log N − pi log
i=1
p0
N
X pi
= H(p0 ) − pi log
i=1
p0

This difference motivates the definition of the Kullback Leibler distance or relative entropy between two
probability distributions.
N
X pi
D(p||p0 ) = pi log = H(p0 ) − H(p)
i=1
p0
Kullback
Leibler Dis-
(Definition) Kullback Leibler Distance The Kullback Leibler distance between two probability distri-
tance
butions, p and q, is defined as the relative entropy
N
X pi
D(p||q) = pi log
i=1
qi

For two general probablity distributions, the difference in the entropy may be interpreted as the difference
in the relative entropy with the uniform probablity distribution.
H(p) − H(q) = D(q||p0 ) − D(p||p0 )
However, the Kullback Leibler Distance is not a ‘distance’ in the sense of a metric. Both the symmetry and
triangle inequality properties are missing.
Example 14. The below example numerically evaluates the symmetry and triangle inequality within the
context of the Kullback Leibler ‘Distance’
D(p||q) 6= D(q||p)

19
D(p||q) > D(q||r) + D(p||r)
\exampledir/ExEntropyCounterExample.m
clear all
close all

pdf1 = [0.3 0.3 0.4];


pdf2 = [0.16 0.33 0.51];
pdf3 = [0.25 0.35 0.4];

% D(pdf1 || pdf2)
RelEntropy12 = sum(pdf1.*log(pdf1.* pdf2.^(-1) ))
RelEntropy21 = sum(pdf2.*log(pdf2.* pdf1.^(-1) ))
Symmetry = RelEntropy12 - RelEntropy21
% D(pdf1 || pdf3)
RelEntropy13 = sum(pdf1.*log(pdf1.* pdf3.^(-1) ))

% D(pdf2 || pdf3)
RelEntropy23 = sum(pdf2.*log(pdf2.* pdf3.^(-1) ))

% D(pdf1 || pdf2) < D(pdf1 || pdf3) + D(pdf2 || pdf3)


TriangleInequality = (RelEntropy12 <= (RelEntropy13 + RelEntropy23 ))
>> ExEntropyCounterExample

RelEntropy12 =

0.0628

RelEntropy21 =

0.0548

Symmetry =

0.0080

RelEntropy13 =

0.0085

RelEntropy23 =

0.0331

TriangleInequality =

ie within this space, the shortest distance between two points may not necessarily be a straight line. Nev-
ertheless, the Kullback Leibler Distance is heavily used within mutual information-based image registration.
Mutual Infor-
mation
20
(Definition) Mutual Information Given two images A : Ω → R and B : Ω → R, Ω ⊂ Rd with probability
intensities p(a) and p(b), respectively, the mutual information between the two images, I(A, B), is defined
as
I(A, B) = D (p(a, b)||p(a)p(b)) = H(A) + H(B) − H(A, B)
For image registration, we typically want to maximize the mutual information. This corresponds to mini-
mizing the joint entropy/ uncertainty. Interpreting mutual information within the context of the Kullback
Leibler ‘distance’, maximizing the mutual information maximizes the distance between the joint distribution
for independent images. ie two registered images will highly correlated.
The joint entropy is defined in an analogous manner.
XX
H(p(a, b)) = pij log (1/pij )
i j

Here the probability intensities are defined through the image histograms normalized to the number of pixels.
Example code for a single image is below. The joint histogram is defined in an analogous manner. ie count

Figure 9: An image and corresponding histogram.

the number of intensities within a 2D bin (I1lb , I1ub ) × (I2lb , I2ub ).


\exampledir/segmentation/ImageEntropy.m
clear all
close all
T1Image = ’ICBM_Template.nii.gz’;
% read image
IntensityImage = load_nii(T1Image );
handle1 = figure(1);
set(gca,’FontSize’,16)
imagesc(IntensityImage.img(:,:,100));
colormap(gray);
saveas(handle1,’ImageExample’,’png’);

% reshape to 1D array
IntensityImage = IntensityImage.img(:);

nbins=32; % set number of histogram bins


min1 = min(IntensityImage); max1 = max(IntensityImage);
IntensityRange = (max1 - min1);
BinSize = (max1 - min1)/nbins; % Grayscale bin width of the image

histogram = zeros(nbins+1,1); % initialize histogram


for iii = 1:length(IntensityImage );
idx = floor( (IntensityImage(iii) - min1)/BinSize) + 1; % bin location
histogram (idx ) = histogram(idx ) + 1; % increment histogram
end

21
histogram = histogram /length(IntensityImage); % normalize to 1
entropy = -sum(histogram .*log(histogram ))

handle2 = figure(2);
set(gca,’FontSize’,16)
bar( histogram ,’k’)
saveas(handle2,’ImageHistogram’,’png’)

>> ImageEntropy

entropy =

2.3567

Notice that the mutual information is not a metric.


• (M1). I(A,A) = H(A) which is not necessarily = 0
• (M2). Symmetry is satisfied.

D (p(a, b)||p(a)p(b)) = H(A)+H(B)−H(A, B) = I(A, B) = I(B, A) = D (p(b, a)||p(b)p(a)) = H(B)+H(A)−H(B, A)

• (M3). Kullback Leibler does not satisfy triangly inequality as we have seeng in previous example.
Example 15 (Registration Distance Measures). The below example compares three ‘distance’ measures
commonly used for image registration.

Figure 10: Consider the 1D rigid registration of a brain image parametrized by distance d.

(a) (b) (c)

Figure 11: Comparison of (a) MI, (b) MSQ, (c) NCC Distance Measures.

\exampledir/segmentation/distancemeasure.m

22
clear all
close all
c3dexe = ’/opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d’;
T1Image = ’ICBM_TemplateZSlab.nii.gz’;
T1Image = ’brain_T1ZSlab.nii.gz’;
T1CImage = ’brain_T1CZSlab.nii.gz’;
T2Image = ’brain_T2ZSlab.nii.gz’;
FLImage = ’brain_FlairZSlab.nii.gz’;
ImageList = {T1CImage; T2Image ;T1Image ;FLImage }
transformedimage = ’slicetranslate’;

OriginalImage = load_nii(T1CImage );
nbins = 32;

% fix noise
originalnoisepower = (max(OriginalImage.img(:)) - min(OriginalImage.img(:))) *.2;
originalmean = mean(OriginalImage.img(:));
NoisyImage = OriginalImage.img+floor(originalnoisepower*rand(size(OriginalImage.img))+originalmean);

% initialize Data structures


translationlist = [1:40];
stepsize = 4.0;
metricdata = zeros(size(translationlist ,2),length(ImageList),10);

handle = figure(1);
% look at rigid registration distance of image list
for jjj = 1:length(ImageList)
currentImage = ImageList{jjj}
for iii = translationlist
disp(’###################’)
% create transformed images
theta = (iii-1)*pi/180;
system(sprintf(’sed "s/1 0 0 0 1 0 0 0 1 0 0 0/%f %f 0 %f %f 0 0 0 1 %f %f 0/" identity.tfm > tmp.tfm;’
transformimagefilename = sprintf(’%s%04d.nii.gz’,transformedimage ,iii);
transformcmd = sprintf(’%s %s %s -reslice-itk tmp.tfm -o %s’,c3dexe,T1CImage,currentImage,transformimag
disp(transformcmd );
system(transformcmd );

% compute image metric from ITK as a reference


itkmetrics = sprintf(’./ImageMutualInformation1 %s %s %d’,T1CImage ,transformimagefilename, nbins );
disp(itkmetrics);
system(itkmetrics);

% compute local metrics


TransformImage = load_nii(transformimagefilename);
metricdata(iii,jjj,1) = CalculateMI( OriginalImage.img,TransformImage.img,nbins);
metricdata(iii,jjj,2) = CalculateMSQ( OriginalImage.img,TransformImage.img);
metricdata(iii,jjj,3) = CalculateNormalizedCorrelation(OriginalImage.img,TransformImage.img);

% repeat with noisy image


metricdata(iii,jjj,4) = CalculateMI( NoisyImage ,TransformImage.img,nbins);
metricdata(iii,jjj,5) = CalculateMSQ( NoisyImage ,TransformImage.img);
metricdata(iii,jjj,6) = CalculateNormalizedCorrelation(NoisyImage ,TransformImage.img);
disp(sprintf(’MI %f MSQ %f NCOR %f (noise) MI %f MSQ %f NCOR %f \n’,metricdata(iii,jjj,1:6)));

imagesc(NoisyImage(:,:,1) +TransformImage.img(:,:,1));
colormap(gray);

23
pause(.1)
end
end
%saveas(handle,’RegistrationExample’,’png’)

typelegend = {’-’; ’--’;’:’;’-.’};


colorlegend = {’k’; ’b’; ’g’;’r’ };

set(gca,’FontSize’,16)
% plot MI
handle2 = figure(2);
hold
xlabel(’distance’)
ylabel(’MI’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,1),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,4),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C
% plot MSQ
handle3 = figure(3);
hold
xlabel(’distance’)
ylabel(’MSQ’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,2),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,5),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C
% plot NCOR
handle4 = figure(4);
hold
xlabel(’distance’)
ylabel(’NCOR’)
for jjj = 1:length(ImageList)
plot (translationlist,metricdata(:,jjj,3),strcat(typelegend{1},colorlegend{jjj}))
plot (translationlist,metricdata(:,jjj,6),strcat(typelegend{2},colorlegend{jjj}))
end
legend(’T1C/T1C’, ’T1C (noise)/T1C’, ’T1C/T2’, ’T1C (noise)/T2’, ’T1C/T1’, ’T1C (noise)/T1’, ’T1C/FL’, ’T1C

saveas(handle2,’DistanceMI’,’png’)
saveas(handle3,’DistanceMSQ’,’png’)
saveas(handle4,’DistanceNCOR’,’png’)

The Hellinger distance provides an alternative notion of distance between a probability distribution that
satisfies the properties of a metric.
Example 16 (Hellinger Distance). In statistics, we are typically interested in comparing probability distri-
butions. Here the space of probability functions are positive, continuous, and normalized to 1.
 Z 
+
X ≡ f : Ω → R | f continuous f dx = 1

Various, Z-test, F-test, T-test have been developed to compare the distributions. In some applications, the
probability distributions are known and a more direct measure of distance between two probability distributions

24
f and g may be given by the Hellinger Distance.
 1/2
Z  1/2 Z Z Z
1 p p  2 1  p p 
d(f, g) = √ f (x) − g(x) dx =√  f (x)dx + g(x)dx −2 f (x) g(x)dx
2 Ω 2  
| Ω {z } | Ω {z } Ω
=1 =1
 Z p 1/2
p
= 1− f (x) g(x)dx 0 ≤ d(f, g) ≤ 1

Notice the intuition for the Hellinger distance is obtained from the simplified form above. In a sense, the
Hellinger distance measures the area of overlap between two probability distribution functions. In the extreme
case with no overlap, ie f is non-zero when g is zero and vice-versa, the Hellinger distance attains its max
value of 1. As an explicit example for two normal distribution P ∼ N (µ1 , σ1 ), Q ∼ N (µ2 , σ2 ) the Hellinger
metric reduces to v
u s
1 (µ1 − µ2 )2
 
u 2σ1 σ2
d(P, Q) = 1 −
t exp −
σ12 + σ22 4 σ12 + σ22

>> ExHellingerDistance

\exampledir/ExHellingerDistance.m
mu1 = 0;
sigma1= 1;
mu = .5:.5:10;
sigma = 1.5:.5:10;
sigma = [1,5,10]
plotcolors = [’b’,’r’,’k’]
maxsigma = max(sigma);
close all
handle = figure
hold
for jjj = 1:size(sigma,2)
for iii = 1:size(mu,2)
mu2 = mu(iii);
sigma2= sigma(jjj);
hellinger(iii,jjj) = 1 - sqrt( ( 2 * sigma1 * sigma2 ) / ...
(sigma1^2 + sigma2^2) ) * exp(-(mu1-mu2)^2/(sigma1^2 + sigma2^2)/4);
end
plot(mu,hellinger(:,jjj),plotcolors(jjj))
end
legend(’sigma=1’, ’sigma=5’, ’sigma=10’)
saveas(handle,’HellingerDistance’,’png’)

handle = figure
hold
x = [-4*maxsigma:1e-3:4*maxsigma];
y1 = pdf(’normal’, x, mu1, sigma1);
y2 = pdf(’normal’, x, mu(6), sigma(2));
y3 = pdf(’normal’, x, mu(10), sigma(3));
plot(x, y1)
plot(x, y2, ’r’)
plot(x, y3, ’k’)
title(’Density functions’)
legend(’mu=0 sigma=1’, ’mu=3 sigma=5’, ’mu=5 sigma=10’)
saveas(handle,’NormalPDFCompare’,’png’)

25
µ vs d(N (0, 1), ·) x vs prob

Figure 12: Hellinger Distance.

(M1) Properties of the integral of positive functions


Z
f 2 dx = 0 ⇒ f =0

may be used to show that zero distance is equivalent to the same function
Z p 2
1 p
f (x) − g(x) dx = 0 ⇒ f (x) = g(x) ∀x
2 Ω

(M2) Symmetry is obvious


Z p 2 Z p 2
1 p 1 p
f (x) − g(x) dx = g(x) − f (x) dx
2 Ω 2 Ω

(M3) Minkowski inequality is again used to show the triangle inequality.


Z 1/p Z 1/p Z 1/p
p p p
|a(x) + b(x)| dx ≤ |a(x)| dx + |b(x)| dx p≥1

√ √ √ √
for p = 2, a = f− z, b = z− g
Z  2 1/2 Z  2 1/2 Z  2 1/2
p p p p p p p
f (x) ± z(x) − g(x) dx ≤ f (x) − z(x) dx + z(x) − g(x) dx
Ω Ω Ω

The entropy facilitates a quantitative measure of the information gain for an image segmentation.
Example 17 (Informational Entropy). For image segmentation the information gain during the segmenation
is a measure of the ’reduction in the uncertainty’.

Information Gain = Entropybefore − Entropyafter

The entropy should be reduced when information is added to the system and the information gain should be
> 0.
Assume that we are given disjoint segmentation data sets, Sj , that classify the pixel type to be used in
‘training’ our algorithm. A label l ∈ N is associated with each pixel in the image, v ∈ Rn ,

Sj ≡ {(vi , li ) ∈ Rn × N i = 1, ..., Npixel } j = 1, ..., Ndata

S = ∪j Sj Sj ∩ Si = ∅ i 6= j

26
Define the entropy of these data sets in terms of the probability distributions/histograms of the class labels.
X |Sj |
H(S) = H(Sj ) |A| ≡ # of training points in set A
j
|S|

X X # of class label i
H(A) = − pi log(pi ) = pi log (1/pi ) pi =
i i
total # of label in setA
Consider the thresholding operation on a T1 image with class labels l1 = White Matter, l2 = Grey Matter,
l3 = CSF. The entropy of the entire training set ‘before’ a thresholding operation is applied is computed as

|S|
Hbefore (S) = H(S)
|S|

Below are the initial label statistics to compute the label histograms.
$ /opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d ICBM_Template.nii.gz ICBM_grey_white_csf.nii.gz -lstat
LabelID Mean StdD Max Min Count Vol(mm^3) Extent(Vox)
0 432.23113 677.59359 4095.00000 0.00000 5610187 5610187.000 181 217 181
1 1669.76621 281.12593 2493.00000 177.00000 1032234 1032234.000 145 180 138
2 2214.56145 158.47540 2715.00000 734.00000 435476 435476.000 134 171 120
3 861.27990 308.32083 2146.00000 459.00000 31240 31240.000 77 93 87

Thresholding the dataset at Intensity = 2010, yeilds a left ‘L’ and right ‘R’ dataset. The entropy of the
entire training set ‘after’ the thresholding operation is applied is computed as
X |Sj |
Hafter (S) = H(Sj )
|S|
j∈L,R

Label statistics for each group are computed below.


$ /opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d ICBM_Template.nii.gz -as template -threshold -inf 2010 1 0
ICBM_grey_white_csf.nii.gz -multiply -as greycsf -push template -push greycsf -lstat
LabelID Mean StdD Max Min Count Vol(mm^3) Extent(Vox)
0 581.19920 815.97455 4095.00000 0.00000 6120990 6120990.000 181 217 181
1 1614.01683 245.60763 2010.00000 177.00000 917014 917014.000 145 180 138
2 1887.26700 143.01551 2010.00000 734.00000 39914 39914.000 133 169 117
3 860.46231 306.80601 2008.00000 459.00000 31219 31219.000 77 93 87
$ /opt/apps/itksnap/c3d-1.0.0-Linux-x86_64/bin/c3d ICBM_Template.nii.gz -as template -threshold 2011 inf 1 0
ICBM_grey_white_csf.nii.gz -multiply -as white -push template -push white -lstat
LabelID Mean StdD Max Min Count Vol(mm^3) Extent(Vox)
0 607.29946 759.24294 4095.00000 0.00000 6598334 6598334.000 181 217 181
1 2113.46491 79.53650 2493.00000 2011.00000 115220 115220.000 135 174 121
2 2247.58695 116.98312 2715.00000 2011.00000 395562 395562.000 134 171 120
3 2076.71429 45.65320 2146.00000 2012.00000 21 21.000 38 83 29

\exampledir/ExEntropySegmentation.m
close all
clear all

% grey white csf


disp(’c3d ICBM_Template.nii.gz ICBM_grey_white_csf.nii.gz -lstat’);
disp(’c3d ICBM_Template.nii.gz -as template -threshold -inf 2010 1 0 ICBM_grey_white_csf.nii.gz -multiply -
disp(’c3d ICBM_Template.nii.gz -as template -threshold 2011 inf 1 0 ICBM_grey_white_csf.nii.gz -multiply -
Initial = [1032234,435476, 31240];
GREYCSF = [ 917014, 39914, 31219];
WHITE = [ 115220,395562, 21];

% verify split
verify = sum(GREYCSF )+ sum(WHITE)- sum(Initial )

% compute discrete probabilities


pdfInitial = 1/sum(Initial)* Initial;
pdfGREYCSF = 1/sum(GREYCSF)* GREYCSF;

27
(a) (b)

(c) (d)

Figure 13: Segmentation Histogram.

pdfWHITE = 1/sum(WHITE )* WHITE ;

% compute the initial entropy


EntropyBefore = -sum(pdfInitial .*log(pdfInitial ))

% compute the entropy after the split


EntropyGREYCSF = -sum(pdfGREYCSF .*log(pdfGREYCSF ));
EntropyWHITE = -sum(pdfWHITE .*log(pdfWHITE ));

% each split is normalized the number of entries


EntropyAfter = sum(GREYCSF )/ sum(Initial ) * EntropyGREYCSF + ...
sum(WHITE )/ sum(Initial ) * EntropyWHITE

% information gain is the change in entropy


InformationGain = EntropyBefore - EntropyAfter

28
2000 4000 6000 8000 10000

2500
SAGT1

2000
1500
cor= 0.95

1000
500
10000

N4CORR
8000
6000
4000
2000

500 1000 1500 2000 2500

(a) (b)

Figure 14: Segmentation Histogram. Entropy provides a quantitative measure of the a segmentation thresh-
old result. The ‘distance’ defined by the information gain provides a repeatable and reproducible measure.

>> ExEntropySegmentation

verify =

EntropyBefore =

0.6967

EntropyAfter =

0.3852

InformationGain =

0.3115

Example 18 (Distances in High Dimensional Space). Non-euclidean distances have been suggested to be ap-
propriate for high dimensional clustering applications [Aggarwal et al., 2001]. Lets look at high-dimensional
‘distances’ for identifying noisy images. Consider the image basis derived from the image histograms shown in
Figure 15, ie intensity groups within a histogram bin form a basis vector. As the number of bins increases, the
dimension of the basis vector effectively increases. Consider the ‘contrast’ measure of a uniformly distributed
vector in this high dimensional space, x ∼ U d (0, 1), contrast = E(kxkmax
p − kxkmin
p )
\exampledir/ExHighDimensional.m

.
distancesubset = [2:10:1024];
datamatrix = 1000*rand(Npixel*Npixel,Nsample );
.

29
.
for iii =1:Nsample
for jjj =1:length(distancesubset)
[hresample edges imagebins] = histcounts(originalimage,distancesubset(jjj));
noisyimage = zeros(size(originalimage));
for idnoise = 1:distancesubset(jjj)
noisyimage = noisyimage + datamatrix(idnoise ,iii) * (imagebins==idnoise);
end
.
distanceone( iii,jjj) = norm(datamatrix(1:distancesubset(jjj),iii),1);
distancetwo( iii,jjj) = norm(datamatrix(1:distancesubset(jjj),iii),2);
.
end
end
.
for jjj =1:length(distancesubset)
distanceoneSeparation( jjj) = max(distanceone( :,jjj)) - min(distanceone( :,jjj));
distancetwoSeparation( jjj) = max(distancetwo( :,jjj)) - min(distancetwo( :,jjj));
end

Figure 15: Image Basis of Figure 9 derived from histogram. The max distance for a uniformly distributed
vector with respect to this basis, x ∼ U d (0, 1), is shown in Figure 16.

In summary, as your a progressing through your research and are asked to quantitatively evaluate the
distance between to measurements or computer simulations, in general, your advisor may be skeptical if the
distance function you are using does not satisfy the properties of a metric (1).

2.3 Linear Independence and Dimension of a Vector Spaces ([Kreyszig, 1989],


Section 2.1)
Algebraic properties allow us to express concepts of linear combinations of vectors from our vector space
x1 , ..., xm ∈ X. Linear Combi-
nation

30
(a) (b) (c)

Figure 16: For the basis shown in Figure 15, notice that the ‘contrast’ (1) increases with dimension for
E(kxkmax
1 − kxkmin
1 ) (2) asymtotes with dimension for E(kxk2
max
− kxkmin
2 ) (3) decreases with dimension for
max min
E(kxk3 − kxk3 ). Comparision of this behavior to p=0.5 and MI is left as a homework exercise.

(Definition) Linear Combination A linear combination of a set of vectors M = {x1 , x2 , ...xm } ⊂ X is


denoted by an expression of the form

α1 x1 + α2 x2 + ... + αm xm

The set of all linear combinations of a set M is call the span of the set M . Span of a set of
vectors
(Definition) Span of a set of vectors The span of a set of vectors M = {x1 , x2 , ...xm } ⊂ X denotes the
set of all linear combinations of the vectors
( )
X
span M ≡ y : y = αi xi for some (α1 , α2 , ..., αm ) ∈ Rm
i

Example 19 (Span of Vectors in R2 ). Consider the vectors


     
7 ~b = 5 −5
~a = ~c =
0 2 −2
n o n o
What is the span of ~a, ~b ? span of ~b, ~c ?

 
~b = 5
  2
−5  
7
~c =
~
a =
:

−2  
    - 0
9

Figure 17: A linear combination is a sum of vectors.

n o
span ~a, ~b = R2
 
20
why ? how would you represent ~y = ?
7
          
7 5 20 7 5 α1 20
α1~a + α2~b = ~y ⇔ α1 + α2 = ⇔ =
0 2 7 0 2 α2 7

>> alpha = [7,5;0,2] \ [20;7]

alpha =

0.3571

31
3.5000

>> alpha(1) * [7;0] + alpha(2) * [5;2]

ans =

20
7
n o    
5
span ~c, ~b = y : y = α for some α∈R
2
is this R2 ?
~c = −1 · ~b
Linear Inde-
pendence
(Definition) Linear Independence A set of vectors x1 , ..., xm is said to be linearly independent if the
linear combination of the vectors equals zero iff the scalar coefficients all equal zero.

α1 x1 + α2 x2 + ... + αm xm = 0 ⇔ α1 = α2 = ... = αm = 0

Example 20 (Linear Dependence in Rn ). Two vectors that are collinear in R are linearly dependent ie

x = αy ⇒ x + (−α)y = 0 α 6= 0

Example 21 (Linear independence of functions). Show cos(x), sin(x) ∈ C[a, b] are linearly independent.
Obviously
0 cos(x) + 0 sin(x) = 0 ∀x
Conversely, we need to show that the sum of the functions equal zero implies that the coefficients equal zero.

α1 cos(x) + α2 sin(x) = ~0 ∀x
~0 ⇒ function ⇒ 0 ∀x
Since this is the zero vector this holds for all x. In particular let x = 0

α1 cos(0) + α2 sin(0) = α1 = 0

Similarly, let x = π/2


α1 cos(π/2) + α2 sin(π/2) = α2 = 0
Thus the two functions are linearly independent.
In practical applications on Rn and on discretized function spaces, there exist a finite set of linearly
independent vectors that form a basis for the space. Any vector in the vector space x ∈ X may be represented
in terms of these basis vectors. Vector Space
Basis
(Definition) Vector Space Basis A set of linearly independent vectors {e1 , e2 , ..., en } is said to be a basis
for X if every x ∈ X has a unique representation as a linear combination of the basis vectors.

∃(α1 , α2 , ..., αn ) ∈ Rn : x = α1 e1 + α2 e2 + ... + αn en ∀x ∈ X

The canonical basis for Rn is a common example.

e1 = (1, 0, 0, ..., 0)
e2 = (0, 1, 0, ..., 0)
. . .
en = (0, 0, 0, ..., 1)

32
Ωi
 ei (x) = 1 x ∈ Ωi
ei (x) = 0 x∈/ Ωi

Figure 18: Discretization of an Image.

Example 22 (Discrete Image). In imaging applications, we typically assume the image of the object, g, that
we are taking a picture of is square integrable
 Z 
g ∈ L2 (Ω) ≡ f : f 2 dx < ∞

Here, our imaging domain, Ω, is a subset of R2 , Ω ⊂ R2 . Unfortunately, it will take an infinite amount of
basis functions to represent an arbitrary image in our square integrable space, L2 (Ω). In order to represent
the image, g on a finite dimensional space for the computer to understand, we typically discretize the domain
into a 256×256 pixel image. We can then define the i-th basis function such that the function equals one
on the i-th pixel and zero everywhere else. An image that we would be interested in can now be easily be
represented as the linear combination of the basis functions on a computer.
X
g(x) = αi ei (x) (α1 , α2 , ..., αn ) ∈ R256×256
i

and the constants of the linear combinations has the interpretation as the piecewise intensity values.
Dimension of a
vector space
(Definition) Dimension of a vector space A vector space X is said to be finite dimensional if there is
a positive integer n such that X contains a linearly independent set of n vectors whereas any set of n + 1 or
more vectors of X is linearly dependent. n is called the dimension of X, written n = dim X.

dim X ≡ {# of linearly independent vectors}

By definition X = 0 is finite dimensional and dim X = 0. If X is not finite dimensional, it is said to be


infinite dimensional. In other words, if n linearly independent vectors can be found in our vector space X
for an arbitrarily large n then the vector space is infinite dimensional.

2.4 Normed Spaces ([Kreyszig, 1989], Section 2.2,2.3)


A norm provides a relationship between the algebraic structure and the metric Norm
(Definition) Norm

(N1) kxk ≥ 0
(N2) kxk = 0 ⇔ x=0
(N3) kαxk = |α|kxk
(N4) kx + yk ≤ kxk + kyk

The normed space is denoted by (X, k · k) or simply X


The defining properties of a norm are suggested and motivated by the length of a vector in elementary vector
algebra.
• (N1) and (N2) state that all vectors have positive length except the zero vector which has zero length
• (N3) means that when a vector is multiplied by a scalar, its length is multiplied by the absolute value
of the scalar

33
Figure 19: Illustration of Triangle Inequality (N4)

• (N4) is the triangle inequality and means that the length of one side of a triangle cannot exceed the
sum of the length of the other two sides
Example 23 (Metric induced by the norm). A norm on X defines a metric d on X which is given by

d(x, y) = kx − yk

and is called the metric induced by the norm. Showing that the metric induced by the norm will be left as a
homework exercise. The metric induced by the norm bay also be shown to be translationally invariant
• The metric is unchanged with respect to a translation by an arbitrary a

d(x + a, y + a) = kx + a − (y + a)k = kx − yk = d(x, y)

• The metric is unchanged with respect to a scaling by an arbitrary constant α

d(αx, αy) = kαx − αyk = |α|kx − yk = |α|d(x, y)

Lemma 2.2. The triangle inequality may be used to show that the norm satisfies the reverse triangle in-
equality
|kyk − kxk| ≤ ky − xk (3)
Proof.
kak = ka + b − bk ≤ ka − bk + kbk ⇒ kak − kbk ≤ ka − bk
Similarly
kbk = kb + a − ak ≤ kb − ak + kak ⇒ kbk − kak ≤ ka − bk

Pay attention to this process. If a holds, then b must be true. If b is true, then c must be true. etc.
Example 24 (Distance in Rn ). The p-norm defines a norm in Rn
!1/p
X
p
kxkp ≡ |xi | p≥1 kxk∞ ≡ max{|xi |}
i

For p = 2, the usual norm in Rn applies


v
u n
uX
kxk = t (ξi − ηi )2
i=1

Proof that this satisfies the properties of a norm is similar to the examples defining a metric and will be left
as a homework exercise. Example calculations of the 1-norm and ∞-norm are provided below.

>> x=[9;-4;7;-10]

x =

9
-4
7

34
-10

>> abs(x(1))+abs(x(2)) + abs(x(3)) +abs(x(4))

ans =

30

>> max(abs(x))

ans =

10

Example 25 (Space of continuous functions). Consider the space of continuous functions of independent
variable t over the domain [a, b].

C[a, b] ≡ {f : [a, b] → R : f is continuous}

Notice that in this space, each ‘point’ represents a function. The max difference between the function over
the domain [a, b] defines a norm on this space.

kxk = max |x(t)|


t∈[a,b]

Proof that this satisfies the properties of a norm is similar to the examples defining a metric and will be left
as a homework exercise.
Example 26 (Lp distance between functions). The Lp norm
Z  p1
kukLp ≡ |u(x)|p dx

defines a norm on a superset of the space of continuous functions that we have seen so far. Proof that
this satisfies the properties of a norm a more technical (involves equivalence classes of Lebesgue measurable
functions) and the result will be assumed. If f, g ∈ X are two images and we want to know how “close” they
are we can quantitatively evaluate their Lp distance.
Z  p1
p
d(f, g) = kf − gkp = (f − g) dx

For p = 2, this may be interpreted as the usual RMS difference. A canonical example is image registration
where we want the “distance” between two “registered” images to be as small as possible.
We have seen the L2 distance between images. Numerical values for other norms may be computed in a
similar fashion, I : [0, 1] × [0, 1] ⊂ R2 → R and J : [0, 1] × [0, 1] ⊂ R2 → R.
Z 1 Z 1 X
kI − Jk1 = |(I(x, y) − J(x, y)| dx dy ≈ |I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)| ∆x ∆y
0 0 i,j

  51
Z 1 Z 1  51 X 5
kI − Jk5 = (I(x, y) − J(x, y))5 dx dy ≈ (I(i · ∆x, j · ∆y) − J(i · ∆x, j · ∆y)) ∆x ∆y 
0 0 i,j

>> echo on
>> ExLOneFiveImageDistance
close all
clear all

35
I(x, y) = x + sin(π y) J(x, y) = sin(π y)

Figure 20: Distance Between Images.

delta = 5.e-4;
[X,Y] = meshgrid([0:delta:1],[0:delta:1]);
I = X + sin(pi*Y);
J = sin(pi*Y);
handle = figure; imagesc(I)
%saveas(handle, ’ImageDistanceOne’, ’png’)
handle = figure; imagesc(J)
%saveas(handle, ’ImageDistanceTwo’, ’png’)
norm(I(:)-J(:),1)*delta*delta

ans =

0.5005

norm(I(:)-J(:),5)*(delta*delta)^(1/5)

ans =

0.6991

Figure 21: Two distance measures may have the same quantitative value, but the difference can be quite
different.

36
∗∗
2.5 Continuity and Convergence ([Kreyszig, 1989], Section 1.4)
The concept of the distance measure provided by our metric and norm allows us to precisely define, at a
very basic level, the concept of continuity of a function and convergence of a sequence of vectors.
Convergence in normed spaces are motivated by the metric induced by the norm d(x, y) = kx − yk. Convergence
of a sequence,
(Definition) Convergence of a sequence, limit A sequence (xn ) in a metric space X = (X, d) is said
limit
to converge or to be convergent if there is an x ∈ X:
lim d(xn , x) = 0 lim kxn − xk = 0
n→∞ n→∞

x is called the limit of (xn )


lim xn = x
n→∞
or simply
xn → x
We say (xn ) converges to x
Notice that the metric induced by the norm is used here to relate convergence in an abstract vector space X
back to convergence on the real line R which we are very familiar with. The metric yields a sequence (an ),
where each an = d(xn , x). Now we can understand convergence in an arbitrary metric space in terms of the
usual  δ notation.
xn → x ≡ Given  > 0 ∃N : xn ∈ {y : d(y, x) = ky − xk < } n>N
| {z }
N (x,) neighborhood about x

Figure 22: sN (x) converges pointwise to 0 at a particular point x. However, this sequence does not converge
to the 0 function in the mean square or L2 sense.

Example 27 (Pointwise convergence). In general we are custom to point wise convergence. That is for at
given t
Xn
x(t) − αi ei (t) → 0 n → ∞ ∀t
i
This is different the the mean square convergence or convergence with respect to the norm
n
X
kx(t) − αi ei (t)k → 0 n→∞
i

Consider the sequence sN (x), Figure 22.




 0
x=0
√

 1
sN (x) = N 0<x<
 N

 1
0 <x≤1


N
1
Z 1 Z N
2
(sN (x) − 0) dx = 6
N dx = 1 = 0
0 0 |{z}
N →∞

37
Example 28 (Optimization). In optimization we are typically search for a solution such that the gradient
of the objective function converges to zero.
∇fk → 0
Familiar properties from calculus carry over into our vector space with a metric and norm defined.
Example 29 (Convergence of Sequences). The metric of two converging sequences converges to the metric
of the limit.
xn → x and yn → y ⇒ d(xn , yn ) → d(x, y)
Proof of this is left as a homework exercise.
Example 30 (Convergence of Sequences In Normed Space). The norm of two converging sequences converges
to the norm of the limit.

xn → x and yn → y ⇒ kxn + yn k → kx + yk

Using triangle inequality

kxn + yn k = kxn ± x + yn k ≤ kxn − xk + kx ± y + yn k ≤ kxn − xk + kx + yk + kyn − yk

Rearranging
kxn , yn k − kx, yk ≤ kxn , xk + ky, yn k
Interchanging xn with x and yn with y

kx+yk ≤ kx−xn k+kxn −yk ≤ kxn −xk+kxn +yn k+ky−yn k ⇒ kx+yk−kxn +yn k ≤ kxn −xk+ky−yn k

Hence, the difference in the norms are bounds by a sequence of numbers converging to zero

|kx + yk − kxn + yn k| ≤ kxn − xk + ky − yn k → 0

as n → ∞

∗∗
2.5.1 Continuity

Figure 23: Continuity. A mapping T : X → Y is said to be continuous at a point x0 if for each  > 0 (no
˜ x, T x0 ) <  whenever whenever d(x, x0 ) < δ
matter how small) there corresponds a δ(, x0 ) such that d(T
Continuity
(Definition) Continuity Let X = (X, d) and Y = (Y, d) ˜ be metric spaces. A mapping T : X → Y is said
to be continuous at a point x0 if for each  > 0 (no matter how small) there corresponds a δ(, x0 ) such that
˜ x, T x0 ) <  whenever whenever d(x, x0 ) < δ, Figure 23.
d(T

Example 31 (Continuity of f (x) = 1/x). Consider f (x) = 1/x over [.1, 1], ie not including 0. Here
˜ y) = d(x, y) = |x − y|
X = Y = R and d(x,

• For the point in question, x0 , draw an arbitrarily small  band about its mapping, f (x0 ) = 1/x0 . (this
is the “for any  > 0” part) and where it intersects the graph (A and B) drop verticals to the x-axis.

1 1 1 x0 x0 x20 
A= += ⇒ x0 − a = 1 = ⇒ a = x0 − =
x0 x0 − a x0 + 1 + x0  1 + x0  1 + x0 
1 1 1 x0 x0 x20 
B= −= ⇒ x0 + b = 1 = ⇒ b= − x0 =
x0 x0 + b x0 − 
1 − x0  1 − x0  1 − x0 

38
Figure 24: By inspection, the function is continuous at a point f (x0 ) = 1/x0 but to show this, we must
identify a δ as a function of the distance from the mapping value, , and the continuity point in question x0 .

• Observing that a < b, choose the smallest bound.


x20 
δ=a=
1 + x0 
|x − x0 | < δ is the centered interval denoted by the small parentheses. We can see from the graph that
if x is closer to x0 than δ, ie |x − x0 | < δ then f (x) will be closer to 1/x0 as desired.
• In summary, you give me a arbitrarily small  > 0 then I can give you back δ as a function of  and x0
x0 x20 
δ(, x0 ) = x0 − =
x0 + 1 1 + x0 
such that for any point bounded by δ the mapping is within the original tolerance .

|x − x0 | < δ ⇒ (work out the algebra) ⇒ |1/x − 1/x0 | < 

2.6 Finite Dimensional Spaces ([Kreyszig, 1989], Section 2.4)


Equivalent
(Definition) Equivalent norms A norm k · k on a vector space X is said to be equivalent to a norm k · k0 norms
on X if there are positive number a and b such that

akxk0 ≤ kxk ≤ bkxk0 ∀x ∈ X

Theorem 2.3 (Equivalent norms). On a finite dimensional vector space X, and norm k · k is equivalent to
any other norm k · k0
Proof. See [Kreyszig, 1989], Theorem 2.4-5
Notice that this implies that convergence or divergence of a sequence does not depends on the choice of the
norm. √
ksk − ~0k1 ≤ nksk − ~0k2 → 0

Remark It may be shown (as a homework exercise) that the k · k1 and k · k2 satisfy
1
√ kxk1 ≤ kxk2 ≤ kxk1 ∀x (4)
n
Within the context of optimization, many compressed sensing applications involve the reconstruction
of some image, x ∈ Rn , from some measurements y ∈ Rm and the measurements are assumed a linear
transformation of the image, y = Ax. There are generally infinitely many images, N (A) 6= {0}, that satisfy
the measurement data and we wish to optimize with respect to a particular norm k · k. In optimization
theory, we may typically choose to optimize the 1-norm over the 2-norm.

min kxk1 such that Ax


| {z= }b vs min kxk2 such that Ax
| {z= }b
x x
N (A)6={0} N (A)6={0}

39
Figure 25: A 2D k · k1 vs k · k2 optimization is shown with respect to the constraint, y = m · x + b.
Notice that iso-distance lines of the k · k2 are ‘circles’ while iso-distance lines of the k · k1 are ‘dia-
monds’. These are distinct functions, f (x, y) = k · k1 6= fˆ(x, y) = k · k2 , and minimizing these norms
with respect to the constraint, y = m · x + b leads to different solutions to the optimization problem.
http://www.cse.illinois.edu/iem/linear equations/pnorms

>> echo on
>> ExL1vsL2min

\exampledir/ExL1vsL2min.m
clear all
close all
delta = 5.e-2;
bound = 6
xcoord = [-bound:delta:bound];
[X,Y] = meshgrid( xcoord , xcoord );

g = sqrt( X.^2 + Y.^2) ;


h = abs(X) + abs(Y) ;

V = [ 1 2 3 4 5 ];
handle = figure;
set(gcf,’renderer’,’zbuffer’);
set(gca,’FontSize’,16)
contour(X,Y,g,V,’--’)
hold
contour(X,Y,h,V,’k-’ )

slope = 0.5;
intercept = 4.1;
ycoord = slope * xcoord + intercept ;
plot(xcoord ,ycoord ,’r’);

twomin = [ - slope * intercept / (1.+slope^2);...


- slope^2 * intercept / (1.+slope^2) + intercept;];

if (intercept < intercept/slope)


onemin = [ 0; intercept];
else
onemin = [ -intercept/slope,0];
end

40
xloc = 2;
text(xloc,slope*xloc + intercept,’ y = m \cdot x + b’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
text(4/sqrt(2),4/sqrt(2),’ ||x||_2 = c = 4 ’,...
’HorizontalAlignment’,’left’,’FontSize’,14)
text(2,-2,’ ||x||_1 = c = 4 ’,...
’HorizontalAlignment’,’left’,’FontSize’,14)

plot(onemin(1),onemin(2),’x’)
text(onemin(1),onemin(2),’min ||x||_1’,...
’HorizontalAlignment’,’left’,’FontSize’,14)

plot(twomin(1),twomin(2),’o’)
text(twomin(1),twomin(2),’min ||x||_2’,...
’HorizontalAlignment’,’left’,’FontSize’,14)

xlabel(’x’)
ylabel(’y’)

saveas(handle,’L1vsL2Min’,’png’)
Numerically, we will see that minimization with respect to the k · k1 has advantages over the k · k2 .
Intuitively, from Figure 25, in 2D we see that the k·k2 solution occurs where the smallest sphere intersects the
line constraint. Similarly, the k · k1 solution occurs where the smallest diamond intersects the line constraint.
Similarly, in higher dimensions, we will see that these same properties hold (ie intersect ’diamond’ with ’line’
in 100D) and minimization with respect to the 1-norm promotes a sparse representation of the solution. Here
‘sparse’ solution implies that the number of non-zeros entries is much less than the dimension of the image
space Rn , # non-zero << n. The solution of the k · k1 problem lies on the corners of an k · k1 -ball. Because
the corners lie on the coordinate axis the solution will lie on the intersection of most of the coordinate
axis (which is a little hard to imagine for a 6 million dimensional k · k1 -ball). The solution of the k · k2
problem lies on the k · k2 -ball, which has no corners, and thus is not restricted to a sparse solution. This has
applications in efficiently storing the solution, analogous to jpeg compression, reducing image acquisition
time, and improving computational efficiency.
In general the kx̂k1 and kx∗ k2 solution will not be the same, x̂ 6= x∗ . However the equivalence of norms
(4) may be used to derive the expected relationships between the two solutions.

kx∗ k2 = min kxk2 : Ax = b ⇒ kx∗ k2 ≤ kx̂k2 ≤ kx̂k1


x |{z} |{z}
defn of min equiv of norm


kx̂k1 = min kxk1 : Ax = b ⇒ kx̂k1 ≤ kx∗ k1 ≤ nkx∗ k2
x |{z} |{z}
defn of min equiv of norm

Hence the 2-norm solution is less than the 1-norm solution and the 1-norm solution is less than a constant
times the 2-norm solution √
kx∗ k2 ≤ kx̂k1 ≤ nkx∗ k2 x∗ =
6 x̂

3 Linear operators and Solvability of a Linear System of Equations


As a motivating example for our study of linear systems, consider the linear system of equations that arises
from our image reconstruction algorithm on our state of the art nine-pixel CT scanner. The signal loss, p,
across the image is measure through an integral of the attenuation µ(x). Discretizing our continuous space
by pixels, the integral reduces to a Riemann sum for each measurement pj .
The resulting system of equations is of the form

41
p4 p5 p6

Id
R
I0 = exp (− µ(s) ds)
p1 = µ1 ∆x + µ2 ∆x + µ3 ∆x
µ1 µ2 µ3 p1
p2 = µ4 ∆x + µ5 ∆x + µ6 ∆x
p3 = µ7 ∆x + µ8 ∆x + µ9 ∆x
µ4 µ5 µ6 p2 p4 = µ1 ∆x + µ4 ∆x + µ7 ∆x
∆y p5 = µ2 ∆x + µ5 ∆x + µ8 ∆x
p6 = µ3 ∆x + µ6 ∆x + µ9 ∆x
√ √ √
µ7 µ8 µ9 p3 p7 = µ1 (2 − 2)∆x + µ2 2( 2 − 1)∆x + µ4 2( 2 − 1)∆x
√ √ √
p7 p8 = µ3 2∆x + µ5 2∆x + µ7 2∆x
√ √ √
∆x p9 = µ6 2( 2 − 1)∆x + µ8 2( 2 − 1)∆x + µ9 (2 − 2)∆x
p8   R
p = ln IId0 = µ(s) ds
p9 ∆x=∆y
Figure 26: Image Reconstruction.

∆x ∆x ∆x 0 0 0 0 0 0
    
µ1 p1
 0 0 0 ∆x ∆x ∆x 0 0 0   µ2   p 2 
    

 0 0 0 0 0 0 ∆x ∆x ∆x   µ3   p 3 
   

 ∆x 0 0 ∆x 0 0 ∆x 0 0   µ4   p 4 
   

 0 ∆x 0 0 ∆x 0 0 ∆x 0   µ5  =  p 5 
   

 √0 √ 0 ∆x √ 0 0 ∆x 0 0 ∆x   µ6   p 6 
   
(2 − 2)∆x 2( 2 − 1)∆x 2( 2 − 1)∆x
√0 √0 0 √0 0 0   µ7   p 7 
    
0 0 2∆x 0 2∆x   µ8   p 8 
√ 0 2∆x 0 0

√ √
0 0 0 0 0 2( 2 − 1)∆x 0 2( 2 − 1)∆x (2 − 2)∆x µ9 p9
| {z } | {z } | {z }
A ~
x ~
b

How would you typically approach questions such as


• Does a solution exist ?

• Is it unique?
• How does uncertainty due to precision and accuracy limitations in the measurement, ∆b affect the
solution we obtain from a canned MATLAB program ?
Using the language of norms and vectors spaces that we have been studying, we can begin to answer
these questions and understand the effect of the conditioning on the solution in quite some detail.

3.1 Linear Operator; Null space; Range Space ([Kreyszig, 1989], Section 2.6,2.9)
Linear operator theory is fundamental to all practical applications. The linear operator, T , is understood as
a mapping from the domain, D(T ) ⊂ X, to the range, R(T ) ⊂ Y .

T : D(T ) → R(T ) R(T ) ≡ {y : y = T x ∀x ∈ D(T )}

The operator may defined to be restricted to a a subset of the space X, but In most applications, the domain
and the range are the full space, X and Y , respectively, we write

T :X→Y

Linear Opera-
tor
42
(Definition) Linear Operator A linear operator T is an operator such that

• The domain D(T ) of T is a vector space and the range R(T ) lies in a vector space over the same field.
• A linear operator satisfies the following property

T (αx + βy) = αT x + βT y ∀x, y ∈ D(T ) ∀α, β ∈ R

By definition the null space, N (T ), denotes the set of all x ∈ D(T ) such that T x = 0

N (T ) = {x : T x = 0}

Notice that letting the scalar α = 0 implies that the zero vector is in the null space

T0 = 0

There are many examples of linear operators in addition to the matrix operator’s that we are accustomed
to from linear algebra.
Example 32 (Identity Operator). The identity operator IX : X → X is defined by IX x = x ∀x ∈ X
We typically write
Ix = x I(αx + βy) = αx + βy = αIx + βIy
Example 33 (Zero Operator). The zero operator 0 : X → Y is defined by 0x = 0 x∈X
Example 34 (Differentiation). Let X = P[a, b] be the vector space of all polynomials on [a, b] We may define
a linear operator T on X by setting
T x(t) = x0 (t) ∀x ∈ X
By linearity of differentiation
0
T (αx + βy) = (αx(t) + βy(t)) = αx0 (t) + βy 0 (t) = αT x + βT y

Here the prime denote classical differentiation and the operator T maps X into itself.
Example 35 (Integration). A linear operator T from C[a, b] into itself can be defined by
Z t
y(t) = T x(t) = x(τ )dτ
a

By linearity of integration
Z t Z t Z t
T (αx + βy) = αx(τ ) + βy(τ )dτ = α x(τ )dτ + β y(τ )dτ = αT x + βT y
a a a

Example 36 (Multiplication). Another linear operator from C[a, b] into itself is defined by

T x(t) = tx(t)

Proof is left as homework exercise


Example 37 (Elementary vector algebra). Cross product and dot product with one argument fixed defines
a linear operator T1 : R3 → R3 , T2 : R3 → R

T1 = x × a ∀x ∈ X
T2 = x · a ∀x ∈ X

Here a ∈ R3 is fixed. Proof is left as homework exercise


Example 38 (Fourier Transform). The Fourier Transform that will be discussed in the later parts of this
course is an example of a linear operator.
Point Spread
Function

43
(Definition) Point Spread Function Consider an imaging system, L, that transforms an exact object
I : Rn → R to an imaged object Iˆ : Rn → R.
Iˆ = L I
The point spread function h is defined as the action of the operator on the delta functional.

h(x, y) ≡ L δ(x − y)

If we further assume that the linear operator is shift invariant

Shift Invariant ⇒ h(x − y) = L δ(x − y)

A given image I may be decomposed as the convolution with a delta functional.


Z
I(x) = I(y)δ(x − y)dy

Applying this decomposition within the operator leads to a representation of the transformed object.
Z  X 
LI(x) = L I(y)δ(x − y)dy = L lim I(yi )δ(x − yi )∆y
∆y→0
 X  Z
= lim I(yi )Lδ(x − yi )∆y = I(y)Lδ(x − y)dy
∆y→0
| {z }
linearity of L
Z Z
= I(y)h(x, y)dy = I(y)h(x − y)dy
| {z }| {z }
defn of h shift invariant

Hence any image from the system may be understood as the convolution with the point spread function.

Iˆ = I ∗ h

Example 39 (Point Spread Function Applied to 1D image). Consider a 1D image I : [a, b] → R Given a
shift invariant point spread function h The imaged object may be represented as
Z N
X
ˆ i) =
I(x I(y)h(x − y)dy ≈ I(yj )h(xi − yj )δy i = 0, ...N − 1
j=1

This may be written as a system of equations


ˆ = I(a
I[i] ˆ + i dy) I[j] = I(a + j dy)
 
 ˆ  h[0] h[−1] h[−2] h[−3] h[−4] . . .  
I[0] h[1] h[0] h[−1] h[−2] h[−3] . . . I[0]
ˆ
 I[1]     I[1] 
  h[2] h[1] h[0] h[−1] h[−2] . . .  
 . =
 h[3] h[2]
 . 
 h[1] h[0] h[−1] . . . 
  
 .   . 
h[4] h[3] h[2] h[1] h[i − j] . . . 
ˆ
I[N − 1] I[N − 1]
. . . . . ...
| {z }
≡L

The PSF for Fourier based MR recon is

sin (π N ∆k x)
h(x) = ∆k
sin (π ∆k x)

Given an ‘exact’ object (


1 a<x<b
I=
0 otherwise
The image obtained during the reconstruction is the convolution.

44
Figure 27: Point Spread Function for Image Reconstruction. Iˆ = I ∗ h

Notice that the convolution operator is commutative. Given a phantom of known geometry I and the
ˆ the PSF that characterizes the system may be found from the solution of the linear
reconstructed image I,
system of equations (Homework exercise) .
 
I[0] I[−1] I[−2] I[−3] I[−4] . . .  
I[1] I[0] I[−1] I[−2] I[−3] . . . h[0]
   h[1] 
I[2] I[1] I[0] I[−1] I[−2] . . .
ˆ  = A~h
 
I =I ∗h=h∗I =   .
I[3] I[2] I[1] I[0] I[−1] . . .  
  . 
I[4] I[3] I[2] I[1] I[i − j] . . .
h[N − 1]
. . . . . ... | {z }
~
| {z }
≡h
≡A

\exampledir/ExPSFFourier.m
clear all
close all

FOV = 20;
deltax = FOV/400
epsilon = 1.e-6
xx = [epsilon:deltax:FOV/2];
Image = heaviside(xx - 5) - heaviside(xx - 8);

deltak = 1/FOV
N = 64;
ImpulseResponse = deltak * sin(pi * N * deltak * xx).* (sin(pi * deltak * xx)).^(-1);

A = toeplitz(ImpulseResponse );
ImageHat = A * Image’;

handle1 = figure(1)
plot(xx ,Image,’r’)
hold
plot(xx ,ImageHat,’k--’)

% reflect ImpulseResponse for plotting


reflectxx = [sort(-xx) xx];
ReflectImpulseResponse = deltak * sin(pi * N * deltak * reflectxx).* (sin(pi * deltak * reflectxx)).^(-1)

45
plot(reflectxx ,ReflectImpulseResponse )

set(gca,’FontSize’,16)
xlabel(’x’)
legend(’Exact’, ’Image’, ’PSF’ ,’Location’,’NorthWest’)
saveas(handle1,’PSFMR1D’,’png’)

Example 40 (Null Space of a matrix). Given an operator T : X → Y , It is important to realize that the
null space of an operator is also a vector space that is generally a subset of the domain of the operator,
N (T ) ⊂ X. As an example lets compute the null space of the matrix operator, L ≡ A
 
2 1
A=
−4 −2

Recall the definition. The null space is the set of all vectors, z that map to zero.

2·(2z1 + z2 = 0) 2z1 + z2 = 0 1
  
2 1 z1 ⇔ ⇔ z1 = − z2
Az = =0 ⇔
−4 −2 z2 +(−4z1 + −2z2 = 0) 0z1 + 0z2 = 0 2

Hence, the null space of this operator is all vectors where the first component is the negative 1/2 of the second
component.

− 21 · 2 · β + β
 1     1     
−2β 2 1 −2β 0
N (A) = :β∈R ⇔ = =
β −4 −2 β − 21 · −4 · β − 2β 0

Figure 28: Inverse Operator.

The null space is important when studying the inverse of an operator and the uniqueness of a solution. Inverse Opera-
tor
(Definition) Inverse Operator The mapping T : X → Y is said to be injective or one-to-one if different
points in the domain have different images.

T x1 = T x2 ⇒ x1 = x2 ⇔ x1 6= x2 ⇒ T x1 6= T x2 ∀x1 , x2 (5)

For this case an inverse mapping is well defined

T −1 : Y → X

and associates a given y0 ∈ Y to unique x0 ∈ X by the application of the operator to the vector.

T x0 = y0 T −1 y0 = x0

Clearly the inverse operator satisfies the following properties.

T −1 T x = x ∀x T T −1 y = y ∀y ∀x

For our study of linear operators, it is important to note that the inverse of a linear operator exists if and
only if the null space of the operator consists of the zero vector only.
Theorem 3.1 (Inverse Operator). Given two vector spaces, X and Y and T : X → Y a linear operator.

46
(i) The inverse T −1 : Y → X exists if and only if the null space is zero.

(T x1 = T x2 ⇒ x1 = x2 ∀x1 , x2 ) ⇔ (T z = 0 ⇒ z=0 ⇔ N (T ) = 0)

(ii) If the inverse T −1 : Y → X exists, it is linear


Proof. (i) (⇐) To claim the inverse exists, we want to show that our definition of an injective or one-to-one
operator (5)
(T x1 = T x2 ⇒ x1 = x2 ∀x1 , x2 ) ⇐ (T z = 0 ⇒ z = 0)
is satisfied, given that the null space is zero. Assuming two points map to the same point in the range
implies the difference is zero.

T x1 = T x2 ⇒ T x1 − T x2 = 0 = T (x1 − x2 ) (linearity of T)

Hence by the assumption, T z = 0 ⇒ z = 0, the difference maps to ~0, so the difference is zero and the
two points are the same.

T x1 = T x2 ⇒ T z = T (x1 −x2 ) = 0 ⇒ z = 0 = x1 −x2 ⇒ x1 = x2 (z = x1 −x2 )

(⇒) Conversely, We want to show that the Null space of the operator is zero given that T −1 exists.

(T x1 = T x2 ⇒ x1 = x2 ∀x1 , x2 ) ⇒ (T z = 0 ⇒ z = 0)

Since this holds for all x1 , x2 , let x2 = 0 and x1 = z arbitrary. By the properties of the linear operator,
T x2 = T 0 = 0.

Tz = 0 ⇒ T z = T x1 = 0 = T x2 = T 0 ⇒ z = x1 = x2 = 0
| {z }
by assumption, T x1 =T x2 ⇒x1 =x2

(ii) We assume that T −1 exists and show that it is linear. Consider any x1 , x2 ∈ X and there images

y1 = T x1 T −1 y1 = x1 y2 = T x2 T −1 y2 = x2

Since T is linear
αy1 + βy2 = αT x1 + βT x2 = T (αx1 + βx2 )
Since the inverse is defined we can associate αy1 + βy2 with its image in X.

T −1 (αy1 + βy2 ) = T −1 T (αx1 + βx2 ) = αx1 + βx2

Finally since xj = T −1 yj

T −1 (αy1 + βy2 ) = αx1 + βx2 = αT −1 y1 + βT −1 y2

Example 41 (Ill Conditioned System). Returning to everyday life, why is this important? What does this
null space have to do with anything ?
>> ExMatrixLinearOperator

clear all
close all
echo on

A = [ 0.814 0.913 0.278; 1.811 1.264 1.093; 2.442 2.739


0.834]

A =

0.8140 0.9130 0.2780


1.8110 1.2640 1.0930

47
2.4420 2.7390 0.8340

x = [ 0.9572; 0.4854; 0.8003];


y = [ 0.1419; 0.4218; 0.9157];

%Matrix A is a linear operator


alpha = 3.3;
beta = 2.7;
zone = A * (alpha *x + beta * y)

zone =

6.8069
15.4675
20.4206

ztwo = alpha *A * x + beta * A * y

ztwo =

6.8069
15.4675
20.4206

% What is the solution A x = b ?


b = [ 1.4448; 3.2217; 4.3343];
xone = A\b
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.846985e-17.
> In ExMatrixLinearOperator at 20

xone =

1.0e+11 *

-1.3894
0.8301
1.3422

% notice the effect of a measurement error \delta b


% on the solution
deltab = [.001;.002;.001];
xtwo = A\(b + deltab)
Warning: Matrix is close to singular or badly scaled.
Results may be inaccurate. RCOND = 1.846985e-17.
> In ExMatrixLinearOperator at 25

xtwo =

1.0e+12 *

-2.9178
1.7432
2.8187

48
%the solution has changed by an order of magnitude ?!?!?
%what’s going on ?!?!?
%x is a solution to A x = b
x

x =

0.9572
0.4854
0.8003

A * x -b

ans =

1.0e-03 *

0.0144
0.0627
0.1432

% x is not a unique solution


% xhat is also a solution to A xhat = b
xhat = [-5.6508; 4.4332;7.1838];
A * xhat -b

ans =

1.0e-03 *

0.0568
0.1594
0.2704

% in fact any linear combination of x with an


% element of the null space of A is a solution
% A * ( x + gamma * null(A) ) = b gamma \in \mathbb{R}
gamma = rand(1,1);
A * ( x + gamma * null(A) ) - b

ans =

1.0e-03 *

0.0144
0.0627
0.1432

In summary, the null space is not the zero vector

N (A) 6= {~0}

Hence, if a solution exists, there are infinite solutions!

49
For clarity lets consider a 2x2 system, or a 2-pixel scanner if you will. Find (x1 , x2 )

0.9130 0.6590 x1
  
0.254
 T = A : R2 → R2
Ax = = =b
0.4570 0.3300 x2 0.127 X = Y = R2
we are getting closer to be able to answer fundamental questions that will appear time and time again in
our research.
Given a linear operator T : X → Y and a right hand side vector y ∈ Y ,
(i) Does an element x ∈ X exist such that
Tx = y
(ii) Is the element unique ?
(iii) How does uncertainty due to precision and accuracy limitations in the measurement, ∆b affect the
solution we obtain from a canned MATLAB program ?
Obvious observations are:
1. For a given y ∈ Y , a solutions exists ⇔ y ∈ R(T )
2. A solution is unique ⇔ N (T ) = {0}, ie the null space is the zero vector or the operator is one-to-one
(injective).
Ax = y
⇒ A(x + z) = y
Az = 0

We can gain further intuition by analyzing the linear operator in an explicit finite dimensional setting. We
have a very powerful theorem at our disposal on finite dimensions. We will state without proof.
Theorem 3.2 (Rank and Nullity). Let X be a finite-dimensional vector space and T : X → Y denotes a
linear transformation from X into Y . Then
dim X = dim N (T ) + dim R(T )
i.e. the sum of the rank and nullity of the linear transformation T equals the dimension of space V .
To be explicit, let {e1 , e2 , ..., en } and {f1 , f2 , ..., fm } denote a basis for X = Rn and Y = Rm , respectively.
Then for a given x ∈ X we can represent it in terms of its basis and apply the operator, T : Rn → Rm
 
x1
 x2 
  n n
 .  X X
x=  ⇔ x= x j e j T x = xj T ej
 . 
  j=1 j=1
 . 
xm
Now each of the vectors T ej ∈ Y and therefore has its own representation with respect to the basis fi .
Denoting the components of T ej with respect to the basis fi with aij
 
a1j
 a2j 
  m
 .  X
T ej = 

 = a1j f1 + a2j f2 + ... + amj fm =
 aij fi
 .  i=1
 . 
amj
we have the usual matrix-vector multiply with m rows and n columns.
 
a1j   
 
 x1
 a2j  y1 a11 ...a1n  
n n n m m X n m
 y2   a21 ...a2n   x2 
 
X X  .  X X X X
Tx = xj T ej =  . =
xj   xj aij fi = aij xj fi = yi fi ⇔ 
 . =
    .  ⇔ y = Ax
...  
 . 
j=1 j=1   j=1 i=1 i=1 j=1 i=1
 .  ym am1 ...amn
xn
amj
Again, the matrix entries aij have the interpretation as the the components of the mapping of the domain
basis T ej with respect to the range basis fi . Singular Ma-
trix
50
(Definition) Singular Matrix We say that a matrix, A, is singular iff the determinant of the matrix is
equal to zero .

A is singular ⇔ det A = 0 ⇔ any two rows or columns of the matrix are linearly dependent

Algebraic properties of the determinant may be used to show that any two rows or columns of the matrix
are linearly dependent for a singular matrix.

The possibilities are summarized as follows:


(i) dim X = dim Y = rankA, then there exists a unique solution x = T −1 y for an arbitrary vector y.
Recall from linear algebra

rankA = # of linearly independent columns of matrix A

Here we are saying that the set {T e1 , T e2 , ..., T em } is linearly independent and hence forms a basis for
Y . Since Y is of the same dimension as the rankA then any y ∈ Y maybe decomposed in this basis so
a solution exists and is the coefficients of this basis {xi }.
X
y= xi T ei
i

Finally by the Rank and Nullity Theorem 3.2, since dim X = rankA, the dimension of the Null space
must be zero
N (T ) = {0}
and the solution is unique.
(ii) dim X = dim Y > rankA, (ie A singular), Infinite solutions: N (T ) 6= {0}, and y ∈ R(T ). y existing
within the range of the operator is equivalent to saying that the rank of the augmented matrix is the
same as the rank of the original matrix, ie y is linearly dependent on the columns of the matrix.
   
a11 . . .a1n y1 a11 . . .a1n
 a21 . . .a2n y2 
 = rank  a21 . . .a2n 
 
rank  . . .   . . . 
am1 . . .amn ym am1 . . .amn

For this situation there exists z 6= 0, z ∈ N (T ) and

Tx = y → y = T x = T x + 0 = T x + αT z = T (x + αz) ∀α

Hence if a solution x exists, then infinite solutions exist x + αz, ∀α ∈ R.

(iii) dim X = dim Y > rankA, (ie A singular), No solutions: y ∈/ R(T ). In other words the span{T e1 , T e2 , ..., T em }
does not cover all of Y AND the y in question does not exist in that span.
(iv) n = dim X > dim Y = m, The number of equation is smaller than the number of unknowns. From
the fundamental identity n = dim N (T ) + dim R(T ), with dim R(T ) ≤ m = dim Y , the dimension null
space is always greater than zero, dim N (T ) ≥ 0, and if a solution exists it is never unique. Similar to
before the rank of the augmented matrix may be used to show if a solution exists.
(v) n = dim X < dim Y = m, The number of equations is bigger than the number of unknowns. Again
from the fundamental identity m > n = dim N (T ) + dim R(T ), so the range space must be a subspace
of the full space m > dim R(T ) and the the rank of the augmented matrix may be used to show if a
solution exists. The dimension of the null space may or may not be 0.

3.2 Bounded Linear Operators ([Kreyszig, 1989], Section 2.7)


Bounded Lin-
(Definition) Bounded Linear Operator Let X and Y be normed spaced and T : D(T ) → Y a linear ear Operator
operator, where D(T ) ⊂ X. The operator T is said to be bounded if there is a real number c

kT xk ≤ ckxk ∀x ∈ D(T ) (6)

51
Bounded linear operators form the basis of a rich theory in functional analysis and will facilitate much of the
discussion of matrix analysis for solving systems of linear equations that appears in our research. In fact,
the space of bounded linear operators may be considered a normed space with an operator norm defined as
the supremum over all bounding constants. There are rigorous methods of defining the supremum of a set
in terms of partial orderings. For our purposes, R will suffice. Supremum
(Definition) Supremum The supremum of a set, A, is denoted sup(A) and denotes the least upper bound
of the set.

sup(A) ≤ c ∀c ∈ {b : b ≥ a ∀a ∈ A} sup(A) ∈ {b : b ≥ a ∀a ∈ A}
| {z }
is an element of the set of upper bnds

Example 42 (Supremum vs Maximum). Motivation for defining the supremum is that it allows to define
an upper bound of a set when the maximum does not exist. Consider, for example, the open set

A = (2, 4) = {x ∈ R : 2 < x < 4}

The max does not exist. Indeed, you can always find a number that is epsilon bigger, ie 3.999...99 <
3.999...991. However, the set [4, ∞) consists of all upper bounds for A and the least upper bound is the
supremum.

sup A = 4

2 4
( )

Figure 29: Supremum vs Max.

There are several equivalent definitions of the “matrix norm” in terms of the supremum of the bounding
constants. Proof of the equivalence is outside the scope of this course. As before, the matrix norm provides
a method to quantify the size and magnitude of a matrix and other linear operators.
kT xk
kT k ≡ sup = sup kT xk
x∈D(T ) x6=0 kxk x∈D(T ) kxk=1

Note that letting the constant equal the norm c = kT k, we arrive at a frequently used formula that bounds
the application of the operator to the domain. By definition, the matrix norm, kT k is an element of the set
of upper bounds.  
kT xk
c: ≤ c x 6= 0
kxk
in fact it is the least upper bound.
kT xk kT xk
kT k = sup ⇒ ≤ kT k ∀x ⇒ kT xk ≤ kT kkxk ∀x (7)
x∈D(T ) x6=0 kxk kxk
| {z }
(by defn, sup is an upp. bnd.)

Example 43 (Identity Operator). The identity operator IX : X → X, Ix = x, on a normed space X 6= 0


is bounded and has norm kIk = 1
Example 44 (Zero Operator). The zero operator 0 : X → Y , 0x = x, on a normed space X is bounded and
has norm k0k = 0
Lemma 3.3. An integral may be bounded by the supremum of its function value over the domain of inte-
gration.
Example 45 (Integral Operator). We can define an integral operator T : C[0, 1] → C[0, 1] by
Z 1
y = Tx y(t) = k(t, τ )x(τ )dτ
0

52
R
Figure 30: Area under the curve is bounded by the rectangular at the max value, ∆x
f (x) dx ≤ fmax ∆x.

Here k is a given function, which is called the kernel of T and is assumed to be continuous on the closed
square G = J × J in the tτ -plane, where J = [0, 1]. This operator is linear and bounded. To prove this we
first note that the continuity of k on the closed square implies that k is bounded, ie there exist a k0 such that

|k(t, τ )| ≤ k0 ∈ R ∀(t, τ ) ∈ J × J

Furthermore, using our previous norm defined on this space

|x(t)| ≤ max |x(t)| = kxk


t∈J

Hence, Z
1

Z 1
kyk = kT xk = max k(t, τ )x(τ )dτ ≤ max
|k(t, τ )||x(τ )|dτ ≤ k0 kxk
t∈J 0 t∈J 0

So we have
kT xk ≤ k0 kxk
And T is bounded
We state without proof some important regarding properties of bounded operators in finite dimension as
well as continuity of bounded operators.
Theorem 3.4 (Finite Dimension). If A normed space X is finite dimensional, then every linear operator
on X is bounded
Proof. See Theorem. 2.7-8 in [Kreyszig, 1989]
Theorem 3.5 (Continuity and boundedness Dimension). Let T : D → Y be a linear operator, where
D(T ) ⊂ X and X, Y are normed spaces. Then:

T is continuous ⇔ T is bounded

Proof. See Theorem. 2.7-9 in [Kreyszig, 1989]


Corollary 3.6 (Continuity). Let T : D → Y be a bounded linear operator, where D(T ) ⊂ X and X, Y are
normed spaces. Then:
xn → x ⇒ T xn → T x xn , x ∈ D(T )

Proof. See Theorem. 2.7-10 in [Kreyszig, 1989]


Hence any linear operator we define on a finite dimensional space is bounded, continuous, and application
of the linear operator to a sequence converges to the operator applied to the limit point of the sequence.

xn → x ⇒ T xn → T x xn , x ∈ D(T )

An important example that will be used in studying the accuracy of the solution to a linear system of
equation is the norm of a matrix.

53
Example 46 (Matrix Operator). Consider the matrix operator T = A, A : (Rn , k · k1 ) → (Rn , k · k1 ). In
determining the matrix norm notice that it is important to specify the analytical form of the norm used on the
domain and range space. We will denote kAk1 as the matrix norm subordinate to the vector norms kAxk1
and kxk1 . Our strategy for establishing the analytical expression of the matrix norm will be to
• Find a constant, K, such that
kAxk1 ≤ Kkxk1 ∀x

• By definition of supremum we know that

kAxk1 ≤ kAk1 kxk1 ≤ Kkxk1 ∀x kAk1 ≤ K

If we can find a particular x∗ with unit norm, kx∗ k1 = 1 for which equality is obtained.

kAk1 kx∗ k1 ≥ kAx∗ k1 = Kkx∗ k1 ⇒ K ≤ kAk1

Then our constant can be bounded above and below by the matrix norm and our result is obtained

kAk1 ≤ K ≤ kAk1 ⇔ kAk1 = K

To obtain the constant consider the matrix A as column-wise entries


        

x1
         . 
         
~a1  ~a2  ... ~aj  ... ~an 
A=  xi 
x=
        
         . 
xn

The triangle inequality yields


n n n
!
X X X
kAxk1 = k ~aj xj k1 ≤ |xj |k~aj k1 ≤ max k~aj k1 |xi | ≤ max k~aj k1 kxk1
j j
j=1 j=1 i=1

Thus we have found a constant, K, such that

K = max k~aj k1 kAxk1 ≤ Kkxk1 ∀x


j

For the final step, pick x∗ such that all entries are zero except in the the j-th position that corresponds to the
max column sum of the matrix

x∗ = (0, 0, ...0, 1, ..., 0) xj = 1 k~aj k1 = max k~ak k1


k

Then x∗ extracts the max entry

kAx∗ k1 = k~aj k1 ⇒ kAk1 = max k~ak k1


k

Example 47 (Differentiation Operator). Let X be the normed space of all polynomials on J = [0, 1] with
norm given kxk = max |x(t)|, t ∈ J. A differentiation operator T is defined on X by

T x(t) = x0 (t)

where prime denotes differentiation with respect to t. This operator is linear but NOT bounded. Consider
xn (t) = tn , where n ∈ N = {1, 2, 3, ...}. Then kxn k = 1 and

kT xn k
T xn (t) = x0n (t) = ntn−1 kT xn (t)k = max |ntn−1 | = n · 1 = nkxn k =n ∀n
t∈[0,1] kxn k
kT xn k
Since n ∈ N is arbitrary, there is no fixed number c such that kxn k ≤ c. Thus T is not bounded.

54
3.3 Applications: Conditioning & Residual [Heath, 1998] Chapter 2
The condition number κ of a matrix is defined with respect to a particular matrix norm k.k as

κ(A) ≡ kAkkA−1 k

In finite dimensions, the values of the matrix norm depends on the norm of the domain, D(T ), and range,
R(T ), of the operator.
n
X
T : (Rn , k · k∞ ) → (Rn , k · k∞ ) ⇒ kT k∞ = max |Tij |
1≤i≤n
j=1
n
X
T : (Rn , k · k1 ) → (Rn , k · k1 ) ⇒ kT k1 = max |Tij |
1≤j≤n
i=1

Hence, the quantitative values of the condition number


 
0.913 0.659
A=
0.457 0.330

depends on the matrix norm induced by the vector norm.

cond1 (A) = kAk1 kA−1 k1 cond∞ (A) = kAk∞ kA−1 k∞

From a practical standpoint, Theorem 2.3 suggests that on this finite dimensional space of matrix operators
it may be more convenient to compute the matrix norm with respect to 1-norms vs 2-norms
for example. However, if the matrix norm is approaching singularity in a given norm then it approaches
singularity in all norms.
The condition number may be verified in MATLAB
>> A = [0.913, 0.659;0.457,0.330]

A =

0.9130 0.6590
0.4570 0.3300
>> cond(A,1)

ans =

1.6958e+04

>> cond(A,2)

ans =

1.2485e+04

>> cond(A,inf)

ans =

1.6958e+04

The residual of an approximate solution, x̂, to the linear system Ax = b is the difference

r = b − Ax̂

Lets look at the residual norms of two potential solutions. In general we would expect the residual norm to
decrease as we obtain a better solution.

55
>> b = [ 0.254; 0.127];
>> xexact = [1;-1]

xexact =

1
-1

>> norm(A*xexact-b,1)

ans =

>> xone = [-0.0827;.5]

xone =

-0.0827
0.5000

>> xtwo = [0.999;-1.001]

xtwo =

0.9990
-1.0010

>> norm(A*xone-b,1)

ans =

2.1120e-04

>> norm(xexact-xone,1)

ans =

2.5827

>> norm(A*xtwo-b,1)

ans =

0.0024

>> norm(xexact-xtwo,1)

ans =

0.0020
Upon initial inspection, one may think that the x̂1 solution is the better
 approximation because of the
1
smaller residual. However, the exact solution may be verified to be x = . As seen in equation (8), this
−1
is an excellent example of where a small residual does not imply a small error in the solution because of the
ill-conditioning of the linear system matrix, A. Assuming the matrix is nonsingular, a relative error bound

56
may be related to the residual
Ax = b
⇒ A(x − x̂) = r ⇒ (x − x̂) = A−1 r ⇒ k∆xk = kx̂ − xk = kA−1 rk ≤ kA−1 kkrk
Ax̂ = b − r
(8)
Manipulating the inequality,
kAkkx̂k −1 k∆xk krk
k∆xk ≤ kA−1 kkrk = kA kkrk ⇒ ≤ cond(A)
kAkkx̂k kx̂k kAkkx̂k
Revisiting our example matrix,
    
0.913 0.659 x1 0.254
Ax = = =b
0.457 0.330 x2 0.127
Consider the residual from two approximate solutions
   
−0.0827 0.999
x̂1 = and x̂2 =
0.5 −1.001
The first solution may be obtained from four-digit arithmetic Gaussian elimination and multiplying the linear
system by an elimination matrix
      
1 0 0.913 0.659 x1 1 0 0.254
0.457 = (multiply both side by matrix)
− 0.913 1 0.457 0.330 x2 − 0.457
0.913 1 0.127
    
0.9130 0.6590 x1 0.2540
⇒ =
0.0 0.0002 x2 0.0001
Back substitution, gives the solution  
−0.0827
x̂1 =
0.5

3.4 Applications: Accuracy & Numerical Stability [Heath, 1998] Chapter 2


The quality of a solution to a linear system solve may be related to the condition number of the matrix.
Theorem 3.7 (Error bound of linear system solve). Let x be the solution to the nonsingular linear system
Ax = b
and let x̂ be the solution to the linear system with a perturbed right hand side
Ax̂ = b + ∆b
The change in the solution ∆x ≡ x̂ − x is proportional both the perturbation ∆b and the condition number
κ(A).
k∆xk k∆bk
≤ κ(A)
kxk kbk
Proof. By definition and linearity of the operator
Ax̂ = A(x + ∆x) = Ax + A∆x = b + ∆b ⇒ A∆x = ∆b
Using the assumption that the inverse is well defined, A−1 (nonsingular)
∆x = A−1 ∆b
Using (7)
1 kAk
kbk = kAxk ≤ kAkkxk ⇒ ≤
kxk kbk
k∆xk = kA−1 ∆bk ≤ kA−1 kk∆bk
Combining these inequalities we obtain the result.
a<b ca < cb k∆xk k∆bk
⇒ ⇒ ca < cb = bc < bd ⇒ ≤ kA−1 kkAk
c<d bc < bd kxk kbk

57
(a) (b)

Figure 31: The effect of matrix conditioning on error bounds of the solution (a) The solution set of each
of the two equations in the linear system is drawn as a straight line in the plane. The width of the lines
reflects the uncertainty in the data within the specified precision arising from limitation in measure precision
and accuracy for example. The resulting uncertainty in the intersection (i.e., the solution) depends on
the condition number of the matrix. http://www.cse.illinois.edu/iem/linear equations/conditioning (b) The
region of uncertainty in the right-hand-side vector for a given relative error is shown in the right graph by a
shaded circular disk whose size can be altered by dragging its perimeter, and the resulting numerical value
for the relative error in rhs is shown below. The lightly shaded circular disk in the left graph shows the
corresponding region of uncertainty in the solution vector x given by the condition number of the matrix,
and the corresponding bound on the relative error in x is shown below. In this case the poorly conditioned
matrix is seen significantly amplify the error in the solution. Working with this example you can also see
the instability in the solution, small changes in the measurement data produces large changes in the output
solution. http://www.cse.illinois.edu/iem/linear equations/error bound

Notice that large perturbations of the solution may occur when large conditioning numbers overwhelm
machine epsilon, k∆bk ≈ mach , k∆bk/kbk << κ(A)
When solving a system of equations in floating point arithmetic, both numerical inaccuracies of the
matrix and the right hand side may exist.
(A + E)x̂ = b + ∆b
As a homework exercise, a similar derivation can show that the error in the perturbed solution may be
bounded by these numerical inaccuracies
 
k∆xk k∆bk kEk
≤ cond(A) +
kxk kbk kAk
Here we see that the condition number of the system plays an important role in the computer solution of
the system of equations. If we assume that the numerical perturbations are on the order of the machine
precision, mach , then the relative error is directly proportional to the condition number.
kx̂ − xk
≤ cond(A)O(mach )
kxk

3.5 Applications: Condition number of nearly singular matrix, [Heath, 1998]


Chapter 2
Given a matrix with nearly linearly dependent rows
   −1  
a a a b 1 d −b
A= =
αa αa +  c d det A −c a

58
Compute the condition number of this matrix.
   
1 αa +  −a 1 αa +  −a X 2αa + 
A−1 = = ⇒ kA−1 k1 = max |aij | =
a(αa + ) − aαa −αa a a −αa a j
i
a

X (2αa + ) ((1 + α)a + )


kAk1 = max |aij | = (1 + α)a +  ⇒ κ(A) = → ∞ for  → 0
j
i
a

3.6 Linear Functionals ([Kreyszig, 1989], Section 2.8)


A functional (NOT a typo) is an operator whose range lies on the real line R or in the complex plane C.
Functionals appear very frequently in research. We will denote functional by lowercase letters f, g, h, ..., the
domain of f by D(f ), the range by R(f ) and the value of f at an x ∈ D(f ) by f (x), with parentheses. Linear Func-
tional
(Definition) Linear Functional A linear functional f is a linear operator with domain in a vector space
X and range in the scalar field of X

f : D(f ) → K K = R or C
Bounded Lin-
ear Functional
(Definition) Bounded Linear Functional A bounded linear functional f is a bounded linear operator
with range in the scalar field, R or C. Thus there exists c ∈ R such that

|f (x)| ≤ ckxk

and
|f (x)| ≤ kf kkxk

A special case of Theorem 3.5


Theorem 3.8. A linear functional f with domain D(f ) in a normed space is continuous if and only if f is
bounded
Example 48 (Norm). The norm k · k : X → R on a normed space (X, k · k) is a functional on X which is
NOT linear
Example 49 (Dot Product). The familiar dot product with one fact kept fixed defines an important functional
f : R3 → R, by means of

f (x) = x · a = x1 a1 + x2 a2 + x2 a3 a ∈ R3 a fixed

This functional has an import place in Hilbert space theory. f is in fact linear and bounded. Proof is left as
a homework exercise.

4 Inner Product Spaces


Inner product spaces are a natural generalization of Euclidean space. Metrics have provided us with a notion
of distance and norms have provided us with a relation between the metric and elementary vector algebra
operations. Inner product spaces provide the concept of orthogonality and the ‘angle’ between two vectors,
which is indispensable in many applications and provides a well defined solution to the least squares problem.
Suppose that we wish to deliver and ‘ideal’ dose, Dideal , distribution to a patient.
(
Dmax ~x ∈ Ωprostate ⊂ R2
Dideal =
0 / Ωprostate ⊂ R2
~x ∈
h i
We need to choose the intensity distributions I~ J
s kg

 
I1 (d)
I~ = I2 (d) ∈ (C[a, b])3
I3 (d)

59
~ = Dideal . For a given exposure time, ∆t, lets assume that the delivered
such that the delivered dose D(I)
dose may be written as a linear combination of the individual beam dose.
Nbeam
X=3 Z !
D(x, y) = ∆tIj (d(x, y)) exp − µ(s) ds
j=1 l(x,y)

Here d(x, y) and l(x, y) represent the distance along the beam intensity profile and path to the source;
respectively. To make this infinite dimensional problem tractable, consider the dose at a finite set of pixels
{~x1 , ~x2 , ..., ~xm } and the corresponding intensities for each beam path at these pixels

 
I1 (d(~x1 ))
 I2 (d(~x1 )) 
 
 I3 (d(~x1 )) 
   
 I1 (d(~x2 ))  Dideal (~x1 )
 
 I2 (d(~x2 ))   Dideal (~x2 ) 
   
 I3 (d(~x2 ))   . 
D(x, y) = A   = 

 . 


 . 


 . 

 . 

 . 
 Dideal (~xm )
I1 (d(~xm )) | {z }
  b
I2 (d(~xm ))
I3 (d(~xm ))
| {z }
x

Figure 32: Discrete IMRT Optimization.

The discrete operator A : R3m → Rm is of tri-diagonal form.


 R  R  R  
exp x1 ) µ(s) ds
l1 (~ exp x1 ) µ(s) ds
l2 (~ exp x1 ) µ(s) ds
l3 (~ 0 ... 0
 R  R  R  
 0 0 0 exp x2 ) µ(s) ds
l1 (~ exp x2 ) µ(s) ds
l2 (~ exp x2 ) µ(s) ds
l3 (~ 0 ... 0
A = ∆t 
 
. . . . . . . . .


. . . . . . . . .
 
. . . . . . . . .

Here we have more unknowns than equations. What do we know about the solution to this problem ?
Ax = b
From the rank and nullity theorem, (3.2), 3m = dim X > dim Y = m, The number of equation is smaller
than the number of unknowns. From the fundamental identity n = dim N (T ) + dim R(T ), with dim R(T ) ≤
m = dim Y , the dimension null space is always greater than zero, dim N (T ) ≥ 0, and if a solution exists it
is never unique.

p p
(b − p, b − p) = inf (b − v, b − v) M ≡ R(A) b ∈ Rm
v∈M

Figure 33: We cannot tell the patient the the solution is not inside the range space or the null space is
non-zero. A particular unique and well-defined solution may be provided in terms of the inner product.
This approach is commonly referred to as the least squares solution and had has the interpretation as the
minimal distance to a subspace defined by the range of the operator, M ≡ R(A), ie orthogonalp projection.
Notice that the distance measure is defined in terms of the inner product, d(x, y) = kx−yk = (x − y, x − y)

4.1 Inner Product, ([Kreyszig, 1989], Section 3.2)


Inner Product
(Definition) Inner Product Space An inner product space is a vector space X with an inner product Space
defined. We write (X, (·, ·)) or simply X . A inner product on X is a mapping of X × X into the scalar field

60
K defined on X, ie R or C. Specifically, we say that (·, ·) : X × X → K is an inner product if the following
properties hold.

(I1) linearity: (αx + αy, z) = α(x, z) + α(y, z)


(I2) conjugate symmetry: (x, y) = (y, x)
∀x, y, z ∈ X ∀α
(x, x) ≥ 0
(I3) positive definite:
(x, x) = 0 ⇔ x = 0
Other common notations for the inner product that are encountered in mathematics, physics, and engineering
are denoted by
(x, y) = < x, y > = < x|y > ∀x, y
We will use (·, ·) to denote the inner product.
The linearity with respect to the first argument (I1) and conjugate symmetry (I2) implies that the inner
product is semilinear with respect to the second argument.

(x, αy + βz) = (αy + βz, x) = α(y, x) + β(z, x) = α(y, x) + β(z, x) = α(x, y) + β(x, z)

For most practical applications we will assume that we have an inner product defined. This will provide
us a notion of the ‘angle’ between two vectors and notice that an inner product on X defines a norm on X
p
kxk = (x, x) (9)

and a metric on X given by p


d(x, y) = kx − yk = (x − y, x − y) (10)
Orthogonality
(Definition) Orthogonality Two vectors x, y ∈ X are said to be orthogonal if

(x, y) = 0

Similarly, we say two sets A, B ⊂ X are orthogonal if

(a, b) = 0 ∀a ∈ A b∈B

Example 50 (Eudlidean space Cn ). We typically define an inner product on Cn by

(x, y) = x1 y1 + x2 y2 + ...xn yn

And the norm and metric induced by this inner product is the familiar l2 distance measure.
p p
d(x, y) = kx − yk = (x − y, x − y) = |x1 − y1 |2 + |x2 − y2 |2 + ... + |xn − yn |2

The conjugate on y is needed to satisfy the symmetry property (I2) (x, y) = (y, x) and to ensure that the
length of the vectors is positive and real-valued in the case of imaginary numbers.

(−3+5i, −3+5i) = (−3+5i)∗(−3−5i) = 34 vs (−3+5i, −3+5i) = (−3+5i)∗(−3+5i) = −16−30i

For R3 , this gives the usual dot product from vector calculus

(x, y) = x · y = x1 y1 + x2 y2 + x3 y3

and orthogonality agrees the the geometric concept of perpendicularity

(x, y) = x · y = 0

You will shown in a homework exercise that the norm induced by an inner product satisfies the parallel-
ogram equality p
kx + yk2 + kx − yk2 = 2 kxk2 + kyk2

kxk = (x, x) (11)
It is worth noting that there do exist norms that are not generated by an inner product, ie do not satisfy
(11), hence not all normed spaces are inner product spaces.

61
Figure 34: Parallelogram Equality. As the name suggest, even in our abstract inner product spaces, the
parallelogram equality, Eqn (11), from elementary geometry still holds. Ie, the squared sum of the sides
equals the squared sum of the diagonals.

Example 51 (1-norm). Not all norms are induced by a inner product. For example, there does not exist a
inner product that can introduce the 1-norm on a vector space.
X p
kxk1 = |xk | =
6 (x, x)
k

Example 52 (Space C(a, b)). The norm defined as the max

kx(t)k = max |x(t)|


t∈[a,b]

on the space of continuous functions does not satisfy the parallel equality (11) and is thus not an inner
product space. To see this consider
t−a
x(t) = 1, kx(t)k = 1 y(t) = , ky(t)k = 1
b−a
t−a t−a
x(t) + y(t) = 1 + , kx(t) + y(t)k = 2 x(t) − y(t) = 1 − , kx(t) − y(t)k = 1
b−a b−a
The parallel equality (11) is not satisfied

5 = kx(t) + y(t)k2 + kx(t) − y(t)k2 6= 2(kx(t)k2 + ky(t)k2 ) = 4


Hilbert Space
(Definition) Hilbert Space Without going into the mathematical technicalities of convergent Cauchy
sequences, we will say that a Hilbert space is a closed inner product space. Meaning that converging
sequences converge to a point in the space.

The motivating example for a ‘closed’ inner product space is the space of continuous functions C[−1, 1] with
the inner product Z 1
(x, y) ≡ x(t)y(t)dt
−1

This is an example of an inner product space but not Hilbert space. The function

 0,
 −1≤t≤0
xn (t) = nt, 0 ≤ t ≤ 1/n

1, 1/n ≤ t ≤ 1

is a sequence of continuous functions converging to a discontinuous function, Figure35. This unfortunate


occurrence causes technical difficulties in analysis and motivates the study of L2 (Ω) Hilbert spaces.
Example 53 (Space L2 (a, b)). The inner product defined by
Z b
(x, y) ≡ x(t)y(t)dt
a

defines a very important Hilbert space (a closed inner product space) of square integrable functions
( Z )
b
2
L2 (a, b) ≡ f : |f (t)| dt < ∞ f (t)f (t) = |f (t)|2
a

62
Figure 35: Incomplete Space.

Notice that this space includes many more functions than C[a, b] including discontinuous functions and
function in which the tail of the function decays fast enough. For example,
Z ∞ Z ∞ ∞
1 1 1
f (x) = 1/x ∈ L2 (1, ∞) f 2 (x)dx = 2
dx = − =1− ∞ =1
1 1 x x 1

There are also functions that blow up to infinity but the integral is defined. For example in spherical coordinate
Z
f (x, y, z) = f (r) = 1/r ∈ L2 (Ω) Ω = {x ∈ R3 : kxk ≤ 1} f 2 (r)r2 sin φdrdφdθ

2π Z π/2 Z 1 2π π/2 1
r2
Z Z Z Z
sin φdrdφ = dθ sin φdφ dr = 2π
0 0 0 r2 0 0 0
Cauchy
(Definition) Cauchy Schwarz Inequality The Cauchy Schwarz Inequality is common inequality that Schwarz In-
bounds the inner product the the norm induced by the inner product equality
p p
|(x, y)| ≤ (x, x) (y, y) = kxkkyk

Example 54. It is intructive to verify the Cauchy Schwarz inequality on vector an function spaces. Consider
x, y ∈ R3
>> x = [-3;8;11];
>> y = [ 7;-4;1];
>> norm(x,2)

ans =

13.9284

>> norm(y,2)

ans =

8.1240

>> norm(x,2)* norm(y,2)

ans =

113.1548

>> abs(dot(x,y))

ans =

42

   
−3 7
x= 8  kxk = 13.9284 y = −4 kyk = 8.1240
11 1

63
|(x, y)| = |−3 · 7 − 8 · 4 + 11 · 1| = 42 ≤ kxkkyk = 113.1548
Consider functions f, g ∈ L2 (0, 1)
s s s s
1 1 r
Z 1 r 1
x3 x5
Z
2 1 2 2 1
f (x) = x kf k = (x) dx = = g(x) = x kgk = (x2 ) dx = =
0 3 0 3 0 5 0 5
Z 1 7 1 r
2 2
 x 1 1
|(f, g)| = x·x dx = = = 0.1429 ≤ kf kkgk =
= 0.2582
0 7 0 7 15

4.2 Orthonormal Sets ([Kreyszig, 1989], Section 3.4)


For a general n-dimensional basis on an inner product space X, the expansion coefficients, αi , of a vector x
X
x= αi ei

may be determined by solving a linear system of equations

α1 (e1 , e1 ) + α2 (e2 , e1 ) + ... + α2 (en , e1 ) = (x, e1 )


α1 (e1 , e2 ) + α2 (e2 , e2 ) + ... + α2 (en , e2 ) = (x, e2 )
. .
α1 (e1 , en ) + α2 (e2 , en ) + ... + α2 (en , en ) = (x, en )
Orthonormal
Basis
(Definition) Orthonormal Basis Orthonormal basis of space X, {e1 , e2 , ...en }

(ei , ej ) = δij

The expansion coefficients in an orthonormal basis are advantageous and may be easily determined for a
given vector x.
(x, ej ) X X
αj = x= αj ej = (x, ej )ej
(ej , ej ) j j

Example 55 (Inner product in R3 ). You are familiar with this concept from vector calculus.
 
5
x = 3 (x, e1 ) = x · e1 = 5
1

Example 56 (Finite Fourier Basis). An example you will see repeatedly within the context of MR is the
projection to the space spanned by the Finite set of orthogonal Fourier basis functions

M = span {1, cos(nπ x/l), sin(nπ x/l), n = 1, ..., N } ⊂ L(−l, l)

Consider the function, (


0, x < .5
f=
1, x > .5
We want to expand this function in terms of the finite basis to represent on the computer
2N
X N
X
fh = αi ei = (f, 1) + (ai cos(nπ x/l) + bi sin(nπ x/l))
i=0 i=1

When we get to the eigenvalue theory, we will see that this basis is indeed orthogonal such that
Z l
h
(f, sin(kπ x/l)) = (f , sin(kπx/l)) ⇒ f (x) sin(kπx/l)dx = bk (sin(kπx/l), sin(kπx/l))
−l
Z l
kπx
(sin(kπx/l), sin(kπx/l)) = sin2 dx = l
−l l

64
so that Z l
1
bk = f (x) sin(kπx/l)dx
l −l

similarly with cos(kπx/l)


Z l
1
ak = f (x) cos(kπx/l)dx
l −l
which is the Fourier series that we will see in the second part of the class.
Example 57 (Resampling). Another common example is in resampling.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 36: Resampling is used for multi-resolution registration. Contours outline the exhale image over the
steps of a multi-resolution registration to match the inhale image. (a) Initial Inhale-Exhale Pair. (b) Initial
Affine tranformation of exhale image to inhale image. (c) Resampling blurs the image to low resolutions to
find bulk changes. Computations run fast at low resolutions seen in (e)-(g), (d) The resolution is iteratively
increased to the solving the registration problem at full resolution (h), 32x → 16x → 8x → 4x → 2x→ 1x

Suppose that we have a 512 × 512 that we want to down sample to a 128 × 128 image. In this case

X = {ei i = 1, ...512 × 512}

And the subspace


M = {êi i = 1, ...128 × 128}
The full space image is
512×512
X
f= αi ei
i=1
we want to project this to
128×128
X
fˆ = α̂i êi
i=1

As before, the k-th coefficient in the lower resolution basis is simply the projection of the higher dimen-
sional basis onto the k-th basis function, êk
  !
128×128
X 512×512
X
(fˆ, êk ) = α̂k (êk , êk ) = (f, êk ) ⇒  α̂j êj , êk  = α̂k (êk , êk ) = αi ei , êk
j=1 i=1

In the typical case that the higher resolution voxels are completely contained in the lower resolution voxels
and the expansion coefficients in the lower resolution are the volume weighted sum as you would expect.
R P512×512
(f, êk ) i=1 αi ei êk dx X Vol(Ωj )
α̂k = = Ω̂k R = αj
(êk , êk ) Ω̂k
dx Vol(Ω̂k )
j:Ωj ⊂Ω̂k

65
Ωi
 ei (x) = 1 x ∈ Ωi
ei (x) = 0 x∈/ Ωi

Ωi ⊂ Ω̂i

Figure 37: Discretization of an Image.

So why did we go through all this formality to arrive at this intuitive result ? Notice that the final
representation of the basis coefficients is in terms of the inner product and basis functions
only.
(f, êk )
α̂k =
(êk , êk )
The inner product space formality can easily be extending to any other basis and inner product defined, ie
b-spline basis and weighted inner products that are prevalent in image registration.

∗∗
4.3 Minimizing Vector ([Kreyszig, 1989], Section 3.3)
 R
Suppose we are given a continuous finite energy signal f ∈ L2 (Ω) = f : Ω f 2 dx < ∞ ≡ X that we
need to represent in a finite dimensional subspace, M , for a computer to understand. In this situation, we
will assume that we are given a known orthonormal basis {φj , j = 1, .., n} for our finite dimensional
subspace M ⊂ X, dim M < ∞. We will formulate this as a projection problem in the inner product space,
[L2 (Ω), (·, ·)]. We want to find the element of the subspace, f h ∈ M ⊂ X that minimizes the distance to an
element f ∈ X.

Figure 38: Convex Set

In the case the subspace is convex, Convex Set


(Definition) Convex Set A set M is said to be convex if for every x, y ∈ M the segment joining x and y
is contained in M
{z : z = αx + (1 − α)y 0 ≤ α ≤ 1} ⊂ M ∀x, y ∈ M
we have a powerful result at our disposal that guarantees the existence of such a projection. We will not
discuss the technical details, but L2 (Ω) is complete, this just means that converging sequences converge to
a element in the space.
Theorem 4.1 (Minimizing vector). Let X be an inner product space and M 6= ∅ a convex subset, M ⊂ X.
M is complete in the metric induced by the inner product (the subspace is closed and the sequence converges
to an element of the subspace). Then for every given x ∈ X there exists a unique y ∈ M such that

δ = inf kx − ŷk = kx − yk
ŷ∈M

Proof. Proof follows from Parallelogram Inequality (11), ie minimization is with respect to the norm induced
by the metric [Kreyszig, 1989] Theorem. 3.3-1

66
Returning to our example, we wish to find f h ∈ M ⊂ L2 (Ω) that provides the best approximation to our
original function f with respect to the norm induced by our inner product

kf − fˆk = inf kf − f¯k


f¯∈M

The projection of f onto the basis may be shown to be the minimum. The squared sum difference between
an arbitrary set of basis coefficients βi may be written as
n
X n
X n
X n
X
2
((f, φj ) − βj ) = (f, φj )2 −2 βj (f, φj ) + βj2
j j j j

Explicitly writing out the norm using the properties of the inner product, orthonormality, and substituting
the difference between the coefficients yields below.
       
Xn n
X n
X Xn Xn n
X
kf − f h k2 = f − βj φ j , f − βj φj  = (f, f ) − f, βj φ j  −  βj φ j , f  +  βj φ j , βj φ j 
j j j j j j
 
n
X n
X
= (f, f ) − 2<e f, βj φ j  + βj β j (φj , φj )
j j
n
X n
X
= (f, f ) −2 βj (f, φj ) + βj2 (assuming real num.)
j j
n
X n
X
2 2
= kf k − (f, φj ) + ((f, φj ) − βj )
j j

This difference is minimized when the expansion coefficients equal the inner product of the original function
with the basis
(f, φj ) = βj = (f h , φj ) ∀j (12)
Here, the parameters, (f, φj ) ≡ fj , that provide the ”best fit” of the data is understood to provide the
minimum distance with respect to the norm induced by the inner product.

4.4 Applications: Interpolation and Least Squares, [Heath, 1998] Ch. 3


The most common application of least squares is curve fitting. Given a set of data points {(ti , bi ), i = 1, 2, ..., m},
we wish to find a function f : R → R that best approximates the measured values in some sense. In this
situation, we will consider a finite dimensional subspace M ⊂ L2 (Ω) arising from a known orthonormal basis
{φj , j = 1, .., n} such that the function at the measurement points approximate the measured values, bi .
n
X
bi ≈ f (ti ) ≈ fˆ(ti ) = fˆj φj (ti ) i = 1, 2, ..., m f (ti ) ∈ L2 (Ω)
j

Typically we have an overdetermined system such that the number of measurements is greater than the
number of coefficients of the expansion, m > n. A linear system of equations may be obtained with each
row of the linear system representing one measurement point
      ˆ
φ1 (t1 ) φ2 (t1 ) ... φn (t1 ) b1 (f, φ1 ) f1
 φ1 (t2 ) φ2 (t2 ) ... φn (t2 )   b2   (f, φ2 )   fˆ2 
       
Ax =   . . ... .  x =  .  = b x=  . ≡ . 
    
 . . ... .   .   .  .
φ1 (tm ) φ2 (tm ) ... φn (tm ) bm (f, φn ) fˆn

From Rank and Nullity Theorem 3.2

m > n = dim N (T ) + dim R(T )

The range space must be a subspace of the full space m > dim R(T ) and the rank of the augmented matrix
may be used to show if a solution exists. The dimension of the null space may or may not be 0.

67
Since the solution may or may not exist, we need to redefine our problem setup to impose meaning to
this problem and guarantee that a solution exists. Recall our minimizing vector Theorem 4.1. We apply the
inner product setup with A : Rn → Rm , the usual inner product in Rn and the subspace is the range of the
operator A.
    

 φ1 (t1 ) φ2 (t1 ) φn (t1 ) 
 φ1 (t2 )   φ2 (t2 )   φn (t2 ) 

 

M ≡ R(A) = {Ax : x ∈ R } = span  .   .  . . .  .  ⊂ Rm
n
    
     


  .   .   .  

 
φ1 (tm ) φ2 (tm ) φn (tm )
 

The trick is to project the right hand side of measurements into the range space of the operator. Let p be
the orthogonal projection of b into R(A)

kb − pk2 = inf kb − vk2


v∈M
p p M ≡ R(A) b ∈ Rm
(b − p, b − p) = inf (b − v, b − v)
v∈M

From the minimizing vector Theorem 4.1 we know that p exists and is unique. then by the definition of the
range space there exists an x∗ that maps to this p

p ∈ R(A) ⇒ ∃x∗ ∈ Rn : Ax∗ = p.

Thus we have a well defined solution x∗

kb − pk2 = kb − Ax∗ k2 = inf kb − vk2


v∈R(A)

or
kb − pk2 = kb − Ax∗ k2 = infn kb − Axk2
x∈R

Since p is unique and Ax = p the least squares problem has a unique solution if

dim N (A) = 0 ⇒ m > n = 0 + dim R(A) ⇒ dim R(A) = n

Notice that the 2-norm minimization reduces to the expected minimization of the difference in the residual
 2
X n
X
min kb − Axk22 = minn (b − Ax, b − Ax) = minn yi − fj φj (ti ) = minn r> r r = b − Ax
x∈Rn x∈R x∈R x∈R
i j

Notice that while we generated our problem description from the 2-norm from the usual (·, ·)2 inner
product, we can easily redefine our problem in terms of a different inner product and all arguments hold in
terms of distances of inner product spaces. In particular, a weighted inner product may be used.

Normal Equations There are several methods for solving this minimization/optimization problem. The
normal equations have simple intuitive derivation, however, we should be wary of normal equations. Normal
equations are obtain by expanding the residual
>
r> r = (b − Ax) (b − Ax) = b> b − 2x> Ab + x> A> Ax

and taking the derivative with respect to x and setting it to zero, similar to undergrad calculus.
d >
r r = 2A> Ax − 2A> b = ~0
dx
which reduces to a n × n square linear system with an amplified condition number.

A> Ax = A> b cond(A> A) = [cond(A)]2 (13)

Thus, if the condition number of the original system was large, the condition number of the normal equations
will be that number squared.

68
Orthogonal Transformations Orthogonal transformations based on QR Factorization and are common
in MATLAB . These factorizations are based on the idea of orthogonal matrices. Orthogonal
Matrix
(Definition) Orthogonal Matrix
Q> Q = QQ> = I
Orthogonal matrices preserve the norm and hence the distance we are trying to minimize of any vector, x

kQxk22 = (Qx, Qx) = x> Q> Qx = (x, x) = kxk22 ∀x

Given an m × n matrix A, m ≥ n we seek an m × m orthogonal matrix Q such that


 
R
A=Q
O

where R is an n × n upper triangular matrix. and O is an (m − n) × n matrix of zeros. Using the properties
of the orthogonal matrix, this leads to a transformation of the least squares equations to an equivalent, but
more numerically stable form.
         
2 R 2 > R 2 > R 2 > R
kb − Axk2 = kb − Q xk2 = k QQ b − Q xk2 = kQ Q b − x k2 = kQ b − xk22
O | {z } O O O
=I

Now the minimization problem becomes


     
R R b̂
min kQ> b − xk22 = min kb̂1 − Rxk22 + kb̂2 k22 x ≈ Q> b ≡ 1
x O x O b̂2

Here b̂1 is the n × 1 sub-vector of the transformed vector Q> b and b̂2 is the (m − n) × 1 remaining sub-vector.
Since the optimization has no control over the kb̂2 k term, the minimum occurs when the residual is equal to
this term.
Rx = b̂1 krk22 = kb̂2 k22
We will not go into details but several methods are possible for computing this QR factorization including
• Householder transformations
• Givens transformations
• Gram-Schmidt orthogonalization
As an example of the effect of the ill conditioning on a least squares interpolation using the monomial
basis for a normal equation approach and a orthogonal transformation approach, consider the interpolation
with a 10-th degree polynomial on [0,1], P 1 0[0, 1].

>> x=[0:.02:1]’;
>> b=exp(x);
>> A = [] ; for iii = 0:10; A = [A,x.^iii]; end
>> cond(A)

ans =

2.0371e+07

>> cond(A’*A) % conditioning of matrix used to solve normal equations

ans =

4.1451e+14

>> [Q R] = qr(A);
>> cond(R) % conditioning of matrix used in QR factorization

69
x21
   
1 x1 y1
1
 x2 x22 

 y2 
 
. . . f =  . 
 

. . .   . 
1 xm x2m ym

Figure 39: http://www.cse.illinois.edu/iem/least squares/data fitting An example least square fit is shown.
Here φ1 (x) = 1, φ2 (x) = x, and φ2 (x) = x2 . Notice that the monomial basis functions becomes more indis-
tinguishable with increase polynomial order. This lead to nearly linearly dependent rows and ill-conditioning
in the matrix

ans =

2.0371e+07
>> xone = (A’*A)\(A’*b) % normal equation solution

xone =

1.000000000804584
0.999999840420694
0.500004795998859
0.166609456930047
0.042019700547931
0.007065676553890
0.004185623326857
-0.003646097590225
0.003234439475462
-0.001486452960945
0.000294845427892

>> xtwo = R\(Q’*b) % QR factorization solution

xtwo =

1.000000000000017
0.999999999994582
0.500000000208064
0.166666663555648
0.041666690728768
0.008333224089361
0.001389198998930
0.000197847650168
0.000025459516102
0.000002286833573
0.000000456883813

>> norm(A* xone - b,2) % normal equations residual

70
ans =
2.882617002030374e-09

>> norm(A* xtwo - b,2) % QR factorization residual


ans =
9.880311277736581e-14

4.5 Adjoint Operator [Greenberg, 1978] Ch 18.4


Adjoint operators play a key role in a variety of applications. Adjoint Oper-
ator
(Definition) Adjoint Operator Let T : H1 → H2 be a bounded linear operator on Hilbert spaces H1 and
H2 . The adjoint operator T ∗ : H2 → H1 is defined to be the operator satisfying

(T x, y)H2 = (x, T ∗ y)H1 ∀x ∈ H1 ∀y ∈ H2

It may be shown that this operator exists and is unique.


Theorem 4.2 (Existence of Adjoint Operator). The Hilbert-adjoint operator T ∗ of T in definition 4.5 exists,
is unique and is a bounded linear operator with norm equal to the original operator norm.

kT ∗ k = kT k

Proof. [Kreyszig, 1989] Theorem. 3.9-2


Self Adjoint
(Definition) Self Adjoint We say that an operator is self adjoint on an inner product space is the operator
equals its adjoint
T = T∗

Example 58 (Adjoint of a Matrix Operator). The adjoint of an n × n matrix A may be determined from
properties of the inner product

(Ax, y) = (Ax)> ȳ = x> A> ȳ = x> A> ȳ = x> A> y = (x, A> y)

Matrices are Self adjoint with respect to the usual inner product if the adjoint equals the conjugate of the
transpose.
 ∗    ∗    ∗  
1 3 1 −i 3 1 + 2i 3 1 + 2i 2 3 2 3
= = =
i 2−i 3 2+i 1 − 2i −1 1 − 2i −1 3 1 3 1
| {z }
Hermitian Symmetry

The following lemma’s are useful in studying the properties of Hilbert adjoint operators
Lemma 4.3 (Equality). If the inner product of two vector v1 , v2 ∈ X is equal for all w ∈ X, then the two
vectors are the same.
(v1 , w) = (v2 , w) ∀w ∈ X ⇒ v1 = v2
In particular,
(v1 , w) = 0 ∀w ∈ X ⇒ v1 = 0
Proof. By assumption,
(v1 − v2 , w) = (v1 , w) − (v2 , w) = 0 ∀w ∈ X
For w = v1 − v2 this gives kv1 − v2 k = 0. Hence v1 − v2 = 0, so that v1 = v2 . In particular,

(v1 , w) = 0 with w = v1 ⇒ kv1 k = 0 ⇒ v1 = 0 defn of norm

Lemma 4.4 (Zero Operator). Let X and Y be inner product spaces and Q : X → Y a bounded linear
operator. Then:
Q = 0 ⇔ (Qx, y) = 0 ∀x ∈ X y ∈ Y

71
Proof. (⇒)

Q=0 ⇒ Qx = 0 ∀x ⇒ (Qx, y) = (0, y) = 0(w, y) = 0 ∀w, y

(⇐) Conversely,

(Qx, y) = 0 ∀x, y ⇒ Qx = 0 ∀x (Lemma 4.3) ⇒ Q=0 (by definition)

The following properties are used frequently in applying adjoint operators and the derivations are useful
in understanding manipulations of adjoint operators.
Theorem 4.5 (Properties of Hilbert-Adjoint Operators). Let H1 ,H2 be Hilbert spaces, S : H1 → H2 and
T : H1 → H2 bounded linear operators and α and scalar.

(a) (T ∗ y, x) = (y, T x) x ∈ H1 , y ∈ H2
∗ ∗ ∗
(b) (S + T ) = S + T
(c) (αT )∗ = ᾱT ∗
(d) (T ∗ )∗ = T
(e) kT ∗ T k = kT T ∗ k = kT k2
(f ) T ∗T = 0 ⇔ T =0
(g) (ST )∗ = T ∗ S ∗ assuming H1 = H2

Proof. • (a) The adjoint may be written with respect to the other arguments in the inner product. By
definition 4.5 we have
(T ∗ y, x) = (x, T ∗ y) = (T x, y) = (y, T x)

• (b) Adjoint operation is distributive. By definition 4.5, for all x and y,

(x, (S + T )∗ y) = ((S + T )x, y)


= (Sx, y) + (T x, y)
= (x, S ∗ y) + (x, T ∗ y)
= (x, (S ∗ + T ∗ )y)

Hence, (S + T )∗ y = (S ∗ + T ∗ )y for all y, and the property holds by lemma 4.3.

• (c) Not to confuse this formula with the action of the linear adjoint on the vector αx, ie T ∗ (αx) = αT ∗ x
Using lemma 4.4 with Q = (αT )∗ − αT ∗

((αT )∗ y, x) = (y, (αT )x) from (a)


= (y, α(T x)) by defn.
= α(y, T x) conj. lin. of IP
= α(T ∗ y, x) from (a)

= (αT y, x) lin. of IP

• (d) The adjoint operator applied twice equals the original operator

((T ∗ )∗ x, y) = (x, T ∗ y) from (a)


= (T x, y) defn of adjoint

• (e) [Kreyszig, 1989] Theorem 3.9-4


• (f) From properties of norm and from (e)

T ∗T = 0 ⇔ kT ∗ T k = kT k2 = 0 ⇔ T =0

72
• (g) Repeated application of definition of adjoint

(x, (ST )∗ y) = ((ST )x, y) = (T x, S ∗ y) = (x, T ∗ S ∗ y)

The adjoint may also be define with respect to an operator on a continuous space
Example 59 (Adjoint of an Integral Operator). Consider a differential operator with a specified zero bound-
ary defined on the space of differentiable functions condition and the usual L2 inner product
Z 1
1 d
X ≡ C [0, 1] (x, y) = x(t)y(t)dt Lx = x + x, x(0) = 0
0 dt
To find the adjoint we start with the definition of the inner product and integrate the derivative term by parts
Z b Z b
d b
(u(x)v(x))dx = u(x)v(x)|a = u0 (x)v(x) + u(x)v 0 (x)dx
a dx a

to manipulate the result in the form of the operator on the second variable (x, L∗ y)
Z 1
(Lx, y) = x0 (t)y(t) + x(t)y(t)dt
0
Z 1
= x(1)y(1) − x(0)y(0) + −y 0 (t)x(t) + x(t)y(t)dt
0
Z 1
= x(1)y(1) + −y 0 (t)x(t) + x(t)y(t)dt
0
= (x, L∗ y)

Here the x(0) = 0 take care of one of the terms in the right hand side. Therefore we must define the boundary
condition of the adjoint operator to be zero on the other part of the domain y(1) = 0
d
Ly = − y + y, y(1) = 0
dt
Hence this operator fails to be self adjoint for two reason: (1) L 6= L∗ and (2) the boundary conditions are
not the same.
It is important to realize that the adjoint operator is defined with respect to the inner product defined
on the space.
Example 60 (Adjoint of Sturm Liouville Operator). The Sturm Liouville Operator is a differential operator
that has an important role in many applications.
   
1 d d
L≡ p(x) + r(x) p(x), w(x) > 0 (14)
w(x) dx dx
Boundary conditions are typically assumed of the form:

αx(a) + βx0 (a) = 0 γx(b) + δx0 (b) = 0

To be explicit, in this example we will assume zero boundary conditions.

x(a) = 0 x(b) = 0

The Sturm Liouville operator may be shown to be self adjoint with respect to the weighted inner product
Z b
(f, g) = f (x)g(x)w(x)dx w(x) > 0 a≤x≤b
a

Here positivity of the weighting function w(x) is imposed to ensure the inner product is positive. Using
integration by parts twice
Z b    Z b   Z b  
d d d d d d
p(t) x(t) y(t) dt = p(t) x(t) y(t) + p(t) x(t) y(t)dt
a dt dt a dt dt a dt dt

73
Z b   Z b   Z b 
d d d d d d
y(t)p(t)x(t) dt = y(t)p(t) x(t)dt + y(t)p(t) x(t)dt
a dt dt a dt dt a dt dt
The adjoint may be found to be
Z b    
1 d d
(Lx, y) = p(t) x(t) + r(t)x(t) y(t)w(t)dt
a w(t) dt dt
Z b   
d d
= p(t) x(t) + r(t)x(t) y(t)dt
a dt dt
Z b  
0 b d d
= [p(t)x (t)y(t)]a + − y(t) p(t) x(t) + r(t)x(t)y(t)dt
a dt dt
Z b  
b b d d
= [p(t)x0 (t)y(t)]a − [p(t)x(t)y 0 (t)]a + p(t) y(t) x(t) + r(t)x(t)y(t)dt
a dt dt
Z b    
0 b 0 b 1 d d
= [p(t)x (t)y(t)]a − [p(t)x(t)y (t)]a + p(t) y(t) + r(t)y(t) w(t)x(t)dt
a w(t) dt dt
= (x, L∗ y)

Where, expanding the boundary terms

p(b)x0 (b)y(b) − p(a)x0 (a)y(a) − p(a)x(a)y 0 (a) + p(b)x(b)y 0 (b)

We see that we require similar boundary conditions on y for the boundary terms to vanish. Hence the operator
is self adjoint with respect to the weight inner product.
   
1 d d
L∗ ≡ p(x) + r(x) , y(a) = 0 y(b) = 0
w(x) dx dx

The following property appears when looking at the eigenvalues of a self adjoint operator.
Theorem 4.6 (Self-adjointness). [Kreyszig, 1989] Theorem 3.10-3 Let T : H → H be a bounded linear
operator on a Hilbert space H.

T self adjoint ⇒ (T x, x) ∈ R ∀x ∈ H

Proof. If T is self adjoint then


(T x, x) = (x, T x) = (T x, x) ∀x
Hence (T x, x) is equal to its complex conjugate, so it is real.

5 Eigen-formulation for Bounded Self-Adjoint Linear Operator


We will focus our study of Eigenvalues on bounded linear operators L defined on a Hilbert space H and map
H into itself.
Lx = λx
Eigenvalues and Eigenvectors have an important role in determining the convergence of iterative schemes
in solutions to linear equations and as well as optimization. We will focus our study on the special class of
self-adjoint operators. We will see that in this case the Eigenvectors corresponding to self adjoint operators
constitute a basis for the space under consideration, H.

5.1 Spectrum of Bounded Self-Adjoint Linear Operator, ([Greenberg, 1978],


Ch 20)
First of all, is important to note that not all Bounded Self adjoint operators have eigenvalues. However,
when the eigenvalues exist we have a very mature set of tools that allow us to characterize the spectrum of
the operator.

74
Example 61 (Eigenvalues of a Matrix). Consider the linear operator A : R2 → R2
 
2 1
A=
1 2

Rewriting Ax = λx as (A − λI)x = 0 we recall that this linear system of equations has non-trivial solution
if and only if

2 − λ 1
det(A − λI) = 0 ⇒ = (2 − λ)2 − 1 = λ2 − 4λ + 3 = 0
1 2 − λ

This is known as the characteristic equation of A and the roots λ1 , λ2 = 1, 3 are the eigenvalues. To find the
corresponding eigenvectors, consider λ1 with Ax = λx explicitly written out.

2ξ1 + ξ2 = ξ1 ξ1 + 2ξ2 = ξ2 ⇒ ξ2 = −ξ1

Denoting eigenvectors as ej , the Eigenvectors


 
1
e1 =
−1

may be written up to an arbitrary constant

Lαx = λαx ∀α

Repeating for λ2 = 3
2ξ1 + ξ2 = 3ξ1 ξ1 + 2ξ2 = 3ξ2 ⇒ ξ2 = ξ1
we have  
1
e2 =
1
If we prefer, the eigenvectors may easily be normalized.
The fact that the eigen vectors are orthogonal is not by coincidence
Theorem 5.1 (Spectrum of Self-Adjoint Operator). If L is self adjoint, then
(i) the eigenvalues are real

(ii) eigen vectors corresponding to distinct eigenvalues are mutually orthogonal


Proof. (i) Suppose we have a eigen- pair λj and ej 6= 0, Lej = λj ej

(ej , Lej ) = (ej , λj ej ) = λj (ej , ej )


(Lej , ej ) = (λj ej , ej ) = λj (ej , ej )

Since L is self adjoint

(Lej , ej ) = (ej , Lej ) ⇒ 0 = (λj − λj )(ej , ej ) ⇒ λj = λj ∈ R

(ii) Suppose that we have eigenvectors ei , ej corresponding to distinct eigenvalues λi , λj

(Lei , ej ) = (λi ei , ej ) = λi (ei , ej )

(ei , Lej ) = (ei , λj ej ) = λj (ei , ej ) = λj (ei , ej )


Self Adjoint ⇒ (Lei , ej ) = (ei , Lej )
⇒ (Lei , ej ) − (ei , Lej ) = (λi − λj )(ei , ej ) = 0
⇒ (ei , ej ) = 0

75
Example 62 (Multiplicity of Eigenvalues). Consider
 
2 0 0
A = 0 1 1
0 1 1

The characteristic equation is


 
2−λ 0 0
det(A − λI) = 0 ⇒  0 1−λ 1  = −λ(λ − 2)2 = 0
0 1 1−λ

The eigenvalues are thus λ1 , λ2 = 0, 2. Note carefully that x = 0 is never acceptable as an eigenvector. By
definition and eigenvector is to be nontrivial, x = 0. However, as seen, the eigenvalue may be zero. In this
case the eigenvalue λ2 = 2 is said to be of ”multiplicity 2”. Proceeding as before
   
0 α
λ1 = 0 e1 =  1  λ2 = 2 β 
−1 β

Here α and β cannot both be zero. As before, A is self adjoint so the eigenvalues are real. and (e1 , e2 ) = 0
∀α, β Moreover the second eigenvector actually contains two orthogonal vectors as well. For instance,
   
1 0
0 1
0 1

So in fact we have three mutually orthogonal eigenvectors, and the three eigenvectors constitute an orthogonal
basis for the space. We will see that this basis is particularly helpful in solving the inhomogeneous problem
Lx = c.
In fact it can be shown that
Theorem 5.2. For any self adjoint operator L on a finite dimensional domain, k mutually orthogonal
eigenvector can be found for each eigenvalue of multiplicity k.

Proof.
Together with the fact that eigenvectors corresponding to distinct eigen values are orthogonal we have
Theorem 5.3. The eigenvectors of any self-adjoint operator L on a finite dimensional space constitute a
basis for the space.

Proof.
These results lead to a diagonalization of self adjoint matrices. Modal Matrix
(Definition) Modal Matrix The modal matrix is defined as the columns of the normalized eigenvectors
     

Q = e1  e2  e3  (15)

As we will see, this leads to a concise form of symmetric matrices in which we may study convergence of
algorithms in the framework of optimization theory
z}|{ z}|{
e1        e1      
|{z} |{z} λ1 0 0
Q> AQ = 
 z}|{  z}|{ 
e2 
|{z} 
Ae1  Ae2  Ae3  =  e2  λ1 e1  λ2 e2  λ3 e3  =  0 λ2 0 
|{z}
z}|{ z}|{ 0 0 λ3
e3 e3
|{z} |{z}

Our study of eigenvalues is not restricted to the the finite dimensional matrix operators.

76
Example 63 (Eigenvalues of a Differential operator). Consider the differential operator L ≡ d2 /dx2 with
zero boundary conditions. The eigenvalue problem is
y 00 + λy = 0 y(0) = y(l) = 0
From differential equations, the general solution is of the form
√ √
y = A sin λx + B cos λx
Where the arbitrary constants A and B are determined by the boundary conditions.

y(0) = 0 ⇒ B=0 y(l) = 0 ⇒ A sin λl = 0
Since A 6= 0 √
is required to have non trivial solutions, we arrive at the analogous characteristic equation as
before where λl must coincide with a zero of the sine function.
√ n2 π 2 nπx
sin λl = 0 ⇒ λn = ⇒ yn = sin n = 1, 2, 3, ...
l2 l
As seen in example 14, the operator under consideration is self-adjoint, hence the eigenvalues are real and
the eigen functions are mutually orthogonal.
Z l
mπx nπx
(ym , yn ) = sin sin dx = 0 m 6= n
0 l l
Unfortunately, Theorem 5.3 was for finite dimensions and we cannot say that this constitutes a basis for
our space at this point. However, the Sturm-Liouville theory provides a rigorous framework in which we can
identify the eigenfunctions as a basis.
Theorem 5.4 (Basis of Sturm-Liouville System). If both p(x) and w(x) are analytic and positive (p, w > 0)
over a ≤ x ≤ b where a and b are finite, Then the eigenfunctions of the Sturm Liouville system (14) form a
basis over L2 [a, b].
This is an important results that may be used to justify that the set of trigonometric Fourier functions
indeed constitutive a basis and any given function may indeed be represented or decomposed into this Fourier
basis.
Example 64 (Fourier Basis from Sturm Liouville Theory). The set of Fourier basis functions may be shown
to satisfy a slightly differ Sturm-Liouville system with periodic boundary conditions.
y 00 + λy = 0
y(−l) = y(l) = 0 y 0 (−l) = y 0 (l) = 0
√ √
Subjecting the general solution y = A sin λx + B cos λx to the periodic boundary conditions
√ √ n2 π 2
A sin λx = 0 λB cos λx = 0 ⇒ λn = n = 0, 1, 2
l2
The eigenvalues n 6= 0 are of multiplicity two and the constants A and B are arbitrary. For the given
eigenfunction
nπx nπx
yn = A sin + B cos
l l
we can obtain an orthogonal functions by setting A = 1, B = 0 and A = 0, B = 1. From Sturm Liouville
theory, the eigenfuctions for a basis for the L2 [a, b] and we may write a general function f in the form
∞ 
X nπx nπx 
f (x) = a0 + an cos + bn sin
1
l l

where the coefficient coincide with the usual trigonometric Fourier series.
1 l
Z
(f, 1)
a0 = = f (x)dx
(1, 1) 2l −l

1 l
Z
(f, cos nπx/l) nπx
an = = f (x) cos dx
(cos nπx/l, cos nπx/l) l −l l
1 l
Z
(f, sin nπx/l) nπx
bn = = f (x) sin dx
(sin nπx/l, sin nπx/l) l −l l

77
5.2 Applications: Spectral Method for the Inhomogeneous Problem
For a symmetric matrix A we know that the eigenvectors form a basis for the space, hence we can expand
expand the solution and of a system of equation in terms of the basis. The solution may be represented in
terms of the spectrum of the operator A.
n
X n
X
Ax = Λx + c c= cj ej x= αj ej Λ∈R
1 1

Here ci = (c, ej ) is assumed known and the αj ’s are to be found. By direct substitution
n
X n
X
αj (λj − Λ)ej = cj e j ⇒ αj (λj − Λ) = cj
1 1

If Λ does not coincide with any of the eigenvalues, then αj = cj /(λj − Λ) and the unique solution is
n
X cj
x= ej
1
λj − Λ

On the other hand if Λ = λk then there are two possibilities. (a) If ck 6= 0, there is no solution. (b) If ck = 0,
then we have a nonunique solution with β arbitrary
n
X cj
x= ej + βek β∈R
λj − Λ
j6=k

6 Unconstrained Optimization
As a motivating example consider the image denoising problem.

Figure 40: Given a noisy image b ∈ L2 (Ω) we wish to remove noise. How can we mathematicall express this
? One mathematical way to express this is to say that we wish to find an minimum distance between the
original and a denoised imaged, f , such that the image is smooth.

1
min kf − bk22 + k∇f kpp
f
|2 {z } | {z }
minimium distance minimum variation, ie smooth

Here, the norm of a gradient is defined as


p p p !1/p
∂f
+ ∂f + ∂f

k∇f kp = ∂x ∂y ∂z
p p p

We will represent the partial derivatives with finite differences.


∂f fi+1,j,k − fi,j,k ∂f fi,j+1,k − fi,j,k ∂f fi,j,k+1 − fi,j,k
≈ ≈ ≈
∂x ∆x ∂y ∆y ∂z ∆z

78
Lets look at this problem in vector notation, fi,j,k ≡ f (i∆x, j∆y, k∆z) and bi,j,k ≡ b(i∆x, j∆y, k∆z) for a
256x256x100 image    
f1,1,1 b1,1,1
 f2,1,1   b2,1,1 
   

 . 


 . 


 . 


 . 


 . 


 . 

 f256,1,1   b256,1,1 
   
 f1,2,1   b1,2,1 
   
 f2,2,1   b2,2,1 
~
f =   ~
b=  
 . 
  . 


 . 


 . 


 . 


 . 

 f256,2,1   b256,2,1 
   

 . 


 . 


 . 


 . 

 .   . 
f256,256,100 b256,256,100
Explicitly discretizing the integrals
Z X  
kf − bk22 = (f − b)2 dx dy dz = (fi,j,k − bi,j,k )2 ∆x∆y∆z = f~ − ~b, f~ − ~b ∆x∆y∆z
Ω i,j,k

For p=1, the smoothing term reduces to the so-called ”total variation” regularizer
Z
X fi+1,j,k − fi,j,k
∂f ∂f
= dx dy dz ≈ ∆x∆y∆z
∂x
Ω ∂x ∆x

1 i,j,k

P
Figure 41: Total Variation, T V (f ) = i |f (xi+1 ) − f (xi )|. Intuitively the function on the left will have less
total variation.
The may equivalently be written as a linear operator notation


 
−1 1 0 0 0 0 0



 0 −1 1 0 0 0 0

 
0 0 −1 1 0 0 0

∂f 
~ ∆x∆y∆z

≈ ∆x 
0 0 0 −1 1 0 0  f

∂x
| {z }1 .
 
 
.

function space norm 
.


| {z }
≡L

x 1

Operators for the y, Ly and z, Lz may be similiarly derived. Hence our total variation denoising problem is
of the form of minimizing a function with a differentiable and non-differentiable term.
 
min f~ − ~b, f~ − ~b + kLx f~k1 + kLy f~k1 + kLz f~k1
f~∈R256x256x100 | {z } | {z }
non-differentiable
differentiable

79
2D

100D
(a) (b) (c) (d)

Figure 42: Comparison of various optimization techniques in 2D and 100D. (a) Nelder Mead (b) Steepest
Descent (c) Quasi-Newton (d) Newton.

N
X −1
100(x2i − xi+1 )2 + (xi − 1)2 x ∈ RN

f (x) =
i=1
\exampledir/multidimrosenboth.m

function [y grad hess] = multidimrosenboth(x)


%
% Multi-dimensional Rosenbrock function
% The number of variables n should be adjusted below.
%
a = 1.0;
sum = 0;
nsize = size(x,1);
grad = zeros(nsize,1);
% do not allocate dense matrix !!!
%hess = zeros(nsize,nsize);
hessdiagonal = zeros(nsize,1);
hessoffdiagonal = zeros(nsize,1);
for j = 1:nsize-1;
sum = sum+100*(x(j)^2-x(j+1))^2+(x(j)-a)^2;
if nargout > 1
grad(j) = grad(j) +400*(x(j)^2-x(j+1))*x(j)+2*(x(j)-a);
grad(j+1) = grad(j+1)-200*(x(j)^2-x(j+1));
if nargout > 2
%% hess(j ,j ) = hess(j ,j ) + 1200*x(j)^2 - 400*x(j+1) + 2;
%% hess(j ,j+1) = hess(j ,j+1) - 400*x(j);
%% hess(j+1,j ) = hess(j+1,j ) - 400*x(j);
%% hess(j+1,j+1) = hess(j+1,j+1) + 200;
hessdiagonal(j) = hessdiagonal(j) + 1200*x(j)^2 - 400*x(j+1) + 2;
hessdiagonal(j+1) = hessdiagonal(j+1) + 200;
% first and last entry are be ignored on upper and lower diagonal
hessoffdiagonal(j) = - 400*x(j);
hess = spdiags([hessoffdiagonal hessdiagonal circshift(hessoffdiagonal,1) ], -1:1, nsize, nsize);
end
end
end
y = sum;

\exampledir/ExOptimizationComparison.m

clear all
close all

maxIter = 10e9
maxFunEval = 5e3

nDimension = 2;
x0 = -2*ones(nDimension,1);
% direct search
[x,fval] = fminsearch(@rosenboth, x0, optimset(’TolX’,1e-8,’Display’,’Simplex’,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptNelderMead2D’,’png’)

80
pause
% steepest descent
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,’TolX’,1e-8,...
’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,’LargeScale’,’off’,’HessUpdate’,’steepdesc’,...
’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptSteepestDescent2D’,’png’)
pause
% trust region hessian approx
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...
’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,...
’TolPCG’,1e-3,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptQuasiNewton2D’,’png’)
pause
% trust region exact hessian
[x,fval] = fminunc(@rosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...
’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’on’,...
’TolPCG’,1e-3,’PlotFcns’,@dfplotx));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotx’);
saveas(handle,’OptNewton2D’,’png’)
pause

%%%%%%%%%%%%%%%%%%%%%%%%% increase dimension %%%%%%%%%%%%%%%%%%%%%%%%


nDimension = 100;
x0 = -2*ones(nDimension,1);
[x,fval] = fminsearch(@multidimrosenboth, x0, optimset(’TolX’,1e-8,’MaxIter’,maxIter,...
’MaxFunEvals’,maxFunEval ,’Display’,’Iter’,’PlotFcns’,@optimplotfval));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotfval’);
saveas(handle,’OptNelderMead100D’,’png’)
pause

[x,fval] = fminunc(@multidimrosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...


’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,...
’LargeScale’,’off’,’HessUpdate’,’steepdesc’,’PlotFcns’,@optimplotfval));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotfval’);
saveas(handle,’OptSteepestDescent100D’,’png’)
pause

[x,fval] = fminunc(@multidimrosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...


’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’off’,...
’TolPCG’,1e-3,’PlotFcns’,@optimplotfval));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotfval’);
saveas(handle,’OptQuasiNewton100D’,’png’)
pause

[x,fval] = fminunc(@multidimrosenboth, x0, optimset(’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval ,...


’TolX’,1e-8,’Display’,’Iter’,’GradObj’,’on’,’Hessian’,’on’ ,...
’TolPCG’,1e-3,’PlotFcns’,@optimplotfval));
handle = findobj(get(gca,’Children’),’Tag’,’optimplotfval’);
saveas(handle,’OptNewton100D’,’png’)

%%%%%%%%%%%%%%%%%%%%%%%%% compare on very high dimension space %%%%%%%%%%%%%%%%%%%%%%%%


nDimension = 1000;
x0 = -2*ones(nDimension,1);

% sparse hessian matrix


tic;
[x,fval] = fminunc(@multidimrosenboth, x0, optimset(optimset(’fminunc’),...
’MaxIter’,maxIter, ’MaxFunEvals’,maxFunEval , ’TolX’,1e-8,’Display’,’off’,...
’GradObj’,’on’, ’Hessian’,’on’ , ’TolPCG’,1e-6,’PlotFcns’,@optimplotfval));
toc
% Elapsed time is 1135.575783 seconds.

% dense quasi-newton matrix


tic;
[x,fval] = fminunc(@multidimrosenboth, x0, optimset(optimset(’fminunc’),...
’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval , ’TolX’,1e-8,’Display’,’off’,...
’GradObj’,’on’,’Hessian’,’off’, ’TolPCG’,1e-6,’PlotFcns’,@optimplotfval));
toc
% Elapsed time is 6892.467525 seconds.

%% % TODO sparse matrix vector multiply not working...


%% Hinfo = speye(nDimension);
%% tic; [x,fval] = fminunc(@multidimrosenboth, x0, optimset(optimset(’fminunc’) ,’MaxIter’,maxIter,’MaxFunEvals’,maxFunEval , ’TolX’,1e-8,’Display’,’off

fminsearch, Nelder Mead Method Direct search methods may be appropriate for function optimization
with the following properties
• Function evaluation is very expensive or time consuming.
• Gradient information is not practical, ie discontinuous functions or very complex physics models.

81
• Numerical Derivatives are impractical and/or slow.
Similar to a bisection or goldent selection method in 1D, the Nelder Mead method used geometrical arguments
to reach a mininum. A simplex in 2D is a triangle and in 3D is a tetrahedron.

(a) (b) (c)

Figure 43: Possible Outcomes of Simplex Method

Table 1: Brief Summary of Optimization Solvers.


easy to implement fast convergence
function only x
gradient only x
quasi-Newton x
Newton x

6.1 Characterizations of Solutions, [Nocedal and Wright, 1999] Ch 2


In unconstrained optimization we are interested in finding x∗ ∈ Rn such that

min f (x)
x
n
Here the objective function, f : R → R, is in general a nonlinear function that is typically too expensive to
extensively evaluate to obtain a global perspective. We generally only have function evaluation and derivative
information at a finite set of points x0 , x1 , x2 , ... and must design algorithms with this in mind.
Ideally we would like to find a global minimizer to the optimization problem. Global Mini-
mizer
(Definition) Global Minimizer A point x∗ is a global minimizer if

f (x∗ ) ≤ f (x) ∀x

For a general nonlinear and discontinuous function, since we typically only evaluate a finite set of points of
the objective function, we can never be certain that we have neglected to sample a region of the function in
which the function takes a sharp dip towards a minimum. This global optimization problem is outside the
scope of this class.
We will restrict our discussion in optimization of functions that are sufficiently smooth such that all
necessary 1st and 2nd derivatives exist and are well defined. Using derivative information, it is feasible to
search for solutions that are local minimizers within this framework. Local Mini-
mizer
(Definition) Local Minimizer A point x∗ is a local minimizer if ∃ a neighborhood N of x∗ such that

f (x∗ ) ≤ f (x) ∀x ∈ N

Many optimization packages developed at National labs and optimization toolkit in MATLAB have adopted
this framework of search for local minimizers and have defined the solutions returned to the user as local
optimizers.
Taylor’s theorem is the primary mathematical tool used for studying local minimizers.

82
Figure 44: Global vs local Minimum

Theorem 6.1 (Taylor’s Theorem). Given f : Rn → R is continuously differentiable

f (x + p) = f (x) + ∇f (x + tp)> p p ∈ Rn for some t ∈ (0, 1)

Moreover, if f is twice continuously differentiable


Z 1
∇f (x + p) = ∇f (x) + ∇2 f (x + tp)> p dt p ∈ Rn
0

and substituting
1
f (x + p) = f (x) + ∇f (x)> p + p> ∇2 f (x + tp)p p ∈ Rn for some t ∈ (0, 1)
2
Necessary conditions of an optimal solution assume that x∗ is a local minimizer and then show require-
ments on the gradient and Hessian.
Theorem 6.2 (First Order Necessary Conditions). Given that f is continuously differentiable

x∗ is a local minimizer ⇒ ∇f (x∗ ) = 0

Proof. Suppose for contradiction that ∇f (x) 6= 0.

x∗ is a local minimizer ⇒ ∇f (x∗ ) = 0 ⇔ ∇f (x∗ ) 6= 0 ⇒ x∗ is not a local minimizer


| {z } | {z }
(p⇒q) (∼q⇒∼p)

We need to show that x∗ is not a min. Our strategy is as follows.

• Consider Taylor expansion about x∗


• Now, using continuity, construct a direction and magnitude that decrease f (x∗ )
For any t̂ ∈ (0, T ], we have by Taylor’s theorem that

f (x∗ + t̂p) = f (x∗ ) + (t̂p)> ∇f (x∗ + tp) for some t ∈ (0, t̂)

Figure 45: N-dimension analog of mean value theorem.

Let the vector p = −∇f (x∗ )


p> ∇f (x∗ ) = −k∇f (x∗ )k2 < 0

83
Because ∇f is continuous near x∗ there is a scalar T > 0 such that
p> ∇f (x∗ + tp) < 0 ∀t ∈ [0, T ]
Therefore, f (x∗ + t̂p) < f (x∗ ) for all t̂ ∈ (0, T ].
Given a = b + c c<0 ⇒ a<b
We have found a direction leading away from x along which f decreases, so x∗ is not a local minimizer.

Contradiction.
According to Theorem 6.2 and local minimizer must be a stationary point. Stationary
Point
(Definition) Stationary Point A stationary point x∗ of a function is a point where the gradient vanishes.
f (x∗ ) = 0
Positive Defi-
(Definition) Positive Definite, Negative Definite, etc. nite, Negative
Definite, etc.
x> Ax > 0 ∀x Positive Def.
x> Ax < 0 ∀x Negative Def.
>
x Ax ≥ 0 ∀x Positive Semi-Def.
>
x Ax ≤ 0 ∀x Negative Semi-Def.
Theorem 6.3 (Second Order Necessary Conditions). If ∇2 f is continuous
x∗ is a local minimizer ⇒ ∇f (x∗ ) = 0 and ∇2 f is positive semidefinite
Proof. That x∗ is a stationary point follow from Theorem 6.2. For contrapositive, suppose that ∇2 f (x∗ ) is
not positive semidefinite.
(p ⇒ q) ⇔ (∼ q ⇒∼ p)
Then we can choose a vector p such that p∇2 f (x∗ )p < 0, ie not positive semidefinite. By continuity ∃T > 0
such that p> ∇2 f (x∗ + tp)p < 0 for all t ∈ [0, T ]
Using a Taylor expansion about x∗ , we have for some t ∈ (0, t̂)
1
f (x∗ + t̂p) = f (x∗ ) + t̂p> ∇f (x∗ ) + t̂2 p> ∇f (x∗ + tp)p < f (x∗ )
| {z } 2
=0, by assumption

Thus x is not a local minimizer as we have found a direction along which it is decreasing.
Theorem 6.4 (Second Order Sufficient Conditions). If ∇2 f is continuous and positive definite
∇f (x∗ ) = 0 ⇒ x∗ is a strict local minimizer
Proof. Because the Hessian is continuous and positive definite at x∗ , we can choose a radius r > 0 so that
∇2 f (x) remains positive definite for all x in a neighborhood D = {z : kz − x∗ k < r}. For any nonzero p with
kpk < r, x∗ + p ∈ D
1 1
f (x∗ + p) = f (x∗ ) + p> ∇f (x∗ ) + p> ∇2 f (z)p = f (x∗ ) + p> ∇2 f (z)p
| {z } 2 2
=0, by assumption

Here z = x∗ + tp for some t ∈ (0, 1).


z∈D ⇒ p> ∇2 f (z)p > 0
Therefore
f (x∗ + p) > f (x∗ )

Note that the second order sufficient conditions are not necessary: A point x∗ may be a strict local
minimizer, and yet may fail to satisfy the sufficient conditions.
Example 65 (Sufficient not Necessary). Consider f (x) = x4 for which x∗ = 0 is a strict local minimizer at
which the Hessian vanishing and is therefore not positive definite.

84
6.2 Search Directions, [Nocedal and Wright, 1999] Ch 2
It is important to recognize the search directions that produce a descent in the objective function.
• The steepest descent direction −∇fk is an obvious search direction to decrease our function value.
Among all directions we could move from the current iterate xk , it is the one that decreases most
rapidly. To see this, recall our Taylor expansion for our search direction p, kpk = 1 and step length α
1
f (x + αp) = f (x) + α∇f (x)> p + α2 p> ∇2 f (x + tp)p p ∈ Rn for some t ∈ (0, α)
2
= f (x) + α∇f (x)> p + O(α2 )

The rate of change of the function is coefficient of α, ie p> ∇fk . Hence the direction of the most rapid
decrease is the solution to the problem

min p> ∇fk kpk = 1


p

Since kpk = 1, using properties of our inner product from vector calculus.

(p, ∇fk ) = kfk k cos θ

This is minimized when θ = π or cos θ = −1. In other words


∇fk
p=−
kfk k

• In general, any search direction that makes an angle of strictly less than π/2 radians with −∇fk is a
descent direction, provide the step length is sufficiently small. Again we use Taylor Theorem to see
this.
f (xk + pk ) = f (xk ) + p> 2 >
k ∇fk + O( ) ≈ f (xk ) + pk ∇fk

When the angle θk between pk and ∇fk is such that cos θk < 0

θk ∈ (π/2, 3/2π) ⇒ (pk , ∇fk ) = kpk kk∇fk k cos θk < 0 ⇒ f (xk +pk ) < f (xk )  sufficiently small

• One of the most important directions is the Newton direction. Consider a second order Taylor series
approximation
1
f (xk + p) ≈ fk + p> ∇fk + p> ∇2 fk p ≡ mk (p)
2
Where we interpret mk (p) as a quadratic model approximation to our function at xk . Assuming ∇2 fk
is positive definite, we find the Newton direction by minimizing our quadratic model, mk (p).
∂mk (p) 2 −1
=0 ⇒ pN
k = ∇ fk ∇fk
∂pi
It is important to realize that the Newton direction may not be defined when the inverse does not exist
∇2 fk−1 . This is no different that before, a solution will not exist then the dimension of the null space
of the operator is non-zero, N (∇2 fk−1 ) 6= {0}. Further when the hessian is not positive definite the
Newton direction may actually increase the objective function value and is not suitable.

(∇2 fk x, x)  0 ∀x ⇒ (∇fk , pN
k ) 6< 0 is possible

6.3 Applications: Nonlinear Least Squares, [Heath, 1998] Ch. 6


A nonlinear least square problem arises when we are trying to fit a function to measured values yi . In
general, the function we are trying to fit depends nonlinearly on the parameters φ(ti , ~x).
 
φ(t1 )
 φ(t2 ) 
 
 . 
yi ≈ φ(ti , ~x) ∀i ~
φ=  
 . 

 . 
φ(tm )

85
Similar to as before we are interested in minimizing the least squares residual
m
1 1X 2 ~
minn f (~x) = minn ~r>~r = minn (yl − φ(tl , ~x)) ~r = ~y − φ
x∈R
~ x∈R 2
~ x∈R 2
~
l

The gradient of the objective function f (~x) is given as the matrix vector product of the Jacobian transpose
times the residual.
m
! m
!
∂ 1X X ∂rl ∂ri ∂φ(ti , ~x)
(∇f )i = rl rl = rl ∇f = J >~r Jij = =−
∂xi 2 ∂xi ∂xj ∂xj
l l

Similarly, without working through the algebra, the matrix of second derivatives may be obtained as a
function of the jacobian J(~x), the residual ~r, the the hessian components of the residuals, Hi (~x)
m m
∂2 1 X X ∂ 2 rl
∇2 f ∇2 f = J > (~x)J(~x) +

ij
= rl rl ri (~x)Hi (~x) (Hl (~x))ij =
∂xi xj 2 i
∂xi xj
l

Notice that all components of the gradients and Hessian’s depend of the current solution ~x, ie J(~x) , Hi (~x).
Thus if ~xk is a current solution, then Newton step sk is given by the linear system
" m
#
X
>
J (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ~sk = −J > (~xk )~r(~xk )
i

Gauss-Newton Method. The m Hessian residual matrices Hi (~x) are typically very inconvenient and
expensive to compute. Further since they are multiplied by the residual, ri which should be small near a
solution we are motivated to drop the second order terms and solve an approximation at each step.
m
X
>
ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk )
 >
J (~xk )J(~xk ) ~sk = −J > (~xk )~r(~xk )

J (~xk )J(~xk ) + ⇒ (16)
i

Notice that this is an approximation to the Hessian and is thus a Quasi-Newton method. You should recognize
this as the normal equations that we visited when we looked as the linear least square problem, Eqn (13).

J(~xk )~sk ≈ −~r(~xk )

Thus the nonlinear least squares problem reduces to a linear least squares problem at each iteration which
may be solved by some orthogonalization method.

Algorithm 1 Gauss Newton Method of Nonlinear Least Squares


Require: tolerance  ≥ 0 and initial guess ~x0
while 1/2~r>~r ≥  do
Compute residual and jacobian ~r(~xk ), J(~xk ),
Solve least squares problem at current iterate

J(~xk )~sk ≈ −~r(~xk )

Update solution
~xk+1 = ~xk + ~sk
end while

to be explicit suppose that we have a time series of imaging data, we draw and ROI on the image and we
need to fit the measurements within the ROI to the time series of data The Jacobian of this basis is given
by
∂ri ∂ri
(J(~xk ))i,1 = = −ex2 ti (J(~xk ))i,2 = = −x1 ti ex2 ti
∂x1 ∂x2

86
φ(ti , ~x) = x1 ex2 t
ri = yi − φ(ti , ~x)

ROI (~xk )1 (~xk )2 kr(~xk )k2


1.000 0.000 2.390
1.690 -0.610 0.212
t 0.0 1.0 2.0 3.0
1.975 -0.930 0.007
y 2.0 0.7 0.3 0.1
1.994 -1.004 0.002
1.995 -1.009 0.002
1.995 -1.010 0.002

Figure 46: Measurements within an ROI. An Gauss Newton Nonlinear Least Squares Estimates

Given an initial guess x0 = [1, 0]> the initial least squares problem to solve is

J(~x0 )s0 ≈ −~r(~x0 )


 ∂r ∂r1

1
 0·0.0 0·0.0
      0.0·0.0   
∂x ∂x2 −e −1 · 0.0e −1 0 2.0 1e −1
 ∂r21 ∂r2  −e0·1.0
 ∂x1 ∂x2  −1 · 1.0e0·1.0  −1 −1    0.0·1.0   
 s0 ≈ − 0.7 − 1e0.0·2.0  = 0.3
∂r3  s0 =  0·2.0 s =
−1 · 2.0e0·2.0  0 −1
 ∂r3   
 ∂x1 ∂x2 
−e −2 0.3 1e  0.7
∂r4 ∂r4 −e0·3.0 −1 · 3.0e0·3.0 −1 −3 0.1 1e0.0·3.0 0.9
∂x1 ∂x2

The solution to this least square problem and next iterate yields
     
0.69 0.69 1.69
s0 = x1 = x0 + =
−0.61 −0.61 −0.61

As a Matlab example, consider the following residual function to be used with ‘lsqnonlin’ in MATLAB .
function [ r e s i d u a l , j a c o b i a n ]= p h a r m a c o k i n e t i c ( x )
time = [ 0 . 0 ; 1 . 0 ; 2 . 0 ; 3 . 0 ] ;
y = [2.0; 0.7;0.3;0.1];
r e s i d u a l = y − x ( 1 ) ∗ exp ( x ( 2 ) ∗ time ) ;
j a c o b i a n = [ −exp ( x ( 2 ) ∗ time ) , −x ( 1 ) ∗ time . ∗ exp ( x ( 2 ) ∗ time ) ] ;
disp ( s p r i n t f ( ’%f %f %f ’ , x ( 1 ) , x ( 2 ) , r e s i d u a l ’ ∗ r e s i d u a l ) )
Below is example usage and output. Without the jacobian specified MATLAB will compute finite differences
of the Jacobian. Notice that this requires one addition function evaluation per optimization variable.
This can be prohibitive when x ∈ Rn , n > O(104 ). When the jacobian is specified, many less function
evaluations are required but you have to explicitly provide the analytic derivatives.

>> lsqnonlin(@pharmacokinetic,[1;0],-inf,inf,optimset(’jacobian’,’off’))
1.000000 0.000000 2.390000
1.000000 0.000000 2.390000
1.000000 0.000000 2.390000
1.690000 -0.610000 0.212590
1.690000 -0.610000 0.212590
1.690000 -0.610000 0.212590
1.975070 -0.930547 0.007335
1.975070 -0.930547 0.007335
1.975070 -0.930547 0.007335
1.994066 -1.003607 0.002024
1.994066 -1.003607 0.002024
1.994066 -1.003607 0.002024
1.994955 -1.009347 0.001996
1.994955 -1.009347 0.001996
1.994955 -1.009347 0.001996
1.995002 -1.009520 0.001996

87
1.995002 -1.009520 0.001996
1.995002 -1.009520 0.001996

Local minimum possible.

lsqnonlin stopped because the final change in the sum of squares


relative to
its initial value is less than the default value of the function
tolerance.

ans =

1.9950
-1.0095

>> lsqnonlin(@pharmacokinetic,[1;0],-inf,inf,optimset(’jacobian’,’on’))
1.000000 0.000000 2.390000
1.690000 -0.610000 0.212590
1.975070 -0.930547 0.007335
1.994066 -1.003607 0.002024
1.994955 -1.009347 0.001996
1.995002 -1.009520 0.001996

Local minimum possible.

lsqnonlin stopped because the final change in the sum of squares


relative to
its initial value is less than the default value of the function
tolerance.

ans =

1.9950
-1.0095

6.4 Line Search and Trust Region Strategies and Convergence



Beginning with an initial guess, x0 , optimization algorithms generate a sequence of iterates {xk }k=0 that
terminate when a solution has been approximated to a predetermined accuracy. Function and derivative
information at xk is used to generate the next iterate at xk+1 . Trust region and line search algorithms are the
most prevalent approaches. Heuristically, a line search algorithms starts by fixing the search direction then
identifying the distance to move along this direction, αk . In the trust-region approach we fix the maximum
distance to move from the current iterate, trust region radius ∆k , then seek a direction to step subject to
the distance constraint.
• line search: (1) fix search direction (2) choose distance

• trust region: (1) fix distance to move (2) choose search direction
Line Search
(Definition) Line Search In a line search strategy the algorithms chooses a descent direction pk and
searches along this direction for the next iterate with a lower function value f (xk+1 ) < f (xk ). The distance
to move along pk is formulated as a 1D optimization problem for α

min φ(α) φ(α) ≡ f (xk + αpk ) (17)


α

Exact solution of the line search subproblem (17) is expensive and unnecessary. Typically an approximation
to (17) is found for xk+1 and the subproblem is repeated. Trust Region

88
Figure 47: Trust Region Intuition

(Definition) Trust Region In the trust region strategy, information about the objective function f is used
to construct a surrogate model function mk whose behavior near the current iterate xk is expected to be
near the actual function f . We search for a solution within the region that we trust, kpk k ≤ ∆k , this model
function to be a good approximation to the actual function. The model is typically chosen to be a quadratic
function and the trust region approach proceeds by solving a sequence of subproblems.
1
min mk (p) kpk ≤ ∆k mk (p) ≡ fk + p> ∇fk + p> Bk p (18)
p 2
Bk is typically chosen as the Hessian matrix or some approximation to it. The Gauss-Newton approximation
to the Hessian, seen in (16), is an excellent example of a Quasi-Newton approximation of the Hessian for the
nonlinear least squares problem. If the trust region subproblem does not achieve adequate decrease in the
objective function we conclude that the model function is not a good approximation and decrease the trust
region radius which, by Taylor series, should produce a closer approximation to the actual function.
The performance of an optimization algorithm may be characterized in terms of its convergence to a
solution, x∗ . Convergence
Rate
(Definition) Convergence Rate We say that the convergence rate of an algorithm to a solution x∗ is p,
(p > 1), if there exists a positive constant, M , such that

kxk+1 − x∗ k
≤M k sufficiently large
kxk − x∗ kp

Steepest descent methods converge linearly, p = 1, and Newton methods converge quadratically, p = 2.

6.5 Line Search, [Nocedal and Wright, 1999] Ch 3


Recall the line search subproblem (17).

min φ(α) φ(α) ≡ f (xk + αpk )


α

Where the search direction pk is typically assumed to be a descent direction p>


k ∇fk = (pk , ∇fk ) < 0 obtained
from 

 I Steepest Descent
−1 2
pk = −Bk ∇fk Bk = ≈ ∇ fk Quasi-Newton

2
∇ fk

Newton
In general,
• Exact solution to (17) would require a prohibitive number of objective function and gradient evaluations

89
Figure 48: Wolfe Conditions. (Sufficient Decrease) The reduction in f should be proportional to both the
step length α and the directional derivative ∇fk> pk . For example, a sequence of iterates that infinitesimally
converge to zero, 1/k, k = 1, 2, ... is NOT acceptable. (Curvature Condition) Rules out unacceptably short
steps. The curvature condition ensures that the slope at the next iterate, φ(αk ) is greater than c2 times the
gradient φ0 (0). Ie if the slope is strongly negative we have indication that we can continue moving along in
this direction. Otherwise, if the slope is only slightly negative or perhaps positive, we really cannot expect
much more of a decrease in this direction.

• Practical strategies perform an inexact line search that can achieve an adequate reduction of the
objective function f at minimal cost.
A popular inexact line search approach impose a requirement on the step length α such that
• Sufficient decrease in the function is seen. Sufficient decrease is measured by the following inequality.
f (xk + αpk ) ≤ f (xk ) + c1 α∇fk> pk c1 ∈ (0, 1)
In practice, c1 ≈ 10−4 is typically used.
• Step lengths are reasonably far from the current iteration. To rule out unacceptably small steps, a
curvature condition in imposed that requires αk satisfy
∇f (xk + αk pk )> pk ≥ c2 ∇fk> pk c2 ∈ (c1 , 1) 0 < c1 < c2 < 1
In practice, c2 ≈ .9 is typically used for Newton Methods.
Collectively, the sufficient decrease and curvature conditions are known as the Wolfe conditions.

6.6 Applications: Ill conditioned matrices and Convergence


Steepest Descent Direction Analysis of Wolfe conditions can be complex. Much can be learned from
analysis of a straight forward Steepest Descent, Algorithm 2, with exact line search. Suppose that
1 >
f (x) = x Ax − b> x
2
Using modal matrix (15) to obtain the eigen decomposition of A, we can simplify the objection function
using a change of variables, Qx̂ = x
Q> AQ = Λ


Qx̂ = x ⇒ f (x) = f (Qx̂) = x̂> Q> AQx̂ − b> Qx̂ = x̂> Λx̂ − b̂> x̂


Qb̂ = b

90
Algorithm 2 Steepest Descent
Require: tolerance  > 0, and initial guess x0
k=0
while k∇fk k >  do
Compute fk and ∇fk
Compute step length along gradient αk
xk+1 = xk − αk ∇fk
k =k+1
end while

Hence, without loss of generality, we can study problems where the matrix is diagonal.
1 >
f (x) = x Ax − b> x ∇f (x) = Ax − b A = diag(λ1 , λ2 , ...)
2
Hence for A SPD, the solution x∗ is the solution to the linear system Ax∗ = b.

As a concrete example, consider the quadratic function1 f : R2 → R


1 >
f (x) = x Ax − b> x A ∈ R2x2 symmetric b ∈ R2
2

 
2 −1
A= b=0
−1 2
 
> 1 0
Q AQ =Λ =
0 3
 
1 1 −1
Q= √ Q−1 = Q>
2 1 1

Left: Graph of the function. Contour lines with the red lines indicate the eigenvector directions of A.

         
2 0 −2 0 2 0 −1 1
A1 = A2 = A= b1 = b2 =
0 −2 0 2 0 0 −1 0
Indefinite matrices lead to saddle points. For semi-definite Hessian matrix, the choose of b influences the
existence of a solution. In the singularity direction , the function is dominated be the linear term b. The
function based on A and b1 is unbounded from below and, thus no solution exists. However, for A and b2 is
independent of x2 and bounded from below and a solution exists but it is not unique.

Figure 49: Quadratic Contours.

An exact solution to the line search may be obtained in this situation by differentiating the line search

91
function with respect to α
∇fk> ∇fk
 
d 1
f (xk − αgk ) = (xk − αgk )> A(xk − αgk ) − b> (xk − αgk ) = 0 ⇒ α=
dα 2 ∇fk> A∇fk
Steepest descent with exact line search in this case yields
∇fk> ∇fk
xk+1 = xk − ∇fk
∇fk> A∇fk
The error in the solution may be bounded by the ratio of the largest to smallest eigenvalue.
Theorem 6.5 (Convergence of Steepest Descent [Nocedal and Wright, 1999]). Given that f : Rn → R is
twice differentiable, and that iterates generate by the steepest descent method with exact line searches converge
to a point x∗ where the Hessian matrix ∇2 f (x∗ ) is positive definite. Then the convergence of the algorithm
is bounded by the eigenvalues of the Hessian, λ1 ≤ .. ≤ λn
 2
λn − λ1
f (xk+1 ) − f (x∗ ) ≤ [f (xk ) − f (x∗ )]
λn + λ1
Examples are provided in Figure 49.
It is worth mentioning that the conditioning number is related to the largest and smallest eigenvalue, for
the matrix induced by the 2-norm
 
λ1 0 λmax
κ(A) = cond = A : (Rn , k · k2 ) → (Rn , k · k2 )
0 λ2 λmin
Consider
1
(c1 x21 + c2 x22 )
f (x1 , x2 ) = A = diag(c1 , c2 ) (19)
2
This function is convex and has a global minimum at x = 0 when the eigenvalues are c1 and c2 are positive.
The steepest descent method takes many iterations to converge to a solution when the eigenvalues are far
apart however the Newton Method converges very quickly.

http://www.cse.illinois.edu/iem/optimization/SteepestDescent
http://www.cse.illinois.edu/iem/optimization/Newton Opt2D

Figure 50: Steepest Descent Optimization

Theorem 6.6 (Convergence of Newton Method [Nocedal and Wright, 1999]). Given that f : Rn → R is
sufficiently differentiable in a neighborhood N such that the sufficient conditions are satisfied.

∇f (x∗ ) = 0 x> ∇ 2 f x > 0 ∀x

Then the sequence generated by


−1
xk+1 = xk − ∇2 fk

∇fk
converges quadratically

kxk+1 − x∗ k ≤ Ckxk − x∗ k2 k∇fk+1 k ≤ C̃k∇fk k2

92
Newton Direction It is helpful to look at Newtons method in 1-D. We expect to see quadratic conver-
gence. Ie, the number of correct digits doubles. In 1-D Newton method becomes
fk0
H [xk+1 − xk ] = −∇fk ⇒ fk00 [xk+1 − xk ] = −fk0 ⇒ xk+1 = xk −
fk00

function newtonExample ( x0 )
x = x0 ;
Tol = 0 . 0 0 0 0 0 1 ;
count = 0 ;
fprime = x − x ˆ2;
f = 1/2∗ x ˆ2 − 1/3∗ x ˆ 3 ; % compute t h e new v a l u e o f f ( x )
dx=1; %t h i s i s a f a k e v a l u e s o t h a t t h e w h i l e l o o p w i l l e x e c u t e
fprintf ( ’ step x dx f ( x )\n ’ )
f p r i n t f ( ’−−−− −−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−\n ’ )
f p r i n t f ( ’%3 i %23.16 e %23.16 e %23.16 e \n ’ , count , x , dx , f )
xVec=x ; fVec=f ;
while ( abs ( f p r i m e )>Tol ) %l o o p u n t i l s t a t i o n a r y p o i n t r e a c h e d
count = count + 1 ;
fprime = x − x ˆ2;
f d o u b l e p r i m e = 1 − 2∗ x ;
xnew = x − ( f p r i m e / f d o u b l e p r i m e ) ; % compute t h e new v a l u e o f x
dx=abs ( x−xnew ) ; % compute how much x has changed s i n c e l a s t s t e p
x = xnew ;
f = 1/2∗ x ˆ2 − 1/3∗ x ˆ 3 ; % compute t h e new v a l u e o f f ( x )
f p r i n t f ( ’%3 i %23.16 e %23.16 e %23.16 e \n ’ , count , x , dx , f )
end

f = 1/2 ∗ x2 − 1/3 ∗ x3

http://www.cse.illinois.edu/iem/nonlinear eqns/Newton

Figure 51: Quadratic Convergence of Newton Method

>> newtonExample(20)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 2.0000000000000000e+01 1.0000000000000000e+00 -2.4666666666666665e+03
1 1.0256410256410257e+01 9.7435897435897427e+00 -3.0704046483139189e+02
2 5.3910172175612390e+00 4.8653930388490183e+00 -3.7694964230510664e+01
3 2.9710656645912565e+00 2.4199515529699824e+00 -4.3284789023937087e+00
4 1.7861182949934300e+00 1.1849473695978265e+00 -3.0425996536838751e-01
5 1.2402508292312779e+00 5.4586746576215206e-01 1.3318397332485366e-01

93
6 1.0389870964455772e+00 2.0126373278570076e-01 1.6588691644185172e-01
7 1.0014100464549900e+00 3.7577049990587197e-02 1.6666567161666468e-01
8 1.0000019826397770e+00 1.4080638152129676e-03 1.6666666666470126e-01
9 1.0000000000039309e+00 1.9826358461649818e-06 1.6666666666666669e-01
10 1.0000000000000000e+00 3.9308556409878292e-12 1.6666666666666669e-01
>> newtonExample(-20)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 -2.0000000000000000e+01 1.0000000000000000e+00 2.8666666666666665e+03
1 -9.7560975609756095e+00 1.0243902439024390e+01 3.5712385678288661e+02
2 -4.6402366520692553e+00 5.1158609089063543e+00 4.4070108044524261e+01
3 -2.0944362725536232e+00 2.5458003795156321e+00 5.2558605600796291e+00
4 -8.4539815955291497e-01 1.2490381130007082e+00 5.5875049560892487e-01
5 -2.6560837886568645e-01 5.7978978068722853e-01 4.1519935359147608e-02
6 -4.6073039997407417e-02 2.1953533886827903e-01 1.0939626388017812e-03
7 -1.9436273713611465e-03 4.4129412626046270e-02 1.8912911515357276e-06
8 -3.7630593882501204e-06 1.9398643119728964e-03 7.0803257421616288e-12
9 -1.4160509385405506e-11 3.7630452277407350e-06 1.0026001302802519e-22
10 -2.0052182829735183e-22 1.4160509385204984e-11 2.0104501811856324e-44
>> newtonExample(.49)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 4.8999999999999999e-01 1.0000000000000000e+00 8.0833666666666665e-02
1 -1.2004999999999990e+01 1.2494999999999990e+01 6.4878031254166513e+02
2 -5.7624960015993549e+00 6.2425039984006352e+00 8.0387019317008566e+01
3 -2.6512080933838607e+00 3.1112879082154943e+00 9.7261482145687737e+00
4 -1.1152713730936505e+00 1.5359367202902101e+00 1.0843178694211222e+00
5 -3.8502206389628169e-01 7.3024930919736886e-01 9.3146473785263459e-02
6 -8.3750448567531555e-02 3.0127161532875013e-01 3.7028812087205941e-03
7 -6.0078220517643943e-03 7.7742626515767160e-02 1.8119244863963280e-05
8 -3.5665383253675152e-05 5.9721566685107192e-03 6.3602490367083519e-10
9 -1.2719288349747086e-09 3.5664111324840178e-05 8.0890148130596978e-19
10 -1.6178029439585086e-18 1.2719288333569056e-09 1.3086431827404087e-36
>> newtonExample(.51)
step x dx f(x)
---- ---------------------- -------------------- ---------------------
0 5.1000000000000001e-01 1.0000000000000000e+00 8.5832999999999993e-02
1 1.3004999999999990e+01 1.2494999999999990e+01 -6.4861364587499838e+02
2 6.7624960015993558e+00 6.2425039984006343e+00 -8.0220352650341923e+01
3 3.6512080933838615e+00 3.1112879082154943e+00 -9.5594815479021165e+00
4 2.1152713730936510e+00 1.5359367202902106e+00 -9.1765120275445655e-01
5 1.3850220638962822e+00 7.3024930919736875e-01 7.3520192881403101e-02
6 1.0837504485675318e+00 3.0127161532875046e-01 1.6296378545794610e-01
7 1.0060078220517643e+00 7.7742626515767466e-02 1.6664854742180274e-01
8 1.0000356653832536e+00 5.9721566685106975e-03 1.6666666603064179e-01
9 1.0000000012719288e+00 3.5664111324829051e-05 1.6666666666666669e-01
10 1.0000000000000000e+00 1.2719287845186500e-09 1.6666666666666669e-01

6.7 Quasi Newton Hessian Approximations, [Nocedal and Wright, 1999] Ch 8


A common example of a Hessian approximation is given by the Gauss-Newton approximation to the Hessian
that we saw in our nonlinear least squares approach, the Levenberg-Marquardt Method that is seen in
MATLAB lsqnonlin function, and BFGS methods.
In the nonlinear least squares problem, Section 6.3, the true Hessian is typically very involved to compute
because of the need for second derivative terms. Gauss-Newton
(Definition) Gauss-Newton The Gauss-Newton Method, (16), approximates the Hessian with first deriva-

94
tive information and saves implementation time.
m
∂ 2 rl >
X
(Hl (~x))ij = 2
∇ fk = J (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk ) ≡ Bk
∂xi xj i

Levenberg-Marquardt methods builds upon the Gauss Newton method by shifting upon the eigenvalues
of the Hessian approximation to make it positive definite. Levenberg-
Marquardt
(Definition) Levenberg-Marquardt Levenberg-Marquardt approximation to the Hessian of the nonlin-
ear least square problem by replacing the second derivative terms by a scalar multiple of the identity matrix.
m
X
∇2 fk = J > (~xk )J(~xk ) + ri (~xk )Hi (~xk ) ≈ J > (~xk )J(~xk ) + µk I ≡ Bk µk > 0
i

Notice that the final least squares problem is a weighted linear combination of the Gauss-Newton Direction
and the Steepest Descent Direction.
 >
J (~xk )J(~xk ) + µk I ~sk ≈ −J > (~xk )~r(~xk )


   
> √  J(~xk ) > > √  −~r(~xk )
= −J > (~xk )~r(~xk )
 >  >
A A = J (~xk ) µk I √ = J (~xk )J(~xk )+µk I A y = J (~xk ) µk I
µk I 0
   
> > J(~xk ) −~r(~xk )
A A~sk = A y ⇒ √ ~s ≈
µk I k 0
The Broyden, Fletcher, Goldfarb, Shanno (BFGS) is a widely used approximation to the Hessian available
in many packages. Details of why the BFGS approximation is in general a good approximation to the Hessian
may be found in [Nocedal and Wright, 1999] Chapter 8. Similar to the Gauss-Newton and Levenburg-
Marquardt approach only first derivative information is needed. BFGS
(Definition) BFGS The BFGS method performs rank-one updates of the Hessian matrix using information
at previous iterations
sk ≡ xk+1 − xk yk ≡ ∇fk+1 − ∇fk
> 1
Hk+1 = I − ρk sk yk> Hk I − ρk sk yk> + ρk sk s>
 
k ρk = >
yk sk
Here the matrix updates to the Hessian are of rank-1.
 
a1 b1 a1 b2 . . a1 bn
 a2 b1 a2 b2 . . a2 bn 
ab> = 
 
 . . . . .  
 . . . . . 
an b1 an b2 . . an bn

6.8 Trust Region, [Nocedal and Wright, 1999] Ch 4


Recall the trust region model sub problem (18).
1
min mk (p) kpk ≤ ∆k mk (p) ≡ fk + p> ∇fk + p> Bk p
p 2
Again, the quadratic term may be taken as the true Hessian, Bk = ∇2 fk , some Hessian approximation seen
in Section 6.7, or may even be zero.


 0 Steepest Descent
2
Bk = ≈ ∇ fk Quasi-Newton

2
∇ fk

Newton

It is important to note that the Hessian approximation may even be indefinite.


The trust region algorithm requires efficient and accurate approximations of the subproblem (18). It is
helpful to characterize an exact solution to the trust region sub problem for developing approximations of
the subproblem.

95
Algorithm 3 Trust Region.
The trust region algorithm is fairly intuitive and allows the trust region radius to grow and shrink
depending on how well the surrogate model approximates the actual model at a given iteration.

Require: Max Trust Radius ∆ ¯ and initial trust radius ∆0 ∈ (0, ∆)


¯
k=0
while k∇fk k >  do
Obtain a search direction by approximately solving the trust region subproblem (18).

pk ≈ min mk (p)
p

Evaluate how well the surrogate model is approximating the actual model

actual reduction fk − f (xk + pk )


ρk = =
predicted reduction mk (0) − mk (pk )

if ρk < 1/4 then


Surrogate model is not a good model, shrink the trust radius

∆k+1 = 1/4kpk k

else if ρk > 3/4 and kpk k = ∆k then


Surrogate model is a good model and the step direction is on the boundary of the trust region. Grow
the trust region.
∆k+1 = min(2∆k , ∆) ¯

else
Surrogate model is a reasonable approximation and we are not hitting the boundary.

1/4 < ρk < 3/4 or kpk k < ∆k

Continue with current trust region.


∆k+1 = ∆k
We only increase the trust region if we hit the boundary. Since we have not hit the boundary, we
infer that the trust region is not interfering with the algorithm.
end if
end while

96
Theorem 6.7 (Characterization of Exact Solution to Trust Region Subproblem ). The vector p∗ is a global
solution of the trust-region problem
1
minn m(p) = f + g > p + p> Bp kpk ≤ ∆
p∈R 2
if and only if p∗ is feasible and there is a scalar λ ≥ 0 such that

(B + λI)p∗ = −g
λ(∆ − kp∗ k) = 0
(B + λI) is positive semi-def

Notice that the summation B +λI is positive semi-definite, but B may not be positive semi-definite.
The characterization of the solution provides complementary conditions that at least on of the nonnegative
quantities λ or (∆ − kp∗ k) must be zero. Hence when the solution lies strictly within the trust region, the
solution is approximately the Newton step depending on the Hessian approximation.

∆ > kp∗ k ⇒ λ=0 Bp∗ = −g B positive semi-def

When the solution lies on the boundary of the trust region, it is collinear with the negative gradient of the
model function m and normal to its contours

∆ = kp∗ k ⇒ λp∗ = −Bp∗ − g = −∇m(p∗ )

Figure 52: Cauchy Point.

As with line search methods, we do not attempt to solve the trust region subproblem (18) exactly. The
general approach is to develop algorithms that begin with the so called Cauchy Point and try to improve
upon this estimate to take the full Newton step pk = −(∇2 f )−1 ∇f provided the Hessian is positive definite
and within the trust radius kpk k ≤ ∆k . Cauchy Point
(Definition) Cauchy Point Find the vector psk that solve a linear version of (18).

psk = minn fk + ∇fk> p kpk ≤ ∆k (20)


p∈R

Calculate the scalar τk > 0 that minimizes mk (τ psk ) subject to satisfying the trust-region bound, that is,

τk = min mk (τ psk ) kτ psk k ≤ ∆k


τ >0

Set pC s
k = τk p k

Using our characterization to the trust region solution, the solution to 20 is


−1
B=0 psk = ∇fk k∇fk k ∆k
⇒ λ ⇒ λ= ⇒ psk = − ∇fk
((B + λI)x, x) ≥ 0 ∆ k∇fk k
∆ = kpsk k

97
Minimizing the step length along the model in this direction, the Cauchy Point is found to be some step
∆k
length, α = τk k∇fkk
, along the steepest descent direction, subject to the trust region radius.

 1 ∇fk> Bk ∇fk ≤ 0
∆k 
pC = −τk ∇fk τk = k∇fk k3
 
k
k∇fk k  min
 , 1 otherwise
∆k ∇>
k Bk ∇fk

6.9 Newton-Krylov Trust Region Methods, [Nocedal and Wright, 1999] Ch 6


Despite the convergence properties of steepest descent, the steepest descent direction is robust and globally
convergent and guaranteed to converge to a solution. On the other hand, the Newton method is not robust
but converges fast when close to a solution. Is there a way to combine the best of both worlds to devise an
optimization algorithm that has both properties?
• Robust as steepest descent and will always make progress to a local minimum
• Converges as fast a Newton method when close to a solution
Trust region Newton Krylov optimization methods are an elegant solution technique, when Hessian infor-
mation is available, that satisfies these properties. A classical example is a Trust Region Newton Conjugate
Gradient (CG) algorithm. Despite quadratic convergence of the Newton method, a direct solution to New-
ton’s method at every optimization iteration could be potentially very expensive.

∇2 fk sk = −∇fk (21)

• For large number of optimization variables, the Gaussian elimination solve grows in computational
complexity as O(n3 ).
• Storage for the Hessian matrix can be quite expensive. Consider a 256 × 256 × 10 image. This image
has 655360 voxels. The corresponding matrix for an optimization problem with one variable per pixel
bytes
is 655360 × 655360 × 8 entry ≈ 3.4 TB
• The Hessian inverse and hence a well defined solution may not exist. Also, perhaps Hessian is not
positive definite and the Newton direction is not a descent direction.
Alternatively. A CG solution for the newton step, pk has several attractive properties, Algorithm 4.
• Krylov space methods work under the assume that the Hessian is a linear operator H : Rn → Rn
and require only the operation of the linear operator acting on the gradient, ie the matrix-vector
product. Hence, the full matrix does not need to be stored. Further, since in theory, the accuracy
of the linear system solve can be controlled by the number of matrix vector products. This provides a
mechanism to control the work preformed within the inner loop of the Newton solve. Most rules for
terminating the iterative solver for (21) are based on the residual.

rk = ∇2 f (xk )pk + ∇f (xk )

One iteration being the worst accuracy. The accuracy monotonically increases (decreasing residual)
until the number of iterations reaches the full number of degrees of freedom at which machine precision
can be achieved (same as Gaussian elimination). We can control the amount of work within the inner
Newton solve by the tolerance we set on the residual. Since the residual is not invariant to scaling of
the objective function, the size of the residual relative to the right hand side of (21) is used, k∇f (xk )k
and the solver is terminated when the residual is less than the gradient scaled by a forcing sequence ηk

krk k ≤ ηk k∇f (xk )k 0 < ηk < 1

Intuitively, notice that this means that we solve the Newton system more and more as we approach a
solution k∇fk k = 0 and when are are initially starting out and the matrix may not be positive definite
and the Newton direction may not be a descent direction we are essentially taking a step that is closer
to a steepest descent direction. In fact, the choice of ηk = min(0.5,k∇fk k) may be shown to recover
quadratic convergence of the Newton method, Theorem 6.2 [Nocedal and Wright, 1999]. The choice of
convergence criteria, ηk = min(0.5,k∇fk k), is known as the Eisentstat-Walker convergence criteria.

98
Algorithm 4 Newton-CG Approximation to Trust Region subproblem, (18), with Eisentstat-Walker con-
vergence criteria
Require: ∆ > 0,p0 = 0, r0 = g, d0 = −r0 , rj = Bpj + g residual, pj solution, dj search direction
for j= 0, 1, 2 do
if d>j Bdj ≤ 0 then
negative curvature is detected, return optimal solution, as function of τ,
along the current search direction p = pj + τ dj and satisfies kpk k = ∆.

1
min m(p) m(p) = f + g > p + p> Bp kpk = ∆
p 2

• Notice if j = 0, then this is the gradient direction the initial solution is the steepest descent
direction.
r> r0 g> g
p1 = α0 d0 = >0 d0 = − > g
d0 Bd0 g Bg
The initialization of p0 to zero is a crucial feature of the algorithm. Hence when the first iteration
encounters the boundary, this is exactly the Cauchy point.
return p;
end if
Subsequent CG iterates serve to improve the model value. Notice that the computational work is in the
mat-vec multiply, Bdj 
rj> rj 
αj = >  pj+1 = pj + αj dj
dj Bdj  Newton
CG iterates monotonically increase the  length of the step direction, Theorem 4.2
 
[Nocedal and Wright, 1999]. 
 0 = kp0 k < ... < kpj k < kpj+1 k < kpk k ≤ ∆
 



 Steepest Descent
-


if kpj+1 k ≥ ∆ then
check if trust region has been reached. if kpj+1 k > ∆ ⇒, return optimal solution, as function of τ ,
along the current search direction pk = pj + τ dj and satisfies kpk k = ∆. find τ such that p = pj + τ dj
satisfies
kpk = ∆

return p;
end if
Update Residual
rj+1 = rj + αj Bdj

if krj+1 k ≤ min(0.5, k∇fk k)k∇fk k then


The CG solution transforms from the gradient direction initially to the Newton direction when fully
converged. Notice that as we are far from the solution we do not think our Newton direction is
particularly favorable so we require less work and only approximately solve the Newton direction.
However, as we get closer, we solve the system better and better to recover quadratic convergence.
return p = pj+1 ;
end if
Update search directions
>
rj+1 rj+1
βj+1 = dj+1 = rj+1 + βj+1 dj
rj> rj

end for

99
• adjust trust region for subsequent Newton iterates based on current solve. This essentially controls
the step length of the model. The trust region or step length along the descent direction is allowed to
change according to how well the quadratic model, mk (pk ), approximates the actual function. When
the step length has reached the boundary of the trust region, the trust region size is increased.
(
f (xk ) − f (xk + pk ) < 1/4 ⇒ decrease trust region
mk (0) − mk (pk ) > 3/4 ⇒ increase trust region ifkpk k = ∆k

Near a well behaved solution, the trust region can be shown to become inactive as quadratic convergence
is obtained, Theorem 6.4 [Nocedal and Wright, 1999].
• Each successive step of the conjugate gradient algorithm can be shown to be a descent direction for
function m, provided that the negative curvature condition is met. Suppose that pj is the current CG
solution, the CG update implies pj+1 = pj + αj dj
!
> 1 2 > >
rj> rj >
m(pj+1 ) = m(pj + αj dj ) = m(pj ) + αj dj g + αj dj Bdj = m(pj ) + αj dj g + dj Bdj
2 2 d>
j Bdj

By definition of CG variables and conjugacy of search directions

rj> rj X X
p>
i Bpj = 0 i 6= j αj = −g = Bx∗ = B αi pi ⇒ −d> >
j g = −dj B αi pi = αj −d> >
j Bdj = rj rj
d>
j Bdj i i

substituting
2
rj> rj 1
m(pj+1 ) = m(pj ) −
2 d>
j Bd j

reveals that the next CG iterate is a descent direction provided that the positive curvature condition
is obtained, d>
j Bdj > 0!

7 Constrained Optimization
7.1 Theory of Constrained Optimization, [Nocedal and Wright, 1999] Ch 12
Under the constrained optimization formalism, we would like to minimize a function f (x) such that the
solution satisfies physically meaningful constraints, ci , that restrict the parameter space.
(
ci (x) = 0, i ∈ E
minn f (x) such that (22)
x∈R ci (x) ≥ 0, i ∈ I

Here we will assume that all functions are sufficiently smooth such all needed derivatives are well defined in
the classical sense. E and I are two finite sets of indices’s representing the equality and inequality constraints
respectively. Equality Con-
straints
(Definition) Equality Constraints E denotes the equality constraints and denotes the finite set of in-
dices’s such that equality holds.
ci (x) = 0 i∈E
Inequality
Constraints
(Definition) Inequality Constraints I denotes the inequality constraints and denotes the finite set of
indices such that inequality holds.
ci (x) ≥ 0 i∈I
Feasible Set
(Definition) Feasible Set The feasible set, Ω, represents the set of points x, that satisfy the constraints

Ω = {x : ci (x) = 0 i ∈ E; ci (x) ≥ 0 i ∈ I}

100
Figure 53: Constraint and Function Gradients at Various Feasible Points

Using the feasible set notation, the constrained optimization formalism (22) may be represented concisely.

min f (x)
x∈Ω
Active/Inactive
constraint
(Definition) Active/Inactive constraint At a feasible point x, the inequality constraint i ∈ I is said to
be active if ci (x) = 0 and inactive if the strict inequality ci (x) > 0 is satisfied.
Active Set
(Definition) Active Set The active set A(x) at any feasible x is the union of the set E with the indices
of the active inequality constraints.

A(x) ≡ E ∪ {i ∈ T : ci (x) = 0}

As before, first order necessary conditions will characterize a solution to the constrained optimization
problem. Lagrangian
(Definition) Lagrangian A solution to constrained optimization problem may be characterized through
the so-called Lagrangian of the problem.
X
L(x, λ) ≡ f (x) − λi ci (x) = f (x) − (~λ, ~c) (23)
i∈E∪I

We will state the first order necessary conditions defining a solution to the constrained optimization problem
without proof and look at several simple examples to build intuition.
Theorem 7.1 (First-Order Necessary Conditions). Suppose that x∗ is a local solution of (22) and that the
linear independence constraint qualification (LICQ) holds at x∗ ,

{∇ci (x∗ ), i ∈ A(x∗ )} are linearly indpendent

Then there exists a Lagrange multiplier vector λ∗ with components λi , i ∈ E ∪ I such that

∇x L(x∗ , λ∗ ) = 0
ci (x∗ ) = 0, ∀i ∈ E
ci (x∗ ) ≥ 0, ∀i ∈ I
λ∗i ≥ 0, ∀i ∈ I
λ∗i ci (x

) = 0, ∀i ∈ I ∪ E

Proof. See [Nocedal and Wright, 1999] Section 12.3


Complementary
Condition
101
(Definition) Complementary Condition The complementary condition implies that the Lagrange mul-
tipliers can be strictly positive only when the corresponding constraints are active.
λ> c(x) = 0
The complementarity condition implies that the Lagrange multipliers corresponding to inactive inequality
constraints are zero. Thus we can omit the terms for indicies i ∈
/ A(x) and rewrite this as
X
0 = ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) − λ∗i ∇ci (x∗ )
i∈A(x∗ )
KKT Condi-
(Definition) KKT Conditions The first order necessary conditions of constrained optimization, Theo- tions
rem 7.1, are commonly known as the Karush-Kuhn-Tucker conditions, or KKT conditions.
Lets begin our study of constrained optimization with a single equality constraint problem.
Example 66 (Single Equality Constraint). Consider the single equality constraint
min x1 + x2 2 − x21 − x22 = 0
• In the language of (22), we have f (x) = x1 + x2 , I = ∅, and E = 1

• By inspection, the feasible set is the circle of radius 2 centered at the origin. The solution is (−1, −1)
• From any other point on the circle, it is easy to find a direction that stays√feasible while decreasing the
objective function f . For instance, a clockwise direction from the point ( 2, 0) has the desired effect.
• Notice that at the solution, the constraint normal ∇c1 is parallel to the objective function gradient with
proportionality constant λ1
∇f (x∗ ) = λ∗1 c1 (x∗ )
In fact, the proportionality of the objective function gradient and constraint gradient may be obtained
from a Taylor series approximation. To retain feasibility with respect to the constraint c(x) = 0 we require
that c1 (x + d) = 0
0 = c1 (x + d) ≈ c1 (x) +∇c> >
1 (x)d = ∇c1 (x)d
| {z }
c1 (x)=0

Hence the direction d retains feasibility with respect to the constraint c1 when it satisfies
∇c>
1 (x)d = (∇c1 , d) = 0 (24)
Similarly, a direction of improvement must produce a decrease in f , so that
0 > f (x + d) − f (x) ≈ ∇f > (x)d
or, to first order, as before
∇f > (x)d = (∇f, d) < 0 (25)
It follows that a necessary conditions then is that there exists no direction d that satisfies both (24) and
(25). By inspection, the only way such a direction cannot exist is if
     
1 1 −2x1 1 2
∇f (x) = λ1 ∇c1 (x) ⇒ = =
1 2 −2x2
|{z} 2 2
|{z} | {z }
∇f (x) λ1 ∇c1 (x)

0 > (∇f, d) = λ1 (∇c1 , d) ⇒ λ1 (∇c1 , d) > 0 constraint not satisfied


0 = λ1 (∇c1 , d) = (∇f, d) ⇒ (∇f, d) = 0 not a descent direction
By introducing the so-called Lagrangian function
L(x, λ1 ) = f (x) − λ1 c1 (x)
and noticing that Lx (x, λ1 ) = ∇f (x) − λ1 ∇c1 (x). We observe that our necessary condition may be expressed
concisely as as a stationary point of the Lagrangian function.
Lx (x, λ1 ) = 0
Note however, that this is a sufficient and not necessary condition. In the example above, this condition is
also satisfied at (1, 1), but, this is in fact a maximum.

102
Example 67 (Single Inequality Constraint). Now consider the inequality constrained problem.

min x1 + x2 2 − x21 − x22 ≥ 0



Here the feasible region consists of the interior of the circle of radius 2. Again the solution is (−1, −1),
and the ∇f (−1, −1) = λ1 ∇c1 (−1, −1) for λ1 = 1/2
As before, a feasible point is not optimal if we can find a direction d that decreases the objective function
and retains feasibility. The requirement on the objective function is the same, we require ∇> f (x)d to improve
the objective function. For the constraint, the direction retains feasibility if

0 ≤ c1 (x + d) ≈ c1 (x) + ∇c>
1 (x)d

so to first order, feasibility is retained if

0 ≤ c1 (x) + ∇c>
1 (x)d

Figure 54: Improvement Directions at Active and Inactive Constraints

In determining if such a direction exists we will consider the case where we are at a point inside the
feasible set and on the boundary of the feasible set.

Case I: Consider the case in which x lies strictly within the circle, c1 (x) > 0. Whenever ∇f 6= 0 we can
obtain a direction d that decreases the objective function and retains feasibility when
 
c1 (x) −c1 (x)
d=− ∇f (x) ⇒ 0 ≤ c1 (x) + ∇c1 , ∇f (x)
k∇c(x)kk∇f (x)k k∇c(x)kk∇f (x)k
  
∇c1 (x) ∇f (x)
≤ c1 (x) 1 − ,
k∇c1 (x)k k∇f (x)k

Using Cauchy Schwarz inequality the constraint is satisfied


 
∇c1 (x) ∇f (x) ∇c1 (x) ∇f (x)
k∇c1 (x)k , k∇f (x)k ≤ k∇c1 (x)k k∇f (x)k ≤ 1

and the only situation when this fails to exist is when

∇f (x) = 0 = λ1 ∇c1 (x) ⇒ λ1 = 0

Case II: Consider the case in which x lies on the boundary of the circle, c1 (x) = 0. The conditions for
improvement become
∇f > (x)d < 0 ∇c>1 (x)d ≥ 0

The first condition is the open half space characterize by the direction of the objective function gradient.
The direction of the inequality sign imposes the search direction in the negative of the gradient. The second

103
condition is the closed half space characterized by the constraint gradient. The direction of the inequality
imposes the search direction in the positive direction of the constraint. Hence the these two regions fail to
intersect when the gradient and constraint gradient point in the same direction.

∇f (x) = λ1 ∇c1 (x) λ1 ≥ 0

0 > (∇f, d) = λ1 (∇c1 , d) ⇒ λ1 (∇c1 , d) < 0 constraint not satisfied


0 ≤ λ1 (∇c1 , d) = (∇f, d) ⇒ (∇f, d) ≥ 0 not a descent direction
Example 68 (Single Inequality Constraint). Recall the inequality constrained problem.

min x1 + x2 2 − x21 − x22 ≥ 0

The solution is (−1, −1).    


1 −2x1
∇f = ∇c =
1 −2x2
   
1 1 2
∇f (−1, −1) = = = λ1 ∇c1 (−1, −1)
1 2 2
Note the sign of the multiplier is significant here, otherwise the directions would make up the entire half
plane, ie

0 > (∇f, d) = (−λ1 ∇c1 , d) ⇒ −λ1 (∇c1 , d) < 0 ⇒ (∇c1 , d) > 0 constraint satisfied

Both cases can be concisely expressed with reference to the Lagrangian

∇x L(x, λ1 ) = 0 λ1 ≥ 0

where we also require the so called complementary condition

λ1 c1 (x) = 0

Notice that we are using the Lagrangian to concisely and conveniently represent the requirements on our
solutions for multiple cases.
For Case I, c1 (x) > 0 so this requires that λ∗1 = 0 and the gradient of the Lagrangian reduces to the
gradient of the objective function. For Case II, λ1 is allowed to take a non negative value.
The examples suggest that several conditions are important to characterizing a solution to our constrained
optimization problem (22). These include the relation (1) ∇x L(x, λ) = 0 (2) the non-negativity of the
multipliers λi ≥ 0, i = 1,2,.. (3) and the complementary condition λi ci (x) = 0.
Example 69 (Two Inequality Constraints). Now consider the inequality constrained problem.

min x1 + x2 x21 + x22 ≤ 2 x2 ≥ 0

Repeating the arguements for the previous examples, we conclude that a direction d is a feasible descent
direction, to first order, if it satisfies the following conditions:

∇ci (x)> d ≥ 0 i ∈ I = 1, 2 ∇f (x)> d < 0 (26)



It is clear from Figure 55, that no such direction can exist when x = (− 2, 0)> . The conditions ∇ci (x)> d ≥ 0,
i = 1, 2 are both satisfied only if d lies in the quadrant defined by ∇c1 (x) and ∇c2 (x) , but it is clear that
all vectors d in this quadrant satisfy ∇f (x)> d ≥ 0.
Defining the Lagrangian for this problem we have

L(x, λ) = f (x) − λ1 c1 (x) − λ2 c2 (x) λ ≡ (λ1 , λ2 )>

Here λ is the vector of multipliers. Extending the derivative of the Lagrangian in this case we have

∇x L(x∗ , λ∗ ) = 0, λ∗ ≥ 0

λ∗ ≥ 0 means all components are non-negative. Complementary conditions imply

λ∗1 c1 (x∗ ) = 0 λ∗2 c2 (x∗ ) = 0

104
(a) (b)

Figure 55: Gradient at a solution√ and non optimal point. Here√ the feasible region consists of the upper
>
interior of the circle of radius
√ 2. (a) Here the solution is (− 2, 0) , a point at which both constraints are
>
active. (b) For the point ( 2, 0) , we have both constraints active. However, the objective gradient ∇f (x)
no longer lies in the quadrant defined by the conditions ∇ci (x)> d ≥ 0, i = 1, 2. One first order feasible
descent direction from this point- a vector d that satisfies (26)- is simply (−1, 0)> ; there are many others.
For this value of x it is easy to verify that the condition ∇x L(x, λ) = 0 is satisfied when λ = ( 2−1 √ , 1). The
2
first component λ1 is negative but we require positive multipliers.


At the solution x∗ = (− 2, 0)> , we have
   √   
∗ 1 ∗ 2 2 ∗ 0
∇f (x ) = ∇c1 (x ) = ∇c2 (x ) =
1 0 1

We can verify that ∇x L(x∗ , λ∗ ) = 0 when we select λ∗ as


 1 


λ = 2 2
1

Note both components are positive.


Finally consider the point x = (1, 0)> . Here only the second constraint, c2 is active. At this point,
linearization of f and c as before gives the following conditions for d to be a feasible direction.

1 + ∇c1 (x)> d ≥ 0 ∇c2 (x)> d ≥ 0 ∇f (x)> d < 0 (27)

In fact, we need worry only about satisfying the second and third condtions, since we can always satisfy the
first condition by multiplying d by a sufficiently small postive quantity. Noting that
   
1 0
∇f (x) = ∇c2 (x) =
1 1

it is easy to verify that the vector d = (−1/2, 1/4) satisfies (27) and is therefore a descent direction. To show
that the gradient of the Lagrangian in non-zero ∇x L = 6 0 and the complmentary conditions fail, we first not
that since c1 (x) > 0 we must have λ1 = 0 Therefore in trying to satisfy ∇x L = 0 we are left to search for a
value λ2 such that ∇f (x) − λ2 ∇c2 (x) = 0. Since no such λ2 exists, this point fails to satisfy the optimality
conditions. LICQ

(Definition) LICQ Given the point x∗ and the active set A(x∗ ) we say that the linear independence
constraint qualification (LICQ) holds if the set of active constraint gradients

{∇ci (x∗ ), i ∈ A(x∗ )} are linearly indpendent

is linearly independent.

It is possible for the gradient of the constraint to vanish as a result of the algebraic representation of the
constraint. Restrictions are typically applied to the constraints to avoid degenerate behavior. For example,
if we replaced our circle constraint by the equivalent
2
c1 (x) = x21 + x22 − 2 = 0

105
we would have ∇c1 (x) = 0 for all feasible points and the condition ∇f = λ∇c no longer holds at
the optimal point (−1, −1). To avoid this we typically require that the constraint gradients be linearly
independent at the solution By definition of linear independence the active constraint gradients
can not be zero.

Sensitivity At this point the Lagrange multipliers are more of a mathematical convenience. However, the
value of the Lagrange multiplier, λi , can provide information as to the sensitivity of the optimal value of
f (x∗ ) to the presence of the constraint, ci .
• For an inactive constraint i ∈/ A(x∗ ) such that ci (x∗ ) > 0, the solution, x∗ and function value f (x∗ )
are indifferent to the constraint ci (x∗ ) Hence, λi = 0
• Suppose instead that the constraint i is active then the solution x∗ perturbed by an  at contraint ci
ci (x) ≥ k∇ci (x)k instead of ci (x) ≥ 0
yeilds a change proportional to the multiplier λi
df (x∗ ())
= −λ∗i k∇ci (x∗ )k
d
Hence, if λ∗i k∇ci (x∗ )k is large, the the optimal value or solution point is very sensitive to the i-th
constraint and the function value at the solution depends heavily on the constraint.

7.2 Gradient Project Method, [Nocedal and Wright, 1999] Ch 16.6



 li
 xi < li
P (x, l, u)i = xi xi ∈ [li , ui ] x(t) = P (x0 − αd, l, u)

ui xi > ui

7.3 Quadratic Penalty Method, [Nocedal and Wright, 1999] Ch 17.1


A fundamental approach to constrained optimization is to replace the original problem by a penalty function.
The simplest penalty of this type is the quadratic penalty function. Here we penalize be the square of the
constraint violations. Consider the equality constrained problem
min f (x) subject to ci (x) = 0 i∈E
x

The quadratic penalty function, Q(x; u), for this formulation is


1 X 2
Q(x; µ) ≡ f (x) + ci (x)

i∈E

here the penalty parameter is positive, µ > 0, and by driving the penalty parameter to zero we penalize
the constraint more severely. Intuitively, we would like to consider a sequence of penalty parameters that
increasingly penalize the constraints
µk → 0
The general framework for this solution technique
Theorem 7.2 (Convergence of Quadratic Penalty Function). If the tolerances in Algorithm 5 satisfy
τk → 0
and the penalty parameters µk → 0, then for all limit points x∗ of the sequence xk at which the constraint
gradients ∇ci (x∗ ) are linearly independent, we have that x∗ is a KKT point for the problem
min f (x) subject to ci (x) = 0 i∈E
x

For such points, we have for the the infinite subsequence K such that limk∈K xk = x∗
−ci (xk )
lim = λ∗i ∀i ∈ E (28)
k∈K µk
where λ∗ is a multiplier vector that satisfies the KKT conditions (7.1).

106
Algorithm 5 Quadratic Penalty
Require: µ0 > 0, tolerance τ0 > 0, and initial guess xs0
for k= 0, 1, 2 do
Find an approximate minimizer xk of Q(·; µk ) starting at xk , terminate when k∇Q(x; µk )k ≤ τk
if final convergence test satisfied then
STOP with approximate solution xk
else
Choose new penalty parameter µk+1 ∈ (0, µk )
Choose new subproblem convergence τk+1 ∈ (0, τk )
Choose new starting point xsk+1
end if
end for

Here we see that Algorithm 5 is in fact attracted to a KKT point, ie satisfies necessary conditions.
Further, the quantities ci (xk )/µk may be used as estimates of the Lagrange multipliers λ∗i under certain
conditions. Notice that this implies that as µk → 0, the constraint ci (xk ) becomes more active.
Unfortunately, the hessian of the Quadratic penalty approach becomes increasingly ill conditioned with
µk → 0. The Hessian is given by
X ci (x) 1 >
∇2xx Q(x; µk ) = ∇2 f (x) + ∇2 ci (x) + A (x)A(x) A> (x) ≡ [∇ci (x)]i∈E
µk µk
i∈E

Near a solution the matrix is approximately the sum of (1) the Lagrangian term ∇2xx L and (2) a matrix of
rank |E| whose nonzero eigenvalues are of order 1/µk
1 >
∇2xx Q(x; µk ) ≈ ∇2xx L(x, µ∗ ) + A (x)A(x)
µk
Hence the overall matrix has some of its eigenvalues approaching a constant while others are of order 1/µk .
Since µk → 0, the increasing ill conditioning of Q(x; µk ) is apparent.

7.4 Augmented Lagrangian Formulation, [Nocedal and Wright, 1999] Ch 17


In the quadratic penalty methods we saw that the approximate minimizers do not quite satisfy the feasibility
conditions ci (x) = 0, i ∈ E and an approximate relationship between the penalty parameter, Lagrange
multiplier, and constraints may be obtained

ci (xk ) = −µk λ∗i i∈E

Alternatively to letting the penalty parameter tend to zero, µk → 0, we can ask if we could more accurately
solve the constraints, ci (x) = 0, and avoid any potential ill-conditioning problems for small values of the
penalty parameter.
The Augmented Lagrangian formulation achieves this by including and explicit estimate of the Lagrange
multipliers λ based on the formula (28). By definition
X 1 X 2
La (x, λ, µ) ≡ f (x) − λi ci (x) + c (x)
i
2µ i i

the augmented Lagrangian La differs from the standard Lagrangian (23) by the presence of the squared
terms. And this approach differs from the quadratic penalty method by the presence of the summation
terms involving the multipliers, λ. As before, the min over the feasible set Ω ⊂ RN , occurs at the stationary
point, x∗ , of the Lagrangian
X
min f (x) ∇x L(x∗ , λ∗ ) = ∇f (x∗ ) − λ∗i ∇ci (x∗ ) = 0
x∈Ω
i∈E

In some sense we are essentially redefining our problem

min Q(x; µk )
x∈Ω

107
Applying the standard Lagrangian (23) to this problem, with λk being a particular Lagrange multiplier
for the µk subproblem, the augmented Lagrangian appears and is stationary at a possibly different point
xk 6= x∗
X X ci (xk ) X
0 = ∇x La (xk , λk , µk ) = ∇Q(xk ; µk ) − λki ∇ci (xk ) = ∇f (xk ) + ∇ci (xk ) − λki ∇ci (xk )
i
µk
i∈E i∈E
X ci (xk )

k k
= ∇f (x ) − λi − ∇ci (xk )
µk
i∈E | {z }
λ∗
i

Reinterpriting Theorem 7.2 with fˆk (x) ≡ f (x) − i λki ci (x)


P

min fˆk (x) subject to ci (x) = 0 i∈E


x

we can infer that at an approximate solution xk

λ∗i ≈ λki − ci (xk )/µk

Rearranging,
−−−−−−−−−−→
ci (xk ) ≈ −µk (λ∗i − λki ) (λ∗i − λki ) → 0 0 ∀i ∈ E
we see that as the approximation to the Lagrange multiplier is close to the actual multiplier λ, the infeasibility
in xk will be much smaller than µk .
Further, the Lagrange multipliers, λk , particular to the augmented function or subproblem,
Q(x, µk ), provides an explicit estimate for the multipliers for the original problem, λ∗ . Equation
(28) suggests an update algorithm for the multipliers based on the current information

λk+1
i = λki − ci (xk )/µk i∈E

Algorithm 6 Augmented Lagrangian Approach


Require: µ0 > 0, tolerance τ0 > 0, and initial guess xs0 and λ0
for k= 0, 1, 2 do
SubProblem: Find an approximate minimizer xk of La (·, λk ; µk ) starting at xsk
if test convergence satisfied: k∇La (x, λk ; µk )k ≤ τ ∗ and kc(xk )k ≤ η ∗ then
STOP with approximate solution xk
else if the constraints are satisfied within tolerance kc(xk )k ≤ ηk then
Decrease the tolerance ηk to try to better solve the constraint

ηk+1 ∈ (0, ηk )

Update Lagrange multipliers


λk+1
i = λki − ci (xk )/µk
else
Decrease the penalty parameter and weight the constraint more heavily in the optimization solve

µk+1 ∈ (0, µk )

Leave Lagrange multipliers the same


λk+1
i = λki
end if
Choose new subproblem convergence τk+1 ∈ (0, τk )
Set starting point for next iteration xsk+1 = xk
end for

108
7.5 Applications: L1 minimization
Compressed sensing literature has generated significant interest in L1 solving problems of the form

min kΦxk1 such that H(x) = 0


x

Here Φx is a transformation into a space where the solution is sparse. H : Rn → Ris a general nonlinear con-
straint that is assumed twice differentiable and bounded below. The split Bregman formulation has received
much attention in these types of L1 type formulations. For compressed sensing with linear constraints, the
split Bregman formulation [Yin et al., 2008] may be shown to be equivalent to the Augmented Lagrangian
Framework in Section 7.4.
Example 70 (Typical CS Example ).

min kxk1 such that Ax = b Φ≡I


x

Initial approaches generally proposed to view this problem as an unconstrained optimization problem
with a penalty term accounting for the sparsity constraint. This is generally of the form:

min f (x) = H(x) + λkΦxk1


x

Initial approaches to this problem attempt to regularize (smooth) the L1 penalty term and approximated it
with a differentiable function. Unfortunately, as the smooth approximation approaches the non-differentiable
problem the problem becomes ill-conditioned because the derivative is not defined. The resulting ill con-
ditioned Hessian matrix, as we have seen, can significantly affect the convergence properties of algorithms
used to solve the optimization problem.
Example 71 (Condition Number of Penalty Method ). The k · k1 may be approximately represented as
Xq −−−→
x2i +  →0 kxk1
i
P √
Letting b(x) ≡ i xi +  we see that the Hessian is a diagonal matrix with diagonal terms scaled by 
q 
∂ 1 2
q −1/2
2 2
x1 +  + x2 +  + ... = xj +  (2xj )
∂xj 2
∂2  2
−1/2  −1/2 xj 2 −3/2 
2
x j x j +  = x2j +  − xj +  (2xj ) = 3/2
∂ xj 2 x2j + 

We expect our sparse solution to contain several sufficiently large non-zero xi elements, (ie |xi | >> )

x = (0, 0, ..., 0, x# , 0, 0, 0, x& , 0, 0, 0, ...)

We can approximate the diagonal elements as a positive number that becomes very small as |xi | increases.

∇2ii b(x) ≈
|xi |3
On the other hand, for the sparse solution we are looking for we except a significant portion of the xi equal
zero. The diagonal entries are relatively large in this case.

xi = 0 ⇒ ∇2ii b(x) = −1

The condition number of this diagonal matrix may thus be approximated as the maximum non-zero entry
cubed times the product of a large number −3/2 .
 √ −1 |xi |3 √
   
2 2 −1
k∇ bk1 = max ,  k∇ b k1 = max , 
i |xi |3 i 
maxi |xi |3
κ(∇2 b) =
3/2
Thus, ill-conditioning is expected for our sparse solutions and will lead to slow convergence of unconstrained
gradient and Newton algorithms as we have seen.

109
Alternatively, we may approximate the CS problem

min kΦxk1 such that H(x) = Ax − b = 0


x

as a sequence of unconstrained Quadratic Penalty problems with a non-smooth objective function


1
min kΦxk1 + kAx − bk22
x 2µ
For large µ, the penalty function does not accurately enforce the constraint. As we have seen, typical
approaches increasingly impose the constraint by letting µ → 0. An alternative is to apply the Augmented
Lagrangian approach or the so-called Bregman iteration [Goldstein and Osher, 2009] where in the penalty
term constants is fixed and we more accurately solve the constraint at each iteration. These approach
introduce an auxillary variable w ∈ Rn and enforce the transformation to this variable.
1
min kAx − bk22 + τ kwk1 such that Φx = w
x 2

This may be interpreted in our previous notation as


 
c1
1 c2 
f (x̂) = f (x, w) = kAx − bk22 + τ kwk1   = Φx − w
.
2
.

and the Augmented Lagrangian may be written as


1 1
La (x, w , λ; µ) = kAx − bk22 + τ kwk1 − (λ, Φx − w) + k(Φx − w)k22
|{z} 2
| {z } | {z } 2µ

P
λi ci (x̂)
| {z }
f (x̂) i 1
c2i (x̂)
P
2µ i

within the scope of Algorithm 6, given λk and µk , the subproblem is still formidable and contains L1 and
L2 terms
1 1
min La (·, λk ; µk ) = min kAx − bk22 + τ kwk1 − (λk , Φx − w) + k(Φx − w)k22
x,w x,w 2 2µ
The Lagrangian may be written in an equivalent form, using the linearity of the inner product

1 1 1 1 1 c2 1 c
(a − cb, a − cb) = [(a − cb, a) − (a − cb, cb)] = kak2 − c(b, a) − c(a, b) + kbk2 = kak2 − (a, b) + kbk2
2c 2c 2c 2c 2c 2c 2c 2
1 c 1
⇒ ka − cbk2 − kbk2 = kak2 − (a, b)
2c 2 2c
µ k 2
and minimizing with respect to (x, w), 2 kλ k2 may be considered a constant

1 1 µ 1 1
min La (·, λk ; µk ) = min kAx−bk22 +τ kwk1 + kΦx−w−µλk k22 − kλk k22 ⇔ min kAx−bk22 +τ kwk1 + kΦx−w−µλk k22
x,w x,w 2 2µ 2 x,w 2 2µ
The approach becomes tractable if we break the subproblem up into an L1 subproblem and L2 subproblem
using an alternating direction or coordinate descent techinque.
1 1
L2 subproblem min kAx − bk22 + kΦx − wk − µλk k22
| {z } x 2 2µ
wk ,λk fixed
1
L1 subproblem min τ kwk1 + kΦxk − w − µλk k22
| {z } w 2µ
xk ,λk fixed

A summary of the Augmented Lagrangian Algorithm in this context is presented in Algorithm 7. Notice
that we have converted the constrained problem into two unconstrained problems.

110
Algorithm 7 Augmented Lagrangian Approach For CS
1
min kH(x)k22 + τ kwk1 such that Φx = w H(x) = Ax − b
x,w 2

Require: µ > 0, tolerance τ > 0, and initial guess w0 = 0, and λ0 = 0


while not converged, ie kΦxk − wk k >  and kH(xk )k22 >  do
Solve L2 subproblem.
1 1
xk+1 = min kH(x)k22 + kΦx − wk − µλk k22
x 2 2µ
The derivative of this objective function for the subproblem fL2sub is given by
1 >
∇fL2sub = A> (Ax − b) + Φ Φx − wk − µλk

µ

Solve L1 subproblem.
1
wk+1 = min τ kwk1 + kΦxk+1 − w − µλk k22
w 2µ
wi = softτ µ (Φxk )i − µλi


b
softa (b) ≡ max(0, |b| − a)
|b|

Update Lagrange Multiplier.


1
λk+1 = λk − (Φxk+1 − wk+1 )
µ
end while

Derivatives The L2 subproblem may be solved with a linesearch or trust region approach using finite
difference or analytical derivatives. For the general case

A~x − ~b
" #
1~ > ~ ~
min f (~x) = minn h (~x)h(~x) h(~x) = 1  
x∈Rn
~ ~x∈R 2 √ x−w
µ Φ~ ~ − µ~λ

The gradient of the objective function f (~x) is given as the matrix vector product of the Jacobian transpose
times the residual.
m
! m
!  
∂ 1X X ∂hl >~ ∂hi A
(∇f )i = hl hl = hl ∇f = J h Jij = J = √1 Φ
∂xi 2 ∂xi ∂xj µ
l l

Applying this formula to each term in our L2 subproblem yeilds


> " ~b
#
A~x −
  
1 1 A  = A> (Ax − b)+ 1 Φ> Φx − wk − µλk 
∇ kH(x)k22 + kΦx − wk − µλk k22 = √1 Φ 1

~
2 2µ µ √
µ Φ~
x − w
~ − µλ µ

Soft Thresholding Operator The L1 subproblem is (perhaps surprisingly) now given by the soft thresh-
olding operator
b
wi = softτ µ (Φxk )i − µλi

softa (b) ≡ max(0, |b| − a)
|b|
To see this, we need to generalize our definition of the derivative for k · k1 . Consider the subdifferentiable of
the L1 problem
wi 1
(Φxk ))i − wi − µλki

0 ∈ τ −
|w|i µ
Subdifferential
(Definition) Subdifferential We say a vector g ∈ Rn is a subgradient of f : Rn → R at x ∈ Rn if

f (z) ≥ f (x) + g > (z − x) ∀z

111
(a) (b)

Figure 56: Subdifferential. (a) If f is differentiable, then the gradient is the subgradient. However, the
subgradient may exist when the gradient does not exist. In fact there may be several subgradients at this
point. (b) Absolute value. Consider f (z) = |z|. For x < 0 the subgradient is unique: ∂f (x) = {−1}.
Similarly, for x > 0 we have ∂f (x) = {1}. At x = 0 the subdifferential is defined by the inequality |z| > gz
for all z, which is satisfied if and only if g ∈ [−1, 1]. Therefore we have ∂f (0) = [−1, 1].

The set of subgradients of f at the point x is called the subdifferential at x and is denoted ∂f (x).
∂f (x) ≡ g : f (z) ≥ f (x) + g > (z − x)

∀z
To illustrate the solution to this equation consider


 −1 x<0
0 ∈ a sign(x) + x − b ⇔ b ∈ a sign(x) + x sign(x) ≡ (−1, 1) x=0

1 x>0

Figure 57: Solution to L1 subproblem. (Provided by W. Stefan)

As seen in Figure 57, when does b = x + a sign(x) ?


Case I
b>a ⇒ b=x+a ⇒ x=b−a

Case II
|b| < a ⇒ x=0

Case III
b < −a ⇒ b=x−a ⇒ x=b+a

These three cases can be convienently expressed by the so-called soft threshold operator. One may
directly verify that: 
b + a
 b < −a
b
x = softa (b) = max(0, |b| − a) = 0 |b| ≤ −a
|b| 
b−a b>a

The solution to the L1 problem is given by a change of variables.

112
Total Variation Denoising Alternative formulations may have computational savings at different steps
of the splitting. For example,
1
min kF (x)k22 + τ (kΦ1 wk1 + kΦ2 wk1 + kΦ3 wk1 ) such that x = w
x,w 2

This may be interpreted in our previous notation as


 
c1
1 c2 
f (x̂) = f (x, w) = kF (x)k22 + τ (kΦ1 wk1 + kΦ2 wk1 + kΦ3 wk1 )  =x−w
.
2
.
and the Augmented Lagrangian may be written as
1 1
La (x, w , λ; µ) = kF (x)k22 + τ (kΦ1 wk1 + kΦ2 wk1 + kΦ3 wk1 ) − (λ, x − w) + k(x − w)k22
|{z} 2
| {z } P | {z } 2µ
x̂ λi ci (x̂)
|
i
{z }
f (x̂) 1
P
c2i (x̂)
2µ i

The resulting L1 and L2 subproblems are of the form


1 1
L2 subproblem min kF (x)k22 + kx − wk − µλk k22
| {z } x 2 2µ
wk ,λk fixed
1 k
L1 subproblem min τ (kΦ1 wk1 + kΦ2 wk1 + kΦ3 wk1 ) + kx − w − µλk k22
| {z } w 2µ
xk ,λk fixed

A second splitting or splitting hierarchy is used to solve the L1 subproblem


 
c1    
v1 Φ1
1 c
kw − b¯k k22 + τ (kv1 k1 + kv2 k1 + kv3 k1 ) b¯k = xk − µλk  2  = v2  − Φ2  w
 
f (x̂) = f (w, v) = .

v3 Φ3
.
 
λ̄1
The Augmented Lagrangian for this subproblem may be written using a second multiplier, λ̄2 .
λ̄3
          2
λ̄1 Φ1 v1 Φ1 v1
1 1  
kw − b¯k k22 + τ (kv1 k1 + kv2 k1 + kv3 k1 ) − λ̄2  , Φ2  w − v2  +

La (w, v , λ̄; µ) = Φ 2 w − v2 
2µ 2µ

λ̄3 Φ3 v3 Φ3 v3 2
|{z}
x̂ | {z }
f (x̂) | {z } | {z }
1
P
c2i (x̂)
P
i λ̄i ci (x̂) 2µ i

The L2 sub-subproblem is equivalent to a least squares with identy operator. The normal equations are used
to identify the equivalent form.
kΦ1 w − v1k − µλ̄k1 k22
 
1 1 
L2 subproblem min kw − b¯k k22 + +kΦ2 w − v2k − µλ̄k2 k22 

| {z } w 2µ 2µ
vik ,λ̄k
i fixed +kΦ3 w − v3k − µλ̄k3 k22
kΦ1 wk − v1 − µλ̄k1 k22
 
1 
L1 subproblem min τ (kv1 k1 + kv2 k1 + kv3 k1 ) + +kΦ2 wk − v2 − µλ̄k2 k22 

| {z } v 2µ
wk ,λ̄k i fixed +kΦ3 wk − v3 − µλ̄k3 k22
The L1 sub-subproblem(s) now have the solution given by the component-wise soft thresholding operator.
a∗ < a ∀a b∗ < b ∀b c∗ < c ∀c ⇒ a∗ + b∗ + c∗ < a + b + c ∀a, b, c
1
L1 subproblem min τ kvi k1 + kΦi wk − vi − µλ̄ki k22
| {z } vi 2µ
wk ,λ̄k
i fixed

Notice that this allows the solution of a multiple soft thesholding in between the L2 solve. In fact, in the
limit, of one threshold per L2 solve we arrive at the previous Algorithm 7.

113
Algorithm 8 Augmented Lagrangian Approach With Denoising Subproblem
 
Φ1
1
min kH(x)k22 + τ kΦwk1 Φ = Φ2  Φ> = Φ> Φ> Φ>
 
such that x = w 1 2 3
x,w 2
Φ3
Require: µ > 0, tolerance τ > 0, and initial guess w0 = 0, and λ0 = 0
while not converged, ie kxk − wk k >  and kH(xk )k22 >  do
Solve L2 subproblem.
1 1
xk+1 = min kH(x)k22 + kx − wk − µλk k22
x 2 2µ

Solve L1 denoising subproblem.


1
wk+1 = min τ kΦwk1 + kw − b̄k k22 b¯k = xk − µλk
w 2µ

while not converged, ie kΦwk − v k k >  and kwk − b̄k k22 > , given initial guess vi0 = 0, and λ̄0i = 0 do
Solve L1 -L2 sub-subproblem.
!
1 X 1
kw−b¯k k2 +
X
>
X  ¯
w k+1
= min 2 k k 2
kΦi w−vi −µλ̄i k2 |{z} ⇔ Φi Φi + I wk+1 = Φ> k k
i vi − µλ̄i +b
k
w 2µ 2µ i Normal Eqn i i

Solve L1 -L1 sub-subproblem.



1
vik+1 = min τ kvi k1 + kΦi w k+1 k 2
− vi − µλ̄i k2 
v 2µ




k

vi = softτ µ Φi w − µλ̄i i = 1, 2, 3

b 

softa (b) ≡ max(0, |b| − a)

|b|

Update Lagrange Multiplier.


1
λ̄k+1
i = λ̄ki − (Φi wk+1 − vik+1 ) i = 1, 2, 3
µ
end while
Update Lagrange Multiplier.
1 k+1
λk+1 = λk − (x − wk+1 )
µ
end while

114
∗∗
7.6 Applications: Adjoint Method for Nonlinear Least Squares
Suppose we want to reconstruct an image from measurement data u0
Z
1
min f (u) = min (u − u0 )2 dx
u u 2 Ω

subject to some smoothness criteria

k∆u = f k = k(η1 , η2 , ..., ηM )

This may also be interpreted as a function image reconstruction for a concentration subject to a conservation
based constraint. This infinite dimension minimization problem is implicitly a function of the diffusion
parameters, ~η . Z
1
min (u(k) − u0 )2 dx k(~η )∆u = f
η) 2 Ω
u(~

The first thing to realize is that this is a nonlinear least squares problem in disguise. Indeed, consider a
finite dimension basis {φi i = 1, ..., N } in which we expand our solution and measurements.
X X
u(x) = ui φi (x) u0 (x) = u0i φi (x)
i i

Directly substituting, the minimization function can be rewritten as a weighted norm.


Z !2 Z ! 
X X X X X X
ui (k)φi (x) − u0i φi (x) dx = ui (k)φi (x) − u0i φi (x)  uj (k)φj (x) − u0j φj (x) dx
Ω i i Ω i i j j
>
= r M r = (r, r)M = krk2M

Where the weighting terms and residual are defined as:

u1 − u01
 
 u2 − u02 
 
  . R . .
r≡  . 
 M ≡ . Ω φi (x)φj (x)dx .
 .  . . .
un − u0n

Using a finite difference or finite element formulation the smoothness constraint may be written as a linear
system where the matrix depends on the diffusion coefficient, k
 
u1  
 u2  f (x1 )
u(xi + h) − 2u(xi ) + u(xi + h)    f (x2 ) 
∆u(xi ) ≈ 2
⇒ Au = b u= .  b=  . 

h 
 . 

f (xn )
un

The derivative of the objective function with respect to k does not have an analytical form this time.

∂ηi u1
  
 Z  Z    ∂ u2 
∂ 1 0 2 0 ∂u(~
η)   ∂ηi 
(u(k) − u ) dx = (u(k) − u ) dx = 
r, 
 . 
∂ηi 2 Ω Ω ∂ηi 
  . 

∂ηi un M

The gradient may be computed from the derivative of the constraint


∂ ∂
∂ηi u1 ∂ηi u1
     
q1
 ∂ u2  M  ∂ u2  M  q2 
∂ ∂A  ∂ηi  X  ∂ηi  X ∂A  .  ∈ RM
 
(Au = b) u + A . =0 ⇒ A qi  . =− qi u
∂ηi ∂ηi     ∂ηi  
 . 
 .  i  .  i
∂ ∂ qM
∂ηi un ∂ηi un

115
Solving for adjoint variable, p, with the right hand side given by the residual:

A> p = r

The derivatives may be expressed in terms of the adjoint.


∂ ∂ ∂
∂ηi u1 ∂ηi u1 ∂ηi u1
         
 ∂ u2    ∂ u2     ∂ u2    
∂  ∂ηi   ∂ηi   ∂ηi  ∂A
 , A> p
   
f (u) =   .  , r =  . A 
=  .  , p = − u, p
∂ηi         ∂ηi M
 .    .     .  
∂ ∂ ∂
∂ηi un M ∂ηi un M ∂ηi un M

Table 2: Comparing the computational expense of available methods for computing the gradient for M
parameters η1 , η2 , ..., ηM
finite difference sensitivities adjoint
(approx) (exact) (exact)
M+1 linear Au = b solves 1 linear Au = b solves 1 linear Au = b solves
∂u ∂A
- M linear A ∂η i
= − ∂ηi u solves -
- - 1 linear A> p = r solve

Second derivatives of the objective function and the constraint the Hessian matrix of second derivatives.

∂2 ∂ 2 u(~η )
 Z  Z 
1 ∂u(~η ) ∂u(~η )
(u(k) − u0 )2 dx = + (u(k) − u0 ) dx
∂ηi ∂ηi 2 Ω Ω ∂ηi ∂ηj ∂ηi ηj
   ∂    ∂2 

∂ηi u1 ∂ηj u1 ∂ηi ηj u1
 ∂ u   ∂ u    ∂2
  ∂ηi ηj u2 

 ∂ηi 2   ∂ηj 2 
=  .  ,  .  + r, 
     
 . 
.  . 
     
    . 
∂ ∂ 2
u u ∂
∂ηi n ∂ηj n M ∂ηi ηj un M
 ∂


 ∂2 
∂ηj u1 ∂ηi u1 ∂ηi ∂ηj u1
 
 ∂ u   ∂ u2   ∂2
 ∂ηi ∂ηj u2 

∂2 ∂2A ∂A   ∂ηj 2   ∂ηi 
 ∂A 
(Au = b) u+ . + .  + A =0
 
.
∂ηi ∂ηj ∂ηi ∂ηj ∂ηi   ∂ηj 
 
. .
 
     . 
∂ ∂ ∂2
∂ηj un ∂ηi un ∂ηi ∂ηj un

The second derivatives may again be expressed in terms of the adjoint.


 ∂2    ∂2     ∂2  
∂ηi ∂ηj u1 ∂ηi ∂ηj u1 ∂ηi ∂ηj u1
 ∂ 2  ∂ 2  ∂2
 ∂ηi ∂ηj u2   ∂ηi ∂ηj u2   ∂ηi ∂ηj u2 
      
∂2    
f (u) =   , r =   , A> p = A   , p
         
. . .
∂ηi ∂ηj          
 .    .     .  
∂2 ∂2 ∂2
∂ηi ∂ηj un M ∂ηi ∂ηj un ∂ηi ∂ηj un
  M∂  M


∂ηj u1 ∂ηi u1
 
  ∂ u   ∂ u2  
 ∂2A ∂A   ∂ηj 2   ∂ηi 
 ∂A 

= − u+ . + .  , p

 ∂ηi ∂ηj ∂ηi   ∂ηj 
  
  .   .   
∂ ∂
∂ηj un ∂ηi un M

It is worth mentioning that this matrix may be infeasible to compute in a realistic time, however, the
Hessian-Vector product or the action of a linear operator may be computed without explicit storing
the matrix. The hessian-vector product may be computed for an arbitrary linear combination of sensitivities

116
∂u
P
jqj ∂η j
using a second adjoint variable, p̃. To avoid needing to compute each sensitivity independently,
we consider a Matrix Vector product, using linearity of IP
  ∂   ∂  
∂ηj u1 ∂ηi u1
M
M M
 ∂ u  M  ∂ u2  
2
X ∂2 X ∂2A ∂A X   ∂ηj  X ∂A  ∂ηi  
f (u)qj = −  qj u+ qj  .  + qj  .  , p

∂ηi ∂ηj ∂ηi ∂ηj ∂ηi j ∂ηj 
 .  
 
 . 
  
j  j j 
∂ ∂
u
∂ηj n
u
∂ηi n M

A second adjoint problem is needed to avoid needing to compute each sensitivity independently.
M
X ∂AT
A> p̃ = qj p
j
∂ηj


   

∂ηj u1 ∂ηi u1
  
  ∂ u    ∂ u2 
M M M  ∂ηj 2  M
∂2 ∂2A ∂AT 

X X ∂A X    ∂ηi  X
f (u)qj = −  qj u+ qj  .  , p − . , qj p
   
∂ηi ∂ηj ∂ηi ∂ηj ∂ηi j   ∂ηj 
. .
   
j  j      j 
∂ ∂
∂ηj un ∂ηi un M
  ∂
 M

∂ηj u1 ∂ηi u1
  
  ∂ u    ∂ u2 
M M  ∂ηj 2  M
∂2A ∂AT 

X ∂A X    ∂ηi  X
= − qj u+ qj  .  , p − . , qj p
   
∂ηi ∂ηj ∂ηi j   ∂ηj 
. .
   
 j      j 
∂ ∂
∂ηj un ∂ηi un M
  ∂
 M

∂ηj u1 ∂ηi u1
  
M M
 ∂ u    ∂ u2 
X ∂2A ∂A X   ∂ηj 2    ∂ηi 

 , A> p̃

= − qj u+ qj  .  , p − .
   
∂ηi ∂ηj ∂ηi j   
. .
   
 j      
∂ ∂
∂ηj un M ∂ηi un M

   
∂ηj u1
M M
 ∂ u  
X ∂2A ∂A X   ∂ηj 2   
∂A

= − qj u+ qj  .  , p + u, p̃
  
∂ηi ∂ηj ∂ηi j ∂ηi M
.
   
 j   

∂ηj un M

Similar ideas of only computing matrix-vector products without storing the matrix (typically be the
matrices would use a prohibitive amount of memory ) appear frequently in image reconstruction. The
computation of a matrix-vector product are ideal for a class of iterative solution techniques for a linear
system of equations and are the motivation for the Newton-Krylov or Newton-CG optimization techniques
we will discuss, Section 6.9.

Table 3: Comparing the computational expense of available methods for computing the Hessian for M
parameters η1 , η2 , ..., ηM
finite difference sensitivities adjoint
(approx) (exact) (exact)
2M2 +1 linear Au = b solves 1 linear Au = b solves 1 linear Au = b solves
∂u ∂A
PM ∂u PM ∂A
- M linear A ∂ηi = − ∂ηi u solves 1 linear A i qi ∂ηi = − i qi ∂η i
u solves
2 ∂2u ∂2A >
- M linear A ∂ηi ηj = − ∂ηi ∂ηj u solves 2 linear A p = r adjoint solve

117
A Homework I
1. The notion of ‘distance’ does not necessarily behave in an intuitive way in higher dimensions. The
‘best’ distance measure is application dependent and may Not be the usual Euclidean distance that
we are familiar with. Similar to Example 18, analyze the behavior of the common image distances for
increasing dimesion ∈ [2, 1024] noisy images. Compare E(kxkmax.5 − kxkmin
.5 ) and Mutual Information
max
(MI) E(M I −M I ) to E(kxk1 −kxk1 ), E(kxk2 −kxk2 ), and E(kxkmax
min max min max min
3 −kxkmin
3 ). Discuss
which image ‘contrast(s)’ would be most appropriate for detecting noisy images in the high dimensional
spaces.
2. An phantom of known geometry was imaged on a new scanner. Download the data of the exact, I and
ˆ phantom data, KnownPhantom.mat MeasuredPhantom.mat,
measured, I,

from http://172.30.205.52/fuentes/AppliedMath/

Figure 58: Known and Measured Phantom

Formulate the convolution as a linear system


Iˆ = h ∗ I
and compute the point spread function that characterizes the Imaging System.
3. Given A ∈ Rn×n , derive that the FLOP count of the below factorization.

for k= 1:n-1
A(k + 1:n, k) = A(k + 1:n, k)/A(k, k)
for i= k+ 1:n
for j = k + 1:n
A(i,j) = A(i,j)- A(i, k)A(k,j)
end
end
end

You may use the following identities


n n n n
X X (n + 1 − m)(n − m) X X n(n + 1)(2n + 1)
1=n+1−m i= i2 = i2 =
i=m i=m
2 i=0 i=1
6

4. Given a random variables X, Y .


X = {◦, , ♦, ×} pX = {1/2, 1/4, 1/8, 1/8}
Y = {◦, , ♦, ×} pY = {1/4, 1/4, 1/4, 1/4}
Compute the entropy of X and Y .

118
5. Download the image data, ICBM_grey_white_csf.nii.gz ICBM_Template.nii.gz,

from http://172.30.205.52/fuentes/AppliedMath/

Compute the intensity threshold value that maximizes the information gain. Hint: perform an exhaus-
tive search. Your search should resemble Figure 14 (b).
6. Download the image data, brain_T1C.mha ICBM_Template.nii.gz,

from http://172.30.205.52/fuentes/AppliedMath/

Calculate MI, MSQ, NCC image distance between these images using matlab.
7. Does the NCC (2) satisfy the triangle inequality ? Prove of give a counter example.
8. What are the properties of a metric? What are the properties of a norm? What are the properties of
an inner product?
• Verify that the norm k · k satisfies the properties of a metric defined as

d(x, y) = kx − yk

• Verify that the norm (9) and metric (10) induced by an inner product indeed satisfy the properties
of the norm and metric.
p
kxk = (x, x)
p
d(x, y) = kx − yk = (x − y, x − y)

• Show that the norm (9) induced by an inner product


p
kxk = (x, x)

satisfies the Parallelogram Equality (11).

kx + yk2 + kx − yk2 = 2 kxk2 + kyk2




9. Define the p-norm in Rn


• Compute the 1-norm, 2-norm, and ∞-norm of
 
5  
  5
5 8
x1 = x2 =   x3 = 8
8 1
7
7

10. Define what is means for two norms to be equivalent. Show that the norms k · k1 and k · k2 satisfy
1
√ kxk1 ≤ kxk2 ≤ kxk1 ∀x ∈ Rn
n
Hint: Use Hölder Inequality

n n
!1/p n
!1/q
X X
p
X
q 1 1
|ξj ηj | ≤ |ξj | |ηj | p>1 + =1
j=1 m=1
p q
k=1

11. Show that the following set of vectors are linearly independent.
• (1, 0, 0), (0, 1, 0), (1, 0, 1) ∈ R3
• x, sin(x), ex ∈ C[0, 1]
12. Define a linear operator. Define a bounded linear operator. Show that:

119
• The operator T : C[a, b] → C[a, b] is linear as defined by (Example 36)
Z t
T x(t) = x(τ )dτ kxk = max |x(t)|
a t∈[a,b]

Is this operator bounded ?


• Cross product with one argument fixed defines a linear operator

T1 : R3 → R3 T1 = x × a ∀x ∈ X

Here a ∈ R3 is fixed, Example 37. Is this operator bounded under the k · k2 norm ?
13. Define a functional. Show that the functional defined as the dot product with a fixed vector in R3
f (x) = x · a = x1 a1 + x2 a2 + x3 a3 a ∈ R3 a fixed
f is linear and bounded.
14. Given the matrices,    
1 2 1 2 3
A1 = A2 =
20 25 2 4 5
Define the null space. Find all vectors that are in the null space of these matrices.
N (A1 ) = {z :?} N (A2 ) = {z :?}
What is the dimension of each of these null spaces? Use the rank and nullity Theorem (3.2). What is
the dimension of the range space ? Discuss existence and uniqueness of a solution, x1 and x2 , to each
of the linear systems
A1 x 1 = b1 A2 x2 = b2
15. Consider an exact image reconstruction of an object x from the measurements b,
    
.8 .3 x1 4.5
Ax = = =b
.2 .5 x2 7.8
• For arbitrary n ∈ N, show that the matrix norm induced by
A : (Rn , k · k1 ) → (Rn , k · k∞ )
is
kAk1,∞ = max max |aij |
i j

• Compute the condition number of this matrix


κ(A) = kAk1,∞ kA−1 k1,∞

• Under given the given measurement uncertainty


 
0.09
∆b = Ax̂ = b + ∆b
0.07
Compute a bound for error in the solution,
k∆xk k∆bk
≤ κ(A) =?
kxk kbk
• Let x be the solution to a nonsingular linear system and x̂ be the solution to a perturbation to
the linear operator, ∆A.
Ax = b (A + ∆A)x̂ = b
Show that the change in the solution, ∆x ≡ x̂−x is bounded by the perturbation and the condition
number, κ(A).
k∆xk k∆Ak
≤ κ(A)
kxk kAk
You may assumed second order perturbations are negligible.
∆A∆x ≈ 0

120
16. Consider the below images defined on the unit square Ω = [0, 1] × [0, 1]

f (x, y) = x + y g(x, y) = y a(x, y) = sin x + cos xy


b(x, y) = sin x − cos xy u(x, y) = sin πx v(x, y) = sin 2πx

Define the L2 [Ω] inner product (·, ·)L2 [Ω] .


• Compute the distance as defined by this inner product
p
kf − gk = (f − g, f − g) =?

• Compute the inner products


(a, b) =? (u, v) =?
17. Let X = (X,d): If xn → x and yn → y , define what we mean by convergence and show that
d(xn , yn ) → d(x, y)
2
• For the two sequences of functions (xn ) =(exp nx ) and (yn ) =( xn ) in L2 [a, b], what does their
distance converge to ?
d(xn , yn ) = kxn − yn k → ?

121
B Homework II
1. Download/Obtain a 512 × 512 pixel image and write a MATLAB program to resample the image to
128 × 256 image. Do not use the ‘resample’ command. Express your solution in terms of projections
in an inner product space. Compute the L1 and L2 norm of the difference between the original image
and the resampled image. What is the resulting weighted norm on Rn ?

kI original − I resampled k1 =? kI original − I resampled k2 =?

2. Consider a differential operator with specified zero derivative boundary conditions defined on the space
of differentiable functions condition and the usual L2 inner product
Z b
d2
1
X ≡ C [a, b] (x, y) = x(t)y(t)dt Lx = 2 x + x, x0 (a) = 0 x0 (b) = 0 a<b
a dt

(a) Compute the adjoint. Is the operator self adjoint ?


(b) What are the eigen-functions ?
(c) Do the eigen-functions form a basis for L2 ?
3. For a given symmetric positive definite (SPD) matrix, A, define:

x> Ax
f (x) = (Ax, x) > 0 ∀x A = A>
x> x
(a) Show n stationary points, {xi }ni=1 , of f (x) are eigenvectors of A, f (xi ) are eigenvalues
(b) Consider the constraint c(x) = x> x − 1
i. Show the constraint is satisfied at the optimum, ie c(x) = 0 ⇒ ∇f (x) = 0
4. Consider the quadratic function
   
1 > 8 2 6
min f (x) = x Ax − b> x A ∈ R2x2 symmetric b ∈ R2 A= b=
x 2 2 900 1

• What are the eigen-values and eigen-vectors of A? (Compute by hand, ie do not use ”eigs” in
MATLAB )
• Code the Steepest Descent Method, Algorithm 2, in MATLAB with your favorite line search and
apply it to find a solution.
• Explain the convergence behavior observed.
5. Suppose that we have a time series of imaging data, we draw and ROI on the image and we need to
fit the measurements within the ROI, {ti , yi }5i=1 , to our model function φ(t, ~x).

2
φ(ti , ~x) = x1 ex2 ti +x3 ti
ri = yi − φ(ti , ~x)

ROI

t 0.0 1.0 2.0 3.0 4.0


y 3.0 2.7 1.3 0.7 0.1

Figure 59: Measurements within an ROI.

• Formulate the solution as a nonlinear least squares problem. Define what metric you are using
for your objective function. What is the gradient? What is the Hessian ?
• Explicitly write out the first iteration of the Gauss-Newton approach to this nonlinear least squares
problem.

122
– What is your initial guess?
– What is the initial residual?
– What is the resulting least square problem?
– Solve this least square problem using normal equations in MATLAB .
– Solve this least square problem using QR factorization in MATLAB .
– Compare QR factorization vs normal equations. What are potential advantages and disad-
vantages of both approaches ?
• Using either a steepest descent, quasi-Newton, or Newton approach, code your own iterative
algorithm to find a solution, ie do not use MATLAB intrinsic functions. What algorithm did
you use ? What properties led to your algorithm selection ?
• Compare your answer to MATLAB lsqnonlin function. What is the convergence rate of your
algorithm? What is the convergence rate of MATLAB lsqnonlin function?
6. Discuss the properties of a Newton-CG algorithm for the Trust Region subproblem. In particular,
which descent directions does the algorithm favor ? Is the algorithm robust ? What type of convergence
properties are expected ? In terms of RAM/disk usage and floating point operations, how does the
algorithm allow you to reduce the computational expense of the Newton solve ? Give an example when
the Newton-CG trust region algorithm would increase the trust region radius. Give an example when
the Newton-CG trust region algorithm would decrease the trust region radius.
7. Explore the ability of an L1 algorithm to recover a sparse solution. Consider an exact solution as the
sum of three sinusoidal frequencies

f (t) = 7 sin(70t) + 5 sin(1t) + 9 sin(20t)

If we randomly sample this function as the “data”,


Nsample = 100
t = 100∗rand ( Nsample , 1 ) ;
y = 7 ∗ sin ( 7 0 ∗ t ) + 5 sin ( 1 ∗ t ) + 9 sin ( 2 0 ∗ t ) ;

Code an L1 minimization solver to recover the amplitude of the dominant frequency components from
an assumed model g(t).
 
x1  
   x2  f (t1 )
100
X . . .  .   f (t2 ) 
  
g(t) = xj sin(j ∗ t)  . sin(jti ) . 
   =
 .  ⇔ Ax = y
.  
j=1 . . .    . 
. 
f (tN sample )
x100

Specifically, beginning with initial guess, x = ~0, solve the L1 problem using Algorithm 7.

min kxk1 such that Ax = y

At your solution, what is the condition number of the Hessian that results from the smooth approxi-
mation to the L1-norm ?
Xq
kxk1 ≈ b(x) ≡ x2i + .001 ∇2 b(x) =?
i

How does your algorithm behave for Nsample = 3, 100, 1000?

123

You might also like