OM Notes PDF
OM Notes PDF
Optimization Methods
Instructor: C. V. Jawahar
IIIT Hyderabad
1
2
Contents
1 Background 21
1.1 Background on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2 Background on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Hard Problems and Approximate Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Optimization Problems in Machine Learning and Signal Processing . . . . . . . . . . . . . . . 23
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3
2.2.2 Related Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Graphical Method of Solving LP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Some cases of special interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 LP/IP Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Pattern Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 LP and IP Formulations 33
3.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Problem: Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Minimizing Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Example Problem: Cutting Paper Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Example Problem: MaxFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 LP Relaxation 53
5.1 LP Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Minimum Vertex Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Facilty Location problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Maximum Independent Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 More on IP Formulations 61
4
6.1 BIP and MIP Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.1 Example Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Function of K Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Either-OR constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 K out of N constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Modelling Compond Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5.1 Problems with a fixed cost and variable cost . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 Modelling Piecewise Linear Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.7 Solving BIP using Bala’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.7.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7 More on LP Relaxation 67
7.1 Scheduling for Unrelated Parallel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Minimum makespan scheduling on unrelated machines . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1 LP-relaxation algorithm for 2 machines . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2.2 LP-relaxation algorithm for minimum makespan scheduling on unrelated machines . . 69
8 Solving Ax = b 73
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 A is an Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.2 A is a Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.3 A is a Diagonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 A is a Triangular Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.1 Forward Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Cholesky Decomposition of Positive Definite (PD) Matrix . . . . . . . . . . . . . . . . . . . . 75
8.3.1 PD Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.4 Algorithm for Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.5 Solving linear equations by Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . 77
8.6 Finding Inverse using Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.7.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.7.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5
9.1 Review and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Computing the LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3.1 Computational procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.4 Solving linear equations by LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.5 Computing the Inverse using LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6 Solution of Ax = b with a direct inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6.1 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.6.2 Eample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.7 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.8 QR Factorization: The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.8.1 Algorithm: QR FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.8.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.9 Applications of QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.10 Factorization using SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.11 Computing Inverse using SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.12 Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.3 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.12.4 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6
10.7.3 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7
14 More on Simplex 121
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.3 How simplex method works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
14.3.1 Simplex: Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.3.2 Computing B̄−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.3.3 Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.4 On Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.5 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8
17.2.3 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.2.4 Duality Result for LP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.2.5 Duality Result for IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.2.6 Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.3 Proof of Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.3.1 Geometric Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.3.2 Proof based on Farkas’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.5 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9
19.6 Example: MST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
19.6.1 MST:Ver1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
19.7 Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
19.7.1 Set cover:Ver2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
19.8 Example:Weighted Vertex Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.1 Ver1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.2 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.3 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
19.8.4 Minimum-weighted Vertex Cover:Ver2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
19.8.5 Weighted Vertex Cover via Primal-Dual method . . . . . . . . . . . . . . . . . . . . . 180
19.8.6 Primal-Dual algorithm for WVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
19.8.7 WVC :Ver3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
19.9 Example: Minimum Steiner Forest via Primal-Dual method . . . . . . . . . . . . . . . . . . . 182
19.9.1 Approximation Algorithm for Minimum Steiner Forest via Primal-Dual method . . . . 183
19.10WEIGHTED SET COVER: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10
20.6 Additional problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
20.7 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8 Appendix: Gradient, Hessian and Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.1 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.2 Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.3 Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11
24 Nonlinear Optimization: Solving f (x) = 0 217
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.1.1 Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.2.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
24.2.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
24.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
24.3 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
24.3.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
24.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
24.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
24.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
24.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
24.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
24.5.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
24.6 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
24.6.1 Additional Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12
26.4 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
26.5 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13
14
Chapter 0
Introduction to Optimization
Methods
15
0.1 A Course on Optimization Methods
We see a variety of optimization problems in our day to day life. We solve many of them very comfortably.
(not sure whether we solve all of them ’optimally’ !!). Some time, we also struggle with such problems. Even
formally stating the problem can be very challenging in many situations.
We had studied many optimization problems in the high school days. We also have seen optimization
in different areas of computer science (and of course in the wider engineering). There are many important
aspects for the optimizarion that demand a formal study of this class of problems and the associated solution
schemes. This course focuses on a set of fundamental aspects related to optimization methods. (We will
not be able to cover all important aspects in a first level course like this. Please read classical text books,
monographs or research papers for advancing your knowledge.) This is purely an introductory course.
There are many important aspects of optimization methods that we are interested in. They include:
• What is the nature of a specific optimization problem? How to characterize the problem?
• How do we formally state a problem as an optimization problem (say as a Linear Programming problem
(LP))? (Do not get confused with the word programming with your favorite python programming!!.
Both may look very different on surface.)
• How do we solve an optimization problem (say for a hard problem)? What are some of the popular
algorithms?
• Do we have an efficient solution to the problem of interest (say a polynomial time algorithm)?
• If we do not have an efficient solution, can we come up with an efficient but approximate solution?
• When we solve an optimization problem, will we obtain the “best” solution always? (do we get the
global optima or local optima)?
• What are the applications of optimization methods in other related disciplines (such as machine learn-
ing)?
• What are some of the numerical linear algebra related tools that are needed for solving optimization
problems? (eg. how to solve a system of equations efficiently?)
This course assumes background in (i) mathematics (specially linear algebra) and (ii) algorithms (graph
algorithms, computational complexity), both at an introductory level.
Though the major focus of this course is on optimization methods, we demonstrate the methods on two
specific domains (i) approximate algorithms and (ii) machine learning. Minimal understanding of the termi-
nologies in these spaces will be useful; but not very critical. Some of these are summarized in Chapter 1.
Many computational problems are optimization problems. When you want to design a minimal spanning
tree (MST) for a graph, you are fining the “best” tree out of many possible ones. How do we find the best
ones? Indeed, some one had already told you the algorithm to compute this. However, when you have a new
problem in hand, how do we design an algorithm? You might have seen many specific tools in a course on
design and analysis of algorithms. If you want to pick only one tool, then it could be LP or IP, that will be
widely applicable.
Modern machine learning is dominated by optimization. The entire “training” process of a machine learning
algorithm is often an optimization with some additional tricks. Deep learning methods may be optimizing an
objective function of Millions or Billions variables. Even after optimizing for days or weeks, the solution that
we obtain is not guaranteed to be the best (or optimal)!!. Even if the models/solutions obtained through
the optimization process is useful and “very good”, as an optimization person, you should appreciate that
16
this leaves tremendous scope for designing far superior solution to many practically important problems, if
we ever can design a better optimization scheme.
This course will give emphasis on (i) How to formulate the problem at hand as an optimization problem.
(plain english/task to a mathematical definition of the problem). (ii) How to identify the nature of the
problem (is it a linear one? it it a nonlinear one? is it a non-convex one?) (iii) What are the popular
algorithms (may be a smaller set of the popular ones) to solve the optimization problem exactly or some
times approximately. (iv) How optimization techniques gets used in machine learning, design of algorithms
etc.
Implementatin of many of these optimization problems are numerically tricky. Though we will see some
numerical and computational procedures, the objective of this course is NOT to train you in implementing
the solution to the optimization problem on a computer. Whereever programming is required, student will be
encouraged and guided to use some of the standard libraries/packages. Minimal familiarity with computer
programming is expected in this regard.
There are many interesting dichotomies in this space. Some of them worth noticing:
1. Linear Vs Nonlinear
2. Convex vs NonConvex
3. Discrete Vs Continuos
4. Constrained vs Unconstrained
The names may make these distinctions reasonably obvious. You can keep in mind these words as and when
we progress. However, the nature of the problem, solution scheme, guarantees etc. could be very different
across the categories. This is why it becomes important for us to know where our problem lies in practice.
Question Contrast and elaborate these classification schemes in detail at the end this course. Give examples
of problems and also pick classical or practical problems in each of these categories.
By end of this course, you may aim to develop skills that can help you to place a problem in an appropriate
class.
Each chapter in this note series, correspond to approximately one lecture of 1.5 Hrs. The assumption is that
you should learn content worth 25 to 30 lectures by end of this course. This lecture notes and the Internet
resources will help you to know these further. It is important that the student puts effort him/herself and
go some what beyond the lectures. Content worth 20 to 25 lectures will be discussed in the lectures. It is
not expected that you come prepared for the lectures (at this stage) by reading these notes in advance. But
it is expected that you read it after the lecture. (Notes may not be descriptive or complete enough for you
to read and follow smoothly at this stage).
17
0.5.2 Evaluation/Grading
I wish a course like this gets evaluated purely based on class works, homeworks, personalized evaluations.
May be we are not fully ready to get rid of the examination at this stage. We will continue to have mid and
final examinations. There is no plan to make these exams tricky or super challenging.
Regular Homeworks The main evaluation is based on regular homeworks. You are expected to solve
50 homework problems “at your convinience”. i.e., whenever you think you have time, you can “ask for” a
question and you will be given a question immediately. (this implies that you do not have to take special
permissions for family functions or festivals.). To make the load (and more importantly learning) uniform,
you can solve a max of one question a day. However, you will have to submit your answer within the next
24Hrs. We assume that the average time required for solving a question to be 1 to 2 Hrs for a typical student,
provided that the student is familiar with the content. Indeed, there will be situations where you are unable
to solve/submit due to personal reasons, or limitations. Student may make mistakes too. Therefore, you
will have access to 60 questions and the best 50 will be used for the grading. We typically expect you solve
2-3 questions between sucessive lectures. If you move faster that that, you will get questions from future
lectures and it becomes your responsibility to learn yourself.
We wish you collaborate openly. Once you collaborate, do acknwoledge. Your points could also be divided
accordingly. The person who helps (including TAs) should be given due credit.
Questions that you solve may be different from that your friends solve. It is a serious offtense if (i) you take
the question out of the system (question could be watermarked with visible and/or invisible watermarks)
inlcuding photography/screenshots (ii) if you disucss the solutions electronically (electronic groups, mailing
lists). (iii) take help of people without acknwledging.
Grading
• Homeworks: 40%
There could be a maximum of ±5% change from this plan. It is expected that the class participation score
will be zero for most students. However, it can be positive and negative for a smaller fraction of the students.
There are many popular text books and references. A list is available on the course page.
3. S. Boyd and L Vandenberghe, “Convex Optimization”, Cambridge University Press (Online Copy
available at: http://www.stanford.edu/ boyd/cvxbook/ )
18
4. L Vandenberghe, Lecture Notes for Applied Numerical Computing, (Online available at: http://www.ee.ucla.edu
denbe/103/reader.pdf )
This is a lecture notes based on the notes from the previous offerings. This is still evolving. Please see this
section to know where we are, and how it changed over versions.
• Ver 0.1 [Jan 1 2019] Some structural changes and cosmetic changes to suite the plans for the Spring
2019. Available to the students who have enrolled to roughly appreciate the course content.
• Ver 0.2 Coming soon .. (EDA Jan 15, 2019)
19
20
Chapter 1
Background
This is the time to revise what we had learned in the past. Many of the terminologies could be
familiar to you; but it is worth reading and refreshing. If you are too far from these terminologies,
you may find it hard to go through the rest of the material.
21
1.1 Background on Matrices
This course assumes some amount of familiarity of mathematics. Specially, linear algebra. Basic under-
standing of geometry (points, lines, planes) is expected. Also basics of calculs (differentiation) is required.
Here are some definitions/terms that you should have heard already:
• Vectors
• Matrices
• Determinant of a matrix
• Rank of a matrix
• Lines, Planes, Hyper Planes and Half Spaces ax1 + bx2 = c is a line in 2D. ax1 + bx2 ≤ c is a
line plus one side of the line (half space) in 2D. Similarly ax1 + bx2 ≥ c. In 3D this is equivalent to
planes as a1 x1 + a2 x2 + a3 x3 = b and in d dimension, it is aT x = b.
This course assumes some amount of familiarity with basics of Graphcs (eg. what you learn from a first
course on Discrete Mathematice or Data Structures).
You are expected to be familar with:
22
1.4 Optimization Problems in Machine Learning and Signal Pro-
cessing
• Least Square Problem
• Regression
• Binary Classification
• Multi Class Classification
• Support Vector Machines and Deep Learning
1.5 Notes
You will find excellent references for the above at many places including the courses that you have done and
also the text books that you used. The best is to use the text book that you used for your mathematics,
data structures and algorithm courses.
23
24
Chapter 2
25
2.1 Introduction to Linear Programming(LP)
Linear Function A function f (x1 , x2 , . . . , xn ) of x1 , . . . xn is a linear function if and only if for some set
of constants c1 , c2 , . . . , cn ,
f (x1 , x2 , . . . , xn ) = c1 x1 + c2 x2 + . . . + cn xn
Linear Inequalities For any linear function f (x1 , x2 , . . . , xn ) and any number b, the inequalities such as
f (x1 , x2 , . . . , xn ) ≤ b
f (x1 , x2 , . . . , xn ) ≥ b
are linear inequalities.
A linear programming problem may be defined as the problem of maximizing or minimizing a linear function
subject to linear constraints. The constraints may be equalities or inequalities.
To discuss a bit more in detail, a Linear Programming Problem is an optimization problem for which
1. We attempt to maximize (or minimize) a linear function (or objective function) of the decision variables.
Objective function (z) is the criterion for selecting the “best” values of the decision variables. Decision
variables (xi ) are the variables of interest here.
2. We are often interested in the decision variables at the optima (x∗i ) and the value of the objective (z ∗ ).
Not always both!!.
3. The values of the decision variables must satisfy a set of constraints. Each of these constraints can be
equalities or inequalities. Constraints are like limitations on the resource availability. There can be
more than one constraints.
4. Sign restriction could be there on each variable. i.e., xi xi ≥ 0, xi ≤ 0 or xi is unrestricted.
In Linear programming, the decision variables can take any real value.
In Integer programming, the decision variables can take only integer value. This may seem to be a simple
change from LP. But this makes solving a general IP to be very hard.
maximize cT x
subject to
Ax ≤ b
x ≥ 0; x ∈ Rn
where x represents the vector of n variables (the decision variables of our interest), c and b are vectors of
(known) coefficients, A is a (known) m × n matrix of coefficients, and (.)T is the matrix transpose. Note
that the coefficients from the m constraints are now arranged into a matrix form in A.
26
The expression to be maximized or minimized is called the objective function. The inequalities Ax ≤ b
and x ≥ 0 are the constraints which specify a convex polytope over which the objective function is to be
optimised.
For Linear Program (LP), xi s are real numbers i.e., x ∈Rn
For Integer Program (IP), xi s are integers, i.e., x ∈Zn
We may also have mixed integer programs where some variables are real and some are integers.
Though the LP and IP may seem to be very similar, they can be very difficult in practice. An LP can be
solved “easily”, while solving an IP can be “hard”. We shall see these more in detail as we move forward.
You may not be able to get a problem that can be written in the above form easily.
Given a problem, we often rewrite the problem into a standard form.
Formulations of LP can be transformed from one form to another easily. i.e., maximization of objective
function to minimization. Some simple rules that guide the transformations are:
3. Equality xi = 3 ⇐⇒ xi ≤ 3 and xi ≥ 3
3. Feasible Region: Set of all feasible solutions. A different way of defining is as a region enclosed by all
the inequalities involved. Feasible Region is also referred to as search space, solution space or feasible
set. An optimization method searches for the best solution from the feasible region.
4. Optimal Solution: Optimal solution is the feasible solution with the largest objective function value
for a maximization problem or a feasible solution with the smallest objective function value for the
minimization problem.
27
Figure 2.1: Constraints and the Feasible Region. Note that the axes should be x1 and x2
Objective Function The function to be maximized (or minimized) is called the objective function. Here,
the objective function is 3x1 + 2x2 and the objective needs to be maximized.
Question: For the above problem what is m? what is n? 2 and 2. Can you write A, b and c?
Constraints A constraint involves an inequality or equality in some linear function of the variables. The
two constraints, x1 ≥ 0 and x2 ≥ 0, are special. These are called non-negativity constraints and are
often found in linear programming problems. The other constraints are then called the main constraints.
Here, the main constraints are (i) 2x1 + x2 ≤ 6 and (ii) 7x1 + 8x2 ≤ 28
Feasible Region A feasible region or solution space is the set of all possible points of an optimization
problem that satisfy the problem’s constraints. The feasible region for the example problem mentioned above
is given in the figure.
Feasible Solution A feasible solution to a linear program is a solution that satisfies all the constraints
which effectively lies within the feasible region. The point (1, 1) is a feasible solution in our example.
Optimal Solution An optimal solution to a linear program is a feasible solution with the largest objective
function value (for a maximization problem). A linear program may have multiple optimal solutions, but
only one optimal solution value.
Note:
1. Since every inequality forms a half plane (which is a convex set), the set of feasible solutions to an LP
(feasible region) forms a (possibly unbounded) convex set.
2. Optimal solution lies in any face/edge of the convex set. Eventually, every linear program has an
extreme point (corner point) that is an optimal solution.
Optimization Method Do we have to search at all the vertices? Do we have to search at all the points
inside the shaded region also?
28
2.4 Graphical Method of Solving LP
For the example problem, the objective is to maximize 3x1 + 2x2 . From the figure, the corner points are
(0,3.5), (0,0), (3,0) and (20/9,14/9). Their corresponding values are 7, 0, 9 and 9.77 respectively. Since Max
value is 9.77 the optimal point is (20/9,14/9)
Let us look at a procedure to find this maxima. Consider a line 3x1 + 2x2 = k. Let us draw this line for
various various values of k. When k = 0, this passes through origin, now when we increase k, this line
moves away from origin (in a certain direction). (If k is very large, this line does not intersect our polygon
of interest.) Let us increase k slowly from zero. And at certain point, the last line that intersect/touch the
polygon is obtained. This is the one of our interest. In this case, this is when k = 9.77 and the line passes
through (20/9, 14/9).
Let us now summarise our graphical way of solving LPs.
Given a LP problem i.e., an objective function and a set of constraints, we proceed as follows:
2. Determine the valid side of each constraint line. This can be done by substituting the origin in the
inequality. If origin satisfies the inequality, then all the points on the origin side of the line are feasible
(valid), and all the points on the other side of the line are infeasible (invalid). The reverse is true of
the origin doesn’t satisfy the inequality.
3. Identify the feasible solution region. This is the area of the graph that is valid for all the constraints.
Choosing any point in the region results in a valid solution.
4. Plot the objective function line and use it to move in the direction of improvement that is in the
direction of the greater value if the objective is to maximize the objective function and in the direction
of the lowest value if the objective is to minimize.
5. Optimal Solution aways occurs at the corners. So algebraically calculate the coordinates of all corners
of the feasible region and use the objective function line to find the most optimal corner.
6. Now use the coordinates of the optimal corner (optimal solution) to get the objective function value.
1. is infeasible,
2. is unbounded,
Question: Does (3) mean that every LP has only one optima?
Infeasible Solution A linear program is infeasible if it has no feasible solutions, i.e. the feasible region is
empty. Here is an example for Infeasible Solution
Problem
maximize z = x1 + x2
subject to
x1 − 2x2 ≥ 6
29
Figure 2.2: Infeasible Solution and Unbounded Solution (TODO fix)
2x1 + x2 ≤ 4
x1 , x 2 ≥ 0
Unbounded Feasible Region A linear program is infeasible if it has no feasible solutions, i.e. the feasible
region is empty. Here is an example for Unbounded Region
Problem:
maximize z = x1 + x2
subject to
x1 − x2 ≤ 5
−2x1 + x2 ≤ 4
x1 , x 2 ≥ 0
2.5.1 Excercise
Subject to:
6x1 + x2 ≥ 6
4x1 + 3x2 ≥ 12
x1 + 2x2 ≥ 4
xi ≥ 0 and real
1. Draw the lines corresponding to each constraints in a graph and shade the half planes according to the
inequalities.
2. If there is no intersection of the shaded regions then there is no solution.
3. If there is a bounded intersection region, find the coordinates of the corners by solving the systems of
intersecting equations.
4. Find which coordinate gives the optimum value by applying them over the optimization function.
30
Note: Is this how we plan to solve all the problems as we move forward? No never. However it is important
to understand the geometric view point of LP, the constraints and the objectives.
Question: Can an IP also be solved like this? No that is a larger story for the next few lectures.
Question: Solve the LPs that we had seen with a numerical solver on your computer (say Simplex) in your
favourite library. Do compare your answers with that.
One of our objective is to learn how to formulate a problem as LP or IP in the initial part of this course.
We will take many examples in the next few lectures. This is an important skill to have. It is expected that
you practice formulating problems on your own.
Let us now consider an example problem i.e., pattern classification. This is a fundamental problem in machine
learning offern referred to as supervised classification problem. If you are given examples of “apples” and
“not-apples”, a machine learning algorithm is asked to lean a classifier that separates the apples fron others.
Let us assume that every item (sample/object) of interest is characterized by two variables xi and yi (say
highit and weight).
Problem Statement Given that they are two disjoint patterns. One pattern is a positive pattern con-
sisting of all coordinates {(x+ +
i , yi )} Another pattern is a negative pattern consisting of all coordinates
− −
{(xi , yi )}. Assume there are n+ positive points and n− negative examples. We have to find a line such
that it classifies the patterns (or the 2D points) in the best possible way i.e the line must be in between and
must be as far as possible from every point on both the patterns classes (i.e., positives and negatives).
Note that in practice this classifier need not be a line. It can be of any arbitrary shape. However, many
practical problems still use linear classifiers.
There are actually many lines as possible solution to this problem. A good classifier also wants the points to
be as away as possible from the separating line/plane. Can we maximize this separation? Let us formulate
this problem as LP.
LP Formulation We would like to maximize the separation δ of each sample (to be precise the nearest
sample) from the line.
We take a variable δ and maximize it while making sure that that the distance of the nearest point in both
the classes (i.e., positive class and negative class) is at least δ (or) the distance of the line to each and every
point in the data set is at least δ i.e.,
• The distance from any one of the positive pattern points to the line at least δ.
• The distance from the line to any of the negative pattern points is at least δ.
• The line is defined as yi = axi + b, and positive samples are on one side and negative samples on the
other side.
M aximize δ
31
Figure 2.3: Line which separates the patterns with distance of at least δ
Subject to:
yi+ ≥ ax+
i + b + δ i = 1, . . . , n
+
yi− ≤ ax−
i + b − δ i = 1, . . . , n
−
If we solve this problem, we obtain a line that separates the points from the line with at least the distance δ.
Comments:
1. We have formulated the LP for simple Euclidean distance measured along y. Not the orthogonal
distance from the line. The problem becomes difficult if orthogonal distance is to be considered. (Note
that the figure does not show the problem setting correctly.)
2. We have assumed that positive pattern is above the line and negative pattern is below the line. It
could be the reverse also. i.e., negative above and positive above. Question: Does it matter to us
really?
32
Chapter 3
LP and IP Formulations
• We had seen the basic definitions of the LP and IP in the last lecture.
• We also know how to solve an LP graphically (not all LPs!!) on paper.
One of our objective is to learn how to formulate a problem as LP or IP in the initial part of this
course.
33
3.1 Formulations
In this lecture, we see more examples on how to formulate LP and IP. Though each example may be one of
its own form, do appreciate the utility and also the spectrum of problems that getts mapped to LP and IP.
Try to pickup the skill of formulating.
Given a set of points (xi , yi ), where i = 1, 2, 3, . . . , n, find the “best” line to fit these points. i.e., we want
to find a straight line of the form y = ax + b that best describes the points or data set. This is a classical
problem with interest in many areas.
Needless to say, we may not be able to find a line that pass through all the points in most cases. Our
objective is then to find the best line. Best in what sense? That defines (and change the nature of the
problem) the problem.
Assume our objective is to find a line such that it is “close” to all the points, i.e., sum of the distances of all
the points to the line is minimum. Objective function is then:
∑
n
minimize |yi − (axi + b)|
i=1
(Note: here we have used L1 Norm because it helps to get deviation or error of the line from the point, while
2
L2 Norm magnifies the error i.e. (yi − (axi + b)) .)
Let error ei = |yi − (axi + b)|. Then we write the problem as:
∑
n
minimize ei
i=1
subject to:
|yi − (axi + b)| ≤ ei i = 1, 2, . . . , n
If we eliminate norm, and write as our familiar LP, we obtain:
∑
n
minimize ei
i=1
subject to:
(yi − (axi + b)) ≤ ei
−(yi − (axi + b)) ≤ ei i = 1, 2, . . . , n
We have a total of N + 1 variables (e1 , e2 , .....en , a, b) and 2n constraints.
3.2.1 Variations
There are many interesting variations of this problem, where the error is defined as L0, L1 or L2 norm. (We
may see some of them later on in this course.) The objectives in these cases become
∑
E0 = ∥ yi − (axi ∔ b) ∥0
i
∑
E1 = ∥ yi − (axi ∔ b) ∥1
i
34
∑
E2 = ∥ yi − (axi ∔ b) ∥2
i
We had seen already the how the interpretation of the problem changes with a change in the norm. Let us
now consider couple of examples.
3.3.1 Example 1
(With our notations, it should be obvious that A is a matrix and b as well as x are vectors. We are given
A and b. We optimize over x. Indeed over non-negativity constraints and real vector constraints are not
explicitly written here.)
The above problem can be formulated as an LP. Let us assume that A is a matrix of dimension m × n, x
is a vector of dimension n × 1 and b is a vector of dimension m × 1 We know that ∥x∥∞ norm measure the
maximum of the absolute values of the elements in a vector.
Since the constraint ∥x∥∞ ≤ 1 the maximum element of the vector is less than equal to 1 which means that
every element of the vector is less than or equal to 1.
−1 ≤ xj ≤ 1 , j = 1, . . . , n
We know that ||x||1 norm measure the sum of the absolute values of the elements in a vector.
∥Ax − b∥1 = |A11 x1 +A12 x2 +. . .+A1n xn −b1 |+|A21 x1 +A22 x2 +. . .+A2n xn −b2 | . . .+|Am1 x1 +Am2 x2 +...+Amn xn −bm
As we have to minimize every element of ∥Ax − b∥1 , Here we take a vector y of dimension m×1 where
each element of the vector Ax-b ranges from −yi to yi . We now mimimize the summation of all elements
of the y vector. Thus the final objective function can be written as follows.
∑
m
M inimize yi
i=1
35
Subject to:
∑
n
−yi ≤ (aij · xj ) − bi ≤ yi , i = 1, . . . , m
j=1
,
−1 ≤ xj ≤ 1 , j = 1, . . . , n
Question: How many constraints are there here in our standard form? Do write them as Ax ≤ b.
3.3.2 Example 2
Since the constraint is ∥Ax − b∥∞ ≤ 1, the maximum element of the vector is less than equal to 1 which
means that every element of the vector is less than or equal to 1.
∑
n
−1≤ (aij · xj ) − bi ≤1 , i = 1 . . . , m
j=1
We know that ∥x∥1 norm measure the sum of the absolute values of the elements in a vector.
∑
n
M inimize yj
j=1
Subject to:
∑
n
−1≤ aij xj − bi ≤1 , i = 1 . . . , m
j=1
−yj ≤ xj ≤ yj , j = 1 . . . , n
Let us now consider an engineering problem. Paper rolls come in (gets manufactored) 3m (300 cm) width.
A roll is really long. Length of the roll does not matter, since the customers are also looking for only rolls.
There is an on order to serve:
(i) 97 rolls of width 135 cm (ii) 610 rolls of width 108 cm (iii) 395 rolls of width 93 cm (iv) 211 rolls of width
42 cm
36
Figure 3.1: First and second possible ways of cutting the role
• P1 : 2×135
• P2 : 135+108+42
• P3 : 135+93+42
• P4 : 135+3*42
• P5 : 2×108+2*42
• P6 : 108+2*93
• P7 : 108+93+2×42
• P8 : 108+4*42
• P9 : 3×93
• P10 : 2×93+2×42
• P11 : 93+4×42
• P12 : 7×42
Formulation Let xi be the number of times ith possiblity is used in cutting a 3m roll. Then problem can
be formulated as follows:
∑
M inimize xj
j
Subject to:
2x1 + x2 + x3 + x4 ≥ 97
x2 + x5 + x6 + x7 + x8 ≥ 610
37
x2 + x3 + 3x4 + 2x5 + 2x7 + 4x8 + 2x10 + 4x11 + 2x12 ≥ 211
Finding maximum flow in a graph is a classical problem of theoretical and practical interest. The max
flow passing from source node to destination node in a network equals the minimum capacity which when
removed from the network results in a situation where there is no flow from source to destination. (In any
network, the value of max flow equals of min-cut. That is, if there exists a max flow ’f’, there exists a cut
whose capacity equals the value of f. )
For the above graph, let the variables fsu , fsv fuv , fut and fvt be the maximum flows along the respective
edges. We also know that the flow can not be more than the edge capacity. Also since there is no storage in
nodes, at every nodes, incoming and outgoing flows should match.
LP problem would be as follows,
38
0 ≤ fsv ≤ 5
0 ≤ fuv ≤ 15
0 ≤ fut ≤ 5
0 ≤ fvt ≤ 10
with non-negativity and real constraints.
Question: Write the c, A and b for the graph in Figure (b).
Discussions In any network, the value of max flow equals of min-cut. That is, if there exists a max flow
’f’, there exists a cut whose capacity equals the value of f. Can we also formulate the min-cut problem as an
LP/IP? Which one? How?
Are there some relationships between these problems? Very often problems come in “pairs”. We call them
as dual problems. More about duality in one of the later lectures.
3.6 Reading
39
40
Chapter 4
• We had seen many LP and IP formulations in the past. We will see some more as we move
forward.
• We had seen how LP can be solved using graphical method.
• Here we will see how an IP can be solved.
41
4.1 How to solve an IP?
The advantage of IP is that they are a very expressive language to formulate optimization problems, and
they can capture in a natural and direct way a large number of combinatorial optimization problems. The
disadvantage of IP is that finding optimal solutions for IPs is NP-Hard, in general. (There are, in fact, lucky
IP problems that can be solved easily. They are topics of interest for a future lecture.)
Let us now see how to solve some of the IP problems. We are interested in two methods at this stage.
If we relax the integer constraints from the IP and assume that the variables are real, we get an LP. We can
then solve LP. However, how are the solutions related? Note that the optima of the IP is inferior (or equal) to
the optima of LP. (for a maximization problem (IP ∗ ≤ LP ∗ and for a minimization problem (IP ∗ ≥ LP ∗ ).
LP relaxation is a technique of relatxing the IP as an LP and solving it efficiently. We will come back to
this later.
We ‘assume” that we know how to solve LP. We use the LP solver as a black box/sub-routine here. Note
that we know how to solve LP using graphical method for simpler problems and use simplex solvers for large
LP on a computer.
This is a classical paradigm that can be used for solving many other hard problems. (refer to your text
books on Algorithms for details.)
Basic idea is the following: We divide a large problem into multiple smaller ones. (This is the “branch”
part.) The bounding part is done by estimating how good a solution we can get for each smaller problems
(to do this, we may have to divide the problem further) The optimal value from a subproblem will tell you
whether there is a need to further divide it or not. That is the “bound” part. i.e., we do not have to divide
until we get trivial/small problems.
We use the linear programming relaxation to estimate the bound on the optimal solution of the integer
programming.
1. Initialize the list of problme (or constraints) as L = {Ax ≤ b}. The set of problems to be solved is
represented by the constraints here, since the objective does not change.
2. Initialize x− = ∅ , l = −∞. Here l is the presently seen best solution in the algorithm and x− is the
presently seen best optimal value in the algorithm.
3. while L ̸= ∅
(a) Pick subproblem maximize cT x , A′ x ≤ b′ and solve the LP. Also delete the subproblem con-
straints from L.
(b) Let x∗ be the optimum solution when you solve the above LP problem.
(c) If x∗ ∈ Z n and cT x∗ > l. Then set x− = x∗ and l = cT x∗ .
42
Figure 4.1: Feasible region of the main problem
(d) If x∗ ∈
/ Z n and cT x∗ > l then
add two subproblems into the list L for x∗j ∈
/Z
• Max cT x such that A′ x ≤ b′ and xj ≤ ⌊x∗j ⌋ and add to L.
• Max cT x such that A′ x ≤ b′ and xj ≥ ⌈x∗j ⌉ and add to L.
Quesion: Refer steps 3(c) and 3(d): Can’t cT x < l?. What should we do in such cases?
4.3.1 Example 1
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x2 - x1 = 2 and 8x2 + 2x1 = 19 with the x1 and x2 axis give the points (-2,0),(0,2)
,(0,9.5),(2.375,0)
• Intersection of the lines x2 - x1 = 2 and 8x2 + 2x1 = 19 with each other gives (1.5,3.5).
• The feasible region points are (0,0) ,(0,2) ,(2.375,0),(1.5,3.5).
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (1.5,3.5)
and the value is 5.
43
• However, as (1.5,3.5) is not an integer solution.We branch on x1 and divide the problem into two
subproblems in which one subproblem whill have the added constraint x1 ≤ ⌞1.5⌟ = 1and other will
have the added constraint x1 ≥ ⌜1.5⌝ = 2 .
Subproblem 1 :
Maximize x1 + x2
subject to:
x2 - x1 ≤ 2
8x2 + 2x1 ≤ 19
x1 ≤ 1
x1 , x2 ≥ 0 x1 ∈ Z x2 ∈ Z
Solution:
Figure 4.2: (a) Feasible region of the sub problem with the added constraint x1 ≤1. (b) Feasible region of
the sub problem with the added constraint x1 ≥2
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x2 - x1 = 2 and x1 = 1 with the x1 and x2 axis give the points (-2,0),(0,2)
,(2,0),(1,0)
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (1,3)
and the value is 4. This is an integer solution.
44
Subproblem 2 :
Maximize x1 + x2
subject to:
x2 - x1 ≤ 2
8x2 + 2x1 ≤ 19
x1 ≥ 2
x1 , x2 ≥ 0 x1 ∈ Z;x2 ∈ Z
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 2 and 8x2 + 2x1 = 19 with the x and y axis give the points (2,0)
,(0,9.5),(2.375,0)
• Intersection of the lines x1 = 2 and 8x2 + 2x1 = 19 with each other gives (2,1.5).
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (2,1.5)
and the value is 3.5 which is less than above integer value 4 obtained. Hence there is no need to branch
further as we cannot get a solution which is more than 3.5. Hence we stop here.
From above two subproblems we see that the maximum integer solution obtained is (1,3) and the optima z ∗
is 4.
4.3.2 Example 2
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with the x1 and x2 axis give the points (4.5,0),(0,8)
,(8,0)
• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with each other gives (4.5,3.5).
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (4.5,3.5)
and the value is 12.5.
• However, as (4.5,3.5) is not an integer solution.We branch on x1 and divide the problem into two
subproblems in which one subproblem whill have the added constraint x1 ≤ ⌞4.5⌟ = 4 and other will
have the added constraint x1 ≥ ⌜4.5⌝ = 5 .
45
Figure 4.3: (a) Feasible region of the main problem (b) Feasible region of the sub problem with the added
constraint x1 ≤4
Subproblem 1 :
Maximize 2x1 + x2
subject to
x1 + x2 ≤ 8
2x1 ≤ 9
x1 ≥ 5
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z
Solution:
There is no feasible region for above system of equations and hence there is no solution and no need to
branch further.
Subproblem 2 :
Maximize 2x1 + x2
subject to
x1 + x2 ≤ 8
2x1 ≤ 9
x1 ≤ 4
x1 , x 2 ≥ 0 x1 ∈ Z x 2 ∈ Z
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines 2x1 = 9 and x1 ≤ 4 and x1 + x2 =8 with the x1 and x2 axis give the points
(4.5,0),(0,8) ,(8,0),(4,0)
• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with each other gives (4.5,3.5).
46
• The feasible region points are (0,0) ,(4,0),(0,8) ,(4,4).
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (4,4)
and the value is 12.
• As we have got an Integer solution (4,4) which gives a value of 12. There is no need to branch further.So
ans is (4,4).
4.3.3 Example 3
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = x2 = 6 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points (0,6)
,(0,5),(6,0),(9,0)
• Intersection of the lines x1 + x2 = 6 and 5x1 + 9x2 = 45 with each other gives (2.25,3.75).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is
(2.25,3.75) and the value is 41.25 other values are 30 by (6,0) and 40 by (0,5) .
• But as (2.25,3.75) is not an integer solution.We branch on x1 and divide the problem into two sub-
problems in which one subproblem whill have the added constraint x1 ≤ ⌞2.2.5⌟ = 2 and other will
have the added constraint x1 ≥ ⌜2.25⌝ = 3.
47
Subproblem 1:
Solution:
Figure 4.5: Feasible region of the sub problem with the added constraint x1 ≤2 (b) Feasible region of the sub
problem with the added constraint x1 ≥3
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 2 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points (0,5),(2,0),(9,0)
• Intersection of the lines x1 = 2 and 5x1 + 9x2 = 45 with each other gives (2,3.88).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is
(2,3.88) and the value is 41.1 other values are 10 by (2,0) and 40 by (0,5) .
Subproblem 2:
Solution:
48
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 3 and x1 + x2 ≤ 6 with the x1 and x2 axis give the points (6,0),(3,0),(0,6)
• Intersection of the lines x1 = 3 and x1 + x2 ≤ 6 with each other gives (3,3).
• The feasible region points are (6,0) ,(3,0) ,(3,3).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (3,3)
and the value is 39 .
• As We have already have an Integer solution of 40 from (0,5). We dont need to branch this problem
further.
We will now branch further into subproblem 1 We branch on x2 and divide the subproblem 1 into
two subproblems in which one subproblem whill have the added constraint x2 ≤ ⌞3.88⌟ = 3and other
will have the added constraint x2 ≥ ⌜3.88⌝ = 4 .
Subproblem 1(a) :
Solution:
Figure 4.6: Feasible region of the sub problem with the added constraint x2 ≤3 (b) Feasible region of the sub
problem with the added constraint x2 ≥4
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 2 and x2 = 3 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points
(0,5),(2,0),(9,0)
• Intersection of the lines x1 = 2 and x2 = 3 with each other gives (2,3).
49
• The feasible region points are (0,0) ,(0,3) ,(2,0),(2,3).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (2,3)
and the value is 34 other values are 10 by (2,0) and 24 by (0,3) .
• As this is less than before seen feasible solution.There is no need to branch this problem further.
Subproblem 1(b) :
Solution:
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 2 and x2 = 4 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points
(0,4),(2,0),(9,0)(0,5)
• Intersection of the lines x2 = 4 and 5x1 + 9x2 = 45 with each other gives (1.8,4).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (1.8,4)
and the value is 41.
• But as (1.8,4) is not an integer solution.We branch on x1 and divide the problem into two subproblems
in which one subproblem whill have the added constraint x1 ≤ ⌞1.8⌟ = 1and other will have the added
constraint x1 ≥ ⌜1.8⌝ = 2 .
Subproblem 1(b)(i) :
Solution:
• There will be only one point (2,3.8) which is not an IP solution. There os no need to branch further.
Subproblem 1(b)(ii) :
50
Figure 4.7: (a) Feasible region of the sub problem with the added constraint x1 ≥2.Only one point is in the
region (b) Feasible region of the sub problem with the added constraint x1 ≤1.
5x1 + 9x2 ≤ 45
x1 ≤ 2
x1 ≤ 1
x2 ≥ 4
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z
Solution:
• The feasible points in this region are (1,4) , (0,4) ,(0,5) and (1,4.44) out of which highest is (1,4.44)
and the value is 40.552. But Since we already have a solution of 40 .We can stop the branching here.
From Above two subproblems we see that the maximum integer solution obtained is (0,5) and the
solution is 40.
51
52
Chapter 5
LP Relaxation
• We had seen some LP and IP formulations in the past. We will see some more as we move
forward.
53
5.1 LP Relaxation
If we are interested in designing a polynomial time algorithm( exact or approximate) for a combinatorial
optimiation problem, formulating the combinatorial optimization problem as an IP is useful as a first step
in the following methodology( the discussion assumes that we are working with minimization problem):
• If the optimal LP solution has integer values, then it is a solution for the IP of cost opt(LP)=opt(IP).
We have now found an optimal solution for the IP and hence an optimal solution for our combinatorial
optimization problem.
• If the optimal LP solution x* has fractional values, but we have a rounding procedure that transforms
x* into an integral solution x’ such that cost(x’)≤c . cost(x*) for some constant c, then we are able to
find a solution to the IP of cost≤c . opt(LP) ≤c . opt(IP), and so we have a c-approimate algorithm
for our combinatorial optimization problem.
Definition
When there are an equal number of nodes on each side of a bipartite graph, a perfect matching is an
assignment of nodes on the left to nodes on the right, in such a way that
Formulation
Let G=(V,E) be a graph where V represents the set of vertices and E represents the set of edges. As G is
bipartite, let V be divided into two disjoint sets X and Y such that | X |=| Y |and edges are from X→Y
only. A perfect matching M⊂E is such that each vertex in X as well as Y appears only once in M.
Our problem of interest is to find ∑
the maximum weight matching. i.e., sum of weights on the edges in the
match M is maximum., i.e., Max e∈M we
Note: This many similar problems can be seen as “selection” of a set of objects (like edges of a graph). It is
common to introduce a binary variable ∈ {0, 1} that defines whether the object is selected or not.
The IP formulation of the problem is as follows:
Let xe ∈ {0, 1} to say whether edge is in M or not.
54
Figure 5.1: Bipartite Matching
∑
M aximize we · xe
e∈E
subject to;
∑
xe = 1 for all v ∈ V
e∈E;v∈e
xe ∈ {0, 1} ∀e ∈ E
This is an IP. Let us relax the xe ∈ {0, 1} to xe ∈ [0, 1] to obtain an LP by relaxing the IP.
∑
M aximize we · xe
e∈E
subject to;
∑
xe = 1 for all v ∈ V
e∈E;v∈e
xe ∈ [0, 1] ∀e ∈ E
55
2. LP provides an upper bound on the IP. (note: maximization problelm).
Unfortunately, this LP need not give an integer solution. However, the situation is not bad.
Rounding
∗
xe − ϵ for e ∈ {e1 , e3 , . . . , et−1 }
ye = x∗ + ϵ for e ∈ {e2 , e4 , . . . , et }
e∗
xe otherwise
∑
Easy to see that ye = 1 and ye is a valid / feasible solution for small ϵ.
Let us now look at the cost of matching.
∑ ∑
W (Y ) = we ye = W (X ∗ ) + ϵ (−1)i wei = w(X ∗ ) + ϵ△
e
Since X ∗ is optimal, △= 0 ( otherwise we would have found an ϵ such that W (Y ) > W (X ∗ ).) If △ is
negative, then we would have taken a negative ϵ.
What does it mean? This change in xe is not going to change the matching cost. This helps is finding a
rounding scheme that does not change the cost.
2. Then y will have lesser non - integer values. Repeat for many chains until you set an integer solution.
Result If LPR has a feasible solution, then it has one integer optimal solution. We can obtain by an
appropriate rounding.
Let us now consider another problem that of finding a minimal vertex set that cover all the edges.
56
IP formulation Here we need to select vertices. Let us define a new binary variable xv to denote whether
a vertex is selected or not.
∑
Minimize v∈V xv
Such that xu + xv ≥ 1 ∀(u, v) ∈ E (5.1)
xu ∈ 0, 1
Indeed, we can now relax the integral constraints and create an LP.
LP Relaxation ∑
Minimize v∈V xv
Such that xu + xv ≥ 1 ∀(u, v) ∈ E (5.2)
0 ≤ xu ≤ 1
Solution using LPR Solve the LPR using an appropriate LP solver to obtain the optimal solution. Let
it be x∗ . This need not be an integral solution.
Now we create the following vertex cover SLP from x∗ .
1
SLP = {v ∈ V |xv ≥ } (5.3)
2
Analyzing the above equation carefully we will see that it is a vertex cover. (Why?.
Now we want to check how good/bad is this cover from the minimum.
Let, y be the optimal solution by IP and let SOP T be the corresponding vertex cover. We have,
Why? Remember that the IP optima is inferior to the LP optima. We can also conclude the following from
LP-IP relationships. Now we want to see exactly how bad is our solution.
∑ ∑
x∗v ≤ yv (5.5)
v∈V v∈V
Now we have:
∑ ∑ ∑ ∑
|SLP | = 1≤ 2.x∗v ≤ 2.x∗v ≤ 2 yv ≤ 2|SOP T | (5.6)
v∈SLP v∈SLP v∈V v∈V
A company wants to set up factories at some of the i locations so that it can supply materials to all its
customers in j places. The cost of setting up the factory at location i is fi . Let∑xi ∈ {0, 1} indicate whether
factory is set up or not. Therefore the cots of setting up the factories is Cf = i fi xi .
57
Now the goods need to be transferred from the factories (that are set up) to the customers. Let c(i, j) be the
∑j. Let yij = {0, 1} indicate whether customer j is assigned
cost of transportation from factory i to customer
to factory i. Total transportation cost is Cr = ij yij c(i, j). Final problem is now:
∑ ∑
minimize fi xi + c(i, j)yij
i ij
subject to
∑
yij ≥ 1 ∀j
i
xi ≥ yij
The first constraint says that each customer should be assigned to at least one facility. The second says
that is a customer is assigned to a facility, then that facility must be open. This is an IP. We relax the last
constraint as xi ∈ [0, 1] and yij ∈ [0, 1]. This leads to an LP.
Question: Suggest an appropriate rounding scheme for the above problem. Also can you come up with a
bound on the approximation errors?
Hint:
http://pages.cs.wisc.edu/~shuchi/courses/787-F09/scribe-notes/lec10.pdf
Definition Independent set : Let there be a graph (G,V). s ∈ V , such that no two vertices in s are
connected by an edge in G is called an independent set. Our goal is to maximize the number of elements in
s.
IP formulation ∑
Maximize v∈V xv
Such that xu + xv ≤ 1 where (u, v) ∈ E (5.8)
xu ∈ 0, 1
LP Relaxation ∑
Maximize v∈V xv
Such that xu + sv ≤ 1 where (u, v) ∈ E (5.9)
0 ≤ xu ≤ 1
Analysis We, can see the if LP = 12 , then all the above constraints are satisfied. Also the objective is then
|V | |V |
2 . Therefore, the optimal value is 2 or larger.
So, the LP has a feasible solution. Now it can be clearly seen that.
|V |
LP ∗ ≥ (5.10)
2
58
Consider a fully connected graph. The maximal independent set is a single node. However the LP ∗ is n
2.
This does not allow us to come up with a guarantee on the solution.
As, we can see unlike the the minimum vertex cover problem, there are no good bounds here. So, LP
relaxation tells nothing about Maximum Independent Set.
5.6 Reading
59
60
Chapter 6
More on IP Formulations
• We had seen seen how many of the classical problems (eg. graph algorithms) get formulated as an IP.
• We had see how to solve IP by (i) branch and bound (ii) LP relaxation.
• We had also seen how approximate algorithms can be designed using LP relaxation.
• Here, we will see how many problems will become a natural IP problem.
61
6.1 BIP and MIP Formulations
We had seen how many problems get formulated as LP, and some others as IP. We now see two special class
of problems.
• BIP: Binary integer programming problems. In this class the decision variables are binary i.e., xi ∈
{0, 1}
• MIP: Mixed integer programming problems. In this class, some of the decision variables are real and
some others integer.
In binary problems, each variable can only take on the value of 0 or 1. This may represent the selection or
rejection of an option, the turning on or off of switches, a yes/no answer, or many other situations.
A mixed integer programming (MIP) problem results when some of the variables in your model are real
valued (can take on fractional values) and some of the variables are integer valued. The model is therefore
“mixed”. When the objective function and constraints are all linear in form, then it is a mixed integer linear
program (MILP). In common parlance, MIP is often taken to mean MILP, though mixed integer nonlinear
programs (MINLP) also occur, and are much harder to solve.
In the next section, we will see a variety of problems getting modelled as IP. Mostly BIP.
Consider a problem where x1 + x2 + x3 will have to be 5 or 10 or 20 depending on something else (say what
mode of transportation or what is the cost of raw material). We encode this as
with yi as binary variables and only one of the yi can be 1 at a time. This last constraint is added as.
y1 + y2 + y3 = 1
This class of constraints arise when we need to enforce only one of the constraints. Assume we had a
constraint of the form |x1 | ≥ 3. This results two disconnected sets (non-convex region/set.). This only
means that either x1 ≥ 3 or x1 ≤ −3. If we convert both as ≤,
x1 ≤ −3
−x1 ≤ −3
Both can not be true at the same time. How do we do this? The idea is simple. When a constraint is
getting violated, we make it true “trivially”. Let L be a large quantity. (for the sake of argument, it can be
the maximum value your integer/real variable can store. But in practice, it is a reasonable value to manage
numerically.) We can write this constraints as either
x1 ≤ −3
−x1 ≤ −3 + L
62
or
x1 ≤ −3 + L
−x1 ≤ −3
Note that the addition of the large quantity makes the constraint trivially true. In the first set, only the first
constraint is active. While in the second, only the second constraint is active. However, we do not know
which constraint needs to be made true. For this purpose, we add an additional integer variable y. i.e.,
x1 ≤ −3 + y · L
−x1 ≤ −3 + (1 − y) · L
y ∈ {0, 1}
f1 (x) ≤ b1 + y1 L
f2 (x) ≤ b2 + y2 L
.........
fN (x) ≤ bN + yN L
∑
yi = N − K
i
This final constraint can be interpreted as follows. Since we want K constraints to hold out of N , there
must be N − K constraints that do not hold. So this constraint insures that N − K of the binary variables
take the value 1 so that associated M values are turned on, thereby eliminating the constraint.
63
6.5 Modelling Compond Alternatives
Consider the problem of allowing three disjoint regions in the constraints as shown in the figure.
This can be done using the ideas discussed above.
Region 1:
f1 (x1 , x2 ) − Ly1 ≤ b1
f2 (x1 , x2 ) − Ly1 ≤ b2
Regioon 2:
f3 (x1 , x2 ) − Ly2 ≤ b3
f4 (x1 , x2 ) − Ly2 ≤ b4
Region 3:
f5 (x1 , x2 ) − Ly3 ≤ b5
f6 (x1 , x2 ) − Ly3 ≤ b6
f7 (x1 , x2 ) − Ly3 ≤ b7
with additional constraints such as
y1 + y2 + y3 ≤ 2
x1 ≥ 0, x2 ≥ 0
yi ∈ {0, 1}
Most of the problems that we had seen are with a variable cost (a linear function is the objective.) If xi
increase, our objective ci xi also increases.
Consider a class of problems, where there is a fixed cost for a variable. For example, when you sign up
an agreement, you pay an initial amount and then you pay a fixed cost for the regular use of xi (say the
bandwidth). The initial cost may be associated with the setting up of a specific hardware unit or cabling so
that an internet connectivity can be provided.
Therefore the objective is 0 if x1 = 0 and K + c1 x1 if x1 > 0. Here K is the fixed charge.
This can be modelled as
MinimizeKy + c1 x1 + rest of the terms
subject to:
x1 − Ly ≤ 0
other constraints
y ∈ {0, 1}
Consider the cost of production of certain material. This can be c1 x until the production reaches 5 units.
Then the cost could be different say c2 x until the production reaches 12. And then c3 x until 20. etc. Note
that when the production is 7, cost is 5c1 + 2c2 . Here the objective is piecewise linear.
Let d1 , d2 and d3 are the amount of production in the three cost ranges. When d2 is nonzero, d1 is 5. When
d3 is nonzero, d1 is 5 and d2 is 7.
x = d1 + d2 + d3
64
Cost is c1 d1 + c2 d2 + c3 d3 . By adding two additional binary variables, the constraints are:
5w1 ≤ d1 ≤ 5
7w2 ≤ d2 ≤ 7w1
0 ≤ d3 ≤ 8w2
When w1 = 1, d1 is at the upper bound i.e., 5. When w2 = 1, d2 is at the upper bound i.e., 7.
We had seen how branch and bound can be used to solve IP. Let us see another algorithm specially designed
to solve BIPs — Bala’s additive algorithm. It requires the problem to be in a standard form.
∑
• The objective function has the form. Minimize Z = i ci xi with xi as binary variables.
∑
• The m constraints are of the form aij xj ≥ bi for i = 1, . . . m
• ci s are non-negative. Also 0 ≤ c1 ≤ c2 ≤ . . . ≤ cn
This may seem to be restrictive at first look. But not that bad. For example, negative coefficients are
converted by changing xi to 1 − xi .
The objective function is minimization. And ci s are positive. Also xi ∈ {0, 1}. Therefore if all xi = 0 is
feasible, it will be the minima.
If there are N variables, we need to evaluate 2N possible configuration of the x. Bala’s algorithm uses as
depth first search approach.
Bala’s algorithm starts expanding tree with x1 and then move to x2 etc. At each step, it evaluates the
objective (or come with a bound) and checks whether any of the constraints are violated or not. When we
find a feasible point, we stop expanding further that tree. Also the bound that we obtain for certain nodes
will tell us whether to expand a node further or no.
65
66
Chapter 7
More on LP Relaxation
• We had seen seen how many of the classical problems (eg. graph algorithms) get formulated as an IP.
• We had see how to solve IP by (i) branch and bound (ii) LP relaxation. (ii) Bala’s Algorithm
• We had also seen how approximate algorithms can be designed using LP relaxation. We have here one
more example.
67
7.1 Scheduling for Unrelated Parallel Machines
Scheduling is the allocation of shared resources over time to competing activities. It has been the subject of a
signiïficant amount of literature in the operations research field. Emphasis has been on investigating machine
scheduling problems where jobs represent activities and machines represent resources. Each machine can
process at most one job at a time.
In this lecture we consider the Makespan Scheduling problem. We are interested in minimizing the time
required to complete all the jobs. That is, the time for the last machine to complete the last job should be
minimized. There are n jobs, indexed by the set J = {1, ...., n} and m machines for scheduling,∑ indexed by
the set M = {1, ...., m}. Also, given, dij is the time job j takes to run on machine i. Then ti = j∈J xij dij
is the completion time of machine i, where xij is 1 when job j is assigned to machine i, otherwise 0. The
task is to assign jobs to machines so that the t = maxi (ti ), i.e., the maximum job completion time, also
called the makespan of the schedule, is minimized. Note that, the order in which the jobs are processed on
a particular machine does not matter.
There are many variants of the problem. Some of them are :
1. Minimum makespan scheduling on identical machines: dij = dj ∀i.
2. Minimum makespan scheduling on unrelated machines: each machine can take different time to do a job.
This is the problem of interest to us.
3. Non splittable jobs: xij ∈ {0, 1} and splittable jobs: xij ∈ [0, 1].
Minimum makespan scheduling for spilltable jobs can be solved exactly in polynomial time (Problem is in P
and can be formulated as LP). Here we discuss only minimum makespan scheduling on unrelated machines
for non-splittable jobs (Problem is in NP-hard). For identical machines, you can use references.
xij ∈ {0, 1}
• Number of variables: nm + 1
• Number of constraints: n + m
Some Observations on LP . Let us make some observations on LP. We can come back to these in detail
at a later stage.
68
• Consider a system of equations Ax = b where A is m × n. If m ≤ n, there could be multiple solutions.
If x1 is a solution and x2 is a solution αx1 + (1 − α)x2 is also a valid solution.
• Let B be a m × m submatrix of A by selecting m columns. Let the x vector with the selected columns
be xB . Then BxB = b. Assume we solve this. xB may be a vector with no non-zero elements. Cosnider
the original vector with elements from xB and all other elements xi equal to zero. xi = 0 if i ∈
/B
• This vector is a basic feasible solution. (Vertex of the convex polygon of our interest). At least one of
the basic feasible solutions is optimal.
• In x, at least n − m variables/elements are zero.
Note: We know the importance of solving Ax = b. This will be the focus on the next few lectures.
Using LP Relaxation
• There are two machines, m1 and m2 .
• There are 2n + 1 variables and n + 2 constraints.
• How many zeros are there in {xij |1 ≤ j ≤ n, i = 1 or 2}?
Answer: (2n + 1) − (n + 2) = n − 1 zeros and, n + 2 non-zeros.
We know t > 0, therefore number of non-zeros in {xij |1 ≤ j ≤ n, i = 1 or 2} is n + 1. This means only one
job need to be set splittable across 2 machines such that x1j + x2j = 1. Let s be the job that gets split.
We assign s to m1 if x1s > x2s otherwise, assign to m2 (Figure 7.2 shows an example). If T ∗ be IP optimal
makespan and Tapprox be LP relaxation makespan, then Tapprox ≤ 2T ∗ .
Theorem 7.2.1. LP relaxation used here is 2-approximation for makespan scheduling for 2 machines. ′
Proof. Let T ∗ be IP optimal makespan, Tapprox be LP relaxation makespan, Ts be makespan of s and T
be optimal makespan before assigning s.
′
Tapprox ≤ T + Ts ≤ T ∗ + T ∗
′
Since, T < T ∗ and Ts < T ∗
Tapprox ≤ 2T ∗
Here we consider makespan scheduling on unrelated machines for non-splittable jobs, which means that job
j takes time dij if scheduled on machine i. Firstly, we define an IP to solve the problem, The algorithm is
based on a suitable LP-formulation and a procedure for rounding the LP. The LP formulation for minimum
makespan scheduling on unrelated machines can be written as
minimize t
subject to
∑
m
xij = 1 ∀j
i=1
∑n
xij dij ≤ t ∀i
j=1
xij ∈ {0, 1}
69
If we relax the constraints xij ∈ {0, 1}, it turns out that this formulation has unbounded integrality gap.
The main cause of the problem is an unfair advantage of the LP-relaxation.
Example Suppose we have only one job, which has a processing time of m on each of the m machines.
Clearly, the minimum makespan is m. However, the optimal solution to the linear relaxation is to schedule
the job to the extent of 1/m on each machine, thereby leading to an objective function value of 1, and giving
an integrality gap of m.
∑n
If dij > t for an arbitrary t, then we must have xij = 0 in any feasible integer solution due to j=1 xij dij ≤ t ∀i
constraint. But we might have fractional xij > 0 in feasible fractional solutions for LP relaxed problem. However,
we can not formulate the statement “if dij > t then xij = 0” in terms of linear constraints. The question arises here
is therefore, how to correctly choose t?
Parametric Pruning
We will make use of a technique called parametric pruning to overcome this difficulty. We “guess” the parameter t
which is a lower bound for the actual makespan T ∗ . One way to obtain suitable value is to do binary search on t.
Note that since we already know t we don’t need to check whether dij > t or not. Therefore, we are now able to
enforce constraints xij = 0 for all machine-job pairs (i, j) for which dij > t. We now define a family lp(t) of linear
programs, one for each value of the parameter t. lp(t) uses only those variables xij for which (i, j) ∈ St , where
St = {(i, j) : dij < t}, and asks if there is a feasible solution. Remember that, we’ve relaxed constraints on the
variables xij to xij ≥ 0. With t fixed, we define a family lp(t) of linear programs
xij ≥ 0 ∀(i, j) ∈ St
Let T be the minimum value(LP optima) for which LP (t) has a feasible solution obtained using binary search. Let
T ∗ be IP optima makespan. Then certainly, T ∗ ≥ T . That is, the actual makespan is bounded below by T . But we
still don’t have IP feasible solution! Our solution is obtained by rounding an extreme point solution of LP (T ). We’ll
later see that makespan thus obtained from rounding is actually atmost 2.T ∗ .
LP-rounding
Clearly, an extreme point solution to LP (t) has atmost n + m many non-zero variables. Also, it is easy to prove that
any extreme point solution to LP (t) must set atleast n − m many jobs integrally, i.e., xij ∈ {0, 1}.
The LP-rounding algorithm is based on several interesting properties of extreme point solutions of LP (T ). For any
extreme point solution x for LP (T ) define a bipartite graph G = (M ∪ J, E) such that (i, j) ∈ E if and only if xij > 0.
Let F ⊂ J be the set of fractionally set jobs in x. Let H be the subgraph of G induced by the vertex set M ∪ F .
Clearly, (i, j) ∈ E(H) if 0 < xij < 1. A matching H is called a perfect matching if it matches every job j ∈ F .
Each job that is integrally set in x has degree 1 and exactly one edge incident at it in G (Figure 1(a)). Remove these
jobs together with their incident edges from G. The resulting graph is clearly H (Figure 1(b)). In H, each job has a
degree of at least two. So, all leaves in H must be machines. Keep matching a leaf with the job it is incident to and
remove them both from the graph (Figure 1(c)). At each stage all leaves must be machines. In the end we will be
left with even cycles Match alternating edges of each cycle. This gives a perfect matching P (Figure 1(d)). Refer to
Figure 2 for an example.
70
Figure 7.1: Steps in LP relaxation for minimum makespan scheduling on unrelated machines. M is the set
of machines, J is set of jobs and edge (Ji , Mk ) means job Ji has been scheduled on machine Mk . All nodes
in J with degree 1 has been integrally set and all the nodes with degree atleast two has been fractionally set.
(a) G = (M ∪ J, E). Jobs J1 and Jn are integrally set. (b) H = (M ∪ F, E ′ ) contains only fractionally set
jobs. (c) Assign a machine to the job that has edge to leaf machine node and remove them from the graph
(assign Mn to J4 , remove them and all edges incident on J4 ). (d) Match alternate edge of each cycle(M2 to
J2 and M3 to J3 ).
Figure 7.2: An example showing steps in LP relaxation for minimum makespan scheduling on unrelated
machines. M is the set of machines, J is set of jobs and edge (Ji , Mk ) means job Ji has been scheduled on
machine Mk . (a) G = (M ∪ J, E). J1 is integrally set. (b) H = (M ∪ F, E ′ ) contains only fractionally set
jobs obtained by removing J1 and edge x11 from G. (c) Assign a machine to the job that has edge to leaf
machine node and remove them from the graph (assign M1 to J4 , remove them and all edges incident on
J4 ). (d) Match alternate edge of each cycle(M2 to J2 and M3 to J3 ).
Theorem 7.2.2. LP relaxation used here is 2-approximation for minimum makespan scheduling on unrelated ma-
chines.
Proof. Let T be LP optimal value and T ∗ be IP optimal value. Then clearly, T ≤ T ∗ since we chose T such that
LP (T ) has a solution. The extreme point solution x to LP (T ) has a fractional makespan of atmost T . Therefore,
each integrally set jobs also has an integral makespan of atmost T . Each edge (i, j) of graph H satisfies dij ≤ T . The
perfect matching found in H schedules atmost one extra job on each machine. Hence, the total makespan is atmost
2.T ≤ 2.T ∗ .
71
72
Chapter 8
Solving Ax = b
8.1 Introduction
In this lecture, let us consider solving a system of linear equations of the form
Ax = b (8.1)
Since A is non-singular, the solution is x = A−1 b. Since, b ̸= 0 the trivial solution x = 0 does not exist. Computing
A−1 is a costly operation. Also this has numerical computing and stability issues in many situations. If the structure
of A is known, and it has some special properties, we can use this knowledge to solve Ax = b much more efficiently.
This lecture looks at solving the problem when A is an identity matrix, a permutation matrix, a triangular matrix
and a positive definite matrix. As the constraints on the matrix A eases, the complexity of solving it increases.
In order to compute the complexity of these operations, we are not interested in the big-O complexity. Instead, we
try to calculate the number of flops or floating point operations needed for each method. We define a flop as one
addition, subtraction, multiplication or division of two floating-point numbers. To evaluate the complexity of an
algorithm, we count the total number of flops, express it as a function (usually a polynomial) of the dimensions of
the matrices and vectors involved, and simplify the expression by ignoring all terms except the leading (i.e., highest
order or dominant) terms.
x = A−1 b =⇒ x = b
This does not require any computation. Therefore, Flop Count = 0. A trivial problem to solve.
A Permutation matrix P is a square binary matrix that has exactly one entry of 1 in each row and each column
and 0s elsewhere. Each such matrix represents a specific permutation of n elements and, when used to multiply
another matrix, can produce that permutation in the rows or columns of the other matrix.
73
Example:
1 0 0
0 0 1
0 1 0
[ ] [ ]
is a permutation matrix. On multiplying this matrix with 2 3 4 , we get 2 4 3 .
Result: Inverse of a permutation matrix is its transpose i.e. P −1 = P T Hence, the value of x is nothing but a
permutation of the matrix b. In this case also, Flop Count = 0.
A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. The diagonal entries
themselves may or may not be zero. Thus, the matrix A = (ai,j ) with n rows and n columns is diagonal if
Suppose A is a lower triangular matrix of order n with nonzero diagonal elements. Consider a system of equations
Ax = b:
a11 0 ... 0 x1 b1
a21 a22 . . . 0 x2 b2
. .. .. .. .. = .. (8.3)
.. . . . . .
an1 an2 . . . ann xn bn
74
We solve for x as:
b1
x1 =
a11
b2 − a21 x1
x2 =
a22
b3 − a31 x1 − a32 x2
x3 =
a33
..
.
bn − an1 x1 − an2 x2 − . . . − an,n1 xn1
xn = .
ann
Flops Count= 1 + 3 + 5 + . . . + (2n − 1) = n2
[ ]
a11 0
Recursive Formulation: If A is lower triangular, it can be represented as where
A21 A22
• a11 is 1x1
• 0 is 1x(n-1)
• A21 is (n-1) x 1
• A22 is a lower triangular matrix of size (n-1) x (n-1)
Now, the forward substitution algorithm can be written recursively using this representation.
[ ][ ] [ ]
a11 0 x1 b1
= (8.4)
A21 A22 X2 B2
b1
1. x1 = a11
2. Solve A22 X2 = (B2 − A21 x1 ) by Forward Substitution
bn
1. xn = ann
2. Solve A11 X1 = (B1 − A12 xn ) by Backward Substitution
Flop Count = n2
8.3.1 PD Matrix
75
[ ] [ ]
1 0 a
Example 1. The identity matrix I = is positive definite because, for every real vector z = , zT Iz =
0 1 b
[ ]
[ ] a
zT z = a b = a2 + b2 , which is positive.
b
A = LLT (8.6)
where L is lower triangular with positive diagonal elements. This is called the Cholesky factorization of A. If n = 1,
i.e A is a scalar, then the Cholesky factor of A is just the square root of A.
Example 3. An
example
of a 3x3
Cholesky factorization
is
1 2 3 1 0 0 1 2 3
2 20 26 = 2 4 0 0 4 5
3 26 70 3 5 6 0 0 6
The Cholesky factorization takes (1/3)n3 flops.
• A22 = L21 LT21 + L22 LT22 =⇒ (A22 − L21 LT21 ) = L22 LT22
Algorithm: Cholesky
√
1. Calculate l11 = a11
A21
2. Calculate L21 = l11
76
√
• l11 = a11 = 1
1 1 0 0
= 2 . Now, L = 2
A21
• L21 = l11
l22 0
3 3 l32 l33 .
[ ] [ ] [ ]
16 20 l22 0 l22 l32
• We have to do Cholesky factorization of A22 - L21 L21 T = = .
20 61 l32 l33 . 0 l23
√
• l22 = a22 = 4
20
• l23 = 4
=5
1 0 0
• The matrix is now 2 4 0
3 5 l33
• We have to factorize 61 - 5 · 5 = 36. This gives l33 as 6
•
The final answer
is
1 2 3 1 0 0 1 2 3
2 20 26 = 2 4 0 0 4 5
3 26 70 3 5 6 0 0 6
AX = I or Ax = [e1 , e2 , . . . , en ] (8.8)
using the method for solving equations with multiple right-hand sides described in the previous section. Only one
Cholesky factorization of A is required, and n forward and backward substitutions. The cost of computing the inverse
using this method is (1/3)n3 + 2n3 = (7/3)n3 .
8.7.1 Example 1
Compute
Cholesky factorization
of
4 6 2 −6
6 34 3 −9
A= 2
3 2 −1
−6 −9 −1 38
77
l21
• Determine l11 and L21 where L21 = l31
l41
√
l11 = a11
l11 = 2
6
1
L21 = 2
l11
−6
3
L21 = 1
−3
34 3 −9 3 [ ] l22 0 0 l22 l32 l42
3 2 −1 − 1 3 1 −3 = l32 l33 0 0 l33 l43
−9 1 38 −3 l43 l43 l44 0 0 l44
25 0 0 l22 0 0 l22 l32 l42
0 1 2 = l32 l33 0 0 l33 l43
0 2 29 l43 l43 l44 0 0 l44
l22 = 25
l22 = 5
l32 = 0
l42 = 0
2
l33 =1
l33 = 1
l4 3 = 2/l33
l43 = 2
[ ][ ]
To solve for l44 we have 29 − 2 2 = l − 442
l44 = 5
4 6 2 −6 2 0 0 0 2 3 1 −3
6 34 3 −9 3 5 0 0 0 5 0 0
=
2 3 2 −1 1 0 1 0 0 0 1 2
−6 −9 1 38 −3 0 2 5 0 0 0 5
8.7.2 Example 2
78
[ ]
A u
1. What is the Cholesky factor of the (n+1)x(n+1) matrix B = where B is positive semidefinite?
uT 1
[ ] [ ][ ]
A u L11 0 L11 T L21 T
• =
uT 1 L21 ln+1,n+1 0 ln+1,n+1
• A = L11 L11 T
• ∴ L11 = L
1 − uT A-1 u (8.3)
T -1
= 1 − (LL21 ) (LL ) LL21
T T T
(8.4)
-T -1
= 1 − L21 L L L L L21
T T T
(8.5)
= 1 − L2121 T
(8.6)
2
= ln+1,n+1 (8.7)
which is a positive scalar and hence is positive definite.
• Hence B is positive definite.
8.7.3 Example 3
Solve efficiently LX + XLT = B given L(lower triangular matrix) and B. Lii + Ljj ! = 0∀i, j .
Analyze the complexity.
• Let
[ us write the
] [ equation using
] the
[ recursive form.
][ ] [ ]
l11 0 x11 X 12 x11 X 12 l11 L21 T b11 B 12
+ =
L21 L22 X 21 X 22 X 21 X 22 0T L22 T B 21 B 22
• Simplifying the L.H.S and equating to R.H.S, we get the following four equations:
79
• Equation (8.8) can be solved in 1 flop
• For equation (8.9), the unknown is X12 . To simplify the L.H.S (multiplications and additions of matrices),
number of flops = 2(n − 1)2 + 3(n − 1)
• After equating the L.H.S to the R.H.S, we get a system of linear equations of the form
This upper triangular system of equations can be solved using backward substitution in (n − 1)2 flops.
• Total number of flops for equation (8.9) = 3(n − 1)2 + 3(n − 1) ≈ 3n2
• Similarly, equation (8.10) can be solved in 3n2 flops.
• For equation (8.11), the unknown is X22 . The equation can be rewritten as
• The R.H.S can be computed in 5(n − 1)2 . The equation is of the form LX + XLT = B which can be computed
recursively.
∑
n
• Each recursive step takes on its own 1 + 3n2 + 3n2 + 5n2 ≈ 11(n − 1)2 . Total number of flops is 11n2 =
i=1
11n(n+1)(2n+1)
6
= 11
6
(2n3 + 3n2 + n). Considering only the leading term for flop calculation, total number of
flops = 11
3
n3
8.7.4 Exercise
[ ]
A −A
Given Anxn is a P.D matrix and B = is a P.D matrix. Find the range of values of β .
−A βA
80
Chapter 9
81
9.1 Review and Summary
Matrix decomposition or Matrix Factorization is the method of trasnforming a given matrix to a product of canonical
matrices. Matrix decompositions are usually carried out for making the problem computationally convenient to solve
and simple to analyze. For example, matrix inversion, solving linear systems and least squares fitting problem can
be infeasible to solve optimally in an explicit manner. Thus, converting them into a set of several easier tasks such
as solving a diagonal or triangular system helps speed up the process. It also helps in identifying the underlying
structure of the matrices involved. In the previous class, we took a look at the Cholesky Factorization method. More
formally, it can be stated as follows:
To solve system of linear equations, in which each equation is in the form aT x = c, where a is a n−vector (A vector of
a1 x1
a2 x2
size n) of the form . such that ai ϵR ∀i = 1 . . . n and x is a variable n-vector of the form . Therefore resultant
.. ..
an xn
equation is a1 x1 + a2 x2 + . . . + an xn = c
Indeed many of our discussions on real matrices, are directly applicable for matrices of complex numbers as such or
with minimal changes. However, that is not attempted here.
Flops= Total number of floating point operations or flops required to perform numerical algorithm.
Let us assume that m = n (A is a square matrix) and also non-singular (Inverse of A, A−1 exists).
Following table shows flops required by the different types of matrices to solve linear equations Ax = b
82
9.2 LU Factorization
9.2.1 Definition
Factorize or decompose a square non-singular matrix into product of two matrices L, U with the help of permutation
matrix, if necessary.
A = P LU
where A is any non singular matrix, P is Permutation Matrix, L is Lower triangular Matrix, U is an Upper triangular
matrix
Most of the times P is an Identity matrix. For a matrix, A multiple LU − decompostions are possible. To get unique
decomposition, principle diagonals of either L or U must be one. From now onwards in our discussion we assume that
P = I and Lii = 1, ∀i. The standard algorithm for computing LU − decompostion is called Gaussian elimination.
9.2.2 LU Factorization
An example of an LU factorization is
0 5 5 0 0 1 1 0 0 6 8 8
2 9 0 = 0 1 0 1/3 1 0 0 19/3 −8/3 (9.1)
6 8 8 1 0 0 0 15/19 1 0 0 135/19
2 3
The algorithm for computing the LU factorization is called Gaussian elimination and it takes 3
n flops.
The Gaussian elimination method in its pure form is unstable at times and the permutation matrix P, is used to
control this inability. It does this by permutating the order of the rows of the matrix being operated upon. Such
operations are called pivoting. However, it is possible to factor several non-singular matrices as LU without pivoting.
In the following subsection, we first describe the method and then provide an algorithm for the same.
A = LU (9.2)
However, l11 = 1 as L is a unit lower triangular matrix. Therefore, equating both sides and performing appropriate
substitutions gives us:
1
u11 = a11 , U12 = A12 , L21 = A21 (9.5)
a11
1
L22 U22 = A22 − A21 A12 (9.6)
a11
L22 and U22 can be calculated recursively by performing a LU factorization of dimension (n − 1) on the matrix
obtained on the right hand side of equation (5). This process continues till we arrive at a 1 × 1 matrix. The algorithm
can be summarized as follows :
83
9.3.1 Computational procedure
9.3.2 Example 1
[ ]
0 1
: A=
1 −1
[ ] [ ][ ]
0 1 1 0 0 U12
A= =
1 −1 L21 L22 0 U22
[ ]
0 1
This is not possible since L21 = 01 (1). Suppose P =
1 0
[ ][ ] [ ]
0 1 0 1 1 −1
PA = =
1 0 1 −1 0 1
Therefore,
[ ][ ]
1 0 1 −1
PA =
0 L22 0 U22
and,
Therefore,
[ ][ ]
1 0 1 −1
PA =
0 1 0 1
Finally,
[ ][ ][ ]
0 1 1 0 1 −1
A=
1 0 0 1 0 1
9.3.3 Example 2
6 3 1
A = 2 4 3
9 5 2
84
Factoring,
6 3 1 1 0 0 u11 u12 u13
A = 2 4 3 = l21 1 0 0 u22 u23
9 5 2 l31 l32 1 0 0 u33
6 3 1 1 0 0 6 3 1
2 4 3 = 1/3 1 0 0 u22 u23
9 5 2 3/2 l32 1 0 0 u33
[ ][ ] [ ] [ ]
1 0 u22 u23 4 3 2 [ ]
L22 U22 = = − (1/6) 3 1
l32 1 0 u33 5 2 9
[ ][ ] [ ]
1 0 u22 u23 3 8/3
L22 U22 = =
l32 1 0 u33 1/2 1/2
[ ][ ]
1 0 3 8/3
L22 U22 =
1/6 1 0 u33
6 3 1 1 0 0 6 3 1
A = 2 4 3 = 1/3 1 0 0 3 8/3
9 5 2 3/2 1/6 1 0 0 1/18
AX = I
That is to say, we can solve the system of n equations, Axi = ei where xi is the ith column of A−1 and ei is the ith
unit vector. The cost of this computation would be 23 n3 + n(2n2 ) = 38 n3 flops(one LU factorization and n forward
and n backward substitutions).
85
x = A−1 b
Therefore total 8 3
3
n + n2 + n(n − 1) flops required.
9.6.1 Example 3
Solve
6 3 1 x1 18
2 4 3 x2 = 17
9 5 2 x3 29
We shall use the factorization performed in Example 2. As P = I, the permutation step does not affect the outcome.
We first solve,
1 1 0 z1 18
1/3 1 0 z2 = 17
3/2 1/6 1 z3 29
which gives the solution (18, 11, 1/6) by forward substitution. Finally, we solve
6 3 1 x1 18
0 3 8/3 x2 = 11
0 0 1/18 x3 1/6
non-singular?
2. Assume A is non-singular, how many floating point operations do you need to solve Ax = b
3. Assume A is non-singular, what is the inverse of A−1 (In other words, express the elements of A−1 in terms of
a1 , a2 , . . ., an )
x2 = −a1 x1
x3 = −a2 x1
86
..
.
xn = −an−1 x1
an x1 = 0
If an ̸= 0 then from the last equation, x1 = 0, because of that remaining elements of x also becomes zero, x2 = x3 =
. . . = xn = 0 i.e., x = 0, so A is non-singular.
If an = 0, and x1 = 1 then, x2 = −a1 , x3 = −a2 , etc. and obtain a non zero x with Ax = 0. So if an = 0 then
resultant Matrix becomes singular
an x1 = bn
a1 x1 + x2 = b1
a2 x1 + x3 = b2
..
.
an−1 x1 + xn = bn−1
x3 = b2 − a2 x1
..
.
xn = bn−1 − an−1 x1
9.7 QR Factorization
Factorizations have a wide number of applications. However, some factorizations like Cholesky are not always
appropriate or efficient enough due to certain inherent restrictions. For example, the Least Squares problem can be
solved faster by QR factorization which we cover in this section. (more in the next lecture). Also we need methods
to factorize non square matrices.
87
A=QR
where Q is an m × n orthogonal matrix and R is an n × n upper triangular matrix with positive diagonal elements.
This is called the QR factorization of A.
Just to recap, an m × n orthogonal matrix has the property that:
1. QT Q = I, when m > n
2. QQT = I, when m < n
3. QT Q = QQT = I, when m = n
An example of a QR factorization is
3 −6 26 3/5 0 4/5
4 1 −2 2
−8 −7 4/5 0 −3/5
(1/5)
0
= 0 1 1 (9.7)
4 4 0 4/5 0
0 0 5
0 −3 −3 0 −3/5 0
The first step is to factorize the A, Q and R matrices in the following manner,
[ ]
[ ] [ ] r11 R12
A = a1 A2 , Q = q1 Q2 , R=
0 R22
Therefore,
We also know, that r11 > 0 and R22 is an upper triangular matrix with positive diagonals.
Comparing the left and the right matrices we can conclude that a1 = q1 r11 . But since q1 is unit norm, r11 = ||a1 ||
and q1 = ra11
1
. And, A2 = q1 R12 + Q2 R22 .
To simplify this, notice that we can premultiply both sides by q1T thereby obtaining,
88
9.8.1 Algorithm: QR FACTORIZATION
The algorithm to compute the QR factors of the matrix can be concisely written as follows:
1. Compute the preliminary values of the first rows/columns: r11 = ||a1 ||, q1 = a1
r11
and R12 = q1T A2
2. Computer QR factor of A2 − q1 R12 as Q12 R22
9.8.2 Example
2 8 13
Find QR − f actorization of 4 7 −7 . Show steps.
4 −2 −13
Solution:
Recursive Algorithm:
1. r11 = ∥a1 ∥
a1
2. q 1 = ∥a1 ∥
3. R12 = q 1 T A2
4. Compute Q2 R22 by QR − f actorization of A2 − q 1 R12
Step1:
1
√ 2 3 [ ] 8 13 [ ]
r11 = ∥a1 ∥= 4 + 16 + 16 = 6; q 1 =
a1
∥a1 ∥
= 16 4 = 23 ; R12 = q 1 T A2 = 13 32 23 7 −7 = 6 −9 ;
4 2
3
−2 −13
1
8 13 3 [ ] 8 13 2 −3 6 16
A2 − q 1 R12 = 7 −7 − 23 6 −9 ⇒ 7 −7 − 4 −6 ⇒ 3 −1
−2 −13 2
3
−2 −13 4 −6 −6 −7
1
3
6 6 −9
A = 23 0
2
3
0
Step2:
2
√ 6 3 [ ] 16 [ ]
r11 = ∥a1 ∥= 36 + 9 + 36 = 9; q 1 = 3 = 1 ; R12 = q 1 T A2 = 2
a1
= 19 1 −2 −1 = 15 ;
∥a1 ∥ 3 3 3 3
−2
−6 3
−7
2
16 3 [ ] 16 10 6
A2 − q 1 R12 =−1 − 13 15 ⇒ −1 − 5 ⇒−6
−2
−7 3
−7 −10 −3
1
3
2
3
6 6 −9
A= 3 2 1
0 9 15
3
2 −2
3 3
0 0
Step3:
2
√ 6 3 [ ]
r11 = ∥a1 ∥= 36 + 36 + 9 = 9; q 1 = a1
= 19 −6 = −2 ; R12 = q 1 T A2 = 2 −2 1
0 = 0;
∥a1 ∥ 3 3 3 3
1
3 3
A2 − q 1 R12 =0
89
1
3
2
3
2
3
6 6 −9
A= 2 1 −2
0 9 15
3 3 3
2 −2 1
3 3 3
0 0 9
9.9 Applications of QR
Let us consider the two typical applications (when A is square)
Solution to Ax = b
Ax = b
QRx = b
Rx = QT b
This requires a matrix vector product followed by solving a triangular system of equations.
Formally, the singular value decomposition of an m × n real or complex matrix M is a factorization of the form
M = U DV T , where U is an m × m real or complex unitary matrix, D is an mn rectangular diagonal matrix with
non-negative real numbers on the diagonal, and V T (the conjugate transpose of V, or simply the transpose of V if V
is real) is a n × n real or complex unitary matrix. The diagonal entries Dii of D are known as the singular values of
M. The m columns of U and the n columns of V are called the left-singular vectors and right-singular vectors of M ,
respectively.
The singular value decomposition and the eigendecomposition are closely related. Namely:
Applications that employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix approx-
imation, and determining the rank, range and null space of a matrix.
∑
M= Di ui viT
i
A−1 = V D−1 U T
90
D−1 is easy to calculate since it is diagonal in n flops.
If A is singular, one can find the approximation by discarding the zero elements. Di−1 = 1
Di
is Di > t and zero
otherwise.
Consider a set of homogeneous equations Ax = 0. Any vector x in the null space of A is a solution. Hence any
column of V whose corresponding singular value is zero is a solution
9.12.1 Example
1 3
Demonstrate that cholesky factorization can be done in 3
n operations
Recursive Algortihm:
√ A21
Calculate the first column of L: l11 = a11 and L21 = l11
Explanation:
⇒[1 + 2 + . . . + n] + 2[1 + 4 + . . . + (n − 1)2 ] (Sum of n natural numbers and 2*(Sum of squares of (n − 1) natural
numbers))
⇒[ n(n+1)
2
] + 2[ n(n−1)(2n−1))
6
] ≊ 13 n3
9.12.2 Example
Suggest an efficient algorithm to compute Z = (I + A−1 + A−2 + A−3 )b. Analyze the complexity/flops
Solution:
91
Z = (I + A−1 + A−2 + A−3 )b
Considerw = A−1 b
⇒Aw = b
⇒Aw = LU w = b
Conider x = A−1 w
⇒Ax = LU x = w
Conider y = A−1 x
⇒Ay = LU y = x
9.12.3 Exercise
(D + uv T )x = b
where u, v, and b are given n-vectors, and D is a given diagonal matrix. The diagonal elements of D are nonzero and
uT D−1 v ̸= −1.
1. What is the cost of solving these equations using the following method?
(i) First calculate A = (D + uv T )
(ii) Then solve Ax = b using the standard LU method
92
2. Compute the inverse of the above A matrix using suitably efficient algorithm and determine the cost.
9.12.4 Excercise
−3 2 0 3
6 −6 0 −12
Calculate the LU factorization without pivoting of the matrix A = . Provide all the steps
−3 6 −1 16
12 −14 −2 −15
during calculation.
93
94
Chapter 10
95
10.1 Introduction
In the last two lectues, we had seen the problem of solving Ax = b when A is non-singular and square. What if
m ̸= n? There are two cases that we discuss today: (i) m > n and (ii) m < n.
• Consider the case when m > n, i.e., the number of equations is more than number of variables/unknowns.
Linear equations with m > n are called over-determined equations. They may not have a solution that satisfy
all the equations. However, we explore the “most approximate” (or optimal) solution leading to an optimization
problem. This is a problem with many practical interest, since the lack of a consistent solution to the entire
set of equations may be due to practical issues like ‘noise’. They are often formulated as least square error
(LSE) minimization problem or popularly known as least square problem. Least squares (LS) problems are
optimization problems in which the objective (error) function is expressed as a sum of squares. They have a
natural relationship to model fitting problems, and the solutions may be computed analytically using the tools
of linear algebra.
• When m < n the situation is very different. There may be too many solutions that satisfy the too few equations
that we have. In such situations our interest is in finding the solution that has some special character, for
example, the one with minimum norm.
We first derive the closed form expressions for these two (for L2 norm) and then discuss how computationally efficient
and accurrate solutions can be designed.
Note that ri (x) is the ith component of Ax − b. Often ri is called the residual or error and will have some physical
significance. Let us come back to our familiar matrix form of the equations.
Ax = b
Our problem is now to minimize ∥r∥ such that Ax + r = b or minimize ∥Ax − b∥
To minimize this objective (function of x), let us do a partial differentiation of this w.r.t. xi and equate to zero. This
leads to:
(Note: If you do not know how to differentiate functions of matrices/vectors, read: Tom Minka’s technical report on
“Old and New Matrix Algebra Useful for Statistics”.)
2AT Ax − 2AT b = 0
or AT Ax = AT b (10.1)
By multiplyinf (AT A)−1 on both sides,
96
Note that AT A is PD and we can use cholesky to solve this.
Our solution is to form an equation Cx = d with C = AT Ax and d = AT b. The steps can be summarized as:
Complexity for calculating step 1 to compute C we need mn2 flops. Note that C is symmetric and we need to
compute only half the elements approximately. And to compute d we need 2mn.
Complexity for calculating step 2 is 1/3 n3 .
Complextity for calculating step 3 is n2 .
Complexity for calculating step 4 is n2 .
Therefore, the total cost will be mn2 +2mn+1/3 n3 +n2 + n2 which is taken as 1/3 n3 +mn2
1.Factorize A as QR
2.Compute w=QT b
3.Solve Rx=w backward substitution
Therefore, the total cost will be 2mn2 +2mn+n2 which is approximated as 2mn2
(AT A)x = AT b
[(U DV T )T (U DV T )]x = ((U DV T )T b
[V DU T U DV T ]x = (U DV T )T b
[V D2 V T ]x = (V DU T )b
DV T x = U T b
97
We solve this as:
Dw = p
V Tx = w
x=Vw
Minimize||x||
subject to Ax = b
is unique and is given by x̂ = AT (AAT )−1 b
Verify First we verify and show that this is the solution. We show below how this can be derived using Langrangians
below.
Since Ax = Ax̂ = 0.
Thus we have:
98
Derivation
M inimizexT x
subject to
Ax = b
Combining the constraints, Langrangian is
L(x, λ) = xT x + λT [Ax − b]
2x + AT λ = 0
Ax − b = 0
or
AT λ
x=−
2
Substituting
A(−AT λ)
−b=0
2
λ = −2(AAT )−1 b
Substituting
x = AT (AAT )−1 b
1.Compute C=AAT
2.Decompose C as C=LLT
3.Solve Lw=b Forward substitution
4.Solve LT z=w Backward substitution
5.Find x = AT z
Solution using QR
AT = QR (10.4)
QT Q=I and R is upper right triangualr matrix with positive diagonal
= QR(RT QT QR)−1 b
= QR(R−1 R−T )b
= QR−T b
99
1. Compute QR factorixation AT =QR
2. Solve RT z=b forward substitution
3. Solve x=Qz
x̂ = AT (AAT )−1 b
We can factorize AT using SVD and substitute in the above equation. This leads to:
DU T x = V b
Dw = p
UT x = w
x = Uw
1.Factorizw AT = U DV T
2.Compute p=V b
3.Solve Dw=p
4.Find x=Uw
Minimize 1T z
Subject to: Ax = b; |xi | ≤ Rzi ; and zi ∈ {0, 1}
Minimize 1T z
Subject to: Ax = b; |xi | ≤ Rzi ; and 0 ≤ zi ≤ 1
100
xi
Observing that zi = R
at the optimum, the problem is equivalent to:
∥x∥1
Minimize
R
Subject to: Ax = b;
This is similar to replacing L0 norm by L1 norm. Assume we use L1 norm instead of L0 norm.
Minimize ∥x∥1
Question: Can the L1 norm based solution be the same as that of L0? or how bad this can be? Let us wait for
some lectures to know the answer.
Equivalent LP
Minimize u1 + u2 + . . . un
Subject to: Ax = b
−u ≤ x ≤ u
x, u ∈ Rn and u ≥ 0
10.7.1 Example 1
= 2AT Ax + 2x − 2AT b
2AT b = 2AT Ax + 2x
AT b = (AT A + I)x
1.compute B=(AT A + I)
2.Choeskly of B as LLT
3.LLT x = AT b
4.Lw = AT b
5.LT x = w
101
1.compute B=(AT A + I
2. Bx=AT b
3.B T = QR
4.RT z = b
5.x=Qz -
10.7.2 Example 2
substitute A=QR-2mn2
10.7.3 Excercise
min(∥x − x0 ∥2 ) (10.6)
Such that
Ax = b; (m < n) (10.7)
TODO Notes (i) Fix flops for SVD (ii) double check costs.
102
Chapter 11
103
11.1 Introduction
Constrained Optimization vs Unconstrained Optimization
11.4 Problems
104
Chapter 12
105
12.1 Eigen Values and Eigen Vectors
In linear algebra, an eigenvector or characteristic vector of a square matrix is a vector that does not change its
direction under the associated linear transformation. That is, if
Ax = λx
then x is the eigen vector of A and λ is the correposining eigen value. Note that an n × n matrix A may have at max
n eigen vectors and eigen values. Eigen values can also be zero.
Geometrically, an eigenvector corresponding to a real, nonzero eigenvalue points in a direction that is stretched by
the transformation and the eigenvalue is the factor by which it is stretched.
Another intuitive explanation of eigen vectors is as follows. We know that when a vector gets multipled by a matrix
(a linear transformation), the vector changes its direction. However, certain vectors do not change their direction.
They are called eigen vectors. i.e., Ax = λx. There is only a scale change characterized by λ on multiplication. Such
vectors x are the eigen vectors and the corresponding λ, the scale factor is the eigen value. If A is an identity matrix,
every vector is an eigen vector, and all of them have eigen value of λ = 1.
To compute the eigen values and eigen vactors, we start with Ax = λx. or (A − λI)x = 0. If this system has to have
a non trivial solution, the determinant of A − λI should be zero.
|A − λI| = 0
Solving the above for λ gives the eigen values. Substituting this in
Ax = λx
yield x.
Example: Compute the eigen values and eigen vectors of the following matrix.
1 2 3
4 5 6
7 8 9
106
12.2 Applications in Optimization
A number of problems lead to a formulation of the form M inimize or M aximize xT Ax with an additional constraint
that ∥x∥2 = 1. Often we need the constraing to avoid the trivial cases.
Let us assume that we are given N points in 2D i.e., (xi , yi ). We are interested in finding the equation of a line
ax + by + c = 0. that minimize the sum or orthohonal disctances. We know that this line passes through the origin.
Let us assume that the points are mean centered. i.e., mean is zero. Now the line equation can be written as
ax + by = 0. With no loss in generality we can assume that the vector u = [a, b]T is normalized such that the norm
is unity.
Assume that the data is arranged in a data matrix M as N × 2. We are interested in finding
M inimize∥M u∥2 such that∥u∥ = 1
Let us take an SVD of M as U DV T .
∥M u∥2 = [M u]T [M u] = uT M T M u = uT V DU T U DV T u = uT V D2 V T u
Since V is an orthogonal matrix, multiplication by V is not to change the length of u. Let us assume v = V u. Then
the problem is now.
M inimizev T D2 v such that∥v∥ = 1
This is nothing but the square of the largest singular value.
Discussed along with the SVD. When A is not a square matrix, eigen vectors of AT A and AAT are related to the U
and V in the SVD.
However which eigen vector to pick? one corresponding to the smallest or largest?
107
12.5 Optimization: Application in PCA
Problem: Given a set of samples x1 , x2 , . . . xN , each with M features x1i , . . . xM
i , Find a new feature representation
(such as the jth feature is a linear combination of the original features) as:
......
Yi = AXi
Where Yi is a k × 1 vector, A is a K × M matrix and Xi is a M × 1 vector (usually k < M )
Let the original vector xj is projected into a new dimension ui (basis vector) as vij = xj · ui .
It is easy to observe that the mean after the projection is same as the projection of the original mean.
∑
j xj 1 ∑ 1 ∑
x̄ · ui = · ui = xj · ui = vj
N N j N j
v¯i = x̄ · ui
or else
v̄ = x̄ · [u1 , . . . , uk ] = Ux̄
Objective
Let the objective is to maximize the variance after projection. (Or only minimal information is lost.) Let us first find
the “best dimension” (u) in this regard i.e.,
∑
M axu var(v) = M axu ||vi − v̄||2
i
∑ ∑
M axu ||vi − v̄||2 = ||(xi − x̄) · u||2
i i
∑
M axu uT [xi − x̄][xi − x̄]T u = M axu , uT Σu
i
An unconstrained optimization of the same could give extremely large u. Therefore, we introduce a constraint
uT u = 1 or the problem is
M aximize uT Σu − λ(uT u − 1)
M aximize uT Σu − λ(uT u − 1)
Σu = λu
It can be now easily seen that the basis vectors which preserve maximum variance are the sorted (in decreasing order)
eigen vectors.
108
12.6 Optimization: Graph Cuts and Clustering
Let us now look at another interesting application of eigen vectors. Consider a graph G = (V, E).
We are interested in partitioning the graph into two subsets of vertices A and B. Or, we want to find a cut, which is
the sum of edges that we need to cut.
This has many applications in clustering. Consider that we are given N points {x1 , . . . , xN } and we are interested
in clustering these points into two clusters. Assume A is an affinity (or some similarity matrix) where Aij is the
similarity of xi and xj . This is also in a way related to the weight matrix W of the graph. If one needs more
d(xi ,xj )
intuituition, one can define weight wij as e σ .
Let wij be the weight of the edges. We need a cut that partition the vertices into two. It cuts a set of edges.
∑
Cut(A, B) = wij
i∈A,j∈B
However, this is not very useful in many cases. More useful measure is
1 1
N Cut(A, B) = cut(A, B)( +
V ol(A) V ol(B)
with ∑
vol(A) = wij
i∈A,j∈V
∑
vol(A) = di
i∈A
this normalization helps to remove the easy case of cutting an outlier point as the preferred point.
∑
Let D(i, i) = j wij .
1 1
D− 2 (D − W )D− 2 z = λz
1
where z = D 2 y
using Rayleigh quotient, the second smallest eigen vector turns out to be the real valued solution and the solution to
the normalized cut problem
Ax = λBx
109
When B = I, this problem reduced to a popular equation we had seen in Equation (1).
If we multiply both side by either of B −1 or A−1 , then we can get the popular simple eigen value problem. However
the problem is that the matrices like A−1 B need not be symmetric.
1
A standard trick is to find sqrt of the B (either as B 2 ) or using cholesky LLT . i.e., Ax = λLLT x or L−1 Ax = λLT x
or
110
Chapter 13
111
13.1 Introduction
In the previous lectures, we have studied several formulations of problems as LP and some of the applications of
LP. We had also seen how some of the toy LPs gets solved on paper. We now describe an efficient way to solve
the linear programs. This method is popular and is widely used in many real life problems. This is the well known
simplex method. The graphical method to solve linear programs becomes tedious and computationally intensive
for large linear programs (when the numer of variables increases). As the number of variables increases, every new
constraint adds an exponential number of vertices to be evaluated (see below). The graphical method basically
requires enumerating and evaluating the objective on all the possible vertices or extreme points of the feasible region
described by the constraints of the linear program. This makes the algorithm unsuitable for practical use in large
linear programs. The simplex method describes a procedure to solve these linear programs efficiently. In this lecture,
we start with a brief conceptual introduction to the simplex method, and then proceed to describe more formal details
of the algorithm. In the next couple of lectures, we will see how simplex method can be understood/derived, and a
tablau based solution can be used to solve it on paper.
We had argued in the past that the optima is an extreme point of the convex ploygon (or feasible region) formed by
the constraints. Based on this, let us state a naive version (Ver 0) of the simplex algorithm. This is a simplified and
intuitive description of the simplex algorithm.
1. Form the feasible region, and start with any one of the extreme points.
2. Repeat:
• Move to one of the neighbouring extreme points which has a better objective value.
• If no such point exists terminate algorithm with the current extreme point as the optima
There are many important questions to be answered at this stage, both theoretical and practical. For example, will
this algorithm converge? Will this lead to a global optima? How to move from one extreme point to another?
The procedure start with any one of the extreme points. Then it move in one specific direction that improves the
ojective. Those who are familiar with the gradient descent style optimization may argue that we should pick the
direction of greatest change in the cost. That is fine. But there are more important aspects to look into. However, if
the objective function is convex (which is so in our case), it does not really matter. Eventually we will reach the same
extreme point (local as well as global optima). Given that linear programming problem is a convex optimization
problem, the solution found would be optimal. The above algorithm keeps finding a better solution till it reaches
a local minima (i.e none of its neighbours have a better solution). Therefor by the property of convex optimization
problems, this local optima is also the global optima that we are looking for.
The above algorithm aims at finding the best solution by exploiting the advantages of the convex optimization. A
linear program is a convex optimization problem. A convex optimization problem optimizes a convex function over
a convex region/set.
A set S (eg. ⊂ Rn ) is convex if, for any x1 , x2 ∈ S, and for θ ∈ [0, 1],
x3 = θx1 + (1 − θ)x2
is also in S. This simply implies that the line joining x1 and x2 is also within the set.
Since our method search over a set of extreme points, let us see how many extreme points are present in an LP?. In
fact, there are too many.
The feasible region for a linear programming problem, given by, {x|Ax ≤ b where, A is an m × n matrix, c ∈ Rn
and b ∈ Rm , m < n, is represented by a polyhedron. A polyhedron has a finite number of extreme points or vertices
which are bounded by
m m!
Cn =
n!(m − n)!
which represents m equations and choosing n variables if m > n. As m and n increase, the value of m Cn increases.
Hence, for a general L.P. problem, the number of vertices can be very large. Let us not think of finding all of them
and evaluating the objective on all of them.
112
13.1.2 Remark on Computational Complexity
1. Simplex is a generalization of triangles. You may see many related aspects of the simplex on
https://en.wikipedia.org/wiki/Simplex
Minimize: cT x
such that Ax ≤ b; x ≥ 0
where, A is an m × n, matrix, c ∈ Rn denotes the coefficients of the objective function and b ∈ Rm , is part of the
constants. We are given A, b and c. We need to find the optimal x.
In order to use the simplex algorithm, we require the linear program to be cast in a specific format known as the
standard slack form. Therefore, conversion of the above inequality into an equality is now required. This is done by
adding a variable to each of the constraints. Hence, the LP problem becomes
Minimize: cT x
such that Ax = b
x≥0
The slack form consists of a linear program’s constraints expressed as Ax = b. That is without the use of any
inequality expressions. The slack form of the linear program is expected in the following format :
Minimize: cT x
Subject To: Ax = b
x≥0
The conversion of the linear program to the slack form is done by introducing additional slack variables, for each
inequality constraint in the original standard form expression. For example, if the original linear program has m
constraints with n variables. The converted slack form’s constraints become:
a11 x1 + a12 x2 + . . . + a1n xn + xn+1 = b1
a21 x1 + a22 x2 + . . . + a2n xn + xn+2 = b2
...
am1 x1 + am2 x2 + . . . + amn xn + xn+m = bm
113
Therefore the new slack form constraints Ax = b , has an A with dimensions m × (m + n). Due to the additional m
variables introduced. These new variables are known as slack variables. Their values are intended to be set based on
the other variables in order to make up for the slack in an inequality, thereby allowing us to express the LP now as
an equality. This slack form will be very important for our formulation of the simplex method.
We know that the algorithm of enumerating vertices to find the minimum is not efficient and attractive. Instead, an
alternative algorithm as below can be used (we also saw the same in the last section).
We now assume that we have the linear program expressed in slack form as Ax = b, where A has m constraints and
n variables. Here in the slack form, due to introduced variables, we have m ≤ n and therefore Ax = b has multiple
valid solutions. In such a situation if we have two possible solutions x1 and x2 . Then all x of the form:
are solutions to the LP. This will be an important property that we use in the simplex method to jump from one
solution to the another (till we reach the optimum). Now for the LP problem, the extreme points or vertices are
obtained when the column vectors Ai obtained with corresponding non-zero basic variables are linearly independent.
Such an x with a valid solution to the LP is known as a Basic Feasible Solution or BFS. A linear program will have
multiple BFS, but if we are able to obtain at least one BFS (the initial BFS), then we can say that there must exist
an optimal BFS. With these definitions we now look into the simplex method.
Let B be the set of m basic variables. i.e., B ⊆ {1 . . . n} be a subset of the n variables. Let us set xi = 0, ∀i ∈
/ B.
We represent B as a set of indices of the variables that are selected. Also we use the notation B(i) to represent the
i th element of the set.
then,
B̄xB = b
(13.2)
=⇒ xB = B̄−1 b, xi = 0, i ∈
/B
The elements of the vector x can be now taken from xB and the rest zero.
114
Now, the feasible region defined by the LP problem, Ax = b, xi ≥ 0 is a convex polyhedron. So, x is a vertex, if
and only if the subset of the n variables, i.e. the column vectors Ai , of the matrix A, corresponding to non-zero
entries of x(xi ̸= 0) are linearly independent. Such an x is called a Basic Feasible Solution (BFS). So, in this
case, B ⊆ {1 . . . n} is the basic variables. And B̄ is a non-singular matrix. For a given linear programming problem,
there can be many basic feasible solutions possible. If an optimal solution exists for an L.P., then one of the BFS is
optimal.
Simplex - Ver1
Its practical implementation is done by finding the matrices B̄1 → B̄2 → B̄3 → B̄4 → B̄5 . . . Where finding a new B̄
is done by removing a column and adding a new column. Therefore, the problem of simplex will soon get reduced
to finding the departing column (or variable) and finding the entering column (or variable). Often, the first vertex is
found easily as B = I, where I is the identity matrix.
Often the initial BFS is chosen with B̄ = I. That is by setting all the slack variables to be Basic variables (non-zero),
while the remaining originally existing variables are classified as non-basic and are by default set to 0. (This is
equivalent to the origin in your convex plogon defined by the inequalities.) This usually (not always) provides us a
simple to obtain BFS as xB(i) = bi as the rest of the variables in the constraint are non-basic and zero.
We design the simplex algorithm such that in each iteration of the algorithm, exactly one of the Non-Basic variables
become a basic variable and a basic variable leaves to become a non-basic variable.
Now we proceed to formalize the process by witch we switch between two Basic Feasible Solutions, making sure that
it still satisfies the original LP’s problem.
Let d be a difference vector, and θ ∈ [0, 1] be a scalar. The vector d takes us from one solution to another as
xnew = x + θd.
=⇒ B̄dB + Aj = 0
=⇒ dB = −B −1 Aj
115
Let us calculate the cost difference in the objective for a given choice of j.
T −1
C̄j = Cj − CB B A , here C̄j is the difference in cost due to j th variable becoming basic. We compute it for all the
variables.
If C̄ ≥ 0, (i.e., all the elements are positive or objective can not be further minimized), then it means that our current
BFS is the optimum. Otherwise we choose one of the j from C̄ , such that C̄j < 0. We can have some nice heuristics
in picking the best j out of many possible ones. Such heuristics may work in many situations, but not always. In
any case, we know that the problem is convex!!.
An important question not yet answered is the choice of θ. How do we fix theta? If our theta is small, we move
only by a small amount. If we take a large θ, we move out of the feasible region. The trick is finding the largest θ
such that we still remain in the feasible region. We show below, how to find the θ such that we go as far as possible
without violating the constraints. This leads to a new dimension/variable l that exit from the basic variable set. In
short our objective is to find j that enters and l that exits in every iteration.
xl + d l θ = 0
13.5 Examples
Example 5. Consider an example to obtain an optimal solution by checking all possible basic feasible solutions.
Minimize: −x1 − x2
such that −x1 + x2 ≤1
x1 ≤3
x2 ≤2
x1 , x 2 ≥0
The inequalities are converted to equalities by adding variables to each of the constraints. This results in
Minimize: −x1 − x2
such that
−x1 + x2 + x3 + = 1
x1 + +x4 = 3
x2 + +x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0
116
Figure 13.1: The constraints on a 2D plane. One can see the extreme points of the original problem. Also
notice the relationship to the Basic Feasible Solutions.
Here,
x1
−1 1 1 0 0 x2 1
A= 1 0 0 1 0 , x=
x3 , b = 3
0 1 0 0 1 x4 2
x5
Since,
Ax = b (13.3)
Therefore,
x1
−1 1 1 0 0 x2
1
1 0 0 1 0 x
3
= 3
0 1 0 0 1 x4 2
x5
Initially, the basic feasible solution is x = [0, 0, 1, 3, 2]. As the 3rd , 4th and the 5th columns of the matrix A form an
identity matrix (I), B = {3, 4, 5}. The objective function, z = 0.
Rearranging the equations so as to write them in terms of x3 , x4 and x5 ,
x3 = 1 + x1 − x2
x4 = 3 − x1
x5 = 2 − x2
z = −x1 − x2
=⇒ z =0
x2 can increase upto 1 while keeping x1 = 0 . Thus, the equations transform as,
x2 = 1 + x1 − x3
x4 = 3 − x1
x5 = 1 − x1 + x3
z = −x1 − x2
= −x1 − (1 + x1 − x3 )
= −1 − 2x1 + x3
=⇒ z = −1 − 2(0) + 0
= −1
In this case, 2 enters and 3 exits. Therefore, x = [0, 1, 0, 3, 1], B = {2, 4, 5} and z = −1.
117
Figure 13.2: Navigation of the vertices of the feasible region.
Figure 13.1 shows the navigation through the basic feasible solutions to reach an optimal solution.
x1 = 1 + x3 − x5
x2 = 2 − x5
x4 = 2 − x3 + x5
z = −1 − 2x1 + x3
= −1 − 2(1 + x3 − x5 ) + x3
= −3 − x3 + 2x5
=⇒ z = −3 − 0 + 2(0)
= −3
In this case, 1 enters and 5 exits. Therefore, x = [1, 2, 0, 2, 0], B = {1, 2, 4} and z = −3.
It can be seen that z can be minimized by increasing x3 . So, the equations transform as,
x1 = 3 − x4
x2 = 2 − x5
x3 = 2 − x4 + x5
z = −x1 − x2
= −(3 − x4 ) − 2 − x5
= −5 + x4 + x5
=⇒ z = −5 + 0 + 0
= −5
In this case, 3 enters and 4 exits. Therefore, x = [3, 2, 2, 0, 0], B = {1, 2, 3} and z = −5. z cannot be minimized
further. Hence, the final minimized value of z is −5.
Minimize: z = x1 + x2
such that x1 + 5x2 ≤5
2x1 + x2 ≤4
x1 , x2 ≥0
118
Figure 13.3: Graphical representation.
Minimize: z = x1 + x2
such that
x1 + 5x2 + x3 =5
2x1 + x2 + x4 =4
x1 , x2 ≥ 0
So,
[ ] x1 [ ]
1 5 1 0 x2 5
A= , x=
, b=
2 1 0 1 x3 4
x4
Ax = b
Therefore,
[ ] x1 [ ]
1 5 1 0
x2 = 5
2 1 0 1 x3 4
x4
x3 = 5 − x1 − 5x2
x4 = 4 − 2x1 − x2
z = x1 + x2
=⇒ z =0
Since x1 , x2 both are positive, increasing them will increase the value of z. So, the optimal solution is reached, with
z = 0.
119
13.6 Additional Problems
1. Find a sequence of BFS to arrive at the optimal point for the following Linear program:
Minimize : z = −x1 − x2
Subject to :
x1 + 10x2 ≤ 5
4x1 + x2 ≤ 4
x1 , x 2 ≥ 0
2. Find a sequence of BFS to arrive at the optimal point for the following Linear program:
120
Chapter 14
More on Simplex
121
14.1 Introduction
In the last lecture, we have seen a basic version of Simplex. That method involved moving from one extreme point of
the feasible polyhedra to another, and finally stopping at the extreme point corresponding to optimal solution, when
no such adjacent movement leads to cost minimization. We had seen this in an example. We know how to apply this
method, without getting into much of the theory corresponding to it. In this lecture we develop fomal mathematical
details that guides the Simplex method, so that we are able to apply this method with more confidence.
The simplex method is based on this fact that if an optimal solution exist then there is basic fesible solution(BFS)
that is optimal. It searches for an optimal solution by moving from one basic feasible solution to another, along
the edges of the feasible set. We make sure that we move always in a cost reducing direction. Eventually, a special
basic feasible solution is reached at which none of the available edges leads to a cost reduction. Such a basic feasible
solution is optimal and the algorithm terminates.
To start the simplex problem we need to know at least one basic feasible solution. Simplex often starts with B̄ = I.
14.2 Basics
Minimize: cT x
such that Ax = b
We assume that the dimensions of the matrix A are m × n where m < n. Here, Ai is the ith column of the matrix
A. Let P be the corresponding feasible set. Note that the problem is stated in the standard slack form.
The simplex method searches for an optimal solution by navigating from one vertex to another, along the edges of the
feasible region such that the cost is reduced. When we reach a basic feasible solution at which there is no neighbour
that has a lesser cost, we have reached the minima, and the algorithm terminates.
A polyhedron is a set that can be described in the form {x ∈ Rn | Ax ≤ b}, where A is m × n matrix and b is
vector in Rn .
′ ′ ′
We need to find the vector x s.t. x ∈ P and cT x is minimized. Instead of directly working with inequalities we
introduce slack variables and convert the given problem into a standard form. The vector x∗ s.t. {x∗ ∈ Rn } is Basic
Feasible Solution (BFS) if all the constraints are satisfied (i.e., feasible solution), and n − m of the elements are
zero (basic solution).
For the rest of lecture we assume our problem has been already converted into the standard form, with A being
m × n matrix representing m equalities in n variables. All the m rows being linearly independent.
• i ∈ B =⇒ xi is a basic variable.
• i∈/ B =⇒ xi = 0 & xi is a nonbasic variable.
As discussed in the last lecture, simplex algorithm needs to find two variables in each iteration — one that enters the
Basic variable set (i.e., j) and the other that exits the basic variable set (i.e., l). Here is a summary of the process.
(We will go down to the details and remind the notation in the next section.)
122
14.3 How simplex method works?
Suppose that we are at a point x ∈ P and that we want to move away from x, in the direction of a vector d ∈ Rn . We
should move in a direction that does not take us outside the feasible region. d ∈ Rn is said to be a feasible direction
at x, if there exists a positive scalar θ for which
x + θd ∈ P. (14.1)
Let x be a basic feasible solution to our problem, let B(1), …, B(m) be the indices of the basic variables, and let
[ ]
B̄ = AB(1) · · · AB(m) (14.2)
Simplex Algorithm:
Note:
1. Often the first vertex is found easily when B is the identity matrix.
2. In each new iteration, one new variable enter the basis and one exits.
Consider a new point y = x + θd. Now, select a non-basic variable xj (j ∈ / B) and change its value to a positive
value θ. Let the other non-basic variables xi (i ∈
/ B, i ̸= j)be zero. That is, dj = 1 and di = 0
We know that,
Ax = b and Ay = b
A(x + θd) = b (14.5)
∴ Ad = 0
When j enters B
dj = 1
BxB = b
xB = B−1 b (14.6)
The lth column goes out and the j th column comes in the basis.
xnew
B = B−1
new b (14.7)
Recall now that dj = 1, and that di = 0 for all other non-basic indices i. Then,
∑
n
Ad = 0 =⇒ Aj dj = 0.
j=1
Splitting the summation into different parts for the basic and non-basic variables, we get
∑ ∑
di Ai + dj Aj + di Ai = 0.
i∈B i∈B
/
i̸=j
123
We know that dj = 1 and for i ∈
/ B, di = 0.
∑
∴ di Ai + 1.Aj = 0
i∈B
∑
m
dB(i) AB(i) + Aj = 0
i=1
BdB + Aj = 0
Aj = −BdB
dB = −B−1 Aj (14.8)
Next, we need to find θ, i.e., how much to move in the feasible direction. Given that
y = x + θd
xnew
i = xi + θdB (14.9)
If di < 0, then θ is limited by xi ≥ 0. { }
−xi
∴ θ = min (14.10)
i∈B di
di <0
Now the cost is given by cT x, on changing the BFS x to y = x + θd, results in change of cost given by cT d
Let C̄j denote the cost change due to j entering B. Let C̄ = [C̄1 , ..., C̄n ] denote the vector of cost change correspond-
ing to each j.
Lemma 14.3.1. If C̄j ≥ 0 ∀ C̄j ∈ C̄, then the current BFS x is optimal.
If the above lemma holds, stop the algorithm and report the solution. Else ∃j s.t. C̄j < 0.
C̄j = C T d
= CBT
dB + Cj (As j is entering B, dj = 1 and di = 0∀i ∈
/ B & i ̸= j.)
T −1
= Cj − CB B Aj
Simplex method :
1. Start with X as BFS
2. Compute C̄
3. If C̄j ≥ 0 ∀ C̄j ∈ C̄, then stop and report x as solution.
Else ∃j s.t. C̄j < 0, compute dB = −B̄ −1 Aj
4. Find l and θ s.t. j enters B and l leaves B.
5. Update x, B, B and goto step 2.
124
1. Start with x as a BFS.
2. Compute c̄.
3. If c̄ ≥ 0, then the solution is optimal, terminate.
Else for cj < 0
Compute dB = −B−1 Aj .
4. Find l and θ such that
xnew
i = xi + θdB
and l exits B and j enters B.
5. Update xj , B and go to step 2.
B−1 B = I
= [e1 e2 · · · em ]
= [e1 · · · el−1 u el+1 · · · em ]
Only the lth column is non-zero on the RHS in the above equation. Here u = B−1 Aj . Now, we need to do elementary
row transformations such that we get RHS as I and B−1 −→ B̄−1 .
125
14.5 Example Problems
Example 7. Consider the linear programming problem
minimize c1 x1 + c2 x2 + c3 x3 + c4 x4
subject to x1 + x2 + x3 + x4 = 2
2x1 + 3x3 + 4x4 = 2
x1 , x2 , x3 , x4 ≥ 0.
The first two columns of the matrix A are A1 = (1, 2) and A2 = (1, 0). Since they are linearly independent, we can
choose x1 and x2 as our basic variables. The corresponding basis matrix is
[ ]
1 1
B=
2 0
We set x3 = x4 = 0, and solve for x1 , x2 , to obtain x1 = 1 and x2 = 1. Therefore, we have obtained a non-degenerate
basic feasible solution. A basic direction corresponding to an increase in the non-basic variable x3 , is constructed as
follows. We have d3 = 1 and d4 = 0. The direction of change of the basic variables is obtained using Eq. (18.23) as
follows:
[ ] [ ]
d1 d
dB = = B(1)
d2 dB(2)
= −B−1 A3
[ ][ ]
0 1/2 1
=−
1 −1/2 3
[ ]
−3/2
=
1/2
The cost of moving along this basic direction is cT d = −3c1 /2 + c2 /2 + c3 . This is the same as the reduced cost
of the variable x3 . Suppose that c = (2, 0, 0, 0), in which case, we have c¯3 = −3. Since c¯3 is negative, we form the
corresponding basic direction, which is d = (−3/2, 1/2, 1, 0), and consider vectors of the form x + θd, with θ ≥ 0.
As θ increases, the only component of x that decreases is the first one (because d1 < 0). The largest possible value
of θ is given by θ∗ = −(x1 /d1 ) = 2/3.
This takes us to the point y = x + 2d/3 = (0, 4/3, 2/3, 0).
Note that the columns A2 and A3 corresponding to the non-zero variables at the new vector y are (1, 0) and (1, 3),
respectively, and are linearly independent.
Therefore, they form a basis and the vector y is a new basic feasible solution. In particular, the variable x3 has
entered the basis and the variable x1 has exited the basis.
Example 8. Let x be an element of standard form polyhedra P = {x ∈ Rn | Ax = b, x ≥ 0}. Prove that a vector
d ∈ Rn is a feasible direction at x iff Ad = 0 and di ≥ 0 ∀ i s.t. xi = 0.
Let x, y ∈ Rn s.t. y = x + θd where di indicates whether we are moving in that component, in forward or backward
direction and θ is the scaling factor as in the speed at which we are moving.
For the forward direction of proof, we assume d to be a feasible direction vector and show the mentioned conditions
hold
Ay = b
=⇒ A(x + θd) = b
=⇒ Ax + θAd = b
=⇒ θAd = 0
=⇒ Ad = 0 (As θ is scaling factor and θ ̸= 0
Also ∀i s.t. xi = 0, xi can’t decrease further, so it can only increase or stay at zero level. In formal terms di ≥ 0.This
proves the forward direction.
For the converse, we can let d ∈ Rn s.t. Ad = 0 and di ≥ 0 ∀ i | xi = 0 and will show that d is feasible direction
vector.
Ad = 0
=⇒ θAd = 0 (multiplying by scalar on both sides )
=⇒ Ax + Aθd = Ax ( adding same dimension vector on both sides )
=⇒ A(x + θd) = b
Let y = x + θd, as Ay = b =⇒ y ∈ P . So d is feasible direction vector. This completes the proof.
126
Example 9. Let P = {x ∈ R3 | x1 + x2 + x3 = 1, x̄ ≥ 0} be a polyhedra. Consider a vertex x = [0, 0, 1]T . Find the
set of feasible directions at x.
Using the result of the previous proof, we know that d is a feasible direction vector if Ad = 0 and di ≥ 0 ∀ xi = 0.
We get,
Ad = 0 =⇒ d1 + d2 + d3 = 0
Here x = [0, 0, 1]T , so x1 = x2 = 0 =⇒ d1 , d2 ≥ 0.
So all the feasible direction vector at given x, is given by the polyhedron
D = {d ∈ R3 | d1 + d2 + d3 = 0, d1 , d2 ≥ 0}
Example 10. Find the sequence at BFS and arrive at optimal point.
minimize x1 + x2
s.t
x1 + x2 ≤2
2x1 ≤2
x1 , x2 ≥ 0
minimize x1 + x2
s.t
x1 + x2 +x3 =2
2x1 + x4 ≤2
x1 , x2 ,x3 ,x4 ≥ 0
x3 = 2 − x1 − x2
x4 = 2 − 2x1
z = x1 + x2
X = [0, 0, 2, 1],
B = {3, 4}, z = 0
Now both x1 and x2 have positive cofficients so we can’t further decrease z since all xi ’s are greater than zero and
they can’t be less than zero.
127
128
Chapter 15
129
15.1 Simplex
This is an incomplete version. There are inconsistency in the notation and errors still left out. use
the text book for careful study.
Find B −1 , Given B −1
B −1 B = I or → B −1 B = [e1 , e2 ....em ]
minimize cT x
subject to Ax ≤ b
x≥0
To summarize the intuition, We choose a corner by choosing n planes(assuming the intersection to be feasible). If
we remove one plane, then the intersection of n-1 planes gives an edge(To stay in the feasible set we can choose only
one direction). If we add another plane to the other n-1 planes such that the new intersection of n planes is in the
feasible set then we move to a neighbouring corner. But just moving to the neighbouring corner is not our aim but
this idea is used to move from one corner to another along an edge(which reduces the cost).
The main idea of simplex is to move from one corner to another along an edge of the feasible set such that the cost
minimizes. A corner is the meeting of n different planes(Each plane is given by an equation). Each corner of the
feasible set comes from turning n of n+m inequalites Ax ≤ b and x ≥ 0 into equations and finding the intersection
of these n planes.
To understand
[ ] how simplex works, we introduce[positive
] slack variables w = Ax − b or Ax − w = b which gives us
[ ] x [ ] x
A −I = b renaming A −I as A and as x.
w w
So we have, minimize cT x, subject to Ax = b , x ≥ 0. The intersection of n planes gives a corner. Suppose one of n
intersecting planes is removed. The points that satisfy the remaining n-1 equations form an edge that comes out of
the corner. This edge is the intersection of n-1 planes. To stay in the feasible region, only one direction is allowed
along each edge which is given by Phase II of the simplex which will be discussed later in this lecture.
In this new setting, we can observe the same thing. A corner is now a new point where n components of new x are
zero. These n components of x are free variables of Ax = b set to zero. The remaining m components are basic
variables or pivot variables. The basic solution will be a genuine corner if its m nonzero components are positive. We
get a corner if x has n zero components. Now suppose we make one of the zero component variable and set one of the
m nonzero component to zero. The earlier zero component which is now chosen to be variable can take a value and
the system has one degree of freedom. The one zero component which can now take a value is adjusted such that it
minimizes the cost cT x. So which corner do we land now? We really wanted to move from one corner to another along
an edge. Since two corners are neighbours, m-1 basic variables will remain basic. At the same time, one variable
130
will move up from zero to become basic. The value of other m-1 basic components will change but remain positive.
The choice of edge decides which variable leaves the basis and which enters(and viceversa). The basic variables are
computed by solving Ax = b. The free components are set to zero.
A corner is degenerate if more than the usual n components of x are zero. The basic set might change without
actually moving from the corner.
A way to do the above is Tableau discussed in next subsection.
TODO: the last row in the above should be kept as the first row to be consistent with the tablaues below.
131
x2 +x5 = 2
[ ]
C = −1 −1 0 0 0
−1 1 1 0 0
A=1 0 0 1 0
0 1 0 0 1
1
b= 3
2
x1 x2 x3 x4 x5
0 -1 -1 0 0 0
x3 1 -1 1 1 0 0
x4 3 1 0 0 1 0
x5 2 0 1 0 0 1
[CT -CTB B−1 A] < 0 for both x1 and x2 , we can select any of them as entering variable. Let x2 be selected.
(ABi /ui ) is least and ui is positive for x2 , so it leaves. After performing row operations we get
x1 x2 x3 x4 x5
1 -2 0 1 0 0
x2 1 -1 1 1 0 0
x4 3 1 0 0 1 0
x5 1 1 0 -1 0 1
[CT -CTB B−1 A] < 0 for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x5 , so it leaves. After performing row operations we get
x1 x2 x3 x4 x5
3 0 0 -1 0 2
x2 2 0 1 0 0 1
x4 2 0 0 1 1 -1
x1 1 1 0 -1 0 1
x1 x2 x3 x4 x5
5 0 0 0 1 1
x2 2 0 1 0 0 1
x3 2 0 0 1 1 -1
x1 3 1 0 0 1 0
[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ,x5 ) = (1,2,2,0,0)
Objective = -5
Solution:
132
x1 +2x2 +2x3 +x4 = 20
2x1 +x2 +2x3 +x5 = 20
2x1 +2x2 +x3 +x6 = 20
[ ]
C = −10 −12 −12 0 0 0
1 2 2 1 0 0
A = 2 1 2 0 1 0
2 2 1 0 0 1
20
b= 20
20
x1 x2 x3 x4 x5 x6
0 -10 -12 -12 0 0 0
x4 20 1 2 2 1 0 0
x5 20 2 1 2 0 1 0
x6 20 2 2 1 0 0 1
[CT -CTB B−1 A] < 0 and has less magnitude for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x5 , so it leaves. After performing row operations we get
x1 x2 x3 x4 x5 x6
100 0 -7 -2 0 5 0
x4 10 0 1.5 1 1 -0.5 0
x1 10 1 0.5 1 0 0.5 0
x6 0 0 1 -1 0 -1 1
[CT -CTB B−1 A] < 0 and has less magnitude for x3 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x4 , so it leaves. After performing row operations we get
x1 x2 x3 x4 x5 x6
120 0 -4 0 2 4 0
x3 10 0 1.5 1 1 -0.5 0
x1 0 1 -1 0 -1 1 0
x6 10 0 2.5 0 1 -0.5 1
[CT -CTB B−1 A] < 0 and has less magnitude for x2 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x2 , so it leaves. After performing row operations we get
x1 x2 x3 x4 x5 x6
136 0 0 0 3.6 1.6 1.6
x3 4 0 0 1 0.4 0.4 -0.6
x1 4 1 0 0 -0.6 0.4 0.4
x2 4 0 1 0 0.1 -0.6 0.4
[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ,x5 ) = (4,4,4,0,0)
Objective = -136
133
Solution:
x1 x2 x3 x4
0 -5 -7 0 0
x3 13 2 3 1 0
x4 12 3 2 0 1
[CT -CTB B−1 A] < 0 and has greater magnitude for x2 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x3 , so it leaves. After performing row operations we get
x1 x2 x3 x4
91/3 -1/3 0 7/3 0
x2 13/3 2/3 1 1/3 0
x4 10/3 5/3 0 -2/3 1
[CT -CTB B−1 A] < 0 and has greater magnitude for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x4 , so it leaves. After performing row operations we get
x1 x2 x3 x4
31 0 0 11/5 1/5
x2 5 0 1 3/5 -2/5
x1 2 1 0 -2/5 3/5
[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ) = (2,5,0,0)
Objective = 31
Example 14. Explain how row transformation correctly yields results in zeroth row.
′ ′ ′ ′
Suppose at the beginning of typical iteration the 0th row is of the form [0|C ]-g [b|A] where g =CB B−1 , Hence 0th
′
row is equal to [0|C ] + a linear combination of the row [b|A]. Let column j be the pivot column and row l be pivot
′
row. Note that pivot row is of form h [b|A], where h1 is lth row of B−1 . Hence, after a multiple of pivot row is
′
added to 0th row, that row is again equal to [0|C ] + a ( different) linear combination of rows of [b|A] if of form
[0|C’] - P’[b|A], for some vector P. Recall that update rule is such that pivot column entry of 0th row becomes zero.
CB(l) -P’AB(l) =Cj -P’Aj =0. Consider row B(l)th column for i not equal to l. The 0th row entry of that column is 0,
before that change of basis, since it is reduced cost of basic variable.B−1 AB(l) is lth unit vector & i not equal to l,
the entry in pivot row for that column is also equal to 0. Hence addming a multiple pivot row to 0th row does not
affect the 0th row entry of that column, which is left at zero.
Solution:
x1 +x2 +x3 = 2
x1 +x2 +x4 = 6
[ ]
C = −2 −1 0 0
[ ]
1 −1 1 0
A=
1 1 0 1
134
[ ]
2
b=
6
x1 x2 x3 x4
0 -2 -1 0 0
x3 2 1 -1 1 0
x4 6 1 1 0 1
x1 x2 x3 x4
6 -1 0 0 1
x3 8 2 0 1 1
x2 6 1 1 0 1
x1 x2 x3 x4
10 0 0 0.5 1.5
x1 8 2 0 1 1
x2 2 0 1 -0.5 0.5
−6 −1 1 0 0
A = −4 −3 0 1 0
−1 −2 0 0 1
−6
b = −12
−1
[ ]
So take B = x1 x2 x3
135
x4 = 4-5x2 +4x5
x3 = -6+6x1 +x2
x3 = 18-11x2 +6x5
−6 1 0 −6
B =−4 0 1 , b =−12
−1 0 0 −4
0
8
CB =
0
0
−5
-20 0 8 0 0 -5
x1 4 1 2 0 0 -1
x3 18 0 11 1 0 -6
x4 4 0 5 0 1 -4
We see that for x5 value is negative (-5), so we need to calculate min(x[5]/u[5]) where u[i] > 0
But all u(i) are -ve
So optimal sol at xi =+∞
So min cost -Z1 =- ∞
Hence max Z= ∞
1 2 -1 0 6
2 1 0 -1 6
1 1 0 0 0
We start with a feasible point say at P, which is the intersection of x=0 and 2x + y = 6. To relate to theory and be
organised we exchange columns 1 and 3 to put to basic variables.
-1 2 1 0 6
0 1 2 -1 6
0 1 1 0 0
Then elimination multiplies the first row by -1 to give a unit pivot, and uses second row to produce zeros in the
second column. Now fully reduced form at P is: At first look at r = [-1 1] in the bottom row has a negative entry in
1 0 3 -2 6
0 1 2 -1 6
0 0 -1 1 -6
column 3. So third variable will enter the basis. The current corner P and cost -6 is not optimal. The column above
the negative entry B −1 u = (3, 2), its ratio with the last column is 63 and 26 Since the first ratio is smaller, the first
w, so the first column of basis is pushed out of the basis. We move from corner P to Q. The new tableau exchanges
column 1 and 3. Pivoting by elimination gives: The new tableau at Q, has r=[ 13 13 ] is positive and thus final. The
corner x=2, y=2 and z=+4 is optimal.
This can also be done without exchanging rows as shown in next example.
136
3 0 1 -2 6
2 1 0 -1 6
-1 0 0 1 -6
1 −2
1 0 3 3 2
−2 1
0 1 3 3 2
1 1
0 0 3 3 -4
minimize − x1 − x2
s.t. −x1 + x2 + x3 = 1
x1 + x4 = 3
x2 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0
We choose x1 to enters as the corresponding r1 is one of the negative entry for r, x4 leaves as it the the variable
b x1 x2 x3 x4 x5
x3 1 -1 1 1 0 0
x4 3 1* 0 0 1 0
x5 2 0 1 0 0 1
z 0 -1 -1 0 0 0
b x1 x2 x3 x4 x5
x3 4 0 1 1 1 0
x1 3 1 0 0 1 0
x5 2 0 1* 0 0 1
z 3 0 -1 0 1 0
b x1 x2 x3 x4 x5
x3 2 0 0 1 1 -1
x1 3 1 0 0 1 0
x2 2 0 1 0 0 1
z 5 0 0 0 1 1
minimize −x1 − x2
such that −x1 + x2 ≤ 1
x1 ≤ 3
x2 ≤ 2
x1 , x2 ≥ 0
137
0x1 − x2 + 0x3 + 0x4 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0
min −x1 − x2
−x1 + x2 + x3 = 1
−x1 + x4 = 3
−x2 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0
1. Initial stage
x3 = 1 + x1 − x2
x4 = 3 − x1
x5 = 2 − x2
z = 0 − x1 − x2
BF S = [0, 0, 1, 3, 2], B = {3, 4, 5}, z = 0
2. x1 ,x2 both have negative coefficient any of them can be selected as entering variable let’s choose x2
x2 can be increase upto 1 while keeping x3 ≥ 0
x2 enters x3 exits
x2 = 1 + x1 − x3
x4 = 3 − x1
x5 = 1 − x1 + x3
z = −1 − 2x1 + x3
x = [0, 1, 0, 3, 1], B = {2, 4, 5}, z = −1
3. x1 has negative coefficient, x1 enters
x1 can be increase upto 1 while keeping x5 ≥ 0
x1 enters x5 exits
x1 = 1 + x3 − x5
x2 = 2 − x5
x4 = 2 − x3 + x5
z = −3 − x3 + 2x5
x = [1, 2, 0, 2, 0], B = {1, 2, 4}, z = −3
4. x3 has negative coefficient, x3 enters
x3 can be increase upto 2 while keeping x4 ≥ 0
x3 enters x4 exits
x3 = 2 − x4 + x5
x1 = 3 − x4
x2 = 2 − x5
z = −5 + x4 + x5
x = [3, 2, 3, 0, 0], B = {1, 2, 3}, z = −5
We cannot reduce z any further because all the coefficient of z are positive, z = −5 is the optimal solution.
1. Initial stage
x1 x2 x3 x4 x5
0 -1 -1 0 0 0 ...a1
x3 = 1 -1 1∗ 1 0 0 ...a2
x4 = 3 1 0 0 1 0 ...a3
x5 = 2 0 1 0 0 1 ...a4
2. x2 enters, x3 exits
x1 x2 x3 x4 x5
a1 + a2 1 -2 0 1 0 0 ...b1
a2 x2 = 1 -1 1 1 0 0 ...b2
a3 x4 = 3 1 0 0 1 0 ...b3
a4 − a2 x5 = 1 1∗ 0 -1 0 1 ...b4
3. x1 enters, x5 exits
138
x1 x2 x3 x4 x5
b1 + 2b4 3 0 0 -1 0 2 ...c1
b2 + b4 x2 = 2 0 1 0 0 1 ...c2
b3 − b4 x4 = 2 0 0 1∗ 1 -1 ...c3
b4 x1 = 1 1 0 -1 0 1 ...c4
4. x3 enters, x4 exits
x1 x2 x3 x4 x5
c1 + c3 5 0 0 0 1 1 ...d1
c2 x2 = 2 0 1 0 0 1 ...d2
c3 x3 = 2 0 0 1 1 -1 ...d3
c4 + c3 x1 = 3 1 0 0 1 0 ...d4
We cannot reduce z any further because all the coefficient of z are positive, z = −5 is the optimal solution.
Figure 31.1 We started with point (0,0) as initial feasible solution in step 1, after step 2 we reach point (0,1) which
has better solution, after step 3 we reach (1,2) and finally to (3,2) after step 4 which is most optimal solution and we
cannot go any further. The dotted path is another possible path that might have resulted depending upon selection
of entering variable.
Here, we add artificial variables to these equalities and to the objective by multiplying some M (very
large value) so that these artficial variables dont affect the objective. We solve the problem using
simplex algorithm as follows,
x1 x2 x3 x4 x5 A1 A2 A3
0 -24 396 -8 -28 -10 M M M
A1 12 12 4 1 -19 7 1 0 0
A2 6 6 -7 18 -1 -13 0 1 0
A3 1 1 17 3 18 -2 0 0 1
139
x1 x2 x3 x4 x5 A1 A2 A3
24 0 804 64 404 -58 M M M+24
A1 0 0 -200 -35 -235 31 1 0 -12
A2 0 0 -109 0 18 -1 0 1 -6
x1 1 1 17 3 18 -2 0 0 1
2x1 + x2 +x3 ≤ 2
x1 ,x2 ,x3 ≥ 0
-2x1 + 4x2 ≤ 12
x1 ,x2 ,x3 ≥ 0
140
Chapter 16
Dual Problems
141
16.1 Introduction
Duality means that optimization problems may be viewed from two perspectives, the primal problem or the dual
problem. This lead to problems coming in pair — primal and dual.
In many cases, both the primal and dual problems have physical interpretation. Often what is more interesting are
the theoretical results (and of course, the practical implications) of the primal dual problems and their optimas.
In this lecture, we start by introducing a set of examples and show how dual problems can be constructed and
interpreted.
Suppose we want to obtain a bound on the optimal value of this LP, we can do it by argueing in the following manner:
• Since all variables are non-negative, 6x1 + 4x2 + 2x3 ≥ 4x1 + 2x2 + x3 . Therefore, the value of the LP must be
at least 5.
• Also 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + (x1 + x2 ) ≥ 5 + 3 = 8.
• Similarly, 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + 2(x1 + x2 ) ≥ 5 + 2 · 3 = 11.
• And, 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + (x1 + x2 ) + (x2 + x3 ) ≥ 5 + 3 + 4 = 12.
We are now finding a bound that is closer and closer to the optima. Is 12 the optima? How do we know? We can
actually determine the best bound we can achieve by setting up a different LP! Note that in all the above, we have
been taking linear combination of the constraints to obtain better and better bounds.
Let y1 be the number of times we take the first constraint, y2 the number of times we take the second constraint and
y3 the number of times we take the third constraint. Then the lower bound we get is 5y1 + 3y2 + 4y3 , and we need
to ensure that this is a lower bound, i.e.
Since xi s and yi s are non-negative, need additional constraints to guarantee the above inequality. We can do this by
ensuring that 4y1 + y2 + 0y3 ≤ 6(since we have 6x1 in the objective value, and 4x1 in the first constraint, 1x1 in the
second constraint and 0x1 in the third constraint). Similarly 2y1 + y2 + y3 ≤ 4, y1 + y2 ≤ 2. Also, we need to have
y1 , y2 , y3 ≥ 0 (otherwise the inequalities in the constraints change direction, and we would not get a lower bound).
We thus obtain the following LP for getting the best bound:
Now we have two linear programming problems. They are related. On a close observation, we see that b and c have
swapped their roles. A has transposed. Goal is reversed (min to max). Such problems pairs are called the primal and
dual problem pairs. In fact, one can construct the primal problem from the dual. You may also see some relationships
in the A, b and c across these two problems.
142
16.3 Primal Dual Probelm Pairs
An optimization problem, called the primal, may be converted to its dual, and the solution to the dual provides a
bound to the solution of the primal problem.
M ax cT x
s.t. Ax ≤ b
x≥0
M in bT y
s.t. AT y ≥ c
y≥0
Note: We consider the maximization problem as primal and minimization problem as dual.
subject to:
4x1 + 8x2 ≤ 12
2x1 + x2 ≤ 3
3x1 + 2x2 ≤ 4
x1 , x 2 ≥ 0
Its dual is
Minimize 12y1 + 3y2 + 4y3
Subject to:
4y1 + 2y2 + 3y3 ≥ 2
8y1 + y2 + 2y3 ≥ 3
y1 , y2 , y3 ≥ 0
If we solve these two problems (using simplex) (try it out!!), we see the optima of primal is ( 21 , 54 ) and dual is ( 16
5
, 0, 41 ).
But the objectives of both primal and dual are the same i.e., 4.75.
Is it always so? How are the primal and dual problems generated and related? There are some of the questions that
we need to understand.
To make the life simple, we can state that for LP, primal and dual obtimas coincide, (when we have a feasible
unbounded solution).
143
PRIMAL DUAL
x1 , x2 , ..., xn y1 , y2 , ..., ym
A AT
b c
c b
M ax cT X M in bT Y
≤ yi ≥ 0
≥ yi ≤ 0
= yi ∈ R
xj ≥ 0 j th constraint ≥
xj ≤ 0 j th constraint ≤
xj ∈ R j th constraint =
Problem: Derive the dual problem for a primal problem containing unconstrained variable.
Z = min x1 − x2
subject to :
2x1 + 3x2 − x3 + x4 ≤ 0
−x1 − x2 + 2x3 + x4 = 6
x1 ≤ 0, x2 , x3 ≥ 0, x4 ∈ R
subject to :
2y1 + 3y2 − y3 ≥ 1
3y1 + y2 − y3 ≤ −1
y1 − 2y2 + y3 = 0
y1 ≤ 0, y2 ≥ 0, y3 ∈ R
144
The dual can be calculated by referring to the table given above. The dual is found to be :
M in 24y1 + 60y2
1
s.t. y1 + y2 ≥ 6
2
2y1 + 2y2 ≥ 14
− y1 + 4y2 ≥ 13
y1 , y2 ≥ 0
The primal and dual are inter-convertible, i.e., the primal can be obtained by calculating the dual of the dual.
minimize cT x
subject to
Ax ≥ b
x≥0
Now assume that some other person (the seller) has a way of supplying the nutrients directly, not through food. (For
example, the nutrients may be vitamins, and the seller may sell vitamin pills.) The seller wants to charge as much
as he can for the nutrients, but still have the buyer come to him to buy nutrients. A plausible constraint in this case
is that the price of nutrients is such that it is never cheaper to buy a food in order to get the nutrients in it rather
than buy the nutrients directly. If y is the vector of nutrient prices, this gives the constraints AT y ≤ c. In addition,
we have the nonnegativity constrain y ≥ 0. Under these constraints the seller wants to set the prices of the nutrients
in a way that would maximize the sellers profit (assuming that the buyer does indeed buy all his nutrients from the
seller). This gives the the following dual LP:
maximize bT y
subject to
AT y ≤ c
y≥0
Problem: Suggest another real world problem where the primal and dual has physical interpretations.
16.7.1 Maxflow
Let N = (V, E) be a network with s, t ∈ V being the source and the sink of N respectively.
145
The capacity of an edge is a mapping c : E → R+ , denoted by cuv or c(u, v). It represents the maximum amount
of flow that can pass through an edge.
A flow is a mapping f : E → R+ , denoted by fuv or f (u, v), subject to the following two constraints:
1. 1. fuv ≤ cuv , for each (u, v) ∈ E (capacity constraint: the flow of an edge cannot exceed its capacity)
∑ ∑
2. 2. u:(u,v)∈E fuv = u:(v,u)∈E fvu , for each v ∈ V \ {s, t}(conservation of flows: the sum of the
flows entering a node must equal the sum of the flows exiting a node, except for the source and the sink nodes)
∑
The value of flow is defined by |f | = v:(s,v)∈E fsv , where s is the source of N . It represents the amount of flow
passing from the source to the sink.
The maximum flow problem is to maximize |f |, that is, to route as much flow as possible from s to t.
16.7.2 Mincut
The minimum cut of a graph is a partition of the vertices of a graph into two disjoint subsets that are joined by at
least one edge whose cut set has the smallest number of edges (unweighted case) or smallest sum of weights possible.
Example: Now let us consider a simple graph as in the figure and formulate the maxflow problem as linear
programming. (we then want to argue that its dual is mincut.)
M ax xsu + xsv
s.t xsu + 0 + 0 + 0 + 0 ≤ 10
0 + xsv + 0 + 0 + 0 ≤ 5
0 + 0 + xuv + 0 + 0 ≤ 15
0 + 0 + 0 + xut + 0 ≤ 5
0 + 0 + 0 + 0 + xvt ≤ 10
xsu − xuv − xut = 0
xsv + xuv − xvt = 0
146
1 0 0 0 0 10
0 1 0 0 0 5 ysu
ysv
0 0 1 0 0 15
yuv
A = 0 0 0 1 0 , b = 5 yut
0 0 0 0 1 10 yvt
1 0 −1 −1 0 0 uu
uv
0 1 1 0 −1 0
[ ]T
and c = 1 1 0 0 0
This problems represents the min-cut. Let us assume uu is a variable that is 1 when if u is in the cut with set S and
0 otherwise. Similarly uv is a variable for vertex v.
• The first constraint now states that if u is not in S, then su should be added to the cut.
• The third constraint imply that if u is with S and v is not, (i.e., −uu + uv = −1). Then uv should be in the
cut.
Though the dual LP does not insist that yi or uu to be in {0, 1}, we loosely argue that the problem nature insists
that these variables take only integer values. (see more details in one of the next lectures.)
problems There are many other classical examples of primal-dual pairs in the ’algorithms’. List 3 different pairs.
147
148
Chapter 17
More on Duality
149
17.1 Introduction
In the previous lecture, we saw a set of examples of primal and dual problems. The beuty of the structure of problems
in the problem space is worth noticing and appreciating. Duality theory allows lot more than this. For example, it
can help you to verify whether a solution is optimal. It can also help you to design approximate algorithms.
We first see some important results in this space in this lecture. Later on, we also see how the primal and dual
problem structure can be used for designing approximate algorithms (in one of the next lectures).
Farkas’s lemma is a result in mathematics stating that a vector is either in a given convex cone or that there exists
a hyperplane separating the vector from the cone. There are no other possibilities.
cT x ≤ b T y
Proof.
cT x = xT c ≤ xT (AT y) ≤ (Ax)T y ≤ bT y
• It always holds for convex and nonconvex problems. See the derivation above for the general case using
lagrangians.
• It can be used to find nontrivial lower bounds for difficult problems.
• Note that this relationship is for the feasible points and not the optimal points alone.
cT x∗ = bT y ∗
Theorem 17.2.2. (Strong duality) If a linear programming problem has an optimal solution, so does its dual, and
the respective optimal costs are equal.
i.e., x∗ = y ∗ . If x∗ and y ∗ are the optimal solutions of primal and dual problems.
150
• It (usually) holds for convex problems. (Q: Isn’t it always?)
• The conditions that guarantee strong duality in convex problems are called constraint qualifications.
Result 1:
From the weak duality, we can see that, if P is unbounded, then D should be infeasible. Similarly If dual is unbounded,
primal should be infeasible.
Result 2:
Let x and y be feasible solutions to the primal and the dual, respectively. And suppose that bT y = cT x. Then, x
and y are optimal solutions to the primal and the dual, respectively. This allows to verify the optimality.
If P and D are Primal-Dual pairs of a LP problem then one of the four cases occur:
In this case, (2) and (3) come from Result 1. (4) come from strong duality. (1) can be proven easily with A = 0,
b < 0 and c > 0.
Let x̄ and ȳ be the optima of the primal and dual integer programs. Let x∗ and y ∗ are the optima of the relaxed
primal and dual linear programs. Then:
Or
cT x̄ ≤ bT ȳ
Duality gap is the difference between the primal and dual solutions. If x∗ is the optimal primal value and y ∗ is the
optimal dual value then
Duality gap = y ∗ − x∗
For weak duality : Duality Gap > 0 and for strong duality, Duality Gap = 0
151
Figure 17.1: Visualization of the duality gap
Their difference is called the duality gap. For convex optimization problems, the duality gap is zero under a constraint
qualification condition. Thus, a solution to the dual problem provides a bound on the value of the solution to the
primal problem; when the problem is convex and satisfies a constraint qualification, then the value of an optimal
solution of the primal problem is given by the dual problem.
17.4 Examples
Example 23. Let A be a symmetric square matrix. Consider LP
min cT x
Ax ≥ c
x≥0
152
Ay ≤ c
y≥0
∗ ∗
Now, given Ax = c. Hence, x satisfies both dual and primal problem constraints and gives
cT x∗ as the objective value for primal problem.
cT x∗ as the objective value for dual problem.
As the given problem is LP, a convex problem, strong duality must hold. Therefore, the dual and primal problems
must meet at the optimal solution. Hence x∗ is the optimal solution.
min cT x
Ax ≥ b
x≥0
Form the dual problem and convert it into an equivalent minimization problem. Derive a set of conditions on the
matrix A and the vectors b, c, under which the dual is identical to the primal, and construct an example in which
these conditions are satisfied.
M in 2x1 + x2
s.t. x1 + x2 ≤ 6
x1 + 3x2 ≤ 3
x1 , x 2 ≥ 0
(a) plot the feasible region and solve the problem graphically.
(b) Find the dual and solve it graphically.
(c) Verify that primal and dual optimal solutions satisfy the Strong Duality Theorem
Problem. Rock, Paper, and Scissors is a game in which two players simultaneously reveal no fingers (rock), one
finger (paper), or two fingers (scissors). The payoff to player 1 for the game is governed by the following table:
Note that the game is void if both players select the same alternative: otherwise, rock breaks scissors and wins,
scissors cut paper and wins, and paper covers rock and wins. Use the linear-programming formulation of this game
and linear-programming duality theory, to show that both players’ optimal strategy is to choose each alternative with
probability 1/3 . [?]
Problem. Why is it that, if the primal has unique optimal solution x* , there is a sufficiently small amount by which
c can be altered without changing the optimal solution? [?]
153
154
Chapter 18
More on Duality
155
18.1 Review of Important Results
P D
M ax cT x M in bT y
Ax ≤ b AT y ≥ c
x≥0 y≥0
Examples
1. Numerial Problem:
max 2x1 + 3x2
such that 4x1 + 8x2 ≤ 12
2x1 + x2 ≤ 3
3x1 + 2x2 ≤ 4
x1 , x2 ≥ 0
• Duality Theorem for LP: If P and D are primal and dual pair for LP, then one of the four cases occur:
1. Both are infeasible
2. P in inbounded and D is infeasible
3. D is inbound and P is infeasible
4. Both are feasible and there exists solution x and y to P and D such that cT x = bT y
• IP duals are weak duals
P = max(cT x|Ax ≤ b, x ≥ 0, x ∈ Z n )
D = min(bT y|AT y ≥ c, y ≥ 0, y ∈ Z n )
For any feasible x̄ and ȳ
cT x̄ ≤ bT ȳ
Example: Discuss the four cases of Duality theorem for LP Starting from the duality results of LP,
i.e.,
1.Both primal and dual are infeasible
2.Primal is infeasible and dual is unbounded
3.Dual is infeasible and primal is unbounded
4.Both primal and dual are feasible
Example of case 1:
maximize 2x1 − x2
such that
x1 − x2 ≤ 1 (18.1)
156
− x1 + x2 ≤ −3 (18.2)
x1 , x 2 ≥ 0 (18.3)
The dual of for the above primal is
minimize y1 − 3y2
such that
y1 − y2 ≥ 2 (18.4)
− y1 + y2 ≥ −1 (18.5)
y1 , y 2 ≥ 0 (18.6)
Both primal and dual are infeasible
Example of case 2:
maximize 2x2 + x3
x1 − x2 ≤ 5 (18.7)
− 2x1 + x2 ≤ 3 (18.8)
x2 − 2x3 ≤ 5 (18.9)
minimize 5y1 + 3y2 + 5y3
y1 − 2y2 ≥ 0 (18.10)
− y1 + y2 + y3 ≥ 2 (18.11)
− 2y3 ≥ 1 (18.12)
Here primal is unbounded and dual is infeasible
Example of case 3:
minimize 5x1 + 3x2 + 5x3
x1 − 2x2 ≥ 0 (18.13)
− x1 + x2 + x3 ≥ 2 (18.14)
− 2x3 ≥ 1 (18.15)
x1 , x2 , x3 ≥ 0
maximize 2y2 + y3
y1 − y2 ≤ 5 (18.16)
− 2y1 + y2 ≤ 3 (18.17)
y2 − 2y3 ≤ 5 (18.18)
y1 , y 2 , y 3 ≥ 0
Here dual is unbounded and primal is infeasible
Example of Case 4:
maximize 2x1 + x2
x1 + x2 ≤ 6 (18.19)
x1 + 2x2 ≤ 8 (18.20)
minimize 6y1 + 8y2
y1 + y2 ≥ 2 (18.21)
y1 + 2y2 ≥ 1 (18.22)
Problem: For the following problem construct the dual problem and verify the strong duality (i.e, primal optimum
=dual optimum)
Maximize 5x + 4y + 3z
subject to
x + y + z ≤ 30
2x + y + 3z ≤ 60
3x + 2y + 4z ≤ 84
157
18.2 Duality from Lagrandian Perspective
Lagrange multiplier is often used in calculus to minimize a function subject to (equality) constraints. For example,
in order to solve the below problem,
min x2 + y 2
(18.23)
subject to x + y = 1,
∂L ∂L
= 0, =0 (18.25)
∂x ∂y
∂L
optimal solution to this unconstrained problem is x = y = λ/2 and depends on λ. The constraint x + y = 1, (or ∂λ
)
gives an additional relation. i.e., λ = 1 and optimal solution is x = y = 1/2.
Basic Idea
• We violate hard constraints in our original constrianed problem ( ̃18.23) and associate a Lagrange Multiplier
or price λ with the amount (1 − x − y) by which it is violated. This gives us an unconstrined problem.
• When price λ is properly chosen, optimal solution to the constrined problem is also optimal solution for
unconstrained problem.
• Under specific value of λ, optimal cost is unlatered by the presence or absence of hard constraints.
The variables λi are the lagranigain multipliers. We observe that for every feasible x and λ ≥ 0, f0 (x) is bounded
below by L(x, λ). i.e., f0 (x) ≥ L(x, λ). Or f0 (x) = maxλ≥0 L(x, λ).
Here we have used the fact that maxλ≥0 λ f is 0 if f ≤ 0 and +∞ otherwise. (Note that λT f is the shortcut for
∑
T
i λi fi (x).)
Now we define a fual function (to be precise lagrangian dual function), g(λ) = minx L(x, λ). Note that we have the
following relationship. (since f0 (x) ≥ L(x, λ))
f0 (x) ≥ g(λ)
or the dual optima is
d∗ = max min L(x, λ)
λ≥0 x
158
Minmax inequality Another important relation to keep in mind at this stage is the minmax inequality. This
says that for any two variables x ∈ X and y ∈ Y,
To show this,
min ϕ(x′ , y) ≤ max ϕ(x, y ′ )
x′ ∈X ′ y ∈Y
Start from this equation and take minx on RHS and maxy on LHS. This leads to the the general inequality above.
Use of this will also lead to weak duality in general case.
p∗ = max cT x : Ax ≤ b
x
Langranean functions is
L(x, λ) = cT x + λT (b − Ax)
L(x, λ) = bT λ + (cT − λT A)x
We can see that this function is ≥ cT x when λ ≥ 0. Let us define the dual function g(λ) as
d∗ = min g(λ)
λ≥0
∗
d = min max bT λ + (cT − λT A)x
λ≥0 x
∗
d = min(b λ + max (cT − λT A)x)
T
λ≥0 x
If cT − λT A has nonzero entries, then the maximum over all x is ∞ and is a useless upper bound. We should consider
only the case when AT λ = c. This leads to the dual problem as
d∗ = minbT λ s. t. AT λ = c; and λ ≥ 0
159
Problem: If A is TU, will AT also be TU?
Theorem 18.3.1. If A is TU & b is integral, then LP gives integral solution
Proof.
maxcT x|Ax ≤ b, b ∈ Z m , x ≥ 0
To solve this, we had seen the process of adding slack variables and creating the equality.
′
Ax=b
′
xB = AB−1 b
1
= ′ [ ]
det(AB )
Since AB will have determinant as −1 or 1 (and not zero) as well as the cofactors are also integers, xB has only
integer values.
We will study this pair when the graph is BPG and for a general case.
Cases:
Maximum matching:In a graph G, finding the maximum number of edges without any common vertices is maximum
matching problem. Minimum vertex cover set: In a graph G, selecting the minimum number of vertices so that
atleast one vertex of every edge in graph is present in the selected vertices.
∑
m
M aximize Xj (18.26)
j=1
such that ∑
Xk ≤ 1 ∀j ∈ V (18.27)
k:(j,k)∈E
Xi ∈ {0, 1} (18.28)
Minimum Vertex Cover formulation:
∑
M inimize Xj (18.29)
v∈V
such that
xu + xv ≥ 1 f or every{u, v} ∈ E (18.30)
160
xu ∈ {0, 1} (18.31)
Minimum vertex cover and max matching have primal dual relationship.
Consode r a graph with n vertices and m edges v1 , v2 ...vn and e1 , e2 ...em . Let us define an Incidence matrix A
such that:
{
1 if xi ∈ ej
aij =
0 else
Ax ≤ 1
x≥0
x ∈ Zm
AT y ≥ 1
y≥0
y ∈ Zn
This argues that for BPG, we can always show that A is TU and this will yield integral solutions.
161
Figure 18.1: BPG
1 1
A = 1 0
0 1
See that this matrix is TU.
For the primal, max1T x such that Ax ≤ 1, lead to the optinal x as:
[ ]
1
0
LP ∗ = IP ∗ = 1
[objective : cT x = bT y = 1(same)]
P rimal : max x1 + x2 + x3
x1 + x2 ≤ 1
x2 + x3 ≤ 1
x3 + x1 ≤ 1
Dual : min y1 + y2 + y3
y1 + y2 ≥ 1
y2 + y3 ≥ 1
y3 + y1 ≥ 1
162
on solving primal:
3
LP ∗ =
2
while the integer optimal is:
IP ∗ = 1
on solving dual:
3
LP ∗ =
2
while the integer optimal is:
IP ∗ = 2
This imply that the LP did not have an integral optima. (Why? because A was not TU)
1 0 1
Now, A = 1 1 0
0 1 1
|A| = 2
⇒ A is not a TU
163
164
Chapter 19
165
19.1 Review and Summary
Duality Theorem
This Theorem states that the problems Primal(P) and Dual(D) are intimately related.We can think their relationship
in following table:
166
19.2 Complementary Slackness
Relationship between the primal and dual that is known as complementary slackness. We know that the number
of variables in the dual is equal to the number of constraints in the primal and the number of constraints in the
dual is equal to the number of variables in the primal.It means that variables in one problem are complementary to
constraints in the other. constraint having slack means it is not binding.For an inequality constraint, the constraint
has slack if the slack variable is +ve. Complementary slackness means that a relationship between slackness in
primal constraints and slackness of it’s dual constraints. The following are primal and dual complmentary slackness
conditions.
The above complementary slackness conditions guarantee that the values of the primal and dual are the same.
We know about primal and dual methods. Consider Following Primal-Dual Pairs.
X T [AT Y − c] = 0 (19.7)
T
From eqn (6) either Y = 0 or AX = b. Similar for eqn (7) either X = 0 or A Y = c.
We concluded that either ith variable is 0 or ith constraint is tight.
167
19.2.3 Complementary Slackness:Ver2
Complementary slackness tells us that when primal LP sets some variable to non-zero, then it’s for some “good
reason”.
1 ≤ j ≤ n either xj = 0 or
∑n
i=1 aij yi = ci , X [A Y − c] = 0
T T
1 ≤ i ≤ m either yi = 0 or
∑m
j=1 aij xj = bj , Y [AX − b] = 0
T
Note : While solving a problem by P-D method, ensure PCS strictly and DCS relaxed.
If the conditions are obeyed for feasible solutions x, y, then solutions are optimal. For solving any problem start with
a variable X and change it with condition, until its feasible value is found ensuring it should satisfy slack constraints.
X i → X i+1 ⇒ X i+1 ← X i + Y
168
19.3 Introduction to Primal Dual Method
The primal-dual method is a standard tool for designing algorithms for combinatorial optimization problems. In this
lecture, we focus on showing how to modify the primal-dual method to provide good approximation algorithms for a
wide variety of NP-hard problems.
The primal-dual method was originally proposed by Dantzig, Ford and Fulkerson as another means of solving linear
programs. However, now it is more widely used for devising algorithms for problems in combinatorial optimization.
The main feature of primal-dual method is that it allows weighted optimization problem to be reduced to a purely
combinatorial unweighted problem. It also leads to efficient polytime algorithms for solving NP-hard problems.
The following figure shows the general framework of the primal-dual method:
∑
n
min cj xj
j=1
∑
n
s.t : aij xj ≥ bi
j=1
xj ≥ 0
i = 1, 2, 3, 4....m
j = 1, 2, 3, 4....n
∑
m
max bi c i
i=1
169
∑
m
s.t : aij yj ≤ cj
i=1
yi ≥ 0
i = 1, 2, 3, 4....m
j = 1, 2, 3, 4....n
Given an optimization problem (NP-hard),we will formulate this problem as an IP and relax it to obtain an LP.
Then, we round the optimal solution x∗ of the LP to obtain an integral solution. In the primal-dual method, we
find a feasible integral solution to the LP (thus to the IP) from scratch (instead of solving LP) using dual D as our
guiding.
Specifically, we do either of the following:
If we use the first way, that is, ensure the primal conditions and relax the dual conditions, we have:
Lemma 19.3.1. If x and y are feasible solutions of P and D respectively, satisfying conditions in the first way i.e.
primal conditions are ensured and dual conditions are relaxed, then:
∑
n ∑
m
c j xj ≤ β bi yi
j=1 i=1
Proof.
∑
n ∑
n ∑ m
cj xj = ( aij yi )xj
j=1 j=1 i=1
∑
n ∑
m ∑n
cj xj = ( aij xj )yi
j=1 i=1 j=1
∑
n ∑
m
c j xj ≤ β bi yi
j=1 i=1
More specifically, let α = 1 if the primal conditions are ensured and β = 1 if the dual conditions are ensured, then
we have:
Lemma 19.3.2. If x and y are feasible solutions of P and D respectively, satisfying the complementary slackness
conditions stated above, then both x and y are α − β approximate solutions:
∑
n ∑
m
cj xj ≤ αβ bi yi
j=1 i=1
170
Proof.
∑
n ∑
n ∑ m
c j xj = α ( aij yi )xj
j=1 j=1 i=1
∑
n ∑
m ∑ n
c j xj = α ( aij xj )yi
j=1 i=1 j=1
∑
n ∑
m
cj xj ≤ αβ bi yi
j=1 i=1
19.3.2 Introduction :
Today’s tool is an adaption of a fundamental tool in the design of algorithms for linear programming and combinatorial
optimization : the primal-dual method.
We know,
cT X ∗ = bT Y ∗ - (1)
(AT Y ∗ )T X ∗ = Y ∗T AX ∗
Y ∗T AX ∗ ≤ Y ∗T b
finally, we can write above relations substituting in (1), as below,we get fact 1,
Proof. follows from derivation of Fact 1. Since there’s equality between b.y ∗ and c.x∗ ,all the inequalities are equality,
which translates to the fact above.
Fact 3. Let x and y be feasible solutions to (1) and (2) satisfying the following conditions :
171
∑m
a) Increase the value of yi in some fashion until dual constraints go tight, i.e. i=1 aij yi = αcj for some j
while always maintaining feasibility of y.
b) Select some subset of tight dual constraints and increase the value of primal variables corresponding to
these constraints by an integral amount.
step 4. Cost of dual solution is used as a lower bound for primal optimization problem. Note that the approximation
guarentee of the algorithm is αβ.
Example In this section we briefly describle how to use pcs to solve Dual problem. Now,take the following example.
We know solution of above: z1 = 42; x1 = 0; x2 = 10.4; x3 = 0; x4 = 0.4.Now ,we used this information to solve it’s
dual using complementary slackness. The following is it’s dual:
In above example x2 and x4 are positive. So their corresponding constraints are tight in dual by PCS. Similarly,
constraints 1 and 3 are tight but not second constriant. So corresponding varible y2 equal to 0. then
y1 + y3 = 4
4y1 − y3 = 1
After solving above equations we got y1 = 1; y3 = 3 and z2 = 42. So for this primal and dual problems optimas are
same.
max x1 − x2
subject to −2x1 + x2 ≤ 2
x1 − 2x2 ≤ 2
x1 + x2 ≤ 5
x≥0
Suppose I claimed that (1,4) solved primal.How could you check this using complementary slackness?
172
19.5 Shortest Path
Data
3. a cost function c : A → R.
Problem Statement -
Find all minimum cost (i.e. shortest) paths from s to all nodes N . The problem is also called Shortest Path
Tree/Arborescence Problem, because of a property of its solution: the set of all shortest paths forms a spanning
arborescence rooted in s.
Figura: A shortest paths arborescence (s = 1). Costs are black. Flows are red. Distances are blue.
Primal-Dual pair
∑
P) minimize (i,j)ϵA cij xij
∑ ∑
s.t. (j,i)ϵδ − xji −
i
x = 1 ∀iϵN \{s}
(i,j)ϵδ + ij
i
∑ ∑
−
(j,s)ϵδs
xjs − +
(s,j)ϵδs
xsj = 1 − n
yi ϵR ∀iϵN
Observation 1. If we add a constant α to each y variable, nothing changes. Hence we are allowed to fix one
variable:ys = 0
Observation 2. We have m inequality constraints, n − 1 original y variables and m slack variables. The LP tableau
of the dual problem has m rows and n − 1 + m columns. Hence in each base solution of D there should be m basic
variables and n − 1 non-basic (null) variables. For the complementary slackness theorem, there should be red n − 1
basic (positive) variables in the primal problem.
Observation 3. We have n equality constraints that are not linearly independent: summing up all the rows we
obtain 0 = 0. Hence we are allowed to delete one constraint: we delete the flow conservation constraint for s.
Observation 4. We have now n − 1 equality constraints and m variables. The LP tableau of P has n − 1 rows and
173
m columns. Hence in each base solution of P there are n − 1 basic variables and m − (n − 1) non-basic variables.
yi ϵR ∀iϵN \{s}
Only arcs (i, j) for which y i + cij = yj can carry flow xij .
By applying above constraints and any of the graph algorithms Dijkstras or Ford Fulkerson etc. problem is solved.
For example,
Data structures:
Algorithm
Step FF2(Iteration):
else terminate;
Feasiblity
After initialization (Step FF1) we have neither primal feasibility nor dual feasibility.
Primal viewpoint: We have πi = nil for all iϵN ; hence no flow enters any node.
Dual viewpoint: We have yi = ∞ for all iϵN \{s}; hence all constraints yi − ys ≤ csi are violated.
The FF algorithm maintains the CSCs and iteratively enforces primal and dual feasibility.
174
19.6 Example: MST
19.6.1 MST:Ver1
xe ≥ 0 ∀eϵE
y∏ ≥ 0 ∀∏
In the algorithm, first note that we need to maintain the y∏ s. Initially they are all zeroes. At any stage we need to
improve the cost function. In general, we can do that by increasing some of the ys and decreasing some. To keep
things as simple as possible, we will try and do this by only increasing one of them at a time.
Let us assume that all ce s are strictly positive, so initially, none of the constraints are tight. We wish to increase the
cost function. To do so we have to increase some y∏ . Which onne? It seems that we gain the most by increasing the
one where each vertex is in a separate partition. So, suppose we start to increase this. Ho much can we increase it
by? We see that we can increase this upto the weight of the minimum weight edge. At this point the inequalities
corresponding to all edges of minimum weight will be tight.
Let us consider the generic step. So suppose that at some stage we have some y. As per our recipe we need to
consider only the inequalities which are equalities. So, let F denote the set of edges which are currently tight (the
corresponding inequalities are tight.) We need to increase some y∏ such that for each of these edges the sum of the
increases in the y that the edge crosses is at most zero. Thsi means we can increase a y∏ such that none of the
edges of F cross ∏. How do we find such a ∏? The most natural is to find the connected components and put each
component in one part.
1. Inititalization: We think of all y∏ s to be zero. Note that we cannot explicitly set them.
′
2. Iterative Step: Let E denote the set of edges which are tight. Find the connected components of the graph
′ ∏
G = (V, E ′ ). Increase y∏ till some edge becomes tight, where the parts of are the connected components of G′ .
Proof of Optimality
We can prove that the above algorithm gives an optimum solution by exhibiting a primal and dual solution of the
same cost.
This is exactly the cost of dual. So we have proved the cost of primal is same as the cost of the dual proving optimality.
175
MST: Ver2
∑
Min ce xe
e∈E [ ]
1 if e is included
where xe =
0 otherwise
∑
xe ≥ |V | − 1 (19.8)
e∈E
∑
xe ≥ |S| − 1 ∀ S ≤ V (19.9)
e:e=(u,v)
u∈S
v∈S′
xe ≥ 0 ∀ e ∈ E (19.10)
Where equation (8) ensures that we have at least |V | − 1 edges in solution. Any MST have exactly |V | − 1 edges.
Constraint (9) ensures that there is no cycle formation in any subset of S vertex.There can be at most |S| − 1 edges
b/w vertexes of S in the solution.
Dual of Above: ∑
M ax(|V | − 1)α + (1 − |S|)βs (19.11)
S≤V
∑
st α − βs ≤ Ce ∀e = (u, v) ∈ S (19.12)
s,u,v∈S
βs ≥ 0 ∀ S ≤ V, α ≥ 0 (19.13)
Algorithm
∑
1. Initialization: We think of all (α − βs ) to be 0.Note that we cannot explicitly set them.
2. Iterative Steps: Let Ep deonte
∑ the set of edges which are tight. Find connected components of graph
Gp = (V, Ep ). Increase (α − βs ) till some edge becomes tight where the parts of V are the connected
components of Gp .
3. The previous step terminates when we get one connected component.
xs ≥ 0 ∀ s ∈ S (19.16)
[ ]
1 if s take
where xs =
0 otherwise
where each set s ∈ S
Dual: ∑
M ax yu (19.17)
yu
∑
st yu ≤ cs ∀s ∈ S (19.18)
u∈S
yu ≥ 0 ∀u ∈ V (19.19)
Approximate Algorithm:
wi = weight of si is taken
176
2. J ← ϕ // Nothing yet in the cover
3. I ← [1, ...., m] // Elements yet to be covered
4. while I ̸= ϕ do
5. P ick i ∈ I // and now try to increase yi as much as possible
[ ]
6. ji ← argmin wj | j ∈ [n], i ∈ Sj
7. yi ← wji // this is the most yi could be increased to
8. // the dual constraint corresponding to ji shall become ”binding” (becomes equality)
9. for each j ← where i ∈ Sj do
10. wj ← wj − wji
11. end for
12. J ← J ∪ {ji }
13. I ← I − Sji // those in Sji are already covered
14. end while
15. Return J
Solution.
The general idea is to work with an LP-relaxation of an NP-hard problem and its dual. Then the algorithm iteratively
changes a primal and a dual solution until the relaxed primal-dual complementary slackness conditions are satisfied.
Primal-Dual Schema
xj ≥ 0 j = 1, ..., n.
yi ≥ 0 i = 1, ..., m.
We will move forward using this schema and by ensuring one set of conditions and suitably relaxing the other. We
will capture both situations by relaxing both conditions. If primal conditionsare to be ensured, we set α = 1below,
and if dual conditions are to be ensured, we set β = 1.
Lemma. If x and y are primal and dual feasible solutions respectively satisfying the complementary slackness
conditions stated above, then val(x) ≤ αβval(y).
∑m ∑ ∑
=α i=1
( nj=1 aij xj ) yi ≤ αβ m b y = val(y)
i=1 i i
177
which was claimed.
The algorithm starts with a primal infeasible solution and a dual feasible solution; usually these are x = 0 and y=0
initially. It iteratively improves the feasibility of the primal solution and the optimality of the dual solution ensuring
that in the end a primal feasible solution is obtained and all conditions stated above, with a suitable choice for α and
β, are satisfied. The primal solution is always extended integrally, thus ensuring that the final solution is integral.
The improvements to the primal and the dual go hand-in-hand: the current primal solution is used to determine
the improvement to the dual, and vice versa. Finally, the cost of the dual solution is used as a lower bound on the
optimum value and by above mentioned Lemma, the approximation gaurantee is αβ.
19.8.1 Ver1
∑
Primal: M in v∈V wv xv
Constraints xu + xv ≥ 1∀(u, v) ∈ E
xv ≥ 0
∑
Dual: ∑ M ax e∈E ye
Constraints v∈e ye ≤ wv ∀ye ≥ 0.
Iterative Steps
1. Start with X = [0 . . . 0] and Y = [0 . . . 0].It is feasible for D(Dual) but not for P(Primal).Keep Modifing X such
that It is feasible for P.
2. Increase y for ”a” edge to 4.So that at vertex 4 constrait is tight.ya = 4 =⇒ x4 = 1 and freeze a,b and c.(
freeze means we have to set value of variables in such a way that all variables are satisfying that constraint .In
above example we have to make b=0 and c=0 for ya + yb + yc = 4 ).
3. Increase y for ”e” to 1.ye = 1 =⇒ x5 = 1 and freeze e and d.
4. yg = 2 =⇒ x2 = 1 and freeze g,h and f.
178
19.8.3 Summary:
1. PD is exact algorithm for combinational optimized problems.
bj ∑n
2. PD for Approximate Algorithm (1 ≤ j ≤ n either yi = 0 or β
≤ j=1 aij xj ≤ bj where β ≥ 1.)
PD Method:
1. start with init guess.
2. move to better guess guided by some constraints.
3. when nothing better can be found stop.
Statement - Given weights wi on iϵV , find a min-cost subset S of V, such that at least one endpoint of every edge
is in S.
Solution :
Ground Set : V
Costs : wi , iϵV
DCS - ye > 0 ⇒ xu + xv = 1.
These conditions will guide us to design an approximate algorithm. Let us see how to interpret these conditions. A
primal dual algorithm will construct primal and dual feasible solutions simultaneously. To ensure that these solutions
are optimal, the primal condition says that a vertex should be picked only if this vertex is saturated by the matching,
179
and the dual condition says that an edge should be picked only if exactly one vertex is chosen. Dual condition is
difficult because we can pick one vertex for each edge in the matching but still cover all the edges in the graph. So
let us relax the dual condition by a factor of 2, i.e. set β = 2:
ye > 0 ⇒ xu + xv ≤ 2.
This relaxation now makes the problem much easier because the dual condition is now satisfied automatically as
xu andxv are at most 1. Now, we just need to construct primal and dual solutions satisfying the primal complementary
slackness condition only. This can be achieved by the following simple algorithm :
1. Initialization: x=0,y=0.
Clearly this algorithm will produce a feasible solution for the vertex cover problem, and also satisfy the primal
complementary slackness condition.
Steps -
Here, 1 ≤ xu + xv ≤ β, where β = 2.
Weighted Vertex Cover (WVC): Given an undirected graph G = (V, E), where |V | = n and |E| = m and a cost
function on vertices : C → Q+ , find a subset C ⊆ V such that every edge e ∈ E has at least one end point in C and
C has minimum cost.
Formulate vertex cover as following IP :
For each vertex i ∈ V (V = (1, 2, 3, ..., n))
Let xi ∈ 0, 1 be variables such that: xi = 1 if i ∈ C, otherwise xi = 0.
We have:
∑n
min ci xi
i=1
s.t : xi + xj ≥ 1 ∀(i, j) ∈ E
xi ∈ 0, 1 ∀i ∈ V
s.t : xi + xj ≥ 1 ∀(i, j) ∈ E
xi ≥ 0 ∀i ∈ V
180
Assign a dual variable yij to the constraint xi + xj ≥ 1, We have the corresponding dual D:
∑
max yij
(i,j)∈E
∑
s.t : yij ≤ Ci ∀i ∈ V
j:(i,j)∈E
yij ≥ 0 ∀(i, j) ∈ E
Let us choose α = 1 and β > 1. To ensure∑ primal conditions and suitably relax dual conditions:
For each vertex i ∈ E: either xi = 0 or j:(i,j)∈E yij = Ci
For each edge (i, j) ∈ E: either
∑yij = 0 or 1 ≤ xi + xj ≤ β∑where β > 1
Therefore, when xi ̸= 0,then j:(i,j)∈E yij = Ci . When j:(i,j)∈E yij = Ci for some i, we say that this constraint
goes tight.
181
Step 1. Start with x = [00...0] and y = [00...0]
Step 2. Increase y for a i.e. x4 = 1. This implies that x4 can be in the vertex cover. Freeze a, b, c.
Step 3. Increase y for e i.e. x5 = 1.Freeze d, e.
Step 4. Increase y for h i.e. x2 = 1.Freeze f, g, h.
Lemma 19.8.1. Let x and y be the solutions obtained from the above algorithm, then x is primal feasible and y is
dual feasible.
Proof. Note that each edge (i, j) removed from E is incident on some vertex i s.t. xi = 1. Additionally,the loop is
terminated when all edges have been removed.Therefore, x∑ i + xj ≥ 1∀(i, j) ∈ E i.e. x is feasible to P .
Likewise, once the constraint goes tight for some i i.e. j:(i,j)∈E yij = Ci , the algorithm removes these edges.
Therefore, none of the values of yij exceed the value of Ci .Hence, y is feasible to D.
Theorem 19.8.2. The above algorithm produces a vertex cover C with an approximation ratio 2.
Last inequality follows from the fact that|(i, j)| = 2. for all edge (i, j) i.e each edge has 2 endpoints, so we may count
yij twice for each edge (i, j).
Therefore, we conclude that: ∑
cost(C) = 2 yij ≤ 2OP T (D) ≤ 2OP T
j:(i,j)∈E
For any X ⊆ V , we set f (X) = 1 iff there exists a u ∈ X and v ∈ X̄ such that u and v belong to some set Si ,
otherwise f (X) = 0.
Let δ(X) be the set of edges with exactly one endpoint in X. Let a binary variable xe indicate whether the edge is
chosen in the subgraph for each edge e ∈ E.
xe ∈ 0, 1
182
∑
s.t : xe ≥ f (X) ∀X ⊆ V
e:e∈δ(X)
xe ≥ 0
yX ≥ 0
Let cj be the weight of subset Sj . Let xj be a binary variable such that xj = 1 if Sj ∈ C, otherwise xj = 0.
We have the following IP:
∑
n
min cj xj
j=1
∑
s.t : xj ≥ 1 ∀i ∈ 1, ..., m
j:i∈Sj
xj ∈ 0, 1 ∀j ∈ 1, ..., n
∑
n
min cj xj
j=1
∑
s.t : xj ≥ 1 ∀i ∈ 1, ..., m
j:i∈Sj
183
xj ≥ 1 ∀j ∈ 1, ..., n
∑
Let yi be the variable corresponding to the constraint j:i∈Sj xj ≥ 1. The corresponding dual D is:
∑
m
max yi
i=1
∑
s.t : yi ≤ cj ∀j ∈ 1, ..., n
i∈Sj
yi ≥ 0 ∀i ∈ 1, ..., m
∑
For some j if the dual constraint i∈Sj yi = cj we say that this constraint goes tight and the corresponding Sj is
tight. We have the following algorithm:
Step 1. initialize x = 0, y = 0
Step 2. while U ̸= ϕ do:
a) Choose an uncovered element say i and raise yPi until some set in S goes tight say Sj .
b) Choose all these sets Sj and set xj = 1
c) Remove all the elements in these sets Sj from U
d) Remove all these sets Sj from collection S
Step 3. end while
Step 4. return C = Sj |xj = 1
QUES 2:Derive a Primal-Dual based exact algorithm for minimum spanning tree
problem.
In the algorithm, first note that we need to maintain the yπ . Initially they are all zeroes. At any stage we need
to improve the cost function. In general we can do that by increasing some of the y and decreasing some. To keep
things as simple as possible, we will try and do this by only increasing one of them at a time.
Let us assume that all ce are strictly positive, so initially, none of the constraints are tight. We wish to increase the
cost function. To do so we have to increase some yπ . Which one? It seems that we gain the most by increasing the
one where each vertex is in a separate partition. So, suppose we start to increase this. How much can we increase
it by? We see that we can increase this upto the weight of the minimum weight edge. At this point the inequalities
corresponding to all edges of minimum weight will be tight. Let us consider the generic step. So suppose that at
some stage we have some y. Let F denote the set of edges which are currently tight (the corresponding inequalities
are tight.) We need to increase some yπ such that for each of these edges the sum of the increases in the y that the
edge crosses is at most zero. This means we can increase a yπ such that none of the edges of F cross π. How do we
find such a π? The most natural is to find the connected components and put each component in one part.
Here then is the algorithm for MST:
184
1. INITIALIZATION: we think of all yπ to be 0. Note that we can not explicitly set them.
2. ITERATIVE STEP: let E ′ denote set of edges which are tight. Find the connected components of the graph
G′ = (V, E ′ ). Increase yπ till some edgebecomes tight when the parts of π are the connected components of G′ .
3. previous step terminates when we get one connected component.
Problems
Minimum-cost branching Given a directed graph G = (V, A), root rϵV , find a min-cost subgraph such that
there is a directed path from r to every other vertex.
Maximum Independent Set Given a directed graph G = (V, A), finding an independent set such that adding
any other vertex to the set forces the set to contain an edge.
QUES 2: Derive a Primal-Dual based exact algorithm for travelling salesman problem.
185
186
Chapter 20
187
20.1 Introduction
We have been mostly looking into the linear optimization (linear programming) and the associated concepts from
linear algebra. We had also seen how many of these concepts (eg. duality) are more general than linear programming.
We also argued that linear programming is convex, while integer programming is not convex. What does it mean?
Why it should matter to us? What are the general class of convex optimization schemes? How does one optimize
when problem is non-convex or/and non-linear? We will discuss some of these in the next few lectures. Note that
this is a huge area of literature. We will be discussing only a limited set of topics.
where θ ∈ R, form the line passing through x1 and x2 . The parameter value θ = 0 corresponds to y = x2 , and the
parameter value θ = 1 corresponds to y = x1 . Values of the parameter θ between 0 and 1 correspond to the (closed)
line segment between x1 and x2 .
y = x2 + θ(x1 − x2 ) (20.2)
gives another intepretation: y is the sum of the base point x2 (corresponding to θ = 0) and the direction x1 − x2
(which points from x2 to x1 ) scaled by the parameter θ. Thus, θ gives the fraction of the way from x2 to x1 where y
lies. As θ increases from 0 to 1, the point y moves from x2 to x1 .
Figure 20.1: The line passing through x1 and x2 is described parametrically by θx1 + (1 − θ)x2 , where θ
varies over R.
The reason for us to familiarize ourselves with the concept of line segment is going to be clear in the next section,
when we dive into convex sets, and convex functions.
In other words, a set is convex if every point in the set is seen by every other point inside the set.
188
Figure 20.2: Examples of convex and non-convex sets.
Examples: Figure 20.3 shows an example of a convex set, and a non-convex set. Figure 20.2 shows an additional
example at the right exterem. The square contains (the right extreme) some boundary points but not all, and is not
convex.
Figure 20.3: Left: Hexagon is a convex set, while Right The kidney shaped set is clearly not a convex set
{x | xT a = b}
P = {x | aTi x ≤ bi CjT x = dj }
189
Figure 20.4: Example of a polyhedron
convex, it follows that θx1 + (1 − θ)x2 is in A. Similarly x1 and x2 are in B, and B being convex, θx1 + (1 − θ)x2
is in B.
Thus θx1 + (1 − θ)x2 is in A and B, i.e., θx1 + (1 − θ)x2 is in C.
2. Union of two convex sets need not be convex
Proof. We can provide a counter example to disprove that union of two convex sets is convex.
Let us consider two sets on the Real line R A and B, where A = [α, β], and B = [γ, λ], such that A ∩ B = ∅
i.e. they are mutully exclusive. If we consider any linear combination of β and γ, then it might not be on R,
hence
3. If C = {x} is a convex set, the αC = {αx} is also convex
4. If C = {x} is a convex set, the (C + t) = {x + t} is also convex
5. If C = {x} is a convex set, the (aC + b) = {ax + b} is also convex
6. Set Sum: If C1 and C2 are convex, C1 + C2 = {x + y | x ∈ S1 , y ∈ S2 } is also convex.
Before appreciating the convex hull, we look at a convex combination. We call a point of the form θ1 x1 + ... + θk xk =
∑
i θi xi , where θ ∈ R, θ1 + . . . + θk = 1 and θi ≥ 0, i = 1, . . . , k, a convex combination of the points x1 , . . . , xk . A
convex combination of points can be thought of as a mixture or weighted average of the points, with θi the fraction
of xi in the mixture.
Definition. Convex hull of a set C is the set of all convex combinations of points in C and is represented by Conv(C).
Note that Conv(C) is convex and C ⊆ Conv(C). From the definition of the convex set, one can also see that the
convex hull of a convex set is itself.
Convex hull of C is always convex, as any linear combination of points lies in the set only. In fact, it is the smallest
convex set that contains C. In other words, if B is any convex set that contains C, then conv C ⊆ B. Figure 20.6
shows a convex hull of a set of points. As evident from the figure, the upper part of the hull (colored in blue) is
called the upper hull, while the lower part of the hull (colored in red) is called lower hull.
• We know that R is convex. Note that set of integers Z is not. (why? a linear combination can result in a point
that is not an integer.)
190
Figure 20.5: The convex hulls of two sets in R2 . Left. The convex hull of a set of thirteen points (shown as
dots) is the pentagon (shown shaded). Right. The convex hull of the kidney shaped set is the shaded set.
• You may remember that Linear Programming is convex while integer programming is not. You can connect
this to the fact that the real space (over which LP optimizes) is convex. While the set of integers is not.
• What does happen with LP relaxation? We find the convex hull.!!
Figure 20.6: Convex hull of a set of points, showing the upper and the lower hull
Assume we are given two convex sets C and D, such that C ∩ D = ∅. Then we can always find a separating
hyperplane, which can separate C and D. To be precise, for any x in C, we have aT x ≥ b, and for any x in D, we
have aT x < b. In this case, we always have a separating hyperplane to separate the set of points in n-dimensional
space. See Figure 20.7 for a visualzation.
If C and D are the convex sets and C ∩ D = ∅, then there exists a non-zero a such that
∀x ∈ C aT x ≤ b
and
∀x ∈ D aT x ≥ b
such a hyperplane is characterized by a and is called the separating hyperplane.
191
Figure 20.7: Separatimg plane for two convex sets
Figure 20.8: Example of a convex function. Chord joining the two points lie above the function
A function f : Rn → R is convex if dom(f ) is a convex set and if for all x, y ∈ dom(f ), and θ with 0 ≤ θ ≤ 1, we
have
This is called the Jensen′ s inequality. In other words, it is a function defined on a convex domain such that for
any two points in the domain, the segment between the two points lies above the function. Figure 20.8 shows an
example of a convex function. In the figure, we find that for any two points x, and y in the domain of f , the line
joining the points (x, f (x)), and (y, f (y)) lie above the function.
20.3.2 Epigraph
The link between convex sets and convex functions is via epigraph: A function is convex if and only if its epigraph
is a convex set.
In other words, if a function is convex, then its epigraph is a convex set. Again, if the epigraph of a function is a
convex set, then the function is convex. Figure 20.9 (Left) shows the epigraph of a function f . As evident from the
192
figure, the epigraph can be considered as a filled bucket, where the function f can be imagined as a bucket, and the
epigraph is the whole space inside the bucket.
Since the epigraph of a function is the part above that function, and if the epigraph is s convex set, that means that
any two point inside that the epigraph should lie inside that only. That essentially means that any two points inside
the function can be connected by a line segment lying entirely inside the function.
Figure 20.9 (Right) shows the epigraph of a non-convex function f . As evident from the figure, the line segment
connectting any two points x and y does not entirely lie inside the function. So, there exists atleast one point among
the set of points given by p = θx + (1 − θ)y for any value of θ such that, 0 ≤ θ ≤ 1, which does not lie above the
function, thus dissatisfying the inequality
It has to be noted, that the epigraph of the function is not convex set, hence proving that the function is non-convex.
Figure 20.9: Left: Epigraph of a convex function. Line joining any two points inside the green region lies
inside the epigraph. Right: For a non-convex function, line joining two points x and y does not lie completely
inside the epigraph of the function
20.3.3 Hessians
A function is convex if its domain is convex and its Hessian is a PSD (Positive Semi Definite) matrix.
[ 2 ]
H = ∂x∂i ∂xf
j n×n
∂2f 2
∂ f ∂2f
∂x2 ∂x1 ∂x2
... ∂x1 ∂xn
1
H = ...
2 2 2
∂ f ∂ f ∂ f
∂xn ∂x1 ∂xn ∂x2
... ∂x2
n n×n
If you are not familiar with the terminology of Hessian, please read the appendix. More insight into this is seen in
the following two subsections.
If we assume that the function f is differentiable at each point in the domain of f , that is gradient ∇ of f exists.
The f is convex if and only if dom f is convex and
holds for all x, y ∈ dom f .This inequality shows that from the local information of a convex function, we can derive
its global properties. This is one of the most important property of a convex function, and is crucial in the field
of convex optimizations problems. From the inequality, we can say that if gradient of the function f is 0. In other
193
words, if ∇f (x) = 0, then for all y ∈ dom f , f (y) ≥ f (x), that is , x is a global minimizer of the function f . Hence,
value of f (x) is minimum at that point.
This is already known to us, because for any function, we can minimize that function by setting its derivative to 0,
then proving that the function is minimized if the double derivative is less than 0. The same thing is valid for convex
functions as well.
If we assume that f is twice differentiable, that is, its Hessian or second-order derivative i.e. ∇2 exists at each point
in dom f , which is open. Then we can say, that f is convex if and only if dom f is convex and its Hessian is positive
semidefinite: for all x ∈ dom(f ).
∇2 f (x) ≥ 0 (20.8)
In others words, if the derivative of the function is non-decreasing, then the function is said to be convex. That is,
the graph of the function have positive curavature at any point x. In addition to that, the dom f should be convex
set.
The above definitions of the convex functions are not for strictly convex functions. If we use < instead of ≤ in
equation 20.10, we obtain the strictly convex functions.
In other words, α − sublevel set of a function f is a set of all values of x in the domain of f , for which the value of
f (x) will be less that or equal to α. Figure 20.10 explains the concept of α − sublevel set of a function f .
Lemma 20.4.1. Sublevel sets of a convex function are convex, for any value of α.
Proof. The proof is immediate from the definition of convexity. Let us consider two elements x and y from the set
Cα . Then, f (x) ≤ α and also, f (y) ≤ α, and so f (θx + (1 − θ)y) ≤ α for 0 ≤ θ ≤ 1, and hence θx + (1 − θ)y ϵ Cα .
So any linear combination of x and y is present in the set. Since we have considered a general value of α, so we can
say that for any convex function, α sublevel sets of the function is also convex, for any value of α.
But it is to be noted, that the converse is not true: a function can have all its sublevel sets convx, but not be a convex
function. An example of such a setup is f (x) = −ex is not convex on R, although all its sublevel sets are convex. In
fact, the function is strictly concave.
194
Figure 20.10: α − sublevel set of a function f (x). C1 is the α1 set of f . C2 is the α2 set of f . C1 , C2 ∈ dom
f
A function f : Rn → R is concave if dom(f ) is a convex set and if for all x, y ∈ dom(f ), and θ with 0 ≤ θ ≤ 1, we
have
1. f (x) = −x2
√
2. g(x) = x
3. sin function is concate in the interval [0, π]
195
Question: If a function f is convex, will −f be concave?
A function f : Rn → R is called quasiconvex or unimodal if its domain and its sublevel sets
for α ∈ R, are convex. Figure 20.11 shows a quasiconvex function. The function has all α-sublevel sets as convex
sets, but still the function is not a convex function. In other words, the α-sublevel sets should be an interval (which
might be an unbounded). In the figure, we see that Sα is the interval [a, b]. On the other hand, Sβ is an interval
(-∞, c].
Figure 20.11: Quasiconvex funtion. We find that α-sublevel and β-sublevel sets of the function are convex
i.e. and interval.
{x ∈ Rn | aT x ≤ β} and {x ∈ Rn | aT x ≥ α}
196
Figure 20.12: An example quasi-convex function.
x + S2 can be expressed as {y | y = x + z, z ∈ S2 }
Consider a new set that contains x. For any 2 points x1 and x2 in this new set, by definition we have
x1 + z ∈ S1 , ∀z ∈ S2
x2 + z ∈ S1 , ∀z ∈ S2
Consider any λ ∈ [0, 1]
because for any z ∈ S2 we have (x1 + z), (x2 + z) are in S1 and S1 is a convex set. Therefore the new
set containing such x is also convex.
(c) {x|||x − x0 ||2 ≤ ||x − y||2 } for all y ϵ S, S ⊆ Rn
Solution : For any fixed value of y ϵ S, we have the set given by {x|||x − x0 ||2 ≤ ||x − y||2 }. It can be
shown that it is a halfspace. To show that, it is sufficient to show that it can be expressed as AT X ⪯ b.
Now x is closer to x0 that to xi if and only if,
T
x1 − x0 x1 x1 − xT0 x0
x2 − x0 xT2 x2 − xT0 x0
A = 2
... , and b =
...
xK − x0 xK xK − x0 x0
T T
Once we know that the set is a halfspace, we can now say that the given set is convex. The given set can
be expressed as,
∩
{x|||x − x0 ||2 ≤ ||x − y||2 }, (20.12)
yϵS
i.e. an intersection of halfspaces. We know that intersection of halfspaces is always convex. Hence proved
that the given set of convex.
2. Check whether the following functions are convex, concave or quasi-convex
(a) f (n) = ex − 1 on R
Every exponential function is convex. Therefore the given function is strictly convex and every convex
function is quasi-convex too.
197
2
(b) f(x1 , x2 ) = x1 x2 on R++
The Hessian of f is ][
0 1
∇2 f (x) = ,
1 0
which is neither positive nor negative semi-definite. Therefore f is neither convex nor concave. It is
quasi-convex since its superlevel sets
{(x1 , x2 ) ∈ R++
2
| x1 x2 ≥ α}
are convex.
x1 2
(c) f(x1 , x2 ) = x2
on R++
The Hessian of f is
2 1
2 x1 x2
1 x1
∇2 f (x) = ≥ 0
x1 x2 1 2
x1 x2 x22
Therefore, f is convex and quasi-convex. It is not concave.
(d) Inverse of an increasing convex function : Suppose f: R − > R is increasing and convex on its domain
(a,b). Let g denote its inverse i.e., the function with domain (f(a),f(b)) and g(f(x))=x for a < b . What
can you say about convexity or concavity of g?
(e) Check if the following set is convex or not : The set of points whose distance to a does not exceed a fixed
fraction θ of the distance to b i.e.,
{x | ∥x − a∥2 ≤ θ∥x − b∥2 }
where a ̸= b, 0 ≥ θ ≥ 1, and a,b,x ϵ R
3. Check whether the following functions are convex, concave, or quasian
(a) f (x) = ex − 1
Solution : The Hessian of f is ∇2 f (x) = ex . We know that ex > 0 for all x ≥ 0. So we can say, that f
is strictly convex, and therefore it is also called quasiconvex.
(b) f (x1 , x2 ) = x1 x2
Solution : The Hessian of f is
[ ]
0 1
∇ f (x) =
2
,
1 0
which is neither positive semidefinite nor negative semidefinite. Therefore, f is neither convex nor concave.
The superlevel sets of f , given by
198
20.7 Convex Optimization
Convex Optimization studies the problem of minimizing convex functions over convex sets. Optimization can be
made easier by making use of convexity property, for example any local minimum should be global minimum in case
of convex functions.
Definition. Convex minimization is the minimization of a convex function over a convex set.
The convexity property can make optimization in some sense easier than the general case - for example, any local
minimum must be a global minimum.
Standard form is the usual and most intuitive form of describing a convex minimization problem.
M infc (x)s.t.
aTi x = bi
Where the functions hi are affine. In practice, the terms ”linear” and ”affine” are often used interchangeably. Such
constraints can be expressed in the form hi (x) = aTi x + bi , where ai is a column-vector and bi a real number.
20.8.1 Gradient
The gradient of a function g(x) of n variables, at x̂, is the vector of first partial derivatives evaluated at x̂, and is
denoted by ▽g(x̂):
∂g(x̂)
∂x1
∂g(x̂)
∂x2
▽g(x̂) =
.
..
∂g(x̂)
∂xn
20.8.2 Hessian
The Hessian of a function g(x) of n variables, at x̂, is the matrix of second partial derivatives evaluated at x̂, and is
denoted as ▽2 g(x̂)g(x):
∂ 2 g(x̂) ∂ 2 g(x̂) ∂ 2 g(x̂)
∂x2 ···
1 ∂x1 ∂x2 ∂x1 ∂xn
2
∂ g(x̂) ∂2f ∂ 2 g(x̂)
···
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
H(g) =
.
.. .. .. ..
. . . .
∂ 2 g(x̂) ∂ 2 g(x̂) ∂ g(x̂)
2
···
∂xn ∂x1 ∂xn ∂x2 ∂x2n
∂ 2 g(x̂) ∂ 2 g(x̂)
This is a symmetric matrix, because =
∂xi , ∂xj ∂xj , ∂xi
199
20.8.3 Jacobian
Jacobain is an nxn matrix whose entries are the various partial derivative of the components of f. Specifically:
∂f1 ∂f1
···
∂x1 ∂xn
. ..
J = . . ..
.
. .
∂f ∂fm
m
···
∂x1 ∂xn
J is the derivative matrix (or Jacobian matrix ) evaluated at x0 .
200
Chapter 21
Convex Optimization
201
21.1 Introduction
Mathematical optimization can be very difficult to solve in general, in terms of computational complexity. However,
a class of problems can be solved easily. This include problems such as least-squares problems, linear programming
problems, and convex optimization problems. We had seen the least squares problems and linear programming
problems in the past. This lecture focuses primarily on an overview of the convex optimization formulations and
different variations thereof. Let us start with a brief comparison of convex optimization with the other well-known
and, arguably easier to solve, classes of optimization, viz. least squares and linear programs, as shown in Table 21.1.
Table 21.1: Convex optimization versus least squares and linear programs
21.1.1 Review
• A set C is convex, if for any x1 , x2 ∈ C, any x3 = θx1 + (1 − θ)x2 is also in C, where 0 ≤ θ ≤ 1
• Example: R is convex while Z is not convex. Integer set can be convex when it is a singleton set.
• A function f is convex iff
f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )
See the figure 21.1 for a quick understanding.
Figure 21.1: Example of a convex function. Note that the function is always below the line joining x1 and
x2 for all the points between x1 and x2 .
minimize f0 (x)
(21.1)
subject to fi (x) ≤ bi , i = 1, . . . , m
202
where x = (x1 , . . . , xn ) are the optimization variables, f0 = Rn → R is the objective function, and fi = Rn → R, i =
1, . . . , m are the constraint functions. The optimal solution is the one which has the smallest value for the objective
function while satisfying all the constraints.
Optimization problems with additional constraints such as in equation 21.1 are often called constrained optimization
problems. There are also problems which are unconstrained optimization problems.
The problem is to find any point x∗ in χ for which the number f (x) is smallest, i.e. a point x∗ such that f (x∗ ) ≤ f (x)
for all x ∈ χ
Convex minimization has applications in a wide range of disciplines, such as automatic control systems, estimation
and signal processing, communications and networks, electronic circuit design, data analysis and modeling, statistics
(optimal design), and finance. With recent improvements in computing and in optimization theory, convex mini-
mization is nearly as straightforward as linear programming. Convex optimization has a vast plethora of applications
in academic research and industrial engineering, ranging from automatic control systems to signal processing, from
data modelling to electronic circuit design and many many more. Recent improvements in computing power and new
theoretical breakthroughs have brought convex optimisation almost on par with linear programming in terms of ease
of solution.
Convex optimization is the formulation of the general optimization problem over convex sets and convex functions.
More specifically, convex optimization refers to the problem of optimising a convex function over a convex set defined
by constraints, which are themselves either convex or linear. For a general convex optimization problem, any locally
optimal solution is usually the global optimum as well.
Why is it important to know the convex optimization? Solving optimization problems is generally difficult except for
some classes of problems that can be solved efficiently and reliably. Examples include:
Is it that most problems are convex? Not really. Many practical problems are non-convex. We will discuss some of
these in the next lectures.
21.2.2 Formulation
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m (21.2)
hi (x) = 0, i = 1, . . . , p
Here the vector x = (x1 , ..., xn ) is the optimization variable of the problem, the function f0 : Rn − > R is the convex
objective function, the functions fi : Rn → R,i = 1, ..., m, are the (inequality) constraint functions. Some times
fi (x) ≤ 0 is represented as fi (x) ≤ bi .
This refers to the problem of finding the value of x that minimises the function f0 while simultaneously satisfying the
conditions fi (x) ≤ 0 and hi (x) = 0 for all i = 1, . . . , m and i = 1, . . . , p respectively. The variable, or rather vector,
x ∈ Rn is the optimization variable, and the function f0 (x) : Rn → R is the cost function or objective function. The
constraints fi (x) ≤ 0, i = 1, . . . , m are the inequality constraints, and the functions fi : Rn → R are usually convex
in nature. If there are no constraints, then we have an unconstrained problem.
203
Where the functions hi are affine. In practice, the terms “linear” and “affine” are often used interchangeably. Such
constraints can be expressed in the form hi (x) = aTi x + bi , where ai is a column-vector and bi a real number.
In convex optimization problems, the inequality constraints and the equality constraints are the explicit constraints.
However, we can convert all the explicit constraints to implicit ones by redefining their domains. The extreme case
here would be coverting a standard convex optimization problem to an unconstrained one minimize F (x) where the
function F is the same as f0 , but with a redefined domain restricted to the feasible set as dom F = {x ∈ dom f0 |fi (x) ≤
0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}. Alternatively, if fi (x) ≤ 0 and hi (x) = 0 are the explicit constraints, then
the implicit constraints are given by x ∈ dom fi , x ∈ dom hi and D = dom f0 ∩ . . . ∩ dom fm ∩ dom h1 ∩ . . . ∩ dom hp
where D is the domain of the objective function.
In any general convex optimization problem, if the objective function is zero, then the optimal value can be either
zero if there is a feasible set, or ∞ if the feasible set is empty. Basically, with f0 (x) = 0, the convex optimization
problem is reduced to
minimize 0
subject to fi (x) ≤ 0, i = 1, . . . , m (21.3)
hi (x) = 0, i = 1, . . . , p
minimize x
subject to fi (x) ≤ 0, i = 1, . . . , m (21.4)
hi (x) = 0, i = 1, . . . , p
where the goal is to determine whether the constraints are consistent, and find a suitable solution satisfying them.
As already mentioned, the optimal value will be 0 if the constraints are feasible, and any value of x satisfying the
constraints will be optimal. In case the constraints are infeasible, we obtain an optimal value of ∞.
For any general convex optimization problem, let us suppose that the objective function f0 is differentiable with
respect to x. In that case, for then any value of x, y satisfying the constraints,.i.e. any value of x, y from the feasible
se, we have the following relation:
f0 (y) ≥ f0 (x) + ∇f0 (x)T (y − x) (21.5)
We can thus conclude that any value x in the feasible set is optimal if and only if the following condition holds
Geometrically, this optimality criterion tells us that for ∇f0 (x) ̸= 0, then it defines a supporting hyperplane to he
feasible set at x as shown in Figure 21.2
Two optimization problems are considered to be equivalent if the solution of one can be obtained from the solution
of the other, and vice-versa. In practice, a large number of optimization problems can be converted to a convex
optimization problem, and solved. These conversions are usually done using a few common transformations that
preserve convexity, viz.:
• Change of variables
• Transformation of objective function
• Transformation of constraint functions
• Eliminating equality constraints
204
Figure 21.2: Optimality condition shown geometrically. The feasible region is the shaded convex hull X.
Possible level curves of f0 are shown as dashed lines. At the optimal point x, −∇f0 (x) defines a supporting
hyperplane.
All convex functions are also quasiconvex, but not all quasiconvex functions are convex, so quasiconvexity is a
generalization of convexity. Quasiconvexity and quasiconcavity extend to functions with multiple arguments the
notion of unimodality of functions with a single real argument.
Definition. A function f : S → R defined on a convex subset S of a real vector space is quasiconvex if for all x, y
∈ S and λ ∈ [0, 1] we have:
( )
f (λx + (1 − λ)y) ≤ max f (x), f (y) .
In words, if f is such that it is always true that a point directly between two other points does not give a higher a
value of the function than do both of the other points, then f is quasiconvex. Note that the points x and y, and the
point directly between them, can be points on a line or more generally points in n-dimensional space.
1 Note discussed for S2016
205
A quasilinear function is both quasiconvex and quasiconcave as shown in Figure 21.3.
The graph of a function that is both concave and quasi-convex on the nonnegative real numbers. An alternative way
of defining a quasi-convex function f(x) is to require that each sub-levelset Sα (f ) = {x|f (x) ≤ α} is a convex set.
( )
If furthermore f (λx + (1 − λ)y) < max f (x), f (y) for all f (x) ̸= f (y) and λ ∈ (0, 1), then f is strictly quasiconvex.
That is, strict quasiconvexity requires that a point directly between two other points must give a lower value of the
function than one of the other points does.
A quasiconcave function is a function whose negative is quasiconvex, and a strictly quasiconcave function is a function
whose negative is strictly quasiconvex.
( )
Equivalently a function f is quasiconcave if f (λx + (1 − λ)y) ≥ min f (x), f (y)
( )
and strictly quasiconcave if f (λx + (1 − λ)y) > min f (x), f (y)
A (strictly) quasiconvex function has (strictly) convex lower contour sets, while a (strictly) quasiconcave function
has (strictly) convex upper contour sets. A function that is both quasiconvex and quasiconcave is quasilinear. A
particular case of quasi-concavity is unimodality, in which there is a locally maximal value.
21.3.2 Optimization:
A quasiconvex optimization problem is an optimization problem where we seek to minimize(or, alternatively, max-
imise) a quasiconvex function over a convex set defined by constraints that are either convex or linear. Mathematically,
a quasiconvex optimization problem is defined as:
minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m (21.7)
Ax = b
where the function f0 : Rn → R is quasiconvex instead of being convex, while the functions fi : Rn → R are convex,
and the functions hi : Rn → R are linear. It is to be noted that, in the nature of quasiconvex functions, this sort of
problem can have locally optimal points that are not globally.
A standard approach to solving quasiconvex optimization problems is representing sublevel sets of a quasiconvex
function with a family of convex inequalities. If f0 be the quasiconvex function, then we have a family of functions
ϕt such that t-sublevel set of f0 is 0-sublevel set of ϕt , i.e.
We can then solve the quasiconex otimization problem by solving the associated feasibility problem, viz.
find x
subject to ϕt (x) ≤ 0
(21.9)
and fi (x) ≤ 0, i = 1, . . . , m
and Ax = b
which is a convex feasibility problem in x for fixed t. If this is feasible, then we have t ≥ p∗ , otherwise we will have
t ≤ p∗ . We can sole this convex feasibility problem using a variant of the bisection method, as shown below.
206
minimize f0 x
subject to fi x ⩽ 0 i=1,2,...,m
Ax=b
Example
f0 x = p(x)
q(x)
with p convex, q concave, and p(x) ≥ 0 , q(x) > 0 on dom f0
can take Φt (x)=p(x)-tq(x):
1. for t ≥ 0,Φt convex in x
2. p(x)/q(x) ≤ t if and only if Φt (x) ≤ 0
207
l←t
end if
until u − l ≤ ϵ
minimize cT x + d
subject to Gx ≤ h (21.10)
Ax = b
Figure 21.5: A linear program shown geometrically. The polyhedron P is the feasible region. The level
curves of the linear objective function cT x are orthogonal to c, and the point x∗ is optimal.
A linear-fractional programming problem is another type of optimization problem where the objective is to minimize
a ratio of affine functions over a polyhedron formed by the intersection of constraint. The standard form of a
linear-fractional program is :
cT x + d
minimize f0 (x) =
eT x + f
(21.11)
subject to Gx ≤ h
Ax = b
where domf0 = {x|eT x + f > 0}. Linear-fractional programs are usually quasiconvex in nature, due to the objective
function being quasiconvex.
A convex optimization problem with an objective function that is quadratic and constraints that are affine in nature
is known as a quadratic optimization problem. In general, a quadratic optimization problem can be expressed as:
1 T
minimize x P x + qT x + r
2
subject to Gx ≤ h (21.12)
Ax = b
where P ∈ S+ n
, G ∈ Rm×n , and A ∈ Rp×n . Here, S is the vector space of matrices of appropriate dimensionality.
The feasible region is usually a polyhedron in case of quadratic optimization, as shown in Figure 21.6.
208
Figure 21.6: A quadratic optimization shown geometrically. The polyhedron P is the feasible region. The
level curves of the quadratic objective function are shown as dashed curves. x∗ is the optimal point.
Example
Least Squares
minimize ∥ Ax − b ∥22
Related Topics
This is a variant of quadratic optimization with constraints that are quadratic in nature as opposed to convex or
affine constraints as in plain vanilla quadratic optimization.
1 T
minimize x P0 x + q0T x + r0
2
1 (21.13)
subject to xT Pi x + qiT x + ri < 0, i = 1, . . . , m
2
Ax = b
Here Pi ∈ S+n
, i = 0, . . . , m, and the feasible region is an intersection of ellipsoids Pi > 0. It is to be noted that linear
programs are a special case of quadratic programs, where P = 0. Further, quadratic programs, and by extension,
linear programs, are special cases of quadratically constrained quadratic programs, where in addition to P = 0, we
also have Pi = 0, i = 1, . . . , m.
Second order cone programming is a form of convex optimization where the inequalities are second order cone
constraints. It closely related to other forms of convex optimization. A second order cone program can be reduced
to a QCQP when ci = 0, i = 1, . . . , m by squaring each of the constraints. Furthermore, if Ai = 0, 1, . . . , m, then it
is further reduced to a linear program.
21.4.5 SDP
Semidefinite programming is the subfield of convex optimization that deals with the optimization of a linear objective
function over the intersection of the cone of positive semidefinite matrices with an affine space. The inequality
constraints are called linear matrix inequalities. Like SOCP, it is also closely associated with other forms of convex
optimiation. Linear programs and second order cone programs can be converted to semidefinite programs, which are
more general in nature.
209
21.4.6 Hard Variations
Slight modifications of the standard convex optimization problem can yield problems that are quite hard to solve
computationally. A few examples of the same include the following:
21.5 Exercise
Exercise. Optimize f (x, y) = 5x − 3y such that x2 + y 2 = 136
F = 5x − 3y − λx2 − λy 2 + 136λ
λ= 1
4
then x = 10 and y = −6 so f (10, −6) = 68 (Maximization Problem)
−1
λ= 4
then x = −10 and y = 6 so f (−10, 6) = −68 (Minimization Problem)
• f0 (x1 , x2 ) = x1 + x2
• f0 (x1 , x2 ) = max{x1 , x2 }
• f0 (x1 , x2 ) = x21 + 9x22
The feasible set is given by the intersection of the halfspaces defined by the constraints. More specifically, we have
the following point of intersection between the two inequality constraints:
2x1 + x2 = 1
2 ∗ (x1 + 3x2 ) = 2
5x2 = 1
(x1 , x2 ) = (2/5, 1/5)
210
Between each of the inequality constraints and the non-negativity constraints, we have the following points of inter-
section:
(0, ∞), (0, 1), (1, 0), (∞, 0)
So, the feasible set or region is given by the convex hull of the polygon with vertices as (0, ∞), (0, 1), (2/5, 1/5),
(1, 0), (∞, 0), after taking into account the apropriate directions according to the inequalities. This is shown in
Figure 21.7.
For f0 (x) = x1 + x2 , the optimal set is given by x∗ = (2/5, 1/5), and the optimal value is 3/5.
For f0 (x) = max{x1 , x2 }, the optimal set is given by x∗ = (1/3, 1/3), and the optimal value is 1/3.
For f0 (x) = x21 + 9x22 , the optimal set is given by x∗ = (1/2, 1/6), and the optimal value is 1/2.
Exercise. Prove that x∗ = [1, 1/2, −1]T is optimal for the problem:
1 T
minimize x P x + qT x + r
2
subject to − 1 ≤ xi ≤ 1 ∀i = 1, 2, 3
13 12 −2 −22.0
P = 12 17 6 q = −14.5 r = 1
−2 6 12 13.0
In order to minimize 12 xT P x + q T x + r the first thing we need to do is find its gradient. Hence we diferentiate the
objective function with respect to x. This gives us:
∇f0 (x) = xT P + q T
In order for the given x∗ to be the optimal solution, it needs to satisfy the optimality condition specified earlier,
which is given again as follows:
∇f0 (x)T (y − x) ≥ 0 ∀f easible y
Now, for x∗ = [1, 1/2, −1]T , the gradient of the objective function attains a value of:
which is obtained by plugging in the values of P and q respectively. Therefore the optimality condition, for any
y = [y1 , y2 , y3 ]T is reduced to:
Since the optimality condition is satisfied along with the feasibility constraints of −1 ≤ yi ≤ 1, we can conclude that
x∗ = [1, 1/2, −1]T is indeed an optimal solution for the quadratic optimization probem of minimizing 21 xT P x+q T x+r
211
Exercise. Consider the following optimization problem:
minimize x1 + x2
subject to − x1 ≤ 0
and − x2 ≤ 0
and 1 − x1 x2 ≤ 0
Prove that the feasible set is a half-hyperboloid, with optimal value 2 at optimal point x∗ = (1, 1).
212
Chapter 22
213
22.1 Introduction
max Margin and SVM problem
22.4 Problems
214
Chapter 23
215
23.1 Introduction
216
Chapter 24
217
24.1 Introduction
In this lecture we discuss various methods to solve m nonlinear equations in n variables.
f1 (x1 , x2 , ......, xn ) = 0
f2 (x1 , x2 , ......, xn ) = 0
f3 (x1 , x2 , ......, xn ) = 0
.
.
.
fm (x1 , x2 , ......, xn ) = 0
1. f is continuous, (i.e. limy→x f (y) = f (x) for all x) and sometimes differentiable.
2. f is available in explicit form (Eg. x21 + x22 + x23 = 10 ) or as black-box ( input: x, output : f (x)).
1. Solve f (x) = 0. (You may recollect that we solved Ax − b = 0 in the first part of the course). Indeed this
problem is more complex than the linear systems that we had discussed. Also the optimality criteria (i.e.,
derivative is zero) results in this problem for many optimization tasks.
2. Minimize ||f (x)||. (You may also recollect that we solved the linear least squares problem in the past i.e.,
minimize ||Ax − b||.).
Nonlinear equations usually have to be solved by iterative algorithms, that generate a sequence of points x(0) , x(1) , x(2) , . . .
with f (x(k) ) → 0, as k → ∞. The vector x(0) is called the starting point of the algorithm, and x(k) is called the
kth iterate. Moving from x(k) to x(k+1) is called an iteration of the algorithm. The algorithm is terminated when
||f (x(k) )|| ⩽ ϵ, where ϵ > 0 is some specified tolerance, or when it is determined that the sequence is not converging.
Some questions that we are interesting in answering in this regard are as follows
218
At each step the method divides the interval in two by computing the midpoint c = (a + b)/2 of the interval and
the value of the function f (c) at that point. Unless c is itself a root (which is very unlikely, but possible) there are
now two possibilities: either f (a) and f (c) have opposite signs and bracket a root, or f (c) and f (b) have opposite
signs and bracket a root. The method selects the subinterval that is a bracket as a new interval to be used in the
next step. In this way the interval that contains a zero of f is reduced in width by 50% at each step. The process is
continued until the interval is sufficiently small.
The Bisection Method is basically a numerical method for estimating the roots of a polynomial f (x). Consider an
equation f (x) = 0 which has a zero in the interval [a, b] and f (a) ∗ f (b) < 0. This method computes the zero, say
p, by repeatedly having the interval [a, b]. Starting with p = (a + b)/2. This step is like computing x0 . Now, the
next step is to compute the next iterate. The interval [a, b] is replaced by [p, b] if f (p) ∗ f (b) < 0 or with [a, p] if
f (a) ∗ f (p) < 0. This process is continued untill the zero is obtained i.e, f (xk+1 ) → 0 or |a − b| < ϵ.
24.2.1 Algorithim
Consider an example in Figure 24.2. We start with a1 , b1 such that f (a1 ).f (b1 ) < 0
Given: A function f(x) continuous on an interval [a,b] and f (a) ∗ f (b) < 0
while ( |a − b| > ϵ )
{
p = (a+b)/2
if( f (a) ∗ f (p) < 0 )
b=p
else
a=p
}
Thus, with the seventh iteration, we note that the final interval, [1.7266, 1.7344], has a width less than 0.01 and
|f (1.7344)| < 0.01, and therefore we chose b = 1.7344 to be our approximation of the root.
Example 25. Suppose we apply bisection method to find a root of the polynomial
f (x) = x3 − x − 2.
For a1 = 1 and b1 = 2 we have f (a1 ) = −2 and f (b1 ) = +4. Since the function is continuous, the root must lie within
the interval [1, 2].
219
Figure 24.2: An Example
Midpoint is
2+1
c1 = = 1.5.
2
Function value at the midpoint is f (c1 ) = −0.125. Because it is negative, a2 = c1 = 1.5 and b2 = b1 = 2, to ensure
they have opposite signs at the next iteration.
Table 24.2 shows how the method converges gradually to the solution.
After 13 iterations, it becomes apparent that there is a convergence to about 1.521: a root for the polynomial.
Bisection method is based on the Intermediate Value Theorem (IVT). It is guaranteed to converge to a root of f if
f is a continuous function on the interval [a, b], and f (a) and f (b) have opposite signs. The absolute error is halved
at each step so the method converges linearly, which is comparatively slow.
Specifically, if c1 = (a + b)/2 is the midpoint of the initial interval, and cn is the midpoint of the interval in the nth
step, then the difference between cn and a solution c is bounded by
|b − a|
|cn − c| ⩽
2n
This formula can be used to determine in advance the number of iterations that the bisection method would need
to converge to a root to within a certain tolerance. The number of iterations needed, n, to achieve a given error (or
tolerance), ϵ is given by
(ϵ ) log ϵ0 − log ϵ
0
n = log2 =
ϵ log 2
220
Iteration an bn cn f (cn )
1 1 2 1.5 -0.125
2 1.5 2 1.75 1.6093750
3 1.5 1.75 1.625 0.6660156
4 1.5 1.625 1.5625 0.2521973
5 1.5 1.5625 1.5312500 0.0591125
6 1.5 1.5312500 1.5156250 -0.0340538
7 1.5156250 1.5312500 1.5234375 0.0122504
8 1.5156250 1.5234375 1.5195313 -0.0109712
9 1.5195313 1.5234375 1.5214844 0.0006222
10 1.5195313 1.5214844 1.5205078 -0.0051789
11 1.5205078 1.5214844 1.5209961 -0.0022794
12 1.5209961 1.5214844 1.5212402 -0.0008289
13 1.5212402 1.5214844 1.5213623 -0.0001034
14 1.5213623 1.5214844 1.5214233 0.0002594
15 1.5213623 1.5214233 1.5213928 0.0000780
where, ϵ0 = b − a (inital bracket size). Therefore the linear convergence is expressed by ϵn+1 = constant × ϵm
n , m = 1.
24.2.3 Discussion
1. The method is guaranteed to converge.
2. The error bound decreases by half with each iteration.
3. The bisection method is robust and simple, but converges very slowly.
4. It is often used to obtain a rough approximation to a solution which is then used as a starting point for more
rapidly converging methods.
5. The bisection method cannot detect multiple roots.
Advantages
• The procedure is simple
• We don’t need the explicit form of the function, i.e. a blackbox representation is sufficient
• Guaranteed to converge
Disadvantages
• Two variable initialization is required
• Very slow
To solve f (x) = 0 we create a function g(x) such that the solution to f (x) = 0 is a fixed point of g(x) algorithm
Iterative Fixed Point Algorithm
x0 ← random()
While|g(xk ) − xk | > ϵ
xk+1 ← g(xk )
221
Fixed-point iteration is a method of computing fixed points of iterated functions. A fixed-point (also known as
invariant point) of a function is an element of the function’s domain that is mapped to itself by the function. For
example, if g is defined on the real numbers by g(x) = x2 − 3x + 4, then 2 is a fixed point, since g(2) = 2.
More specifically, given a function g defined on the real numbers with real values and given a point x0 in the domain
of g, the fixed-point iteration is
xn+1 = g(xn ), n = 0, 1, 2, . . .
which gives rise to a sequence x0 , x1 , x2 , . . . which is hoped to converge to a point x. If g is continuous, then one can
prove that the obtained x is a fixed point of g, i.e. g(x) = (x).
Given a root-finding problem f (p) = 0, there are many g with fixed-points at p. If g has fixed-points at p, then
f (x) = x − g(x) has zero at p.
Fixed-point iteration is a method of computing fixed points of iterated functions. More specifically, given a function
f(x) defined on the real numbers with real values and given a point x0 in the domain of f, the fixed point iteration
is xn+1 = f (xn ), n=0, 1, 2, . . . which gives rise to the sequence x0 , x1 , x2 , . . . which is hoped to converge to a point
x. If f is continuous, then one can prove that the obtained x is a fixed point of f(x), i.e., f(x)=x. More generally, the
function f can be defined on any metric space with values in that same space.
24.3.1 Algorithim
222
That is for g2 the iterative process is converging to 1.85558 with any initial guess.
Consider g3(x) = (x + 10)1/2 /x and the fixed point iterative scheme
Figure 24.4: Using g1 the iterative process does not converge for any initial approximation.
Figure 24.5: Using g2, the iterative process converges very quickly to the root which is the intersection point
of y = x and y = g2(x) as shown in the figure.
Figure 24.6: Using g3, the iterative process converges but very slowly.
223
√ √
2. x2 − x − 2 = 0 =⇒ x2 = x + 2 =⇒ x = x + 2 =⇒ g(x) = x + 2
3. x2 − x − 2 = 0 =⇒ x(x − 1) = 2 =⇒ x − 1 = 2
x
=⇒ x = 1 + 2
x
=⇒ g(x) = 1 + 2
x
2
x2 +2
4. x2 − x − 2 = 0 =⇒ 2x2 − x = x2 + 2 =⇒ x(2x − 1) = x2 + 2 =⇒ x = x +2
2x−1
=⇒ g(x) = 2x−1
Suppose we choose the candidate g(x) = 1 + x2 . Then the steps in the solution are as follows
2
g(x) = 1 + x
0
Let x = 1
x1 = 1 + 2
1
=3
2 2 5
x =1+ 3 3
=
x3 = 1 + 2 11
5 = 5
3
4 2 21
x =1+ 11 = 11
5
Now suppose we choose the candidate g(x) = x2 − 2. Then the steps in the solution are as follows
g(x) = x2 − 2
Let x0 = 2.5
x1 = 2.52 − 2 = 4.25
x2 = 4.252 − 2 = 16.0625
x3 = 16.06252 − 2 = 256.004
Note
• Not all g(x) are good as they may not converge to a solution
• We have to analyze and see what makes a particular g(x) good or bad for this purpose
• If |g ′ (x∗)| < 1 then the fixed point iteration is locally convergent
• E.g.
For g(x) = 1 + x2 , we have g ′ (x) = − x22 , so for x = 2, g ′ (x) = − 42 = − 12 < 1
For g(x) = x2 − 2, we have g ′ (x) = 2x, so for x = 2, g ′ (x) = 2 × 2 = 4 > 1
√
Example 28. Find a with the help of fixed point iteration
Solve x − a = 0
2
We need to put this in the form x = g(x) for using fixed point iteration
x2 − a = 0 =⇒ 2x2 = a + x2 =⇒ x = 12 ( xa + x) =⇒ g(x) = 21 ( xa + x)
√
Using the above formulation, let us obtain 5, i.e. a = 5. With a starting value of x0 = 1 we get:
f (x) = cos x − x
224
Iteration xn g(xn ) xn+1
1 0.3 0.92106099400289 0.92106099400289
2 0.92106099400289 0.60497568726594 0.60497568726594
3 0.60497568726594 0.82251592555039 0.82251592555039
4 0.82251592555039 0.68037954156567 0.68037954156567
5 0.68037954156567 0.77733400966246 0.77733400966246
6 0.77733400966246 0.71278594551835 0.71278594551835
7 0.71278594551835 0.75654296195845 0.75654296195845
There are many ways to convert f (x) to the fixed-point form x = g(x). Suppose we choose
We prove the convergence of fixed-point iteration using the following three theorems.
Theorem 24.3.1. If f ∈ C [a, b] and f (x) ∈ [a, b] , ∀x ∈ [a, b], then f has a fixed point p ∈ [a, b]. (Brouwer fixed
point theorem)
225
Theorem 24.3.2. If, in addition, the derivative f ′ (x) exists on (a, b) and |f ′ (x)| ⩽ k < 1, ∀x ∈ (a, b), then the
fixed-point is unique.
Theorem 24.3.3. Then, for any number p0 ∈ [a, b], the sequence defined by
pn = f (pn−1 ), n = 1, 2, . . . , ∞
converges to the unique fixed point p∗ ∈ [a, b].
The idea behind the algorithm is a simple one. We begin with an initial guess for the root we wish to find. This
often can be determined from the graph of the function. We then calculate the coordinates of the point on the graph
of our function that has for its x-value the initial guess. The equation of the tangent line at this point is computed,
then the point at which the tangent line intercepts the x-axis is noted. This usually serves as a better estimate of
the zero we seek. Given a function ƒ defined over the reals x, and its derivative f ′ , we begin with a first guess x0 for
a root of the function f. Provided the function satisfies all the assumptions made in the derivation of the formula, a
better approximation x1 is
f (x0 )
x1 = x0 − ′
f (x0 )
Geometrically, (x1 , 0) is the intersection with the x-axis of the tangent to the graph of f at (x0 , f (x0 )). The process
is repeated as
f (xn )
xn+1 = xn − ′
f (xn )
until a sufficiently accurate value is reached.
We can arrive at this from Taylors series. If f ∈ C 2 [a, b], and we know x∗ ∈ [a, b], then we can formally Taylor
expand around a point x close to the root:
(x − x∗ )2 ′′
0 = f (x) = f (x) + (x∗ − x) f ′ (x) + f (ξ(x)) , ξ(x) ∈ [x, x∗ ]
2
226
If we are close to the root, then |x − x∗ | is small, which means that |x − x∗ |2 ≪ |x − x∗ |, hence we make the
approximation:
f (x)
0 ≈ f (x) + (x∗ − x) f ′ (x), ↔ x∗ ≈ x − ′
f (x)
Geometrically Newton’s method can be interpreted as follows. Let’s suppose that we want to approximate the solution
to f (x) = 0 and let’s also suppose that we have somehow found an initial approximation to this solution say, x0 .
This initial approximation is probably not all that good and so we would like to find a better approximation. This
is easy enough to do. First we will get the tangent line to f (x) at x0 .
y = f (x0 ) + f ′ (x0 )(x − x0 )
Now, from the graph shown, the blue line is the tangent line at x0 . We can see that this line will cross the x-axis
much closer to the actual solution to the equation than x0 does. This point is x1 , where x1 = x0 − ff′(x0)
(x0 )
. Similarly
x2 = x1 − ff′(x 1)
(x1 )
. Generalizing this, we we get a sequence of numbers that are getting very close to the actual solution.
The sequence is given by the Newton’s method, i.e.
f (xn−1 )
xn = xn−1 −
f ′ (xn−1 )
227
Figure 24.9: Newtons Method
√
Example 30. Find a using Newton’s method
We have to solve for x2 − a = 0 ∴ f (x) = x2 − a =⇒ f ′ (x) = 2x
2
xk − a 1 a
=⇒ xk+1 = xk − = (xk + k )
2xk 2 x
1 a
=⇒ g(x) =
( + x)
2 x
We see that this is the same as what we got earlier using the fixed point method
24.4.1 Algorithm
1. Declare Variables
2. Set Maximum number of iterations to perform.
3. Set tolerance to small value
4. Set an initial guess x0 .
5. Set the counter of number of iterations to zero.
6. Begin Loop:
(a) Find next guess x1 = x0 − f (x0 )
f ′ x0
.
(b) If |f (x0 )| < tolerance, then exit loop
(c) Increment the count of number of iterations.
(d) If number of iteratations > max allowed, then exit.
7. If root was not found in the max number of iterations, then print warning message.
8. Print the value of root and number of iterations performed.
Example 31. Let us now consider an example in which we want to find the roots of the polynomial f (x) = x3 −x+1 =
0
The sketch of the graph will tell us that this polynomial has exactly one real root. We first need a good guess as to
where it is. This generally requires some creativity, but in this case, notice that f (−2) = −5 and f (−1) = 1 . This
tells us that the root is between -2 and -1. We might then choose x0 = −1 for our initial guess.
To perform the iteration, we need to know the derivative f ′ (x) = 3x2 − 1 so that
x3n − xn + 1
xn+1 = xn −
3x2n − 1
With our initial guess of x0 = −1 , we can produce the following values:
x0 -1
x1 -1.500000
x2 -1.347826
x3 -1.325200
x4 -1.324718
x5 -1.324717
x6 -1.324717
Notice how the values for xn become closer and closer to the same value. This means that we have found the
approximate solution to six decimal places. In fact, this was obtained after only five relatively painless steps.
228
Example 32. As another example, let’s try and find a root to the equation f (x) = ex − 2x = 0 . Notice that
f ′ (x) = ex − 2 so that
exn − 2xn
xn+1 = xn − x
e n −2
If we try an initial value x0 = 1 , we find that
x1 = 0, x2 = 1, x3 = 0, x4 = 1, ...
In other words, Newton’s Method fails to produce a solution. Why is this? Because there is no solution to be found!
We could rewrite a solution as ex = 2x.
√
Example 33. Let’s solve the previous root-finding problem using Newton’s method. We want to find a, ∀a ∈ Z
2
Suppose a = 612, i.e. we want to find solution for x = 612
Then f (x) = x2 − 612 and derivative f ′ (x) = 2x. With starting point x0 = 10 we obtain
√ the following sequence:
Table 24.5 shows the convergence using Newton’s method of iteration. Thus we obtain 612 = 24.7386.
f (xn )
Iteration xn xn+1 = xn − f ′ (xn )
1 10 35.6
2 35.6 26.3955
3 26.3955 24.7906
4 24.7906 24.7387
5 24.7387 24.7386
√
Table 24.5: Newton’s method for 612
24.4.2 Discussion
1. Fast Convergence: Newton’s method converges fastest of the methods discussed (Quadratic convergence).
2. Expensive: We have to compute the derivative in every iteration, which is quite expensive.
3. Starting point: We argued that when |x − x∗ | is small, then |x − x∗ |2 ≪ |x − x∗ |, and we can neglect the
second order term in the Taylor expansion. In order for Newton’s method to converge we need a good starting
point.
Theorem 24.4.1. Let f (x) ∈ C 2 [a, b]. If x∗ ∈ [a, b] such that f (x∗ ) = 0 and f ′ (x∗ ) ̸= 0, then there exists a
δ > 0 such that Newton’s method generates a sequence {xn }∞ ∗
n=1 converging to x for any initial approximation
∗ ∗
x0 ∈ [x − δ, x + δ].
f (x )
4. Newton’s method as a fixed-point iteration: xn = xn−1 − f ′ (xn−1 n−1 )
. Then (the fixed point theorem), we
must find an interval [x∗ − δ, x∗ + δ] that g maps into itself, and for which |g ′ (x)| ⩽ k < 1.
f ′ (x)f ′ (x) − f (x)f ′′ (x) f (x)f ′′ (x)
g ′ (x) = 1 − 2 =
′
[f (x)] [f ′ (x)]2
By assumption f (x∗ ) = 0, f ′ (x∗ ) ̸= 0, so g ′ (x∗ ) = 0. By continuity |g ′ (x)| ⩽ k < 1 for some neighborhood of
x∗ . Hence the fixed-point iteration will converge.
Other issues:
1. Choose a wise starting point: We need to choose a good starting point such that |xi+1 − xi | is very small.
2. Faster Convergence : It converges fastest of the methods discussed so far.
Convergence is quadratic: as the method converges on the root, the difference between the root and the
approximation is squared (the number of accurate digits roughly doubles) at each step.
3. Expensive : We have to calculate values of two function for each iteration, one of f(x) and other of f ′ (x).
4. Difficulties with this method
(a) Difficulty in calculating derivative of a function.
(b) The method may overshoot, and diverge from that root.
(c) A large error in the initial estimate can contribute to non-convergence of the algorithm.
(d) If the root being sought has multiplicity greater than one, the convergence rate is merely linear.
229
24.5 Secant Method
Newton’s method was based on using the line tangent to the curve of y = f(x), with the point of tangency (x0 , f (x0 )).
When x0 → α, the graph of the tangent line is approximately the same as the graph of y = f(x) around x = α. We
then used the root of the tangent line to approximate α.
Consider using an approximating line based on interpolation. We assume we have two estimates of the root α, say
x0 and x1 . Then we produce a linear function q(x) = a0 + a1 x with q(x0 ) = f (x0 ), q(x0 ) = f (x0 )
This line is sometimes called a secant line. Its equation is given by
q(x) = (x1 −x)f (xx01)+(x−x
−x0
0 )f (x1 )
We now solve the equation q(x) = 0, denoting the root by x2. This yields
f (x )−f (xn−1 )
xn+1 = xn − f (xn )/ xnn −xn−1 , n=1,2,3,...
This is called the secant method for solving f(x)=0.
We get the secant method when we substitute the differential term in the Newton’s method with a finite difference
term. As a result the secant method will clearly be a bit slower than Newton’s method. However it is still faster than
the bisection method and is useful in situations where we have a blackbox representation of the function but not of
k
(xk−1 )
its differential Substituting f ′ (xk ) = f (xxk)−f
−xk−1
, we get
xk − xk−1
xk+1 = xk − f (xk )
f (xk ) − f (xk−1 )
The main weakness of Newton’s method is the need to compute the derivative, f ′ (), in each step. Many times f ′ ()
is far more difficult to compute and needs more arithmetic operations to calculate than f (x).
By definition
f (x) − f (xn−1 )
f ′ (xn−1 ) = lim
x→xn−1 x − xn−1
Let x = xn−2 , and approximate
f (xn−2 ) − f (xn−1 )
f ′ (xn−1 ) ≈
xn−2 − xn−1
using this approximation for the derivative in Newton’e method, gives us the Secant Method
f (xn−1 )
xn = xn−1 − [ ]
f (xn−2 )−f (xn−1 )
xn−2 −xn−1
24.5.1 Algorithim
230
2. C2. By testing the condition |xi+1 − xi | (where i is the iteration number) less than some tolerance limit, say
ϵ, fixed apriori.
i 0 1 2 3 4 5 6
xi 0 1 0.471 0.308 0.363 0.36 0.36
So the iterative process converges to 0.36 in six iterations.
1. x3 − 2x − 5 = 0
Soln.f (x) = x3 − 2x − 5 =⇒ f ′ (x) = 3x2 − 2
3 3
xk − 2xk − 5 2xk + 5
∴ xk+1 = xk − =
3x − 2
k 2
3xk 2 − 2
Let x0 = 2
2.23 + 5
x1 = = 2.1
3.22 − 2
2.(2.1)3 + 5
x2 = = 2.094
3.(2.1)2 − 2
2.(2.094)3 + 5
x3 = = 2.09455
3.(2.094)2 − 2
2. e−x = x
Soln.f (x) = x − e− x =⇒ f ′ (x) = 1 + e− x
k
xk − e−x k
k x + 1
∴ xk+1 = xk − −x
= e−x
1+e k
1 + e−xk
Let x0 = 0
0+1
x1 = e0 = 0.5
1 + e0
0.5 + 1
x2 = e−0.5 = 0.566
1 + e−0.5
0.566 + 1
x3 = e−0.566 = 0.567
1 + e−0.566
3. xsinx = 1
Soln.f (x) = xsinx − 1 =⇒ f ′ (x) = sinx + xcosx
2
xk sinxk − 1 xk cosxk + 1
∴ xk+1 = xk − k k k
=
sinx + x cosx sinxk + xk cosxk
Let x0 = π
2
( π2 )2 cos( π2 ) + 1
x1 = =1
sin( π2 ) + π2 cos( π2 )
12 .cos(1) + 1
x2 = = 1.115
sin(1) + 1.cos(1)
1.1152 .cos(1.115) + 1
x3 = = 1.114
sin(1.115) + 1.115.cos(1.115)
231
Exercise. A calculator is defective: it can only add, subtract, and multiply. Use the equation x1 = 1
37
, the Newton
1
Method, and the defective calculator to find 37 correct to 8 decimal places
1
Soln.For convenience we write a instead of 37 . Then a1 is the root of the equation f (x) = 0 where
1
f (x) = a −
x
We have f ′ (x) = 1
x2
, and therefore the Newton Method yields the iteration
a− 1
xn 1
xn+1 = xn − 1 = xn − xn 2 (a − ) = xn (2 − axn )
xn 2
xn
Note that the expression xn (2axn ) can be evaluated on our defective calculator, since it only involves multiplication
and subtraction.
1
Pick x0 reasonably close to 37 . The choice x0 = 1 would work out fine, but we will start out a little closer, maybe
by noting that 1.37 is about 43 so its reciprocal is about 34 . Choose x0 = 0.75
1
We get x1 = x0 (2 37 x0 ) = 0.729375. Similarly x2 = 0.729926589, and x3 = 0.729927007. It turns out that x4 = x3 to
1
9 decimal places. So we can be reasonably confident that 37 is equal to 0.72992701 to 8 decimal places
232
Chapter 25
233
25.1 Introduction
Last class, we introduced the problems of finding a solution of m nonlinear equations in n variables (though our
discussions were only the case when n = 1)
f1 (x1 , x2 , ....xn ) = 0
f2 (x1 , x2 , ....xn ) = 0
..
.
fm (x1 , x2 , ....xn ) = 0
Bisection Method
We start with an interval [a, b] that satisfies f (a) · f (b) < 0 (the function values at the end points of the interval have
opposite signs). Since f is continuous, this guarantees that the interval contains at least one solution of f (x) = 0.
In each iteration we evaluate f at the midpoint p = (a + b)/2 of the interval, and depending on the sign of f (p),
replace a or b with p. If f (p) has the same sign as f (a), we replace a with p. Otherwise we replace b. Thus we
obtain a new interval that still satisfies f (a)f (b) <0. The method is called bisection because the interval is replaced
by either its left or right half at each iteration.
Example 35. Let us find a root of f (x) = 3x + sin(x) − exp(x) = 0. The graph of this eqution is given in the
Figure 25.1. It is clear from the graph that there are two roots, one lies between 0 and 0.5 and the other lies between
1.5 and 2.0. Consider the function f (x) in the interval [0, 0.5] since f (0) ∗ f (0.5) <0. Then the bisection iterations
are given in Table ??
234
Figure 25.1: Graph of the equation f(x)
Newtons Method
In numerical analysis, Newton’s method (also known as the Newton-Raphson method), named after Isaac Newton and
Joseph Raphson, is a method for finding successively better approximations to the roots (or zeroes of a real-valued
function.
x : f (x) = 0 .
The Newton-Raphson method in one variable is implemented as follows. Given a function ƒ defined over the real x,
and its derivative f’, we begin with a first guess x 0 for a root of the function f. Provided the function satisfies all the
assumptions made in the derivation of the formula, a better approximation x 1 is
f (x0 )
x1 = x0 − .
f ′ (x0 )
Geometrically, (x 1 , 0) is the intersection with the x-axis of the tangent to the graph of f at (x 0 , f (x 0 )).
Example 36. Let us now consider an example in which we want to find the roots of the polynomial f (x) = x3 −x+1 =
0.
The sketch of the graph will tell us that this polynomial has exactly one real root. We first need a good guess as to
where it is. This generally requires some creativity, but in this case, notice that f (−2) = −5 and f (−1) = 1 . This
tells us that the root is between -2 and -1. We might then choose x0 = −1 for our initial guess.
To perform the iteration, we need to know the derivative f ′ (x) = 3x2 − 1 so that
x3n − xn + 1
xn+1 = xn −
3x2n − 1
235
Figure 25.2: Demonstration of Newtons Method
x0 -1
x1 -1.500000
x2 -1.325200
x3 -1.324718
x4 -1.324717
x5 -1.324717
x6 -1.324717
With our initial guess of x0 = −1 , we can produce the following values as in Table ??.
Notice how the values for xn become closer and closer to the same value. This means that we have found the
approximate solution to six decimal places. In fact, this was obtained after only five relatively painless steps.
Secant Method
In numerical analysis, the secant method is a root-finding algorithm that uses a succession of roots of secant lines to
better approximate a root of a function f . The secant method can be thought of as a finite difference approximation
of Newton’s method. However, the method was developed independently of Newton’s method, and predated the
latter by over 3,000 years.
As can be seen from the recurrence relation, the secant method requires two initial values, x 0 and x 1 , which should
ideally be chosen to lie close to the root.
Example 37. Lets try to find the root of 3x + sin(x) − exp(x) = 0, using secant method. Let the initial guess be
0.0 and 1.0 Following the secant method’s recurrence relation, we will get the Table ?? As can be seen the secant
method converges to 0.36 in six iterations.
236
x0 0
x1 1
x2 0.471
x3 0.308
x4 0.363
x5 0.36
x6 0.36
Gradient: The gradient of a function f (x) of n variables, at x∗ , is the vector of first partial derivatives evaluated
at x∗ , and is denoted as ∇f (x∗ ):
∂f (x∗ )
∂x1
∂f (x∗ )
∂x2
∇f (x ) =
∗
..
(25.2)
.
∂f (x∗ )
∂xn
Hessian: The Hessain of a function f (x) of n variables, at x∗ , is the matrix of second partial derivatives evaluated
at x∗ , and is denoted as ∇2 f (x∗ ):
∂ 2 f (x∗ ) ∂ 2 f (x∗ ) ∂ 2 f (x∗ )
···
2∂x1 ∗
2 ∂x1 ∂x2 ∂x1 ∂xn
∂ f (x ) ∂ 2 f (x∗ )
··· ∂ 2 f (x∗ )
∂x2 ∂x1 ∂x2 ∂x2 ∂xn
∇2 f (x∗ ) =
.. ..
2
..
(25.3)
..
. . . .
∂ 2 f (x∗ ) ∂ 2 f (x∗ ) ∂ 2 f (x∗ )
∂xn ∂x1 ∂xn ∂x2
··· ∂x2 n
Hessian is a square matrix of second-order partial derivatives of a function. It describes the local curvature of a
function of many variables. The Hessian of a function f (x) of n variables, at x∗ , is the matrix of second partial
∂ 2 f (x∗ ∂ 2 f (x∗ )
derivatives evaluated at x∗ , and is denoted as ▽2 f (x∗ ). This is a symmetric matrix, because =
∂xi , ∂xj ∂xj , ∂xi
Jacobian: Given a set of m equations yi = fi (x) in n variables x1 , ..., xn , the Jacobian is defined as:
∂f1
∂x1
∂f1
∂x2
· · · ∂x ∂f1
∂f2 ∂f2
n
∂x1
∂f2
· · · ∂x n
J = ..
∂x2
. . . .. (25.4)
. .
. . .
∂fm
∂x1
∂fm
∂x2
· · · ∂xn
∂fm
Looking into the above definitions one can observe a simple relation between ∇f (x) , ∇2 f (x) and J. The jacobian of
gradient is Hessian and is given by:
237
25.4 Approximating Functions and Taylor’s Series
Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the
function’s derivatives at a single point
It is common practice to approximate a function by using a finite number of terms of its Taylor series. Taylor’s
theorem gives quantitative estimates on the error in this approximation. Any finite number of initial terms of the
Taylor series of a function is called a Taylor polynomial.
The Taylor series of a function is the limit of that function’s Taylor polynomials, provided that the limit exists. A
function may not be equal to its Taylor series, even if its Taylor series converges at every point. A function that is
equal to its Taylor series in an open interval (or a disc in the complex plane) is known as an analytic function.
Example 38. Let’s compute the Taylor series for f (x) = ex with center x0 = 0 . All derivatives are of the form ex
, so at x0 = 0 they evaluate to 1. Thus the Taylor series has the form:
x2 x3 x4
ex = 1 + x + + + + ···
2! 3! 4!
Orders of approximation refer to formal or informal terms for how precise an approximation is, and to indicate
progressively more refined approximations: in increasing order of precision, a zeroth-order approximation, a first-
order approximation, a second-order approximation, and so forth.
Formally, an nth-order approximation is one where the order of magnitude of the error is at most xn+1 , or in
terms of big O notation, the error is O(xn+1 ). In suitable circumstances, approximating a function by a Taylor
polynomial of degree n yields an nth-order approximation, by Taylor’s theorem: a first-order approximation is a
linear approximation, and so forth.
Example 39. Figure 25.4 is an accurate approximation of sin(x) around the point x = 0. The pink curve is a
polynomial of degree seven:
x3 x5 x7
sin(x) ≈ x − + − .
3! 5! 7!
|x|9
The error in this approximation is no more than 9!
. In particular, for 1 < x < 1, the error is less than 0.000003.
To view the usefulness of Taylor series, Figures 25.5, 25.6, and 25.7 show the zeroth, first, and second order Taylor
series approximations of the exponential function f (x) = ex at x = 0.
The first order approximation takes the first two terms in the series and approximates the function
f ′ (x)
f (y) = f (x) + 1!
(y − x)
238
Figure 25.4: The sine function (blue) is closely approximated by its Taylor polynomial of degree 7 (pink) for
a full period centered at the origin.
⇒y =x− f (x)
f ′ (x)
⇒y =x− f (x0 )
f ′ (x0 )
The first order approximation is the equation of a line with a slope of f ′ (x0 ). So, the first two terms of the Taylor
Series gives us the equation of the line tangent to our function at the point (x0 , y0 ) .
We can now develop an iterative algorithm by replacing y with xk+1 and x with xk . At each iteration we will get a
better solution then the previous iteration.
239
Figure 25.6: The first-order Taylor series approximation of ex around x = 0.
algorithm
x0 ← Critical guess
for k = 0, 1, 2, · · · do
solve ∇f (xk )T s = −f (xk ) for s
xk+1 = xk + s
end for
J= ..
..
. .
∂fm
∂x1
· · · ∂xn
∂fm
240
Figure 25.7: The second-order Taylor series approximation of ex around x = 0.
The convergance is guaranteed if spectral radius of J and maximam eigen value of J is less then 1.
f ′′ (x)(y − x)2
f (y) = f (x) + f ′ (x)(y − x) +
2!
We can find better solution then the first order approximation but it is harder to compute.
As we notice, the second order approximation uses the first three terms of the Taylor Series.
f ′ (x) f ′′ (x)
f (y) = f (x) + (y − x) + (y − x)2
1! 2!
Let f (y) = 0, the true solution
f ′ (x) f ′′ (x)
then 0 = f (x) + (y − x) + (y − x)2
1! 2!
In second order approximation, better solution can be obtained, but more difficult to compute.
Analysis While the approximation in Figure 1 becomes poor very quickly, it is quite apparent that the linear, or
1st-order, approximation in Figure 2 is already quite reasonable in a small interval around x = 0. The quadratic, or
2nd-order, approximation in Figure 3 is even better. However, as the degree of approximation increases, computation
also increases. So, there’s a tradeoff!
241
25.5 Optimality Conditions
Let, f : Rn → R be scalar valued function of n variables x = (x1 , x2 , ....., xn ). We say x∗ = (x∗1 , x∗2 , ....x∗n ) minimizes
f if f (x∗ ) ≤ f (x) for all x ∈ Rn . We use the notation min f(x) to denote the problem of finding an x∗ that minimizes
f. A vector x∗ is a local minimum if there exists a neighbourhood around x∗ in which f (x∗ ) ≤ f (x). With minimum
we refer to a global minimum. We can have case with finite min f(x) and no x∗ with f (x∗ ) = minf (x), in such a case
optimal value is not attained. It is also possible that f(x) is unbounded below, in which case we define the optimal
value as minf (x) = −∞.
Global optimality A function g is convex if ▽2 g(x) is a positive semi definite ever where.If g is convex, then x
is a minimum if and only if
▽g(x) = 0
Means there are no other local minima, i.e. every local minimum is global.
Local optimality It is much harder to characterize optimality if g is not convex (i.e., if there are points where
the Hessian is not positive semidefinite). It is not sufficient to set the gradient equal to zero, because such a point
might correspond to a local minimum, a local maximum, or a saddle point . However, we can state some simple
conditions for local optimality.
1. Necessary condition. If x is locally optimal, then ▽g(x) = 0and ▽2 g(x) is positive semidefinite.
2. Sufficient condition. If ▽g(x) = 0 and ▽2 g(x) is positive definite, then x is locally optimal.
Example 40. Lets try to find the local extrema of f (x1 , x2 ) = x31 + x32 − 3x1 x2 .
This function is everywhere differentiable, so extrema can only occur at points x∗ such that its gradient (▽f (x∗ ) = 0.)
[ ]
3x21 − 3x2
▽f (x) =
3x21 − 3x1
So, [ ]
0 −3
H(0, 0) =
−3 0
Let H1 denote the first principal minor of H(0, 0) and let H2 denote its second principal minor. Then det(H1 ) = 0
and det(H2 ) = −9.
Therefore, H(0, 0) is neither positive nor negative definite.
[ ]
6 −3
H(1, 1) =
−3 6
Its first principal minor has det(H1 ) = 6 > 0 and its second principal minor has det(H2 ) = 25 > 0. Therefore, H(1, 1)
is positive definite, which implies that (1, 1) is a local minimum.
242
Calculation of Gradient
∂g(x)
∂x1
∂g(x)
▽g(x) = ∂x2 (25.10)
.
.
.
∂g(x)
∂xn
x1
√
x1 + x2 + · · · + xn + 1
2 2 2
x2
√
= x12 + x22 + · · · + xn2 + 1 (25.11)
..
.
√
xn x12 + x22 + · · · + xn2 + 1
x1
1 x2
= . (25.12)
g(x) ..
xn
Calculation of Hessian
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
(x2 + x2 + · · · + x2n + 1)3/2 ···
1 2 (x21 + x22 + · · · + x2n + 1)3/2 (x21 + x22 + · · · + x2n + 1)3/2
−x2 x1 x21 + x23 + · · · + x2n + 1 −x2 xn
2 ···
(x + x 2
+ · · · + x2n + 1)3/2 (x1 + x22 + · · · + x2n + 1)3/2
2
(x1 + x2 + · · · + xn + 1)
2 2 2 3/2
▽2 g(x) =
1 2
.. .. .. ..
.
. . .
−xn x1 −xn x2 x1 + · · · + xn−1 + 1
2 2
···
(x21 + x22 + · · · + x2n + 1)3/2 (x21 + x22 + · · · + x2n + 1)3/2 (x1 + x2 + · · · + xn + 1)
2 2 2 3/2
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
···
g 3 (x) g 3 (x) g 3 (x)
−x2 x1 x1 + x3 + · · · + x2n + 1
2 2
−x2 xn
···
g 3 (x) g 3 (x) g(x)3
=
(25.13)
.. .. .. ..
. . . .
−xn x1 −xn x2 x21 + · · · + x2n−1 + 1
···
g 3 (x) g 3 (x) g 2 (x)
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
···
g 2 (x) g 3 (x) g 2 (x)
−x2 x1 x1 + x3 + · · · + x2n + 1
2 2
−x2 xn
···
1
g 2 (x) g 2 (x) g 2 (x)
=
g(x)
.. .. .. ..
. . . .
−xn x1 −xn x2 x21 + · · · + x2n−1 + 1
···
g 2 (x) g 2 (x) g 2 (x)
243
• We can see that, it is symmetric.
• As u is a n-vector with norm less than 1, all its eigen values are positive.
• Since all the eigen values are positive and the matrix is real symmetric, the matrix is positive definite.
Exercise. Consider the system of equations
x1 − 1 = 0
x1 x2 − 1 = 0
• Show that the eigen value-vector problem can be casted as root-finding problem i.e.;
(A − λI)v = 0; v T v = 1
Derive the Newton’s iteration to solve the above problem.
244
Chapter 26
245
In the last lecture, we had seen how the Newtons method come out of the Taylors series expansion. We had also
briefly seen the situation where we solve f (x) = 0 where f : Rm → Rn .
We now complete the need for studying this problem for optimization by connecting to the situation of minimizinh
g(x) where x is a vector. The conditions for optimality in the’ previous lecture pointed to the need of solving ∇g = 0
or else, we have now set of equations to simultaneously solve.
Here J is an n × n matrix whose entries are the various partial derivative of the components of f, i.e. :
∂f1 ∂f1
···
∂x1 ∂xn
. ..
J = . . ..
.
. .
∂f ∂fm
m
···
∂x1 ∂xn
J is the derivative matrix (or Jacobian matrix, explained elsewhere ) evaluated at x0
That is,
J= ..
..
. .
∂fm
∂x1
· · · ∂xn
∂fm
0 = f (x) + Jx (y − x)
Let y − x = s
0 = f (x) + Jx s
⇒Jx s = −f (x) (26.2)
The convergance is guaranteed if spectral radius of J and maximam eigen value of J is less then 1.
Example 41. As an example, lets take a problem with two equations and two variables
246
Example [ 42. Suppose] we need to solve the following system of non linear equations.
x1 + x2 − 3
F (x) =
x21 + x22 − 9
[ ]
1. Let the initial guess be x◦ = 1 5
2. Iteration 1
Solving J(x
[ ] ◦ ).s◦ = [−F (x]◦ )
1 1 3
.s◦ = −
2 10 17
[ 13 ]
[ ]
s◦ = 8
11 x1 = x◦ + s◦ = −0.625 3.625
8
3. Iteration 2
Solving
[ J(x1]).s1 = [−F (x1 ]
)
1 1 0
.s1 =
− 54 29 145
[ 145
4 ] 32
s1 = 272
− 145
272 [ ] [ ]
x2 = x1 + s1 = 0.092 3.092 . The actual solution to the above problem is 0 3 . In just two iterations
algorithm has reached quite close.
If the objective function g is convex we can find it’s minimum by solving ▽g(x) = 0. This is a set of n nonlinear
equations in n variables that we can solve using any method of nonlinear equations. If we linearise the optimality
condition ▽g(x) = 0 near x◦ we obtain
When n = 1 this interpretation is particularly simple. The solution of the linearized optimality condition is the zero-
crossing of the derivative g(x), which is monotonically increasing since g(x) > 0. Given our current approximation x(k)
of the solution, we form a first-order Taylor approximation of g(x) at x(k). The zero crossing of this approximation
is then x(k) + v(k). This interpretation is illustrated in the below figure.
247
The solid curve is the derivative g(x) of the function g. faf f (x) = g(x(k) )+g(x(k) )(x − x(k) ) is the affine approximation
of g(x) at x(k) . The Newton step v(k) is the difference between the root of faf f and the point x(k) .
▽f (x) = f ′ (x) = 7 − 1
x
From Egn 5, the Newton Direction is −H(x)−1 ▽ f (x) = x − 7x2 , and is defined so long as x > 0 (Domain of f (x) is
x > 0)
Below are some examples of the sequences generated by this algorithm for different starting points
Note that the iterate in the first column is not in the domain of the objective function, so the algorithm has to
terminate.
As we can see, the above algorithm converges only when started near the solution, as expected based on the general
properties of the method.
248
k xk xk xk
0 1 0.1 0.01
1 -5 0.13 0.0193
2 0.1417 0.03599257
3 0.14284777 0.062916884
4 0.142857142 0.098124028
5 0.142857143 0.128849782
6 0.1414837
7 0.142843938
8 0.142857142
9 0.142857143
10 0.142857143
It is quite evident that Newton’s method will converge only when started near the solution. We need a modification
that makes its globally convergent. A closer look into the method shows that problem doesn’t lie with the direction
of step but it’s the step size that creates a problem. A step-size too large (might miss global optima) or even too
small (method could be too slow) is a problem; we need to select a perfect step size for a given problem.
To overcome this problem at each iteration first take a full Newton step and evaluate the function at that point. If
the function value g(xk + v k ) if higher than g(xk ) it’s rejected and xk + 12 v k is tried, If the function is still higher
than g(xk ) than xk + 14 v k is tried and so on un til a value of t is found with xk + tv k < g(xk )
Newton’s method with backtracking is the solution to above problem. The idea is to use the direction of the chosen
step, but to control the length. Following algorithm is used for same.
The purpose of backtracking is to avoid a situation in which, the function values increase from iteration to iteration
and never converges. Analysis shows that there is actually nothing wrong with the direction of the Newton step,
since it always points in the direction of decreasing g. The problem is that we step too far in that direction. So the
remedy is quite obvious.
At each iteration, we first attempt the step 4 of previous algorithm x (xk+1 = xk + v k ), and evaluate g at that point.
If the function value g(xk + v k ) is higher than g(xk ), we reject the update, and try xk+1 = xk + (1/2)v k , instead. If
the function value is still higher than g(x), we try xk+1 = xk + (1/4)v k , and so on, until a value of t is found with
g(xk + tv k ) < g(xk ). We then take xk+1 = xk + tvk.
In practice, the backtracking idea is often implemented as shown in the following algorithm: Algorithm
given initial x, tolerance ϵ > 0, parameter α ∈ (0, 1/2).
repeat
4. t :=1
5. x = x + v
Figure 26.2 shows the iterations in Newton’s method with backtracking, applied to the previous example, starting
from x(0) = 4. As expected the convergence problem has been resolved. From the plot of the step sizes we note
that the method accepts the full Newton step (t = 1) after a few iterations. This means that near the solution the
algorithm works like the pure Newton method, which ensures fast (quadratic) convergence.
249
Figure 26.1: The solid line in the left figure is g(x) = log(ex + ex ). The circles indicate the function values
at the successive iterates in Newton’s method, starting at x(0) = 1.15. The solid line in the right figure is
the derivative g(x). The dashed lines in the right-hand figure illustrate the first interpretation of Newton’s
method.
Ar = M (1 − 1
(1+r)n
)
A car loan of $10000 was repaid in 60 monthly payments of $250. Use the Newton Method to find the monthly
interest rate correct to 4 significant figures.
Even quite commonplace money calculations involve equations that cannot be solved by ’exact’ formulas. Let r be
the interest rate. Then
250
10000r = 250(1 − (1+r) 1
60 )
⇒ f ′ (r) = 40 − (1+r)
60
61
40rn + 1 −1
(1+rn )60
r( n + 1) = rn − 40− 60
(1+rn )61
If the interest rate were 2.5% a month, the monthly interest on $10,000 would be $250, and so with monthly payments
of $250 we would never pay off the loan. So the monthly interest rate must have been substantially under 2.5%. A
bit of trying suggests taking r0 = 0.015. We then find that r1 = 0.014411839, r2 = 0.014394797andr3 = 0.01439477.
This suggests that to four significant figures the monthly interest rate is 1.439%.
∑
n
Exercise. Derive Newton equation for unconstrained minimization problem min( 12 xT x + log exp(aTi x + bi ))
i=1
Give an efficient method for solving mXn Newton system assuming matrix AεℜmXn (with rows aTi ) is dense with
m ≪ n. Give an approximate FLOP count of your method.
Sol. 2:
∑
n
f (x) = 12 xT x + log exp(aTi x + bi )
i=1
∑
n ∑
n
∂ exp(aTi x + bi ) aTi exp(aTi x + bi )
∇f (x) = x + 1 i=1
=x+ i=1
∑
n ∂x ∑n
exp(aTi x + bi ) exp(aTi x + bi )
i=1 i=1
∑
n ∑
n ∑n ∑
n
( exp(aTi x + bi ))( (aTi ai ) exp(aTi x + bi )) − ( aTi exp(aTi x + bi ))( (aTi ) exp(aTi x + bi ))
∇2 f (x) = 1 + i=1 i=1 i=1 i=1
∑
n
( exp(aTi x + bi )) 2
i=1
The QR method is slower than Cholesky (by a factor of about two if n � m), but it is more accurate. It is the preferred
method if n and m are not too large. For very large sparse problems, the Cholesky factorization is useful.
∑
m
Exercise. Solve the unconstrained minimization problem min( log exp(aTi x − bi ) + exp(−aTi x + bi )).A is m ×
i=1
n-matrix and b is m dimension vector.
∑
m
Solution Let, g(x) = log exp(aTi x − bi ) + exp(−aTi x + bi )
i=1
∑
m
We express g as g(x) = f (Ax − b),i.e.f (y) = log exp(yi ) + exp(−yi )
i=1
251
∇2 f (y) = 4
(exp(yi )+exp(−yi ))2
f ori = j, 0f ori ̸= j
Once we have the gradient and Hessian, the implementation of Newton’s method is straightforward.Using Hessian
and gradient of f compute hessian and gradiant of g as shown below:
∇g(x) = AT ∇f (y)
∇2 g(x) = AT ∇2 f (y)AT
We start with x0 = (1, 1, 1...1) and set α = 0.01 terminate if ∥∇f (y)∥ ≤ 10−5
Note: The number of iterations in iterative algorithm depends on the problem parameters and on the starting point
hence its efficiency isn’t expressed by giving its flop count, rather by giving upper bounds on the number of iterations
to reach a given accuracy.
(b) there exists scalars β > 0 and L > 0 for which ∥H(x) − H(x*)∥ ≤ L∥x − x*∥
3. Show that the eigen value-vector problem can be casted as root-finding problem i.e.;
(A − λI)v = 0; v T v = 1
Derive the Newton’s iteration to solve the above problem.
4. Use Newton’s method to find all the solution of the two equations
x21 + x22 = 16
(x1 − 2)2 + (x2 − 3)2 = 25
√
5. Derive gradient and Hessian of the function g(x) = ||Cx||2 + 1, where C is a left-invertible m × n matrix.
Hint: Use the result of Homework Problem1 and the expression ▽2 g(x) = C T ▽2 h(Cx + d)C for the Hessian
of the function g(x) = h(Cx + d)
252
Chapter 27
253
27.1 Introduction
254
Chapter 28
255
256
Chapter 29
257
258
Chapter 30
259
30.1 Introduction
260
Chapter 31
261
31.1 Representation and Coding
For many problems in engineering, building a numerical representation has been the first step. Often this represen-
tation is a vector of n elements. It could be some measurements of the phyisical phenomena. Such representations
are understood as an element in a vector space. Often there is also a set of basis functions, such as Fourier basis, for
this representation
In this lecture, we are interested in an over-complete basis. This leads to representations that can be “sparse”.
There are many aspects for looking for a right representation. It could be the lenth of the representation (number
of elements in the vector.) It could also be the compressibility of the representation. It could also be the number
of bits required to store. Beyond these computational requirements, one is also (obviously) interested in building
a representation that is useful for solving the problem. One useful representaion scheme is based on sparse linear
combination of some basis functions. Sparse coding, that is modelling data vectors as sparse linear combinations of
basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This focuses on
learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be
very effective for signal reconstruction and classification in the audio and image processing domains.
Sparseness is one of the reasons for the extensive use of popular signal transforms such as the Discrete Fourier
Transform(DFT), the wavelet transform(WT) and the Singular Value Decomposition(SVD). These transforms often
reveal certain structures of a signal and are used to represent these structures in a compact and sparse representation.
Sparse representations have therefore increasingly become recognized as providing extremely high performance for
applications as diverse as: noise reduction, compression, feature extraction, pattern classification and blind source
separation. Sparse approximation techniques also build the foundations of wavelet denoising and methods in pattern
classification, such as in the Support Vector Machine(SVM).
1. Sparse modeling calls for constructing efficient representations of data as a (often linear) combination of a few
typical patterns (atoms) learned from the data itself. Significant contributions to the theory and practice of
learning such collections of atoms (usually called dictionaries or codebooks), and of representing the actual
data in terms of them, leading to state-of-the-art results in many signal and image processing and analysis
tasks. The first critical component of this topic is how to sparsely encode a signal given the dictionary.
2. The actual dictionary plays a critical role, and it has been shown once and again that learned dictionaries
significantly outperforms off-the-shelf ones such as wavelets.
3. There are numerous applications where the dictionary is not only adapted to reconstruct the data, but also
learned for a specific task, such as classification, edge detection and compressed sensing .
There are many formulations and algorithms for sparse coding in the literature. Let yRn be a data point in a data
set Y .
where:
• x is a m × 1 column vector
• D is a m × p matrix
• a is a k × 1 column vector
• k≪m
We discussed how we can build a model based on large number of observations. Using sparse coding we try to
represent the new sample in terms of a sparse set of input sample. The large set of sample that is available to us is
called dictionary. Each sample i.e each column of the matrix is known as atom. The dictionary is over complete and
the rows and columns are not required to be linearly independent. There are a lot of advantages working with sparse
vectors. For example calculations involving multiplying a vector by a matrix take less time to compute in general if
the vector is sparse. Also sparse vectors require less space when being stored on a computer as only the position and
value of the entries need to be recorded.
Given y find(learn) D.
262
This problem can be simply stated as:
Given x find(learn) D.
This is generally a harder problem comparedto the one given below and the major focus in these notes will be on the
second problem that is discussed hereafter.
Problem 2 - Sparse Coding Sparse coding that is, modelling data vectors as sparse linear combinations of
basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This focuses on
learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be
very effective for signal reconstruction and classification in the audio and image processing domains.
∑
k
x= (ai ϕi )
i=1
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and
patterns inherent in the input data.
Sparse Representation of a Signal The problem of finding the sparse representation of a signal in a given
overcomplete dictionary can be formulated as follows. Given a N × M matrix A containing the elements of an
overcomplete dictionary in its columns, with M > N and a signal y ∈ RN , the problem of sparse representation is
to find an M × 1 coefficient vector x, such that y = Ax and ||x||0 is minimized, i.e.
x = min
′
||x′ ||0
x
s.t. y = Ax
In general the above problem is computationally intractable. Therefore we go for sparse approximation.
Sparse Coding
Traditionally a sample was transformed into a sparse vector based on particular basis. Representing a sample in a
particular basis involves finding a unique set, for that sample,of expansion coefficients in that basis. There are a lot
of advantages working with sparse vectors. For example calculations involving multiplying a vector by a matrix take
less time to compute in general if the vector is sparse. Also sparse vectors require less space when being stored on a
computer as only the position and value of the entries need to be recorded.
The main disadvantage of using orthogonal basis to represent a new sample, is that specific basis works for specific
type of samples and may not work with other types. For example, smooth continuous signals are sparsely represented
in a Fourier basis, while impulses are not. On the other hand, a smooth signal with isolated discontinuities is sparsely
represented in a wavelet basis, while a wavelet basis is not efficient at representing a signal whose Fourier transform
has narrow high frequency support.
Real world observations often contain features that prohibit sparse representation in any single basis. To ensure that
we are able represent every vector in the space, the dictionary of vectors we need to choose from, must span the
space. How ever, because the set is not limited to a single basis, the dictionary is not linearly independent.
Because the vectors in the dictionary are not a linearly independent set, the sample representation in the dictionary
may not unique. However, by creating a redundant dictionary (over defined dictionary), we can expand our sample in
a set of vectors that is a union of several bases. You are free to create a dictionary consisting of the union of several
263
bases. For example to represent an arbitrary document our dictionary may be defined as a union of various different
type of document bases like sports document, musical document, culinary document etc
x1
x2
···
d11 d12 d1n−1 d1n ..
. .. .
.. ..
. . ×
..
dm1 dm2 ··· dmn−1 xmn .
xn − 1
xn
1. Sparse Coding Given a dictionary D, and a new observation y find a sparse representation of y as x Minimize.||x||0
such that y= Dx or Minimize ||x||0 such that ||y − Dx|| <= ϵ
2. Dictionary learning Given a lot of samples y1 ...yn find the dictionary D
Sparse Coding requires to minimize the L0 norm which is a hard problem so there are two practical ways to solve
the problem as shown in the image.
1. Given x and D , the task is to find a such that it best approximates the signal or data using a highly sparse
vector.
2. Consider a linear system of equations x = Da, where D is an underdetermined m × p (m ≪ n) matrix and
x ∈ Rm ,a ∈ Rp , called as the dictionary or the design matrix, is given. The problem is estimation of the signal
a, subject to it being sparse. Sparse decomposition helps in such a way that even though the observed values
are in high-dimensional(Rm ) space, the actual signal is organized in some lower-dimensional subspace(Rk ),
(k ≪ m).
3. It is evident that only a few components of a are non-zero and the rest are zero as it is a sparse vector. This
implies that x can be decomposed as a linear combination of only a few m × 1 vectors in D, these vectors are
known as atoms. D itself is over-complete (m ≪ p). Such vectors(atoms) are called as the basis of x. Here
though, unlike other dimensionality reducing decomposition techniques such as Principal Component Analysis,
the basis vectors are not required to be mutually orthogonal.
where ||a||0 is a pseudo-norm, l0 , which counts the number of non-zero components of a. A convex relaxation of the
∑p
problem can instead be obtained by taking the l1 norm instead of the l0 norm, where ||a||1 = |ai |. The l1 norm
i=1
induces sparsity under certain conditions.
264
The need to learn dictionary(D)
The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one has led to
the better results for numerous tasks .For example low-level image processing tasks such as denoising as well as
higher-level tasks such as classification showing that sparse learned models are well adapted to natural data.
Dictionary you are trying to learn should be specific to the subject .
Sparseness is one of the reasons for the extensive use of popular transforms such as the Discrete Fourier Transform,
the wavelet transform and the Singular Value Decomposition. The aim of these transforms is often to reveal certain
structures of a signal and to represent these structures in a compact and sparse representation. Sparse representations
have therefore increasingly become recognized as providing extremely high performance for applications as diverse
as: noise reduction, compression, feature extraction, pattern classification and blind source separation. Sparse
representation ideas also build the foundations of wavelet denoising and methods in pattern classification, such as in
the Support Vector Machine and the Relevance Vector Machine, where sparsity can be directly related to learnability
of an estimator.
Sparse signal representations allow the salient information within a signal to be conveyed with only a few elementary
components, called atoms. The goal of sparse coding is to represent input vectors approximately as a weighted linear
combination of a small number of (unknown) “basis vectors”. These basis vectors thus capture high-level patterns in
the input data. Sparse coding is, modelling data vectors as sparse linear combinations of basis elements, it is widely
used in machine learning, neuroscience, signal processing, and statistics. It is proven to be very effective for signal
reconstruction and classification in the audio and image processing domains.
31.2.3 Application
1. When a sparse coding algorithm is applied to natural images, the learned bases resemble the receptive fields
of neurons in the visual cortex.
2. Sparse coding produces localized bases when applied to other natural stimuli such as speech and video.
3. There are many applications of sparse coding in Seismic Imaging linear regression and Transform Coding.
265
31.3 BP, MP and OMP
The sparse decomposition problem is represented as:
where ||x||0 is a pseudo-norm, l0 , which counts the number of non-zero components of x. A convex relaxation of the
∑p
problem can instead be obtained by taking the l1 norm instead of the l0 norm, where ||x||1 = |xi |. The l1 norm
i=1
induces sparsity under certain conditions.
31.3.1 Overview
Basis Pursuit The idea of Basis Pursuit is to replace the difficult sparse problem with an easier optimization
problem. The difficulty with the above problem is the L0 norm. Basis Pursuit replaces the L0 norm with the L1 to
make the problem easier to work with. Basis Pursuit: min ||x||1 subject to Ax = b
Matching pursuit (MP) A greedy iterative algorithm for approximately solving the original l0 pseudo-norm
problem. Matching pursuit works by finding a basis vector in D that maximizes the correlation with the residual
(initialized to y), and then recomputing the residual and coefficients by projecting the residual on all atoms in the
dictionary using existing coefficients.
Matching pursuit (MP) Matching pursuit is a greedy iterative algorithm for approximately solving the original
l0 pseudo-norm problem. Matching pursuit works by finding a basis vector in D that maximizes the correlation with
the residual (initialized to x), and then recomputing the residual and coefficients by projecting the residual on all
atoms in the dictionary using existing coefficients. It can be seen as an expectation maximization[?] problem.
Orthogonal matching pursuit(OMP) It is similar to Matching Pursuit[?], except that an atom once picked,
cannot be picked again. The algorithm maintains an active set of atoms already picked, and adds a new atom at each
iteration. The residual is projected on to a linear combination of all atoms in the active set, so that an orthogonal
updated residual is obtained. Both Matching Pursuit and Orthogonal Matching Pursuit use the l2 norm.
Matching pursuit is a greedy algorithm that computes the best nonlinear approximation to a sample in a complete,
redundant dictionary. Matching pursuit builds a sequence of sparse approximations to the signal step-wise. Let
ϕ = φk denote a dictionary of unit-norm atoms. Let f be your signal.
1. Start by defining R0 f = f
2. Begin the matching pursuit by selecting the atom from the dictionary that maximizes the absolute value of the
inner product w ith R0 f = f . Denote that atom by φp .
3. Form the residual R1 f by subtracting the orthogonal projection of R0 f onto the space spanned by φp .
5. Stop the algorithm w hen you reach some specified stopping criterion.
In nonorthogonal (or basic) matching pursuit, the dictionary atoms are not mutually orthogonal vectors. Therefore,
subtracting subsequent residuals from the previous one can reintroduce components that are not orthogonal to the
span of the previously included atoms. The next algorithm Orthogonal Matching Pursuit handles this problem.
266
Drawbacks
Matching pursuit has a drawback that an atom can be selected multiple times, orthogonal matching pursuit[?] gets
rid of this drawback in its implementation.
Objective Function:
min |||x − Da||22 | ||x||0 < N (31.6)
Steps:
1. x0 = b, t = 0, v0 = ϕ
2. Let vt = i is the vector closest/most correalted to vector rt , pick vi
vt = vt+1 ∪ vi
∑t
3. Solve min ||b − αk αvi ||
i=1
4. t + 1 ← t ; Go to step 1.
In orthogonal matching pursuit (OMP), the residual is always orthogonal to the atoms already selected. This means
that the same atom can never be selected twice and results in convergence for a n-dimensional vector after at most
n steps.
Orthogonal matching pursuit ensures that the previously selected atoms are not chosen again in subsequent steps.
The other approach is to solve an approximated problem in an exact manner. Taking example of the sparse coding
problem, we have seen that it is hard problem , but the problem can be approximated to an easier problem that can
be solved exactly. In the easier version we try to find the L1 norm instead of L0 norm. This algorithm is known as
Basis pursuit.
Objective function
min |||y − Dx||22 | ||x||0 < L (31.9)
Steps:
1. λ = ϕ
2. for iter = 1,2 .....L do
267
3. Select the atom which most reduces the objective
4. Let vt = i is the vector closest/most correalted to vector rt , pick vi
vt = vt−1 ∪ vi
5. Update the active set.
6. Update the residual .
7. Update the coefficients.
It is similar to Matching Pursuit, except that an atom once picked, cannot be picked again. The algorithm maintains
an active set of atoms already picked, and adds a new atom at each iteration. The residual is projected on to a
linear combination of all atoms in the active set, so that an orthogonal updated residual is obtained. Both Matching
Pursuit and Orthogonal Matching Pursuit use the norm.Contrary to MP, an atom can only be selected one time with
OMP. It is,however, more difficult to implement efficiently. The keys for a goodimplementation in the case of a large
number of signals are
Sparse Approximations: This problem is NP-hard in general. Therefore various relaxed sparsity measures have
been presented to make the problem tractable. Two commonly used methods used to solve sparse approximation
problems are
1. Basic pursuit
Basic Pursuit: The idea of Basis Pursuit is to replace the difficult sparse problem with an easier optimization
problem. Below the formal definition of the sparse problem is given :
Orthogonal Matching Pursuit: Orthogonal matching pursuit (OMP) constructs an approximation by going
through an iteration process. At each iteration the locally optimum solution is calculated. This is done by finding
the column vector in A which most closely resembles a residual vector r. OMP is based on a variation of an earlier
algorithm called Matching Pursuit (MP). MP simply removes the selected column vector from the residual vector at
each iteration.
Where a OP is the column vector in A which most closely resembles rt−1 . OMP uses a least-squares step at each
iteration to update the residual vector in order to improve the approximation.
Algorithm:
268
5. Calculate the new residual using c
∑
t
rt = rt−1 − c(vj )avj
j=1
6. Set t ← t + 1
7. Check stoping criterion, if the criterion has not been satisfied then return to step 2.
31.3.4 Discussions
Uniqueness of Sparse approximation A sparse represntation need not necessarily be unique. For uniqueness
it has to satisfy a certain condition. A sparse representation x of b is unique if
x0 < Spark(A)/2
where,
Spark(A) is defined the size of the smallest set of linearly dependant vectors of A.
Sparse coding using Fixed Dictionary To solve an unconstrained optimization problem, Let y ∈ Rd and
x ∈ RN (where d < N ) be the input and the coefficient vectors and let the matrix D ∈ RdxN be the dictionary .
where
||x||0 is the sparsity measure (which counts the number of non-zero coefficients)
λ is constant multiplier
Upon applying basis pursuit on ||x||0 the problem becomes an L1 regularized linear least-squares problem. A number
of recent methods for solving this type of problems are based on coordinate descent with soft thresholding. When
the columns of the dictionary have low correlation, these simple methods have proven to be very efficient.
The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one has led
to the better results for numerous tasks .For example low-level image processing tasks such as denoising as well
as higher-level tasks such as classification showing that sparse learned models are well adapted to natural data.
Dictionary you are trying to learn should be specific to the subject .
Dictionary Learning However, the columns of learned dictionaries are in general highly correlated and thus
more preferably used.
Let D = [d1, ..., dp] ∈ Rmxp be a set of normalized basis vectors and . We call it dictionary and let Y be a set of
n-dimensional N input signals. D is adapted to y if it can represent it with a few basis vectors, that is, there exists a
sparse vector α in Rp . We call α the sparse code
α[1]
α[2]
(y) ≈ (d1|d2|...|dp)( ... )
α[3]
Learning a reconstructive dictionary with K items for sparse representation of Y can be accomplished by solving the
following problem:
269
< D, X >= arg min||Y − DX||22
where D = [d1...dK] ∈ RnxK (K > n), making the dictionary over-complete) is the learned dictionary, X =
[x1, ..., xN ] ∈ RKxN are the sparse codes of input signals Y , and T is a sparsity constraint factor (each signal
has fewer than T items in its decomposition).
The construction of D is achieved by minimizing the reconstruction error and satisfying the sparsity constraints.
The K-SVD algorithm is an iterative approach to minimize the energy and learns a reconstructive dictionary for
sparse representations of signals. It is highly efficient and works well in applications such as image restoration and
compression.The term ||Y − DX||22 denotes the reconstruction error.
In Dictionary Learning, we consider the problem of finding a few representatives for a dataset, i.e., a subset of data
points that efficiently describes the entire dataset. The problem is computationally costly.
min ∥ x ∥0
s.t ∥ Y − DX ∥2 < ε
j=1 ∈ R
Given a set Y=(y j )m of m signals yj ∈ Rm , dictionary learning aims at finding the best dictionary
nXm
For example:
For sound signal there are some salient features, that can be used to reconstruct the signal using only few elementry
components, called ’atoms’. For signal to be encoded first its Vector Quantization is done and then a Dictionary can
be built using its salient features which are domain specific. Successful application of a sparse decomposition depends
on the dictionary used, and whether it matches the signal features.
270
In some case it is possible that original signal cannot be reconstructed using Dictionary, this error can be ignored to
some threshold value.
Dictionary learning, one often starts with some initial dictionary and finds sparse approximations of the set of training
signals while keeping the dictionary fixed. This is followed by a second step in which the sparse coefficients are kept
fixed and the dictionary is optimized. This algorithm runs for a specific number of alternating optimizations or until
a specific approximation error is reached. Most of these algorithms have been derived for dictionary learning in a
noisy sparse approximation setting.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster
with the nearest mean, serving as a prototype of the cluster.
X is a k vector
C is given dictionary
C=[C1 · · · Ck ]
0
0
1
Yi =Cxi ; xi =
0 s.t. xi =ej
0
1
271
k
∥Y − DX∥= ∥ Y - Σ dj xjT ∥
j=1
Algorithm:
1. Find Ci , i=1 · · · k as mean of member in subset ’i’.
2. Repartition, such that every signal is assign to the nearest subset.
N
E= Σ e2i = ∥ Y − CX ∥2
i=1
31.5 K-SVD
K-SVD[?] is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value
decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively
alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the
dictionary to better fit the data, just like in expectation maximization. K-SVD can be found widely in use in
applications such as image processing, audio processing, biology, and document analysis.
Problem Description
Given an overcomplete dictionary matrix D ∈ R⋉×K that contains K signal-atoms for each columns, a signal x ∈ Rn
can be represented as a linear combination of these atoms. To represent x, the sparse representation a should
satisfy the exact condition x = Da, or approximate condition X ≈ DA, or |||x − Da||p ≤ ϵ. The vector x ∈ RK
contains the representation coefficients of the signal x. Typically, norm p is selected as 1, 2 or ∞.
If n < K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem,
Hence, constraints should be set on the solution. Also, to ensure sparsity, the solution with the fewest number of
nonzero coefficients is preferred. Thus, the sparsity representation is the solution of either
OR
(P0 , ϵ) min ||a||0 | ||x − Da||2 ≤ ϵ (31.11)
a
where the L0 norm counts the nonzero entries of a vector.
Algorithm: K-SVD
or equivalently
min {||X − DA||2F } | ∀i, ||ai ||0 = 1 (31.13)
D,A
The sparse representation term ai = ek (where ek is a column vector vector from a m × m permutation matrix)
enforces K-means algorithm to use only one atom(column) in dictionary D. To relax this constraint, target of the
K-SVD algorithm is to represent signal as a linear combination of atoms in D. The K-SVD algorithm follows the
construction flow of K-means algorithm. However, In contrary to K-means, in order to achieve linear combination
of atoms in D, sparsity term of the constrain is relaxed so that nonzero entries of each column xi can be more than
1, but less than a number T0 .
272
or in another objective form ∑
min ||ai ||0 | ∀i, ||X − DA||2F (31.15)
D,A
i
In the K-SVD algorithm, the dictionary D is first set to be fixed and the best coefficient matrix X is calculated. As
finding the truly optimal X is impossible, we use an approximation pursuit method. Any such algorithm as OMP,
the orthogonal matching pursuit can be used for the calculation of the coefficients, as long as it can supply a
solution with a fixed and predetermined number of nonzero entries T0 .
After the sparse coding task, the next job is to search for a better dictionary D. However, finding the whole
dictionary all at a time is impossible, so the process then updates only one column of the dictionary D each time
while we fix X. The update of the kth column is done by rewriting the penalty term as
∑
k ∑
||X − DA||2F = ||X − dj aTj ||2F = ||(X − dj xTj ) − dk xTk ||2F = ||Ek − dk xTk ||2F (31.16)
j=1 j̸=k
where xTk denotes the k-th column of X(the super-script denotes transpose for vector multiplication).
By decomposing the multiplication DX into sum of K rank 1 matrices, we can assume the other K − 1 terms are
fixed, and the kth column remains unknown. After this step, we can solve the minimization problem by
approximating the Ek term with a rank − 1 matrix using singular value decomposition, then update dk with it.
However, the new solution of vector xTk is very likely to be filled, because the sparsity constraint is not made
compulsory.
K-SVD
K-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value
decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively
alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the
dictionary to better fit the data. K-SVD can be found widely in use in applications such as image processing, audio
processing, biology, and document analysis.
Problem Description
Given an overcomplete dictionary matrix D ∈ R⋉×K that contains K signal-atoms for each columns, a signal y ∈ Rn
can be represented as a linear combination of these atoms. To represent y, the sparse representation x should
satisfy the exact condition y = Dx, or approximate condition Y ≈ DX, or |||y − Dx||p ≤ ϵ. The vector y ∈ RK
contains the representation coefficients of the signal y. Typically, norm p is selected as 1, 2 or ∞.
If n < K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem,
Hence, constraints should be set on the solution. Also, to ensure sparsity, the solution with the fewest number of
nonzero coefficients is preferred. Thus, the sparsity representation is the solution of either
OR
(P0 , ϵ) min ||x||0 | ||y − Dx||2 ≤ ϵ (31.18)
x
where the L0 norm counts the nonzero entries of a vector.
The Algorithm
or can be written as
min {||Y − DX||2F } | ∀i, ||xi ||0 = 1 (31.20)
D,X
273
The sparse representation term xi = ek enforces K-means algorithm to use only one atom (column) in dictionary D.
To relax this constraint, target of the K-SVD algorithm is to represent signal as a linear combination of atoms in D.
The K-SVD algorithm follows the construction flow of K-means algorithm. However, In contrary to K-means, in
order to achieve linear combination of atoms in D, sparsity term of the constrain is relaxed so that nonzero entries
of each column xi can be more than 1, but less than a number T0 .
In the K-SVD algorithm, the dictionary D is first set to be fixed and the best coefficient matrix X is calculated. As
finding the truly optimal X is impossible, we use an approximation pursuit method. Any such algorithm as OMP,
the orthogonal matching pursuit in can be used for the calculation of the coefficients, as long as it can supply a
solution with a fixed and predetermined number of nonzero entries T0 .
After the sparse coding task, the next is to search for a better dictionary D. However, finding the whole dictionary
all at a time is impossible, so the process then updates only one column of the dictionary D each time while we fix
X. The update of the kth column is done by rewriting the penalty term as
∑
k ∑
||Y − DX||2F = |Y − dj xjT |2F = |(Y − dj xjT ) − dk xkT |2F = ||Ek − dk xkT ||2F (31.23)
j=1 j̸=k
By decomposing the multiplication DX into sum of K rank 1 matrices, we can assume the other K − 1 terms are
assumed fixed, and the kth column remains unknown. After this step, we can solve the minimization problem by
approximating the Ek term with a rank − 1 matrix using singular value decomposition, then update dk with it.
However, the new solution of vector xkT is very likely to be filled, because the sparsity constraint is not enforced.
31.5.1 K- SVD
K-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse
coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the
data.
Given an over complete dictionary matrix D∈ RnXK that contains K signal-atoms for each columns, a signal y∈ Rn
can be represented as a linear combination of these atoms. To represent y, the sparse representation x should satisfy
the exact condition Y = Dx, or approximate condition y ≈ Dx, or ∥y − Dx∥p ≤ ϵ . The vector x∈ RK contains the
representation coefficients of the signal y . Typically, norm P is selected as 1, 2 or ∞. If n < K and D is a full-rank
matrix, an infinite number of solutions are available for the representation problem, Hence, constraints should be set
on the solution. Also, to ensure sparsity, the solution with the fewest number of nonzero coefficients is preferred.
Thus, the sparsity representation is the solution of either
min
(P0 ) ∥ x ∥0 subject to y = Dx
x
or
min
(P0,e ) ∥ x ∥0 subject to ∥y − Dx∥2 ≤ϵ
x
where L0 the norm counts the nonzero entries of a vector.
Algorithm:
Finding the whole dictionary all at a time is impossible, so the process then update only one column of the dictionary
D each time while fix X. The update of k − th is done by rewriting the penalty term as
274
K
∥Y − DX ∥=∥ Y − Σ dj xjT ∥
j=1
Σ
∥(Y − d xi ) − dk xkT ∥=∥ Ek − dk xkT ∥
j ̸= k i T
SVD(EK )=UDVT
if EK is Rank 1 => D is a diagonal with only one non-zero principal diagonal element.
now U 1 → dk
(d11 v1 )T → xK
T
Convergence of this algorithm is guaranteed, since both the steps going down direction.
Note
k -th vector of D is used by very few people (X). Since X is highely sparse. So instead of using the whole bigger
problem, we can use the smaller problem where only columns corrospoding to the values that are present with non
zero (in X) are used.
31.6.1 Homework
Design a method to fit a line to a set of points in 2D such that orthogonal distance to the line is minimized.
The residuals of the best-fit line for a set of n points using unsquared perpendicular distances di of points (xi , yi ) are
given by
∑
n
R⊥ = di (31.1)
i=1
∑n
|yi − (a + bxi )|
R⊥ = √ (31.3)
i=1
1 + b2
The absolute value function does not have continuous derivatives, minimizing R is not amenable to analytic solution.
However, if the square of the perpendicular distances
2
∑n
[yi − (a + bxi )]2
R⊥ = (31.4)
i=1
1 + b2
275
is minimized instead, the problem can be solved in closed form. R2 is a minimum when
2 ∑
n
= [yi − (a + bxi )](−1) = 0 (31.5)
1 + b2 i=1
and
2 ∑ ∑
n n
[yi − (a + bxi )]2 (−1)(2b)
= 2
[yi − (a + bx i )](−x i ) + =0 (31.6)
1 + b i=1 i=1
(1 + b2 )2
∑
n ∑
n
(1 + b2 ) [yi − (a + bxi )]xi + b [yi − (a + bxi )]2 = 0 (31.9)
i=1 i=1
But
[y − (a + bx)]2 = y 2 − 2(a + bx)y + (a + bx)2 (31.10)
y − 2ay − 2bxy + a + 2abx + b x ,
2 2 2 2
(31.11)
so (10) becomes
∑
n ∑
n ∑
n ∑
n
(1 + b2 )( xi yi − a xi − b x2i ) + b( yi2 −
i=1 i=1 i=1 i=1
(31.12)
∑
n ∑
n ∑
n ∑
n ∑
n
2a yi − 2b xi yi + a2 1 + 2ab xi + b2 x2i ) = 0
i=1 i=1 i=1 i=1 i=1
∑
n ∑
n ∑
n
[(1 + b2 )(−b) + b(b2 )] x2i + [(1 + b2 ) − 2b2 ] xi yi + b yi2 +
i=1 i=1 i=1
(31.13)
∑
n ∑
n
2
[−a(1 + b ) + 2ab ] 2
xi − 2ab 2
yi + ba n = 0
i=1 i=1
∑
n ∑
n ∑
n ∑
n ∑
n
−b x2i + (1 − b2 ) xi yi + b yi2 + a(b2 − 1) xi − 2ab yi + ba2 n = 0 (31.14)
i=1 i=1 i=1 i=1 i=1
∑n ∑n ∑n ∑n
yi2 − 2 1
i=1 xi + n [( i=1 xi ) − (
2 2
i=1 yi ) ]
b2 + i=1
∑ ∑ ∑ b−1=0 (31.16)
i=1 yi −
1 n n n
n i=1 xi i=1 x1 yi
So define ∑n ∑n ∑ ∑n
1[ yi2 − 1
( y i )2 ] − [ ni=1 xi − n (
2 1 2
i=1 xi ) ]
B= i=1 n
∑n i=1 ∑n ∑n (31.17)
i=1 yi −
1
2 n i=1 xi i=1 x1 yi
276
∑n ∑
1( yi2 − ny 2 ) − [ n x2 − nx2
i=1 ∑n i=1 i , (31.18)
2 nxy − i=1 xi yi
√
b = −B + − B 2 + 1, (31.19)
where ai is the column vectors of A that are not in B and B + is the pseudo- inverse of B
Ans: Let A consist of the columns of two orthonormal basis with coherence µ. Then if a representation x where
1 −1
||x||0 < (µ + 1)
2
where || ||0 is count of the number of non-zero coefficients. Then x is the unique solution
Model Fitting
Once the data representation part is done, next we try to build a model with our data. Here we try to model the
target variable as function of features.
Lets represent the height of child as y, age of child as x1 , weight of child x2 , height of parents x3 and x4 . We may
create a linear model like
y = a0 + a1 x1 + a2 x2 + a3 x3 + a4 x4 (31.20)
or a polynomial model like
y = a0 + a1 x1 + a2 x1 2 + a3 x2 + a4 x22 + a5 x3 x4 (31.21)
or a even complex one like
1
y = a0 + a1 logx1 + a2 e2x + a3 x3 x4 5 (31.22)
Problem Solver
In this step we may try to either predict new values based on our model (prediction) or may try to classify a new
observation into a some class (classification).
Taking the example of modelling the height of child, lets say our model is very simple and it says the height of a the
child is only dependent on the age of child linearly. Mathematically height of child y, age of child x1
y = a0 + a1 x1 (31.23)
or more generally y = Ax
As can be seen this is the equation of a line and we have a lot sample of height and age of child. Now the problem
statement is to find the equation of line using all these samples that minimizes the error. The error can be defined
in may forms like
1. lleast square error ||y − Ax||2 , fit the line such that sum distance of point form line is minimized
277
2. L0 norm ||y − Ax||0 , i.e fit the line such that number of points not on line are minimized
3. least square error ||y − Ax||2 , fit the line such that sum of orthogonal distance of point form line is minimized.
References
1. Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Sparse Coding and Dictionary Learning for
Image Analysis, ICCV�09 tutorial, Kyoto, 28th September 2009
2. Yuchen Xie On A Nonlinear Generalization of Sparse Coding and Dictionary Learning , Qualcomm Technolo-
gies, Inc., San Diego, CA 92121 USA
3. Philip Breen Algorithms for Sparse Approximation , University of Edinburgh 2009
4. Philip Breen, Algorithms for Sparse Approximation, School of Mathematics, University of Edinburgh ,2009
5. Shaobing Chen, David Donoho, Basis Pursuit, Statistics Department, Stanford University, Stanford, CA 94305
6. Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vybiral,Survey of Compressed Sensing
7. The EM Algorithm November 20, 2005
8. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 7, JULY 2011 Orthogonal Matching
Pursuit for Sparse Signal Recovery With Noise
9. T. Tony Cai and Lie Wang
10. IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006 4311 K-SVD: An
Algorithm for Designing Overcomplete Dictionaries for Sparse Representation Michal Aharon, Michael Elad,
and Alfred Bruckstein
11. KSVD https://en.wikipedia.org/wiki/K-SVD
12. M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dic-
tionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311–4322, November
2006.
13. Orthogonal Matching Pursuit for Sparse SignalRecovery With NoiseT. Tony Cai and Lie Wang
14. Francis Bach, Julien Mairal, Jean Ponce and Guill ermo Sapiro, Sparse Coding and Dictionary Learning for
Image Analysis (Part III: Optimization for Sparse Coding and Dictionary Learning), CVPR’10 tutorial, San
Francisco, 14th June 2010.
278