0% found this document useful (0 votes)
931 views278 pages

OM Notes PDF

The document contains lecture notes on optimization methods for a course at IIIT Hyderabad. It covers various topics in optimization like linear programming, integer programming, solving integer programs using branch and bound, LP relaxation techniques and more. It provides formulations for different problems and examples to illustrate optimization concepts and algorithms. The notes are compiled and curated by the instructor C.V. Jawahar for the students of the course.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
931 views278 pages

OM Notes PDF

The document contains lecture notes on optimization methods for a course at IIIT Hyderabad. It covers various topics in optimization like linear programming, integer programming, solving integer programs using branch and bound, LP relaxation techniques and more. It provides formulations for different problems and examples to illustrate optimization concepts and algorithms. The notes are compiled and curated by the instructor C.V. Jawahar for the students of the course.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 278

Lecture Notes on

Optimization Methods

Compiled and Curated for

CSE 481: Optimization Methods, Spring 2019

Instructor: C. V. Jawahar
IIIT Hyderabad

Ver 0.1 (Updated on Jan. 1, 2019)


Not for Public Circulation; Strictly for the use of students enrolled for the course.

1
2
Contents

0 Introduction to Optimization Methods 15


0.1 A Course on Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.2 Why OM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3 What should I look for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.4 Classes of Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.5 Spring 2019: Couse on “Optimization Methods” . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.5.1 Course Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.5.2 Evaluation/Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.6 Text Books and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.6.1 Books (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.6.2 NOTES/BLOGS (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.6.3 Papers (P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.7 Notes and Disclaimers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.8 Version History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1 Background 21
1.1 Background on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2 Background on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Hard Problems and Approximate Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Optimization Problems in Machine Learning and Signal Processing . . . . . . . . . . . . . . . 23
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Linear and Integer Programming 25


2.1 Introduction to Linear Programming(LP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Linear Programming (LP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.2 Integer Programming (IP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Formal Introduction to LP and IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Related Tricks to Remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3
2.2.2 Related Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Graphical Method of Solving LP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Some cases of special interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 LP/IP Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7 Pattern Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 LP and IP Formulations 33
3.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Problem: Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Minimizing Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Example Problem: Cutting Paper Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Example Problem: MaxFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Solving IP using Branch and Bound 41


4.1 How to solve an IP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Branch and Bound for IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 General Branch and Bound Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Numerical Examples of Branch and Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 LP Relaxation 53
5.1 LP Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Bipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Minimum Vertex Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Facilty Location problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Maximum Independent Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 More on IP Formulations 61

4
6.1 BIP and MIP Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.1 Example Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Function of K Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Either-OR constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 K out of N constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Modelling Compond Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5.1 Problems with a fixed cost and variable cost . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 Modelling Piecewise Linear Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.7 Solving BIP using Bala’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.7.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7 More on LP Relaxation 67
7.1 Scheduling for Unrelated Parallel Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Minimum makespan scheduling on unrelated machines . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1 LP-relaxation algorithm for 2 machines . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2.2 LP-relaxation algorithm for minimum makespan scheduling on unrelated machines . . 69

8 Solving Ax = b 73
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.1 A is an Identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.2 A is a Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1.3 A is a Diagonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 A is a Triangular Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.1 Forward Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Cholesky Decomposition of Positive Definite (PD) Matrix . . . . . . . . . . . . . . . . . . . . 75
8.3.1 PD Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.4 Algorithm for Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.5 Solving linear equations by Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . 77
8.6 Finding Inverse using Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.7.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.7.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.7.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9 Matrix Decompositions: LU, QR and SVD 81

5
9.1 Review and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Computing the LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3.1 Computational procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.4 Solving linear equations by LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.5 Computing the Inverse using LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6 Solution of Ax = b with a direct inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.6.1 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.6.2 Eample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.7 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.8 QR Factorization: The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.8.1 Algorithm: QR FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.8.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.9 Applications of QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.10 Factorization using SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.11 Computing Inverse using SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.12 Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.12.3 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.12.4 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

10 Optimization Problems: Least Square and Least Norm 95


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.2 Least Square Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3 Efficient Computation of LS solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.4 Least Norms Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.5 Efficient Solutions to Least Norm Pronlems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.6 Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.7 Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.7.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.7.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6
10.7.3 Excercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11 Constrained Optimization: Lagrange Multipliers and KKT Conditions 103


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.2 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.3 KKT Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

12 Eigen Value Problems in Optimization 105


12.1 Eigen Values and Eigen Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.1.1 Basics and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.1.2 Numerical Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.1.3 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.2 Applications in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.3 Optimization: Application in Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.3.1 Relationshp to SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.4 Application in solving Ax = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.5 Optimization: Application in PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
12.6 Optimization: Graph Cuts and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.7 Optimization: Generalized Eigen Value Problem . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.8 Optimization: Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

13 Introduction to simplex method 111


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
13.1.1 Remark on LP as a Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 112
13.1.2 Remark on Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
13.1.3 Historical Notes on Algorithms that solve LP . . . . . . . . . . . . . . . . . . . . . . . 113
13.2 Standard Slack Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
13.3 Simplex as Search over Basic Feasible Solutions (BFS) . . . . . . . . . . . . . . . . . . . . . . 114
13.3.1 Basic Feasible Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
13.4 Simplex Algorithm - Ver 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.4.1 An Intuitive Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.4.2 Moving across BFSs - Cost difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.4.3 Simplex Ver 1: A more formal version . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.6 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7
14 More on Simplex 121
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.3 How simplex method works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
14.3.1 Simplex: Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.3.2 Computing B̄−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.3.3 Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.4 On Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.5 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

15 Simplex Method Tableaux 129


15.1 Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.1.1 Simplex: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.2 Simplex Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.2.1 Intuition and Design of Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.3 The Tableau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
15.4 Example Probems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
15.5 Additional problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

16 Dual Problems 141


16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
16.2 A Simple Primal Dual Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
16.3 Primal Dual Probelm Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.3.1 Another LP Primal and Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.4 Primal ↔ Dual Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16.5 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
16.6 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
16.7 Primal-Dual Pair: Mincut and Maxflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
16.7.1 Maxflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
16.7.2 Mincut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

17 More on Duality 149


17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.1.1 More examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.2 Key Results on Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.2.1 Farkas’ Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.2.2 Weak Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8
17.2.3 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
17.2.4 Duality Result for LP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.2.5 Duality Result for IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.2.6 Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.3 Proof of Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.3.1 Geometric Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.3.2 Proof based on Farkas’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.5 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

18 More on Duality 155


18.1 Review of Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
18.1.1 Some Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
18.2 Duality from Lagrandian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
18.2.1 Lagrangean and Lagrange Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
18.2.2 Duality and Lagrange Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
18.2.3 Special Case of Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
18.3 When does LP yield an integral solution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
18.4 Closer Look at Matching and Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
18.4.1 For BPG, incidence matrix A is TU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
18.5 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
18.6 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

19 Primal Dual Methods 165


19.1 Review and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
19.2 Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
19.2.1 Basic Derivation/Proof of Complmentary Slackness . . . . . . . . . . . . . . . . . . . . 167
19.2.2 Using Complementary Slackness to Solve Duals/Primal . . . . . . . . . . . . . . . . . 167
19.2.3 Complementary Slackness:Ver2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
19.3 Introduction to Primal Dual Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
19.3.1 Overview of the primal-dual method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
19.3.2 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
19.3.3 Primal-Dual based approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . 171
19.4 Example: Numerical Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
19.5 Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
19.5.1 Shortest Path Problem:Ver1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9
19.6 Example: MST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
19.6.1 MST:Ver1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
19.7 Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
19.7.1 Set cover:Ver2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
19.8 Example:Weighted Vertex Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.1 Ver1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.2 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
19.8.3 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
19.8.4 Minimum-weighted Vertex Cover:Ver2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
19.8.5 Weighted Vertex Cover via Primal-Dual method . . . . . . . . . . . . . . . . . . . . . 180
19.8.6 Primal-Dual algorithm for WVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
19.8.7 WVC :Ver3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
19.9 Example: Minimum Steiner Forest via Primal-Dual method . . . . . . . . . . . . . . . . . . . 182
19.9.1 Approximation Algorithm for Minimum Steiner Forest via Primal-Dual method . . . . 183
19.10WEIGHTED SET COVER: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

20 Convex Sets and Convex Functions 187


20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
20.1.1 Line joining the points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
20.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
20.2.1 Properties of Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
20.2.2 Convex Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
20.2.3 Separating Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
20.3 Definition of Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
20.3.1 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
20.3.2 Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
20.3.3 Hessians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
20.3.4 First-order conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
20.3.5 Second-order conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.4 More on Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.4.1 Strictly convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.4.2 Sublevel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
20.4.3 Convexity Preserving operations over functions . . . . . . . . . . . . . . . . . . . . . . 194
20.4.4 Concave Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
20.4.5 Quasiconvex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
20.5 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10
20.6 Additional problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
20.7 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8 Appendix: Gradient, Hessian and Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.1 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.2 Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
20.8.3 Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

21 Convex Optimization 201


21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
21.1.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
21.2 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
21.2.1 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
21.2.2 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
21.2.3 Implicit and Explicit Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
21.2.4 Feasibility Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
21.2.5 Optimality Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
21.2.6 Equivalent convex problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
21.3 Quasi Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
21.3.1 Quasiconvex Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
21.3.2 Optimization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
21.4 Variations and Other Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
21.4.1 Linear Fractional Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
21.4.2 Quadratic Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
21.4.3 Quadratically Constrained Quadratic Programs . . . . . . . . . . . . . . . . . . . . . . 209
21.4.4 Second Order Cone Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
21.4.5 SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
21.4.6 Hard Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
21.5 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

22 Optimization Problem: Support Vector Machines 213


22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
22.2 Primal problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
22.3 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
22.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

23 SDP and MaxCut 215


23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

11
24 Nonlinear Optimization: Solving f (x) = 0 217
24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.1.1 Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
24.2.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
24.2.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
24.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
24.3 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
24.3.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
24.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
24.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
24.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
24.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
24.5 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
24.5.1 Algorithim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
24.6 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
24.6.1 Additional Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

25 More about Non Linear Optimization 233


25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
25.2 Review of Iterative Scehemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
25.3 Gradient, Hessian and Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
25.3.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
25.4 Approximating Functions and Taylor’s Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
25.4.1 Orders of Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
25.4.2 First order approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
25.4.3 second order approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
25.5 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
25.5.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
25.6 Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
25.6.1 Additional Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

26 More about Non Linear Optimization 245


26.1 Newton’s Method For Sets of Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . 246
26.2 Newton’s method for minimizing a convex function . . . . . . . . . . . . . . . . . . . . . . . . 247
26.3 Newton’s method with backtracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

12
26.4 Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
26.5 Additional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

27 Claoser Look at Gradient Descent Optimization 253


27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

28 Optimization in Deep Neural Networks 255

29 Optimization in Support Vector Machines 257

30 Regression and Regularization 259


30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

31 Sparse Coding and Dictionary Learning 261


31.1 Representation and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
31.2 Problem of Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
31.2.1 The Sparse Coding Problem Formulated . . . . . . . . . . . . . . . . . . . . . . . . . . 264
31.2.2 What are sparse representations/approximations good for? . . . . . . . . . . . . . . . 265
31.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
31.2.4 Important aspects related to Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . 265
31.3 BP, MP and OMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
31.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
31.3.2 Basic Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
31.3.3 Orthogonal Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
31.3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
31.4 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
31.4.1 Why to learn Dictionary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
31.4.2 Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
31.4.3 Dictionary Learning Methods: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
31.4.4 k-means:[Randomly Partition Y to k subsets] . . . . . . . . . . . . . . . . . . . . . . . 271
31.4.5 Dictionary Learning via k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
31.5 K-SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
31.5.1 K- SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
31.6 Homework Problems: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
31.6.1 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

13
14
Chapter 0

Introduction to Optimization
Methods

What is the scope of this course?


Examples of optimization problems.
How will this course be run?
How will the performance be evaluated?
What should I aim to learn by end of this course?
This chapter gives you answers to some of these.

15
0.1 A Course on Optimization Methods

We see a variety of optimization problems in our day to day life. We solve many of them very comfortably.
(not sure whether we solve all of them ’optimally’ !!). Some time, we also struggle with such problems. Even
formally stating the problem can be very challenging in many situations.
We had studied many optimization problems in the high school days. We also have seen optimization
in different areas of computer science (and of course in the wider engineering). There are many important
aspects for the optimizarion that demand a formal study of this class of problems and the associated solution
schemes. This course focuses on a set of fundamental aspects related to optimization methods. (We will
not be able to cover all important aspects in a first level course like this. Please read classical text books,
monographs or research papers for advancing your knowledge.) This is purely an introductory course.
There are many important aspects of optimization methods that we are interested in. They include:

• What is the nature of a specific optimization problem? How to characterize the problem?
• How do we formally state a problem as an optimization problem (say as a Linear Programming problem
(LP))? (Do not get confused with the word programming with your favorite python programming!!.
Both may look very different on surface.)
• How do we solve an optimization problem (say for a hard problem)? What are some of the popular
algorithms?
• Do we have an efficient solution to the problem of interest (say a polynomial time algorithm)?
• If we do not have an efficient solution, can we come up with an efficient but approximate solution?
• When we solve an optimization problem, will we obtain the “best” solution always? (do we get the
global optima or local optima)?
• What are the applications of optimization methods in other related disciplines (such as machine learn-
ing)?
• What are some of the numerical linear algebra related tools that are needed for solving optimization
problems? (eg. how to solve a system of equations efficiently?)

This course assumes background in (i) mathematics (specially linear algebra) and (ii) algorithms (graph
algorithms, computational complexity), both at an introductory level.
Though the major focus of this course is on optimization methods, we demonstrate the methods on two
specific domains (i) approximate algorithms and (ii) machine learning. Minimal understanding of the termi-
nologies in these spaces will be useful; but not very critical. Some of these are summarized in Chapter 1.

0.2 Why OM?

Many computational problems are optimization problems. When you want to design a minimal spanning
tree (MST) for a graph, you are fining the “best” tree out of many possible ones. How do we find the best
ones? Indeed, some one had already told you the algorithm to compute this. However, when you have a new
problem in hand, how do we design an algorithm? You might have seen many specific tools in a course on
design and analysis of algorithms. If you want to pick only one tool, then it could be LP or IP, that will be
widely applicable.
Modern machine learning is dominated by optimization. The entire “training” process of a machine learning
algorithm is often an optimization with some additional tricks. Deep learning methods may be optimizing an
objective function of Millions or Billions variables. Even after optimizing for days or weeks, the solution that
we obtain is not guaranteed to be the best (or optimal)!!. Even if the models/solutions obtained through
the optimization process is useful and “very good”, as an optimization person, you should appreciate that

16
this leaves tremendous scope for designing far superior solution to many practically important problems, if
we ever can design a better optimization scheme.

0.3 What should I look for?

This course will give emphasis on (i) How to formulate the problem at hand as an optimization problem.
(plain english/task to a mathematical definition of the problem). (ii) How to identify the nature of the
problem (is it a linear one? it it a nonlinear one? is it a non-convex one?) (iii) What are the popular
algorithms (may be a smaller set of the popular ones) to solve the optimization problem exactly or some
times approximately. (iv) How optimization techniques gets used in machine learning, design of algorithms
etc.
Implementatin of many of these optimization problems are numerically tricky. Though we will see some
numerical and computational procedures, the objective of this course is NOT to train you in implementing
the solution to the optimization problem on a computer. Whereever programming is required, student will be
encouraged and guided to use some of the standard libraries/packages. Minimal familiarity with computer
programming is expected in this regard.

0.4 Classes of Optimization Problems

There are many interesting dichotomies in this space. Some of them worth noticing:

1. Linear Vs Nonlinear

2. Convex vs NonConvex

3. Discrete Vs Continuos

4. Constrained vs Unconstrained

The names may make these distinctions reasonably obvious. You can keep in mind these words as and when
we progress. However, the nature of the problem, solution scheme, guarantees etc. could be very different
across the categories. This is why it becomes important for us to know where our problem lies in practice.
Question Contrast and elaborate these classification schemes in detail at the end this course. Give examples
of problems and also pick classical or practical problems in each of these categories.
By end of this course, you may aim to develop skills that can help you to place a problem in an appropriate
class.

0.5 Spring 2019: Couse on “Optimization Methods”

0.5.1 Course Plan

Each chapter in this note series, correspond to approximately one lecture of 1.5 Hrs. The assumption is that
you should learn content worth 25 to 30 lectures by end of this course. This lecture notes and the Internet
resources will help you to know these further. It is important that the student puts effort him/herself and
go some what beyond the lectures. Content worth 20 to 25 lectures will be discussed in the lectures. It is
not expected that you come prepared for the lectures (at this stage) by reading these notes in advance. But
it is expected that you read it after the lecture. (Notes may not be descriptive or complete enough for you
to read and follow smoothly at this stage).

17
0.5.2 Evaluation/Grading

I wish a course like this gets evaluated purely based on class works, homeworks, personalized evaluations.
May be we are not fully ready to get rid of the examination at this stage. We will continue to have mid and
final examinations. There is no plan to make these exams tricky or super challenging.

Regular Homeworks The main evaluation is based on regular homeworks. You are expected to solve
50 homework problems “at your convinience”. i.e., whenever you think you have time, you can “ask for” a
question and you will be given a question immediately. (this implies that you do not have to take special
permissions for family functions or festivals.). To make the load (and more importantly learning) uniform,
you can solve a max of one question a day. However, you will have to submit your answer within the next
24Hrs. We assume that the average time required for solving a question to be 1 to 2 Hrs for a typical student,
provided that the student is familiar with the content. Indeed, there will be situations where you are unable
to solve/submit due to personal reasons, or limitations. Student may make mistakes too. Therefore, you
will have access to 60 questions and the best 50 will be used for the grading. We typically expect you solve
2-3 questions between sucessive lectures. If you move faster that that, you will get questions from future
lectures and it becomes your responsibility to learn yourself.
We wish you collaborate openly. Once you collaborate, do acknwoledge. Your points could also be divided
accordingly. The person who helps (including TAs) should be given due credit.
Questions that you solve may be different from that your friends solve. It is a serious offtense if (i) you take
the question out of the system (question could be watermarked with visible and/or invisible watermarks)
inlcuding photography/screenshots (ii) if you disucss the solutions electronically (electronic groups, mailing
lists). (iii) take help of people without acknwledging.

Grading

• Two mid semester exams: 25%

• One Final exam: 20%

• Homeworks: 40%

• Assignments (Max 3) 15%

• Course participation (discussions in class+course portal): 5%

There could be a maximum of ±5% change from this plan. It is expected that the class participation score
will be zero for most students. However, it can be positive and negative for a smaller fraction of the students.

0.6 Text Books and References

There are many popular text books and references. A list is available on the course page.

0.6.1 Books (B)


1. M T Heath, “Scientific Computing”, TMH (Most of First six chapters)

2. C H Papadimitriou and K Steiglitz, “Combinatorial Optimization: Algorithms and Complexity” (Most


of First seven chapters), Dover

3. S. Boyd and L Vandenberghe, “Convex Optimization”, Cambridge University Press (Online Copy
available at: http://www.stanford.edu/ boyd/cvxbook/ )

18
4. L Vandenberghe, Lecture Notes for Applied Numerical Computing, (Online available at: http://www.ee.ucla.edu
denbe/103/reader.pdf )

5. D Bertisimas and J N Tsitsiklis, “Introduction to Linear Optimization”, Athena Scientific http://personal.vu.nl/l


6. J Matousek and B. Gartner, “Understanding and Using Linear Programming”, Springer, 2007 (
http://blogs.epfl.ch/extrema/documents/Maison

0.6.2 NOTES/BLOGS (A)


1. TOADD

0.6.3 Papers (P)


1. TOADD

0.7 Notes and Disclaimers


• Notations There is no guarantee that the notations are fully consistent. Though it is attempted at
many places.
• Disclaimer This is from the draft lecture notes for CSE481 at IIIT Hyderabad in the past. Do not
circulate these notes beyond the registered students. Please do report bugs/errors/omissions on the
course web page. This is only a draft version. Also the objective of posting these notes is to initiate
discussions within the class, but outside the class room. Also do suggest better explanations of the
concepts or better examples. This will help to improve the notes further. This is built over the
lectures/notes/scribes of the previous offerings.
• Enhancements The notes are expected to be enhanced and more complete by the end of the semester.
Whevener there is a new version, you can see on the course page.
• Academic Honesty This is very important. If you take help of your students, or TAs, please ac-
knowledge them well. Attempt to use social media, electronic forums or other means to do unethical
collaboration may lead to academic actions.

0.8 Version History

This is a lecture notes based on the notes from the previous offerings. This is still evolving. Please see this
section to know where we are, and how it changed over versions.

• Ver 0.1 [Jan 1 2019] Some structural changes and cosmetic changes to suite the plans for the Spring
2019. Available to the students who have enrolled to roughly appreciate the course content.
• Ver 0.2 Coming soon .. (EDA Jan 15, 2019)

19
20
Chapter 1

Background

This is the time to revise what we had learned in the past. Many of the terminologies could be
familiar to you; but it is worth reading and refreshing. If you are too far from these terminologies,
you may find it hard to go through the rest of the material.

21
1.1 Background on Matrices

This course assumes some amount of familiarity of mathematics. Specially, linear algebra. Basic under-
standing of geometry (points, lines, planes) is expected. Also basics of calculs (differentiation) is required.
Here are some definitions/terms that you should have heard already:

• Vectors

• Matrices

• Types of Matrices (i) square (ii) triangular

• Determinant of a matrix

• Rank of a matrix

• Norms L0, L1, L2, Lp, L∞ norms

• System of Equations System of linear equations represented as Ax = b. Where A is a matrix and


b is a vector.

• Lines, Planes, Hyper Planes and Half Spaces ax1 + bx2 = c is a line in 2D. ax1 + bx2 ≤ c is a
line plus one side of the line (half space) in 2D. Similarly ax1 + bx2 ≥ c. In 3D this is equivalent to
planes as a1 x1 + a2 x2 + a3 x3 = b and in d dimension, it is aT x = b.

• What does it mean by solving Ax = b and Finding Feasible points for Ax ≤ b

1.2 Background on Graphs

This course assumes some amount of familiarity with basics of Graphcs (eg. what you learn from a first
course on Discrete Mathematice or Data Structures).
You are expected to be familar with:

• Graphs and Trees

• Nodes/Vertices and Edges

• Sample Problem: Shortest Path Problem

• Sample Problem: Minimal Spanning Tree

1.3 Hard Problems and Approximate Algorithms


• Big O Computational complexity. Big Oh Notation.

• Complexity Why is it very important.

• P and NP Class of problems.

• Sample Problem: Knapsack problem

• Sample Problem: TSP Problem

22
1.4 Optimization Problems in Machine Learning and Signal Pro-
cessing
• Least Square Problem
• Regression

• Binary Classification
• Multi Class Classification
• Support Vector Machines and Deep Learning

1.5 Notes

You will find excellent references for the above at many places including the courses that you have done and
also the text books that you used. The best is to use the text book that you used for your mathematics,
data structures and algorithm courses.

23
24
Chapter 2

Linear and Integer Programming

Let us a start with a simple, but popular class of problems.

25
2.1 Introduction to Linear Programming(LP)

Linear Function A function f (x1 , x2 , . . . , xn ) of x1 , . . . xn is a linear function if and only if for some set
of constants c1 , c2 , . . . , cn ,
f (x1 , x2 , . . . , xn ) = c1 x1 + c2 x2 + . . . + cn xn

Linear Inequalities For any linear function f (x1 , x2 , . . . , xn ) and any number b, the inequalities such as

f (x1 , x2 , . . . , xn ) ≤ b

f (x1 , x2 , . . . , xn ) ≥ b
are linear inequalities.

2.1.1 Linear Programming (LP)

A linear programming problem may be defined as the problem of maximizing or minimizing a linear function
subject to linear constraints. The constraints may be equalities or inequalities.
To discuss a bit more in detail, a Linear Programming Problem is an optimization problem for which

1. We attempt to maximize (or minimize) a linear function (or objective function) of the decision variables.
Objective function (z) is the criterion for selecting the “best” values of the decision variables. Decision
variables (xi ) are the variables of interest here.
2. We are often interested in the decision variables at the optima (x∗i ) and the value of the objective (z ∗ ).
Not always both!!.
3. The values of the decision variables must satisfy a set of constraints. Each of these constraints can be
equalities or inequalities. Constraints are like limitations on the resource availability. There can be
more than one constraints.
4. Sign restriction could be there on each variable. i.e., xi xi ≥ 0, xi ≤ 0 or xi is unrestricted.

In Linear programming, the decision variables can take any real value.

2.1.2 Integer Programming (IP)

In Integer programming, the decision variables can take only integer value. This may seem to be a simple
change from LP. But this makes solving a general IP to be very hard.

2.2 Formal Introduction to LP and IP

Linear programs are problems that can be expressed in a canonical form:

maximize cT x
subject to
Ax ≤ b
x ≥ 0; x ∈ Rn
where x represents the vector of n variables (the decision variables of our interest), c and b are vectors of
(known) coefficients, A is a (known) m × n matrix of coefficients, and (.)T is the matrix transpose. Note
that the coefficients from the m constraints are now arranged into a matrix form in A.

26
The expression to be maximized or minimized is called the objective function. The inequalities Ax ≤ b
and x ≥ 0 are the constraints which specify a convex polytope over which the objective function is to be
optimised.
For Linear Program (LP), xi s are real numbers i.e., x ∈Rn
For Integer Program (IP), xi s are integers, i.e., x ∈Zn
We may also have mixed integer programs where some variables are real and some are integers.
Though the LP and IP may seem to be very similar, they can be very difficult in practice. An LP can be
solved “easily”, while solving an IP can be “hard”. We shall see these more in detail as we move forward.

2.2.1 Related Tricks to Remember

You may not be able to get a problem that can be written in the above form easily.
Given a problem, we often rewrite the problem into a standard form.
Formulations of LP can be transformed from one form to another easily. i.e., maximization of objective
function to minimization. Some simple rules that guide the transformations are:

1. max ⇐⇒ min is same as cT x ⇐⇒ −cT x

2. Absolute Values |xi | ≤ 3 ⇐⇒ xi ≤ 3 AND xi ≥ −3

3. Equality xi = 3 ⇐⇒ xi ≤ 3 and xi ≥ 3

2.2.2 Related Terms


1. Solution: A solution of a LP program is a setting or state of the decision variables.

2. Feasible Solution: A solution that satisfies all the constraints.

3. Feasible Region: Set of all feasible solutions. A different way of defining is as a region enclosed by all
the inequalities involved. Feasible Region is also referred to as search space, solution space or feasible
set. An optimization method searches for the best solution from the feasible region.

4. Optimal Solution: Optimal solution is the feasible solution with the largest objective function value
for a maximization problem or a feasible solution with the smallest objective function value for the
minimization problem.

2.3 Numerical Example

Let us now start with a simple numerical problem.

maximize 3x1 + 2x2


subject to
2x1 + x2 ≤ 6
7x1 + 8x2 ≤ 28
x1 ≥ 0 ; x2 ≥ 0 and both real
In this problem there are two unknowns, and four constraints. There are several important terms to under-
stand before solving any LP problem. They are given below.

27
Figure 2.1: Constraints and the Feasible Region. Note that the axes should be x1 and x2

Objective Function The function to be maximized (or minimized) is called the objective function. Here,
the objective function is 3x1 + 2x2 and the objective needs to be maximized.
Question: For the above problem what is m? what is n? 2 and 2. Can you write A, b and c?

Constraints A constraint involves an inequality or equality in some linear function of the variables. The
two constraints, x1 ≥ 0 and x2 ≥ 0, are special. These are called non-negativity constraints and are
often found in linear programming problems. The other constraints are then called the main constraints.
Here, the main constraints are (i) 2x1 + x2 ≤ 6 and (ii) 7x1 + 8x2 ≤ 28

Feasible Region A feasible region or solution space is the set of all possible points of an optimization
problem that satisfy the problem’s constraints. The feasible region for the example problem mentioned above
is given in the figure.

Feasible Solution A feasible solution to a linear program is a solution that satisfies all the constraints
which effectively lies within the feasible region. The point (1, 1) is a feasible solution in our example.

Optimal Solution An optimal solution to a linear program is a feasible solution with the largest objective
function value (for a maximization problem). A linear program may have multiple optimal solutions, but
only one optimal solution value.

Note:

1. Since every inequality forms a half plane (which is a convex set), the set of feasible solutions to an LP
(feasible region) forms a (possibly unbounded) convex set.

2. Optimal solution lies in any face/edge of the convex set. Eventually, every linear program has an
extreme point (corner point) that is an optimal solution.

Optimization Method Do we have to search at all the vertices? Do we have to search at all the points
inside the shaded region also?

28
2.4 Graphical Method of Solving LP

For the example problem, the objective is to maximize 3x1 + 2x2 . From the figure, the corner points are
(0,3.5), (0,0), (3,0) and (20/9,14/9). Their corresponding values are 7, 0, 9 and 9.77 respectively. Since Max
value is 9.77 the optimal point is (20/9,14/9)
Let us look at a procedure to find this maxima. Consider a line 3x1 + 2x2 = k. Let us draw this line for
various various values of k. When k = 0, this passes through origin, now when we increase k, this line
moves away from origin (in a certain direction). (If k is very large, this line does not intersect our polygon
of interest.) Let us increase k slowly from zero. And at certain point, the last line that intersect/touch the
polygon is obtained. This is the one of our interest. In this case, this is when k = 9.77 and the line passes
through (20/9, 14/9).
Let us now summarise our graphical way of solving LPs.
Given a LP problem i.e., an objective function and a set of constraints, we proceed as follows:

1. Construct a graph and plot the constraint inequalities.

2. Determine the valid side of each constraint line. This can be done by substituting the origin in the
inequality. If origin satisfies the inequality, then all the points on the origin side of the line are feasible
(valid), and all the points on the other side of the line are infeasible (invalid). The reverse is true of
the origin doesn’t satisfy the inequality.

3. Identify the feasible solution region. This is the area of the graph that is valid for all the constraints.
Choosing any point in the region results in a valid solution.

4. Plot the objective function line and use it to move in the direction of improvement that is in the
direction of the greater value if the objective is to maximize the objective function and in the direction
of the lowest value if the objective is to minimize.

5. Optimal Solution aways occurs at the corners. So algebraically calculate the coordinates of all corners
of the feasible region and use the objective function line to find the most optimal corner.

6. Now use the coordinates of the optimal corner (optimal solution) to get the objective function value.

2.5 Some cases of special interest

Every linear program either

1. is infeasible,

2. is unbounded,

3. has a unique optimal solution value

Question: Does (3) mean that every LP has only one optima?

Infeasible Solution A linear program is infeasible if it has no feasible solutions, i.e. the feasible region is
empty. Here is an example for Infeasible Solution

Problem
maximize z = x1 + x2
subject to
x1 − 2x2 ≥ 6

29
Figure 2.2: Infeasible Solution and Unbounded Solution (TODO fix)

2x1 + x2 ≤ 4
x1 , x 2 ≥ 0

Unbounded Feasible Region A linear program is infeasible if it has no feasible solutions, i.e. the feasible
region is empty. Here is an example for Unbounded Region

Problem:
maximize z = x1 + x2
subject to
x1 − x2 ≤ 5
−2x1 + x2 ≤ 4
x1 , x 2 ≥ 0

2.5.1 Excercise

Solve the following problem graphically.

M inimize z = 5x1 + 2x2

Subject to:
6x1 + x2 ≥ 6
4x1 + 3x2 ≥ 12
x1 + 2x2 ≥ 4
xi ≥ 0 and real

Graph Method: Notes

1. Draw the lines corresponding to each constraints in a graph and shade the half planes according to the
inequalities.
2. If there is no intersection of the shaded regions then there is no solution.
3. If there is a bounded intersection region, find the coordinates of the corners by solving the systems of
intersecting equations.
4. Find which coordinate gives the optimum value by applying them over the optimization function.

30
Note: Is this how we plan to solve all the problems as we move forward? No never. However it is important
to understand the geometric view point of LP, the constraints and the objectives.
Question: Can an IP also be solved like this? No that is a larger story for the next few lectures.
Question: Solve the LPs that we had seen with a numerical solver on your computer (say Simplex) in your
favourite library. Do compare your answers with that.

2.6 LP/IP Formulations

One of our objective is to learn how to formulate a problem as LP or IP in the initial part of this course.
We will take many examples in the next few lectures. This is an important skill to have. It is expected that
you practice formulating problems on your own.

2.7 Pattern Classification Problem

Let us now consider an example problem i.e., pattern classification. This is a fundamental problem in machine
learning offern referred to as supervised classification problem. If you are given examples of “apples” and
“not-apples”, a machine learning algorithm is asked to lean a classifier that separates the apples fron others.
Let us assume that every item (sample/object) of interest is characterized by two variables xi and yi (say
highit and weight).

Problem Statement Given that they are two disjoint patterns. One pattern is a positive pattern con-
sisting of all coordinates {(x+ +
i , yi )} Another pattern is a negative pattern consisting of all coordinates
− −
{(xi , yi )}. Assume there are n+ positive points and n− negative examples. We have to find a line such
that it classifies the patterns (or the 2D points) in the best possible way i.e the line must be in between and
must be as far as possible from every point on both the patterns classes (i.e., positives and negatives).
Note that in practice this classifier need not be a line. It can be of any arbitrary shape. However, many
practical problems still use linear classifiers.
There are actually many lines as possible solution to this problem. A good classifier also wants the points to
be as away as possible from the separating line/plane. Can we maximize this separation? Let us formulate
this problem as LP.

LP Formulation We would like to maximize the separation δ of each sample (to be precise the nearest
sample) from the line.
We take a variable δ and maximize it while making sure that that the distance of the nearest point in both
the classes (i.e., positive class and negative class) is at least δ (or) the distance of the line to each and every
point in the data set is at least δ i.e.,

• The distance from any one of the positive pattern points to the line at least δ.

• The distance from the line to any of the negative pattern points is at least δ.

• The line is defined as yi = axi + b, and positive samples are on one side and negative samples on the
other side.

Formulation of the problem is as follows:

M aximize δ

31
Figure 2.3: Line which separates the patterns with distance of at least δ

Subject to:
yi+ ≥ ax+
i + b + δ i = 1, . . . , n
+

yi− ≤ ax−
i + b − δ i = 1, . . . , n

If we solve this problem, we obtain a line that separates the points from the line with at least the distance δ.

Comments:

1. We have formulated the LP for simple Euclidean distance measured along y. Not the orthogonal
distance from the line. The problem becomes difficult if orthogonal distance is to be considered. (Note
that the figure does not show the problem setting correctly.)

2. We have assumed that positive pattern is above the line and negative pattern is below the line. It
could be the reverse also. i.e., negative above and positive above. Question: Does it matter to us
really?

32
Chapter 3

LP and IP Formulations

• We had seen the basic definitions of the LP and IP in the last lecture.
• We also know how to solve an LP graphically (not all LPs!!) on paper.

One of our objective is to learn how to formulate a problem as LP or IP in the initial part of this
course.

33
3.1 Formulations

In this lecture, we see more examples on how to formulate LP and IP. Though each example may be one of
its own form, do appreciate the utility and also the spectrum of problems that getts mapped to LP and IP.
Try to pickup the skill of formulating.

3.2 Problem: Line Fitting

Given a set of points (xi , yi ), where i = 1, 2, 3, . . . , n, find the “best” line to fit these points. i.e., we want
to find a straight line of the form y = ax + b that best describes the points or data set. This is a classical
problem with interest in many areas.
Needless to say, we may not be able to find a line that pass through all the points in most cases. Our
objective is then to find the best line. Best in what sense? That defines (and change the nature of the
problem) the problem.
Assume our objective is to find a line such that it is “close” to all the points, i.e., sum of the distances of all
the points to the line is minimum. Objective function is then:


n
minimize |yi − (axi + b)|
i=1

(Note: here we have used L1 Norm because it helps to get deviation or error of the line from the point, while
2
L2 Norm magnifies the error i.e. (yi − (axi + b)) .)
Let error ei = |yi − (axi + b)|. Then we write the problem as:


n
minimize ei
i=1

subject to:
|yi − (axi + b)| ≤ ei i = 1, 2, . . . , n
If we eliminate norm, and write as our familiar LP, we obtain:


n
minimize ei
i=1

subject to:
(yi − (axi + b)) ≤ ei
−(yi − (axi + b)) ≤ ei i = 1, 2, . . . , n
We have a total of N + 1 variables (e1 , e2 , .....en , a, b) and 2n constraints.

3.2.1 Variations

There are many interesting variations of this problem, where the error is defined as L0, L1 or L2 norm. (We
may see some of them later on in this course.) The objectives in these cases become


E0 = ∥ yi − (axi ∔ b) ∥0
i

E1 = ∥ yi − (axi ∔ b) ∥1
i

34

E2 = ∥ yi − (axi ∔ b) ∥2
i

They have different interpretations


Minimization of E0 leads to line passes through as many points as possible.
Minimization of E1 leads to least absolute derivation we saw above.
Minimization of E2 is the ordinary least squares.
Question What about L∞ norm?

3.3 Minimizing Norms

We had seen already the how the interpretation of the problem changes with a change in the norm. Let us
now consider couple of examples.

3.3.1 Example 1

Consider the problems of

M inimize ∥Ax − b∥1


subject to
∥x∥∞ ≤ 1

(With our notations, it should be obvious that A is a matrix and b as well as x are vectors. We are given
A and b. We optimize over x. Indeed over non-negativity constraints and real vector constraints are not
explicitly written here.)
The above problem can be formulated as an LP. Let us assume that A is a matrix of dimension m × n, x
is a vector of dimension n × 1 and b is a vector of dimension m × 1 We know that ∥x∥∞ norm measure the
maximum of the absolute values of the elements in a vector.

∥x∥∞ = max(|x1 |, |x2 |, |x3 |, . . . , |xn |)

Since the constraint ∥x∥∞ ≤ 1 the maximum element of the vector is less than equal to 1 which means that
every element of the vector is less than or equal to 1.

−1 ≤ xj ≤ 1 , j = 1, . . . , n

We know that ||x||1 norm measure the sum of the absolute values of the elements in a vector.

∥x∥1 = |x1 | + |x2 | + . . . + |xn |.


thus

∥Ax − b∥1 = |A11 x1 +A12 x2 +. . .+A1n xn −b1 |+|A21 x1 +A22 x2 +. . .+A2n xn −b2 | . . .+|Am1 x1 +Am2 x2 +...+Amn xn −bm

As we have to minimize every element of ∥Ax − b∥1 , Here we take a vector y of dimension m×1 where
each element of the vector Ax-b ranges from −yi to yi . We now mimimize the summation of all elements
of the y vector. Thus the final objective function can be written as follows.

m
M inimize yi
i=1

35
Subject to:

n
−yi ≤ (aij · xj ) − bi ≤ yi , i = 1, . . . , m
j=1
,
−1 ≤ xj ≤ 1 , j = 1, . . . , n

Question: How many constraints are there here in our standard form? Do write them as Ax ≤ b.

3.3.2 Example 2

Let us now consider another problem:


Minimize ∥x∥1 subject to constraint ∥Ax − b∥∞ ≤ 1
This problem can also be formulated as an LP.
We know that ||x||∞ norm measure the maximum of the absolute values of the elements in a vector.

∥x∥∞ = max(|x1 |, . . . , |xn |)

Since the constraint is ∥Ax − b∥∞ ≤ 1, the maximum element of the vector is less than equal to 1 which
means that every element of the vector is less than or equal to 1.


n
−1≤ (aij · xj ) − bi ≤1 , i = 1 . . . , m
j=1

We know that ∥x∥1 norm measure the sum of the absolute values of the elements in a vector.

∥x∥1 = |x1 | + |x2 | + . . . + |xn |.


As we have to minimize every element of ∥x∥1 , Here we take a vector y of dimension n×1 where each
element of the vector x ranges from −yi to yi .
The LP problem can be formulated as follows


n
M inimize yj
j=1

Subject to:


n
−1≤ aij xj − bi ≤1 , i = 1 . . . , m
j=1

−yj ≤ xj ≤ yj , j = 1 . . . , n

3.4 Example Problem: Cutting Paper Rolls

Let us now consider an engineering problem. Paper rolls come in (gets manufactored) 3m (300 cm) width.
A roll is really long. Length of the roll does not matter, since the customers are also looking for only rolls.
There is an on order to serve:
(i) 97 rolls of width 135 cm (ii) 610 rolls of width 108 cm (iii) 395 rolls of width 93 cm (iv) 211 rolls of width
42 cm

36
Figure 3.1: First and second possible ways of cutting the role

What is the smallest number of 3m roll sto be cut?


The trivial way of meeting the order is to get 97 + 610 + 395 + 211 = 1313 rolls. If we cut all these rolls and
take rolls of the appropriate widths, we may be wasting a lot. Our interest is reducing the number of rolls
to be cut.
We can cut these paper rolls in 12 different ways:

• P1 : 2×135

• P2 : 135+108+42

• P3 : 135+93+42

• P4 : 135+3*42

• P5 : 2×108+2*42

• P6 : 108+2*93

• P7 : 108+93+2×42

• P8 : 108+4*42

• P9 : 3×93

• P10 : 2×93+2×42

• P11 : 93+4×42

• P12 : 7×42

Note that all these are less than 300cm.

Formulation Let xi be the number of times ith possiblity is used in cutting a 3m roll. Then problem can
be formulated as follows:


M inimize xj
j

Subject to:

2x1 + x2 + x3 + x4 ≥ 97

x2 + x5 + x6 + x7 + x8 ≥ 610

x3 + 2x6 + x7 + 3x9 + 2x10 + x11 ≥ 395

37
x2 + x3 + 3x4 + 2x5 + 2x7 + 4x8 + 2x10 + 4x11 + 2x12 ≥ 211

with non-negativity constraints.


Upon Solving the above problem as LP, we get x1 = 48.5, x5 = 206.25, x6 = 197.5
Therefore, the final solution is x1 +x5 +x6 = 452.5 which is a non-integer solution. To convert above solution
into an integer solution we round up (why rounding up?) the values of x1 , x5 , x6 .
Rounding up each of the above values we get x1 = 49, x5 = 207, x6 = 198 and x1 + x5 + x6 = 454. Therefore,
the number of rolls to be cut is 454. However, is this minimum?
The optimal solution to the problem is 453 which is obtained by x1 = 49, x5 = 207, x6 = 196, x9 = 1 and
x1 + x5 + x6 + x9 = 453.
Thus rounding of the result of an LP solution to and integer solution may or not be the optimal solution to
the IP problem but it will be nearer to it.
What went wrong here? Nothing. This was not an LP. This was an IP. Solving by forgetting the integer
constraint does not guarantee the optimal solution.

3.5 Example Problem: MaxFlow

Finding maximum flow in a graph is a classical problem of theoretical and practical interest. The max
flow passing from source node to destination node in a network equals the minimum capacity which when
removed from the network results in a situation where there is no flow from source to destination. (In any
network, the value of max flow equals of min-cut. That is, if there exists a max flow ’f’, there exists a cut
whose capacity equals the value of f. )

Figure 3.2: A flow network.

For the above graph, let the variables fsu , fsv fuv , fut and fvt be the maximum flows along the respective
edges. We also know that the flow can not be more than the edge capacity. Also since there is no storage in
nodes, at every nodes, incoming and outgoing flows should match.
LP problem would be as follows,

M aximize fsu + fsv


Subject to
fsu = fuv + fut
fsv + fuv = fvt
0 ≤ fsu ≤ 10

38
0 ≤ fsv ≤ 5
0 ≤ fuv ≤ 15
0 ≤ fut ≤ 5
0 ≤ fvt ≤ 10
with non-negativity and real constraints.
Question: Write the c, A and b for the graph in Figure (b).

Discussions In any network, the value of max flow equals of min-cut. That is, if there exists a max flow
’f’, there exists a cut whose capacity equals the value of f. Can we also formulate the min-cut problem as an
LP/IP? Which one? How?
Are there some relationships between these problems? Very often problems come in “pairs”. We call them
as dual problems. More about duality in one of the later lectures.

Excercise Consider the LP


M inimize c1 x1 + c2 x2 + c3 x3
subject to
x1 + x2 ≥ 1
x1 + 2x2 ≤ 3
x1 ≥ 0; x2 ≥ 0; x3 ≥ 0
Give the optimal value and the optimal set for the following values of c (i) c = (−1, 0, 1) (ii) c = (0, 1, 0) (iii)
c = (0, 0, −1)

Excercise Formulate the following problem as LP

M inimize∥Ax − b∥1 + ∥x∥∞

3.6 Reading

Read chapter 2 of [B6].

39
40
Chapter 4

Solving IP using Branch and Bound

• We had seen many LP and IP formulations in the past. We will see some more as we move
forward.
• We had seen how LP can be solved using graphical method.
• Here we will see how an IP can be solved.

41
4.1 How to solve an IP?

The advantage of IP is that they are a very expressive language to formulate optimization problems, and
they can capture in a natural and direct way a large number of combinatorial optimization problems. The
disadvantage of IP is that finding optimal solutions for IPs is NP-Hard, in general. (There are, in fact, lucky
IP problems that can be solved easily. They are topics of interest for a future lecture.)
Let us now see how to solve some of the IP problems. We are interested in two methods at this stage.

• Branch and Bound


• LP Relaxation

If we relax the integer constraints from the IP and assume that the variables are real, we get an LP. We can
then solve LP. However, how are the solutions related? Note that the optima of the IP is inferior (or equal) to
the optima of LP. (for a maximization problem (IP ∗ ≤ LP ∗ and for a minimization problem (IP ∗ ≥ LP ∗ ).
LP relaxation is a technique of relatxing the IP as an LP and solving it efficiently. We will come back to
this later.
We ‘assume” that we know how to solve LP. We use the LP solver as a black box/sub-routine here. Note
that we know how to solve LP using graphical method for simpler problems and use simplex solvers for large
LP on a computer.

4.2 Branch and Bound for IP

This is a classical paradigm that can be used for solving many other hard problems. (refer to your text
books on Algorithms for details.)
Basic idea is the following: We divide a large problem into multiple smaller ones. (This is the “branch”
part.) The bounding part is done by estimating how good a solution we can get for each smaller problems
(to do this, we may have to divide the problem further) The optimal value from a subproblem will tell you
whether there is a need to further divide it or not. That is the “bound” part. i.e., we do not have to divide
until we get trivial/small problems.
We use the linear programming relaxation to estimate the bound on the optimal solution of the integer
programming.

4.2.1 General Branch and Bound Algorithm

Consider the problem statement:


M aximize cT x subject to Ax ≤ b

1. Initialize the list of problme (or constraints) as L = {Ax ≤ b}. The set of problems to be solved is
represented by the constraints here, since the objective does not change.
2. Initialize x− = ∅ , l = −∞. Here l is the presently seen best solution in the algorithm and x− is the
presently seen best optimal value in the algorithm.
3. while L ̸= ∅

(a) Pick subproblem maximize cT x , A′ x ≤ b′ and solve the LP. Also delete the subproblem con-
straints from L.
(b) Let x∗ be the optimum solution when you solve the above LP problem.
(c) If x∗ ∈ Z n and cT x∗ > l. Then set x− = x∗ and l = cT x∗ .

42
Figure 4.1: Feasible region of the main problem

(d) If x∗ ∈
/ Z n and cT x∗ > l then
add two subproblems into the list L for x∗j ∈
/Z
• Max cT x such that A′ x ≤ b′ and xj ≤ ⌊x∗j ⌋ and add to L.
• Max cT x such that A′ x ≤ b′ and xj ≥ ⌈x∗j ⌉ and add to L.

Quesion: Refer steps 3(c) and 3(d): Can’t cT x < l?. What should we do in such cases?

4.3 Numerical Examples of Branch and Bound

Let us now solve some IPs numerically.

4.3.1 Example 1

Given the following problem.Obtain an Integer Solution.


Maximize x1 + x2
subject to :
x2 − x1 ≤ 2
8x2 + 2x1 ≤ 19
x1 , x2 ≥ 0 x1 ∈ Z x2 ∈ Z

Solution:

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x2 - x1 = 2 and 8x2 + 2x1 = 19 with the x1 and x2 axis give the points (-2,0),(0,2)
,(0,9.5),(2.375,0)
• Intersection of the lines x2 - x1 = 2 and 8x2 + 2x1 = 19 with each other gives (1.5,3.5).
• The feasible region points are (0,0) ,(0,2) ,(2.375,0),(1.5,3.5).
• Out of these points the point which gives the highest value of the objective function x1 + x2 is (1.5,3.5)
and the value is 5.

43
• However, as (1.5,3.5) is not an integer solution.We branch on x1 and divide the problem into two
subproblems in which one subproblem whill have the added constraint x1 ≤ ⌞1.5⌟ = 1and other will
have the added constraint x1 ≥ ⌜1.5⌝ = 2 .

Subproblem 1 :

Maximize x1 + x2
subject to:
x2 - x1 ≤ 2
8x2 + 2x1 ≤ 19
x1 ≤ 1
x1 , x2 ≥ 0 x1 ∈ Z x2 ∈ Z

Solution:

Figure 4.2: (a) Feasible region of the sub problem with the added constraint x1 ≤1. (b) Feasible region of
the sub problem with the added constraint x1 ≥2

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines x2 - x1 = 2 and x1 = 1 with the x1 and x2 axis give the points (-2,0),(0,2)
,(2,0),(1,0)

• Intersection of the lines x2 - x1 = 2 and x1 = 1 with each other gives (1,3).

• The feasible region points are (0,0) ,(0,2) ,(1,0),(1,3).

• Out of these points the point which gives the highest value of the objective function x1 + x2 is (1,3)
and the value is 4. This is an integer solution.

• As we have got an integer solution. There is no need to branch this further.

While we need to explore the other subproblem further.

44
Subproblem 2 :

Maximize x1 + x2
subject to:
x2 - x1 ≤ 2
8x2 + 2x1 ≤ 19
x1 ≥ 2
x1 , x2 ≥ 0 x1 ∈ Z;x2 ∈ Z

Solution:

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines x1 = 2 and 8x2 + 2x1 = 19 with the x and y axis give the points (2,0)
,(0,9.5),(2.375,0)

• Intersection of the lines x1 = 2 and 8x2 + 2x1 = 19 with each other gives (2,1.5).

• The feasible region vertices are (2,0),(2.375,0),(2,1.5).

• Out of these points the point which gives the highest value of the objective function x1 + x2 is (2,1.5)
and the value is 3.5 which is less than above integer value 4 obtained. Hence there is no need to branch
further as we cannot get a solution which is more than 3.5. Hence we stop here.

From above two subproblems we see that the maximum integer solution obtained is (1,3) and the optima z ∗
is 4.

4.3.2 Example 2

Obtain an Integer solution to the following problem:


Maximize 2x1 + x2
such that
x1 + x2 ≤ 8
2x1 ≤ 9
x1 , x2 ≥ 0 x1 ∈ Z; x2 ∈ Z

Solution:

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with the x1 and x2 axis give the points (4.5,0),(0,8)
,(8,0)

• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with each other gives (4.5,3.5).

• The vertices of the feasible region are (0,0) ,(4.5,0),(0,8) ,(4.5,3.5).

• Out of these points the point which gives the highest value of the objective function x1 + x2 is (4.5,3.5)
and the value is 12.5.

• However, as (4.5,3.5) is not an integer solution.We branch on x1 and divide the problem into two
subproblems in which one subproblem whill have the added constraint x1 ≤ ⌞4.5⌟ = 4 and other will
have the added constraint x1 ≥ ⌜4.5⌝ = 5 .

45
Figure 4.3: (a) Feasible region of the main problem (b) Feasible region of the sub problem with the added
constraint x1 ≤4

Subproblem 1 :

Maximize 2x1 + x2
subject to
x1 + x2 ≤ 8
2x1 ≤ 9
x1 ≥ 5
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:
There is no feasible region for above system of equations and hence there is no solution and no need to
branch further.

Subproblem 2 :

Maximize 2x1 + x2
subject to
x1 + x2 ≤ 8
2x1 ≤ 9
x1 ≤ 4
x1 , x 2 ≥ 0 x1 ∈ Z x 2 ∈ Z

Solution:

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines 2x1 = 9 and x1 ≤ 4 and x1 + x2 =8 with the x1 and x2 axis give the points
(4.5,0),(0,8) ,(8,0),(4,0)

• Intersection of the lines 2x1 = 9 and x1 + x2 =8 with each other gives (4.5,3.5).

• Intersection of the lines x1 = 4 and x1 + x2 =8 with each other gives (4,4).

46
• The feasible region points are (0,0) ,(4,0),(0,8) ,(4,4).

• Out of these points the point which gives the highest value of the objective function x1 + x2 is (4,4)
and the value is 12.

• As we have got an Integer solution (4,4) which gives a value of 12. There is no need to branch further.So
ans is (4,4).

4.3.3 Example 3

Let us use branch and bound algorithm to :

Maximize 5x1 + 8x2


Such that
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:

Figure 4.4: Feasible region of the main problem

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines x1 = x2 = 6 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points (0,6)
,(0,5),(6,0),(9,0)

• Intersection of the lines x1 + x2 = 6 and 5x1 + 9x2 = 45 with each other gives (2.25,3.75).

• The feasible region points are (0,0) ,(0,5) ,(6,0),(2.25,3.75).

• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is
(2.25,3.75) and the value is 41.25 other values are 30 by (6,0) and 40 by (0,5) .

• But as (2.25,3.75) is not an integer solution.We branch on x1 and divide the problem into two sub-
problems in which one subproblem whill have the added constraint x1 ≤ ⌞2.2.5⌟ = 2 and other will
have the added constraint x1 ≥ ⌜2.25⌝ = 3.

47
Subproblem 1:

Maximize 5x1 + 8x2


Subject to
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 ≤ 2
x1 , x 2 ≥ 0 x1 ∈ Z x 2 ∈ Z

Solution:

Figure 4.5: Feasible region of the sub problem with the added constraint x1 ≤2 (b) Feasible region of the sub
problem with the added constraint x1 ≥3

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines x1 = 2 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points (0,5),(2,0),(9,0)

• Intersection of the lines x1 = 2 and 5x1 + 9x2 = 45 with each other gives (2,3.88).

• The feasible region points are (0,0) ,(0,5) ,(2,0),(2,3.88).

• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is
(2,3.88) and the value is 41.1 other values are 10 by (2,0) and 40 by (0,5) .

• Let us see subproblem 2 now.

Subproblem 2:

Max 5x1 + 8x2


such that
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 ≥ 3
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:

48
• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 3 and x1 + x2 ≤ 6 with the x1 and x2 axis give the points (6,0),(3,0),(0,6)
• Intersection of the lines x1 = 3 and x1 + x2 ≤ 6 with each other gives (3,3).
• The feasible region points are (6,0) ,(3,0) ,(3,3).
• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (3,3)
and the value is 39 .
• As We have already have an Integer solution of 40 from (0,5). We dont need to branch this problem
further.

We will now branch further into subproblem 1 We branch on x2 and divide the subproblem 1 into
two subproblems in which one subproblem whill have the added constraint x2 ≤ ⌞3.88⌟ = 3and other
will have the added constraint x2 ≥ ⌜3.88⌝ = 4 .

Subproblem 1(a) :

Maximize 5x1 + 8x2


siubject to:
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 ≤ 2
x2 ≤ 3
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:

Figure 4.6: Feasible region of the sub problem with the added constraint x2 ≤3 (b) Feasible region of the sub
problem with the added constraint x2 ≥4

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .
• Intersection of the lines x1 = 2 and x2 = 3 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points
(0,5),(2,0),(9,0)
• Intersection of the lines x1 = 2 and x2 = 3 with each other gives (2,3).

49
• The feasible region points are (0,0) ,(0,3) ,(2,0),(2,3).

• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (2,3)
and the value is 34 other values are 10 by (2,0) and 24 by (0,3) .

• As this is less than before seen feasible solution.There is no need to branch this problem further.

Subproblem 1(b) :

Maximize 5x1 + 8x2


Subject to
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 ≤ 2
x2 ≥ 4
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:

• We plot the graph by taking two axis x1 (x2 =0) and x2 (x1 =0) .

• Intersection of the lines x1 = 2 and x2 = 4 and 5x1 + 9x2 = 45 with the x1 and x2 axis give the points
(0,4),(2,0),(9,0)(0,5)

• Intersection of the lines x2 = 4 and 5x1 + 9x2 = 45 with each other gives (1.8,4).

• The feasible region points are (0,4) ,(0,5) ,(1.8,4).

• Out of these points the point which gives the highest value of the objective function 5x1 + 8x2 is (1.8,4)
and the value is 41.

• But as (1.8,4) is not an integer solution.We branch on x1 and divide the problem into two subproblems
in which one subproblem whill have the added constraint x1 ≤ ⌞1.8⌟ = 1and other will have the added
constraint x1 ≥ ⌜1.8⌝ = 2 .

Subproblem 1(b)(i) :

Maximize 5x1 + 8x2


Subject to
x1 + x2 ≤ 6
5x1 + 9x2 ≤ 45
x1 ≤ 2
x1 ≥ 2
x2 ≥ 4
x1 , x2 ≥ 0 x1 ϵZ x2 ϵZ

Solution:

• There will be only one point (2,3.8) which is not an IP solution. There os no need to branch further.

Subproblem 1(b)(ii) :

Max 5x1 + 8x2


such that
x1 + x2 ≤ 6

50
Figure 4.7: (a) Feasible region of the sub problem with the added constraint x1 ≥2.Only one point is in the
region (b) Feasible region of the sub problem with the added constraint x1 ≤1.

5x1 + 9x2 ≤ 45
x1 ≤ 2
x1 ≤ 1
x2 ≥ 4
x1 , x2 ≥ 0 x1 ∈Z x2 ∈Z

Solution:

• The feasible points in this region are (1,4) , (0,4) ,(0,5) and (1,4.44) out of which highest is (1,4.44)
and the value is 40.552. But Since we already have a solution of 40 .We can stop the branching here.

From Above two subproblems we see that the maximum integer solution obtained is (0,5) and the
solution is 40.

51
52
Chapter 5

LP Relaxation

• We had seen some LP and IP formulations in the past. We will see some more as we move
forward.

• We know how to solve toy problems on pen and paper.


• LP relaxation is a powerful way to solve IPs.
• Beyond obtaining a solution, this line of thinking leads to many insights about the problem,
properties of the problem, guarantees of an approximate solution etc.

53
5.1 LP Relaxation

If we are interested in designing a polynomial time algorithm( exact or approximate) for a combinatorial
optimiation problem, formulating the combinatorial optimization problem as an IP is useful as a first step
in the following methodology( the discussion assumes that we are working with minimization problem):

1. Formulate the combinatorial optimiation problem as an IP.


2. Derive a LP from the IP by removing the constraint that the variables have to take integer value. The
resulting LP is called a “relaxation” of the original problem. Note that in the LP we are minimizing

the same objective function over a larger set of solutions, so opt(LP) ≤opt(IP). In other words, zIP is

inferior to zLP
3. Solve the LP optimally using an efficient algorithm for linear programming.

• If the optimal LP solution has integer values, then it is a solution for the IP of cost opt(LP)=opt(IP).
We have now found an optimal solution for the IP and hence an optimal solution for our combinatorial
optimization problem.

• If the optimal LP solution x* has fractional values, but we have a rounding procedure that transforms
x* into an integral solution x’ such that cost(x’)≤c . cost(x*) for some constant c, then we are able to
find a solution to the IP of cost≤c . opt(LP) ≤c . opt(IP), and so we have a c-approimate algorithm
for our combinatorial optimization problem.

5.2 Bipartite Matching

Definition

When there are an equal number of nodes on each side of a bipartite graph, a perfect matching is an
assignment of nodes on the left to nodes on the right, in such a way that

1. each node is connected by an edge to the node it is assigned to, and


2. no two nodes on the left are assigned to the same node on the right

Formulation

Let G=(V,E) be a graph where V represents the set of vertices and E represents the set of edges. As G is
bipartite, let V be divided into two disjoint sets X and Y such that | X |=| Y |and edges are from X→Y
only. A perfect matching M⊂E is such that each vertex in X as well as Y appears only once in M.
Our problem of interest is to find ∑
the maximum weight matching. i.e., sum of weights on the edges in the
match M is maximum., i.e., Max e∈M we
Note: This many similar problems can be seen as “selection” of a set of objects (like edges of a graph). It is
common to introduce a binary variable ∈ {0, 1} that defines whether the object is selected or not.
The IP formulation of the problem is as follows:
Let xe ∈ {0, 1} to say whether edge is in M or not.

54
Figure 5.1: Bipartite Matching


M aximize we · xe
e∈E

subject to;

xe = 1 for all v ∈ V
e∈E;v∈e

xe ∈ {0, 1} ∀e ∈ E

This is an IP. Let us relax the xe ∈ {0, 1} to xe ∈ [0, 1] to obtain an LP by relaxing the IP.

M aximize we · xe
e∈E

subject to;

xe = 1 for all v ∈ V
e∈E;v∈e

xe ∈ [0, 1] ∀e ∈ E

Now, after relaxation, we obtain an IP as follows: Maximize Σxe we


such that Σxe = 1 ∀ v∈V
0 ≤ xe ≤1 ∀e∈E
While solving LP, we conclude/observe:

1. Infeasibility in LP implies infeasible IP.

55
2. LP provides an upper bound on the IP. (note: maximization problelm).

3. If LP itself gives an integer solution, a rare good problem.

Unfortunately, this LP need not give an integer solution. However, the situation is not bad.

Rounding

We make an argument below that helps us in designing a “loss-less” rounding scheme.


Assume one of the edge (a1 , b1 ) has 0 < xe < 1 . Then there exists another edge (b1 , a2 ) which is also 0 <
xe < 1. We can then get a chain a1 b1 a2 b2 . . . . . a1 of unsatisfactory edges such that all have 0 ≤ xe ≤ 1.
Note that number of edges will be even e1 . . . . . . et .
Consider now modifying the xe in this chain (cycle) such that

 ∗
 xe − ϵ for e ∈ {e1 , e3 , . . . , et−1 }
ye = x∗ + ϵ for e ∈ {e2 , e4 , . . . , et }
 e∗
xe otherwise


Easy to see that ye = 1 and ye is a valid / feasible solution for small ϵ.
Let us now look at the cost of matching.
∑ ∑
W (Y ) = we ye = W (X ∗ ) + ϵ (−1)i wei = w(X ∗ ) + ϵ△
e

Since X ∗ is optimal, △= 0 ( otherwise we would have found an ϵ such that W (Y ) > W (X ∗ ).) If △ is
negative, then we would have taken a negative ϵ.
What does it mean? This change in xe is not going to change the matching cost. This helps is finding a
rounding scheme that does not change the cost.

1. Find largest ϵ such that y is still feasible

2. Then y will have lesser non - integer values. Repeat for many chains until you set an integer solution.

3. We can repeat this until we get all binary values.

Result If LPR has a feasible solution, then it has one integer optimal solution. We can obtain by an
appropriate rounding.

5.3 Minimum Vertex Cover

Let us now consider another problem that of finding a minimal vertex set that cover all the edges.

Definition Formally, a vertex-cover of an undirected graph G = (V, E) is a subset V ′ of V such that if


edge (u, v) is an edge of G then either u in V‘ or v in V‘ (or both). The set V‘ is said to cover the edges of
G.
A minimum vertex cover is a vertex cover of smallest possible size. Finding it is NP-Hard.

56
IP formulation Here we need to select vertices. Let us define a new binary variable xv to denote whether
a vertex is selected or not.


Minimize v∈V xv
Such that xu + xv ≥ 1 ∀(u, v) ∈ E (5.1)
xu ∈ 0, 1

Indeed, we can now relax the integral constraints and create an LP.

LP Relaxation ∑
Minimize v∈V xv
Such that xu + xv ≥ 1 ∀(u, v) ∈ E (5.2)
0 ≤ xu ≤ 1

Solution using LPR Solve the LPR using an appropriate LP solver to obtain the optimal solution. Let
it be x∗ . This need not be an integral solution.
Now we create the following vertex cover SLP from x∗ .

1
SLP = {v ∈ V |xv ≥ } (5.3)
2

Analyzing the above equation carefully we will see that it is a vertex cover. (Why?.
Now we want to check how good/bad is this cover from the minimum.
Let, y be the optimal solution by IP and let SOP T be the corresponding vertex cover. We have,

|SLP | ≥ |SOP T | (5.4)

Why? Remember that the IP optima is inferior to the LP optima. We can also conclude the following from
LP-IP relationships. Now we want to see exactly how bad is our solution.
∑ ∑
x∗v ≤ yv (5.5)
v∈V v∈V

Now we have:

∑ ∑ ∑ ∑
|SLP | = 1≤ 2.x∗v ≤ 2.x∗v ≤ 2 yv ≤ 2|SOP T | (5.6)
v∈SLP v∈SLP v∈V v∈V

So, we can conclude that :

|SOP T | ≤ |SLP | ≤ 2|SOP T | (5.7)

5.4 Facilty Location problem

A company wants to set up factories at some of the i locations so that it can supply materials to all its
customers in j places. The cost of setting up the factory at location i is fi . Let∑xi ∈ {0, 1} indicate whether
factory is set up or not. Therefore the cots of setting up the factories is Cf = i fi xi .

57
Now the goods need to be transferred from the factories (that are set up) to the customers. Let c(i, j) be the
∑j. Let yij = {0, 1} indicate whether customer j is assigned
cost of transportation from factory i to customer
to factory i. Total transportation cost is Cr = ij yij c(i, j). Final problem is now:
∑ ∑
minimize fi xi + c(i, j)yij
i ij

subject to

yij ≥ 1 ∀j
i

xi ≥ yij

xi ∈ {0, 1} yij ∈ {0, 1}

The first constraint says that each customer should be assigned to at least one facility. The second says
that is a customer is assigned to a facility, then that facility must be open. This is an IP. We relax the last
constraint as xi ∈ [0, 1] and yij ∈ [0, 1]. This leads to an LP.

Question: Suggest an appropriate rounding scheme for the above problem. Also can you come up with a
bound on the approximation errors?
Hint:

http://pages.cs.wisc.edu/~shuchi/courses/787-F09/scribe-notes/lec10.pdf

5.5 Maximum Independent Set

Let us look at another vertex selection problem.

Definition Independent set : Let there be a graph (G,V). s ∈ V , such that no two vertices in s are
connected by an edge in G is called an independent set. Our goal is to maximize the number of elements in
s.

IP formulation ∑
Maximize v∈V xv
Such that xu + xv ≤ 1 where (u, v) ∈ E (5.8)
xu ∈ 0, 1

LP Relaxation ∑
Maximize v∈V xv
Such that xu + sv ≤ 1 where (u, v) ∈ E (5.9)
0 ≤ xu ≤ 1

Analysis We, can see the if LP = 12 , then all the above constraints are satisfied. Also the objective is then
|V | |V |
2 . Therefore, the optimal value is 2 or larger.

So, the LP has a feasible solution. Now it can be clearly seen that.

|V |
LP ∗ ≥ (5.10)
2

58
Consider a fully connected graph. The maximal independent set is a single node. However the LP ∗ is n
2.
This does not allow us to come up with a guarantee on the solution.
As, we can see unlike the the minimum vertex cover problem, there are no good bounds here. So, LP
relaxation tells nothing about Maximum Independent Set.

5.6 Reading

Strongly urge to read chapter 3 of [B6].

59
60
Chapter 6

More on IP Formulations

• We had seen seen how many of the classical problems (eg. graph algorithms) get formulated as an IP.
• We had see how to solve IP by (i) branch and bound (ii) LP relaxation.
• We had also seen how approximate algorithms can be designed using LP relaxation.

• Here, we will see how many problems will become a natural IP problem.

61
6.1 BIP and MIP Formulations

We had seen how many problems get formulated as LP, and some others as IP. We now see two special class
of problems.

• BIP: Binary integer programming problems. In this class the decision variables are binary i.e., xi ∈
{0, 1}

• MIP: Mixed integer programming problems. In this class, some of the decision variables are real and
some others integer.

In binary problems, each variable can only take on the value of 0 or 1. This may represent the selection or
rejection of an option, the turning on or off of switches, a yes/no answer, or many other situations.
A mixed integer programming (MIP) problem results when some of the variables in your model are real
valued (can take on fractional values) and some of the variables are integer valued. The model is therefore
“mixed”. When the objective function and constraints are all linear in form, then it is a mixed integer linear
program (MILP). In common parlance, MIP is often taken to mean MILP, though mixed integer nonlinear
programs (MINLP) also occur, and are much harder to solve.

6.1.1 Example Formulations

In the next section, we will see a variety of problems getting modelled as IP. Mostly BIP.

6.2 Function of K Discrete Variables

Consider a problem where x1 + x2 + x3 will have to be 5 or 10 or 20 depending on something else (say what
mode of transportation or what is the cost of raw material). We encode this as

x1 + x2 + x3 = 5y1 + 10y2 + 20y3

with yi as binary variables and only one of the yi can be 1 at a time. This last constraint is added as.

y1 + y2 + y3 = 1

6.3 Either-OR constraints

This class of constraints arise when we need to enforce only one of the constraints. Assume we had a
constraint of the form |x1 | ≥ 3. This results two disconnected sets (non-convex region/set.). This only
means that either x1 ≥ 3 or x1 ≤ −3. If we convert both as ≤,

x1 ≤ −3

−x1 ≤ −3
Both can not be true at the same time. How do we do this? The idea is simple. When a constraint is
getting violated, we make it true “trivially”. Let L be a large quantity. (for the sake of argument, it can be
the maximum value your integer/real variable can store. But in practice, it is a reasonable value to manage
numerically.) We can write this constraints as either

x1 ≤ −3

−x1 ≤ −3 + L

62
or
x1 ≤ −3 + L

−x1 ≤ −3

Note that the addition of the large quantity makes the constraint trivially true. In the first set, only the first
constraint is active. While in the second, only the second constraint is active. However, we do not know
which constraint needs to be made true. For this purpose, we add an additional integer variable y. i.e.,

x1 ≤ −3 + y · L

−x1 ≤ −3 + (1 − y) · L

y ∈ {0, 1}

6.4 K out of N constraints

An extension of the same idea to K out of N constraints to be true is as follows.

f1 (x) ≤ b1 + y1 L

f2 (x) ≤ b2 + y2 L

.........

fN (x) ≤ bN + yN L

yi = N − K
i

This final constraint can be interpreted as follows. Since we want K constraints to hold out of N , there
must be N − K constraints that do not hold. So this constraint insures that N − K of the binary variables
take the value 1 so that associated M values are turned on, thereby eliminating the constraint.

63
6.5 Modelling Compond Alternatives

Consider the problem of allowing three disjoint regions in the constraints as shown in the figure.
This can be done using the ideas discussed above.
Region 1:
f1 (x1 , x2 ) − Ly1 ≤ b1
f2 (x1 , x2 ) − Ly1 ≤ b2
Regioon 2:
f3 (x1 , x2 ) − Ly2 ≤ b3
f4 (x1 , x2 ) − Ly2 ≤ b4
Region 3:
f5 (x1 , x2 ) − Ly3 ≤ b5
f6 (x1 , x2 ) − Ly3 ≤ b6
f7 (x1 , x2 ) − Ly3 ≤ b7
with additional constraints such as
y1 + y2 + y3 ≤ 2
x1 ≥ 0, x2 ≥ 0
yi ∈ {0, 1}

6.5.1 Problems with a fixed cost and variable cost

Most of the problems that we had seen are with a variable cost (a linear function is the objective.) If xi
increase, our objective ci xi also increases.
Consider a class of problems, where there is a fixed cost for a variable. For example, when you sign up
an agreement, you pay an initial amount and then you pay a fixed cost for the regular use of xi (say the
bandwidth). The initial cost may be associated with the setting up of a specific hardware unit or cabling so
that an internet connectivity can be provided.
Therefore the objective is 0 if x1 = 0 and K + c1 x1 if x1 > 0. Here K is the fixed charge.
This can be modelled as
MinimizeKy + c1 x1 + rest of the terms
subject to:
x1 − Ly ≤ 0
other constraints
y ∈ {0, 1}

6.6 Modelling Piecewise Linear Cost

Consider the cost of production of certain material. This can be c1 x until the production reaches 5 units.
Then the cost could be different say c2 x until the production reaches 12. And then c3 x until 20. etc. Note
that when the production is 7, cost is 5c1 + 2c2 . Here the objective is piecewise linear.
Let d1 , d2 and d3 are the amount of production in the three cost ranges. When d2 is nonzero, d1 is 5. When
d3 is nonzero, d1 is 5 and d2 is 7.
x = d1 + d2 + d3

64
Cost is c1 d1 + c2 d2 + c3 d3 . By adding two additional binary variables, the constraints are:

5w1 ≤ d1 ≤ 5

7w2 ≤ d2 ≤ 7w1
0 ≤ d3 ≤ 8w2

When w1 = 1, d1 is at the upper bound i.e., 5. When w2 = 1, d2 is at the upper bound i.e., 7.

6.7 Solving BIP using Bala’s Algorithm

We had seen how branch and bound can be used to solve IP. Let us see another algorithm specially designed
to solve BIPs — Bala’s additive algorithm. It requires the problem to be in a standard form.


• The objective function has the form. Minimize Z = i ci xi with xi as binary variables.

• The m constraints are of the form aij xj ≥ bi for i = 1, . . . m
• ci s are non-negative. Also 0 ≤ c1 ≤ c2 ≤ . . . ≤ cn

This may seem to be restrictive at first look. But not that bad. For example, negative coefficients are
converted by changing xi to 1 − xi .
The objective function is minimization. And ci s are positive. Also xi ∈ {0, 1}. Therefore if all xi = 0 is
feasible, it will be the minima.
If there are N variables, we need to evaluate 2N possible configuration of the x. Bala’s algorithm uses as
depth first search approach.
Bala’s algorithm starts expanding tree with x1 and then move to x2 etc. At each step, it evaluates the
objective (or come with a bound) and checks whether any of the constraints are violated or not. When we
find a feasible point, we stop expanding further that tree. Also the bound that we obtain for certain nodes
will tell us whether to expand a node further or no.

6.7.1 Example Problem

Minimize z = 3x1 + 5x2 + 6x3 + 9x4 + 10x5 + 10x6


subject to:
−2x1 + 6x2 − 3x3 + 4x4 + x5 − 2x6 ≥ 2
−5x1 − 3x2 + x3 + 3x4 − 2x5 + x6 ≥ −2
5x1 − x2 + 4x3 − 2x4 + 2x5 − x6 ≥ 3
xi ∈ {0, 1}

65
66
Chapter 7

More on LP Relaxation

• We had seen seen how many of the classical problems (eg. graph algorithms) get formulated as an IP.
• We had see how to solve IP by (i) branch and bound (ii) LP relaxation. (ii) Bala’s Algorithm
• We had also seen how approximate algorithms can be designed using LP relaxation. We have here one
more example.

67
7.1 Scheduling for Unrelated Parallel Machines

Scheduling is the allocation of shared resources over time to competing activities. It has been the subject of a
signiïficant amount of literature in the operations research field. Emphasis has been on investigating machine
scheduling problems where jobs represent activities and machines represent resources. Each machine can
process at most one job at a time.
In this lecture we consider the Makespan Scheduling problem. We are interested in minimizing the time
required to complete all the jobs. That is, the time for the last machine to complete the last job should be
minimized. There are n jobs, indexed by the set J = {1, ...., n} and m machines for scheduling,∑ indexed by
the set M = {1, ...., m}. Also, given, dij is the time job j takes to run on machine i. Then ti = j∈J xij dij
is the completion time of machine i, where xij is 1 when job j is assigned to machine i, otherwise 0. The
task is to assign jobs to machines so that the t = maxi (ti ), i.e., the maximum job completion time, also
called the makespan of the schedule, is minimized. Note that, the order in which the jobs are processed on
a particular machine does not matter.
There are many variants of the problem. Some of them are :
1. Minimum makespan scheduling on identical machines: dij = dj ∀i.

2. Minimum makespan scheduling on unrelated machines: each machine can take different time to do a job.
This is the problem of interest to us.

3. Non splittable jobs: xij ∈ {0, 1} and splittable jobs: xij ∈ [0, 1].

4. Jobs have precedence constraints.

Minimum makespan scheduling for spilltable jobs can be solved exactly in polynomial time (Problem is in P
and can be formulated as LP). Here we discuss only minimum makespan scheduling on unrelated machines
for non-splittable jobs (Problem is in NP-hard). For identical machines, you can use references.

7.2 Minimum makespan scheduling on unrelated machines

Formulating the problem as Integer Program


minimize t
subject to

m
xij = 1 ∀j ∈ J
i=1
∑n
xij dij ≤ t ∀i ∈ M
j=1

xij ∈ {0, 1}

• Number of variables: nm + 1

• Number of constraints: n + m

Some Observations on LP . Let us make some observations on LP. We can come back to these in detail
at a later stage.

• A problem of the form Maximize cT x subject to {Ax ≤ b; x ≥ 0} can be converted to Maximize


cT x subject to {Ax = b; x ≥ 0} by adding slack variables. Positive slack variables are added to each
inequality and coverted them to equality.

68
• Consider a system of equations Ax = b where A is m × n. If m ≤ n, there could be multiple solutions.
If x1 is a solution and x2 is a solution αx1 + (1 − α)x2 is also a valid solution.
• Let B be a m × m submatrix of A by selecting m columns. Let the x vector with the selected columns
be xB . Then BxB = b. Assume we solve this. xB may be a vector with no non-zero elements. Cosnider
the original vector with elements from xB and all other elements xi equal to zero. xi = 0 if i ∈
/B
• This vector is a basic feasible solution. (Vertex of the convex polygon of our interest). At least one of
the basic feasible solutions is optimal.
• In x, at least n − m variables/elements are zero.

Note: We know the importance of solving Ax = b. This will be the focus on the next few lectures.

7.2.1 LP-relaxation algorithm for 2 machines

Using LP Relaxation
• There are two machines, m1 and m2 .
• There are 2n + 1 variables and n + 2 constraints.
• How many zeros are there in {xij |1 ≤ j ≤ n, i = 1 or 2}?
Answer: (2n + 1) − (n + 2) = n − 1 zeros and, n + 2 non-zeros.

We know t > 0, therefore number of non-zeros in {xij |1 ≤ j ≤ n, i = 1 or 2} is n + 1. This means only one
job need to be set splittable across 2 machines such that x1j + x2j = 1. Let s be the job that gets split.
We assign s to m1 if x1s > x2s otherwise, assign to m2 (Figure 7.2 shows an example). If T ∗ be IP optimal
makespan and Tapprox be LP relaxation makespan, then Tapprox ≤ 2T ∗ .
Theorem 7.2.1. LP relaxation used here is 2-approximation for makespan scheduling for 2 machines. ′
Proof. Let T ∗ be IP optimal makespan, Tapprox be LP relaxation makespan, Ts be makespan of s and T
be optimal makespan before assigning s.

Tapprox ≤ T + Ts ≤ T ∗ + T ∗

Since, T < T ∗ and Ts < T ∗
Tapprox ≤ 2T ∗

7.2.2 LP-relaxation algorithm for minimum makespan scheduling on unrelated


machines

Here we consider makespan scheduling on unrelated machines for non-splittable jobs, which means that job
j takes time dij if scheduled on machine i. Firstly, we define an IP to solve the problem, The algorithm is
based on a suitable LP-formulation and a procedure for rounding the LP. The LP formulation for minimum
makespan scheduling on unrelated machines can be written as

minimize t
subject to

m
xij = 1 ∀j
i=1
∑n
xij dij ≤ t ∀i
j=1

xij ∈ {0, 1}

69
If we relax the constraints xij ∈ {0, 1}, it turns out that this formulation has unbounded integrality gap.
The main cause of the problem is an unfair advantage of the LP-relaxation.

Example Suppose we have only one job, which has a processing time of m on each of the m machines.
Clearly, the minimum makespan is m. However, the optimal solution to the linear relaxation is to schedule
the job to the extent of 1/m on each machine, thereby leading to an objective function value of 1, and giving
an integrality gap of m.
∑n
If dij > t for an arbitrary t, then we must have xij = 0 in any feasible integer solution due to j=1 xij dij ≤ t ∀i
constraint. But we might have fractional xij > 0 in feasible fractional solutions for LP relaxed problem. However,
we can not formulate the statement “if dij > t then xij = 0” in terms of linear constraints. The question arises here
is therefore, how to correctly choose t?

Parametric Pruning

We will make use of a technique called parametric pruning to overcome this difficulty. We “guess” the parameter t
which is a lower bound for the actual makespan T ∗ . One way to obtain suitable value is to do binary search on t.
Note that since we already know t we don’t need to check whether dij > t or not. Therefore, we are now able to
enforce constraints xij = 0 for all machine-job pairs (i, j) for which dij > t. We now define a family lp(t) of linear
programs, one for each value of the parameter t. lp(t) uses only those variables xij for which (i, j) ∈ St , where
St = {(i, j) : dij < t}, and asks if there is a feasible solution. Remember that, we’ve relaxed constraints on the
variables xij to xij ≥ 0. With t fixed, we define a family lp(t) of linear programs

minimize 0 (LP (t))


subject to

m
xij = 1 ∀j
i=1
∑n
xij dij ≤ t ∀i
j=1

xij ≥ 0 ∀(i, j) ∈ St

Let T be the minimum value(LP optima) for which LP (t) has a feasible solution obtained using binary search. Let
T ∗ be IP optima makespan. Then certainly, T ∗ ≥ T . That is, the actual makespan is bounded below by T . But we
still don’t have IP feasible solution! Our solution is obtained by rounding an extreme point solution of LP (T ). We’ll
later see that makespan thus obtained from rounding is actually atmost 2.T ∗ .

LP-rounding

Clearly, an extreme point solution to LP (t) has atmost n + m many non-zero variables. Also, it is easy to prove that
any extreme point solution to LP (t) must set atleast n − m many jobs integrally, i.e., xij ∈ {0, 1}.

The LP-rounding algorithm is based on several interesting properties of extreme point solutions of LP (T ). For any
extreme point solution x for LP (T ) define a bipartite graph G = (M ∪ J, E) such that (i, j) ∈ E if and only if xij > 0.
Let F ⊂ J be the set of fractionally set jobs in x. Let H be the subgraph of G induced by the vertex set M ∪ F .
Clearly, (i, j) ∈ E(H) if 0 < xij < 1. A matching H is called a perfect matching if it matches every job j ∈ F .

Each job that is integrally set in x has degree 1 and exactly one edge incident at it in G (Figure 1(a)). Remove these
jobs together with their incident edges from G. The resulting graph is clearly H (Figure 1(b)). In H, each job has a
degree of at least two. So, all leaves in H must be machines. Keep matching a leaf with the job it is incident to and
remove them both from the graph (Figure 1(c)). At each stage all leaves must be machines. In the end we will be
left with even cycles Match alternating edges of each cycle. This gives a perfect matching P (Figure 1(d)). Refer to
Figure 2 for an example.

70
Figure 7.1: Steps in LP relaxation for minimum makespan scheduling on unrelated machines. M is the set
of machines, J is set of jobs and edge (Ji , Mk ) means job Ji has been scheduled on machine Mk . All nodes
in J with degree 1 has been integrally set and all the nodes with degree atleast two has been fractionally set.
(a) G = (M ∪ J, E). Jobs J1 and Jn are integrally set. (b) H = (M ∪ F, E ′ ) contains only fractionally set
jobs. (c) Assign a machine to the job that has edge to leaf machine node and remove them from the graph
(assign Mn to J4 , remove them and all edges incident on J4 ). (d) Match alternate edge of each cycle(M2 to
J2 and M3 to J3 ).

Figure 7.2: An example showing steps in LP relaxation for minimum makespan scheduling on unrelated
machines. M is the set of machines, J is set of jobs and edge (Ji , Mk ) means job Ji has been scheduled on
machine Mk . (a) G = (M ∪ J, E). J1 is integrally set. (b) H = (M ∪ F, E ′ ) contains only fractionally set
jobs obtained by removing J1 and edge x11 from G. (c) Assign a machine to the job that has edge to leaf
machine node and remove them from the graph (assign M1 to J4 , remove them and all edges incident on
J4 ). (d) Match alternate edge of each cycle(M2 to J2 and M3 to J3 ).

Theorem 7.2.2. LP relaxation used here is 2-approximation for minimum makespan scheduling on unrelated ma-
chines.

Proof. Let T be LP optimal value and T ∗ be IP optimal value. Then clearly, T ≤ T ∗ since we chose T such that
LP (T ) has a solution. The extreme point solution x to LP (T ) has a fractional makespan of atmost T . Therefore,
each integrally set jobs also has an integral makespan of atmost T . Each edge (i, j) of graph H satisfies dij ≤ T . The
perfect matching found in H schedules atmost one extra job on each machine. Hence, the total makespan is atmost
2.T ≤ 2.T ∗ .

References: V. Vazirani, Approximate Algorithm, Chapter 17, Chapter 10

71
72
Chapter 8

Solving Ax = b

8.1 Introduction
In this lecture, let us consider solving a system of linear equations of the form

Ax = b (8.1)

. Let us make the following assumptions on the structures of these matrices:

• A is a square, non-singular matrix


• b is not a zero vector.

Since A is non-singular, the solution is x = A−1 b. Since, b ̸= 0 the trivial solution x = 0 does not exist. Computing
A−1 is a costly operation. Also this has numerical computing and stability issues in many situations. If the structure
of A is known, and it has some special properties, we can use this knowledge to solve Ax = b much more efficiently.
This lecture looks at solving the problem when A is an identity matrix, a permutation matrix, a triangular matrix
and a positive definite matrix. As the constraints on the matrix A eases, the complexity of solving it increases.

In order to compute the complexity of these operations, we are not interested in the big-O complexity. Instead, we
try to calculate the number of flops or floating point operations needed for each method. We define a flop as one
addition, subtraction, multiplication or division of two floating-point numbers. To evaluate the complexity of an
algorithm, we count the total number of flops, express it as a function (usually a polynomial) of the dimensions of
the matrices and vectors involved, and simplify the expression by ignoring all terms except the leading (i.e., highest
order or dominant) terms.

Let us consider different structures of A so that Ax = b can be solved efficiently.

8.1.1 A is an Identity matrix

If A is Identity, then A−1 = A = I. Therefore,

x = A−1 b =⇒ x = b

This does not require any computation. Therefore, Flop Count = 0. A trivial problem to solve.

8.1.2 A is a Permutation Matrix

A Permutation matrix P is a square binary matrix that has exactly one entry of 1 in each row and each column
and 0s elsewhere. Each such matrix represents a specific permutation of n elements and, when used to multiply
another matrix, can produce that permutation in the rows or columns of the other matrix.

73
Example:
 
1 0 0
 0 0 1 
0 1 0
[ ] [ ]
is a permutation matrix. On multiplying this matrix with 2 3 4 , we get 2 4 3 .

Result: Inverse of a permutation matrix is its transpose i.e. P −1 = P T Hence, the value of x is nothing but a
permutation of the matrix b. In this case also, Flop Count = 0.

8.1.3 A is a Diagonal Matrix

A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. The diagonal entries
themselves may or may not be zero. Thus, the matrix A = (ai,j ) with n rows and n columns is diagonal if

aij = 0 if i ̸= j ∀i, j ∈ {1, 2, ..., n}

For example, the following matrix is diagonal:


 
1 0 0
 0 4 0 
0 0 2
Here, A is non singulat implies that none of the diagonal elements are zero. Therefore, the equation Ax = b can be
written as a set of equations
a11 x1 = b1 ; a22 x2 = b2 ; . . . ; ann xn = bn
bi
Hence, we can directly compute x1 , x2 , . . . , xn as xi = aii
. There are n multiplications(division) and therefore, Flop
Count = n

8.2 A is a Triangular Matrix


• A matrix A is lower triangular if the elements above the diagonal are zero, i.e., aij = 0 for i < j. The matrix
 
−2 0 0 0
3 −6 0 0

P =  (8.2)
1 3 5 0
4 4 2 7
is an example of a 4 x 4 lower triangular matrix.
• A matrix A is upper triangular if the elements below the diagonal are zero: aij = 0 for i > j.
• A matrix is diagonal if the off-diagonal elements are zero: aij = 0 for i ̸= j. A diagonal matrix is both upper
triangular and lower triangular.
• Nonsingularity of the triangular matrices implies nonzero diagonal elements.

8.2.1 Forward Substitution

Suppose A is a lower triangular matrix of order n with nonzero diagonal elements. Consider a system of equations
Ax = b:
    
a11 0 ... 0 x1 b1
 a21 a22 . . . 0   x2   b2 
    
 . .. .. ..   ..  =  ..  (8.3)
 .. . . .  .   . 
an1 an2 . . . ann xn bn

74
We solve for x as:
b1
x1 =
a11
b2 − a21 x1
x2 =
a22
b3 − a31 x1 − a32 x2
x3 =
a33
..
.
bn − an1 x1 − an2 x2 − . . . − an,n1 xn1
xn = .
ann
Flops Count= 1 + 3 + 5 + . . . + (2n − 1) = n2

[ ]
a11 0
Recursive Formulation: If A is lower triangular, it can be represented as where
A21 A22

• a11 is 1x1
• 0 is 1x(n-1)
• A21 is (n-1) x 1
• A22 is a lower triangular matrix of size (n-1) x (n-1)

Now, the forward substitution algorithm can be written recursively using this representation.
[ ][ ] [ ]
a11 0 x1 b1
= (8.4)
A21 A22 X2 B2

Algorithm: Forward Substitution

b1
1. x1 = a11
2. Solve A22 X2 = (B2 − A21 x1 ) by Forward Substitution

Similarly and upper triangular matrix can be represented as


[ ][ ] [ ]
A11 A12 X1 B1
= (8.5)
0 ann xn bn

and solved recursively with a backward substitution algorithm as:

Algorithm: Backward Substitution

bn
1. xn = ann
2. Solve A11 X1 = (B1 − A12 xn ) by Backward Substitution

Flop Count = n2

8.3 Cholesky Decomposition of Positive Definite (PD) Matrix

8.3.1 PD Matrix

A matrix is said to be positive definite if it satisfies

1. The matrix is Symmetric


• A symmetric matrix is a square matrix that is equal to its transpose
2. x Ax > 0 ∀x
T

A matrix is postive semi definite (PSD) if the second constraint is relaxed as xT Ax ≥ 0 ∀x

75
[ ] [ ]
1 0 a
Example 1. The identity matrix I = is positive definite because, for every real vector z = , zT Iz =
0 1 b
[ ]
[ ] a
zT z = a b = a2 + b2 , which is positive.
b

8.3.2 Cholesky Decomposition

Every positive definite matrix A can be factored as

A = LLT (8.6)

where L is lower triangular with positive diagonal elements. This is called the Cholesky factorization of A. If n = 1,
i.e A is a scalar, then the Cholesky factor of A is just the square root of A.

Example 2. An example of a 3 × 3 Cholesky factorization is


     
25 15 −5 5 0 0 5 3 −1
 15 18 0 = 3 3 0  × 0 3 1  (8.7)
−5 0 11 −1 1 3 0 0 3

Example 3. An
  example
 of a 3x3
 Cholesky factorization
 is
1 2 3 1 0 0 1 2 3
 2 20 26  =  2 4 0   0 4 5 
3 26 70 3 5 6 0 0 6
The Cholesky factorization takes (1/3)n3 flops.

8.4 Algorithm for Cholesky Factorization


[ ] [ ]
a11 AT21 l11 0
Now, let A = ,L= and A = LLT
A21 A22 L21 L22

Question: What are the dimensions of these matrices?

We can thus write


[ ] [ ][ ]
a11 AT21 l11 0 l11 LT21
=
A21 A22 L21 L22 0T L22

Now, we can form the following set of equations from above



• a11 = l11 · l11 =⇒ l11 = a11
• A21 = L21 · l11 =⇒ L21 = A21
l11

• A22 = L21 LT21 + L22 LT22 =⇒ (A22 − L21 LT21 ) = L22 LT22

Algorithm: Cholesky

1. Calculate l11 = a11
A21
2. Calculate L21 = l11

3. Use Cholesky to compute L22 as (A22 − L21 LT21 ) = L22 LT22

The cost of this algorithm is (1/3)n3 flops.

Question: Verify this.

Let us look at an example for computing Cholesky factorization.


    
1 2 3 l11 0 0 l11 l21 l31
Example 4. Consider A =  2 20 26  . This can be factorized as  l21 l22 0  0 l22 l32 
3 26 70 l31 l32 l33 0 0 0

76

• l11 = a11 = 1
   
1 1 0 0
=  2 . Now, L =  2 
A21
• L21 = l11
l22 0
3 3 l32 l33 .
[ ] [ ] [ ]
16 20 l22 0 l22 l32
• We have to do Cholesky factorization of A22 - L21 L21 T = = .
20 61 l32 l33 . 0 l23

• l22 = a22 = 4
20
• l23 = 4
=5
 
1 0 0
• The matrix is now  2 4 0 
3 5 l33
• We have to factorize 61 - 5 · 5 = 36. This gives l33 as 6
• 
The final answer
 is   
1 2 3 1 0 0 1 2 3
 2 20 26  =  2 4 0  0 4 5 
3 26 70 3 5 6 0 0 6

8.5 Solving linear equations by Cholesky factorization


Solve AX = b where A is PD matrix of order n. LLT x = b

1. Cholesky factorization: factor A as LLT ((1/3)n3 flops).


2. Forward substitution: Solve Lw = b (n2 flops).
3. Back substitution: Solve LT x = w (n2 flops).

Total cost is (1/3)n3 + 2n2 or roughly (1/3)n3 .

8.6 Finding Inverse using Cholesky factorization


The inverse of a positive definite matrix can be computed by solving

AX = I or Ax = [e1 , e2 , . . . , en ] (8.8)

using the method for solving equations with multiple right-hand sides described in the previous section. Only one
Cholesky factorization of A is required, and n forward and backward substitutions. The cost of computing the inverse
using this method is (1/3)n3 + 2n3 = (7/3)n3 .

8.7 Example Problems

8.7.1 Example 1

Compute
 Cholesky factorization
 of
4 6 2 −6
 6 34 3 −9 
A= 2

3 2 −1 
−6 −9 −1 38

Sol: Cholesky factorization of A = LLT


    
4 6 2 −6 l11 0 0 0 l11 l21 l31 l41
6 34 3 −9  0  l42 
  = l21 l22 0  0 l22 l32  (8.1)
2 3 2 −1 l31 l32 l33 0  0 0 l33 l43 
−6 −9 1 38 l41 l43 l43 l44 0 0 0 l44

77
 
l21
• Determine l11 and L21 where L21 = l31 
l41

l11 = a11
l11 = 2
 
6
1  
L21 = 2
l11
−6
 
3
L21 = 1 
−3

• To compute L22 i.e. A22 − L21 LT21 = L22 LT22

      
34 3 −9 3 [ ] l22 0 0 l22 l32 l42
3 2 −1 −  1  3 1 −3 = l32 l33 0  0 l33 l43 
−9 1 38 −3 l43 l43 l44 0 0 l44

    
25 0 0 l22 0 0 l22 l32 l42
0 1 2  = l32 l33 0  0 l33 l43 
0 2 29 l43 l43 l44 0 0 l44

l22 = 25
l22 = 5
l32 = 0
l42 = 0

• We are left with


[ ] [ ] [ ][ ]
1 2 0 0 l33 0 l33 l43
− = (8.2)
2 29 0 0 l43 l44 0 l44

2
l33 =1
l33 = 1
l4 3 = 2/l33
l43 = 2
[ ][ ]
To solve for l44 we have 29 − 2 2 = l − 442

l44 = 5

Putting all the values in the first equation

    
4 6 2 −6 2 0 0 0 2 3 1 −3
6 34 3 −9 3 5 0 0 0 5 0 0 
 =  
2 3 2 −1  1 0 1 0 0 0 1 2 
−6 −9 1 38 −3 0 2 5 0 0 0 5

8.7.2 Example 2

You are given a Cholesky factorization of A = LLT of positive definite A of order n.

78
[ ]
A u
1. What is the Cholesky factor of the (n+1)x(n+1) matrix B = where B is positive semidefinite?
uT 1
[ ] [ ][ ]
A u L11 0 L11 T L21 T
• =
uT 1 L21 ln+1,n+1 0 ln+1,n+1

• A = L11 L11 T

• ∴ L11 = L

• u = L11 L21 T → L21 T = L11 -1 u

• 1 = L21 L21 T + ln+1,n+1 2 → ln+1,n+1 2 = 1 - L21 L21 T = 1 − uT L-T L-1 u = 1 − uT A-1 u


[ ][ -1
]
L √ 0 LT √ L u
• Thus, the factors of B are
uT L-T 1 − uT A-1 u 0 1 − uT A-1 u

2. What is the cost of computing Cholesky of B if factors of A is given?


• n2 flops
• We are given L, so we only need to compute L21 and L22 .
• We can compute L-1 u by solving the set of equations :x = u which takes n2 flops because L is lower
triangular.
√ √
• Computing 1 − uT A-1 u = 1 − xT x takes roughly 2n flops
3. Suppose ||L-1 || ≥ 1. Show that B is positive definite for all u such that ||u|| ≥ 1
[ ]
A B
• Given a matrix M = , M is positive definite if A and C − B T A-1 B are positive definite. (Refer
BT C
to [?] for the proof).
• So in our case, A and 1 − uT AT u should be positive definite. We already know that A is positive definite.
• From the first part of this solution, u = LL21 T , A = LLT

1 − uT A-1 u (8.3)
T -1
= 1 − (LL21 ) (LL ) LL21
T T T
(8.4)
-T -1
= 1 − L21 L L L L L21
T T T
(8.5)
= 1 − L2121 T
(8.6)
2
= ln+1,n+1 (8.7)
which is a positive scalar and hence is positive definite.
• Hence B is positive definite.

8.7.3 Example 3

Solve efficiently LX + XLT = B given L(lower triangular matrix) and B. Lii + Ljj ! = 0∀i, j .
Analyze the complexity.

• Let
[ us write the
] [ equation using
] the
[ recursive form.
][ ] [ ]
l11 0 x11 X 12 x11 X 12 l11 L21 T b11 B 12
+ =
L21 L22 X 21 X 22 X 21 X 22 0T L22 T B 21 B 22

• Simplifying the L.H.S and equating to R.H.S, we get the following four equations:

2l11 x11 = b11 (8.8)

l11 X 12 + x11 L21 T + X 12 L22 T = B 12 (8.9)


L21 x11 + L22 X 21 + X 21 l11 = B 21 (8.10)
L21 X 12 + L22 X 22 + X 21 L21 T + X 22 L22 T = B 22 (8.11)

79
• Equation (8.8) can be solved in 1 flop
• For equation (8.9), the unknown is X12 . To simplify the L.H.S (multiplications and additions of matrices),
number of flops = 2(n − 1)2 + 3(n − 1)
• After equating the L.H.S to the R.H.S, we get a system of linear equations of the form

f (x12 , x13 , . . . , x1n ) = b11 (8.12)

f (x13 , x14 , . . . , x1n ) = b12 (8.13)


..
.

f (x1n ) = b1n (8.14)

This upper triangular system of equations can be solved using backward substitution in (n − 1)2 flops.
• Total number of flops for equation (8.9) = 3(n − 1)2 + 3(n − 1) ≈ 3n2
• Similarly, equation (8.10) can be solved in 3n2 flops.
• For equation (8.11), the unknown is X22 . The equation can be rewritten as

L22 X 22 + X 22 L22 T = B 22 − L21 X 12 − X 21 L21 T (8.15)

• The R.H.S can be computed in 5(n − 1)2 . The equation is of the form LX + XLT = B which can be computed
recursively.

n
• Each recursive step takes on its own 1 + 3n2 + 3n2 + 5n2 ≈ 11(n − 1)2 . Total number of flops is 11n2 =
i=1
11n(n+1)(2n+1)
6
= 11
6
(2n3 + 3n2 + n). Considering only the leading term for flop calculation, total number of
flops = 11
3
n3

8.7.4 Exercise
[ ]
A −A
Given Anxn is a P.D matrix and B = is a P.D matrix. Find the range of values of β .
−A βA

80
Chapter 9

Matrix Decompositions: LU, QR and


SVD

81
9.1 Review and Summary

Matrix decomposition or Matrix Factorization is the method of trasnforming a given matrix to a product of canonical
matrices. Matrix decompositions are usually carried out for making the problem computationally convenient to solve
and simple to analyze. For example, matrix inversion, solving linear systems and least squares fitting problem can
be infeasible to solve optimally in an explicit manner. Thus, converting them into a set of several easier tasks such
as solving a diagonal or triangular system helps speed up the process. It also helps in identifying the underlying
structure of the matrices involved. In the previous class, we took a look at the Cholesky Factorization method. More
formally, it can be stated as follows:

Solve Ax = b where A is a non-singular n × n matrix.

Obviously, x = A−1 b is not computationally preferable.


If A were a Permutation matrix it would take 0 flops to solve this problem. Similarly, if it were Identity it would
require 0 flops, if it were Diagonal, n flops and lastly if it were Positive Definite, it would take 13 n3 + 2n2 .

Cholesky factorization involves decomposing A as A = LLT and it takes only 1 3


3
n flops. In today’s lecture, we take
a look at two more methods of factorization namely, LU, QR and SVD.

To solve system of linear equations, in which each equation is in the form aT x = c, where a is a n−vector (A vector of
   
a1 x1
 a2   x2 
   
size n) of the form .  such that ai ϵR ∀i = 1 . . . n and x is a variable n-vector of the form .  Therefore resultant
 ..   .. 
an xn
equation is a1 x1 + a2 x2 + . . . + an xn = c

The set of linear equation is represented in the form Ax = b:


     
a11 a12 . . . a1n x1 b1
 a21 a22 . . . a2n   x2   b2 
     
 . .. .. ..   .. =  ..  where A: R
m×n
(A matrix of order m × n, all elements belongs to Real
 .. . . .   .   . 
am1 am2 . . . amn xn bm
numbers), x is a n − vector and b is a m − vector.

Indeed many of our discussions on real matrices, are directly applicable for matrices of complex numbers as such or
with minimal changes. However, that is not attempted here.

Flops for various matrices

Flops= Total number of floating point operations or flops required to perform numerical algorithm.

Let us assume that m = n (A is a square matrix) and also non-singular (Inverse of A, A−1 exists).

Following table shows flops required by the different types of matrices to solve linear equations Ax = b

Matrix type flops


Identity Matrix 0
Permutation Matrix 0
Diagonal Matrix n
Upper Triangular Matrix n2
Lower Triangular Matrix n2
1 2
Positive Definite Matrix 3
n + 2n2

82
9.2 LU Factorization

9.2.1 Definition

Factorize or decompose a square non-singular matrix into product of two matrices L, U with the help of permutation
matrix, if necessary.

A = P LU
where A is any non singular matrix, P is Permutation Matrix, L is Lower triangular Matrix, U is an Upper triangular
matrix

Most of the times P is an Identity matrix. For a matrix, A multiple LU − decompostions are possible. To get unique
decomposition, principle diagonals of either L or U must be one. From now onwards in our discussion we assume that
P = I and Lii = 1, ∀i. The standard algorithm for computing LU − decompostion is called Gaussian elimination.

Cost The cost is 23 n3 flops.

9.2.2 LU Factorization

An example of an LU factorization is
     
0 5 5 0 0 1 1 0 0 6 8 8
2 9 0 = 0 1 0 1/3 1 0  0 19/3 −8/3  (9.1)
6 8 8 1 0 0 0 15/19 1 0 0 135/19

2 3
The algorithm for computing the LU factorization is called Gaussian elimination and it takes 3
n flops.

The Gaussian elimination method in its pure form is unstable at times and the permutation matrix P, is used to
control this inability. It does this by permutating the order of the rows of the matrix being operated upon. Such
operations are called pivoting. However, it is possible to factor several non-singular matrices as LU without pivoting.
In the following subsection, we first describe the method and then provide an algorithm for the same.

9.3 Computing the LU factorization


Let us consider the simple case where P = I such that

A = LU (9.2)

The above equation can be partitioned and rewritten as


[ ] [ ][ ]
a11 A12 l11 0 u11 U12
= (9.3)
A21 A22 L21 L22 0 U22
[ ] [ ]
a11 A12 l11 u11 l11 U12
= (9.4)
A21 A22 u11 L21 L21 U12 + L22 U22

However, l11 = 1 as L is a unit lower triangular matrix. Therefore, equating both sides and performing appropriate
substitutions gives us:
1
u11 = a11 , U12 = A12 , L21 = A21 (9.5)
a11
1
L22 U22 = A22 − A21 A12 (9.6)
a11
L22 and U22 can be calculated recursively by performing a LU factorization of dimension (n − 1) on the matrix
obtained on the right hand side of equation (5). This process continues till we arrive at a 1 × 1 matrix. The algorithm
can be summarized as follows :

83
9.3.1 Computational procedure

Given a n × n nonsingular matrix A

1. Calculate the first row of U : u11 = a11 and U12 = A12

2. Calculate the first column of L : l11 = 1; L12 = (1/a11 )A21 .

3. Recursively calculate the LU factorization of the (n − 1) × (n − 1) matrix L22 U22 = A22 − 1


A A
a11 21 12

9.3.2 Example 1
[ ]
0 1
: A=
1 −1

[ ] [ ][ ]
0 1 1 0 0 U12
A= =
1 −1 L21 L22 0 U22

[ ]
0 1
This is not possible since L21 = 01 (1). Suppose P =
1 0

[ ][ ] [ ]
0 1 0 1 1 −1
PA = =
1 0 1 −1 0 1

Now, L21 = 11 (0) = 0 and U12 = −1

Therefore,

[ ][ ]
1 0 1 −1
PA =
0 L22 0 U22

and,

L22 U22 = A22 − 0(−1) = 1

which implies that L22 = U22 = 1.

Therefore,

[ ][ ]
1 0 1 −1
PA =
0 1 0 1

Finally,

[ ][ ][ ]
0 1 1 0 1 −1
A=
1 0 0 1 0 1

9.3.3 Example 2

 
6 3 1
A = 2 4 3
9 5 2

84
Factoring,
    
6 3 1 1 0 0 u11 u12 u13
A = 2 4 3 = l21 1 0  0 u22 u23 
9 5 2 l31 l32 1 0 0 u33

    
6 3 1 1 0 0 6 3 1
2 4 3 = 1/3 1 0  0 u22 u23 
9 5 2 3/2 l32 1 0 0 u33

[ ][ ] [ ] [ ]
1 0 u22 u23 4 3 2 [ ]
L22 U22 = = − (1/6) 3 1
l32 1 0 u33 5 2 9

[ ][ ] [ ]
1 0 u22 u23 3 8/3
L22 U22 = =
l32 1 0 u33 1/2 1/2

[ ][ ]
1 0 3 8/3
L22 U22 =
1/6 1 0 u33

u33 = 1/2 − (1/3)(1/2)(8/3) = 1/18

    
6 3 1 1 0 0 6 3 1
A = 2 4 3 = 1/3 1 0 0 3 8/3 
9 5 2 3/2 1/6 1 0 0 1/18

9.4 Solving linear equations by LU Factorization


The use of LU factorization is a standard way of solving linear equations with a general nonsingular coefficient matrix
A. Given a set of linear equations Ax = b, the following algorithm is used to find the solution :

1. LU factorization: Factor A as A = P LU (2/3)n3 flops


2. Permutation: Calculate w = P T b by reordering the rows of b 0 flops
3. Forward Substitution: Solve Lz = w for z n2 flops
4. Backward Substitution: Solve U x = w n2 flops

The total cost is (2/3)n2 + 2n2 or simply (2/3)n2 flops.

9.5 Computing the Inverse using LU


In order to compute the inverse A−1 , we can solve the equation

AX = I

That is to say, we can solve the system of n equations, Axi = ei where xi is the ith column of A−1 and ei is the ith
unit vector. The cost of this computation would be 23 n3 + n(2n2 ) = 38 n3 flops(one LU factorization and n forward
and n backward substitutions).

9.6 Solution of Ax = b with a direct inverse


Ax = b

85
x = A−1 b

Step1:Find Inverse of A, A−1


8 3
Total cost: 3
n flops

Step2: Matrix multiplication between A−1 and b, A−1 b

It requires n2 + n(n − 1) flops.

Therefore total 8 3
3
n + n2 + n(n − 1) flops required.

Note: A solution based on inverse is not attractive.

9.6.1 Example 3

Solve
    
6 3 1 x1 18
2 4 3 x2  = 17
9 5 2 x3 29

We shall use the factorization performed in Example 2. As P = I, the permutation step does not affect the outcome.
We first solve,
    
1 1 0 z1 18
1/3 1 0 z2  = 17
3/2 1/6 1 z3 29

which gives the solution (18, 11, 1/6) by forward substitution. Finally, we solve
    
6 3 1 x1 18
0 3 8/3  x2  =  11 
0 0 1/18 x3 1/6

which gives the solution x = (2, 1, 3) by backward substitution.

9.6.2 Eample Problem

Solve the following problem with 3 parts.

1. For what values of a1 , a2 , . . ., an is the n × n matrix


 
a1 1 0 ... 0 0
 a2 0 1 ... 0 0 
 
 .. .. .. .. .. .. 
 . 
A= . . . . . 
 an−2 0 
 0 ... 1 0 
 an−1 0 0 ... 0 1 
an 0 0 ... 0 0

non-singular?
2. Assume A is non-singular, how many floating point operations do you need to solve Ax = b
3. Assume A is non-singular, what is the inverse of A−1 (In other words, express the elements of A−1 in terms of
a1 , a2 , . . ., an )

Solution: 1. A is non-singular if and only if Ax = 0 implies x = 0


Ax = 0 means

x2 = −a1 x1

x3 = −a2 x1

86
..
.

xn = −an−1 x1

an x1 = 0

If an ̸= 0 then from the last equation, x1 = 0, because of that remaining elements of x also becomes zero, x2 = x3 =
. . . = xn = 0 i.e., x = 0, so A is non-singular.

If an = 0, and x1 = 1 then, x2 = −a1 , x3 = −a2 , etc. and obtain a non zero x with Ax = 0. So if an = 0 then
resultant Matrix becomes singular

∀an ̸= 0 A is non-singular. a1 , a2 , . . ., an−1 can take any value.

2. If we put the last equation first we obtain

an x1 = bn

a1 x1 + x2 = b1

a2 x1 + x3 = b2
..
.

an−1 x1 + xn = bn−1

We can solve these equations by forward substitution.


bn
x1 = an
(1flop)

x2 = b1 − a1 x1 (from now onwards 2 flops in each equation)

x3 = b2 − a2 x1
..
.

xn = bn−1 − an−1 x1

thus it takes 2n − 1flops.

3.We can find A−1 by solving AX = I column by column.


 
a1 1 0 ... 0 0
 a2 0 1 ... 0 0 
 
 .. .. .. . . .. ..  [ ]
[ ]
 . . . . . .  .
  x1 x2 x3 .. xn−1 xn = e1 e2 ... en−1 en
 an−2 0 0 . . . 1 0 
 
 an−1 0 0 . . . 0 1 
an 0 0 ... 0 0
 1 
0 0 0 ... 0 an
1 0 0 . . . 0 −a 1 
 an 
0 1 0 . . . 0 −a2 
 an 
A−1 = 
0 0 1 . . . 0
−a3 
an 
. . . .. 
. . . .. .. 
. . . . . . 
−an−1
0 0 0 ... 1 an

9.7 QR Factorization
Factorizations have a wide number of applications. However, some factorizations like Cholesky are not always
appropriate or efficient enough due to certain inherent restrictions. For example, the Least Squares problem can be
solved faster by QR factorization which we cover in this section. (more in the next lecture). Also we need methods
to factorize non square matrices.

A left-invertible m × n matrix can be factored as

87
A=QR

where Q is an m × n orthogonal matrix and R is an n × n upper triangular matrix with positive diagonal elements.
This is called the QR factorization of A.
Just to recap, an m × n orthogonal matrix has the property that:

1. QT Q = I, when m > n
2. QQT = I, when m < n
3. QT Q = QQT = I, when m = n

An example of a QR factorization is
   
3 −6 26 3/5 0 4/5  
4 1 −2 2
−8 −7 4/5 0 −3/5
(1/5) 
0
=  0 1 1 (9.7)
4 4  0 4/5 0 
0 0 5
0 −3 −3 0 −3/5 0

9.8 QR Factorization: The Method


We now describe the method and the algorithm to compute the QR factorization and its cost is 2mn2 . As an exercise
for further understanding, one can attempt to factorize the matrix on the left hand side of the above equation and
arrive at the matrices on the right hand side.

The first step is to factorize the A, Q and R matrices in the following manner,

[ ]
[ ] [ ] r11 R12
A = a1 A2 , Q = q1 Q2 , R=
0 R22

where each of the elements follow the standard


[ T ] block notation.
[ T ] [ ]
q [ ] q q1 q1T Q2 1 0
Since Q is orthogonal, we know QT Q = 1T q1 Q2 = 1T =
Q2 Q2 q1 QT2 Q2 0 I

Therefore,

q1T q1 = 1, q1T Q2 = 0 , QT2 Q2 = I

We also know, that r11 > 0 and R22 is an upper triangular matrix with positive diagonals.

Combining the above, we get,


[ ]
[ ] [ ] r11 R12 [ ]
a1 A2 = q1 Q2 = q1 r11 q1 R12 + Q2 R22 (9.8)
0 R22

Comparing the left and the right matrices we can conclude that a1 = q1 r11 . But since q1 is unit norm, r11 = ||a1 ||
and q1 = ra11
1
. And, A2 = q1 R12 + Q2 R22 .
To simplify this, notice that we can premultiply both sides by q1T thereby obtaining,

q1T A2 = q1T q1 R12 + q1T Q2 R22

Since, q1T q1 = 1 and q1T Q2 = 0,


R12 = q1T A2 (9.9)
[ ]
∴ A2 − q1 R12 = Q12 R22

88
9.8.1 Algorithm: QR FACTORIZATION

The algorithm to compute the QR factors of the matrix can be concisely written as follows:

Given an m × n matrix A, it can be factored as A=QR

1. Compute the preliminary values of the first rows/columns: r11 = ||a1 ||, q1 = a1
r11
and R12 = q1T A2
2. Computer QR factor of A2 − q1 R12 as Q12 R22

Computational cost: 2mn2 32 n3 f lops, or 34 n3 for square matrices.

9.8.2 Example
 
2 8 13
Find QR − f actorization of 4 7 −7 . Show steps.
4 −2 −13
Solution:

Recursive Algorithm:

1. r11 = ∥a1 ∥
a1
2. q 1 = ∥a1 ∥

3. R12 = q 1 T A2
4. Compute Q2 R22 by QR − f actorization of A2 − q 1 R12

Step1:
  1  
√ 2 3 [ ] 8 13 [ ]
r11 = ∥a1 ∥= 4 + 16 + 16 = 6; q 1 =
a1
∥a1 ∥
= 16 4 =  23 ; R12 = q 1 T A2 = 13 32 23  7 −7  = 6 −9 ;
4 2
3
−2 −13
  1      
8 13 3 [ ] 8 13 2 −3 6 16
A2 − q 1 R12 = 7 −7  −  23  6 −9 ⇒  7 −7  − 4 −6 ⇒  3 −1
−2 −13 2
3
−2 −13 4 −6 −6 −7
 1  
3
6 6 −9
A =  23 0 
2
3
0
Step2:
  2    
√ 6 3 [ ] 16 [ ]
r11 = ∥a1 ∥= 36 + 9 + 36 = 9; q 1 =  3  =  1 ; R12 = q 1 T A2 = 2
a1
= 19 1 −2 −1 = 15 ;
∥a1 ∥ 3 3 3 3
−2
−6 3
−7
   2       
16 3 [ ] 16 10 6
A2 − q 1 R12 =−1 −  13  15 ⇒ −1 −  5  ⇒−6
−2
−7 3
−7 −10 −3
1  
3
2
3
6 6 −9
A= 3 2 1 
0 9 15 
3
2 −2
3 3
0 0
Step3:
   2 
√ 6 3 [ ]
r11 = ∥a1 ∥= 36 + 36 + 9 = 9; q 1 = a1
= 19 −6 =  −2 ; R12 = q 1 T A2 = 2 −2 1
0 = 0;
∥a1 ∥ 3 3 3 3
1
3 3

A2 − q 1 R12 =0

89
1  
3
2
3
2
3
6 6 −9
A= 2 1 −2 
0 9 15 
3 3 3
2 −2 1
3 3 3
0 0 9

9.9 Applications of QR
Let us consider the two typical applications (when A is square)

Solution to Ax = b
Ax = b
QRx = b
Rx = QT b

This requires a matrix vector product followed by solving a triangular system of equations.

Inverse of A We solve the system of equations AX = I or we solve Axi = ei .


Question: Find the computational complexity in both the cases.

9.10 Factorization using SVD


Singular value decomposition is a very popular factorization scheme with many applications. The singular value
decomposition (SVD) is a factorization of a real or complex matrix. It has many useful applications in signal
processing, statistics and optimization.

Formally, the singular value decomposition of an m × n real or complex matrix M is a factorization of the form
M = U DV T , where U is an m × m real or complex unitary matrix, D is an mn rectangular diagonal matrix with
non-negative real numbers on the diagonal, and V T (the conjugate transpose of V, or simply the transpose of V if V
is real) is a n × n real or complex unitary matrix. The diagonal entries Dii of D are known as the singular values of
M. The m columns of U and the n columns of V are called the left-singular vectors and right-singular vectors of M ,
respectively.

Note that U U = I and V T V = V V T = I

Computational cost of SVD is 2mn2 + 11n3 .

The singular value decomposition and the eigendecomposition are closely related. Namely:

• The left-singular vectors of M are eigenvectors of M M T .


• The right-singular vectors of M are eigenvectors of M T M .
• The non-zero singular values of M (found on the diagonal entries of D) are the square roots of the non-zero
eigenvalues of both M T M and M M T .

Applications that employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix approx-
imation, and determining the rank, range and null space of a matrix.


M= Di ui viT
i

9.11 Computing Inverse using SVD


Let us assume A = U DV T

A−1 = V D−1 U T

90
D−1 is easy to calculate since it is diagonal in n flops.

If A is singular, one can find the approximation by discarding the zero elements. Di−1 = 1
Di
is Di > t and zero
otherwise.

One can solve Ax = b with the help of this inverse.

Consider a set of homogeneous equations Ax = 0. Any vector x in the null space of A is a solution. Hence any
column of V whose corresponding singular value is zero is a solution

9.12 Additional Examples

9.12.1 Example
1 3
Demonstrate that cholesky factorization can be done in 3
n operations

Sol: Let A is n × n − matrix which Positive Definite Matrix.

Then generalized cholesky factorization can be done as follows


[ ] [ ][ ]
a11 A12 l11 O l11 L21 T
= T
A21 A22 L21 L22 O L22

a11 = l11 2 ⇒ l11 = a11

A21 = L21 l11 ⇒ L21 = A21


l11

A22 = L21 L21 T + L22 L22 T

A22 − L21 L21 T = L22 L22 T

Recursive Algortihm:
√ A21
Calculate the first column of L: l11 = a11 and L21 = l11

Compute Cholesky factorization A22 − L21 L21 T = L22 L22 T

Time or number of operations (flops) required to complete algorithm for n as order

T (n) = 1 + (n − 1) + 2(n − 1)2 + T (n − 1) (By Master’s theorem)

T (n) = aT (n − 1) + O(n2 ) ⇒it requires O(n3 )flops

Explanation:

1. calculation of l11 takes 1 flop in each recursion


2. calculation of L21 takes (n − 1) flops if current matrix order is n. So upto now each recursion takes n flops
3. calculation of L22 L22 T i.e., A22 − L21 L21 T can be done in 2(n − 1)2 flops, if current matrix order is n. So each
recursion takes n + 2(n − 1)2 flops approximately.

So total flops required for LLT − f actorization or Cholesky factorization of n × n − matrix is

[n + (n − 1) + . . . 1] + 2(n − 1)2 + 2(n − 2)2 + . . . + 1

⇒[1 + 2 + . . . + n] + 2[1 + 4 + . . . + (n − 1)2 ] (Sum of n natural numbers and 2*(Sum of squares of (n − 1) natural
numbers))

⇒[ n(n+1)
2
] + 2[ n(n−1)(2n−1))
6
] ≊ 13 n3

9.12.2 Example

Suggest an efficient algorithm to compute Z = (I + A−1 + A−2 + A−3 )b. Analyze the complexity/flops

Solution:

91
Z = (I + A−1 + A−2 + A−3 )b

⇒Ib + A−1 b + A−2 b + A−3 b

⇒Let w = A−1 b and ;x = A−2 b;y = A−3 b

⇒Ib + A−1 b + A−1 (A−1 b) + A−1 (A−2 b)

⇒Ib + A−1 b + A−1 w + A−1 x

Considerw = A−1 b

⇒Aw = b

⇒Let A = LU (LU − decompostion requires 2 3


3
n flops.)—-(1)

⇒Aw = LU w = b

⇒Let U w = v substitute in previous in previous equation

⇒Lv = b (Solve v by forward substitution. n2 flops required)

⇒U w = v (Solve w by backward substitution. n2 flops required)


2 3
Total 3
n + 2n2 flops are required to solve linear equation:Aw = b.

Conider x = A−1 w

Already we found LU − decompostion of A from (1)

⇒Ax = LU x = w

⇒Let U x = p substitute in previous in previous equation

⇒Lp = b (Solve p by forward substitution. n2 flops required)

⇒U x = p (Solve x by backward substitution. n2 flops required)

Total 2n2 flops are required to solve linear equation:Ax = w.

Conider y = A−1 x

Already we found LU − decompostion of A from (1)

⇒Ay = LU y = x

⇒Let U y = p substitute in previous in previous equation

⇒Lp = b (Solve p by forward substitution. n2 flops required)

⇒U y = p (Solve y by backward substitution. n2 flops required)

Total 2n2 flops are required to solve linear equation:Ay = x.

Addition of 4 n × 1 − matrices take 3n flops.


2 3
Therefore total 3
n + 6n2 + 3n flops required in efficient manner.

9.12.3 Exercise

Consider the set of linear equations

(D + uv T )x = b

where u, v, and b are given n-vectors, and D is a given diagonal matrix. The diagonal elements of D are nonzero and
uT D−1 v ̸= −1.

1. What is the cost of solving these equations using the following method?
(i) First calculate A = (D + uv T )
(ii) Then solve Ax = b using the standard LU method

92
2. Compute the inverse of the above A matrix using suitably efficient algorithm and determine the cost.

9.12.4 Excercise
 
−3 2 0 3
6 −6 0 −12

Calculate the LU factorization without pivoting of the matrix A =  . Provide all the steps
−3 6 −1 16 
12 −14 −2 −15
during calculation.

93
94
Chapter 10

Optimization Problems: Least Square


and Least Norm

95
10.1 Introduction
In the last two lectues, we had seen the problem of solving Ax = b when A is non-singular and square. What if
m ̸= n? There are two cases that we discuss today: (i) m > n and (ii) m < n.

• Consider the case when m > n, i.e., the number of equations is more than number of variables/unknowns.
Linear equations with m > n are called over-determined equations. They may not have a solution that satisfy
all the equations. However, we explore the “most approximate” (or optimal) solution leading to an optimization
problem. This is a problem with many practical interest, since the lack of a consistent solution to the entire
set of equations may be due to practical issues like ‘noise’. They are often formulated as least square error
(LSE) minimization problem or popularly known as least square problem. Least squares (LS) problems are
optimization problems in which the objective (error) function is expressed as a sum of squares. They have a
natural relationship to model fitting problems, and the solutions may be computed analytically using the tools
of linear algebra.
• When m < n the situation is very different. There may be too many solutions that satisfy the too few equations
that we have. In such situations our interest is in finding the solution that has some special character, for
example, the one with minimum norm.

We first derive the closed form expressions for these two (for L2 norm) and then discuss how computationally efficient
and accurrate solutions can be designed.

10.2 Least Square Problem


The least squares problem is defined as the problem of finding a vector x, that minimizes ∥Ax − b∥22 . This problem
gets it name from the following equation that define the error (or residual) for the i th equation as:

∀i = 1, 2..., m, ri (x) = ai1 x1 + ai2 x2 + .... + ain xn − bi .

Note that ri (x) is the ith component of Ax − b. Often ri is called the residual or error and will have some physical
significance. Let us come back to our familiar matrix form of the equations.
Ax = b
Our problem is now to minimize ∥r∥ such that Ax + r = b or minimize ∥Ax − b∥

Let us first understand the objective function


∥Ax − b∥ = [Ax − b]T [Ax − b] = xAT Ax + bT b − 2xT AT b
This has a quadratic term in x, linear term in x and a constant. (like your familair ax2 + bx + c)

To minimize this objective (function of x), let us do a partial differentiation of this w.r.t. xi and equate to zero. This
leads to:

(Note: If you do not know how to differentiate functions of matrices/vectors, read: Tom Minka’s technical report on
“Old and New Matrix Algebra Useful for Statistics”.)

2AT Ax − 2AT b = 0
or AT Ax = AT b (10.1)
By multiplyinf (AT A)−1 on both sides,

x = (AT A)−1 AT b (10.2)

10.3 Efficient Computation of LS solution


Solution by using cholesky Reaggarnging the solution gives:
AT Ax = AT b (10.3)

96
Note that AT A is PD and we can use cholesky to solve this.

Question: Prove that AT A is PD.

Our solution is to form an equation Cx = d with C = AT Ax and d = AT b. The steps can be summarized as:

1.Compute C=AT Ax and d = AT b


2.Compute C=LLT using Cholesky factorization
3.Solve Lw=d forward substitution
4.Solve LT x=w backward substitution

Complexity for calculating step 1 to compute C we need mn2 flops. Note that C is symmetric and we need to
compute only half the elements approximately. And to compute d we need 2mn.
Complexity for calculating step 2 is 1/3 n3 .
Complextity for calculating step 3 is n2 .
Complexity for calculating step 4 is n2 .

Therefore, the total cost will be mn2 +2mn+1/3 n3 +n2 + n2 which is taken as 1/3 n3 +mn2

Solution using QR decomposition Given matrix A is written in the form A = QR then


AT Ax = AT b
(QR)T (QR)x = (QR)T b
RT QT QRx = RT QT b
RT Rx = RT QT b
Rx = QT b

Steps for solving least square problems using QR

1.Factorize A as QR
2.Compute w=QT b
3.Solve Rx=w backward substitution

Complexity for calculating step 1 is 2mn2 .


Complextity for calculating step 2 is 2mn.
Complexity for calculating step 3 is n2 .

Therefore, the total cost will be 2mn2 +2mn+n2 which is approximated as 2mn2

Solution using SVD Let us recollect SVD.


Given matrix A is written in the form A = U DV T
U is m × n orhtogonal matrix U T U =I
D is n × n Diagonal matirix
V is n × n orthogonal so V T V =V V T =I

(AT A)x = AT b
[(U DV T )T (U DV T )]x = ((U DV T )T b
[V DU T U DV T ]x = (U DV T )T b
[V D2 V T ]x = (V DU T )b
DV T x = U T b

97
We solve this as:
Dw = p
V Tx = w
x=Vw

Steps for solving least square problems using SVD

1.Factorizw A=U DV T using SVD.


2.Compute p=U T b
3.Solve Dw=p
4.Find x=Vw

10.4 Least Norms Problems


Linear equations with m < n are called under-determined. This under-determined system has infinitely many
solutions. However, we are seeking a solution x∗ such that ||x∗ || is minimal (or the norm is minimum).

Minimize||x||
subject to Ax = b
is unique and is given by x̂ = AT (AAT )−1 b

Verify First we verify and show that this is the solution. We show below how this can be derived using Langrangians
below.

1. First we check if x̂ satisfies Ax̂ = b


x̂ = AT (AAT )−1 b
substitute x̂ in Ax̂ = b
= AAT (AAT )−1 b
= AAT A−T A−1 b
= AA−1 b
= b
This means that this is one of the possible solutions.
2. Now we show that any other solution of the equation has a norm greater than x̂
suppose x satisfies Ax = b
||x|| = ||x̂ + (x − x̂)||2
= ||x̂||2 + ||(x − x̂)||2 + 2x̂T (x − x̂)
The third term is zero since
x̂T (x − x̂) = (AT (AAT )−T b)T (x − x̂)
= bT (AAT )−1 A(x − x̂)
= 0

Since Ax = Ax̂ = 0.
Thus we have:

||x||2 = ||(x − x̂)||2 + ||x̂||2


= ||x̂||2 + positive term
This implies that ||x̂||2 is the minimum.

98
Derivation
M inimizexT x
subject to
Ax = b
Combining the constraints, Langrangian is

L(x, λ) = xT x + λT [Ax − b]

Differntiating with repect to x and λ and equating to zero:

2x + AT λ = 0

Ax − b = 0
or
AT λ
x=−
2
Substituting
A(−AT λ)
−b=0
2
λ = −2(AAT )−1 b
Substituting
x = AT (AAT )−1 b

10.5 Efficient Solutions to Least Norm Pronlems


Solution using cholesky If A is m × n and right invertibe matrix

1.Compute C=AAT
2.Decompose C as C=LLT
3.Solve Lw=b Forward substitution
4.Solve LT z=w Backward substitution
5.Find x = AT z

Computing step 1 will take nm2 flops.


Step 2 will take 1/3 m3 .
Step 3 will take m2 .
Step 4 will take m2 and step 5 will take 2mn.
So the total complexity is given as 1/3m3 + nm2

Solution using QR
AT = QR (10.4)
QT Q=I and R is upper right triangualr matrix with positive diagonal

∥x̂∥ = AT (AAT )−1 b (10.5)


T
substitute A =QR in above equation

= QR(RT QT QR)−1 b
= QR(R−1 R−T )b
= QR−T b

First we find x̂ by computing R−T b by solving RT z = b and then multiply this by Q.

Steps to solve least norm problems using QR

99
1. Compute QR factorixation AT =QR
2. Solve RT z=b forward substitution
3. Solve x=Qz

for computing step 1 we need 2nm2


step 2 we need m2
step 3 we need 2mn. The total complexity is 2nm2

Solution using SVD Remember that SVD factorizes A as U DV T .


Given matrix A is written in the form A=U DV T
U is mxn orhtogonal matrix U T U =I
D is nxn Diagonal matirix
V is nxn orthogonal so V T V =V V T =I

x̂ = AT (AAT )−1 b
We can factorize AT using SVD and substitute in the above equation. This leads to:

DU T x = V b
Dw = p
UT x = w
x = Uw

Steps for solving least square problems using SVD

1.Factorizw AT = U DV T
2.Compute p=V b
3.Solve Dw=p
4.Find x=Uw

Cost may be higher. But the stability is superior.

10.6 Basis Pursuit


Consider now another problem
Minimize ∥x∥0
Subject to: Ax = b and ∥x∥∞ ≤ R
Note that ∥x∥0 = #{i|xi ̸= 0}

If ∥x∥0 ≤≤ m, x is sparse. This is NP-Hard.

Let us first rewrite the problem as MIP.

Minimize 1T z
Subject to: Ax = b; |xi | ≤ Rzi ; and zi ∈ {0, 1}

Assume we relax zi from {0, 1} to [0, 1].

Minimize 1T z
Subject to: Ax = b; |xi | ≤ Rzi ; and 0 ≤ zi ≤ 1

100
xi
Observing that zi = R
at the optimum, the problem is equivalent to:

∥x∥1
Minimize
R
Subject to: Ax = b;

This is similar to replacing L0 norm by L1 norm. Assume we use L1 norm instead of L0 norm.

Minimize ∥x∥1

Subject to: Ax = b and x ∈ Rn

Question: Can the L1 norm based solution be the same as that of L0? or how bad this can be? Let us wait for
some lectures to know the answer.

Equivalent LP

Minimize u1 + u2 + . . . un
Subject to: Ax = b
−u ≤ x ≤ u
x, u ∈ Rn and u ≥ 0

10.7 Additional Examples

10.7.1 Example 1

minimize ∥Ax − b∥2 + ∥x∥2


solution

let y = ∥Ax − b∥2 + ∥x∥2


= [Ax − b]T [Ax − b] + xT x
= [xT AT − b][Ax − b] + xT x
= xT AT Ax − xT AT b − bT Ax + bT b + xT x
= xT AT Ax + xT x − xT AT b − bT Ax + bT b
= xT AT Ax + xT x − 2xT AT b + bT b

differentiating with respect to x on boths sides

= 2AT Ax + 2x − 2AT b
2AT b = 2AT Ax + 2x
AT b = (AT A + I)x

solving above problem using cholesky

1.compute B=(AT A + I)
2.Choeskly of B as LLT
3.LLT x = AT b
4.Lw = AT b
5.LT x = w

solving above problem using QR - complexity

101
1.compute B=(AT A + I
2. Bx=AT b
3.B T = QR
4.RT z = b
5.x=Qz -

10.7.2 Example 2

Suggest an efficient algorithm to minimize


∥Ax − b1 ∥2 + ∥Ax − b2 ∥2 given matrix A is mxn and b1 and b2 are two m vectors
Solution:

y = [Ax − b1 ]T [Ax − b1 ] + [Ax − b2 ]T [Ax − b2 ]


y = [xT AT − bT1 ][Ax − b1 ] + [xT AT − bT2 ][Ax − b2 ]
y = xT AT Ax − xT AT b1 − bT1 Ax + bT1 b + xT AT Ax − xT AT b2 − bT2 Ax + bT2 b2

differentiating with respect to x on both sides

4AT Ax − 2AT b1 − 2AT b2 = 0


2AT Ax = AT b1 + AT b2

substitute A=QR-2mn2

2(QR)T (QR)x = (QR)T b1 + (QR)T b1


2RT QT QRx = RT QT b1 + RT QT b2
RT QT (b1 + b2 )
RT Rx =
2
QT (b1 + b2 )
Rx =
2

solving above problem using cholesky


T
1.let w= Q (b1 +b2 )
2
-2mn + m
2
2.Rx=w -n

so total complexity is 2mn2 + 2mn + m + n2

So total complexity is 2mn2 + 2mn + m + n2

10.7.3 Excercise

min(∥x − x0 ∥2 ) (10.6)
Such that
Ax = b; (m < n) (10.7)

TODO Notes (i) Fix flops for SVD (ii) double check costs.

102
Chapter 11

Constrained Optimization: Lagrange


Multipliers and KKT Conditions

103
11.1 Introduction
Constrained Optimization vs Unconstrained Optimization

11.2 Lagrange Multipliers

11.3 KKT Conditions

11.4 Problems

104
Chapter 12

Eigen Value Problems in Optimization

105
12.1 Eigen Values and Eigen Vectors
In linear algebra, an eigenvector or characteristic vector of a square matrix is a vector that does not change its
direction under the associated linear transformation. That is, if

Ax = λx

then x is the eigen vector of A and λ is the correposining eigen value. Note that an n × n matrix A may have at max
n eigen vectors and eigen values. Eigen values can also be zero.

Geometrically, an eigenvector corresponding to a real, nonzero eigenvalue points in a direction that is stretched by
the transformation and the eigenvalue is the factor by which it is stretched.

Another intuitive explanation of eigen vectors is as follows. We know that when a vector gets multipled by a matrix
(a linear transformation), the vector changes its direction. However, certain vectors do not change their direction.
They are called eigen vectors. i.e., Ax = λx. There is only a scale change characterized by λ on multiplication. Such
vectors x are the eigen vectors and the corresponding λ, the scale factor is the eigen value. If A is an identity matrix,
every vector is an eigen vector, and all of them have eigen value of λ = 1.

12.1.1 Basics and Properties


1. When A is squared, the eigenvectors stay the same and eigenvalues get squared.
Prove this.
2. The product of eigen values is determinant.
Prove this.
3. Sum of eigen values is Trace
Prove this.
4. Symmetric matrcies have real eigen values
Prove/Verify this.

12.1.2 Numerical Computation

To compute the eigen values and eigen vactors, we start with Ax = λx. or (A − λI)x = 0. If this system has to have
a non trivial solution, the determinant of A − λI should be zero.

|A − λI| = 0

Solving the above for λ gives the eigen values. Substituting this in

Ax = λx

yield x.

Example: Compute the eigen values and eigen vectors of the following matrix.
 
1 2 3
 4 5 6 
7 8 9

12.1.3 Numerical Algorithms

Note done for S-2016

106
12.2 Applications in Optimization
A number of problems lead to a formulation of the form M inimize or M aximize xT Ax with an additional constraint
that ∥x∥2 = 1. Often we need the constraing to avoid the trivial cases.

If we create an unconstrained optimization with the help of a lagrangain.


M inimizexT Ax − λ(xx − 1)
Differentiating with respect to x and equating to zero leads to
Ax = λx
Or the optima is an eigen vector or the A.

If we substitute this in the objective function,


xT Ax = λxT x = λ
If the problem is maximization problem, then we will pick the eigen vector corresponding to the largest eigen value.
If the problem is minimization, we will pick the eigen vector corresponding to the smallest eigen value.

Another class of functions lead to a problem of the form:


Ax = Bx

12.3 Optimization: Application in Line Fitting


In the previous lectures, we had seen the problem of line fitting (as LP, as LSE). However, we always wondered why
don’t we minimize the orthogonal distance.

Let us assume that we are given N points in 2D i.e., (xi , yi ). We are interested in finding the equation of a line
ax + by + c = 0. that minimize the sum or orthohonal disctances. We know that this line passes through the origin.

Let us assume that the points are mean centered. i.e., mean is zero. Now the line equation can be written as
ax + by = 0. With no loss in generality we can assume that the vector u = [a, b]T is normalized such that the norm
is unity.

Assume that the data is arranged in a data matrix M as N × 2. We are interested in finding
M inimize∥M u∥2 such that∥u∥ = 1
Let us take an SVD of M as U DV T .

∥M u∥2 = [M u]T [M u] = uT M T M u = uT V DU T U DV T u = uT V D2 V T u
Since V is an orthogonal matrix, multiplication by V is not to change the length of u. Let us assume v = V u. Then
the problem is now.
M inimizev T D2 v such that∥v∥ = 1
This is nothing but the square of the largest singular value.

12.3.1 Relationshp to SVD

Discussed along with the SVD. When A is not a square matrix, eigen vectors of AT A and AAT are related to the U
and V in the SVD.

12.4 Application in solving Ax = 0


Consider the problem of M inimize ||Ax|| where A is a m × n matrix with m > n. This immediately lead to a
minimization problem of the form xT AT Ax with the unit norm constraint on x. This leads to the eigen vectors of
AT A.

However which eigen vector to pick? one corresponding to the smallest or largest?

107
12.5 Optimization: Application in PCA
Problem: Given a set of samples x1 , x2 , . . . xN , each with M features x1i , . . . xM
i , Find a new feature representation
(such as the jth feature is a linear combination of the original features) as:

yi1 = α11 x1i + α21 x2i + . . . + αM


1
xM
i

yi2 = α12 x1i + α22 x2i + . . . + αM


2
xM
i

......

yik = α12 xki + α2k x2i + . . . + αM


k
xM
i

Yi = AXi
Where Yi is a k × 1 vector, A is a K × M matrix and Xi is a M × 1 vector (usually k < M )

Let the original vector xj is projected into a new dimension ui (basis vector) as vij = xj · ui .

It is easy to observe that the mean after the projection is same as the projection of the original mean.

j xj 1 ∑ 1 ∑
x̄ · ui = · ui = xj · ui = vj
N N j N j

v¯i = x̄ · ui
or else
v̄ = x̄ · [u1 , . . . , uk ] = Ux̄

Objective

Let the objective is to maximize the variance after projection. (Or only minimal information is lost.) Let us first find
the “best dimension” (u) in this regard i.e.,


M axu var(v) = M axu ||vi − v̄||2
i
∑ ∑
M axu ||vi − v̄||2 = ||(xi − x̄) · u||2
i i

M axu uT [xi − x̄][xi − x̄]T u = M axu , uT Σu
i

An unconstrained optimization of the same could give extremely large u. Therefore, we introduce a constraint
uT u = 1 or the problem is
M aximize uT Σu − λ(uT u − 1)

M aximize uT Σu − λ(uT u − 1)

Differentiating with respect to u and equating to zero.

Σu = λu

Or u will have to be eigen vector of Σ.

It can be now easily seen that the basis vectors which preserve maximum variance are the sorted (in decreasing order)
eigen vectors.

(The largest λ maximizes the objective)

108
12.6 Optimization: Graph Cuts and Clustering
Let us now look at another interesting application of eigen vectors. Consider a graph G = (V, E).

We are interested in partitioning the graph into two subsets of vertices A and B. Or, we want to find a cut, which is
the sum of edges that we need to cut.

This has many applications in clustering. Consider that we are given N points {x1 , . . . , xN } and we are interested
in clustering these points into two clusters. Assume A is an affinity (or some similarity matrix) where Aij is the
similarity of xi and xj . This is also in a way related to the weight matrix W of the graph. If one needs more
d(xi ,xj )
intuituition, one can define weight wij as e σ .

Let wij be the weight of the edges. We need a cut that partition the vertices into two. It cuts a set of edges.

Cut(A, B) = wij
i∈A,j∈B

However, this is not very useful in many cases. More useful measure is
1 1
N Cut(A, B) = cut(A, B)( +
V ol(A) V ol(B)
with ∑
vol(A) = wij
i∈A,j∈V

vol(A) = di
i∈A

this normalization helps to remove the easy case of cutting an outlier point as the preferred point.

Let D(i, i) = j wij .

Consider a vector x such that xi is 1 if i th vertex is in A and −1 if it is in B.


∑ ∑
xi >0;xj <0 −wij xi xj xi <0;xj >0 −wij xi xj
N cut(A, B) = ∑ + ∑
d
xi >0 i xi <0 di

where x is an N dimensional vector such that xi = 1 is i ∈ A and xi = −1 is i is in B. And di = j wij

Rewriting in terms of W and D, this is

(1 + x)T (D − W )(1 + x) (1 − x)T (D − W )(1 − x)


N Cut(x) = +
k1T D1 (1 − k)1T D1

Let y = 12 ((1 + x) − b(1 − x))


y T (D − W )y
minx N Cut(x) = min
y y T Dy
with the condition that y T D1 = 0.

Minimize the equation using the standard tricks:

1 1
D− 2 (D − W )D− 2 z = λz
1
where z = D 2 y

using Rayleigh quotient, the second smallest eigen vector turns out to be the real valued solution and the solution to
the normalized cut problem

12.7 Optimization: Generalized Eigen Value Problem


A very related problem is that of generalized eigen value problem involving two matrices.

Ax = λBx

109
When B = I, this problem reduced to a popular equation we had seen in Equation (1).

If we multiply both side by either of B −1 or A−1 , then we can get the popular simple eigen value problem. However
the problem is that the matrices like A−1 B need not be symmetric.
1
A standard trick is to find sqrt of the B (either as B 2 ) or using cholesky LLT . i.e., Ax = λLLT x or L−1 Ax = λLT x
or

L−1 AL−T [LT x] = λLT x


or
A′ y = λy
where A′ = L−1 AL−T and y = LT x

12.8 Optimization: Spectral Graph Theory


Note done for S-2016

110
Chapter 13

Introduction to simplex method

111
13.1 Introduction
In the previous lectures, we have studied several formulations of problems as LP and some of the applications of
LP. We had also seen how some of the toy LPs gets solved on paper. We now describe an efficient way to solve
the linear programs. This method is popular and is widely used in many real life problems. This is the well known
simplex method. The graphical method to solve linear programs becomes tedious and computationally intensive
for large linear programs (when the numer of variables increases). As the number of variables increases, every new
constraint adds an exponential number of vertices to be evaluated (see below). The graphical method basically
requires enumerating and evaluating the objective on all the possible vertices or extreme points of the feasible region
described by the constraints of the linear program. This makes the algorithm unsuitable for practical use in large
linear programs. The simplex method describes a procedure to solve these linear programs efficiently. In this lecture,
we start with a brief conceptual introduction to the simplex method, and then proceed to describe more formal details
of the algorithm. In the next couple of lectures, we will see how simplex method can be understood/derived, and a
tablau based solution can be used to solve it on paper.

We had argued in the past that the optima is an extreme point of the convex ploygon (or feasible region) formed by
the constraints. Based on this, let us state a naive version (Ver 0) of the simplex algorithm. This is a simplified and
intuitive description of the simplex algorithm.

1. Form the feasible region, and start with any one of the extreme points.
2. Repeat:

• Move to one of the neighbouring extreme points which has a better objective value.
• If no such point exists terminate algorithm with the current extreme point as the optima

There are many important questions to be answered at this stage, both theoretical and practical. For example, will
this algorithm converge? Will this lead to a global optima? How to move from one extreme point to another?

13.1.1 Remark on LP as a Convex Optimization

The procedure start with any one of the extreme points. Then it move in one specific direction that improves the
ojective. Those who are familiar with the gradient descent style optimization may argue that we should pick the
direction of greatest change in the cost. That is fine. But there are more important aspects to look into. However, if
the objective function is convex (which is so in our case), it does not really matter. Eventually we will reach the same
extreme point (local as well as global optima). Given that linear programming problem is a convex optimization
problem, the solution found would be optimal. The above algorithm keeps finding a better solution till it reaches
a local minima (i.e none of its neighbours have a better solution). Therefor by the property of convex optimization
problems, this local optima is also the global optima that we are looking for.

The above algorithm aims at finding the best solution by exploiting the advantages of the convex optimization. A
linear program is a convex optimization problem. A convex optimization problem optimizes a convex function over
a convex region/set.

A set S (eg. ⊂ Rn ) is convex if, for any x1 , x2 ∈ S, and for θ ∈ [0, 1],

x3 = θx1 + (1 − θ)x2

is also in S. This simply implies that the line joining x1 and x2 is also within the set.

Since our method search over a set of extreme points, let us see how many extreme points are present in an LP?. In
fact, there are too many.

The feasible region for a linear programming problem, given by, {x|Ax ≤ b where, A is an m × n matrix, c ∈ Rn
and b ∈ Rm , m < n, is represented by a polyhedron. A polyhedron has a finite number of extreme points or vertices
which are bounded by
m m!
Cn =
n!(m − n)!
which represents m equations and choosing n variables if m > n. As m and n increase, the value of m Cn increases.
Hence, for a general L.P. problem, the number of vertices can be very large. Let us not think of finding all of them
and evaluating the objective on all of them.

112
13.1.2 Remark on Computational Complexity
1. Simplex is a generalization of triangles. You may see many related aspects of the simplex on
https://en.wikipedia.org/wiki/Simplex

2. TODO: More details on the vertices, faces and complexity.

13.1.3 Historical Notes on Algorithms that solve LP


• Simplex. Invented by George Dantzig in 1947 (Stanford University)
• L. G. Khachian introduced an ellipsoid method (1979) that seemed to overcome some of the simplex method’s
limitations. Complexity: O(n6 ). Disadvantage – runs with the same complexity on all problems
• Narendra K. Karmarkar of AT&T Bell Laboratories proposed in1984 a new very efficient interior-point algo-
rithm. Complexity: O(n3.5 ). In empirical tests it performs competitively with the simplex method.

13.2 Standard Slack Form


The standard Linear Programming (LP) problem consists of a linear objective function, which is subject to a certain
number of linear constraints. In general, it is represented as:

Minimize: cT x
such that Ax ≤ b; x ≥ 0
where, A is an m × n, matrix, c ∈ Rn denotes the coefficients of the objective function and b ∈ Rm , is part of the
constants. We are given A, b and c. We need to find the optimal x.

In order to use the simplex algorithm, we require the linear program to be cast in a specific format known as the
standard slack form. Therefore, conversion of the above inequality into an equality is now required. This is done by
adding a variable to each of the constraints. Hence, the LP problem becomes

Minimize: cT x
such that Ax = b
x≥0

For example, an inequality like


2x1 + 3x2 ≤ 4
is converted to an equality by adding the variable x3 . Thus, the inequality changes to
2x1 + 3x2 + x3 = 4
with x3 ≥ 0. The new variable x3 is also called the slack variable.

The slack form consists of a linear program’s constraints expressed as Ax = b. That is without the use of any
inequality expressions. The slack form of the linear program is expected in the following format :
Minimize: cT x
Subject To: Ax = b
x≥0

The conversion of the linear program to the slack form is done by introducing additional slack variables, for each
inequality constraint in the original standard form expression. For example, if the original linear program has m
constraints with n variables. The converted slack form’s constraints become:
a11 x1 + a12 x2 + . . . + a1n xn + xn+1 = b1
a21 x1 + a22 x2 + . . . + a2n xn + xn+2 = b2
...
am1 x1 + am2 x2 + . . . + amn xn + xn+m = bm

113
Therefore the new slack form constraints Ax = b , has an A with dimensions m × (m + n). Due to the additional m
variables introduced. These new variables are known as slack variables. Their values are intended to be set based on
the other variables in order to make up for the slack in an inequality, thereby allowing us to express the LP now as
an equality. This slack form will be very important for our formulation of the simplex method.

13.3 Simplex as Search over Basic Feasible Solutions (BFS)


We had seen the significance of the extreme points (vertices of the convex polygon defined by Ax ≤ b) in understanding
the solution to the LP. Let us see what is the equivalent when the constraints are in the form of Ax = b.

• A basic solution to a system of m linear equations in n unknowns (n ≥ m) is obtained by setting n − m


variables to 0 and solving the resulting system to get the values of the other m variables.
• The variables set to 0 are called nonbasic variables; the variables obtained by solving the system are called
basic variables.
• A basic solution is called feasible if all its (basic) variables are nonnegative.

We know that the algorithm of enumerating vertices to find the minimum is not efficient and attractive. Instead, an
alternative algorithm as below can be used (we also saw the same in the last section).

Simplex Ver 0 The two intuitive steps of the algorithm are:

1. Start from any vertex as a possible solution.


2. If a neighbouring vertex is better, solution is updated.
3. Repeat step2 until there is no better vertex.

13.3.1 Basic Feasible Solution

We now assume that we have the linear program expressed in slack form as Ax = b, where A has m constraints and
n variables. Here in the slack form, due to introduced variables, we have m ≤ n and therefore Ax = b has multiple
valid solutions. In such a situation if we have two possible solutions x1 and x2 . Then all x of the form:

x = αx1 + (1 − α)x2 , where 0 ≤ α ≤ 1

are solutions to the LP. This will be an important property that we use in the simplex method to jump from one
solution to the another (till we reach the optimum). Now for the LP problem, the extreme points or vertices are
obtained when the column vectors Ai obtained with corresponding non-zero basic variables are linearly independent.
Such an x with a valid solution to the LP is known as a Basic Feasible Solution or BFS. A linear program will have
multiple BFS, but if we are able to obtain at least one BFS (the initial BFS), then we can say that there must exist
an optimal BFS. With these definitions we now look into the simplex method.

Let B be the set of m basic variables. i.e., B ⊆ {1 . . . n} be a subset of the n variables. Let us set xi = 0, ∀i ∈
/ B.
We represent B as a set of indices of the variables that are selected. Also we use the notation B(i) to represent the
i th element of the set.

Now, assume A to be a matrix of rank m. We select m columns from A, then


[ ]
AB(1) , AB(2) , · · · AB(m) xB = b (13.1)
[ ]
where AB(1) , AB(2) , · · · AB(m) is an m × m matrix, xB is an m × 1 vector and b is an m × 1 vector. If
[ ]
B̄ = AB(1) , AB(2) , · · · AB(m)

then,

B̄xB = b
(13.2)
=⇒ xB = B̄−1 b, xi = 0, i ∈
/B

The elements of the vector x can be now taken from xB and the rest zero.

114
Now, the feasible region defined by the LP problem, Ax = b, xi ≥ 0 is a convex polyhedron. So, x is a vertex, if
and only if the subset of the n variables, i.e. the column vectors Ai , of the matrix A, corresponding to non-zero
entries of x(xi ̸= 0) are linearly independent. Such an x is called a Basic Feasible Solution (BFS). So, in this
case, B ⊆ {1 . . . n} is the basic variables. And B̄ is a non-singular matrix. For a given linear programming problem,
there can be many basic feasible solutions possible. If an optimal solution exists for an L.P., then one of the BFS is
optimal.

13.4 Simplex Algorithm - Ver 1


m
Now we can create various B̄ matrices and find the respective x. (There are such Cn matrices. However, this is not
attractive. Our strategy is the following.

Simplex - Ver1

1. Start with a BFS.


2. Repeat
• If a better solution exists, move to that

Its practical implementation is done by finding the matrices B̄1 → B̄2 → B̄3 → B̄4 → B̄5 . . . Where finding a new B̄
is done by removing a column and adding a new column. Therefore, the problem of simplex will soon get reduced
to finding the departing column (or variable) and finding the entering column (or variable). Often, the first vertex is
found easily as B = I, where I is the identity matrix.

13.4.1 An Intuitive Version


1. Start with an Initial BFS.
2. Repeat:
• If we have a better BFS , move to that
• else stop with the present BFS and report it is as the optimum.

Often the initial BFS is chosen with B̄ = I. That is by setting all the slack variables to be Basic variables (non-zero),
while the remaining originally existing variables are classified as non-basic and are by default set to 0. (This is
equivalent to the origin in your convex plogon defined by the inequalities.) This usually (not always) provides us a
simple to obtain BFS as xB(i) = bi as the rest of the variables in the constraint are non-basic and zero.

We design the simplex algorithm such that in each iteration of the algorithm, exactly one of the Non-Basic variables
become a basic variable and a basic variable leaves to become a non-basic variable.

Now we proceed to formalize the process by witch we switch between two Basic Feasible Solutions, making sure that
it still satisfies the original LP’s problem.

13.4.2 Moving across BFSs - Cost difference

Let d be a difference vector, and θ ∈ [0, 1] be a scalar. The vector d takes us from one solution to another as
xnew = x + θd.

Let dj = 1, for some j ∈


/ B. ∀i ∈
/ B − {j}, di = 0 i.e., only one variable is moving into B ( and one moving out).

Axnew = Ax = b =⇒ A(x + θd) = Ax = b =⇒ Ad = 0



m
∴ dB(i) AB(i) + Aj = 0
i=1

=⇒ B̄dB + Aj = 0
=⇒ dB = −B −1 Aj

115
Let us calculate the cost difference in the objective for a given choice of j.
T −1
C̄j = Cj − CB B A , here C̄j is the difference in cost due to j th variable becoming basic. We compute it for all the
variables.

C̄ = [C¯1 C¯2 . . . C¯n ]

If C̄ ≥ 0, (i.e., all the elements are positive or objective can not be further minimized), then it means that our current
BFS is the optimum. Otherwise we choose one of the j from C̄ , such that C̄j < 0. We can have some nice heuristics
in picking the best j out of many possible ones. Such heuristics may work in many situations, but not always. In
any case, we know that the problem is convex!!.

An important question not yet answered is the choice of θ. How do we fix theta? If our theta is small, we move
only by a small amount. If we take a large θ, we move out of the feasible region. The trick is finding the largest θ
such that we still remain in the feasible region. We show below, how to find the θ such that we go as far as possible
without violating the constraints. This leads to a new dimension/variable l that exit from the basic variable set. In
short our objective is to find j that enters and l that exits in every iteration.

13.4.3 Simplex Ver 1: A more formal version


1. Start with an initial BFS, (Set B, etc.)
2. Repeat:

• Calculate C̄ , if ≥ 0 , then stop repeating.


• else, select a j from C̄ , such that C̄j < 0.
• set dB = −B −1 A
• Find l and θ , such that:
– θ is continuously increased till,
– the first l , such that on increasing θ , Basic variable xl becomes zero. i.e :

xl + d l θ = 0

• l exits B, and j enters B.


B = B − {l} + {j}

3. The current BFS is the optima

13.5 Examples
Example 5. Consider an example to obtain an optimal solution by checking all possible basic feasible solutions.

Minimize: −x1 − x2
such that −x1 + x2 ≤1
x1 ≤3
x2 ≤2
x1 , x 2 ≥0

The inequalities are converted to equalities by adding variables to each of the constraints. This results in

Minimize: −x1 − x2
such that
−x1 + x2 + x3 + = 1
x1 + +x4 = 3
x2 + +x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0

116
Figure 13.1: The constraints on a 2D plane. One can see the extreme points of the original problem. Also
notice the relationship to the Basic Feasible Solutions.

Here,
 
  x1  
−1 1 1 0 0 x2  1
 
A= 1 0 0 1 0 , x= 
x3  , b = 3
0 1 0 0 1 x4  2
x5

Since,
Ax = b (13.3)
Therefore,  
  x1  
−1 1 1 0 0 x2 
 1
1 0 0 1 0  x
 3
 = 3
0 1 0 0 1 x4  2
x5

Initially, the basic feasible solution is x = [0, 0, 1, 3, 2]. As the 3rd , 4th and the 5th columns of the matrix A form an
identity matrix (I), B = {3, 4, 5}. The objective function, z = 0.
Rearranging the equations so as to write them in terms of x3 , x4 and x5 ,

x3 = 1 + x1 − x2
x4 = 3 − x1
x5 = 2 − x2
z = −x1 − x2
=⇒ z =0

x2 can increase upto 1 while keeping x1 = 0 . Thus, the equations transform as,

x2 = 1 + x1 − x3
x4 = 3 − x1
x5 = 1 − x1 + x3
z = −x1 − x2
= −x1 − (1 + x1 − x3 )
= −1 − 2x1 + x3
=⇒ z = −1 − 2(0) + 0
= −1

In this case, 2 enters and 3 exits. Therefore, x = [0, 1, 0, 3, 1], B = {2, 4, 5} and z = −1.

117
Figure 13.2: Navigation of the vertices of the feasible region.

Figure 13.1 shows the navigation through the basic feasible solutions to reach an optimal solution.

Now, x1 can be increased to 1. So, the equations transform as,

x1 = 1 + x3 − x5
x2 = 2 − x5
x4 = 2 − x3 + x5
z = −1 − 2x1 + x3
= −1 − 2(1 + x3 − x5 ) + x3
= −3 − x3 + 2x5
=⇒ z = −3 − 0 + 2(0)
= −3

In this case, 1 enters and 5 exits. Therefore, x = [1, 2, 0, 2, 0], B = {1, 2, 4} and z = −3.

It can be seen that z can be minimized by increasing x3 . So, the equations transform as,

x1 = 3 − x4
x2 = 2 − x5
x3 = 2 − x4 + x5
z = −x1 − x2
= −(3 − x4 ) − 2 − x5
= −5 + x4 + x5
=⇒ z = −5 + 0 + 0
= −5

In this case, 3 enters and 4 exits. Therefore, x = [3, 2, 2, 0, 0], B = {1, 2, 3} and z = −5. z cannot be minimized
further. Hence, the final minimized value of z is −5.

Example 6. Find a sequence of BFS and arrive at an optimal point.

Minimize: z = x1 + x2
such that x1 + 5x2 ≤5
2x1 + x2 ≤4
x1 , x2 ≥0

The graphical representation of the given problem is as follows:

Variables are introduced to convert the inequalities to equalities.

118
Figure 13.3: Graphical representation.

Minimize: z = x1 + x2
such that
x1 + 5x2 + x3 =5
2x1 + x2 + x4 =4
x1 , x2 ≥ 0

So,
 
[ ] x1 [ ]
1 5 1 0 x2  5
A= , x= 
 , b=
2 1 0 1 x3  4
x4

Ax = b

Therefore,
 
[ ] x1 [ ]
1 5 1 0  
x2  = 5
2 1 0 1 x3  4
x4

x = [0, 0, 3, 4]. If B = I, then B = {3, 4}. After rearrangement of the equations,

x3 = 5 − x1 − 5x2
x4 = 4 − 2x1 − x2
z = x1 + x2
=⇒ z =0

Since x1 , x2 both are positive, increasing them will increase the value of z. So, the optimal solution is reached, with
z = 0.

119
13.6 Additional Problems
1. Find a sequence of BFS to arrive at the optimal point for the following Linear program:

Minimize : z = −x1 − x2
Subject to :
x1 + 10x2 ≤ 5
4x1 + x2 ≤ 4
x1 , x 2 ≥ 0

2. Find a sequence of BFS to arrive at the optimal point for the following Linear program:

Minimize: z = 2x1 + 3x2


such that 5x1 + 25x2 ≤ 40
x1 + 3x2 ≥ 20
x1 + x2 = 10
x1 , x2 ≥0

120
Chapter 14

More on Simplex

121
14.1 Introduction

In the last lecture, we have seen a basic version of Simplex. That method involved moving from one extreme point of
the feasible polyhedra to another, and finally stopping at the extreme point corresponding to optimal solution, when
no such adjacent movement leads to cost minimization. We had seen this in an example. We know how to apply this
method, without getting into much of the theory corresponding to it. In this lecture we develop fomal mathematical
details that guides the Simplex method, so that we are able to apply this method with more confidence.

The simplex method is based on this fact that if an optimal solution exist then there is basic fesible solution(BFS)
that is optimal. It searches for an optimal solution by moving from one basic feasible solution to another, along
the edges of the feasible set. We make sure that we move always in a cost reducing direction. Eventually, a special
basic feasible solution is reached at which none of the available edges leads to a cost reduction. Such a basic feasible
solution is optimal and the algorithm terminates.

To start the simplex problem we need to know at least one basic feasible solution. Simplex often starts with B̄ = I.

14.2 Basics

Let us consider the following problem

Minimize: cT x
such that Ax = b

We assume that the dimensions of the matrix A are m × n where m < n. Here, Ai is the ith column of the matrix
A. Let P be the corresponding feasible set. Note that the problem is stated in the standard slack form.

The simplex method searches for an optimal solution by navigating from one vertex to another, along the edges of the
feasible region such that the cost is reduced. When we reach a basic feasible solution at which there is no neighbour
that has a lesser cost, we have reached the minima, and the algorithm terminates.

A polyhedron is a set that can be described in the form {x ∈ Rn | Ax ≤ b}, where A is m × n matrix and b is
vector in Rn .
′ ′ ′
We need to find the vector x s.t. x ∈ P and cT x is minimized. Instead of directly working with inequalities we
introduce slack variables and convert the given problem into a standard form. The vector x∗ s.t. {x∗ ∈ Rn } is Basic
Feasible Solution (BFS) if all the constraints are satisfied (i.e., feasible solution), and n − m of the elements are
zero (basic solution).

For the rest of lecture we assume our problem has been already converted into the standard form, with A being
m × n matrix representing m equalities in n variables. All the m rows being linearly independent.

We define a set of basic variables as B ⊂ {1, 2, ..., n}, s.t.

• i ∈ B =⇒ xi is a basic variable.
• i∈/ B =⇒ xi = 0 & xi is a nonbasic variable.

As discussed in the last lecture, simplex algorithm needs to find two variables in each iteration — one that enters the
Basic variable set (i.e., j) and the other that exits the basic variable set (i.e., l). Here is a summary of the process.
(We will go down to the details and remind the notation in the next section.)

Finding l, θ once we know j :


1. Compute d using dB = −B̄ −1 Aj , dj = 1 and di = 0∀i ̸= j & i ∈/B
2. As we have xi+1 = xi + θd, if di ≤ 0, then θ is limited by xi ≥ 0
θ = mini∈B {−xi /di }
3. l is where this minimum happens

122
14.3 How simplex method works?
Suppose that we are at a point x ∈ P and that we want to move away from x, in the direction of a vector d ∈ Rn . We
should move in a direction that does not take us outside the feasible region. d ∈ Rn is said to be a feasible direction
at x, if there exists a positive scalar θ for which

x + θd ∈ P. (14.1)

Let x be a basic feasible solution to our problem, let B(1), …, B(m) be the indices of the basic variables, and let
[ ]
B̄ = AB(1) · · · AB(m) (14.2)

be the corresponding basis matrix, such that xi = 0 for non-basic variables (i ∈


/ B). While the basic variables are
given by
¯ b
xB = B −1 (14.3)

Simplex Algorithm:

1. Start with a BFS.


2. Find θ and d such that
xi+1 ←− xi + θd (14.4)
if the cost is reduced.

Note:

1. Often the first vertex is found easily when B is the identity matrix.
2. In each new iteration, one new variable enter the basis and one exits.

Consider a new point y = x + θd. Now, select a non-basic variable xj (j ∈ / B) and change its value to a positive
value θ. Let the other non-basic variables xi (i ∈
/ B, i ̸= j)be zero. That is, dj = 1 and di = 0

We know that,

Ax = b and Ay = b
A(x + θd) = b (14.5)
∴ Ad = 0

When j enters B

dj = 1
BxB = b
xB = B−1 b (14.6)

The lth column goes out and the j th column comes in the basis.

xnew
B = B−1
new b (14.7)

Recall now that dj = 1, and that di = 0 for all other non-basic indices i. Then,

n
Ad = 0 =⇒ Aj dj = 0.
j=1

Since dj is scalar, we can write



n
dj Aj = 0.
j=1

Splitting the summation into different parts for the basic and non-basic variables, we get
∑ ∑
di Ai + dj Aj + di Ai = 0.
i∈B i∈B
/
i̸=j

123
We know that dj = 1 and for i ∈
/ B, di = 0.

∴ di Ai + 1.Aj = 0
i∈B

m
dB(i) AB(i) + Aj = 0
i=1

BdB + Aj = 0
Aj = −BdB
dB = −B−1 Aj (14.8)
Next, we need to find θ, i.e., how much to move in the feasible direction. Given that
y = x + θd
xnew
i = xi + θdB (14.9)
If di < 0, then θ is limited by xi ≥ 0. { }
−xi
∴ θ = min (14.10)
i∈B di
di <0

Let l be the minimizing index.


xnew
l =0
[ ]
B̄ = AB(1) · · · Aj AB(l+1) · · · AB(m)
Change (reduction) in cost due to dj = 1:
Let cj be the change in cost due to j entering the basis. The rate of cost change along direction d is given by
cT d = cTB dB + cj (14.11)
Using Eq. (18.23), we get
cT d = cj − cTB B−1 Aj (14.12)
cj is the cost per unit increase in the variable xj , and the term −cTB B−1 Aj is the cost of the compensating change
in the basic variables brought about by the constraint Ax = b.
c¯j = cj − cTB B−1 Aj (14.13)
c¯j is the reduction in cost due to j entering B.
c̄ = [c1 · · · cn ] (14.14)
If c̄ ≥ 0, the solution is optimal, so we stop. Else, we pick a j such that c¯j < 0.

Now the cost is given by cT x, on changing the BFS x to y = x + θd, results in change of cost given by cT d

Let C̄j denote the cost change due to j entering B. Let C̄ = [C̄1 , ..., C̄n ] denote the vector of cost change correspond-
ing to each j.

Lemma 14.3.1. If C̄j ≥ 0 ∀ C̄j ∈ C̄, then the current BFS x is optimal.

If the above lemma holds, stop the algorithm and report the solution. Else ∃j s.t. C̄j < 0.
C̄j = C T d
= CBT
dB + Cj (As j is entering B, dj = 1 and di = 0∀i ∈
/ B & i ̸= j.)
T −1
= Cj − CB B Aj

14.3.1 Simplex: Steps

Simplex method :
1. Start with X as BFS
2. Compute C̄
3. If C̄j ≥ 0 ∀ C̄j ∈ C̄, then stop and report x as solution.
Else ∃j s.t. C̄j < 0, compute dB = −B̄ −1 Aj
4. Find l and θ s.t. j enters B and l leaves B.
5. Update x, B, B and goto step 2.

124
1. Start with x as a BFS.
2. Compute c̄.
3. If c̄ ≥ 0, then the solution is optimal, terminate.
Else for cj < 0
Compute dB = −B−1 Aj .
4. Find l and θ such that
xnew
i = xi + θdB
and l exits B and j enters B.
5. Update xj , B and go to step 2.

14.3.2 Computing B̄−1

B and B̄ differ only by one column.


[ ] [ ]
B = AB(1) · · · AB(m) and B̄ = AB(1) · · · AB(l−1) Aj AB(l+1) · · · AB(m)

Given B, how can we efficiently compute B̄?


We know

B−1 B = I
= [e1 e2 · · · em ]
= [e1 · · · el−1 u el+1 · · · em ]

Only the lth column is non-zero on the RHS in the above equation. Here u = B−1 Aj . Now, we need to do elementary
row transformations such that we get RHS as I and B−1 −→ B̄−1 .

Given B−1 , can we find B̄−1 efficiently ?


1. We start with B−1 B = I = [e1 , ..., em ]
2. Replacing B with B̄, we get B−1 B̄ = [e1 , ..., u, ..., em ] (As only one column of the 2nd matrix is changed,
and as each of the columns of product matrix is determined by the corresponding column of 2nd matrix)
3. Apply row operations to product matrix, changing the column u to el , and apply the same row operation
to the first matrix. After atmost m − 1 iterations, this is converted to the form :
B̄−1 B̄ = I = [e1 , ..., eu , ..., em ] which can be used in the next iteration direclty.

14.3.3 Simplex Algorithm


1. Let x be basic feasible solution to the standard form problem, let B(1),. . ., B(m) be the indices of the basic
variables, and let B = [AB(1) . . . AB(m) ] be the corresponding basis matrix. In particular, we have X i = 0 for
every nonbasic variable, while the vector X B = (xB(1) ,. . ., xB(m) ) of basic variables is given by X B = B −1 b.
2. Compute the reduced costs cj = cj − cTB B −1 Aj for all nonbasic indices j . If they are all nonnegative, the
current basic feasible solution is optimal, and the algorithm terminates; else, choose some j for which cj < 0.
3. Compute d = B −1 Aj . If no component of u is negative, we have θ∗ = ∞, the optimal cost is −∞ , and the
algorithm terminates.
min −XB(i)
4. If any component of d is negative, let θ∗ = {i = 1, ..., m|ui < 0} di
.
−X
5. Let l be such that θ∗ = dB(l)
l
. Form a new basis by replacing AB(l) by Aj .If Y is the new basic solution then
value of the new basic variables are Y=X+θ*d,where Yj =θ∗ and YB(i) =XB(i) + θ* di and i̸=l.
6. Repeat step 1 with X as Y until algorithm terminates.

14.4 On Computational Complexity


TODO

125
14.5 Example Problems
Example 7. Consider the linear programming problem
minimize c1 x1 + c2 x2 + c3 x3 + c4 x4
subject to x1 + x2 + x3 + x4 = 2
2x1 + 3x3 + 4x4 = 2

x1 , x2 , x3 , x4 ≥ 0.

The first two columns of the matrix A are A1 = (1, 2) and A2 = (1, 0). Since they are linearly independent, we can
choose x1 and x2 as our basic variables. The corresponding basis matrix is
[ ]
1 1
B=
2 0
We set x3 = x4 = 0, and solve for x1 , x2 , to obtain x1 = 1 and x2 = 1. Therefore, we have obtained a non-degenerate
basic feasible solution. A basic direction corresponding to an increase in the non-basic variable x3 , is constructed as
follows. We have d3 = 1 and d4 = 0. The direction of change of the basic variables is obtained using Eq. (18.23) as
follows:
[ ] [ ]
d1 d
dB = = B(1)
d2 dB(2)
= −B−1 A3
[ ][ ]
0 1/2 1
=−
1 −1/2 3
[ ]
−3/2
=
1/2
The cost of moving along this basic direction is cT d = −3c1 /2 + c2 /2 + c3 . This is the same as the reduced cost
of the variable x3 . Suppose that c = (2, 0, 0, 0), in which case, we have c¯3 = −3. Since c¯3 is negative, we form the
corresponding basic direction, which is d = (−3/2, 1/2, 1, 0), and consider vectors of the form x + θd, with θ ≥ 0.
As θ increases, the only component of x that decreases is the first one (because d1 < 0). The largest possible value
of θ is given by θ∗ = −(x1 /d1 ) = 2/3.
This takes us to the point y = x + 2d/3 = (0, 4/3, 2/3, 0).
Note that the columns A2 and A3 corresponding to the non-zero variables at the new vector y are (1, 0) and (1, 3),
respectively, and are linearly independent.
Therefore, they form a basis and the vector y is a new basic feasible solution. In particular, the variable x3 has
entered the basis and the variable x1 has exited the basis.
Example 8. Let x be an element of standard form polyhedra P = {x ∈ Rn | Ax = b, x ≥ 0}. Prove that a vector
d ∈ Rn is a feasible direction at x iff Ad = 0 and di ≥ 0 ∀ i s.t. xi = 0.

Let x, y ∈ Rn s.t. y = x + θd where di indicates whether we are moving in that component, in forward or backward
direction and θ is the scaling factor as in the speed at which we are moving.
For the forward direction of proof, we assume d to be a feasible direction vector and show the mentioned conditions
hold
Ay = b
=⇒ A(x + θd) = b
=⇒ Ax + θAd = b
=⇒ θAd = 0
=⇒ Ad = 0 (As θ is scaling factor and θ ̸= 0
Also ∀i s.t. xi = 0, xi can’t decrease further, so it can only increase or stay at zero level. In formal terms di ≥ 0.This
proves the forward direction.

For the converse, we can let d ∈ Rn s.t. Ad = 0 and di ≥ 0 ∀ i | xi = 0 and will show that d is feasible direction
vector.
Ad = 0
=⇒ θAd = 0 (multiplying by scalar on both sides )
=⇒ Ax + Aθd = Ax ( adding same dimension vector on both sides )
=⇒ A(x + θd) = b
Let y = x + θd, as Ay = b =⇒ y ∈ P . So d is feasible direction vector. This completes the proof.

126
Example 9. Let P = {x ∈ R3 | x1 + x2 + x3 = 1, x̄ ≥ 0} be a polyhedra. Consider a vertex x = [0, 0, 1]T . Find the
set of feasible directions at x.

Using the result of the previous proof, we know that d is a feasible direction vector if Ad = 0 and di ≥ 0 ∀ xi = 0.
We get,
Ad = 0 =⇒ d1 + d2 + d3 = 0
Here x = [0, 0, 1]T , so x1 = x2 = 0 =⇒ d1 , d2 ≥ 0.
So all the feasible direction vector at given x, is given by the polyhedron

D = {d ∈ R3 | d1 + d2 + d3 = 0, d1 , d2 ≥ 0}

Example 10. Find the sequence at BFS and arrive at optimal point.

minimize x1 + x2

s.t

x1 + x2 ≤2

2x1 ≤2

x1 , x2 ≥ 0

The above problem is converted from inequality to equality form

minimize x1 + x2

s.t

x1 + x2 +x3 =2

2x1 + x4 ≤2

x1 , x2 ,x3 ,x4 ≥ 0

Step 1: This is the initial stage at which x1 =0 and x2 =0

x3 = 2 − x1 − x2

x4 = 2 − 2x1

z = x1 + x2

X = [0, 0, 2, 1],

B = {3, 4}, z = 0

Now both x1 and x2 have positive cofficients so we can’t further decrease z since all xi ’s are greater than zero and
they can’t be less than zero.

127
128
Chapter 15

Simplex Method Tableaux

129
15.1 Simplex
This is an incomplete version. There are inconsistency in the notation and errors still left out. use
the text book for careful study.

15.1.1 Simplex: Summary


1. We start with a basis B̄ = [AB(1) , . . . , AB(m) ] and associated solution x.
T −1
2. Compute the reduced cost Cj = Cj − CB B̄ Aj for each non basic variable j if they are all positive, current
solution is optimal, so exit else choose j such that Cj > 0.
3. Compute u = B −1 Aj if no component of u is positive, we have θ∗ = ∞ and optimal cost = −∞. Exit.
4. If for some computation, ui is positive then, θ∗ = min{i/ui ≥0} (AB(i) /ui )
5. If l is the variable which minimizes then l exits and j enters. For a new basis by replacing AB(i) by Aj .

New solution y = X + θdi (or) yi = θ∗ and yB(i) = XB(i) − θ∗ ui ; i ̸= l

Find B −1 , Given B −1

B −1 B = I or → B −1 B = [e1 , e2 ....em ]

The tabloid is of the following format.

-CTB B−1 b CT -CB B−1 A


B−1 b B−1 A

15.2 Simplex Summary

15.2.1 Intuition and Design of Simplex Method

minimize cT x
subject to Ax ≤ b
x≥0
To summarize the intuition, We choose a corner by choosing n planes(assuming the intersection to be feasible). If
we remove one plane, then the intersection of n-1 planes gives an edge(To stay in the feasible set we can choose only
one direction). If we add another plane to the other n-1 planes such that the new intersection of n planes is in the
feasible set then we move to a neighbouring corner. But just moving to the neighbouring corner is not our aim but
this idea is used to move from one corner to another along an edge(which reduces the cost).
The main idea of simplex is to move from one corner to another along an edge of the feasible set such that the cost
minimizes. A corner is the meeting of n different planes(Each plane is given by an equation). Each corner of the
feasible set comes from turning n of n+m inequalites Ax ≤ b and x ≥ 0 into equations and finding the intersection
of these n planes.
To understand
[ ] how simplex works, we introduce[positive
] slack variables w = Ax − b or Ax − w = b which gives us
[ ] x [ ] x
A −I = b renaming A −I as A and as x.
w w
So we have, minimize cT x, subject to Ax = b , x ≥ 0. The intersection of n planes gives a corner. Suppose one of n
intersecting planes is removed. The points that satisfy the remaining n-1 equations form an edge that comes out of
the corner. This edge is the intersection of n-1 planes. To stay in the feasible region, only one direction is allowed
along each edge which is given by Phase II of the simplex which will be discussed later in this lecture.
In this new setting, we can observe the same thing. A corner is now a new point where n components of new x are
zero. These n components of x are free variables of Ax = b set to zero. The remaining m components are basic
variables or pivot variables. The basic solution will be a genuine corner if its m nonzero components are positive. We
get a corner if x has n zero components. Now suppose we make one of the zero component variable and set one of the
m nonzero component to zero. The earlier zero component which is now chosen to be variable can take a value and
the system has one degree of freedom. The one zero component which can now take a value is adjusted such that it
minimizes the cost cT x. So which corner do we land now? We really wanted to move from one corner to another along
an edge. Since two corners are neighbours, m-1 basic variables will remain basic. At the same time, one variable

130
will move up from zero to become basic. The value of other m-1 basic components will change but remain positive.
The choice of edge decides which variable leaves the basis and which enters(and viceversa). The basic variables are
computed by solving Ax = b. The free components are set to zero.
A corner is degenerate if more than the usual n components of x are zero. The basic set might change without
actually moving from the corner.
A way to do the above is Tableau discussed in next subsection.

15.3 The Tableau


A move from one corner to another along an edge is a step of simplex algorithm. Each Simplex step involves decision
followed by row operations. In this entering and leaving variable ave to be chosen, and they have to be made to come
and go. A way to organize this is to fit A,b,c[ into a] large matrix called tableau:
A b
Tableau is m + 1 by m + n + 1 T =
c 0
At the start basic variables may be mixed with free variables. Renumbering if necessary, say x1 , x2 , ...., xm are basic
variables. The first m columns of A form a square matrix B(basis matrix for that corner). The last n columns give
an m by n matrix N. The cost vector c splits into [cB cN ] and unknowns x into (xB , xN ).
At the corner, the free variables are
[ xN = 0. So, ] Ax = b becomes BxB = b
B N b
Tableau at corner T = xn = 0, xB = B −1 b, cost = cB B −1 b
cB cN 0
The basic variables will stand alone
[ when elimination ] multiplies by B −1
−1 −1
I B N B b
Reduced Tableau T′ = .
cB cN 0
To reach fully reduced row echelon form[R = rref(T), substract cB times ] the top block row from the bottom row:
I B −1 N B −1 b
Fully Reduced Tableau R= .
0 cN − cB B −1 N −cB B −1 b
Constraints xB + B −1 N xN = B −1 b Corner xB = B −1 b, xN = 0.
The cost cB xB + cN xN converts to :
Cost cT x = (cN − cB B −1 N )xN + cB B −1 b Cost at this corner = cB B −1 b
Every information is there in R. We can decide if the corner is optimal by looking at r = cN − cB B −1 N in the middle
of the bottom row. If any entry in ris negative, the cost can still be reduced. We can make rxN negativeby increasing
the component of xN . That is our next step. If r ≥ 0, the best corner has been found. This is the optimality condition.
Negative component of r corresponds to edges on which cost goes down.
Suppose ri one of the negative values of r(Possibly the most negative). Then the ith component of xN is the entering
variable which increased from zero to a positive value α at the next corner. As xi is increased the other component
of x may decrease to maintain Ax=b. The xk which reaches zero first becomes the leaving variable(it changes from
basic to free). We have reached the corner when a component of xB drops to zero. New corner is feasible because we
have x ≥ 0. It is basic because we have n zero components. The ith component of xN went from zero to α. The kth
component of xB dropped to zero.
(B −1 b)j (B −1 b)k
At new corner xi = α = minj (B −1 u)j
= (B −1 u)k
B −1 u is the column of B −1 N in the reduced tableau R, above the selected negative entry ri in the bottom row. If
B −1 u ≤ 0, the next corner is infinitely far away and the minimal cost is −∞.
We can actually do the tableau method without grouping basic and non-basic column set and just doing row operations
as in example section(unlike in the method described).

TODO: the last row in the above should be kept as the first row to be consistent with the tablaues below.

15.4 Example Probems


Example 11. We start with a problem that we have solved in the past.

Min -x1 -x2


-x1 +x2 +x3 = 1
x1 +x4 = 3

131
x2 +x5 = 2

[ ]
C = −1 −1 0 0 0

 
−1 1 1 0 0
A=1 0 0 1 0
0 1 0 0 1
 
1
b= 3
2

x1 x2 x3 x4 x5
0 -1 -1 0 0 0
x3 1 -1 1 1 0 0
x4 3 1 0 0 1 0
x5 2 0 1 0 0 1

[CT -CTB B−1 A] < 0 for both x1 and x2 , we can select any of them as entering variable. Let x2 be selected.
(ABi /ui ) is least and ui is positive for x2 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5
1 -2 0 1 0 0
x2 1 -1 1 1 0 0
x4 3 1 0 0 1 0
x5 1 1 0 -1 0 1
[CT -CTB B−1 A] < 0 for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x5 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5
3 0 0 -1 0 2
x2 2 0 1 0 0 1
x4 2 0 0 1 1 -1
x1 1 1 0 -1 0 1

[CT -CTB B−1 A] < 0 for x3 , so it is the entering variable


(ABi /ui ) is least and ui is positive for x4 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5
5 0 0 0 1 1
x2 2 0 1 0 0 1
x3 2 0 0 1 1 -1
x1 3 1 0 0 1 0

[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ,x5 ) = (1,2,2,0,0)
Objective = -5

Example 12. Min -10x1 -12x2 -12x3


x1 +2x2 +2x3 ≤ 20
2x1 +x2 +2x3 ≤ 20
2x1 +2x2 +x3 ≤ 20
x1 ,x2 ,x3 ≥ 0

Solution:

132
x1 +2x2 +2x3 +x4 = 20
2x1 +x2 +2x3 +x5 = 20
2x1 +2x2 +x3 +x6 = 20

[ ]
C = −10 −12 −12 0 0 0

 
1 2 2 1 0 0
A = 2 1 2 0 1 0
2 2 1 0 0 1
 
20
b= 20
20

x1 x2 x3 x4 x5 x6
0 -10 -12 -12 0 0 0
x4 20 1 2 2 1 0 0
x5 20 2 1 2 0 1 0
x6 20 2 2 1 0 0 1

[CT -CTB B−1 A] < 0 and has less magnitude for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x5 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5 x6
100 0 -7 -2 0 5 0
x4 10 0 1.5 1 1 -0.5 0
x1 10 1 0.5 1 0 0.5 0
x6 0 0 1 -1 0 -1 1

[CT -CTB B−1 A] < 0 and has less magnitude for x3 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x4 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5 x6
120 0 -4 0 2 4 0
x3 10 0 1.5 1 1 -0.5 0
x1 0 1 -1 0 -1 1 0
x6 10 0 2.5 0 1 -0.5 1

[CT -CTB B−1 A] < 0 and has less magnitude for x2 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x2 , so it leaves. After performing row operations we get

x1 x2 x3 x4 x5 x6
136 0 0 0 3.6 1.6 1.6
x3 4 0 0 1 0.4 0.4 -0.6
x1 4 1 0 0 -0.6 0.4 0.4
x2 4 0 1 0 0.1 -0.6 0.4

[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ,x5 ) = (4,4,4,0,0)
Objective = -136

Example 13. Max Z = 5x1 +7x2


such that
2x1 +3x2 ≤ 13
3x1 +2x2 ≤ 12
x1 ,x2 ≥ 0

133
Solution:

2x1 +3x2 +x3 = 13


3x1 +2x2 +x4 = 12

x1 x2 x3 x4
0 -5 -7 0 0
x3 13 2 3 1 0
x4 12 3 2 0 1

[CT -CTB B−1 A] < 0 and has greater magnitude for x2 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x3 , so it leaves. After performing row operations we get

x1 x2 x3 x4
91/3 -1/3 0 7/3 0
x2 13/3 2/3 1 1/3 0
x4 10/3 5/3 0 -2/3 1

[CT -CTB B−1 A] < 0 and has greater magnitude for x1 , so it is the entering variable
(ABi /ui ) is least and ui is positive for x4 , so it leaves. After performing row operations we get

x1 x2 x3 x4
31 0 0 11/5 1/5
x2 5 0 1 3/5 -2/5
x1 2 1 0 -2/5 3/5

[CT -CTB B−1 A] > 0 for all the variables so we can end the process.
(x1 ,x2 ,x3 ,x4 ) = (2,5,0,0)
Objective = 31

Example 14. Explain how row transformation correctly yields results in zeroth row.
′ ′ ′ ′
Suppose at the beginning of typical iteration the 0th row is of the form [0|C ]-g [b|A] where g =CB B−1 , Hence 0th

row is equal to [0|C ] + a linear combination of the row [b|A]. Let column j be the pivot column and row l be pivot

row. Note that pivot row is of form h [b|A], where h1 is lth row of B−1 . Hence, after a multiple of pivot row is

added to 0th row, that row is again equal to [0|C ] + a ( different) linear combination of rows of [b|A] if of form
[0|C’] - P’[b|A], for some vector P. Recall that update rule is such that pivot column entry of 0th row becomes zero.
CB(l) -P’AB(l) =Cj -P’Aj =0. Consider row B(l)th column for i not equal to l. The 0th row entry of that column is 0,
before that change of basis, since it is reduced cost of basic variable.B−1 AB(l) is lth unit vector & i not equal to l,
the entry in pivot row for that column is also equal to 0. Hence addming a multiple pivot row to 0th row does not
affect the 0th row entry of that column, which is left at zero.

Example 15. Min -2x1 -x2


x1 -x2 ≤ 2
x1 +x2 ≤ 6
x1 ,x2 ≥ 0

Solution:

x1 +x2 +x3 = 2
x1 +x2 +x4 = 6

[ ]
C = −2 −1 0 0

[ ]
1 −1 1 0
A=
1 1 0 1

134
[ ]
2
b=
6

x1 x2 x3 x4
0 -2 -1 0 0
x3 2 1 -1 1 0
x4 6 1 1 0 1

x4 leaves and x2 enters

x1 x2 x3 x4
6 -1 0 0 1
x3 8 2 0 1 1
x2 6 1 1 0 1

x3 leaves and x1 enters

x1 x2 x3 x4
10 0 0 0.5 1.5
x1 8 2 0 1 1
x2 2 0 1 -0.5 0.5

Further reduction not possible


So Z = -10

Example 16. Min 5x1 +2x2


6x1 +x2 ≥ 6
4x1 +3x2 ≥ 12
x1 +2x2 ≥ 4
x1 ,x2 ≥0

Convert into minimization problem

Min Z = -5x1 -2x2


-6x1 -x2 +x3 = -6
-4x1 -3x2 +x4 = -12
-x1 -2x2 +x5 = -4

 
−6 −1 1 0 0
A = −4 −3 0 1 0
−1 −2 0 0 1
 
−6
b = −12
−1

If B=I then, x1 =0, x2 =0 are not satisfied inequalities.

[ ]
So take B = x1 x2 x3

then -Z = 5 ( -4+2x2 -x5 )-2x2

-Z= -20+8x2 -5x5


x1 = 4-2x2 +x5
x4 = -12+4(4-2x2 +x5 )+3x2

135
x4 = 4-5x2 +4x5
x3 = -6+6x1 +x2
x3 = 18-11x2 +6x5

  

−6 1 0 −6
B =−4 0 1 , b =−12
−1 0 0 −4
 
0
8
 
CB =
0

0
−5

-20 0 8 0 0 -5
x1 4 1 2 0 0 -1
x3 18 0 11 1 0 -6
x4 4 0 5 0 1 -4

We see that for x5 value is negative (-5), so we need to calculate min(x[5]/u[5]) where u[i] > 0
But all u(i) are -ve
So optimal sol at xi =+∞
So min cost -Z1 =- ∞
Hence max Z= ∞

Example 17. Tableau demonstration:


minimizex + y
x + 2y ≥ 6
2x + y ≥ 6
x, y ≥ 0 [ ] [ ]
1 2 −1 0 6 [ ]
Introducing slacks, we get, A = b= c= 1 1 0 0
2 1 0 −1 6
[ ] [ ]
A b 1 2 −1 0 6
=
c 0 2 1 0 −1 6

1 2 -1 0 6
2 1 0 -1 6
1 1 0 0 0

We start with a feasible point say at P, which is the intersection of x=0 and 2x + y = 6. To relate to theory and be
organised we exchange columns 1 and 3 to put to basic variables.

-1 2 1 0 6
0 1 2 -1 6
0 1 1 0 0

Then elimination multiplies the first row by -1 to give a unit pivot, and uses second row to produce zeros in the
second column. Now fully reduced form at P is: At first look at r = [-1 1] in the bottom row has a negative entry in

1 0 3 -2 6
0 1 2 -1 6
0 0 -1 1 -6

column 3. So third variable will enter the basis. The current corner P and cost -6 is not optimal. The column above
the negative entry B −1 u = (3, 2), its ratio with the last column is 63 and 26 Since the first ratio is smaller, the first
w, so the first column of basis is pushed out of the basis. We move from corner P to Q. The new tableau exchanges
column 1 and 3. Pivoting by elimination gives: The new tableau at Q, has r=[ 13 13 ] is positive and thus final. The
corner x=2, y=2 and z=+4 is optimal.
This can also be done without exchanging rows as shown in next example.

136
3 0 1 -2 6
2 1 0 -1 6
-1 0 0 1 -6

1 −2
1 0 3 3 2
−2 1
0 1 3 3 2
1 1
0 0 3 3 -4

Example 18. minimize − x1 − x2


−x1 + x2 ≤ 1
x1 ≤ 3
x2 ≤ 2
x1 , x2 ≥ 0

minimize − x1 − x2
s.t. −x1 + x2 + x3 = 1
x1 + x4 = 3
x2 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0
We choose x1 to enters as the corresponding r1 is one of the negative entry for r, x4 leaves as it the the variable

b x1 x2 x3 x4 x5
x3 1 -1 1 1 0 0
x4 3 1* 0 0 1 0
x5 2 0 1 0 0 1
z 0 -1 -1 0 0 0

which is equal to α and positive.

b x1 x2 x3 x4 x5
x3 4 0 1 1 1 0
x1 3 1 0 0 1 0
x5 2 0 1* 0 0 1
z 3 0 -1 0 1 0

By using similar argument, x2 enters x5 leaves


Answer: z = -5, x1=3, x2=2.

b x1 x2 x3 x4 x5
x3 2 0 0 1 1 -1
x1 3 1 0 0 1 0
x2 2 0 1 0 0 1
z 5 0 0 0 1 1

Example 19. Simplex Example solved in two different forms.

minimize −x1 − x2
such that −x1 + x2 ≤ 1
x1 ≤ 3
x2 ≤ 2
x1 , x2 ≥ 0

min −x1 − x2 + 0x3 + 0x4 + 0x5


−x1 + x2 + x3 + 0x4 + 0x5 = 1
−x1 + 0x2 + 0x3 + x4 + 0x5 = 3

137
0x1 − x2 + 0x3 + 0x4 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0

min −x1 − x2
−x1 + x2 + x3 = 1
−x1 + x4 = 3
−x2 + x5 = 2
x1 , x2 , x3 , x4 , x5 ≥ 0

Simplex version 1 (Solve above problem using simplex version 1)

1. Initial stage
x3 = 1 + x1 − x2
x4 = 3 − x1
x5 = 2 − x2
z = 0 − x1 − x2
BF S = [0, 0, 1, 3, 2], B = {3, 4, 5}, z = 0
2. x1 ,x2 both have negative coefficient any of them can be selected as entering variable let’s choose x2
x2 can be increase upto 1 while keeping x3 ≥ 0
x2 enters x3 exits
x2 = 1 + x1 − x3
x4 = 3 − x1
x5 = 1 − x1 + x3
z = −1 − 2x1 + x3
x = [0, 1, 0, 3, 1], B = {2, 4, 5}, z = −1
3. x1 has negative coefficient, x1 enters
x1 can be increase upto 1 while keeping x5 ≥ 0
x1 enters x5 exits
x1 = 1 + x3 − x5
x2 = 2 − x5
x4 = 2 − x3 + x5
z = −3 − x3 + 2x5
x = [1, 2, 0, 2, 0], B = {1, 2, 4}, z = −3
4. x3 has negative coefficient, x3 enters
x3 can be increase upto 2 while keeping x4 ≥ 0
x3 enters x4 exits
x3 = 2 − x4 + x5
x1 = 3 − x4
x2 = 2 − x5
z = −5 + x4 + x5
x = [3, 2, 3, 0, 0], B = {1, 2, 3}, z = −5

We cannot reduce z any further because all the coefficient of z are positive, z = −5 is the optimal solution.

Simplex version 2 (Solve above problem using simplex version 2)

1. Initial stage
x1 x2 x3 x4 x5
0 -1 -1 0 0 0 ...a1
x3 = 1 -1 1∗ 1 0 0 ...a2
x4 = 3 1 0 0 1 0 ...a3
x5 = 2 0 1 0 0 1 ...a4
2. x2 enters, x3 exits
x1 x2 x3 x4 x5
a1 + a2 1 -2 0 1 0 0 ...b1
a2 x2 = 1 -1 1 1 0 0 ...b2
a3 x4 = 3 1 0 0 1 0 ...b3
a4 − a2 x5 = 1 1∗ 0 -1 0 1 ...b4
3. x1 enters, x5 exits

138
x1 x2 x3 x4 x5
b1 + 2b4 3 0 0 -1 0 2 ...c1
b2 + b4 x2 = 2 0 1 0 0 1 ...c2
b3 − b4 x4 = 2 0 0 1∗ 1 -1 ...c3
b4 x1 = 1 1 0 -1 0 1 ...c4
4. x3 enters, x4 exits

x1 x2 x3 x4 x5
c1 + c3 5 0 0 0 1 1 ...d1
c2 x2 = 2 0 1 0 0 1 ...d2
c3 x3 = 2 0 0 1 1 -1 ...d3
c4 + c3 x1 = 3 1 0 0 1 0 ...d4

We cannot reduce z any further because all the coefficient of z are positive, z = −5 is the optimal solution.

Figure 15.1: Diagrammatic representation of above procedure

Figure 31.1 We started with point (0,0) as initial feasible solution in step 1, after step 2 we reach point (0,1) which
has better solution, after step 3 we reach (1,2) and finally to (3,2) after step 4 which is most optimal solution and we
cannot go any further. The dotted path is another possible path that might have resulted depending upon selection
of entering variable.

Example 20. Using Simplex method(tableau) to solve:


minimize −24x1 +396 x2 −8x3 −28x4 −10x5
s.t
 
  x1  
12 4 1 −19 7  x2  12
 
 6 −7 18 −1 −13   x3 = 6 
 
1 17 3 18 −2  x4  1
x5

Here, we add artificial variables to these equalities and to the objective by multiplying some M (very
large value) so that these artficial variables dont affect the objective. We solve the problem using
simplex algorithm as follows,

x1 x2 x3 x4 x5 A1 A2 A3
0 -24 396 -8 -28 -10 M M M
A1 12 12 4 1 -19 7 1 0 0
A2 6 6 -7 18 -1 -13 0 1 0
A3 1 1 17 3 18 -2 0 0 1

Final solution is (x1 , x2 , x3 , x4 , x5 ) = (1, 0, 0, 0, 0) and objective is −24.

139
x1 x2 x3 x4 x5 A1 A2 A3
24 0 804 64 404 -58 M M M+24
A1 0 0 -200 -35 -235 31 1 0 -12
A2 0 0 -109 0 18 -1 0 1 -6
x1 1 1 17 3 18 -2 0 0 1

15.5 Additional problems


Problem. Max Z = -3x1 + 2x2 +2x3

2x1 + x2 +x3 ≤ 2

3x1 + 4x2 +2x3 ≥ 8

x1 ,x2 ,x3 ≥ 0

2) Min Z = x1 - 3x2 +2x3

2x1 -2x2 +3x3 ≤ 7

-2x1 + 4x2 ≤ 12

-4x1 + 3x2 +8x3 ≤ 10

x1 ,x2 ,x3 ≥ 0

Problem. min −2x1 − 2x2


x1 − x2 ≤ 2
x1 + x2 ≤ 6
x1 , x2 ≥ 0

Problem. Explain how transformation yields results in zeroth row.

Problem. Solve by Tableu method max z = 5x1 + 2x2


s.t.
6x1 + x2 ≥ 6
4x1 + 3x2 ≥ 12
x1 + 2x2 ≥ 4

140
Chapter 16

Dual Problems

141
16.1 Introduction
Duality means that optimization problems may be viewed from two perspectives, the primal problem or the dual
problem. This lead to problems coming in pair — primal and dual.

In many cases, both the primal and dual problems have physical interpretation. Often what is more interesting are
the theoretical results (and of course, the practical implications) of the primal dual problems and their optimas.

In this lecture, we start by introducing a set of examples and show how dual problems can be constructed and
interpreted.

16.2 A Simple Primal Dual Pair


Consider a linear programming problem:
Z = min 6x1 + 4x2 + 2x3
subject to :
4x1 + 2x2 + x3 ≥ 5
x1 + x2 ≥ 3
x2 + x3 ≥ 4
xi ≥ 0, ∀i = 1, 2, 3.

Suppose we want to obtain a bound on the optimal value of this LP, we can do it by argueing in the following manner:

• Since all variables are non-negative, 6x1 + 4x2 + 2x3 ≥ 4x1 + 2x2 + x3 . Therefore, the value of the LP must be
at least 5.
• Also 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + (x1 + x2 ) ≥ 5 + 3 = 8.
• Similarly, 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + 2(x1 + x2 ) ≥ 5 + 2 · 3 = 11.
• And, 6x1 + 4x2 + 2x3 ≥ (4x1 + 2x2 + x3 ) + (x1 + x2 ) + (x2 + x3 ) ≥ 5 + 3 + 4 = 12.

We are now finding a bound that is closer and closer to the optima. Is 12 the optima? How do we know? We can
actually determine the best bound we can achieve by setting up a different LP! Note that in all the above, we have
been taking linear combination of the constraints to obtain better and better bounds.

Let y1 be the number of times we take the first constraint, y2 the number of times we take the second constraint and
y3 the number of times we take the third constraint. Then the lower bound we get is 5y1 + 3y2 + 4y3 , and we need
to ensure that this is a lower bound, i.e.

6x1 + 4x2 + 2x3 ≥ y1 (4x1 + 2x2 + x3 ) + y2 (x1 + x2 ) + y3 (x2 + x3 )

Since xi s and yi s are non-negative, need additional constraints to guarantee the above inequality. We can do this by
ensuring that 4y1 + y2 + 0y3 ≤ 6(since we have 6x1 in the objective value, and 4x1 in the first constraint, 1x1 in the
second constraint and 0x1 in the third constraint). Similarly 2y1 + y2 + y3 ≤ 4, y1 + y2 ≤ 2. Also, we need to have
y1 , y2 , y3 ≥ 0 (otherwise the inequalities in the constraints change direction, and we would not get a lower bound).
We thus obtain the following LP for getting the best bound:

maximize 5y1 + 3y2 + 4y3


subject to :
4y1 + y2 ≤ 6
2y1 + y2 + y3 ≤ 4
y1 + y3 ≤ 2
yi ≥ 0, ∀i = 1, 2, 3.

Now we have two linear programming problems. They are related. On a close observation, we see that b and c have
swapped their roles. A has transposed. Goal is reversed (min to max). Such problems pairs are called the primal and
dual problem pairs. In fact, one can construct the primal problem from the dual. You may also see some relationships
in the A, b and c across these two problems.

142
16.3 Primal Dual Probelm Pairs
An optimization problem, called the primal, may be converted to its dual, and the solution to the dual provides a
bound to the solution of the primal problem.

In our case, an optimization problem of the general form

M ax cT x
s.t. Ax ≤ b
x≥0

is the primal, it’s dual can be written as :

M in bT y
s.t. AT y ≥ c
y≥0

Note: We consider the maximization problem as primal and minimization problem as dual.

16.3.1 Another LP Primal and Dual

Consider another example (this is a maximization problem):

Maximize 2x1 + 3x2

subject to:
4x1 + 8x2 ≤ 12

2x1 + x2 ≤ 3

3x1 + 2x2 ≤ 4

x1 , x 2 ≥ 0
Its dual is
Minimize 12y1 + 3y2 + 4y3
Subject to:
4y1 + 2y2 + 3y3 ≥ 2

8y1 + y2 + 2y3 ≥ 3

y1 , y2 , y3 ≥ 0

If we solve these two problems (using simplex) (try it out!!), we see the optima of primal is ( 21 , 54 ) and dual is ( 16
5
, 0, 41 ).

But the objectives of both primal and dual are the same i.e., 4.75.

Is it always so? How are the primal and dual problems generated and related? There are some of the questions that
we need to understand.

To make the life simple, we can state that for LP, primal and dual obtimas coincide, (when we have a feasible
unbounded solution).

16.4 Primal ↔ Dual Conversion


Given a problem, one can convert it into its dual using the following table.

143
PRIMAL DUAL
x1 , x2 , ..., xn y1 , y2 , ..., ym
A AT
b c
c b
M ax cT X M in bT Y
≤ yi ≥ 0
≥ yi ≤ 0
= yi ∈ R
xj ≥ 0 j th constraint ≥
xj ≤ 0 j th constraint ≤
xj ∈ R j th constraint =

Problem: Derive the dual problem for a primal problem containing unconstrained variable.

16.5 Numerical Examples


Example 21. Find Dual of the given optimization problem

Z = min x1 − x2

subject to :
2x1 + 3x2 − x3 + x4 ≤ 0

3x1 + x2 + 4x3 − 2x4 ≥ 3

−x1 − x2 + 2x3 + x4 = 6

x1 ≤ 0, x2 , x3 ≥ 0, x4 ∈ R

Using the conversion table, the dual problem can be written as :

Z = max 3y2 + 6y3

subject to :

2y1 + 3y2 − y3 ≥ 1

3y1 + y2 − y3 ≤ −1

−y1 + 4y2 + 2y3 ≤ 0

y1 − 2y2 + y3 = 0

y1 ≤ 0, y2 ≥ 0, y3 ∈ R

Example 22. Given the primal :

M ax 6x1 + 14x2 + 13x3


1
s.t. x1 + 2x2 − x3 ≤ 24
2
x1 + 2x2 + 4x3 ≤ 60
x1 , x 2 , x 3 ≥ 0

144
The dual can be calculated by referring to the table given above. The dual is found to be :

M in 24y1 + 60y2
1
s.t. y1 + y2 ≥ 6
2
2y1 + 2y2 ≥ 14
− y1 + 4y2 ≥ 13
y1 , y2 ≥ 0

The primal and dual are inter-convertible, i.e., the primal can be obtained by calculating the dual of the dual.

16.6 More Examples


Diet Problem: There are n foods, m nutrients, and a person (the buyer) is required to consume at least bi units
of nutrient i (for 1 ≤ i ≤ m). Let aij denote the amount of nutrient i present in one unit of food j. Let ci denote the
cost of one unit of food item i. One needs to design a diet of minimal cost that supplies at least the required amount
of nutrients. This gives the following linear program:

minimize cT x

subject to

Ax ≥ b

x≥0

Now assume that some other person (the seller) has a way of supplying the nutrients directly, not through food. (For
example, the nutrients may be vitamins, and the seller may sell vitamin pills.) The seller wants to charge as much
as he can for the nutrients, but still have the buyer come to him to buy nutrients. A plausible constraint in this case
is that the price of nutrients is such that it is never cheaper to buy a food in order to get the nutrients in it rather
than buy the nutrients directly. If y is the vector of nutrient prices, this gives the constraints AT y ≤ c. In addition,
we have the nonnegativity constrain y ≥ 0. Under these constraints the seller wants to set the prices of the nutrients
in a way that would maximize the sellers profit (assuming that the buyer does indeed buy all his nutrients from the
seller). This gives the the following dual LP:

maximize bT y

subject to

AT y ≤ c

y≥0

Problem: Suggest another real world problem where the primal and dual has physical interpretations.

16.7 Primal-Dual Pair: Mincut and Maxflow


Now let us take a classical primal-dual pair. The maxflow and mincut problems are a prime example of mathematical
duality.

16.7.1 Maxflow

Let N = (V, E) be a network with s, t ∈ V being the source and the sink of N respectively.

145
The capacity of an edge is a mapping c : E → R+ , denoted by cuv or c(u, v). It represents the maximum amount
of flow that can pass through an edge.
A flow is a mapping f : E → R+ , denoted by fuv or f (u, v), subject to the following two constraints:

1. 1. fuv ≤ cuv , for each (u, v) ∈ E (capacity constraint: the flow of an edge cannot exceed its capacity)
∑ ∑
2. 2. u:(u,v)∈E fuv = u:(v,u)∈E fvu , for each v ∈ V \ {s, t}(conservation of flows: the sum of the
flows entering a node must equal the sum of the flows exiting a node, except for the source and the sink nodes)

The value of flow is defined by |f | = v:(s,v)∈E fsv , where s is the source of N . It represents the amount of flow
passing from the source to the sink.

The maximum flow problem is to maximize |f |, that is, to route as much flow as possible from s to t.

16.7.2 Mincut

The minimum cut of a graph is a partition of the vertices of a graph into two disjoint subsets that are joined by at
least one edge whose cut set has the smallest number of edges (unweighted case) or smallest sum of weights possible.

Example: Now let us consider a simple graph as in the figure and formulate the maxflow problem as linear
programming. (we then want to argue that its dual is mincut.)

The primal maxflow problem is formulated as :

M ax xsu + xsv
s.t xsu + 0 + 0 + 0 + 0 ≤ 10
0 + xsv + 0 + 0 + 0 ≤ 5
0 + 0 + xuv + 0 + 0 ≤ 15
0 + 0 + 0 + xut + 0 ≤ 5
0 + 0 + 0 + 0 + xvt ≤ 10
xsu − xuv − xut = 0
xsv + xuv − xvt = 0

xsu xsv xuv xut xvt

146
   
1 0 0 0 0 10
0 1 0 0 0 5 ysu
    ysv
0 0 1 0 0 15
    yuv
A = 0 0 0 1 0  , b = 5 yut
   
0 0 0 0 1 10 yvt
1 0 −1 −1 0  0 uu
uv
0 1 1 0 −1 0

[ ]T
and c = 1 1 0 0 0

Dual problem The corresponding dual, is:

M in 10ysu + 5ysv + 15yuv + 5yut + 10yvt


s.t. ysu + uu ≥ 1
ysv + uv ≥ 1
yuv − uu + uv ≥ 0
yut − uu ≥ 0
yvt − uv ≥ 0
yi ≥ 0
ui ∈ R

This problems represents the min-cut. Let us assume uu is a variable that is 1 when if u is in the cut with set S and
0 otherwise. Similarly uv is a variable for vertex v.

Also y = 1 imply the ones to be cut and y = 0 are not cut.

• The first constraint now states that if u is not in S, then su should be added to the cut.
• The third constraint imply that if u is with S and v is not, (i.e., −uu + uv = −1). Then uv should be in the
cut.

Though the dual LP does not insist that yi or uu to be in {0, 1}, we loosely argue that the problem nature insists
that these variables take only integer values. (see more details in one of the next lectures.)

problems There are many other classical examples of primal-dual pairs in the ’algorithms’. List 3 different pairs.

147
148
Chapter 17

More on Duality

149
17.1 Introduction
In the previous lecture, we saw a set of examples of primal and dual problems. The beuty of the structure of problems
in the problem space is worth noticing and appreciating. Duality theory allows lot more than this. For example, it
can help you to verify whether a solution is optimal. It can also help you to design approximate algorithms.

We first see some important results in this space in this lecture. Later on, we also see how the primal and dual
problem structure can be used for designing approximate algorithms (in one of the next lectures).

17.1.1 More examples

TODO: Add more pairs.

17.2 Key Results on Duality


In this section, we discuss three importants results/theorems from the duality theory.

17.2.1 Farkas’ Lemma

Farkas’s lemma is a result in mathematics stating that a vector is either in a given convex cone or that there exists
a hyperplane separating the vector from the cone. There are no other possibilities.

17.2.2 Weak Duality

Theorem 17.2.1. Let


P = max{cT x|Ax ≤ b, x ≥ 0, x ∈ Rn }
D = min{bT y|AT y ≥ c, y ≥ 0, y ∈ Rm }
If x is a feasible solution of P and y is a feasible solution of D then

cT x ≤ b T y

Proof.
cT x = xT c ≤ xT (AT y) ≤ (Ax)T y ≤ bT y

• It always holds for convex and nonconvex problems. See the derivation above for the general case using
lagrangians.
• It can be used to find nontrivial lower bounds for difficult problems.
• Note that this relationship is for the feasible points and not the optimal points alone.

17.2.3 Strong Duality

Strong duality holds if

cT x∗ = bT y ∗

Theorem 17.2.2. (Strong duality) If a linear programming problem has an optimal solution, so does its dual, and
the respective optimal costs are equal.

i.e., x∗ = y ∗ . If x∗ and y ∗ are the optimal solutions of primal and dual problems.

• It does not hold in general

150
• It (usually) holds for convex problems. (Q: Isn’t it always?)
• The conditions that guarantee strong duality in convex problems are called constraint qualifications.

Result 1:

From the weak duality, we can see that, if P is unbounded, then D should be infeasible. Similarly If dual is unbounded,
primal should be infeasible.

Result 2:

Let x and y be feasible solutions to the primal and the dual, respectively. And suppose that bT y = cT x. Then, x
and y are optimal solutions to the primal and the dual, respectively. This allows to verify the optimality.

17.2.4 Duality Result for LP

If P and D are Primal-Dual pairs of a LP problem then one of the four cases occur:

1. Both are infeasible.


2. P is unbounded and D is infeasible.
3. D is unbounded and P is infeasible.
4. Both are feasible. There exists an optimal solution x∗ and y ∗ for P and D such that cT x∗ = bT y ∗ . As LP is a
convex optimization problem, strong duality hold and hence when the problem is feasible, the optimal values
of primal and dual are same.

In this case, (2) and (3) come from Result 1. (4) come from strong duality. (1) can be proven easily with A = 0,
b < 0 and c > 0.

17.2.5 Duality Result for IP

IP allows only weak duality. This can be easily proved as below.

Let x̄ and ȳ be the optima of the primal and dual integer programs. Let x∗ and y ∗ are the optima of the relaxed
primal and dual linear programs. Then:

cT x̄ = max(cT x|Ax ≤ b, x ≥ 0, x ∈ Zn ) (17.1)


≤ max(cT x|Ax ≤ b, x ≥ 0, x ∈ Rn ) (17.2)
T ∗
= c x (17.3)
= bT y ∗ (17.4)
= min(b y|A y ≥ c, y ≥ 0, y ∈ R )
T T n
(17.5)
≤ min(bT y|AT y ≥ c, y ≥ 0, y ∈ Zn ) (17.6)
T
= b ȳ (17.7)
(17.8)

Or
cT x̄ ≤ bT ȳ

17.2.6 Duality Gap

Duality gap is the difference between the primal and dual solutions. If x∗ is the optimal primal value and y ∗ is the
optimal dual value then

Duality gap = y ∗ − x∗

For weak duality : Duality Gap > 0 and for strong duality, Duality Gap = 0

151
Figure 17.1: Visualization of the duality gap

Their difference is called the duality gap. For convex optimization problems, the duality gap is zero under a constraint
qualification condition. Thus, a solution to the dual problem provides a bound on the value of the solution to the
primal problem; when the problem is convex and satisfies a constraint qualification, then the value of an optimal
solution of the primal problem is given by the dual problem.

17.3 Proof of Strong Duality


Now let us formally argue that cT x∗ = bT y ∗ or the strong duality.

Proof is left as an excercise.

17.3.1 Geometric Proof

17.3.2 Proof based on Farkas’s Lemma

17.4 Examples
Example 23. Let A be a symmetric square matrix. Consider LP

min cT x

Ax ≥ c
x≥0

Prove that if x∗ satisfies Ax∗ = c and x∗ ≥ 0 then x∗ is an optimal solution.

Forming the dual of the given problem :


max cT y
AT y ≤ c
y≥0
T
Given A to be a symmetric matrix, A = A.
Hence Dual problem becomes :
max cT y

152
Ay ≤ c
y≥0
∗ ∗
Now, given Ax = c. Hence, x satisfies both dual and primal problem constraints and gives
cT x∗ as the objective value for primal problem.
cT x∗ as the objective value for dual problem.
As the given problem is LP, a convex problem, strong duality must hold. Therefore, the dual and primal problems
must meet at the optimal solution. Hence x∗ is the optimal solution.

17.5 Additional Problems


Problem. Consider the primal problem

min cT x
Ax ≥ b
x≥0

Form the dual problem and convert it into an equivalent minimization problem. Derive a set of conditions on the
matrix A and the vectors b, c, under which the dual is identical to the primal, and construct an example in which
these conditions are satisfied.

Problem. Consider the LP

M in 2x1 + x2
s.t. x1 + x2 ≤ 6
x1 + 3x2 ≤ 3
x1 , x 2 ≥ 0

(a) plot the feasible region and solve the problem graphically.
(b) Find the dual and solve it graphically.
(c) Verify that primal and dual optimal solutions satisfy the Strong Duality Theorem

Problem. Rock, Paper, and Scissors is a game in which two players simultaneously reveal no fingers (rock), one
finger (paper), or two fingers (scissors). The payoff to player 1 for the game is governed by the following table:

Player1/Player2 Rock Paper Scissors


Rock 0 -1 1
Paper 1 0 -1
Scissors -1 1 0

Table 17.1: input data

Payoff to player 1 (or minus payoff to player 2)

Note that the game is void if both players select the same alternative: otherwise, rock breaks scissors and wins,
scissors cut paper and wins, and paper covers rock and wins. Use the linear-programming formulation of this game
and linear-programming duality theory, to show that both players’ optimal strategy is to choose each alternative with
probability 1/3 . [?]

Problem. Why is it that, if the primal has unique optimal solution x* , there is a sufficiently small amount by which
c can be altered without changing the optimal solution? [?]

153
154
Chapter 18

More on Duality

155
18.1 Review of Important Results
P D
M ax cT x M in bT y
Ax ≤ b AT y ≥ c
x≥0 y≥0

Examples

1. Numerial Problem:
max 2x1 + 3x2
such that 4x1 + 8x2 ≤ 12
2x1 + x2 ≤ 3
3x1 + 2x2 ≤ 4
x1 , x2 ≥ 0

The dual of above is


min 12y1 + 3y2 + 4y3
such that 4y1 + 2y2 + 3y3 ≥ 2
8y1 + y2 + 2y3 ≥ 3
y1 , y2 , y3 ≥ 0
2. Graph Problem: M axf low ↔ M incut
In this problem we had seen that the maxflow is an LP, while mincut is an IP (selection of edges). How can
they be duals?

18.1.1 Some Results


• Weak Duality Theorem (WDT)
xT c ≤ bT y
• When both P and D are feasible, cT x∗ = bT y ∗ where x∗ , y ∗ represents the optimal solution of primal and dual
respectively

• Duality Theorem for LP: If P and D are primal and dual pair for LP, then one of the four cases occur:
1. Both are infeasible
2. P in inbounded and D is infeasible
3. D is inbound and P is infeasible
4. Both are feasible and there exists solution x and y to P and D such that cT x = bT y
• IP duals are weak duals
P = max(cT x|Ax ≤ b, x ≥ 0, x ∈ Z n )
D = min(bT y|AT y ≥ c, y ≥ 0, y ∈ Z n )
For any feasible x̄ and ȳ

cT x̄ ≤ bT ȳ

Example: Discuss the four cases of Duality theorem for LP Starting from the duality results of LP,
i.e.,
1.Both primal and dual are infeasible
2.Primal is infeasible and dual is unbounded
3.Dual is infeasible and primal is unbounded
4.Both primal and dual are feasible
Example of case 1:
maximize 2x1 − x2
such that
x1 − x2 ≤ 1 (18.1)

156
− x1 + x2 ≤ −3 (18.2)
x1 , x 2 ≥ 0 (18.3)
The dual of for the above primal is
minimize y1 − 3y2
such that
y1 − y2 ≥ 2 (18.4)
− y1 + y2 ≥ −1 (18.5)
y1 , y 2 ≥ 0 (18.6)
Both primal and dual are infeasible
Example of case 2:
maximize 2x2 + x3
x1 − x2 ≤ 5 (18.7)
− 2x1 + x2 ≤ 3 (18.8)
x2 − 2x3 ≤ 5 (18.9)
minimize 5y1 + 3y2 + 5y3
y1 − 2y2 ≥ 0 (18.10)
− y1 + y2 + y3 ≥ 2 (18.11)
− 2y3 ≥ 1 (18.12)
Here primal is unbounded and dual is infeasible
Example of case 3:
minimize 5x1 + 3x2 + 5x3
x1 − 2x2 ≥ 0 (18.13)
− x1 + x2 + x3 ≥ 2 (18.14)
− 2x3 ≥ 1 (18.15)
x1 , x2 , x3 ≥ 0
maximize 2y2 + y3
y1 − y2 ≤ 5 (18.16)
− 2y1 + y2 ≤ 3 (18.17)
y2 − 2y3 ≤ 5 (18.18)
y1 , y 2 , y 3 ≥ 0
Here dual is unbounded and primal is infeasible
Example of Case 4:
maximize 2x1 + x2

x1 + x2 ≤ 6 (18.19)
x1 + 2x2 ≤ 8 (18.20)
minimize 6y1 + 8y2

y1 + y2 ≥ 2 (18.21)
y1 + 2y2 ≥ 1 (18.22)

Problem: For the following problem construct the dual problem and verify the strong duality (i.e, primal optimum
=dual optimum)
Maximize 5x + 4y + 3z
subject to
x + y + z ≤ 30
2x + y + 3z ≤ 60
3x + 2y + 4z ≤ 84

157
18.2 Duality from Lagrandian Perspective

18.2.1 Lagrangean and Lagrange Multiplier

Lagrange multiplier is often used in calculus to minimize a function subject to (equality) constraints. For example,
in order to solve the below problem,

min x2 + y 2
(18.23)
subject to x + y = 1,

we introduce Lagrange multiplier λ and form the Lagrangean L(x, y, λ) defined by

L(x, y, λ) = x2 + y 2 + λ(1 − x − y) (18.24)

By keeping λ fixed, we minimize Lagarangean over all x and y subject to no constraints,

∂L ∂L
= 0, =0 (18.25)
∂x ∂y

∂L
optimal solution to this unconstrained problem is x = y = λ/2 and depends on λ. The constraint x + y = 1, (or ∂λ
)
gives an additional relation. i.e., λ = 1 and optimal solution is x = y = 1/2.

Basic Idea

• We violate hard constraints in our original constrianed problem ( ̃18.23) and associate a Lagrange Multiplier
or price λ with the amount (1 − x − y) by which it is violated. This gives us an unconstrined problem.
• When price λ is properly chosen, optimal solution to the constrined problem is also optimal solution for
unconstrained problem.
• Under specific value of λ, optimal cost is unlatered by the presence or absence of hard constraints.

18.2.2 Duality and Lagrange Multiplier

Consider an optimization problem (it could be nonconvex)

p∗ = min f0 (x); subject to fi (x) ≤ 0 ∀i


x

We define a Lagrangian by combining the constraints with the objective as:



L(x, λ) = f0 (x) + λi fi (x)
i

The variables λi are the lagranigain multipliers. We observe that for every feasible x and λ ≥ 0, f0 (x) is bounded
below by L(x, λ). i.e., f0 (x) ≥ L(x, λ). Or f0 (x) = maxλ≥0 L(x, λ).

Let us define the primal problem now as


p∗ = min max L(x, λ)
x λ≥0

Here we have used the fact that maxλ≥0 λ f is 0 if f ≤ 0 and +∞ otherwise. (Note that λT f is the shortcut for

T

i λi fi (x).)

Now we define a fual function (to be precise lagrangian dual function), g(λ) = minx L(x, λ). Note that we have the
following relationship. (since f0 (x) ≥ L(x, λ))
f0 (x) ≥ g(λ)
or the dual optima is
d∗ = max min L(x, λ)
λ≥0 x

158
Minmax inequality Another important relation to keep in mind at this stage is the minmax inequality. This
says that for any two variables x ∈ X and y ∈ Y,

max min ϕ(x, y) ≤ min max ϕ(x, y)


y∈Y x∈X x∈X y∈Y

To show this,
min ϕ(x′ , y) ≤ max ϕ(x, y ′ )
x′ ∈X ′ y ∈Y

Start from this equation and take minx on RHS and maxy on LHS. This leads to the the general inequality above.
Use of this will also lead to weak duality in general case.

18.2.3 Special Case of Linear Programming

Conside the LP in standard inequality form

p∗ = max cT x : Ax ≤ b
x

where A ∈ Rm×n and b ∈ Rm

Langranean functions is
L(x, λ) = cT x + λT (b − Ax)
L(x, λ) = bT λ + (cT − λT A)x
We can see that this function is ≥ cT x when λ ≥ 0. Let us define the dual function g(λ) as

g(λ) = max g(x, λ) ≥ p∗


x

We are interested in the best (tightest) upper bound on g(λ).

d∗ = min g(λ)
λ≥0

d∗ = min max L(x, λ)


λ≥0 x

d∗ = min max cT x + λT (b − Ax)


λ≥0 x


d = min max bT λ + (cT − λT A)x
λ≥0 x


d = min(b λ + max (cT − λT A)x)
T
λ≥0 x

If cT − λT A has nonzero entries, then the maximum over all x is ∞ and is a useless upper bound. We should consider
only the case when AT λ = c. This leads to the dual problem as

d∗ = minbT λ s. t. AT λ = c; and λ ≥ 0

18.3 When does LP yield an integral solution?


We had seen in the past that many problems that gets formulated as IP have efficient solutions. It is important to
ask why certain problems yield integer solutions with ease? To answer, we need to know a class of matrices first.
Definition 1. A matrix is totally unimodular(TU) if all its square sub matrices (by removing certain rows and
columns) have determinant -1,0 or 1.

This immediately imply that:

• All elements ∈ {−1, 0, 1}.



• If A is totally unimodular, then A = [AI] is also totally unimodular.

159
Problem: If A is TU, will AT also be TU?
Theorem 18.3.1. If A is TU & b is integral, then LP gives integral solution

Proof.

maxcT x|Ax ≤ b, b ∈ Z m , x ≥ 0

To solve this, we had seen the process of adding slack variables and creating the equality.


Ax=b


xB = AB−1 b

1
= ′ [ ]
det(AB )

Since AB will have determinant as −1 or 1 (and not zero) as well as the cofactors are also integers, xB has only
integer values.

18.4 Closer Look at Matching and Cover


In this section, we will have a closer look at another primal dual pair.

Ex : M aximal M atch ↔ M in V ertex Cover

We will study this pair when the graph is BPG and for a general case.

Cases:

1. For BPG, G = (X ∪ Y, E) (Bipartite graph), A is TU.


2. For a general graph, A is not TU.

Maximum matching:In a graph G, finding the maximum number of edges without any common vertices is maximum
matching problem. Minimum vertex cover set: In a graph G, selecting the minimum number of vertices so that
atleast one vertex of every edge in graph is present in the selected vertices.

Maximum Matching Formulation:

Let G(V,E) be the graph and |V | = n and |E| = m


m
M aximize Xj (18.26)
j=1

such that ∑
Xk ≤ 1 ∀j ∈ V (18.27)
k:(j,k)∈E

Xi ∈ {0, 1} (18.28)
Minimum Vertex Cover formulation:

M inimize Xj (18.29)
v∈V

such that
xu + xv ≥ 1 f or every{u, v} ∈ E (18.30)

160
xu ∈ {0, 1} (18.31)
Minimum vertex cover and max matching have primal dual relationship.

Consode r a graph with n vertices and m edges v1 , v2 ...vn and e1 , e2 ...em . Let us define an Incidence matrix A
such that:

{
1 if xi ∈ ej
aij =
0 else

A is a n × m matrix. Now we can define these two problems as:

IP1 (Maximum Matching)



m
max xj
j=1

Ax ≤ 1
x≥0
x ∈ Zm

IP2(Minimum Vertex Cover)



n
min yi
i=1

AT y ≥ 1
y≥0
y ∈ Zn

Our objective is to show that A is TU for BPG.

18.4.1 For BPG, incidence matrix A is TU

Proved by Induction: Q is a l × l sub matrix of A

1. Q is 1 × 1 =⇒ Q is TU, |Q| ∈ {0, 1, −1}


2. Assume (l − 1) × (l − 1) matrix has elements ϵ{0, 1, −1}, then Q of l×l has detϵ{0, 1, −1}
Consider the cases where we have a column which has (i) no 1s (ii) one 1 and (ii) two 1s. (note that a column
will not have more than two 1s)

(a) all are zeros, |Ql×l | = 0


(b) there is one 1,|Ql×l | = 1×|Q(l−1)×(l−1) | ∈ {−1, 0, 1}
(c) there are two ones |Ql×l | = 0
(Rearrange rows such that all X are in the first half and Y are in the second half, These rows are linearly
dependent. Therefore , determinant = 0)

This argues that for BPG, we can always show that A is TU and this will yield integral solutions.

We now see two numerical examples

18.5 Numerical Example


Example 1: Bipartite Graph Consider the BPG as shown below:
For the primal and dual, the A matrix is:

161
Figure 18.1: BPG

 
1 1
A = 1 0
0 1
See that this matrix is TU.

For the primal, max1T x such that Ax ≤ 1, lead to the optinal x as:
[ ]
1
0

LP ∗ = IP ∗ = 1

Similarly for the dual,


 
1
min 1 y such that A y ≥ c yield the optimal y as 0 LP ∗ = IP ∗ = 1
T T

[objective : cT x = bT y = 1(same)]

Case 2: Non-Bipartite Graph Now consider a non-BPG with three vertices.

Figure 18.2: Ex Non-BPG

The primal and dual problems are:

P rimal : max x1 + x2 + x3

x1 + x2 ≤ 1
x2 + x3 ≤ 1
x3 + x1 ≤ 1
Dual : min y1 + y2 + y3
y1 + y2 ≥ 1
y2 + y3 ≥ 1
y3 + y1 ≥ 1

162
on solving primal:
3
LP ∗ =
2
while the integer optimal is:
IP ∗ = 1
on solving dual:
3
LP ∗ =
2
while the integer optimal is:
IP ∗ = 2

This imply that the LP did not have an integral optima. (Why? because A was not TU)

Let us write A properly:

 
1 0 1
Now, A = 1 1 0
0 1 1

|A| = 2

⇒ A is not a TU

18.6 Additional Problems


Problem: Prove that the matrix A we had for the mincut is T U .

163
164
Chapter 19

Primal Dual Methods

165
19.1 Review and Summary
Duality Theorem

This Theorem states that the problems Primal(P) and Dual(D) are intimately related.We can think their relationship
in following table:

PD unbounded has solution not feasible


unbounded no no possible
has solution no same values no
not feasible possible no possible

166
19.2 Complementary Slackness
Relationship between the primal and dual that is known as complementary slackness. We know that the number
of variables in the dual is equal to the number of constraints in the primal and the number of constraints in the
dual is equal to the number of variables in the primal.It means that variables in one problem are complementary to
constraints in the other. constraint having slack means it is not binding.For an inequality constraint, the constraint
has slack if the slack variable is +ve. Complementary slackness means that a relationship between slackness in
primal constraints and slackness of it’s dual constraints. The following are primal and dual complmentary slackness
conditions.

Primal Complementary (PCS)

Conditions for PCS: ∑


1 ≤ j ≤ n either Xj = 0 or mi=1 aij yi = cj =⇒ X (A Y − c) = 0
T T

Dual Complementary (DCS)

Conditions for DCS: ∑


1 ≤ i ≤ m either Yi = 0 or n
j=1 aij xj = bi =⇒ Y (AX − b) = 0
T

The above complementary slackness conditions guarantee that the values of the primal and dual are the same.

19.2.1 Basic Derivation/Proof of Complmentary Slackness

We know about primal and dual methods. Consider Following Primal-Dual Pairs.

P = M axcT X | [AX ≤ b], [X ≥ 0] (19.1)

D = M inbT Y | [AT Y ≥ c], [Y ≥ 0] (19.2)



Suppose X∗ is optimal point for primal problem and Y is optimal for Dual.
Further from equations (1) and (2)

cT X∗ ≤ (AT Y ∗)T X∗ ≤ (Y ∗)T AX∗ ≤ Y ∗T b (19.3)

From Strong Duality :


cT X∗ = bT Y ∗ (19.4)
from equations (3) and (4)
cT X∗ = (AT Y ∗)T X ∗ and Y ∗T AX∗ = Y ∗T b (19.5)
We can rewrite above equation:
Y T [AX − b] = 0 (19.6)

X T [AT Y − c] = 0 (19.7)
T
From eqn (6) either Y = 0 or AX = b. Similar for eqn (7) either X = 0 or A Y = c.
We concluded that either ith variable is 0 or ith constraint is tight.

19.2.2 Using Complementary Slackness to Solve Duals/Primal


b ∑
Ensure PCS but not DCS and 1 ≤ j ≤ n either yi = 0 or βj ≤ n j=1 aij xj ≤ bj where β ≥ 1. PCS is strictly obeyed
and DCS is relax. or vica-verssa.Help in Deciding Solution which is approximate.

Iterative Method for solution above.


Xi → Xi+1
Xi+1 → Xi + Y
We have to find Y to calculate Xi+1 .

167
19.2.3 Complementary Slackness:Ver2

Complementary slackness tells us that when primal LP sets some variable to non-zero, then it’s for some “good
reason”.

Primal Complementary Slack (PCS) condition

1 ≤ j ≤ n either xj = 0 or
∑n
i=1 aij yi = ci , X [A Y − c] = 0
T T

Dual Complementary Slack (DCS) condition

1 ≤ i ≤ m either yi = 0 or
∑m
j=1 aij xj = bj , Y [AX − b] = 0
T

Note : While solving a problem by P-D method, ensure PCS strictly and DCS relaxed.

Relaxed DCS can be written as -



1 ≤ i ≤ m either yi = 0 or bj ≤ m
j=1 aij xj ≤ βbj

here β is kept for relaxing the relation. β = 1 when it is tight.

How to solve problems

If the conditions are obeyed for feasible solutions x, y, then solutions are optimal. For solving any problem start with
a variable X and change it with condition, until its feasible value is found ensuring it should satisfy slack constraints.

X i → X i+1 ⇒ X i+1 ← X i + Y

168
19.3 Introduction to Primal Dual Method
The primal-dual method is a standard tool for designing algorithms for combinatorial optimization problems. In this
lecture, we focus on showing how to modify the primal-dual method to provide good approximation algorithms for a
wide variety of NP-hard problems.
The primal-dual method was originally proposed by Dantzig, Ford and Fulkerson as another means of solving linear
programs. However, now it is more widely used for devising algorithms for problems in combinatorial optimization.
The main feature of primal-dual method is that it allows weighted optimization problem to be reduced to a purely
combinatorial unweighted problem. It also leads to efficient polytime algorithms for solving NP-hard problems.
The following figure shows the general framework of the primal-dual method:

Figure 19.1: General framework of primal-dual method

19.3.1 Overview of the primal-dual method

Consider the following primal program called P :


n
min cj xj
j=1


n
s.t : aij xj ≥ bi
j=1

xj ≥ 0
i = 1, 2, 3, 4....m
j = 1, 2, 3, 4....n

Then the dual program D is:


m
max bi c i
i=1

169

m
s.t : aij yj ≤ cj
i=1

yi ≥ 0
i = 1, 2, 3, 4....m
j = 1, 2, 3, 4....n

Recall the complementary slackness conditions:

1) Primal complementary slackness conditions:



For each 1 ≤ j ≤ n : either xj = 0 or mi=1 aij yi = cj

2) Dual complementary slackness conditions:



For each 1 ≤ i ≤ m : either yi = 0 or nj=1 aij xj = bi

Given an optimization problem (NP-hard),we will formulate this problem as an IP and relax it to obtain an LP.
Then, we round the optimal solution x∗ of the LP to obtain an integral solution. In the primal-dual method, we
find a feasible integral solution to the LP (thus to the IP) from scratch (instead of solving LP) using dual D as our
guiding.
Specifically, we do either of the following:

1) Ensure the primal conditions and suitably∑relax the dual conditions:


For each: 1 ≤ i ≤ m either yi = 0 or bi ≤ nj=1 aij xj ≤ βbi where β > 1

2) Ensure the dual conditions and suitably relax


∑ the primal conditions:
c
For each: 1 ≤ j ≤ n either xj = 0 or αj ≤ m i=1 aij yi ≤ cj where α > 1

If we use the first way, that is, ensure the primal conditions and relax the dual conditions, we have:

Lemma 19.3.1. If x and y are feasible solutions of P and D respectively, satisfying conditions in the first way i.e.
primal conditions are ensured and dual conditions are relaxed, then:


n ∑
m
c j xj ≤ β bi yi
j=1 i=1

Proof.

n ∑
n ∑ m
cj xj = ( aij yi )xj
j=1 j=1 i=1


n ∑
m ∑n
cj xj = ( aij xj )yi
j=1 i=1 j=1


n ∑
m
c j xj ≤ β bi yi
j=1 i=1

More specifically, let α = 1 if the primal conditions are ensured and β = 1 if the dual conditions are ensured, then
we have:

1) Primal complementary slackness conditions: ∑m


cj
Let α ≥ 1. For each 1 ≤ j ≤ n : either xj = 0 or α
≤ i=1 aij yi ≤ cj
2) Dual complementary slackness conditions: ∑
Let β ≥ 1. For each 1 ≤ i ≤ m : either yi = 0 or bi ≤ nj=1 aij xj ≤ βbi

Lemma 19.3.2. If x and y are feasible solutions of P and D respectively, satisfying the complementary slackness
conditions stated above, then both x and y are α − β approximate solutions:


n ∑
m
cj xj ≤ αβ bi yi
j=1 i=1

170
Proof.

n ∑
n ∑ m
c j xj = α ( aij yi )xj
j=1 j=1 i=1


n ∑
m ∑ n
c j xj = α ( aij xj )yi
j=1 i=1 j=1


n ∑
m
cj xj ≤ αβ bi yi
j=1 i=1

19.3.2 Introduction :

Today’s tool is an adaption of a fundamental tool in the design of algorithms for linear programming and combinatorial
optimization : the primal-dual method.

We know,

Primal P : max{cT X|AX ≤ b, X ≥ 0}

Dual D : min{bT Y |AT Y ≥ c, Y ≥ 0}

from Strong duality theorem, we know that, at optimum,

cT X ∗ = bT Y ∗ - (1)

Substituting c with AT Y (from D) in LHS of (1), we get

cT X ∗ ≤ (AT Y ∗ )T X ∗ ⇒LHS of (1)

Solving RHS of above more, we get,

(AT Y ∗ )T X ∗ = Y ∗T AX ∗

Substituting AX to b brings ≤ inequality to condition from P, which gives,

Y ∗T AX ∗ ≤ Y ∗T b

finally, we can write above relations substituting in (1), as below,we get fact 1,

Fact 1. cT X ∗ = (AT Y ∗ )T X ∗ and Y ∗T AX ∗ = Y ∗T AX ∗ = Y ∗T b

Fact 2. (1) X ∗T (AT Y − c) = 0; that is for all 1 ≤ j ≤ m either Xj∗ = 0 or Aj .y ∗ = cj .

(2) y ∗ (AX ∗ − b) = 0;that is, for all 1 ≤ i ≤ n,either Yi∗ = 0or Ai X ∗ = bi .

Proof. follows from derivation of Fact 1. Since there’s equality between b.y ∗ and c.x∗ ,all the inequalities are equality,
which translates to the fact above.

Fact 3. Let x and y be feasible solutions to (1) and (2) satisfying the following conditions :

(1) For all 1 ≤ j ≤ m,either xj = 0 or Aj .y ≥ cj /α and

(2) For all 1 ≤ i ≤ n,either yi = 0 or ai .x ≤ βbi .

Proof. This is derived from results of Fact 1 and Fact2.

19.3.3 Primal-Dual based approximation algorithm


step 1. Formulate the given problem as an IP. Relax the variable constraints to obtain the primal LP P , then find the
dual D.
step 2. Start with a primal infeasible solution x and dual feasible solution y. Usually x = 0 and y = 0.
step 3. until x is feasible do:

171
∑m
a) Increase the value of yi in some fashion until dual constraints go tight, i.e. i=1 aij yi = αcj for some j
while always maintaining feasibility of y.
b) Select some subset of tight dual constraints and increase the value of primal variables corresponding to
these constraints by an integral amount.

step 4. Cost of dual solution is used as a lower bound for primal optimization problem. Note that the approximation
guarentee of the algorithm is αβ.

Example In this section we briefly describle how to use pcs to solve Dual problem. Now,take the following example.

z1 = max 2x1 + 4x2 + 3x3 + x4


Constraints: 3x1 + x2 + x3 + 4x4 ≤ 12
x1 − 3x2 + 2x3 + 3x4 ≤ 7
2x1 + x2 + 3x3 − x4 ≤ 10
xi ≥ 0

We know solution of above: z1 = 42; x1 = 0; x2 = 10.4; x3 = 0; x4 = 0.4.Now ,we used this information to solve it’s
dual using complementary slackness. The following is it’s dual:

z2 = min 12y1 + 7y2 + 10y3


Constraints: 3y1 + y2 + 2y3 ≥ 2
y1 + y2 + 2y3 ≥ 2
y1 − 3y2 + y3 ≥ 4
y1 + 2y2 + 3y3 ≥ 3
4y1 + 3y2 − y3 ≥ 1
yi ≥ 0

In above example x2 and x4 are positive. So their corresponding constraints are tight in dual by PCS. Similarly,
constraints 1 and 3 are tight but not second constriant. So corresponding varible y2 equal to 0. then

y1 + y3 = 4
4y1 − y3 = 1

After solving above equations we got y1 = 1; y3 = 3 and z2 = 42. So for this primal and dual problems optimas are
same.

Problem Take the following LP:

max x1 − x2
subject to −2x1 + x2 ≤ 2
x1 − 2x2 ≤ 2
x1 + x2 ≤ 5
x≥0

The dual is:

min 2y1 + 2y2 + 5y3


subject to −2y1 + y2 + y3 ≥ 1
y1 − 2y2 + y3 ≥ −1
y≥0

Suppose I claimed that (1,4) solved primal.How could you check this using complementary slackness?

19.4 Example: Numerical Problem

172
19.5 Shortest Path

19.5.1 Shortest Path Problem:Ver1

Data

1. a digraph D = (N, A) with |N | = n nodes and |A| = m arcs;

2. a source node sϵN ;

3. a cost function c : A → R.

Problem Statement -

Find all minimum cost (i.e. shortest) paths from s to all nodes N . The problem is also called Shortest Path
Tree/Arborescence Problem, because of a property of its solution: the set of all shortest paths forms a spanning
arborescence rooted in s.

Figura: A shortest paths arborescence (s = 1). Costs are black. Flows are red. Distances are blue.

Primal-Dual pair

P) minimize (i,j)ϵA cij xij
∑ ∑
s.t. (j,i)ϵδ − xji −
i
x = 1 ∀iϵN \{s}
(i,j)ϵδ + ij
i
∑ ∑

(j,s)ϵδs
xjs − +
(s,j)ϵδs
xsj = 1 − n

xij ≥ 0 ∀(i, j)ϵA



D) maximize iϵN \{s} yi + (1 − n)ys

s.t. yj − yi ≤ cij ∀(i, j)ϵA

yi ϵR ∀iϵN

Observation 1. If we add a constant α to each y variable, nothing changes. Hence we are allowed to fix one
variable:ys = 0

Observation 2. We have m inequality constraints, n − 1 original y variables and m slack variables. The LP tableau
of the dual problem has m rows and n − 1 + m columns. Hence in each base solution of D there should be m basic
variables and n − 1 non-basic (null) variables. For the complementary slackness theorem, there should be red n − 1
basic (positive) variables in the primal problem.

Observation 3. We have n equality constraints that are not linearly independent: summing up all the rows we
obtain 0 = 0. Hence we are allowed to delete one constraint: we delete the flow conservation constraint for s.

Observation 4. We have now n − 1 equality constraints and m variables. The LP tableau of P has n − 1 rows and

173
m columns. Hence in each base solution of P there are n − 1 basic variables and m − (n − 1) non-basic variables.

Complementary slackness conditions (CSC)



P’) minimize z = (i,j)ϵA cij xij
∑ ∑
s.t. x − (i,j)ϵδ+ xij = 1 ∀iϵN \{s}
(j,i)ϵδ − ji
i i

xij ≥ 0 ∀(i, j)ϵA.



D’) maximize w = iϵN \{s} yi

s.t. yj − yi ≤ cij ∀(i, j)ϵA

yi ϵR ∀iϵN \{s}

Primal CSCs: xij (yi + cij − yj ) = 0

Basic variables in P’ correspond to active constraints in D’.

Only arcs (i, j) for which y i + cij = yj can carry flow xij .

By applying above constraints and any of the graph algorithms Dijkstras or Ford Fulkerson etc. problem is solved.

For example,

The Ford-Fulkerson algorithm (1962)

Data structures:

i) a predecessor label, πi for each node iϵN ;

ii) a cost label, yi for each node iϵN.

Algorithm

Step FF1 (initialization):

Set ys = 0 and yi = ∞ ∀iϵN \{s}.

Set πi = nil ∀iϵN.

Step FF2(Iteration):

Select an arc (i, j)ϵA such that yj − yi > cij .

If such an arc exists

then Set yj = yi + cij , πj = i and repeat;

else terminate;

Feasiblity

After initialization (Step FF1) we have neither primal feasibility nor dual feasibility.

Primal viewpoint: We have πi = nil for all iϵN ; hence no flow enters any node.

Dual viewpoint: We have yi = ∞ for all iϵN \{s}; hence all constraints yi − ys ≤ csi are violated.

The FF algorithm maintains the CSCs and iteratively enforces primal and dual feasibility.

174
19.6 Example: MST

19.6.1 MST:Ver1

Solution. Starting with LP formulation of MST, we get,



min e ce xe
∑ ∏ ∏
s.t. ∏ xe ≥ | | − 1∀
e crosses

xe ≥ 0 ∀eϵE

Here is the dual.



max ∏ y∏ (|∏| − 1)

s.t. ∏ y∏ ≤ ce ∀eϵE
e crosses

y∏ ≥ 0 ∀∏

In the algorithm, first note that we need to maintain the y∏ s. Initially they are all zeroes. At any stage we need to
improve the cost function. In general, we can do that by increasing some of the ys and decreasing some. To keep
things as simple as possible, we will try and do this by only increasing one of them at a time.

Let us assume that all ce s are strictly positive, so initially, none of the constraints are tight. We wish to increase the
cost function. To do so we have to increase some y∏ . Which onne? It seems that we gain the most by increasing the
one where each vertex is in a separate partition. So, suppose we start to increase this. Ho much can we increase it
by? We see that we can increase this upto the weight of the minimum weight edge. At this point the inequalities
corresponding to all edges of minimum weight will be tight.

Let us consider the generic step. So suppose that at some stage we have some y. As per our recipe we need to
consider only the inequalities which are equalities. So, let F denote the set of edges which are currently tight (the
corresponding inequalities are tight.) We need to increase some y∏ such that for each of these edges the sum of the
increases in the y that the edge crosses is at most zero. Thsi means we can increase a y∏ such that none of the
edges of F cross ∏. How do we find such a ∏? The most natural is to find the connected components and put each
component in one part.

Here then is the algorithm for MSTs :

1. Inititalization: We think of all y∏ s to be zero. Note that we cannot explicitly set them.

2. Iterative Step: Let E denote the set of edges which are tight. Find the connected components of the graph
′ ∏
G = (V, E ′ ). Increase y∏ till some edge becomes tight, where the parts of are the connected components of G′ .

3. The previous step terminates when we get one connected component.

Proof of Optimality

We can prove that the above algorithm gives an optimum solution by exhibiting a primal and dual solution of the
same cost.

Cost of primal is given by:



e ce where e is chosen by the algorithm
∑ ∑
= e ∏e crosses ∏ y∏
∑ ∑
= ∏ ee crosses ∏ y∏
∑ ∑
= ∏ y∏ e crosses ∏ 1
∑ ∏
= ∏ y∏ (number of edges which crosses )

= ∏ y∏ (| ∏ |−1)

This is exactly the cost of dual. So we have proved the cost of primal is same as the cost of the dual proving optimality.

175
MST: Ver2

Here is LP formulation of MST


Min ce xe
e∈E [ ]
1 if e is included
where xe =
0 otherwise

xe ≥ |V | − 1 (19.8)
e∈E

xe ≥ |S| − 1 ∀ S ≤ V (19.9)
e:e=(u,v)
u∈S
v∈S′

xe ≥ 0 ∀ e ∈ E (19.10)
Where equation (8) ensures that we have at least |V | − 1 edges in solution. Any MST have exactly |V | − 1 edges.
Constraint (9) ensures that there is no cycle formation in any subset of S vertex.There can be at most |S| − 1 edges
b/w vertexes of S in the solution.
Dual of Above: ∑
M ax(|V | − 1)α + (1 − |S|)βs (19.11)
S≤V

st α − βs ≤ Ce ∀e = (u, v) ∈ S (19.12)
s,u,v∈S

βs ≥ 0 ∀ S ≤ V, α ≥ 0 (19.13)
Algorithm

1. Initialization: We think of all (α − βs ) to be 0.Note that we cannot explicitly set them.
2. Iterative Steps: Let Ep deonte
∑ the set of edges which are tight. Find connected components of graph
Gp = (V, Ep ). Increase (α − βs ) till some edge becomes tight where the parts of V are the connected
components of Gp .
3. The previous step terminates when we get one connected component.

19.7 Set Cover


Set Cover: Ver 1

Here is Lp-Ip formulation of set cover. ∑


min cs xs (19.14)
s∈S

st xs ≥ 1 ∀u ∈ V (19.15)
s:u∈S

xs ≥ 0 ∀ s ∈ S (19.16)
[ ]
1 if s take
where xs =
0 otherwise
where each set s ∈ S
Dual: ∑
M ax yu (19.17)
yu

st yu ≤ cs ∀s ∈ S (19.18)
u∈S

yu ≥ 0 ∀u ∈ V (19.19)
Approximate Algorithm:
wi = weight of si is taken

1. y ← 0 // start with a feasible solution

176
2. J ← ϕ // Nothing yet in the cover
3. I ← [1, ...., m] // Elements yet to be covered
4. while I ̸= ϕ do
5. P ick i ∈ I // and now try to increase yi as much as possible
[ ]
6. ji ← argmin wj | j ∈ [n], i ∈ Sj
7. yi ← wji // this is the most yi could be increased to
8. // the dual constraint corresponding to ji shall become ”binding” (becomes equality)
9. for each j ← where i ∈ Sj do
10. wj ← wj − wji
11. end for
12. J ← J ∪ {ji }
13. I ← I − Sji // those in Sji are already covered
14. end while
15. Return J

19.7.1 Set cover:Ver2

Solution.

The general idea is to work with an LP-relaxation of an NP-hard problem and its dual. Then the algorithm iteratively
changes a primal and a dual solution until the relaxed primal-dual complementary slackness conditions are satisfied.

Primal-Dual Schema

Consider the following primal program:



minimize val(x) = nj=1 cj xj ,

subject to nj=1 aij xj ≥ bi i = 1, ..., m,

xj ≥ 0 j = 1, ..., n.

The dual program is:



maximize val(y)= m by,
i=1 i i
∑m
subject to i=1 aij yi ≤ cj j = 1, ..., n,

yi ≥ 0 i = 1, ..., m.

We will move forward using this schema and by ensuring one set of conditions and suitably relaxing the other. We
will capture both situations by relaxing both conditions. If primal conditionsare to be ensured, we set α = 1below,
and if dual conditions are to be ensured, we set β = 1.

Primal Complementary Slackness Conditions. Let α ≥ 1. For each 1 ≤ j ≤ n :



either xj = 0 or cj /α ≤ m a y ≤ cj .
i=1 ij i

Dual Complementary Slackness Conditions. Let β ≥ 1.For each 1 ≤ i ≤ m :



either yi = 0 or bi ≤ nj=1 aij xj ≤ βbi .

Lemma. If x and y are primal and dual feasible solutions respectively satisfying the complementary slackness
conditions stated above, then val(x) ≤ αβval(y).

Proof. We calculate directly using the slackness conditions and obtain


∑ ∑ ∑
val(x) = nj=1 cj xj ≤ α nj=1 ( m a y )xj
i=1 ij i

∑m ∑ ∑
=α i=1
( nj=1 aij xj ) yi ≤ αβ m b y = val(y)
i=1 i i

177
which was claimed.

Procedure to solve set cover problem by this algorithm:

The algorithm starts with a primal infeasible solution and a dual feasible solution; usually these are x = 0 and y=0
initially. It iteratively improves the feasibility of the primal solution and the optimality of the dual solution ensuring
that in the end a primal feasible solution is obtained and all conditions stated above, with a suitable choice for α and
β, are satisfied. The primal solution is always extended integrally, thus ensuring that the final solution is integral.
The improvements to the primal and the dual go hand-in-hand: the current primal solution is used to determine
the improvement to the dual, and vice versa. Finally, the cost of the dual solution is used as a lower bound on the
optimum value and by above mentioned Lemma, the approximation gaurantee is αβ.

19.8 Example:Weighted Vertex Cover

19.8.1 Ver1

Primal: M in v∈V wv xv
Constraints xu + xv ≥ 1∀(u, v) ∈ E
xv ≥ 0

Dual: ∑ M ax e∈E ye
Constraints v∈e ye ≤ wv ∀ye ≥ 0.

Figure 19.2: Weighted Vertex Cover

Iterative Steps
1. Start with X = [0 . . . 0] and Y = [0 . . . 0].It is feasible for D(Dual) but not for P(Primal).Keep Modifing X such
that It is feasible for P.
2. Increase y for ”a” edge to 4.So that at vertex 4 constrait is tight.ya = 4 =⇒ x4 = 1 and freeze a,b and c.(
freeze means we have to set value of variables in such a way that all variables are satisfying that constraint .In
above example we have to make b=0 and c=0 for ya + yb + yc = 4 ).
3. Increase y for ”e” to 1.ye = 1 =⇒ x5 = 1 and freeze e and d.
4. yg = 2 =⇒ x2 = 1 and freeze g,h and f.

Now solution is satisfying for both P and D. In above case β = 2.

19.8.2 Iterative Algorithm


1. Start with X0 Such that D is feasible.
2. Find Z such that D is remains feasible and P is closer to feasible.
3. Xi+1 = Xi + Z.

178
19.8.3 Summary:
1. PD is exact algorithm for combinational optimized problems.
bj ∑n
2. PD for Approximate Algorithm (1 ≤ j ≤ n either yi = 0 or β
≤ j=1 aij xj ≤ bj where β ≥ 1.)

PD Method:
1. start with init guess.
2. move to better guess guided by some constraints.
3. when nothing better can be found stop.

19.8.4 Minimum-weighted Vertex Cover:Ver2

Statement - Given weights wi on iϵV , find a min-cost subset S of V, such that at least one endpoint of every edge
is in S.

Solution :

Ground Set : V

Costs : wi , iϵV

Sets: {u, v} , where (u, v) ϵE

Primal and Dual equation for above problem -



Primal: min uϵV wv xv such that xu + xv ≥ 1and xu ≥ 0
∑ ∑
Dual: max eϵE ye such that e,vϵe ye ≤ wv and ye ≥ 0

The Complementary Slackness conditions are as follows :



PCS - xv > 0 ⇒ eϵE ye = wv,

DCS - ye > 0 ⇒ xu + xv = 1.

These conditions will guide us to design an approximate algorithm. Let us see how to interpret these conditions. A
primal dual algorithm will construct primal and dual feasible solutions simultaneously. To ensure that these solutions
are optimal, the primal condition says that a vertex should be picked only if this vertex is saturated by the matching,

179
and the dual condition says that an edge should be picked only if exactly one vertex is chosen. Dual condition is
difficult because we can pick one vertex for each edge in the matching but still cover all the edges in the graph. So
let us relax the dual condition by a factor of 2, i.e. set β = 2:

ye > 0 ⇒ xu + xv ≤ 2.

This relaxation now makes the problem much easier because the dual condition is now satisfied automatically as
xu andxv are at most 1. Now, we just need to construct primal and dual solutions satisfying the primal complementary
slackness condition only. This can be achieved by the following simple algorithm :

1. Initialization: x=0,y=0.

2. When there is an uncovered edge:

(a)Pick an uncovered edge, and increase ye until some vertices go tight.

(b)Add all tight vertices to the vertex cover.

3. Output the vertex cover.

Clearly this algorithm will produce a feasible solution for the vertex cover problem, and also satisfy the primal
complementary slackness condition.

Steps -

1)Increase y for a ⇒ x4 = 1⇒Freeze a, b and c.

2)Increase y for e to 1 ⇒ x5 = 1 ⇒Freeze d and e.

3)Increase yg to2 ⇒ x2 = 1 ⇒ x2 = 1 ⇒Freeze f, g and h.

Here, 1 ≤ xu + xv ≤ β, where β = 2.

Above algorithm is PD based approximate algorithm.

In summary, to solve a problem by P-D approach,

i) start with X 0 , such that dual is feasible, then

ii) find Z, such that dual remains feasible

iii) find X i+1 ←X i + Z

19.8.5 Weighted Vertex Cover via Primal-Dual method

Weighted Vertex Cover (WVC): Given an undirected graph G = (V, E), where |V | = n and |E| = m and a cost
function on vertices : C → Q+ , find a subset C ⊆ V such that every edge e ∈ E has at least one end point in C and
C has minimum cost.
Formulate vertex cover as following IP :
For each vertex i ∈ V (V = (1, 2, 3, ..., n))
Let xi ∈ 0, 1 be variables such that: xi = 1 if i ∈ C, otherwise xi = 0.
We have:
∑n
min ci xi
i=1

s.t : xi + xj ≥ 1 ∀(i, j) ∈ E
xi ∈ 0, 1 ∀i ∈ V

The corresponding LP P of above IP:



n
min C i xi
i=1

s.t : xi + xj ≥ 1 ∀(i, j) ∈ E
xi ≥ 0 ∀i ∈ V

180
Assign a dual variable yij to the constraint xi + xj ≥ 1, We have the corresponding dual D:

max yij
(i,j)∈E


s.t : yij ≤ Ci ∀i ∈ V
j:(i,j)∈E

yij ≥ 0 ∀(i, j) ∈ E

Let us choose α = 1 and β > 1. To ensure∑ primal conditions and suitably relax dual conditions:
For each vertex i ∈ E: either xi = 0 or j:(i,j)∈E yij = Ci
For each edge (i, j) ∈ E: either
∑yij = 0 or 1 ≤ xi + xj ≤ β∑where β > 1
Therefore, when xi ̸= 0,then j:(i,j)∈E yij = Ci . When j:(i,j)∈E yij = Ci for some i, we say that this constraint
goes tight.

19.8.6 Primal-Dual algorithm for WVC


step 1. Initialise x = 0 and y = 0
step 2. While E ̸= ϕ do

a) Select non-empty set E ′ ⊆ E



b) Raise yij for each edge (i, j) ∈ E ′ until some dual constraints goes tight i.e j:i∈(i,j) yij = Ci for some i.
c) Let S be the set of vertices i corresponding to the dual constraints that just went tight.
d) For eachi ∈ S set xi = 1 and delete all edges (i, j) from E i.e. delete all edges incident to vertices in S.

step 3. End while


step 4. Return set C = i|xi = 1.

19.8.7 WVC :Ver3

Figure 19.3: An example graph for WVC algorithm

181
Step 1. Start with x = [00...0] and y = [00...0]
Step 2. Increase y for a i.e. x4 = 1. This implies that x4 can be in the vertex cover. Freeze a, b, c.
Step 3. Increase y for e i.e. x5 = 1.Freeze d, e.
Step 4. Increase y for h i.e. x2 = 1.Freeze f, g, h.

Now, we have got a solution which satisfies P and D both.

Lemma 19.8.1. Let x and y be the solutions obtained from the above algorithm, then x is primal feasible and y is
dual feasible.

Proof. Note that each edge (i, j) removed from E is incident on some vertex i s.t. xi = 1. Additionally,the loop is
terminated when all edges have been removed.Therefore, x∑ i + xj ≥ 1∀(i, j) ∈ E i.e. x is feasible to P .
Likewise, once the constraint goes tight for some i i.e. j:(i,j)∈E yij = Ci , the algorithm removes these edges.
Therefore, none of the values of yij exceed the value of Ci .Hence, y is feasible to D.

Theorem 19.8.2. The above algorithm produces a vertex cover C with an approximation ratio 2.

Proof. Let OP T be the cost of vertex cover. We have:


∑ ∑ ∑
cost(C) = Ci = ( yij )
i∈C i∈C j:(i,j)∈E

The last equality follows


∑ from the fact that we set xi = 1 only for vertices corresponding to tight dual constraints i.e.
i ∈ C.This implies j:(i,j)∈E yij = Ci .
Also note that: ∑ ∑ ∑ ∑ ∑
( yij ) = ( yij ) ≤ 2 yij
i∈C j:(i,j)∈E (i,j)∈E i∈C∩(i,j) (i,j)∈E

Last inequality follows from the fact that|(i, j)| = 2. for all edge (i, j) i.e each edge has 2 endpoints, so we may count
yij twice for each edge (i, j).
Therefore, we conclude that: ∑
cost(C) = 2 yij ≤ 2OP T (D) ≤ 2OP T
j:(i,j)∈E

19.9 Example: Minimum Steiner Forest via Primal-Dual method


Minimum Steiner Forest (MSF): Given an undirected graph G = (V, E), a cost function c : E → Q+ and
a collection of disjoint subsets of V :S1 , ..., Sn , find a minimum cost subgraph of G in which each pair of vertices
belonging to the same Si is connected, such a subgraph is cycle free and is called a steiner forest.

For any X ⊆ V , we set f (X) = 1 iff there exists a u ∈ X and v ∈ X̄ such that u and v belong to some set Si ,
otherwise f (X) = 0.
Let δ(X) be the set of edges with exactly one endpoint in X. Let a binary variable xe indicate whether the edge is
chosen in the subgraph for each edge e ∈ E.

The problem can be formulated as an IP: ∑


min ce xe
e∈E

s.t : xe ≥ f (X) ∀X ⊆ V
e:e∈δ(X)

xe ∈ 0, 1

The corresponding LP P is: ∑


min ce xe
e∈E

182

s.t : xe ≥ f (X) ∀X ⊆ V
e:e∈δ(X)

xe ≥ 0

The dual D will be: ∑


max f X yX
X∈V

s.t : yX ≤ ce ∀e ∈ E
X:e∈δ(X)

yX ≥ 0

19.9.1 Approximation Algorithm for Minimum Steiner Forest via Primal-Dual


method
Step 1. initialize F ← ϕ, y ← 0, j ← 0
Step 2. while F is in-feasible do:

a) let χ be the set of all active X


b) simultaneosly increase yX at same rate for each active set χ ∈ X until some edge e goes tight
c) refer e as ej
d) F ← F ∪ ej
e) j + +

Step 3. end while


Step 4. Reverse Delete Step: for j = |F |to1 do:
if F − ej is primal feasible, then F ← F − ej
Step 5. refer this set as F ′
Step 6. return F ′
Step 7. end

19.10 WEIGHTED SET COVER:


Given a universe U = 1, ..., m, a collection of subsets of U , S = S1 , ..., Sn and a cost function c : S → Q+ , find a
minimum cost sub collection C = Sj |1 ≤ j ≤ n such that C covers all the elements of U .

Let cj be the weight of subset Sj . Let xj be a binary variable such that xj = 1 if Sj ∈ C, otherwise xj = 0.
We have the following IP:

n
min cj xj
j=1

s.t : xj ≥ 1 ∀i ∈ 1, ..., m
j:i∈Sj

xj ∈ 0, 1 ∀j ∈ 1, ..., n

The corresponding LP primal P :


n
min cj xj
j=1

s.t : xj ≥ 1 ∀i ∈ 1, ..., m
j:i∈Sj

183
xj ≥ 1 ∀j ∈ 1, ..., n


Let yi be the variable corresponding to the constraint j:i∈Sj xj ≥ 1. The corresponding dual D is:


m
max yi
i=1

s.t : yi ≤ cj ∀j ∈ 1, ..., n
i∈Sj

yi ≥ 0 ∀i ∈ 1, ..., m


For some j if the dual constraint i∈Sj yi = cj we say that this constraint goes tight and the corresponding Sj is
tight. We have the following algorithm:

Step 1. initialize x = 0, y = 0
Step 2. while U ̸= ϕ do:
a) Choose an uncovered element say i and raise yPi until some set in S goes tight say Sj .
b) Choose all these sets Sj and set xj = 1
c) Remove all the elements in these sets Sj from U
d) Remove all these sets Sj from collection S
Step 3. end while
Step 4. return C = Sj |xj = 1

QUES 2:Derive a Primal-Dual based exact algorithm for minimum spanning tree
problem.

The primal LP formulation for MST is denoted as P:



min ce xe
e

s.t : xe ≥ |π| − 1 ∀π
ecrossesπ
xe ≥ 0 ∀e ∈ E

The corresponding dual is: ∑


max yπ (|π| − 1)
π

s.t : yπ ≤ ce ∀e ∈ E
ecrossesπ
yπ ≥ 0 ∀π

In the algorithm, first note that we need to maintain the yπ . Initially they are all zeroes. At any stage we need
to improve the cost function. In general we can do that by increasing some of the y and decreasing some. To keep
things as simple as possible, we will try and do this by only increasing one of them at a time.
Let us assume that all ce are strictly positive, so initially, none of the constraints are tight. We wish to increase the
cost function. To do so we have to increase some yπ . Which one? It seems that we gain the most by increasing the
one where each vertex is in a separate partition. So, suppose we start to increase this. How much can we increase
it by? We see that we can increase this upto the weight of the minimum weight edge. At this point the inequalities
corresponding to all edges of minimum weight will be tight. Let us consider the generic step. So suppose that at
some stage we have some y. Let F denote the set of edges which are currently tight (the corresponding inequalities
are tight.) We need to increase some yπ such that for each of these edges the sum of the increases in the y that the
edge crosses is at most zero. This means we can increase a yπ such that none of the edges of F cross π. How do we
find such a π? The most natural is to find the connected components and put each component in one part.
Here then is the algorithm for MST:

184
1. INITIALIZATION: we think of all yπ to be 0. Note that we can not explicitly set them.
2. ITERATIVE STEP: let E ′ denote set of edges which are tight. Find the connected components of the graph
G′ = (V, E ′ ). Increase yπ till some edgebecomes tight when the parts of π are the connected components of G′ .
3. previous step terminates when we get one connected component.

Problems

Minimum-cost branching Given a directed graph G = (V, A), root rϵV , find a min-cost subgraph such that
there is a directed path from r to every other vertex.

Maximum Independent Set Given a directed graph G = (V, A), finding an independent set such that adding
any other vertex to the set forces the set to contain an edge.

QUES 1: Derive a Primal-Dual based approximation algorithm for sudoku problem.

QUES 2: Derive a Primal-Dual based exact algorithm for travelling salesman problem.

185
186
Chapter 20

Convex Sets and Convex Functions

187
20.1 Introduction
We have been mostly looking into the linear optimization (linear programming) and the associated concepts from
linear algebra. We had also seen how many of these concepts (eg. duality) are more general than linear programming.

We also argued that linear programming is convex, while integer programming is not convex. What does it mean?
Why it should matter to us? What are the general class of convex optimization schemes? How does one optimize
when problem is non-convex or/and non-linear? We will discuss some of these in the next few lectures. Note that
this is a huge area of literature. We will be discussing only a limited set of topics.

20.1.1 Line joining the points

Suppose x1 ̸= x2 are two points in Rn . Points of the form

y = θx1 + (1 − θ)x2 (20.1)

where θ ∈ R, form the line passing through x1 and x2 . The parameter value θ = 0 corresponds to y = x2 , and the
parameter value θ = 1 corresponds to y = x1 . Values of the parameter θ between 0 and 1 correspond to the (closed)
line segment between x1 and x2 .

Expressing y in the form

y = x2 + θ(x1 − x2 ) (20.2)

gives another intepretation: y is the sum of the base point x2 (corresponding to θ = 0) and the direction x1 − x2
(which points from x2 to x1 ) scaled by the parameter θ. Thus, θ gives the fraction of the way from x2 to x1 where y
lies. As θ increases from 0 to 1, the point y moves from x2 to x1 .

Figure 20.1: The line passing through x1 and x2 is described parametrically by θx1 + (1 − θ)x2 , where θ
varies over R.

The reason for us to familiarize ourselves with the concept of line segment is going to be clear in the next section,
when we dive into convex sets, and convex functions.

20.2 Convex Sets


Definition. A set C is convex if the line segment between any two points in C lies in C, i.e., if for any x1 , x2 ∈ C
and any θ with 0 ≤ θ ≤ 1, we have

y = θx1 + (1 − θ)x2 ∈ C (20.3)


.

In other words, a set is convex if every point in the set is seen by every other point inside the set.

188
Figure 20.2: Examples of convex and non-convex sets.

Examples: Figure 20.3 shows an example of a convex set, and a non-convex set. Figure 20.2 shows an additional
example at the right exterem. The square contains (the right extreme) some boundary points but not all, and is not
convex.

Figure 20.3: Left: Hexagon is a convex set, while Right The kidney shaped set is clearly not a convex set

Some examples of convex sets are given below:

1. The Real line


2. A line segmet α ≤ x ≤ β (x ∈ R)
3. Empty set, Singleton set {x0 },
4. Real space Rn
5. Hyper planes and Half spaces. Hyper planes are represented by

{x | xT a = b}

and half spaces are represented by


{x | xT a ≤ b}
6. Euclidean Balls B(x0 , ε) = {x | ∥x − x0 ∥2 ≤ ε }
7. Solution to an underdetermined system of equations. Let Ax = b with A having more columns than rows
(linearly independet).
8. Polyhedron P = {x |Ax ≤ b Cx = d}
A polyhedron is defined as the solution of a finite number of linear equations & inequalities

P = {x | aTi x ≤ bi CjT x = dj }

20.2.1 Properties of Convex sets


1. Interesection of two convex sets is convex
Proof. Suppose A and B are convex, and let C = A ∩ B
To show C is convex, we have to show that if x1 and x2 are elements of C and θ is any scalar between 0 and 1,
then θx1 + (1 − θ)x2 is also an element of C. Since x1 and x2 are in C, x1 and x2 are in A and B. Since A is

189
Figure 20.4: Example of a polyhedron

convex, it follows that θx1 + (1 − θ)x2 is in A. Similarly x1 and x2 are in B, and B being convex, θx1 + (1 − θ)x2
is in B.
Thus θx1 + (1 − θ)x2 is in A and B, i.e., θx1 + (1 − θ)x2 is in C.
2. Union of two convex sets need not be convex
Proof. We can provide a counter example to disprove that union of two convex sets is convex.
Let us consider two sets on the Real line R A and B, where A = [α, β], and B = [γ, λ], such that A ∩ B = ∅
i.e. they are mutully exclusive. If we consider any linear combination of β and γ, then it might not be on R,
hence
3. If C = {x} is a convex set, the αC = {αx} is also convex
4. If C = {x} is a convex set, the (C + t) = {x + t} is also convex
5. If C = {x} is a convex set, the (aC + b) = {ax + b} is also convex
6. Set Sum: If C1 and C2 are convex, C1 + C2 = {x + y | x ∈ S1 , y ∈ S2 } is also convex.

20.2.2 Convex Hull

Before appreciating the convex hull, we look at a convex combination. We call a point of the form θ1 x1 + ... + θk xk =

i θi xi , where θ ∈ R, θ1 + . . . + θk = 1 and θi ≥ 0, i = 1, . . . , k, a convex combination of the points x1 , . . . , xk . A
convex combination of points can be thought of as a mixture or weighted average of the points, with θi the fraction
of xi in the mixture.

Definition. Convex hull of a set C is the set of all convex combinations of points in C and is represented by Conv(C).

Conv(C) = {θ1 x1 + θ2 x2 + ... + θk xk }



Such thatxi ∈ C θi ≥ 0 θi = 1
i

Note that Conv(C) is convex and C ⊆ Conv(C). From the definition of the convex set, one can also see that the
convex hull of a convex set is itself.

Convex hull of C is always convex, as any linear combination of points lies in the set only. In fact, it is the smallest
convex set that contains C. In other words, if B is any convex set that contains C, then conv C ⊆ B. Figure 20.6
shows a convex hull of a set of points. As evident from the figure, the upper part of the hull (colored in blue) is
called the upper hull, while the lower part of the hull (colored in red) is called lower hull.

• We know that R is convex. Note that set of integers Z is not. (why? a linear combination can result in a point
that is not an integer.)

190
Figure 20.5: The convex hulls of two sets in R2 . Left. The convex hull of a set of thirteen points (shown as
dots) is the pentagon (shown shaded). Right. The convex hull of the kidney shaped set is the shaded set.

• You may remember that Linear Programming is convex while integer programming is not. You can connect
this to the fact that the real space (over which LP optimizes) is convex. While the set of integers is not.
• What does happen with LP relaxation? We find the convex hull.!!

Figure 20.6: Convex hull of a set of points, showing the upper and the lower hull

20.2.3 Separating Hyperplane

Assume we are given two convex sets C and D, such that C ∩ D = ∅. Then we can always find a separating
hyperplane, which can separate C and D. To be precise, for any x in C, we have aT x ≥ b, and for any x in D, we
have aT x < b. In this case, we always have a separating hyperplane to separate the set of points in n-dimensional
space. See Figure 20.7 for a visualzation.

If C and D are the convex sets and C ∩ D = ∅, then there exists a non-zero a such that

∀x ∈ C aT x ≤ b

and
∀x ∈ D aT x ≥ b
such a hyperplane is characterized by a and is called the separating hyperplane.

20.3 Definition of Convex Functions


A Convex function can be defined and understood in multiple ways. We now look at some of these definitions in
detail below.

191
Figure 20.7: Separatimg plane for two convex sets

Figure 20.8: Example of a convex function. Chord joining the two points lie above the function

20.3.1 Jensen’s inequality

A function f : Rn → R is convex if dom(f ) is a convex set and if for all x, y ∈ dom(f ), and θ with 0 ≤ θ ≤ 1, we
have

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (20.4)

This is called the Jensen′ s inequality. In other words, it is a function defined on a convex domain such that for
any two points in the domain, the segment between the two points lies above the function. Figure 20.8 shows an
example of a convex function. In the figure, we find that for any two points x, and y in the domain of f , the line
joining the points (x, f (x)), and (y, f (y)) lie above the function.

20.3.2 Epigraph

Epigraph of a function f is denoted by epi(f ) and is given by

epi(f ) = {(x, t)|x ∈ dom(f ), t ≥ f (x)} (20.5)

The link between convex sets and convex functions is via epigraph: A function is convex if and only if its epigraph
is a convex set.

In other words, if a function is convex, then its epigraph is a convex set. Again, if the epigraph of a function is a
convex set, then the function is convex. Figure 20.9 (Left) shows the epigraph of a function f . As evident from the

192
figure, the epigraph can be considered as a filled bucket, where the function f can be imagined as a bucket, and the
epigraph is the whole space inside the bucket.

Since the epigraph of a function is the part above that function, and if the epigraph is s convex set, that means that
any two point inside that the epigraph should lie inside that only. That essentially means that any two points inside
the function can be connected by a line segment lying entirely inside the function.

Figure 20.9 (Right) shows the epigraph of a non-convex function f . As evident from the figure, the line segment
connectting any two points x and y does not entirely lie inside the function. So, there exists atleast one point among
the set of points given by p = θx + (1 − θ)y for any value of θ such that, 0 ≤ θ ≤ 1, which does not lie above the
function, thus dissatisfying the inequality

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). (20.6)

It has to be noted, that the epigraph of the function is not convex set, hence proving that the function is non-convex.

Figure 20.9: Left: Epigraph of a convex function. Line joining any two points inside the green region lies
inside the epigraph. Right: For a non-convex function, line joining two points x and y does not lie completely
inside the epigraph of the function

20.3.3 Hessians

A function is convex if its domain is convex and its Hessian is a PSD (Positive Semi Definite) matrix.
[ 2 ]
H = ∂x∂i ∂xf
j n×n

 ∂2f 2
∂ f ∂2f

∂x2 ∂x1 ∂x2
... ∂x1 ∂xn
 1 
H =  ... 
2 2 2
∂ f ∂ f ∂ f
∂xn ∂x1 ∂xn ∂x2
... ∂x2
n n×n

If you are not familiar with the terminology of Hessian, please read the appendix. More insight into this is seen in
the following two subsections.

20.3.4 First-order conditions

If we assume that the function f is differentiable at each point in the domain of f , that is gradient ∇ of f exists.
The f is convex if and only if dom f is convex and

f (y) ≥ f (x) + ∇f (x)T (y − x) (20.7)

holds for all x, y ∈ dom f .This inequality shows that from the local information of a convex function, we can derive
its global properties. This is one of the most important property of a convex function, and is crucial in the field
of convex optimizations problems. From the inequality, we can say that if gradient of the function f is 0. In other

193
words, if ∇f (x) = 0, then for all y ∈ dom f , f (y) ≥ f (x), that is , x is a global minimizer of the function f . Hence,
value of f (x) is minimum at that point.

This is already known to us, because for any function, we can minimize that function by setting its derivative to 0,
then proving that the function is minimized if the double derivative is less than 0. The same thing is valid for convex
functions as well.

20.3.5 Second-order conditions

If we assume that f is twice differentiable, that is, its Hessian or second-order derivative i.e. ∇2 exists at each point
in dom f , which is open. Then we can say, that f is convex if and only if dom f is convex and its Hessian is positive
semidefinite: for all x ∈ dom(f ).

∇2 f (x) ≥ 0 (20.8)

In others words, if the derivative of the function is non-decreasing, then the function is said to be convex. That is,
the graph of the function have positive curavature at any point x. In addition to that, the dom f should be convex
set.

20.4 More on Convex Functions

20.4.1 Strictly convex functions

The above definitions of the convex functions are not for strictly convex functions. If we use < instead of ≤ in
equation 20.10, we obtain the strictly convex functions.

20.4.2 Sublevel Sets

The α − sublevel set of a function f : Rn → R is defined as

Cα = {x ∈ dom(f )|f (x) ≤ α}. (20.9)

In other words, α − sublevel set of a function f is a set of all values of x in the domain of f , for which the value of
f (x) will be less that or equal to α. Figure 20.10 explains the concept of α − sublevel set of a function f .

Lemma 20.4.1. Sublevel sets of a convex function are convex, for any value of α.

Proof. The proof is immediate from the definition of convexity. Let us consider two elements x and y from the set
Cα . Then, f (x) ≤ α and also, f (y) ≤ α, and so f (θx + (1 − θ)y) ≤ α for 0 ≤ θ ≤ 1, and hence θx + (1 − θ)y ϵ Cα .
So any linear combination of x and y is present in the set. Since we have considered a general value of α, so we can
say that for any convex function, α sublevel sets of the function is also convex, for any value of α.

But it is to be noted, that the converse is not true: a function can have all its sublevel sets convx, but not be a convex
function. An example of such a setup is f (x) = −ex is not convex on R, although all its sublevel sets are convex. In
fact, the function is strictly concave.

20.4.3 Convexity Preserving operations over functions


1. If f is convex, then αf is convex for α ≥ 0
2. If f and g are two convex functions, then
(a) f + g is convex
(b) max(f, g) is convex
Proof:

194
Figure 20.10: α − sublevel set of a function f (x). C1 is the α1 set of f . C2 is the α2 set of f . C1 , C2 ∈ dom
f

(a) Given f and g are convex, let h(x) = f (x) + g(x)


since f and g are convex consider x1 and x2 ∈ dom(f ), dom(g) and dom(h)

f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )


Similarly g(θx1 + (1 − θ)x2 ) ≤ θg(x1 ) + (1 − θ)g(x2 )
h(θx1 + (1 − θ)x2 ) = f (θx1 + (1 − θ)x2 ) + g(θx1 + (1 − θ)x2 )
=⇒ h(θx1 + (1 − θ)x2 ) ≤ θ(f (x1 ) + g(x1 )) + (1 − θ)(f (x2 ) + g(x2 ))
= θh(x1 ) + (1 − θ)h(x2 )
∴ h(θx1 + (1 − θ)x2 ) ≤ θh(x1) + (1 − θ)h(x2 )
=⇒ h is convex
Let F (x) = max(f (x), g(x))

F (θx1 + (1 − θ)x2 ) = max{f (θx1 + (1 − θ)x2 ), g(θx1 + (1 − θ)x2 )}

≤ max{θf (x1 ) + (1 − θ)f (x2 ) , θg(x1 ) + (1 − θ)g(x2 )}


≤ θmax{f (x1 ), g(x1 )} + (1 − θ)max{f (x2 ), g(x2 )}
≤ θF (x1 ) + (1 − θ)F (x2 )
=⇒ F is convex
Hence proved.
Note that this property can be generalised to any linear combination of convex functions, not just two.

20.4.4 Concave Functions

A function f : Rn → R is concave if dom(f ) is a convex set and if for all x, y ∈ dom(f ), and θ with 0 ≤ θ ≤ 1, we
have

f (θx + (1 − θ)y) ≥ θf (x) + (1 − θ)f (y). (20.10)

1. f (x) = −x2

2. g(x) = x
3. sin function is concate in the interval [0, π]

195
Question: If a function f is convex, will −f be concave?

Question: Should every function be either convex or concate?

20.4.5 Quasiconvex functions

A function f : Rn → R is called quasiconvex or unimodal if its domain and its sublevel sets

Sα = {x ∈ domf |f (x) ≤ α} (20.11)

for α ∈ R, are convex. Figure 20.11 shows a quasiconvex function. The function has all α-sublevel sets as convex
sets, but still the function is not a convex function. In other words, the α-sublevel sets should be an interval (which
might be an unbounded). In the figure, we see that Sα is the interval [a, b]. On the other hand, Sβ is an interval
(-∞, c].

Figure 20.11: Quasiconvex funtion. We find that α-sublevel and β-sublevel sets of the function are convex
i.e. and interval.

Examples of quasiconvex functions include:



1. |x| is quasi convex on R.
2
2. f(x1 , x2 ) = x1 x2 is quasiconcave on R++
3. distance ratio:
∥x − a∥2
f (x) = , dom(f ) = {x | ∥x − a∥2 ≤ ∥x − b∥2 }
∥x − b∥2
is quasi-convex

20.5 Example Problems


1. Are the following convex sets? Prove.
(a) slab {x ∈ Rn | α ≤ aT x ≤ β}

A slab is convex, since it is the intersection of two half-spaces

{x ∈ Rn | aT x ≤ β} and {x ∈ Rn | aT x ≥ α}

and each half-space is a convex set.

196
Figure 20.12: An example quasi-convex function.

(b) The set {x |x + S2 ⊆ S1 } where S1 and S2 ⊆ Rn with S1 convex.

x + S2 can be expressed as {y | y = x + z, z ∈ S2 }

Consider a new set that contains x. For any 2 points x1 and x2 in this new set, by definition we have

x1 + z ∈ S1 , ∀z ∈ S2

x2 + z ∈ S1 , ∀z ∈ S2
Consider any λ ∈ [0, 1]

[λx1 + (1 − λ)x2 ] + z = λ(x1 + z) + (1 − λ)(x2 + z) ϵ S1 ∀z ϵ S2

because for any z ∈ S2 we have (x1 + z), (x2 + z) are in S1 and S1 is a convex set. Therefore the new
set containing such x is also convex.
(c) {x|||x − x0 ||2 ≤ ||x − y||2 } for all y ϵ S, S ⊆ Rn
Solution : For any fixed value of y ϵ S, we have the set given by {x|||x − x0 ||2 ≤ ||x − y||2 }. It can be
shown that it is a halfspace. To show that, it is sufficient to show that it can be expressed as AT X ⪯ b.
Now x is closer to x0 that to xi if and only if,

||x − x0 ||2 ≤ ||x − xi ||2


⇒ (x − x0 )T (x − x0 ) ≤ (x − xi )T (x − xi )
⇒ xT x − 2xT0 x + xT0 x0 ≤ xT x − 2xTi x + xTi xi
⇒ 2(xi − x0 )T x ≤ xTi xi − xT0 x0 ,

which clearly defines a halfspace of the form {x|Ax ⪯ b} with

   T 
x1 − x0 x1 x1 − xT0 x0
 x2 − x0   xT2 x2 − xT0 x0 
A = 2  
 ... , and b = 


...
xK − x0 xK xK − x0 x0
T T

Once we know that the set is a halfspace, we can now say that the given set is convex. The given set can
be expressed as,

{x|||x − x0 ||2 ≤ ||x − y||2 }, (20.12)
yϵS

i.e. an intersection of halfspaces. We know that intersection of halfspaces is always convex. Hence proved
that the given set of convex.
2. Check whether the following functions are convex, concave or quasi-convex
(a) f (n) = ex − 1 on R

Every exponential function is convex. Therefore the given function is strictly convex and every convex
function is quasi-convex too.

197
2
(b) f(x1 , x2 ) = x1 x2 on R++

The Hessian of f is ][
0 1
∇2 f (x) = ,
1 0
which is neither positive nor negative semi-definite. Therefore f is neither convex nor concave. It is
quasi-convex since its superlevel sets
{(x1 , x2 ) ∈ R++
2
| x1 x2 ≥ α}
are convex.
x1 2
(c) f(x1 , x2 ) = x2
on R++

The Hessian of f is  
2 1
2 x1 x2
1  x1 
∇2 f (x) =   ≥ 0
x1 x2 1 2
x1 x2 x22
Therefore, f is convex and quasi-convex. It is not concave.
(d) Inverse of an increasing convex function : Suppose f: R − > R is increasing and convex on its domain
(a,b). Let g denote its inverse i.e., the function with domain (f(a),f(b)) and g(f(x))=x for a < b . What
can you say about convexity or concavity of g?
(e) Check if the following set is convex or not : The set of points whose distance to a does not exceed a fixed
fraction θ of the distance to b i.e.,
{x | ∥x − a∥2 ≤ θ∥x − b∥2 }
where a ̸= b, 0 ≥ θ ≥ 1, and a,b,x ϵ R
3. Check whether the following functions are convex, concave, or quasian
(a) f (x) = ex − 1
Solution : The Hessian of f is ∇2 f (x) = ex . We know that ex > 0 for all x ≥ 0. So we can say, that f
is strictly convex, and therefore it is also called quasiconvex.
(b) f (x1 , x2 ) = x1 x2
Solution : The Hessian of f is
[ ]
0 1
∇ f (x) =
2
,
1 0
which is neither positive semidefinite nor negative semidefinite. Therefore, f is neither convex nor concave.
The superlevel sets of f , given by

{(x1 , x2 ) ϵ R2++ |x1 x2 ≥ α} (20.13)


are convex. So the function f is quasiconvex.
(c) f (x1 , x2 ) = x1 /x2
Solution : The Hessian of f is
[ ]
0 −1/x22
∇2 f (x) = 3 ,
−1/x22 2x1 /x2
which is not positive semidefinite nor negative semidefinite. Therefore f is not convex not concave.

20.6 Additional problems


1. Show that the following function f : Rn → R is convex.

f (x) = ||Ax − b||, where A ∈ RmXn , b ∈ Rm


and ∥ · ∥ is a norm on R . m

2. Which of the following sets are convex


(a) A wedge i.e {x ∈ Rn | at1 x ≤ β1 , at2 x ≤ β2 } where a1, a2 ∈ Rn β1 β2 ∈ R
(b) The set of points whose distance to a doesnot exceed a fixed fraction θ ∈ [0, 1] of the distance to b,i.e the
set {x ∈ Rn | ∥x − a∥2 ≤ θ∥x − b∥2 }, where a ̸= b are points in Rn .

198
20.7 Convex Optimization
Convex Optimization studies the problem of minimizing convex functions over convex sets. Optimization can be
made easier by making use of convexity property, for example any local minimum should be global minimum in case
of convex functions.

Definition. Convex minimization is the minimization of a convex function over a convex set.

The convexity property can make optimization in some sense easier than the general case - for example, any local
minimum must be a global minimum.

Standard form is the usual and most intuitive form of describing a convex minimization problem.

M infc (x)s.t.

fi (c) ≤ 0 where fi is convex

hi (x) = 0 where hi is linear

aTi x = bi

Where the functions hi are affine. In practice, the terms ”linear” and ”affine” are often used interchangeably. Such
constraints can be expressed in the form hi (x) = aTi x + bi , where ai is a column-vector and bi a real number.

20.8 Appendix: Gradient, Hessian and Jacobian

20.8.1 Gradient

The gradient of a function g(x) of n variables, at x̂, is the vector of first partial derivatives evaluated at x̂, and is
denoted by ▽g(x̂):
 
∂g(x̂)
 ∂x1 
 
 
 ∂g(x̂) 
 
 ∂x2 
 
▽g(x̂) =  
 . 
 .. 
 
 
 
 ∂g(x̂) 
∂xn

20.8.2 Hessian

The Hessian of a function g(x) of n variables, at x̂, is the matrix of second partial derivatives evaluated at x̂, and is
denoted as ▽2 g(x̂)g(x):
 
∂ 2 g(x̂) ∂ 2 g(x̂) ∂ 2 g(x̂)
 ∂x2 ···
 1 ∂x1 ∂x2 ∂x1 ∂xn  
 2 
 ∂ g(x̂) ∂2f ∂ 2 g(x̂) 
 ··· 
 ∂x2 ∂x1 ∂x22 ∂x2 ∂xn 
H(g) = 

.

 .. .. .. .. 
 . . . . 
 
 
 ∂ 2 g(x̂) ∂ 2 g(x̂) ∂ g(x̂) 
2
···
∂xn ∂x1 ∂xn ∂x2 ∂x2n
∂ 2 g(x̂) ∂ 2 g(x̂)
This is a symmetric matrix, because =
∂xi , ∂xj ∂xj , ∂xi

199
20.8.3 Jacobian

Jacobain is an nxn matrix whose entries are the various partial derivative of the components of f. Specifically:
 
∂f1 ∂f1
···
 ∂x1 ∂xn 
 . .. 
J = . . ..
.

. .
 ∂f ∂fm 
m
···
∂x1 ∂xn
J is the derivative matrix (or Jacobian matrix ) evaluated at x0 .

200
Chapter 21

Convex Optimization

201
21.1 Introduction
Mathematical optimization can be very difficult to solve in general, in terms of computational complexity. However,
a class of problems can be solved easily. This include problems such as least-squares problems, linear programming
problems, and convex optimization problems. We had seen the least squares problems and linear programming
problems in the past. This lecture focuses primarily on an overview of the convex optimization formulations and
different variations thereof. Let us start with a brief comparison of convex optimization with the other well-known
and, arguably easier to solve, classes of optimization, viz. least squares and linear programs, as shown in Table 21.1.

Property Least Squares Linear Programs Convex Optimization


Formulation minimize||Ax − b||22 minimize cT x minimize f0 (x)
ai x ≤ bi , i = 1, . . . , m
T
fi (x) ≤ bi , i = 1, . . . , m
Analytical Solution x∗ = (AT A)−1 AT b No analytical formula No analytical formula
Algorithms Reliable and efficient Reliable and efficient Reliable and efficient
Computational Complexity n2 k n2 m max{n3 , n2 m, F }

Table 21.1: Convex optimization versus least squares and linear programs

21.1.1 Review
• A set C is convex, if for any x1 , x2 ∈ C, any x3 = θx1 + (1 − θ)x2 is also in C, where 0 ≤ θ ≤ 1
• Example: R is convex while Z is not convex. Integer set can be convex when it is a singleton set.
• A function f is convex iff
f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )
See the figure 21.1 for a quick understanding.

Figure 21.1: Example of a convex function. Note that the function is always below the line joining x1 and
x2 for all the points between x1 and x2 .

• Convex optimization is the optimization/minimization of a convex function over a convex set.


• LP is convex. Note that half spaces are convex and LP is an optimization over a set of intersctions of the half
spaces. Also linear objective functions are convex functions.

In general, any convex optimization problem can be represented as:

minimize f0 (x)
(21.1)
subject to fi (x) ≤ bi , i = 1, . . . , m

202
where x = (x1 , . . . , xn ) are the optimization variables, f0 = Rn → R is the objective function, and fi = Rn → R, i =
1, . . . , m are the constraint functions. The optimal solution is the one which has the smallest value for the objective
function while satisfying all the constraints.

Optimization problems with additional constraints such as in equation 21.1 are often called constrained optimization
problems. There are also problems which are unconstrained optimization problems.

21.2 Convex Optimization

21.2.1 Optimization Problem

Given a real vector-space X together with a convex, real-valued function:

f : χ → R defined on a convex subset χ of X

The problem is to find any point x∗ in χ for which the number f (x) is smallest, i.e. a point x∗ such that f (x∗ ) ≤ f (x)
for all x ∈ χ

Convex minimization has applications in a wide range of disciplines, such as automatic control systems, estimation
and signal processing, communications and networks, electronic circuit design, data analysis and modeling, statistics
(optimal design), and finance. With recent improvements in computing and in optimization theory, convex mini-
mization is nearly as straightforward as linear programming. Convex optimization has a vast plethora of applications
in academic research and industrial engineering, ranging from automatic control systems to signal processing, from
data modelling to electronic circuit design and many many more. Recent improvements in computing power and new
theoretical breakthroughs have brought convex optimisation almost on par with linear programming in terms of ease
of solution.

Convex optimization is the formulation of the general optimization problem over convex sets and convex functions.
More specifically, convex optimization refers to the problem of optimising a convex function over a convex set defined
by constraints, which are themselves either convex or linear. For a general convex optimization problem, any locally
optimal solution is usually the global optimum as well.

Why is it important to know the convex optimization? Solving optimization problems is generally difficult except for
some classes of problems that can be solved efficiently and reliably. Examples include:

1. Least square problems


2. Linear programming problems
3. Convex optimization problems.

Is it that most problems are convex? Not really. Many practical problems are non-convex. We will discuss some of
these in the next lectures.

21.2.2 Formulation

We use the notation to define the convex optimization problems

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m (21.2)
hi (x) = 0, i = 1, . . . , p

Here the vector x = (x1 , ..., xn ) is the optimization variable of the problem, the function f0 : Rn − > R is the convex
objective function, the functions fi : Rn → R,i = 1, ..., m, are the (inequality) constraint functions. Some times
fi (x) ≤ 0 is represented as fi (x) ≤ bi .

This refers to the problem of finding the value of x that minimises the function f0 while simultaneously satisfying the
conditions fi (x) ≤ 0 and hi (x) = 0 for all i = 1, . . . , m and i = 1, . . . , p respectively. The variable, or rather vector,
x ∈ Rn is the optimization variable, and the function f0 (x) : Rn → R is the cost function or objective function. The
constraints fi (x) ≤ 0, i = 1, . . . , m are the inequality constraints, and the functions fi : Rn → R are usually convex
in nature. If there are no constraints, then we have an unconstrained problem.

203
Where the functions hi are affine. In practice, the terms “linear” and “affine” are often used interchangeably. Such
constraints can be expressed in the form hi (x) = aTi x + bi , where ai is a column-vector and bi a real number.

21.2.3 Implicit and Explicit Constraints

In convex optimization problems, the inequality constraints and the equality constraints are the explicit constraints.
However, we can convert all the explicit constraints to implicit ones by redefining their domains. The extreme case
here would be coverting a standard convex optimization problem to an unconstrained one minimize F (x) where the
function F is the same as f0 , but with a redefined domain restricted to the feasible set as dom F = {x ∈ dom f0 |fi (x) ≤
0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}. Alternatively, if fi (x) ≤ 0 and hi (x) = 0 are the explicit constraints, then
the implicit constraints are given by x ∈ dom fi , x ∈ dom hi and D = dom f0 ∩ . . . ∩ dom fm ∩ dom h1 ∩ . . . ∩ dom hp
where D is the domain of the objective function.

21.2.4 Feasibility Problem

In any general convex optimization problem, if the objective function is zero, then the optimal value can be either
zero if there is a feasible set, or ∞ if the feasible set is empty. Basically, with f0 (x) = 0, the convex optimization
problem is reduced to

minimize 0
subject to fi (x) ≤ 0, i = 1, . . . , m (21.3)
hi (x) = 0, i = 1, . . . , p

The feasibility problem can then be formulated as follows:

minimize x
subject to fi (x) ≤ 0, i = 1, . . . , m (21.4)
hi (x) = 0, i = 1, . . . , p

where the goal is to determine whether the constraints are consistent, and find a suitable solution satisfying them.
As already mentioned, the optimal value will be 0 if the constraints are feasible, and any value of x satisfying the
constraints will be optimal. In case the constraints are infeasible, we obtain an optimal value of ∞.

21.2.5 Optimality Criterion

For any general convex optimization problem, let us suppose that the objective function f0 is differentiable with
respect to x. In that case, for then any value of x, y satisfying the constraints,.i.e. any value of x, y from the feasible
se, we have the following relation:
f0 (y) ≥ f0 (x) + ∇f0 (x)T (y − x) (21.5)
We can thus conclude that any value x in the feasible set is optimal if and only if the following condition holds

∇f0 (x)T (y − x) ≥ 0 ∀f easible y (21.6)

Geometrically, this optimality criterion tells us that for ∇f0 (x) ̸= 0, then it defines a supporting hyperplane to he
feasible set at x as shown in Figure 21.2

21.2.6 Equivalent convex problems

Two optimization problems are considered to be equivalent if the solution of one can be obtained from the solution
of the other, and vice-versa. In practice, a large number of optimization problems can be converted to a convex
optimization problem, and solved. These conversions are usually done using a few common transformations that
preserve convexity, viz.:

• Change of variables
• Transformation of objective function
• Transformation of constraint functions
• Eliminating equality constraints

204
Figure 21.2: Optimality condition shown geometrically. The feasible region is the shaded convex hull X.
Possible level curves of f0 are shown as dashed lines. At the optimal point x, −∇f0 (x) defines a supporting
hyperplane.

• Introducing equality constraints


• Introducing slack variables
• Epigraph form

21.3 Quasi Convex Optimization


1
Note:

In mathematics, a quasiconvex function is a real-valued function defined on an interval or on a convex subset of a


real vector space such that the inverse image of any set of the form (−∞, a) is a convex set. Informally, along any
stretch of the curve the highest point is one of the endpoints. The negative of a quasiconvex function is said to be
quasiconcave.

All convex functions are also quasiconvex, but not all quasiconvex functions are convex, so quasiconvexity is a
generalization of convexity. Quasiconvexity and quasiconcavity extend to functions with multiple arguments the
notion of unimodality of functions with a single real argument.

Figure 21.3: A quasilinear function is both quasiconvex and quasiconcave

Figure 21.3 shows a quasiconvex function.

Definition. A function f : S → R defined on a convex subset S of a real vector space is quasiconvex if for all x, y
∈ S and λ ∈ [0, 1] we have:
( )
f (λx + (1 − λ)y) ≤ max f (x), f (y) .

In words, if f is such that it is always true that a point directly between two other points does not give a higher a
value of the function than do both of the other points, then f is quasiconvex. Note that the points x and y, and the
point directly between them, can be points on a line or more generally points in n-dimensional space.
1 Note discussed for S2016

205
A quasilinear function is both quasiconvex and quasiconcave as shown in Figure 21.3.

The graph of a function that is both concave and quasi-convex on the nonnegative real numbers. An alternative way
of defining a quasi-convex function f(x) is to require that each sub-levelset Sα (f ) = {x|f (x) ≤ α} is a convex set.
( )
If furthermore f (λx + (1 − λ)y) < max f (x), f (y) for all f (x) ̸= f (y) and λ ∈ (0, 1), then f is strictly quasiconvex.

That is, strict quasiconvexity requires that a point directly between two other points must give a lower value of the
function than one of the other points does.

A quasiconcave function is a function whose negative is quasiconvex, and a strictly quasiconcave function is a function
whose negative is strictly quasiconvex.
( )
Equivalently a function f is quasiconcave if f (λx + (1 − λ)y) ≥ min f (x), f (y)
( )
and strictly quasiconcave if f (λx + (1 − λ)y) > min f (x), f (y)

A (strictly) quasiconvex function has (strictly) convex lower contour sets, while a (strictly) quasiconcave function
has (strictly) convex upper contour sets. A function that is both quasiconvex and quasiconcave is quasilinear. A
particular case of quasi-concavity is unimodality, in which there is a locally maximal value.

21.3.1 Quasiconvex Standard Form

Let qt be a family of convex functions that satisfy f0 (x) ≤ t ⇔ ϕt (x) ≤ 0

Find x such that ϕt (x) ≤ 0


ft (x) ≤ 0
Ax = b

If this is feasible p∗ < t Else if this is infeasible p∗ > t

21.3.2 Optimization:

A quasiconvex optimization problem is an optimization problem where we seek to minimize(or, alternatively, max-
imise) a quasiconvex function over a convex set defined by constraints that are either convex or linear. Mathematically,
a quasiconvex optimization problem is defined as:

minimize f0 (x)
subject to fi (x) ≤ 0, i = 1, . . . , m (21.7)
Ax = b

where the function f0 : Rn → R is quasiconvex instead of being convex, while the functions fi : Rn → R are convex,
and the functions hi : Rn → R are linear. It is to be noted that, in the nature of quasiconvex functions, this sort of
problem can have locally optimal points that are not globally.

A standard approach to solving quasiconvex optimization problems is representing sublevel sets of a quasiconvex
function with a family of convex inequalities. If f0 be the quasiconvex function, then we have a family of functions
ϕt such that t-sublevel set of f0 is 0-sublevel set of ϕt , i.e.

f0 (x) ≤ t ⇐⇒ ϕt (x) ≤ 0 (21.8)

We can then solve the quasiconex otimization problem by solving the associated feasibility problem, viz.

find x
subject to ϕt (x) ≤ 0
(21.9)
and fi (x) ≤ 0, i = 1, . . . , m
and Ax = b

which is a convex feasibility problem in x for fixed t. If this is feasible, then we have t ≥ p∗ , otherwise we will have
t ≤ p∗ . We can sole this convex feasibility problem using a variant of the bisection method, as shown below.

206
minimize f0 x
subject to fi x ⩽ 0 i=1,2,...,m
Ax=b

with f0 : Rn ⇒ R quasiconvex, f1 . . . fm convex


can have locally optimal points that are not (globally) optimal

Figure 21.4: QuasiConvex Function

Representation of sublevel sets of f0

if f0 is quasiconvex, there exists a family of functions Φt such that:


1.Φt is convex in x for fixed t
2.t-sublevel set of f0 is 0-sublevel set of Φt , i.e.,

Example

f0 x = p(x)
q(x)
with p convex, q concave, and p(x) ≥ 0 , q(x) > 0 on dom f0
can take Φt (x)=p(x)-tq(x):
1. for t ≥ 0,Φt convex in x
2. p(x)/q(x) ≤ t if and only if Φt (x) ≤ 0

Quasiconvex optimization via convex feasibility problems

Φt (x) ≤ 0, fi (x) ≤ 0, i = 1, 2, ..m, Ax = b (1)


1.for fixed t, a convex feasibility problem in x
2.if feasible, we can conclude that t ≥ p⋆ ; if infeasible, t ≤ p⋆

Bisection method for quasiconvex optimization

given l ≤ p⋆ , u ≥ p⋆ , tolerance ϵ > 0


repeat
1. t := (l + u)/2
2. Solve the convex feasibility problem (1)
3. if (1) is feasible, u:= t ; else l:= t
until u - l ≤ ϵ
requires exactly ln((u − l)/ϵ) iterations (where u,l are initial values)

algorithm Bisection method for quasiconvex feasibility


Require: l ≤ p∗ , u ≥ p∗ , tolerance ϵ > 0
repeat
t ← (l + u)/2
Solve the convex feasibility problem in x and t
if nf easible then
u←t
else

207
l←t
end if
until u − l ≤ ϵ

21.4 Variations and Other Formulations


Linear optimization, or rather linear programming problems are simply convex optimization problems where the
objective and constraint functions are all affine. A general linear problem has the standard form:

minimize cT x + d
subject to Gx ≤ h (21.10)
Ax = b

where G ∈ Rm×n and A ∈ Rp×n .


The geometric interpretation of a linear program is shown in Figure 21.5. The feasible region in a linear program is,
as already covered in previous chapters, a polyhedron formed by the interection of the constraints.

Figure 21.5: A linear program shown geometrically. The polyhedron P is the feasible region. The level
curves of the linear objective function cT x are orthogonal to c, and the point x∗ is optimal.

21.4.1 Linear Fractional Programming

A linear-fractional programming problem is another type of optimization problem where the objective is to minimize
a ratio of affine functions over a polyhedron formed by the intersection of constraint. The standard form of a
linear-fractional program is :

cT x + d
minimize f0 (x) =
eT x + f
(21.11)
subject to Gx ≤ h
Ax = b

where domf0 = {x|eT x + f > 0}. Linear-fractional programs are usually quasiconvex in nature, due to the objective
function being quasiconvex.

21.4.2 Quadratic Optimisation

A convex optimization problem with an objective function that is quadratic and constraints that are affine in nature
is known as a quadratic optimization problem. In general, a quadratic optimization problem can be expressed as:
1 T
minimize x P x + qT x + r
2
subject to Gx ≤ h (21.12)
Ax = b

where P ∈ S+ n
, G ∈ Rm×n , and A ∈ Rp×n . Here, S is the vector space of matrices of appropriate dimensionality.
The feasible region is usually a polyhedron in case of quadratic optimization, as shown in Figure 21.6.

208
Figure 21.6: A quadratic optimization shown geometrically. The polyhedron P is the feasible region. The
level curves of the quadratic objective function are shown as dashed curves. x∗ is the optimal point.

Example

Least Squares
minimize ∥ Ax − b ∥22

Related Topics

1.Quadratically constrained quadratic program (QCQP)


2.Second-order cone programming

21.4.3 Quadratically Constrained Quadratic Programs

This is a variant of quadratic optimization with constraints that are quadratic in nature as opposed to convex or
affine constraints as in plain vanilla quadratic optimization.

1 T
minimize x P0 x + q0T x + r0
2
1 (21.13)
subject to xT Pi x + qiT x + ri < 0, i = 1, . . . , m
2
Ax = b

Here Pi ∈ S+n
, i = 0, . . . , m, and the feasible region is an intersection of ellipsoids Pi > 0. It is to be noted that linear
programs are a special case of quadratic programs, where P = 0. Further, quadratic programs, and by extension,
linear programs, are special cases of quadratically constrained quadratic programs, where in addition to P = 0, we
also have Pi = 0, i = 1, . . . , m.

21.4.4 Second Order Cone Programming

Second order cone programming is a form of convex optimization where the inequalities are second order cone
constraints. It closely related to other forms of convex optimization. A second order cone program can be reduced
to a QCQP when ci = 0, i = 1, . . . , m by squaring each of the constraints. Furthermore, if Ai = 0, 1, . . . , m, then it
is further reduced to a linear program.

21.4.5 SDP

Semidefinite programming is the subfield of convex optimization that deals with the optimization of a linear objective
function over the intersection of the cone of positive semidefinite matrices with an affine space. The inequality
constraints are called linear matrix inequalities. Like SOCP, it is also closely associated with other forms of convex
optimiation. Linear programs and second order cone programs can be converted to semidefinite programs, which are
more general in nature.

209
21.4.6 Hard Variations

Slight modifications of the standard convex optimization problem can yield problems that are quite hard to solve
computationally. A few examples of the same include the following:

• Convex Maximization/Concave Minimization

maximize ||x|| subject to Ax ≤ b (21.14)


• Nonlinear Equality Constraints
minimize cT x subject to xT Pi x + qiT x + ri = 0, i = 1, . . . , k (21.15)

• Non-convex Sets, e.g. Boolean Sets

f ind x subject to Ax ≤ b, xi ∈ {0, 1} (21.16)

21.5 Exercise
Exercise. Optimize f (x, y) = 5x − 3y such that x2 + y 2 = 136

F = 5x − 3y − λx2 − λy 2 + 136λ

Fx is obtained from differentiating F with respect to x


Fx = 5 − 2λx = 0 Therefore, x = 2λ 5

Fy is obtained from differentiating F with respect to y


Fy = −3 − 2λy = 0 Therefore, y = −3 2λ

Fλ is obtained from differentiating F with respect to λ


Fλ = −x2 − y 2 + 136 = 0
25 9
Fλ = 4λ2
+ 4λ2
+ 136
34
Fλ = 4λ2
= 136

So therefore λ can be the following two values:

λ= 1
4
then x = 10 and y = −6 so f (10, −6) = 68 (Maximization Problem)

−1
λ= 4
then x = −10 and y = 6 so f (−10, 6) = −68 (Minimization Problem)

Exercise. Consider the optimization problem:


minimize f0 (x1 , x2 )
subject to 2x1 + x2 ≥ 1
x1 + 3x2 ≥ 1
x1 , x2 ≥ 0
Make a sketch of the feasible region/set. For each objective function below, give the optimal set and value.

• f0 (x1 , x2 ) = x1 + x2
• f0 (x1 , x2 ) = max{x1 , x2 }
• f0 (x1 , x2 ) = x21 + 9x22

The feasible set is given by the intersection of the halfspaces defined by the constraints. More specifically, we have
the following point of intersection between the two inequality constraints:
2x1 + x2 = 1
2 ∗ (x1 + 3x2 ) = 2
5x2 = 1
(x1 , x2 ) = (2/5, 1/5)

210
Between each of the inequality constraints and the non-negativity constraints, we have the following points of inter-
section:
(0, ∞), (0, 1), (1, 0), (∞, 0)
So, the feasible set or region is given by the convex hull of the polygon with vertices as (0, ∞), (0, 1), (2/5, 1/5),
(1, 0), (∞, 0), after taking into account the apropriate directions according to the inequalities. This is shown in
Figure 21.7.

Figure 21.7: Feasible region.

For f0 (x) = x1 + x2 , the optimal set is given by x∗ = (2/5, 1/5), and the optimal value is 3/5.

For f0 (x) = max{x1 , x2 }, the optimal set is given by x∗ = (1/3, 1/3), and the optimal value is 1/3.

For f0 (x) = x21 + 9x22 , the optimal set is given by x∗ = (1/2, 1/6), and the optimal value is 1/2.

Exercise. Prove that x∗ = [1, 1/2, −1]T is optimal for the problem:

1 T
minimize x P x + qT x + r
2
subject to − 1 ≤ xi ≤ 1 ∀i = 1, 2, 3
   
13 12 −2 −22.0
P =  12 17 6  q =  −14.5  r = 1
−2 6 12 13.0

In order to minimize 12 xT P x + q T x + r the first thing we need to do is find its gradient. Hence we diferentiate the
objective function with respect to x. This gives us:

∇f0 (x) = xT P + q T
In order for the given x∗ to be the optimal solution, it needs to satisfy the optimality condition specified earlier,
which is given again as follows:
∇f0 (x)T (y − x) ≥ 0 ∀f easible y
Now, for x∗ = [1, 1/2, −1]T , the gradient of the objective function attains a value of:

∇f0 (x∗ ) = {x∗ }T P + q T


= [−1, 0, 2]

which is obtained by plugging in the values of P and q respectively. Therefore the optimality condition, for any
y = [y1 , y2 , y3 ]T is reduced to:

∇f0 (x∗ )T (y − x∗ ) = −1(y1 − 1) + 0(y2 − 1/2) + 2(y2 + 1)


= −y1 + 1 + 2y2 + 2
= 3 + 2y2 − y1
≥ 0, ∀y such that − 1 ≤ yi ≤ 1

Since the optimality condition is satisfied along with the feasibility constraints of −1 ≤ yi ≤ 1, we can conclude that
x∗ = [1, 1/2, −1]T is indeed an optimal solution for the quadratic optimization probem of minimizing 21 xT P x+q T x+r

211
Exercise. Consider the following optimization problem:

minimize x1 + x2
subject to − x1 ≤ 0
and − x2 ≤ 0
and 1 − x1 x2 ≤ 0

Prove that the feasible set is a half-hyperboloid, with optimal value 2 at optimal point x∗ = (1, 1).

212
Chapter 22

Optimization Problem: Support


Vector Machines

213
22.1 Introduction
max Margin and SVM problem

22.2 Primal problem

22.3 Dual problem

22.4 Problems

214
Chapter 23

SDP and MaxCut

215
23.1 Introduction

216
Chapter 24

Nonlinear Optimization: Solving


f (x) = 0

217
24.1 Introduction
In this lecture we discuss various methods to solve m nonlinear equations in n variables.
f1 (x1 , x2 , ......, xn ) = 0
f2 (x1 , x2 , ......, xn ) = 0
f3 (x1 , x2 , ......, xn ) = 0
.
.
.
fm (x1 , x2 , ......, xn ) = 0

We can also write this system as f (x) = 0 where


   
f1 (x1 , x2 , ......, xn ) x1
 f2 (x1 , x2 , ......, xn )   x2 
   
 f3 (x1 , x2 , ......, xn )   x3 
   
f (x) = 
 .  and , x = 
  . 

 .   . 
   
 .   . 
fm (x1 , x2 , ......, xn ) xm
Often we make two Assumptions:

1. f is continuous, (i.e. limy→x f (y) = f (x) for all x) and sometimes differentiable.
2. f is available in explicit form (Eg. x21 + x22 + x23 = 10 ) or as black-box ( input: x, output : f (x)).

We are interested in two problems (in the unconstrained setting):

1. Solve f (x) = 0. (You may recollect that we solved Ax − b = 0 in the first part of the course). Indeed this
problem is more complex than the linear systems that we had discussed. Also the optimality criteria (i.e.,
derivative is zero) results in this problem for many optimization tasks.
2. Minimize ||f (x)||. (You may also recollect that we solved the linear least squares problem in the past i.e.,
minimize ||Ax − b||.).

24.1.1 Iterative Algorithms

Nonlinear equations usually have to be solved by iterative algorithms, that generate a sequence of points x(0) , x(1) , x(2) , . . .
with f (x(k) ) → 0, as k → ∞. The vector x(0) is called the starting point of the algorithm, and x(k) is called the
kth iterate. Moving from x(k) to x(k+1) is called an iteration of the algorithm. The algorithm is terminated when
||f (x(k) )|| ⩽ ϵ, where ϵ > 0 is some specified tolerance, or when it is determined that the sequence is not converging.

The general structure of this class of algorithim is:

1. Choose a starting point x0 .


2. Compute the next iterate xk+1 using xk and information about the function at xk .
3. We find xk+1 untill a solution point with sufficent accuracy is found or no further progress can be made.

Some questions that we are interesting in answering in this regard are as follows

• How many iterations are required?


• How fast does it converge?
• Is it converging to a local or a global maximum?

24.2 Bisection Method


This method is applicable when we wish to solve the equation f (x) = 0 for the real variable x, where f is a continuous
function defined on an interval [a, b] and f (a) and f (b) have opposite signs. In this case a and b are said to bracket
a root since, by the intermediate value theorem, the f must have at least one root in the interval (a, b).

218
At each step the method divides the interval in two by computing the midpoint c = (a + b)/2 of the interval and
the value of the function f (c) at that point. Unless c is itself a root (which is very unlikely, but possible) there are
now two possibilities: either f (a) and f (c) have opposite signs and bracket a root, or f (c) and f (b) have opposite
signs and bracket a root. The method selects the subinterval that is a bracket as a new interval to be used in the
next step. In this way the interval that contains a zero of f is reduced in width by 50% at each step. The process is
continued until the interval is sufficiently small.

Figure 24.1: Bisection Method.

Figure 24.1 shows how the bisection method converges.


Source : https://www.math.ust.hk/ mamu/courses/231/Slides/ch02_1.pdf

The Bisection Method is basically a numerical method for estimating the roots of a polynomial f (x). Consider an
equation f (x) = 0 which has a zero in the interval [a, b] and f (a) ∗ f (b) < 0. This method computes the zero, say
p, by repeatedly having the interval [a, b]. Starting with p = (a + b)/2. This step is like computing x0 . Now, the
next step is to compute the next iterate. The interval [a, b] is replaced by [p, b] if f (p) ∗ f (b) < 0 or with [a, p] if
f (a) ∗ f (p) < 0. This process is continued untill the zero is obtained i.e, f (xk+1 ) → 0 or |a − b| < ϵ.

24.2.1 Algorithim

Consider an example in Figure 24.2. We start with a1 , b1 such that f (a1 ).f (b1 ) < 0

The number of iterations taken by this procedure to converge is log2 |a1 −b


ϵ
1|

Given: A function f(x) continuous on an interval [a,b] and f (a) ∗ f (b) < 0

while ( |a − b| > ϵ )
{
p = (a+b)/2
if( f (a) ∗ f (p) < 0 )
b=p
else
a=p
}

Example 24. Question: Find the root of f (x) = x2 − 3, given ϵ = 0.01.


Answer: We start with the interval [1,2]. Iterations are shown in Table 24.1.

Thus, with the seventh iteration, we note that the final interval, [1.7266, 1.7344], has a width less than 0.01 and
|f (1.7344)| < 0.01, and therefore we chose b = 1.7344 to be our approximation of the root.

Example 25. Suppose we apply bisection method to find a root of the polynomial

f (x) = x3 − x − 2.
For a1 = 1 and b1 = 2 we have f (a1 ) = −2 and f (b1 ) = +4. Since the function is continuous, the root must lie within
the interval [1, 2].

219
Figure 24.2: An Example

a b f(a) f(b) p=(a+b)/2 f(p) Update new (b-a)


1.0 2.0 -2.0 1.0 1.5 -0.75 a=p 0.5
1.5 2.0 -0.75 1.0 1.75 0.062 b=p 0.25
1.5 1.75 -0.75 0.0625 1.625 -0.359 a=p 0.125
1.625 1.75 -0.3594 0.0625 1.6875 -0.1523 a=p 0.0625
1.6875 1.75 -0.1523 0.0625 1.7188 -0.0457 a=p 0.0313
1.7188 1.75 -0.0457 0.0625 1.7344 0.0081 b=p 0.0156
1.71988 1.7344 -0.0457 0.0081 1.7266 -0.0189 a=p 0.0078

Table 24.1: Example for Bisection

Midpoint is
2+1
c1 = = 1.5.
2
Function value at the midpoint is f (c1 ) = −0.125. Because it is negative, a2 = c1 = 1.5 and b2 = b1 = 2, to ensure
they have opposite signs at the next iteration.

Table 24.2 shows how the method converges gradually to the solution.

After 13 iterations, it becomes apparent that there is a convergence to about 1.521: a root for the polynomial.

24.2.2 Convergence Analysis

Bisection method is based on the Intermediate Value Theorem (IVT). It is guaranteed to converge to a root of f if
f is a continuous function on the interval [a, b], and f (a) and f (b) have opposite signs. The absolute error is halved
at each step so the method converges linearly, which is comparatively slow.

Specifically, if c1 = (a + b)/2 is the midpoint of the initial interval, and cn is the midpoint of the interval in the nth
step, then the difference between cn and a solution c is bounded by

|b − a|
|cn − c| ⩽
2n
This formula can be used to determine in advance the number of iterations that the bisection method would need
to converge to a root to within a certain tolerance. The number of iterations needed, n, to achieve a given error (or
tolerance), ϵ is given by
(ϵ ) log ϵ0 − log ϵ
0
n = log2 =
ϵ log 2

220
Iteration an bn cn f (cn )
1 1 2 1.5 -0.125
2 1.5 2 1.75 1.6093750
3 1.5 1.75 1.625 0.6660156
4 1.5 1.625 1.5625 0.2521973
5 1.5 1.5625 1.5312500 0.0591125
6 1.5 1.5312500 1.5156250 -0.0340538
7 1.5156250 1.5312500 1.5234375 0.0122504
8 1.5156250 1.5234375 1.5195313 -0.0109712
9 1.5195313 1.5234375 1.5214844 0.0006222
10 1.5195313 1.5214844 1.5205078 -0.0051789
11 1.5205078 1.5214844 1.5209961 -0.0022794
12 1.5209961 1.5214844 1.5212402 -0.0008289
13 1.5212402 1.5214844 1.5213623 -0.0001034
14 1.5213623 1.5214844 1.5214233 0.0002594
15 1.5213623 1.5214233 1.5213928 0.0000780

Table 24.2: Bisection Method Iterations

where, ϵ0 = b − a (inital bracket size). Therefore the linear convergence is expressed by ϵn+1 = constant × ϵm
n , m = 1.

24.2.3 Discussion
1. The method is guaranteed to converge.
2. The error bound decreases by half with each iteration.
3. The bisection method is robust and simple, but converges very slowly.
4. It is often used to obtain a rough approximation to a solution which is then used as a starting point for more
rapidly converging methods.
5. The bisection method cannot detect multiple roots.

Advantages
• The procedure is simple
• We don’t need the explicit form of the function, i.e. a blackbox representation is sufficient
• Guaranteed to converge

Disadvantages
• Two variable initialization is required
• Very slow

24.3 Fixed point iteration


A point where x = g(x) is called the fixed point of the function g(x). We can make use of this concept to solve for
problems of the sort f (x) = 0 as follows

To solve f (x) = 0 we create a function g(x) such that the solution to f (x) = 0 is a fixed point of g(x) algorithm
Iterative Fixed Point Algorithm
x0 ← random()
While|g(xk ) − xk | > ϵ
xk+1 ← g(xk )

221
Fixed-point iteration is a method of computing fixed points of iterated functions. A fixed-point (also known as
invariant point) of a function is an element of the function’s domain that is mapped to itself by the function. For
example, if g is defined on the real numbers by g(x) = x2 − 3x + 4, then 2 is a fixed point, since g(2) = 2.

Figure 24.3: A function with three fixed points.

Figure 24.3 shows a function with three fixed points.


Source : http://en.wikipedia.org/wiki/Fixed_point_(mathematics)

More specifically, given a function g defined on the real numbers with real values and given a point x0 in the domain
of g, the fixed-point iteration is
xn+1 = g(xn ), n = 0, 1, 2, . . .
which gives rise to a sequence x0 , x1 , x2 , . . . which is hoped to converge to a point x. If g is continuous, then one can
prove that the obtained x is a fixed point of g, i.e. g(x) = (x).

Given a root-finding problem f (p) = 0, there are many g with fixed-points at p. If g has fixed-points at p, then
f (x) = x − g(x) has zero at p.

Fixed-point iteration is a method of computing fixed points of iterated functions. More specifically, given a function
f(x) defined on the real numbers with real values and given a point x0 in the domain of f, the fixed point iteration
is xn+1 = f (xn ), n=0, 1, 2, . . . which gives rise to the sequence x0 , x1 , x2 , . . . which is hoped to converge to a point
x. If f is continuous, then one can prove that the obtained x is a fixed point of f(x), i.e., f(x)=x. More generally, the
function f can be defined on any metric space with values in that same space.

24.3.1 Algorithim

Given: An equation f(x) = 0


Convert f(x) = 0 into the form x = g(x)
Let the initial guess be x0
Do
xi+1 = g(xi )
while(none of the convergence criterion C1 or C2 is met)

1. C1. Fixing apriori the total number of iterations N .


2. C2. By testing the condition |xi + 1 − g(xi )| (where i is the iteration number) less than some tolerance limit,
say ϵ, fixed apriori.
Example 26. Question: Find a root of x4 − x − 10 = 0
Answer: Consider g1(x) = 10/(x3 − 1) and the fixed point iterative scheme
xi+1 = 10/(x3i − 1), i = 0, 1, 2, …
Let the initial guess x0 be 2.0
i 0 1 2 3 4 5 6 7 8
xi 2 1.429 5.214 0.071 -10.004 -9.978E-3 -10 -9.99E-3 -10
So the iterative process with g1 gone into an infinite loop without converging.
Consider another function g2(x) = (x + 10)1/4 and the fixed point iterative scheme xi+1 = (xi + 10)1/4 , i = 0,
1, 2, . . .
let the initial guess x0 be 1.0, 2.0 and 4.0
i 0 1 2 3 4 5 6
xi 1.0 1.82116 1.85424 1.85553 1.85558 1.85558
xi 2.0 1.861 1.8558 1.85559 1.85558 1.85558
xi 4.0 1.93434 1.85866 1.8557 1.85559 1.85558 1.85558

222
That is for g2 the iterative process is converging to 1.85558 with any initial guess.
Consider g3(x) = (x + 10)1/2 /x and the fixed point iterative scheme

xi+1 = (xi + 10)1/2 /xi , i = 0, 1, 2, …let the initial guess x0 be 1.8,


i 0 1 2 3 4 5 6 ... 98
xi 1.8 1.9084 1.80825 1.90035 1.81529 1.89355 1.82129 ... 1.8555
That is for g3 with any initial guess the iterative process is converging but very slowly to
Geometric interpretation of convergence with g1, g2 and g3.

Figure 24.4: Using g1 the iterative process does not converge for any initial approximation.

Figure 24.5: Using g2, the iterative process converges very quickly to the root which is the intersection point
of y = x and y = g2(x) as shown in the figure.

Figure 24.6: Using g3, the iterative process converges but very slowly.

Example 27. Considerf (x) = x2 − x − 2 Suppose that we want to solve f (x) = 0


To solve this by Iterative Fixed Point Algorithm we can think of 4 possible candidates for g(x) which can be derived
as follows

1. x2 − x − 2 = 0 =⇒ x = x2 − 2 =⇒ g(x) = x2 − 2 (for x = g(x))

223
√ √
2. x2 − x − 2 = 0 =⇒ x2 = x + 2 =⇒ x = x + 2 =⇒ g(x) = x + 2
3. x2 − x − 2 = 0 =⇒ x(x − 1) = 2 =⇒ x − 1 = 2
x
=⇒ x = 1 + 2
x
=⇒ g(x) = 1 + 2
x
2
x2 +2
4. x2 − x − 2 = 0 =⇒ 2x2 − x = x2 + 2 =⇒ x(2x − 1) = x2 + 2 =⇒ x = x +2
2x−1
=⇒ g(x) = 2x−1

Suppose we choose the candidate g(x) = 1 + x2 . Then the steps in the solution are as follows

2
g(x) = 1 + x
0
Let x = 1
x1 = 1 + 2
1
=3
2 2 5
x =1+ 3 3
=
x3 = 1 + 2 11
5 = 5
3
4 2 21
x =1+ 11 = 11
5

Now suppose we choose the candidate g(x) = x2 − 2. Then the steps in the solution are as follows

g(x) = x2 − 2
Let x0 = 2.5
x1 = 2.52 − 2 = 4.25
x2 = 4.252 − 2 = 16.0625
x3 = 16.06252 − 2 = 256.004

Clearly, the series is not converging

Note
• Not all g(x) are good as they may not converge to a solution
• We have to analyze and see what makes a particular g(x) good or bad for this purpose
• If |g ′ (x∗)| < 1 then the fixed point iteration is locally convergent
• E.g.
For g(x) = 1 + x2 , we have g ′ (x) = − x22 , so for x = 2, g ′ (x) = − 42 = − 12 < 1
For g(x) = x2 − 2, we have g ′ (x) = 2x, so for x = 2, g ′ (x) = 2 × 2 = 4 > 1

Example 28. Find a with the help of fixed point iteration
Solve x − a = 0
2

We need to put this in the form x = g(x) for using fixed point iteration
x2 − a = 0 =⇒ 2x2 = a + x2 =⇒ x = 12 ( xa + x) =⇒ g(x) = 21 ( xa + x)


Using the above formulation, let us obtain 5, i.e. a = 5. With a starting value of x0 = 1 we get:

Iteration xn g(xn ) xn+1


1 1 3 3
2 3 2.333 2.333
3 2.333 2.238 2.238
4 2.238 2.236 2.236
5 2.236 2.236 2.236

Table 24.3: Fixed-Point Iterations for 5

Table 24.3 shows the fixed-point iteration convergence. Thus we obtain 5 = 2.236.

Example 29. Suppose we want to solve the equation

f (x) = cos x − x

224
Iteration xn g(xn ) xn+1
1 0.3 0.92106099400289 0.92106099400289
2 0.92106099400289 0.60497568726594 0.60497568726594
3 0.60497568726594 0.82251592555039 0.82251592555039
4 0.82251592555039 0.68037954156567 0.68037954156567
5 0.68037954156567 0.77733400966246 0.77733400966246
6 0.77733400966246 0.71278594551835 0.71278594551835
7 0.71278594551835 0.75654296195845 0.75654296195845

Table 24.4: Fixed-Point Iterations for cos x

There are many ways to convert f (x) to the fixed-point form x = g(x). Suppose we choose

cos x = x such that g(x) = cos x

as our fixed-point equation. With a starting value of x0 = 0.3 we get:

Table 24.4 shows the fixed-point iteration convergence.

Figure 24.7: Convergence of the cosine function.

Figure 24.7 shows the iterations of the cos x function.


Source : http://www-rohan.sdsu.edu/ jmahaffy/courses/s10/math541/lectures/pdf/week03/lecture-static-04.pdf

24.3.2 Convergence Analysis

We prove the convergence of fixed-point iteration using the following three theorems.

Theorem 24.3.1. If f ∈ C [a, b] and f (x) ∈ [a, b] , ∀x ∈ [a, b], then f has a fixed point p ∈ [a, b]. (Brouwer fixed
point theorem)

Proof. If f (a) = a, or f (b) = b, then we are done.


Otherwise, f (a) > a and f (b) < b.
We define a new function h(x) = f (x) − x.
Since both f (x) and x are continuous, we have h(x) ∈ C [a, b]. Further h(a) > 0, and h(b) < 0 by construction.
Now, the intermediate value theorem guarantees ∃p∗ ∈ (a, b) : h(p∗ ) = 0.
We have 0 = h(p∗ ) = f (p∗ ) − p∗ or p∗ = f (p∗ ).

225
Theorem 24.3.2. If, in addition, the derivative f ′ (x) exists on (a, b) and |f ′ (x)| ⩽ k < 1, ∀x ∈ (a, b), then the
fixed-point is unique.

Proof. |f ′ (x)| ⩽ k < 1. Suppose we have two fixed points p∗ ̸= q ∗ .


Without loss of generality we may assume p∗ < q ∗ .
The mean value theorem tells ∃r ∈ (p∗ , q ∗ ) :
f (p∗ ) − f (q ∗ )
f ′ (r) =
p∗ − q ∗
Now,
|p∗ − q ∗ | = |f (p∗ ) − f (q ∗ )|

= f ′ (r) . |p∗ − q ∗ |
⩽ k |p∗ − q ∗ |
< |p∗ − q ∗ |
The contradiction |p∗ − q ∗ | < |p∗ − q ∗ | shows that the supposition p∗ ̸= q ∗ is false. Hence, the fixed point is
unique.

Theorem 24.3.3. Then, for any number p0 ∈ [a, b], the sequence defined by
pn = f (pn−1 ), n = 1, 2, . . . , ∞
converges to the unique fixed point p∗ ∈ [a, b].

Proof. The proof is straight-forward:


|pn − p∗ | = |f (pn−1 ) − f (p∗ )|

= f ′ (r) . |pn−1 − p∗ | by MVT
= k |pn−1 − p∗ | from Theorem 4.2
Since k < 1, the distance to the fixed-point is shrinking in every iteration.
In fact,
|pn−1 − p∗ | ⩽ kn |pn−1 − p∗ | ⩽ kn max{p0 − a, b − p0 }.

24.4 Newton’s Method


Newton’s method (also known as the Newton–Raphson method), named after Isaac Newton and Joseph Raphson, is
a method for finding successively better approximations to the roots (or zeroes) of a real-valued function.

The idea behind the algorithm is a simple one. We begin with an initial guess for the root we wish to find. This
often can be determined from the graph of the function. We then calculate the coordinates of the point on the graph
of our function that has for its x-value the initial guess. The equation of the tangent line at this point is computed,
then the point at which the tangent line intercepts the x-axis is noted. This usually serves as a better estimate of
the zero we seek. Given a function ƒ defined over the reals x, and its derivative f ′ , we begin with a first guess x0 for
a root of the function f. Provided the function satisfies all the assumptions made in the derivation of the formula, a
better approximation x1 is
f (x0 )
x1 = x0 − ′
f (x0 )
Geometrically, (x1 , 0) is the intersection with the x-axis of the tangent to the graph of f at (x0 , f (x0 )). The process
is repeated as
f (xn )
xn+1 = xn − ′
f (xn )
until a sufficiently accurate value is reached.

We can arrive at this from Taylors series. If f ∈ C 2 [a, b], and we know x∗ ∈ [a, b], then we can formally Taylor
expand around a point x close to the root:
(x − x∗ )2 ′′
0 = f (x) = f (x) + (x∗ − x) f ′ (x) + f (ξ(x)) , ξ(x) ∈ [x, x∗ ]
2

226
If we are close to the root, then |x − x∗ | is small, which means that |x − x∗ |2 ≪ |x − x∗ |, hence we make the
approximation:
f (x)
0 ≈ f (x) + (x∗ − x) f ′ (x), ↔ x∗ ≈ x − ′
f (x)

Newton’s Method for root finding is based on the approximation:


f (x)
x∗ ≈ x −
f ′ (x)
which is valid when x is close to x∗ .
Given an approximation xn−1 , we get an improved approximation xn by computing
f (xn−1 )
xn = xn−1 −
f ′ (xn−1 )

Geometrically Newton’s method can be interpreted as follows. Let’s suppose that we want to approximate the solution
to f (x) = 0 and let’s also suppose that we have somehow found an initial approximation to this solution say, x0 .
This initial approximation is probably not all that good and so we would like to find a better approximation. This
is easy enough to do. First we will get the tangent line to f (x) at x0 .
y = f (x0 ) + f ′ (x0 )(x − x0 )

Figure 24.8: Newton’s method.

Figure 24.8 shows the geometric interpretation of Newton’s method.


Source : http://tutorial.math.lamar.edu/Classes/CalcI/NewtonsMethod.aspx

Now, from the graph shown, the blue line is the tangent line at x0 . We can see that this line will cross the x-axis
much closer to the actual solution to the equation than x0 does. This point is x1 , where x1 = x0 − ff′(x0)
(x0 )
. Similarly
x2 = x1 − ff′(x 1)
(x1 )
. Generalizing this, we we get a sequence of numbers that are getting very close to the actual solution.
The sequence is given by the Newton’s method, i.e.
f (xn−1 )
xn = xn−1 −
f ′ (xn−1 )

Newton’s method is a special case of the fixed point iteration given by


f (xk )
xk+1 = xk −
f ′ (xk )
Here g(x) = x − ff′(x)
(x)
∴ g ′ (x) = 1 − θ, where θ > 0. So we can prove that
|g ′ (x∗)| < 1
This means that Newton’s method is locally converging. It is also faster than the bisection method

227
Figure 24.9: Newtons Method


Example 30. Find a using Newton’s method
We have to solve for x2 − a = 0 ∴ f (x) = x2 − a =⇒ f ′ (x) = 2x
2
xk − a 1 a
=⇒ xk+1 = xk − = (xk + k )
2xk 2 x
1 a
=⇒ g(x) =
( + x)
2 x
We see that this is the same as what we got earlier using the fixed point method

24.4.1 Algorithm
1. Declare Variables
2. Set Maximum number of iterations to perform.
3. Set tolerance to small value
4. Set an initial guess x0 .
5. Set the counter of number of iterations to zero.
6. Begin Loop:
(a) Find next guess x1 = x0 − f (x0 )
f ′ x0
.
(b) If |f (x0 )| < tolerance, then exit loop
(c) Increment the count of number of iterations.
(d) If number of iteratations > max allowed, then exit.
7. If root was not found in the max number of iterations, then print warning message.
8. Print the value of root and number of iterations performed.

Example 31. Let us now consider an example in which we want to find the roots of the polynomial f (x) = x3 −x+1 =
0
The sketch of the graph will tell us that this polynomial has exactly one real root. We first need a good guess as to
where it is. This generally requires some creativity, but in this case, notice that f (−2) = −5 and f (−1) = 1 . This
tells us that the root is between -2 and -1. We might then choose x0 = −1 for our initial guess.
To perform the iteration, we need to know the derivative f ′ (x) = 3x2 − 1 so that

x3n − xn + 1
xn+1 = xn −
3x2n − 1
With our initial guess of x0 = −1 , we can produce the following values:
x0 -1
x1 -1.500000
x2 -1.347826
x3 -1.325200
x4 -1.324718
x5 -1.324717
x6 -1.324717
Notice how the values for xn become closer and closer to the same value. This means that we have found the
approximate solution to six decimal places. In fact, this was obtained after only five relatively painless steps.

228
Example 32. As another example, let’s try and find a root to the equation f (x) = ex − 2x = 0 . Notice that
f ′ (x) = ex − 2 so that
exn − 2xn
xn+1 = xn − x
e n −2
If we try an initial value x0 = 1 , we find that
x1 = 0, x2 = 1, x3 = 0, x4 = 1, ...
In other words, Newton’s Method fails to produce a solution. Why is this? Because there is no solution to be found!
We could rewrite a solution as ex = 2x.


Example 33. Let’s solve the previous root-finding problem using Newton’s method. We want to find a, ∀a ∈ Z
2
Suppose a = 612, i.e. we want to find solution for x = 612
Then f (x) = x2 − 612 and derivative f ′ (x) = 2x. With starting point x0 = 10 we obtain
√ the following sequence:
Table 24.5 shows the convergence using Newton’s method of iteration. Thus we obtain 612 = 24.7386.

f (xn )
Iteration xn xn+1 = xn − f ′ (xn )
1 10 35.6
2 35.6 26.3955
3 26.3955 24.7906
4 24.7906 24.7387
5 24.7387 24.7386

Table 24.5: Newton’s method for 612

24.4.2 Discussion
1. Fast Convergence: Newton’s method converges fastest of the methods discussed (Quadratic convergence).
2. Expensive: We have to compute the derivative in every iteration, which is quite expensive.
3. Starting point: We argued that when |x − x∗ | is small, then |x − x∗ |2 ≪ |x − x∗ |, and we can neglect the
second order term in the Taylor expansion. In order for Newton’s method to converge we need a good starting
point.
Theorem 24.4.1. Let f (x) ∈ C 2 [a, b]. If x∗ ∈ [a, b] such that f (x∗ ) = 0 and f ′ (x∗ ) ̸= 0, then there exists a
δ > 0 such that Newton’s method generates a sequence {xn }∞ ∗
n=1 converging to x for any initial approximation
∗ ∗
x0 ∈ [x − δ, x + δ].
f (x )
4. Newton’s method as a fixed-point iteration: xn = xn−1 − f ′ (xn−1 n−1 )
. Then (the fixed point theorem), we
must find an interval [x∗ − δ, x∗ + δ] that g maps into itself, and for which |g ′ (x)| ⩽ k < 1.
f ′ (x)f ′ (x) − f (x)f ′′ (x) f (x)f ′′ (x)
g ′ (x) = 1 − 2 =

[f (x)] [f ′ (x)]2
By assumption f (x∗ ) = 0, f ′ (x∗ ) ̸= 0, so g ′ (x∗ ) = 0. By continuity |g ′ (x)| ⩽ k < 1 for some neighborhood of
x∗ . Hence the fixed-point iteration will converge.

Other issues:

1. Choose a wise starting point: We need to choose a good starting point such that |xi+1 − xi | is very small.
2. Faster Convergence : It converges fastest of the methods discussed so far.
Convergence is quadratic: as the method converges on the root, the difference between the root and the
approximation is squared (the number of accurate digits roughly doubles) at each step.
3. Expensive : We have to calculate values of two function for each iteration, one of f(x) and other of f ′ (x).
4. Difficulties with this method
(a) Difficulty in calculating derivative of a function.
(b) The method may overshoot, and diverge from that root.
(c) A large error in the initial estimate can contribute to non-convergence of the algorithm.
(d) If the root being sought has multiplicity greater than one, the convergence rate is merely linear.

229
24.5 Secant Method
Newton’s method was based on using the line tangent to the curve of y = f(x), with the point of tangency (x0 , f (x0 )).
When x0 → α, the graph of the tangent line is approximately the same as the graph of y = f(x) around x = α. We
then used the root of the tangent line to approximate α.
Consider using an approximating line based on interpolation. We assume we have two estimates of the root α, say
x0 and x1 . Then we produce a linear function q(x) = a0 + a1 x with q(x0 ) = f (x0 ), q(x0 ) = f (x0 )
This line is sometimes called a secant line. Its equation is given by
q(x) = (x1 −x)f (xx01)+(x−x
−x0
0 )f (x1 )

We now solve the equation q(x) = 0, denoting the root by x2. This yields
f (x )−f (xn−1 )
xn+1 = xn − f (xn )/ xnn −xn−1 , n=1,2,3,...
This is called the secant method for solving f(x)=0.

Figure 24.10: Secant Method

We get the secant method when we substitute the differential term in the Newton’s method with a finite difference
term. As a result the secant method will clearly be a bit slower than Newton’s method. However it is still faster than
the bisection method and is useful in situations where we have a blackbox representation of the function but not of
k
(xk−1 )
its differential Substituting f ′ (xk ) = f (xxk)−f
−xk−1
, we get

xk − xk−1
xk+1 = xk − f (xk )
f (xk ) − f (xk−1 )

The main weakness of Newton’s method is the need to compute the derivative, f ′ (), in each step. Many times f ′ ()
is far more difficult to compute and needs more arithmetic operations to calculate than f (x).
By definition
f (x) − f (xn−1 )
f ′ (xn−1 ) = lim
x→xn−1 x − xn−1
Let x = xn−2 , and approximate
f (xn−2 ) − f (xn−1 )
f ′ (xn−1 ) ≈
xn−2 − xn−1
using this approximation for the derivative in Newton’e method, gives us the Secant Method

f (xn−1 )
xn = xn−1 − [ ]
f (xn−2 )−f (xn−1 )
xn−2 −xn−1

f (xn−1 ) [xn−2 − xn−1 ]


= xn−1 −
f (xn−2 ) − f (xn−1 )

24.5.1 Algorithim

Given an equation f(x) = 0


Let the initial guesses be x0 and x1
Do
f (x )−f (xn−1 )
xn+1 = xn − f (xn )/ xnn −xn−1 , n=1,2,3...
while (none of the convergence criterion C1 or C2 is met)

1. C1. Fixing apriori the total number of iterations N.

230
2. C2. By testing the condition |xi+1 − xi | (where i is the iteration number) less than some tolerance limit, say
ϵ, fixed apriori.

Example 34. Find the root of 3x+sin[x]-exp[x] = 0.


Solution: Let the initial guess be 0.0 and 1.0
f(x) = 3x+sin[x]-exp[x]

i 0 1 2 3 4 5 6
xi 0 1 0.471 0.308 0.363 0.36 0.36
So the iterative process converges to 0.36 in six iterations.

24.6 Additional Problems


Exercise. Write Newton’s iterations for solving each of the following non-linear equations

1. x3 − 2x − 5 = 0
Soln.f (x) = x3 − 2x − 5 =⇒ f ′ (x) = 3x2 − 2
3 3
xk − 2xk − 5 2xk + 5
∴ xk+1 = xk − =
3x − 2
k 2
3xk 2 − 2

Let x0 = 2
2.23 + 5
x1 = = 2.1
3.22 − 2
2.(2.1)3 + 5
x2 = = 2.094
3.(2.1)2 − 2
2.(2.094)3 + 5
x3 = = 2.09455
3.(2.094)2 − 2
2. e−x = x
Soln.f (x) = x − e− x =⇒ f ′ (x) = 1 + e− x
k
xk − e−x k
k x + 1
∴ xk+1 = xk − −x
= e−x
1+e k
1 + e−xk

Let x0 = 0
0+1
x1 = e0 = 0.5
1 + e0
0.5 + 1
x2 = e−0.5 = 0.566
1 + e−0.5
0.566 + 1
x3 = e−0.566 = 0.567
1 + e−0.566
3. xsinx = 1
Soln.f (x) = xsinx − 1 =⇒ f ′ (x) = sinx + xcosx
2
xk sinxk − 1 xk cosxk + 1
∴ xk+1 = xk − k k k
=
sinx + x cosx sinxk + xk cosxk

Let x0 = π
2
( π2 )2 cos( π2 ) + 1
x1 = =1
sin( π2 ) + π2 cos( π2 )
12 .cos(1) + 1
x2 = = 1.115
sin(1) + 1.cos(1)
1.1152 .cos(1.115) + 1
x3 = = 1.114
sin(1.115) + 1.115.cos(1.115)

231
Exercise. A calculator is defective: it can only add, subtract, and multiply. Use the equation x1 = 1
37
, the Newton
1
Method, and the defective calculator to find 37 correct to 8 decimal places
1
Soln.For convenience we write a instead of 37 . Then a1 is the root of the equation f (x) = 0 where

1
f (x) = a −
x
We have f ′ (x) = 1
x2
, and therefore the Newton Method yields the iteration

a− 1
xn 1
xn+1 = xn − 1 = xn − xn 2 (a − ) = xn (2 − axn )
xn 2
xn

Note that the expression xn (2axn ) can be evaluated on our defective calculator, since it only involves multiplication
and subtraction.
1
Pick x0 reasonably close to 37 . The choice x0 = 1 would work out fine, but we will start out a little closer, maybe
by noting that 1.37 is about 43 so its reciprocal is about 34 . Choose x0 = 0.75
1
We get x1 = x0 (2 37 x0 ) = 0.729375. Similarly x2 = 0.729926589, and x3 = 0.729927007. It turns out that x4 = x3 to
1
9 decimal places. So we can be reasonably confident that 37 is equal to 0.72992701 to 8 decimal places

24.6.1 Additional Problem


1. Find a root of x − sin(x) − (1/2) = 0 using Secant Method.
2. Consider the function f (x) = x3 − 3x − 3. Find all extrema and points of inflection. Is this function odd, even
or neither? Sketch a graph of this function. Use Newton’s Method to approximate the value of the x-intercept.
Start with x0 = 2 and perform four iterations.
3. Show that, Newton’s method fails and the iterations diverge to infinity for every f (x) = |x|α where 0 < α < 12 .

232
Chapter 25

More about Non Linear Optimization

233
25.1 Introduction
Last class, we introduced the problems of finding a solution of m nonlinear equations in n variables (though our
discussions were only the case when n = 1)
f1 (x1 , x2 , ....xn ) = 0
f2 (x1 , x2 , ....xn ) = 0
..
.
fm (x1 , x2 , ....xn ) = 0

To simplify, we expressed the above set of equations as f (x) = 0 where


   
f1 (x1 , x2 , ....xn ) x1
 f2 (x1 , x2 , ....xn )   x2 
   
f (x) =  ..  , and x =  . 
 .   .. 
fm (x1 , x2 , ....xn ) xn

We are interested in solving two categories of problems:

• Problem 1: Solve f (x) = 0


• Problem 2: Min ||f (x)||

Here the nature of f can be,

• f : R → R : Non-linear. Eg, f (x) = x2 + 2x + 1 = 0


• f : Rn → R : f (x1 , x2 , ....xn )
• f : Rn → Rm : f1 (x1 , x2 , ....xn )
f2 (x1 , x2 , ....xn )
.
.
fm (x1 , x2 , ....xn )

• f : convex → Local Minima = Global Minima


• f : non-convex → Local Minima ̸= Global Minima

25.2 Review of Iterative Scehemes


In the Last class, various iterative methods were discussed for solving the nonlinear equations where f : Rn → R is
differentiable. Did we make any assumption of convexity? (we did not really talk about the case of f : Rn → Rm )

The methods discussed include:

Bisection Method

We start with an interval [a, b] that satisfies f (a) · f (b) < 0 (the function values at the end points of the interval have
opposite signs). Since f is continuous, this guarantees that the interval contains at least one solution of f (x) = 0.

In each iteration we evaluate f at the midpoint p = (a + b)/2 of the interval, and depending on the sign of f (p),
replace a or b with p. If f (p) has the same sign as f (a), we replace a with p. Otherwise we replace b. Thus we
obtain a new interval that still satisfies f (a)f (b) <0. The method is called bisection because the interval is replaced
by either its left or right half at each iteration.
Example 35. Let us find a root of f (x) = 3x + sin(x) − exp(x) = 0. The graph of this eqution is given in the
Figure 25.1. It is clear from the graph that there are two roots, one lies between 0 and 0.5 and the other lies between
1.5 and 2.0. Consider the function f (x) in the interval [0, 0.5] since f (0) ∗ f (0.5) <0. Then the bisection iterations
are given in Table ??

234
Figure 25.1: Graph of the equation f(x)

Iteration No a b c f (a) ∗ f (c)


1 0 0.5 0.25 0.287 (+ve)
2 0.25 0.50 0.393 -0.015 (-ve)
3 0.65 0.393 0.340 9.69 E-3 (+ve)
4 0.34 0.393 0.367 -7.81 E-4 (-ve)
5 0.34 0.367 0.354 8.9 E-4 (+ve)
6 0.35 0.367 0.360 -3.1 E-6 (-ve)

Table 25.1: Bisection Method Iterations

Newtons Method

In numerical analysis, Newton’s method (also known as the Newton-Raphson method), named after Isaac Newton and
Joseph Raphson, is a method for finding successively better approximations to the roots (or zeroes of a real-valued
function.
x : f (x) = 0 .

The Newton-Raphson method in one variable is implemented as follows. Given a function ƒ defined over the real x,
and its derivative f’, we begin with a first guess x 0 for a root of the function f. Provided the function satisfies all the
assumptions made in the derivation of the formula, a better approximation x 1 is
f (x0 )
x1 = x0 − .
f ′ (x0 )

Geometrically, (x 1 , 0) is the intersection with the x-axis of the tangent to the graph of f at (x 0 , f (x 0 )).

The process is repeated as


f (xn )
xn+1 = xn −
f ′ (xn )
until a sufficiently accurate value is reached. This method can also be extended to complex functions and to systems
of equations.

Example 36. Let us now consider an example in which we want to find the roots of the polynomial f (x) = x3 −x+1 =
0.

The sketch of the graph will tell us that this polynomial has exactly one real root. We first need a good guess as to
where it is. This generally requires some creativity, but in this case, notice that f (−2) = −5 and f (−1) = 1 . This
tells us that the root is between -2 and -1. We might then choose x0 = −1 for our initial guess.

To perform the iteration, we need to know the derivative f ′ (x) = 3x2 − 1 so that

x3n − xn + 1
xn+1 = xn −
3x2n − 1

235
Figure 25.2: Demonstration of Newtons Method

x0 -1
x1 -1.500000
x2 -1.325200
x3 -1.324718
x4 -1.324717
x5 -1.324717
x6 -1.324717

Table 25.2: Newton Method Iterations

Figure 25.3: Demonstration of Secant Method

With our initial guess of x0 = −1 , we can produce the following values as in Table ??.

Notice how the values for xn become closer and closer to the same value. This means that we have found the
approximate solution to six decimal places. In fact, this was obtained after only five relatively painless steps.

Secant Method

In numerical analysis, the secant method is a root-finding algorithm that uses a succession of roots of secant lines to
better approximate a root of a function f . The secant method can be thought of as a finite difference approximation
of Newton’s method. However, the method was developed independently of Newton’s method, and predated the
latter by over 3,000 years.

The secant method is defined by the recurrence relation


xn−1 − xn−2 xn−2 f (xn−1 ) − xn−1 f (xn−2 )
xn = xn−1 − f (xn−1 ) = (25.1)
f (xn−1 ) − f (xn−2 ) f (xn−1 ) − f (xn−2 )

As can be seen from the recurrence relation, the secant method requires two initial values, x 0 and x 1 , which should
ideally be chosen to lie close to the root.
Example 37. Lets try to find the root of 3x + sin(x) − exp(x) = 0, using secant method. Let the initial guess be
0.0 and 1.0 Following the secant method’s recurrence relation, we will get the Table ?? As can be seen the secant
method converges to 0.36 in six iterations.

236
x0 0
x1 1
x2 0.471
x3 0.308
x4 0.363
x5 0.36
x6 0.36

Table 25.3: Secant Method Iterations

25.3 Gradient, Hessian and Jacobian


We had mostly focussed on valued functions of a real variable in the last class. We also know, real valued functions
of multiple variables and vector valued functions are of equal interest to us. Often we would be using terms such as
Gradient (∇f (x)) , Hessian (∇2 f (x)) and Jacobian (J) while dealing with nonlinear functions and their optimization.
Following are the basic definitions of each of them.

25.3.1 Basic Terminology

Gradient: The gradient of a function f (x) of n variables, at x∗ , is the vector of first partial derivatives evaluated
at x∗ , and is denoted as ∇f (x∗ ):

 ∂f (x∗ ) 
∂x1
 ∂f (x∗ ) 
 ∂x2 
∇f (x ) = 

 .. 
 (25.2)
 . 
∂f (x∗ )
∂xn

Hessian: The Hessain of a function f (x) of n variables, at x∗ , is the matrix of second partial derivatives evaluated
at x∗ , and is denoted as ∇2 f (x∗ ):
 
∂ 2 f (x∗ ) ∂ 2 f (x∗ ) ∂ 2 f (x∗ )
···
 2∂x1 ∗ 
2 ∂x1 ∂x2 ∂x1 ∂xn
 ∂ f (x ) ∂ 2 f (x∗ )
··· ∂ 2 f (x∗ ) 
 ∂x2 ∂x1 ∂x2 ∂x2 ∂xn 
∇2 f (x∗ ) = 
 .. ..
2
..

 (25.3)
 .. 
 . . . . 
∂ 2 f (x∗ ) ∂ 2 f (x∗ ) ∂ 2 f (x∗ )
∂xn ∂x1 ∂xn ∂x2
··· ∂x2 n

Hessian is a square matrix of second-order partial derivatives of a function. It describes the local curvature of a
function of many variables. The Hessian of a function f (x) of n variables, at x∗ , is the matrix of second partial
∂ 2 f (x∗ ∂ 2 f (x∗ )
derivatives evaluated at x∗ , and is denoted as ▽2 f (x∗ ). This is a symmetric matrix, because =
∂xi , ∂xj ∂xj , ∂xi
Jacobian: Given a set of m equations yi = fi (x) in n variables x1 , ..., xn , the Jacobian is defined as:
 ∂f1 
∂x1
∂f1
∂x2
· · · ∂x ∂f1

 ∂f2 ∂f2 
n

 ∂x1
∂f2
· · · ∂x n 
J = .. 
∂x2
 . . . ..  (25.4)
 . .
. . . 
∂fm
∂x1
∂fm
∂x2
· · · ∂xn
∂fm

Looking into the above definitions one can observe a simple relation between ∇f (x) , ∇2 f (x) and J. The jacobian of
gradient is Hessian and is given by:

J(∇f (x)) = ∇2 f (x); (25.5)

237
25.4 Approximating Functions and Taylor’s Series
Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the
function’s derivatives at a single point

f ′ (a) f ′′ (a) f (3) (a)


f (x) = f (a) + (x − a) + (x − a)2 + (x − a)3 + · · · .
1! 2! 3!
Let x → y and a → x

f ′ (x) f ′′ (x) f (3) (x)


f (y) = f (x) + (y − x) + (y − x)2 + (y − x)3 + · · · .
1! 2! 3!

It is common practice to approximate a function by using a finite number of terms of its Taylor series. Taylor’s
theorem gives quantitative estimates on the error in this approximation. Any finite number of initial terms of the
Taylor series of a function is called a Taylor polynomial.

The Taylor series of a function is the limit of that function’s Taylor polynomials, provided that the limit exists. A
function may not be equal to its Taylor series, even if its Taylor series converges at every point. A function that is
equal to its Taylor series in an open interval (or a disc in the complex plane) is known as an analytic function.

Example 38. Let’s compute the Taylor series for f (x) = ex with center x0 = 0 . All derivatives are of the form ex
, so at x0 = 0 they evaluate to 1. Thus the Taylor series has the form:

x2 x3 x4
ex = 1 + x + + + + ···
2! 3! 4!

Commonly Used Taylor Series are 1


1−x
, cosx, sinx, ex , ln(1 + x), etc.

25.4.1 Orders of Approximation

Orders of approximation refer to formal or informal terms for how precise an approximation is, and to indicate
progressively more refined approximations: in increasing order of precision, a zeroth-order approximation, a first-
order approximation, a second-order approximation, and so forth.

Formally, an nth-order approximation is one where the order of magnitude of the error is at most xn+1 , or in
terms of big O notation, the error is O(xn+1 ). In suitable circumstances, approximating a function by a Taylor
polynomial of degree n yields an nth-order approximation, by Taylor’s theorem: a first-order approximation is a
linear approximation, and so forth.

Example 39. Figure 25.4 is an accurate approximation of sin(x) around the point x = 0. The pink curve is a
polynomial of degree seven:
x3 x5 x7
sin(x) ≈ x − + − .
3! 5! 7!

|x|9
The error in this approximation is no more than 9!
. In particular, for 1 < x < 1, the error is less than 0.000003.

To view the usefulness of Taylor series, Figures 25.5, 25.6, and 25.7 show the zeroth, first, and second order Taylor
series approximations of the exponential function f (x) = ex at x = 0.

Zero Order Approximation

The approximation f (x + h) = f (x) is called a zeroth-order Taylor-series approximation.

25.4.2 First order approximation

The first order approximation takes the first two terms in the series and approximates the function
f ′ (x)
f (y) = f (x) + 1!
(y − x)

238
Figure 25.4: The sine function (blue) is closely approximated by its Taylor polynomial of degree 7 (pink) for
a full period centered at the origin.

Figure 25.5: The zeroth-order Taylor series approximation of ex around x = 0.

Let f (y) = 0, the true solution


−f (x)
then y − x = f ′ (x)

⇒y =x− f (x)
f ′ (x)

⇒y =x− f (x0 )
f ′ (x0 )

The first order approximation is the equation of a line with a slope of f ′ (x0 ). So, the first two terms of the Taylor
Series gives us the equation of the line tangent to our function at the point (x0 , y0 ) .

We can now develop an iterative algorithm by replacing y with xk+1 and x with xk . At each iteration we will get a
better solution then the previous iteration.

xk+1 = xk − f (x)/f ′ (x)

Function of the type f : Rn → R

The first order approximation can be given as

f (y) = f (x) + ∇f (x)T (y − x) (25.6)

239
Figure 25.6: The first-order Taylor series approximation of ex around x = 0.

Where ∇f (x) can be given as


 ∂f

∂x1
 ∂f 
 
∇f (x) =  
∂x2
 .. 
 . 
∂f
∂xn
∑n
∂f
⇒ f (y) = f (x) + (yi − xi )
i=1
∂x i

Putting f (y) = 0 in Equation 25.6 we get


0 = f (x) + ∇f (x)T (y − x)
Let y − x = s
0 = f (x) + ∇f (x)T s
⇒∇f (x)T s = −f (x) (25.7)
Iterative algorithm to solve f (x) = 0 can be given as

algorithm
x0 ← Critical guess
for k = 0, 1, 2, · · · do
solve ∇f (xk )T s = −f (xk ) for s
xk+1 = xk + s
end for

Unfortunately this method does now work. Why?

Function of the type f : Rn → Rm

The first order approximation can be given as


f (y) = f (x) + Jx (y − x) (25.8)
Where Jacobian J is an m × n matrix and can be given as
 ∂f 
1
∂x1
· · · ∂x
∂f1
 . .. 
n

J=   ..
..
. . 

∂fm
∂x1
· · · ∂xn
∂fm

240
Figure 25.7: The second-order Taylor series approximation of ex around x = 0.

Putting f (y) = 0 in equation 26.1 we get


0 = f (x) + Jx (y − x)
Let y − x = s
0 = f (x) + Jx s
⇒Jx s = −f (x) (25.9)

Iterative algorithm to solve f (x) = 0 can be given as algorithm


x0 ← Critical guess
for k = 0, 1, 2, · · · do
solve Jxk s = −f (xk ) for s
xk+1 = xk + s
end for

The convergance is guaranteed if spectral radius of J and maximam eigen value of J is less then 1.

25.4.3 second order approximation

f ′′ (x)(y − x)2
f (y) = f (x) + f ′ (x)(y − x) +
2!
We can find better solution then the first order approximation but it is harder to compute.

As we notice, the second order approximation uses the first three terms of the Taylor Series.
f ′ (x) f ′′ (x)
f (y) = f (x) + (y − x) + (y − x)2
1! 2!
Let f (y) = 0, the true solution
f ′ (x) f ′′ (x)
then 0 = f (x) + (y − x) + (y − x)2
1! 2!

In second order approximation, better solution can be obtained, but more difficult to compute.

Analysis While the approximation in Figure 1 becomes poor very quickly, it is quite apparent that the linear, or
1st-order, approximation in Figure 2 is already quite reasonable in a small interval around x = 0. The quadratic, or
2nd-order, approximation in Figure 3 is even better. However, as the degree of approximation increases, computation
also increases. So, there’s a tradeoff!

241
25.5 Optimality Conditions
Let, f : Rn → R be scalar valued function of n variables x = (x1 , x2 , ....., xn ). We say x∗ = (x∗1 , x∗2 , ....x∗n ) minimizes
f if f (x∗ ) ≤ f (x) for all x ∈ Rn . We use the notation min f(x) to denote the problem of finding an x∗ that minimizes
f. A vector x∗ is a local minimum if there exists a neighbourhood around x∗ in which f (x∗ ) ≤ f (x). With minimum
we refer to a global minimum. We can have case with finite min f(x) and no x∗ with f (x∗ ) = minf (x), in such a case
optimal value is not attained. It is also possible that f(x) is unbounded below, in which case we define the optimal
value as minf (x) = −∞.

25.5.1 Optimality conditions

Global optimality A function g is convex if ▽2 g(x) is a positive semi definite ever where.If g is convex, then x
is a minimum if and only if

▽g(x) = 0

Means there are no other local minima, i.e. every local minimum is global.

Local optimality It is much harder to characterize optimality if g is not convex (i.e., if there are points where
the Hessian is not positive semidefinite). It is not sufficient to set the gradient equal to zero, because such a point
might correspond to a local minimum, a local maximum, or a saddle point . However, we can state some simple
conditions for local optimality.

1. Necessary condition. If x is locally optimal, then ▽g(x) = 0and ▽2 g(x) is positive semidefinite.
2. Sufficient condition. If ▽g(x) = 0 and ▽2 g(x) is positive definite, then x is locally optimal.

Example 40. Lets try to find the local extrema of f (x1 , x2 ) = x31 + x32 − 3x1 x2 .

This function is everywhere differentiable, so extrema can only occur at points x∗ such that its gradient (▽f (x∗ ) = 0.)

[ ]
3x21 − 3x2
▽f (x) =
3x21 − 3x1

This equals 0 iff (x1 , x2 )= (0, 0) or (1, 1). The Hessian is


[ ]
6x1 −3
H(x) =
−3 6x2

So, [ ]
0 −3
H(0, 0) =
−3 0

Let H1 denote the first principal minor of H(0, 0) and let H2 denote its second principal minor. Then det(H1 ) = 0
and det(H2 ) = −9.
Therefore, H(0, 0) is neither positive nor negative definite.

[ ]
6 −3
H(1, 1) =
−3 6

Its first principal minor has det(H1 ) = 6 > 0 and its second principal minor has det(H2 ) = 25 > 0. Therefore, H(1, 1)
is positive definite, which implies that (1, 1) is a local minimum.

25.6 Additional Examples



Exercise. Derive gradient and Hessian of the function g(x) = x21 + x22 + · · · + x2n + 1. Also, show that ▽g(x)2 is
Positive Definite.

242
Calculation of Gradient
 
∂g(x)
 ∂x1 
 ∂g(x) 
 
 
▽g(x) =  ∂x2  (25.10)
 . 
 . 
 . 
 ∂g(x) 
∂xn
 x1 

 x1 + x2 + · · · + xn + 1 
2 2 2
 x2 
 √ 
=  x12 + x22 + · · · + xn2 + 1  (25.11)
 
 .. 
 . 

xn x12 + x22 + · · · + xn2 + 1
 
x1
 
1  x2 
=  .  (25.12)
g(x)  .. 
xn

Calculation of Hessian
 
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
 (x2 + x2 + · · · + x2n + 1)3/2 ···
 1 2 (x21 + x22 + · · · + x2n + 1)3/2 (x21 + x22 + · · · + x2n + 1)3/2 

 
 −x2 x1 x21 + x23 + · · · + x2n + 1 −x2 xn 
 2 ··· 
 (x + x 2
+ · · · + x2n + 1)3/2 (x1 + x22 + · · · + x2n + 1)3/2
2
(x1 + x2 + · · · + xn + 1) 
2 2 2 3/2
▽2 g(x) =  
1 2
 .. .. .. .. 
 . 
 . . . 
 
 
 −xn x1 −xn x2 x1 + · · · + xn−1 + 1
2 2 
···
(x21 + x22 + · · · + x2n + 1)3/2 (x21 + x22 + · · · + x2n + 1)3/2 (x1 + x2 + · · · + xn + 1)
2 2 2 3/2

 
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
 ··· 
 g 3 (x) g 3 (x) g 3 (x) 
 
 −x2 x1 x1 + x3 + · · · + x2n + 1
2 2
−x2 xn 
 ··· 
 g 3 (x) g 3 (x) g(x)3 
=


 (25.13)
 .. .. .. .. 
 . . . . 
 
 
 −xn x1 −xn x2 x21 + · · · + x2n−1 + 1 
···
g 3 (x) g 3 (x) g 2 (x)
 
x22 + · · · + x2n + 1 −x1 x2 −x1 xn
 ··· 
 g 2 (x) g 3 (x) g 2 (x) 
 
 −x2 x1 x1 + x3 + · · · + x2n + 1
2 2
−x2 xn 
 ··· 
1  
g 2 (x) g 2 (x) g 2 (x) 

=
g(x) 
 .. .. .. .. 

 . . . . 
 
 
 −xn x1 −xn x2 x21 + · · · + x2n−1 + 1 
···
g 2 (x) g 2 (x) g 2 (x)

The above Hessian Matrix can be expressed as


1
▽2 g(x) = (I − uuT ) (25.14)
g(x)
1 [ ]
, by considering u = x1 x2 ··· xn
g(x)
1
Now we will prove that, ▽2 g(x) = (I − uuT ) is positive definite matrix.
g(x)

243
• We can see that, it is symmetric.
• As u is a n-vector with norm less than 1, all its eigen values are positive.
• Since all the eigen values are positive and the matrix is real symmetric, the matrix is positive definite.
Exercise. Consider the system of equations

x1 − 1 = 0
x1 x2 − 1 = 0

Suggest a starting point where Newton’s method may fail. Why?


Soln. For a system of equations, Newton’s method is represented as
⃗xk+1 = ⃗xk − JF −1 F
⃗ (⃗xk )
⃗ is the function vector. In this case, we have
where JF is the Jacobian and F
( ) ( )
⃗ = x1 − 1 1 0
F , JF =
x1 x2 − 1 x2 x1

Clearly for x1 = 1 and any value of x2 , |JF | = 0 =⇒ JF −1 is not defined


∴ If we start with x1 = 1 any value of x2 Newton’s method fails
Exercise. Use Newton’s method to find all solutions of the system

(x1 − 1)2 + 2(x2 − 1) = 1


3(x1 + x2 − 2)2 + (x1 − x2 − 1)2 = 2

in two variables x1 and x2


Soln. For the above system of equations, we have
( ) ( )
⃗ (x1 − 1)2 + 2(x2 − 1) − 1 2x1 − 2 2
F = , JF =
3(x1 + x2 − 2)2 + (x1 − x2 − 1)2 − 2 8x1 + 4x2 − 14 4x1 + 8x2 − 10
( )
2 2
We start with one seed value x1 = 2, x2 = 1. In this case we get JF = . Clearly this does not have an inverse
6 6
so the method fails.
We try another seed x1 = 1, x2 = 1. We get
(1 ) ( )
− 12 ⃗ = −1
JF = 21 ,F
2
0 −1
( ) ( ) ( ) ( )
0 1 0 1
∴ ⃗x1 = ⃗x0 − = − = 3
− 12 1 − 12
( ) 2
0 2
Calculating JF , we get JF = . Clearly this does not have an inverse, so the method fails again. This leads
0 6
us to believe that the system of equations has no real solutions.

25.6.1 Additional Problem



• Derive gradient and Hessian of the function g(x) = ||Cx||2 + 1, where C is a left-invertible m × n matrix.
• Let f(x) be a strictly convex twice-continuously differentiable and x* be a point for which ∆f (x*) = 0. Suppose
that H(x) satisfies the following conditions:
a.) there exists a scalar h > 0 for which ∥H(x*)−1 ∥ ≤ h1
b.) there exists scalars β > 0 and L > 0 for which ∥H(x) − H(x*)∥ ≤ L∥x − x*∥
Let x satisfy ∥x − x*∥ < γ := min{β, 3L 2h
} and xN := x − H(x)−1 ▽ f (x), then prove following:
(i)∥xN − x*∥ ≤ ∥x − x*∥ 2(h−L∥x−x∗ ∥)
2 L

(ii)∥xN − x*∥ < ∥x − x*∥ < γ


(iii)∥xN − x*∥ ≤ ∥x − x*∥2 3L2h
Hint: Ref. Quadratic Convergence Theorem of Newton’s Method

• Show that the eigen value-vector problem can be casted as root-finding problem i.e.;
(A − λI)v = 0; v T v = 1
Derive the Newton’s iteration to solve the above problem.

244
Chapter 26

More about Non Linear Optimization

245
In the last lecture, we had seen how the Newtons method come out of the Taylors series expansion. We had also
briefly seen the situation where we solve f (x) = 0 where f : Rm → Rn .

We now complete the need for studying this problem for optimization by connecting to the situation of minimizinh
g(x) where x is a vector. The conditions for optimality in the’ previous lecture pointed to the need of solving ∇g = 0
or else, we have now set of equations to simultaneously solve.

26.1 Newton’s Method For Sets of Nonlinear Equations


In the single variable case, Newton’s method was derived by considering the linear approximation of the function f at
the initial guess x0 . From Calculus, the following is the linear approximation of f at x0 , for vectors and vector-valued
functions:

f (x) ≈ f (x0 ) + J(x − x0 ).

Here J is an n × n matrix whose entries are the various partial derivative of the components of f, i.e. :
 
∂f1 ∂f1
···
 ∂x1 ∂xn 
 . .. 
J = . . ..
.

. .
 ∂f ∂fm 
m
···
∂x1 ∂xn
J is the derivative matrix (or Jacobian matrix, explained elsewhere ) evaluated at x0

That is,

f (y) = f (x) + Jx (y − x) (26.1)

Where Jacobian J is an m × n matrix and can be given as


 ∂f 
1
∂x1
· · · ∂x
∂f1
 . .. 
n

J=   ..
..
. . 

∂fm
∂x1
· · · ∂xn
∂fm

Putting f (y) = 0 in equation 26.1 we get

0 = f (x) + Jx (y − x)

Let y − x = s

0 = f (x) + Jx s
⇒Jx s = −f (x) (26.2)

Iterative algorithm to solve f (x) = 0 can be given as algorithm


x0 ← Critical guess
for k = 0, 1, 2, · · · do
solve Jxk s = −f (xk ) for s
xk+1 = xk + s
end for

The convergance is guaranteed if spectral radius of J and maximam eigen value of J is less then 1.

Example 41. As an example, lets take a problem with two equations and two variables

f1 (x1 , x2 ) = log(x21 + 2x22 + 1) − 0.5, f2 (x1 , x2 ) = −x21 + x2 + 0.2 (26.3)


 
2x1 4x2
The derivative matrix is J =  x21 + 2x22 + 1 x21 + 2x22 + 1  .
−2x1 1
After finding the derivative matrix, we can proceed with the Newtons method discussed in the last class.

There are two solutions, (0.70,0.29) and (−0.70,0.29).

246
Example [ 42. Suppose] we need to solve the following system of non linear equations.
x1 + x2 − 3
F (x) =
x21 + x22 − 9

[ ]
1. Let the initial guess be x◦ = 1 5

2. Iteration 1
Solving J(x
[ ] ◦ ).s◦ = [−F (x]◦ )
1 1 3
.s◦ = −
2 10 17
[ 13 ]
[ ]
s◦ = 8
11 x1 = x◦ + s◦ = −0.625 3.625
8

3. Iteration 2
Solving
[ J(x1]).s1 = [−F (x1 ]
)
1 1 0
.s1 =
− 54 29 145
[ 145
4 ] 32

s1 = 272
− 145
272 [ ] [ ]
x2 = x1 + s1 = 0.092 3.092 . The actual solution to the above problem is 0 3 . In just two iterations
algorithm has reached quite close.

26.2 Newton’s method for minimizing a convex function

If the objective function g is convex we can find it’s minimum by solving ▽g(x) = 0. This is a set of n nonlinear
equations in n variables that we can solve using any method of nonlinear equations. If we linearise the optimality
condition ▽g(x) = 0 near x◦ we obtain

▽g(x) ≈ ▽g(x◦ ) + ▽2 g(x)(x − x◦ ) = 0

This is a linear equation in x, with solution

x = x◦ − ▽2 g(x◦ )−1 g(x◦ )

The above step is called Newton step.

When n = 1 this interpretation is particularly simple. The solution of the linearized optimality condition is the zero-
crossing of the derivative g(x), which is monotonically increasing since g(x) > 0. Given our current approximation x(k)
of the solution, we form a first-order Taylor approximation of g(x) at x(k). The zero crossing of this approximation
is then x(k) + v(k). This interpretation is illustrated in the below figure.

247
The solid curve is the derivative g(x) of the function g. faf f (x) = g(x(k) )+g(x(k) )(x − x(k) ) is the affine approximation
of g(x) at x(k) . The Newton step v(k) is the difference between the root of faf f and the point x(k) .

ALGORITHM for Newton’s method for unconstrained optimization


given
initial x, tolerance ε > 0
repeat
1. Evaluate ∇g(x) and ∇2 g(x).
2. if ∥∇g(x)∥ ≤ ε , return x.
3. Solve ∇2 g(x)v = ∇g(x).
4. x := x + v.
until a limit on the number of iterations is exceeded
Example 43. Let’s try to find the minimum of this function f (x) = 7x − lnx

▽f (x) = f ′ (x) = 7 − 1
x

Hessian = H(x) = f ′′ (x) = 1


x2

It is not hard to check that x∗ = 1


7
= 0.142857143 is the unique global minimizer.

From Egn 5, the Newton Direction is −H(x)−1 ▽ f (x) = x − 7x2 , and is defined so long as x > 0 (Domain of f (x) is
x > 0)

So, Newton’s method will generate the sequence of iterates as:


xk+1 = xk + (xk − 7xk 2 )
= 2xk − 7x2k

Below are some examples of the sequences generated by this algorithm for different starting points

Note that the iterate in the first column is not in the domain of the objective function, so the algorithm has to
terminate.

As we can see, the above algorithm converges only when started near the solution, as expected based on the general
properties of the method.

26.3 Newton’s method with backtracking


Newton’s method always tries to move in the correct direction. The only factor by which it may fail to converge is if
the step size is too big. In this case the function value increases from iteration to iteration.

248
k xk xk xk
0 1 0.1 0.01
1 -5 0.13 0.0193
2 0.1417 0.03599257
3 0.14284777 0.062916884
4 0.142857142 0.098124028
5 0.142857143 0.128849782
6 0.1414837
7 0.142843938
8 0.142857142
9 0.142857143
10 0.142857143

Table 26.1: Newton Method Iterations for different Starting Points

It is quite evident that Newton’s method will converge only when started near the solution. We need a modification
that makes its globally convergent. A closer look into the method shows that problem doesn’t lie with the direction
of step but it’s the step size that creates a problem. A step-size too large (might miss global optima) or even too
small (method could be too slow) is a problem; we need to select a perfect step size for a given problem.

To overcome this problem at each iteration first take a full Newton step and evaluate the function at that point. If
the function value g(xk + v k ) if higher than g(xk ) it’s rejected and xk + 12 v k is tried, If the function is still higher
than g(xk ) than xk + 14 v k is tried and so on un til a value of t is found with xk + tv k < g(xk )

Newton’s method with backtracking is the solution to above problem. The idea is to use the direction of the chosen
step, but to control the length. Following algorithm is used for same.

The purpose of backtracking is to avoid a situation in which, the function values increase from iteration to iteration
and never converges. Analysis shows that there is actually nothing wrong with the direction of the Newton step,
since it always points in the direction of decreasing g. The problem is that we step too far in that direction. So the
remedy is quite obvious.

At each iteration, we first attempt the step 4 of previous algorithm x (xk+1 = xk + v k ), and evaluate g at that point.
If the function value g(xk + v k ) is higher than g(xk ), we reject the update, and try xk+1 = xk + (1/2)v k , instead. If
the function value is still higher than g(x), we try xk+1 = xk + (1/4)v k , and so on, until a value of t is found with
g(xk + tv k ) < g(xk ). We then take xk+1 = xk + tvk.

In practice, the backtracking idea is often implemented as shown in the following algorithm: Algorithm
given initial x, tolerance ϵ > 0, parameter α ∈ (0, 1/2).

repeat

1. Evaluate ▽g(x) and ▽2 g(x).

2. If ||▽g(x)|| < ϵ, then return x

3. Solve ▽2 g(x)v = − ▽ g(x)

4. t :=1

while g(x + tv) > g(x) + αt ▽ g(x)T v, t := t/2.

5. x = x + v

untill a limit on the number of iterations is exceeded

The parameter α is usually chosen quite small (e.g, α = 0.01)

Figure 26.2 shows the iterations in Newton’s method with backtracking, applied to the previous example, starting
from x(0) = 4. As expected the convergence problem has been resolved. From the plot of the step sizes we note
that the method accepts the full Newton step (t = 1) after a few iterations. This means that near the solution the
algorithm works like the pure Newton method, which ensures fast (quadratic) convergence.

249
Figure 26.1: The solid line in the left figure is g(x) = log(ex + ex ). The circles indicate the function values
at the successive iterates in Newton’s method, starting at x(0) = 1.15. The solid line in the right figure is
the derivative g(x). The dashed lines in the right-hand figure illustrate the first interpretation of Newton’s
method.

Figure 26.2: Newton

26.4 Example Problems


Exercise. A loan of A dollars is repaid by making ’n’ equal monthly payments of ’M’ dollars, starting a month after
the loan is made. It can be shown that if the monthly interest rate is r, then

Ar = M (1 − 1
(1+r)n
)

A car loan of $10000 was repaid in 60 monthly payments of $250. Use the Newton Method to find the monthly
interest rate correct to 4 significant figures.

Even quite commonplace money calculations involve equations that cannot be solved by ’exact’ formulas. Let r be
the interest rate. Then

250
10000r = 250(1 − (1+r) 1
60 )

f (r) = 40r + (1+r)60 − 1


1

⇒ f ′ (r) = 40 − (1+r)
60
61

Using Newton’s Method:

40rn + 1 −1
(1+rn )60
r( n + 1) = rn − 40− 60
(1+rn )61

If the interest rate were 2.5% a month, the monthly interest on $10,000 would be $250, and so with monthly payments
of $250 we would never pay off the loan. So the monthly interest rate must have been substantially under 2.5%. A
bit of trying suggests taking r0 = 0.015. We then find that r1 = 0.014411839, r2 = 0.014394797andr3 = 0.01439477.
This suggests that to four significant figures the monthly interest rate is 1.439%.

n
Exercise. Derive Newton equation for unconstrained minimization problem min( 12 xT x + log exp(aTi x + bi ))
i=1
Give an efficient method for solving mXn Newton system assuming matrix AεℜmXn (with rows aTi ) is dense with
m ≪ n. Give an approximate FLOP count of your method.
Sol. 2:

n
f (x) = 12 xT x + log exp(aTi x + bi )
i=1


n ∑
n
∂ exp(aTi x + bi ) aTi exp(aTi x + bi )
∇f (x) = x + 1 i=1
=x+ i=1

n ∂x ∑n
exp(aTi x + bi ) exp(aTi x + bi )
i=1 i=1


n ∑
n ∑n ∑
n
( exp(aTi x + bi ))( (aTi ai ) exp(aTi x + bi )) − ( aTi exp(aTi x + bi ))( (aTi ) exp(aTi x + bi ))
∇2 f (x) = 1 + i=1 i=1 i=1 i=1

n
( exp(aTi x + bi )) 2

i=1

We now solve for v = −∇f 2 (xk )−1 ∇f (xk )

Since ∇f 2 (xk ) is is positive definite, we can use the Cholesky/QR factorization.

Cholesky: nm2 + 21 m3 (cholesky) + 2m2 (bakward/f orwardsubstitution)) + 2mn(matrixvectorproduct) ≈ nm2 + 13 m3

QR Decomposition: 2nm2 (QR) + m2 (f orwardsubstituion) + 2mn(matrixproduct) ≈ 2nm2

The QR method is slower than Cholesky (by a factor of about two if n � m), but it is more accurate. It is the preferred
method if n and m are not too large. For very large sparse problems, the Cholesky factorization is useful.

m
Exercise. Solve the unconstrained minimization problem min( log exp(aTi x − bi ) + exp(−aTi x + bi )).A is m ×
i=1
n-matrix and b is m dimension vector.


m
Solution Let, g(x) = log exp(aTi x − bi ) + exp(−aTi x + bi )
i=1

m
We express g as g(x) = f (Ax − b),i.e.f (y) = log exp(yi ) + exp(−yi )
i=1

∇f (y) = exp(yi )−exp(−yi )


exp(yi )+exp(−yi )

251
∇2 f (y) = 4
(exp(yi )+exp(−yi ))2
f ori = j, 0f ori ̸= j

Once we have the gradient and Hessian, the implementation of Newton’s method is straightforward.Using Hessian
and gradient of f compute hessian and gradiant of g as shown below:

∇g(x) = AT ∇f (y)

∇2 g(x) = AT ∇2 f (y)AT

We start with x0 = (1, 1, 1...1) and set α = 0.01 terminate if ∥∇f (y)∥ ≤ 10−5

Note: The number of iterations in iterative algorithm depends on the problem parameters and on the starting point
hence its efficiency isn’t expressed by giving its flop count, rather by giving upper bounds on the number of iterations
to reach a given accuracy.

26.5 Additional Problems


1. Let f(x) be a strictly convex twice-continuously differentiable and x* be a point for which ∆f (x*) = 0. Suppose
that H(x) satisfies the following conditions:
(a) there exists a scalar h > 0 for which ∥H(x*)−1 ∥ ≤ 1
h

(b) there exists scalars β > 0 and L > 0 for which ∥H(x) − H(x*)∥ ≤ L∥x − x*∥

2. Let x satisfy ∥x − x*∥ < γ := min{β, 3L


2h
} and xN := x − H(x)−1 ▽ f (x), then prove following:
(i)∥xN − x*∥ ≤ ∥x − x*∥ 2(h−L∥x−x∗ ∥)
2 L

(ii)∥xN − x*∥ < ∥x − x*∥ < γ


(iii)∥xN − x*∥ ≤ ∥x − x*∥2 3L
2h
Hint: Ref. Quadratic Convergence Theorem of Newton’s Method

3. Show that the eigen value-vector problem can be casted as root-finding problem i.e.;
(A − λI)v = 0; v T v = 1
Derive the Newton’s iteration to solve the above problem.

4. Use Newton’s method to find all the solution of the two equations

x21 + x22 = 16
(x1 − 2)2 + (x2 − 3)2 = 25


5. Derive gradient and Hessian of the function g(x) = ||Cx||2 + 1, where C is a left-invertible m × n matrix.
Hint: Use the result of Homework Problem1 and the expression ▽2 g(x) = C T ▽2 h(Cx + d)C for the Hessian
of the function g(x) = h(Cx + d)

252
Chapter 27

Claoser Look at Gradient Descent


Optimization

253
27.1 Introduction

254
Chapter 28

Optimization in Deep Neural


Networks

255
256
Chapter 29

Optimization in Support Vector


Machines

257
258
Chapter 30

Regression and Regularization

259
30.1 Introduction

260
Chapter 31

Sparse Coding and Dictionary


Learning

261
31.1 Representation and Coding
For many problems in engineering, building a numerical representation has been the first step. Often this represen-
tation is a vector of n elements. It could be some measurements of the phyisical phenomena. Such representations
are understood as an element in a vector space. Often there is also a set of basis functions, such as Fourier basis, for
this representation

In this lecture, we are interested in an over-complete basis. This leads to representations that can be “sparse”.

There are many aspects for looking for a right representation. It could be the lenth of the representation (number
of elements in the vector.) It could also be the compressibility of the representation. It could also be the number
of bits required to store. Beyond these computational requirements, one is also (obviously) interested in building
a representation that is useful for solving the problem. One useful representaion scheme is based on sparse linear
combination of some basis functions. Sparse coding, that is modelling data vectors as sparse linear combinations of
basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This focuses on
learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be
very effective for signal reconstruction and classification in the audio and image processing domains.

Sparseness is one of the reasons for the extensive use of popular signal transforms such as the Discrete Fourier
Transform(DFT), the wavelet transform(WT) and the Singular Value Decomposition(SVD). These transforms often
reveal certain structures of a signal and are used to represent these structures in a compact and sparse representation.
Sparse representations have therefore increasingly become recognized as providing extremely high performance for
applications as diverse as: noise reduction, compression, feature extraction, pattern classification and blind source
separation. Sparse approximation techniques also build the foundations of wavelet denoising and methods in pattern
classification, such as in the Support Vector Machine(SVM).

1. Sparse modeling calls for constructing efficient representations of data as a (often linear) combination of a few
typical patterns (atoms) learned from the data itself. Significant contributions to the theory and practice of
learning such collections of atoms (usually called dictionaries or codebooks), and of representing the actual
data in terms of them, leading to state-of-the-art results in many signal and image processing and analysis
tasks. The first critical component of this topic is how to sparsely encode a signal given the dictionary.
2. The actual dictionary plays a critical role, and it has been shown once and again that learned dictionaries
significantly outperforms off-the-shelf ones such as wavelets.
3. There are numerous applications where the dictionary is not only adapted to reconstruct the data, but also
learned for a specific task, such as classification, edge detection and compressed sensing .

There are many formulations and algorithms for sparse coding in the literature. Let yRn be a data point in a data
set Y .

Assume that y can be expressed as


y = Dx (31.1)

where:

• x is a m × 1 column vector
• D is a m × p matrix
• a is a k × 1 column vector
• k≪m

We discussed how we can build a model based on large number of observations. Using sparse coding we try to
represent the new sample in terms of a sparse set of input sample. The large set of sample that is available to us is
called dictionary. Each sample i.e each column of the matrix is known as atom. The dictionary is over complete and
the rows and columns are not required to be linearly independent. There are a lot of advantages working with sparse
vectors. For example calculations involving multiplying a vector by a matrix take less time to compute in general if
the vector is sparse. Also sparse vectors require less space when being stored on a computer as only the position and
value of the entries need to be recorded.

Problem 1 - Dictionary Learning

Given y find(learn) D.

262
This problem can be simply stated as:

Given x find(learn) D.

This is generally a harder problem comparedto the one given below and the major focus in these notes will be on the
second problem that is discussed hereafter.

Problem 2 - Sparse Coding Sparse coding that is, modelling data vectors as sparse linear combinations of
basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This focuses on
learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be
very effective for signal reconstruction and classification in the audio and image processing domains.

31.2 Problem of Sparse Coding


Sparse Coding: Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to
represent data efficiently. Formally , The aim of sparse coding is to find a set of basis vectors ϕi such that we can
represent an input vector x as a linear combination of these basis vectors:


k
x= (ai ϕi )
i=1

The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and
patterns inherent in the input data.

Sparse Representation of a Signal The problem of finding the sparse representation of a signal in a given
overcomplete dictionary can be formulated as follows. Given a N × M matrix A containing the elements of an
overcomplete dictionary in its columns, with M > N and a signal y ∈ RN , the problem of sparse representation is
to find an M × 1 coefficient vector x, such that y = Ax and ||x||0 is minimized, i.e.

x = min

||x′ ||0
x

s.t. y = Ax

In general the above problem is computationally intractable. Therefore we go for sparse approximation.

Sparse Coding

Traditionally a sample was transformed into a sparse vector based on particular basis. Representing a sample in a
particular basis involves finding a unique set, for that sample,of expansion coefficients in that basis. There are a lot
of advantages working with sparse vectors. For example calculations involving multiplying a vector by a matrix take
less time to compute in general if the vector is sparse. Also sparse vectors require less space when being stored on a
computer as only the position and value of the entries need to be recorded.

The main disadvantage of using orthogonal basis to represent a new sample, is that specific basis works for specific
type of samples and may not work with other types. For example, smooth continuous signals are sparsely represented
in a Fourier basis, while impulses are not. On the other hand, a smooth signal with isolated discontinuities is sparsely
represented in a wavelet basis, while a wavelet basis is not efficient at representing a signal whose Fourier transform
has narrow high frequency support.

Real world observations often contain features that prohibit sparse representation in any single basis. To ensure that
we are able represent every vector in the space, the dictionary of vectors we need to choose from, must span the
space. How ever, because the set is not limited to a single basis, the dictionary is not linearly independent.

Because the vectors in the dictionary are not a linearly independent set, the sample representation in the dictionary
may not unique. However, by creating a redundant dictionary (over defined dictionary), we can expand our sample in
a set of vectors that is a union of several bases. You are free to create a dictionary consisting of the union of several

263
bases. For example to represent an arbitrary document our dictionary may be defined as a union of various different
type of document bases like sports document, musical document, culinary document etc

 
x1
 
 x2 
···  
d11 d12 d1n−1 d1n  .. 
 . ..   . 
 .. ..
. . ×
 ..


 
dm1 dm2 ··· dmn−1 xmn  . 
 xn − 1 
xn

Note on Solving Hard Problems

1. Sparse Coding Given a dictionary D, and a new observation y find a sparse representation of y as x Minimize.||x||0
such that y= Dx or Minimize ||x||0 such that ||y − Dx|| <= ϵ
2. Dictionary learning Given a lot of samples y1 ...yn find the dictionary D

Sparse Coding requires to minimize the L0 norm which is a hard problem so there are two practical ways to solve
the problem as shown in the image.

Figure 31.1: Practical approaches to solve a hard problem

The eccentricities of sparse coding can be enumerated as follows:

1. Given x and D , the task is to find a such that it best approximates the signal or data using a highly sparse
vector.
2. Consider a linear system of equations x = Da, where D is an underdetermined m × p (m ≪ n) matrix and
x ∈ Rm ,a ∈ Rp , called as the dictionary or the design matrix, is given. The problem is estimation of the signal
a, subject to it being sparse. Sparse decomposition helps in such a way that even though the observed values
are in high-dimensional(Rm ) space, the actual signal is organized in some lower-dimensional subspace(Rk ),
(k ≪ m).
3. It is evident that only a few components of a are non-zero and the rest are zero as it is a sparse vector. This
implies that x can be decomposed as a linear combination of only a few m × 1 vectors in D, these vectors are
known as atoms. D itself is over-complete (m ≪ p). Such vectors(atoms) are called as the basis of x. Here
though, unlike other dimensionality reducing decomposition techniques such as Principal Component Analysis,
the basis vectors are not required to be mutually orthogonal.

31.2.1 The Sparse Coding Problem Formulated

The sparse decomposition problem is represented as:

min ||a||0 | x = Da (31.2)


α∈Rp

where ||a||0 is a pseudo-norm, l0 , which counts the number of non-zero components of a. A convex relaxation of the
∑p
problem can instead be obtained by taking the l1 norm instead of the l0 norm, where ||a||1 = |ai |. The l1 norm
i=1
induces sparsity under certain conditions.

264
The need to learn dictionary(D)

The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one has led to
the better results for numerous tasks .For example low-level image processing tasks such as denoising as well as
higher-level tasks such as classification showing that sparse learned models are well adapted to natural data.
Dictionary you are trying to learn should be specific to the subject .

31.2.2 What are sparse representations/approximations good for?

Sparseness is one of the reasons for the extensive use of popular transforms such as the Discrete Fourier Transform,
the wavelet transform and the Singular Value Decomposition. The aim of these transforms is often to reveal certain
structures of a signal and to represent these structures in a compact and sparse representation. Sparse representations
have therefore increasingly become recognized as providing extremely high performance for applications as diverse
as: noise reduction, compression, feature extraction, pattern classification and blind source separation. Sparse
representation ideas also build the foundations of wavelet denoising and methods in pattern classification, such as in
the Support Vector Machine and the Relevance Vector Machine, where sparsity can be directly related to learnability
of an estimator.

Sparse signal representations allow the salient information within a signal to be conveyed with only a few elementary
components, called atoms. The goal of sparse coding is to represent input vectors approximately as a weighted linear
combination of a small number of (unknown) “basis vectors”. These basis vectors thus capture high-level patterns in
the input data. Sparse coding is, modelling data vectors as sparse linear combinations of basis elements, it is widely
used in machine learning, neuroscience, signal processing, and statistics. It is proven to be very effective for signal
reconstruction and classification in the audio and image processing domains.

31.2.3 Application
1. When a sparse coding algorithm is applied to natural images, the learned bases resemble the receptive fields
of neurons in the visual cortex.
2. Sparse coding produces localized bases when applied to other natural stimuli such as speech and video.
3. There are many applications of sparse coding in Seismic Imaging linear regression and Transform Coding.

31.2.4 Important aspects related to Sparse Coding


1. Signal and image processing: Restoration, reconstruction
(a) Image Denoising
(b) Inpainting
(c) Demosaicking
(d) Video Processing
(e) Other Applications
2. Sparse Linear Models and Dictionary Learning:
(a) The machine learning point of view
(b) Why does the l1 -norm induce sparsity?
(c) Dictionary Learning and Matrix Factorization
(d) Group Sparsity
(e) Structure Sparsity
3. Computer Vision Applications :
(a) Learning codebooks for image classification
(b) Modelling the local appearance of image patches
(c) Background subtraction with structured sparsity
4. Optimization for sparse methods
(a) Greedy algorithms
(b) l1 approximations
(c) Online dictionary learning .

265
31.3 BP, MP and OMP
The sparse decomposition problem is represented as:

min ||x||0 | y = Dx (31.3)


α∈Rp

where ||x||0 is a pseudo-norm, l0 , which counts the number of non-zero components of x. A convex relaxation of the
∑p
problem can instead be obtained by taking the l1 norm instead of the l0 norm, where ||x||1 = |xi |. The l1 norm
i=1
induces sparsity under certain conditions.

31.3.1 Overview

Basis Pursuit The idea of Basis Pursuit is to replace the difficult sparse problem with an easier optimization
problem. The difficulty with the above problem is the L0 norm. Basis Pursuit replaces the L0 norm with the L1 to
make the problem easier to work with. Basis Pursuit: min ||x||1 subject to Ax = b

Matching pursuit (MP) A greedy iterative algorithm for approximately solving the original l0 pseudo-norm
problem. Matching pursuit works by finding a basis vector in D that maximizes the correlation with the residual
(initialized to y), and then recomputing the residual and coefficients by projecting the residual on all atoms in the
dictionary using existing coefficients.

Matching pursuit (MP) Matching pursuit is a greedy iterative algorithm for approximately solving the original
l0 pseudo-norm problem. Matching pursuit works by finding a basis vector in D that maximizes the correlation with
the residual (initialized to x), and then recomputing the residual and coefficients by projecting the residual on all
atoms in the dictionary using existing coefficients. It can be seen as an expectation maximization[?] problem.

Orthogonal matching pursuit(OMP) It is similar to Matching Pursuit[?], except that an atom once picked,
cannot be picked again. The algorithm maintains an active set of atoms already picked, and adds a new atom at each
iteration. The residual is projected on to a linear combination of all atoms in the active set, so that an orthogonal
updated residual is obtained. Both Matching Pursuit and Orthogonal Matching Pursuit use the l2 norm.

31.3.2 Basic Matching Pursuit

Matching pursuit is a greedy algorithm that computes the best nonlinear approximation to a sample in a complete,
redundant dictionary. Matching pursuit builds a sequence of sparse approximations to the signal step-wise. Let
ϕ = φk denote a dictionary of unit-norm atoms. Let f be your signal.

1. Start by defining R0 f = f
2. Begin the matching pursuit by selecting the atom from the dictionary that maximizes the absolute value of the
inner product w ith R0 f = f . Denote that atom by φp .
3. Form the residual R1 f by subtracting the orthogonal projection of R0 f onto the space spanned by φp .

R1 f = R0 f − < R0 f, ϕp > ϕp (31.4)

4. Iterate by repeating steps 2 and 3 on the residual.

R( m + 1)f = Rm f − < Rm f, ϕk > ϕk (31.5)

5. Stop the algorithm w hen you reach some specified stopping criterion.

In nonorthogonal (or basic) matching pursuit, the dictionary atoms are not mutually orthogonal vectors. Therefore,
subtracting subsequent residuals from the previous one can reintroduce components that are not orthogonal to the
span of the previously included atoms. The next algorithm Orthogonal Matching Pursuit handles this problem.

266
Drawbacks

Matching pursuit has a drawback that an atom can be selected multiple times, orthogonal matching pursuit[?] gets
rid of this drawback in its implementation.

Algorithm : Matching Pursuit

Objective Function:
min |||x − Da||22 | ||x||0 < N (31.6)

Steps:

1. x0 = b, t = 0, v0 = ϕ
2. Let vt = i is the vector closest/most correalted to vector rt , pick vi
vt = vt+1 ∪ vi
∑t
3. Solve min ||b − αk αvi ||
i=1

4. t + 1 ← t ; Go to step 1.

31.3.3 Orthogonal Matching Pursuit

In orthogonal matching pursuit (OMP), the residual is always orthogonal to the atoms already selected. This means
that the same atom can never be selected twice and results in convergence for a n-dimensional vector after at most
n steps.

The OMP algorithm is:

1. Denote your signal by f. Initialize the residual R0 f = f


2. Select the atom that maximizes the absolute value of the inner product w ith R0 f = f . Denote that atom by
ϕp .
3. Form a matrix, Φ, with previously selected atoms as the columns. Define the orthogonal projection operator
onto the span of the columns of Φ.
P = Φ(Φ ∗ Φ)− 1Φ∗ (31.7)
4. Apply the orthogonal projection operator to the residual.
5. Update the residual.
Rm+1 f = (I − P )Rm f (31.8)

I is the identity matrix.

Orthogonal matching pursuit ensures that the previously selected atoms are not chosen again in subsequent steps.

The other approach is to solve an approximated problem in an exact manner. Taking example of the sparse coding
problem, we have seen that it is hard problem , but the problem can be approximated to an easier problem that can
be solved exactly. In the easier version we try to find the L1 norm instead of L0 norm. This algorithm is known as
Basis pursuit.

Algorithm : Orthogonal Matching Persuit

Objective function
min |||y − Dx||22 | ||x||0 < L (31.9)

Steps:

1. λ = ϕ
2. for iter = 1,2 .....L do

267
3. Select the atom which most reduces the objective
4. Let vt = i is the vector closest/most correalted to vector rt , pick vi
vt = vt−1 ∪ vi
5. Update the active set.
6. Update the residual .
7. Update the coefficients.

It is similar to Matching Pursuit, except that an atom once picked, cannot be picked again. The algorithm maintains
an active set of atoms already picked, and adds a new atom at each iteration. The residual is projected on to a
linear combination of all atoms in the active set, so that an orthogonal updated residual is obtained. Both Matching
Pursuit and Orthogonal Matching Pursuit use the norm.Contrary to MP, an atom can only be selected one time with
OMP. It is,however, more difficult to implement efficiently. The keys for a goodimplementation in the case of a large
number of signals are

Sparse Approximations: This problem is NP-hard in general. Therefore various relaxed sparsity measures have
been presented to make the problem tractable. Two commonly used methods used to solve sparse approximation
problems are

1. Basic pursuit

2. Orthogonal matching pursuit

Basic Pursuit: The idea of Basis Pursuit is to replace the difficult sparse problem with an easier optimization
problem. Below the formal definition of the sparse problem is given :

Sparse problem : min||x||0


subject to Ax = b, where || ||0 is count of the number of non-zero coefficients. The difficulty with the above problem
is the L0 norm. Basis Pursuit replaces the L0 norm with the L1 to make the problem easier to work with.

Basis P ursuit : min||x||1


subject to Ax = b, where || ||1 is used for measuring the reconstruction error.

BP can be solved using linear programming, Projected gradient or interior-point methods.

Orthogonal Matching Pursuit: Orthogonal matching pursuit (OMP) constructs an approximation by going
through an iteration process. At each iteration the locally optimum solution is calculated. This is done by finding
the column vector in A which most closely resembles a residual vector r. OMP is based on a variation of an earlier
algorithm called Matching Pursuit (MP). MP simply removes the selected column vector from the residual vector at
each iteration.

rt=rt−1 − < aOP , rt−1 > rt−1

Where a OP is the column vector in A which most closely resembles rt−1 . OMP uses a least-squares step at each
iteration to update the residual vector in order to improve the approximation.

Algorithm for OMP:

Input : signal b, matrix A and stopping criterion

Output : Approximation vector c.

Algorithm:

1. Start by setting the residual r0 = b, the time t = 0 and index set V0 = ∅


2. Let vt = i, where ai gives the solution of max < rt, ak > where ak are the row vectors of A.
3. Update the set Vt with vt : Vt = Vt−1 ∪ {vt }
4. Solve the least-squares problem

t
min ||b − c(vj )avj ||2
c∈C Vt
j=1

268
5. Calculate the new residual using c

t
rt = rt−1 − c(vj )avj
j=1

6. Set t ← t + 1
7. Check stoping criterion, if the criterion has not been satisfied then return to step 2.

31.3.4 Discussions

Uniqueness of Sparse approximation A sparse represntation need not necessarily be unique. For uniqueness
it has to satisfy a certain condition. A sparse representation x of b is unique if

x0 < Spark(A)/2

where,

Spark(A) is defined the size of the smallest set of linearly dependant vectors of A.

Sparse coding using Fixed Dictionary To solve an unconstrained optimization problem, Let y ∈ Rd and
x ∈ RN (where d < N ) be the input and the coefficient vectors and let the matrix D ∈ RdxN be the dictionary .

min ϕ(x); ϕ(x) = ||y − Dx||2 + λ||x||0


x

where

||x||0 is the sparsity measure (which counts the number of non-zero coefficients)

λ is constant multiplier

Upon applying basis pursuit on ||x||0 the problem becomes an L1 regularized linear least-squares problem. A number
of recent methods for solving this type of problems are based on coordinate descent with soft thresholding. When
the columns of the dictionary have low correlation, these simple methods have proven to be very efficient.

31.4 Dictionary Learning

31.4.1 Why to learn Dictionary?

The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one has led
to the better results for numerous tasks .For example low-level image processing tasks such as denoising as well
as higher-level tasks such as classification showing that sparse learned models are well adapted to natural data.
Dictionary you are trying to learn should be specific to the subject .

Dictionary Learning However, the columns of learned dictionaries are in general highly correlated and thus
more preferably used.

Let D = [d1, ..., dp] ∈ Rmxp be a set of normalized basis vectors and . We call it dictionary and let Y be a set of
n-dimensional N input signals. D is adapted to y if it can represent it with a few basis vectors, that is, there exists a
sparse vector α in Rp . We call α the sparse code
α[1]
α[2]
(y) ≈ (d1|d2|...|dp)( ... )
α[3]

Learning a reconstructive dictionary with K items for sparse representation of Y can be accomplished by solving the
following problem:

269
< D, X >= arg min||Y − DX||22

s.t. ∀i, ||x||0 ≤ T

where D = [d1...dK] ∈ RnxK (K > n), making the dictionary over-complete) is the learned dictionary, X =
[x1, ..., xN ] ∈ RKxN are the sparse codes of input signals Y , and T is a sparsity constraint factor (each signal
has fewer than T items in its decomposition).

The construction of D is achieved by minimizing the reconstruction error and satisfying the sparsity constraints.
The K-SVD algorithm is an iterative approach to minimize the energy and learns a reconstructive dictionary for
sparse representations of signals. It is highly efficient and works well in applications such as image restoration and
compression.The term ||Y − DX||22 denotes the reconstruction error.

31.4.2 Dictionary Learning

In Dictionary Learning, we consider the problem of finding a few representatives for a dataset, i.e., a subset of data
points that efficiently describes the entire dataset. The problem is computationally costly.

min ∥ x ∥0

s.t ∥ Y − DX ∥2 < ε

Dictionary Learning : Given Y1 ....YN , Find D

There will be N vectors for each

j=1 ∈ R
Given a set Y=(y j )m of m signals yj ∈ Rm , dictionary learning aims at finding the best dictionary
nXm

D=(di )i=1 of p atoms di ∈ R to sparse code all the data.


p n

If Dictionary can be adopted to the domain then it can be much faster.

For example:

Let us consider Dictionary for sound signal

For sound signal there are some salient features, that can be used to reconstruct the signal using only few elementry
components, called ’atoms’. For signal to be encoded first its Vector Quantization is done and then a Dictionary can
be built using its salient features which are domain specific. Successful application of a sparse decomposition depends
on the dictionary used, and whether it matches the signal features.

270
In some case it is possible that original signal cannot be reconstructed using Dictionary, this error can be ignored to
some threshold value.

31.4.3 Dictionary Learning Methods:

Dictionary learning, one often starts with some initial dictionary and finds sparse approximations of the set of training
signals while keeping the dictionary fixed. This is followed by a second step in which the sparse coefficients are kept
fixed and the dictionary is optimized. This algorithm runs for a specific number of alternating optimizations or until
a specific approximation error is reached. Most of these algorithms have been derived for dictionary learning in a
noisy sparse approximation setting.

31.4.4 k-means:[Randomly Partition Y to k subsets]

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining.

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster
with the nearest mean, serving as a prototype of the cluster.

minX,D ∥ Y − DX ∥2 s.t. ∥ xi ∥0 ≤ T0 ∀i , if i =1 99KK - means

X is a k vector

D is a full rank matrix, D rank =N

31.4.5 Dictionary Learning via k-means

Data points Y= { Y1 · · · YN } will be mapped to C1 · · · Ck clusters

C is given dictionary

C=[C1 · · · Ck ]
 
0
0
 
1
Yi =Cxi ; xi =  
0 s.t. xi =ej
 
0
1

s.t. ∥ yi − Cej ∥≤ ∥Yi − Cek ∥ ∀k ̸= j

min∥ Y − CX ∥s.t. xi = ek for some ’k’

271
k
∥Y − DX∥= ∥ Y - Σ dj xjT ∥
j=1

Algorithm:
1. Find Ci , i=1 · · · k as mean of member in subset ’i’.
2. Repartition, such that every signal is assign to the nearest subset.
N
E= Σ e2i = ∥ Y − CX ∥2
i=1

31.5 K-SVD
K-SVD[?] is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value
decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively
alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the
dictionary to better fit the data, just like in expectation maximization. K-SVD can be found widely in use in
applications such as image processing, audio processing, biology, and document analysis.

Problem Description

Given an overcomplete dictionary matrix D ∈ R⋉×K that contains K signal-atoms for each columns, a signal x ∈ Rn
can be represented as a linear combination of these atoms. To represent x, the sparse representation a should
satisfy the exact condition x = Da, or approximate condition X ≈ DA, or |||x − Da||p ≤ ϵ. The vector x ∈ RK
contains the representation coefficients of the signal x. Typically, norm p is selected as 1, 2 or ∞.

If n < K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem,
Hence, constraints should be set on the solution. Also, to ensure sparsity, the solution with the fewest number of
nonzero coefficients is preferred. Thus, the sparsity representation is the solution of either

(P0 ) min ||a||0 | x = Da (31.10)


a

OR
(P0 , ϵ) min ||a||0 | ||x − Da||2 ≤ ϵ (31.11)
a
where the L0 norm counts the nonzero entries of a vector.

Algorithm: K-SVD

K-SVD is a kind of generalization of K-means, as follows.


The k-means clustering can be also regarded as a method of sparse representation. That is, finding the best possible
codebook to represent the data samples {xi }M
i=1 by nearest neighbor, by solving

min {||X − DA||2F } | ∀i, ai = ek ∃ k (31.12)


D,A

or equivalently
min {||X − DA||2F } | ∀i, ||ai ||0 = 1 (31.13)
D,A

The sparse representation term ai = ek (where ek is a column vector vector from a m × m permutation matrix)
enforces K-means algorithm to use only one atom(column) in dictionary D. To relax this constraint, target of the
K-SVD algorithm is to represent signal as a linear combination of atoms in D. The K-SVD algorithm follows the
construction flow of K-means algorithm. However, In contrary to K-means, in order to achieve linear combination
of atoms in D, sparsity term of the constrain is relaxed so that nonzero entries of each column xi can be more than
1, but less than a number T0 .

So, the objective function becomes

min {||X − DA||2F } | ∀i, ||ai ||0 ≤ T0 (31.14)


D,A

272
or in another objective form ∑
min ||ai ||0 | ∀i, ||X − DA||2F (31.15)
D,A
i

In the K-SVD algorithm, the dictionary D is first set to be fixed and the best coefficient matrix X is calculated. As
finding the truly optimal X is impossible, we use an approximation pursuit method. Any such algorithm as OMP,
the orthogonal matching pursuit can be used for the calculation of the coefficients, as long as it can supply a
solution with a fixed and predetermined number of nonzero entries T0 .

After the sparse coding task, the next job is to search for a better dictionary D. However, finding the whole
dictionary all at a time is impossible, so the process then updates only one column of the dictionary D each time
while we fix X. The update of the kth column is done by rewriting the penalty term as


k ∑
||X − DA||2F = ||X − dj aTj ||2F = ||(X − dj xTj ) − dk xTk ||2F = ||Ek − dk xTk ||2F (31.16)
j=1 j̸=k

where xTk denotes the k-th column of X(the super-script denotes transpose for vector multiplication).

By decomposing the multiplication DX into sum of K rank 1 matrices, we can assume the other K − 1 terms are
fixed, and the kth column remains unknown. After this step, we can solve the minimization problem by
approximating the Ek term with a rank − 1 matrix using singular value decomposition, then update dk with it.
However, the new solution of vector xTk is very likely to be filled, because the sparsity constraint is not made
compulsory.

K-SVD

K-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value
decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively
alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the
dictionary to better fit the data. K-SVD can be found widely in use in applications such as image processing, audio
processing, biology, and document analysis.

Problem Description

Given an overcomplete dictionary matrix D ∈ R⋉×K that contains K signal-atoms for each columns, a signal y ∈ Rn
can be represented as a linear combination of these atoms. To represent y, the sparse representation x should
satisfy the exact condition y = Dx, or approximate condition Y ≈ DX, or |||y − Dx||p ≤ ϵ. The vector y ∈ RK
contains the representation coefficients of the signal y. Typically, norm p is selected as 1, 2 or ∞.

If n < K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem,
Hence, constraints should be set on the solution. Also, to ensure sparsity, the solution with the fewest number of
nonzero coefficients is preferred. Thus, the sparsity representation is the solution of either

(P0 ) min ||x||0 | y = Dx (31.17)


x

OR
(P0 , ϵ) min ||x||0 | ||y − Dx||2 ≤ ϵ (31.18)
x
where the L0 norm counts the nonzero entries of a vector.

The Algorithm

K-SVD is a kind of generalization of K-means, as follows.


The k-means clustering can be also regarded as a method of sparse representation. That is, finding the best possible
codebook to represent the data samples {yi }M
i=1 by nearest neighbor, by solving

min {||Y − DX||2F } | ∀i, xi = ek ∃ k (31.19)


D,X

or can be written as
min {||Y − DX||2F } | ∀i, ||xi ||0 = 1 (31.20)
D,X

273
The sparse representation term xi = ek enforces K-means algorithm to use only one atom (column) in dictionary D.
To relax this constraint, target of the K-SVD algorithm is to represent signal as a linear combination of atoms in D.
The K-SVD algorithm follows the construction flow of K-means algorithm. However, In contrary to K-means, in
order to achieve linear combination of atoms in D, sparsity term of the constrain is relaxed so that nonzero entries
of each column xi can be more than 1, but less than a number T0 .

So, the objective function becomes

min {||Y − DX||2F } | ∀i, ||xi ||0 ≤ T0 (31.21)


D,X

or in another objective form



min ||xi ||0 | ∀i, ||Y − DX||2F (31.22)
D,X
i

In the K-SVD algorithm, the dictionary D is first set to be fixed and the best coefficient matrix X is calculated. As
finding the truly optimal X is impossible, we use an approximation pursuit method. Any such algorithm as OMP,
the orthogonal matching pursuit in can be used for the calculation of the coefficients, as long as it can supply a
solution with a fixed and predetermined number of nonzero entries T0 .

After the sparse coding task, the next is to search for a better dictionary D. However, finding the whole dictionary
all at a time is impossible, so the process then updates only one column of the dictionary D each time while we fix
X. The update of the kth column is done by rewriting the penalty term as


k ∑
||Y − DX||2F = |Y − dj xjT |2F = |(Y − dj xjT ) − dk xkT |2F = ||Ek − dk xkT ||2F (31.23)
j=1 j̸=k

where xkT denotes the k-th row of X.

By decomposing the multiplication DX into sum of K rank 1 matrices, we can assume the other K − 1 terms are
assumed fixed, and the kth column remains unknown. After this step, we can solve the minimization problem by
approximating the Ek term with a rank − 1 matrix using singular value decomposition, then update dk with it.
However, the new solution of vector xkT is very likely to be filled, because the sparsity constraint is not enforced.

31.5.1 K- SVD

K-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse
coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the
data.

Given an over complete dictionary matrix D∈ RnXK that contains K signal-atoms for each columns, a signal y∈ Rn
can be represented as a linear combination of these atoms. To represent y, the sparse representation x should satisfy
the exact condition Y = Dx, or approximate condition y ≈ Dx, or ∥y − Dx∥p ≤ ϵ . The vector x∈ RK contains the
representation coefficients of the signal y . Typically, norm P is selected as 1, 2 or ∞. If n < K and D is a full-rank
matrix, an infinite number of solutions are available for the representation problem, Hence, constraints should be set
on the solution. Also, to ensure sparsity, the solution with the fewest number of nonzero coefficients is preferred.
Thus, the sparsity representation is the solution of either
min
(P0 ) ∥ x ∥0 subject to y = Dx
x
or
min
(P0,e ) ∥ x ∥0 subject to ∥y − Dx∥2 ≤ϵ
x
where L0 the norm counts the nonzero entries of a vector.

Algorithm:

Finding the whole dictionary all at a time is impossible, so the process then update only one column of the dictionary
D each time while fix X. The update of k − th is done by rewriting the penalty term as

274
K
∥Y − DX ∥=∥ Y − Σ dj xjT ∥
j=1

Σ
∥(Y − d xi ) − dk xkT ∥=∥ Ek − dk xkT ∥
j ̸= k i T

xkT denotes the k − th row of X

Over complete Dictionary

SVD(EK )=UDVT

if EK is Rank 1 => D is a diagonal with only one non-zero principal diagonal element.

UDV can be factored as U1 (d11 ,x1 )+ U2 (d22 , x2 )T · · ·

this will get closest to Rank 1

now U 1 → dk

(d11 v1 )T → xK
T

Convergence of this algorithm is guaranteed, since both the steps going down direction.

Note

k -th vector of D is used by very few people (X). Since X is highely sparse. So instead of using the whole bigger
problem, we can use the smaller problem where only columns corrospoding to the values that are present with non
zero (in X) are used.

31.6 Homework Problems:

31.6.1 Homework

Design a method to fit a line to a set of points in 2D such that orthogonal distance to the line is minimized.

The residuals of the best-fit line for a set of n points using unsquared perpendicular distances di of points (xi , yi ) are
given by


n
R⊥ = di (31.1)
i=1

Since the perpendicular distance from a line y=a+ b x to point i is given by


|yi − (a + bxi )|
di = √ , (31.2)
1 + b2

the function to be minimized is

∑n
|yi − (a + bxi )|
R⊥ = √ (31.3)
i=1
1 + b2

The absolute value function does not have continuous derivatives, minimizing R is not amenable to analytic solution.
However, if the square of the perpendicular distances

2
∑n
[yi − (a + bxi )]2
R⊥ = (31.4)
i=1
1 + b2

275
is minimized instead, the problem can be solved in closed form. R2 is a minimum when

2 ∑
n
= [yi − (a + bxi )](−1) = 0 (31.5)
1 + b2 i=1
and

2 ∑ ∑
n n
[yi − (a + bxi )]2 (−1)(2b)
= 2
[yi − (a + bx i )](−x i ) + =0 (31.6)
1 + b i=1 i=1
(1 + b2 )2

The former gives ∑n ∑


i=1yi − b ni=1 xi
a= (31.7)
n
= y − bx, (31.8)
and the latter


n ∑
n
(1 + b2 ) [yi − (a + bxi )]xi + b [yi − (a + bxi )]2 = 0 (31.9)
i=1 i=1

But
[y − (a + bx)]2 = y 2 − 2(a + bx)y + (a + bx)2 (31.10)
y − 2ay − 2bxy + a + 2abx + b x ,
2 2 2 2
(31.11)
so (10) becomes


n ∑
n ∑
n ∑
n
(1 + b2 )( xi yi − a xi − b x2i ) + b( yi2 −
i=1 i=1 i=1 i=1
(31.12)

n ∑
n ∑
n ∑
n ∑
n
2a yi − 2b xi yi + a2 1 + 2ab xi + b2 x2i ) = 0
i=1 i=1 i=1 i=1 i=1


n ∑
n ∑
n
[(1 + b2 )(−b) + b(b2 )] x2i + [(1 + b2 ) − 2b2 ] xi yi + b yi2 +
i=1 i=1 i=1
(31.13)

n ∑
n
2
[−a(1 + b ) + 2ab ] 2
xi − 2ab 2
yi + ba n = 0
i=1 i=1


n ∑
n ∑
n ∑
n ∑
n
−b x2i + (1 − b2 ) xi yi + b yi2 + a(b2 − 1) xi − 2ab yi + ba2 n = 0 (31.14)
i=1 i=1 i=1 i=1 i=1

Plugging the value into (14) then gives



n ∑
n ∑
n
1 2 ∑n ∑n ∑n
−b x2i + (1 − b2 ) xi yi + b yi2 + (b − 1)[ yi − b xi ] xi −
i=1 i=1 i=1
n i=1 i=1 i=1
(31.15)
2 ∑ ∑ ∑ b ∑ ∑
n n n n n
[ yi − b xi ]b yi + [ yi − b xi ]2
n i=1 i=1 i=1
n i=1 i=1

On simplifying the equations we get

∑n ∑n ∑n ∑n
yi2 − 2 1
i=1 xi + n [( i=1 xi ) − (
2 2
i=1 yi ) ]
b2 + i=1
∑ ∑ ∑ b−1=0 (31.16)
i=1 yi −
1 n n n
n i=1 xi i=1 x1 yi

So define ∑n ∑n ∑ ∑n
1[ yi2 − 1
( y i )2 ] − [ ni=1 xi − n (
2 1 2
i=1 xi ) ]
B= i=1 n
∑n i=1 ∑n ∑n (31.17)
i=1 yi −
1
2 n i=1 xi i=1 x1 yi

276
∑n ∑
1( yi2 − ny 2 ) − [ n x2 − nx2
i=1 ∑n i=1 i , (31.18)
2 nxy − i=1 xi yi

and the quadratic formula gives


b = −B + − B 2 + 1, (31.19)

1 . Under what condition does MP/OMP gives optimal answer?

Ans:Orthogonal Matching Pursuit will recover the optimum representation if

max ||B + ai ||1 < 1


ai

where ai is the column vectors of A that are not in B and B + is the pseudo- inverse of B

2. Under what conditions does BP yields solution to the original problem?

Ans: Let A consist of the columns of two orthonormal basis with coherence µ. Then if a representation x where

1 −1
||x||0 < (µ + 1)
2

where || ||0 is count of the number of non-zero coefficients. Then x is the unique solution

Model Fitting

Once the data representation part is done, next we try to build a model with our data. Here we try to model the
target variable as function of features.

Lets represent the height of child as y, age of child as x1 , weight of child x2 , height of parents x3 and x4 . We may
create a linear model like
y = a0 + a1 x1 + a2 x2 + a3 x3 + a4 x4 (31.20)
or a polynomial model like
y = a0 + a1 x1 + a2 x1 2 + a3 x2 + a4 x22 + a5 x3 x4 (31.21)
or a even complex one like
1
y = a0 + a1 logx1 + a2 e2x + a3 x3 x4 5 (31.22)

Problem Solver

In this step we may try to either predict new values based on our model (prediction) or may try to classify a new
observation into a some class (classification).

Example - Linear Model (Line Fitting).

Taking the example of modelling the height of child, lets say our model is very simple and it says the height of a the
child is only dependent on the age of child linearly. Mathematically height of child y, age of child x1

y = a0 + a1 x1 (31.23)
or more generally y = Ax

where x is a n dimensional vector corresponding to (n-1) features

As can be seen this is the equation of a line and we have a lot sample of height and age of child. Now the problem
statement is to find the equation of line using all these samples that minimizes the error. The error can be defined
in may forms like

1. lleast square error ||y − Ax||2 , fit the line such that sum distance of point form line is minimized

277
2. L0 norm ||y − Ax||0 , i.e fit the line such that number of points not on line are minimized
3. least square error ||y − Ax||2 , fit the line such that sum of orthogonal distance of point form line is minimized.

References

1. Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro Sparse Coding and Dictionary Learning for
Image Analysis, ICCV�09 tutorial, Kyoto, 28th September 2009
2. Yuchen Xie On A Nonlinear Generalization of Sparse Coding and Dictionary Learning , Qualcomm Technolo-
gies, Inc., San Diego, CA 92121 USA
3. Philip Breen Algorithms for Sparse Approximation , University of Edinburgh 2009
4. Philip Breen, Algorithms for Sparse Approximation, School of Mathematics, University of Edinburgh ,2009
5. Shaobing Chen, David Donoho, Basis Pursuit, Statistics Department, Stanford University, Stanford, CA 94305
6. Holger Boche, Robert Calderbank, Gitta Kutyniok, and Jan Vybiral,Survey of Compressed Sensing
7. The EM Algorithm November 20, 2005
8. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 7, JULY 2011 Orthogonal Matching
Pursuit for Sparse Signal Recovery With Noise
9. T. Tony Cai and Lie Wang
10. IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006 4311 K-SVD: An
Algorithm for Designing Overcomplete Dictionaries for Sparse Representation Michal Aharon, Michael Elad,
and Alfred Bruckstein
11. KSVD https://en.wikipedia.org/wiki/K-SVD
12. M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dic-
tionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311–4322, November
2006.
13. Orthogonal Matching Pursuit for Sparse SignalRecovery With NoiseT. Tony Cai and Lie Wang
14. Francis Bach, Julien Mairal, Jean Ponce and Guill ermo Sapiro, Sparse Coding and Dictionary Learning for
Image Analysis (Part III: Optimization for Sparse Coding and Dictionary Learning), CVPR’10 tutorial, San
Francisco, 14th June 2010.

278

You might also like