Optimization and Mathematical Modeling in Computer Architecture
Optimization and Mathematical Modeling in Computer Architecture
NOWATZKI
SYNTHESIS LECTURES ON
COMPUTER ARCHITECTURE
M
& C Mor gan& Cl aypool Publishers
Series Editor: Mark D. Hill, University of Wisconsin
Shared-Memory Synchronization
Michael L. Scott
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
iv
Automatic Parallelization: An Overview of Fundamental Compiler Techniques
Samuel P. Midkiff
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
v
e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00531ED1V01Y201308CAC026
Lecture #26
Series Editor: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Synthesis Lectures on Computer Architecture
Print 1935-3235 Electronic 1935-3243
Optimization and Mathematical
Modeling in Computer
Architecture
Tony Nowatzki
University of Wisconsin-Madison
Michael Ferris
University of Wisconsin-Madison
Karthikeyan Sankaralingam
University of Wisconsin-Madison
Cristian Estan
Broadcom Corporation
Nilay Vaish
University of Wisconsin-Madison
David Wood
University of Wisconsin-Madison
M
&C Morgan & cLaypool publishers
ABSTRACT
In the last few decades computer systems and the underlying hardware have steadily become
larger and more complex. e need to increase their efficiency through architectural innovation
has not abated, but quantitatively evaluating the effect of various choices has become more dif-
ficult. Performance and resource consumption are determined by complex interactions between
many modules, each with many possible alternative implementations. We need powerful com-
puter programs to explore large design spaces, but the traditional approach of developing sim-
ulators, building prototypes, or writing heuristic-based algorithms in traditional programming
languages is often tedious and slow.
Fortunately mathematical optimization has made great advances in theory, and many fast
commercial and academic solvers are now available. In this book we motivate and describe the
use of mathematical modeling, specifically optimization based on mixed integer linear program-
ming (MILP) as a way to design and evaluate computer systems. e major advantage is that the
architect or system software writer only needs to describe what the problem is, not how to find a
good solution. is greatly speeds up their work and, as our case studies show, it can often lead
to better solutions than the traditional approach.
In this book we give an overview of modeling techniques used to describe computer systems
to mathematical optimization tools. We give a brief introduction to various classes of mathemat-
ical optimization frameworks with special focus on mixed integer linear programming which
provides a good balance between solver time and expressiveness. We present four detailed case
studies—instruction set customization, data center resource management, spatial architecture
scheduling, and resource allocation in tiled architectures—showing how MILP can be used and
quantifying by how much it outperforms traditional design exploration techniques. is book
should help a skilled systems designer to learn techniques for using MILP in their problems, and
the skilled optimization expert to understand the types of computer systems problems that MILP
can be applied to.
Fully operational source code for the examples used in this book is provided through the
NEOS System at http://www.neos-guide.org/content/computer-architecture
KEYWORDS
Integer Linear Programming, ILP, Mixed Integer Linear Programming, MILP,
Mathematical Modeling, General Algebraic Modeling System, GAMS, Optimiza-
tion, Spatial Architectures, Tiled Architectures, Scheduling, Resource Allocation,
Instruction Set Customization
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Why this Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Evolution of Mathematical eories and Algorithms . . . . . . . . . . . . . . . . 1
1.1.2 Maturity of Solvers and Modeling Systems . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Complexity of Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Who is this Book For? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 What is this Book About? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Mathematical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Optimization as a Modeling Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 e Essential Primitives of MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.5 Benefits of Modeling and MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 What this Book is not About . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Code Provided with this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 An Overview of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Overview of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Models for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Convex Programming F1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Network Flow Problems F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Mixed Integer Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5 Mixed Integer Nonlinear Programs F . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Modeling Problems as MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Logic and Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 Constraint Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
¹F indicates an optional section.
x
2.3.4 Piecewise-linear Models F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.5 Modeling Mixed Integer Nonlinear Programs F . . . . . . . . . . . . . . . . . . 39
2.4 Solution Methods F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.1 Branch-and-bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.2 Extensions to Basic Branch-and-bound . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.3 Column Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.4 Bender’s Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.6 Modeling Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1 Properties of a MILP-friendly Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Understanding the Limitations of MILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.1 Properties of Optimization Problems Unsuitable to MILP . . . . . . . . . . 116
7.2.2 Example Problems Poorly Suited to MILP . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Implementing Your Optimization Problems in MILP . . . . . . . . . . . . . . . . . . . 119
xii
7.3.1 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.2 Dealing with MILP Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.3 Optimizing and Tuning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Acknowledgments
We would like to acknowledge and thank many of you who contributed to this synthesis lec-
ture. anks to Mark Hill for helping refine the scope of the book, constant encouragement,
and reviewing drafts of this lecture. anks to Benjamin Lee, Lieven Eeckhout, Paul Feautrier,
and other anonymous reviewers for several comments that helped improve this lecture. anks
to Newsha Ardalani, Michael Bussieck, Preston Briggs, Daniel Luchuap, Zach Marzec, and
Lorenzo De Carli for reading drafts in detail and providing feedback. anks to Somesh Jha
for his help on early work in applying SMT techniques to spatial architecture scheduling. anks
from Nilay to Srikrishna Sridhar and Taedong Kim for the numerous discussions on mathemat-
ical optimization.
anks to Michael Sartin-Tarm for creating the online case studies for this synthesis lec-
ture.
Much of the content of this lecture is built on research that has been supported by various
grants from the Air Force Office of Scientific Research, the Department of Energy and the Na-
tional Science Foundation. We are grateful for their vision to facilitate both focused disciplinary
research and the development of new areas, applications, and impacts arising from interdisci-
plinary interactions.
Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and
David Wood
September 2013
1
CHAPTER 1
Introduction
1.1 WHY THIS BOOK?
e past half century has seen an important culmination of trends. First, the mathematical the-
ory behind optimization has evolved to a state where it can be considered mature, and this has
spurred classes of entirely new and significantly more powerful methods for solving optimization
problems. Second, the robustness, capabilities, and performance of academic and commercial
optimization solvers, along with new ways for expressing real world systems using mathematical
constructs and languages, has greatly matured and shows no sign of slowing progress. ird, archi-
tecture and systems have become more complex, necessitating sophisticated analysis for modeling
and advanced algorithms for design. e power of mathematical tools enable reasoned analysis
of these complex systems. Due to these trends, the use of optimization for computer systems
in general and computer architecture specifically becomes ever more practical and helpful, and
therefore continues to see increased use. ese broad trends are outlined below.
VLSI Perhaps the most prominent applications of optimization techniques have been in the
field of VLSI. Indeed, almost all levels of VLSI physical design automation have been solved
with Mixed Integer Linear Programming. e floorplanning problem, where hardware modules
are placed on a 2D plane to minimize the total area, are a natural fit for MILP. It has seen much
research throughout the 1990s, and into the 2000s, and prominent works are by Sutanthavibul
et al., Sen et al., and Dorneich et al. [51, 55, 154, 160]. Global routing, where the approximate
connections between blocks are determined, are also a good fit for MILP. Usually since the scale
of these algorithms are large, relaxations and approximations of the MILP are used to solve the
problem [20, 165, 180]. Not only has MILP been applied to the problems in VLSI design au-
tomation, but also in the related field of scheduling tests for verification [36, 37].
Integer linear programming sees continued use and research in VLSI technologies, includ-
ing floorplanning with module selection in mixed granularity FPGAs [157], and floorplanning
in 3D manufacturing to reduce the number of 3D vias subject to area and power constraints [97].
Also, MILP has been recently applied to routing for flip-chip interconnects [62]. An excellent
reference on optimization in this field is “Combinatorial Optimization in VLSI Design” by Held
et al. [91].
Computer Architecture To provide a contrast between VLSI and computer architecture prob-
lems, we characterize them as follows. VLSI problems have well-defined objective functions,
working on circuits already represented as a graph, and the primitive element is a gate transistor
or even lower. ere are millions of nodes/nets or other variables. Architecture problems tend to
involve generally coarser grained blocks, like functional units or processing elements, interacting
in an ad-hoc, or more unrestricted fashion. Problems are not already formulated with a graph
representation, and the objective function can be unclear. e coarse grain nature leads to many
fewer decisions, generally in the range of 10s to 100s. We discuss some examples next.
In the accelerator domain, optimization has seen a variety of uses. Lee et al. use integer
linear programming for scheduling the execution of hardware datapaths, subject to timing and
resource constraints [115]. Similarly, Azhar et al. use MILP for specializing the datapath for a
4 1. INTRODUCTION
Viterbi decoding accelerator [13]. An important field of research focuses on the interconnect for
SOC designs, where the solution is to employ a network on chip. e interconnect topology and
bandwidth can be specialized to an application to avoid over provisioning. Srinivasan et al. solve
this problem using a two stage MILP formulation, minimizing the power consumption subject to
power constraints [159]. e performance of systems can be dramatically affected by which pieces
are performed in hardware or software. e trade-off in placing computation on the hardware is
largely to improve performance, but its costs involve area and power. One important work in this
field is by Niemann et al. [140], who take a VHDL specification of an application, and use integer
linear programming to separate the specification into software and hardware components.
In this book, we study four concrete use cases for MILP in the field of computer architec-
ture, chosen to demonstrate both MILP’s range of applicability and expressive power in terms of
modeling interesting system features.
Systems perspective for the skilled optimization expert is book provides a systems perspective
on the nature of problems encountered in computer systems by examining in detail four case
studies in which mathematical modeling is applied to the design and evaluation of four very
different systems. us it gives the optimization expert background and perspective on systems
problems, helping them approach and solve such problems using their insight and expertise.
Modeling techniques for the systems expert is book provides a general overview of mathe-
matical modeling and optimization specifically for the systems expert interested in learning about
such techniques. Specifically we use Mixed Integer Linear Programming, covering its basic the-
ory and practical implementation and uses. We discuss how MILP modeling is used to design
and evaluate four diverse types of systems. e specific system aspects considered in this book are:
instruction set customization for a processor, data center job scheduling, and compiler/microar-
chitecture design of spatial computing architectures.
1. All system features that have freedom to be changed by the designer are expressed as vari-
ables.
2. All system features that are fixed, we refer to as parameters; they essentially become constants
in mathematical formulas.
1.3. WHAT IS THIS BOOK ABOUT? 7
3. e behavior of the system is expressed using a collection of functions of the variables and
parameters.
4. All system features that restrict its behavior are expressed as constraints on these functions.
5. e system property that the designer wants to optimize is selected from these functions
and is termed the objective function.
6. e objective function and the collection of constraints together form the model of the
system.
7. For MILP, some variables can take only integer values and constraints are linear relation-
ships of variables.
Based on the range and types of values variables can take, the relationship between variables
that make up constraints, and the nature of the final objective(s), there are various “families” of
optimization models including linear programming, convex quadratic programming, etc. ese
relationships are developed in detail in Chapter 2. is book and its examples focus on MILP,
because it provides a nice balance between expressive power, solution time, and ability to provide
optimality guarantees.
Example 1: Special Function Units A processor architect must choose some number of spe-
cial functional units (SFUs) based upon the cost and projected performance benefits. SFUs are
specialized pieces of hardware, like a sin/cos unit or a video decoder, which are good at specific
computations or tasks. Specifically, there are N special function units which improve the per-
formance of a specific task, each one requiring certain chip area, An . e maximum area budget
is MAXarea . Also, there are M applications, and it is known how much each special function
unit can speedup each application in M , called Smn . e values in An , MAXarea and Smn are the
input parameters of the system. e architect must choose which special function units to operate
or implement, represented by On . Here, fOn j n D 1; : : : ; N g is the set of binary variables which
describe the choices in our system. e performance improvement on an application m 2 M is
PN
nD1 Smn On , where each improvement Smn is only counted when unit n is operational, i.e.,
On D 1. If the goal is to improve all applications’ performance by some minimum amount, the
8 1. INTRODUCTION
following objective function and constraints form the model of interest:
N
X
max min Smn On
O mD1;:::;M
nD1
N
X
s:t: An On MAXarea
nD1
On 2 f0; 1g n D 1; : : : ; N
In the notation above, “max” indicates that members of O are variable in the following problem.
O
e expression to the right is what we are maximizing, which is the minimum performance im-
provement over all applications. e first constraint above, which appears after the “s.t.”, restricts
the total area to the maximum, and the second constraint ensures that all variables in O are binary.
As we outline in Section 2.2.2 (or by elementary observations), this model can be recast as a
MILP, using an additional variable PERF to model the lower-bound performance improvement
of all applications:
max PERF
O;PERF
N
X
s:t: An On MAXarea
nD1
N
X
PERF Smn On for all m 2 M
nD1
On 2 f0; 1g n D 1; : : : ; N; PERF 2 R
In this case, the objective function is simply to maximize PERF . e first constraint above re-
mains the same, and the second constraint calculates the performance gain based on which SFUs
are selected, and limits PERF to the minimum performance gain across applications.
Example 2: Instruction Scheduling Instruction scheduling is a compiler optimization which
orders instructions to achieve a maximum amount of instruction level parallelism. Here, we con-
sider the problem of scheduling a basic block for a multi-issue in-order processor (with issue width
r ). e number of instructions in a basic block is N . e set Dij lists pairs of data-dependent
instructions, i.e., pairs such that instructions j dependents on the output of instruction i . e
expected latencies between dependent instructions are captured by Lij . e values of Dij and Lij
are the systems input parameters. e instructions are mapped into particular clock cycles, the
maximum number we consider is Cmax . e schedule is represented by the set of binary variables
cycleic , which indicate if instruction i is mapped into cycle c . e objective, in this example, is
to minimize the total cycles necessary (the latency), described by the variable LAT . Good model-
ing practice also introduces an additional variable numberi that describes the cycle number that
instruction i executes on. Adding variables that have physical meaning often can lead to tighter
1.3. WHAT IS THIS BOOK ABOUT? 9
* Define params randomly for this example. GAMS provides built-in math functions like
S(m,n)=uniform(0,1/%N%); max and also uniform/normal for generating
A(n)=max(normal(1,.5),0); random numbers. These calculations occur
MAXarea=%N%/5; before model generation/evaluation.
* Declare and solve MILP model. Finally, we create the model and solve us-
Model sample1 /limPerf,limArea/; ing MIP (Mixed Integer linear Programming),
solve sample1 using MIP maximizing PERF; minimizing the objective variable PERF.
models. It is important for clarity and efficiency to introduce them if they appear multiple times
within the model. e following describes the above model:
min LAT
cycle;number;LAT
CX
max
CX
max
e first equation enforces that all instructions are be mapped to one of the clock cycles. e
P max
second constraint enforces the issue width of the processor. e expression C cD1 c cyclei c
describes the cycle number that instruction i executes on. e fourth equation is the most com-
plex, but simply ensures that dependent instructions are scheduled at least the required number
of cycles apart. e final constraint computes the objective, the latency of the basic block, by con-
straining LAT to be larger than any of the individual instructions’ latencies. Minimizing LAT is
the objective function.
• First, compared to simulation and building systems, modeling is much faster. In computer
architecture for example, modeling is several orders of magnitude faster than studying sys-
tems with simulation. Essentially MILP solvers can perform fast enumeration and pruning
of possibilities.
• Modeling expresses the design problem in a declarative fashion, which contrasts with im-
plementing/simulating a system, performed imperatively or heuristically. For the problems
we study in this book, heuristic-based techniques are typically written using many lines of
imperative code like C/C++ which implement a heuristic to solve the problem. In contrast,
a MILP program can declaratively specify the original problem to be solved in a very concise
fashion. For reference, Table 1.1 shows the lines of GAMS code required for each of the
models in the case studies in this book, ranging from 20 to 200 lines. is is significantly less
than the 1,000s to 10,000s of lines of C++ or Java code required for some heuristic-based
versions.
• Modeling can often provide deep insight on the problem and “holes” in the design. Some-
times these are just omissions to the model formulation, but often they reveal new mech-
anisms that the modeler was unaware of. In many cases, an optimal solution reveals an
1.3. WHAT IS THIS BOOK ABOUT? 11
Set n /n1*n%N%/; e alias command creates new names for the same
s e t c /1*%N%/; set.
a l i a s (n , i , j ) ;
D(i,j) is a multi-dimensional set over the aliased set
* Dependences and Latency from i to j n. Equations which involve D will use the aliased
Set D( i , j ) ; i and j to distinguish between the first and second
Parameter L( i , j ) ; index.
Model sample2b/ a l l / ; Define the problem using all equations, and solve
s o l v e sample2b using MIP minimizing LAT; using MIP while minimizing LAT.
Modeling practitioners should weight the benefits of MILP along with the trade-offs and
limitations. For instance, certain optimization problems may not be suitable for MILP because
they are too large, non-linear, or inherently dynamic. In Chapter 7, we discuss the characteristics
12 1. INTRODUCTION
Table 1.1: Lines of code for case study models (does not include inputs or solver flags).
of “MILP-friendly” problems, along with practical strategies for deciding when to use MILP and
how to speed up MILP formulations.
CHAPTER 2
An Overview of Optimization
is chapter serves two broad purposes. First, it provides a general overview of optimization tech-
niques. Second, it provides a detailed treatment of how to model design problems as MILP in-
cluding the selection and formulation of variables, constraints, objective functions, and some ad-
vanced techniques to reformulate nonlinearity into linearity. For those who want to quickly learn
the basic material required to understand the book’s case studies and quickly learn the techniques
of MILP, we recommend reading sections 2.1, 2.2.1, 2.2.4, and 2.3, while less essential sections
are marked with a star F. ose who want a broader understanding of the field of optimization
should read the entire chapter.
e chapter is organized as follows. We provide a general overview of optimization in
the next section, followed by three sections that contain an introduction to optimization from
a mathematical perspective. First, we give an overview of selected mathematical optimization
models. e subsequent section describes how to model interesting and useful phenomena in
Mixed Integer Linear Programming (MILP), and we conclude by giving some insights into the
mechanisms of MILP solvers.
Definitions To aid the discussion of different types of optimization models, we briefly describe
some terms:
Integer variables Variables which can take on only discrete, integer values. Binary variables are
a specialization that can only take the values 0 or 1.
16 2. AN OVERVIEW OF OPTIMIZATION
Linear constraints Equations or inequalities which allow only linear relationships between vari-
ables. In practice, this means variables can only be multiplied by constants and summed.
Nonlinear constraints Equations or inequalities which allow arbitrary (but typically differen-
tiable) relationships between variables.
Linear objective A function defining the optimization goal, which is expressed in terms of linear
relationships between model variables.
Convex domain e set of allowable values for a model, such that a line segment between any
two points in the set is completely contained inside the set.
Convex objective An optimization function whose epigraph (the set of points on or above the
function’s graph) is a convex set. All local extrema of a convex function over a convex domain
are global extrema.
Network representation A special representation of graphs, based on node-arc incidence, lever-
aged to improve the efficiency of network modeling.
Depending on the types of variables, domains, and objective functions which are allowed,
different optimization models are created. One basic model we consider is Linear Programming
(LP), where variables must be continuous, and all the relationships between variables and the ob-
jective function must be linear. e most complex model we mention is Mixed Integer Nonlinear
Programming (MINLP), where some variables can be forced to be integral, and the constraints
and objective are allowed to be nonlinear. Figure 2.1 shows a number of important models dis-
cussed in this book, and how they are related, where MINLP is the least restricted model, and
LP is the most restricted. Edges in the figure represent additional restrictions on some aspect of
the model. For instance, Linear Programming is a subset of Mixed Integer Linear Programming,
in that all variables must be continuous. All Linear Programs are, in fact, Mixed Integer Linear
Programs.
As computational environments have become more and more powerful, the types of models
that can be efficiently processed have also grown to include complex models like MINLP, which
have nonlinear and sometimes nonsmooth functions, coupled with discrete as well as continu-
ous variables. eoretical enhancements have extended the discipline beyond the convex domain
to look at global solution of non-convex models. Even with tremendous strides, the field is still
rapidly expanding and evolving to tackle difficult problems. e challenges of size (spatial/tem-
poral/decision hierarchical) remain an active area of research: traditional approaches have proven
inadequate, even with the largest supercomputers, due to the range of scales and prohibitively
large number of variables. Furthermore, the nature of the model data remains problematic: how
to deal with environments that are data sparse, data rich, or for which the underlying data is un-
certain due to measurement errors, the lack of understanding of the model structure or random
processes. us, rather than considering all possible optimization problems using the above fea-
tures, we will specialize to a collection of problems that are useful to capture many of the aspects
2.2. MODELS FOR OPTIMIZATION 17
Figure 2.1: Optimization Models. Edges indicate additional restrictions to the model.
of a design optimization problem within the field of computer architecture. Table 2.1 outlines
the optimization models we will cover in this chapter, defines their fundamental properties, and
gives some example problems.
Ax b; Ex D g:
solution (4,5)
y (butter)
x (ice cream)
z=
z=
z=
10
30
20
Notice that X has extreme (corner or vertex) points: it can be shown that if X is non-empty, then
either the problem is unbounded below (meaning the objective function can be driven to 1) or
a solution exists that is an extreme point.
x
c A b
Ax b
cTx
To make the above description concrete, consider the Dairy Farmer Problem. A dairy farm
makes ice cream and butter, and must decide how much of each to produce. eir cow produces
22 pints of milk per day, and it takes 3 pints to produce 1 pint of ice cream, and 2 pints of milk to
produce 1 pint of butter. A pint of ice cream takes 15 min to make, while churning up a pint of
butter takes an hour; there are 6 total hours of labor available. Finally the ice cream must be kept
overnight in the freezer, and the total capacity is 6 pints. ey can sell their product each morning
for 4 dollars/pint for butter and 5 dollars/pint for ice cream. e objective is to maximize their total
profit. is problem can be modeled aptly by a linear program, which is depicted in Figure 2.2.
Here we have formulated resource constraints for total labor, milk production, and freezer space,
limiting the total resources required in producing butter and ice cream to the maximum available.
For instance, the “milk constraint” limits the production of butter and ice cream by enforcing that
the total milk required (3 times the amount of ice cream, and 2 times the amount of butter) is
less than the total milk available (22 pints). Also, dotted lines show the objective function, where
each line represents choices which correspond to some overall profit. ese are drawn according
to the objective function (4 times the amount of butter, and 5 times the amount of ice cream).
e solution is the feasible point which gives the highest profit.
20 2. AN OVERVIEW OF OPTIMIZATION
Associated with any linear program is a dual linear program. e dual of (2.1) is
eorem 2.1 Any linear program (2.1) is either infeasible, or its objective function can be driven to
1 (it is unbounded), or it has an optimal solution. When an optimal solution exists, then the dual
problem (2.2) also has a solution, and their optimal objective function values are equal. Furthermore,
any feasible point of the dual provides a lower bound on the optimal value of (2.1), and any feasible
point of (2.1) provides an upper bound on the optimal value of (2.2).
e theory of duality is an extremely powerful tool that helps to define algorithms for the
solution of (2.1), and also provides mathematical guarantees for the existence of solutions. e
dual variables u are important within economics where they are known as shadow prices due to
the fact that the rate of change of the objective function of (2.1) with respect to changes in bi can
be shown to be ui . Duality also indicates which constraints are important at solution points, and
can be used as a practical (rigorous) mechanism to determine when to terminate an algorithm
with a solution that is sufficiently close to being optimal.
We briefly comment on linear programming solvers here. ere are two popular algorithm
types for linear programs, namely the simplex method and interior point (or barrier) methods.
e former searches for a solution by moving from one extreme point to the next, while the latter
remains in the (relative) interior of the feasible region (by virtue of a barrier function that goes
to infinity on the boundary of the feasible region) and traces a parametric path of solutions to
an optimal feasible point. While the simplex method is exponential, linear programming is in
complexity class P ; as variants of the interior point algorithm have polynomial complexity. In
practice both methods are useful, since they have very different computational features that are
relevant in varying situations.
x; y 2 C H) f.1 /x C y j 0 1g C:
Any polyhedral set (an intersection of half-spaces or linear constraints) is convex, and hence the
feasible region of a linear program is convex. Convex functions can be defined by their epigraphs,
¹F indicates an optional section.
2.2. MODELS FOR OPTIMIZATION 21
that is
˚
epif D .x; / RnC1 j f .x/ I
f is a convex function if its epigraph is a convex subset of RnC1 . e convex programming prob-
lem is:
min f .x/ s:t: x 2 C; (2.3)
where f is a convex function and C is a convex set. erefore, linear programs are a special case
of convex programs. Other cases include quadratic programming where
1 T
f .x/ D x Qx C c T x
2
with Q being a positive semidefinite matrix (x T Qx 0; 8x ), and C being defined by linear (and
convex quadratic) constraints. e factor 21 is customary in the optimization literature since the
matrix of second derivatives of f is used often and in this case will be equal to Q. is is purely
a convenience.
Convex programming is a special case of a general nonlinear program, defined as:
where f and hi are general nonlinear functions. Specifically, when f and hi are convex functions,
then (2.4) is a convex program (2.3), with C D fx j hi .x/ 0; i D 1; : : : ; mg being a convex set
by virtue of each hi being a convex function. Calculus plays an important role in the theory of
nonlinear optimization. We note that this gives rise to optimality conditions (mechanisms to
prove that a solution is optimal) but these are typically based on a local analysis. e point x is a
local solution of (2.4) if there is some ı > 0 such that
e key feature that makes convex programming attractive is the fact that any local solution of a
convex program is in fact a global solution (where ı is infinite).
Much of the theory and algorithmic development of linear programming can be extended
to the convex programming setting. In particular, the theory of duality extends in a natural way
to this setting, albeit with some additional assumptions often termed constraint qualifications.
Minimax problems
We consider the solution of a modification of linear programming in which the linear objective
function is replaced by a convex piecewise-linear function. Such a function can be represented as
the pointwise maximum of a set of linear functions, that we can reduce to a linear programming
problem. Figure 2.4 shows an example convex piecewise-linear function in blue, taken as the
pointwise maximum of the linear functions shown in black.
22 2. AN OVERVIEW OF OPTIMIZATION
By introducing an artificial variable , we can reformulate (2.6) as the following linear program:
min
.x;/
s:t: .c i /T x C di ; 8i D 1; : : : ; mI
Ax b:
Note that the constraints themselves do not guarantee that equals f .x/; they ensure only that
is greater than or equal to f .x/. However, the fact that we are minimizing ensures that takes
on the smallest value consistent with these constraints, so at the optimum it is indeed equal to
f .x/. e examples in Section 1.3.4 use this construction.
Any (continuous) convex function can be approximated to any degree of accuracy by a
piecewise-linear convex function [96]. us, effectively a (separable) convex program can be ap-
proximated by a sequence of linear programs to any degree of accuracy using the technique out-
lined above. Note also that problems involving both kxk1 (sum of absolute values of x ) and kxk1
(maximum element in x ) can be expressed as linear programs using the formulation above.
10 3 2 3
Supply 3 Supply 2
b1=+8 b1=+8
3 2 3
1 4 1
4 0
1 3
2 2
1 5 3 5
uij =3 for 6 6
Demand 6 1
Demand
all (i,j)∈A b5=-4 b5=-4
Edge Labels Indicate Costs cij Edge Labels Indicate Flow xij
Associated with each node i is a divergence bi , which represents the amount of product
produced or consumed at node i . When bi > 0, node i is a supply node, while if bi < 0, it is a
demand node. e variables xij in the problem represent the amount of commodity moved along
the arc .i; j /. Associated with each arc .i; j / are a lower bound lij and an upper bound uij of the
amount of the commodity that can be moved along that arc. e unit cost of moving one unit
of flow along arc .i; j / is cij . Typically, all the data objects cij , lij , uij , and bi are assumed to be
integral. e problem is to minimize the total cost of moving the commodity from the supply
nodes to the demand nodes. Figure 2.5 shows an example problem with one supply node and two
demand nodes.
Using the above notation, we can formulate the minimum-cost network flow problem as
follows. X
min z D cij xij
x
.i;j /2A
X X
s:t: xij xki D bi ; for all nodes i 2 N ;
j W.i;j /2A kW.k;i /2A
0 lij xij uij ; for all arcs .i; j / 2 A:
e first constraint states that the net flow through each node should match its divergence. e
first summation represents the total flow out of node i , summed over all the arcs that have node
24 2. AN OVERVIEW OF OPTIMIZATION
i as their origin. e second summation represents total flow into node i , summed over all the
arcs having node i as their destination. e difference between inflow and outflow is constrained
to be the divergence bi .
e problem can be written in matrix form as the following linear program:
min c T x
(2.7)
s:t: I x D b; 0 l x u:
Here, the node-arc incidence matrix I is an jN j jAj matrix, the rows being indexed by nodes
and the columns being indexed by arcs. Every column of I corresponds to an arc .i; j / 2 A and
contains two nonzero entries: a C1 in row i and a 1 in row j . For the example network of
Figure 2.5, the matrix I is given by
2 3
1 1 1
6 1 1 7
6 7
6 7
6 1 1 1 7
ID6 7;
6 1 1 1 1 7
6 7
4 1 1 15
1 1 1
where we have taken the arcs in the order
f.1; 2/; .1; 4/; .1; 6/; .3; 2/; .3; 5/; .4; 3/; .4; 5/; .6; 4/; .6; 5/g:
Network flow problems of this nature are prevalent in the study of communication net-
works, on-chip and off-chip networks, and multi-commodity versions of this problem are used
for routing of messages, for example. Other applications include vehicle fleet planning, building
evacuation planning, karotyping of chromosomes, network interdiction, etc. In many examples,
the problem doesn’t quite fit into the formulation described above. In several cases, there are net-
work transformations, described for example in [7], that massage the problem into the standard
formulation.
For efficient implementations, it is crucial not to store or factor the complete matrix I , but
rather to use special schemes that exploit the special structure of this matrix. A node-arc incidence
matrix is an example of a totally unimodular matrix, that is, a matrix for which the determinant
of every square submatrix is equal to 0, 1, or 1. Whenn A is totally unimodular and bQ , bo, dQ
and d are integer vectors, it can be shown that if the set x 2 Rn j bQ Ax b; dQ x d is
not empty, then all its extreme points are integer vectors (see [139] for further details). Since the
simplex method moves from one extreme point to another, this result guarantees that the network
simplex method will produce solutions x that are integer vectors whenever the data of the problem
is integral.
We use this observation next to solve a number of problem types that occur frequently in
large numbers of applications. While there are often specialized algorithms that can solve each
2.2. MODELS FOR OPTIMIZATION 25
of the following problems more efficiently than the way we outline, the formulations below are
critical as component building blocks for the more general mixed integer linear programs (MILP)
that we consider later. at is, if we use formulations of the type outlined in the remainder of this
subsection within a more general MILP, then the relaxations that typical commercial grade solvers
use are likely to perform significantly better than if we use a different (non-network) formulation.
e lower bounds lij on the flows should be set to zero, while the upper bounds uij should be 1.
If we wish to know the shortest path from s to all the other nodes i 2 N , we define the
network flow problem in the same way except that the divergences are different:
Figure 2.6 shows an example shortest-path problem and solution forlmulated as above.
Having obtained the solution, we can recognize the shortest path from node s to a given node i
by starting at i and backtracking along edges with positive flow.
Shortest path problems arise in many applications in the telecommunications, on-chip net-
works, and transportation industries whenever a message or a vehicle needs to be moved between
26 2. AN OVERVIEW OF OPTIMIZATION
two locations as quickly as possible, and as subproblems in other application such as project man-
agement and DNA sequencing. Specialized algorithms for the solution of this problem exist; the
seminal reference is [50].
Max-Flow Problem
Given a network and two special nodes s and t , the max-flow problem is to determine the maxi-
mum amount of flow that can be sent from s to t . Of course, for the problem to be meaningful,
some of the arcs in the network must have finite capacities uij .
is problem can be formulated as a minimum-cost network flow problem by adding an
arc .t; s/ to the network with infinite capacity, zero lower bound and a cost of 1. e divergences
bi at all the nodes are set to zero; the costs cij and lower bounds lij on all the original arcs are
also set to zero. e added arc ensures that all the flow that is pushed from s to t (generally along
multiple routes) is returned to s again and generates a profit (negative cost) corresponding to the
flow on this arc. See Figure 2.7 below for an example.
We can also define max-flow problems with multiple sources, in which we wish to find the
maximum amount of flow originating at any of the given sources that can be sent to the specified
destination. To formulate as a minimum-cost network flow problem, we add a “super-source” as
shown in Figure 2.8 and define arcs from the super-source to the original sources with infinite
capacities and zero costs. We also add the arc from the sink to the super-source in the manner
described above. Max-flow problems with multiple sinks (or multiple sources and multiple sinks)
can be formulated similarly, see Figure 2.8.
Max-flow problem also occur in many application settings, including political redistricting,
scheduling of jobs on parallel machines, assigning different modules of a program to minimize
collective costs of computation and communication, and tanker scheduling.
2.2. MODELS FOR OPTIMIZATION 27
Assignment Problem
In the assignment problem we have two sets of nodes N1 and N2 of equal size. Given a specified
cost cij for pairing a node i 2 N1 with a node j 2 N2 , we wish to pair off each node in N1 with
a partner in N2 (making a one-one correspondence between the two sets) so as to minimize the
total cost of pairing. We can formulate this problem as a minimum-cost flow problem by defining
the divergences as follows:
Intuitively, the positive divergence for nodes in N1 forces flow along arcs to nodes in N2 ,
which have negative divergence. e unit capacities on arcs make the assignment a one-to-one
mapping. Figure 2.9 shows an example assignment problem and solution.
Assignment problems arise in a variety of problem contexts. Examples include personnel
assignment, medical resident scheduling, locating objects in space, scheduling on parallel ma-
chines, and spatial architecture scheduling. A detailed treatment of assignment problems can be
found in [35]. Network flow problems are used extensively in Internet traffic routing, and to
model compiler flow graphs. Graph partitioning can also be approximated very effectively using
network flow models, or when additional constraints are present, using a reformulation as a mixed
integer program. We reiterate the point that MILP models built using underlying network struc-
ture are typically much more amenable to solution using the branch-and-bound or branch and
cut procedures that are present in most commercial solvers.
28 2. AN OVERVIEW OF OPTIMIZATION
Notice that the variables are now separated into two sets, namely those that can take on contin-
uous values (x ), and those that can take only discrete (integer) values (y ). e MILP acronym
comes from the fact that there is a mix of continuous and discrete variables, and that all func-
tional relationships are linear. Simple bound constraints l .x; y/ u can also be placed on the
variables. In practice, it is important to specify good bounds (as tight as possible) on the integer
variables since computational techniques rely heavily on such bounds.
Restricting some or all of the variables in a problem to take on only integer values can make
a general linear program significantly harder to solve (more precisely, while linear programs are
known to be polynomially solvable, mixed integer programs are NP-hard), so in a certain sense the
result about the integrality of solutions to network linear programs is rather remarkable. Further
information on this and other results can be found in the texts by Ahuja et al. [7], Nemhauser
et al. [139], and Schrijver [152]. Unfortunately, the theory of duality also breaks down in this
setting. While there are some duality results couched in the theory of submodular functions, this
has not had anywhere near the impact that convex duality has on the field of convex optimization.
To give some intuition on what makes a MILP problem difficult to solve, we describe
two fundamental properties of a MILP formulation, its convex hull and its linear relaxation.
e convex hull is defined as the intersection of all convex sets containing the feasible set of the
MILP (and hence is the smallest convex set containing the feasible set of the MILP). e linear
relaxation of the MILP (2.8) simply relaxes the constraints y 2 Zl to be y 2 Rl , and hence is the
2.2. MODELS FOR OPTIMIZATION 29
linear program:
min c T x C d T y s:t: Ax C Hy b; x 2 Rn ; y 2 Rl
x;y
min x1 C 3x2
.x1 ;x2 /2Z2
s:t: x1 x2 2
4x1 11x2 20
2x1 C 4x2 4
x 1 ; x2 0
x2
Relaxation
Solution
obj=4.5
Integer
Optimal
obj=3 Feasible Region
1
0 x1
0 1 2 3
Figure 2.10: e convex hull of the feasible set of a MILP.
Figure 2.10 shows both the convex hull and feasible region for this problem. If the relaxed,
linear problem is solved, then the optimal value obtained will not be integral. is solution is
shown in the figure as the “relaxation solution.” Often, the relaxation solution is rounded to the
nearest integer. In some cases, this may be optimal, in other cases it may not be optimal or even
be feasible for the original problem. In the given example, the rounded relaxation solution is
.1; 1/, while the optimal solution is .2; 1/. If the convex hull of the feasible region of the MILP is
identical to the feasible region of the linear relaxation, the relaxation is termed tight. In this case,
solving the relaxed problem as a linear program would be sufficient, and computationally much
less complex. Intuitively speaking, the closer or “tighter” the relaxed feasible region is to the convex
hull, the easier it is to use the relaxed linear program as a way of helping to find the optimal integer
solution, and the easier it is in general to solve the MILP. Solution methods are discussed further
in Section 2.4. ough we don’t define the term formally, “tightness” is sometimes used to refer
to the how close the relaxation feasible region is to the convex hull.
30 2. AN OVERVIEW OF OPTIMIZATION
As mentioned in the introduction, the last two decades have seen enormous improvement
in the size and complexity of models that can be solved in realistic time frames in the MILP
setting, and there are commercial and open source solvers that can be easily deployed in application
domains. We will describe a number of problem formats that can be naturally expressed as MILPs
in Section 2.3.
Due to the commonly used solution techniques, MILP is good for models for which the
linear programming relaxation is tight, or for which the number of discrete choices is relatively
small. MILP tends not to be as effective for ordering problems, or for complex very large scale
models. ere are many techniques that can be employed to extend the applicability of MILP to
a stochastic setting, or to model both non-convex and nonsmooth phenomena. We also mention
some of these techniques in Section 2.3.
where p represents input values to the model, and x represents the state of the system; “scenarios”
or “simulations” are performed to determine which of these input values leads to acceptable or
good values for the state variables. One simple example of a nonlinear function would be modeling
power as being proportional to the square of the voltage.
A concrete example of a nonlinear optimization problem is that of parameter fitting. In this
setting we have a number of observations yi of x and we wish to find the best values of p that
explain all these observations. e optimization problem is the following instance of (2.4):
m
X
min kyi xk2 s:t: F .x; p/ D 0:
x;p
iD1
ere is a huge literature in statistics and optimization dealing with such regression or inverse
problems (using observations of states to infer parameter values). In the data poor setting (m
small) the problem is underdetermined (i.e., many x ’s achieve the minimum), and so modelers
aim to impose additional structure in the formulation to compensate for the lack of data samples.
Typically, this is an effort to allow for prediction and makes a trade-off between accuracy and
a simple model structure (that can have more predictive power—Occam’s razor). Compressed
sensing and sparse optimization are burgeoning fields of exploration that have many examples of
models of the form:
min .E.z/ C ˛S.z// s:t: z 2 X;
z
where X is the constraint set, E measures “error,” and S penalizes bad structure. In the above
P
example, we could set z D .x; p/, X D f.x; p/ j F .x; p/ D 0g, E.z/ D m iD1 kyi xk22 and
2.3. MODELING PROBLEMS AS MILP 31
S.z/ D kpk1 resulting in the sparse optimization version
m
X
min kyi xk22 C kpk1 s:t: F .x; p/ D 0:
x;p
iD1
It should be noted that across disciplines, there are distinct terminology issues. For example, active
learning is another term for optimal experimental design, and reinforcement learning is some-
times called approximate dynamic programming. However, it is widely acknowledged that in-
corporating domain knowledge into models (i.e., specifying F well) is critically important. Many
opportunities remain to exploit theory and structure to generate much more effective algorithms,
generalizability, and to understand learning behavior.
ere is a large amount of recent work on the generalization of nonlinear programming
problems (2.4) to include integer variables (a subset of the x variables are constrained to have
integer values), so called mixed integer nonlinear programs (MINLP). At the present time these
codes are less robust and unable to process models of the same size as MILP solvers. A key
difficulty is that the “relaxed” problems remain difficult to solve to global optimality. e most
widely used MINLP solvers currently appear to be BARON [162, 163] and LindoGlobal [118,
151], but BONMIN [29] and DICOPT [52] are also widely used but assume convexity of the
underlying nonlinear program in order to guarantee global solutions.
ıD1!x>0
requires an additional fix, that is an > 0 with the property that we replace x > 0 with x
for suitably chosen (small) . In this case, provided x 0, the implication ı D 1 ! x is
equivalent to
x ı:
e following theorem captures concisely some possible generalizations of this idea. Each
of the statements first has an upper or lower bound expression that must be satisfied at any feasible
solution (i.e., m or M must be chosen so that the given expression is automatically satisfied). e
remainder of the statement then gives the logical expression on the left, followed by the constraint
that must be added to any MILP (when g is linear) to model that logical expression. Note that
the case (1b) is a generalization of the fixed charge case outlined above (g.x/ D x ).
e parts of this theorem can be combined to model implications of the form f .x/ > 0 !
g.x/ 0 using
.f .x/ > 0 ! ı D 1/ and .ı D 1 ! g.x/ 0/:
Similarly g.x/ D 0 $ ı D 1 can be modeled when g is linear within a MILP using:
g.x/ .M C /ı1
g.x/ .m /ı2 C
ı1 C ı2 1 C ı
As we will see in the case studies, constraints of this form are extremely useful in expressing natural
phenomena and system requirements.
Solvers often introduce additional variable types to capture some of these notions; for exam-
ple, semicontinuous variables are variables that are either 0 or lie between given positive bounds.
Another type of variable that is common among many solvers is an SOS1 variable (specially or-
dered set of type 1). is is a collection of variables defined over an (ordered) set K , at most one
of which can take a strictly positive value, all others remaining 0. Branching strategies (see Sec-
tion 2.4.1) can exploit the ordering in K , for example when it indicates a choice between small,
medium, large, and super-sized items. is can, in some cases, lead to improved solution times if
such semantics are required.
In this case, ıi is often termed an indicator variable since it indicates whether the proposition is
true or false. It is also used in statistics to indicate whether a variable takes on a particular value
(or set of values) or not.
34 2. AN OVERVIEW OF OPTIMIZATION
We will use standard notation from boolean algebra to denote connectives between propo-
sitions. us,
_ means ‘or’
^ means ‘and’
: means ‘not’
! means ‘implies’
$ means ‘if and only if ’
Y means ‘exclusive or’
Other connectives such as “nor” or “nand” are also used in the literature.
e proposition Pi could stand for “we will use register i ”, and Q could represent “perform
compiler level 3 optimization”, so that Q ! Pi encodes a logical constraints that if we use a level
3 optimizing compiler then we must use register i . A key part of modeling that we demonstrate
in examples later in this book is determining what variables capture the underlying logic of our
problem, and finding the connectives that they must satisfy in order for the design to capture the
underlying required properties.
Table 2.3 details standard ways to equivalently express propositional logic in terms of con-
straints on the corresponding indicator variables in a MILP (see [134] for example). e examples
shown in the table are useful in building models since they construct a tight approximation of the
logic typically, even when the solution algorithm used to solve the MILP relaxes some of the
variables to be continuous (i.e., in Œ0; 1 instead on being in f0; 1g).
Other operators like “before”, “last” or “notequal”, “allDifferent” are often allowed in con-
straint logic programming (CLP) languages; there is a growing literature on how to reformulate
some of these within a MILP code and lots of specialized codes that treat these constraints explic-
itly. Merging these two techniques (MILP and CLP) is an active area of research. e techniques
used in CLP are essentially clever ways to do complete enumeration very efficiently and quickly.
2.3.3 ORDERING
Ordering, in this context, refers to the arrangement of events in a certain domain, subject to some
constraints. An example from compilers would be the ordering of instructions inside a software
pipelined loop for maximum throughput.
e constraint logic programming constructs are particularly useful in ordering problems
since they can easily encode the notions of an ordering. We outline here two of the main ideas
applicable when a pure MILP approach is used. ese are only practical on medium sized prob-
lems at this time, although much research is currently underway to improve the size of problems
that are practically tractable.
One formulation of ordering uses binary variables rank with the definition:
Statement Constraint
:P1 ı1 D 0
P1 _ P2 ı1 C ı2 1
P1 Y P2 ı1 C ı2 D 1
P1 ^ P2 ı1 D 1, ı2 D 1
:.P1 _ P2 / ı1 D 0, ı2 D 0
P1 ! P2 ı1 ı2 [equivalent to: .:P1 / _ P2 ]
P1 ! .:P2 / ı1 C ı2 1 [equivalent to: :.P1 ^ P2 /]
P1 $ P2 ı1 D ı2
P1 ! .P2 ^ P3 / ı1 ı2 , ı1 ı3
P1 ! .P2 _ P3 / ı1 ı2 C ı3
.P1 ^ P2 / ! P3 ı1 C ı2 1 C ı3
.P1 _ P2 / ! P3 ı1 ı3 , ı2 ı3
P1 ^ .P2 _ P3 / ı1 D 1, ı2 C ı3 1
P1 _ .P2 ^ P3 / ı1 C ı2 1, ı1 C ı3 1
More general forms of some of the above are also stated below:
n
X
P1 _ P2 _ Pn ıi 1
iD1
Xk n
X
.P1 ^ Pk / ! .PkC1 _ Pn / .1 ıi / C ıi 1
iD1 iDkC1
Xn
at least k out of n are true ıi k
iD1
Xn
exactly k out of n are true ıi D k
iD1
Xn
at most k out of n are true ıi k
iD1
Xk
Pn $ .P1 _ _ Pk / ıi ın , ın ıj , j D 1; : : : ; k
iD1
k
X
Pn $ .P1 ^ ^ Pk / ın C k 1 C ıi , ıj ın , j D 1; : : : ; k
iD1
It is then fairly straightforward to generate expressions for entities such as the start time of the
item in position k , for example, and thereby generate expressions for waiting time and other
objectives.
A second type of formulation involves ordering. e model implements the condition that
either item j finishes before item i starts or the converse. Such either-or constraints are termed
disjunctions. To represent the start and stop times of a particular task, we use the variables st arti
and endi . We introduce additional “violation” variables that measure how much the pair .i; j /
violates the condition that j finishes before i starts (and the converse):
We then add a condition that only one of these two variables—violationij and violationj i —can
be positive, that is we force each such pair of variables to be in an SOS1 set of size 2 of the
form fviolationij ; violationj i g. Some solvers (e.g. CPLEX) implement the notion of indicator
constraints which provide an alternative way to formulate disjunctions.
Network flow problems can also be used for ordering problems, often with great success. In
many cases, the graph formulation allows the ordering constraints to be modeled in a more com-
putationally effective manner. Examples tend to be very domain specific; they often enumerate
many different “ordered paths” and use a flow formulation to select among those paths.
Models of this nature appear throughout the literature and there are a number of excellent solvers
that are effective on even large scale instances of them. Most of these solvers find local solutions
however, and the extension to non-convex models is known to be NP-hard even when all the
functions are univariate (depending on only a single variable)[104].
Piecewise-linear functions are extensively used to approximate the original functions in the
non-convex setting. ere are a large number of applications of this idea, along with special-
ized algorithms and reformulations as MILP. We refer the reader to [172] for more extensive
references.
We explain this approximation first in the case of a function of a single variable x . We
restrict attention to the non-convex case since the convex case can be solved using the approach
outlined in Section 2.2.2 for minimax problems.
2.3. MODELING PROBLEMS AS MILP 37
e piecewise-linear function is described by a collection of segments S . In the case where
the domain of the function is an unbounded set, or the function is not continuous, the segment
approach has proven effective. Each segment i has an .xi ; fi / coordinate point, a (potentially
infinite) length li , and a slope gi , the rate of increase or decrease of the function from .xi ; fi /.
e sign of the li determines if the segment expands to the left (negative length) or the right
x f l g
3 Seg. 1 3 2 -∞ -1
Seg. 2 2 1 4 0
2 Seg. 3 6 1 2
Seg. 4 7.5 4 ∞ ¼
Segment 2
1
0
0 1 2 3 4 5 6 7 8 9
(positive length) of the .xi ; fi / point. ese segment definitions allow more than pure piecewise-
linear functions. Segments can overlap, meaning we can have multi-valued functions, and there
can be holes in the x coordinate space. ere is also no order requirement of the segment xi
coordinates.
Each segment has two variables associated with it. e first is a binary variable bi that
chooses the segment to be used. In order that we have a single value for the function at x , only
one segment can be active, which is modeled using:
X
bi D 1:
i2S
e other segment variable is a nonnegative variable i whose upper bound is the absolute value
of the length of the segment: i jli j. is variable measures how far we move into segment i
from the starting point .xi ; fi /. A particular choice of the vectors b and formed from these
components determines a point of evaluation x 2 R and the value of the approximation f at x
by the following formulae (sgn.li / denotes the “sign” of the parameter li ):
X X
xD .xi bi C sgn.li /i /; f D .fi bi C sgn.li /gi i /:
i2S i2S
For each segment that has finite length jli j < 1, we enforce the constraint that i can only be
positive if bi D 1 using the M constraint:
i jli jbi :
38 2. AN OVERVIEW OF OPTIMIZATION
If the piecewise-linear function contains segments of infinite length, this constraint does not
work. Instead, for these segments, we form a SOS1 set containing the variables i and 1 bi ,
that is at most one of these two variables is positive. is has the same effect as the M constraint,
but is independent of the length of the segment and hence also works with infinite length.
Figure 2.11 gives an example piecewise-linear multi-valued function, and shows the asso-
ciated parameters which define the lines. Note that the function values between x D 2 and x D 3
are not determined (the function is multi-valued there). It can take on either the value 1, or the
value 5 x . e optimization procedure will choose the preferable value.
In the non-convex setting there are other popular MILP formulations for piecewise-linear
functions namely incremental cost [103, 125] and convex combination [44]. Historically, [18]
suggested a formulation for piecewise-linear functions similar to convex combination, except that
no binary variables are included in the model and the nonlinearities are enforced algorithmically,
directly in the branch-and-bound algorithm, by branching on sets of variables, which they called
special ordered sets of type 2 (SOS2). It is also possible to formulate piecewise-linear functions
similar to incremental cost but without binary variables and enforcing the nonlinearities directly
in the branch-and-bound algorithm. Two advantages of eliminating binary variables are the sub-
stantial reduction in the size of the model and the use of the problem structure [63]. Since these
variables may reduce the preprocessing opportunities, it may be that the reduction in size of the
problem does not lead to overall solution speedup.
We now outline some of the ways the single variable case can be extended to a more general
setting.
Separable programming
A function f is called separable if it can be expressed as the sum of functions fj of a single variable
xj 2 R:
X
f .x/ D fj .xj /
j
e nonlinear optimization problem (2.4) is separable if f and hi are all separable functions.
If the problem is not separable, there are a number of tricks that can be used to substitute
out non-separable terms and convert the model into a separable one (see [178]). For example, we
can deal with terms like xi xj by using the fact that
or terms like
m
Y
xi
iD1
2.3. MODELING PROBLEMS AS MILP 39
(with xi > 0) can be replaced with y where
m
X
ln y D ln xi :
iD1
P
Note that linear functions are separable, so functions like f . j aj xj / can be reformulated in a
P
separable manner using f .y/ where y D j aj xj .
For a specific example, consider the problem:
min x1 x2 log.x1 C 2x2 / s:t: x12 C x22 1:
x1 ;x2
Using the product form reformulation and introducing additional constraints and variables the
problem can be made separable as follows:
1 2 1 2
min y z log.w/ s:t: x12 C x22 1; y D x1 C x2 ; z D x1 x2 ; w D x1 C 2x2 :
x1 ;x2 ;w;y;z 4 4
However, in general this may lead to a large growth in the number of variables and constraints
in the resulting model. Once a problem has been converted into separable form, the separable
programming technique basically replaces all separable functions, in objectives and constraints,
by piecewise-linear functions.
e epigraph of a piecewise-linear function is easily seen to be the union of polyhedra. It
is possible to approximate a non-separable function by a general function of this form. e paper
[172] gives an excellent treatment of how mixed integer models can approximate the general
non-separable case, with pointers to which formulation is best in what setting.
Crucially, both the continuous and the discrete variables are bounded. A very effective reformu-
P
lation of this problem carries out a binary expansion of the variable yj as yj D kiD1 2i 1 zi with
zi binary (and suitably chosen k ) and then replaces the product terms xl zj by a new variable wlj ,
which is also expanded using additional variables v . It can be shown that .xl ; yj ; wlj ; zj ; vlj / 2
Blj and that Blj is a polyhedral set with some integer restrictions (on yj and zj ), and that solving
P P T T
min l j .Q0 /lj wlj C c0 x C d0 y
x;y;w;z;v
s:t: Ax C Hy b0
P P T T
l j .Q t /lj wlj C c t x C d t y b t ; t D 1; : : : ; m
.xl ; yj ; wlj ; zj ; vlj / 2 Blj
gives a solution to the original problem. is problem is a MILP. An excellent reference for this
material and its extensions is [86]. In particular, the paper shows the above reformulation to be
very effective under modern MILP solvers, and that new cuts for B can be derived to make the
solution procedure even more efficient. ere is no restriction on Q being positive semidefinite
in this setting.
An implementation of many different reformulations for problems of this form is provided
by the GloMIQO solver [131] that is available within the GAMS modeling system.
2.4.1 BRANCH-AND-BOUND
e standard method for solving MILP is branch-and-bound, which uses the linear programming
methodology outlined above to solve many subproblems. In this setting, (bounded) discrete vari-
ables are replaced by continuous variables, and the resulting (so called root-relaxation) problem
is solved using a linear programming code. e optimal value of this relaxation provides a lower
bound on the optimal value of the MILP. If at a solution, all the continuous replacement variables
take on valid discrete values, then the original problem is solved. Otherwise, one of the integer
variables xi has the value xN i which is not integral and is branched on: that is, two (MILP) sub-
problems are generated, one of which adds the constraint that xi floor.xN i / and the other adds
the constraint xi ceil.xN i /. e variable xi is called the branching variable. If we can compute
the optimal solutions for both of these subproblems then the better of them will be the solution
of the original MILP. In this way a search tree of nodes is constructed, each node consisting of an
MILP that is the original MILP augmented with a collection of bound restrictions on variables.
We can repeat this process: at any stage of the algorithm, we have a search tree, each of whose
leaves are MILP subproblems (augmented with additional linear constraints) that are not solved.
If we can solve or discard all of these leaves, then the original MILP is solved by one of the leaf
solutions.
Consider the following example problem, a variant of the 0 1 knapsack problem.
e branch-and-bound procedure for this problem can be seen in Figure 2.12. We start
by having no incumbent solution, which is defined as the current best solution which meets the
integrality constraints. erefore, the upper bound best solution is 1. In Step 1, we solve the
root relaxed linear program, and achieve a fractional solution (x2 D 1=4). e objective for this
42 2. AN OVERVIEW OF OPTIMIZATION
problem, 13:5, serves as a lower bound for any solution attainable. To continue, we create two
new problems by forcing x2 to either 0 or 1. Both of these problems again have fractional solutions.
When multiple subproblems are available for branching, a node selection strategy is fol-
lowed; one method is to choose the solution with smaller node relaxation objective function.
erefore, in Step 2, we create two new problems from the subproblem with objective 13:25,
by fixing the value of x3 .
We now describe the process of fathoming (discarding nodes). Suppose an unsolved leaf
MILP subproblem is selected for solution. We relax the integrality constraints on the subproblem
and solve the node relaxation. If the node relaxation solution is feasible for the original MILP
problem, then we update the incumbent solution to be the best among this solution and the
existing incumbent and we can fathom this node if its solution is worse. In Step 2, with x3 D 0
we achieve an integer solution with objective value 12. is solution becomes our incumbent.
2.4. SOLUTION METHODS F 43
e second option is that the node relaxation is solved (it cannot be unbounded since oth-
erwise its ancestor node would not have been solvable), and we can fathom this node if its optimal
value is larger than the incumbent value (since branching on this node will not beat the incum-
bent). For example, in Step 2, the subproblem with x3 D 1 has objective value 12:2. However,
since all of the variables and parameters in the objective are integral, a sophisticated solver will
realize that this can also be fathomed. is is because the best objective value that this subproblem
can give is 12 when rounded up to the nearest integer, and our incumbent integer solution has
objective 12.
e third option is that the node relaxation is infeasible (we are after all adding constraints
to it) so again that node is fathomed. (is occurs in our example in Step 3, when we consider
the node generated by setting the variable x1 D 1.) Otherwise, we branch on this node to create
two new nodes.
Even though an integer solution has been found, we have not proven optimality until no
subproblem has an objective value which is lower than the current incumbent. So in Step 3, we
return to the previous leaf (where x2 D 1), and continue by fixing the value of x1 . When x1 is
fixed to 0, the objective is 12:5, and can therefore be fathomed by the same argument as before.
When x1 is fixed to 1, the problem becomes infeasible as mentioned above. Since all subproblems
are now either integral, fathomed, or infeasible, the procedure is complete, and we have found
the optimal value.
In general, this process may take a very long time to complete. e remaining issue is to
explain how we stop the process in practice: this is accomplished by determining tight upper and
lower bounds on the optimal solution value of the MILP. Note that the value of the incumbent
solution is an upper bound on the optimal objective of the MILP (we are minimizing after all). A
lower bound can also be found: it is the minimum of the objective values of the node relaxations
over all the leaf nodes of the search tree. is value can be updated as we proceed in the algorithm:
the difference between the upper and lower bound is typically called the gap and we often curtail
the algorithm when this gap gets sufficiently small. us, in our example, if we were only looking
for a solution which was 15% within optimal, we could have stopped when we found the first
integer solution. e ability to control the degree of optimality for a solution is one of the primary
benefits of MILP, and mathematical optimization in general.
x2
Cut 1
obj=4.5
x1 ≤ 2 Optimal
Solution
Original Cut 2
obj=3 Feasible Region x2 ≤ 1
1
0 x1
0 1 2 3
z D min c T x s:t: Ax b; x 2 Y
x
where the set Y involves a collection of integrality constraints along with linear restrictions. e
constraints Ax b are often referred to as complicating (bad) constraints: if we relax them the
2.4. SOLUTION METHODS F 45
problem
min c T x s:t: x 2 Y
x
is assumed to be relatively easy to solve (Y encodes the good constraints), even though it is a
MILP. We exploit this fact by constructing the Lagrangian Dual:
L.u/ WD min c T x C uT .Ax b/ s:t: x 2 Y
x
(so that for fixed u we can evaluate L.u/ easily), and note that for any u 0, L.u/ forms a lower
bound on z (the optimal value of the MILP) since for any feasible x , uT .Ax b/ 0. Lagrangian
relaxation methods aim to find strong lower bounds by solving
max L.u/:
u0
Similar relaxation schemes can be used for convex optimization, including those that are based
on semidefinite programming.
A specific example is the generalized assignment problem:
X n
m X n
X
min cij xij s:t: aij xij bi ; i D 1; : : : ; m; x 2 Y
x
iD1 j D1 j D1
where ( )
n
X
Y D x D .xij / j xij D 1; 8j; xij 2 f0; 1g; 8i; j :
iD1
In this case, the Lagrangian relaxation problem (for fixed u 0) is
m X
X n n
X n
X
min cij xij C ui . aij xC ij bi /
x2Y
iD1 j D1 iD1 j D1
X m X n n
X
D min .cij C ui aij /xij ui bi
x2Y
iD1 j D1 iD1
ese subproblems are solved in time proportional to nm by determining mini .cij C ui aij / for
each j , setting the associated xij D 1 and all the remaining xij to zero. (Subgradient optimization
approaches can then be used to update the vector u.)
Here X represents standard linear constraints on the continuous variables, whereas Y incorporates
all the integrality constraints on y as well. Implicitly, for this approach to be successful, it is
important that problems with constraint sets X or Y can be solved efficiently, and the constraints
Ax C Hy b are complicating restrictions that restrict our ability to optimize easily over x and
y . e above problem is equivalent to
T T
min d y C minfc x s:t: Ax b Hyg : (2.9)
y2Y x2X
For fixed y , the inner minimization is a linear program, and hence by linear programming (weak)
duality:
minfc T x s:t: Ax b Hyg .b Hy/T u
x2X
Bender’s decomposition solves (2.9) by iteratively generating lower bounds (cuts) formed from
feasible points of the dual. More precisely, each iteration consists of solving (2.10) for a fixed
2.4. SOLUTION METHODS F 47
y D yN 2 Y or determining an unbounded ray. en we solve
min z
z2R;y2Y
s:t: z d T y C .b Hy/T uN k ; k D 1; : : : ; KI (2.11)
0 .b By/T uN l ; l D 1; : : : ; L;
where k indexes previously found optimal solutions of (2.10) and l indexes previously found
unbounded rays of (2.10). e solution y D yN of this problem is then fed back into (2.10). Note
that (2.10) is a linear program, and this method works well when the MILP (2.11) is easier to
solve than the original MILP (specifically note that it does not involve the matrix A).
2.5 CONCLUSION
Optimization can facilitate prediction, improve operation, and help with strategic behavior and
design. Models can be combined; their utility stems from engaging groups in a decision, actually
making complex decisions, and operating or controlling a system of interacting parts. Putting to-
gether some or all of the constructs that we outlined above requires skill, understanding, and many
iterations. Modeling and optimization is best done as part of an iterative process that engages the
designer, facilitates the collection of appropriate data, and allows domain-specific design tools to
be incorporated into a general (optimization) framework. Determining what is the appropriate
model takes some effort: is it linear or nonlinear, deterministic or probabilistic, discrete or contin-
uous, best modeled using smooth or nonsmooth functions, and/or static or dynamic. Modeling
systems allow one to move between these formulations quickly, provide some data and model
verification, and provide constructs to broaden the modeling classes and tricks that are practically
usable.
e following four chapters provide case studies in using mathematical optimization tech-
niques to model the design or operation of various systems. ese case studies demonstrate the
usefulness of the modeling techniques described in this chapter, the broad range of applicable
problems, and practical usefulness of MILP formulations and solvers. e avid reader may wish
to skip to the final chapter, conclusions, where we provide insight and experience into decid-
ing which types of problems are suitable for MILP, as well as tuning strategies for improving
performance and tips on how to formulate good models.
49
CHAPTER 3
3.2 OVERVIEW
Our high level strategy is to apply integer linear programming to find opportunistic regions of
code to accelerate, but target only single basic blocks at a time. is model would be used as part
of a design flow, as depicted in Figure 3.1. First, a compiler “frontend” will output the program’s
basic blocks to the template generator. e template generator uses the MILP model we develop
to analyze these basic blocks, and discover templates which are good candidates for instruction
extensions. During this phase, a separate graph isomorphism pass eliminates redundant templates.
ese templates are synthesized into hardware to determine their area and latency. e template
selection phase uses this information, and performs an analysis considering the area, coverage, and
speedup trade-offs to determine the set of chosen instructions. Finally, the compiler “backend”
generates meta-data regarding instruction formats used to re-compile the code for the newly
extended instruction set.
Figure 3.1: Design flow for custom instruction generation and compilation.
System Architecture e baseline architecture we target, depicted in Figure 3.2, is that of an in-
order processor with some number of custom instruction hardware components for performing
specialized instructions. e interface to the hardware is the bus which can deliver or receive a cer-
tain number of inputs per cycle. As our target does not support memory inputs, we only provide
an interface with the register file. It should further be noted that all of the inputs for a custom in-
struction must arrive before the computation begins, meaning that no intermediate output values
may be used in the processor before being returned to the custom instruction hardware.
Example Template To elucidate the fundamental issues of the problem, we show some example
templates in Figure 3.3. Here, circles represent instructions to be computed, and edges repre-
3.2. OVERVIEW 51
sent data dependence. Subfigures (a) and (b) show two different possible instruction templates in
shaded colors, each containing different operations and differing numbers of I/Os. ese choices
affect the potential speedup improvements because certain operations can be accelerated more
opportunistically in hardware, and the subgraph chosen will affect the amount of time required
to transfer inputs and outputs from the register file. Note that both are “convex¹” in the sense
that no edge leaves and returns to the subgraph. is property is required for correctness, as the
generated template must be serializable with the original instruction stream to be usable.
Problem Statement
Chapter Organization We describe the modeling of the integer linear program by first describ-
ing the system abstractly, then writing logical constraints, and where required, linearizing these
constraints to match integer linear programming theory. e goal of the next three sections is to
describe, from a modeler’s perspective, how to formulate the model in terms of the fundamental
decision variables, describe the formulation of the constraints by linearizing logical constraints,
and finally how to reason about and write the objective. We then describe the limitations of our
model, the related work, and an evaluation with a modified compiler and real workloads.
Figure 3.3: Example instruction templates. (a) and (b) show two ways to partition the basic block’s
instructions, each including different numbers of I/Os.
Decision Variables We must first represent the DAG of instructions in our basic block. For this,
we define the set V , which represents the constituent instructions, along with instructions outside
the basic block which produce values used inside. e connections, or data dependencies, between
instructions in V can be described with Aij , the appropriate adjacency matrix. Now the choice
of how to represent the subgraph becomes obvious; we associate with each vertex i a variable xi ,
which describes if the vertex is contained in the template. is is our decision variable, as it is
the output of model we are trying to determine. In fact, all other quantities for this model can be
easily computed offline based on these variables.
System Parameters To model our system, we need to present, as input, the parameters which
describe the system, and use these in formulating the model. First, we need to understand the
difference between the software execution time of an instruction, and the hardware execution
3.3. FORMULATION: PARAMETERS AND DECISION VARIABLES 53
time, so we approximate these with si and hi respectively. In addition to the execution latencies, we
need to calculate the data transfer time for live inputs and outputs. is is based on the bandwidth
and latency to the register file.
ese are used to calculate the number of data transfer cycles C , which should be minimized
if possible. e induced latency is parametrized by the variables PORTi n and PORTout , which
describe the number of register file read/write ports, and RF Ci n and RF Cout , the number of
cycles to access the register file. is is captured by PORTi n and PORTout for bandwidth, and
RF Ci n and RF Cout for latency.
Table 4.1 summarizes the parameters and variables of the model, in terms of input param-
eters of the architecture and input basic block graph, output decision variables, and intermediate
variables that describe specific aspects of the problem. Any variables introduced for the lineariza-
tions of constraints are also included in this table.
Inputs and Outputs We can determine the number of inputs and outputs using the following
observation. An input to the graph is one such that, for a given edge .i; j / 2 Aij , i is not inside
3.4. FORMULATION: CONSTRAINTS 55
the template, but j is. e reverse case is true for outputs. e logical formulation of the number
of inputs and outputs is therefore as follows.
0 1
X _
I NS Ti n D :xi ^ @ xj A
i2V j W.i;j /2A
0 1
X _
I NSTout D xi ^ @ :xj A
i2V j W.i;j /2A
Since MILP does not operate on logical predicates directly, we must linearize the equa-
tions using the techniques described in Chapter 2, Section 2.3.1. By distributing the logical and
over the conjunction in the equations above, and by introducing an auxiliary variable for each
equation i nputi and outputi , we can now enforce the implications :xi ^ xj H) i nputi and
xi ^ :xj H) outputi with the following linear equations:
Now, we can simply use the following two equations to add up the total number of inputs
and outputs.
X
I NS Ti n D i nputi (3.3)
i2V
X
I NSTout D outputi (3.4)
i2V
If required, at this point we could enforce the number of inputs or outputs to be at certain
values by introducing appropriate equations. For example, we may only want to find 2-input, 1-
output instructions, if this is what is encodable in the instruction. However, for the rest of our
formulation and analysis, we will assume the hardware can transfer additional values from the
register file.
_
aj D .xj _ ai / for all j 2 V
_/2A
j W.i;j
di D .xi _ dj / for all i 2 V
iW.i;j /2A
ese equations can be linearized by simply using the identity for xi _ ai H) ai discussed
in the previous chapter. e result is the following two equations.
Finally, to enforce the convexity constraint, we just need to assert that no non-template
node should have an ancestor and descendant in the template. We can do this by taking the
logical conjunction of the three associated variables.
Next, we must consider the custom instruction latency if it were actualized in hardware.
is is estimated by calculating the critical path through the hardware nodes. For each edge in
the template, we simply add the source node’s expected time, and the estimated hardware latency
of the current operation. e result is accumulated in the auxiliary continuous variable tj . is
naturally yields a linear equation. e subsequent equation simply takes the maximum time as the
hardware latency.
e total transfer time C takes into account the number of data transfers, as well as the
cost of each transfer.
Finally, the total reduction in cycles can be written simply as the difference between the
total software latency, and the sum of the hardware and data transfer latencies.
ZDS .H C C / (3.14)
Hardware Latency Here we considered the hardware latency to be the critical path through es-
timates of each hardware unit in the template. is is not necessarily accurate, because the actual
synthesis will have a different latency. Synthesizing every combination in the optimizer is imprac-
tical, and since the equations involved are non-linear, would not fit into the linear programming
paradigm.
Software Latency e software latency was calculated as the sum of each of the software la-
tencies of individual instructions. is ignores pipeline effects like data dependence hazards or
dynamic resource conflicts. However, to the first order, it should be accurate enough for our base-
line processor to be useful.
58 3. CASE STUDY: INSTRUCTION SET CUSTOMIZATION
Basic Block Scope We examined only non-control flow instructions for inclusion in custom in-
structions, and only examined one basic block at a time. is made the scope of the problem more
tractable. at said, others have still managed to use MILP for performing custom instruction
generation across larger program regions [73].
Iterative Solutions Each time we find a solution, we fix the nodes in the template to not be in
the template, to find additional potential subgraphs inside the basic block. is way of iterating
over the space means that we will not find every possible template, because there could have been
useful overlapping templates that our strategy does not consider.
3.7 EVALUATION
In this section, we answer two important questions:
1. Can this model be applied in the context of a real system, and be implementable?
2. Is the problem formulated well enough to be solved in a reasonable amount of time?
3.7.1 METHODOLOGY
We wrote the model described in this chapter in GAMS, using roughly fifty lines of code. In order
to evaluate the model on real workloads, we integrated the GAMS model into an LLVM module
which analyzes data dependence inside basic blocks. e two components communicate through
graphs written in GAMS syntax, which include all of the input parameters from Table 3.1.
For this particular problem, we consider arithmetic and logical operations as candidates for
specialization, and all other instructions are prevented from inclusion in the templates by fixing
their template decision variables to zero. For this problem, we’ve set the register file access cost
(RF Ci n and RF Cout ) to a single cycle, and give two register read ports (PORTi n ) and one write
port (PORTout ).
3.7.2 RESULTS
Implementability We apply our scheme to the EEBMC benchmark suite, and consider basic
blocks which have more than one computation instruction which would be feasible to include in a
custom instruction template. A summary of our results appears in Table 3.2. We characterize the
workloads with the number of basic blocks and average basic block size, appearing in the second
and third columns of Table 3.2. Overall, we see a significant number of basic blocks for potential
specialization.
Result-1: Our model for template generation is practical and implementable.
Solution Time e fourth and fifth columns of Table 3.2 show the average number of equations
and variables for each problem instance, which are very small due to small average basic block size.
is leads to very fast solution times, shown in the final column. We show the expected latency
3.7. EVALUATION 59
Benchmark Num BBs Average Model Model Per Temp. Per Temp.
Considered BB Size Eqs Vars Lat. Reduct. Solve Time(ms)
a2time01 272 9:98 248 126 3:94 89
aifftr01 232 11:54 338 167 5:33 95
aifirf01 238 11:68 321 162 4:85 113
aiifft01 226 11:50 338 167 5:33 215
autcor00 200 10:99 307 153 5:02 204
basefp01 208 11:68 319 158 5:16 234
bezier01 198 11:20 326 161 5:35 149
bitmnp01 349 8:77 304 152 4:80 79
cacheb01 208 11:56 338 168 5:42 75
canrdr01 234 10:88 305 153 4:85 90
cjpeg 1080 10:62 186 104 2:36 77
conven00 209 10:99 315 156 5:14 90
dither01 200 11:19 317 159 5:12 104
djpeg 857 11:67 233 129 2:47 103
fbital00 204 10:82 295 148 4:85 80
fft00 215 11:26 306 153 4:90 70
idctrn01 227 15:97 468 230 10:44 73
iirflt01 251 11:89 351 176 5:72 92
matrix01 302 10:53 311 156 5:05 95
ospf 208 11:07 325 161 5:28 76
pktflow 238 10:76 312 157 4:84 105
pntrch01 221 10:90 320 160 5:16 96
puwmod01 236 10:73 282 143 4:40 194
rotate01 295 9:59 263 135 3:74 96
routelookup 219 10:99 318 158 5:11 76
rspeed01 212 11:16 324 162 5:08 93
tblook01 224 11:60 367 181 5:35 77
text01 215 10:75 323 160 5:28 99
ttsprk01 303 11:80 344 174 4:41 105
viterb00 217 11:65 348 175 5:03 112
Table 3.2: Average number of equations and solve times
reduction (in cycles) of each template in the sixth column, as an average over the entire benchmark.
ese tend to be in the range of 5-10 cycles on average, but can be as large as 50 or more cycles
for certain templates. It is more likely that the large latency reduction templates would be picked
up by the instruction selection phase, as they show more benefits. Overall, we see that the model
can both be solved in a reasonable amount of time, showing its feasibility, and can find templates
with large expected latency reduction, showing its effectiveness.
Result-2: e formulation is solvable on a reasonable time scale.
60 3. CASE STUDY: INSTRUCTION SET CUSTOMIZATION
3.8 RELATED WORK
For a comprehensive survey of techniques, practices, and research in the field of automatic
instruction-set extensions, Galuzzi et al. provide a comprehensive survey [72]. We highlight the
differences between the MILP approach explored in this chapter and a few selected approaches
below.
Many practical techniques have been explored to solve the problem of automatic instruction
set extension. ese algorithms come in a variety of flavors, and must trade-off the execution time
of the algorithm, the scope of search inside the program structure, and the type of graphs searched
for. Hence, these solutions sacrifice exploring some aspect of the design space. Some solutions
only look for connected subgraphs, which ignore potential increased parallelism [15, 183]. Other
techniques only find instruction templates with certain simplified graph structure, like those that
only find single-output subgraphs [8].
e MILP approach we explore here attempts to generalize the type of graphs we look
for, including those with multiple outputs, useful for computationally intense code, and even
those which have disjoint subgraphs (yet still convex), as they provide additional parallelism. Our
approach guarantees generality in subgraph form, and also in optimality, given that the latency
model can be configured to match the hardware.
3.9 CONCLUSIONS
In this chapter, we identified a problem suitable for integer linear programming, custom instruc-
tion set generation, and described how to create the formal model. We described the modeling
process, from high level description, to integer linear constraints. Integrating our solution into
a compiler, we were able to generate instruction templates with optimal projected performance
improvement. Our implementation proved to be efficient, taking on the order of 100ms per so-
lution.
61
CHAPTER 4
Figure 4.1: Example resource management system with heterogeneous hardware and software.
capture the constraints or requirements of different situations. Our treatment of this problem
is most closely related to that of Speitkamp and Bichler, who formulate the static problem in
MILP, and evaluate their model on real workload data [114]. By implementing our models and
measuring the run-times and optimization bounds, we will explore how they can be applied in
real world scenarios, and how simple models can be extended to model complex requirements
and phenomena.
4.2 OVERVIEW
Static Server Allocation (SSAP) Essentially, the problem is to allocate services with fixed re-
source requirements onto machines with fixed resource limitations. e problem is static in nature,
in that we don’t take advantage of the ability to migrate jobs across machines, and assume that
allocations are final. For this problem, we also assume the most general case, where machines are
allowed to provide heterogeneous resource requirements, and all instances of services have been
profiled individually. Computationally, this problem is equivalent to the multidimensional vector
bin-packing algorithm, where each tuple of resources corresponds to a vector.
Time-Varying Server Allocation (TSAP) In certain situations, we may have a priori knowledge
of the resource usage pattern of a workload in the course of a day. For instance, certain services
may experience more load during the nighttime, or daytime. erefore, since we are still consid-
ering static allocation, we may want to co-locate machines which have complementary resource
patterns, ensuring that we do not exceed the machine’s resource limitations. Figure 4.2 shows two
example services with an opportunistically co-locatable CPU resource utilization pattern. is is
termed the Time-varying Server Allocation Problem (TSAP).
Interference-Sensitive Server Allocation (ISAP) Although we have modeled the machine re-
source usage, it is true that the performance of co-located jobs can be affected by interference
in the memory system. is is especially true for latency sensitive tasks, where cache interfer-
ence and contention for memory bandwidth can have a profound effect on the overall Quality
of Service (QoS). Moreover, these types of resource constraints do not easily lend themselves to
the “bin-packing” approach we have thus far proposed to manage system resources, as there is a
complex relationship between memory system contention and performance degradation, and this
relationship is specific to the applications involved.
To overcome these problems, Mars et al. show how we can capture this relationship with
the concepts of memory pressure and memory sensitivity [126]. Here, our abstraction for memory
pressure is the application’s lowest level working set size which determines the amount of on-chip
cache used for storing the data for the given application. e memory sensitivity of an application
describes the amount of on-chip cache required by the application to achieve a given quality of
service. Figure 4.3 gives a concrete example of the relationship between memory pressure (from
all other applications on the machine) and the projected application QoS. In the given example,
to maintain a quality of service higher than 90%, service 1 and service 2 require a system memory
pressure of less than 10MB and 25MB respectively. Because they each exert less than the other’s
64 4. CASE STUDY: DATA CENTER RESOURCE MANAGEMENT
Service 1 Service 2
Exerted Memory Pressure=15 MB Exerted Memory Pressure=5 MB
Application QoS
Application QoS
90% 90%
10MB 25MB
memory sensitivity limit (15MB25MB and 5MB10MB), they can be co-located effectively.
e problem which incorporates the constraint that QoS should not fall below a certain threshold
is called the Interference-sensitive Server Allocation Problem (ISAP).
Problem Statement We have described several problem classes above for data center resource
management. ough the requirements and objectives for each model described above is slightly
different, the overall nature is the same. We can state the overall problem as:
Statically determine the best co-locations of services on servers such that the re-
source requirements and service level agreements can be satisfied.
Chapter Organization Similar to the previous chapter, the next three sections describe the mod-
eling of the integer linear program by first describing the decision variables and input parameters,
then formulating the mixed integer linear program through linearizations of logical constraints
and the objective function. We then discuss modeling limitations and related work, and conclude
with an evaluation.
Static Server Allocation (SSAP) e most basic constraint that we need to enforce is for there
to exist a mapping from services to machines, such that each service is mapped to some machine.
We enforce this simply with the following equation, which forces only one Msc to be valid for
any c .
X
Msc D 1 for all s 2 S (4.1)
c2C
Recall that for SSAP, each server and machine is unique, and that we need to enforce
the resource limitations, and we need to determine which servers are “on”. e first constraint
below, a linear constraint, enforces proper utilization. For each resource and machine, we add
up the contribution of resource usage from each mapped job, and limit that against the max for
that machine. e second equation below, a logical constraint, equates the fact that no service is
mapped with the fact that the machine is off.
X
.Msc Rsk / Lck for all c 2 C; k 2 K
s2S
V
s2S .Msc D 0/ D :Oc for all c 2 C
4.4. FORMULATION: CONSTRAINTS 67
We can actually linearize the above two constraints as the following single constraint, by
simply multiplying the right hand side of the first constraint above by Oc . e reason this works
is that if any Msc is on, then Oc must be on to enforce the limit. Since we will later see that our
objective is to minimize the sum of Oc , we don’t need to enforce the implication the other way.
X
.Msc Rsk / Oc Lck for all c 2 C; k 2 K (4.2)
s2S
Warehouse Server Allocation (WSAP) e warehouse server allocation problem is similar, ex-
cept that we take advantage of the limited number of workload types. e set S now represents
these types, and Msc now represents the number of services of s type mapped to machine c . We
add one component to the model, num, which indicates how many of a certain type we have.
e formulation must be modified so that we can allow multiple types of a certain service to be
mapped to the same machine. Constraint 4.3 below accomplishes this, and we borrow the same
Equation 4.2 for this formulation.
X
Msc D nums for all s 2 S (4.3)
c2C
For WSAP, we will also assume that we have homogeneous hardware, because this is usually
the case (at least to some extent) in the warehouse computing domain. is opens an opportunity
for reducing the search space of the problem. Since each machine is the same, it is wasteful to
explore allocating services on all available combinations. For instance,
if we have 100 machines
total, and the optimal solution uses 50 machines, then there are 100
50
optimal solutions. We can
help reduce the symmetry of the problem by considering the early nodes first. We accomplish this
by specifying that each machine can only be on if the previous machine is on, which we refer to
as canonical order.
X
.Msc Rsk t / Oc Lck for all c 2 C; k 2 K; t 2 T (4.5)
s2S
ough the above equations are correct, they contain arbitrarily large constants which can
reduce the effectiveness of the model, and increase the solution time. We should instead limit
bigM’s to the smallest legal value. e smallest legal value for bigM1 is the most services S that
can fit on the machine C . We can compute this simply offline. e smallest legal value for bigM2
is the most memory pressure that a machine could ever take. Here, we use the biggest memory
sensitivity given in the problem .XS /.
Finally, we can again borrow constraints 4.3 and 4.4 to complete the TSAP model.
X
TOTon D Oc (4.8)
c2C
Communication Modeling In our model, we have not considered, to any rigorous degree, the
communication requirements of the various services, nor do we model the underlying network
bandwidth capabilities or latencies between servers. For certain types of services, where com-
munication is the bottleneck, this model would not be appropriate. ough we do not explore
communication modeling in the data center setting, the next chapter describes an assignment
problem formulation which does communication scheduling, but in a very different domain.
4.7 EVALUATION
In this section, we evaluate the four different models using synthetic workload inputs. e key
questions are:
1. Can these models be implemented in GAMS and be solved to some degree of optimality?
2. How well do these models scale, and to what problem domains could they be applied?
3. How important is the formulation of the model?
4.7.1 METHODOLOGY
For the model inputs, we must have knowledge of the services in question, including resource
usage over time, memory pressure, and memory sensitivity. Since we are not trying to validate our
models against real data-center workloads and hardware, or prove that they can achieve a better
degree of co-location over another algorithm, we simply use synthetic inputs. is is reasonable,
because our goal is to show what is possible using MILP. ough it is true that the exact resource
distribution will affect the solving time of the MILP, we posit that these are second order effects,
and that synthetic inputs are sufficient given that they are reasonable.
We generate synthetic inputs as follows. We use two resources for set K , modeling CPU
and memory resources. For the set of machines, we normalize the hardware capabilities to unit
dimensions, and when modeling heterogeneous hardware, add in a uniform distribution from
0 to 1 units. For services, we use a uniform distribution from 0 to 1, and use 8 time periods
when modeling time varying resource requirements. Memory pressure and sensitivity are also
given uniform distributions, but sensitivity is given a multiplicative factor (average 3), so that
co-location is possible.
For each problem size, we report the average of five runs with different random seeds. Also,
for all experiments, we stop the solution procedure when the incumbent solution is within 10%
70 4. CASE STUDY: DATA CENTER RESOURCE MANAGEMENT
optimality, or in other words, when the number of additional machines used is less than 10% of
the projected optimal value. Note that this doesn’t necessarily mean that we haven’t attained the
optimal value, just that we haven’t proven the incumbent solution is optimal.
For comparison, we implement a very simple first-fit algorithm which attempts to place
services onto machines iteratively, in much the same way that a dynamic service allocator would
function.
4.7.2 RESULTS
Implementability We begin the results section by giving a summary of the results for all four
models with 50 machines and 50 services to be scheduled, as shown in Table 4.2. e first two
columns show the number of equations and variables for the problem. SSAP requires many more
variables because in this problem we are mapping each service individually instead of mapping
“kinds” of services as in the other problems. TSAP and ISAP require more equations, as we are
additionally enforcing resource restrictions for time periods and memory sensitivity respectively.
e next two columns show the Heuristic and MILP solution optimality (the percentage dif-
ference between the given solution and the lower bound which the solver reports). Here, the
MILP solution considerably outperforms the Heuristic, especially for the SSAP problem, where
the number of choices are high. e final column of Table 4.2 shows the average solve time for
each benchmark, which are all reasonably fast.
Result-1: Our formulation is practical and implementable.
Table 4.2: Average number of equations and solve times, for 50 services and 50 machines
Scalability We explore the scalability of the approach by changing the number of machines and
services in our problem. Figure 4.4 shows the scalability of all four problems. As expected, the
problems scale to a similar degree, where SSAP takes significantly longer than the three others, as
it models each individual service. In fact, for SSAP problems larger than 1000 services/machines,
GAMS always takes longer than the given timeout of 20 min to find the initial value to the
solution. is is because the problem is simply too large. Figure 4.5 shows the benefits we get
with the MILP approach for different problem sizes. e MILP solutions are generally about 50%
better than the first fit for the SSAP problem, while the others are around 10% better performing,
independent of the problem size itself.
4.7. EVALUATION 71
Result-2: Our model for data-center allocation scales up to thousands of machines and services.
Effects of Model Formulation We now examine the benefits of two optimizations explained in
the formulation. e first is regarding symmetry as it relates to which machines are on (Oc ) in the
WSAP problem. We introduced constraint 4.4 so that we would only consider turning machines
72 4. CASE STUDY: DATA CENTER RESOURCE MANAGEMENT
on in canonical order, reducing the total number of equivalent solutions by a combinatorial factor.
Figure 4.6 shows the benefits of including this constraint by comparing solution times, where
“WSAP (SYMM)” is the WSAP problem without the symmetry breaking constraint. Solution
times are generally worse, and much more “erratic” as certain instances force the solver to explore
many similar solutions.
Another common problem with modeling logical operations in MILP is in choosing the
value of so-called “bigM” variables. ose that are too large can create models with very loose lin-
ear relaxations, greatly increasing solve time. In Figure 4.7, we show the benefit of using appropri-
ate bigM values by comparing the optimized ISAP with the non-optimized “ISAP-BIGM.” is
non-optimized formulation uses a nominal value of 1000 for both bigM variables. We see that,
with poor bigM values, the problem becomes much more difficult to solve, somewhat increasing
the solve times for smaller input sizes.
Result-3: e choice of constraints and parameters in the formulation can significantly affect the
solve time.
of optimality obtained. e value of optimality over shorter runtime depends on the particular
resource allocation problem.
Even inside the domain of using Integer Linear Programming for resource allocation, there
is still a wide variety of relevant research. As mentioned before, we draw heavily upon the work
of Speitkamp and Bichler for their modeling of the SSAP and TSAP problems [158]. eir
work also provides extensive studies of real workload data, where we are simply using synthetic
workloads. Bose and Sundarrajan use a similar formulation to solve the same problem, but use
more nuanced constraints for enforcing service level agreements [30]. Berral et al. and Lubin
et al. create models which focus on minimizing energy consumption, which leads to somewhat
different constraints [23, 121]. An interesting application of Lubin’s MILP allocation model is in
the work by Guerva et al. [84]. Here, they use a market-based mechanism to resolve application
demands for heterogeneous resources, and the periodically run MILP model clears the market,
maximizing the overall welfare of the system. Lastly, Zhu et al. use MILP for making allocation
decisions in the data center, but make extensions for modeling the network organization and
bandwidth constraints [185].
4.9 CONCLUSIONS
In this chapter, we described how to formulate an important set of related data center problems,
specifically the consolidation and allocation of services onto servers, while preserving service level
agreements when colocating. We’ve shown two important phenomena in this chapter: 1. Related
problems can be modeled easily through simple extensions to a generic MILP model. Only a few
74 4. CASE STUDY: DATA CENTER RESOURCE MANAGEMENT
equations were required to model the problems of handling workload and machine characteristics
found in warehouse computing (WSAP), time-varying service resource requirements (TSAP),
and managing memory interference between services (ISAP). 2. Large MILP problems can be
modeled, but sufficient abstractions and efficient formulations must be applied. Here, we showed
how tight bigM bounds and reducing symmetry were extremely important.
75
CHAPTER 5
Our third case study is more complex and detailed than the previous two, and pursues a new
domain with much broader goals. In this chapter, we will show how MILP can be applied to
solve the scheduling problem for a certain class of computer architecture. Similar to the previous
case study, we show how a set of constraints can be built upon to model related problems, and we
begin this chapter by describing the targeted domain.
Hardware specialization has emerged as an important way to sustain performance improve-
ments of microprocessors to address transistor energy efficiency challenges and general purpose
processing’s inefficiencies [14, 58, 88]. e fundamental insight of many specialization tech-
niques is to “map” large regions of computation to the hardware, breaking away from instruction-
by-instruction pipelined execution and instead adopting a spatial architecture paradigm. Specifi-
cally, spatial architectures expose hardware resources, like functional units, interconnection net-
work, or storage to the compiler. Pioneering examples include RAW [175], Wavescalar [161],
and TRIPS [34], motivated primarily by performance, and recent energy-focused proposals in-
clude Tartan [133], CCA [38], PLUG [47, 113], FlexCore [166], SoftHV [48], MESCAL [94],
SPL [177], C-Cores [171], DySER [80, 81], BERET [85], and NPU [59].
A fundamental problem in all spatial architectures is the scheduling of some notion of
computation to the ISA-exposed hardware resources. Typically this problem has been solved with
heuristic based approaches, like the TRIPS and RAW schedulers [40, 116]. Our approach, which
leverages mathematical optimization, belongs to the same class as seminal job scheduling and
VLIW scheduling work [64, 174]. A mathematical approach can provide many benefits besides
solution quality; we seek to create a solution which allows high developer productivity, provides
provable properties on results, and enables true architectural generality.
e approach we outline in this chapter captures the scheduling problem with five intu-
itive abstractions: i) placement of computation on the hardware substrate, ii) routing of data on the
substrate to reflect and carry out the computation semantics—including interconnection network
assignment, network contention, and network path assignment, iii) managing the timing of events
in the hardware, iv) managing resource utilization to orchestrate concurrent usage of hardware,
and v) forming the optimization objectives to meet the architectural performance goals.
76 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
Target Architectures We apply our approach to three architectures picked to stress our MILP
scheduler in various ways. To test the performance deliverable by our general MILP approach, we
consider TRIPS because it is a mature architecture with sophisticated specialized schedulers re-
sulting from multi-year efforts [3, 34, 40, 138]. To represent the emerging class of energy-centric
specialized spatial architectures, we consider DySER [81]. Finally, to demonstrate the generality
of our technique, we consider PLUG [47, 113], which uses a radically different organization and
execution model. We will show that a total of 20 constraints specify the general problem, and
only 3 TRIPS, 1 DySER, and 10 PLUG constraints are required to handle architecture-specific
details.
We now briefly describe the three spatial architectures we consider in detail, and a detailed
diagram of all three architectures is in Figure 5.8 (page 87).
e TRIPS architecture is organized into 16 tiles, with each tile containing 64 slots, with
these slots grouped into sets of eight. e slots from one group are available for mapping one block
of code, with different groups used for concurrently executing blocks. e tiles are interconnected
using a 2-D mesh network, which implements dimension-ordered routing and provides support
for flow-control and network contention. e scheduler must perform computation mapping: it
takes a block of instructions (which can be no more than 128 instructions long) and assigns each
instruction to one of the 16 tiles and within them, to one of the 8 slots.
e DySER architecture consists of functional units (FUs) and switches, and is integrated
into the execution stage of a conventional processor. Each FU is connected to four neighboring
switches from where it gets input values and injects outputs. e switches allow datapaths to be
dynamically specialized. Using a compiler, applications are profiled to extract the most commonly
executed regions, called path-trees, which are then mapped to the DySER array. e role of the
scheduler is to map nodes in the path-trees to tiles in the DySER array and to determine switch
configurations to assign a path for the data-flow graph edges. ere is no hardware support for
contention, and some mappings may result in unroutable paths. Hence, the scheduler must ensure
the mappings are correct, have low latencies and have high throughput.
e PLUG architecture is designed to work as an accelerator for data-structure lookups
in network processing. Each PLUG tile consists of a set of SRAM banks, a set of no-buffering
routers, and an array of statically scheduled in-order cores. e only memory access allowed by
a core is to its local SRAM, which makes all delays statically determinable. Applications are
expressed as dataflow graphs with code-snippets (the PLUG literature refers to them as code-
blocks) and memory associated with each node of the graph. Execution of programs is data-flow
driven by messages sent from tile to tile—the ISA provides a send instruction. e scheduler
must perform computation mapping and network mapping (dataflow edges ! networks). It must
ensure there is no contention for any network link, which it can do by scheduling when send
instructions execute in a code-snippet or adjusting the mapping of graph nodes to tiles. It must
also handle flow-control.
5.2. OVERVIEW 77
In all three architectures, multiple instances of a block, region, or dataflow graph are exe-
cuting concurrently on the same hardware, resulting in additional contention and flow-control.
5.2 OVERVIEW
We present below the main insights of our approach in using MILP for specifying the scheduling
problem for spatial architectures. Subsequently, we distill the formulation into five responsibilities,
or subproblems, each corresponding to one architectural primitive of the hardware.
e scheduler for a spatial architecture works at the granularity of “blocks” of code, which
could be basic-blocks, hyper-blocks, or other code regions. ese blocks, which we represent as
directed acyclic graphs (DAGs), consist of computation instructions, control-flow instructions,
and memory access instructions, which we refer to as G . e hardware, which is composed of
functional units and routers, is referred to as H . For ease of explanation, we explain G as comprised
of vertices and edges, while H is comprised of nodes, routers, and links (formal definitions and
details follow in Section 5.3). An example scheduling problem is shown in Figure 5.1 on page 80.
Problem Statement
Spatially map a typed computation DAG G to a hardware graph H under the ar-
chitectural constraints.
ough the above problem statement captures the overall problem, the complexity of the
problem is reduced through further subdivisions. To design and implement a general scheduler
applicable to many spatial architectures, we observe that five fundamental architectural primitives,
each with a corresponding scheduler responsibility, capture the problem as outlined in Table 5.1
(columns 2 and 3). Below we describe the insight connecting the primitives and responsibilities
and highlight the mathematical approach. Table 5.1 summarizes this correspondence (in columns
2 and 3), and describes these primitives for three different architectures.
Concurrent hardware resource usage ! Managing Utilization Central to the difficulties of the
scheduling problem is the concurrent usage of hardware resources by multiple vertices/edges in
G of one node/link in H . We formalize this concurrent usage with a notion of utilization, which
represents the amount of work a single hardware resource performs. Such concurrent usage (and
hence > 1 utilization) can occur within a DAG and across concurrently executing DAGs. Overall,
the scheduler must be aware of resource limits in H and which resources can be shared as shown
in Table 5.1, row 4. For example, in TRIPS, within a single DAG, 8 instruction-slots share a
5.3. FORMULATION: PARAMETERS AND DECISION VARIABLES 79
single ALU (node in H ), and across concurrent DAGs, 64 slots share a single ALU in TRIPS.
In both cases, this node-sharing leads to contention on the links as well.
+ +
edges y
(E)
vertices z
(V)
z
routers (R)
DAG G for Graph H for hardware of A Mapping of G to H
z=(x+y)2 spatial architecture
represent input/output nodes and vertices, and circles represent computation nodes and vertices.
Squares represent elements of R, which are routers composing the communication network. El-
ements of E are shown as unidirectional arrows in the computation DAG, and elements of L as
bidirectional arrows in H representing two unidirectional links in either direction.
e scheduler’s job is to use the description of the typed computation DAG and hardware
graph to find a mapping from computation vertices to computation resource nodes and determine
the hardware paths along which individual edges flow. Figure 5.1 also shows a correct mapping of
the computation graph to the hardware graph. is mapping is defined by a series of constraints
and variables described in the remainder of this section, and these variables and scheduler inputs
are summarized in Table 5.3.
v3 + n4 n6 +
n2 n9 y
v4 n5 n7
n10 z
v5 z n3
Mapping V to N
DAG G for Graph H for hardware of Mvn(v1,n1)=1, Mvn(v2,n1)=0,
z=(x+y)2 spatial architecture Mvn(v3,n4)=1, Mvn(v3,n5)=0,
e equation above cannot fully capture dynamic events like cache misses. Rather than
consider all possibilities, the scheduler simply assumes best-case values for unknown latencies
(alternatively, these could be attained through profiling or similar means). Note that this is an
issue for specialized schedulers as well.
With the constraints thus far, it is possible for the
scheduler to overestimate edge latency because the link map- x 5
6
ping allows fictitious cycles. As shown by the cycle in the
bottom-left quadrant of Figure 5.4, the links in this cycle +
falsely contribute to the time between input “x” and vertex y 4
“+”. is does not violate constraint 5.5 because each router 1 3
involved contains the correct number of incoming/outgoing 2
z
links.
In many architectures, routing constraints (see eq. 5.7) Figure 5.4: Fictitious cycles.
make such loops impossible, but when this is not the case we
eliminate cycles through a new constraint. We add a new set of variables O.L/, indicating the
84 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
partial order in which links activated. If an edge is mapped to two connected links, this constraint
enforces that the second link must be of later order.
As shown in the running DySER example below in Figure 5.6, we limit the utilization of
each link U.l/ to MAXL D 1. is ensures that only a single message per block traverses the link,
allowing the DySER’s arbitration-free routers to operate correctly.
8v 2 Vi n ; T .v/ D 0 (5.17)
8v 2 Vout ; T .v/ LAT (5.18)
To model the throughput aspects, we utilize the concept of the service interval SV C , which
is defined as the minimum number of cycles between successive invocations when no data depen-
dencies between invocations exists. We compute S V C by finding the maximum utilization on
any resource.
8n 2 N; U.n/ S V C (5.19)
8l 2 L; U.l/ S V C (5.20)
Figure 5.8: ree candidate architectures and corresponding H graphs, considering four tiles for each archi-
tecture.
Network organization ! Routing data Since messages are dedicated and point-to-point (as
opposed to multicast), we use equations modeling each edge as consuming a resource and con-
tributing to the total utilization. e TRIPS routers implement dimension-order routing, i.e.,
messages first travel along the X axis, then along the Y axis. TRIPS uses the I.L; L/ parameter,
which disallows the mapping of certain link pairs, to invalidate any paths which are not compatible
with dimension-order.
HW timing ! Managing timing of events We can calculate network timing without any ad-
ditions to the general formulation.
Extensions: For TRIPS, the scheduler must also account for control flow when computing the
utilization and ultimately the service interval for throughput. Simple extensions, as explained
below, can in general handle control flow for any architecture and could belong in the general
MILP formulation as well. Let P be the set of control flow paths that the computation can
take through G . Note that p 2 P is not actually a path through G , but the subset of its vertices
and edges activated in a given execution. Let Av .P; V / and Ae .P; E/ be the activation matrices
defining, for each vertex and edge of the computation, whether they are activated when a given
path is taken or not. For each path we define the maximum utilization on this path Wp .P /. ese
equations are similar to the original utilization constraints (5.10, 5.14), but also take control flow
activation matrices into account.
Empirically, we found that external limitations on the throughput of inputs are greater
than that of computation. For this reason, the DySER scheduler first optimizes for latency, adds
90 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
the latency of the solution as a constraint, then optimizes for throughput by minimizing latency
mismatch MIS , as below:
Because the insertion of no-ops can only change timing in specific ways, we use two con-
straints to further link ı.E/ and .E/ to .E/ and .E/. e first ensures that the scheduler
5.6. ARCHITECTURE-SPECIFIC MODELING 91
never attempts to pad a negative number of NOPs. e second ensures that sending delay ı.E/
is the same for all multicast edges carrying the same message.
To implement these constraints we use the following four sets concerning distinct edges
e; e 0 : SI.e; e 0 / is the set of pairs of edges arriving to the same vertex such that .e/ < .e 0 /,
LIFO.e; e 0 / has for each vertex with both input and output edges the last input edge e and the
first output edge e 0 , SO.e; e 0 / has the pairs of output edges with the same source vertex such that
.e/ < .e 0 /, and EQO.e; e 0 / has the pairs of output edges leaving the same node concurrently.
We then define the utilization based on the number of vertex bundles mapped to a node. We
also instantiate edge bundles be for all the set of edges coming from the same vertex bundle and
going to the same destination. Since all the edges in such a bundle are logically a single message
source, the schedule must equalize the receiving times of the message they send. Let Bmut ex Be
be the set of edge-bundles described above. en we add the following timing constraint:
Additionally, architectural constraints require the total length in instructions of the vertex
bundles mapped to the same node to be 32. is requires defining, for each bundle, the max-
imum bundle length .bv / as a function of the last send message of the vertex. is length can
then be constrained to be 32.
To achieve this, we first define the set LAST .Bv ; Be /, which pairs each vertex bundle with
its last edge bundle, corresponding to the last send message of the vertex. is enables to define
the maximum bundle length .bv / as:
92 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
We finally define Q.Bv ; N / as the required number of instructions on node n from vertex
bundle bv and limit it to 32 (the code-snippet length).
Objective Formulation For PLUG, the smallest service interval is achieved and enforced for
any legal schedule, and we optimize solely for latency LAT .
DySER “throughput” microbenchmark: is performs the calculation y = x − x 2i in the code-region. Paths diverge at the
input node x, into one long path which computes x 2i with a series of i multiplies, and along a short path which routes x
to the subtraction. is pattern tends to cause latency mismatch because one of these converging paths naturally takes fewer
resources.
Independence of Latency and Utilization One possible modeling of the spatial scheduling prob-
lem is to create binary decision variables both for “where” a computation should go, and “when”
it should be activated, which we refer to as “space-time” scheduling. We have taken a slightly
different approach in this formulation by assuming that the latency and utilization concerns are
mostly independent, and only create decision variables for “where” a computation goes. We rely
on the latency being calculable based on the mapping of computation and communication. is
is not necessarily true, because with TRIPS, two computations which could both fire on the same
tile at the same time will need to be arbitrated. For the purpose of the timing responsibility, the
model optimistically assumes that both computations will fire at the same time. In general, our
formulation does not take into account the fine-grained interaction of latency and utilization.
at said, our approach uses many fewer decision variables than a “space-time” approach, and we
can more naturally model utilization constraints.
5.8 EVALUATION
In this section, we describe our implementation of the constraints in and evaluate its performance
compared to native specialized schedulers for the three architectures. Our key questions are:
5.8.1 METHODOLOGY
We use the GAMS modeling language to specify our constraints as mixed integer linear programs,
and we use the commercial CPLEX solver to obtain the schedules. Our implementation strategy
for prioritizing multiple variables follows a standard approach: we define an allowable percentage
optimization gap (of between 2% to 10%, depending on the architecture), and optimize for each
variable in prioritized succession, finishing the solver when the percent gap is within the specified
bounds. After finding the optimal value for each variable, we add a constraint which restricts that
variable to be no worse in future iterations.
Figure 5.9 shows our implementation and how we integrated with the compiler/simulator
toolchains [3, 47, 80]. For all three architectures, we use their intermediate output converted into
94 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
our standard directed acyclic graph (DAG) form for G and fed to our GAMS MILP program.
We specified H for each architecture. To evaluate our approach, we compare the performance
of the final binaries on the architectures varying only the scheduler. Table 5.4 summarizes the
methodology and infrastructure used.
Figure 5.9: Implementation of our MILP scheduler. Dotted boxes indicate the new components added.
5.8.2 RESULTS
Practicality Table 5.5 (page 95) summarizes the mathematical characteristics of the workloads
and corresponding scheduling behavior. e three right-hand columns respectively show the
number of software nodes to schedule, the amount of single MILP equations created, and the
solver time.¹ ere is a rough correlation between the workload “size” and scheduling time, but
it is still highly variable.
e solver time of the specialized schedulers in comparison is typically in the order of
seconds. Although some blocks may take minutes to solve, these times are still tractable, demon-
strating the practicality of MILP as a scheduling technique.
Result-1: Our general MILP scheduler runs in tractable time.
Solution Quality vs. Heuristic Schedulers Figure 5.10 (page 96) shows the performance of our
MILP scheduler. It shows the cycle-count reduction for the executed programs as a normalized
percentage of the program produced by the specialized compiler (higher is better, negative num-
bers mean execution time was increased). We discuss these results in terms of each architecture.
Compared to the TRIPS SPS specialized scheduler (a cumulated multi-year effort spanning
several publications [34, 40, 138]), our MILP scheduler performs competitively as summarized
below.
¹For TRIPS, the per-benchmark number of DAGs can range from 50 to 5000, and the metrics provided are average per DAG.
For DySER, #DAGs is 1 to 4 per benchmark, and PLUG is always 1.
5.8. EVALUATION 95
Trips # of # of Solve
bench nodes eqns (sec)
ammp_1 17 3744 76
ammp_2 8 1593 11
art_1 22 4547 74
art_2 27 5506 76
art_3 33 7042 20
Trips # of # of Solve bzip_1 13 2655 10
EEBMC nodes eqns (sec) equake_1 24 4455 3
a2time01 11 1914 5 gzip_1 23 4480 1
aifftr01 12 2173 25 gzip_2 22 4506 111
aifirf01 11 2025 1 matrix_1 19 3797 18
basefp01 10 1863 6 parser_1 33 7248 174
bitmnp01 9 1535 3 transp_GMTI 20 4159 115
cacheb01 27 2745 76 vadd 30 7313 315
candr01 10 1871 8
idctrn01 11 1947 3
iirflt01 11 2080 2 DySER # of # of Solve
matrix01 11 1426 2 Apps. nodes eqns (sec)
pntrch01 10 1819 8 fft 20 120250 365
puwmod01 10 1779 3 mm 32 159231 77
rspeed01 10 1816 7 mri-q 19 98615 66
tblook01 10 1818 4 spmv 32 155068 72
ttsprk1 11 1993 8 stencil 30 153428 74
cjpeg 12 2280 3 tpacf 40 211584 368
djpeg 12 2277 1 nnw 25 169197 102
ospf 10 1778 3 kmeans 40 232399 218
pktflow 10 1774 3 needle 32 181686 183
routelookup 10 1747 3 throughput 9 45138 62
bezier01 10 1788 2 DySER Avg. 28 152660 159
dither01 10 3579 4
rotate01 10 1910 5 PLUG # of # of Solve
text01 10 1781 3 Apps. nodes eqns (sec)
autocor00 10 1746 2 Ethernet 18 35603 57
conven0 10 1758 4 Ethane 11 13905 14
fbital00 9 1699 3 IPv4 12 38741 384
viterb00 10 1870 5 Seattle 16 14531 26
TRIPS Avg. 14 2832 31 PLUG Avg. 14 23195 120
Compared to SPS
(a)Better on 22 of 43 benchmarks up to 21% GM +2.9%
(b)Worse on 18 of 43 benchmarks within 4.9% GM -1.9%
(typically 2%)
(c)5.4%, 6.04%, and 13.2% worse on ONLY 3 benchmarks
Compared to GRST
Consistently better, up to 59% better; GM +30%
96 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
Groups (a) and (b) show the MILP scheduler is capturing the architecture/scheduler inter-
actions well. e small slowdowns/speedups compared to SPS are due to dynamic events which
disrupt the scheduler’s view of event timing, making its node/link assignments sub-optimal, typ-
ically by only 2%. After detailed analysis, we discovered the reason for the performance gap of
group (c) is the lack of information that could be easily integrated in our model. First, the SPS
scheduler took advantage of information regarding the specific cache banks of loads and stores,
which is not available in the modular scheduling interface exposed by the TRIPS compiler. is
knowledge would improve the MILP scheduler’s performance and would only require changes to
the compatibility matrix C.V; N /. Second, knowledge of limited resources was available to SPS,
allowing it to defer and interact with code-generation to map movement-related instructions.
What these results show overall is that our first-principles-based approach is capturing all the
architecture behavior in a general fashion and arguably aesthetically cleaner fashion than SPS’s
indirect heuristics. Our MILP scheduler consistently exceeds by appreciable amounts a previ-
ous generation TRIPS scheduler, GRST, that did not model contention [138], as shown by the
hatched bars in the figure.
Figure 5.10: Normalized percentage improvement in execution cycles of MILP scheduler compared to special-
ized scheduler.
On DySER, the MILP scheduler outperforms the specialized scheduler on all benchmarks,
as shown in Figure 5.10, for a 64-unit DySER. Across the benchmarks, the MILP scheduler re-
duces individual block latencies by 38% on average. When the latency of DySER execution is
the bottleneck, especially when there are dependencies between instances of the computation (like
the needle benchmark), this leads to significant speedup of up to 15%. We also implemented an
extra DySER benchmark, which elucidates the importance of latency mismatch and is described
5.9. RELATED WORK 97
in Table 5.4. e specialized scheduler tries to minimize the extra path length at each step, exac-
erbating the latency mismatch of the short and long paths in the program. e MILP scheduler,
on the other hand, pads the length of the shorter path to reduce latency mismatch, increasing the
potential throughput and achieving a 4.2 improvement over the specialized scheduler. Finally,
we also compared to manually scheduled code on a 16-unit DySER (since hand-scheduling for
64-unit DySER is exceedingly tedious). e MILP scheduler always matched or out-performed
it by a small (< 2%) percentage.
e MILP scheduler matches or out-performs the PLUG hand-mapped schedules. It is
able to both find schedules that force S V C D 1 and provide latency improvements of a few per-
cent. Of particular note is solver time because PLUG’s DFGs are more complex. In fact, each
DFG represents an entire application. e most complex benchmark, IPV4, contains 74 edges
(24 more than any others) split between 30 mutually exclusive or multicast groups. Despite these
difficulties, it completes in tractable time.
Result-2: Our MILP scheduler outperforms or matches the performance of specialized schedulers.
Finally, while our approach is general, in that we have demonstrated implementations across
three disparate architectures and shown extensions to others, an open question remains on “uni-
versality”: what spatial architecture organization could render our framework ineffective? Overall,
our general scheduler can form an important component for future spatial architectures.
100 5. CASE STUDY: SPATIAL ARCHITECTURE SCHEDULING
CHAPTER 6
6.2 OVERVIEW
e problems that we explore are based on the work of Abts, et al. [5] and Mishra, et al. [132].
As in these works, we assume a simplified tiled many-core architecture where cores are laid out
on a 2D-plane and connected via an on-chip 2D-switched network that uses a deterministic,
dimension-order routing algorithm that ensures that all messages from node A to node B travel
the same path. Our modeled architecture approximates systems such as the Tilera TILE-Gx8072
processor [2]. Figure 6.1 shows a high level diagram of a tiled architecture.
Memory Controller Placement While current tiled architectures typically place memory con-
trollers along the periphery, Abts et al. argue that future designs should distribute them amongst
the tiles to reduce average latency and eliminate network hot-spots. Such a design co-locates
memory controllers with their assigned cores, so the placement problem simply involves deter-
mining which cores are assigned memory controllers, as illustrated in Figure 6.1. However, for n
6.2. OVERVIEW 103
n
cores and m memory controllers, there are m possible ways of placing the memory controllers.
us a 64-core, 16-port design has approximately 4:9 1014 possible ways to place the memory
controllers, which is well beyond what can be evaluated with typical simulation models. Our ap-
proach is to apply integer linear programming to determine memory controller placements that
minimize network hot-spots.
We acknowledge that, as part of physical design, there is a second problem where placing
tiles with multiple sizes and shapes (because of the added memory controller) could make floor-
planning more difficult. We do not address this concern here, and our approach is similar to how
others have abstracted this problem [5].
Heterogeneous On-Chip Network Allocation Current tiled architectures use homogeneous in-
terconnection networks, where each router and network link are provisioned identically, regardless
of network traffic. Mishra et al. propose allocating resources to links and routers according to the
load they will observe. Specifically, their heterogeneous mesh network is composed of two types
of links—wide and narrow—and two types of routers—big and small, as illustrated in Figure 6.1.
An
8 8 mesh14network requires 64 routers. Assuming 16 big and 48 small routers, there
64
are 48 4:89 10 possible ways in which these routers can be placed in the network. If the
assumption that routers can only be big and small is dropped, the solution space explodes further.
Our approach is to apply integer linear programming to place big routers and wide links in a way
that minimizes network contention.
Combined Problem Since memory controller placement affects network traffic patterns and
heterogeneous network allocation affects how much traffic a given router or link can handle be-
fore becoming a bottleneck, solving the two problems together should result in a much better
overall solution. However, the interactions between the two subproblems results in a non-linear
constraint, which requires finding a solution to a mixed integer, non-linear program. While such
104 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
a problem is incomputable in general [109] (no algorithm exists that can compute a solution), we
show that our nonlinear model produces solutions that are within 13% of the optimal solution,
and a linear reformulation of the program can prove this solution’s optimality.
Problem Statement Our study focuses on improving throughput in a tiled architecture. us
we focus on minimizing network contention that can limit throughput and increase latency. us
we seek to minimize the worst case load on any given link in the interconnect.
Determine the placement of memory controllers, router buffers, and link widths
which minimizes the worst case, over all links, ratio of per-link traffic to link re-
sources.
We describe the modeling of the integer linear program by first describing the system
abstractly, then writing logical constraints, and where required, linearizing these constraints to
match integer linear programming theory. e goal of the next three sections is to describe, from
a modeler’s perspective, how to formulate the model in terms of the fundamental decision vari-
ables, describe the formulation of the constraints by linearizing logical constraints, and finally
how to reason about and write the objective.
System and Workload Parameters To model our system, we need to present, as input, the pa-
rameters which describe the system. For this model, the key parameters describe how messages
are routed between two nodes in the system and how much traffic a memory request generates.
To model network routing, we introduce the set P at h.xs ; ys ; xd ; yd / which contains all the links
l that are used when routing messages from a source tile .xs ; ys / to a destination tile .xd ; yd /.
Note that this formulation is sufficient for any deterministic routing policy; different policies will
result in different sets for P ath.xs ; ys ; xd ; yd /.
We further introduce two parameters Lreq and Lresp , which indicate the average number
of network flits required by a memory request and response, respectively. ese values in turn
depend upon the cache block size, details of the coherence protocol, and ratio of cache fills to
writebacks, but we simplify the formulation to these two parameters.
Auxiliary Variables A key auxiliary variable, termed LoadOnLi nk.l/, represents the traffic
that traverses link l . is variable combines the system and workload parameters with the addi-
tional assumption that memory requests are uniformly distributed across the L2 cache banks and
memory controllers.
X
LoadOnLi nk.l/ D M Cxy Lreq
.x;y;x 0 ;y 0 /Wl2P at h.x 0 ;y 0 ;x;y/
X
C M Cxy Lresp (6.1)
.x;y;x 0 ;y 0 /Wl2P at h.x;y;x 0 ;y 0 /
for all l 2 L
¹e big/small formulation incorrectly implies that there are exactly two router configurations. In fact, the router’s cross-bar is
different for each combination of wide/narrow input and output links; thus there are more possible cross-bar configurations
than routers in the systems we study.
106 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
We describe the model’s constraints in two parts. First, we show how to model the basic design
resource constraints that a solution must satisfy to be feasible. Second, we discuss some additional
quality constraints, that seek to improve the overall quality of the selected design.
Resource Constraints Any architectural design problem comes with a set of basic resource con-
straints. Our model focuses on placing a fixed number of memory controllers among the tiles.
X
M Cxy D NM C (6.2)
.x;y/
is constraint simply states that the number of tiles that are allocated a memory controller is
equal to the budget NM C . Note that this constraint must be an equality, since the number of
memory controllers is visible to the processor’s external interface.
6.5. FORMULATION: OBJECTIVE 107
Similar constraints can be stated to enforce the number of wide links, virtual channels, and
router buffer budgets.
X
Wl NW (6.3)
l2L
X
V Cl NV C (6.4)
l2L
X
Bl NB (6.5)
l2L
Note that since these constraints can be inequalities, since the number actually allocated is not
exposed to an external interface.
Quality Constraints Architects must take into account many factors when evaluating the good-
ness of a design. In our example, placing too many memory controllers too close together may
result in either thermal density (i.e., hot spot) problems or complicate the global routing (e.g.,
from memory controller to pins). To address these concerns, it may be desirable to add additional
constraints to spread the memory controllers. For example, the additional constraints
ensure that adjacent tiles, in rows and columns respectively, are not both assigned memory con-
trollers.
Other quality constraints, which we omit for brevity, include minimum and maximum re-
source allocations. For example, each link requires at least one virtual channel per virtual network.
Similarly, timing constraints may limit the maximum number of buffers that a single link can have
without increasing the cycle time.
Initial Formulation Our overall objective is to improve system throughput by decreasing net-
work contention. We adopt Abts et al.’s maximum channel load [5] as the figure of merit, thus
our goal is to minimize the worst-case utilization of any link. Our goal is to simultaneously mini-
mize the utilization with respect to the resources allocated to each link. We define utilization of a
resource for a given link as the ratio of the load on the link and the amount of resource allocated.
LoadOnLi nk.l/
For example, the utilization of virtual channels for link l is given by: . We can
V Cl
108 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
express our utilization goal by introducing the following auxiliary constraints:
Here T , S , and W represent the maximum utilization for various per link resources. We set our
objective function as: minimize W C S C T .
e thing to note is that these auxiliary constraints are not linear, since they involve the
ratio of decision variables (the auxiliary variable LoadOnLi nk.l/ is a function of the memory
controller placement variables M Cxy ). us our complete formulation is an example of a mixed
integer nonlinear program (MINLP).
We can simplify the solution of the problem significantly if we are only interested in the
memory controller placement problem. In this case, the network allocation variables Wl , V Cl ,
and Bl become input parameters, rather than decision variables, and thus the auxiliary constraints
become linear. In this case we have an integer linear program, which is much easier to solve.
Similarly, if we are only interested in the heterogeneous network allocation problem [132],
we can convert the memory controller placement variables M Cxy from decision variables to input
parameters. is also makes the auxiliary constraints linear, resulting in an integer linear program.
Linearized Formulation In our formulation, the only nonlinear constraints are of the form:
f .x; y; w/ D w xy 0. us, each nonlinear term involves the product of two variables. Such
terms are called bilinear terms and the formulation is referred to as a Bilinear Program (BLP).
Moreover, the bilinear terms in our formulation are products of one continuous and one integer
variable. e technique described by Gupte et al., and mentioned in Chapter 2 Section 2.3.5,
converts such bilinear terms into linear terms [86]. Doing this allows one to use MILP techniques
for solving BLPs.
Consider the constraint w xy . Assume that x is a continuous variable and y is an integer
variable. Further assume that both of these are non-negative and bounded from above, i.e., 0
x a and 0 y b . Note that constraints (6.8), (6.9), and (6.10) in our formulation are of this
form. e set of points satisfying this constraint can be represented as: (where RC and ZC are
positive real numbers and positive integers respectively):
n o
P D .x; y; w/ 2 RC ZC R W w xy; x a; y b (6.11)
6.6. MODELING LIMITATIONS 109
k
X
Using y ’s binary expansion, we get: y D 2i 1
zi where zi are 0-1 integer variables and k D
iD1
blog2 bc C 1. By forcing vi to equal xzi we obtain the following set of linear constraints:
n
B D .x; y; w; z; v/ 2 R Z R f0; 1gk Rk W
k
X k
X
yD 2i 1
zi ; y b; w 2i 1
vi ;
iD1 iD1
vi 0; vi azi ; vi x; vi x C azi a;o
for all i 2 f1; ; kg (6.12)
It can be shown that P D Projx;y;w .B /. Here Projx;y;w represents the projection operator
that maps .x; y; w; z; v/ 2 B to .x; y; w/. Note that B does not have any nonlinear term in its
representation and hence it is an exact linearization of P .
We used the approach described above for linearizing this formulation. We replaced each
constraint of the form w xy , specifically (6.8),(6.9), and (6.10), with the constraints used in
defining B .
6.7 EVALUATION
Our evaluation focuses on three important questions:
1. Can the intuitive but non-linear MINLP model be solved to optimality?
110 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
Parameter Value
Processors 64
Memory Controllers 16
Router Latency 2 cycle
Inter-router wire latency 1 cycle
1 flit for request,
Packet Size
5 flits for reply
Virtual Networks 1 for requests, 1 for response
Virtual Channels Varies with design, router
6.7.1 METHODOLOGY
We wrote the combined model in GAMS. is model describes an abstract representation of a
tiled architecture, allowing us to solve for a mathematically optimal solution to the formulation
above. To evaluate whether this solution will actually improve performance for real applications,
we simulate the execution of workloads drawn from the SPEC CPU2006 benchmark suite [1]
using the detailed architectural simulator gem5 [25]. Table 6.2 describes the baseline parameters
for the tiled architecture and network that are used in both the mathematical and simulation
models.
6.7.2 RESULTS
MINLP Formulation We explored the solution space for the combined problem using
Baron [164], an NLP solver, since no efficient methods are known for solving non-convex
MINLPs [32]. e solver requires an initial solution to begin with, which is critical since it af-
fects the time required for finding a locally optimal solution and its quality with respect to the
globally optimal solution. To get as close as possible to the optimal design, we seeded the solver
with multiple different initial designs, and experimented with the bounds on different variables.
We selected the initial seeds by first solving the simpler memory controller placement and net-
work allocation problems independently using the Gurobi MILP solver [87], and using those
assignments as the seeds. Baron was then able to find designs with improved objective function
values.
Figure 6.2 illustrates the best design, referred to as opt, that we obtained from this solution
process. Figure 6.2(a), illustrates the placement of the memory controllers and the wide and the
narrow links across the mesh network. e filled boxes represent the positions of the memory
6.7. EVALUATION 111
(a) Distribution of Memory Controllers and Link (b) Buffer/Virtual Channel Distribution
Widths
controllers and the bold lines represent the wide links. Figure 6.2(b) presents a heat map that
illustrates the distribution of the virtual channels amongst the routers, where white squares indi-
cate the maximum allocation and dark squares indicate the minimum allocation. It is important
to note that neither the memory controller placement nor the allocation of links, virtual chan-
nels, and router buffers are the same in the combined problem as they are in the solutions to the
respective individual problems.
Our opt design is quite different than the best design evaluated by Mishra, et al. [132],
which is illustrated in Figure 6.3. is diagonal design places the memory controllers on the
diagonal nodes, leveraging the best solution found by Abts, et al. [5] to the memory controller
placement problem. ey then place the big routers along the diagonal nodes (6 virtual channel-
s/port) and connect them to their neighbors using wide links. Routers on non-diagonal nodes are
small (2 virtual channels/port), and communicate using narrow links, except to neighboring big
routers. Note that both the diagonal and the opt designs use the same number of virtual channels,
wide and narrow links, and buffers.
Although our NLP solver was not able to guarantee that the opt solution is in fact optimal,
it is able to provide a lower bound on the objective value for the optimal solution. is allows us
to show that our opt design is within 13% of this lower bound. In comparison, Mishra et al.’s
diagonal design is only within 55% of this lower bound on the optimal value.
Result-1: e MINLP model can be solved to within some degree of optimality.
112 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
Figure 6.3: Distribution of memory controllers, buffers, and link widths for the diagonal design.
Solution Performance To determine whether the opt design actually improves performance, we
used the architectural simulator gem5 [25] to simulate a detailed architectural model. We used
multiprogrammed workloads drawn from the SPEC CPU2006 benchmark suite [1]. For each
simulation, we randomly chose eight applications from the suite. Since there are 64 cores in the
simulated system, eight copies of each application were simulated. Applications were randomly
mapped to the cores. We simulated 44 different combinations and mappings. Each simulation
was allowed to run until every processor had executed at least 10,000,000 instructions. We then
calculated aggregate IPC (total number of instructions executed by all the processors/total number
of clock cycles). Figure 6.4 presents the aggregate IPC of the opt design normalized to that of
the diagonal design (i.e., higher is better). We observed an average performance gain of about
10% and no workload performed worse.
Result-3: e optimal solution enables real world performance gains.
6.8. RELATED WORK 113
Figure 6.4: Relative IPC for SPEC applications. Each bar is the ratio of aggregate IPC for the opt
design and that for the diagonal design.
On-chip Placement Prior works [12, 176] have focused on figuring out the best mapping for
applications and data on to cores and memory controllers at execution time, while we have pre-
sented a design time approach. A lot of work for on-chip placement has been done in the SoC
domain. ese works [95, 184] propose using genetic algorithms for generating solutions.
Xu et al. [182] also tackled the problem of placing memory controllers for chip multi-
processors. ey solved the problem for a 4 4 CMP through exhaustive searched. To find the
best placement for the 8 8 problem, they exhaustively searched through solutions obtained by
stitching solutions obtained for the 4 4 problem. is reduces the solution space that needs to
be searched, but the idea is not generic. It assumes that the chip can be divided into smaller re-
gions and solutions for the smaller regions can be composed to get optimal solutions for larger
regions. is may not hold true in general. Our approach of using mathematical optimization
does not rely on any such assumption.
114 6. CASE STUDY: RESOURCE ALLOCATION IN TILED ARCHITECTURES
eory e network design problem has been widely studied in theoretical computer sci-
ence [77, 124], particularly with respect to designing distribution, transportation, telecommuni-
cation and other types of networks. ese works mainly focus on designing approximation algo-
rithms for the different variants of the network design problem and on analyzing their theoretical
complexity. We focus on on-chip network design and on-chip placement.
6.9 CONCLUSIONS
In this chapter, we looked at the problems of placing memory controllers in a tiled architecture
and allocating resources to a heterogeneous on-chip network. We showed how to formulate these
problems and showed that the individual problems result in integer linear programs, but that the
combined problem results in a mixed integer non-linear program, which does not guarantee that
an optimal solution can be found. We described how we explored the design space for the com-
bined problem using an NLP solver, starting with optimal solutions to the individual problems.
We described how the NLP solver can generate an upper bound on the optimal solution, and
in our case showed that our best solution was within 13% of this bound. We then described a
sophisticated linearization of the formulation, which allowed us to show that our best solution
was in fact optimal. Finally, we used detailed architectural simulation to show that our optimal
solution improves throughput of multithreaded SPEC CPU2006 workloads by 10% on average.
115
CHAPTER 7
Conclusions
is book has presented an overview of the modeling technique called mathematical optimiza-
tion, and has shown how to apply it to solve complex computer architecture problems. We first de-
scribed optimization in terms of its primitives, and gave a broad overview of related optimization
models. ough we did not cover all types of mathematical models in the field of optimization,
our broad overview provides the background necessary for interested readers to learn more.
Focusing on one such expressive and efficiently solvable model type, Mixed Integer Linear
Programming, we described fundamental modeling techniques relevant to a variety of systems
and architecture problems. We then chose four architecture problems to study further, which gave
a flavor of the type of problem encountered in the domain. We presented detailed case studies
for each, showed how to formulate their problems with MILP, and demonstrated real-world
practicality and applicability.
We conclude by summarizing the properties that make problems likely to be amenable to
MILP, showing some of MILP’s limitations, describing how to decide if MILP is appropriate for
a given problem, and finally by describing methods used to improve solve time.
Expressible e underlying problem must be expressible as a MILP, meaning that the inputs,
constraints, and objectives must be able to be mathematically described within the format of a
MILP. An example of inexpressibility is, for a scheduling problem, to maximize the throughput
and latency. e problem here is that the best throughput and latency might not occur in the
same schedule, thus, it is meaningless to “optimize for both”. A closely related but expressible
model would be to optimize for “the best latency” under a constraint that the actual throughput
is within 10% of the best throughput. (e best throughput would be calculated by one run of the
model, followed by a second run with a different objective and an additional constraint.)
Tractable Finding the optimal value to a MILP formulation is NP-hard, and unless P=NP,
means that solution algorithms are at least exponential in terms of problem size. While the size
of the underlying linear program does affect solution times somewhat, it is often the number of
116 7. CONCLUSIONS
integer variables that make the problem hard or even intractable. erefore, it is wise to limit
these to the hundreds to tens of thousands range at this time. Good formulations of the problem
are also important, since the enumerative nature of the solution techniques can be improved by
generating tight bounds or eliminating many possible alternatives quickly.
Linearity Simple linear relationships are naturally expressible MILP, and many other concep-
tual conditions, like logical operations and piecewise-linear functions, can be expressed with aux-
iliary variables and constraints inside of MILP. ough general nonlinearity can be approximated
through certain iterative techniques, a truly linear system will be more efficiently and exactly solv-
able. We again note that there are other optimization techniques which can directly operate on
nonlinear relationships, but this involves complex trade-offs like increased execution time and
often lack optimality guarantees. Exploring these techniques is out of the scope of this lecture.
Design Nature Optimization can and has been used extensively both in design and operational
situations. Design problems generally have a “static” nature, in that the problems can be solved
off-line, and all input parameters are known at the time of optimization. Resource allocation and
compilation are great examples of work which is performed statically, but where the output is
used dynamically. is is usually a good fit MILP, as the solution process can be lengthy, but can
provide optimality guarantees for problems where the solution is used many times. Operational
problems typically require guaranteed bounds on computational time (typically expressed as being
in P ), and often involve changing data. In this setting, LP and convex programming are thus more
applicable, or gradient-based techniques that simply improve the current situation.
Poorly Formulated Problems Certain problems, even though easily formulated as a MILP, and
which are also small enough and have static inputs, can nonetheless still be difficult to solve. is
has to do with the nature of the problem itself: if there are many similar quality feasible points to
be evaluated, this simply takes time. e intuition behind the mathematics is that the relaxation
of certain formulations can be either close to or far from the integer hull of the problem. Refer
to Chapter 2 Section 2.4 for details on solution methods. e closer the relaxation, the less work
the solver must do in exploring the solution space before finding an optimal integral solution.
Perhaps the most critical component of the formulation is the strength of the root node
relaxation. Problems for which the objective value at the root node is close to the actual solution
7.2. UNDERSTANDING THE LIMITATIONS OF MILP 117
value are much easier to solve. Adding problem-specific cuts (linear inequalities) to the formula-
tion to improve this bound can be very worthwhile.
Problems which have identical solutions due to permutations of a subset of the variables
create problems for the branch-and-bound procedure. Identical machines are one such example.
Reformulating problems to determine how many machines to use (rather than assigning particular
machines to specific tasks) are often effective. Restrictions on an ordering of identical machines
also reduces symmetry.
Solution hints: Can extra constraints (cuts) be added to the model? Can the problem be formulated
differently, with less symmetry?
Large Problem Size Every optimization problem has some appropriate bound during which
attaining a solution is meaningful and/or useful. A VLSI problem, for instance, maybe be solved
in hours or days, as the answer does not need to be immediately returned. For problems in com-
pilers, the range of up to seconds may be appropriate. e more quickly a solution is required,
the smaller the problem must be. Genuinely large problems, with millions of equations and/or
integer variables, cannot be tractably solved directly with MILP.
Solution hints: Can a problem be effectively solved using a divide-and-conquer technique via
multiple MILPs? Can a small instance of the problem be representative of the larger problem?
Dynamic Nature A dynamic optimization problem is one where not all pieces of the problem,
generally the input parameters, are known at the time of optimization. is poses a challenge to
the ability to formulate the optimization problem at all: how can one write a constraint about an
unknown relationship?
Solution hints: Can stochastic programming or robust optimization techniques be applied to ac-
count for uncertain future dynamic events?
At first glance, this looks to be a very natural optimization problem which is expressible as a
MILP and of now tractable size. However, note that we are trying to find the required number of
each service pattern which satisfies the total required services of each type. Calculating the total
services of each type requires the multiplication of the number of each pattern, and the number
of services included in that pattern. is is a quadratic relationship, and serves as an example for
how nonlinear constraints can arise in an otherwise linear model.
first-order logic, then applying well known techniques to map between logic and integer linear
constraints, as demonstrated in Chapter 2. After determining the initial model, more complex
and efficient formulations can be explored easily through successive modifications.
e other reason that the problem might not be expressible is that it is difficult to write an
objective function which captures many disparate goals. If there is a clear prioritization of goals,
then one strategy is to solve the same model multiple times for each objective, adding constraints
which enforce previously found objective maxima after each solve. is procedure is referred to
as “pre-emptive” goal programming. If, however, a direct comparison of the objective values are
required, “non pre-emptive” techniques should be used. Here, the difficulty is in weighting the
contribution of multiple objectives to the overall solution. If such a weighting exists, this is likely a
preferable solution, as it does not introduce the need for repeated model solution, and the weights
often will capture the modelers underlying objective more naturally.
Large Problem Size Problems which are fundamentally large are very difficult to make amenable
to MILP solutions. If MILP is to be applied in such a case, the problem needs to be broken into
independent or hierarchical pieces, fixing some decisions at each step. While this no longer can
guarantee global optimality, there may still be value in achieving optimality at each step.
Nonlinear Relationships Some nonlinear relationships, like logical and piecewise-linear con-
straints covered in Chapter 2, Section 2.3.4, are naturally expressible in MILP through straight-
forward transformations. More general nonlinear relationships, however, are much more difficult
to model directly. ere are essentially two possibilities for using optimization on such a prob-
lem. e first is to approximate these functions with piecewise-linear functions, which are easily
modeled (though more expensive) in MILP. e second approach is to apply a higher order opti-
mization technique like MICQP (mixed integer quadratically constrained programming) or even
7.3. IMPLEMENTING YOUR OPTIMIZATION PROBLEMS IN MILP 121
more generally MINLP (mixed integer nonlinear programming). e trade-offs in using such
techniques are many and complex, and are not covered in this book.
Dynamic Problems Dynamic problems, where some pieces of the problem are unknown at for-
mulation time, can be addressed if some properties of uncertain variables are known. For instance,
if we know that a memory latency is either 1, 10, or 100 cycles, with certain probabilities, we can
still use linear programming by employing “stochastic programming” techniques. One strategy for
solving this problem would be to formulate the constraints for each possible scenario of memory
latency, and combine these models into one large MILP. e objective function must quantify
the trade-offs between the various scenarios, either by incorporating probabilities and calculating
the average objective value, or optimizing for the minimum or maximum objective value across
scenarios. ese techniques can multiply the size of the model by factors, so care must be taken
in selecting representative scenarios.
Poorly Formulated Problems Formulations with appropriate size, but still performing much too
slow, may simply be poorly formulated. Sometimes, reformulations can improve the solution ef-
ficiency by orders of magnitude. ough we can only scratch the surface of possible reformulation
techniques, we offer some advice below.
When large degrees of symmetry are inherent in the problem, consider adding constraints
which disallow certain otherwise feasible possibilities with known equivalents. Recall the data-
center allocation problem from Chapter 4, where we added the constraint which enforces that
we only “turn on” identical machines in a pre-specified order. is symmetry-breaking constraint
does not actually prevent attaining optimal solutions, because there is always an equivalent valid
solution for any solution which we disallowed.
We also recommend learning “textbook” reformulation techniques, one of which we de-
scribe here as an example. Consider enforcing the logical constraint that any ai implies b
(8i .ai H) b/). e “aggregated” constraint is:
X
ai M b
i
ai b for all i:
e aggregated constraint is worse in the sense that relaxation is much further from the
integer hull than the disaggregated constraint. In practice, even though the disaggregated version
is composed of many constraints, the quality of the formulation invariably leads to faster (and
sometimes much faster) solution times.
In some cases, the number of constraints that are needed to describe the convex hull is ex-
ponential in the problem data. A good example of this is in traveling salesman problems (Hamil-
tonian circuits) where “subtour elimination” constraints provide cuts that strengthen the node
122 7. CONCLUSIONS
relaxation problems. In practice, we solve the relaxation and then add in a number of violated
constraints of this form—we perform a trade-off between quality of formulation and size of for-
mulation. ese tradeoffs should be made empirically with working models.
It is often better to introduce extra variables (that have physical significance) into the model
and use them throughout the description of the constraints (as we did in our second example of
the introduction, instruction scheduling: page 8). e solver then only has to treat the expression
once; the modeler can provide additional information in terms of bounds on these variables, and
sometimes branching strategies exploit these “higher level” variables (this can be formalized with
priority branching strategies in some solvers).
Adding realistic bounds on as many of the problem variables as possible typically improves
performance of the solver. Even if some of these bounds are overly strict, this can help in deter-
mining an initial solution to the MILP and thereby checking the logic of the formulation. After
solution, some bounds that are artificial, but active, could be relaxed and the problem resolved.
Relaxing Integrality Because enforcing integrality constraints can be very costly, due to branch-
ing in common MILP solving algorithms, integral variables are generally much more costly than
continuous variables. ere are at least two circumstances where the integrality conditions on
7.3. IMPLEMENTING YOUR OPTIMIZATION PROBLEMS IN MILP 123
variables can be trivially relaxed. e first case is when the integral variables are themselves de-
pendent only on other integral variables having relationships with only integral coefficients. e
integrality of these variables is essentially “free”. Explicitly making these variables continuous in
the formulation enables solvers to skip these variables as branch targets, potentially improving the
solution time significantly.
e other scenario where integral variable constraints could be relaxed is when the variable
is itself very large (>> 1). Depending on the constraints involved, a solution with an integral
variable in the thousands range varying by ˙1 may have very little effect on the optimal objective
value. Relaxing the integrality of these variables can, on the other hand, significantly improve
the performance of the solver for the same reason as before: these variables do not need to be
considered as branch targets, reducing the search space.
Simple is often better. For example, how many segments are really necessary in a piecewise-
linear approximation? Do we need to model each instruction, or can some be aggregated together
as a group? Is the model a hierarchical model in which fixing some variables then makes the re-
maining variable choices much easier? Are all the variables measured in the same currency (scal-
ing)? Often simplifications of this form can lead to faster solution times.
Tuning Solvers As general advice, we recommend always starting with default solver tuning
parameters, especially when switching solvers or solver versions. Commercial solvers are fairly
intelligent/adaptive. Moreover, adjusting parameters and witnessing speedups on a single problem
can be misleading, or even meaningless. Finding speedups on an individual problem can be mere
chance rather than true improvements in tweaking the solution strategy. Applying the tuning to
a variety of similar problems can help, or automatic tuning can be applied for some solvers.
Adjusting Optimality Requirements In general, MILP solvers allow the user to specify bounds
on required optimality. What this means is that if the solver can prove a certain degree of opti-
mality, then it will return the current incumbent solution. is can lead to huge improvements in
performance for the obvious reason that the algorithm can stop early, omitting the exploration of
much of the solution space. Making this trade-off is simple, and should be done on a case-by-case
basis, depending on the requirements of the problem.
Heuristics for Initial Solutions Depending on the problem, a MILP solver may spend a sig-
nificant portion of the solve time without a feasible solution (thereby limiting opportunities for
pruning the branch-and-bound search tree). e solver may apply its own internal heuristics to
try to find an integer feasible solution, but these are not always effective. Adding domain knowl-
edge to the heuristic can be much more effective, and in these cases it may be a good idea to
supply an initial starting solution to the solver. If heuristic procedures for the optimization prob-
lem already exist, it is usually very simple to convert these solutions to the modeling language
format, and many solvers support this feature. is can significantly reduce the solution time,
because the solver can eliminate branches which cannot possibly improve the objective value past
the incumbent solution.
124 7. CONCLUSIONS
7.4 LESSONS LEARNED
In working on these problems and in writing this book, we have learned many things. We share
a few of them and our opinions below since they could be beneficial to others.
Expanding scope of problems targetable by MILP Many interesting design optimization prob-
lems in computer architecture benefit from MILP. As the theory behind solving techniques im-
proves, we believe more problems will become practically solvable with MILP and other theories.
As a specific example, after completing this work we came across the interesting design problem
of interconnect design space in the photonic interconnect work of Kokka et al. [106]. In that
work, they use an “ad-hoc” methodology to study various networks and since the optimal net-
work is not known, define an abstract “perfect network” with certain properties to approximate
more general designs. On close examination, we realized that this network design problem (and
the optimal network problem) can be cast as MILP problem driven by constraints of decibel loss
and bandwidth loss, given the network communication patterns.
Spend time upfront to see if MILP seems feasible Tempering our previous point, we also feel
it is important to spend time upfront to understand the fundamental relationships between the
variables/system features to determine if the underlying problem is MILP-friendly. For example,
we initially felt memory controller scheduling could be cast as a MILP problem. But the nature of
the problem, in the sense that it operates on dynamic data, makes such a formulation less useful.
Invest time in tuning the problem and learning the theory MILP solvers and MILP theory
abounds with techniques on reformulation problems to make them “friendlier” for a MILP solver.
We encourage readers to invest time in tuning the problem formulation if it seems intractable or
too long running on MILP. For example, in the WSAP problem in the second case study, our
initial formulation took excessively long in certain cases (more than the 20 min timeout). We
then implemented the symmetry breaking constraints, which reduced solver time by three orders
of magnitude. We were able to do this transformation because of an understanding of the theory
behind solvers. ough this book does not provide deep intuition about solvers, our goal is to have
encouraged readers and taught readers enough about MILP and its utility and expressiveness so
they can invest time in learning about the theory to become expert users of MILP.
If your formulation is looking too complicated, something is wrong In general if you feel the
MILP formulation is too “complicated” and appears non-intuitive, it is very likely you are model-
ing it the wrong way. Revisiting our spatial scheduling case study, the “complicated” formulation
would schedule each operation onto a particular cycle to figure out the cycle-by-cycle resource
contention; but the elegant formulation models utilization directly. Unsurprisingly the elegant
and intuitive formulation is also solved faster. Look for beauty in your formulation.
We leave the reader with a quote from Paul Dirac on equations which is apt in the context
of modeling and fitting models results to experimental data: “It seems that if one is working from
7.4. LESSONS LEARNED 125
the point of view of getting beauty in one’s equations, and if one has really a sound insight, one is on a
sure line of progress.”
127
Bibliography
[1] Standard performance evaluation corporation. http://www.spec.org/results. 110, 112
[5] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving Predictable
Performance through Better Memory Controller Placement in Many-Core CMPs. In ISCA
’09. DOI: 10.1145/1555815.1555810. 102, 103, 107, 111
[6] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and
Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
2006. 98
[7] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: eory, Algorithms, and
Applications. Prentice-Hall, Englewood Cliffs, NJ, 1993. 1, 23, 24, 28
[8] C. Alippi, W. Fornaciari, L. Pozzi, and M. Sami. A dag-based design approach for recon-
figurable vliw processors. In Proceedings of the conference on Design, automation and test in
Europe, DATE ’99, New York, NY, USA, 1999. ACM. DOI: 10.1145/307418.307504. 60
[9] F. Alizadeh and D. Goldfarb. Second-Order Cone Programming. Mathematical Program-
ming, 95:3–51, 2003. DOI: 10.1007/s10107-002-0339-5. 14
[10] S. Amarasinghe, D. R. Karger, W. Lee, and V. S. Mirrokni. A theoretical and practical
approach to instruction scheduling on spatial architectures. Technical report, MIT, 2002.
97, 98
[11] C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In Proceedings of the third
ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP ’91,
pages 39–50, 1991. DOI: 10.1145/109625.109631. 98
[37] K. Chakrabarty. Test scheduling for core-based systems using mixed-integer linear pro-
gramming. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
19(10):1163 –1174, oct 2000. DOI: 10.1109/43.875306. 3
[41] R. Cottle, E. Johnson, and R. Wets. George B. Dantzig (1914–2005). Notices of the AMS,
54(3):344–362, 2007. 1
[43] J. Czyzyk, T. Wisniewski, and S. J. Wright. Optimization Case Studies in the NEOS
Guide. SIAM Review, 41:148–163, 1999. DOI: 10.1137/S0036144598334874. 14
[44] G. Dantzig. On the significance of solving linear programming problems with some in-
teger variables. Econometrica, Journal of the Econometric Society, 28(1):30–44, 1960. DOI:
10.2307/1905292. 38
[45] G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton,
New Jersey, 1963. 23
[47] L. De Carli, Y. Pan, A. Kumar, C. Estan, and K. Sankaralingam. Plug: flexible lookup
modules for rapid deployment of new protocols in high-speed routers. In Proceedings of the
ACM SIGCOMM 2009 conference on Data communication, SIGCOMM ’09, pages 207–218,
2009. DOI: 10.1145/1592568.1592593. 75, 76, 92, 93
BIBLIOGRAPHY 131
[48] A. Deb, J. M. Codina, and A. González. Softhv: a hw/sw co-designed processor with
horizontal and vertical fusion. In Proceedings of the 8th ACM International Conference on
Computing Frontiers, CF ’11, pages 1:1–1:10, 2011. DOI: 10.1145/2016604.2016606. 75
[50] E. W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numerische Mathe-
matik, 1:269–271, 1959. DOI: 10.1007/BF01386390. 26
[54] A. E. Eichenberger and E. S. Davidson. Efficient formulation for optimal modulo sched-
ulers. In Proceedings of the ACM SIGPLAN 1997 conference on Programming language design
and implementation, PLDI ’97, pages 194–205, 1997. DOI: 10.1145/258915.258933. 98
[55] M. Ekpanyapong, J. Minz, T. Watewai, H.-H. Lee, and S. K. Lim. Profile-guided microar-
chitectural floor planning for deep submicron processor design. Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on, 25(7):1289 –1300, july 2006. DOI:
10.1109/TCAD.2005.855971. 3
[56] J. R. Ellis. Bulldog: a compiler for vliw architectures. PhD thesis, 1985. 97
[57] D. W. Engels, J. Feldman, D. R. Karger, and M. Ruhl. Parallel processor scheduling with
delay constraints. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete
algorithms, SODA ’01, pages 577–585, 2001. 98
[58] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon
and the end of multicore scaling. SIGARCH Comput. Archit. News, 39(3):365–376, June
2011. DOI: 10.1145/2024723.2000108. 75
[59] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-
purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, MICRO ’12, pages 449–460, Washington, DC,
USA, 2012. IEEE Computer Society. DOI: 10.1109/MICRO.2012.48. 75, 99
132 BIBLIOGRAPHY
[60] F. Facchinei and J. S. Pang. Finite-Dimensional Variational Inequalities and Complementarity
Problems. Springer-Verlag, New York, New York, 2003. 14
[61] K. Fan, H. h. Park, M. Kudlur, and S. o. Mahlke. Modulo scheduling for highly customized
datapaths to increase hardware reusability. In Proceedings of the 6th annual IEEE/ACM
international symposium on Code generation and optimization, CGO ’08, pages 124–133, New
York, NY, USA, 2008. ACM. DOI: 10.1145/1356058.1356075. 98
[62] J.-W. Fang, C.-H. Hsu, and Y.-W. Chang. An integer linear programming based rout-
ing algorithm for flip-chip design. In Design Automation Conference, 2007. DAC ’07. 44th
ACM/IEEE, pages 606 –611, june 2007. DOI: 10.1109/TCAD.2008.2009151. 3
[64] P. Feautrier. Some efficient solutions to the affine scheduling problem. International Journal
of Parallel Programming, 21:313–347, 1992. DOI: 10.1007/BF01379404. 75, 97, 98
[66] M. L. Fisher. e Lagrangian Relaxation Method for Solving Integer Programming Prob-
lems. Management Science, 27:1–18, Dec. 1981. DOI: 10.1287/mnsc.27.1.1. 44
[67] L. Ford and D. R. Fulkerson. A suggested computation for maximal multicommodity net-
work flows. Management Science, 5(1):97–101, 1958. DOI: 10.1287/mnsc.5.1.97. 46
[68] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962. 23
[71] R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A Modeling Language for Mathematical
Programming. Duxbury Press, Pacific Grove, California, 1993. 47
[72] C. Galuzzi and K. Bertels. e instruction-set extension problem: A survey. ACM Trans.
Reconfigurable Technol. Syst., 4(2):18:1–18:28, May 2011. DOI: 10.1145/1968502.1968509.
60
BIBLIOGRAPHY 133
[73] C. Galuzzi, E. Panainte, Y. Yankova, K. Bertels, and S. Vassiliadis. Automatic selection of
application-specific instruction-set extensions. In Hardware/Software Codesign and System
Synthesis, 2006. CODES+ISSS ’06. Proceedings of the 4th International Conference, pages 160
–165, oct. 2006. DOI: 10.1145/1176254.1176293. 58
[75] P. Gilmore and R. Gomory. A Linear Programming Approach to the Cutting-Stock Prob-
lem. Operations Research, 9(6):849–859, 1961. DOI: 10.1287/opre.9.6.849. 46
[76] F. Glover. Improved linear integer programming formulations of nonlinear integer prob-
lems. Management Science, 22(4):455–460, 1975. DOI: 10.1287/mnsc.22.4.455. 39
[79] R. E. Gonzalez. Xtensa: A configurable and extensible processor. IEEE Micro, 20(2):60–70,
Mar. 2000. DOI: 10.1109/40.848473. 49
[82] M. Grant and S. Boyd. cvx Users’ Guide. Por Clasifcar, 2(build 711):1–72, 2011. 47, 48
[84] M. Guevara, B. Lubin, and B. C. Lee. Navigating heterogeneous processors with market
mechanisms. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th
International Symposium on, pages 95–106, 2013. DOI: 10.1109/HPCA.2013.6522310. 73
134 BIBLIOGRAPHY
[85] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. Bundled execution of recurring
traces for energy-efficient general purpose processing. In Proceedings of the 44th Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO-44 ’11, pages 12–23,
2011. DOI: 10.1145/2155620.2155623. 75
[86] A. Gupte, S. Ahmed, M. S. Cheon, and S. S. Dey. Solving mixed integer bilinear problems
using milp formulations. Technical report, 2012. DOI: 10.1137/110836183. 31, 40, 108
[87] I. Gurobi Optimization. Gurobi optimizer reference manual. http://www.gurobi.com,
2012. 110
[88] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers.
IEEE Micro, 31(4):6–15, 2011. DOI: 10.1109/MM.2011.77. 75
[89] W. E. Hart, C. Laird, J.-P. Watson, and D. L. Woodruff. Pyomo - Optimization Modeling
in Python. 2012. DOI: 10.1007/978-1-4614-3226-5. 47
[90] M. Hayenga, N. E. Jerger, and M. Lipasti. Scarab: a single cycle adaptive routing and
bufferless network. In MICRO 42, pages 244–254, 2009. DOI: 10.1145/1669112.1669144.
113
[91] S. Held, B. Korte, D. Rautenbach, and J. Vygen. Combinatorial optimization in vlsi design.
Combinatorial optimization methods and applications, 31:33–96, 2011. 3
[92] F. L. Hitchcock. e Distribution of a Product from Several Sources to Numerous Facilities.
Journal of Mathematical Physics, 20:224–230, 1941. 23
[93] A. Hoffmann, H. Meyr, and R. Leupers. Architecture Exploration for Embedded Processors
with Lisa. Kluwer Academic Publishers, Norwell, MA, USA, 2002. DOI: 10.1007/978-1-
4757-4538-2. 49
[94] Z. Huang, S. Malik, N. Moreano, and G. Araujo. e design of dynamically reconfigurable
datapath coprocessors. ACM Trans. Embed. Comput. Syst., 3(2):361–384, May 2004. DOI:
10.1145/993396.993403. 75
[95] W.-L. Hung, Y. Xie, N. Vijaykrishnan, C. Addo-Quaye, T. eocharides, and M. Irwin.
ermal-aware floorplanning using genetic algorithms. In International Symposium on Qual-
ity of Electronic Design, 2005., pages 634 – 639. DOI: 10.1109/ISQED.2005.122. 113
[96] R. Jeroslow. Representability in mixed integer programmiing, I: characterization results.
Discrete Applied Mathematics, 17:223–243, 1987. DOI: 10.1016/0166-218X(87)90026-6.
22
[97] I.-R. Jiang. Generic integer linear programming formulation for 3d ic partitioning. In
SOC Conference, 2009. SOCC 2009. IEEE International, pages 321 –324, sept. 2009. DOI:
10.1109/SOCCON.2009.5398032. 3
BIBLIOGRAPHY 135
[98] R. Joshi, G. Nelson, and K. Randall. Denali: a goal-directed superoptimizer. In Proceedings
of the ACM SIGPLAN 2002 Conference on Programming language design and implementation,
PLDI ’02, pages 304–314, 2002. DOI: 10.1145/512529.512566. 98
[99] K. Kailas, A. Agrawala, and K. Ebcioglu. Cars: A new code generation framework for clus-
tered ilp processors. In Proceedings of the 7th International Symposium on High-Performance
Computer Architecture, HPCA ’01, pages 133–, 2001. DOI: 10.1109/HPCA.2001.903258.
97
[100] J. Kallrath, editor. Algebraic Modeling Systems: Modeling and Solving Real World Optimiza-
tion Problems. Springer Verlag, 2012. DOI: 10.1007/978-3-642-23592-4. 2
[102] N. Karmarkar. A New Polynomial Time Algorithm for Linear Programming. Combina-
torica, 4:373–395, 1984. DOI: 10.1007/BF02579150. 1
[103] A. B. Keha, I. R. de Farias, and G. L. Nemhauser. Models for representing piecewise linear
cost functions. Operations Research Letters, 32(1):44–48, Jan. 2004. DOI: 10.1016/S0167-
6377(03)00059-2. 38
[106] P. Koka, M. McCracken, H. Schwetman, C.-H. Chen, X. Zheng, R. Ho, K. Raj, and
A. Krishnamoorthy. A micro-architectural analysis of switched photonic multi-chip inter-
connects. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on,
pages 153–164, 2012. DOI: 10.1109/ISCA.2012.6237014. 124
[107] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by Direct Search: New Per-
spectives on Some Classical and Modern Methods. SIAM Review, 45(3):385–482, 2003.
DOI: 10.1137/S003614450242889. 3
[112] C. Kuip. Algebraic languages for mathematical programming. European Journal of Oper-
ational Research, 67:25–51, 1993. DOI: 10.1016/0377-2217(93)90320-M. 2
[114] F. Larumbe and B. Sansò. Optimal location of data centers and software components in
cloud computing network design. In Cluster, Cloud and Grid Computing (CCGrid), 2012
12th IEEE/ACM International Symposium on, pages 841–844, May. DOI: 10.1109/CC-
Grid.2012.124. 62
[115] J.-H. Lee, Y.-C. Hsu, and Y.-L. Lin. A new integer linear programming formulation for
the scheduling problem in data path synthesis. In Computer-Aided Design, 1989. ICCAD-
89. Digest of Technical Papers., 1989 IEEE International Conference on, pages 20 –23, nov
1989. DOI: 10.1109/ICCAD.1989.76896. 3
[117] C. E. Lemke. e dual method of solving the linear programming problem. Naval Research
Logistics Quarterly, 1(1):36–47, 1954. DOI: 10.1002/nav.3800010107. 1
[118] Y. Lin and L. Schrage. e global solver in the LINDO API. Optimization Methods and
Software, (4-5):657–668, 2009. DOI: 10.1080/10556780902753221. 31
[121] B. Lubin, J. O. Kephart, R. Das, and D. C. Parkes. Expressive power-based resource al-
location for data centers. In Proceedings of the 21st international jont conference on Artifical
intelligence, IJCAI’09, pages 1451–1456, San Francisco, CA, USA, 2009. Morgan Kauf-
mann Publishers Inc. 73
[122] S. Ma, N. Enright Jerger, and Z. Wang. Dbar: an efficient routing algorithm to support
multiple concurrent applications in networks-on-chip. In ISCA-38, pages 413–424, 2011.
DOI: 10.1145/2024723.2000113. 113
[123] S. Ma, N. D. E. Jerger, and Z. Wang. Whole packet forwarding: Efficient design of fully
adaptive routing algorithms for networks-on-chip. In HPCA, pages 467–478, 2012. DOI:
10.1109/HPCA.2012.6169049. 113
[124] T. L. Magnanti and R. T. Wong. Network Design and Transportation Planning: Models
and Algorithms. Transportation Science, 18:1–56, 1984. DOI: 10.1287/trsc.18.1.1. 114
[125] H. Markowitz and A. Manne. On the solution of discrete programming problems. Econo-
metrica: Journal of the Econometric, 25(1):84–110, 1957. DOI: 10.2307/1907744. 38
[126] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: increasing utilization
in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th
Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 ’11, pages
248–259, 2011. DOI: 10.1145/2155620.2155650. 63
[127] R. K. Martin. Large scale linear and integer optimization. Kluwer Academic Publishers,
Boston, MA, 1999. DOI: 10.1007/978-1-4615-4975-8. 39
[134] G. Mitra and C. Lucas. Tools for reformulating logical forms into zero-one mixed
integer programs. European Journal of Operational Research, 72:262–276, 1994. DOI:
10.1016/0377-2217(94)90308-5. 34
[135] T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks. In
Proceedings of the 36th annual international symposium on Computer architecture, ISCA ’09,
pages 196–207, 2009. DOI: 10.1145/1555815.1555781. 113
[136] B. A. Murtagh and M. A. Saunders. MINOS 5.0 User’s Guide. Technical Report SOL
83.20, Systems Optimization Laboratory, Department of Operations Research, Stanford
University, Stanford, California, 1983. 2
[137] M. N. Bennani and D. A. Menasce. Resource allocation for autonomic data centers using
analytic performance models. In Proceedings of the Second International Conference on Auto-
matic Computing, ICAC ’05, pages 229–240, Washington, DC, USA, 2005. IEEE Com-
puter Society. DOI: 10.1109/ICAC.2005.50. 72
[139] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. John Wiley
& Sons, New York, NY, 1988. 2, 24, 28, 40
[141] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 1999. DOI:
10.1007/b98874. 14
BIBLIOGRAPHY 139
[142] E. Özer, S. Banerjia, and T. M. Conte. Unified assign and schedule: a new approach
to scheduling for clustered register file microarchitectures. In Proceedings of the 31st an-
nual ACM/IEEE international symposium on Microarchitecture, MICRO 31, pages 308–315,
1998. DOI: 10.1109/MICRO.1998.742792. 97
[143] C. Ozturan, G. Dundar, and K. Atasu. An integer linear programming approach for
identifying instruction-set extensions. In Hardware/Software Codesign and System Synthesis,
2005. CODES+ISSS ’05. ird IEEE/ACM/IFIP International Conference on, pages 172 –
177, sept. 2005. 50
[144] J. Palsberg and M. Naik. Ilp-based resource-aware compilation, 2004. 98
[145] P. M. Pardalos and M. G. C. Resende, editors. Handbook of Applied Optimization. Oxford
University Press, New York, NY, 2002. DOI: 10.1007/978-1-4757-5362-2. 3
[146] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim. Edge-centric mod-
ulo scheduling for coarse-grained reconfigurable architectures. In Proceedings of the 17th
international conference on Parallel architectures and compilation techniques, PACT ’08, pages
166–176, 2008. DOI: 10.1145/1454115.1454140. 97
[147] W. Pugh. e omega test: a fast and practical integer programming algorithm for depen-
dence analysis. In Supercomputing ’91. DOI: 10.1145/125826.125848. 98
[148] L. M. Rios and N. V. Sahinidis. Derivative-free optimization: a review of algorithms and
comparison of software implementations. Journal of Global Optimization, pages 1–47, July
2012. DOI: 10.1007/s10898-012-9951-y. 47
[149] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey,
1970. 2, 40
[150] N. Satish, K. Ravindran, and K. Keutzer. A decomposition-based constraint optimization
approach for statically scheduling task graphs with communication delays to multiproces-
sors. In DATE ’07. DOI: 10.1145/1266366.1266381. 97, 98
[151] L. Schrage. Optimization Modeling with LINGO. 1999. 31
[152] A. Schrijver. eory of Linear and Integer Programming. John Wiley & Sons, 1986. 2, 28,
40
[153] A. Schrijver. On the history of the transportation and maximum flow problems. Mathe-
matical Programming, 91:437–445, 2002. DOI: 10.1007/s101070100259. 23
[154] A. Sen, H. Deng, and S. Guha. On a graph partitioning problem with applications to vlsi
layout. In Circuits and Systems, 1991., IEEE International Sympoisum on, pages 2846 –2849
vol.5, June 1991. DOI: 10.1109/ISCAS.1991.176137. 3
140 BIBLIOGRAPHY
[155] A. Shapiro, D. Dentcheva, and A. Ruszczyński. LECTURES ON STOCHASTIC PRO-
GRAMMING. SIAM, 2009. DOI: 10.1137/1.9780898718751. 2, 14
[156] D. M. Shepard, M. C. Ferris, G. Olivera, and T. R. Mackie. Optimizing the
Delivery of Radiation to Cancer Patients. SIAM Review, 41:721–744, 1999. DOI:
10.1137/S0036144598342032. 3
[157] A. Smith, G. Constantinides, and P. Cheung. Integrated floorplanning, module-
selection, and architecture generation for reconfigurable devices. Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, 16(6):733 –744, june 2008. DOI:
10.1109/TVLSI.2008.2000259. 3
[158] B. Speitkamp and M. Bichler. A mathematical programming approach for server con-
solidation problems in virtualized data centers. IEEE Transactions on Services Computing,
3(4):266–278, 2010. DOI: 10.1109/TSC.2010.25. 73
[159] K. Srinivasan, K. Chatha, and G. Konjevod. Linear-programming-based techniques for
synthesis of network-on-chip architectures. Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, 14(4):407 –420, april 2006. DOI: 10.1109/TVLSI.2006.871762. 4
[160] S. Sutanthavibul, E. Shragowitz, and J. Rosen. An analytical approach to floorplan design
and optimization. In Design Automation Conference, 1990. Proceedings., 27th ACM/IEEE,
pages 187 –192, jun 1990. DOI: 10.1109/DAC.1990.114852. 3
[161] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In Proceedings of the
36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36, pages
291–, 2003. DOI: 10.1109/MICRO.2003.1253203. 75
[162] M. Tawarmalani and N. V. Sahinidis. Convexification and Global Optimization in Con-
tinuous and Mixed-Integer Nonlinear Programming: eory, Algorithms, Software and Appli-
cations, volume 69 of Nonconvex Optimization and its Applications. Kluwer Academic Pub-
lishers, Dordrecht, 2002. 31
[163] M. Tawarmalani and N. V. Sahinidis. A polyhedral branch-and-cut approach to global
optimization. Mathematical Programming, 103(2):225–249, 2005. DOI: 10.1007/s10107-
005-0581-8. 31
[164] M. Tawarmalani and N. V. Sahinidis. A polyhedral branch-and-cut approach to global
optimization. Math. Program., 103(2):225–249, June 2005. DOI: 10.1007/s10107-005-
0581-8. 110
[165] R. aik, N. Lek, and S.-M. Kang. A new global router using zero-one integer linear
programming techniques for sea-of-gates and custom logic arrays. Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, 11(12):1479 –1494, dec 1992. DOI:
10.1109/43.180262. 3
BIBLIOGRAPHY 141
[166] M. uresson, M. Sjalander, M. Bjork, L. Svensson, P. Larsson-Edefors, and P. Stenstrom.
Flexcore: Utilizing exposed datapath control for efficient computing. In IC-SAMOS 2007.
DOI: 10.1109/ICSAMOS.2007.4285729. 75
[168] P. Van Hentenryck. e OPL Optimization Programming Language. MIT Press, 1999. 47
[173] J. von Neumann and O. Morgenstern. eory of Games and Economic Behavior, volume 2
of Princeton Classic Editions. Princeton University Press, 1944. 1
[178] H. P. Williams. Model Building in Mathematical Programming. John Wiley & Sons, third
edition, 1990. 38
142 BIBLIOGRAPHY
[179] L. A. Wolsey. Integer Programming. John Wiley & Sons, New York, NY, 1998. 2
[180] T.-H. Wu, A. Davoodi, and J. Linderoth. Grip: Scalable 3d global routing using integer
programming. In Design Automation Conference, 2009. DAC ’09. 46th ACM/IEEE, pages
320 –325, july 2009. DOI: 10.1145/1629911.1629999. 3
[181] J. Xu, M. Zhao, J. Fortes, R. Carpenter, and M. Yousif. Autonomic resource manage-
ment in virtualized data centers using fuzzy logic-based approaches. Cluster Computing,
11(3):213–227, Sept. 2008. DOI: 10.1002/9780470382776. 72
[182] T. C. Xu, P. Liljeberg, and H. Tenhunen. Optimal memory controller placement for chip
multiprocessor. In CODES+ISSS, pages 217–226, 2011. DOI: 10.1145/2039370.2039405.
113
[183] P. Yu and T. Mitra. Scalable custom instructions identification for instruction-set
extensible processors. In Proceedings of the 2004 international conference on Compilers,
architecture, and synthesis for embedded systems, CASES ’04, pages 69–78, 2004. DOI:
10.1145/1023833.1023844. 60
[184] W. Zhou, Y. Zhang, and Z. Mao. Pareto based multi-objective mapping ip cores onto
noc architectures. In IEEE Asia Pacific Conference on Circuits and Systems, 2006., pages 331
–334. DOI: 10.1109/APCCAS.2006.342418. 113
[185] X. Zhu, C. Santos, D. Beyer, J. Ward, and S. Singhal. Automated application compo-
nent placement in data centers using mathematical programming. International Journal of
Network Management, 18(6):467–483, 2008. DOI: 10.1002/nem.707. 73
143
Authors’ Biographies
TONY NOWATZKI
Tony Nowatzki is a graduate student at the University of Wisconsin-Madison, working as a
research assistant in the Vertical Research Group. His research centers around computational
accelerators from a design exploration and comparison perspective. Broad interests include ar-
chitecture and compiler co-design. He is a student member of IEEE. He has a Bachelor’s of
Computer Science and Computer Engineering from the University of Minnesota, and a Master’s
of Computer Science from UW-Madison.
MICHAEL FERRIS
Michael Ferris is a Professor at the University of Wisconsin-Madison in the department of com-
puter sciences. His research is concerned with algorithmic and interface development for large
scale problems in mathematical programming, including links to the GAMS and AMPL model-
ing languages, and general purpose software such as PATH, NLPEC, and EMP. He has worked
on several applications of both optimization and complementarity, including cancer treatment
plan development, radiation therapy, video-on-demand data delivery, economic and traffic equi-
libria, structural and mechanical engineering. Ferris is a SIAM fellow and an INFORMS fellow
and received the Beale-Orchard-Hays prize from the Mathematical Programming Society and is
a past recipient of a NSF Presidential Young Investigator Award, and a Guggenheim Fellowship.
He serves on the editorial boards of Mathematical Programming, SIAM Journal on Optimiza-
tion, Transactions of Mathematical Software, and Optimization Methods and Software.
KARTHIKEYAN SANKARALINGAM
Karthikeyan Sankaralingam is an Associate Professor at the University of Wisconsin-Madison
in the department of computer sciences. He leads the Vertical Research group at UW-Madison,
exploring a vertically integrated approach to microprocessor design. His research has developed
widely cited results on Dark Silicon, hardware specialization in the DySER architecture, and
novel generalizations of GPUs. He is a recipient of the NSF Career Award in 2009 and the
IEEE TCCA Young Computer Architect Award in 2011. He is an IEEE Senior Member. He
got his PhD and MS from the University of Texas at Austin, and his Bachelor’s degree from the
Indian Institute of Technology, Madras.
144 AUTHORS’ BIOGRAPHIES
CRISTIAN ESTAN
Cristian Estan is an architect at Broadcom Corporation where he works on coprocessors perform-
ing critical tasks for networking infrastructure such as packet classification, forwarding lookups
and deep packet inspection. He has achieved major reductions in power consumption and cost
per bit and increases in functionality through algorithmic and architectural innovation. He has
received the Broadcom CEO achievement recognition award (2013), PLDI distinguished paper
award (2013), NSF CAREER award (2006), ACSAC best paper award (2006) and UCSD CSE
PhD dissertation award (2004). Earlier he worked at NetLogic Microsystems, taught at the CS
Department of University of Wisconsin-Madison and had shorter stints at various startups. He
published 30 research papers at selective peer-reviewed venues in the fields of computer network-
ing, security, systems, programming languages and databases and is an inventor on 14 patents
and patent applications.
NILAY VAISH
Nilay Vaish is a graduate student at the University of Wisconsin-Madison, working as a research
assistant in the Multifacet Group. He has a Bachelor’s degree from the Indian Institute of Tech-
nology, Delhi and a Master’s degree from UW-Madison.
DAVID WOOD
David Wood is a Professor and Romnes Fellow in the Computer Sciences Department at the
University of Wisconsin, Madison. He also holds a courtesy appointment in the Department of
Electrical and Computer Engineering. He received a B.S. in Electrical Engineering and Com-
puter Science (1981) and a Ph.D. in Computer Science (1990), both at the University of Cali-
fornia, Berkeley. He joined the faculty at the University of Wisconsin in 1990. Dr. Wood was
named an ACM Fellow (2005) and IEEE Fellow (2004), received the University of Wiscon-
sin’s H.I. Romnes Faculty Fellowship (1999), and received the National Science Foundation’s
Presidential Young Investigator award (1991). Dr. Wood is Area Editor (Computer Systems) of
ACM Transactions on Modeling and Computer Simulation, is Associate Editor of ACM Trans-
actions on Architecture and Compiler Optimization, served as Program Committee Chairman
of ASPLOS-X (2002), and has served on numerous program committees. Dr. Wood is an ACM
Fellow, an IEEE Fellow, and a member of the IEEE Computer Society. Dr. Wood has published
over 70 technical papers and is an inventor on thirteen U.S. and International patents.