0% found this document useful (0 votes)

47 views11 pages

Practical Aggregation of Semantical Program Properties

Uploaded by

aaabbaaabb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views11 pages

Practical Aggregation of Semantical Program Properties

Uploaded by

aaabbaaabb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Practical Aggregation of Semantical Program Properties

for Machine Learning Based Optimization

Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, Ari Freund

To cite this version:

Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, Ari Freund. Practical Aggregation of
Semantical Program Properties for Machine Learning Based Optimization. International Conference
on Compilers Architectures and Synthesis for Embedded Systems (CASES’10), Oct 2010, Scottsdale,
United States. �inria-00551512�

HAL Id: inria-00551512

https://hal.inria.fr/inria-00551512
Submitted on 4 Jan 2011

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Practical Aggregation of Semantical Program Properties
for Machine Learning Based Optimization

Mircea Namolaru Albert Cohen Grigori Fursin

IBM Haifa Research Lab INRIA Saclay and LRI, Paris-Sud INRIA Saclay and University of
[email protected] 11 University Versailles
[email protected] [email protected]

Ayal Zaks Ari Freund

IBM Haifa Research Lab IBM Haifa Research Lab
[email protected] [email protected]

ABSTRACT Categories and Subject Descriptors

Iterative search combined with machine learning is a promising ap- D.3.4 [Programming Languages]: Compilers
proach to design optimizing compilers harnessing the complexity
of modern computing systems. While traversing a program op- General Terms
timization space, we collect characteristic feature vectors of the
Performance, Languages, Algorithms
program, and use them to discover correlations across programs,
target architectures, data sets, and performance. Predictive mod-
els can be derived from such correlations, effectively hiding the 1. INTRODUCTION AND RELATED WORK
time-consuming feedback-directed optimization process from the Sophisticated search techniques to optimize programs or improve
application programmer. default compiler heuristics have been proposed to cope with the
One key task of this approach, naturally assigned to compiler ex- complexity of modern computing systems [34, 29, 24, 8, 4, 31,
perts, is to design relevant features and implement scalable feature 7, 20, 27, 16, 17, 12]. These techniques are already used in in-
extractors, including statistical models that filter the most relevant dustry [10, 1, 18, 9], require little knowledge of underlying hard-
information from millions of lines of code. This new task turns out ware and can adapt to new environments. However they are still
to be a very challenging and tedious one from a compiler construc- very restrictive in practice due to an excessively large number of
tion perspective. So far, only a limited set of ad-hoc, largely syn- evaluations (recompilations and runs). Machine learning (ML) was
tactical features have been devised. Yet machine learning is only introduced to make such search techniques practical and reduce op-
able to discover correlations from information it is fed with: it is timization time by enabling optimization knowledge reuse [25, 30,
critical to select topical program features for a given optimization 2, 6]. These studies rely on quantitative characterization of a pro-
problem in order for this approach to succeed. gram to build associations between similar programs and similar
We propose a general method for systematically generating nu- optimization spaces. Such a characterization presented by a vector
merical features from a program. This method puts no restrictions of floating point numbers, called numerical features (for example,
on how to logically and algebraically aggregate semantical prop- the average basic block size may be one such numerical feature).
erties into numerical features. We illustrate our method on the These vectors provide the base for defining different optimizations
difficult problem of selecting the best possible combination of 88 heuristics, cost-models, and more. It is of critical importance for
available optimizations in GCC. We achieve 74% of the potential ML techniques to capture program similarities that effectively cor-
speedup obtained through iterative compilation on a wide range of respond to similarities in program optimizations.
benchmarks and four different general-purpose and embedded ar- Compiler experts have been responsible for identifying the quan-
chitectures. Our work is particularly relevant to embedded sys- titative program characteristics relevant for the problem being ad-
tem designers willing to quickly adapt the optimization heuristics dressed. For some extensively investigated optimizations includ-
of a mainstream compiler to their custom ISA, microarchitecture, ing unrolling, inlining, scheduling and register allocation, several
benchmark suite and workload. Our method has been integrated static heuristics based on numerical features were designed. These
with the publicly released MILEPOST GCC [14]. heuristics involve analytical cost models to provide quantitative
estimates of the effects of an optimization [26]. Beyond analyti-
cal models, empirical and feedback-directed approaches have also
been proposed to guide optimization experts and to help compiler
designers [28].
Permission to make digital or hard copies of all or part of this work for One of the first statistical ML techniques used successfully for
personal or classroom use is granted without fee provided that copies are solving several compiler optimizations problems is presented in
not made or distributed for profit or commercial advantage and that copies [23]. The information required by the ML component is a vector of
bear this notice and the full citation on the first page. To copy otherwise, to numerical features. We note that the optimizations addressed: un-
republish, to post on servers or to redistribute to lists, requires prior specific rolling, inlining, register allocation, scheduling, etc., all have well-
permission and/or a fee.
CASES’10, October 24–29, 2010, Scottsdale, Arizona, USA. known static heuristics from which these numerical features were
Copyright 2010 ACM 978-1-60558-903-9/10/10 ...$10.00. drawn by a compiler expert [26].
Some optimization interferences are almost impossible to pre- the CFG; the machine learning component is responsible for deriv-
dict by a compiler expert. Optimizing performance by tuning op- ing the most relevant characteristics. Without the machine learning
timization flags may be somewhat accessible to an expert for well- component, the compiler expert would have to resort to a simpler
understood application characteristics [35], but it quickly becomes predictive model, with a high probability of missing important cor-
intractable when dealing with the fine-tuning of more obscure op- relations.
timization passes. Besides, this task is entirely dependent on the In our view, the problem of automatically generating numeri-
availability of quantitative features of the program, and on their rel- cal features consists of automatically inferring properties of the
evance to the optimization problem. To address this challenge, we program and automatically aggregating these properties into fea-
experimented with extensive sets of numerical features. This lead tures. These two sub-problems are reminiscent of automated theo-
us to consider feature extraction as a general translation problem of rem proving: starting from a set of basic properties, inference rules
a given program representation into numerical feature spaces. Un- can be designed to infer all possible properties. This is precisely the
like ordinary program properties maintained in compiler internals, approach we follow. We currently rely on a semi-automatic, logic
numerical features must be comparable across different programs programming approach to drive the inference towards a bounded
and target architectures. Cross-program and cross-target compara- set of features; a more futuristic direction would be to further auto-
bility is necessary for the correlations to be statistically representa- mate the process, synthesizing new features on demand.
tive, hence for ML predictions to be robust. In our approach the program is view as a labeled graph, and
One important comparability requirement is that the size of the Datalog [32] a first-order logic notion is used for representing this
numerical feature vector be constant. As the number of variables, graph. This provides an alternative view of the program as a deduc-
instructions, loops, basic blocks etc. in a program varies, the in- tive (an extension of relational) database. The features are provided
formation about their properties therefore needs to be aggregated. by evaluating Datalog (or Prolog) queries over this database.
This implies that we might provide inaccurate information for the To our best knowledge the only work taking a similar view and
machine learning component in some cases. To address this prob- generating program features automatically from intermediate rep-
lem, we consider more sophisticated, semantically rich properties. resentation was introduced recently in [22] - the program is rep-
For instance, such a property for a given loop may be if the loop resented by a XML database, and features are provided by evalu-
is countable, consists of a single basic block, and contains no store ating Xquery expressions over this database. We note that only a
instructions. The flexibility required for supporting such complex single major compiler data structure, the IR (the intermediate rep-
properties was achieved by an underlying generative mechanism resentation) is processed. The IR used is basically a three-address
that allows the derivation of complex properties from simpler ones. representation - as a graph this is a tree with a fixed hierarchical
A given representation of the program is translated into numeri- structure. Our work addresses several major compiler data struc-
cal features in two stages. First we translate the program represen- tures (beside the IR), represented as graphs with a more complex
tation into an intermediate form that contains the basic properties structures. In addition we provide techniques for translating the
of the program. Then, the second stage performs the derivation program information into a Datalog representation further used to
of more complex properties, as well as the aggregation needed in generate the features.
order to finally extract the previously established number of numer- It was already shown [33] that Datalog representation is suit-
ical features. able for even complex compiler analysis. Inferring new program
The basic properties of a program appear in the compiler’s inter- properties (further to be aggregated to features) requires in fact per-
nal representation at compilation time. These properties, extracted forming compiler analysis - and the XML representation seems less
from the program in the first stage determine the possible features appropriate for this. Furthermore, by viewing the program as a la-
that can be derived in the second stage. We therefore designed the beled graph represented by Datalog notation we could take advan-
first stage to extract an exhaustive coverage of the compiler’s global tage of related body of work done in graph (and multi-relational)
data structures representing the program being compiled. data mining and ILP (inductive logic programming). We define the
In our approach, the compiler expert is responsible for choos- space of possible features- this space is huge and an exhaustive ex-
ing the basic properties to be extracted from the program and this ploration is not possible. Similar with [21] we show how this space
way defining the space of possible features that can be derived from could be structured and its structure used for effective exploration.
them. We will demonstrate how to use header files of the compiler Based on the techniques presented in this paper, we implemented
to extract basic properties of the program. This approach can be a feature extractor for the GCC compiler, and applied supervised
automated, facilitating the complete automation of the feature ex- ML techniques for learning optimal settings of the flags. We eval-
traction process. uate our approach on several platforms using combinations of all
In a machine learning compiler, numerical features are the quan- available compiler optimizations, making it a practical and realistic
titative links between the properties of the program and the predic- approach.
tive models that complement (or substitute) human-crafted heuris- Typical machine learning compilers [25, 30, 2, 6, 11] are com-
tics. Identifying the factors that affect the performance of a given possed of two main phases, as shown in Figure 1: a training phase
optimization is a time-consuming task. In addition a human can and a prediction phase. In the training phase optimization tools
consider only a simplified model of the program, where many char- gather information about the structure of a training set for different
acteristics are ignored. Contrary to this, machine learning tech- programs, architectures, data sets, etc. The tools extract program
niques are able to process huge amounts of data, and may work features, apply different combinations of optimizations to each pro-
with a much more detailed and accurate model of a program. gram, profile and execute the resulting variants, and record the
We believe that compiler expertise is still required, but at a dif- speedups. A predictive model is then built by correlating program
ferent level. For instance, the structure of the control-flow graph features, optimizations and speedups. In the prediction phase, fea-
(CFG) may affect the output of a given optimization. But instead tures of a new program are extracted and fed into the predictive
of requiring the compiler expert to point to the characteristics of the model that suggests a “good” combination of optimizations, with
CFG that play a role in the decision, compiler expertise is employed the goal of reducing execution time or other optimization objectives
to generate a space of candidate features that can be derived from such as code size and power consumption. Such techniques show
Optimization tools Predictive Model Optimization tools

Prediction for “good”

Program1

optimizations
Extracting program Building associations
Extracting program
features between program
features
Training

features and speedups

… New program

Applying different Predicting “good” Applying predicted

combinations of optimizations to “good” combination of
ProgramN improve exec. time, optimizations
optimizations
code size, etc

Figure 1: Typical machine learning scenario to predict “good” optimizations for programs. During a training phase (from left to
right) a predictive model is built to correlate complex dependencies between program structure and candidate optimizations. In the
prediction phase (from right to left), features of a new program are passed to the learned model and used to predict combinations of
optimizations.

great potential but require large number of compilations and exe- • loop hierarchy;
cutions as training examples. Moreover, although program features
are one of the key components of any machine learning approach, • control dependence graph;
little attention has been devoted so far to ways of extracting them
• dominator tree;
from program semantics.
• data dependence graph;
2. FEATURE EXTRACTION
• liveness information;
We may consider a program as being characterized by a number
of entities. Some of these entities are a direct mapping of similar • availability information;
entities defined by the specific programming language, while others
are generated during compilation. These entities include: • anticipatibility information;

• functions; • alias information.

• instructions and operands; For example, the control flow graph can be viewed as a rela-
tion over pairs of basic blocks. New entities and relations relevant
• variables; to specific optimizations of interest should be considered. For in-
• types; stance if information concerning register pressure is important, new
entities and relation such as live range and interference graph, re-
• constants; spectively, need to be considered.
Furthermore, the language in which the application is written
• basic blocks;
also gives rise to entities and relations worth considering. As an
• loops; example, a class entity and a class hierarchy graph (CHG) relation
are relevant for programs written in object-oriented languages such
• compiler-generated temporaries. as C++.
We prefer to focus on generic compilation entities and relations
2.1 Relational View of a Program (such as the ones enumerated above) over entities and relations that
A relation over one or more sets of entities is a subset of their are specific to certain compilers. The features we consider are thus
Cartesian product. Relations can be used to express statements defined in generic compilation terms, ensuring that our work is
about tuples of entities, i.e., they define predicates. For example, portable across different optimizing compilers.
we can define a relation opcode = {(ik , opl ) | instruction ik has op- We restrict our attention and extract only binary relations from
code opl } ⊆ I × OPS, where I is the set of program instructions and the program. This is not restrictive, as every k-arity relation can
OPS is the set of all opcodes. Then, the statement opcode(ik , opl ) be expressed by a set of k + 1 binary relations. This assumption
is the claim that instruction ik has opcode opl . This statement is implies a graphical representation of the relations, a labeled graph
true or false depending on the set of pairs constituting the relation (i.e. semantic network). The labels of the vertices are provided by
opcode. As another example the relation in ⊆ I × B, in = {(ik , bl ) the entities and the labels of the edges are provided by the relations.
| instruction ik is in basic block bl }, where B is the set of basic For a relation r ⊆ E1 ×E2 , a fact r(a, b) is represented by two nodes
blocks, expresses the membership of instructions in basic blocks. with labels E1 and respective E2 connected by an edge with label r.
During compilation more complex relations among entities are In this program graph, important subgraphs correspond to major
computed, providing supplementary information about the program compiler data structures such as CFG, def-use chains, IR (the inter-
being compiled. Some of these relations, common to almost every mediate language) etc. In order to take advantage of their specific
optimizing compiler are: properties, we may consider each of these subgraphs separately.
• call graph; In conclusion, a program may be represented as a collection of
(binary) relations over sets of entities, i.e., as a relational database.
• control flow graph; Our first step is therefore to provide such a representation from the
compiler’s data structures. We use the Datalog language [32] for E2 × E3 and p ⊆ F1 × F2 × F3 such that E2 = F1 and E3 = F2 . Then
this task, as we describe next. we can join the two relations in the following three ways.
2.2 Datalog rel1(E1, E2, E3, F2, F3) :-
We use the Datalog logic-based notation to describe relations. r(E1, E2, E3), p(E2, F2, F3).
Datalog is a Prolog-like language, but with more restricted seman- rel2(E1, E2, E3, F1, F3) :-
tics, suitable for expressing relations and operations between them r(E1, E2, E3), p(F1, E3, F3).
[3],[32]. Datalog allows us to provide rules for defining and com- rel3(E1, E2, E3, F3) :-
puting new relations from existing ones. r(E1, E2, E3), p(E2, E3, F3).
The elements of Datalog are atoms of the form p(X1 , ..., Xn )
where p is a predicate and X1 ,..., Xn are variables or constants. By
By repeated joining, starting from a set of basic relations, we
convention names beginning with lower case letters are used for
can obtain new relations of increasing complexity. As the example
constants and predicates, while names beginning with upper case
shows, this is straightforward to automate. In a practical setting,
letters are used for variables. A ground atom is a predicate with
though, the number of relations and their complexity must be kept
only constants as arguments.
to a limit. For example, we may limit the number of joinings that
A Datalog database consists of a list of rules. Each Datalog rule
lead to a relation, the number of times any relation may appear in
has the form H : − B1 , B2 , ...Bn , where H, B1 , .., Bn are atoms. H
such a sequence, the arity of the resulting relation, and more.
is called the head of the rule, and B1 , B2 , ..., Bn form the body of
the rule. The body of the rule is optional (i.e., n ≥ 0). Bodyless
rules are called facts, and can be used to define relations by explicit
2.4 Extracting Relations from Programs
enumeration. For example, the two facts x(1, 2) and x(3, 5) define During compilation a compiler maintains an internal representa-
x as the relation {(1, 2), (3, 5)}. Rules with bodies serve to infer tion of the program being compiled using several data structures.
the head relation from the body relations; meaning that whenever We use the definitions of these data structures to extract and iden-
we substitute constants for the variables in the atoms, and this sub- tify basic entities and relations. The data types express the enti-
stitution makes all the body predicates true then the head predicate ties: in C such data types are typically of type struct T, having a
must also be true. number of fields.1 Each such field may define a relation between
A Datalog query has the form : − B1 , B2 , ...Bn , where B1 , .., Bn the entity represented by the parent struct and the entity repre-
are atoms. An answer to a given query is a set of constants that sented by the type of the field. For example, the data structure for
substituted to the variables in the atoms makes all the predicates an edge of a control-flow graph can be a struct edge containing
appearing in the query true. A query may result in many answering two fields src and trg (among others) that are pointers to struct
substitutions. basic_block, as in the case of GCC. The data types struct edge
To obtain a Datalog representation of the program, we enumerate and struct basic_block introduce two entities E and B, and the
the elements of every entity of interest: variables V = v1 , v2 , ..., fields src and trg introduce two relations over E × B: edge_src
types T = t1 ,t2 , ..., instructions I = i1 , i2 , i3 ..., basic blocks B = and edge_trg.
b1 , b2 , b3 ..., etc. We then extract from the compiler’s data structures The above mechanical method provides compiler specific enti-
relations over these entities. For example we specify the relation ties and relations, which we then map to generic entities and re-
in ⊆ I × B, in = {(ik , bl ) | instruction ik is in basic block bl } by a lations. This mapping may be straightforward as in the example
sequence of Datalog ground atoms of the form in(ik , bl ). above, or may require some additional processing and semantic
Datalog is able to work with relations and perform operations on understanding. For example, in GCC struct tree is used to rep-
them whose results are in turn are relations as well. All standard resent different generic entities such as variables and types, with
relational algebra operations [32] are expressible, the most useful a selector field in the struct identifying the intended semantics.
(for our purposes) being the conjunction (join) of two relations. Other fields of this data structure are overloaded, and their mean-
For instance starting with the relations store and in, Datalog can ing depends on the entity the tree represent. For example, one of
compute the relation st_in ⊆ I × B formed from all pairs (i, b) such the fields in a struct tree that represents a variable contains a
that instruction i is a store instruction in basic block b. In Datalog pointer to another struct tree that represents the variable’s type.
this computation is triggered by the rule Knowing this allows us to deduce a relation on (variable,variable
type) pairs.
st_in(I, B) : −store(I), in(I, B).

2.3 Automatic Inference of New Relations 2.5 Extracting Features from Relations
A machine learning tool requires a quantitative measurement of
Given a set of basic relations (such as those listed in Section 2.1),
the program, provided by a vector of numerical features. In this
further useful relations can be inferred, including very complex
section we present several techniques for deriving numerical fea-
ones. For example, Whaley and Lam [33] were able to perform
tures from a relational representation of the program.
interprocedural context sensitive alias analysis using Datalog in-
We consider first the case of entities having numerical values.
ference. Although, as a general rule it is impractical to infer very
These values may need to be aggregated into their sum, average,
complex relations automatically, it is still useful to infer new rela-
variance, max, min, etc., and in this way produce numerical fea-
tions easily with Datalog, albeit of limited complexity.
tures for the relation. For example, given relation
The main operation we use for relation inference is the joining
of two relations: given two relations r ⊆ E1 × · · · × Ek and p ⊆
F1 × · · · × Fl such that some of the Es are identical to some of the count = {(b, n) | b is a basic block
Fs, we select a nonempty subset I of pairs of identical entities and whose estimated number of executions is n},
essentially concatenate the two relations with the common entities
(in I) appearing only once. The simplest way to explain this is 1 We focus on C because our work is implemented in the context of
through a Datalog example. Suppose the two relations are r ⊆ E1 × GCC, which is written in C.
we may want to compute numerical features such as the maximal merical features are provided by the number of occurrences of such
number of estimated executions of a basic block, or the average patterns in the graph.
number of estimated executions of a basic block. For instance, the control flow graph (CFG) may be considered
We focus now on the case of entities having categorical values as a relation over B × B, where B is the set of basic blocks. New
(i.e., symbols). Most of the entities important for the compilation relations over B × B may be induced from this relation by taking
process belong to this class. Typically, numerical features describ- into account the way in which two basic blocks are connected. For
ing relations over such entities provide information on basic struc- example, we may consider blocks connected via an if-then or an
tural aspects of the relation such as the number of tuples in the if-then-else pattern in CFG. The following Datalog rules provide
relation, the maximum out-degree of nodes in a tree relation, etc. possible definitions for these two relations. (In this example the
We show how to extract several typical types of numerical features relation bb_edge specifies whether two basic blocks are connected
by applying the the standard selection and projection operations, by an edge in the CFG.)
together with the num operator, defined as returning the number of
tuples in a relation. bb_ifthen(B1,B3) :-
First we note that applying num to a relation already provides a bb_edge(B1,B3), bb_edge(B1,B2), bb_edge(B2,B3).
numerical feature which is often of interest. This is particularly
so in the case of unary relations (e.g., number of basic blocks) bb_ifthen_else(B1,B4) :-
but may also be the case for higher arity relations (e.g., number bb_edge(B1,B2), bb_edge(B1,B3),
of edges in the control flow graph). Also, applying num to the bb_edge(B2,B4), bb_edge(B3,B4).
projection of relation r on dimension i—yielding the unary rela-
tion ri = {e | ∃t ∈ r such that t has e at position i}—often provides
These new relations may in turn induce new relations over basic
an interesting numerical feature. For example, consider the relation
blocks connected via nested if-then or if-then-else patterns.
st_in_block = {(i, b) | i is a store instruction in basic block b}. The following Datalog rule provides a possible definitions for a
relation having as elements pairs of basic blocks connected via a
Then num(st_in_block1 ) is the number of stores in all basic blocks, direct edge and a nested if-then pattern (an if-then pattern in
while num(st_in_block2 ) is the number of basic blocks containing which the then alternative is itself an if-then pattern).
store instructions.
We consider now the case of a binary relation r ⊆ E1 × E2 . For bb_ifthen_n(B1,B4) :-
every element e ∈ Ei , 1 ≤ i ≤ 2, we consider the selection induced bb_edge(B1,B4), bb_edge(B1,B2),
by this element, i.e., the relation ri (e) defined as the set of pairs bb_ifthen(B2,B3), bb_edge(B3,B4).
in r that contain e at position i. By associating with e the value of
num(ri (e)) we define a new relation in Ei × N. For this relation, nu-
merical features can be derived by aggregating the numerical values In a similar way we may derive relations describing patterns in
in the second position. any graph structure computed during the compilation. These pat-
For example, consider again the relation st_in_block. For a given terns can be described easily by Datalog rules. The semantics of
basic block b, the value num(st_in_block2 (b)) is the number of the graph structure being analyzed provide guidance in selecting
store instructions in basic block b. Thus the relation consisting the patterns to consider. Additional knowledge about the code may
of all pairs (b, num(st_in_block2 (b))) associates each block with help further trim the pattern space. For instance, knowing that for
the number of store instructions it contains. By aggregating these C programs without switch statements every node has at most two
counts we may obtain numerical features such as the average num- successors in the CFG could limit the number of possible patterns
ber of stores in a basic block. we look for.
For the general case of a k-arity relation r where k ≥ 2, we may Other patterns in graphs such as cycles may be considered as
derive a number of binary relations by considering the projection well. For the CFG, the loop structure may be extracted either from
of r on any two dimensions i, j, i 6= j. For each such binary rela- relevant data structures of the compiler if available, or by com-
tion we derive new features by the above technique. Furthermore, puting simple patterns directly from the CFG, such as single basic
for a relation r ⊆ E1 × ...Ek we can also consider any two disjunct block loops or innermost loops with a simple structure (e.g., con-
subsets I and J of the index set {1, . . . k}. The projection of r on the taining a single if-then pattern inside the loop body).
dimensions in I and J may be seen as a binary relation over the sets Finally we note that every binary relation r ⊆ E × F can be
S1 = Ei1 × · · · × Ei p and S2 = E j1 × · · · × E jq , where I = {i1 , . . . , i p } viewed as a bipartite graph in which the partite sets correspond
and J = { j1 , . . . , jq }. Again, for this binary relation new numerical to E and F. For example, the def -use relation over operand pairs
features may be derived. induces a bipartite graph in which one of the partite sets consists of
The techniques described above for derivations of numerical fea- the def s and the other consists of the uses. This allows us to apply
tures from relations can be automated. We implemented the extrac- the techniques presented in this section to any binary relation. For
tion of numerical features from the Datalog-derived representation instance, let r denote the def -use relation. then the web relation
of the program in Prolog, as the required aggregation operations below defines a web pattern in the bipartite graph corresponding to
are not supported in Datalog. the def -use relation.

2.6 Structural Code Patterns web(E1,E2,F1,F2) :-

In the previous section we examined some basic structural prop- r(E1,F1), r(E2,F1), r(E1,F2), r(E2,F2).
erties of a graph as number of edges, average number of neighbors
for a vertex etc. These properties represent poorly the graph struc- As can be seen, a large number of structural patterns can be eas-
ture for labeled graphs with a small number of labels for vertices ily expressed and tested using our feature extraction framework.
and edges (e.g., CFG, DDG, dominator tree, etc.). We try to char- Techniques for exploring the space of structural patterns are further
acterize such graphs by a number of (subgraph) patterns - the nu- discussed in the next subsection.
2.7 Exploring the Structural Pattern Space intuitively the number of databases r where the pattern g occurs.
In our framework, Datalog queries are used to represent sub- A pattern q is called frequent [21] if support(q, S) is at least equal
graph patterns. For a query q, its f requency(q, r) is defined as the with a threshold specified by the user.
number of the substitutions for which the query is true in respect The specialization operator is anti-monotonic w.r.t. to the sup-
to a Datalog database r. The frequency provides a metric for a pat- port relation for a set S of Datalog databases, i.e. if q1 ≺ q2 then
tern that maps the pattern to a feature. Given a set of patterns the support(q1 , S) ≥ support(q2 , S). The anti-monotonicity property,
features vector is provided by their frequencies. Thus, the features allows us to effectively prune the extension of a query - if a query
space is determined by Datalog patterns (i.e. queries) space. is not frequent, then none of its extensions is frequent.
We use a pattern growth approach, in which more complex pat-
terns are successively derived from a set of initial patterns. We 3. METHODOLOGY
refine the scheme of inference of new relations presented in a pre- In our work we attempt to overcome two major methodological
vious subsection by imposing constraints (chosen by a compiler flaws that limit the dissemination of current and past research on
expert) on the variables. In this way only the potential important iterative compilation and machine learning compilation, namely:
patterns are generated, this significantly reducing the space of pat-
terns to be considered. • the use of proprietary, unreleased or outdated transformation,
We exemplify our extension techniques for the case of the CFG, compilation and feature extraction tools;
represented by the relation bb_edge ⊆ B × B. The possible queries
are sequences of bb_edge predicates of arbitrary length • the very limited set of optimizations and features, making it
difficult or even impossible to replicate and improve upon
: − bb_edge(X1 , X2 ), bb_edge(X1 , X3 ), bb_edge(X3 , X4 ), ... previous results.
For each variable X1 , X2 , . . . in the sequence, some constraints con-
In an attempt to curb these tendencies, we decided to implement
trol the sharing of variables between the bb_edge predicates. The
our feature extractor inside the popular, free software, production-
constraints are of the form (m, n), m is the maximal number of oc-
quality GCC [13] compiler. Recent versions of GCC achieve per-
currences of the variable as the first argument, and n is the maxi-
formance levels competitive with the best commercial and research
mal number of occurrences of the variable as the second argument.
compilers. GCC also supports a large number of platforms, a large
Intuitively these constraints limit the number of predecessors and
and fast-growing number of optimizations, and modern intermedi-
successors for the vertices substituted to the variable and are cho-
ate representations facilitating the extraction of semantically rich
sen on basis of domain expert knowledge - for CFG the constraints
properties and features. It is a unique tool available for research
chosen are (1, 2), (2, 1), (1, 1), (2, 2).
purposes in compilation of real-world applications.
A query is extended as by adding at each step a bb_edge predi-
Based on the techniques described in the previous section, we
cate. If a new variable is introduced, the possible four constraints
implemented our feature extractor as additional passes in GCC 4.2-
mentioned above should be attached to it - in fact there are four
4.4 versions [14]. It is invoked on demand after the compiler gen-
new resulting relations. If no variable is introduced, the addition of
erates the data needed for producing features. The feature extractor
the new added predicate (that uses two existent variables) should
works in two stages:
conform with the constraints imposed on the variables. As an ex-
ample we consider the query below, the constrains associated to the • extracting a relational representation of the program;
variables, and a possible legal extension.
• computing a feature vector based on this representation.
constraint(B1) = (2,2)
constrains(B2) = (1,1)
constrains(B3) = (2,1) 4. EVALUATION
The technique presented in this paper shows how to automate
Before extension and generalize feature extraction for use in predicting good opti-
:- bb_edge(B1,B2), bb_edge(B1,B3). mizations. To make its benefits more concrete, we propose a com-
plete, realistic scenario about how a compiler expert may incre-
After extension mentally enhance a machine learning compiler. We assume the
:- bb_edge(B1,B2), bb_edge(B1,B3), bb_edge(B2,B3). compiler expert is working in an embedded system design group,
targeting an ARC 725D 700MHz embedded processor – ARC. As
We note that after the extension the variable B2 could not be it is common in such a context, the design group is very small and
further shared with any new added predicate as this would violate does not have the resources to tune the heuristics, optimization pass
its constraints. Similarly B1 and B3 could not appear as the first selection and compilation flags for this particular platform.
argument, respective second argument of a new added predicate. We use the popular, freely-available MiBench [15] benchmark
We note that our techniques could be extended to any labeled suite that comes with a variety of embedded and general-purpose
graph. As mentioned before the compiler expert should define the desktop applications.
constraints to be used based on the specific properties of the graph.
The pattern grow approach previously described, introduces a 1. The expert first constructs a search space where significant
partial order ≺ over the set of Datalog queries, where q1 ≺ q2 speedups can be obtained using traditional iterative compila-
means that the query q2 is an extension of the query q1 . The pattern tion.
space is the lattice spanned by the partial order ≺, the inference of
the patterns may be seen as a search problem in this space. 2. She uses this space to build a machine learning model.
For a collection S of Datalog databases, we define the support of
3. She trains this model over multiple (desktop and server) plat-
a query q with respect to S as
forms: AMD – Athlon 64 3700+, IA32 – Intel Xeon 2.8GHz,
support(q, S) = |{r ∈ S | f requency(q, r) > 0}| IA64 – Itanium2 1.3GHz.
2.24 2.31 2.31 1.97
1.6
1.5

speedup
1.4
1.3
1.2
1.1
1

sh a

t1
unt

gsm
_c

tra

cia

a e l_ e
d

m_d
n_c

n_e

n_s

m_c

rch1
d

e
fish_

a e l_
ish_

qsor
CRC
jpeg

jpeg

dijks

patri
bitco

susa

adpc

gsea
rijnd
r ijn d
f
blow
blow

strin
AMD IA32 IA64

Figure 2: Speedups obtained using iterative search on 3 platforms (500 random combinations of optimizations with 50% probability
to select each optimization)

1.4
1.3
speedup

1.2
1.1
1
0.9
0.8

ge
sh a

t1
unt

gsm
_c

tra

cia

a e l_ e
d

m_d
n_c

n_e

n_s

m_c

rch1
d

e
fish_

a e l_
ish_

qsor
CRC
jpeg

a
jpeg

dijks

patri
bitco

Aver
susa

susa

adpc

gsea
rijnd
r ijn d
f
blow
blow

strin
Iterative compilation
Predicted optimization passes using static feature extractior and nearest neighbour classifier

Figure 3: Speedups when predicting best optimizations based on program features in comparison with the achievable speedups after
iterative compilation based on 500 runs per benchmark (ARC processor)

4. The expert aims to use this knowledge base to predict how to 9. To finalize the tuning, and improve compilation and training
select the best optimizations, when running the same bench- time, she performs principal component analysis (PCA) to
marks but on the embedded ARC target. narrow down the set of features that really make sense on her
platform of interest.
5. In the process, her first experiments are disappointing: the
predictions achieved by the model only reach a fraction of the As outlined in the use case scenario, the training of the machine
performance of the best combination of optimizations avail- learning model has been performed on all benchmarks and all the
able in the search space. platforms, except ARC which we used as a test platform for opti-
mization predictions.
6. The expert identifies the source of the problem using stan- To illustrate this scenario in practice, we applied 500 random
dard statistical metrics [19]. It may come from a model over- combinations of 88 compiler optimizations that are known to influ-
fit due to a limited number of features, or to lack of effective ence performance, with 50% probability of being selected, and run
correlations between these features and the semantical prop- each program variant 5 times. To make the adaptive optimization
erties that actually impact performance on the ARC platform. fully transparent, we directly invoke optimization passes inside a
modified GCC pass manager. Figure 2 shows speedups over the
7. The expert designs and implements new program feature ex- best GCC optimization level -O3 for all programs and all architec-
tractors, leveraging her understanding of the optimization tures. It confirms the previous findings about iterative compilation
process and of the performance anomalies involved. [10, 1, 27, 17] — that it is possible to considerably improve per-
formance over default compiler settings, which are tuned to per-
8. She incrementally adds these features into the training set, form well on average across all programs and platforms. In order
until the predictive model shows relevant results. to help end-users and researchers reproduce results and optimize
Feature # Description:
ft1 Number of basic blocks in the method
ft2 Number of basic blocks with a single successor
ft3 Number of basic blocks with two successors
ft4 Number of basic blocks with more than two successors
ft5 Number of basic blocks with a single predecessor
ft6 Number of basic blocks with two predecessors
ft7 Number of basic blocks with more than two predecessors
ft8 Number of basic blocks with a single predecessor and a single successor
ft9 Number of basic blocks with a single predecessor and two successors
ft10 Number of basic blocks with a two predecessors and one successor
ft11 Number of basic blocks with two successors and two predecessors
ft12 Number of basic blocks with more than two successors and more than two predecessors
ft13 Number of basic blocks with number of instructions less than 15
ft14 Number of basic blocks with number of instructions in the interval [15, 500]
ft15 Number of basic blocks with number of instructions greater than 500
ft16 Number of edges in the control flow graph
ft17 Number of critical edges in the control flow graph
ft18 Number of abnormal edges in the control flow graph
ft19 Number of direct calls in the method
ft20 Number of conditional branches in the method
ft21 Number of assignment instructions in the method
ft22 Number of unconditional branches in the method
ft23 Number of binary integer operations in the method
ft24 Number of binary floating point operations in the method
ft25 Number of instructions in the method
ft26 Average of number of instructions in basic blocks
ft27 Average of number of phi-nodes at the beginning of a basic block
ft28 Average of arguments for a phi-node
ft29 Number of basic blocks with no phi nodes
ft30 Number of basic blocks with phi nodes in the interval [0, 3]
ft31 Number of basic blocks with more than 3 phi nodes
ft32 Number of basic block where total number of arguments for all phi-nodes is in greater than 5
ft33 Number of basic block where total number of arguments for all phi-nodes is in the interval [1, 5]
ft34 Number of switch instructions in the method
ft35 Number of unary operations in the method
ft36 Number of instruction that do pointer arithmetic in the method
ft37 Number of indirect references via pointers ("*" in C)
ft38 Number of times the address of a variables is taken ("&" in C)
ft39 Number of times the address of a function is taken ("&" in C)
ft40 Number of indirect calls (i.e. done via pointers) in the method
ft41 Number of assignment instructions with the left operand an integer constant in the method
ft42 Number of binary operations with one of the operands an integer constant in the method
ft43 Number of calls with pointers as arguments
ft44 Number of calls with the number of arguments is greater than 4
ft45 Number of calls that return a pointer
ft46 Number of calls that return an integer
ft47 Number of occurrences of integer constant zero
ft48 Number of occurrences of 32-bit integer constants
ft49 Number of occurrences of integer constant one
ft50 Number of occurrences of 64-bit integer constants
ft51 Number of references of local variables in the method
ft52 Number of references (def/use) of static/extern variables in the method
ft53 Number of local variables referred in the method
ft54 Number of static/extern variables referred in the method
ft55 Number of local variables that are pointers in the method
ft56 Number of static/extern variables that are pointers in the method

Table 1: List of program features produced using our technique to be able to predict good optimizations

their programs, we made experimental data publicly available in predictive modeling techniques similar to [25, 30, 2, 6] to be able
the Collective Optimization Database at [9]. Note that the same to characterize similarities between programs and optimizations,
combination of optimizations found for one benchmark, for exam- and to predict good optimizations for a yet unseen program based
ple, susan_corners on AMD, does not improve execution time of on this knowledge. To validate our results, we decided to use a
the bitcount benchmark, and even degrades the execution time of state-of-the-art predictive model described in [2]. This model pre-
jpeg_c by 10% on the same architecture. It is of course a clear dicts optimizations for a given program based on a nearest-neighbor
signal that program features are key to the success of any machine static feature classifier, suggesting optimizations from the similar-
learning compiler. This of course does not diminish the importance ity of programs. We use a different training set on the embedded
of architecture features and data-set features. system platform ARC, and the traditional leave-one-out validation
Though obtaining strong speedups, the iterative compilation pro- where the evaluated benchmark is removed from the training set.
cess is very time-consuming and impractical in production. We use To avoid strong biasing of the same optimizations from the same
benchmark. When a new program is compiled, features are first ever, machine learning is only able to recover correlations (hence
extracted using our tool, then they are compared with all similar optimization knowledge) from the information it is fed with: it is
features of other programs using a nearest-neighbor classifier, as critical to select topical program features for a given optimization
described in [5]. The program is recompiled again with the combi- problem. To our knowledge, this is the first attempt to propose a
nation of optimizations for the most similar program encountered practical and general method for systematically generating numer-
so far. ical features from a program, and to implement it in a production
As outlined in the use case scenario, we iterated on this baseline compiler. This method does not put any restriction on how to logi-
method while gradually adding more and more features. We even- cally and algebraically aggregate semantical properties into numer-
tually reached 11% average performance across all benchmarks, ical features, offering a virtually exhaustive coverage of statistically
out of 15% when picking the optimal points in the search space; see relevant information that can be derived from a program.
Figure 3. Adding more features did not bring us more performance This method has been implemented in GCC and applied to a
on average across the benchmarks. The list of the 56 important number of general-purpose and embedded benchmarks. We illus-
features identified in this iterative process that are able to capture trate our method on the difficult problem of selecting the optimal
complex dependencies between program structure and a combina- setting of compiler optimizations for improving the performance of
tion of multiple optimizations is presented in Table 1. Though we an application, and demonstrate its practicality achieving 74% of
did not reach the best performance achieved with iterative compi- the available speedup obtained through iterative compilation on a
lation, we showed that our technique for automatic feature extrac- wide range of benchmarks and 4 different general-purpose and em-
tion can already be used effectively for machine learning, to enable bedded architectures. We believe this work is an important step to-
optimization knowledge reuse and automatically improve program wards generalizing machine learning techniques to tackle the com-
execution time. The simplicity and expressiveness of the feature plexity of present and future computing systems. Feature extractor
extractor is one key contribution of our approach: a few lines of presented in this paper is now available for download within MILE-
Prolog code for each new feature, building on a finite set of pretty- POST GCC at [14] while experimental data is available at [9] to
printers from GCC’s internal data structures into Datalog entities. help researchers reproduce and extend this work.
Our results pave the way for a more systematic study of the qual-
ity and importance of individual program features, a necessary step
towards automatic feature selection and the construction of robust
6. ACKNOWLEDGMENTS
predictive models for compiler optimizations. This work was partly supported by the European Commission
Our main contribution is to construct program features by ag- through the FP6 project MILEPOST id. 035307 and by the HiPEAC
gregation and filtering of a large amount of semantical properties. Network of Excellence.
But comparison with other predictive techniques is a relevant ques-
tion in itself, related with the selection of the features and machine 7. REFERENCES
learning classifier or predictor. Our work is intended to ease such
comparisons, replicating the work of others into a single machine [1] ACOVEA: Using Natural Selection to Investigate Software
learning optimization platform. Complexities.
http://www.coyotegulch.com/products/acovea.
5. CONCLUSION [2] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin,
M.F.P. O’Boyle, J. Thomson, M. Toussaint, and C.K.I.
Though the combination of iterative compilation and machine
Williams. Using machine learning to focus iterative
learning has been studied for more than a decade and showed great
optimization. In Proceedings of the International Symposium
potential for program optimizations, there are surprisingly few re-
on Code Generation and Optimization (CGO), 2006.
search results on the problem of selecting good quality program
features. This problem is relevant for effective optimization knowl- [3] A.V. Aho, M.S. Lam, R. Sethi, and J.D. Ullman. Compilers:
edge reuse, to speedup the search for good optimizations, to build Principles, Techniques and Tools. Addison-Wesley, 2nd
predictive models for compilation heuristics, to select optimization edition, 2007.
passes and ordering, to build and tune analytical performance mod- [4] F. Bodin, T. Kisuki, P.M.W. Knijnenburg, M.F.P. O’Boyle,
els, and more. and E. Rohou. Iterative compilation in a non-linear
Up to now, compiler experts had to manually construct and im- optimisation space. In Proceedings of the Workshop on
plement feature extractors that best suit their purpose. Without a Profile and Feedback Directed Compilation, 1998.
systematic way to construct features and evaluate their merits, this [5] Edwin V. Bonilla, Christopher K. I. Williams, Felix V.
task remains a tedious trial and error process relying on what the Agakov, John Cavazos, John Thomson, and Michael F. P.
experts believe they understand about the impact of optimization O’Boyle. Predictive search distributions. In William W.
passes. In a modern compiler like GCC, more than 200 passes com- Cohen and Andrew Moore, editors, Proceedings of the 23rd
pete in a dreadful interplay of tradeoffs and assumptions about the International Conference on Machine learning, pages
program and the target architecture (itself very complex and rather 121–128, New York, NY, USA, 2006. ACM.
unpredictable). The global impact of these heuristics can be very [6] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. O’Boyle,
far from optimal, even on a major target of the compiler such as and O. Temam. Rapidly selecting good compiler
the x86 ISA and its most popular microarchitectural instances. But optimizations using performance counters. In Proceedings of
what about embedded targets which attract less attention from ex- the 5th Annual International Symposium on Code
perts developers and cannot afford large in-house compiler groups? Generation and Optimization (CGO), March 2007.
What about design-space exploration of the ISA, microarchitecture [7] K. Cooper, A. Grosul, T. Harvey, S. Reeves,
and compiler? D. Subramanian, L. Torczon, and T. Waterman. ACME:
So far, a limited set of largely syntactical features have been adaptive compilation made efficient. In Proceedings of the
devised to prove that optimization knowledge can be reused and Conference on Languages, Compilers, and Tools for
derived automatically from feedback-directed optimization. How- Embedded Systems (LCTES), 2005.
[8] K.D. Cooper, P.J. Schielke, and D. Subramanian. Optimizing architecture for the FFT. In Proceedings of the IEEE
for reduced code space using genetic algorithms. In International Conference on Acoustics, Speech, and Signal
Proceedings of the Conference on Languages, Compilers, Processing, volume 3, pages 1381–1384, Seattle, WA, May
and Tools for Embedded Systems (LCTES), pages 1–9, 1999. 1998.
[9] Collective Tuning Infrastructure: automating and [25] A. Monsifrot, F. Bodin, and R. Quiniou. A machine learning
accelerating development and optimization of computing approach to automatic production of compiler heuristics. In
systems. http://cTuning.org. Proceedings of the International Conference on Artificial
[10] ESTO: Expert System for Tuning Optimizations. Intelligence: Methodology, Systems, Applications, LNCS
http://www.haifa.ibm.com/projects/systems/cot/esto. 2443, pages 41–50, 2002.
[11] Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea [26] S.S. Muchnick. Advanced Compiler Design and
Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Implementation. Morgan Kaufmann, 1997.
Barnard, Elton Ashton, Eric Courtois, Francois Bodin, [27] Z. Pan and R. Eigenmann. Fast and effective orchestration of
Edwin Bonilla, John Thomson, Hugh Leather, Chris compiler optimizations for automatic performance tuning. In
Williams, and Michael O’Boyle. Milepost gcc: machine Proceedings of the International Symposium on Code
learning based research compiler. In Proceedings of the GCC Generation and Optimization (CGO), pages 319–332, 2006.
Developers’ Summit, June 2008. [28] David Parello, Olivier Temam, Albert Cohen, and
[12] Grigori Fursin and Olivier Temam. Collective optimization. Jean-Marie Verdun. Towards a systematic, pragmatic and
In Proceedings of the International Conference on High architecture-aware program optimization process for
Performance Embedded Architectures & Compilers complex processors. In ACM/IEEE Conf. on Supercomputing
(HiPEAC 2009), January 2009. (SC’04), page 15, Washington, DC, 2004.
[13] GCC: GNU Compiler Collection. http://gcc.gnu.org. [29] B. Singer and M. Veloso. Learning to predict performance
[14] MILEPOST GCC: Collaborative development website. from formula modeling and training data. In Proceedings of
http://cTuning.org/milepost-gcc. the Conference on Machine Learning, 2000.
[15] Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, [30] M. Stephenson and S. Amarasinghe. Predicting unroll factors
Todd M. Austin, Trevor Mudge, and Richard B. Brown. using supervised classification. In Proceedings of
Mibench: A free, commercially representative embedded International Symposium on Code Generation and
benchmark suite. In Proceedings of the IEEE 4th Annual Optimization (CGO), pages 123–134, 2005.
Workshop on Workload Characterization, Austin, TX, [31] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D.I.
December 2001. August. Compiler optimization-space exploration. In
[16] K. Heydemann and F. Bodin. Iterative compilation for two Proceedings of the International Symposium on Code
antagonistic criteria: Application to code size and Generation and Optimization (CGO), pages 204–215, 2003.
performance. In Proceedings of the 4th Workshop on [32] J. D. Ullman. Principles of Database and Knowledge
Optimizations for DSP and Embedded Systems, colocated Systems, volume 1. Computer Science Press, 1988.
with CGO, 2006. [33] J. Whaley and M.S. Lam. Cloning based context sensitive
[17] K. Hoste and L. Eeckhout. Cole: Compiler optimization level pointer alias analysis using binary decision diagrams. In
exploration. In Proceedings of International Symposium on Proceedings of the ACM SIGPLAN Conference on
Code Generation and Optimization (CGO), 2008. Programming Language Design and Implementation (PLDI),
[18] Shih-Hao Hung, Chia-Heng Tu, Huang-Sen Lin, and 2004.
Chi-Meng Chen. An automatic compiler optimizations [34] R. Whaley and J. Dongarra. Automatically tuned linear
selection framework for embedded applications. In Intl. algebra software. In Proceedings of the Conference on High
Conf. on Embedded Software and Systems (ICESS’09), pages Performance Networking and Computing, 1998.
381–387, 2009. [35] D. Whitfield and M. L. Soffa. An approach to ordering
[19] Raj Jain. The Art of Computer Systems Performance optimizing transformations. In ACM Symp. on Principles &
Analysis. John Wiley and Sons, 1991. practice of parallel programming (PPoPP’90), pages
[20] P. Kulkarni, W. Zhao, H. Moon, K. Cho, D. Whalley, 137–146, Seattle, Washington, United States, 1990.
J. Davidson, M. Bailey, Y. Paek, and K. Gallivan. Finding
effective optimization phase sequences. In Proc. Languages,
Compilers, and Tools for Embedded Systems (LCTES), pages
12–23, 2003.
[21] L.Dehaspe and H.Toivonen. Discovery of frequent datalog
patterns. In Data Mining and Knowledge Discovery, pages
7–36, 1999.
[22] H. Leather, E. Yom-Tov, M. Namolaru, and A. Freund.
Automatic feature generation for setting compilers heuristics.
In 2nd Workshop on Statistical and Machine Learning
Approaches Applied to Architectures and Compilation
(SMART’08), colocated with HiPEAC’08 conference, 2008.
[23] S. MacLane. Categories for the Working Mathematician,
volume 5 of Graduate Texts in Mathematics. Springer
Verlag, Berlin, 1971.
[24] F. Matteo and S. Johnson. FFTW: An adaptive software