0% found this document useful (0 votes)
20 views12 pages

SLP Pldi 2000

Uploaded by

lei li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views12 pages

SLP Pldi 2000

Uploaded by

lei li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Exploiting Superword Level Parallelism

with Multimedia Instruction Sets


Samuel Larsen and Saman Amarasinghe
MIT Laboratory for Computer Science
Cambridge, MA 02139
{slarsen,saman}@lcs.mit.edu

Abstract ister or memory location. In the past, such systems could


accommodate only small data types of 8 or 16 bits, mak-
Increasing focus on multimedia applications has prompted ing them suitable for a limited set of applications. With
the addition of multimedia extensions to most existing gen- the emergence of 128-bit superwords, new architectures are
eral purpose microprocessors. This added functionality capable of performing four 32-bit operations with a single
comes primarily with the addition of short SIMD instruc- instruction. By adding floating point support as well, these
tions. Unfortunately, access to these instructions is limited extensions can now be used to perform more general purpose
to in-line assembly and library calls. Generally, it has been computation.
assumed that vector compilers provide the most promis- It is not surprising that SIMD execution units have ap-
ing means of exploiting multimedia instructions. Although peared in desktop microprocessors. Their simple control,
vectorization technology is well understood, it is inherently replicated functional units, and absence of heavily-ported
complex and fragile. In addition, it is incapable of locating register files make them inherently simple and extremely
SIMD-style parallelism within a basic block. amenable to scaling. As the number of available transis-
In this paper we introduce the concept of Superword tors increases with advances in semiconductor technology,
Level Parallelism (SLP), a novel way of viewing parallelism datapaths are likely to grow even larger.
in multimedia and scientific applications. We believe SLP Today, use of multimedia extensions is difficult since ap-
is fundamentally different from the loop level parallelism plication writers are largely restricted to using in-line as-
exploited by traditional vector processing, and therefore de- sembly routines or specialized library calls. The problem is
mands a new method of extracting it. We have developed exacerbated by inconsistencies among different instruction
a simple and robust compiler for detecting SLP that tar- sets. One solution to this inconvenience is to employ vec-
gets basic blocks rather than loop nests. As with techniques torization techniques that have been used to parallelize sci-
designed to extract ILP, ours is able to exploit parallelism entific code for vector machines [5, 14, 15]. Since a number
both across loop iterations and within basic blocks. The re- of multimedia applications are vectorizable, this approach
sult is an algorithm that provides excellent performance in promises good results. However, many important multime-
several application domains. In our experiments, dynamic dia applications are difficult to vectorize. Complicated loop
instruction counts were reduced by 46%. Speedups ranged transformation techniques such as loop fission and scalar ex-
from 1.24 to 6.70. pansion are required to parallelize loops that are only par-
tially vectorizable [2, 4, 17]. Consequently, no commercial
1 Introduction compiler currently implements this functionality. This paper
presents a method for extracting SIMD parallelism beyond
The recent shift toward computation-intensive multimedia vectorizable loops.
workloads has resulted in a variety of new multimedia ex- We believe that short SIMD operations are well suited
tensions to current microprocessors [6, 10, 16, 18, 20]. Many to exploit a fundamentally different type of parallelism than
new designs are targeted specifically at the multimedia do- the vector parallelism associated with traditional vector and
main [3, 7, 11]. This trend is likely to continue as it has been SIMD supercomputers. We denote this parallelism Super-
projected that multimedia processing will soon become the word Level Parallelism (SLP) since it comes in the form of
main focus of microprocessor design [8]. superwords containing packed data. Vector supercomput-
While different processors vary in the type and number ers require large amounts of parallelism in order to achieve
of multimedia instructions offered, at the core of each is a set speedups, whereas SLP can be profitable when parallelism is
of short SIMD or superword operations. These instructions scarce. From this perspective, we have developed a general
operate concurrently on data that are packed in a single reg- algorithm for detecting SLP that targets basic blocks rather
than loop nests.
Permission to make digital or hard copies of all or part of this work for In some respects, superword level parallelism is a re-
personal or classroom use is granted without fee provided that copies stricted form of ILP. ILP techniques have been very success-
are not made or distributed for profit or commercial advantage and
that copies bear this notice and the full citation of the first page. To ful in the general purpose computing arena, partly because
copy otherwise, or republish, to post on servers or to redistribute to of their ability to find parallelism within basic blocks. In
lists, requires prior specific permission and/or a fee. the same way that loop unrolling translates loop level par-
PLDI 2000, Vancouver, British Columbia, Canada. allelism into ILP, vector parallelism can be transformed into
Copyright 2000 ACM 1-58113-199-2/00/0006...$5.00.
SLP. This realization allows for the parallelization of vector-
a = b + c * z[i+0]
d = e + f * z[i+1] for (i=0; i<16; i++) {
r = s + t * z[i+2] localdiff = ref[i] - curr[i];
w = x + y * z[i+3] diff += abs(localdiff);
}

(a) Original loop.


a b c z[i+0]
d e f z[i+1] for (i=0; i<16; i++) {
= +SIMD *SIMD T[i] = ref[i] - curr[i];
r s t z[i+2]
w x y z[i+3] }

for (i=0; i<16; i++) {


diff += abs(T[i]);
Figure 1: Isomorphic statements that can be packed and }
executed in parallel.
(b) After scalar expansion and loop fission.

for (i=0; i<16; i+=4) {


izable loops using the same basic block analysis. As a result, localdiff = ref[i+0] - curr[i+0];
our algorithm does not require any of the complicated loop diff += abs(localdiff);
transformations typically associated with vectorization. In
localdiff = ref[i+1] - curr[i+1];
fact, vector parallelism alone can be uncovered using a sim- diff += abs(localdiff);
plified version of the SLP compiler algorithm.
The remainder of this paper is organized as follows: Sec- localdiff = ref[i+2] - curr[i+2];
tion 2 defines superword level parallelism and compares it diff += abs(localdiff);
to other forms of parallelism. Section 3 describes the com- localdiff = ref[i+3] - curr[i+3];
piler algorithm used to extract superword level parallelism. diff += abs(localdiff);
A variation of this algorithm targeting vector parallelism is }
discussed in Section 4. Section 5 presents results on mul-
timedia and scientific benchmarks. Section 6 discusses ar- (c) Superword level parallelism exposed after unrolling.
chitectural features that complement SLP compilation. Sec- for (i=0; i<16; i+=4) {
tion 7 outlines reasons why we believe SLP algorithms will localdiff0 = ref[i+0] - curr[i+0];
be successful, and Section 8 concludes. localdiff1 = ref[i+1] - curr[i+1];
localdiff2 = ref[i+2] - curr[i+2];
localdiff3 = ref[i+3] - curr[i+3];
2 Superword Level Parallelism
diff += abs(localdiff0);
This section begins by elaborating on the notion of SLP diff += abs(localdiff1);
diff += abs(localdiff2);
and the means by which it is detected. Terminology is in- diff += abs(localdiff3);
troduced that facilitates the discussion of our algorithms in }
Sections 3 and 4. We then contrast SLP to other forms of (d) Packable statements grouped together after renaming.
parallelism and discuss their interactions. This helps moti-
vate the need for a new compilation technique. Figure 2: A comparison between SLP and vector paralleliza-
tion techniques.
2.1 Description of Superword Level Parallelism
Superword level parallelism is defined as short SIMD paral- Packed statements that contain adjacent memory refer-
lelism in which the source and result operands of a SIMD ences among corresponding operands are particularly well
operation are packed in a storage location. Detection is suited for SLP execution. This is because operands are ef-
done through a short, simple analysis in which independent fectively pre-packed in memory and require no reshuffling
isomorphic statements are identified within a basic block. within a register. In addition, an address calculation fol-
Isomorphic statements are those that contain the same op- lowed by a load or store need only be executed once in-
erations in the same order. Such statements can be executed stead of individually for each element. The combined ef-
in parallel by a technique we call statement packing, an ex- fect can lead to a significant performance increase. This is
ample of which is shown in Figure 1. Here, source operands not surprising since vector machines have been successful at
in corresponding positions have been packed into registers exploiting the same phenomenon. In our experiments, in-
and the addition and multiplication operators have been re- structions eliminated from operating on adjacent memory
placed by their SIMD counterparts. Since the result of the locations had the greatest impact on speedup. For this rea-
computation is also packed, unpacking may be required de- son, locating adjacent memory references forms the basis of
pending on how the data are used in later computations. our algorithm, discussed in Section 3.
The performance benefit of statement packing is determined
by the speedup gained from parallelization minus the cost
of packing and unpacking. 2.2 Vector Parallelism
Depending on what operations an architecture provides Vector parallelism is a subset of superword level parallelism.
to facilitate general packing and unpacking, this technique Our results in Section 5 show that 20% of dynamic instruc-
can actually result in a performance degradation if packing tion savings on the SPEC95fp benchmark suite are from
and unpacking costs are high relative to ALU operations. non-vectorizable code sequences.
One of the main objectives of our SLP detection technique To better explain the differences between superword level
is to minimize packing and unpacking by locating cases in parallelism and vector parallelism, we present two short ex-
which packed data produced as a result of one computation amples, shown in Figures 2 and 3. Although the first ex-
can be used directly as a source in another computation.
do {
typically exploited by a multiprocessor or MIMD machine.
dst[0] = (src1[0] + src2[0]) >> 1; In many cases, parallel loops may not yield performance
dst[1] = (src1[1] + src2[1]) >> 1; gains because of fine-grain synchronization or loop-carried
dst[2] = (src1[2] + src2[2]) >> 1; communication. It is therefore necessary to find coarse-grain
dst[3] = (src1[3] + src2[3]) >> 1;
parallel loops when compiling for MIMD machines. Tradi-
dst += 4; tionally, a MIMD machine is composed of multiple micropro-
src1 += 4; cessors. It is conceivable that loop level parallelism could be
src2 += 4; exploited orthogonally to superword level parallelism within
}
while (dst != end);
each processor. Since coarse-grain parallelism is required to
get good MIMD performance, extracting SLP should not
Figure 3: An example of a hand-optimized matrix operation detract from existing MIMD parallel performance.
that proves unvectorizable.
2.4 SIMD Parallelism
SIMD parallelism came into prominence with the advent
of massively parallel supercomputers such as the Illiac IV
ample can be molded into a vectorizable form, we know of
[9]. The association of the term “SIMD” with this type of
no vector compilers that can be used to vectorize the sec-
computer is what led us to utilize the term Superword Level
ond. Furthermore, the transformations required in the first
Parallelism when discussing short SIMD operations.
example are unnecessarily complex and may not work in
SIMD supercomputers were implemented using thou-
more complicated circumstances. In general, a vector com-
sands of small processors that worked synchronously on a
piler must employ a repertoire of tools in order to parallelize
single instruction stream. While the cost of massive SIMD
loops on a case by case basis. By comparison, our method is
parallel execution and near-neighbor communication was
simple and robust, yet still capable of detecting the available
low, distribution of data to these processors was expensive.
parallelism.
For this reason, automatic SIMD parallelization centered on
Figure 2(a) presents the inner loop of the motion esti-
solving the data distribution problem [1]. In the end, the
mation algorithm used for MPEG encoding. Vectorization
class of applications for which SIMD compilers were success-
is inhibited by the presence of a loop carried dependence
ful was even more restrictive than that of vector and MIMD
and a function call within the loop body. To overcome this,
machines.
a vector compiler can perform a series of transformations to
mold the loop into a vectorizable form. The first is scalar
expansion, which allocates a new element in a temporary 2.5 Instruction Level Parallelism
array for each iteration of the loop [4]. Loop fission is then Superword level parallelism is closely related to ILP. In fact,
used to divide the statements into separate loops [12]. The SLP can be viewed as a subset of instruction level paral-
result of these transformations is shown in Figure 2(b). The lelism. Most processors that support SLP also support ILP
first loop is vectorizable, but the second must be executed in the form of superscalar execution. Because of their simi-
sequentially. larities, methods for locating SLP and ILP may extract the
Figure 2(c) shows the loop from the perspective of SLP. same information. Under circumstances where these types
After unrolling, the four statements corresponding to the of parallelism completely overlap, SLP execution is preferred
first statement in the original loop can be packed together. because it provides a less expensive and more energy efficient
The packing process effectively moves packable statements solution.
to contiguous positions, as shown in part (d). The code In practice, the majority of ILP is found in the presence
motion is legal because it does not violate any dependences of loops. Therefore, unrolling the loop multiple times may
(once scalar renaming is performed). The first four state- provide enough parallelism to satisfy both ILP and SLP pro-
ments in the resulting loop body can be packed and executed cessor utilization. In this situation, ILP performance would
in parallel. Their results are then unpacked so they can be not noticeably degrade after SLP is extracted from a pro-
used in the sequential computation of the final statements. gram.
In the end, this method has the same effect as the trans-
formations used for vector compilation, while only requiring
loop unrolling and scalar renaming. 3 SLP Compiler Algorithm
Figure 3 shows a code segment that averages the ele-
ments of two 16x16 matrices. As is the case with many Our SLP compiler algorithm can be divided into several dis-
multimedia kernels, our example has been hand-optimized tinct phases. First, loop unrolling is used to transform vec-
for a sequential machine. In order to vectorize this loop, tor parallelism into SLP. Alignment analysis then attempts
a vector compiler would need to reverse the programmer- to determine the address alignment of each load and store
applied optimizations. Were such methods available, they instruction. This is needed for compiling to architectures
would involve constructing a for loop, restoring the induc- that do not support unaligned memory accesses. Next, the
tion variable, and re-rolling the loop. In contrast, locating intermediate representation is transformed into a low level
SLP within the loop body is simple. Since the optimized form and a series of standard compiler optimizations is ap-
code is amenable to SLP analysis, hand-optimization has plied.
had no detrimental effects on our ability to detect the avail- The core of our algorithm begins by locating statements
able parallelism. with adjacent memory references and packing them into
groups of size two. From this initial seed, more groups are
discovered based on the active set of packed data. All groups
2.3 Loop Level Parallelism are then merged into larger clusters of a size consistent with
Vector parallelism, exploited by vector computers, is a sub- the superword datapath width. Finally, a new schedule is
set of loop level parallelism. General loop level parallelism is
produced for each basic block, where groups of packed state- be eliminated. Optimizations include constant propaga-
ments are replaced with SIMD instructions. tion, copy propagation, dead code elimination, common sub-
The following subsections describe each of these phases expression elimination, loop-invariant code motion, and re-
in detail. Figure 4 presents a simple example to highlight dundant load/store elimination. As a final step, scalar re-
the core routines and Figure 5 lists the pseudo code. Both naming is performed to remove output and anti-dependences
will be referenced throughout this section. since they can inhibit parallelization.

3.1 Loop Unrolling 3.4 Identifying Adjacent Memory References


Loop unrolling is performed early since it is most easily done Because of their obvious impact, statements containing adja-
at a high level. As discussed, it is used to transform vec- cent memory references are the first candidates for packing.
tor parallelism into basic blocks with superword level paral- We therefore begin the core of our analysis by scanning each
lelism. In order to ensure full utilization of the superword basic block to find independent pairs of such statements.
datapath in the presence of a vectorizable loop, the unroll Adjacency is determined using both alignment information
factor must be customized to the data sizes used within the and array analysis.
loop. For example, a vectorizable loop containing 16-bit val- In general, duplicate memory operations can intro-
ues should be unrolled 8 times for a 128-bit datapath. Our duce several different packing possibilities. Dependences
system currently unrolls loops based on the smallest data will eliminate many of these possibilities and redundant
type present. load/store elimination will usually remove the rest. In prac-
tice, nearly every memory reference is directly adjacent to
3.2 Alignment Analysis at most two other references. These correspond to the ref-
erences that access memory on either side of the reference
Alignment analysis determines the alignment of memory ac- in question. When located, the first occurrence of each pair
cesses with respect to a certain superword datapath width. is added to the PackSet.
For architectures that do not support unaligned memory
accesses, alignment analysis can greatly improve the per- Definition 3.1 A Pack is an n-tuple, hs1 , ..., sn i, where
formance of our system. Without it, memory accesses are s1 , ..., sn are independent isomorphic statements in a basic
assumed to be unaligned and the proper merging code must block.
be emitted for every wide load and store.
One situation in which merging overhead can be amor- Definition 3.2 A PackSet is a set of Packs.
tized is when a contiguous block of memory is accessed
within a loop. In this situation, overhead can be reduced In this phase of the algorithm, only groups of two state-
to one additional merge operation per load or store by using ments are constructed. We refer to these as pairs with a left
data from previous iterations. and right element.
Alignment analysis, however, can completely remove this
overhead. For FORTRAN sources, a simple interprocedu- Definition 3.3 A Pair is a Pack of size two, where the
ral analysis can determine alignment information in a single first statement is considered the left element, and the sec-
pass. This analysis is flow-insensitive, context-insensitive, ond statement is considered the right element.
and visits the call graph in breadth-first order. For C
sources, we use an enhanced pointer analysis package de- As an intermediate step, statements are allowed to be-
veloped by Rugina and Rinard [21]. Since this pass also long to two groups as long as they occupy a left position
provides location set information, we can consider depen- in one of the groups and a right position in the other. En-
dences more carefully when combining packing candidates. forcing this discipline here allows the Combination phase to
A full discussion of alignment analysis is beyond the scope easily merge groups into larger clusters. These details are
of this paper. A complete description will be given in [13]. discussed in Section 3.6.
Our compilation system is capable of operating both with Figure 4(a) presents an example sequence of statements.
and without alignment constraints. For simplicity, we de- Figure 4(b) shows the results of adjacent memory identi-
scribe subsequent phases of the algorithm assuming no ar- fication in which two pairs have been added to the Pack-
chitectural support for unaligned accesses. As such, later Set. The pseudo code for this phase is shown in Figure 5 as
phases assume alignment information has been annotated find adj refs.
to each load and store instruction where possible.
3.5 Extending the PackSet
3.3 Pre-optimization Once the PackSet has been seeded with an initial set of
SLP analysis is most useful when performed on a three ad- packed statements, more groups can be added. This is done
dress representation. This way, the algorithm has full flex- by finding new candidates that can either:
ibility in choosing which operations to pack. If isomorphic
statements are instead matched by the tree structure inher- • Produce needed source operands in packed form, or
ited from the source code, long expressions must be identi- • Use existing packed data as source operands.
cal in order to parallelize. On the other hand, identifying
adjacent memory references is much easier if address calcu- This is accomplished by following def-use and use-def
lations maintain their original form. We therefore annotate chains of existing PackSet entries. If these chains lead to
each load and store instruction with this information before fresh packable statements, a new group is created and added
flattening. to the PackSet. For two statements to be packable, they
After flattening, several standard optimizations are ap- must meet the following criteria:
plied to an input program. This ensures that parallelism
is not extracted from computation that would otherwise • The statements are isomorphic.
U U P
(1) b = a[i+0] (2) c = 5 (1) b = a[i+0]
(2) c = 5 (3) d = b + c (4) e = a[i+1]
(3) d = b + c
(4) e = a[i+1]
(5) f = 6
(4) e = a[i+1] (6) g = e + f (7) h = a[i+2]
(5) f = 6
(6) g = e + f (8) j = 7
(9) k = h + j
(7) h = a[i+2]
(8) j = 7
(b)
(9) k = h + j

(a)
P
(1) b = a[i+0]
(4) e = a[i+1]
U P
(2) c = 5 (1) b = a[i+0] (4) e = a[i+1]
(5) f = 6 (4) e = a[i+1] (7) h = a[i+2]
(8) j = 7
(4) e = a[i+1] (3) d = b + c
(7) h = a[i+2] (6) g = e + f

(3) d = b + c (6) g = e + f
(6) g = e + f (9) k = h + j

(6) g = e + f (2) c = 5
(9) k = h + j (5) f = 6

(c) (5) f = 6
(8) j = 7

(d)

(1) b = a[i+0]
b a[i+0]
(4) e = a[i+1] e = a[i+1]
(7) h = a[i+2] h a[i+2]

(3) d = b + c
(6) g = e + f c 5
f = 6
(9) k = h + j j 7

(2) c = 5
(5) f = 6 d b c
(8) j = 7 g = e + f
k h j
(e)
(f)

Figure 4: Various phases of SLP analysis. U and P represent the current set of unpacked and packed statements, respectively.
(a) Initial sequence of instructions. (b) Statements with adjacent memory references are paired and added to the PackSet.
(c) The PackSet is extended by following def-use chains of existing entries. (d) The PackSet is further extended by following
use-def chains. (e) Combination merges groups containing the same expression. (f) Each group is scheduled and SIMD
operations are emitted in their place.
SLP extract: BasicBlock B → BasicBlock stmts can pack: BasicBlock B × PackSet P ×
PackSet P ← ∅ Stmt s × Stmt s0 × Int align → Boolean
P ← find adj refs(B, P ) if isomorphic(s, s0 ) then
P ← extend packlist(B, P ) if independent(s, s0 ) then
P ← combine packs(P ) if ∀ht, t0 i ∈ P.t 6= s then
return schedule(B, [ ], P ) if ∀ht, t0 i ∈ P.t0 6= s0 then
Int aligns ← get alignment(s)
Int aligns0 ← get alignment(s0 )
if aligns ≡ > ∨ aligns ≡ align then
find adj refs: BasicBlock B × PackSet P → PackSet if aligns0 ≡ > ∨ aligns0 ≡ align+data size(s0 ) then
foreach Stmt s ∈ B do return true
foreach Stmt s0 ∈ B where s 6= s0 do return false
if has mem ref(s) ∧ has mem ref(s0 ) then
if adjacent(s, s0 ) then
Int align ← get alignment(s)
if stmts can pack(B, P, s, s0 , align) then
P ← P ∪ {hs, s0 i}
return P follow use defs: BasicBlock B × PackSet P × Pack p → PackSet
where p = hs, s0 i, s = [ x0 := f(x1 , ..., xm ) ], s0 = [ x00 := f(x01 , ..., x0m ) ]
Int align ← get alignment(s)
for j ← 1 to m do
extend packlist: BasicBlock B × PackSet P → PackSet if ∃t ∈ B.t = [ xj := ... ] ∧ ∃t0 ∈ B.t0 = [ x0j := ... ] then
repeat if stmts can pack(B, P, t, t0 , align)
PackSet Pprev ← P if est savings (ht, t0 i, P ) ≥ 0 then
foreach Pack p ∈ P do P ← P ∪ {ht, t0 i}
P ← follow use defs(B, P, p) set alignment(s, s0 , align)
P ← follow def uses(B, P, p) return P
until P ≡ Pprev
return P

combine packs: PackSet P → PackSet follow def uses: BasicBlock B × PackSet P × Pack p → PackSet
repeat where p = hs, s0 i, s = [ x0 := f(x1 , ..., xm ) ], s0 = [ x00 := f(x01 , ..., x0m ) ]
PackSet Pprev ← P Int align ← get alignment(s)
foreach Pack p = hs1 , ..., sn i ∈ P do Int savings ← −1
foreach Pack p0 = hs01 , ..., s0m i ∈ P do foreach Stmt t ∈ B where t = [ ... := g(..., x0 , ...) ] do
if sn ≡ s01 then foreach Stmt t0 ∈ B where t 6= t0 = [ ... := h(..., x00 , ...) ] do
P ← P − {p, p0 } ∪ {hs1 , ..., sn , s02 , ..., s0m i} if stmts can pack(B, P, t, t0 , align) then
until P ≡ Pprev if est savings(ht, t0 i, P ) > savings then
return P savings ← est savings(ht, t0 i, P )
Stmt u ← t
Stmt u0 ← t0
if savings ≥ 0 then
schedule: BasicBlock B × BasicBlock B 0 × PackSet P P ← P ∪ {hu, u0 i}
→ BasicBlock set alignment(u, u0 )
for i ← 0 to |B| do return P
if ∃p = h..., si , ...i ∈ P then
if ∀s ∈ p. deps scheduled(s, B 0 ) then
foreach Stmt s ∈ p do
B ←B−s
B0 ← B0 · s
return schedule(B, B 0 , P )
else if deps scheduled(si , B 0 ) then
return schedule(B − si , B 0 · si , P )
if |B| 6= 0 then
P ← P − {p} where p = first(B, P )
return schedule(B, B 0 , P )
return B 0

Figure 5: Pseudo code for the SLP extraction algorithm. Only key procedures are listed. Helper functions include: 1)
has mem ref, which returns true if a statement accesses memory, 2) adjacent, which checks adjacency between two memory
references, 3) get alignment, which retrieves alignment information, 4) set alignment, which sets alignment information when it
is not already set, 5) deps scheduled, which returns true when, for a given statement, all statements upon which it is dependent
have been scheduled, 6) first, which returns the PackSet member containing the earliest unscheduled statement, 7) est savings,
which estimates the savings of a potential group, 8) isomorphic, which checks for statement isomorphism, and 9) independent,
which returns true when two statements are independent.
• The statements are independent. x = a[i+0] + k1
x = a[i+0] + k1
y = a[i+1] + k2
y = a[i+1] + k2 z = a[i+2] + s
• The left statement is not already packed in a left po-
sition. q = b[i+0] + y
r = b[i+1] + k3
s = b[i+2] + k4
• The right statement is not already packed in a right q = b[i+0] + y
position. z = a[i+2] + s r = b[i+1] + k3
s = b[i+2] + k4
• Alignment information is consistent.
• Execution time of the new parallel operation is esti- Figure 6: Example of a dependence between groups of
mated to be less than the sequential version. packed statements.

The analysis computes an estimated speedup of each po-


tential SIMD instruction based on a cost model for each in- if a statement in one group is dependent on a statement in
struction added and removed. This includes any packing or the other. As long as there are no cycles in this dependence
unpacking that must be performed in conjunction with the graph, all groups can be scheduled such that no violations
new instruction. If the proper packed operand data already occur. However, a cycle indicates that the set of chosen
exist in the PackSet, then packing cost is set to zero. groups is invalid and at least one group will need to be elim-
As new groups are added to the PackSet, alignment in- inated. Although experimental data has shown this case to
formation is propagated from existing groups via use-def or be extremely rare, care must be taken to ensure correctness.
def-use chains. Once set, a statement’s alignment deter- The scheduling phase begins by scheduling statements
mines which position it will occupy in the datapath dur- based on their order in the original basic block. Each state-
ing its computation. For this reason, a statement can have ment is scheduled as soon as all statements on which it is
only one alignment. New groups are created only if their dependent have been scheduled. For groups of packed state-
alignment requirements are consistent with those already in ments, this property must be satisfied for each statement in
place. the group. If scheduling is ever inhibited by the presence of
When a single definition has multiple uses, there is the a cycle, the group containing the earliest unscheduled state-
potential for many different packing possibilities. If this ment is split apart. Scheduling continues until all statements
occurs, the cost model is used to estimate the most prof- have been scheduled.
itable possibilities based on what is currently packed. These Whenever a group of packed statements is scheduled, a
groups are added to the PackSet in order of their estimated new SIMD operation is emitted instead. If this new opera-
profitability as long as there are no conflicts with existing tion requires operand packing or reshuffling, the necessary
PackSet entries. operations are scheduled first. Similarly, if any statements
In the example, part (c) shows new groups that are added require unpacking of their source data, the required steps
after following def-use chains of the two existing PackSet en- are taken. Since our analysis operates at the level of basic
tries. Part (d) introduces new groups discovered by follow- blocks, each basic block assumes all data are in an unpacked
ing use-def chains. The pseudo code for this phase is listed configuration upon entry to the block. For this reason, all
as extend packset in Figure 5. variables that are live on exit are unpacked at the end of the
block.
3.6 Combination Scheduling is provided by the schedule routine in Fig-
ure 5. In the example of Figure 4, the result of scheduling
Once all profitable pairs have been chosen, they can be com- is shown in part (f). At the completion of this phase, a new
bined into larger groups. Two groups can be combined when basic block has been constructed wherever parallelization
the left statement of one is the same as the right statement of was successful. These blocks contain SIMD instructions in
the other. In fact, groups must be combined in this fashion place of packed isomorphic statements. As we will show in
in order to prevent a statement from appearing in more than Section 5, the algorithm can be used to achieve speedups on
one group in the final PackSet. This process, provided by a microprocessor with multimedia extensions.
the combine packs routine, checks all groups against one an-
other and repeats until all possible combinations have been
4 A Simple Vectorizing Compiler
made. Figure 4(e) shows the result of our example after
combination. The SLP concepts presented in the previous section lead to
Since the adjacent memory identification phase uses an elegant implementation of a vectorizing compiler. Vec-
alignment information, it will never create pairs of mem- tor parallelism is characterized by the execution of multiple
ory accesses that cross an alignment boundary. All packed iterations of an instruction using a single vector operation.
statements are aligned based on this initial seed. As a re- This same computation can be uncovered with unrolling by
sult, the combination phase will never produce a group that limiting packing to unrolled versions of the same statement.
spans an alignment boundary. Combined groups are there- With this technique, each statement has only one possible
fore guaranteed to be less than or equal to the superword grouping, which means that no searching is required. In-
datapath size. stead, every statement can be packed automatically with its
siblings if they are found to be independent. The profitabil-
3.7 Scheduling ity of each group can then be evaluated in the context of the
entire set of packed data. Any groups that are deemed un-
Dependence analysis before packing ensures that statements
profitable can be dropped in favor of their sequential coun-
within a group can be executed safely in parallel. However,
terparts. The pseudo code for this algorithm is shown in
it may be the case that executing two groups produces a
Figure 7.
dependence violation. An example of this is shown in Fig-
While not as general as the algorithm described in the
ure 6. Here, dependence edges are drawn between groups
vector parallelize: BasicBlock B → BasicBlock Name Description Datatype
PackSet P ← ∅ FIR Finite impulse response filter 32-bit float
P ← find all packs(B, P ) IIR Infinite impulse response filter 32-bit float
P ← eliminate unprofitable packs(P ) VMM Vector-matrix multiply 32-bit float
return schedule(B, [ ], P )
MMM Matrix-matrix multiply 32-bit float
YUV RGB to YUV conversion 16-bit integer
find all packs: BasicBlock B × PackSet P → PackSet
foreach Stmt s ∈ B do Table 1: Multimedia kernels used to evaluate the effective-
if ∀p ∈ P.s ∈ / p then ness of SLP analysis.
Pack p ← [s]
foreach Stmt s0 ∈ B where s0 6= s do
if stmts are packable(s, s0 ) then
p ← p · s0
if |p| > 1 then use the SPEC95fp benchmark suite. Our multimedia bench-
P ← P ∪ {p} marks are provided by the kernels listed in Table 1.
return P

5.2 SLP Availability


stmts are packable: Stmt s × Stmt s0 → Boolean
if same orig stmt(s, s0 ) then To evaluate the availability of superword level parallelism in
if independent(s, s0 ) then our benchmarks, we calculated the percentage of dynamic
return true
return false
instructions eliminated from a sequential program after par-
allelization. All instructions were counted equally, including
SIMD operations. When packing was required, we assumed
eliminate unprofitable packs: PackSet P → PackSet that n-1 instructions were required to pack n values into a
repeat
PackSet P 0 ← P
single SIMD register. These values were also used for un-
foreach Pack p ∈ P do packing costs.
if est savings(p, P ) < 0 then Measurements were obtained by instrumenting source
P ← P − {p} code with counters in order to determine the number of
until P ≡ P 0
return P
times each basic block was executed. These numbers were
then multiplied by the number of static SUIF instructions
in each basic block. Results for both sets of benchmarks
are listed in Table 2 and illustrated in Figure 8. The per-
Figure 7: Pseudo code for the vector extraction algorithm. formance of each benchmark is shown for a variety of hy-
Procedures that are identical to those in Figure 5 are omit- pothetical datapath widths. It is assumed that each datap-
ted. same orig stmt returns true if two statements are un- ath can accommodate SIMD versions of any standard data
rolled versions of the same original statement. type. For example, a datapath of 512 bits can perform eight
64-bit floating point operations in parallel. To uncover the
maximum amount of superword level parallelism available,
previous section, this technique shares many of the same we compiled each benchmark without alignment constraints.
desirable properties. First, the analysis itself is extremely This allowed for a maximum degree of freedom when making
simple and robust. Second, partially vectorizable loops can packing decisions.
be parallelized without complicated loop transformations. For the multimedia benchmarks, YUV greatly outper-
Most importantly, this analysis is able to achieve good re- forms the other kernels. This is because it operates on 16-
sults on scientific and multimedia benchmarks. bit values and is entirely vectorizable. The remaining kernels
The drawback to this method is that it may not be ap- are partially vectorizable and still exhibit large performance
plicable to long vector architectures. Since the unroll factor gains.
must be consistent with the vector size, unrolling may pro- For the SPEC95fp benchmark suite, some of the appli-
duce basic blocks that overwhelm the analysis and the code
generator. As such, this method is mainly applicable to ar-
chitectures with short vectors. Benchmark 128 bits 256 bits 512 bits 1024 bits
In Section 5, we will provide data that compare this ap- swim 61.59% 64.45% 73.44% 77.17%
proach to the algorithm described in Section 3. tomcatv 40.91% 61.28% 69.50% 73.85%
mgrid 43.49% 55.13% 60.51% 61.52%
su2cor 33.99% 48.73% 56.06% 59.63%
5 Results wave5 26.69% 37.25% 41.97% 43.87%
apsi 24.19% 29.93% 31.32% 29.85%
This section presents potential performance gains for SLP hydro2d 18.53% 26.17% 28.88% 30.80%
compiler techniques and substantiates them using a Mo- turb3d 21.16% 24.76% 21.55% 15.13%
torola MPC7400 microprocessor with the AltiVec instruc- applu 15.54% 22.56% 10.29% 0.01%
tion set. All results were gathered using the compiler al- fpppp 4.22% 8.14% 8.27% 8.27%
gorithms described in Sections 3 and 4. Both were imple- FIR 38.72% 45.37% 48.56% 49.84%
IIR 51.83% 60.59% 64.77% 66.45%
mented within the SUIF compiler infrastructure [23]. VMM 36.92% 43.37% 46.63% 51.90%
MMM 61.75% 73.63% 79.76% 82.86%
5.1 Benchmarks YUV 87.21% 93.59% 96.79% 98.36%

We measure the success of our SLP algorithm on both sci- Table 2: Percentage of dynamic instructions eliminated by
entific and multimedia applications. For scientific codes, we SLP analysis for a variety of hypothetical datapath widths.
128 bits 256 bits 512 bits 1024 bits
Vector Component Non-vector Component

% contribution to dynamic instructions eliminated


100%
100%
% of dynamic instructions eliminated

90%
90%
80%
80%
70%
70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
im

tv

id

or

e5

R
IIR

V
hy si

d
3d

pp
pl
o2
ap

FI

si
U

im

id

or

e5

3d

p
d
ca

gr

2c

M
av

rb
sw

pp
pl
o2
ap

ap
fp
dr

ca

gr
m

2c
m

av
V

M
su

rb
tu
w

sw

ap

fp
dr
to

m
m

su

tu
w

hy
to
Figure 8: Percentage of dynamic instructions eliminated by
Figure 10: Contribution of vectorizable and non-vectorizable
SLP analysis for a variety of hypothetical datapath widths.
code sequences in total SLP savings for the SPEC95fp
benchmark suite.
cations exhibit a performance degradation as the datapath
width is increased. This is due to the large unroll factor sions available in commercial microprocessors, we believe
required to fill a wide datapath. If the dynamic iteration AltiVec best matches the compilation technique described
counts for these loops are smaller than the unroll factor, the in this paper. AltiVec defines 128-bit floating point and in-
unrolled loop is never executed. For turb3d and applu, the teger SIMD operations and provides a complementary set of
optimal unroll factor is four. A 256-bit datapath is therefore 32 general purpose registers. It also defines load and store
sufficient since it can accommodate four 64-bit operations. instructions capable of moving a full 128 bits of data.
In fpppp, the most time-intensive loop is already unrolled Our compiler automatically generates C code with Al-
by a factor of three. A 192-bit datapath can support the tiVec macros inserted where parallelization is successful. We
available parallelism in this situation. then use an extended gcc compiler to generate machine code.
In Figure 9 and Table 3 we compare the SLP algorithm to This compiler was provided by Motorola and supports the
the vectorization technique described in Section 4. For the AltiVec ABI (application binary interface). Due to the ex-
multimedia benchmarks, both methods perform identically. perimental nature of the AltiVec compiler extensions, it was
However, there are many cases in the scientific applications necessary to compile all benchmarks without optimization.
for which the SLP algorithm is able to find additional pack- Base measurements were made by compiling the unparal-
ing opportunities. In Figure 10, we show the available vector lelized version for execution on the MPC7400 superscalar
parallelism as a subset of the available superword level par- unit. In both cases, the same set of SUIF optimizations and
allelism. the same gcc backend were used. Since AltiVec does not
support unaligned memory accesses, all benchmarks were
5.3 SLP Performance compiled with alignment constraints in place [13].
Table 4 and Figure 11 present performance compar-
To test the performance of our SLP algorithm in a real en- isons on a 450MHz G4 PowerMac workstation. Most of
vironment, we targeted our compilation system to the Al- the SPEC95fp benchmarks require double precision float-
tiVec [19] instruction set. Of the popular multimedia exten- ing point support to operate correctly. Since this is not

Vector Parallelism Superword Level Parallelism


100% Benchmark SLP Vector
% of dynamic instructions eliminated

90% swim 64.45% 62.29%


tomcatv 61.28% 56.87%
80%
mgrid 55.13% 34.29%
70% su2cor 48.73% 44.20%
60% wave5 37.25% 28.73%
apsi 29.93% 15.89%
50%
hydro2d 26.17% 22.91%
40% turb3d 24.76% 20.35%
30% applu 22.56% 14.67%
fpppp 8.14% 0.00%
20%
FIR 45.37% 45.37%
10% IIR 60.59% 60.59%
0% VMM 43.37% 43.37%
MMM 73.63% 73.63%
to m

id
tv

or

R
hy s i

V
e5

3d

IIR
o2

pp
pl
ap

FI

U
i
ca

gr
2c

M
av

rb
sw

YUV 93.59% 93.59%


ap

Y
dr

fp
m
m

su

V
M
tu
w

Table 3: Percentage of dynamic instructions eliminated with


Figure 9: Percentage of dynamic instructions eliminated SLP parallelization and with vector parallelization on a 256-
with SLP parallelization and with vector parallelization on bit datapath.
a 256-bit datapath.
Benchmark Speedup 570%
swim 1.24

% improvement of the execution time


tomcatv 1.57 100%
FIR 1.26
IIR 1.41
VMM 1.70 75%
MMM 1.79
YUV 6.70
50%

Table 4: Speedup on an MPC7400 processor using SLP com-


pilation. 25%

0%
Tomcatv Swim FIR IIR VMM MMM YUV
supported by AltiVec, we were unable to compile vectorized
versions for all but two of the benchmarks. swim utilizes sin-
gle precision floating point operations, and the SPEC92fp Figure 11: Percentage improvement of execution time on an
version of tomcatv provides a result similar to the 64-bit MPC7400 processor using SLP compilation.
version.
Our compiler currently assumes that all packed opera-
tions are executed on the AltiVec unit and all sequential
operations are performed on the superscalar unit. Opera-
tions to pack and unpack data are therefore required to go • Although our system is capable of compiling for ma-
through memory since AltiVec provides no instructions to chines that do not support unaligned memory accesses,
move data between register files. Despite this high cost, our the algorithm is potentially more effective without this
compiler is still able to exploit superword level parallelism constraint. Architectures supplying efficient unaligned
and provide speedups. load and store instructions might improve the perfor-
mance of SLP analysis.
6 Architectural Support for SLP The first three points discuss simple processor modifica-
tions that we hope will be incorporated into future multi-
The compiler algorithm presented in Section 3 was inspired media instruction sets as they mature. The last two points
by the multimedia extensions in modern processors. How- address difficult issues. Solving them in either hardware or
ever, several limitations make it difficult to fully realize the software is not trivial. More research is required to deter-
potential provided by SLP analysis. We list some of these mine the best approach.
limitations below:

• Many multimedia instructions are designed for a spe- 7 Keys to General Acceptance of SLP
cific high-level operation. For example, HP’s MAX-2
extensions offer matrix transform instructions [16] and Many of the techniques developed by the academic compiler
SUN’s VIS extensions include instructions to compute community are not accepted in mainstream computing. A
pixel distances [18]. The complex CISC-like semantics good example is the work on loop level parallelization that
of these instructions make automatic code generation has continued for over three decades. However, in a very
difficult. short period of time, ILP compilers have become universal.
We believe the following characteristics are critical to the
• SLP hardware is typically viewed as a multimedia en- general acceptance of a compiler optimization:
gine alone and is not designed for general purpose
computation. Floating point capabilities, for example, • Robustness: If simple source code modifications
have only recently been added to some architectures. drastically alter program performance, success be-
Furthermore, even the most advanced multimedia ex- comes dependent upon the user’s understanding of
tensions lack certain fundamental operations such as compiler intricacies. For example, techniques to un-
32-bit integer multiplication and division [19]. cover loop level parallelism are prone to wide fluctu-
ations in performance. A change in one statement of
• In current architectures, data sets are usually con- the loop body may result in a vector compiler’s se-
sidered to belong exclusively to either multimedia or quentialization of the entire loop. In the case of ILP
superscalar hardware. This design philosophy is por- and SLP, failure to parallelize a few statements will
trayed in the lack of inter register file move operations not significantly impact aggregate performance. This
in the AltiVec instruction set. If SLP compilation tech- makes methods for their extraction much more robust.
niques can show a need for a better coupling between
these two units, future architectures may provide the • Scalability: Compiler techniques must be able to
necessary support. handle large programs if they are to gain acceptance
for real applications. Some analyses required by loop
• Most current multimedia instruction sets are designed optimizations do not scale well to large code sizes be-
with the assumption that data are always stored in cause of dependence on global program analysis. Al-
the proper packed configuration. As a result, data though global analysis can improve the effectiveness of
packing and unpacking instructions are generally not ILP and SLP, it is not required. Therefore, complex-
well supported. This important operation is useful to ity grows linearly with program size. This results in
our system. With better support, SLP performance smooth scaling to larger applications.
can be further increased.
• Simplicity: Complex compiler transformations are compilation techniques have the potential to become an in-
more prone to bugs than simple analyses. Problems tegral part of general purpose computing in the near future.
are likely to appear only under very specific conditions,
making them difficult to detect. Many time-critical 9 Acknowledgments
projects are compiled without optimizations in order
to avoid possible compiler errors. Coarse-grain paral- We thank Krste Asanović for his input on vector process-
lelization and vectorization require involved analyses ing and the CAG group members who provided feedback
that are more likely to exhibit this behavior. However, during the writing of this paper, particularly Michael Tay-
most ILP techniques, as well as the SLP techniques lor, Derek Bruening, Mike Zhang, Darko Marinov, Matt
presented in Section 3, are extremely simple to under- Frank and Mark Stephenson. Radu Rugina contributed his
stand, implement and validate. In addition, it is often pointer analysis package as well as his own time to imple-
the case that simplicity leads to faster compilation. ment new extensions. Manas Mandal, Kalpesh Gala, Brian
Grayson and James Yang at Motorola provided access to
• Portability: Optimizations that are dependent on
much needed development tools and technical expertise. We
particular features of a source language or program-
also thank the anonymous reviewers for their constructive
ming style will not become universal. Techniques for
comments. Finally, we thank Matt Deeds for his help in the
extracting loop level parallelism are limited because
completion of this paper.
they only apply to programs written with loops and
This research was funded in part by NSF grant
arrays. Alternatively, ILP and SLP techniques are ap-
EIA9810173 and DARPA grant DBT63-96-C-0036.
plied at the level of basic blocks, making them less
dependent on source code characteristics.
References
• Effectiveness: No compiler technique will be used
if it does not substantially improve program perfor- [1] E. Albert, K. Knobe, J. Lukas, and G. Steele, Jr.
mance. In Section 5, we showed that our algorithm Compiling Fortran 8x array features for the Connec-
for detecting SLP can provide remarkable performance tion Machine computer system. In Proceedings of the
gains. ACM SIGPLAN Symposium on Parallel Programming:
Experience with Applications, Languages, and Systems
We believe SLP compiler techniques have the potential (PPEALS), New Haven, CT, July 1988.
to become universally accepted as viable and effective meth-
ods of extracting SIMD parallelism. As a result, we expect [2] J. R. Allen and K. Kennedy. PFC: A Program to Con-
future architectures to place increasing importance on SLP vert Fortran to Parallel Form. In K. Hwang, editor,
operations. Supercomputers: Design and Applications, pages 186–
203. IEEE Computer Society Press, Silver Spring, MD,
1984.
8 Conclusion
[3] Krste Asanović, James Beck, Bertrand Irissou,
In this paper we introduced superword level parallelism, the Brian E. D. Kingsbury, Nelson Morgan, and John
notion of viewing parallelism from the perspective of parti- Wawrzynek. The T0 Vector Microprocessor. In Pro-
tioned operations on packed superwords. We showed that ceedings of Hot Chips VII, August 1995.
SLP can be exploited with a simple and robust compiler im-
plementation that exhibits speedups ranging from 1.24 to [4] D. Callahan and P. Havlak. Scalar expansion in PFC:
6.70 on a set of scientific and multimedia benchmarks. Modifications for Parallelization. Supercomputer Soft-
We also showed that SLP concepts lead to an elegant im- ware Newsletter 5, Dept. of Computer Science, Rice
plementation of a vectorizing compiler. By comparing the University, October 1986.
performance of this compiler to the more general SLP algo-
rithm, we demonstrated that vector parallelism is a subset [5] Derek J. DeVries. A Vectorizing SUIF Compiler: Imple-
of superword level parallelism. mentation and Performance. Master’s thesis, University
Our current compiler implementation is still in its in- of Toronto, June 1997.
fancy. While successful, we believe its effectiveness can be [6] Keith Diefendorff. Pentium III = Pentium II + SSE.
improved. By extending the SLP analysis beyond basic Microprocessor Report, 13(3):1,6–11, March 1999.
blocks, more packing opportunities could be found. Fur-
thermore, SLP could offer a form of predication, in which [7] Keith Diefendorff. Sony’s Emotionally Charged Chip.
unfilled slots of a wide operation could be filled with spec- Microprocessor Report, 13(5):1,6–11, April 1999.
ulative computation. If data are invalidated due to control
flow, they could simply be discarded. [8] Keith Diefendorff and Pradeep K. Dubey. How Multi-
Recent research has shown that compiler analysis can media Workloads Will Change Processor Design. IEEE
significantly reduce the size of data types needed to store Computer, 30(9):43–45, September 1997.
program variables [22]. Incorporating this analysis into our
own has the potential of drastically improving performance [9] G. H. Barnes, R. Brown, M. Kato, D. J. Kuck, D. L.
by increasing the number of operands that can be packed Slotnick, and R. A. Stokes. The Illiac IV Computer.
and executed in parallel. IEEE Transactions on Computers, C(17):746–757, Au-
Today, most desktop processors are equipped with mul- gust 1968.
timedia extensions. Nonuniformities in the different instruc- [10] Linley Gwennap. AltiVec Vectorizes PowerPC. Micro-
tion sets, exacerbated by a lack of compiler support, has left processor Report, 12(6):1,6–9, May 1998.
these extensions underutilized. We have shown that SLP
compilation is not only possible, but also applicable to a [11] Craig Hansen. MicroUnity’s MediaProcessor Architec-
wider class of application domains. As such, we believe SLP ture. IEEE Micro, 16(4):34–41, Aug 1996.
[12] D.J. Kuck, R.H. Kuhn, D. Padua, B. Leasure, and
M. Wolfe. Dependence Graphs and Compiler Optimiza-
tions. In Proceedings of the 8th ACM Symposium on
Priciples of Programming Languages, pages 207–218,
Williamsburg, VA, Jan 1981.
[13] Samuel Larsen, Radu Rugina, and Saman Amaras-
inghe. Alignment Analysis. Technical Report LCS-
TM-605, Massachusetts Institute of Technology, June
2000.
[14] Corina G. Lee and Derek J. DeVries. Initial Results on
the Performance and Cost of Vector Microprocessors.
In Proceedings of the 30th Annual International Sym-
posium on MicroArchitecutre, pages 171–182, Research
Triangle Park, USA, December 1997.
[15] Corina G. Lee and Mark G. Stoodley. Simple Vector
Microprocessors for Multimedia Applications. In Pro-
ceedings of the 31st Annual International Symposium
on MicroArchitecutre, pages 25–36, Dallas, TX, Decem-
ber 1998.
[16] Ruby Lee. Subword Parallelism with MAX-2. IEEE
Micro, 16(4):51–59, Aug 1996.
[17] Glenn Luecke and Waqar Haque. Evaluation of For-
tran Vector Compilers and Preprocessors. Software—
Practice and Experience, 21(9), September 1991.
[18] Marc Tremblay and Michael O’Connor and Venkatesh
Narayanan and Liang He. VIS Speeds New Media Pro-
cessing. IEEE Micro, 16(4):10–20, Aug 1996.
[19] Motorola. AltiVec Technology Programming Environ-
ments Manual, November 1998.
[20] Alex Peleg and Uri Weiser. MMX Technology Exten-
sion to Intel Architecture. IEEE Micro, 16(4):42–50,
Aug 1996.
[21] Radu Rugina and Martin Rinard. Pointer Analysis for
Multithreaded Programs. In Proceedings of the SIG-
PLAN ’99 Conference on Programming Language De-
sign and Implementation, Atlanta, GA, May 1999.
[22] Mark Stephenson, Jonathon Babb, and Saman Ama-
rasinghe. Bitwidth Analysis with Application to Sili-
con Compilation. In Proceedings of the SIGPLAN ’00
Conference on Programming Language Design and Im-
plementation, Vancouver, BC, June 2000.
[23] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amaras-
inghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-
W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy.
SUIF: An Infrastructure for Research on Parallelizing
and Optimizing Compilers. ACM SIGPLAN Notices,
29(12):31–37, December 1994.

You might also like