SLP Pldi 2000
SLP Pldi 2000
(a)
P
(1) b = a[i+0]
(4) e = a[i+1]
U P
(2) c = 5 (1) b = a[i+0] (4) e = a[i+1]
(5) f = 6 (4) e = a[i+1] (7) h = a[i+2]
(8) j = 7
(4) e = a[i+1] (3) d = b + c
(7) h = a[i+2] (6) g = e + f
(3) d = b + c (6) g = e + f
(6) g = e + f (9) k = h + j
(6) g = e + f (2) c = 5
(9) k = h + j (5) f = 6
(c) (5) f = 6
(8) j = 7
(d)
(1) b = a[i+0]
b a[i+0]
(4) e = a[i+1] e = a[i+1]
(7) h = a[i+2] h a[i+2]
(3) d = b + c
(6) g = e + f c 5
f = 6
(9) k = h + j j 7
(2) c = 5
(5) f = 6 d b c
(8) j = 7 g = e + f
k h j
(e)
(f)
Figure 4: Various phases of SLP analysis. U and P represent the current set of unpacked and packed statements, respectively.
(a) Initial sequence of instructions. (b) Statements with adjacent memory references are paired and added to the PackSet.
(c) The PackSet is extended by following def-use chains of existing entries. (d) The PackSet is further extended by following
use-def chains. (e) Combination merges groups containing the same expression. (f) Each group is scheduled and SIMD
operations are emitted in their place.
SLP extract: BasicBlock B → BasicBlock stmts can pack: BasicBlock B × PackSet P ×
PackSet P ← ∅ Stmt s × Stmt s0 × Int align → Boolean
P ← find adj refs(B, P ) if isomorphic(s, s0 ) then
P ← extend packlist(B, P ) if independent(s, s0 ) then
P ← combine packs(P ) if ∀ht, t0 i ∈ P.t 6= s then
return schedule(B, [ ], P ) if ∀ht, t0 i ∈ P.t0 6= s0 then
Int aligns ← get alignment(s)
Int aligns0 ← get alignment(s0 )
if aligns ≡ > ∨ aligns ≡ align then
find adj refs: BasicBlock B × PackSet P → PackSet if aligns0 ≡ > ∨ aligns0 ≡ align+data size(s0 ) then
foreach Stmt s ∈ B do return true
foreach Stmt s0 ∈ B where s 6= s0 do return false
if has mem ref(s) ∧ has mem ref(s0 ) then
if adjacent(s, s0 ) then
Int align ← get alignment(s)
if stmts can pack(B, P, s, s0 , align) then
P ← P ∪ {hs, s0 i}
return P follow use defs: BasicBlock B × PackSet P × Pack p → PackSet
where p = hs, s0 i, s = [ x0 := f(x1 , ..., xm ) ], s0 = [ x00 := f(x01 , ..., x0m ) ]
Int align ← get alignment(s)
for j ← 1 to m do
extend packlist: BasicBlock B × PackSet P → PackSet if ∃t ∈ B.t = [ xj := ... ] ∧ ∃t0 ∈ B.t0 = [ x0j := ... ] then
repeat if stmts can pack(B, P, t, t0 , align)
PackSet Pprev ← P if est savings (ht, t0 i, P ) ≥ 0 then
foreach Pack p ∈ P do P ← P ∪ {ht, t0 i}
P ← follow use defs(B, P, p) set alignment(s, s0 , align)
P ← follow def uses(B, P, p) return P
until P ≡ Pprev
return P
combine packs: PackSet P → PackSet follow def uses: BasicBlock B × PackSet P × Pack p → PackSet
repeat where p = hs, s0 i, s = [ x0 := f(x1 , ..., xm ) ], s0 = [ x00 := f(x01 , ..., x0m ) ]
PackSet Pprev ← P Int align ← get alignment(s)
foreach Pack p = hs1 , ..., sn i ∈ P do Int savings ← −1
foreach Pack p0 = hs01 , ..., s0m i ∈ P do foreach Stmt t ∈ B where t = [ ... := g(..., x0 , ...) ] do
if sn ≡ s01 then foreach Stmt t0 ∈ B where t 6= t0 = [ ... := h(..., x00 , ...) ] do
P ← P − {p, p0 } ∪ {hs1 , ..., sn , s02 , ..., s0m i} if stmts can pack(B, P, t, t0 , align) then
until P ≡ Pprev if est savings(ht, t0 i, P ) > savings then
return P savings ← est savings(ht, t0 i, P )
Stmt u ← t
Stmt u0 ← t0
if savings ≥ 0 then
schedule: BasicBlock B × BasicBlock B 0 × PackSet P P ← P ∪ {hu, u0 i}
→ BasicBlock set alignment(u, u0 )
for i ← 0 to |B| do return P
if ∃p = h..., si , ...i ∈ P then
if ∀s ∈ p. deps scheduled(s, B 0 ) then
foreach Stmt s ∈ p do
B ←B−s
B0 ← B0 · s
return schedule(B, B 0 , P )
else if deps scheduled(si , B 0 ) then
return schedule(B − si , B 0 · si , P )
if |B| 6= 0 then
P ← P − {p} where p = first(B, P )
return schedule(B, B 0 , P )
return B 0
Figure 5: Pseudo code for the SLP extraction algorithm. Only key procedures are listed. Helper functions include: 1)
has mem ref, which returns true if a statement accesses memory, 2) adjacent, which checks adjacency between two memory
references, 3) get alignment, which retrieves alignment information, 4) set alignment, which sets alignment information when it
is not already set, 5) deps scheduled, which returns true when, for a given statement, all statements upon which it is dependent
have been scheduled, 6) first, which returns the PackSet member containing the earliest unscheduled statement, 7) est savings,
which estimates the savings of a potential group, 8) isomorphic, which checks for statement isomorphism, and 9) independent,
which returns true when two statements are independent.
• The statements are independent. x = a[i+0] + k1
x = a[i+0] + k1
y = a[i+1] + k2
y = a[i+1] + k2 z = a[i+2] + s
• The left statement is not already packed in a left po-
sition. q = b[i+0] + y
r = b[i+1] + k3
s = b[i+2] + k4
• The right statement is not already packed in a right q = b[i+0] + y
position. z = a[i+2] + s r = b[i+1] + k3
s = b[i+2] + k4
• Alignment information is consistent.
• Execution time of the new parallel operation is esti- Figure 6: Example of a dependence between groups of
mated to be less than the sequential version. packed statements.
We measure the success of our SLP algorithm on both sci- Table 2: Percentage of dynamic instructions eliminated by
entific and multimedia applications. For scientific codes, we SLP analysis for a variety of hypothetical datapath widths.
128 bits 256 bits 512 bits 1024 bits
Vector Component Non-vector Component
90%
90%
80%
80%
70%
70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
im
tv
id
or
e5
R
IIR
V
hy si
d
3d
pp
pl
o2
ap
FI
si
U
im
id
or
e5
3d
p
d
ca
gr
2c
M
av
rb
sw
pp
pl
o2
ap
ap
fp
dr
ca
gr
m
2c
m
av
V
M
su
rb
tu
w
sw
ap
fp
dr
to
m
m
su
tu
w
hy
to
Figure 8: Percentage of dynamic instructions eliminated by
Figure 10: Contribution of vectorizable and non-vectorizable
SLP analysis for a variety of hypothetical datapath widths.
code sequences in total SLP savings for the SPEC95fp
benchmark suite.
cations exhibit a performance degradation as the datapath
width is increased. This is due to the large unroll factor sions available in commercial microprocessors, we believe
required to fill a wide datapath. If the dynamic iteration AltiVec best matches the compilation technique described
counts for these loops are smaller than the unroll factor, the in this paper. AltiVec defines 128-bit floating point and in-
unrolled loop is never executed. For turb3d and applu, the teger SIMD operations and provides a complementary set of
optimal unroll factor is four. A 256-bit datapath is therefore 32 general purpose registers. It also defines load and store
sufficient since it can accommodate four 64-bit operations. instructions capable of moving a full 128 bits of data.
In fpppp, the most time-intensive loop is already unrolled Our compiler automatically generates C code with Al-
by a factor of three. A 192-bit datapath can support the tiVec macros inserted where parallelization is successful. We
available parallelism in this situation. then use an extended gcc compiler to generate machine code.
In Figure 9 and Table 3 we compare the SLP algorithm to This compiler was provided by Motorola and supports the
the vectorization technique described in Section 4. For the AltiVec ABI (application binary interface). Due to the ex-
multimedia benchmarks, both methods perform identically. perimental nature of the AltiVec compiler extensions, it was
However, there are many cases in the scientific applications necessary to compile all benchmarks without optimization.
for which the SLP algorithm is able to find additional pack- Base measurements were made by compiling the unparal-
ing opportunities. In Figure 10, we show the available vector lelized version for execution on the MPC7400 superscalar
parallelism as a subset of the available superword level par- unit. In both cases, the same set of SUIF optimizations and
allelism. the same gcc backend were used. Since AltiVec does not
support unaligned memory accesses, all benchmarks were
5.3 SLP Performance compiled with alignment constraints in place [13].
Table 4 and Figure 11 present performance compar-
To test the performance of our SLP algorithm in a real en- isons on a 450MHz G4 PowerMac workstation. Most of
vironment, we targeted our compilation system to the Al- the SPEC95fp benchmarks require double precision float-
tiVec [19] instruction set. Of the popular multimedia exten- ing point support to operate correctly. Since this is not
id
tv
or
R
hy s i
V
e5
3d
IIR
o2
pp
pl
ap
FI
U
i
ca
gr
2c
M
av
rb
sw
Y
dr
fp
m
m
su
V
M
tu
w
0%
Tomcatv Swim FIR IIR VMM MMM YUV
supported by AltiVec, we were unable to compile vectorized
versions for all but two of the benchmarks. swim utilizes sin-
gle precision floating point operations, and the SPEC92fp Figure 11: Percentage improvement of execution time on an
version of tomcatv provides a result similar to the 64-bit MPC7400 processor using SLP compilation.
version.
Our compiler currently assumes that all packed opera-
tions are executed on the AltiVec unit and all sequential
operations are performed on the superscalar unit. Opera-
tions to pack and unpack data are therefore required to go • Although our system is capable of compiling for ma-
through memory since AltiVec provides no instructions to chines that do not support unaligned memory accesses,
move data between register files. Despite this high cost, our the algorithm is potentially more effective without this
compiler is still able to exploit superword level parallelism constraint. Architectures supplying efficient unaligned
and provide speedups. load and store instructions might improve the perfor-
mance of SLP analysis.
6 Architectural Support for SLP The first three points discuss simple processor modifica-
tions that we hope will be incorporated into future multi-
The compiler algorithm presented in Section 3 was inspired media instruction sets as they mature. The last two points
by the multimedia extensions in modern processors. How- address difficult issues. Solving them in either hardware or
ever, several limitations make it difficult to fully realize the software is not trivial. More research is required to deter-
potential provided by SLP analysis. We list some of these mine the best approach.
limitations below:
• Many multimedia instructions are designed for a spe- 7 Keys to General Acceptance of SLP
cific high-level operation. For example, HP’s MAX-2
extensions offer matrix transform instructions [16] and Many of the techniques developed by the academic compiler
SUN’s VIS extensions include instructions to compute community are not accepted in mainstream computing. A
pixel distances [18]. The complex CISC-like semantics good example is the work on loop level parallelization that
of these instructions make automatic code generation has continued for over three decades. However, in a very
difficult. short period of time, ILP compilers have become universal.
We believe the following characteristics are critical to the
• SLP hardware is typically viewed as a multimedia en- general acceptance of a compiler optimization:
gine alone and is not designed for general purpose
computation. Floating point capabilities, for example, • Robustness: If simple source code modifications
have only recently been added to some architectures. drastically alter program performance, success be-
Furthermore, even the most advanced multimedia ex- comes dependent upon the user’s understanding of
tensions lack certain fundamental operations such as compiler intricacies. For example, techniques to un-
32-bit integer multiplication and division [19]. cover loop level parallelism are prone to wide fluctu-
ations in performance. A change in one statement of
• In current architectures, data sets are usually con- the loop body may result in a vector compiler’s se-
sidered to belong exclusively to either multimedia or quentialization of the entire loop. In the case of ILP
superscalar hardware. This design philosophy is por- and SLP, failure to parallelize a few statements will
trayed in the lack of inter register file move operations not significantly impact aggregate performance. This
in the AltiVec instruction set. If SLP compilation tech- makes methods for their extraction much more robust.
niques can show a need for a better coupling between
these two units, future architectures may provide the • Scalability: Compiler techniques must be able to
necessary support. handle large programs if they are to gain acceptance
for real applications. Some analyses required by loop
• Most current multimedia instruction sets are designed optimizations do not scale well to large code sizes be-
with the assumption that data are always stored in cause of dependence on global program analysis. Al-
the proper packed configuration. As a result, data though global analysis can improve the effectiveness of
packing and unpacking instructions are generally not ILP and SLP, it is not required. Therefore, complex-
well supported. This important operation is useful to ity grows linearly with program size. This results in
our system. With better support, SLP performance smooth scaling to larger applications.
can be further increased.
• Simplicity: Complex compiler transformations are compilation techniques have the potential to become an in-
more prone to bugs than simple analyses. Problems tegral part of general purpose computing in the near future.
are likely to appear only under very specific conditions,
making them difficult to detect. Many time-critical 9 Acknowledgments
projects are compiled without optimizations in order
to avoid possible compiler errors. Coarse-grain paral- We thank Krste Asanović for his input on vector process-
lelization and vectorization require involved analyses ing and the CAG group members who provided feedback
that are more likely to exhibit this behavior. However, during the writing of this paper, particularly Michael Tay-
most ILP techniques, as well as the SLP techniques lor, Derek Bruening, Mike Zhang, Darko Marinov, Matt
presented in Section 3, are extremely simple to under- Frank and Mark Stephenson. Radu Rugina contributed his
stand, implement and validate. In addition, it is often pointer analysis package as well as his own time to imple-
the case that simplicity leads to faster compilation. ment new extensions. Manas Mandal, Kalpesh Gala, Brian
Grayson and James Yang at Motorola provided access to
• Portability: Optimizations that are dependent on
much needed development tools and technical expertise. We
particular features of a source language or program-
also thank the anonymous reviewers for their constructive
ming style will not become universal. Techniques for
comments. Finally, we thank Matt Deeds for his help in the
extracting loop level parallelism are limited because
completion of this paper.
they only apply to programs written with loops and
This research was funded in part by NSF grant
arrays. Alternatively, ILP and SLP techniques are ap-
EIA9810173 and DARPA grant DBT63-96-C-0036.
plied at the level of basic blocks, making them less
dependent on source code characteristics.
References
• Effectiveness: No compiler technique will be used
if it does not substantially improve program perfor- [1] E. Albert, K. Knobe, J. Lukas, and G. Steele, Jr.
mance. In Section 5, we showed that our algorithm Compiling Fortran 8x array features for the Connec-
for detecting SLP can provide remarkable performance tion Machine computer system. In Proceedings of the
gains. ACM SIGPLAN Symposium on Parallel Programming:
Experience with Applications, Languages, and Systems
We believe SLP compiler techniques have the potential (PPEALS), New Haven, CT, July 1988.
to become universally accepted as viable and effective meth-
ods of extracting SIMD parallelism. As a result, we expect [2] J. R. Allen and K. Kennedy. PFC: A Program to Con-
future architectures to place increasing importance on SLP vert Fortran to Parallel Form. In K. Hwang, editor,
operations. Supercomputers: Design and Applications, pages 186–
203. IEEE Computer Society Press, Silver Spring, MD,
1984.
8 Conclusion
[3] Krste Asanović, James Beck, Bertrand Irissou,
In this paper we introduced superword level parallelism, the Brian E. D. Kingsbury, Nelson Morgan, and John
notion of viewing parallelism from the perspective of parti- Wawrzynek. The T0 Vector Microprocessor. In Pro-
tioned operations on packed superwords. We showed that ceedings of Hot Chips VII, August 1995.
SLP can be exploited with a simple and robust compiler im-
plementation that exhibits speedups ranging from 1.24 to [4] D. Callahan and P. Havlak. Scalar expansion in PFC:
6.70 on a set of scientific and multimedia benchmarks. Modifications for Parallelization. Supercomputer Soft-
We also showed that SLP concepts lead to an elegant im- ware Newsletter 5, Dept. of Computer Science, Rice
plementation of a vectorizing compiler. By comparing the University, October 1986.
performance of this compiler to the more general SLP algo-
rithm, we demonstrated that vector parallelism is a subset [5] Derek J. DeVries. A Vectorizing SUIF Compiler: Imple-
of superword level parallelism. mentation and Performance. Master’s thesis, University
Our current compiler implementation is still in its in- of Toronto, June 1997.
fancy. While successful, we believe its effectiveness can be [6] Keith Diefendorff. Pentium III = Pentium II + SSE.
improved. By extending the SLP analysis beyond basic Microprocessor Report, 13(3):1,6–11, March 1999.
blocks, more packing opportunities could be found. Fur-
thermore, SLP could offer a form of predication, in which [7] Keith Diefendorff. Sony’s Emotionally Charged Chip.
unfilled slots of a wide operation could be filled with spec- Microprocessor Report, 13(5):1,6–11, April 1999.
ulative computation. If data are invalidated due to control
flow, they could simply be discarded. [8] Keith Diefendorff and Pradeep K. Dubey. How Multi-
Recent research has shown that compiler analysis can media Workloads Will Change Processor Design. IEEE
significantly reduce the size of data types needed to store Computer, 30(9):43–45, September 1997.
program variables [22]. Incorporating this analysis into our
own has the potential of drastically improving performance [9] G. H. Barnes, R. Brown, M. Kato, D. J. Kuck, D. L.
by increasing the number of operands that can be packed Slotnick, and R. A. Stokes. The Illiac IV Computer.
and executed in parallel. IEEE Transactions on Computers, C(17):746–757, Au-
Today, most desktop processors are equipped with mul- gust 1968.
timedia extensions. Nonuniformities in the different instruc- [10] Linley Gwennap. AltiVec Vectorizes PowerPC. Micro-
tion sets, exacerbated by a lack of compiler support, has left processor Report, 12(6):1,6–9, May 1998.
these extensions underutilized. We have shown that SLP
compilation is not only possible, but also applicable to a [11] Craig Hansen. MicroUnity’s MediaProcessor Architec-
wider class of application domains. As such, we believe SLP ture. IEEE Micro, 16(4):34–41, Aug 1996.
[12] D.J. Kuck, R.H. Kuhn, D. Padua, B. Leasure, and
M. Wolfe. Dependence Graphs and Compiler Optimiza-
tions. In Proceedings of the 8th ACM Symposium on
Priciples of Programming Languages, pages 207–218,
Williamsburg, VA, Jan 1981.
[13] Samuel Larsen, Radu Rugina, and Saman Amaras-
inghe. Alignment Analysis. Technical Report LCS-
TM-605, Massachusetts Institute of Technology, June
2000.
[14] Corina G. Lee and Derek J. DeVries. Initial Results on
the Performance and Cost of Vector Microprocessors.
In Proceedings of the 30th Annual International Sym-
posium on MicroArchitecutre, pages 171–182, Research
Triangle Park, USA, December 1997.
[15] Corina G. Lee and Mark G. Stoodley. Simple Vector
Microprocessors for Multimedia Applications. In Pro-
ceedings of the 31st Annual International Symposium
on MicroArchitecutre, pages 25–36, Dallas, TX, Decem-
ber 1998.
[16] Ruby Lee. Subword Parallelism with MAX-2. IEEE
Micro, 16(4):51–59, Aug 1996.
[17] Glenn Luecke and Waqar Haque. Evaluation of For-
tran Vector Compilers and Preprocessors. Software—
Practice and Experience, 21(9), September 1991.
[18] Marc Tremblay and Michael O’Connor and Venkatesh
Narayanan and Liang He. VIS Speeds New Media Pro-
cessing. IEEE Micro, 16(4):10–20, Aug 1996.
[19] Motorola. AltiVec Technology Programming Environ-
ments Manual, November 1998.
[20] Alex Peleg and Uri Weiser. MMX Technology Exten-
sion to Intel Architecture. IEEE Micro, 16(4):42–50,
Aug 1996.
[21] Radu Rugina and Martin Rinard. Pointer Analysis for
Multithreaded Programs. In Proceedings of the SIG-
PLAN ’99 Conference on Programming Language De-
sign and Implementation, Atlanta, GA, May 1999.
[22] Mark Stephenson, Jonathon Babb, and Saman Ama-
rasinghe. Bitwidth Analysis with Application to Sili-
con Compilation. In Proceedings of the SIGPLAN ’00
Conference on Programming Language Design and Im-
plementation, Vancouver, BC, June 2000.
[23] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amaras-
inghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-
W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy.
SUIF: An Infrastructure for Research on Parallelizing
and Optimizing Compilers. ACM SIGPLAN Notices,
29(12):31–37, December 1994.