SG 247575
SG 247575
SG 247575
Abraham Arevalo
Ricardo M. Matinata
Maharaja Pandian
Eitan Peri
Kurtis Ruby
Francois Thomas
Chris Almond
ibm.com/redbooks
International Technical Support Organization
August 2008
SG24-7575-00
Note: Before using this information and the product it supports, read the information in
“Notices” on page xi.
This edition applies to Version 3.0 of the IBM Cell Broadband Engine SDK, and the IBM
BladeCenter QS-21 platform.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
iv Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3.6 A few scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7 Design patterns for Cell/B.E. programming . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.1 Shared queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.2 Indirect addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.4 Multi-SPE software cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.5 Plug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Contents v
4.6.4 SIMD programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.6.5 Auto-SIMDizing by compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
4.6.6 Working with scalars and converting between different vector types277
4.6.7 Code transfer using SPU code overlay . . . . . . . . . . . . . . . . . . . . . . 283
4.6.8 Eliminating and predicting branches . . . . . . . . . . . . . . . . . . . . . . . . 284
4.7 Frameworks and domain-specific libraries . . . . . . . . . . . . . . . . . . . . . . . 289
4.7.1 Data Communication and Synchronization . . . . . . . . . . . . . . . . . . . 291
4.7.2 Accelerated Library Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
4.7.3 Domain-specific libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
4.8 Programming guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
4.8.1 General guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
4.8.2 SPE programming guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
4.8.3 Data transfers and synchronization guidelines . . . . . . . . . . . . . . . . 325
4.8.4 Inter-processor communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
vi Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Chapter 6. The performance tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.1 Analysis of the FFT16M application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
6.2 Preparing and building for profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
6.2.1 Step 1: Copying the application from SDK tree. . . . . . . . . . . . . . . . 418
6.2.2 Step 2: Preparing the makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
6.3 Creating and working with the profile data . . . . . . . . . . . . . . . . . . . . . . . 423
6.3.1 Step 1: Collecting data with CPC . . . . . . . . . . . . . . . . . . . . . . . . . . 423
6.3.2 Step 2: Visualizing counter information using Counter Analyzer . . 423
6.3.3 Step 3: Collecting data with OProfile. . . . . . . . . . . . . . . . . . . . . . . . 424
6.3.4 Step 4: Examining profile information with Profile Analyzer . . . . . . 426
6.3.5 Step 5: Gathering profile information with FDPR-Pro . . . . . . . . . . . 428
6.3.6 Step 6: Analyzing profile data with Code Analyzer . . . . . . . . . . . . . 430
6.4 Creating and working with trace data . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.4.1 Step 1: Collecting trace data with PDT . . . . . . . . . . . . . . . . . . . . . . 438
6.4.2 Step 2: Importing the PDT data into Trace Analyzer. . . . . . . . . . . . 441
Contents vii
Part 3. Application re-engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration . . . . 541
10.1 BladeCenter QS21 characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
10.2 Installing the operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
10.2.1 Important considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
10.2.2 Managing and accessing the blade server . . . . . . . . . . . . . . . . . . 544
10.2.3 Installation from network storage . . . . . . . . . . . . . . . . . . . . . . . . . 547
10.2.4 Example of installation from network storage . . . . . . . . . . . . . . . . 550
10.3 Installing SDK 3.0 on BladeCenter QS21 . . . . . . . . . . . . . . . . . . . . . . . 560
10.3.1 Pre-installation steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
10.3.2 Installation steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
10.3.3 Post-installation steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
viii Programming the Cell Broadband Engine Architecture: Examples and Best Practices
10.4 Firmware considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
10.4.1 Updating firmware for the BladeCenter QS21. . . . . . . . . . . . . . . . 566
10.5 Options for managing multiple blades . . . . . . . . . . . . . . . . . . . . . . . . . . 569
10.5.1 Distributed Image Management . . . . . . . . . . . . . . . . . . . . . . . . . . 569
10.5.2 Extreme Cluster Administration Toolkit . . . . . . . . . . . . . . . . . . . . . 589
10.6 Method for installing a minimized distribution . . . . . . . . . . . . . . . . . . . . 593
10.6.1 During installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
10.6.2 Post-installation package removal . . . . . . . . . . . . . . . . . . . . . . . . 596
10.6.3 Shutting off services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Contents ix
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
How to get Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
x Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
AMD, AMD Opteron, the AMD Arrow logo, and combinations thereof, are trademarks of Advanced Micro
Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade
Association.
Snapshot, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and
other countries.
SUSE, the Novell logo, and the N logo are registered trademarks of Novell, Inc. in the United States and
other countries.
Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United
States, other countries, or both and is used under license therefrom.
Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other
countries, or both.
ESP, Excel, Fluent, Microsoft, Visual Basic, Visual C++, Windows, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States, other countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
MultiCore Plus is a trademark of Mercury Computer Systems, Inc. in the United States, other countries, or
both.
Other company, product, or service names may be trademarks or service marks of others.
xii Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Preface
We also describe the content and packaging of the IBM Software Development
Kit (SDK) version 3.0 for Multicore Acceleration. This SDK provides all the tools
and resources that are necessary to build applications that run IBM
BladeCenter® QS21 and QS20 blade servers. We show in-depth and real-world
usage of the tools and resources found in the SDK. We also provide installation,
configuration, and administration tips and best practices for the IBM BladeCenter
QS21. In addition, we discuss the supporting software that is provided by IBM
alphaWorks®.
This book was written for developers and programmers, IBM technical
specialists, Business Partners, Clients, and the Cell/B.E. community to
understand how to develop applications by using the Cell/B.E. SDK 3.0.
Eitan Peri works in the IBM Haifa Research Lab as the technical lead for
Cell/B.E. pre-sales activities in Israel. He is currently working on projects
focusing on porting applications to the CBEA within the health care, computer
graphics, aerospace, and defense industries. He has nine years of experience in
real-time programming, chip design, and chip verification. His areas of expertise
include Cell/B.E. programming and consulting, application parallelization and
optimization, algorithm performance analysis, and medical imaging. Eitan holds
a Bachelor of Science degree in Computer Engineering from the Israel Institute
of Technology (the Technion) and a Master of Science degree in Biomedical
Engineering from Tel-Aviv University, where he specialized in brain imaging
analysis.
Kurtis Ruby is a software consultant with IBM Lab Services at IBM in Rochester,
Minnesota. He has over twenty-five years of experience in various programming
assignments in IBM. His expertise includes Cell/B.E. programming and
consulting. Kurtis holds a degree in mathematics from Iowa State University.
xiv Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Chris Almond is an ITSO Project Leader and IT Architect based at the ITSO
Center in Austin, Texas. In his current role, he specializes in managing content
development projects focused on Linux, AIX 5L™ systems engineering, and
other innovation programs. He has a total of 17 years of IT industry experience,
including the last eight with IBM. Chris handled the production of this IBM
Redbooks publication.
Acknowledgements
This IBM Redbooks publication would not have been possible without the
generous support and contributions provided many from IBM. The authoring
team gratefully acknowledges the critical support and sponsorship for this project
provided by the following IBM specialists:
Rebecca Austen, Director, Systems Software, Systems and Technology
Group
Daniel Brokenshire, Senior Technical Staff Member and Software Architect,
Quasar/Cell Broadband Engine Software Development, Systems and
Technology Group
Paula Richards, Director, Global Engineering Solutions, Systems and
Technology Group
Jeffrey Scheel, Blue Gene® Software Program Manager and Software
Architect, Systems and Technology Group
Tanaz Sowdagar, Marketing Manager, Systems and Technology Group
We also thank the following IBM specialists for their significant input to this
project during the development and technical review process:
Marina Biberstein, Research Scientist, Haifa Research Lab, IBM Research
Michael Brutman, Solutions Architect, Lab Services, IBM Systems and
Technology Group
Dean Burdick, Developer, Cell Software Development, IBM Systems and
Technology Group
Catherine Crawford, Senior Technical Staff Member and Chief Architect, Next
Generation Systems Software, IBM Systems and Technology Group
Bruce D’Amora, Research Scientist, Systems Engineering, IBM Research
Matthew Drahzal, Software Architect, Deep Computing, IBM Systems and
Technology Group
Matthias Fritsch, Enterprise System Development, IBM Systems and
Technology Group
Preface xv
Gad Haber, Manager, Performance Analysis and Optimization Technology,
Haifa Reseach Lab, IBM Research
Francesco Iorio, Solutions Architect, Next Generation Computing, IBM
Software Group
Kirk Jordan, Solutions Executive, Deep Computing and Emerging HPC
Technologies, IBM Systems and Technology Group
Melvin Kendrick, Manager, Cell Ecosystem Technical Enablement, IBM
Systems and Technology Group
Mark Mendell, Team Lead, Cell/B.E. Compiler Development, IBM Software
Group
Michael P. Perrone, Ph.D., Manager Cell Solutions, IBM Systems and
Technology Group
Juan Jose Porta, Executive Architect HPC and e-Science Platforms, IBM
Systems and Technology Group
Uzi Shvadron, Research Scientist, Cell/B.E. Performance Tools, Haifa
Research Lab, IBM Research
Van To, Advisory Software Engineer, Cell/B.E. and Next Generation
Computing Systems, IBM Systems and Technology Group
Duc J. Vianney, Ph. D, Technical Education Lead, Cell/B.E. Ecosystem and
Solutions Enablement, IBM Systems and Technology Group
Brian Watt, Systems Development, Quasar Design Center Development, IBM
Systems and Technology Group
Ulrich Weigand, Developer, Linux on Cell/B.E., IBM Systems and Technology
Group
Cornell Wright, Developer, Cell Software Development, IBM Systems and
Technology Group
xvi Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Become a published author
Join us for a two- to six-week residency program! Help write a book dealing with
specific products or solutions, while getting hands-on experience with
leading-edge technologies. You will have the opportunity to team with IBM
technical professionals, Business Partners, and Clients.
Your efforts will help increase product acceptance and customer satisfaction. As
a bonus, you will develop a network of contacts in IBM development labs, and
increase your productivity and marketability.
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Preface xvii
xviii Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Part 1
There is also a brief discussion of the IBM Software Development Kit (SDK) for
Multicore Acceleration from a content and packaging perspective. These
discussions complement the in-depth content of the remaining chapters of this
book.
1.1 Motivation
The CBEA has been designed to support a broad range of applications. The first
implementation is a single-chip multiprocessor with nine processor elements that
operate on a shared memory model, as shown in Figure 1-1. In this respect, the
Cell/B.E. processor extends current trends in PC and server processors. The
most distinguishing feature of the Cell/B.E. processor is that, although all
processor elements can share or access all available memory, their function is
specialized into two types:
Power Processor Element (PPE)
Synergistic Processor Element (SPE)
The Cell/B.E. processor has one PPE and eight SPEs. The PPE contains a
64-bit PowerPC® Architecture™ core. It complies with the 64-bit PowerPC
Architecture and can run 32-bit and 64-bit operating systems and applications.
The SPE is optimized for running compute-intensive single-instruction,
multiple-data (SIMD) applications. It is not optimized for running an operating
system.
The SPEs are independent processor elements, each running their own
individual application programs or threads. Each SPE has full access to shared
memory, including the memory-mapped I/O space implemented by multiple
4 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
direct memory access (DMA) units. There is a mutual dependence between the
PPE and the SPEs. The SPEs depend on the PPE to run the operating system,
and, in many cases, the top-level thread control for an application. The PPE
depends on the SPEs to provide the bulk of the application performance.
The more significant difference between the SPE and PPE lies in how they
access memory. The PPE accesses main storage (the effective address space)
with load and store instructions that move data between main storage and a
private register file, the contents of which may be cached. PPE memory access
is like that of a conventional processor technology, which is found on
conventional machines.
The SPEs, in contrast, access main storage with DMA commands that move
data and instructions between main storage and a private local memory, called
local storage (LS). An instruction-fetches and load and store instructions of an
SPE access its private LS rather than shared main storage, and the LS has no
associated cache. This three-level organization of storage (register file, LS, and
main storage), with asynchronous DMA transfers between LS and main storage,
is a radical break from conventional architecture and programming models. The
organization explicitly parallelizes computation with the transfers of data and
instructions that feed computation and store the results of computation in main
storage.
One of the motivations for this radical change is that memory latency, measured
in processor cycles, has gone up several hundredfold from about the years 1980
to 2000. The result is that application performance is, in most cases, limited by
memory latency rather than peak compute capability or peak bandwidth.
One of the DMA transfer methods of the SPE supports a list, such as a
scatter-gather list, of DMA transfers that is constructed in the local storage of an
SPE. Therefore, the DMA controller of the SPE can process the list
asynchronously while the SPE operates on previously transferred data. In
several cases, this approach to accessing memory has led to application
performance exceeding that of conventional processors by almost two orders of
magnitude. This result is significantly more than you expect from the peak
performance ratio (approximately 10 times) between the Cell/B.E. processor and
conventional PC processors. The DMA transfers can be set up and controlled by
the SPE that is sourcing or receiving the data, or in some circumstances, by the
PPE or another SPE.
6 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The Cell/B.E. processor does this by providing a general-purpose PPE to run the
operating system and other control-plane code, as well as eight SPEs
specialized for computing data-rich (data-plane) applications. The specialized
SPEs are more compute efficient because they have simpler hardware
implementations. The hardware does not devote transistors to branch prediction,
out-of-order execution, speculative execution, shadow registers and register
renaming, extensive pipeline interlocks, and so on. By weight, more of the
transistors are used for computation than in conventional processor cores.
The SPEs of the Cell/B.E. processor use two mechanisms to deal with long
main-memory latencies:
Three-level memory structure (main storage, local storage in each SPE, and
large register files in each SPE)
Asynchronous DMA transfers between main storage and local storage
By specializing the PPE and the SPEs for control and compute-intensive tasks,
respectively, the CBEA, on which the Cell/B.E. processor is based, allows both
the PPE and the SPEs to be designed for high frequency without excessive
overhead. The PPE achieves efficiency primarily by executing two threads
simultaneously rather than by optimizing single-thread performance.
Actual application performance varies. Some applications may benefit little from
the SPEs, where others show a performance increase well in excess of ten-fold.
In general, compute-intensive applications that use 32-bit or smaller data
formats, such as single-precision floating-point and integer, are excellent
candidates for the Cell/B.E. processor.
The PPE contains a 64-bit, dual-thread PowerPC Architecture RISC core and
supports a PowerPC virtual-memory subsystem. It has 32 KB level-1 (L1)
instruction and data caches and a 512 KB level-2 (L2) unified (instruction and
data) cache. It is intended primarily for control processing, running operating
systems, managing system resources, and managing SPE threads. It can run
existing PowerPC Architecture software and is well-suited to executing
system-control code. The instruction set for the PPE is an extended version of
the PowerPC instruction set. It includes the vector/SIMD multimedia extensions
and associated C/C++ intrinsic extensions.
8 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The eight identical SPEs are SIMD processor elements that are optimized for
data-rich operations allocated to them by the PPE. Each SPE contains a RISC
core, 256 KB software-controlled LS for instructions and data, and a 128-bit,
128-entry unified register file. The SPEs support a special SIMD instruction set,
called the Synergistic Processor Unit Instruction Set Architecture, and a unique
set of commands for managing DMA transfers and interprocessor messaging
and control. SPE DMA transfers access main storage by using PowerPC
effective addresses. As in the PPE, SPE address translation is governed by
PowerPC Architecture segment and page tables, which are loaded into the SPEs
by privileged software running on the PPE. The SPEs are not intended to run an
operating system.
An SPE controls DMA transfers and communicates with the system by means of
channels that are implemented in and managed by the memory flow controller
(MFC) of the SPE. The channels are unidirectional message-passing interfaces
(MPIs). The PPE and other devices in the system, including other SPEs, can also
access this MFC state through the memory-mapped I/O (MMIO) registers and
queues of the MFC. These registers and queues are visible to software in the
main-storage address space.
The EIB consists of four 16-byte-wide data rings. Each ring transfers 128 bytes
(one PPE cache line) at a time. Each processor element has one on-ramp and
one off-ramp. Processor elements can drive and receive data simultaneously.
Figure 1-1 on page 4 shows the unit ID numbers of each element and the order
in which the elements are connected to the EIB. The connection order is
important to programmers who are seeking to minimize the latency of transfers
on the EIB. The latency is a function of the number of connection hops, so that
transfers between adjacent elements have the shortest latencies, and transfers
between elements separated by six hops have the longest latencies.
Memory accesses on each interface are 1 to 8, 16, 32, 64, or 128 bytes, with
coherent memory ordering. Up to 64 reads and 64 writes can be queued. The
resource-allocation token manager provides feedback about queue levels.
The MIC has multiple software-controlled modes, which includes the following
types:
Fast-path mode, for improved latency when command queues are empty
High-priority read, for prioritizing SPE reads in front of all other reads
Early read, for starting a read before a previous write completes
Speculative read
Slow mode, for power management
The MIC implements a closed-page controller (bank rows are closed after being
read, written, or refreshed), memory initialization, and memory scrubbing.
The XDR DRAM memory is ECC-protected, with multi-bit error detection and
optional single-bit error correction. It also supports write-masking, initial and
periodic timing calibration. It also supports write-masking, initial and periodic
timing calibration, dynamic width control, sub-page activation, dynamic clock
gating, and 4, 8, or 16 banks.
1
The Cell Broadband Engine Architecture document is on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1AEEE1270EA2776387257060006E61
BA/$file/CBEA_v1.02_11Oct2007_pub.pdf
10 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
manages data transfers between the EIB and I/O devices and provides I/O
address translation and command processing.
The BEI supports two Rambus FlexIO interfaces. One of the two interfaces,
IOIF1, supports only a noncoherent I/O interface (IOIF) protocol, which is
suitable for I/O devices. The other interface, IOIF0 (also called BIF/IOIF0), is
software-selectable between the noncoherent IOIF protocol and the
memory-coherent Cell/B.E. interface (Broadband Interface (BIF)) protocol. The
BIF protocol is the internal protocol of the EIB. It can be used to coherently
extend the EIB, through IOIF0, to another memory-coherent device, that can be
another Cell/B.E. processor.
The instruction set for the SPEs is a new SIMD instruction set, the Synergistic
Processor Unit Instruction Set Architecture, with accompanying C/C++ intrinsics.
It also has a unique set of commands for managing DMA transfer, external
events, interprocessor messaging, and other functions. The instruction set for the
SPEs is similar to that of the vector/SIMD multimedia extensions for the PPE, in
the sense that they operate on SIMD vectors. However, the two vector instruction
sets are distinct. Programs for the PPE and SPEs are often compiled by different
compilers, generating code streams for two entirely different instruction sets.
Although most coding for the Cell/B.E. processor might be done by using a
high-level language such as C or C++, an understanding of the PPE and SPE
machine instructions adds considerably to a software developer’s ability to
produce efficient, optimized code. This is particularly true because most of the
C/C++ intrinsics have a mnemonic that relates directly to the underlying
assembly language mnemonic.
12 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
in a non-coherent manner. The PPE accesses the main-storage space through
its PowerPC processor storage subsystem (PPSS).
Data transfer between an SPE local storage and main storage is performed by
the MFC that is local to the SPE. Software running on an SPE sends commands
to its MFC by using the private channel interface. The MFC can also be
manipulated by remote SPEs, the PPE, or I/O devices by using memory mapped
I/O. Software running on the associated SPE interacts with its own MFC through
its channel interface. The channels support enqueueing of DMA commands and
other facilities, such as mailbox and signal-notification messages. Software
running on the PPE or another SPE can interact with an MFC through MMIO
registers, which are associated with the channels and visible in the main-storage
space.
Each MFC maintains and processes two independent command queues for DMA
and other commands: one queue for its associated SPU and another queue for
other devices that access the SPE through the main-storage space. Each MFC
can process multiple in-progress DMA commands. Each MFC can also
autonomously manage a sequence of DMA transfers in response to a DMA list
command from its associated SPU (but not from the PPE or other SPEs). Each
DMA command is tagged with a tag group ID that allows software to check or
wait on the completion of commands in a particular tag group.
The MFCs support naturally aligned DMA transfer sizes of 1, 2, 4, or 8 bytes, and
multiples of 16 bytes, with a maximum transfer size of 16 KB per DMA transfer.
DMA list commands can initiate up to 2,048 such DMA transfers. Peak transfer
performance is achieved if both the effective addresses and the LS addresses
are 128-byte aligned and the size of the transfer is an even multiple of 128 bytes.
Each MFC has a synergistic memory management (SMM) unit that processes
address-translation and access-permission information that is supplied by the
2
Cell Broadband Engine Programming Handbook, Version 1.1 is available at the following Web
address: http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/
9F820A5FFA3ECE8C8725716A0062585F/$file/CBE_Handbook_v1.1_24APR2007_pub.pdf
Figure 1-3 shows a summary of the byte ordering and bit ordering in memory
and the bit-numbering conventions.
Neither the PPE nor the SPEs, including their MFCs, support little-endian byte
ordering. The DMA transfers of the MFC are simply byte moves, without regard
to the numeric significance of any byte. Thus, the big-endian or little-endian issue
becomes irrelevant to the movement of a block of data. The byte-order mapping
only becomes significant when data is loaded or interpreted by a processor
element or an MFC.
14 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
1.4.4 Runtime environment
The PPE runs PowerPC Architecture applications and operating systems, which
can include both PowerPC Architecture instructions and vector/SIMD multimedia
extension instructions. To use all of the Cell/B.E. processor’s features, the PPE
requires an operating system that supports these features, such as
multiprocessing with the SPEs, access to the PPE vector/SIMD multimedia
extension operations, the Cell/B.E. interrupt controller, and all the other functions
provided by the Cell/B.E. processor.
A main thread running on the PPE can interact directly with an SPE thread
through the LS of the SPE. The thread can interact indirectly through the
main-storage space. A thread can poll or sleep while waiting for SPE threads.
The PPE thread can also communicate through mailbox and signal events that
are implemented in the hardware.
The operating system defines the mechanism and policy for selecting an
available SPE to schedule an SPU thread to run on. It must prioritize among all
the Cell/B.E. applications in the system, and it must schedule SPE execution
independently from regular main threads. The operating system is also
responsible for runtime loading, the passing of parameters to SPE programs,
notification of SPE events and errors, and debugger support.
OpenMP compiler: The IBM XLC/C++ compiler that comes with SDK 3 is an
OpenMP directed single source compiler that supports automatic program
partitioning, data virtualization, code overlay, and more. This version of the
compiler is in beta mode. Therefore, users should not base production
applications on this compiler.
18 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2.1.4 IBM XL Fortran for Multicore Acceleration for Linux
IBM XL Fortran for Multicore Acceleration for Linux is the latest addition to the
IBM XL family of compilers. It adopts proven high-performance compiler
technologies that are used in its compiler family predecessors. It also adds new
features that are tailored to exploit the unique performance capabilities of
processors that are compliant with the new CBEA.
The simulator for the Cell/B.E. processor provides a cycle-accurate SPU core
model that can be used for performance analysis of computationally-intense
applications.
1
Cell Broadband Engine Software Development Kit 2.1 Installation Guide Version 2.1,
SC33-8323-02, is available on the Web at:
ftp://ftp.software.ibm.com/systems/support/bladecenter/cpbsdk00.pdf
20 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2.4 Cell/B.E. libraries
In the following sections, we describe various the programming libraries.
The BLAS API is available as standard ANSI C and standard FORTRAN 77/90
interfaces. BLAS implementations are also available in open source (netlib.org).
The BLAS library in IBM SDK for Multicore Acceleration supports only real
single-precision (SP) and real double-precision (DP) versions. All SP and DP
routines in the three levels of standard BLAS are supported on the PPE. These
routines are available as PPE APIs and conform to the standard BLAS interface.
Some of these routines are optimized by using the SPEs and show a marked
increase in performance compared to the corresponding versions that are
implemented solely on the PPE. These optimized routines have an SPE interface
in addition to the PPE interface. The SPE interface does not conform to, yet
provides a restricted version of, the standard BLAS interface. Moreover, the single
precision versions of these routines have been further optimized for maximum
performance by using various features of the SPE.
22 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2.4.6 Data Communication and Synchronization library
The Data Communication and Synchronization (DaCS) library provides a set of
services for handling process-to-process communication in a heterogeneous
multi-core system. In addition to the basic message passing service, the library
includes the following services:
Mailbox services
Resource reservation
Process and process group management
Process and data synchronization
Remote memory services
Error handling
Additional examples and demonstrations show how you can exploit the on-chip
computational capacity.
Lastly, this is a static analysis tool. It does not identify branch behavior or
memory transfer delays.
2.6.2 OProfile
OProfile is a Linux tool that exists on other architectures besides the Cell/B.E. It
has been extended to support the unique hardware on the PPU and SPUs. It is a
sampling based tool that does not require special source compile flags to
produce useful data reports.
The opreport tool produces the output report. Reports can be generated based
on the file names that correspond to the samples, symbol names, or annotated
source code listings. Special source compiler flags are required in this case.
24 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
counters, are provided in the Cell/B.E. performance monitoring unit (PMU) for
counting these events.
26 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
ways to move data and functions from a host processor to an accelerator
processor and vice versa.
Part 2 Programming
environment
In this part, we provide in-depth coverage of various programming methods,
tools, strategies, and adaptions to different computing workloads. Specifically this
part includes the following chapters:
Chapter 3, “Enabling applications on the Cell Broadband Engine hardware”
on page 31
Chapter 4, “Cell Broadband Engine programming” on page 75
Chapter 5, “Programming tools and debugging techniques” on page 329
Chapter 6, “The performance tools” on page 417
Chapter 7, “Programming in distributed environments” on page 445
Next we describe the relationship between the computational kernels and the
Cell/B.E. features and between parallel programming models and Cell/B.E.
programming frameworks. We give examples of the most common parallel
programming models and contrast them in terms of control parallelism and data
transfers. We also present design patterns for Cell/B.E. programming following a
formalism used in other areas of computer sciences.
32 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3.1 Concepts and terminology
Figure 3-1 shows the concepts and how they are related. As described in this
section, the figure shows an application as having one or more computational
kernels and one or more potential parallel programming models. The
computational kernels exercise or stress one or more of the Cell/B.E. features
(the Q1 connection). The different Cell/B.E. features can either strongly or weakly
support the different parallel programming model choices (the Q2 connection).
The chosen parallel programming model can be implemented on the CBEA by
using various programming frameworks (the Q3 connection).
To answer questions Q1 and Q2, the programmer must be able to match the
characteristics of the computational kernel and parallel programming model to
the strengths of the CBEA. Many programming frameworks are available for the
CBEA. Which one is best suited to implement the parallel programming model
that is chosen for the application? We provide advice to the programmer for
question Q3 in 3.4, “Deciding which Cell/B.E. programming framework to use” on
page 61.
This work is also based on input from other domains including machine learning,
database, computer graphics, and games. The first seven dwarfs are the ones
that were initially found by Phillip Colella. The six remaining ones were identified
by Patterson et al. The intent of the study is to help the parallel computing
research community, in both academia and the industry, by providing a limited set
of patterns against which new ideas for hardware and software can be evaluated.
1
Intel also classifies applications in three categories named RMS for Recognition, Mining and
Synthesis to direct its research in computer architecture. See 9 on page 624.
34 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Dwarf name Description Example, application,
benchmark
Map-reduce (Unknown)a
This table is likely to change with future products. Most of the items in the table
are described in detail in Chapter 4, “Cell Broadband Engine programming” on
page 75.
36 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 3-3 Important Cell/B.E. features as seen from a programmer’s perspective
Good Not so good
Branching
A parallel application tries to bind multiple resources for its own use, which may
be in regard to memory and processors. The purpose is either to speed up the
whole computation (more processors) or to treat bigger problems (more
memory). The work of parallelizing an application involves the following tasks:
Distributing the work across processors
Distributing the data if the memory is distributed
Synchronizing the subtasks, possibly through shared data access if the
memory is shared
Communicating the data if the memory is distributed
Let us look at the options for each of these components in the following sections.
38 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Work distribution
The first task is to find concurrency in the application by exposing multiple
independent tasks and grouping them together inside execution threads. The
following options are possible:
Independent tasks operating on largely independent data
Domain decomposition, where the whole data can be split in multiple
subdomains, each of which is assigned to a single task
Streaming, where the same piece of data undergoes successive
transformations, each of which is performed by a single task
All tasks are arranged in a string and pass data in a producer-consumer
mode. The amount of concurrency here is the number of different steps in the
computation.
Now each parallel task can perform the work itself, or it can call other processing
resources for assistance. This process is completely transparent to the rest of the
participating tasks. We find the following techniques:
Function offload, where the compute-intensive part of the job is offloaded to a
supposedly faster processor
Accelerator mode, which is a variant of the previous technique, where multiple
processors can be called to collectively speed up the computation
Data distribution
The data model is an important part of the parallelization work. Currently, the
choices are available:
Shared memory, in which every execution thread has direct access to other
threads’ memory contents
Distributed memory, which is the opposite of shared memory in that each
execution thread can only access its own memory space
Specialized functions are required to import other threads’ data into its own
memory space.
Partitioned Global Address Space (PGAS), where each piece of data must be
explicitly declared as either shared or local, but within a unified address space
Task synchronization
Sometimes, during program execution, the parallel tasks must synchronize.
Synchronization can be realized in the following ways:
Messages or other asynchronous events emitted by other tasks
Locks or other synchronization primitives, for example, to access a queue
Transactional memory mechanisms
The programming models can further be classified according to the way each of
these tasks is taken care of, either explicitly by the programmer or implicitly by
the underlying runtime environment. Table 3-4 lists common parallel
programming models and shows how they can be described according to what
was exposed previously.
2
Clusters based on commodity hardware. See http://www.beowulf.org/overview/index.html
40 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In this discussion about the parallel programming models, we ignore the
instruction level (multiple execution units) and word level (SIMD) parallelisms.
They are to be considered as well to maximize the application performance, but
they usually do not interfere with the high level tasks of data and work
distribution.
At the lowest level, the Cell/B.E. chip appears to the programmer as a distributed
memory cluster of 8+1 computational cores, with an ultra high-speed
interconnect and a remote DMA engine on every core. On a single blade server,
two Cell/B.E. chips can be viewed as either a single 16+2 cores compute
resource (SMP mode) or a NUMA machine with two NUMA nodes.
The frameworks that are part of the IBM SDK for Multicore Acceleration are
described in greater detail in 4.7, “Frameworks and domain-specific libraries” on
page 289, and 7.1, “Hybrid programming models in SDK 3.0” on page 446. We
only give a brief overview in the following sections.
Software cache
A software cache can help implement a shared memory parallel programming
model when the data that the SPEs reference cannot be easily predicted. See
4.3.6, “Automatic software caching on SPE” on page 157, for more details.
42 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Mailboxes
Remote DMA operations
Process synchronization using barriers
Data synchronization using mutexes to protect memory accesses
The DaCS services are implemented as an API for the Cell/B.E.-only version and
are complemented by a runtime daemon for the hybrid case. For a complete
discussion, see 4.7.1, “Data Communication and Synchronization” on page 291,
and 7.2, “Hybrid Data Communication and Synchronization” on page 449.
MPI
MPI is not part of the IBM SDK for Multicore Acceleration. However, any
implementation for Linux on Power can run on the Cell/B.E., leveraging the PPE
only. The following implementations are the most common:
MPICH/MPICH2, from Argonne National Laboratory
MVAPICH/MVAPICH2, from Ohio State University
OpenMPI, from a large consortium involving IBM and Los Alamos National
Laboratory, among others
On the accelerator side, the programmer only has to code the computational
kernel, unwrap the input data, and pack the output data when the kernel finishes
processing. In between, the runtime system manages the work blocks queue on
the accelerated side and gives control to the computational kernel upon receiving
a new work block on the accelerator side.
The ALF also offers more sophisticated mechanisms for managing multiple
computational kernels or express dependencies, or to further tune the data
movement. As with DaCS, the ALF can operate inside a Cell/B.E. server or in
hybrid mode.
44 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
DAV comes with tools to generate ad-hoc stub libraries based on the prototypes
of the offloaded functions on the client side and similar information about the
server side (the Cell/B.E. system) where the functions are implemented. For the
main application, the Cell/B.E. system is completely hidden. The actual
implementation of the function on the Cell/B.E. system uses the existing
programming frameworks to maximize the application performance.
See 7.4, “Dynamic Application Virtualization” on page 475, for a more complete
description.
OpenMP
The IBM SDK for Multicore Acceleration contains a technology preview of the XL
C/C++ single source compiler. Usage of this compiler completely hides the
Cell/B.E. system to the application programmer who can continue by using
OpenMP, which is a familiar shared memory parallel programming model. The
compiler runtime library takes care of spawning threads of execution on the
SPEs and manages the data movement and synchronization of the PPE threads
to SPE threads.
PeakStream
The PeakStream platform offers an API, a generalize array type, and a virtual
machine environment that abstracts the programmer from the real hardware.
Data is moved back and forth between the application and the virtual machine
that accesses the Cell/B.E. resources by using an I/O interface. All the work in
the virtual machine is asynchronous from the main application perspective, which
can keep doing work before reading the data from the virtual machine. The
CodeSourcery
CodeSourcery offers Sourcery VSIPL++, a C++ implementation of the open
standard VSIPL++ library that is used in signal and image processing. The
programmer is freed from accessing the low level mechanisms of the Cell/B.E.
platform. This is handled by the CodeSourcery runtime library.
Gedae
Gedae tries to automate the software development by using a model-driven
approach. The algorithm is captured in a flow diagram that is then used by the
multiprocessor compiler to generate a code that will match both the target
architecture and the data movements required by the application.
RapidMind
RapidMind works with standard C++ language constructs and augments the
language by using specialized macro language functions. The whole integration
involves the following steps:
1. Replace float or int arrays by RapidMind equivalent types (Array, Value).
2. Capture the computations that are enclosed between the RapidMind
keywords Program BEGIN and END and convert them into object modules.
3. Stream the recorded computations to the underlying hardware by using
platform-specific constructs, such as Cell/B.E., processor, or GPU, when the
modules are invoked.
There are also research groups working on implementing other frameworks onto
the Cell/B.E. platform. It is worth noting the efforts of the Barcelona
Supercomputing teams with CellSs (Cell Superscalar) and derivatives, such as
SMPSs (SMP Superscalar), and from Los Alamos National Laboratory with
CellFS, based on concepts taken from the Plan9 operating system.
46 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Summary
In Figure 3-3, we plot these frameworks on a scale ranging from the closest to
the hardware to the most abstract.
IBM DAV is of particular note here. On the accelerated program side (the client
side in DAV terminology), the Cell/B.E. system is completely hidden by using the
stub dynamic link library (DLL) mechanism. On the accelerator side (the server
side for DAV), any Cell/B.E. programming model can be used to implement the
functions that have been offloaded from the client application.
Figure 3-4 Determining whether the Cell/B.E. system is a good fit for this application
48 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3.2.1 Higher performance/watt
The main driver for enabling applications on the Cell/B.E. system is the need for
a higher level of performance per watt. This is a concern shared by many
customers as reported by the IDC study referenced in Solutions for the
Datacenter’s Thermal Challenges [14 on page 624]. Customers may be willing to
make the following accommodations:
Lower their electricity bills.
Overcome computer rooms limits in space, power, and cooling.
Adopt a green strategy for their IT, a green “ITtude.”
Allow for more computing power for a given space and electrical power
budget as is often the case in embedded computing.
The design choices for the Cell/B.E. system exactly match these new
requirements with a power efficiency (expressed in peak GFlops per Watt) that is
over two times better than conventional general purpose processors.
The more parallel processing opportunities the application can leverage, the
better.
The map-reduce dwarf is parallel and, therefore, a perfect fit for the Cell/B.E.
system. Examples can be found in ray-tracing or Monte-Carlo simulations.
The graph traversal dwarf is a more difficult target for the Cell/B.E. system due to
random memory accesses although some new sorting algorithms (AA-sort in [5
on page 623]) have been shown to exploit the CBEA.
The N-Body simulation does not seem yet ready for Cell/B.E. exploitation
although research efforts are providing good early results [19 on page 624].
Table 3-6 on page 51 summarizes the results of these studies. We present each
of the “13 dwarfs”, their Cell/B.E. affinity (from 1 (poor) to 5 (excellent)), and the
Cell/B.E. features that are of most value for each kernel.
The algorithm match also depends on the data types that are being used. The
current Cell/B.E. system has a single precision floating point affinity. There will be
much larger memory and enhanced double-precision floating point capabilities in
later versions of the Cell/B.E. system.
50 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 3-6 The 13 dwarfs from Patterson et al. and their Cell/B.E. affinity
Dwarf name Cell/B.E. affinity Main Cell/B.E. features
(1 = poor to
5 = excellent)
Dynamic 4 SIMD
programming
What are the alternatives? The parallelization effort may have already been done
by using OpenMP at the process level. In this case, using the prototype of the
XLC single source compiler might be the only viable alternative. Despite high
usability, these compilers are still far from providing the level of performance that
can be attained with native SPE programming. The portability of the code is
maintained, which for some customers, might be a key requirement.
For new developments, it might be a good idea to use the higher level of
abstraction provided by such technologies from PeakStream, RapidMind, or
StreamIt.3 The portability of the code is maintained between the Cell/B.E., GPU,
and general multi-core processors. However, the application is tied to the
development environment, which is a different form of lock-in.
In the long run, new standardized languages may emerge. Such projects as X10
from IBM4 or Chapel from Cray, Inc.5 might become the preferred language for
writing applications to run on massively multi-core systems. Adopting new
languages has historically been a slow process. Even if we get a new language,
that still does not help the millions of lines of code written in C/C++ and Fortran.
The standard API for the host-accelerator model may be closer. ALF is a good
candidate. The fast adoption of MPI in the mid-1990s has proven that an API can
be what we need to enable a wide range of applications.
Can we wait for these languages and standards to emerge? If the answer is “no”
and the decision has been made to enable the application on the Cell/B.E.
system now, there is a list of items to consider and possible workarounds when
problems are encountered as highlighted in Table 3-7 on page 53.
3
http://www.cag.csail.mit.edu/streamit
4
http://domino.research.ibm.com/comm/research_projects.nsf/pages/x10.index.html
5
http://chapel.cs.washington.edu
52 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 3-7 Considerations when enabling an application on the Cell/B.E. system
Topic Potential problem or problems Workaround
Source code Portability concerns The Cell/B.E. APIs are standard C. Approaches
changes Limits the scope of code such as a host-accelerator can limit the amount of
changes source code changes.
Operating Windows applications The Cell/B.E. system runs Linux only. If the
systems application runs on Windows, we may want to use
DAV to offload only the computational part to the
Cell/B.E. system.
Libraries Not many libraries supported Use the workload libraries that are provided by the
yet IBM SDK for Multicore Acceleration.
Little ISV support
Data types 8-bit, 16-bit, and 32-bit data Full speed double-precision support is soon to be
well supported available.
64-bit float point supported
Memory Maximum of 2 GB per blade Use more smaller MPI tasks. On an IBM Blade
requirements server Server, use a single MPI task with 16 SPEs rather
than two MPI tasks with eight SPEs. (This is
subject to change because a much larger memory
configuration per blade is due in future product
releases.)
Memory Local storage (LS) size of 256k Large functions need to be split. We will have to
requirements use overlay. Limit recursions (stack space).
I/O I/O intensive tasks The Cell/B.E. system does not help I/O bound
workloads.
We first describe the parallel programming models that are found in the literature
and then focus on the Cell/B.E. chip or board-level parallelism and the
host-accelerator model.
Finding concurrency Find parallel tasks and group and order them
In the algorithm space, Mattson et al. propose to look at three different ways of
decomposing the work, each with two modalities. This leads to six major
algorithm structures, which are described in Table 3-9.
Recursive Tree
54 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Organization principle Organization subtype Algorithm structure
As for the supporting structures, Mattson et al. identified four structures for
organizing tasks and three for organizing data. They are given side by side in
Table 3-10.6
Fork/join
Shared data refers to the constructs that are necessary to share data between
execution threads. Shared queue is the coordination among tasks to process a
queue of work items. Distributed array addresses the decomposition of
multi-dimensional arrays into smaller sub-arrays that are spread across multiple
execution units.
6
Timothy G. Mattson, Berna L. Massingill, Beverly A. Sanders. Patterns for Parallel Programming.
Addison Wesley, 2004.
56 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 3-5 shows the following different models:
A small single SPE program, where the whole code holds in the local storage
of a single SPE
A large single SPE program, one SPE program accessing system memory
A small multi-SPE program
A general Cell/B.E. program with multiple SPE and PPE threads
When multiple SPEs are used, they can be arranged in a data parallel, streaming
mode, as illustrated in Figure 3-6.
Each piece of input data (I0, I1, ...) is streamed through one SPE to produce a
piece of output data (O0, O1, and so on). The exact same code runs on all SPEs.
Sophisticated load balancing mechanisms can be applied here to account for
differing compute time per data chunk.
The main benefit is that the we aggregate the code size of all the SPEs that are
participating in the pipeline. We also benefit from the huge EIB bandwidth to
transfer the data. A possible variation is to move the code instead of the data,
whichever is the easiest or smallest to move around. Good load balancing is much
more challenging because it relies on constant per stage computational time.
58 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
acceleration. This method is the easiest from a program development
perspective because it limits the scope of source code changes and does not
require much re-engineering at the application logic level. It is much a fork/join
model and, therefore, requires care in giving enough work to the accelerator
threads to compensate for the startup time. This method is typically implemented
with specialized workload libraries such as BLAS, FFT, or RNG for which exists a
Cell/B.E.-tuned version. BLAS is the only library that can be considered a “drop
in replacement” at this time.
60 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Code Cell/B.E. suitability Things to look at
structure (1 = poor and
5 = excellent)
Geometric ALF if data blocks can be processed DaCS for more general data
decomposition independently decomposition
Event-based libspe
coordination
Fork/join Workload specific libraries if they exist and This is the accelerator model. Use
ALF otherwise. ALF can be used to create DAV if necessary to accelerate a
new workload specific libraries. Windows application.
62 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3.5 The application enablement process
The process of enabling an application on the Cell/B.E. system (Figure 3-9) can
be incremental and iterative. It is incremental in the sense that the hotspots of the
application should be moved progressively off the PPE to the SPE. It is iterative
for each hotspot. The optimization can be refined at the SIMD, synchronization,
and data movement levels until satisfactory levels of performance are obtained.
Figure 3-9 General flow for enabling an application on the Cell/B.E. system
64 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 3-11 shows the process for ALF.
7
http://well.cellperformance.com
66 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 3-12 A typical workflow for Cell/B.E. tuned workload libraries
Starting and terminating SPE contexts takes time. We must ensure that the time
spent in the library far exceeds the SPE context startup time.
A variation of this scheme is when the application calls the library repeatedly. In
this case, it might be interesting to keep the library contexts running on the SPE
and set them to work with a lightweight mailbox operation, for example, as shown
in Figure 3-13 on page 68.
68 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 3-13 on page 68 shows three successive invocations of the library with
data1, data2 and data3. The dotted lines indicate an SPE context that is active
but waiting. This arrangement minimizes the impact of the SPE contexts creation
but can only work if the application has a single computational kernel that is
called repeatedly.
Figure 3-14 shows the typical workflow of an ALF application. The PPE thread
prepares the work blocks (numbered wb0 to wb12 here), which execute on the
SPE accelerators.
By using the same formalism, we started to build a list of design patterns that
apply to Cell/B.E. programming. This is clearly only a start, and we hope that new
patterns will emerge as we gain more expertise in porting code to the Cell/B.E.
environment.
70 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Forces
The two main forces are the need for a good load balance between the SPE and
minimal contention.
Solution
We can envision three solutions for dividing work between SPEs. They are
described in the following sections.
Master/worker
The PPE assigns the work elements to the SPE. The PPE can give the pieces of
work one at a time to the SPE. When an SPE is finished with its piece of work, it
signals the PPE, which then feeds the calling SPE with a new item, automatically
balancing the work. This scheme is good for the load balance, but may lead to
contention if many SPE are being used because the PPE might be overwhelmed
by the task of assigning the work blocks.
Forces
The index array by which the x vector is accessed leads to random memory
accesses.
Solution
The index array is known in advance, and we can exploit this information by using
software pipelining with a multi-buffering scheme and DMA lists as shown in
Figure 3-15.
We do not to show the accesses to matrix A and array y. They are accessed
sequentially, and a simple multi-buffering scheme can also be applied.
72 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3.7.3 Pipeline
We want to arrange SPE contexts in a multi-stage pipeline manner.
Forces
We want to minimize the time it takes for the data to move from one pipeline
stage to the other.
Solution
By using the affinity functions as described in 4.3.4, “Direct problem state access
and LS-to-LS transfer” on page 144, we can make sure that successive SPE
contexts are ideally placed on the EIB to maximize the LS-to-LS transfer speed.
An alternative arrangement is to move the code instead of the data, whichever is
the fastest. Often when the programming model is a pipeline, some state data
must reside on each stage, and moving the function also requires moving the
state data.
Forces
We want to push the software cache a bit further by allowing data to be cached
not necessarily in the SPE that encounters a “miss” but also in the LS of another
SPE. The idea is to exploit the high EIB bandwidth.
Solution
We do not have a solution for this yet. The first step is to look at the cache
coherency protocols (MESI, MOESI, and MESIF)8 that are in use today on
multiprocessor systems and try to adapt them to the Cell/B.E. system.
8
M=Modified, O=Owner, E=Exclusive, S=Shared, I=Invalid, F=Forwarding
Forces
The forces are similar to what happens with a Web browser when the flow of data
coming from the Web server contains data that requires the loading of external
plug-ins to be displayed. The challenge is to load on the SPE both the data and
the code to process it when the data is discovered.
Solution
The Cell/B.E. system has overlay support already, which is one solution.
However, there might be a better solution to this particular problem by using
dynamically loaded code. We can imagine loading code together with data by
using exactly the same DMA functions. Nothing in the SPE memory differentiates
code from data. This has been implemented successfully by Eric Christensen et
al. in Dynamic Code Uploading on the SPU [26 on page 625] by using the
following process:
1. Compile the code.
2. Dump the object code as binary.
3. Load the binary code as data.
4. Use DMA on the data (containing the code) just as with regular data to the
SPE. Actual data can also be loaded during the same DMA operation.
5. On the SPE, jump to the address location where the code has been loaded to
pass the control to the plug-in that has just been loaded.
74 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4
The programming techniques and libraries in this chapter are divided into
sections according to the functionality within the program that they perform. We
hope this approach is useful for the program developer in helping you to find the
corresponding subject according to the current stage in development or
according to the specific part of the program that is currently implemented.
76 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In 4.6, “SPU programming” on page 244, we show how to write an optimized
synergistic processor unit (SPU) program. The intention is for programming
issues related only to programming the SPU itself and without interacting with
external components, such as the PPE, other SPEs, or main storage.
In 4.7, “Frameworks and domain-specific libraries” on page 289, we discuss
high-level programming frameworks and libraries, such as DaCS, ALF and
domain specific libraries, that can reduce development efforts and hide the
Cell/B.E. specific architecture. In some case, the usage of such frameworks
provides similar performance as programming by using low-level libraries.
Finally, in 4.8, “Programming guidelines” on page 319, we provide a collection
of programming guidelines and tips. In this section, we discuss:
– Information gathered from various resources and new items that we added
– Issues that are described in detail in other chapters and references to
those chapters
From a programmer’s point of view, managing the work with the SPEs is similar
to working with Linux threads. Also, the SDK contains libraries that assist in
managing the code that runs on the SPE and communicate with this code during
execution.
The PPE itself conforms to the PowerPC Architecture so that programs written
for the PowerPC 970 processor, for example, should run on the Cell/B.E.
processor without modification. In addition, most programs that run on a
Linux-based Power system and use the operating system (OS) facilities should
work properly on a Cell/B.E.-based system. Such facilities include accessing the
file system, using sockets and message passing interface (MPI) for
communication with remote nodes, and managing memory allocation.
The programmer should know that usage of the operating system facilities in any
Cell/B.E. application always take place on the PPE. While SPE code might use
In this section, we include the issues related to PPE programming that we found
are the most important when running most Cell/B.E. applications. However, in
case you are interested in learning more about this subject or need specific
details that are not covered in this section, refer to the “PowerPC Processor
Element” chapter in the Cell Broadband Engine Programming Handbook,
Version 1.1 as a good starting point.1
1
Cell Broadband Engine Programming Handbook, Version 1.1 is on the Web at the following
address: http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/
9F820A5FFA3ECE8C8725716A0062585F/$file/CBE_Handbook_v1.1_24APR2007_pub.pdf
78 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
512 KB L2 unified (instruction and data) cache
Cache line of 128 bytes
Instructions executed in order
In most cases, programmers prefer to use the eight SPEs to perform the massive
SIMD operations and let the PPU program manage the application flow.
However, it may be useful in some cases to add some SIMD computation on the
PPU.
Most of the coding for the Cell/B.E. processor is in a high-level language, such as
C or C++. However, an understanding of the PPE architecture and PPU
instruction sets considerably helps a developer to produce efficient, optimized
code. This is particularly true because C-language internals are provided for
some instruction set of the PPU. In the following section, we discuss the PPU
intrinsics (C/C++ language extensions) and how to use them. We also discuss
intrinsics that operate both on scalars and on vector data type.
We discuss the two main types of PPU intrinsics in the following sections.
Scalar intrinsics
Scalar intrinsics provide a minimal set of specific intrinsics to make the PPU
instruction set accessible from the C programming language. Except for
__setflm, each of these intrinsics has a one-to-one assembly language mapping,
unless compiled for a 32-bit application binary interface (ABI) in which the high
and low halves of a 64-bit double word are maintained in separate registers.
The most useful intrinsics under this category are those related to shared
memory access and synchronization and those related to cache management.
Efficient use of those intrinsic can assist in improving the overall performance of
the application.
All such intrinsics are declared in the ppu_intrinsics.h header file that must be
included in order to use the intrinsics. They may be either defined within the
header as macros or implemented internally within the compiler.
Intrinsics do not have a specific ordering unless otherwise noted. They can be
optimized by the compiler and be scheduled like any other instruction.
You can find additional information about PPU scalar intrinsics in the following
resources:
The “PPU specific intrinsics” chapter of C/C++ Language Extensions for Cell
Broadband Engine Architecture document, which contains a list of the
available intrinsics and their meaning2
The “PPE instruction sets” chapter of the Software Development Kit for
Multicore Acceleration Version 3.0 Programming Tutorial document, which
provides a useful table that summarizes those intrinsics3
2
C/C++ Language Extensions for Cell Broadband Engine Architecture is available on the Web at the
following address: http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/
30B3520C93F437AB87257060006FFE5E/$file/Language_Extensions_for_CBEA_2.5.pdf
3
Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial is available
on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/FC857AE550F7EB83872571A80061F7
88/$file/CBE_Programming_Tutorial_v3.0.pdf
80 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The vector instructions include a reach set of operations that can be performed
on those vectors. Such operations include arithmetic operations, rounding and
conversion, floating-point estimate intrinsics, compare intrinsics, logical intrinsics,
rotate and shift Intrinsics, load and store intrinsics, and pack and unpack
intrinsics.
VMX data types and Vector/SIMD Multimedia Extension intrinsics can be used in
a seamless way throughout a C-language program. The programmer does not
need to set up to enter a special mode. The intrinsics may be either defined as
macros within the system header file or implemented internally within the
compiler.
To use the Vector/SIMD intrinsics of the PPU, the programmer should ensure the
following settings:
Include the system header file altivec.h, which defines the intrinsics.
Set the -qaltivec and -qenablevmx flags in case XLC compilation is used.
Set the -mabi=altivec and -maltivec flags in case GCC compilation is
used.
Example 4-1 demonstrates simple PPU code that initiates two unsigned integer
vectors and adds them while placing the results in a third similar vector.
Source code: The code in Example 4-1 is included in the additional material
for this book. See “Simple PPU vector/SIMD code” on page 617 for more
information.
typedef union {
int i[4];
vector unsigned int v;
} vec_u;
int main()
{
vec_u a, b, d;
d.v = vec_add(a.v,b.v);
For additional information about PPU vector data type intrinsics, refer to the
following resources:
AltiVec Technology Programming Interface Manual, which provides a detailed
description of VMX intrinsics
The “Vector Multimedia Extension intrinsics” chapter of the C/C++ Language
Extensions for Cell Broadband Engine Architecture document, which
includes a list of the available intrinsics and their meaning4
The “PPE instruction sets” chapter of the Software Development Kit for
Multicore Acceleration Version 3.0 Programming Tutorial document, which
includes a useful table that summarizes the intrinsics5
In most cases, programmers prefer to use the eight SPEs to perform the massive
SIMD operations and let the PPU program manage the application flow. For this
practical reason, we do not discuss the issue of PPU Vector/SIMD operations in
detail when we discuss the SPU SIMD instructions. See 4.6.4, “SIMD
programming” on page 258.
The SDK also provides as set of header files that aim to minimize the effort when
porting PPU program to the SPU and vice versa:
vmx2spu.h
The macros and inline functions to map PPU Vector/SIMD intrinsics to
generic SPU intrinsics
spu2vmx.h
The macros and inline functions to map generic SPU intrinsics to PPU
Vector/SIMD intrinsics
4
See note 2 on page 80.
5
See 3 on page 80.
82 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
vec_types.h
An SPU header file that defines a set of single token vector data types that
are available on both the PPU and SPU. The SDK 3.0 provides both GCC and
XLC versions of this header file.
To learn more about this issue, we recommend that you read the “SPU and PPU
Vector Multimedia Extension Intrinsics” chapter and “Header files” chapter in the
C/C++ Language Extensions for Cell Broadband Engine Architecture document.6
The SIMD Math Library is supported both by the SPU and PPU. The SPU
version of this library is discussed in “SIMD Math Library” on page 262. The PPU
version is similar, but the location of the library files is different:
The simdmath.h file is in the /usr/spu/include directory.
The Inline headers are in the /usr/spu/include/simdmath directory.
The libsimdmath.a library is in the /usr/spu/lib directory.
The operating system schedules SPE contexts from all running applications onto
the physical SPE resources in the system for execution according to the
scheduling priorities and policies that are associated with the runnable SPE
contexts.
The programmer must run the SPE contexts on a separate Linux thread, which
enables the operating system to run them in parallel compared to the PPE
threads and compared to other SPEs.
When creating an SPE thread, similar to Linux threads, the PPE program might
pass up to three parameters to this function. The parameters may be either 64-bit
parameters or 128-bit vectors. These parameters may be used later by the code
that is running on the SPE. One common use, in the parameters, is to place an
effective address of a control block that might be larger and contains additional
information. The SPE can use this address to fetch this control block into its local
storage memory.
7
SPE Runtime Management Library is on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/771EC60D862C5857872571A8006A20
6B/$file/libspe_v1.2.pdf
84 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
There are two main methods to load SPE programs:
Static loading of SPE object
Statically compile the SPE object within the PPE program. At run time, the
object is accessed as an external pointer that can be used by the programmer
to load the program into local storage. The loading itself is implemented
internally by the library API by using DMA.
Dynamic loading of SPE executable
Compile the SPE as stand-alone application. At run time, open the executable
file, map it into the main memory, and then load it into the local storage of the
SPE. This method is more flexible because you can decide, at run time, which
program to load, for example, depending on the run time parameters. By
using this method, you save linking the SPE program with the PPE program at
the cost of lost encapsulation, so that the program is now a set of files, and
not just a single executable.
Example 4-2 on page 86 through Example 4-4 on page 89 include the following
actions for the PPU code, which are ordered according to how they are executed
in the code:
1. Initiate a control structure to point to input and output data buffers and initiate
the SPU executable’s parameter to point to this structure (step 1 in the code).
2. Create the SPE context by using spe_context_create function.
3. Statically load the SPE object into the SPE context local storage by using the
spe_program_load function.
4. Run the SPE context by using spe_context_run function.
5. (Optional) Print the reason why the SPE stopped. In this example, obviously
the end of its main function with return code 0 is the preferred one.
6. Destroy the SPE context by using the spe_context_destroy function.
Example 4-2 on page 86 shows the common header file. Be sure that the
libspe2.h header file is included in order to run the SPE program.
Source code: The code in Example 4-2 through Example 4-4 on page 89 is
included in the additional material for this book. See “Running a single SPE”
on page 618 for more information.
#endif // _COMMON_H_
#include "common.h"
86 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
volatile char out_data[BUFF_SIZE] __attribute__ ((aligned(128)));
switch (stop_info->stop_reason) {
case SPE_EXIT:
printf(")PPE: SPE stop_reason=SPE_EXIT, exit_code=");
break;
case SPE_STOP_AND_SIGNAL:
printf(")PPE: SPE stop_reason=SPE_STOP_AND_SIGNAL,
signal_code=");
break;
case SPE_RUNTIME_ERROR:
printf(")PPE: SPE stop_reason=SPE_RUNTIME_ERROR,
runtime_error=");
break;
case SPE_RUNTIME_EXCEPTION:
printf(")PPE: SPE stop_reason=SPE_RUNTIME_EXCEPTION,
runtime_exception=");
break;
case SPE_RUNTIME_FATAL:
printf(")PPE: SPE stop_reason=SPE_RUNTIME_FATAL,
runtime_fatal=");
break;
case SPE_CALLBACK_ERROR:
printf(")PPE: SPE stop_reason=SPE_CALLBACK_ERROR
callback_error=");
break;
default:
printf(")PPE: SPE stop_reason=UNKNOWN, result=\n");
break;
}
printf("%d, status=%d\n",result,stop_info->spu_status);
}
// main
//====================================================================
int main( )
{
spe_stop_info_t stop_info;
88 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-4 shows the SPU code.
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include "common.h"
uint32_t tag_id;
//STEP 3: get input buffer, process it, and put results in output
// buffer
In the code example in this section, we execute two SPE threads and include the
following actions:
Initiate the SPE control structures.
Dynamically load the SPE executable into several SPEs:
– Create SPE contexts.
– Open images of SPE programs and map them into main storage.
– Load SPE objects into SPE context local storage.
Initiate the Linux threads and run the SPE executable concurrently on those
threads. The PPU forwards parameters to the SPU programs.
In this example, the common header file is the same as Example 4-2 on page 86
in “Running a single SPE program” on page 85. You must include the libspe2.h
header file to run the SPE programs and include the pthread.h file to use the
Linux threads.
Source code: The code shown in Example 4-5 and Example 4-6 on page 93
is included in the additional material for this book. See “Running multiple SPEs
concurrently” on page 618 for more information.
#include "common.h"
#define NUM_SPES 2
90 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
spe_program_handle_t *program[BUFF_SIZE];
spu_data_t data[NUM_SPES];
if(spe_context_run(datp->spe_ctx,&entry,0,datp->argp,NULL,NULL)<0){
perror ("Failed running context"); exit (1);
}
pthread_exit(NULL);
}
// main ===============================================================
int main( )
{
int num;
if (spe_context_destroy( data[num].spe_ctx )) {
perror("Failed spe_context_destroy"); exit(1);
}
}
printf(")PPE:) Complete running all super-fast SPEs\n");
return (0);
}
92 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-6 and Example 4-7 on page 94 show the SPU code.
The SPE scheduler, which is responsible for mapping the SPE logical context to
the physical SPE, honors this relationship by trying schedule the SPE contexts
on physically adjacent SPUs. It depends on the current status of the system and
whether it is able to do so. If the PPE program tries to create such affinity when
no other code is running on the SPEs (in this program or other program), the
schedule should succeed in doing so.
94 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The usage of SPE-to-SPE affinity can create performance advantages in some
cases. The performance gain is based mainly on the following characteristics of
the Cell Broadband Engine Architecture (CBEA) and systems:
On a Cell/B.E.-based symmetric multiprocessor (SMP) system, such as a
BladeCenter QS21, communication between SPEs that are located on the
same Cell/B.E. system are more efficient than data transfer between SPEs
that are located on different Cell/B.E. chips. This includes both data transfer,
for example LS to LS, and other types of communication such as mailbox and
signals.
Similarly to the previous characteristic, but on the same chip, communication
between SPEs that are adjacent on the local element interconnect bus (EIB)
is more efficient than between SPEs that are not adjacent.
Example 4-8 shows PPU code that creates such a chain of SPEs. This example
is inspired by the SDK code example named dmabench that is in the
/opt/cell/sdk/src/benchmarks/dma directory.
The article “Cell Broadband Engine Architecture and its first implementation: A
performance view” [18 on page 624] provides information about the bandwidth
that was measured for some SPE-to-SPE DMA transfers. This information
might be useful when deciding how to locate the SPEs related to each other
on a given algorithm.
Example 4-8 PPU code for creating SPE physical chain using affinity
spe_gang_context_ptr_t gang;
spe_context_ptr_t ctx[NUM_SPES];
int main( )
gang = NULL;
// create a gang
if ((gang = spe_gang_context_create(0))==NULL) {
perror("Failed spe_gang_context_create"); exit(1);
}
if(ctx[i]==NULL){
perror("Failed spe_context_create_affinity"); exit(1);
}
// (the entire source code for this example is comes with the book’s
// additional material).
96 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4.2 Storage domains, channels, and MMIO interfaces
In this section, we describe the main storage domains of the CBEA. The CBEA
has a unique memory architecture. Your understanding of those domains is a key
issue in order to know how to program the Cell/B.E. application and how the data
can be partitioned and transferred in such an application. We discuss the storage
domain in 4.2.1, “Storage domains” on page 97.
The main-storage domain, which is the entire effective address space, can be
configured by the PPE operating system to be shared by all processors in the
system. Alternatively, the local-storage and channel problem-state (user-state)
domains are private to the SPE components. The main components in each SPE
are the SPU, the LS and the MFC, which handles the DMA data transfer.
Main storage: In this document, we use the term main storage to describe
any component that has an effective address mapping on the main storage
domain.
98 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
An SPE program references its own LS by using a local store address (LSA). The
LS of each SPE is also assigned a real address (RA) range within the system’s
memory map. As a result, privileged software on the PPE can map LS areas into
the effective address (EA) space, where the PPE, other SPEs, and other devices
that generate EAs can access the LS like any regular component on the main
storage.
Code that runs on an SPU can only fetch instructions from its own LS, and loads
and stores can only access that LS.
Data transfers between the LS and main storage of the SPE are primarily
executed by using DMA transfers that are controlled by the MFC DMA controller
for that SPE. The MFC of each SPE serves as a data-transfer engine. DMA
transfer requests contain both an LSA and an EA. Therefore, they can address
both the LS and main storage of an SPE and thereby initiate DMA transfers
between the domains. The MFC accomplishes this by maintaining and
processing an MFC command queue.
Because the local storage can be mapped to the main storage, SPEs can use
DMA operations to directly transfer data between their LS to the LS of another
SPE. This mode of data transfer is efficient, because the DMA transfers go
directly from SPE to SPE on the high performance local bus without involving the
system memory.
When accessing the two interfaces, commands are inserted into one of the two
MFC-independent command queues:
The channel interface is associated with the MFC SPU command queue.
The MMIO interface is associated with the MFC Proxy command queue.
The stalling mechanism reduces SPE software complexity and allows an SPE to
minimize the power consumed by message-based synchronization. To avoid
stalling on access to a blocking channel, SPE software can read the channel
count to determine the available channel capacity. In addition, many of the
channels have a corresponding and independent event that can be enabled to
cause an asynchronous interrupt.
Similarly, reading from an MMIO register when a queue is empty returns invalid
data. Therefore, the PPE (or other SPE) should first read the corresponding
status register. Only if there is a valid entry (queue is not empty), the MMIO
register itself should be read.
8
See note 1 on page 78.
100 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 4-1 summarizes the main attributes of the two main interfaces of the MFC.
Channels MFC SPU Local SPU Blocking or Wait until the For MFC commands sent
command nonblocking queue has an from the SPU through the
queue available entry channel interface
MMIO MFC proxy PPE or Always non Overwrite the For MFC commands sent
command other SPEs blocking last entry from the PPE, other SPUs,
queue or other devices through
the MMIO registers
Many code examples in the SDK package also use this method. However, the
examples in the SDK documentation rely mostly on composite intrinsics and
low-level intrinsics. Such examples are available in the Software Development
Kit for Multicore Acceleration Version 3.0 Programming Tutorial document.a
a. See note 3 on page 80.
In this section, we illustrate the differences between the four methods by using
the DMA get command, which moves data from a component in main storage to
local storage. This is done only for demonstration purposes. Similar
implementation can be performed for using each of the other MFC facilities, such
as mailboxes, signals, and events. If you are not familiar with the MFC DMA
Some parameters, which are listed in Table 4-2, are common to all DMA transfer
commands. For all alternatives, we assume that the DMA transfer parameters,
described in Table 4-2 are defined previously to executing the DMA command.
In the following sections, we describe the four methods to access the MFC
facilities.
MFC functions
MFC functions are a set of convenience functions. Each perform a single DMA
command (for example get, put, barrier). The functions are implemented either
as macros or as built-in functions within the compiler, causing the compiler to
map each of those functions to a certain composite intrinsic (similar to those
discussed in “Composite intrinsics” on page 103) with the corresponding
operands.
Table 4-3 on page 113 provides a list and descriptions of all the available MFC
functions. For a more detailed description, see the “Programming support for
MFC input and output” chapter in the C/C++ Language Extensions for Cell
Broadband Engine Architecture document.9
To use the intrinsics, the programmer must include the spu_mfcio.h header file.
Example 4-9 on page 103 shows the initiation of a single get command by using
the MFC functions.
9
See note 2 on page 80.
102 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-9 SPU MFC function get command example
#include “spu_mfcio.h“
// wait until DMA transfer is complete (or do other things before that)
Composite intrinsics
The SDK 3.0 defines a small number of composite intrinsics to handle DMA
commands. Each composite intrinsic handles one DMA command and is
constructed from a series of low-level intrinsics (similar to those discussed in
“Low-level intrinsics” on page 104). These intrinsics are further described in the
“Composite intrinsics” chapter in the C/C++ Language Extensions for Cell
Broadband Engine Architecture document10 and in the Cell Broadband Engine
Architecture document.11
To use the intrinsics, the programmer must include the spu_intrinsics.h header
file. In addition, the spu_mfcio.h header file includes useful predefined values of
the DMA commands (for example, MFC_GET_CMD in Example 4-10). The
programmer can include this file and use its predefined values instead of
explicitly writing the corresponding values. Example 4-10 shows the initiation of a
single get command by using composite intrinsics.
#include <spu_intrinsics.h>
#include “spu_mfcio.h“
10
See note 2 on page 80.
11
The Cell Broadband Engine Architecture document is on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1AEEE1270EA2776387257060006E61
BA/$file/CBEA_v1.02_11Oct2007_pub.pdf
The relevant low-level intrinsic are described in the “Channel control intrinsics”
chapter in the C/C++ Language Extensions for Cell Broadband Engine
Architecture document.12
To use the intrinsics, the programmer must include the spu_intrinsics.h header
file. Example 4-11 shows the initiation of a single get command by using
low-level intrinsics.
Assembly-language instructions
Assembly-language instructions are similar to low-level intrinsics. The intrinsics
are a series of ABI-compliant assembly language instructions that are executed
for a single DMA transfer. Each of the low-level intrinsics represents one
assembly instruction. From practical point of view, the only case where we can
recommend using this method instead the low-level intrinsics is when the
program is written in assembly.
wrch$MFC_LSA, $3
wrch$MFC_EAH, $4
wrch $MFC_EAL, $5
wrch $MFC_Size, $6
12
See note 2 on page 80.
104 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
wrch $MFC_TagID, $7
wrch $MFC_Cmd, $8
bi $0
Unlike the SPU case when using the channel interface, in the PPU case, it is not
always recommended to use the MFC functions. The following list summarizes
the differences between the two methods and recommendations for using them:
MFC functions are simpler from a programmer’s point of view. Therefore,
usage of this method can reduce development time and make the code more
readable.
Direct problem state access gives the programmer more flexibility, especially
when non-standard mechanism must be implemented.
Direct problem state access has significantly better performance in many
cases, such as when writing to the inbound mailbox. Two reasons for the
reduction in performance for the MFC functions is the call overhead and the
mutex locking associated with the library functions being thread safe.
Therefore, in cases where the performance, for example latency, of the PPE
access to the MFC is important, use the direct SPE access, which may have
significantly better performance over the MFC functions.
Most of the examples in this document, as well as many code examples in the
SDK package, use the MFC functions method. However, the examples in the
SDK documentation rely mostly on the direct SPE access method. Many such
examples are available in the Software Development Kit for Multicore
Acceleration Version 3.0 Programming Tutorial.13
In this section, we illustrate the differences between the two methods by using
the DMA get command to move data from a component on the main storage to
the local storage. Similar implementation may be performed for using each of the
other MFC facilities, including mailboxes, signals, events, and so on. We used the
same parameters that are defined in Table 4-2 on page 102, but the additional
spe_context_ptr_t spe_ctx parameter is added in the PPU case. This
13
See note 3 on page 80.
In the following sections, we describe the two main methods for a PPE to access
the MFC facilities.
MFC functions
MFC functions are a set of convenience functions. Each set implements a single
DMA command, for example a get, put, or barrier). Table 4-3 on page 113
provides a list and description of all the available functions. For a more detailed
description, see the “SPE MFC problem state facilities” chapter in the SPE
Runtime Management Library document.14
Unlike the SPE implementation, the implementation of the MFC functions for the
PPE usually involves accessing the operating system kernel, which adds a
non-negligible number of cycles and increases the latency of those functions.
To use the intrinsics, the programmer must include the libspe2.h header file.
Example 4-13 illustrates the initiation of a single get command by using MFC
functions.
#include “libspe2.h“
14
See note 7 on page 84.
106 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
direct problem state as easy as using the libspe functions. For example,
the function _spe_sig_notify_1_read reads the SPU_Sig_Notify_1
register, the _spe_out_mbox_read function reads a value from the
SPU_Out_Mbox mailbox register, and the _spe_mfc_dma function
enqueues a DMA request.
– Use direct memory load or store instruction to access the relevant MMIO
registers. The easiest way to do so is by using enums and structs that
describe the problem state areas and the offset of the MMIO registers.
The enums and structs are defined in the libspe2_types.h and
cbea_map.h header files (see Example 4-15 on page 109). However, to use
them, the programmer should include only the libspe2.h file.
After the problem state area is mapped, direct access to this area by the
application does not involve the kernel, and therefore, has a smaller latency than
the corresponding MFC function.
SPE_MAP_PS flag: The PPE programmer must set the SPE_MAP_PS flag
when creating the SPE context (in the spe_context_create function) of the
SPE whose problem state area the programmer later will try to map by using
the spe_ps_area_get function. See Example 4-14.
Example 4-14 shows the PPU code for mapping an SPE problem state to the
thread address space and initiating a single get command by using direct SPE
access.
Source code: The code of Example 4-14 is included in the additional material
for this book. See “Simple PPU vector/SIMD code” on page 617 for more
information.
#include <libspe2.h>
#include <cbe_mfc.h>
#include <pthread.h>
spe_context_ptr_t spe_ctx;
uint32_t lsa, eah, eal, tag, size, ret, status;
volatile spe_mfc_command_area_t* mfc_cmd;
volatile char data[BUFF_SIZE] __attribute__ ((aligned (128)));
// create SPE context: must set SPE_MAP_PS flag to access problem state
spe_ctx = spe_context_create (SPE_MAP_PS, NULL);
do{
mfc_cmd->MFC_LSA = lsa;
mfc_cmd->MFC_EAH = eah;
mfc_cmd->MFC_EAL = eal;
mfc_cmd->MFC_Size_Tag = (size<<16) | tag;
mfc_cmd->MFC_ClassID_CMD = MFC_PUT_CMD;
ret = mfc_cmd->MFC_CMDStatus;
The SDK 3.0 header files libspe2_types.h and cbea_map.h contain several
enums and structs that define the problem state areas and registers. Therefore,
the programming is made more convenient when accessing the MMIO interface
from the PPE. Example 4-15 on page 109 shows the enums and structs.
108 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-15 Enums and structs for defining problem state areas and registers
// From libspe2_types.h header file
// =================================================================
enum ps_area { SPE_MSSYNC_AREA, SPE_MFC_COMMAND_AREA, SPE_CONTROL_AREA,
SPE_SIG_NOTIFY_1_AREA, SPE_SIG_NOTIFY_2_AREA };
Programming the DMA data transfer can be done by using either an SPU
program with the channel interface or by using the a PPU program with the
MMIO interface. Refer to 4.2, “Storage domains, channels, and MMIO interfaces”
on page 97, about the usage of these interfaces.
Regarding the issue of DMA commands to the MFC command, the channel
interface has 16 entries in its corresponding MFC SPU command queue, which
stands for up to 16 DMA commands that can be handled simultaneously by the
MFC. The corresponding MMIO interface has only eight entries in its
corresponding MFC proxy command queue. For this reason and for other
reasons, such as smaller latency in issuing the DMA commands, less overhead
on the internal EIB, and so on, the programmer should run DMA commands from
the SPU program rather than from the PPU.
110 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In this section, we explain the DMA data transfer methods as well as other data
transfer methods, such as direct load and store, that may be used to transfer data
between LS and main memory or between one LS to another LS.
In the next three sections, we explain how to initiate various data transfers by
using the SDK 3.0 core libraries:
In 4.3.2, “SPE-initiated DMA transfer between the LS and main storage” on
page 120, we discuss how a program running on an SPU can initiate DMA
commands between its LS and the main memory by using the associated
MFC.
In 4.3.3, “PPU initiated DMA transfer between LS and main storage” on
page 138, we explain how a program running on a PPU can initiate DMA
commands between the LS of some SPEs and main memory by using the
MFC that is associated with this SPE.
In 4.3.4, “Direct problem state access and LS-to-LS transfer” on page 144, we
discuss two different issues. We explain how an LS of some SPEs can be
accessed directly by the PPU or by an SPU program running on another SPE.
In the next two sections, we discuss two alternatives (other the core libraries) that
come with the SDK 3.0 and can be used for simpler initiation of data transfer
between the LS and main storage:
In 4.3.5, “Facilitating random data access by using the SPU software cache”
on page 148, we explain how to use the SPU software managed cache and in
which cases to use it.
In 4.3.6, “Automatic software caching on SPE” on page 157, we discuss an
automated version of the SPU software cache that provides an even simpler
programming method but with the potential for reduced performance.
Another topic that is relevant to data transfer and that is not covered in the
following sections is the ordering between different data turnovers and
synchronization techniques. Refer to 4.5, “Shared storage synchronizing and
data ordering” on page 218, to learn more about this topic.
Each MFC has an associated memory management unit (MMU) that holds and
processes address-translation and access-permission information that is
supplied by the PPE operating system. While this MMU is distinct from the one
used by the PPE, to process an effective address provided by a DMA command,
the MMU uses the same method as the PPE memory-management functions.
Thus, DMA transfers are coherent with respect to system storage. Attributes of
system storage are governed by the page and segment tables of the PowerPC
Architecture.
In the following sections, we discuss several issues related to the supported DMA
commands.
MFC supports a set of DMA commands. DMA commands can initiate or monitor
the status of data transfers.
Each MFC can maintain and process up to 16 in-progress DMA command
requests and DMA transfers, which are executed asynchronously to the code
execution.
The MFC can autonomously manage a sequence of DMA transfers in
response to a DMA list command from its associated SPU. DMA lists are a
sequence of eight-byte list elements, stored in the LS of an SPE, each of
which describes a single DMA transfer.
Each DMA command is tagged with a 5-bit Tag ID, which defines up to 32 IDs.
The software can use this identifier to check or wait on the completion of all
queued commands in one or more tag groups.
112 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Refer to “Supported and recommended values for DMA-list parameters” on
page 117 for the supported and recommended parameters of a DMA list.
Table 4-3 summarizes all the DMA commands that are supported by the MFC.
For each command, we mention the SPU and the PPE MFC functions that
implement it, if any. (A blank cell indicates that this command is not supported by
either the SPE or PPE.) For detailed information about the MFC commands, see
the “DMA transfers and inter-processor communication” chapter in the Cell
Broadband Engine Programming Handbook.15
The SPU functions are defined in the spu_mfcio.h header file and are described
in the C/C++ Language Extensions for Cell Broadband Engine Architecture
document.16 The PPE functions are defined in the libspe2.h header file and are
described in the SPE Runtime Management Library document.17
SDK 3.0 defines another set of PPE inline functions for handling the DMA data
transfer in the cbe_mfc.h file, which is preferred from a performance point of view
over the libspe2.h functions. While the cbe_mfc.h functions are not well
described in the official SDK documents, they are straight forward and easy to
use. To enqueue a DMA command, the programmer can issue the _spe_mfc_dma
function with the cmd parameter indicating that the DMA command should be
enqueued. For example, set the cmd parameter to MFC_PUT_CMD for the put
command or set it to MFC_GETS_CMD for the gets command, and so on.
Put commands
put mfc_put spe_mfcio_put Moves data from the LS to the effective address.
puts Unsupported Nonea Moves data from the LS to the effective address and
starts the SPU after the DMA operation completes.
putf mfc_putf spe_mfcio_putf Moves data from the LS to the effective address with
the fence option. This command is locally ordered with
respect to all previously issued commands within the
same tag group and command queue.
15
See note 1 on page 78.
16
See note 2 on page 80.
17
See note 7 on page 84.
putb mfc_putb spe_mfcio_putb Moves data from the LS to the effective address with
the barrier option. This command and all subsequent
commands with the same tag ID as this command are
locally ordered with respect to all previously issued
commands within the same tag group and command
queue.
putfs Unsupported Nonea Moves data from the LS to the effective address with
the fence option. This command is locally ordered with
respect to all previously issued commands within the
same tag group and command queue. Starts the SPU
after the DMA operation completes.
putbs Unsupported Nonea Moves data from the LS to the effective address with
the barrier option. This command and all subsequent
commands with the same tag ID as this command are
locally ordered with respect to all previously issued
commands within the same tag group and command
queue. Starts the SPU after the DMA operation
completes.
putl mfc_putl Unsupported Moves data from the LS to the effective address by
using an MFC list.
putlf mfc_putlf Unsupported Moves data from the LS to the effective address by
using an MFC list with the fence option. This command
is locally ordered with respect to all previously issued
commands within the same tag group and command
queue.
putlb mfc_putlb Unsupported Moves data from the LS to the effective address by
using an MFC list with the barrier option. This
command and all subsequent commands with the
same tag ID as this command are locally ordered with
respect to all previously issued commands within the
same tag group and command queue.
get commands
get mfc_get spe_mfcio_get Moves data from the effective address to the LS.
a
gets Unsupported None Moves data from the effective address to the LS and
starts the SPU after the DMA operation completes.
114 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Function
Command Description
SPU PPE
getf mfc_getf spe_mfcio_getf Moves data from the effective address to the LS with
the fence option. This command is locally ordered with
respect to all previously issued commands within the
same tag group and command queue.
getb mfc_getb spe_mfcio_getb Moves data from the effective address to the LS with
the barrier option. This command and all subsequent
commands with the same tag ID as this command are
locally ordered with respect to all previously issued
commands within the same tag group and command
queue.
getfs Unsupported Nonea Moves data from the effective address to the LS with
the fence option. This command is locally ordered with
respect to all previously issued commands within the
same tag group. Starts the SPU after the DMA
operation completes.
getbs Unsupported Nonea Moves data from the effective address to LS with the
barrier option. This command and all subsequent
commands with the same tag ID as this command are
locally ordered with respect to all previously issued
commands within the same tag group and command
queue. Starts the SPU after the DMA operation
completes.
getl mfc_getl Unsupported Moves data from the effective address to the LS by
using an MFC list.
getlf mfc_getlf Unsupported Moves data from the effective address to the LS by
using an MFC list with the fence option. This command
is locally ordered with respect to all previously issued
commands within the same tag group and command
queue.
getlb mfc_getb Unsupported Moves data from the effective address to the LS by
using an MFC list with the barrier option. This
command and all subsequent commands with the
same tag ID as this command are locally ordered with
respect to all previously issued commands within the
same tag group and command queue.
a. While this command can be issued by the PPE, no MFC function supports it.
s Yes Starts the SPU. Starts the SPU running at the address in the
SPU Next Program Counter Register (SPU_NPC) after the
MFC command completes.
116 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Alignment
Alignment of the LSA and the EA should obey the following guidelines:
– The source and destination addresses must have the same four least
significant bits.
– For transfer sizes less than 16 bytes, the address must be naturally
aligned. Bits 28 through 31 must provide natural alignment based on the
transfer size.
– For transfer sizes of 16 bytes or greater, the address must be aligned to at
least a 16-byte boundary. Bits 28 through 31 must be 0.
– The peak performance is achieved when both the source and destination
are aligned on a 128-byte boundary. Bits 25 through 31 must be cleared
to 0.
The MFC checks the validity of the effective address during the transfers. Partial
transfers can be performed before the MFC encounters an invalid address and
raises the interrupt to the PPE.
Table 4-5 on page 119 shows the synchronization and atomic commands that
are supported by the MFC. For each command, we mention the SPU and the
PPE MFC functions that implement it, if any. A blank cell indicates that this
command is not supported by neither the SPE nor PPE. For detailed information
about the MFC commands, see the “DMA transfers and inter-processor
communication” chapter in the Cell Broadband Engine Programming
Handbook.18
The SPU MFC functions are defined in the spu_mfcio.h header file and are
described in the C/C++ Language Extensions for Cell Broadband Engine
Architecture document.19
18
See note 1 on page 78.
19
See note 2 on page 80.
118 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The PPEs are defined in the libspe2.h header file and are described in the SPE
Runtime Management Library document.20
SPU PPE
Synchronization commands
mfceieio mfc_eieio _eieioa Controls the ordering of get commands with respect to
put commands, and of get commands with respect to
get commands accessing storage that is caching
inhibited and guarded. Also controls the ordering of
put commands with respect to put commands
accessing storage that is memory coherence required
and not caching inhibited.
mfcsync mfc_sync __synca Controls the ordering of DMA put and get operations
within the specified tag group with respect to other
processing units and mechanisms in the system.
Atomic commands
20
See note 7 on page 84.
Tag groups can be formed separately within any of the two MFC command
queues. Therefore, tags that are assigned to commands in the SPU command
queue are independent of the tags that are assigned to commands in the proxy
command queue of the MFC.
Tagging is useful when using barriers to control the ordering of MFC commands
within a single command queue. DMA commands within a tag group can be
synchronized with a fence or barrier option by appending an “f” or “b,”
respectively, to the command mnemonic:
The execution of a fenced command option is delayed until all previously
issued commands within the same tag group have been performed.
The execution of a barrier command option and all subsequent commands is
delayed until all previously issued commands in the same tag group have
been performed.
The MFC supports additional data transfer commands, such as putf, putlb,
getlf, and getb, that guarantee ordering between the data transfer. These
commands are initiated similar to the way in which the basic get and put
commands are initiated, although their behavior is different.
For detailed information about the channel interface and the MFC commands,
see the “SPE channel and related MMIO interface” chapter and the “DMA
transfers and interprocessor communication” chapter in the Cell Broadband
Engine Programming Handbook.21
21
See note 1 on page 78.
120 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Tag manager
The tag manager facilitates the management of tag identifiers that are used for
DMA operations in an SPU application. It is implemented through a set of
functions that the programmer must use to reserve tag IDs before initializing
DMA transactions and release them upon completion.
The functions are defined in the spu_mfcio.h header file and are described in the
C/C++ Language Extensions for Cell Broadband Engine Architecture
document.22 The following functions are the main ones:
mfc_tag_reserve, which reserves a single tag ID
mfc_tag_release, which releases a single tag ID
Some tags may be pre-allocated and used by the operating environment, for
example by the software managed cache or PDT, Performance Analysis Tool.
Therefore, the implementation of the tag manager does not guarantee making all
32 architected tag IDs available for user allocation. If the programmer uses a
fixed value of tag IDs instead of using the tag manager, possible inefficiencies
can result that are caused by waiting for DMA completions on tag groups that
contain DMAs that are issued by other software components.
In the following sections, we explain how to initialize basic get and put DMA
commands. We illustrate this concept by using a code example that also includes
the use of the tag manager.
22
See note 2 on page 80.
These functions are nonblocking in terms of issuing the DMA command. The
software continues its execution after enqueueing the commands into the MFC
SPU command queue but does not block until the DMA commands are issued on
the EIB. However, these functions will block if the command queue is full and
then wait until space is available in that queue. Table 4-3 on page 113 shows the
full list of the supported commands.
The programmer should be aware of the fact that the implementation of these
functions involves a sequence of the following six channel writes:
1. Write the LSA parameter to the MFC_LSA channel.
2. Write the effective address higher (EAH) bits parameter to the MFC_EAH
channel.
3. Write the effective address lower (EAL) bits parameter to the MFC_EAL
channel.
4. Write the transfer size parameter to the MFC_Size channel.
5. Write the tag ID parameter to the MFC_TagID channel.
6. Write the class ID and command opcode to the MFC_Cmd channel. The
opcode command defines the transfer type, for example get or put.
122 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The mfc_read_tag_status_all function waits until all of the specified tagged
DMA commands are completed.
The last two functions are blocking and, therefore, will cause the software to halt
until all DMA transfers that are related to the tag ID are complete. Refer to
Table 4-3 on page 113 for a full list of the supported commands.
The implementation of the first function generates the channel operation to set
the bit that represents the tag ID by writing the corresponding value to the
MFC_WrTagMask channel. All bits are 0 beside the bit number tag ID.
Example 4-16 on page 124 and Example 4-17 on page 125 contain the
corresponding SPU and PPU code respectively.
Example 4-16 SPU initiated basic DMA between LS and main storage - SPU code
#include <spu_mfcio.h>
124 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
waitag(tag_id);
return (0);
}
Example 4-17 SPU initiated basic DMA between LS and main storage - PPU code
#include <libspe2.h>
// macro for rounding input value to the next higher multiple of either
// 16 or 128 (to fulfill MFC’s DMA requirements)
#define spu_mfc_ceil128(value) ((value + 127) & ~127)
#define spu_mfc_ceil16(value) ((value + 15) & ~15)
spe_argp=(void*)str;
spe_envp=(void*)strlen(str);
spe_envp=(void*)spu_mfc_ceil16((uint32_t)spe_envp);//round up to 16B
// Initialize and run the SPE thread using the four functions:
// 1) spe_context_create 2) spe_image_open
// 3) spe_program_load 4) spe_context_run
In the following sections, we describe the steps that a programmer who wants to
initiate a sequence of transfers by using a DMA-list should perform:
In “Creating a DMA list” on page 126, we create and initialize the DMA list in
the LS of an SPE. Either the local SPE, the PPE, or another SPE can do this
step.
In “Initiating the DMA list command” on page 127, we issue a DMA list
command, such as getl or putl. Such DMA list commands can only be
issued by programs that run on the local SPE.
In “Waiting for completion of the data transfer” on page 128, we wait for the
completion of the data transfers.
Finally, in “DMA list transfer: Code example” on page 129, we provide a code
example that illustrates the sequence of steps.
126 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The SPU software creates the list and stores it in the LS. The list basic element is
an mfc_list_element structure that describes a single data transfer. This
structure is defined in the spu_mfcio.h header file as shown in Example 4-18.
Transfer elements are processed sequentially in the order in which they are
stored. If the notify flag is set for a transfer element, the MFC stops processing
the DMA list after performing the transfer for that element until the SPE program
sends an acknowledgement. This procedure is described in “Waiting for
completion of the data transfer” on page 128.
These functions are nonblocking in terms of issuing the DMA command. The
software continues its execution after enqueueing the commands into the MFC
SPU command queue but does not block until the DMA commands are issued on
the EIB. However, these functions will block if the command queue is full and will
wait until space is available in that queue. Refer to Table 4-3 on page 113 for the
full list of supported commands.
Initializing a DMA list command requires similar steps and parameters as when
initializing a basic DMA command. See “Initiating a DMA transfer” on page 122.
However, a DMA list command requires two different types of parameters than
those required by a single-transfer DMA command:
EAL, which is written to the MFC_EAL channel, should be the starting LSA of
the DMA list, rather than with the EAL that is specified in each transfer
element separately.
Transfer size, which is written to the MFC_Size channel, should be the size in
bytes of the DMA list itself, rather than the transfer size that is specified in
The starting LSA and the EAH are specified only once in the DMA list command
that initiates the transfers. The LSA is internally incremented based on the
amount of data transferred by each transfer element. However, if the starting LSA
for each transfer element in a list does not begin on a 16-byte boundary, then the
hardware automatically increments the LSA to the next 16-byte boundary. The
EAL for each transfer element is in the 4 GB area defined by the EAH.
The second mechanism is to use the stall-and-notify flag that enables the
software to be notified about the completion of subset of the transfers in the list
by the MFC. The MFC halts the transfers in the list (but not only the operations)
until it is acknowledged by the software. This mechanism can be useful if the
software needs to update the characteristics of stalled subsequent transfers,
depending on the data that was just transferred to the LS on the previous
transfers. In any case, the number of elements in the queued DMA list cannot be
changed.
To use this mechanism, the SPE software and the local MFC perform the
following actions:
1. The software enables the stall-and-notify event of the DMA list command,
which is illustrated in the notify_event_enable function of Example 4-20 on
page 131.
2. The software sets the notify bit in a certain element in the DMA list for it to
indicate when it is done.
3. The software issues a DMA list command on this list.
4. The MFC stops processing the DMA list after performing the transfer for that
specific element, which activates the DMA list command stall-and-notify
event.
5. The software handles the event, optionally modifies the subsequent transfer
elements before they are processed by the MFC, and then acknowledges the
MFC. This step is illustrated in the notify_event_handler function of
Example 4-20 on page 131.
128 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
6. The MFC continues processing the subsequent transfer elements in the list,
until perhaps another element sets the notify bit.
Source code: The code in Example 4-19 and Example 4-20 on page 131 is
included in the additional material for this book. See “SPU initiated DMA list
transfers between LS and main storage” on page 619 for more information.
Example 4-19 SPU initiated DMA list transfers between LS and main storage - Shared
header file
// common.h file =====================================================
typedef struct {
char cmd;
char data[DATA_LEN];
} data_elem; // aligned to 16B
130 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-20 shows the SPU code.
Example 4-20 SPU initiated DMA list transfers between LS and main storage - SPU code
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include "common.h"
// global variables
int elem_per_dma, tot_num_elem, byte_per_dma, byte_tota, dma_list_len;
int event_num=1, continue_dma=1;
int notify_incr=NOTIFY_INCR;
eve_mask = spu_read_event_mask();
spu_write_event_mask(eve_mask | MFC_LIST_STALL_NOTIFY_EVENT);
}
// updates the remaining DMA list according to data that was already
// transferred to LS
//===================================================================
static inline void notify_event_update_list( )
{
int i, j, start, end;
start = (event_num-1)*notify_incr*elem_per_dma;
end = event_num*notify_incr*elem_per_dma-1;
spu_write_event_mask(eve_mask | MFC_LIST_STALL_NOTIFY_EVENT);
}while ( !(eve_mask&(uint32_t)MFC_LIST_STALL_NOTIFY_EVENT) );
132 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
// stalled or which element in the DMA list command has stalled
tag_mask = mfc_read_list_stall_status();
printf("<SPE: done\n");
}
if(tag_id==MFC_TAG_INVALID){
printf("SPE: ERROR - can't reserve a tag ID\n");
return 1;
if(tot_num_elem>BUFF_SIZE){
printf("SPE: ERROR - dma length bigger then local buffer\n");
exit_handler( tag_id ); return 1;
}
134 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
for (i=0; i<dma_list_len; i++) {
dma_list[i].size = byte_per_dma;
dma_list[i].eal = addr;
dma_list[i].notify = 0;
addr += byte_per_dma;
}
exit_handler(tag_id);
return 0;
}
Example 4-21 SPU initiated DMA list transfers between LS and main storage - PPU code
#include <libspe2.h>
#include <cbe_mfc.h>
#include "common.h"
136 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
int main(int argc, char *argv[])
{
spe_program_handle_t *program;
int i, j, error=0;
// ================================================================
// tell the SPE to stop in some random element (number 3) after 10
// stall-and-notify events.
// ================================================================
in_data[3+10*ELEM_PER_DMA*NOTIFY_INCR].cmd = CMD_STOP;
data.argp = (void*)&ctx;
// (the entire source code for this example is part of the book’s
// additional material).
return (0);
}
For detailed information about the MMIO (or direct problem state) interface and
the MFC commands, see the “SPE channel and related MMIO interface” chapter
and the “DMA transfers and interprocessor communication” chapter respectively
in the Cell Broadband Engine Programming Handbook.23
Tag IDs: The tag ID that is used for the PPE-initiated DMA transfer is not
related to the tag ID that is used by the software that runs on this SPE. Each
tag is related to a different queue of the MFC. Currently no mechanism is
available for allocating tag IDs on the PPE side, such as the SPE tag manager.
Therefore, the programmer should use a predefined tag ID. Since tag IDs 16
to 31 are reserved for the Linux kernel, the user must use only tag IDs 0 to 15.
23
See note 1 on page 78.
138 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Another alternative for a PPU software to access the LS of some SPEs is to map
the LS to the main storage and then use regular direct memory access. This
issue is discussed in “Direct PPE access to LS of some SPE” on page 145.
The naming of the commands is based on an SPE-centric view. For example, put
means a transfer from the SPE LS to an effective address.
DMA commands from the PPE: The programmer should avoid initiating
DMA commands from the PPE and initiate them by the local SPE. There are
three reasons for doing this. First, accessing the MMIO by the PPE is
executed on the interconnect bus, which has larger latency than the SPU that
accesses the local channel interface. The latency is high because the SPE
problem state is mapped as guarded, cache inhibited memory. Second, the
addition of this traffic reduces the available bandwidth for other resources on
the interconnect bus. Third, since the PPE is an expensive resource, it is
better to have the SPEs do more work instead.
The functions are nonblocking so that the software continues its execution after
issuing the commands. Table 4-3 on page 113 provides a complete list of the
supported commands.
Another alternative is to use the inline function that is defined in the cbe_mfc.h
file and supports all the DMA commands. This function is the _spe_mfc_dma
function, which enqueues a DMA request by using the values that are provided. It
supports all types of DMA commands according to the value of the cmd input
parameter. For example, the cmd parameter is set to MFC_PUT_CMD for the put
command, it is set to MFC_GETS_CMD for the gets command, and so on. This
DMA commands from the PPE: When issuing DMA commands from the
PPE, using the cbe_mfc.h functions are preferred from performance point of
view over the libspe2.h functions. While the cbe_mfc.h functions are not well
described in the SDK documentation, they are quite straightforward and easy
to use. In our examples, we used the libspe2.h functions.
The programmer should be aware of the fact that the implementation of the
functions involves a sequence of the following commands:
1. Write the LSA (local store address) parameter to the MFC_LSA register.
2. Write the EAH and EAL parameters to the MFC_EAH registers respectively.
The software can implement this by two 32-bit stores or one 64-bit store.
3. Write the transfer size and tag ID parameters to the MFC_Size and
MFC_TagID registers respectively. The software can implement this by one
32-bit store (MFC_Size in upper 16 bits, MFC_TagID in lower 16 bits) or along
MFC_ClassID_CMD in one 64-bit store.
4. Write the class ID and opcode command to the MFC_ClassID_CMD register.
The opcode command defines the transfer type, for example get or put).
5. Read the MFC_CMDStatus register by using a single 32-bit store to
determine the success or failure of the attempt to enqueue a DMA command,
as indicated by the two least-significant bits of returned value:
0 The enqueue was successful.
1 A sequence error occurred while enqueuing the DMA. For example, an
interrupt occurred, and then another DMA was started within interrupt
handler). The software should restart the DMA sequence by going to
step 1.
2 The enqueue failed due to insufficient space in the command queue. The
software can either wait for space to become available before attempting
the DMA transfer again or simply continue attempting to enqueue the DMA
until successful (go to step 1).
3 Both errors occurred.
140 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Waiting for completion of a DMA transfer
After the DMA command is initiated, the software might wait for completion of the
DMA transaction. The programmer calls one of the functions that are defined in
libspe2.h header, for example the spe_mfcio_tag_status_read function. The
function input parameters include a mask that defines the group ID and blocking
behavior (continue waiting until completion or quit after one read).
The programmer must be aware of the fact that the implementation of this
function includes a sequence of the following actions:
1. Set the Prxy_QueryMask register to the groups of interest. Each tag ID is
represented by one bit. Tag 31 is assigned the most-significant bit, and tag 0
is assigned the least-significant bit.
2. Issue an eieio instruction before reading the Prxy_TagStatus register to
ensure the effects of all previous stores are complete.
3. Read the Prxy_TagStatus register.
4. If the value is a nonzero number, at least one of the tag groups of interest has
completed. If you are waiting for all the tag groups of interest to complete, you
can perform a logical XOR between tag group status value and the tag group
query mask. A result of 0 indicates that all groups of interest are complete.
5. Repeat steps 3 and 4 until the tag groups of interest are complete.
Another alternative is to use the inline functions that are defined in cbe_mfc.h file:
The _spe_mfc_write_tag_mask function is a nonblocking function that writes
the mask value to the Prxy_QueryMask register.
The _spe_mfc_read_tag_status_immediate function is a nonblocking function
that reads the Prxy_TagStatus register and returns the value read. Before
calling this function, the _spe_mfc_write_tag_mask function must be called to
set the tag mask.
Various other methods are possible to wait for the completion of the DMA transfer
as described in the “PPE-initiated DMA transfers” chapter in the Cell Broadband
Engine Programming Handbook.24 We chose to show the simplest one.
24
See note 1 on page 78.
Example 4-22 shows the PPU code. We use the MFC functions method to
access the DMA mechanism from the PPU side. Each of these functions
implements a few of the actions that were mentioned previously causing the code
to be simpler. From a programmer’s point of view, you must be familiar with the
number of commands that are involved in order to understand the impact on its
application execution.
Source code: The code that is shown in Example 4-22 and Example 4-23 on
page 144 is included in the additional material for this book. See “PPU
initiated DMA transfers between LS and main storage” on page 619 for more
information.
Example 4-22 PPU initiated DMA transfers between LS and main storage - PPU code
#include <libspe2.h>
#include <cbe_mfc.h>
spe_context_ptr_t spe_ctx;
int ret;
142 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
uint32_t tag, status;
// MUST use only tag 0-15 since 16-31 are used by kernel
tag = 7; // choose my lucky number
// collect from the SPE the offset in LS of the data buffer. NOT the
// most efficient using mailbox- but sufficient for initialization
while(spe_out_mbox_read( data.spe_ctx, &ls_offset, 1)<=0);
if(ret!=0){
perror ("Error status was returned");
// ‘status’ variable may provide more information
exit (1);
}
return (0);
Example 4-23 PPU initiated DMA transfers between LS and main storage - SPU code
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
// send to PPE the offest the data buffer- stalls if mailbox is full
spu_write_out_mbox((uint32_t)my_data);
return 0;
}
The PPE software can participate in the initiating of such DMA list commands by
creating and initializing the DMA list in the LS of an SPE. The way in which PPU
software can access the LS is described in “Direct PPE access to LS of some
SPE” on page 145. After such a list is created, only code running on this SPU
can proceed with the execution of the command itself. Refer to “DMA list data
transfer” on page 126, in which we describe the entire process.
144 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
PPE access to the LS is described in the following section, “Direct PPE access to
LS of some SPE”. The programmer should avoid massive use of this technique
because of performance considerations.
We explain the access of LS by other SPEs in “SPU initiated LS-to-LS DMA data
transfer” on page 147. For memory bandwidth reasons, use this technique
whenever it fits the application structure.
A code running on the PPU can access the LS by performing the following
actions:
1. Map the LS to the main storage and provide an effective address pointer to
the LS base address. The spe_ls_area_get function of the libspe2.h header
file implements this step as explained in the SPE Runtime Management
Library document.25 Note that this type of memory access is not cache
coherent.
2. (Optional) From the SPE, get the offset and compare it to the LS base of the
data to be read or written. This step can be implemented by using the mailbox
mechanism.
3. Access this region as any regular data access to the main storage by using
direct load and store instructions.
Important: The LS stores the SPU program’s instructions, program stack, and
global data structure. Therefore, use care with the PPU code in accessing the
LS to avoid overriding the SPU program’s components. Let the SPU code
manage its LS space. Using any other communication technique, the SPU
code can send to the PPE the offset of the region in the LS that the PPE can
access.
25
See note 7 on page 84.
Source code: The code in Example 4-24 is included in the additional material
for this book. See “Direct PPE access to LS of some SPEs” on page 619 for
more information.
#include <ppu_intrinsics.h>
146 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
ea_ls_str = ea_ls_base + ls_offset;
return (0);
}
Use LS-to-LS data transfer whenever it fits the application structure. This type of
data transfer is efficient because it goes directly from SPE to SPE on the internal
EIB without involving the main memory interface. The internal bus has a much
higher bandwidth than the memory interface (up to 10 times faster) and lower
latency.
The following actions enable a group of SPEs to initiate the DMA transfer
between each local storage:
1. The PPE maps the local storage to the main storage and provides an effective
address pointer to the local storage base addresses. The spe_ls_area_get
function of the libspe2.h header file implements this step as described in the
SPE Runtime Management Library document.26
2. The SPEs send, to the PPE, the offset compare to the LS base of the data to
be read or written.
26
See note 7 on page 84.
Source code: A code example that uses LS-to-LS data transfer to implement
a multistage pipeline programming mode is available as part the additional
material for this book. See “Multistage pipeline using LS-to-LS DMA transfer”
on page 619 for more information.
Software cache library: In this chapter, we use the term software cache
library. However, its full name is SPU software managed cache. We use the
shorted name for simplicity. For more information about the library
specification, see the Example Library API Reference document.a
a. Example Library API Reference is available on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/3B6ED257EE623
5D900257353006E0F6A/$file/SDK_Example_Library_API_v3.0.pdf
The library provides a set of SPU function calls to manage the data on the LS
and to transfer the data between the LS and the main storage.
The library maintains a cache memory that is statically allocated on the LS.
From a programmer’s point of view, accessing the data by using the software
cache is similar to using ordinary load and store instructions, unlike the typical
DMA interface of the SPU for transferring data.
For the data on the main storage that the program tries to access, the cache
mechanism first checks whether it is already in the cache (that is, in the LS). If
it is, then the data is taken from there, which saves the latency of bringing the
data from the main storage. Otherwise, the cache automatically and
transparently to the programmer performs the DMA transfer.
The cache provides an asynchronous interface which, like double buffering,
enables the programmer to hide the memory access latency by overlapping
data transfer and computation.
148 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The library has the following advantages over using a standard SDK functions
call to activate the DMA transfer:
Better performance in some applications can be achieved by taking
advantage of locality of reference and saving redundant data transfers if the
corresponding data is already in the LS.
More information: Refer to “When and how to use the software cache” on
page 156, which has examples for applications for which the software
cache provides a good solution.
Usage of familiar load and store instructions with an effective address are
easier to program in most cases.
Since the topology and behavior of the cache is configurable, it can be easily
optimized to match data access patterns, unlike most hardware caches.
The library decreases the development time that is needed to port some
applications to SPE.
Note: The software cache activity is local to a single SPU, managing the data
access of such an SPU program to the main storage and the LS. The software
cache does not coordinate between data accesses of different SPUs to main
storage, nor does it handle coherency between them.
Cache line: Unless explicitly mentioned, we use the term cache line to
define the software cache line. While the hardware cache line is fixed to
128 bytes, the software line can be configured to any power of two values
between 16 bytes and 4 KB.
To set the software cache attributes, the programmer must statically add the
corresponding definition to the program code. By using the method, the cache
attributes are taken into account during compilation of the SPE program (not on
run time) when many of the cache structures and functions are constructed.
In addition, the programmer must assign a specific name to the software cache
so that the programmer can define several separate caches in the same
program. This is useful in cases where several different data structures are
accessed by the program and each structure has different attributes. For
example, some structures are read-only, some are read-write, some are integers,
and some are single precision.
By default, the software managed cache can use the entire range of the 32 tag
IDs that are available for DMA transfers. It does not take into account other
application uses of tag IDs. If a program also initiates DMA transfers, which
require separate tag IDs, the programmer must limit the number of tag IDs that
are used by the software cache by explicitly configuring a range of tag IDs that
the software cache can use.
150 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Safe mode and synchronous interface
The safe interfaces provide the programmer with a set of functions to access
data simply by using the data’s effective address. The software cache library
performs the data transfer between the LS and the main memory transparently to
the programmer and manages the data that is already in the LS.
One advantage of using this method is that the programmer does not need to
worry about the LS addresses and can use effective addressees as any other
PPE program does. From a programmer’s point of view, this method is quite
simple.
Data access function calls with this mode are done synchronously and are
performed according to the following guidelines:
If data is already in the LS (in the cache), a simple load from the LS is
performed.
If data is not currently in the LS, the software cache function performs DMA
between the LS and the main storage. The program is blocked until the DMA
is completed.
In this mode, as in safe mode, the software cache keeps tracking the data that is
already in the LS and performs a data transfer between the LS and the main
storage only if the data is not currently in LS.
One advantage of using this method is that the programmer can have the
software cache asynchronously prefetch the data by “touching” it. The
programmer can implement a double-buffering-like data transfer. In doing do, the
software cache starts transferring the next data to be processed, while the
program continues performing computation on the current data.
Source code: The code in Example 4-25 on page 153 through Example 4-27
on page 155 are included in the additional material for this book. See “SPU
software managed cache” on page 620 for more information.
Next in the code following the definitions is the cache header file, which is in the
/opt/cell/sdk/usr/spu/include/cache-api.h directory. Multiple caches can be
defined in the same program by redefining the attributes and re-including the
cache-api.h header file. The only restriction is that the CACHE_NAME must be
different for each cache.
Example 4-25 on page 153 shows code for constructing the software cache. It
shows the following actions:
Constructing a software cache named MY_CACHE and defining both its
mandatory and optional attributes
Reserving the tag ID to be used by the software cache
152 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-25 Constructing the software cache
unsigned int tag_base; // should be defined before the cache
// Manadatory attributes
#define CACHE_NAME MY_CACHE // name of the cache
#define CACHED_TYPE int // type of basic element in cache
// Optional attributes
#define CACHE_TYPE CACHE_TYPE_RW // rw type of cache
#define CACHELINE_LOG2SIZE 7 // 2^7 = 128 bytes cache line
#define CACHE_LOG2NWAY 2 // 2^2 = 4-way cache
#define CACHE_LOG2NSETS 4 // 2^4 = 16 sets
#define CACHE_SET_TAGID(set) (tag_base + (set & 7)) // use 8 tag IDs
#define CACHE_STATS//collect statistics
#include <cache-api.h>
When using this mode, only the effective addresses are used to access the main
memory data. There is no need for the programmer to be aware of the LS
address to which the data was transferred (that is, by the software cache).
Example 4-26 on page 154 shows the code for the following actions:
Using the safe mode to perform synchronous data access
Flushing the cache so that the modified data is written into main memory
int a, b;
unsigned eaddr_a, eaddr_b;
...
When using this mode, the software cache maps the effective address of the
data in the main memory into the local storage. The programmer should use the
mapped local addresses later to use the data.
154 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-27 shows the code for performing the following actions:
Using unsafe mode to perform synchronous data access
Touching a variable so that the cache starts asynchronous prefetching of the
variable from the main memory to the local storage
Waiting until the prefetched data is present in the LS before modifying it
Flushing the cache so that the modified data is written into main memory
Printing the software cache statistics
mfc_multi_tag_release(tag_base, 8);
return (0);
}
156 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
An indirect mechanism: A program first reads index vectors from the main
storage and those vectors contain the location of the next blocks of data that
need to be processed.
Only after computing the results of current the iteration, the program knows
which blocks to read next.
If the application also has a high cache hit rate, then the software cache can
provide better performance compared to other techniques. Such a high rate can
occur if blocks that are read in one iteration are likely to be used in the sequential
iterations. Another similarity is regarding blocks that are read in some iteration
and are close enough to the blocks of the previous iteration, that is in the same
software cache line.
A high cache hit rate ensures that, in most cases, when a structure is accessed
by the program, the corresponding data is already in the LS. Therefore, the
software cache can take the data from the LS instead of transferring it again from
main memory.
If the hit rate is significantly high, performing synchronous data access by using
safe mode provides good performance since waiting for data transfer completion
does not occur often. However, in most cases, the programmer might use
asynchronous data access in the unsafe mode and measure whether
performance improvement is achieved.
To use this mechanism, the programmer must use the __ea address space
identifier when declaring a variable to indicate to the SPU compiler that a
memory reference is in the remote (or effective) address space, rather than in the
local storage. The compiler automatically generates code to access these data
objects by DMA into the local storage and caches references to these data
objects.
Accessing an __ea variable from an SPU program creates a copy of this value in
the local storage of the SPU. Subsequent modifications to the value in main
storage are not automatically reflected in the copy of the value in local storage.
The programmer is responsible for ensuring data coherence for __ea variables
that are accessed by both SPE and PPE programs.
After the pointer is initiated, it can be used as any SPU pointer while the software
cache maps it into DMA memory access, for example:
for (i = 0; i < size; i++) {
*ppe_pointer++ = i; // memory accesses use software
cache
}
Another case is pointers in the LS space of the SPE that can be cast to pointers
in the main storage address space. This action transforms an LS address into an
equivalent address in the main storage because the LS is also mapped to the
main storage domain. Consider the following example:
int x;
__ea int *ppe_pointer_to_x = &x;
158 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
this variable (from the main storage) and the SPE access it (from the LS domain).
Similarly this pointer can be used to transfer data between the LS and another
LS by the SPEs.
GCC for the SPU provides the command line options shown in Table 4-6 to
control the runtime behavior of programs that use the __ea extension. Many of
these options specify parameters for the software-managed cache. In
combination, these options cause GCC to link the program to a single
software-managed cache library that satisfies those options. Table 4-6 describes
these options.
Table 4-6 GCC options for supporting main storage access from the SPE
Option Description
-matomic-updates Uses DMA atomic updates when flushing a cache line back to
PPU memory. This is the default.
You can find a complete example of using the __ea qualifiers to implement a
quick sort algorithm on the SPU accessing PPE memory in the SDK
/opt/cell/sdk/src/examples/ppe_address_space directory.
This sequence is not efficient because it wastes a lot of time waiting for the
completion of the DMA transfer and has no overlap between the data transfer
and the computation. Figure 4-2 illustrates the time graph for this scenario.
Double buffering
We can significantly accelerate the previous process by allocating two buffers, B0
and B1 , and overlapping computation on one buffer with data transfer in the
other. This technique is called double buffering, which is illustrated by the flow
diagram scheme in Figure 4-3.
160 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Double buffering is a private class of multibuffering, which extends this idea by
using multiple buffers in a circular queue instead of only the two buffers of double
buffering. In most cases, usage of the two buffers in the double buffering case is
enough to guarantee overlapping between the computation and data transfer.
However, if the software must wait for completion of the data transfer, the
programmer might consider extending the number of buffers and move to a
multibuffering scheme. Obviously this requires more memory on the LS, which
might be a problem in some cases. Refer to “Multibuffering” on page 166 for
more information about the multibuffering technique.
In the code that follows, we show an example of the double buffering mechanism.
Example 4-29 is the header file, which is common to the SPE and PPE side.
Source code: The code in Example 4-29 through Example 4-31 on page 164
is included in the additional material for this book. See “Double buffering” on
page 620 for more information.
The code demonstrates the use of the barrier on the SPE side to ensure that all
the output data that SPE updates in memory is written into memory before the
PPE tries to read it.
typedef struct {
uint32_t *in_data;
uint32_t *out_data;
uint32_t *status;
int size;
} parm_context;
uint32_t tag_id[2];
tag_id[0] = mfc_tag_reserve();
tag_id[1] = mfc_tag_reserve();
// Init parameters
in_data = ctx.in_data;
out_data = ctx.out_data;
left = ctx.size;
cnt = (left<ELEM_PER_BLOCK) ? left : ELEM_PER_BLOCK;
162 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
while (cnt < left) {
left -= SPU_Mbox_Statnt;
mfc_getb((void*)(&ls_in_data[nxt_buf][0]),
(uint32_t)(nxt_in_data) , nxt_cnt*sizeof(uint32_t),
tag_id[nxt_buf], 0, 0);
buf = nxt_buf;
cnt = nxt_cnt;
}
// process_buffer
for (i=0; i<ELEM_PER_BLOCK; i++){
ls_out_data[buf][i] = ~(ls_in_data[buf][i]);
}
// Put the output buffer back into main storage
// Barrier to ensure all data is written to memory before status
mfc_tag_release(tag_id[0]);
mfc_tag_release(tag_id[1]);
return (0);
}
#include "common.h"
status = STATUS_NO_DONE;
164 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
// Init input buffer and zero output buffer
for (i=0; i<NUM_OF_ELEM; i++){
in_data[i] = i;
out_data[i] = 0;
}
ctx.in_data = in_data;
ctx.out_data = out_data;
ctx.size = NUM_OF_ELEM;
ctx.status = &status;
data.argp = &ctx;
// (the entire source code for this example is part of the book’s
// additional material).
return 0;
}
This algorithm waits for and processes each Bi in round-robin order, regardless
of when the transfers complete with respect to one another. In this regard, the
algorithm uses a strongly ordered transfer model. Strongly ordered transfers are
useful when the data must be processed in a known order, which happens in
many streaming model applications.
The huge page support on the SDK aims to address the issue of reducing the
latency of the address translation mechanism on the SPEs. This mechanism is
implemented by using 256 entry translation look-aside buffers (TLBs) that reside
on the on the SPEs and store the information regarding address translation. The
operating system on the PPE is responsible for managing the buffers.
166 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The following process runs whenever the SPE tries to access data on the main
storage:
1. The SPU code initiates the MFC DMA command for accessing data on the
main storage and provides the effective address of the data in the main
storage.
2. The SPE synergistic memory management (SMM) checks whether the
effective address falls within one of the TLB entries:27
– If it exists (page hit), use this entry to translate to the real address and exit
the translation process.
– If it does not exist (a page miss), continue to step 3.
3. The SPU halts program execution and generates an external interrupt to the
PPE.
4. The operating systems on the PPE allocate the page and writes the required
information to the TLB of this particular SPE by using memory access to the
problem state of this SPE.
5. The PPE signals to the SPE that translation is complete.
6. The MFC starts transferring the data, and the SPU code continues running.
This mechanism causes the SPU program to halt until the translation process is
complete, which can take a significant amount of time. This may be not efficient if
the process repeats itself many times during the program execution. However,
the process occurs only the first time a page is accessed, unless the translation
information in the TLB is replaced with information from other pages that are later
accessed.
Therefore, using huge pages can significantly improve the performance in cases
where the application operates on large data sets. In such cases, using huge
pages can significantly reduce the amount of time in which this process occurs
(only once for each page).
The SDK supports the huge TLB file system, which allows the programmer to
reserve 16 MB huge pages of pinned, contiguous memory. For example, if 50
pages are configured, it provides 600 MB of pinned contiguous memory. In the
worst case where each SPE accesses the entire memory range, a TLB miss
occurs only once for each of the 50 pages since the TLB has enough room to
store all of those pages. For comparison, the size of ordinary pages on the
operating system that runs on the Cell/B.E. system is either 4 KB or 64 KB.
27
The SMM unit is responsible for address translation in the SPE.
To configure huge pages, a root user must execute a set of commands. The
commands can be executed at any time and create memory mapped files in the
/huge/ path that stores the huge pages content.
The first part of Example 4-32 on page 169 shows the commands that are
required to set 20 huge pages that provide 320 MB of memory. The last four
commands in this part, the groupadd, usermod, chgrp, and chmod commands,
provide permission to the user of the huge pages files. Without executing the
commands, only the root user can access the files later and use the huge pages.
The second part of this example demonstrates how to verify whether the huge
pages were successfully allocated.
Some programmers might use huge pages while also using NUMA to restrict
memory allocation to a specific node as described in 4.3.9, “Improving memory
access using NUMA” on page 171. The number of available huge pages for the
specific node in this case is half of what is reported in /proc/meminfo. This is
because, on Cell/B.E.-based blade systems, the huge pages are equally
distributed across both memory nodes.
168 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-32 Configuring huge pages
> Part 1: Configuring huge pages:
mkdir -p /huge
echo 20 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs nodev /huge
groupadd hugetlb
usermod -a -G hugetlb <user>
chgrp -R hugetlb /huge
chmod -R g+w /huge
cat /proc/meminfo
After the huge pages are configured, any application can allocate data on the
corresponding memory mapped file. The programmer explicitly invokes mmap of a
/huge file of the specified size.
Example 4-33 on page 170 shows a code example which opens a huge page file
using the open function and allocates 32 MB of private huge paged memory
using mmap function (32 MB indicated by the 0x2000000 parameter of the mmap
function).
Source code: The code in Example 4-33 on page 170 is included in the
additional material for this book. See “Huge pages” on page 620 for more
information.
The mmap function: The mmap function succeeds even if there are insufficient
huge pages to satisfy the request. Upon first access to a page that cannot be
backed by a huge TLB file system, the application process is terminated and
the message “killed” is shown. Therefore, the programmer must ensure that
the number of huge pages that are requested does not exceed the number of
huge pages that are available.
// now we can use ‘ptr’ effective addr. pointer to store our data
// for example forward to the SPEs to use it
return (0);
}
28
See http://sourceforge.net/projects/libhugetlbfs
170 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4.3.9 Improving memory access using NUMA
The first two Cell/B.E.-based blade systems, the BladeCenter QS20 and
BladeCenter QS21 are both NUMA systems, which consist of two Cell/B.E.
processors, each with its own system memory. The two processors are
interconnected through a FlexIO interface by using the fully coherent Broadband
Interface (BIF) protocol.
Linux provides a NUMA API to address this issue29 and to enable allocating
memory on a specific node. When doing so, the programmer can use the NUMA
API in either of the following ways:
Allocating memory on the same processor on the current thread runs
Guaranteeing that this thread keeps running on a specific processor (node
affinity)
29
Refer to the white paper A NUMA API for LINUX* on the Web at the following address:
http://www.novell.com/collateral/4621437/4621437.pdf
After NUMA is configured and the application completes its execution, the
programmer can use the NUMA numastat command to retrieve statistics
regarding the status of the NUMA allocation and data access on each of the
nodes. This information can be used to estimate the effectiveness of the current
NUMA configuration.
To use the NUMA API, the programmer must perform the following tasks:
Include the numa.h header file in the source code.
Add the -lnuma flag to the compilation command in order to link the library to
the application.
Additional information is available in the man pages of this library and can be
retrieved by using the man numa command.
Example 4-34 on page 174 shows a suggested method for using NUMA and a
corresponding PPU code. The example is inspired by the matrix multiply
demonstration of the SDK in the /opt/cell/sdk/src//demos/matrix_mul directory.
Node: NUMA terminology uses the term node in the following example to refer
to one Cell/B.E. processor. Two are present on a Cell/B.E.-based blade.
172 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The following principles are behind the NUMA example that we present:
1. Use the NUMA API to allocate two continuous memory regions, one on each
of the node’s memories.
2. The allocation is done by using huge pages to minimize the page miss of the
SPE. Notice that the huge pages are equally distributed across both memory
nodes on a Cell/B.E.-based blade system. Refer to 4.3.8, “Improving the page
hit ratio by using huge pages” on page 166, which further discusses huge
pages.
3. Duplicate the input data structures (matrix and vector, in this case) by
initiating two different copies, one on each of the regions that were allocated
in step 1.
4. Use NUMA to split the SPE threads, so that each half of the thread is initiated
and runs on a separate node.
5. The threads that run on node number 0 are assigned to work on the memory
region that was allocated on this node, and node 1’s threads are assigned to
work on node 1’s memory region.
In this example, we needed to duplicate the input data since the entire input
matrix is needed for the threads. While this is not the optimal solution, in many
other applications, this duplication is not necessary. The input data can be
divided between the two nodes. For example, when adding two matrixes, one
half of those matrixes can be located on one node’s memory and the second half
can be loaded on the other node’s memory.
#define MAX_SPUS16
#define HUGE_PAGE_SIZE(size_t)(16*1024*1024)
// main===============================================================
int main(int argc, char *argv[])
{
int i, nodes, phys_spus, spus;
unsigned int offset0, offset1;
nodemask_t mask0, mask1;
174 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
// allocate inout buffers - mem_addr0 on node 0, mem_addr1 on node 1
mem_addr0 = allocate_buffer(offset0, &mask0);
mem_addr1 = allocate_buffer(offset1, &mask1);
if (i < spus/2) {
// lower half of the SPE threads uses input buffer of ndoe 0
threads[i].input_buffer = mem_addr0;
}else{
// similarly - second half use buffer of node 1
threads[i].input_buffer = mem_addr1;
numa_bind(&mask1);
}
// alocate_buffer=====================================================
// allocate a cacheline aligned memory buffer from huge pages or the
char * allocate_buffer(size_t size, nodemask_t *mask)
{
char *addr;
int fmem = -1;
size_t huge_size;
if (addr==MAP_FAILED) {
printf("ERROR: unable to mmap file (errno=%d).\n", errno);
close (fmem); exit(1);
}
For example, the following command invokes a program that allocates all
processors on node 0 with a preferred memory allocation on node 0:
numactl --cpunodebind=0 --preferred=0 ./matrix_mul
The following command is a shorter version that performs the same action:
numactl -c 0 -m 0 ./matrix_mul
To read the man pages of this command, run the man numactl command.
176 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
One advantage of using this method is that there is no need to recompile the
program to run with different settings of the NUMA configuration. Alternatively,
using the command utility enables less flexibility to the programmer compared to
calling the API of libnuma library from the program itself.
Controlling the NUMA policy by using the command utility is usually sufficient in
cases where all SPU threads can run on a single Cell/B.E. processor. If more
than one processor is required (usually because more than eight threads are
necessary), and the application requires dynamic allocation of data, it is usually
difficult to use only the command utility. Using the libnuma library API from the
program is more appropriate and allows greater flexibility in this case.
Refer to 4.2, “Storage domains, channels, and MMIO interfaces” on page 97, in
which we describe the MFC interfaces and the different programming methods in
which programs can interact with the MFC. In this section, we use only the MFC
functions method in order interact with the MFC.
Both mailboxes and signals are mechanism that can be used for program control
and sending short messages between the different processors. While those
mechanisms have a lot in common, there are differences between the two
mechanisms. In general, a mailbox implements a queue for sending separate
178 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
32-bit messages, while signaling is similar to interrupts, which can be
accumulated when being written and reset when being read. Table 4-7 compares
the two mechanisms.
Direction One inbound and two outbound. Two inbound (toward the SPE).
Unique SPU No, programs use channel reads Yes, sndsig, sndsigf, and
commands and writes. sndsigb enable an SPU to send
signals to another SPE.
4.4.1 Mailboxes
In this section, we discuss the mailbox mechanism, which is easily enables the
sending of 32-bit messages between the different processors on the chip (PPE
and SPEs). In this section, we discuss the following topics:
In “Mailbox overview” on page 180, we provide an overview about the
mechanism and its hardware implementation.
In “Programming interface for accessing mailboxes” on page 182, we
describe the main software interfaces for an SPU or PPU program to use the
mailbox mechanism.
In “Blocking versus nonblocking access to the mailboxes” on page 183, we
explain how the programmer can implement either blocking or nonblocking
access to the mailbox on either an SPU or PPU program.
Monitoring the mailbox status can be done asynchronously by using events that
are generated whenever a new mailbox is written or read by external source, for
example, a PPE or other SPE. Refer to 4.4.3, “SPE events” on page 203, in
which we discuss the events mechanism in general.
Mailbox overview
The mailbox mechanism is easy to use and enables the software to exchange
32-bit messages between the local SPU and the PPE or local SPU and other
SPEs. The term local SPU refers to the SPU of the same SPE where the mailbox
is located. The mailboxes are accessed from the local SPU by using the channel
interface and from the PPE or other SPEs by using the MMIO interface.
Mailbox access: Local SPU access to the mailbox is internal to an SPE and
has small latency (less than or equal to six cycles for nonblocking access).
Alternatively, PPE or other SPE access to the mailbox is done through the
local memory EIB. The result is larger latency and overloading of the bus
bandwidth, especially when polling to wait for the mailbox to become available.
The MFC of each SPE contains three mailboxes divided into two categories:
Outbound mailboxes
Two mailboxes are used to send messages from the local SPE to the PPE or
other SPEs:
– SPU Write Outbound mailbox (SPU_WrOutMbox)
– SPU Write Outbound Interrupt mailbox (SPU_WrOutIntrMbox)
Inbound mailbox
One mailbox, SPU Read Inbound mailbox (SPU_RdInMbox), is used to send
messages to the local SPE from the PPE or other SPEs.
180 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 4-8 summarizes the main attributes of the mailboxes and the differences
between outbound and inbound mailboxes. It also describes the differences
between accessing the mailboxes from the SPU programs and accessing them
from the PPU and other SPE programs.
Direction Messages from the PPE or another SPEs Messages from the local SPE to the PPE
to the local SPE. or another SPEs.
# mailboxes 1 2
# entries 4 1
Countera Counts the number of valid entries: Counts the number of empty entries:
Decremented when the SPU program Decremented when the SPU program
reads from the mailbox. writes to the mailbox.
Incremented when the PPU program Incremented when the PPU program
writes to the mailbox.b reads from the mailbox.b
Buffer A first-in-first-out (FIFO) queue. The SPU A FIFO queue. The SPU program reads
program reads the oldest data first. the oldest data first.
Overrun A PPU program that writes new data The SPU program that writes new data
when the buffer is full overruns the last when the buffer is full blocks until space is
entry in the FIFO.b available in the buffer. For example, PPE
reads from the mailbox.b
Blocking The SPU program blocks when trying The SPU program blocks when trying
to read an empty buffer and continues to write to the buffer when it is full and
only when there is a valid entry. For continues only when there is an
example, the PPE writes to the empty entry. For example, the PPE
mailbox.b reads from the mailbox.b
The PPU program never blocks.b The PPU program never blocks.b
Writing to the mailbox when full Reading from the mailbox when it is
overrides the last entry, and the PPU empty returns invalid data, and the
immediately continues. PPU program immediately continues.
a. This per-mailbox counter can be read by a local SPU program by using a separate channel or by
the PPU or other SPUs program using a separate MMIO register.
b. Or other SPE that accesses the mailbox of the local SPE.
For more information about the spu_mfcio.h functions, see the “SPU mailboxes”
chapter in the C/C++ Language Extensions for Cell Broadband Engine
Architecture document.30 For information about the libspe2.h functions, see the
“SPE mailbox functions” chapter in the SPE Runtime Management Library
document.31
Table 4-9 summarizes the simple functions in the files for accessing the
mailboxes from a local SPU program or from a PPU program. In addition to the
value of the mailboxes messages, the counter that is mentioned in Table 4-9 can
also be read by the software by using the SPU *_stat_* functions or the PPU
*_status functions.
Blocking
(channel interface) (MMIO interface)
spu_stat_in_mbox No spe_in_mbox_status No
a. A user parameter to this function determines whether the function is blocking or not blocking.
30
See note 2 on page 80.
31
See note 7 on page 84.
182 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To access a mailbox of the local SPU from other SPU, the following actions are
required:
1. The PPU code maps the controls area of the SPU to main storage by using
the libspe2.h file’s spe_ps_area_get function with the SPE_CONTROL_AREA flag
set.
2. The PPE forwards the control area base address of the SPU to another SPU.
3. The other SPU uses ordinary DMA transfers to access the mailbox. The
effective address should be control area base, plus offset to a specific mailbox
register.
For the SPU, the instructions to access the mailbox are blocking by nature and
are stalled when the mailbox is not available (empty for read or full for write). The
SDK simply implements the instructions.
For the PPU, the instructions to access the mailbox are nonblocking by nature.
The SDK functions provide a software abstraction of blocking behavior functions
for some of the mailboxes, which is implemented by polling the mailbox counter
until an entry is available.
The programmer can explicitly read the mailbox status (the counter mentioned in
Table 4-8 on page 181) by calling the *_stat_* functions for the SPU program
and the *_status functions for the PPU program.
SPU In Reads the mailbox by using the Before reading the mailbox, it polls the
spu_read_in_mbox function. counter by using the spu_stat_in_mbox
function until FIFO is not empty.
Out Writes to the mailbox by using the Before writing to the mailbox, it polls the
spu_write_out_mbox function. counter by using the spu_stat_out_mbox
function until FIFO is not full.
OutIntr Writes to the mailbox by using the Before writing to the mailbox, it polls the
spu_write_out_intr_mbox counter by using the
function. spu_stat_out_intr_mbox function until
FIFO is not full.
In the following sections, we describe different scenarios for using the mailbox
mechanism and provide a mailbox code example.
184 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Notifying the PPE about data transfer completion by using a mailbox
A mailbox can be useful when an SPE must notify the PPE about completion of
transferring data that was previously computed by the SPE to the main memory.
To implement step 4, the PPU might need to poll the mailbox status to see if
there is valid data in this mailbox. Such polling is not efficient because it causes
overhead on the bus bandwidth, which can affect other data transfer on the bus,
such as SPEs reading from main memory.
Alternative: An SPU can notify the PPU that it has completed computation by
using a fenced DMA to write notification to an address in the main storage.
The PPU may poll this area on the memory, which may be local to the PPE in
case the data is in the L2 cache, so that it minimizes the overhead on the EIB
and memory subsystem. Example 4-19 on page 129, Example 4-20 on
page 131, and Example 4-21 on page 136 provide code for such a
mechanism.
Because the operating system runs on the PPE, only the PPE originally is aware
of the effective addresses of different variables in the program, for example, when
the PPE dynamically allocates data buffers or when it maps the local storage or
problem state of the SPE to an effective address on the main storage. The
inbound mailboxes can be used to transfer those addresses to the SPE.
In the other direction, an SPE can use the outbound mailbox to notify the PPE
about a local storage offset of a buffer that is located on the local storage and can
be accessed later by either the PPE or another SPE. Refer to the following
section, “Code example for using mailboxes”, for a code example for such a
mechanism.
Source code: The code in Example 4-35, Example 4-36 on page 188, and
Example 4-39 on page 199 is included in the additional material for this book.
See “Simple mailbox” on page 620 for more information.
// main===============================================================
int main()
{
186 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
int num, ack;
uint64_t ea;
char str[2][8] = {"0 is up","1 is down"};
// (the entire source code for this example is part of the book’s
// additional material)
// STEP 0: map SPEs’ MFC problem state to main storage (get EA)
for( num=0; num<2; num++){
if ((mfc_ctl[num] = (spe_spu_control_area_t*)spe_ps_area_get(
data[num].spe_ctx, SPE_CONTROL_AREA))==NULL){
perror ("Failed mapping MFC control area");exit (1);
}
}
// STEP 1: send each SPE its number using BLOCKING mailbox write
for( num=0; num<2; num++){
// STEP 2: send each SPE the EA of other SPE's MFC area and a string
// Use NON-BLOCKING mailbox write after first verifying
// availability of space.
for( num=0; num<2; num++){
ea = (uint64_t)mfc_ctl[(num==0)?1:0];
// (the entire source code for this example is part of the book’s
// additional material)
return (0);
}
Example 4-36 SPU code for accessing local mailboxes and another SPE’s mailbox
// add the ordinary SDK and C libraries header files...
#include "spu_mfcio_ext.h" // the file described in Example 4-39 on
page 199
uint32_t my_num;
int main( )
{
uint32_t data[2],ret, mbx, ea_mfc_h, ea_mfc_l, tag_id;
uint64_t ea_mfc;
188 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
if ((tag_id= mfc_tag_reserve())==MFC_TAG_INVALID){
printf("SPE: ERROR can't allocate tag ID\n"); return -1;
}
// STEP 2: receive from PPE the EA of other SPE's MFC and string
// use BLOCKING mailbox, but to avoid bloking we first read
// status to check that we have 4 valid entries
while( spu_stat_in_mbox()<4 ){
// SPE can do other things meanwhile before check status again
}
return 0;
}
Monitoring the signal status can be done asynchronously by using events that
are generated whenever a new signal is sent by an external source, for example
a PPE or other SPE. Refer to 4.4.3, “SPE events” on page 203, in which we
discuss the events mechanism in general.
Each SPE contains two identical signal notification registers: Signal Notification 1
(SPU_RdSigNotify1) and Signal Notification 2 (SPU_RdSigNotify2).
Unlike the mailboxes, the signal notification has only one direction and can send
information toward the SPU that resides in the same SPE as the signal registers
(and not vice versa). Programs can access the signals by using the following
interfaces:
A local SPU program reads the signal notification by using the channel
interface.
A PPU program signals an SPE by writing the MMIO interface to it.
190 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
An SPU program signals another SPE by using special signaling commands,
which include sndsig, sndsigf, and sndsigb. (The “f” and “b” suffix in the
commands indicate “fence” and “barrier” respectively.) The commands are
implemented by using the DMA put commands and optionally contain
ordering information.
When the local SPU program reads a signal notification, the value of the signal’s
register is reset to 0. Reading the signal’s MMIO (or problem state) register by
the PPU or other SPUs does not reset their value.
Regarding writing the PPU or other SPUs to the signal registers, two different
modes can be configured:
OR mode (many-to-one)
The MFC accumulates several writes to the signal-notification register by
combining all the values that are written to this register by using a logical OR
operation. The register is reset when the SPU reads it.
Overwrite mode (one-to-one)
Writing a value to a signal-notification register overwrites the value in this
register. This mode is similar to using an inbound mailbox and has similar
performance.
Configuring the signaling mode can be done by the PPU when it creates the
corresponding SPE context.
OR mode: By using the OR mode, signal producers can send their signals at
any time and independently of other signal producers. (No synchronization is
needed.) When the SPU program reads the signal notification register, it
becomes aware of all the signals that have been sent since the most recent
read of the register.
Table 4-7 on page 179 summarizes the similarities and differences between the
signal notification and mailbox mechanism.
The spu_mfcio.h functions are described in the “SPU signal notification” chapter
in the C/C++ Language Extensions for Cell Broadband Engine Architecture
document.32 The libspe2.h functions are described in the “SPE SPU signal
notification functions” chapter in the SPE Runtime Management Library
document.33
32
See note 2 on page 80.
33
See note 7 on page 84.
192 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To signal a local SPU from another SPU, the following actions are required:
1. The PPU code maps the signaling area of the SPU to the main storage by
using the spe_ps_area_get function of the libspe2.h file with the
SPE_SIG_NOTIFY_x_AREA flag set.
2. The PPE forwards the signaling area base address of the SPU to another
SPU.
3. Another SPU uses the mfc_sndsig function of the spu_mfcio.h file to access
the signals. The effective address should signal the area base plus offset to a
specific signal register.
The programmer can take either the blocking or nonblocking approach when
reading the signals from the local SPU. The programming methods for either
approach are similar to the approach for mailboxes as explained in “Blocking
versus nonblocking access to the mailboxes” on page 183. However, setting
signals from the PPU program or other SPUs is always nonblocking.
In regard to blocking behavior, a local SPU that reads the signals register when
no bits are set is blocking. To avoid blocking, the program can first read the
counter. The PPE or other SPEs that signal another SPU is always nonblocking.
When using the OR mode, the PPE or other SPEs usually do not need to poll the
signals counter because the events are accumulated. Otherwise (overwrite
mode) the signals have a similar behavior to inbound mailboxes.
In the following sections, we describe two different scenarios for using the signals
notification mechanism. We also provide a signals code example.
194 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-39 on page 199 shows the SPU code that contains functions that
access the mailbox and signals of another SPE. The code also contains
functions for writing to the other SPE’s mailbox and reading the mailbox status
by using DMA transactions.
Example 4-40 on page 202 shows the PPU and SPU printing macros for
tracing interprocessor communication, such as sending mailboxes and
signaling between the PPE and SPE and the SPE and SPE.
Source code: The code in Example 4-37 through Example 4-40 on page 202
is included in the additional material for this book. See “Simple signals” on
page 621 for more information.
// main===============================================================
int main()
{
int num, ret[2],mbx[2];
uint32_t sig=0x80000000; // bit 31 indicates signal from PPE
uint64_t ea;
// (the entire source code for this example is part of the book’s
// additional material).
// STEP 1: send each SPE the EA of the other SPE's signals area
// first time writing to SPE so we know mailbox has 4 entries empty
for( num=0; num<2; num++){
spe_in_mbox_write(data[num].spe_ctx, (uint32_t*)&num,1,
SPE_MBOX_ANY_NONBLOCKING);
ea = (uint64_t)ea_sig1[(num==0)?1:0];
spe_in_mbox_write(data[num].spe_ctx, (uint32_t*)&ea,2,
SPE_MBOX_ANY_NONBLOCKING);
// wait we have 2 entries free and then send the last 2 entries
while(spe_in_mbox_status(data[num].spe_ctx)<2);
ea = (uint64_t)ea_sig2[(num==0)?1:0];
spe_in_mbox_write(data[num].spe_ctx, (uint32_t*)&ea,2,
SPE_MBOX_ANY_NONBLOCKING);
}
196 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
ret[0]= spe_signal_write(data[0].spe_ctx, SPE_SIG_NOTIFY_REG_1,sig);
ret[1]= spe_signal_write(data[1].spe_ctx, SPE_SIG_NOTIFY_REG_2,sig);
if (ret[0]==-1 || ret[1]==-1){
perror ("Failed writing signal to SPEs"); exit (1);
}
// (the entire source code for this example is part of the book’s
// additional material).
return (0);
}
Example 4-38 SPU code for reading local signals and signaling other SPE
// add the ordinary SDK and C libraries header files...
#include "spu_mfcio_ext.h" // the code from Example 4-39 on page 199
#include <com_print.h> // the code from Example 4-40 on page 202
uint32_t num;
int main( )
{
uint32_t in_sig,out_sig,mbx,idx,i,ea_h,ea_l,tag_id;
uint64_t ea_sig[2];
if ((tag_id= mfc_tag_reserve())==MFC_TAG_INVALID){
printf("SPE: ERROR can't allocate tag ID\n"); return -1;
}
// STEP 2: receive from PPE EA of other SPE's signal area and string
while( spu_stat_in_mbox()<4 ); //wait till we have 4 entries
for (i=0;i<2;i++){
ea_h = spu_read_in_mbox(); // read EA lower bits
ea_l = spu_read_in_mbox(); // read EA higher bits
ea_sig[i] = mfc_hl2ea( ea_h, ea_l);
}
198 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
while(1){
in_sig = spu_read_signal2();
if (in_sig&0x80000000){ break; } // PPE signals us to stop
// STEP 6: block mailbox from PPE- to not finish before other SPE
mbx = spu_read_in_mbox(); prn_s_mbx_p2m(5,num,mbx);
mfc_tag_release(tag_id);
return 0;
}
Example 4-39 SPU code for accessing the mailbox and signals of another SPE
spu_mfcio_ext.h ======================================================
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
idx = SPU_MBOX_STAT_OFFSET_SLOT;
return status[idx];
}
200 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
// writing to a remote SPE’s inbound mailbox
inline int write_in_mbox(uint32_t data, uint64_t ea_mfc,
uint32_t tag_id)
{
int status;
uint64_t ea_in_mbox = ea_mfc + SPU_IN_MBOX_OFFSET;
uint32_t mbx[4], idx;
idx = SPU_IN_MBOX_OFFSET_SLOT;
mbx[idx] = data;
idx = SPU_SIG_NOTIFY_OFFSET_SLOT;
msg[idx] = data;
idx = SPU_SIG_NOTIFY_OFFSET_SLOT;
Example 4-40 PPU and SPU macros for tracing inter-processor communication
com_print.h ======================================================
// add the ordinary SDK and C libraries header files...
202 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4.4.3 SPE events
In this section, we discuss the SPE events mechanism that enables code that
runs on the SPU to trace events that are external to the program execution. The
SDK package provides a software interface that also enables a PPE program to
trace events that occurred on the SPEs.
The main events that can be monitored fall into the following categories:
MFC DMA
These events are related to the DMA commands of the MFC. Specifically,
refer to Example 4-20 on page 131 which shows an example for handling the
MFC DMA list command stall-and-notify.
Mailbox or signals
This category refers to the external write or read to a mailbox or signal
notification registers.
The events are generated asynchronously to the program execution, but the
programmer can choose to monitor and correspond to those events either
synchronously or asynchronously:
Synchronous monitoring
The program explicitly checks the events status in one of the following ways:
– Nonblocking
It polls for pending events by testing the event counts in a loop.
– Blocking
It reads the event status, which stalls when no events are pending.
Asynchronous monitoring
The program implements an event interrupt handler.
Intermediate approach
The program sprinkles bisled instructions, either manually or automatically
throughout the application code by using code-generation tools, so that they
are executed frequently enough to approximate asynchronous event
detection.
Four different 32-bit channels can enable SPU software to manage the events
mechanism. The channels have an identical bit definition, while each event is
represented by a single bit. The SPE software should use the following sequence
of actions to deal with SPE events:
1. Initialize event handling by write to the “SPU Write Event Mask” channel and
set the bits that correspond to the events that the program wants to monitor.
2. Monitor some events that are pending by using either a synchronous,
asynchronous, or intermediate approach as described previously.
3. Recognize which events are pending by reading from the “SPU Read Event
Status” channel and see which bits were set.
4. Clear events by writing a value to “SPU Write Event Acknowledge” and set the
bit that corresponds to the pending events in the written value.
5. Service the events by executing application-specific code to handle the
specific events that are pending.
204 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Similarly to the mailbox or signal notification mechanism, each of the registers
maintains a counter that can be read by the SPU software. The only counter that
is usually relevant to the software is the one related to the “SPU Read Status”
channel. The software can read this channel to detect the number of events that
are pending. If the counter returns a value of 0, then no enabled events are
pending. If the counter returns a value of 1, then enabled events have been
raised since the last read of the status.
SPU Write Event Mask W To enable only the events that are relevant to its operation, the SPU
(SPU_WrEventMask) program can initialize a mask value with the event bits set to 1 only for
the relevant events.
SPU Read Event Status R Reading this channel reports events that are both pending at the time of
(SPU_RdEventStat) the channel read and are enabled. The corresponding bit is set in “SPU
Write Event Mask.”
SPU Write Event W Before an SPE program services the events reported in “SPU Read
Acknowledgment Event Status,” it should write a value to the “SPU Write Event
(SPU_WrEventAck) Acknowledge” to acknowledge (clear) the events that will be processed.
Each bit in the written value acknowledges the corresponding event.
SPU Read Event Mask R This channel enables the software to read the value that was recently
(SPU_RdEventMask) written to “SPU Write Event Mask.”
The spu_mfcio.h functions are described in the “SPU event” chapter in the
C/C++ Language Extensions for Cell Broadband Engine Architecture
document.34 The libspe2.h functions are described in “SPE event handling”
chapter in the SPE Runtime Management Library document.35
Regarding blocking behavior, a local SPU that reads the events register when no
bits are set is blocking. To avoid blocking, the program can first read the counter.
34
See note 2 on page 80.
35
See note 7 on page 84.
206 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Based on the event that is monitored, the events can be used for the following
scenarios:
DMA list dynamic updates
Monitor stall-notify-event to update the DMA list according to the data that
was transferred to the local storage from the main storage. See Example 4-20
on page 131 for this type of scenario.
Profiling or implementing a watchdog on an SPU program
Use the decrementer to periodically profile the program or implement a
watchdog about the program execution.
Example 4-41 on page 209 shows another scenario for using the SPE events, in
which code is included for implementing an event handler on the PPU.
36
See note 1 on page 78.
Refer to the following section, “PPU code example for implementing the SPE
events handler”, in which we suggest how to implement such an event handler on
the PPU.
There is a large latency between the generation of the SPE events and the
execution of corresponding PPU event handler (which involves running kernel
functions). For this reason, only if the delay between one command and another
is large, then using the second synchronous computation server make sense and
provides good performance results.
Example 4-41 on page 209 contains the corresponding PPU code that creates
and registers an event handler for monitoring whenever the inbound mailbox is
no longer full. Any time the mailbox is not full, which indicates that the SPU has
read a command from it, the PPU puts new commands in the mailbox.
The goal of this example is only to demonstrate how to implement a PPE handler
for SPE events and use the event of an SPE read from an inbound mailbox.
While supporting only this type of event is not always practical, it can be easily
extended to support a few different types of other events. For example, it can
support an event in which an SPE has stopped execution, a PPE-initiated DMA
operation has completed, or an SPE has written to the outbound mailbox. For
more information, see the method for making a callback to the PPE side of the
SPE thread (stop and signal mechanism) as described in the “PPE-assisted
library facilities” chapter in the SPE Runtime Management Library document.37
208 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The SPU code is not shown, but generally it should include a simple loop that
reads an incoming message from the mailbox and processes it.
Source code: The code in Example 4-41 is included in the additional material
for this book. See “PPE event handler” on page 621 for more information.
#define NUM_EVENTS 1
#define NUM_MBX 30
int main()
{
int i, ret, num_events, cnt;
spe_event_handler_ptr_t event_hand;
spe_event_unit_t event_uni, pend_events[NUM_EVENTS];
uint32_t mbx=1;
data.argp = NULL;
// load the program to the local stores, and run the SPE threads.
if (!(program = spe_image_open("spu/spu"))) {
37
See note 7 on page 84.
210 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
spe_event_handler_destroy(event_hand); //destroy event handle
return (0);
}
When all the SPEs and the PPE perform atomic operations on a cache line with
an identical effective address, a reservation for that cache line is present in at
least one of the MFC units. When this occurs, the cache snooping and update
processes are performed by transferring the cache line contents to the
requesting SPE or PPE over the EIB, without requiring a read or write to the main
system memory. Hardware support is essential for efficient atomic operations on
shared data structures that consist of up to 512 bytes, divided into four 128-byte
blocks mapped on a 128-byte aligned data structure in the local storage of the
The approach of exploiting this facility is to extend the principles behind the
handling of a mutex lock or an atomic addition and ensure that the operations
that are involved always affect the same four cache lines.
212 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4. Upon the successful update of the cache line, the program can continue to
have both the previous structure values in the temporary buffer and the
modified values in the local storage mapped structure.
A fundamental difference between the behavior of the PPE and SPE in managing
atomic operations is worth noting. While both use the cache line size (128 bytes)
as the reservation granularity, the PPU instructions operate on a maximum of
4 bytes (__lwarx and __stwcx) or 8 bytes (__ldarx and __stdcx) at once. The
SPE atomic functions update the entire cache line contents.
For more details about how to use the atomic instructions on the SPE
(mfc_getllar and mfc_putllc) and on the PPE (__lwarx, __ldarx, __stwcx, and
__stdcx), see 4.5.2, “Atomic synchronization” on page 233.
The atomic cache in one of the MFC units or the PPE cache should always hold
the desired cache lines before another SPE or the PPE requests a reservation on
those lines. Provided that this occurs, the data refresh relies entirely on the
internal data bus, which offers high performance.
The libsync synchronization primitives also use the cache line reservation facility
in the SPE MFC. Therefore, special care is necessary to avoid conflicts that can
occur when simultaneously exploiting manual usage of the atomic unit and other
atomic operations provided by libsync.
Example 4-42 shows PPU code that initiates the shared structure, runs the SPE
threads, and when the threads complete, reads the shared variable. No atomic
access to this structure is done by the PPE.
Example 4-43 on page 215 show SPU code that uses the atomic instructions to
synchronize the access to the shared variables between the SPEs.
Example 4-42 PPU code for using the atomic unit and atomic cache
// add the ordinary SDK and C libraries header files...
// take ‘spu_data_t’ structure and ‘spu_pthread’ function from
// Example 4-5 on page 90
#define SPU_NUM 8
spu_data_t data[SPU_NUM];
typedef struct {
int processingStep; // contains the overall workload processing step
int exitSignal; // variable to signal end of processing step
uint64_t accumulatedTime[8]; // contains workload dynamic execution
// statistics (max. 8 SPE)
int accumulatedSteps[8];
char _dummyAlignment[24]; // dummy variables to set the structure
// size equal to cache line (128 bytes)
} SharedData_s;
214 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
SharedData.processingStep = 0;
// (the entire source code for this example is part of the book’s
// additional material).
return (0);
}
Example 4-43 SPU code for using the atomic unit and atomic cache
// add the ordinary SDK and C libraries header files...
do{
exitFlag = 0;
switch( SharedData.processingStep ){
case 0:
for( i = 0 ; i < (spuNum * 10) + 10 ; ++i ){
if( rand() <= 100 ){ //found the first result
exitFlag = 1;
break;
}
}
break;
case 1:
for( i = 0 ; i < (spuNum * 10) + 10 ; ++i ){
if( rand() <= 10 ){ // found the second result
exitFlag = 1;
216 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
break;
}
}
break;
}
// End performance profile information collection
t_spu = t_start - spu_read_decrementer();
// ...
// Because we have statistics on all the SPEs average workload
// time we can have some inter-SPE dynamic load balancing,
// especially for workloads that operate in pipelined fashion
// using multiple SPEs
do{
// get and lock the cache line of the shared structure
mfc_getllar((void*)&SharedData, SharedData_ea, 0, 0);
(void)mfc_read_atomic_status();
if( exitFlag ){
SharedData.processingStep++;
if(SharedData.processingStep > 1)
SharedData.exitSignal = 1;
}
}while (status);
return 0;
}
By using this model, the programmer must explicitly order access to storage by
using a special synchronization instruction, whenever storage occurs in the
program order. If this is not done correctly, real-time bugs can result that are
difficult to debug. For example, the program can run correctly on one system and
fail on another, or run correctly on one execution and fail on another on the same
system. Alternatively, over usage of the synchronization instructions can
significantly reduce the performance because they take a lot of time to complete.
In this section, we discuss the Cell/B.E. storage model and the software utilities
to control the data transfer ordering. For the reasons mentioned previously, it is
important to understand this topic in order to obtain efficient and correct results.
To learn more about this topic, see the “Shared-storage synchronization” chapter
in the Cell Broadband Engine Programming Handbook.38
38
See note 1 on page 78.
218 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In 4.5.2, “Atomic synchronization” on page 233, we explain the instructions
that enable the different components on the Cell/B.E. chip to synchronize
atomic access to some shared data structures.
In 4.5.3, “Using the sync library facilities” on page 238, we describe the sync
library, which provides more high-level synchronization functions (based on
the instructions mentioned previously). The supported C functions closely
match those found in current traditional operating systems such as mutex,
atomic increment, and decrements of variables and conditional variables.
In 4.5.4, “Practical examples of using ordering and synchronization
mechanisms” on page 240, we describe specific real-life scenarios for using
the ordering and synchronization instructions in the previous sections.
Table 4-12 on page 220 summarizes the effects of the different ordering and
synchronization instructions on three storage domains: main storage, local
storage, and channel interface. It shows the effects of instructions that are issued
by different components: the PPU code, the SPU code, and the MFC. Regarding
the MFC, data transfers are executed by the MFC following commands that were
issued to the MFC by either the SPU code (using the channel interface) or PPU
code (using the MMIO interface).
Shaded table cells: The gray shading in a table cell indicates that the
instruction, command, or facility has no effect on the referenced domain.
lwsynce Accesses to
memory-coherence-
required locations
220 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
d. These accesses can exist only if the LS is mapped by the PPE operating system to the
main-storage space. This can only be done if the LS is assigned caching-inhibited and guarded
attributes.
e. The PowerPC sync instruction with L = 1.
To ensure that access to the shared storage is performed in program order, the
software must place memory-barrier instructions between storage accesses.
The term storage access refers to access to main storage that is caused by a
load, a store, a DMA read, or a DMA write. There are two orders to consider:
Order of instructions execution
The Cell/B.E. processor is an in-order machine, which means that, from a
programmer’s point of view, the instructions are executed in the order
specified by the program.
Order shared-storage access
The order in which shared-storage access is performed might be different
from both the program order and the order in which the instructions that
caused the access are executed.
Storage barriers
__sync() Known as the heavyweight sync, ensures that all Ensure that the results of all
instructions that precede the sync have completed storage into a data structure,
before the sync instruction completes and that no caused by storage instructions
subsequent instructions are initiated until after the executed in a critical section of a
sync instruction completes. This does not mean that program, are seen by other
the previous storage accesses have completed processor elements before the
before the sync instruction completes. data structure is seen as
unlocked.
__lwsync() Also known as lightweight sync, creates the same Use when ordering is required
barrier as the sync instruction for storage access that only for coherent memory,
is memory coherence. Therefore, unlike the sync because it executes faster than
instruction, it orders only the PPE main storage sync.
access and has no effect on the main storage access
of other processor elements.
__eieio() Means “enforce in-order execution of I/O.” All main Use when managing shared data
storage access caused by instructions preceding structures, accessing
eieio have completed, with respect to main storage, memory-mapped I/O (such as the
before any main storage access caused by SPE MMIO interface), and
instructions following the eieio. The eieio instruction preventing load or store
does not order access with differing storage combining.
attributes, for example, if an eieio is placed between
a caching-enabled store and a caching-inhibited.
Instruction barriers
__isync() Ensures that all PPE instructions preceding the isync In conjunction with self-modifying
are completed before isync is completed. Causes an PPU code, execute after an
issue stall and blocks all other instructions from both instruction is modified and before
PPE threads until the isync instruction completes. it is run. Can also be used during
context switching when the MMU
translation rules are being
changed.
222 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 4-14 summarizes the use of the storage barrier instructions for two
common types of main-storage memory:
System memory
The coherence main memory of the system. The XDR main memory falls into
this category, as does the local storage when it is accessed from the EIB
(when from the other PPE or SPEs).
Device memory
Memory that is caching-inhibited and guarded. In a Cell/B.E. system, it is
typical of memory-mapped I/O devices, such as the double data rate (DDR)
that is attached to the south bridge. Mapping of local storage of the SPEs to
the main storage is caching-inhibited but not guarded.
In Table 4-14, “Yes” (and “No”) mean that the instruction performs (or does not
perform) a barrier function on the related storage sequence. “Rec.” (for
“recommended”) means that the instruction is the preferred one. “Not rec.”
means that the instruction will work but is not the preferred one. “Not req.” (for
“not required”) and “No effect” mean the instruction has no effect.
Table 4-14 Storage-barrier ordering of accesses to system memory and device memory
Storage-access System memory Device memory
instruction
sequence sync lwsync eieio sync lwsync eieio
store-barrier-store Yes Rec. Not rec. Not req.a No affect Not req.a
a. Two stores to caching-inhibited storage are performed in the order specified by
the program, regardless of whether they are separated by a barrier instruction.
With regard to an SPU, the Cell/B.E. in-order execution model guarantees only
that SPU instructions that access the LS of that SPU appear to be performed in
program order with respect to that SPU. However, it is not necessarily with
respect to external accesses to that LS or with respect to the instruction fetch of
the SPU.
Therefore, from an architecture point of view, an SPE might write data to the LS
and immediately generate an MFC put command that reads this data (and
transfers it to the main storage). In this case, without synchronization
instructions, it is not guaranteed that the MFC will read the latest data, since it is
not guaranteed that the MFC reading the data is performed after the SPU that
writes the data. From practical point of view, there is no need to add the
synchronization command to guarantee this ordering. Executing the six
commands for issuing the DMA always takes longer than executing the former
write to the LS.
Consider a case where the LS and MFC resources of some SPEs that are
mapped to the system-storage address space are accessed by software running
on the PPE or other SPEs. In this case, there is no guarantee that two accesses
to two different resources are ordered, unless a synchronization command, such
as eieio or sync, is explicitly executed by the PPE or other SPEs, as explained in
“PPE ordering instructions” on page 221.
In the following descriptions, we use the terms SPU load and SPU store to
describe the access by the same SPU that executes the synchronization
instruction. Several practical examples for using the SPU ordering instructions
are discussed in the “Synchronization and ordering” chapter of the Synergistic
Processor Unit Instruction Set Architecture Version 1.2 document.39 Specifically,
39
You can find the Synergistic Processor Unit Instruction Set Architecture Version 1.2 document on
the Web at: http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/
76CA6C7304210F3987257060006F2C44
224 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
refer to the subtopic “External local storage access” in that chapter. It shows how
the instructions can be used when the processor, which is external to the SPE
(for example, the PPE), accesses the LS, for example, to write data to the LS and
later notifies the code that runs on the associated SPU that the writing of the data
is completed by writing to another address in this same LS.
The SPU instruction set provides three synchronization instructions. The easiest
way to use the instructions is through intrinsics, in which the programmer must
include the spu_internals.h header file. Table 4-15 briefly describes the
intrinsics and their main usage.
spu_sync The sync (synchronize) instruction causes This instruction is most often used in
the SPU to wait until all pending instructions conjunction with self-modifying SPU code.
of loads and stores to the LS and channel It must be used before attempting to
accesses have been completed before execute new code that either arrives
fetching the next instruction. through the DMA transfers or is written
with store instructions.
spu_dsync The dsync (synchronize data) instruction Architecturally, Cell/B.E. DMA transfers
ensures that data has been stored in the LS may interfere with store instructions and
before the data becomes visible to the local the store buffer. Therefore, the dsync
MFC or other external devices. instruction is meant to ensure that all DMA
store buffers are flushed to the LS. That is,
all previous stores to the LS will be seen by
subsequent LS accesses. However,
current cell implementation does not
require the dsync instruction for doing so
because it is handled by the hardware.
spu_sync_c The syncc (synchronize channel) instruction Ensures that the effects on SPU state that
ensures that channel synchronization is are caused by a previous write to a
followed by the same synchronization nonblocking channel are propagated and
provided by the sync instruction. influence the execution of the instructions
that follow.
spu_dsync Subsequent external reads Forces SPU load and SPU store access of LS
access data written by prior SPU due to instructions before the dsync to be
stores. completed before completion of dsync.
Subsequent SPU loads access Forces read channel operations due to
data written by external writes. instructions before the dsync to be completed
before completion of the dsync.
Forces SPU load and SPU store access of LS
due to instructions after the dsync to occur
after completion of the dsync.
Forces read-channel and write-channel
operations due to instructions after the dsync
to occur after completion of the dsync.
spu_sync The two effects of spu_dsync. All access of LS and channels due to
Subsequent instruction fetches instructions before the sync to be completed
access data written by prior SPU before completion of sync.
stores and external writes. All access of LS and channels due to
instructions after the sync to occur after
completion of the sync.
spu_sync_c The two effects of spu_dsync. All access of LS and channels due to
The second effect of spu_sync. instructions before the syncc to be completed
Subsequent instruction before completion of syncc.
processing is influenced by all All access of LS and channels due to
internal execution states instructions after the syncc to occur after
modified by previous write completion of the syncc.
instructions to a channel.
Table 4-17 shows which SPU synchronization instructions are required between
LS writes and LS reads to ensure that reads access data written by prior writes.
226 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
MFC ordering mechanisms
The SPU can use the MFC channel interface to issue commands to the
associated MFC. The PPU or other SPUs outside of this SPE similarly can use
the MMIO interface of the MFC to send commands to a particular MFC. For each
interface, the MFC independently accepts only queuable commands that are
entered into one of the MFC SPU command queues (one queue for the channel
interfaces and another for the MMIO). The MFC then processes these
commands, possibly out of order to improve efficiency.
Barrier getb, getbs, getlb, putb, putbs, putrb, putlb, putrlb, sndsigb
Fence getf, getfs, getlf, putf, putfs, putrf, putlf, putrlf, sndsigf
When the data transfers are issued, the storage system can complete the
requests in an order that is different than the order in which they are issued,
depending on the storage attributes.
Figure 4-4 on page 229 illustrates the different effects of the fence and barrier
command. The row of white boxes represents command-execution slots, in
real-time, in which the DMA commands (the solid red (darker) and green (lighter)
boxes) might execute. Each DMA command is assumed to transfer the same
amount of data. Therefore, all boxes are the same size. The arrows show how
the DMA hardware, using out-of-order execution, might execute the DMA
commands over time.
228 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 4-4 Barrier and fence effect
The commands are useful and efficient in synchronizing SPU code data access
to the shared storage with access to other elements in the system. A common
use of the command is in a double-buffering mechanism, as explained in “Double
buffering” on page 160 and shown in the code in Example 4-30 on page 162. For
more examples, see 4.5.4, “Practical examples of using ordering and
synchronization mechanisms” on page 240.
Barrier commands
The barrier commands order storage access that is made through the MFC with
respect to all other MFCs, processor elements, and other devices in the system.
While the Cell Broadband Engine Architecture (CBEA) specifies those
commands as having tag-specific effects (controls only the order in which
transfers related to one tag-ID group are executed compare to each other), the
current Cell/B.E. implementation has no tag-specific effects.
The commands can be activated only by the SPU that is associated with the
MFC that uses the channel interface. There is no support from the MMIO
interface. However, the PPU can achieve similar effects by using the
non-MFC-specific ordering instructions that are described in “PPE ordering
instructions” on page 221.
mfc_sync The mfcsync command is similar to the PPE sync This command is designed to use
instruction and controls the order in which the inter-processor or device
MFC commands are executed with respect to synchronization. Since it creates a
storage access by all other elements and in the large load on the memory system, use
system. it only between commands that involve
storage with different storage
attributes. Otherwise, use other
synchronization commands.
mfc_eieio The mfceieio command controls the order in This command is used for managing
which the DMA commands are executed with shared data structures, performing
respect to the storage access by all other system memory-mapped I/O, and preventing
elements, only when the storage that is load and store from combining in main
accessed has the attributes of caching-inhibited storage. The fence and barrier options
and guarded (typical for I/O devices). The of other commands are preferred from
command is similar to the PPE eieio instruction. performance point of view and should
For details regarding the effects of accessing be used if they are sufficient.
different types of memories, seeTable 4-14 on
page 223.
mfc_barrier The barrier command orders all subsequent This command is used for managing
MFC commands with respect to all MFC data structures that are in the main
commands that precede the barrier command in storage and are shared by other
the DMA command queue, independent of the elements in the system.
tag groups. The barrier command will not
complete until all preceding commands in the
queue have completed. After the command
completes, subsequent commands in the queue
can be started.
230 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The MFC multisource synchronization facility addresses this cumulative-ordering
need by providing two independent multisource synchronization-request
methods:
The MMIO interface allows the PPE or other processor elements or devices
to control synchronization from the main-storage domain.
The Channel interface allows an SPE to control synchronization from its
LS-address and channel domain.
Example 4-44 shows how the PPU programmer can achieve cumulative ordering
by using the two corresponding libspe2.h functions.
#include “libspe2.h”
int status;
spe_context_ptr_t spe_ctx;
status = spe_mssync_start();
if (status==-1){
// do whatever need to do on ERROR but do not continue to next step
}
// Check if all the transfers that are being tracked are completed.
// Repeat this step till the function returns 0 indicating the
// completions of those transfers
while(1){
status = spe_mssync_status(spe_ctx); // nonblocking function
if (status==0){
break; // synchronization was completed
}else{
if (status==-1){
// do whatever need to do on ERROR
break; //unless we already exit program because of the error
}
}
};
41
See note 2 on page 80.
232 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-45 Channel interface of MFC multisource synchronization facility
#include “spu_mfcio.h”
uint32_t status;
mfc_write_multi_src_sync_request();
// Check if all the transfers that are being tracked are completed.
// Repeat this step till the function returns 1 indicating the
// completions of those transfers
do{
status = mfc_stat_multi_src_sync_request(); // nonblocking function
} while(!staus);
The atomic operations that are implemented in the Cell/B.E. processor are not
blocking, which enables the programmer to implement algorithms that are
lock-free and wait-free.
42
See note 1 on page 78.
43
ibid.
234 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4. Determines if the store was successful:
– If successful, the sequence of instructions from steps 1 to step 3 gives the
appearance of having been executed atomically.
– Otherwise, the other processor accesses the semaphore, so that the
software can repeat this process (back to step 1).
These actions also control the order in which memory operations are completed,
with respect to asynchronous events, and the order in which memory operations
are seen by other processor elements or memory access mechanisms.
Table 4-20 summarizes both the PPE and SPE atomic instructions. For each
PPE instruction that is attached, the MFC commands of an SPE implements a
similar mechanism. For all PPE instructions, a reservation (lock) is set for the
entire cache line in which this word resides.
__ldarx Loads a double word mfc_getllar Transfers a cache line from the LS to the main
(cache line) and sets a storage and creates a reservation (lock). Is not
reservation. tagged and is executed immediately (not
queued behind other DMA commands).
__stdcx Stores a double word mfc_putllc Transfers a cache line from the LS to the main
(cache line) only if a storage only if a reservation (lock) exists.
reservation (lock) exists.
Two pairs of atomic instructions are implemented on both the PPE and SPE. The
first pair is __ldarx/_lwarx and mfc_getllar, and the second pair is
__stdcx/_stwcx and mfc_putllc, for the PPE and SPE respectively. These
functions provide atomic read-modify-write operations that can be used to derive
other synchronization primitives between a program that runs on the PPU and a
program that runs on the SPU or SPUs. Example 4-46 on page 237 and
Example 4-47 on page 237 show the PPU code and SPU code for implementing
a mutex-lock mechanism using the two pairs of atomic instructions for PPE and
SPE.
Refer to 4.5.3, “Using the sync library facilities” on page 238, which explains how
the sync library implements many of the standard synchronization mechanisms,
such as mutex and semaphore, by using the atomic instructions. Example 4-46 is
based on the sync library code from the mutex_lock.h header file.
Refer to the “PPE Atomic Synchronization” chapter in the Cell Broadband Engine
Programming Handbook, which provides more code examples that show how
synchronization mechanisms can be implemented on both a PPE program and
an SPE program to achieve synchronization between the two programs.44
Example 4-46 is based on one of those examples.
44
See note 1 on page 78.
236 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-46 PPE implementation of mutex_lock function in sync library
#include “ppu_intrinsics.h”
uint32_t done = 0;
do{
if (__lwarx((void*)mutex_ptr) == 0)
done = __stwcx((void*)mutex_ptr, (uint32_t)1);
#include <spe_mfcio.h>
#include <spu_intrinsics.h>
// determine the offset to the mutex word within its cache line.
// align the effective address to a cache line boundary.
uint32_t offset = mfc_ea2h(mutex_ptr) & 0x7F;
uint32_t mutex_lo = mfc_ea2h(mutex_ptr) & ~0x7F;
mutex_ptr = mfc_hl2ea(mfc_ea2h(mutex_ptr), mutex_lo);
// setup for use possible use of lock line reservation lost events.
// detect and discard phantom events.
mask = spu_read_event_mask();
spu_write_event_mask(0);
if (*lock_ptr) {
// The mutex is currently locked. Wait for the lock line
// reservation lost event before checking again.
spu_write_event_ack( spu_read_event_status());
status = MFC_PUTLLC_STATUS;
} else {
// The mutex is not currently locked. Attempt to lock.
*lock_ptr = 1;
Most of the functions in the sync library are supported by both the PPE and the
SPE, but a small portion of them is supported only by the SPE. The functions are
all based upon the Cell/B.E. load-and-reserve and store-conditional functionality
that is described in 4.5.2, “Atomic synchronization” on page 233.
238 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To use the facilities of the sync library, the programmer must consider the
following files:
The libsync.h: header file contains most of the definitions.
The libsync.a: library contains the implementation and should be linked to
the program.
The function specific header files define each function. Therefore, the
programmer should include this header file instead of the libsync.h file.
For example, the programmer can include the mutex_init.h header file when
the mutex_lock operation is required. In this case, the programmer adds an
underline when calling the function, for example, the _mutex_lock function
when including the mutex_init.h file instead of mutex_lock when including
the libsync.h file.
Note: From a performance point of view, use the function specific header
files because the functions are defined as inline, unlike the definition of the
corresponding function in the libsync.h file. However, a similar effect can be
achieved by setting the appropriate compilation flags.
To ensure the ordering of the DMA writing of the data (step 2) and of the
notification (step 3), the notification can be sent by using a fenced DMA
command. This guarantees that the notification is not sent until all previous DMA
commands of the group are issued.
In this example, the writing of both the data and the notification should have the
same tag ID in order to guarantee that the fence will work.
45
See note 1 on page 78.
240 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Ordering reads followed by writes using the barrier option
A barrier option is useful when a buffer read takes multiple commands that must
be performed before writing the buffer, which also takes multiple commands as
shown in the following scenario:
1. Issue several get commands to read data into the LS.
2. Issue a single barrier putb to write data to the main storage from the LS. The
barrier guarantees that the putb command and the subsequent put
commands issued in step 3 occur only after the get commands of step 1 are
complete.
3. Issue a basic put command (without a barrier) to write the data to main
storage.
4. Wait for the completion of all the commands issued in the previous steps.
By using the barrier form for the first command to write the buffer (step 2), the
commands can be used to put the buffer (step 2 to 3) to be queued without
waiting for the completion of the get commands (step 1). The programmer can
take advantage of this mechanism to overlap the data transfers (read and write)
with computation, allowing the hardware to manage the ordering. This scenario
can occur on either the SPU or PPU that uses the MFC to initiate data transfers.
The get and put commands should have the same tag ID to guarantee that the
barrier option that comes with the get and put commands ensures writing the
buffer just after data is read. If the get and put commands are issued by using
multiple tag IDs, then an MFC barrier command can be inserted between the
get and put command instead of using a put command with a barrier option for
the first put command.
If multiple commands are used to read and write the buffer, by using the barrier
option, the read and write commands can be performed in any order, which
provides better performance but forces all reads to finish before the writes start.
int i;
i = 0;
‘get’ buffer 0
while (more buffers) {
‘getb’ buffer i^1 //‘mfc_getb’ function (with barrier)
In the put command at the end of each loop iteration data is written from the
same local buffer to which data is later read in the beginning of next iteration’s
get command. Therefore, it is critical to use a barrier for the get command to
ensure that the writes complete before the reads are started and to prevent the
wrong data from being written. Example 4-30 on page 162 shows the code of
SPU program that implements such a double-buffering mechanism.
To make this feasible, the data storage performed in step 1 must be visible to the
SPE before receiving the work-request notification (steps 3 and 4). To ensure
guaranteed ordering, a sync storage barrier instruction must be issued by the
PPE between the final data store in memory and the PPE write to the SPE
mailbox or signal-notification register. This barrier instruction is shown as step 2.
242 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
SPEs updating shared structures using atomic operation
In some cases, several SPEs can maintain a shared structure, for example when
using the following programming model:
A list of work elements in the memory defines the work that needs to be done.
Each of the elements defines one task out of the overall work that can be
executed parallel with the others.
A shared structure contains the pointer to the next work element and
potentially other shared information.
An SPE that is available to execute the next work element atomically reads
the shared structure to evaluate the pointer to the next work element and
updates it to point to the next element. Then it can get the relevant information
and process it.
Atomic operations are useful in such cases when several SPEs need to
atomically read and update the value of the shared structure. Potentially, the PPE
can also update this shared structure by using atomic instructions on the PPU
program. Example 4-49 illustrates such a scenario and how the SPEs can
manage the work and access to the shared structure by using the atomic
operations.
while (1){
do {
// get and lock the cache line of the shared sahred structure
mfc_getllar((void*)&ls_var, ea_var, 0, 0);
(void)mfc_read_atomic_status();
//get data of current work, process it, and put results in memory
}
244 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In 4.6.4, “SIMD programming” on page 258, we explain how the programmer
can explicitly exploit the SIMD instructions to the SPU.
In 4.6.5, “Auto-SIMDizing by compiler” on page 270, we describe how the
programmer can use compilers to automatically convert scalar code to SIMD
code.
In 4.6.6, “Working with scalars and converting between different vector types”
on page 277, we describe how to work with different vector data types and
how to convert between vectors and scalars and vice versa.
In 4.6.7, “Code transfer using SPU code overlay” on page 283, we explain
how the programmer can use the SDK 3.0 SPU code overlay face situations
in which the code is too big to fit into the local storage.
In 4.6.8, “Eliminating and predicting branches” on page 284, we explain how
write efficient code when branches are required.
This section mainly covers issues that are related to writing a program that runs
efficiently on the SPU while fetching instructions form the attached LS. However,
in most cases, an SPU program should also interact with the associated MFC to
transfer data between the main storage and communicate with other processors
on the Cell/B.E. chip. Therefore, it is important that you understand the issues
that are explained in the previous sections of this chapter:
Refer to 4.3, “Data transfer” on page 110, to understand how an SPU can
transfer data between the LS and main storage.
Refer to 4.4, “Inter-processor communication” on page 178, to understand
how the SPU communicates with other processors on the chip (PPE and
SPEs).
Refer to 4.5, “Shared storage synchronizing and data ordering” on page 218,
to understand how the data transfer of the SPU and other processors are
ordered and who the SPU can synchronize with other processors.
Register file
The register file offers the following features:
A large register file supports 128 entries of 128-bits each.
Unified register files, including all types (such as floating point, integers,
vectors, pointers, and so on), are stored in the same registers.
The large and unified files allow for instruction-latency hiding by using a deep
pipeline without speculation.
Big-endian data ordering is supported. The lowest-address byte and the
lowest-numbered bit are the most-significant byte and bit, respectively.
246 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
LS arbitration
Arbitration to the LS is done according the following priorities (from high to low):
1. MMIO, DMA, and DMA list transfer element fetch
2. ECC scrub
3. SPU load/store and hint instruction prefetch
4. Inline instruction prefetch
Instructions set
The following features are supported:
The single-instruction, multiple-data (SIMD) instruction architecture that
works on 128-bit vectors
Scalar instructions
SIMD instructions
The programmer should use the SIMD instructions as much as possible for
performance reasons. They can be done by using functions that are defined
by the SDK language extensions for C/C++ or by using the auto-vectorization
feature of the compiler.
Floating-point operations
The instructions are executed as follows:
Single-precision instructions are performed in 4-way SIMD fashion and are
fully pipelined. Since the instructions have good performance, we recommend
that the programmer use them as the application allows.
Double-precision instructions are performed in 4-way SIMD fashion, are only
partially pipelined, and stall dual issues of other instructions. The
performance of these instructions makes the Cell/B.E. processor less
attractive for applications that have massive use of double-precision
instructions.
The data format follows the IEEE standard 754 definition, but the single
precision results are not fully compliant with this standard (different overflow
and underflow behavior, support only for truncation rounding mode, different
denormal results). The programmer should be aware that, in come cases, the
computation results will not be identical to IEEE Standard 754
248 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Features that are not supported by the SPU
The SPU does not support many of the features that are provided in most
general purpose processors:
There is no direct (SPU-program addressable) access to the main storage.
The SPU accesses main storage only by using the DMA transfers of the MFC.
There is no direct access to system control, such as page-table entries. The
PPE privileged software provides the SPU with the address-translation
information that its MFC needs.
With respect to access by its SPU, the LS is unprotected and has
untranslated storage.
The SDK provides a rich set of language extensions for C/C++ that define the
SIMD data types and intrinsics that map to one or more assembly-language
instructions into C-language functions. By using these extension, the
programmer has convenient and productive control over code performance
without needing to perform assembly-language programming.
The ISA has 204 instructions, which are grouped into several classes according
to their functionality. Most of the instructions are mapped into either generic
intrinsics or specific intrinsics that can be called as C functions from the program.
For a full description of the instructions set, refer to the Synergistic Processor
Unit Instruction Set Architecture Version 1.2 document.48
The ISA provides a reach set of SIMD operations that can be performed on
128-bit vectors of several fixed-point or floating-point elements. Instructions are
available to access any of the MFC channels in order to initiate DMA transfers or
to communicate with other processors.
Channels access
A set of instructions is provided to access the MFC channels. The instructions
can be used to initiate the DMA data transfer, communicate with other
processors, access the SPE decrementer, and more. Refer to 4.6, “SPU
programming” on page 244, in which we discuss the SPU interface with the MFC
channel further.
SIMD operations
ISA SIMD instructions provide a reach set of operations (such as logical,
arithmetical, casting, load and store, and so on) that can be performed on 128-bit
vectors of either fixed-point or floating-point values. The vectors can contain
various sizes of variables, such as 8, 16, 32 or 64 bits.
The performance of the program can be significantly effected by the way the
SIMD instructions are used. For example, using SIMD instructions on 32-bit
variables (single-precision floating point or 32-bit integer) can speed up the
program by at least four times compared to the equivalent scalar program. In
every cycle, the instruction works on four different elements in parallel, since
there are four 32-bit variables for one 128 vector.
48
See 39 on page 224.
250 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 4-5 shows an example of the SPU SIMD add instruction of four 32-bit
vector elements. This instruction simultaneously adds four pairs of floating-point
vector elements, which are stored in registers VA and VB, and produces four
floating-point results, which are written to register VC.
In addition, the instructions are often used by the compiler. Whenever the high
level C/C++ function operates on scalars, the compiler translates it into a set of
16-byte read, modify, and write operations. In this process, the compiler uses the
store assist and extract instruction to access the appropriate scalar element.
The ISA provides instructions that use or produce scalar operands or addresses.
In this case, the values are set in the preferred slot in the 128-bit vector registers
as illustrated in Figure 4-6 on page 252. The compiler can use the scalar store
assist and extract instructions when a nonaligned scalar is used to shift it into the
preferred slot.
To eliminate the need for such shift operations, the programmer can explicitly
define the alignment of frequently used scalar variables, so that they will be in the
preferred slot. The compiler optimization and after link optimization tools, such as
Figure 4-7 shows an example instruction. Bytes are selected from vectors VA
and VB based on the byte entries in control vector VC. The control vector entries
are indices of bytes in the 32-byte concatenation of VA and VB. While the shuffle
operation is purely byte oriented, it can also be applied to more than byte vectors,
for example, vectors of floating points or 32-bit integers.
252 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
SPU C/C++ language extensions (intrinsics)
A large set of SPU C/C++ language extensions (intrinsics) makes the underlying
SPU ISA and hardware features conveniently available to C programmers:
Intrinsics are essentially inline assembly-language instructions in the form of
C-language function calls.
Intrinsics can be used in place of assembly-language code when writing in the
C or C++ languages.
A single intrinsic maps one or more assembly-language instructions.
Intrinsics provide the programmer with explicit control of the SPE SIMD
instructions without directly managing registers.
Intrinsics provide the programmer with access to all MFC channels as well as
other system registers, for example the decrementer and SPU state
save/restore register.
For a full description of the extensions, see the C/C++ Language Extensions for
Cell Broadband Engine Architecture document.49
Directory of system header file: The SPU intrinsics are defined in the
spu_intrinsics.h system header file, which should be included if the
programmer wants to use them. The directory in which this file is located
varies depending on which compiler is used. When using GCC, the file is in
/usr/lib/gcc/spu/4.1.1/include/. When using GCC XLC, the file is in
/opt/ibmcmp/xlc/cbe/9.0/include/.
The SDK compiler supports these intrinsics to emit efficient code for the SPE
architecture, similar to using the original assembly instructions. The techniques
that are used by the compilers to generate efficient code include register
coloring, instruction scheduling (dual-issue optimization), loop unrolling and auto
vectorization, up-stream placement of branch hints, and more.
The PPU and SPU instruction sets have similar, but distinct, SIMD intrinsics. You
must understand the mapping between the PPU and SPU SIMD intrinsics when
developing applications on the PPE that will eventually be ported to the SPEs.
Refer to 4.1.1, “PPE architecture and PPU programming” on page 78, in which
we further discuss this issue.
49
See note 2 on page 80.
The translation from function to instruction depend on the data type of the
arguments. For example, spu_add(a,b) can translate to a floating add or a
signed int add depending on the input parameters.
Some operations cannot be performed on all data types. For example multiply
using spu_mul can be performed only on floating point data types. Refer to the
C/C++ Language Extensions for Cell Broadband Engine Architecture document,
which provides detailed information about all the intrinsics, including the data
type that is supported for each of them.50
Intrinsics classes
SPU intrinsics are grouped into the following classes:
Specific intrinsics
These intrinsics have a one-to-one mapping with a single assembly-language
instruction and are provided for all instructions except some branch and
interrupt-related ones. All specific intrinsics are named by using the SPU
assembly instruction that are prefixed by the string, si_. For example, the
specific intrinsic that implements the stop assembly instruction is si_stop.
Programmers rarely need these intrinsics since all of them are mapped into
generic intrinsics, which are more convenient.
Generic and built-in intrinsics
These intrinsics map to one or more assembly-language instructions as a
function of the type of input parameters and are often implemented as
compiler built-ins. Intrinsics of this group are useful and cover almost all the
assembly-language instructions including the SIMD ones. Instructions that
are not covered are naturally accessible through the C/C++ language
semantics.
50
See note 2 on page 80.
254 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
All of the generic intrinsics are prefixed by the string spu_. For example, the
intrinsic that implements the stop assembly instruction is named spu_stop.
Composite and MFC-related intrinsics
These convenience intrinsics are constructed from a sequence of specific or
generic intrinsics. The intrinsics are further discussed in other sections in this
chapter that discuss DMA data transfer and inter-processor communication
using the MFC.
In the following section, we discuss some of the more useful intrinsics. For a list
that summarizes all the SPU intrinsics, refer to Table 18 in the Software
Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial
document.51
The aligned attribute for the SDK GCC and XLC implementations to align a
variable into quadword (16 bytes) uses the following syntax:
51
See note 3 on page 80.
256 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The compilers currently do not support the alignment of automatic (stack)
variables to an alignment that is stricter than the alignment of the stack itself (16
bytes).
The volatile keyword for the SDK has the following syntax:
volatile float factor;
The value can be a constant for a compile-time prediction or a variable used for a
run-time prediction.
Refer to “Branch hint” on page 287, in which we further discuss use of this
directive and provide useful code examples.
An alternative to covert scalar code into SIMD code is to let the compiler perform
automatically conversion of the code. This approach is called auto-SIMDizing,
which we discuss further in 4.6.5, “Auto-SIMDizing by compiler” on page 270.
258 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Vector data types
SPU SIMD programming operates on vector data types. Vector data types have
the following main attributes:
128-bits (16-bytes) long
Aligned on quadword (16-bytes) boundaries
Different data types supported, including fixed point (for example char, short,
int, signed or unsigned) and floating point (for example float and double)
From 1 to 16 elements contained per vector depending on the corresponding
type
Are stored in memory similar to an array of the corresponding data types (For
example, the vector of an integer is like an array of four 32-bit integers.)
To use the data types, the programmer must include the spu_intrinsics.h
header file.
In general, the vector data types share a lot in common with ordinary C language
scalar data types:
Pointers to vector types can be defined as well as operations on those
pointers. For example, if the pointer vector float *p is defined, then p+1
points to the next vector (16-bytes) after that pointed to by p.
Arrays of vectors can be defined as well as operations on those arrays. For
example, if the array vector float p[10] is defined, then p[3] is the third
variable in this array.
vector unsigned long long vec_ullong2 Two 64-bit unsigned double words
SIMD operations
In this section, we explain how the programmer can perform SIMD operations on
an SPU program vector. There are four main options to perform SIMD operations
as discussed in the following sections:
In “SIMD arithmetic and logical operators” on page 261, we explain how SDK
3.0 compilers support a vector version of some of the common arithmetic and
logical operators. The operators work on each element of the vector.
In “SIMD low-level intrinsics” on page 261, we discuss the high level C
functions that support almost all the SIMD assembler instructions of the SPU.
The intrinsics contain basic logical and arithmetic operations between 128-bit
vectors from different types and some operations between elements of a
single vector.
In “SIMD Math Library” on page 262, we explain how the library extends the
low level intrinsic and provides functions that implement more complex
mathematical operations, such as root square and trigonometric operations,
on 128-bit vectors.
In “MASS and MASSV libraries” on page 264, we explain the Mathematical
Acceleration Subsystem (MASS) library functions, which are similar to those
260 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
of the SIMD Math Library, but are optimized to have better performance at the
price of redundant accuracy. The MASS vector (MASSV) library functions
perform similar operations on longer vectors that have a multiple of four.
While the compilers support basic arithmetic, logical, and rational operators, not
all the existing operators are currently supported. If the required operator is not
supported, the programmer should use the other alternatives that are described
in the following sections.
For more details about this subject, see the “Operator overloading for vector data
types” chapter in the C/C++ Language Extensions for Cell Broadband Engine
Architecture document.52
Example 4-50 shows a simple code that uses some SIMD operators.
52
See note 2 on page 80.
The intrinsics are grouped into several types according to their functionality, as
described in “Intrinsics: Functional types” on page 255. The following groups
contain the most significant SIMD operations:
Arithmetic intrinsics, which perform arithmetic operations on all the elements
of the given vectors (for example spu_add, spu_madd, spu_nmadd, ...)
Logical intrinsics, which perform logical operations on all the elements of the
given vectors (spu_and, spu_or, ...)
Byte operations, which perform operations between bytes of the same vector
(for example spu_absd, spu_avg,...)
The intrinsics support different data types. It is up to the compiler to translate the
intrinsics to the correct assembly instruction depending on the type of the
intrinsic operands.
To use the SIMD intrinsics, the programmer must include the spu_intrinsics.h
header file.
Example 4-51 shows a simple code that uses low-level SIMD intrinsics.
262 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The SIMD Math Library provides functions for the following categories:
Absolute value and sign functions
Remove or extract the signs from values
Classification and comparison functions
Return boolean values from comparison or classification of elements
Divide, multiply, modulus, remainder, and reciprocal functions
Standard arithmetic operations
Exponentiation, root, and logarithmic functions
Functions related to exponentiation or the inverse
Gamma and error functions
Probability functions
Minimum and maximum functions
Return the larger, smaller, or absolute difference between elements
Rounding and next functions
Convert floating point values to integers
Trigonometric functions
sin, cos, tan and their inverses
Hyperbolic functions
sinh, cosh, tanh and their inverses
The SIMD Math Library is an implementation of most of the C99 math library
(-lm) that operates on short SIMD vectors. The library functions conform as
closely as possible to the specifications set by the scalar standards. However,
fundamental differences between scalar architectures and the CBEA require
deviations, including the handling of rounding, error conditions, floating-point
exceptions, and special operands such as NaN and infinities.
To use the SIMD Math Library, the programmer must perform the following
actions:
For the linkable library archive version, include the primary
/usr/spu/include/simdmath.h header file.
For the linkable library archive version, link the SPU application with the
/usr/spu/lib/libsimdmath.a library.
For the inline functions version, include a distinct header file for each function
that is used. Those header files are in the /usr/spu/include/simdmath
directory. For example, add #include <simdmath/fabsf4.h> to use the
_fabsf4 inline function.
For additional information about this library, refer to the following resources:
For code example and additional usage instruction, see Chapter 8, “Case
study: Monte Carlo simulation” on page 499.
For the function calls format, see the SIMD Math Library API Reference,
SC33-8335-01:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/6DFAEFEDE17
9041E8725724200782367
For function specifications, refer to the “Performance Information for the
MASS Libraries for CBE SPU” product documentation on the Web at the
following address:
http://www-1.ibm.com/support/docview.wss?rs=2021&context=SSVKBV&dc=D
A400&uid=swg27009548&loc=en_US&cs=UTF-8&lang=en&rss=ct2021other
264 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Similar to the SIMD Math Library, the MASS libraries can be used in two different
versions, the linkable library archive version and the inline functions version.
The implementation of the MASS and MASSV libraries are different from SIMD
Math Library in the following aspects:
The SIMD Math Library is focused on accuracy while the MASS and MASSV
libraries are focused on having better performance. For a performance
comparison between the libraries, see the “Performance Information for the
MASS Libraries for CBE SPU” document.
The SIMD Math Library has support across the entire input domain, while the
MASS and MASSV libraries can restrict the input domain.
The MASS and MASSV libraries support a subset of the SIMD Math Library
functions.
The MASSV library can work on long vectors whose length is any number that
is multiple of 4.
The functions of the MASSV library have names that are similar to the
SIMDmath and MASS functions, but with a prefix of “vs.”
To use the MASS library, the programmer must perform the following actions:
For both libraries, include the mass_simd.h and simdmath.h header files in the
/usr/spu/include/ directory in order to use the MASS functions, and include
massv.h header files for MASSV functions.
For both versions, link the SPU application with the libmass_simd.a header
file for MASS functions and with the libmassv.a file for MASSV functions.
Both files are in the /usr/spu/lib/ directory.
For the inline functions version, include a distinct header file for each function
that is used. The header files are in the /usr/spu/include/mass directory. For
example, include the acosf4.h header file to use the acosf4 inline function.
For more information about these libraries, including the function call format and
description, as well as usage instructions, see the “MASS C/C++ function
prototypes for CBE SPU” product documentation on the Web at the following
address:
http://www-1.ibm.com/support/docview.wss?rs=2021&context=SSVKBV&dc=DA40
0&uid=swg27009546&loc=en_US&cs=UTF-8&lang=en&rss=ct2021other
Example 4-52 shows the loop unrolling technique. The code contains two
versions of multiply between two inout arrays of float. The first version is an
ordinary scalar version (the mult_ function) and the second is loop-unrolled
SIMD version (the vmult_ function).
Source code: The code in Example 4-52 is included in the additional material
for this book. See “SPE loop unrolling” on page 621 for more information.
In this example, we require the arrays to be quadword aligned and the array
length divisible by 4 (that is four float elements in a vector of floats).
Consider the following comments regarding the alignment and length of the
vectors:
We ensure that the quadword alignment uses the aligned attribute, which is
recommended in most cases. If this is not the case, a scalar prefix can be
added to the unrolled loop to handle the first unaligned elements.
We recommend working with the arrays whose length is divisible by 4. If this
is not the case, a suffix can be added to the unrolled loop to handle the last
elements.
#include <spu_intrinsics.h>
#define NN 100
266 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
for (i=0; i<N; i++){
out[i] = in1[i] * in2[i];
}
}
int main( )
{
float in1[NN] __attribute__((aligned (16)));
float in2[NN] __attribute__((aligned (16)));
float out[NN];
float vout[NN] __attribute__((aligned (16)));
return 0;
}
The first method for organizing data in the SIMD vectors is called an array of
structures (AOS) as demonstrated in Figure 4-8. This figure shows a natural way
to use the SIMD vectors to store the homogenous data values (x, y, z, w) for the
three vertices (a, b, c) of a triangle in a 3-D graphics application. This
organization has the name array of structures because the data values for each
vertex are organized in a single structure, and the set of all such structures
(vertices) is an array.
The second method is a structure of arrays (SOA), which is shown in Figure 4-9.
This figure shows the SOA organization to represent the x, y, z vertices for four
triangles. The data types are the same across the vector, and now their data
interpretation is the same. Each corresponding data value for each vertex is
stored in a corresponding location in a set of vectors. This is different from the
AOS method, where the four values of each vertex are stored in one vector.
The AOS data-packing approach often produces small code sizes, but it typically
executes poorly and generally requires significant loop-unrolling to improve its
efficiency. If the vertices contain fewer components than the SIMD vector can
hold (for example, three components instead of four), SIMD efficiencies are
compromised.
268 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Alternatively, when using SOA, it is usually easy to perform loop unrolling or other
SIMD programming. The programmer can think of the data as though it were
scalar, and the vectors are populated with independent data across the vector.
The structure of an unrolled loop iteration should be similar to the scalar case.
However there is the one difference of replacing the scalar operations with
identical vector operations simultaneously on a few elements that are gathered in
one element.
Example 4-53 illustrates the process of taking a scalar loop in which the
elements are stored using the AOS method and the equivalent unrolled
SOA-based loop, which has fewer iterations by four times. The scalar and
unrolled SOA loops are similar and use the same + operators. The only
difference is how the indexing to the data structure is performed.
Source code: The code in Example 4-53 is included in the additional material
for this book. See “SPE SOA loop unrolling” on page 621 for more information.
#define NN 20
int main( )
{
int i, Nv=NN>>2;
vertices vers[NN*4];
vvertices vvers __attribute__((aligned (16)));
However, at this point, there are limitations on the capabilities of the compilers in
translating certain scalar code to SIMD. Not any scalar code that theoretically
can be translated into a SIMD will eventually be translated by the compilers.
53
See note 1 on page 78.
270 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Therefore, a programmer must have knowledge of the compiler limitations, so
that the programmer can choose one of the following options:
Write code in a way that is supported by the compilers for auto-SIMDizing.
Recognize places in the code where auto-SIMDizing is not realistic and
perform explicit SIMD programming in those places.
272 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Use arrays and not pointer arithmetic in the application to access large data
structures.
Use global arrays that are statically declared.
Use global arrays that are aligned to 16-byte boundaries, for example using
the aligned attribute. In general, lay out the data to maximize 16-byte aligned
accesses.
If there is more than a single misaligned store, distribute them into a separate
loop. (Currently the vectorizer peels the loop to align a misaligned store.)
If it is not possible to use aligned data, use the alignx directive to indicate to
the compiler what the alignment is, for example:
#pragma alignx(16, p[i+n+1]);
If it is known that arrays are disjoint, use the disjoint directive to indicate to
the compiler that the arrays specified by the pragma are not overlapping, for
example:
#pragma disjoint(*ptr_a, b)
#pragma disjoint(*ptr_b, a)
Scatter-gather
Scatter-gather refers to a technique for operating on sparse data, using an index
vector. A gather operation takes an index vector and loads the data that resides
at a base address that is added to the offsets in the index vector. A scatter
operation stores data back to memory, using the same index vector.
Reports are generated for all loops that are considered for SIMDization.
Specifically successful candidates are reported. In addition, if SIMDization was
not possible, the reasons that prevented it are also reported.
Similarly, the GCC compiler can provide debug information about the
auto-SIMDizing process by using the following options:
-ftree-vectorizer-verbose=[X]
This option dumps information about which loops were vectorized, which
were not, and why (X=1 least information, X=6 all information). The
information is dumped to stderr unless the following flag is used.
-fdump-tree-vect
This option dumps information into <C file name>.c.t##.vect.
-fdump-tree-vect-details
This option is equivalent to setting the combination of the first two flags:
-fdump-tree-vect -ftree-vectorizer-verbose=6
In this rest of this section, we show how to debug code that might not be
SIMDized as well as other code that can be successfully SIMDized. We illustrate
this by using the XLC debug features with the -qreport option enabled.
Example 4-54 shows an SPU code of a program named t.c, which is hard to
SIMDize because of dependencies between sequential iterations.
int main(){
for (int i=0; i<1024; ++i)
b[i+1] = b[i+2] - c[i-1];
}
274 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The code is then compiled with -qreport option enabled by using the following
command:
Example 4-55 shows the t.lst file that is generated by the XLC compiler. It
contains the problems in SIMDizing the loop and the transformed “pseudo
source.”
1586-535 (I) Loop (loop index 1) at t.c <line 5> was not SIMD
vectorized because the aliasing-induced dependence prevents SIMD
vectorization.
1586-536 (I) Loop (loop index 1) at t.c <line 5> was not SIMD
vectorized because it contains memory references with non-vectorizable
alignment.
1586-536 (I) Loop (loop index 1) at t.c <line 6> was not SIMD
vectorized because it contains memory references ((char *)b +
(4)*(($.CIV0 + 1))) with non-vectorizable alignment.
1586-543 (I) <SIMD info> Total number of the innermost loops considered
<"1">. Total number of the innermost loops SIMD vectorized <"0">.
3 | long main()
{
5 | if (!1) goto lab_5;
$.CIV0 = 0;
6 | $.ICM.b0 = b;
$.ICM.c1 = c;
5 | do { /* id=1 guarded */ /* ~4 */
/* region = 8 */
/* bump-normalized */
6 | $.ICM.b0[$.CIV0 + 1] = $.ICM.b0[$.CIV0 + 2] -
$.ICM.c1[$.CIV0 - 1];
5 | $.CIV0 = $.CIV0 + 1;
} while ((unsigned) $.CIV0 < 1024u); /* ~4 */
lab_5:
rstr = 0;
Example 4-56 shows an SPU code that is similar to Example 4-55 on page 275
but with correcting SIMD inhibitors.
int main()
{
// __alignx(16, c); Not strictly required since compiler
// __alignx(16, b); inserts runtime alignment check
276 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-57 shows the output t.lst file after compiling with the -qreport
option enabled. The example reports successful auto-SIMDizing and the
transformed “pseudo source.”
278 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Making such scalar operations more efficient requires the following static
technique:
1. Use the aligned attribute and extra padding if necessary to statically align the
scalar to the preferred slot. Refer to “The aligned attribute” on page 256
regarding use of this attribute.
2. Change the scalars to quadword vectors. Doing so eliminates the three extra
instructions that are associated with loading and storing scalars, which
reduces the code size and execution time.
In addition, the programmer can use one of the SPU intrinsics to efficiently
promote scalars to vectors, or vectors to scalars:
spu_insert inserts a scalar into a specified vector element.
spu_promote promotes a scalar to a vector that contains the scalar in the
element that is specified by the input parameter. Other elements of the vector
are undefined.
spu_extract extracts a vector element from its vector and returns the element
as a scalar.
spu_splats replicate a single scalar value across all elements of a vector of
the same type.
These instructions are efficient. By using them, the programmer can eliminate
redundant loads and stores. One example for using the instructions is to cluster
several scalars into vectors, load multiple scalars at one instruction using a
quadword memory, and perform a SIMD operation that will operate on all the
scalars at once.
280 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Casts between vector types and scalar types are not allowed. On the SPU, the
spu_extract, spu_insert, and spu_promote generic intrinsics or the specific
casting intrinsics can be used to efficiently achieve the same results.
Source code: The code in Example 4-60 and Example 4-61 on page 282 is
included in the additional material for this book. See “SPE scalar-to-vector
conversion using unions” on page 622 for more information.
An SPU program that can use the casting unions is shown in Example 4-61 on
page 282. The program uses the unions to perform a combination of SIMD
operations on the entire vector and scalar operations between the vector
elements.
While the scalar operations are easy to program, they are not efficient from a
performance point of view. Therefore, the programmer must try to minimize the
frequency in which they happen and use them only if there is not a simple SIMD
solution.
Example 4-60 Header file for casting between scalars and vectors
#include <spu_intrinsics.h>
typedef union {
vector signed char c_v;
signed char c_s[16];
}vec128;
#include <spu_intrinsics.h>
int main( )
{
vec_float a __attribute__((aligned (16)));
vec_float b __attribute__((aligned (16)));
282 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
a.v = spu_mul(a.v, b.v);
return 0;
}
An overlay is a program segment that is not loaded into an LS before the main
program begins to execute, but is instead left in main storage until it is required.
When the SPU program calls code in an overlay segment, this segment is
transferred to the LS where it can be executed. This transfer usually overwrites
another overlay segment that is not immediately required by the program.
The overlay feature is supported on SDK 3.0 for SPU programming but not for
PPU programming.
The most effective way to reduce the impact of branches is to eliminate them by
using the first three methods as listed as follows. The second most effective
method for reducing the impact of branches is to use a branch hint, which is
presented last in the list:
In “Function inlining”, we explain how this method defines the functions as
inline and avoids the branch when a function is called and another branch
when it is returned.
In “Loop unrolling” on page 285, we explain how this method removes loops
or reduces the number of iterations in a loop in order to reduce the number of
branches (that appears at the end of the loop).
In “Branchless control flow statement” on page 287, we explain how this
method uses spu_sel intrinsics to replace a simple control statement.
In “Branch hint” on page 287, we discuss the hint-for branch instructions. If
the software speculates that the instruction branches to a target path, a
branch hint is provided. If a hint is not provided, the software speculates that
the branch is not taken. That is, the instruction execution continues
sequentially.
Function inlining
The function-inlining technique can be used to increase the size of basic blocks
(sequences of consecutive instructions without branches). This technique
eliminates the two branches that are associated with function-call linkage, the
branch for function-call entry and the branch indirect for function-call return.
54
See note 1 on page 78.
284 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To use function inlining, the programmer can use either of the following
techniques:
Explicitly add the inline attribute to the declaration of any function that the
programmer wants to inline. When recommended, one case is for functions
that are short. Another case is for functions that have a small number of
instances in the code but are often executed in run time, for example, when
they appear inside a loop.
Use the compiler options for automatic inlining of the appropriate functions.
Table 4-22 lists such options of the GCC compiler.
Over-aggressive use of inlining can result in larger code, which reduces the LS
space that is available for data storage or, in extreme cases, is too large to fit in
the LS.
-finline-small-functions Integrates functions into their callers when their body is smaller than the
expected function call code, so that the overall size of program gets
smaller. The compiler heuristically decides which functions are simple
enough to be worth integrating this way.
-finline-functions Integrates all simple functions into their callers. The compiler heuristically
decides which functions are simple enough to be worth integrating this
way. If all calls to a given function are integrated, and the function is
declared static, then the function is normally not output as assembler code
in its own right.
-finline-functions-called-once Considers all static functions that are called once for inlining into their
caller even if they are not marked inline. If a call to a given function is
integrated, then the function is not output as assembler code in its own
right.
-finline-limit=n By default, the GCC limits the size of the functions that can be inlined. This
flag allows control of this limit for functions that are explicitly marked as
inline.
Loop unrolling
Loop unrolling is another technique that can be used to increase the size of basic
blocks (sequences of consecutive instructions without branches), which
increases scheduling opportunities. It eliminates branches by decreasing the
number of loop iterations.
// original loop
for (i=0;i<3;i++) x[i]=y[i];
If the number of loops is bigger, but the loop iterations are independent of each
other, the programmer can reduce the number of loops and work on several
items in each iteration as illustrated in Example 4-63. Another advantage of this
technique is that it usually improves dual issue utilization. The loop-unrolling
technique is often used when moving from scalar to vector instructions.
// original loop
for (i=0;i<300;i++) x[i]=y[i];
// can be unrolled to
for (i=0;i<300;i+=3){
x[i] =y[i];
x[i+1]=y[i+1];
x[i+2]=y[i+2];
}
Automatic loop unrolling can be performed by the compiler when the optimization
level is high enough or when one of the appropriate options is set, for example
-funroll-loops, -funroll-all-loops.
Typically, branches that are associated with the loop with a relatively large
number of iterations are inexpensive because they are highly predictable. In this
case, a non-predicted branch usually occurs only in the first and last iterations.
286 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Branchless control flow statement
The select-bits (selb) instruction is the key to eliminating branches for simple
control flow statements such as if and if-then-else constructs. An if-then-else
statement can be made branchless by computing the results of both the then and
else clauses and by using select-bits intrinsics (spu_sel) to choose the result as
a function of the conditional. If computing both results costs less than a
mispredicted branch, then a performance improvement is expected.
Branch hint
The SPU supports branch prediction through a set of hint-for branch (HBR)
instructions (hbr, hbra, and hbrr) and a branch-target buffer (BTB). These
instructions support efficient branch processing by allowing programs to avoid
the penalty of taken branches.
As with all programmer-provided hints, use care when using branch hints
because, if the information provided is incorrect, performance might degrade.
There are immediate and indirect forms for this instruction class. The location of
the branch is always specified by an immediate operand in the instruction.
A branching hint should be present soon enough in the code. A hint that
precedes the branch by at least eleven cycles plus four instruction pairs is
minimal. Hints that are too close to the branch do not affect the speculation after
the branch.
There are many arguments against profiling large bodies of code, but most SPU
code is not like that. SPU code tends to be well-understood loops. Thus,
obtaining realistic profile data should not be time consuming. Compilers should
be able to use this information to arrange code to increase the number of
fall-through branches (that is, conditional branches are not taken). The
information can also be used to select candidates for loop unrolling and other
optimizations that tend to unduly consume LS space.
288 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 4-65 Predicting a false conditional statement
if(__builtin_expect((a>b),0))
c += a;
else
d += 1;
Not only can the __builtin_expect directive be used for static branch prediction,
it can also be used for dynamic branch prediction. The return value of
__builtin_expect is the value of the exp argument, which must be an integral
expression. For dynamic prediction, the value argument can be either a
compile-time constant or a variable. The __builtin_expect function assumes
that exp equals a value. Example 4-66 shows code for a static prediction.
The two main purposes for the high level frameworks are to reduce the
development time of programming an Cell/B.E. application and to create an
abstract layer that hides from the programmer the specific features of the CBEA.
In some cases, the performance of the application that uses those frameworks is
similar to programming that uses the lower level libraries. Given the fact that
development time is shorted and the code is more architecture independent,
using the framework in those cases is preferred. However, in general, using the
lower level libraries can provide better performance since the programmer can
tune the program to the application-specific requirements.
We discuss the main frameworks that are provided with SDK 3.0 in the first two
sections:
In “Data Communication and Synchronization” on page 291, we discuss
DaCS, which is an API and a library of C callable functions that provides
communication and synchronization services among tasks of a parallel
application running on a Cell/B.E. system. Another version of DaCS provides
similar functionality for a hybrid system and is discussed in 7.2, “Hybrid Data
Communication and Synchronization” on page 449.
In “Accelerated Library Framework” on page 298, we discuss the ALF, which
offers a framework for implementing the function off-load model on a Cell/B.E.
system. It uses the PPE as the program control and SPEs as functions
off-load accelerators. A hybrid version is also available and is discussed in
7.3, “Hybrid ALF” on page 463.
The functions that these libraries implement are optimized specifically for the
Cell/B.E. processor and can reduce development time in cases where the
developed application uses similar functions. In such cases, the programmer can
use the corresponding library to implement those functions. Alternatively, they
290 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
can use the library code as a starting point and customize it to the specific
requirements of the application. (The libraries are open source.)
DaCS can be used to implement various types of dialogs between parallel tasks
by using common parallel programming mechanisms, such as message passing,
mailboxes, mutex, and remote memory accesses. The only assumption is that
there is a master task and slave tasks, a host element (HE), and accelerator
elements (AE) in DaCS terminology. This is to be contrasted with MPI, which
treats all tasks as equal. The goal of DaCS is to provide services for applications
by using the host/accelerator model, where one task subcontracts lower-level
tasks to perform a given piece of work. One model might be an application that is
written by using MPI communication at the global level with each MPI task
connected to accelerators that communicate with DaCS as illustrated in
Figure 4-10 on page 292.
As shown in Figure 4-10, five MPI tasks exchange MPI messages and use DaCS
communication with their accelerators. No direct communication occurs between
the accelerators that report to a different MPI task.
292 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
DaCS also supports a hierarchy of accelerators. A task can be an accelerator for
a task that is higher in the hierarchy. It can also be a host element for lower-level
accelerators as shown in Figure 4-11.
A DaCS program does not need to be an MPI program nor use a complex
multi-level hierarchy. DaCS can be used for an application that consists of a
single host process and its set of accelerators.
The main benefit of DaCS is to offer abstractions (message passing, mutex, and
remote memory access) that are more common to application programmers than
the DMA model, which is probably better known by system programmers. It also
hides the fact that the host element and its accelerators might not be running on
the same system.
DaCS groups
In DaCS, the groups that are created by an HE and an AE can only join a group
that was previously created by their HE. The groups are used to enable
synchronization (barrier) between tasks.
DaCS mutex
An HE can create a mutex that an AE agrees to share. After the sharing is
explicitly set up, the AE can use lock-unlock primitives to serialize accesses to
shared resources.
DaCS communication
Apart from the put and get primitives in remote memory regions, DaCS offers
two other mechanisms for data transfer: mailboxes and message passing using
basic send and recv functions. These functions are asymmetric like the data in
mailboxes and send and recv functions between HE and AE only.
294 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
These functions use a wait identifier that must be explicitly reserved before being
used. The functions should also be released when they are no longer in use.
DaCS services
The functions in the API are organized in groups as explained in the following
sections.
Group management
Groups are required to operate synchronizations between tasks. Currently, only
barriers are implemented.
Message passing
Two primitives are provided for sending and receiving data by using a message
passing model. The operations are nonblocking and the calling task must wait
later for completion. The exchanges are point-to-point only, and one of the
endpoints must be an HE.
Mailboxes
Mailboxes provide efficient message notification mechanism for small 32-bit data
between two separate processes. In the case of the Cell/B.E. processor,
mailboxes are implemented in the hardware by using an interrupt mechanism for
communication between the SPE and the PPE or other devices.
Synchronization
Mutexes might be required to protect the remote memory operations and
serialize access to shared resources.
Common patterns
The group, remote memory, and synchronization services are implemented with
a consistent set of create, share/accept, and use and release/destroy primitives.
Invites each AE to share the resource and Accepts the invitation to share the
wait for confirmation from each AE, one by resource.
one.
Uses the resource. The HA can take part Uses the resource.
in the sharing, but it is not mandatory.
Destroys the shared resource. Waits for Releases the resource. Signals to the HE
each AE to indicate that it does not use the that the resource is no longer used.
resource anymore.
dacs_remote_mem_share() dacs_remote_mem_accept()
dacs_remote_mem_destroy() dacs_remote_mem_release()
dacs_group_add_member() dacs_group_accept()
dacs_group_destroy() dacs_group_leave()
296 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
For mutexes, refer to Table 4-26.
dacs_mutex_share() dacs_mutex_accept()
dacs_mutex_destroy() dacs_mutex_release()
Source code: The DaCS code example is part of the additional material for
this book. See “Data Communication and Synchronization programming
example” on page 617 for more details.
The DaCS libraries are fully supported by the debugging and tracing
infrastructure that is provided by the IBM SDK for Multicore Acceleration. The
sample code can be built with the debug and trace types of the DaCS library.
In the current release, no message passing between AEs is allowed and complex
exchanges either require more HE intervention or must be implemented by using
the shared memory mechanisms (remote memory and mutexes). A useful
extension allows AE-to-AE messages. Some data movement patterns (pipeline,
ring of tasks) might be easier to implement in DaCS. However, we can always
call libspe2 functions directly from within a DaCS task to implement custom task
synchronizations and data communications, but this technique is not a supported
by the SDK.
ALF overview
With the ALF, the application developer is required to divide the application in two
parts: the control part and the computational kernels. The control part runs on
the host, which subcontracts accelerators to run the computational kernels.
These kernels take their input data from the host memory and write back the
output data to the host memory.
The ALF is an extension of the subroutine concept, with the difference that input
arguments and output data have to move back and forth between the host
memory and the accelerator memory, similar to the Remote Procedure Call
(RPC) model. The input and output data might have to be further divided into
blocks to be made small enough to fit the limited size of the accelerator’s
memory. The individual blocks are organized in a queue and are meant to be
independent of each other. The ALF run time manages the queue and balances
the work between the accelerators. The application programmer only has to put
the individual pieces of work in the queue.
Each MPI task must know about the other MPI tasks for synchronization and
message passing. However, an accelerator task is not required to know anything
about its host task nor about its siblings nor about the accelerators running on
behalf of foreign MPI tasks. An accelerator task has no visibility to the outside
world. It only answers to requests. It is fed with input data, does some
processing, and the output data it produces is sent back.
There is no need for an accelerator program to know about its host program
because the ALF run time handles all the data movement between the
accelerator memory and the host memory on behalf of the accelerator task. The
298 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
ALF run time does the data transfer by using various tricks, exploiting DMA,
double or triple buffering and pipelining techniques that the programmer does not
need to learn about. The programmer must only describe, generally at the host
level, the layout of the input and output data in host memory that the accelerator
task will work with.
The ALF gives a lot of flexibility to manage accelerator tasks. It supports the
Multiple Program Multiple Data (MPMD) model in two ways:
A subset of accelerator nodes can run task A providing the computational
kernel ckA, while another subset runs task B providing the kernel ckB.
A single accelerator task can perform only a single kernel at an one time.
There are ways that the accelerator can load a different kernel after execution
starts.
The ALF can also express dependencies between tasks by allowing for complex
ordering of tasks when synchronization is required.
On the host side, the application programmer must make calls to the ALF API to
perform the following actions:
Create the tasks
Split the work into work blocks, that is describing the input and output data for
each block
Express the dependencies between tasks if needed
Put the work blocks in the queue
On the accelerator side, the programmer must only write the computational
kernels. As we see later, this is slightly over simplified because the separation of
duties between the host programs and the accelerator program can become
blurred in the interest of performance.
Near the top is the host view with presumably large memory areas for figuring the
input and output data for the function that is to be accelerated. In the middle, lies
the data partitioning where the input and output are split into smaller chunks,
small enough to fit in the accelerator memory. They are “work blocks.” At the
bottom, the accelerator tasks process one work block at a time. The data transfer
part between the host and accelerator memory is handled by the ALF run time.
Figure 4-13 on page 301 shows the split between the host task and the
accelerator tasks. On the host side, we create the accelerator tasks, create the
work blocks, enqueue the blocks, and wait for the tasks to collectively empty the
work block queue. On the accelerator task, for each work block, the ALF run time
fetches the data from the host memory, calls the user provided computational
kernel, and sends the output data back to the host memory.
55
You can find the Accelerated Library Framework for Cell Broadband Engine Programmer’s Guide
and API Reference on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/41838EDB5A15CCCD002573530063D4
65/$file/ALF_Prog_Guide_API_v3.0.pdf
300 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 4-13 The host and accelerator sides
// The input data list is ready. Fetch the items from the
// host into the task input buffer.
get_input_data();
Computational kernel
The ALF is used to offload the computation kernels of the application from the
host task to accelerator tasks running on the accelerator nodes, which are the
SPEs in our case. A computational kernel is a function that requires input, does
some processing, and produces output data. Within the ALF, a computational
kernel function has a given prototype to which application programmers must
conform. In the simplest case, an application programmer must implement a
302 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
single routine, which is the one that does the computation. In the most general
case, up to five functions might need to be written:
The compute kernel
The input data transfer list prepare function
The output data transfer list prepare function
The task context setup function
The task context merge function
The full prototypes are given in Example 4-69 and taken from the alf_accel.h
file, which an accelerator program must include. The ALF run time fills in the
buffers (input, output, context) before calling the user-provided function. From an
application programmer’s perspective, the function gets called after the run time
has filled in all the necessary data, transferring the data from host memory to
accelerator memory, that is, the local storage. The programmers do not have to
be concerned with that. They are given pointers to where the data has been
made available. This is similar to what the shell does when the main() function of
a Linux program is called. The runtime system (the exec() system call set to run
in this case) has filled in the char *argv[] array to use.
304 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The computational kernel functions must be registered to the ALF run time in
order for them to be called when a work block is received. This is accomplished
by using export statements that usually come at the end of the accelerator
source code. Example 4-70 on page 305 shows the typical layout for an
accelerator task.
struct task_descriptor {
// The task context buffer holds status data for the task.
// It is loaded at task started time and can be copied back
// at task unloading time. It’s meant to hold data that is
// kept across multiple invocations of the computational kernel
// with different work blocks
struct task_context {
task_context_buffer [SIZE];
task_context_entries [NUMBER];
};
// These are the names of the functions that this accelerator task
// implements. Only the kernel function is mandatory. The context
// setup and merge, if specified, get called upon loading and
// unloading of the task. The input and output data transfer list
// routines are called when the accelerator does the data
// partitioning itself.
struct accelerator_image {
char *compute_kernel_function;
char *input_dtl_prepare_function;
char *output_dtl_prepare_function;
char *task_context_setup_function;
char *task_context_merge_function;
};
306 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
struct work_blocks {
parameter_and_context_size;
input_buffer_size
output_buffer_size;
overlapped_buffer_size;
number_of_dtl_entries;
};
The task context is a memory buffer that is used for two purposes:
Store persistent data across work blocks. For example, the programmer loads
state data that is to be read every time we process a work block. It contains
work block “invariants.”
Store data that can be reduced between multiple tasks. The task context can
be used to implement all-reduce (associative) operations such as min, max,
or global sum.
Work blocks
A work block is a single invocation of a task with a given set of input, output, and
parameter data. Single-use work blocks are processed only once, and multi-use
work blocks are processed up to total_count times.
The input and output data description of single-use work blocks can be
performed either at the host level or the accelerator level. For multi-use work
blocks, data partitioning is always performed at the accelerator level. The current
count of the multi-use work block and the total count are passed as arguments
every time the input list preparation routine, the compute kernel, and the output
data preparation routine are called.
With the multi-use blocks, the work block creation loop that was running on the
host task is now performed jointly by all the accelerator tasks that this host has
allocated. The only information that a given task is given to create, on the fly, the
proper input and output data transfer lists is the work block context buffer, the
current and total counts. The previous host loop is now parallelized across the
accelerator tasks which balance the work automatically between themselves.
The purpose of the multi-use work blocks is to make sure that the PPE, which
runs the host tasks, does not become a bottleneck, too busy creating work blocks
for the SPEs.
Data partitioning
Data partitioning ensures that each work block gets the right data to work with.
The partitioning can be performed at the host level or accelerator level. We use a
data transfer list, consisting of multiple entries of type <start address, type,
count> that describe from where, in the host memory, we must gather data to be
sent to the accelerators. The API calls differ whether you use the host or
accelerator data partitioning.
Data sets
At the host level, data sets are created that assign attributes to the data buffers
that are to be used by the accelerator tasks. A memory region can be described
as read-only, read-write, or write-only. This information gives hints to the ALF run
time to help improve the data movement performance and scheduling.
Figure 4-14 shows the memory map of an ALF program. The user code contains
the computational kernels and optionally the input/output data transfer list
functions and context setup/merge functions.
308 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
As shown in Figure 4-15, five pointers to data buffers are passed to the
computational kernels (see Example 4-69 on page 303). Various combinations
are possible, depending on the use of overlapped buffers for input and output. In
the simplest case, no overlap exists between the input and output buffers.
310 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In the most general case, three data buffer pointers can be defined as shown in
Figure 4-17.
Figure 4-17 Memory map with all five data pointers defined
API description
The API has two components: the host API and the accelerator. A host program
must include the alf.h file, and an accelerator program must include the
alf_accel.h file.
The accelerator API is much leaner because it includes only a few functions to
perform the data partitioning. See Table 4-28.
312 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
ALF optimization tips
Apart from tuning the computational kernel itself and ensuring that the amount of
work per data communication is maximized, it can be beneficial to tune the data
movement part. To do so, explore the following techniques:
Data partitioning on the accelerator side
Multi-use work blocks
These techniques lower the workload on the host task, which otherwise might not
be able to keep up with the speed of the accelerators, thus becoming a
bottleneck for the whole application. Also, using data sets on the host side and
using overlapped input and output buffers whenever possible gives more
flexibility to the ALF run time to optimize the data transfers.
The examples in the following sections show the type of work that is involved
when accelerating applications with the ALF. Of particular interest in this respect
are the matrix_add and matrix_transpose examples.
Table 4-29 ALF examples in the IBM SDK for Multicore Acceleration
Example Description
matrix_add This example gives the steps that were taken to enable and tune this application
with the ALF and includes the following successive versions:
scalar: The reference version
host_partition: First ALF version, data partitioning on the host
host_partition_simd: The compute kernel is tuned using SIMD
accel_partition: Data partitioning performed by the accelerators
dataset: Use of the dataset feature
overlapped_io: Use of overlapped input and output buffers
PI Shows the use of task context buffers for global parameters and reduction
operations
pipe_line Shows the implementation of a pipeline by using task dependencies and task
context merge operations
matrix_transpose Like matrix_add, shows the steps going from a scaler version to a tuned ALF
version and includes the following successive versions:
scalar : the reference version
STEP1a: Using ALF and host data partitioning
STEP1b: Using accelerator data partitioning
STEP2: Using a tuned SIMD computational kernel
task_context Shows the use of the task context buffer for associative reduction operations (min,
max), a global sum, and a storage for a table lookup.
314 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
A software developer who starts developing an application for the Cell/B.E.
processor (or ports an existing application) can first check whether some parts of
its application are already implemented in one of the SDK’s domain specific
libraries. If it is, then using the corresponding library can provide an easy solution
to save development efforts.
Most of those libraries are open source. Therefore, even if the exact functionality
required by the developed application is not implemented, the programmer can
use those functions as a reference and be customized and tailored for developing
the application specific functions.
In the next four sections, we provide a brief description of the following libraries:
Fast Fourier Transform library
Monte Carlo libraries
Basic Linear Algebra Subprograms library
Matrix, large matrix, and vector libraries
While we found the libraries to be the most useful, SDK 3.0 provides several
other libraries. The Example Library API Reference document discusses the
additional libraries and provides detailed description of some of the libraries that
are discussed in this chapter (and are described only briefly).56
56
The Example Library API Reference is available on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/3B6ED257EE6235D900257353006E0F
6A/$file/SDK_Example_Library_API_v3.0.pdf
Both the FFT transform and inverse FFT transform are supported by this library.
To retrieve more information about this library, the programmer can do either of
the following actions:
Enter the following command on a system where the SDK is installed:
man /opt/cell/sdk/prototype/usr/include/libfft.3
Read the “Fast Fourier Transform (FFT) library” chapter in the Example
Library API Reference document.57
Another alternative library that implements FFT for the Cell/B.E. processor is the
FFTW library. The Cell/B.E. implementation of this library is currently available
only as an alpha preview release. For more information, refer to the following
Web address:
http://www.fftw.org/cell/
For a detailed description of this library and how to use, refer to the Monte Carlo
Library API Reference Manual document.58
57
See note 56 on page 315.
58
The Monte Carlo Library API Reference Manual is available on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/8D78C965B984D1DE00257353006590
B7/$file/CBE_Monte-Carlo_API_v1.0.pdf
316 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Basic Linear Algebra Subprograms library
The BLAS library is based upon a published standard interface (described in the
BLAS Technical Forum Standard document)59 for commonly used linear algebra
operations in high-performance computing (HPC) and other scientific domains. It
is widely used as the basis for other high quality linear algebra software, for
example LAPACK and ScaLAPACK. The Linpack (HPL) benchmark largely
depends on a single BLAS routine (DGEMM) for good performance.
The BLAS API is available as standard ANSI C and standard FORTRAN 77/90
interfaces. BLAS implementations are also available in open-source (netlib.org).
Based on its functionality, BLAS is divided into three levels:
Level 1 routines are for scalar and vector operations.
Level 2 routines are for matrix-vector operations.
Level 3 routines are for matrix-matrix operations.
The BLAS library in SDK 3.0 supports only real single precision and real double
precision versions.
Some of the routines have been optimized by using the SPEs and show a
marked increase in performance in comparison to the corresponding versions
that are implemented solely on the PPE. The optimized routines have an SPE
interface in addition to the PPE interface. However, the SPE interface does not
conform to the standard BLAS interface and provides a restricted version of the
standard BLAS interface.
59
You can find the BLAS Technical Forum Standard document on the Web at the following address:
http://www.netlib.org/blas/blast-forum/
For a detailed description of this library and how to use it, refer to the BLAS
Programmer’s Guide and API Reference document.60
The matrix library consists of various utility libraries that operate on 4x4
matrices as well as quaternions. The library is supported on both the PPE and
SPE. In most cases, all 4x4 matrices are maintained as an array of four 128-bit
SIMD vectors, while both single-precision and double-precision operands are
supported.
The large matrix library consists of various utility functions that operate on large
vectors and large matrices of single precision floating-point numbers. The size of
the input vectors and matrices are limited by the local storage size of the SPE.
This library is currently only supported on the SPE.
60
The BLAS Programmer’s Guide and API Reference is available at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/F6DF42E93A55E57400257353006480
B2/$file/BLAS_Prog_Guide_API_v3.0.pdf
318 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The matrix and large matrix libraries support different matrix operations such as
multiplying, adding, transpose, and inverse. Similar to the SIMDmath and MASS
libraries, the libraries can be used either as a linkable library archive or as a set
of inline function headers. For more details, see “SIMD Math Library” on
page 262 and “MASS and MASSV libraries” on page 264.
The vector library consists of a set of general purpose routines that operate on
vectors. This library is supported on both the PPE and SPE.
For a detailed description of the three libraries and how to use them, see the
“Matrix library,” “Large matrix library,” and “Vector library” chapters in the Example
Library API Reference document.61
61
See 56 on page 315.
62
See note 3 on page 80.
63
See note 1 on page 78.
320 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4.8.2 SPE programming guidelines
In this section, we briefly summarize the programming guidelines for optimizing
the performance of SPE programs. The intention is address programming issues
that are related only to programming the SPU itself, without interacting with
external components, such as PPE, other SPEs, or main storage.
Since almost any SPE program needs to interact with external component, we
recommend that you become familiar with the programming guidelines in 4.8,
“Programming guidelines” on page 319.
General
Avoid over usage of 128-byte alignment. Consider cases in which alignment is
essential, such as for data transfers that are performed often, and use
redundant alignment, for example 16 bytes, for other cases. There two main
reasons why 128-byte alignment can reduce the performance:
– The usage of 128-byte alignment requires the global definition of the
variables. This causes the program to use more registers and reduces the
number of free registers which also reduces performance, for example
increased loads and stores, increased stack size, and reduced loop
unrolling. Therefore, if only redundant alignment is required, for example a
16-byte alignment, then you can use the variables as local, which can
significantly increase performance.
– The usage of 128-byte alignment increases the code size because of the
padding that is added by the compiler to make the data aligned.
Avoid writing recursive SPE code that uses a lot of stack variables because
they can cause stack overflow errors. The compilers provide support for
runtime stack overflow checking that can be enabled during application
debugging of such errors.
Intrinsics
Use intrinsics to achieve machine-level control without needing to write
assembly language code.
Understand how the intrinsics map to assembly instructions and what the
assembly instructions do.
Local storage
Design for the LS size. The LS holds up to 256 KB for the program, stack,
local data structures, heap, and DMA buffers. You can do a lot with 256 KB,
but be aware of this size.
Loops
If the number of loop iterations is a small constant, then consider removing
the loop altogether.
If the number of loop iterations is variable, consider unrolling the loop as long
as the loop is relatively independent. That is, an iteration of the loop does not
depend upon the previous iteration. Unrolling the loops reduces
dependencies and increases dual-issue rates. By doing so, compiler
optimization can exploit the large SPU register file.
SIMD programming
Exploit SIMD programming as much as possible, which can increase the
performance of your application in several ways, especially in regard to
computation bounded programs.
Consider using the compiler auto-SIMDizing feature. This feature can convert
ordinary scalar code into SIMD code. Be aware of the compiler limitations and
on the code structures that are supported for auto-SIMDizing. Try to code
according to the limitations and structures. For more information, see 4.6.5,
“Auto-SIMDizing by compiler” on page 270.
An alternative is to explicitly do SIMD programming by using intrinsics, SIMD
libraries, and supported data types. For more information, see 4.6.4, “SIMD
programming” on page 258.
Not all SIMD operations are supported by all data types. Review the
operations that are critical to your application and verify in which data types
they are supported.
Choose an SIMD strategy that is appropriate for the application that is under
development. The two common strategies are AOS and SOA:
– Array-of-structure organization
From a programmer’s point of view, you can have more-efficient code size
and simpler DMA needs, but SIMDize is more difficult. From a computation
point of view, this organization can be less efficient, but depends on the
specific application.
– Structure-of-arrays organization
322 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
From a programmer’s point of view, this organization is usually easier to
SIMDize, but the data must be maintained in separate arrays or the SPU
must shuffle the AOS data into an SOA form.
If the data is in AOS organization, consider runtime converting the AOS data
to SOA, performing the calculations, and converting the results back. For
more information, see “Data organization: AOS versus SOA” on page 267.
In general, note the following guidelines regarding SIMD programming:
– Use of auto-SIMDizing requires less development time, but in most cases,
the performance is inferior compared to explicit SIMD programming,
unless the program can perfectly fit into the code structures that are
supported by the compiler for auto-SIMDizing.
– On the contrary, correct SIMD programming provides the best
performance, but requires non-negligible development effort.
Scalars
Because SPUs only support quadword loads and stores, scalar loads and
stores (less than a quadword) are slow, with long latency.
Align the scalar loads that are often used with the quadword address to
improve the performance of operations that are done on those scalars.
Cluster scalars into groups and load multiple scalars at a time by using
quadword memory access. Later use extract or insert intrinsics to explicitly
move between scalars and vector data types, which eliminates redundant
loads and stores.
For details, see 4.6.6, “Working with scalars and converting between different
vector types” on page 277
Branches
Eliminate nonpredicted branches by using select bit intrinsics (spu_sel).
For branches that are highly predicted, use the __builtin_expect directive to
explicitly do direct branch prediction. Compiler optimization can add the
corresponding branch hint in this case.
Inline functions are often called by explicitly defining them as inline in the
program, or use compiler optimization.
Use feedback-directed optimization, for example by using the FDPRPro tool.
For details, see 4.6.8, “Eliminating and predicting branches” on page 284.
Dual issue
Choose intrinsics carefully to maximize dual-issue rates or reduce latencies.
Dual issue occurs if a pipe-0 instruction is even-addressed, a pipe-1
instruction is odd-addressed, and there are no dependencies (operands are
available).
Manually insert nops to align instructions for dual issue when writing
non-optimizing assembly programs. In other cases, the compilers
automatically insert nops when needed.
Use software pipeline loops to improve dual-issue rates as described in the
“Loop unrolling and pipelining” chapter in the Cell Broadband Engine
Programming Handbook.64
Understand the fetch and issue rules to maximize the dual-issue rate. The
rules are explained in the “SPU pipelines and dual-issue rules” chapter in the
Cell Broadband Engine Programming Handbook.
Avoid over usage of the odd pipeline for load instructions, which can cause
instruction starvation. This can happen, for example, on a large matrix
transpose on SPE when there are many loads on an odd pipeline and minimal
usage of the even pipeline for computation. A similar case can happen for the
dot product of large vectors. To solve this problem, the programmer can add
more computation on the data that is being loaded.
64
See note 1 on page 78.
324 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4.8.3 Data transfers and synchronization guidelines
In this section, we provide a summary of the programming guidelines for
performing efficient data transfer and synchronization on the Cell/B.E. program:
Choose the transfer mechanism that fits your application data access pattern:
– If the pattern is predictable, for example sequential access array or matrix,
use explicit DMA requests to transfer data. The requests can be
implemented with SDK core libraries functions.
– If the pattern is random or unpredictable, for example a sparse matrix
operation, consider using the software manage cache, especially if there is
a high ratio of data re-use, for example the same data or cache line is used
in different iterations of the algorithm.
When the core libraries for explicitly initiating DMA transfer:
– Follow the supported and recommended values for the DMA parameters.
See “Supported and recommended values for DMA parameters” on
page 116 and “Supported and recommended values for DMA-list
parameters” on page 117.
– DMA throughput is maximized if the transfers are at least 128 bytes, and
transfers greater than or equal to 128 bytes should be cache-line aligned
(aligned to 128 bytes). This refers to the data transfer size and source and
destination addresses as well.
– Overlap DMA data transfer with computation by using a double-buffering
or multibuffering mechanism. See 4.3.7, “Efficient data transfers by
overlapping DMA and computation” on page 159.
– Minimize small transfers. Transfers of less than one cache line consume a
bus bandwidth that is equivalent to a full cache-line transfer.
When explicitly using software managed cache, try to exploit the unsafe
asynchronous mode because it can provide significantly better results than
the unsafe synchronous mode. A double mechanism can also be
implemented by using this safe mode.
Uniformly distribute memory bank accesses. The Cell/B.E. memory
subsystem has 16 banks, interleaved on cache line boundaries. Addresses of
2 KB apart access the same bank. System memory throughput is maximized
if all memory banks are uniformly accessed.
326 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Exploit the on-chip data transfer and communication mechanism by using
LS-to-LS DMA transfers when sharing data between SPEs and using
mailboxes, signal notification registers for small data communications, and
synchronization. The reason is that the EIB provides significantly more
bandwidth than system memory (in the order of 10 or more).
Be aware that when the SPEs receive a DMA put data transfer completion,
the local MFC completed the transaction from its side but not unnecessarily
that the data is already stored in memory. Therefore, it might not be
accessible yet for other processors.
Use the explicit command to force data ordering when sharing data between
SPEs and the PPE and between SPEs to themselves because the CBEA
does not guarantee such ordering between the different storage domains.
Coherency is guaranteed on each of the memory domains separately:
– The DMA can be re-ordered compared to the order in which the SPE
program initiates the corresponding DMA commands. Explicit DMA
ordering commands must be issued to force ordering.
– Use fence or barrier DMA commands to order DMA transfers within a tag
group.
– Use a barrier command to order DMA transfers within the queue.
– Minimize the use of ordering such commands because they have a
negative effect on the performance.
– See 4.5, “Shared storage synchronizing and data ordering” on page 218,
for more information and 4.5.4, “Practical examples of using ordering and
synchronization mechanisms” on page 240, for practical scenarios.
Use affinity to improve the communication between SPEs (for example
LS-to-LS DMA data transfer, mailbox, and signals). See 4.1.3, “Creating
SPEs affinity by using a gang” on page 94, for more information.
Minimize the use of atomic, synchronizing, and data-ordering commands
because they can add significant overhead.
Atomic operations operate on reservation blocks that correspond to 128-byte
cache lines. As a result, place synchronization variables in their own cache
line so that other non-atomic loads and stores do not cause inadvertent lost
reservations.
Avoid having the PPE wait for the SPEs to complete by polling the SPE outbox
mailbox.
328 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5
In contrast, the synergistic processor units (SPUs) are not intended to run an
operating system. SPE programs can access the main-storage address space,
called the effective address (EA) space, only indirectly through the direct memory
access (DMA) controller in the memory flow controller (MFC).
The two processor architectures are different enough to require two distinct
toolchains for software development.
The toolchains for both the PPE and SPEs produce object files in the Executable
and Linking Format (ELF). The ELF is a flexible, portable container for
re-locatable, executable, and shared object (dynamically linkable) output files of
assemblers and linkers. The terms PPE-ELF and SPE-ELF are used to
differentiate between ELF for the two architectures.
330 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 5-1 CESOF layout
332 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
This release of the GNU toolchain includes a GCC compiler and utilities that
optimize code for the Cell/B.E. processor:
The spu-gcc compiler for creating an SPU binary
The ppu-embedspu (and ppu32-embedspu) tool that enables an SPU binary
to be linked with a PPU binary into a single executable program
The ppu-gcc (and ppu32-gcc) compiler
Creating a program
In the following scenario, we create the executable program, called simple, that
contains the SPU code, simple_spu.c, and PPU code, simple.c:
1. Compile and link the SPE executable as shown in Example 5-1.
2. (Optional) Run the embedspu command to wrap the SPU binary into a CESOF
linkable file that contains additional PPE symbol information. See
Example 5-2.
3. Compile the PPE side and link it together with the embedded SPU binary
(Example 5-3).
Alternatively compile the PPE side and link it directly with the SPU binary
(Example 5-4). The linker invokes embedspu and uses the file name of the
SPU binary as the name of the program handle struct.
ppu-gcc options
The ppu-gcc compiler offers the following options:
-mcpu=cell
This option selects the instruction set architecture to generate code for, either
Cell/B.E. or PowerXCell. Code that is compiled with -march=celledp can use
the new instructions and does not run on the Cell/B.E. processor.
-march=celledp also implies -mtune=celledp.
-m32
Selects the 32-bit option. The ppu32-gcc defaults to 32 bit.
-m64
Selects the 64-bit option. The ppu-gcc defaults to 64 bit.
-maltivec
This option enables code generation that uses Altivec vector instructions (the
default in ppu-gcc).
spu-gcc options
The spu-gcc compiler offers the following options:
-march=cell | celledp
Selects between the CBEA and the PowerXCell architecture, as well as its
registers, mneumonics, and instruction scheduling parameters.
-mtune=cell | celledp
Tunes the generated code for either the Cell/B.E. or PowerXCell architecture.
It mostly affects the instruction scheduling strategy.
-mfloat=accurate | fast
Selects whether to use the fast fused-multiply operations for floating point
operations or to enable calls to library routines that implement more precise
operations.
-mdouble=accurate | fast
Selects whether to use the fast fused-multiply operations for floating point
operations or to enable calls to library routines that implement more precise
operations.
-mstdmain
Provides a standard argv/argc calling convention for the main SPU function.
334 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
-fpic
-mwarn-reloc
-merror-reloc
Generates position independent code and indicates a warning if the resulting
code requires load-time relocations.
-msafe-dma
Controls whether load and store instructions have not moved past DMA
operations, by compiler optimizations.
-munsafe-dma
Controls whether load and store instructions have not moved past DMA
operations, by compiler optimizations.
-ffixed-<reg>
-mfixed-range=<reg>-<reg>
Reserve specific registers for user application.
Language options
The GNU GCC compiler offers the following language extensions to provide
access to specific Cell/B.E. architectural features, from a programmer’s point of
view:
Vectors
GNU compiler language support offers the vector data type for both PPU and
SPU, with support for arithmetic operations. Refer to 4.6.4, “SIMD
programming” on page 258, for more details.
Intrinsics
The full set of AltiVec and SPU intrinsics are available. Refer to 4.6.2, “SPU
instruction set and C/C++ language extensions (intrinsics)” on page 249, for
more details.
Optimization
The GNU compiler offers mechanisms to optimize code generation specifically to
the underlying architecture. For a complete reference of all optimization-related
options, consult the GCC manual on the Web at the following address:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Optimize-Options.html
Tip: When in doubt, use the following set of compiler options to generate
optimized code for SPE:
-O3 -funroll-loops -fmodulo-sched -ftree-vectorize -ffast-math
336 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
(IEEE) semantics. For example, with this option, the vectorizer can change
the order of computation of complex expressions.
The benefits of such a technique are to avoid function call overhead and to keep
the function as a whole for combined optimization. The disadvantages are an
increase in code size and compilation time.
Auto-vectorization
The auto-vectorization feature, which is enabled by the -ftree-vectorize switch,
automatically detects situations in the source code where loop-over-scalar
instructions can be transformed into loop-over-vector instructions. This feature is
usually beneficial on the SPE. In cases where the compiler manages to
transform a loop that is performance-critical to the overall application, a
significant speedup can be observed.
If your code has loops that you think should be vectorized, but are not, you can
use the -ftree-vectorizer-verbose=[X] option to determine why this occurs.
X=1 is the least amount of output, and X=6 yields the largest amount of output.
338 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Source Code
src.c
gcc-ftest-coverage gcc-fprofile-generate
test run
profile data
gcov gcc-fprofile-use
src.gcda
coverage optimized
report executable
For full documentation regarding the IBM XL C/C++ compiler, refer to the XL
C/C++ Library at the following address:
http://www.ibm.com/software/awdtools/xlcpp/library/
Optimization
The IBM XL C/C++ introduces several innovations, especially with regard to the
optimization options. We discuss the general concepts that are involved and
provide some useful tips.
Levels 2 and 3
The -O2 level brings comprehensive low-level optimizations while keeping
partial support for debugging:
– Global assignment of user variables to registers
– Strength reduction and effective usage of addressing modes
– Elimination of unused or redundant code
– Movement of invariant code out of loops
– Scheduling of instructions for the target machine
– Some loop unrolling and pipelining
– Visible externals and parameter registers at procedure boundaries
– Additional program points for storage visibility created by the Snapshot™
pragma/directive
– Parameters forced to memory on entry, by the -qkeepparm option, so that
they can be visible in a stack trace
340 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The -O3 level has extensive optimization, but might introduce precision
trade-offs:
– Deeper inner loop unrolling
– Loop nest optimizations, such as unroll-and-jam and interchange (-qhot
subset)
– Better loop scheduling
– Additional optimizations allowed by -qnostrict
– Widened optimization scope (typically the whole procedure)
– No implicit memory usage limits (-qmaxmem=-1)
– Reordering of floating point computations
– Reordering or elimination of possible exceptions, for example divide by
zero, overflow
To achieve the most of the level 2 and 3 optimizations, consider the following
guidance:
Ensure that your code is standard-compliant and, if possible, test and debug
your code without optimization before using -O2.
With regard to the C code, ensure that pointer use follows type restrictions
(generic pointers should be char* or void*), and verify if all shared variables
and pointers to same are marked as volatile.
Try to be uniform, by compiling as much of your code as possible with -O2. If
you encounter problems with -O2, consider using -qalias=noansi or
-qalias=nostd rather than turning off optimization.
Use -O3 as much code as possible. If you encounter problems or
performance degradations, consider using –qstrict, -qcompact, or -qnohot
along with -O3 where necessary. If you still have problems with -O3, switch to
-O2 for a subset of files or subroutines, but consider using -qmaxmem=-1,
-qnostrict, or both.
The -qhot option is designed to be used with other optimization levels, such as
-O2 and -O3, since it has a neutral effect if no optimization opportunities exist.
The link-time optimization can be enabled per compile unit (compile step) or on
the whole program (compile and link), where it expands its reach to the whole
final artifact (executable or library).
342 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
inline= Precise user control of inlining
fine tuning Specify library code behavior, tune program partitioning, or read
commands from a file
Although -ipa works when building executables or shared libraries, make sure
that you compile main and exported functions with -qipa. Again, try to apply this
option as much as possible.
Levels 4 and 5
Optimization levels 4 (-O4) and 5 (-O5) automatically apply all previous
optimization level techniques (-O3). Additionally, it includes its own “packages”
options:
-qhot
-qipa
-qarch=auto
-qtune=auto
-qcache=auto
(In -O5 only) -qipa=level=2
Although the compiler does a through analysis to produce the best fit
auto-vectorized code, still the programmer can influence the overall process,
making it more efficient. Consider the following more relevant tips:
Loop structure
– Inline function calls inside innermost loops
– Automatically (-O5 more aggressive, use inline pragma/directives)
Obviously, if you already manually unrolled any of the loops, it becomes more
difficult for the SIMDization process. Even in that case, you can manually instruct
the compiler to skip those loops:
#pragma nosimd (right before the innermost loop)
344 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Changing the environment
The environment variables in the /opt/cell/sdk/buildutils/make.* files are used to
determine which compiler is used to build the examples. The
/opt/cell/sdk/buildutils/cellsdk_select_compiler script can be used to switch the
compiler. This command has the following syntax:
/opt/cell/sdk/buildutils/cellsdk_select_compiler [xlc | gcc]
In this command, the xlc flag selects the XL C/C++ compiler and the gcc flag
selects the GCC compiler. The default, if unspecified, is to compile the examples
with the GCC compiler. After selecting a particular compiler, that same compiler
is used for all future builds, unless it is specifically overwritten by the shell
environment variables SPU_COMPILER, PPU_COMPILER,
PPU32_COMPILER, or PPU64_COMPILER.
5.3 Debugger
The debugger is a tool to help find and remove problems in your code. In addition
to fixing problems, the debugger can help you understand the program, because
it typically gives you memory and registers contexts, stack call traces, and
step-by-step execution. Figure 5-6 highlights the debug stage of the process.
In this section, we describe how to debug Cell/B.E. software by using the new
and extended features of the GDB that is supplied with SDK 3.0.
When you use the top-level makefiles of the SDK, you can specify the -g option
on compilation commands by setting the CC_OPT_LVL makefile variable to -g.
Regardless of the method that you choose, after you start the application under
ppu-gdb, you can use the standard GDB commands that are available to debug
the application. For more information, refer to the GDB user manual, which is
available from the GNU Web site at the following address:
http://www.gnu.org/software/gdb/gdb.html
346 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Debugging multi-threaded code
Typically a simple program contains only one thread. For example, a PPU “hello
world” program is run in a process with a single thread and the GDB attaches to
that single thread.
On many operating systems, a single program can have more than one thread.
With the ppu-gdb program, you can debug programs with one or more threads.
The debugger shows all threads while your program runs, but whenever the
debugger runs a debugging command, the user interface shows the single thread
involved. This thread is called the current thread. Debugging commands always
show program information from the point of view of the current thread.
For more information about GDB support for debugging multithreaded programs,
see “Debugging programs with multiple threads” and “Stopping and starting
multi-thread programs” in the GDB user manual, which is available from the GNU
Web site at the following address:
http://www.gnu.org/software/gdb/gdb.html
The info threads command shows the set of threads that are active for the
program. The thread command can be used to select the current thread for
debugging.
Debugging architecture
On the Cell/B.E. processor, a thread can run on either the PPE or on an SPE at
any given point in time. All threads, both the main thread of execution and
secondary threads that are started by using the pthread library, start execution
on the PPE. Execution can switch from the PPE to an SPE when a thread
executes the spe_context_run function. See SPE Runtime Management Library
Version 2.2, SC33-8334-01, which is available on the Web at the following
address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B321111
2587257242007883F3
Conversely, a thread that currently executes on an SPE can switch to use the
PPE when executing a library routine that is implemented via the PPE-assisted
call mechanism. See the Cell BE Linux Reference Implementation Application
Binary Interface Specification document for details.
When you choose a thread to debug, the debugger automatically detects the
architecture that the thread is currently running on. If the thread is currently
running on the PPE, the debugger uses the PowerPC architecture. If the thread
is currently running on an SPE, the debugger uses the SPE architecture. A
thread that is currently executing code on an SPE can also be referred to as an
SPE thread.
Using scheduler-locking
Scheduler-locking is a feature of GDB that simplifies multithread debugging by
enabling you to control the behavior of multiple threads when you single-step
through a thread. By default, scheduler-locking is off, which is the recommended
setting.
If scheduler-locking is turned on, there is the potential for deadlocking where one
or more threads cannot continue to run. Consider, for example, an application
that consists of multiple SPE threads that communicate with each other through
a mailbox. Let us assume that you single-step one thread across an instruction
that reads from the mailbox, and that mailbox happens to be empty at the
moment. This instruction (and thus the debugging session) blocks until another
thread writes a message to the mailbox. However, if scheduler-locking is on, the
other thread remains stopped by the debugger because you are single-stepping.
In this situation, none of the threads can continue, and the whole program stalls
indefinitely. This situation cannot occur when scheduler-locking is off, because all
other threads continue to run while the first thread is single-stepped. Ensure that
you enable scheduler-locking only for applications where such deadlocks cannot
occur.
There are situations where you can safely set scheduler-locking on, but you
should do so only when you are sure there are no deadlocks.
348 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
In this command, mode has one of the following values:
off
on
step
You can check the scheduler-locking mode with the following command:
show scheduler-locking
The architecture selection per-frame notion allows the Cell/B.E. back ends to
represent the flow of control that switches from PPE code to SPE code and back.
The full back-trace can represent the following program state, for example:
1. (Current frame) PPE libc printf routine
2. PPE libspe code implementing a PPE-assisted call
3. SPE newlib code that issued the PPE-assisted call
4. SPE user application code that called printf
5. SPE main
6. PPE libspe code that started SPE execution
7. PPE user application code that called spe_context_run
8. PPE main
When you choose a particular stack frame to examine by using the frame, up, or
down commands, the debugger switches its notion of the current architecture as
appropriate for the selected frame. For example, if you use the info registers
command to look at the selected frame’s register contents, the debugger shows
the SPE register set if the selected frame belongs to an SPE context. It shows
the PPE register set if the selected frame belongs to PPE code.
You can use breakpoints for both PPE and SPE portions of the code. In some
instances, however, GDB must defer insertion of a breakpoint because the code
that contains the breakpoint location has not yet been loaded into memory. This
occurs when you want to set the breakpoint for code that is dynamically loaded
later in the program. If ppu-gdb cannot find the location of the breakpoint, it sets
the breakpoint to pending. When the code is loaded, the breakpoint is inserted
and the pending breakpoint deleted. You can use the set breakpoint command
to control the behavior of GDB when it determines that the code for a breakpoint
location is not loaded into memory.
350 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To catch the most common usage cases, GDB uses the following rules when
looking up a global symbol:
If the command is issued while currently debugging PPE code, the debugger
first attempts to look up a definition in the PPE executable. If none is found,
the debugger searches all currently loaded SPE executables and uses the
first definition of a symbol with the given name it finds.
When referring to a global symbol from the command line while currently
debugging SPE context, the debugger first attempts to look up a definition in
that SPE context. If none is found there, the debugger continues to search the
PPE executable and all other currently loaded SPE executables and uses the
first matching definition.
Note: The set spu stop-on-load command has no effect in the SPU
standalone debugger spu-gdb. To let an SPU standalone program proceed to
its “main” function, you can use the start command in spu-gdb.
The spu stop-on-load command has the following syntax, where mode is either
on or off:
set spu stop-on-load <mode>
To check the status of the spu stop-on-load, use the following command:
show spu stop-on-load
If you are working in GDB, you can access help for these new commands. To
access help, type the help info spu command, followed by the info spu
subcommand name, which displays full documentation. Command name
abbreviations are allowed if they are unambiguous.
352 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 5-8 Output of the info spu mailbox command
(gdb) info spu mailbox
SPU Outbound Mailbox
0x00000000
SPU Outbound Interrupt Mailbox
0x00000000
SPU Inbound Mailbox
0x00000000
0x00000000
0x00000000
0x00000000
354 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The simulator infrastructure is designed for modeling processor and system-level
architecture at levels of abstraction, which vary from functional to performance
simulation models with a number of hybrid fidelity points in between:
Functional-only simulation
This simulation models the program-visible effects of instructions without
modeling the time it takes to run these instructions. Functional-only simulation
assumes that each instruction can be run in a constant number of cycles.
Memory access is synchronous and is also performed in a constant number
of cycles. This simulation model is useful for software development and
debugging when a precise measure of execution time is not significant.
Functional simulation proceeds much more rapidly than performance
simulation, and therefore, is useful for fast-forwarding to a specific point of
interest.
Performance simulation
For system and application performance analysis, the simulator provides
performance simulation (also referred to as timing simulation). A
performance simulation model represents internal policies and mechanisms
for system components, such as arbiters, queues, and pipelines. Operation
latencies are modeled dynamically to account for both processing time and
resource constraints. Performance simulation models have been correlated
against hardware or other references to acceptable levels of tolerance.
The simulator for the Cell/B.E. processor provides a cycle-accurate SPU core
model that can be used for performance analysis of computational-intense
applications. The simulator for SDK 3.0 provides additional support for
performance simulation, which is described in the IBM Full-System Simulator
Users Guide and Performance Analysis document.
The simulator can also be configured to fast forward the simulation, by using a
functional model, to a specific point of interest in the application and to switch to
a timing-accurate mode to conduct performance studies. This means that various
types of operational details can be gathered to help you understand real-world
hardware and software systems.
The systemsim script found in the simulators bin directory launches the
simulator. The -g parameter starts the graphical user interface (GUI).
You can use the GUI of the simulator to gain a better understanding of the CBEA.
For example, the simulator shows two sets of the PPE state. This is because the
PPE processor core is dual-threaded, and each thread has its own registers and
context. You can also look at the state of the SPEs, including the state of their
MFC.
To specify that you want to update the sysroot image file with any changes made
in the simulator session, change the newcow parameter on the mysim bogus disk
init command in .systemsim.tcl to rw (specifying read/write access) and remove
the last two parameters. The following changed line is from .systemsim.tcl:
mysim bogus disk init 0 $sysrootfile rw
When running the simulator with read/write access to the sysroot image file,
ensure that the file system in the sysroot image file is not corrupted by
incomplete writes or a premature shutdown of the Linux operating system
running in the simulator. In particular, ensure that Linux writes any cached data
to the file system before exiting the simulator. You do this by typing the following
command in the Linux console window just before you exit the simulator:
sync ; sync
356 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The processor supports five optional instructions to the SPU ISA:
DFCEQ
DFCGT
DFCMEQ
DFCMEQ
DFCMGT
Detailed documentation for these instructions is provided in version 1.2 (or later)
of the SPU ISA specification. The future processor also supports improved issue
and latency for all double-precision instructions.
The simulator also supports simulation of the future processor. The simulator
installation provides a tcl run script to configure it for such simulation. For
example, the following sequence of commands starts the simulator that is
configured for the future processor with a GUI:
export PATH=$PATH:/opt/ibm/systemsim-cell/bin
systemsim -g -f config_edp_smp.tcl
The vertical panel represents the simulated system and its components. The
rows of buttons are used to control the simulator.
To start the GUI from the Linux run directory, enter the following command:
PATH=/opt/ibm/systemsim-cell/bin:$PATH; systemsim -g
The simulator then configures the simulator as a Cell/B.E. system and shows the
main GUI window, which is labeled with the name of the application program.
When the GUI window first is displayed, click the Go button to start the Linux
operating system.
The folders that represent the processors can be further expanded to show the
viewable objects, as well as the options and actions that are available.
PPE components
Five PPE components are visible in the expanded PPE folder:
PCTrack
PCCCore
GPRegs
FPRegs
PCAddressing
You can view the general-purpose registers (GPRs) and the floating-point
registers (FPRs) separately by double-clicking the GPRegs and the FPRegs
folders respectively. As data changes in the simulated registers, the data in the
windows is updated, and registers that have changed state are highlighted.
The PPE Core window (PPCCore) shows the contents of all the registers of the
PPE, including the VMX registers.
SPE components
The SPE folders (SPE0 ... SPE7) each have ten subitems. Five of the subitems
represent windows that show data in the registers, channels, and memory:
SPUTrack
SPUCore
SPEChannel
LS_Stats
SPUMemory
Two of the subitems represent windows that show state information about the
MFC:
MFC
MFC_XLate
358 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The last three subitems represent actions to perform on the SPE:
SPUStats
Model
Load-Exec
The last three items in an SPE folder represent actions to perform, with respect
to the associated SPE. The first of these is SPUStats. When the system is
stopped and you double-click this item, the simulator displays program
performance statistics in its own window. These statistics are collected only when
Model is set to pipeline mode.
The next item in the SPE folder has one of the following labels:
Model: instruction
Model: pipeline
Model: fast
The label indicates whether the simulation is in one of the following modes:
Instruction mode for checking and debugging the functionality of a program
Pipeline mode for collecting performance statistics on the program
Fast mode for fast functional simulation only
You can toggle the model by double-clicking the item. You can use the Perf
Models button on the GUI to display a menu for setting the simulator model
modes of all of the SPEs simultaneously.
The last item in the SPE folder, Load-Exec, is used for loading an executable
onto an SPE. When you double-click this item, a file-browsing window opens in
which you can find and select the executable file to load.
360 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Perf Models
This option displays a window in which various performance models can be
selected for the various system simulator components. It provides a
convenient means to set each SPU\u2019s simulation mode to either cycle
accurate pipeline mode, instruction mode, fast functional-only mode. The
same capabilities are available by using the Model:instruction, Model:pipeline,
Model:fast toggle menu subitem under each SPE in the tree menu at the left
of the main control panel.
SPE Visualization
This option plots histograms of SPU and DMA event counts. The counts are
sampled at user-defined intervals and are continuously displayed. Two modes
of display are provided:
– A \u201cscroll\u201d view, which tracks only the most recent time
segment
– A \u201ccompress\u201d view, which accumulates samples to provide an
overview of the event counts during the time elapsed
Users can view collected data in either detail or summary panels:
– The detailed, single-SPE panel tracks SPU pipeline phenomena (such as
stalls, instructions executed by type, and issue events) and DMA
transaction counts by type (such as gets, puts, atomics, and so forth).
– The summary panel tracks all eight SPEs for the Cell/B.E. processor, with
each plot showing a subset of the detailed event count data that is
available.
Process-Tree and Process-Tree-Stats
This option requires OS kernel hooks that allow the simulator to display
process information. This feature is currently not provided in the SDK kernel.
SPU Modes
This option provides a convenient means to set each SPU\u2019s simulation
mode to either cycle accurate pipeline mode or fast functional-only mode. The
same capabilities are available using the Model:instruction or Model:pipeline
toggle menu subitem under each SPE in the tree menu at the left of the main
control panel.
Event Log
This option enables a set of predefined triggers to start collecting the log
information. The window provides a set of buttons that can be used to set the
marker cycle to a point in the process.
Exit
This option exits the simulator and closes the GUI window.
362 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Defining projects
The Cell/B.E. IDE leverages the Eclipse CDT framework, which is the tooling
support for developing C/C++ applications. In addition to the CDT framework,
you can choose whether you manage the build structure yourself or you have
Eclipse automatically generate it for you. This choice is the difference
respectively between the Standard Make and Managed Make options for project
creation (Table 5-2).
Managed Make C/C++ Eclipse auto-creates and manages the makefiles for
you.
From the menu bar, you select File → New → Project. In the New Project
Wizard window that opens (Figure 5-11), you can select Standard Make or
Managed Make projects under both the C and C++ project groups.
364 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Table 5-3 describes each of the available project types.
Cell PPU Executable Creates a PPU executable binary. This project has the
capability of referencing any other SPU project binary, in
order to produce a Cell/B.E. combined binary.
Cell PPU Shared Library Creates a PPU shared library binary. This project has the
capability of referencing any other SPU project binary, in
order to produce a Cell/B.E. combined library.
Cell PPU Static Library Creates a PPU static library binary. This project has the
capability of referencing any other SPU project binary, in
order to produce a Cell/B.E. combined library.
Cell SPU Executable Creates an SPU binary. The resulting binary can be
executed as a spulet or embedded in a PPU binary.
Cell SPU Static Library Creates an SPU static library. The resulting library can be
linked together with other SPU libraries and be embedded
in an PPU binary.
Project configuration
The newly created project should be displayed in the C/C++ view, on the left side
of the window. Next configure the project’s build options.
The C/C++ Build options carry all of the necessary build and compile
configuration entry points, so that we can customize the whole process. Each
tool that is part of the toolchain has its configuration entry point on the Tools
Settings tab. After you alter or add any of the values, click Apply and then OK to
trigger the automatic makefile generation process of the IDE, so that you
project’s build is updated immediately.
366 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5.5.2 Step 2: Choosing the target environments with remote tools
Now that the projects are created and properly configured, we must first create
and start a programming environment before we can test the program. The
Cell/B.E. IDE integrates the IBM Full System Simulator for the Cell/B.E.
processor and Cell/B.E. blades into Eclipse, so that you are only a few clicks
away from testing your application on a cell environment.
Simulator integration
In the Cell Environments view at the bottom, right-click Local Cell Simulator and
select Create.
The Local Cell Simulator properties window (Figure 5-14) opens, in which you
can configure the simulator to meet any specific needs that you might have. You
can modify an existing cell environment's configuration at any time (as long as its
not running). To modify the configuration, right-click the environment and select
Edit. Enter a name for this simulator, such as My Local Cell Simulator, and
then click Finish.
The Cell Box properties window (Figure 5-15) opens in which you can configure
remote access to your Cell/B.E. blade to meet any specific needs. You can
modify an existing cell environment’s configuration at any time (as long as it is not
running). Right-click the environment and select Edit. Enter a name for this
configuration, such as My Cell Blade, and then click Finish.
368 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
clicking this button, you activate the connection between the chosen environment
and the IDE.
Click the Target tab. On the Target tab (Figure 5-17 on page 370), you can select
which remote environment you want to debug your application with. It is possible
to select any of the previously configured environments, as long as they are
active, that is started.
Click the Launch tab. On the Launch tab (Figure 5-18), you can specify any
command line arguments and shell commands that must be executed before
your application, after your application, or both.
370 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Click the Synchronize tab (Figure 5-19), on which you can specify resources,
such as input/output files, that must be synchronized with the cell environment’s
file system before the application executes, after the application executes, or
both. Click New upload rule to specify the resource or resources to copy to the
cell environment before the application executes. Click New download rule to
copy the resource or resources back to your local file system after execution.
Select the Upload rules enabled and Download rules enabled boxes
respectively after adding any upload or download rules.
Configure the debugger parameters. Click the Debugger tab (Figure 5-20 on
page 372). In the Debugger field, choose Cell PPU gdbserver, Cell SPU
gdbserver, or Cell/B.E. gdbserver. To debug only PPU or SPU programs, select
Cell PPU gdbserver or Cell SPU gdbserver, respectively. The Cell/B.E.
gdbserver option is the combined debugger, which allows for the debugging of
PPU and SPU source code in one debug session.
372 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
If your application is 32 bit, choose the Cell/B.E. 32-bit gdbserver option for the
gdbserver Debugger field. Otherwise, if you have a 64-bit application, select the
Cell/B.E. 64bit gdbserver option.
If there are no pending configuration problems, click the Apply button and then
the Debug button. The IDE switches to the Debug Perspective.
Debug Perspective
In the Eclipse Debug Perspective (Figure 5-22), you can use any of the regular
debugging features such as breakpoint setting, variables inspection, registers,
and memory.
374 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5.6 Performance tools
The Cell/B.E. SDK 3.0 has a set of performance tools. It is crucial to understand
how each one relates to the other, from a time-flow perspective. Figure 5-24
highlights the optimize stage of the process.
Hardware events are available from all of the logical units within the Cell/B.E.
processor including the following units:
The PPE
The SPEs
The interface bus
Memory
I/O controllers
The Cell/B.E. performance monitoring unit (PMU) provides four 32-bit counters,
which can also be configured as pairs of 16-bit counters, for counting these
events. The CPC tool also uses the hardware sampling capabilities of the
376 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Cell/B.E. PMU. By using this feature, the hardware can collect precise counter
data at programmable time intervals. The accumulated data can be used to
monitor changes in performance of the Cell/B.E. system over longer periods of
time.
Operation
The tool offers two modes of execution:
Workload mode
PMU counters are active only during complete execution of a workload,
providing an accurate view of the performance of a single process.
System-wide mode
PMU counters monitor all processes that run on specified processors for a
specified duration.
The results are grouped according to seven logical blocks, PPU, PPSS, SPU,
MFC, EIB, MIC, and BEI, where each block has signals (hardware events)
organized into groups. The PMU can monitor any number of signals within one
group, with a maximum of two signal groups at a time.
Hardware sampling
The Cell/B.E. PMU provides a mechanism for the hardware to periodically read
the counters and store the results in a hardware buffer. By doing so, the CPC tool
can collect a large number of counter samples while greatly reducing the number
of calls that the CPC tool must make into the kernel.
As the default behavior, hardware buffers contain the total number of each
monitored signal’s hit for the specified interval, which is called count mode. In
addition to sampling the counters and accumulating them in the buffers, the
Cell/B.E. PMU offers other sampling modes:
Occurrence mode
This mode monitors one or two entire groups of signals, allowing the
specifying of any signal within the desired group. It also indicates whether
each event occurred at least once during each sampling interval.
PPU Bookmarks
The CPC tool offers a feature that allows finer grained tracing method for signals
sampling. The PPU Bookmark mode is provided as an option to start or stop the
counters when a value is written to the bookmark register. The triggering write
can be issued from both a command line and within your application, achieving
the desired sampling scope narrowing.
The chosen bookmark register must be specified as a command line option for
the CPC, as well as the desired action to be performed on the counter sampling.
The registers can be reached as files in the sysfs file system:
/sys/devices/system/cpu/cpu*/pmu_bookmark
Overall usage
The typical command line syntax for CPC is as follows:
cpc [options] [workload]
Here, the presence of the [workload] parameter controls whether the CPC must
be run against a single application (workload mode) or against the whole running
system (system-wide mode). In system-wide mode, the following options control
both the duration and broadness of the sampling:
--cpus <CPUS>
Controls which processors to use, and where CPUS should be a comma
separated list of processor or the keyword all
--time <TIME>
Controls the sampling duration time (in seconds)
Typical options
The following options enable the features of CPC:
--list-events
Returns the list of all possible events, grouped by logic units within the
Cell/B.E. system (see Example 5-9).
378 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 5-9 Possible events returned
Performance Monitor Signals
Key:
1) Signal Number:
Digit 1 = Section Number
Digit 2 = Subsection Number
Digit 3,4 = Bit Number
.C (Cycles) or .E (Events) if the signal can record
either
2) Count Type
C = Count Cycles
E = Count Event Edges
V = Count Event Cycles
S = Count Single-Cycle Events
3) Signal Name
4) Signal Description
********************************************************************
* Unit 2: PowerPC Processing Unit (PPU)
********************************************************************
2104.CC
2104.EE IL1_Miss_Cycles_t0L1 Instruction cache miss cycles. Counts
the cycles from the miss event until the returned instruction is
dispatched or cancelled due to branch misprediction, completion
restart, or exceptions (see Note 1). (Thread 0)
--event <ID>
Specifies the event to be counted.
--event <ID.E>
Some events allow counting of either events (E) or cycles (C).
--event <ID:SUB>
When specifying SPU or MFC events, the desired subunit can be given (see
Example 5-10).
--event <ID[.E][:SUB],...
Specifies multiple comma-separated events (in all of the previous forms) to be
counted.
380 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
--switch-timeout <ms>
Multiple specification of the event option, allows events to be grouped in sets,
with the kernel cycling through them at the interval defined by the
switch-timeout option (see Example 5-11).
--interval <TIME>
Specifies the interval time for the Hardware Sampling mode. Uses a suffix on
the value of “n” for nanoseconds, “u” for microseconds, or “m” for
milliseconds.
--sampling-mode <MODE>
Used in conjunction with the interval option, defines the behavior of the
hardware sampling mode (as explained in “Hardware sampling” on
page 377). The “threshold” mode has one peculiarity, with regard to the option
syntax, which requires the specification of the desired threshold value for
each event (see Example 5-12).
--start-on-ppu-th0-bookmark
Starts counters upon the PPU hardware-thread 0 bookmark start.
--start-on-ppu-th1-bookmark
Starts counters upon the PPU hardware-thread 1bookmark start.
--stop-on-ppu-th0-bookmark
Stops counters upon the PPU hardware-thread 0 bookmark stop.
--stop-on-ppu-th0-bookmark
Stops counters upon the PPU hardware-thread 1bookmark stop. See
Example 5-13.
5.6.3 OProfile
OProfile for the Cell/B.E. system is a system-level profiler for Cell Linux systems
and is capable of profiling all running code at low overhead. It consists of a kernel
driver, a daemon for collecting sample data, and several post-profiling tools for
turning data into information.
The OProfile tool requires low overhead and cannot use other highly intrusive
profiling methods. In addition, the tool requires instruction-level profiles and
call-graph profiles.
Operation
The OProfile requires root privileges and exclusive access to the Cell/B.E. PMU.
For those reasons, it supports only one session at a time and no other PMU
related tool (such as CPC) can be running simultaneously.
382 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Tools
The tool is composed of utilities that control the profiling session and reporting.
Table 5-4 lists the most relevant utilities.
opcontrol Configures and controls the profiling system. Sets the performance
counter event to be monitored and the sampling rate.
Considerations
The current SDK 3.0 version of OProfile for the Cell/B.E. system supports
profiling on POWER processor events and SPU cycle profiling. These events
include cycles as well as the various processor, cache, and memory events. It is
possible to profile up to four events simultaneously on the Cell/B.E. system.
There are restrictions on which PPU events can be measured simultaneously.
(The tool now verifies that multiple events specified can be profiled
simultaneously. In the previous release, the user had to do this verification.)
When using SPU cycle profiling, events must be within the same group due to
restrictions in the underlying hardware support for the performance counters. You
can use the following command to view the events and which group contains
each event:
opcontrol --list-events
Overall process
The opcontrol utility drives the profiling process, which can be initiated at
compilation time, if source annotation is desired. Both the opreport and
opannotate tools are deployed at the end of the process to collect, format, and
co-relate sampled information.
-Wl,q linker Preserves the relocation and the line number information in the
final integrated executable.
Step 2: Initializing
As previously explained, the opcontrol utility drives the process from initialization
to the end of sampling. Since the OProfile solution comprehends a kernel
module, a daemon, and a collection of tools, in order to properly operate the
session, both the kernel module and the daemon must be operational, and no
other stale session information must be present. Example 5-16 shows the
initialization procedure.
384 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Step 3: Running
After properly cleaning up and ensuring that the kernel module and daemon are
present, we can proceed with the sampling. First, we must properly set how
samples should be organized, indicate the desired event group to monitor, and
define the number of events per sampling as shown in Example 5-17.
At this point, the OProfile environment is ready to initiate the sampling. We can
proceed with the commands shown in Example 5-18.
Step 4: Stopping
As soon as the application returns, stop the sampling and dump the results as
shown in Example 5-19.
The PDT achieves this objective by instrumenting the code that implements key
functions of the events on the SPEs and PPE and collecting the trace records.
This instrumentation requires additional communication between the SPEs and
PPE as trace records are collected in the PPE memory. The traced records can
then be viewed and analyzed by using additional SDK tools.
Operation
Tracing is enabled at the application level (user space). After the application is
enabled, the tracing facility trace data is gathered every time the application runs.
Prior to each application run, the user can configure the PDT to trace events of
interest. The user can also use the PDT API to dynamically control the tracing.
During the application run, the PPE and SPE trace records are gathered in a
memory-mapped (mmap) file in the PPE memory. These records are written into
the file system when appropriate. The event-records order is maintained in this
386 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
file. The SPEs use efficient DMA transfers to write the trace records into the
mmap file. The trace records are written in the trace file using a format that is set
by an external definition (using an XML file). The PDTR (see “PDTR” on
page 392) and Trace Analyzer (see “Trace Analyzer” on page 409) tools, which
use PDT traces as input, use the same format definition for visualization and
analysis. Figure 5-26 illustrates the tracing architecture.
Considerations
Tracing 16 SPEs using one central PPE might lead to a heavy load on the PPE
and the bus, and therefore, might influence the application performance. The
PDT is designed to reduce the tracing execution load and provide a means for
throttling the tracing activity on the PPE and each SPE. In addition, the SPE
tracing code size is minimized, so that it fits into the small SPE local storage.
Performance events are captured by the SDK functions that are already
instrumented for tracing. These functions include SPEs activation, DMA
transfers, synchronization, signaling, user-defined events, and so on. Statically
Overall process
The PDT tracing facility is designed to minimize the effort that is needed to
enable the tracing facility for a given application. The process includes compiling,
linking with trace libraries, setting environment variables, (optionally) adjusting
the trace configuration file, and running the application. (In most cases on the
SPU, code must be compiled since it is statically linked.)
Step 1: Compilation
The compilation part involves the specification of a few tracing flags and the
tracing libraries. There two procedures, one for the PPE and one for SPE.
388 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3. Add the instrumented libraries (for example, libtrace.a) in /usr/lib/trace (or
/usr/lib64/trace for 64-bit applications) to the linker process. The LDFLAGS
variable is in the SDK makefile for the library path, and the IMPORTS variable
is in the SDK makefile for the library. Enter either of the following commands
depending on whether you have 64-bit applications:
-L/usr/lib/trace -ltrace
-L/usr/lib64/trace -ltrace
4. (Optional) To enable the correlation between events and the source code, for
the analysis tools, rebuild the application by using the linking relocation flags.
The LDFLAGS variable is in the SDK makefile:
-Wl,q
PDT_CONFIG_FILE This is the full path to the PDT configuration file for the
application run. The PDT package contains a
pdt_cbe_configuration.xml file in the /usr/share/pdt/config
directory that can be used “as is” or copied and modified
for each application run.
PDT_TRACE_OUTPUT (Optional) This is the full path to the PDT output directory,
which must exist prior to the application running.
The first line of the configuration file contains the application name. This name is
used as a prefix for the PDT output files. To correlate the output name with a
specific run, the name can be changed before each run. The PDT output
directory is also defined in the output_dir attribute. This location is used if the
PDT_TRACE_OUTPUT environment variable is not defined.
The first section of the file, <groups>, defines the groups of events for the run.
The events of each group are defined in other definition files, which are also in
XML format, and are included in the configuration file. These files reside in the
/usr/share/pdt/config directory. They are provided with the instrumented library
and should not be modified by the programmer.
Each file contains a list of events with the definition of the trace-record data for
each event. Some of the events define an interval (with StartTime and EndTime),
and some are single events in which the StartTime is 0 and the EndTime is set to
the event time.
The names of the trace-record fields are the same as the names defined by the
API functions. There are two types of records: one for the PPE and one for the
SPE. Each of these record types has a different header that is defined in a
separate file: pdt_ppe_event_header.xml for the PPE and
pdt_spe_event_header.xml for the SPE.
The SDK provides instrumentation for the following libraries (events are defined
in the XML files):
GENERAL (pdt_general.xml)
These are the general trace events such as trace start and trace stop. Tracing
of these events is always active.
LIBSPE2 (pdt_lbspe2.xml)
These are the libspe2 events.
SPU_MFCIO (pdt_mfcio.xml)
These are the spu_mfcio events that are defined in the spu_mfcio.h header
file.
LIBSYNC (pdt_libsync.xml)
These are the mutex events that are part of the libsync library.
390 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
DACS (pdt_dacs.xml, pdt_dacs_perf.xml, and pdt_dacs_spu.xml)
These are the DaCS events, separated into three groups of events.
ALF (pdt_alf.xml, pdt_alf_perf.xml, and pdt_alf_spu.xml)
These are the ALF events, separated into three groups of events.
The second section of the file contains the tracing control definitions for each
type of processor. The PDT is made ready for the hybrid environment so that
each processor has a host, <host>. On each processor, several groups of events
can be activated in the group control, <groupControl>. Each group is divided into
subgroups, and each subgroup, <subgroup>, has a set of events. Each group,
subgroup, and event has an active attribute that can be either true or false. This
attribute affects tracing as follows:
If a group is active, all of its events are traced.
If a group is not active, and the subgroup is active, all of its subgroup’s events
are traced.
If a group and subgroup are not active, and an event is active, that event is
traced.
We recommended that you enable tracing only for those events that are of
interest. We also recommend that you to turn off tracing of non-stalling events,
since the relative overhead is higher. Depending on the number of processors
that are involved, programs might produce events at a high rate. If this scenario
occurs, the number of traced events might also be high.
PDTR
The PDTR command-line tool (pdtr command) provides both viewing and post
processing of PDT traces on the target (client) machine. The alternative is to use
the graphical Trace Analyzer, part of VPA (see 5.6.6, “Visual Performance
Analyzer” on page 400).
To use this tool, you must instrument your application by building with the PDT.
After the instrumented application has run and created the trace output files, you
can run the pdtr command to show the trace output.
392 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The pdtr command produces the output file app-20071115094957.pep.
The PDTR tool produces various summary output reports with lock statistics,
DMA statistics, mailbox usage statistics, and overall event profiles. The tool can
also produce sequential reports with time-stamped events and parameters per
line.
Example 5-21, Example 5-22 on page 394, and Example 5-23 on page 394
show the reports that are produced by PDTR. See the PDTR man page for
additional output examples and usage details.
See the PDTR man page for additional output examples and usage details.
5.6.5 FDPR-Pro
The Post-link Optimization for Linux on Power tool (FDPR-Pro or fdprpro) is a
performance tuning utility that reduces execution time and real memory
utilization of user space application programs. It optimizes the executable image
of a program by collecting information about the behavior of the program under a
workload. It then creates a new version of that program optimized for that
394 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
workload. The new program typically runs faster and uses less real memory than
the original program.
Operation
The post-link optimizer builds an optimized executable program in three distinct
phases:
1. Instrumentation phase, where the optimizer creates an instrumented
executable program and an empty template profile file
2. Training phase, where the instrumented program is executed with a
representative workload, and as it runs, it updates the profile file
3. Optimization phase, where the optimizer generates the optimized executable
program file
You can control the behavior of the optimizer with options specified on the
command line.
Considerations
The FDPR-Pro tool applies advanced optimization techniques to a program.
Some aggressive optimizations might produce programs that do not behave as
expected. You should test the resulting optimized program with the same test
suite that is used to test the original program. You cannot re-optimize an
optimized program by passing it as input to FDPR-Pro.
Build the executable program with relocation information. To do this, call the
linker with the --emit-relocs (or -q) option. Alternatively, pass the
-Wl,--emit-relocs (or-Wl,-q) options to the GCC or XLC compiler.
396 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
If you are using the SDK makefiles structure (with make.footer), set the variables
shown in Example 5-24 (depending on the compiler) in the makefile.
spe_i: Replace the variable <spe_i> with the name of the SPU file
obtained in the extraction process. The profile and optimization options
must be specified only in the optimization phase (see “Step 4:
Optimization phase” on page 399).
SPE instrumentation
When the optimizer processes PPE executables, it generates a profile file and an
instrumented file. The profile file is filled with counts while the instrumented file
runs. In contrast, when the optimizer processes SPE executables, the profile is
generated when the instrumented executable runs. Running a PPE or SPE
instrumented executable typically generates a number of profiles, one for each
SPE image whose thread is executed. This type of profile accumulates the
counts of all threads that execute the corresponding image. The SPE
instrumented executable generates an SPE profile named <spename>.mprof in
the output directory, where <spename> represents the name of the SPE thread.
The resulting instrumented file is 5% to 20% larger than the original file. Because
of the limited local storage size of the CBEA, instrumentation might cause SPE
memory overflow. If this happens, fdprpro issues an error message and exits. To
avoid this problem, the user can use the --ignore-function-list file or -ifl file
option. The file that is referenced by the file parameter contains the names of the
functions that should not be instrumented and optimized. This results in a
reduced instrumented file size. Specify the same -ifl option in both the
instrumentation and optimization phases.
398 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
If an old profile exists before instrumentation starts, fdprpro accumulates new
data into it. In this way, you can combine the profiles of multiple workloads. If you
do not want to combine profiles, remove the old SPE profiles (the .mprof files)
and replace the PPE profile (by default .nprof file) with its original copy, before
starting the instrumented program.
The default directory for the profile file is the directory that contains the
instrumented program. To specify a different directory, set the environment
variable FDPR_PROF_DIR to the directory that contains the profile file.
Optimization options
If you invoke fdprpro with the basic optimization flag -O, it performs code
reordering optimization. It also optimizes branch prediction, branch folding, code
alignment, and removal of redundant NOOP instructions.
To specify higher levels of optimization, pass one of the flags -O2, -O3, or -O4 to
the optimizer. Higher optimization levels perform more aggressive function
inlining, data flow analysis optimizations, data reordering, and code restructuring,
such as loop unrolling. These high-level optimization flags work well for most
applications. You can achieve optimal performance by selecting and testing
specific optimizations for your program.
400 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
However VPA does not supply performance data collection tools. Instead, it relies
on platform-specific tools, such as OProfile and PDT, to collect the performance
data as illustrated in Figure 5-29.
The VPA is better described by its own individual tools. In the case of the
Cell/B.E. processor, we have the following available relevant tools:
OProfile
CPC
PDT
FDPR-Pro
Tool usage: The following sections provide a brief overview of each Cell/B.E.
relevant VPA tools. For more detailed coverage of tool usage, see Chapter 6,
“The performance tools” on page 417.
Profile Analyzer
With the Profile Analyzer tool, you can navigate through a system profile, looking
for performance bottlenecks. This tool provides a powerful set of graphical and
text-based views for users to narrow down performance problems to a particular
process, thread, module, symbol, offset, instruction, or source line. It supports
profiles that are generated by the Cell/B.E. OProfile.
Overall process
Profile Analyzer works with properly formatted data, gathered by OProfile. First
you run The the desired application with OProfile and format its data. Then you
load the information in the Profile Analyzer and explore its visualization options.
402 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Step 1: Collecting the profile data
As outlined in 5.6.3, “OProfile” on page 382, initiate the profile session as usual,
adding the selected events to be measured. As a required step for VPA,
configure the session to properly separate the samples with the following
command:
opcontrol --separate=all
As soon as the measuring is ended, prepare the output data for VPA with the
opreport tool as follows:
opreport -X –g –d –l myapp.opm
In VPA, a profile data file loading process can run as a background job. When
VPA is loading a file, you can click a button to have the loading job to run in the
background. While the loading job is running in the background, you can use
Profile Analyzer to view already loaded profile data files or start another loading
job at the same time.
404 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The process hierarchy view (Figure 5-32) appears by default in the top center
pane. It shows an expandable list of all processes within the current profile. You
can expand a process to view its module, later thread, and so on. You can also
view the profile in the form of a thread or module, for example. You can define the
hierarchy view by right-clicking the profile and choosing Hierarchy
Management.
Code Analyzer
Code Analyzer displays detailed information about basic blocks, functions, and
assembly instructions of executable files and shared libraries. It is built on top of
FDPR-Pro technology and allows the addition of FDPR-Pro and tprof profile
information. Code Analyzer can show statistics to navigate the code, to display
performance comment and grouping information about the executable, and to
map back to source code.
Overall process
The Code Analyzer works with the artifacts that are generated from the
FDPR-Pro session on your executable. Initially the original executable is loaded,
followed by the .nprof files generated. After that, you should be able to visualize
the information.
406 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Locate the original application binary add it to the Code Analyzer choosing
File → Code Analyzer → Analyze Executable. In the Choose Executable File
to Analyze window (Figure 5-34), select the desired file and click Open.
To enhance the views with profile information for the loaded executable, you can
use either an instrumentation profile file or a sampling profile. For that, choose
File → Code Analyzer → Add Profile Information for each of the executable
tabs that are available in the center of the window. The profile information that is
added must match the executable tab that is selected. For the ppu tabs, add the
*.nprof profiles, and for the spu tab, add the respective *.nprof file. Figure 5-36 on
page 409 shows the added information.
408 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 5-36 Added profile information
Trace Analyzer
Trace Analyzer visualizes Cell/B.E. traces that contain such information as DMA
communication, locking and unlocking activities, and mailbox messages. Trace
Analyzer shows this data organized by core along a common time line. Extra
details are available for each kind of event, for example, lock identifier for lock
operations and accessed address for DMA transfers.
The tool introduces the following concepts with which you must be familiar:
Events Events are records that have no duration, for example, records
describing non-stalling operations, such as releasing a lock. Events
input on performance is normally insignificant, but it be important
for understanding the application and tracking down sources of
performance problems.
Intervals Intervals are records that may have non-zero duration. They
normally come from stalling operations, such as acquiring a lock.
Intervals are often a significant performance factor. Identifying long
stalls and their sources is an important task in performance
debugging. A special case of an interval is a live interval, which
starts when an SPE thread begins to execute and ends when the
thread exits.
410 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Select File → Open File and locate the .pex file generated during the tracing
session. After loading in the trace data, the Trace Analyzer Perspective displays
the data in its views and editors. See Figure 5-38.
From the view shown in Figure 5-38, going from the top left clockwise, we see the
following components:
The navigator view
Trace Editor
This editor shows the trace visualization by core, where data from each core
is displayed in a separate row, and each trace record is represented by a
rectangle. Time is represented on the horizontal axis, so that the location and
size of a rectangle on the horizontal axis represent the corresponding event’s
time and duration. The color of the rectangle represents the type of event, as
defined by the Color Map View.
Details View
This view show the details of the selected record (if any).
Color Map View
With this view, the user can view and modify color mapping for different kinds
of events.
Trace Table View
This view shows all the events on the trace in the order of their occurrence.
Counter Analyzer
The Counter Analyzer tool is a common tool to analyze hardware performance
counter data among many IBM System i™ platforms, including systems that run
on Linux on a Cell/B.E. processor.
The Counter Analyzer tool accepts hardware performance counter data in the
form of a cross-platform XML file format. The tool provides multiple views to help
the user identify the data. The views can be divided into two categories.
412 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The first category is the “table” views, which are two-dimension tables that
display data. The data can be raw performance counter values, derived metrics,
counter comparison results, and so on.
The second category is the “plot” views. In these views, data is represented by a
different kind of plots. The data can also be raw performance counter values,
derived metrics, comparison results, and so on. In addition to these “table” views
and “plot” views, “utility” views help the user configure and customize the tool.
Overall process
In the case of the Cell/B.E. processor, the Counter Analyzer works with count
data that is produced by the CPC tool, in .pmf XML format files. After loading the
counter data of the .pmf file, the Counter Analyzer Perspective displays the data
in its views and editors.
After loading the counter data of the .pmf file, the Counter Analyzer Perspective
(Figure 5-41) displays the data in its views and editors. Primary information,
including details, metrics, and CPI breakdown, is displayed in Counter Editor.
Resource statistics information about the file (if available) is shown in the tabular
view Resource Statistics. The View Graph shows the details, metrics, and CPI
breakdown in a graphic way.
414 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Metrics
The metric information is calculated with a user-defined formula and event
count from a performance monitor counter. It is used to provide such
performance information as the processor utilization rate, at a million
instructions per second. This helps the algorithm designer or programmer
identify and eliminate performance bottlenecks.
CPI Breakdown Model
Cycles per instruction (CPI) is the measurement for analyzing the
performance of a workload. CPI is defined as the number of processor
clocked cycles that are needed to complete an instruction. It is calculated as
shown in the following equation:
CPI = Total Cycles / Number of Instructions Completed
A high CPI value usually implies under utilization of machine resources.
For more information, consult the IBM Visual Performance Analyzer User Guide
Version 6.1 manual at the following Web address:
http://dl.alphaworks.ibm.com/technologies/vpa/vpa.pdf
Notes:
No further makefile modifications, beyond these, are required.
There are specific changes depending on whether you use the GCC or
XLC compiler.
418 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Modifying the ~/FFT16M/ppu/Makefile
In Example 6-1, we comment out the install directives.
PROGRAM_ppu= fft
#######################################################################
#
# Objects
#######################################################################
#
#INSTALL_DIR= $(EXP_SDKBIN)/demos
#INSTALL_FILES= $(PROGRAM_ppu)
LDFLAGS_gcc = -Wl,-q
CFLAGS_gcc = -g
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
PROGRAM_ppu= fft
#######################################################################
#
# Objects
#######################################################################
#
PPU_COMPILER = xlc
#INSTALL_DIR= $(EXP_SDKBIN)/demos
#INSTALL_FILES= $(PROGRAM_ppu)
LDFLAGS_xlc = -Wl,-q
CFLAGS_xlc = -g
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
420 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Modifying the ~/FFT16M/spu/Makefile
In Example 6-3 and Example 6-4 on page 422, we introduce the -g and -Wl,-q
compilation flags in order to preserve the relocation and the line number
information in the final integrated executable.
PROGRAMS_spu:= fft_spu
LIBRARY_embed:= fft_spu.a
#######################################################################
#
# Local Defines
#######################################################################
#
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
#######################################################################
#
# Local Defines
#######################################################################
#
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
Before the actual build, set the default compiler accordingly by entering the
following command:
/opt/cell/sdk/buildutils/cellsdk_select_compiler [gcc|xlc]
Now we are ready for the build and enter the following command:
cd ~/FFT16M ; CELL_TOP=/opt/cell/sdk make
422 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
6.3 Creating and working with the profile data
Assuming that you have properly set up the project tree and had a successful
build, you collect and work with the profile data.
Given that CPC is properly working, we collect counter data for the FFT16M
application. The following example counts PowerPC instructions committed in
one event-set. It also counts L1 cache load misses in a second event-set and
writes the output in the XML format (suitable for counter analyzer) to the file
fft_cpc.pmf:
cd ~/FFT16M
cpc --events C,2100,2119 --cpus all ---xml fft_cpc.pmf ./ppu/fft 40 1
424 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
1. Initialize the OProfile environment for SPU and run the fft workload to collect
SPU average cycle events as shown in Example 6-5.
426 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5. Examine the disassembly information.
a. Select the fft_spu entry in the Modules section (center of the Profile
Analyzer window in Figure 6-3) and double-click main under the
Symbol/Functions view (right side of Figure 6-3). The result is displayed in
the Disassembly view in the lower right pane of the window.
If desired, you can repeat this procedure to analyze the fft_ppu.opm profile
results.
428 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 6-7 shows the output of the command.
430 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3. Associate the PPU profile information:
a. Select the PPU editor tab view.
b. From the menu bar, select File → Code Analyzer → Add Profile Info
(Figure 6-6).
c. Locate the fft.nprof file.
4. Repeat the same procedure in step 3 for the SPU part. This time, click the
SPU editor tab and locate the fft_spu.mprof file.
432 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Code Analyzer offers a variety of ways to use the code:
In addition to the instructions, associate the source code.
a. Select a symbol in the program tree, right-click and select Open Source
Code.
b. Locate the proper source code.
As a result, the Source Code tab is displayed at the center of the window
(Figure 6-8), where you can see rates of execution per line of source code.
You can simultaneously click the Collect hazard info button to collect
comments about performance bottlenecks, above source lines that apply. See
Figure 6-10.
434 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Display pipeline population for each dispatch group. Click the Dispatch Info
tab in the lower right pane (the Instruction Properties tab) and click the Link
with table button. See Figure 6-11.
Figure 6-11 Dispatch Info tab with “Link with Table” option
The Code Analyzer also offers the possibility of inspecting SPU Timing
information, at the pipeline level, with detailed stages of the Cell pipeline
population. Select the fft SPU editor tab, locate the desired symbol in the
program tree, right-click, and select Show SPU-Timing as shown in
Figure 6-13.
436 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 6-14 shows the timing results as they are displayed in the Pipeline view.
PROGRAMS_spu:= fft_spu
LIBRARY_embed:= fft_spu.a
#######################################################################
#
# Local Defines
#######################################################################
#
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
438 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 6-9 Modifying ~/FFT16M/spu/Makefile for xlc compiler
#######################################################################
#
# Target
#######################################################################
#
SPU_COMPILER = xlc
PROGRAMS_spu:= fft_spu
LIBRARY_embed:= fft_spu.a
#######################################################################
#
# Local Defines
#######################################################################
#
#######################################################################
#
# buildutils/make.footer
#######################################################################
#
ifdef CELL_TOP
include $(CELL_TOP)/buildutils/make.footer
else
include ../../../../buildutils/make.footer
endif
Notes:
The default PDT_CONFIG_FILE for the SDK establishes the trace files
prefix as “test.” If you have not modified the file, look for the trace files with
“test” as the prefix.
Remember to unset the LD_LIBRARY_PATH environment variable, before
running the original (non-PDT) binary later.
440 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
6.4.2 Step 2: Importing the PDT data into Trace Analyzer
The Trace Analyzer allows the visualization of the application’s stages of
execution. It works with data generated from the PDT tool. More specifically, it
reads information that is available in the generated .pex file. To visualize the data
on the Trace Analyzer:
1. With VPA open, select Tools → Trace Analyzer.
2. Select File → Open File.
3. In the Open File window, locate the .pex file that was generated in the
previous steps.
This window corresponds to the FFT16M application that was run with 16 SPEs
and no huge pages. As we can observe, a less intensive blue has been selected
Next we zoom into this area, which is all that interests us in this benchmark. As
we can observe in Figure 6-16, the mailboxes (red bars) break the execution into
six stages. The different stages have different behaviors. For example, the third
and sixth stages are much longer than the rest and have a lot of massive stalls.
442 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
By using Trace Analyzer, we can select a stall to obtain further details by clicking
the stall (the yellow highlight circled in the SPE10 row in Figure 6-16). The
selection marker rulers on the left and top show the location of the selected item
and can be used to return to it if you scroll away. The data collected by the PDT
for the selected stall is shown in the Record Details tab in the lower right corner
of Figure 6-16. The stall is huge at almost 12K ticks. We check the Cell/B.E.
performance tips for the cause of the stall and determine that translation
look-aside buffers (TLB) misses are a possible culprit and huge pages are a
possible fix.
Figure 7-1 [27 on page 625] shows the slow down in single thread performance
growth.
446 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The Hybrid Model System architecture (Figure 7-2) proposal is a combination of
characteristics from traditional superscalar multicore solutions and Cell/B.E.
accelerated features. The idea is to use traditional, general purpose superscalar
clusters of machines as “processing masters” (hosts) to handle large data
partitioning and I/O bound computations (for example, message passing).
Meanwhile the clusters off-load well-characterized, computational-intensive
functions to computing kernels running on Cell/B.E. accelerator nodes.
This solution enables finer grain control over the applications’ parallelism and a
more flexible offering, because it easily accommodates Multiple Process Multiple
Data (MPMD) tasks with Single Process Multiple Data (SPMD) tasks.
Motivations
There is a well known balance between generality and performance. While the
Cell/B.E. processor is not a general all-purpose solution, it maintains strong
general purpose capabilities, especially with the presence of the Power
Processor Element (PPE). See Figure 7-3.
448 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
7.2 Hybrid Data Communication and Synchronization
The Hybrid version of the Data Communication and Synchronization (DaCS)
library provides a set of services that eases the development of applications and
application frameworks in a heterogeneous multi-tiered system (for example a 64
bit x86 system (x86_64) and one or more Cell/B.E. systems). The DaCS services
are implemented as a set of application programming interfaces (APIs) that
provide an architecturally neutral layer for application developers on a variety of
multi-core systems.
One of the key abstractions that further differentiates DaCS from other
programming frameworks is a hierarchical topology of processing elements, each
referred to as a DaCS Element (DE). Within the hierarchy, each DE can serve
one or both of the following roles:
A general purpose processing element, acting as a supervisor, control, or
master processor
This type of element usually runs a full operating system and manages jobs
running on other DEs. This element is referred to as a host element (HE).
A general or special purpose processing element running tasks assigned by
an HE, which is referred to as an accelerator element (AE)
A PPE program that works with the SPEs can also be a DaCS program. In this
case, the program uses DaCS for Cell (DaCSC, see the Data Communication
and Synchronization Library for Cell Broadband Engine Programmer’s Guide and
API Reference).1 The PPE acts as an AE for DaCSH (communicating with the
x86_64 system) and as an HE for DaCSC (communicating with the SPEs). The
DaCS API on the PPE is supported by a combined library. When the PPE is used
with both DaCSH and DaCSC, the library automatically uses the parameters that
1
Data Communication and Synchronization Library for Cell Broadband Engine Programmer’s Guide
and API Reference is available on the Web at the following address:
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/EDEC4547DFD111FF00257353006BC6
4A/$file/DaCS_Prog_Guide_API_v3.0.pdf
450 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
are passed to the API to determine if the PPE is an AE communicating with its
HE (DaCSH) or an HE communicating with its AEs (DaCSC).
To manage the interactions between the HE and the AEs, DaCSH starts a
service on each of them. On the host system, the service is the host DaCS
daemon (hdacsd). On the accelerator, the service is the accelerator DaCS
daemon (adacsd). These services are shared between all DaCSH processes for
an operating system image.
For example, consider an x86_64 system that has multiple cores that each run a
host application using DaCSH. In this case, only a single instance of the hdacsd
service is needed to manage the interactions of each of the host applications
with their AEs via DaCSH. Similarly, on the accelerator, if the Cell/B.E. system is
on a blade server (which has two Cell/B.E. processors), a single instance of the
adacsd service is needed to manage both of Cell/B.E. systems acting as AEs,
even if they are used by different HEs.
When a host application starts using DaCSH, it connects to the hdacsd service.
This service manages the system topology from a DaCS perspective (managing
reservations) and starts the accelerator application on the AE. Only process
management requests use the hdacsd and adacsd services. All other
interactions between the host and accelerator application flow via a direct socket
connection.
Services
The DaCS services can be divided into the following categories:
Resource reservation
The resource reservation services allow an HE to reserve AEs below itself in
the hierarchy. The APIs abstract the specifics of the reservation system
(operating system, middleware, and so on) to allocate resources for an HE.
When reserved, the AEs can be used by the HE to execute tasks for
accelerated applications.
Process management
The process management services provide the means for an HE to execute
and manage accelerated applications on AEs, including remote process
launch and remote error notification among others. The hdacsd provides
services to the HE applications. The adacsd provides services to the hdacsd
and HE application, including the launching of the AE applications on the
accelerator for the HE applications.
In general, the return value from these functions is an error code. Data is
returned within parameters that are passed to the functions.
452 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Process management model
When working with the host and accelerators, there must be a way to uniquely
identify the participants that are communicating. From an architectural
perspective, each accelerator can have multiple processes simultaneously
running. Therefore, it is not enough to only identify the accelerator. Instead the
unit of execution on the accelerator (the DaCS process) must be identified by
using its DaCS element ID (DE ID) and its process ID (Pid). The DE ID is
retrieved when the accelerator is reserved (by using dacs_reserve_children())
and the Pid when the process is started (by using dacs_de_start()).
Since the parent is not reserved, and no process is started on it, two constants
are provided to identify the parent: DACS_DE_PARENT and
DACS_PID_PARENT. Similarly, to identify the calling process itself, the constants
DACS_DE_SELF and DACS_PID_SELF are provided.
Important: DaCS for Hybrid and DaCS for the Cell/B.E. system share the
same API set, although they are two different implementations. For more
information about DaCS API, refer to 4.7.1, “Data Communication and
Synchronization” on page 291.
Both static and shared libraries are provided for the x86_64 and PPU. The
desired library is selected by linking to the chosen library in the appropriate path.
The static library is named libdacs.a, and the shared library is named libdacs.so.
454 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
DaCS configuration
The hdacsd and adacsd are both configured by using their respective
/etc/dacsd.conf files. The default versions of these files are automatically
installed with each of the daemons. These default files contain comments on the
parameters and values that are currently supported.
When one of these files is changed, the changes do not take affect until the
respective daemon is restarted as described previously.
DaCS topology
The topology configuration file /etc/dacs_topology.config is only used by the host
daemon service. Ensure that you back up this file before you change it. Changes
do not take effect until the daemon is restarted.
The topology configuration file identifies the hosts and accelerators and their
relationship to one another. The host can contain more than one processor core,
for example a multicore x86_64 blade. The host can be attached to one or more
accelerators, for example a Cell/B.E. blade. By using the topology configuration
file, you can specify a number of configurations for this hardware. For example, it
can be configured so that each core is assigned one Cell/B.E. accelerator or it
might be configured so that each core can reserve any (or all) of the Cell/B.E.
accelerators.
The default topology configuration file, shown in Example 7-1, is for a host that
has four cores and is attached to a single Cell/B.E. blade.
The <hardware> section identifies the host system with its four cores (OC1-OC4)
and the Cell/B.E. BladeCenter (CB1) with its two Cell/B.E. systems (CBE11 and
CBE12).
The <topology> section identifies what each core (host) can use as an
accelerator. In this example, each core can reserve and use either the entire
Cell/B.E. BladeCenter (CB1) or one or more of the Cell/B.E. systems on the
BladeCenter.
The ability to use the Cell/B.E. system is implicit in the <canreserve> element.
This element has an attribute that defaults to false. When it is set to true, only the
Cell/B.E. BladeCenter can be reserved. If the fourth <canreserve> element was
changed to <canreserve he="OC4" ae="CB1" only="TRUE"></canreserve>, then
OC4 can only reserve the Cell/B.E. BladeCenter. The usage can be made more
restrictive by being more specific in the <canreserve> element. If the fourth
<canreserve> element is changed to <canreserve he="OC4"
ae="CBE12"></canreserve>, then OC4 can only reserve CBE12 and cannot
reserve the Cell/B.E. BladeCenter.
DaCS daemons
The daemons can be stopped and started by using the shell service command
in the sbin directory. For example, to stop the host daemon, type the following
command:
/sbin/service hdacsd stop
The adacsd can be restarted in a like manner. See the man page for service for
more details about the service command.
456 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Running an application
A hybrid DaCS application on the host (x86_64) must have CPU affinity to start,
which can be done on the command line. The following command line example
shows how to set affinity of the shell to the first processor:
taskset -p 0x00000001 $$
The bit mask, starting with 0 from right to left, is an index to the CPU affinity
setting. Bit 0 is on or off for CPU 0, bit 1 for CPU 1, and bit number x is CPU
number x. $$ means the current process gets the affinity setting. The following
command returns the mask setting as an integer:
taskset -p$$
You can use the -c option to make taskset more usable. For example, the
following command sets the processor CPU affinity to CPU 3 for the current
process:
taskset -pc 3 $$
The -pc parameter sets by process and CPU number. The following command
returns the current processor setting for affinity for the current process:
taskset -pc $$
According to the man page for taskset, a user must have CAP_SYS_NICE
permission to change CPU affinity. See the man page for taskset for more
details.
We also need a makefile in each of the created folders (only on the host
machine), which we create as shown in Example 7-3, Example 7-4 on page 459,
and Example 7-5 on page 459.
458 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 7-4 hdacshello/ppu/Makefile
DIRS := spu
INCLUDE := -I/opt/cell/sdk/sysroot/usr/include
IMPORTS :=
/opt/cell/sysroot/opt/cell/sdk/prototype/usr/lib64/libdacs_hybrid.so
spu/hdacshello_spu
LDFLAGS += -lstdc++
CC_OPT_LEVEL = -g
PROGRAM_ppu64 := hdacshello_ppu
include $(CELL_TOP)/buildutils/make.footer
de_id_t cbe[2];
dacs_process_id_t pid;
uint32_t num_cbe = 1;
dacs_rc =
dacs_de_start(cbe[0],program,argp,envp,DACS_PROC_REMOTE_FILE,&pid);
printf("HOST: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
int32_t status = 0;
dacs_rc = dacs_de_wait(cbe[0],pid,&status);
printf("HOST: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
dacs_rc = dacs_release_de_list(num_cbe,cbe);
printf("HOST: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
dacs_rc = dacs_runtime_exit();
printf("HOST: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
return 0;
}
460 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
extern spe_program_handle_t hdacshello_spu;
de_id_t spe[2];
dacs_process_id_t pid;
uint32_t num_spe = 1;
dacs_rc = dacs_reserve_children(DACS_DE_SPE,&num_spe,spe);
printf("PPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
printf("PPU: %d : num children = %d, spe =
%08x\n",__LINE__,num_spe,spe[0]); fflush(stdout);
int32_t status = 0;
dacs_rc = dacs_de_wait(spe[0],pid,&status);
printf("PPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
dacs_rc = dacs_release_de_list(num_spe,spe);
printf("PPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
dacs_rc = dacs_runtime_exit();
printf("PPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
return 0;
}
dacs_rc = dacs_runtime_init(NULL,NULL);
printf("SPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
dacs_rc = dacs_runtime_exit();
printf("SPU: %d : rc = %s\n",__LINE__,dacs_strerror(dacs_rc));
fflush(stdout);
return 0;
}
Ensure that you have permission to execute each of binaries, for example, by
using the following command:
chmod a+x ~hdacshello/hdacshello # repeat on the other executables
462 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Next, deploy the CBE binary hdacshello/ppu/hdacshello_ppu to the matching
location on the accelerator (QS21 Cell Blade) machine. You can use scp, as
shown in the following example:
scp ~/hdacshello/ppu/hdacshello_ppu user@qs21:~/hdacshello/ppu
Make sure that all daemons are properly running on the host and the accelerator,
and execute the helper script:
~/hdacshello/run.sh
This implementation of the ALF API uses the DaCS library as the process
management and data transport layer. Refer to 7.2.1, “DaCS for Hybrid
implementation” on page 450, for more information about how to set up DaCS in
this environment.
To manage the interaction between the ALF host run time on the Opteron system
and the ALF accelerator run time on the SPE, this implementation starts a PPE
process (ALF PPE daemon) for each ALF run time. The PPE program is
provided as part of the standard ALF runtime library. Figure 7-8 on page 465
shows the hybrid ALF flow.
464 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 7-8 Hybrid ALF flow
Additionally, both static and shared libraries are provided for the ALF host
libraries. The ALF SPE runtime library is only provided as static libraries.
466 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2. Copy the PPE shared library with the embedded SPE binaries from the host
where it was built to a selected directory on the Cell/B.E. processor where it is
to be executed, for example:
scp my_appl.so <CBE>:/tmp/my_directory
3. Set the environment variable ALF_LIBRARY_PATH to the directory shown in
step 2 on the Cell/B.E processor, for example:
export ALF_LIBRARY_PATH=/tmp/my_directory
4. Set the CPU affinity on the Hybrid-x86 host, for example:
taskset –p 0x00000001 $$
5. Run the x86_64 host application on the host environment, for example:
./my_appl
We also need a makefile in each of the created folders. See Example 7-12,
Example 7-13, and Example 7-14 on page 468.
#define DEBUG
#ifdef DEBUG
#define debug_print(fmt, arg...) printf(fmt,##arg)
#else
#define debug_print(fmt, arg...) { }
#endif
468 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
char output_dtl_name[] = "output_prep";
int main()
{
int ret;
alf_handle_t handle;
alf_task_desc_handle_t task_desc_handle;
alf_task_handle_t task_handle;
alf_wb_handle_t wb_handle;
void *config_parms = NULL;
sprintf(library_name, "alf_hello_world_hybrid_spu64.so");
debug_print("Before alf_init\n");
debug_print("Before alf_num_instances_set\n");
if ((ret = alf_num_instances_set(handle, 1)) < 0) {
fprintf(stderr, "Error: alf_num_instances_set failed, ret=%d\n",
ret);
return 1;
} else if (ret > 0) {
debug_print("alf_num_instances_set returned number of SPUs=%d\n",
ret);
}
debug_print("Before alf_task_desc_create\n");
if ((ret = alf_task_desc_create(handle, 0, &task_desc_handle)) < 0) {
fprintf(stderr, "Error: alf_task_desc_create failed, ret=%d\n",
ret);
return 1;
} else if (ret > 0) {
debug_print("alf_task_desc_create returned number of SPUs=%d\n",
ret);
}
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_MAX_STACK_SIZE, 4096)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, 0)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_WB_IN_BUF_SIZE, 0)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_WB_OUT_BUF_SIZE, 0)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_WB_INOUT_BUF_SIZE, 0)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int32\n");
if ((ret = alf_task_desc_set_int32(task_desc_handle,
ALF_TASK_DESC_TSK_CTX_SIZE, 0)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int32 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int64\n");
470 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
if ((ret = alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long)spu_image_name)) <
0) {
fprintf(stderr, "Error: alf_task_desc_set_int64 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int64\n");
if ((ret = alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long)library_name)) <
0) {
fprintf(stderr, "Error: alf_task_desc_set_int64 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int64\n");
if ((ret = alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long)kernel_name)) <
0) {
fprintf(stderr, "Error: alf_task_desc_set_int64 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int64\n");
if ((ret = alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_INPUT_DTL_REF_L, (unsigned long
long)input_dtl_name)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int64 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_desc_set_int64\n");
if ((ret = alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_OUTPUT_DTL_REF_L, (unsigned long
long)output_dtl_name)) < 0) {
fprintf(stderr, "Error: alf_task_desc_set_int64 failed, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_task_create\n");
debug_print("Before alf_task_desc_destroy\n");
if ((ret = alf_task_desc_destroy(task_desc_handle)) < 0) {
fprintf(stderr, "Error: alf_exit alf_task_desc_destroy, ret=%d\n",
ret);
return 1;
}
debug_print("Before alf_wb_create\n");
if ((ret = alf_wb_create(task_handle, ALF_WB_SINGLE, 1, &wb_handle))
< 0) {
fprintf(stderr, "Error: alf_wb_create failed, ret=%d\n", ret);
return 1;
}
debug_print("Before alf_wb_enqueue\n");
if ((ret = alf_wb_enqueue(wb_handle)) < 0) {
fprintf(stderr, "Error: alf_wb_enqueue failed, ret=%d\n", ret);
return 1;
}
debug_print("Before alf_task_finalize\n");
if ((ret = alf_task_finalize(task_handle)) < 0) {
fprintf(stderr, "Error: alf_task_finalize failed, ret=%d\n", ret);
return 1;
}
debug_print("Before alf_task_wait\n");
if ((ret = alf_task_wait(task_handle, -1)) < 0) {
fprintf(stderr, "Error: alf_task_wait failed, ret=%d\n", ret);
return 1;
} else if (ret > 0) {
debug_print("alf_task_wait returned number of work blocks=%d\n",
ret);
}
472 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
return 1;
}
return 0;
}
ALF_ACCEL_EXPORT_API_LIST_BEGIN
ALF_ACCEL_EXPORT_API("", comp_kernel);
ALF_ACCEL_EXPORT_API("", input_prep);
ALF_ACCEL_EXPORT_API("", output_prep);
ALF_ACCEL_EXPORT_API_LIST_END
474 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Step 4: Running the application
To properly run the application, execute the sequence shown in Example 7-18.
Before running the application, make sure all Hybrid DaCS daemons are
properly running on the host and the accelerator.
The technology was developed initially for the financial services sector but can be
used in other areas. The ideas apply wherever an application exhibits
computational kernels with a high computational intensity (ratio of computation
over data transfers) and cannot be rewritten by using other Cell/B.E. frameworks.
The reason it cannot be rewritten is possibly because we do not have its source
code or because we do not want to port the whole application to the Cell/B.E.
environment.
DAV opens up a whole new world of opportunities to exploit the CBEA for
applications that did not initially target the Cell/B.E. processor. It also offers
increased flexibility. For example, an application can have its front-end run on a
mobile computer or desktop with GUI features and have the hard-core, number
crunching part run on specialized hardware such as the Cell/B.E.
technology-based blade servers.
The best candidate functions for offloading to a Cell/B.E. platform are the ones
that show a high ratio of computation over communication. The increased levels
of performance that are obtained by running on the Cell/B.E. processor should
not be offset by the communication overhead between the client and the server.
The transport mechanism that is currently used by DAV is TCP/IP sockets. Other
options can be explored in the future to lower the latency and increase the
network bandwidth. Clearly, any advance on this front will increase, for a given
application, the number of functions that can be accelerated.
Workloads that have a strong affinity for parallelization are optimal. A good
example is option pricing by using the Monte Carlo method, as described in
Chapter 8, “Case study: Monte Carlo simulation” on page 499.
476 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2. On the server side, write an implementation of the exact same functions,
exploiting all the Cell/B.E. strengths. The implementation can use any
Cell/B.E. programming techniques. The functions must be made into 32-bit
DLLs (the equivalent of shared libraries in Linux.)
3. Back on the client side, fake the application with a DAV DLL, called the stub
library, that replaces the original DLL. This DLL satisfies the application but
interfaces with the DAV infrastructure to ship data back and forth between the
client and the server.
The whole process is shown in Figure 7-9. It shows the unchanged application
linked with the stub library on the client side (left) and the implementation of the
accelerated library on the server side (right).
The client application is then run as usual, but it is faster, thanks to the
acceleration provided by the Cell/B.E. processor.
The DAV client package comes with a few samples in the C:\Program
Files\IBM\DAV\sample directory. We describe the use of DAV following one of the
examples.
478 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
7.4.5 A Visual C++ application example
In this section, we show the steps enable a simple Visual C++ application by
using DAV. The application initializes two arrays and calls two functions to
perform computations on the arrays. Example 7-19 shows the C source code for
the main program.
for(i=0;i<N;i++)
in[i]=1.*i;
return 0;
}
The function prototypes are defined in a header file. This file is important
because this is the input for the whole DAV process. In real situations, the C
source code for the main program, the functions, and the header might not be
available. You should find a way to create a prototype for each function that you
want to offload. This is the only source file that is absolutely required to get DAV
to do what it needs to do. You might have to ask the original writer of the
functions or do reverse engineering to determine the parameters that are passed
to the function.
#if defined(__cplusplus)
extern "C" {
#endif
#if defined(__cplusplus)
}
#endif
The following steps are required to enable DAV acceleration for this application:
1. Build the original application, which creates an executable file and a DLL file
for the functions.
2. Run the DAV Tooling component by using the header file for the functions.
This creates the stub DLL that replaces the original DLL.
3. Instruct the original application to use the stub DLL.
4. On the Cell/B.E. processor, compile the functions and put them in a shared
library.
480 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5. Start the DAV server on the Cell/B.E. processor.
6. Change the DAV parameters on the client machine, so that it points to the
right accelerator node.
7. Run the application with DAV acceleration.
Figure 7-11 The CalculateApp and Calculate projects in Visual C++ 2005 Express Edition
482 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Applying the DAV Tooling component
Next, we apply the DAV Tooling, which produces the stub DLLs and the source
code for the server-side data marshalling, as shown in Figure 7-12. This step
only requires the .h file.
484 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2. In the Header File Location window (Figure 7-14), complete the following
steps:
a. Choose a name for the project and open the header file. We use the one
supplied in the IBM DAV client package in C:\Program
Files\IBM\DAV\sample\Library\Library.h.
b. Click the Check Syntax button to check the syntax of the header file. The
Finish button does not be active until you check the syntax.
c. Click the Finish button.
4. After every function is completely described, click the Generate Stub Code
button (Figure 7-16) to generate the stub DLL.
Figure 7-16 Generate Stub Code button to create the stub DLL
We experienced a slight problem with the Visual C++ 2005 Express Edition here.
The free version lacks libraries that the linker tries to bring in when creating the
stub DLL. These libraries (odbc32.lib and odbccp32.lib) are not needed.
Therefore, we changed the DAV script C:\Program Files\IBM\IBM DAV
Tooling\eclipse\plugins\com.ibm.oai.appweb.tooling_1.0.1\SDK\compilers\vc, so
that it would not try to link them.
486 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
This step creates many important files for the client and the server side. The files
are in the C:Documents\settings\username\workspace\projectname directory.
The following files are in that directory:
Client directory
– <libraryname>_oai.dll
– <libraryname>_oai.lib
– <libraryname>_stub.dll
– <libraryname>_stub.lib
Server directory
– <libraryname>_oai.cpp
– <libraryname>_oai.h
– makefile
– changes_to_server_config_file
The client files are needed to run the DAV-enabled application. They must be
available in the search path for shared libraries. The _stub tagged files contain
the fake functions. The _oai tagged files contain the client side code for the input
and output data marshalling between the client and the server. We copied the
four files to the C:\Program Files\IBM\DAV\bin directory. We point the executable
file to this path later, so that it can find the DLLs when it needs to load them.
The server files (tagged _oai) contain the server side code for the input and
output data marshalling between the client and the server. These files must be
copied over and compiled on the server node. The makefile is provided to do so.
The changes_to_server_config_file file contains the instructions to make the
changes to the DAV configuration files on the server to serve our now offloaded
functions.
488 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
built by using the Calculate.cpp and Calculate.h files listed in Example 7-20 on
page 479 and Example 7-21 on page 480 using a command similar to the
following example:
$ gcc -shared -fPIC -o libLibrary.so Calculate.cpp
dav.listen.port=12456
dav.listen.max=20
#Services
dav.server.service.restarts=1
dav.server.service.restart.interval=3
dav.server.service.root=services
#Service Entries
#These entries will be automatically generated by IBM DAV Tooling in a
"changes_to_server_config_file.txt" file.
#The entries should be inserted here
Library is the name of the service in this example. The Cell/B.E. server is ready
to receive requests from the client application. We now return to the client side.
490 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
#Connections to servers
#If sample1 is not available, sample2 will be tried
dav.servers=qdc221
qdc221.dav.server.ip=9.3.84.221
qdc221.dav.server.port=12456
The server is ready. The application has been instructed to load its computational
DLL from the path where the stud DLL has been stored. The DAV client knows
which accelerator it can work with. We are ready to run the application by using
Cell/B.E. acceleration.
492 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 7-19 Launching the application
Upon returning from the computational routines, the application continues its
work as shown in Figure 7-21.
If we look at the server side, we see that transactions are logged to the DAV log
file as shown in Figure 7-22. We are calling from 9.41.223.243.
494 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Looking at the processes running on the server, we see that two processes are
running on the host, the davStart daemon and the davService process forked to
handle our request as shown in Figure 7-23.
We also see from the maps file that our shared libraries have been loaded into
the DAV server process as shown in Figure 7-24. See the seventh to tenth lines
of output.
In essence, this is how IBM DAV works. Every application, provided that it gets its
computational functions from a DLL, can be accelerated on a Cell/B.E. system by
using DAV.
IBM DAV is best used for accelerating highly computational functions that require
little input and output. This is a general statement that is exacerbated by the fact
that the data transfer is currently performed with TCP/IP, which has high latency.
Keep in mind that no state data can be kept on the accelerator side between two
invocations. Also, all the data needed by the accelerated functions must be
passed as arguments for the DAV run time to ship all the required information to
the accelerator. The functions should be self-contained.
The davServer process that runs on the Cell/B.E. system is a 32-bit Linux
program. Therefore, we are limited to 32-bit shared libraries on the server side.
496 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Part 3
Part 3 Application
re-engineering
In this part, we focus on specific application re-engineering topics as explained in
the following chapters:
Chapter 8, “Case study: Monte Carlo simulation” on page 499
Chapter 9, “Case study: Implementing a Fast Fourier Transform algorithm” on
page 521
The Monte Carlo simulation technique is also used to calculate option pricing
(option value). An option is an agreement between a buyer and a seller. For
example, the buyer of a European call option buys the right to buy a financial
instrument for a preset price (strike price) at expiration (maturity). Similarly, the
buyer of a European put option buys the right to sell a financial instrument for a
preset price at expiration. The buyer of an option is not obligated to exercise the
option. For example, if the market price of the underlying asset is below the strike
price on the expiration date, then the buyer of a call option can decide not to
exercise that option.
In this chapter, we show how to implement Monte Carlo simulation on the Cell
Broadband Engine (Cell/B.E.) system to calculate option pricing based on the
Black-Scholes model by providing sample codes. We include techniques to
improve the performance and provide performance data. Also, since such
mathematical functions as log, exp, sine and cosine are used extensively in
Example 8-1 shows the basic steps for the Monte Carlo simulation to calculate
the price S of an asset (stock) at different time steps 0 = t0 < t 1 < t 2 <...t M =T,
where T is time to expiration(maturity). The current stock price at time
(t0, S(t 0 )=S 0 ), the interest rate (r), and the volatility (v) are known. In this
example, N is the number of cycles and M is the number of time points in each
cycle.
}
}
500 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 8-2 shows pseudo code for calculating European option call and put
values. In this example, S is the spot price, and K is the strike price.
Example 8-3 shows pseudo code to calculate Asian option call and put values.
Again, S is the spot price, and K is the strike price.
Different types of options are traded in the market. For example, an American call
or put option gives the buyer the right to sell or buy the underlying asset at strike
price on or before the expiration date [12 on page 624].
The main computational steps for option values are in Example 8-1 on page 500.
In addition, the most time consuming part is getting the standard Gaussian
(normal) random variables. These random numbers have the following probability
density function with mean 0 and standard deviation 1:
1 1 2
f ( x ) = ------- exp ⎛ – --- x ⎞ , -∞ < x < ∞
2Π ⎝ 2 ⎠
Other methods are equally as good as the Mersenne Twister method. SDK 3.0
provides the Mersenne Twister method as one of the random number generators
[21 on page 624]. This method takes an unsigned integer as a seed and
generates a sequence of random numbers. Then it changes these unsigned
32-bit integers to uniform random variables in (0, 1). Finally it transforms the
uniform random numbers to standard normal random numbers for which the
Box-Muller method or its variant polar method can be used. The Box-Muller
method and the polar method require two uniform random numbers and return
two normal random numbers [22 on page 625].
//Box-Muller method
void box_muller_normal(float *z)
{
float pi=3.14159f;
float t1,t2;
502 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
t1 = sqrtf(-2.0f *logf( rand_unif() );
t2 = 2.0f*pi*rand_unif();
z[0] = t1 * cos(t2);
z[1] = t1 * sin(t2);
}
//polar method
void pollar_normal(float *z)
{
float t1, t2, y;
do {
t1 = 2.0f * rand_unif() - 1.0f;
t2 = 2.0f * rand_unif() - 1.0f;
y = t1 * t1 + t2 * t2;
} while ( y >= 1.0f );
504 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5. Gaussian random numbers
Each SPE generates unsigned integer random numbers and converts them to
standard Gaussian random numbers by using Box-Muller transformation.
6. Monte Carlo cycles
Each SPE performs its portion of Monte Carlo cycles and computes an
average option value as result.
7. Send the result
Each SPE sends its result to the Power Processor Element (PPE) by using
DMA.
8. Collect results
Check if the work is done by the SPEs and collect the results from the SPEs.
9. Final value
Calculate the average of the results that are collected and compute the final
option value.
Some of the steps are explained further in the discussion that follows. Because
the computational steps for getting European option call and put values and
Asian option values are similar, we restrict our discussion to European option call
value.
As noted previously, in Example 8-1 on page 500, the main computational part
generates standard normal random numbers. Therefore, it requires careful
implementation to reduce the overall computing time. For computational
efficiency and because of limited local storage available on SPEs, avoid
precomputing all random numbers and storing them. Instead, generate the
random numbers during the Monte Carlo cycles in each SPE.
However, this poses a major challenge in the sense that we cannot simply
implement the serial random number generators, such as Mersenne Twister, that
generates a sequence of random numbers based on a single seed value as
Using different seeds on different SPEs is not enough since the generated
random numbers can be correlated. That is, the random numbers are not
independent. Therefore, the quality of Monte Carlo simulations is not good, which
leads to inaccurate results. These remarks apply to all parallel machines, not
specific to the Cell/B.E. system.
One way to avoid these problems is to use parallel random number generators
such as Dynamic Creator [10 on page 624]. Dynamic Creator is based on the
Mersenne Twister algorithm, which depends on a set of parameters, called
Mersenne Twister parameters, to generate a sequence of random numbers for a
given seed. With the Dynamic Creator algorithm, we can be certain that the
Mersenne Twister parameters are different on different SPEs so that the
generated sequences are independent of each other, resulting in high quality
random numbers overall. Also, Dynamic Creator provides the capability to
precompute these parameters, which can be done on a PPE and saved in an
array, but only once for a given period.
Following the logical steps given in step 2 on page 504 of Figure 8-1 on
page 504, on PPE, we store the values, such as the number of simulations
(J=N/number_of_SPUs), interest rate (r), volatility (v), number of time steps,
Mersenne Twister parameters, and the initial seeds in the control structure,
defined in Example 8-5 on page 506, to transfer the data to SPEs. Based on
these input values, in step 6 on page 505 (Figure 8-1), the average call value is
calculated as follows:
Ck = (C 0 + C 1 + ...+ C J-1 )/J; k= 1, 2, ... number_of_SPUs,
In Example 8-5, we define the control structure that will be used to share data
between PPU and SPUs.
Example 8-5 Control structure to share data between the PPU and SPUs
typedef struct _control_st {
unsigned int seedvs[4]; /* array of seeds */
unsigned int dcvala[4]; /* MT parameters */
unsigned int dcvalb[4]; /* MT parameters */
unsigned int dcvalc[4]; /* MT parameters */
int num_simulations; /* number of MC simulations */
506 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
float spot_price;
float strike_price;
float interest_rate;
float volatility;
int time_to_maturity;
int num_time_steps;;
float *valp;
char pad[28]; /* padding */
} control_st;
In Example 8-6, we provide a sample code for the main program on SPUs to
calculate the European option call value.
Example 8-6 SPU main program for the Monte Carlo simulation
#include <spu_mfcio.h>
/* control structure */
control_st cb __attribute__ ((aligned (128)));
//
my_num_simulations = cb.num_simulations;
s = cb.spot_price;
x = cb.strike_price;
r = cb.interest_rate;
sigma = cb.volatility;
T = cb.time_to_maturity;
nt = cb.num_time_steps;
//get seed
//
//Get Mersenne Twister parameters that are different on SPUs
return 0;
}
508 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 8-7 SPU vector code for European option call value
#include <spu_intrinsics.h>
#include <simdmath/expf4.h>
#include <simdmath/sqrtf4.h>
//
#include <sum_across_float4.h>
//
/*
s spot price
x strike (exercise) price,
r interest rate
sigma volatility
t_m time to maturity
nt number of time steps
*/
void monte_carlo_eur_option_call(float s, float x, float r,
float sigma, float t_m,int nt,int nsim,float *avg)
{
vector float c,v0,q,zv;
vector float dtv,rv,vs,p,y,rvt,sigmav,xv;
vector float sum,u,sv,sinitv,sqdtv;
int i,j,tot_sim;
//
v0 = spu_splats(0.0f );
c = spu_splats(-0.5f);
dtv = spu_splats( (t_m/(float)nt) );
sqdtv = _sqrtf4(dtv);
sigmav = spu_splats(sigma);
sinitv = spu_splats(s);
xv = spu_splats(x);
rv = spu_splats(r);
vs = spu_mul(sigmav, sigmav);
p = spu_mul(sigmav, sqdtv);
y = spu_madd(vs, c, rv);
rvt = spu_mul(y,dtv);
sv = _fmaxf4(q ,v0);
sum = spu_add(sum,sv);
In this example, in the step that computes tot_sim, we first make the number of
simulations a multiple of four before dividing it by four. In addition to using
SIMDmath functions expf4 and sqrtf4, we used the function _fmaxf4 to get the
component-wise maximum of two vectors and SDK library function
_sum_across_float4 to compute the sum of the vector components.
510 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
For the vector version of Dynamic Creator, the components of a seed vector
should be independent. Therefore, on 16 SPUs, 64 independent seed values,
unsigned integers, are required. For example, one can use thread IDs as seed
values.
For Dynamic Creator, as indicated in Example 8-7, only three Mersenne Twister
parameters (A, maskB, and maskC) that are different on SPUs need to be set
when the random number generator is initialized. The rest of the Mersenne
Twister parameters do not change and can be inlined in the random number
generator code.
For step 2, we show the sample SPU code in Example 8-8, which is a vector
version of the Box-Muller method in Example 8-4 on page 502. In this example,
we convert the vectors of random unsigned integers to vectors of floats, generate
uniform random numbers, and use Box-Muller transformation to obtain vectors of
standard normal random numbers.
Example 8-8 Single precision SPU code to generate Gaussian random numbers
#include <spu_mfcio.h>
#include <simdmath/sqrtf4.h>
#include <simdmath/cosf4.h>
#include <simdmath/sinf4.h>
#include <simdmath/logf4.h>
//
For double precision, the generated 32-bit unsigned integer random numbers are
converted to floating-point values and extended to double-precision values. A
double-precision vector has only two elements because each element is 64-bits
long. In Example 8-8, the vector y1 has four elements that are 32-bit unsigned
integer random numbers. Therefore, we can compute the required two
double-precision vectors of uniform random numbers for Box-Muller
transformation by shuffling the components of y1. Doing this avoids the
computation of y2 in Example 8-8. Example 8-9 explains this idea.
//
512 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
y2 = spu_shuffle(y1, y1, pattern);
// Box-Muller transformation
w1 = _sqrtd2( spu_mul(c4,_logd2(u1)) );
p1 = spu_mul(c3, u2);
z[0] = spu_mul(w1, _cosd2(p1) );
z[1] = spu_mul(w1, _sind2(p1) );
}
Next we discuss ways to tune the code on the SPUs. To improve register
utilization and instruction scheduling, the outer loop in Example 8-7 on page 509
can be unrolled. Further, a vector version of the Box-Muller method and Polar
method computes two vectors of standard normal random numbers out of two
vectors of unsigned integer random numbers. Therefore, we can use all
generated vectors of normal random numbers by unrolling the outer loop, for
example, to a depth of two or four.
In Example 8-8 on page 511 and Example 8-9 on page 512, we used
mathematical intrinsic functions, such as exp, sqrt, cos and sin, from SIMD math
library, which takes advantage of SPU SIMD (vector) instructions and provides
significant performance gain over the standard libm math library [20 on page
624]. MASS, which available in SDK 3.0, provides mathematical intrinsic
functions that are tuned for optimum performance on PPU and SPU. The MASS
libraries provide better performance than SIMD math libraries for most of the
intrinsic functions. In some cases, the results for MASS functions might not be as
accurate as the corresponding functions in the SIMD math libraries. In addition,
MASS can handle the edges differently.
Alternatively, in Example 8-10 and in Example 8-11, you can use MASS SIMD
library functions instead of inlining them. To this, you must change the include
statements and the function names as shown in Example 8-12, which is a
modified version of Example 8-8 on page 511.
Example 8-12 SPU code to generate Gaussian random numbers using the MASS SIMD
library
#include <spu_mfcio.h>
#include <mass_simd.h>
//
514 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
//convert to uniform random numbers
v1 = spu_convtf( (vector signed int) y1, 0) ;
v2 = spu_convtf( (vector signed int) y2, 0) ;
u1 = spu_madd( v1, c2, c1);
u2 = spu_madd( v2, c2, c1);
// Box-Muller transformation
w1 = sqrtf4( spu_mul(c4,logf4(u1)) );
p1 = spu_mul(c3, u2);
z[0] = spu_mul(w1, cosf4(p1) );
z[1] = spu_mul(w1, sinf4(p1) );
}
Further, you must add the libmass_simd.a MASS SIMD library at the link step.
Example 8-13 shows the makefile to use the library.
PROGRAMS_spu := mceuro_spu
INCLUDE = -I/usr/spu/include
OBJS = mceuro_spu.o
LIBRARY_embed := mceuro_spu.a
IMPORTS = /usr/spu/lib/libmass_simd.a
#######################################################################
#
# make.footer
#######################################################################
#
Recall that the number of Monte Carlo cycles for European option pricing is
typically very large, in the hundreds of thousands or millions. In such cases, the
call overhead for math functions can degrade the performance. To avoid call
overhead and improve the performance, you can use the MASS Vector library [21
on page 624] to provide the math functions to calculate the results for an array of
input values with a single call. To link to MASS vector library, you must replace
libmass_simd.a with libmassv.a in IMPORTS in the makefile as shown in
Example 8-13 on page 515.
Example 8-14 shows how to restructure the code to use the MASS vector
functions.
516 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
// get random numbers in the array y
//convert to uniform random numbers
//MASS
vslog ((float *) u1, (float *) u1, &size);
In Example 8-14 on page 516, pointers to vector floats, such as u1 and u2, are
cast to pointers of floats in the MASS vector function calls since the array
arguments in the MASS vector functions are defined as pointers to floats. For
details, see the prototypes for the SPU MASS vector functions in
/usr/spu/include/massv.h.
With our European option pricing code implementation, we found that XLC gives
better performance than gcc. Further, besides using MASS, we used the
following techniques to improve performance:
Unrolled the outer loop in Example 8-7 to improve register utilization and
instruction scheduling
Note that, since a vector version of the Box-Muller method and Polar method
computes two vectors of standard normal random numbers out of two vectors
of unsigned integer random numbers, all generated vectors of normal random
numbers can be used by unrolling the outer loop, for example, to a depth of
eight or four.
Inlined some of the routines to avoid call overhead
Avoided branching (if and else statements) as much as possible within a loop
Reduced the number of global variables to help compiler with register
optimization
Moreover, we used the -O3 compiler optimization flag with XLC. We also tried
higher level optimization flags such as -O5, which did not make a significant
performance difference, compared to -O3.
Example 8-15 shows the input values for the Monte Carlo simulation.
518 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 8-2 shows the performance results in terms of millions of simulations per
second (M/sec) for the tuned single precision code for European option on QS21
(clock speed 3.0 GHz).
2500 2392
2000
1500
1197
M/sec
1000
600
500
300
150
0
SPU 1 SPUs 2 SPUs 4 SPUs 8 Spus 16
The IBM SDK 3.0 for the Cell/B.E. processor contains a prototype FFT library
that is written to explicitly exploit the features of the Cell/B.E. processor. This
library uses several different implementations of an algorithm to solve a small
class of FFT problems. The algorithm is based on a modified Cooley-Tukey type
algorithm. All of the implementations use the same basic algorithm, but each
implementation does something different to make the algorithm perform the best
for a particular range of problem sizes.
The first step in any development process is to start with a good algorithm that
maps well to the underlying architecture of the machine. The best compilers and
hardware cannot hide the deficiencies that are imposed by a poor algorithm. It is
necessary to start with a good algorithm.
522 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Figure 9-1 illustrates the stages and process used during implementation of the
FFT code for Cell/B.E. processor. We describe this figure in more detail in the
sections that follow.
9.2.1 Code
The “code” box in the Process box in Figure 9-1 represents the actual physical
writing of code. This can include the implementation of a formal program
specification or coding done without any documentation other than code directly
from the programmer.
Coding can be done in any manner that is convenient to the programmer. The
IBM Eclipse integrated development environment (IDE) for the Cell/B.E.
processor in the SDK is a good choice for writing and debugging Cell/B.E. code.
9.2.2 Test
The “test” box inside the Process box in Figure 9-1 on page 523 represents the
testing of code that has been compiled. For Cell/B.E. applications, testing is still a
necessary and critical step in producing well performing applications.
The test process for the Cell/B.E. application undoubtedly requires testing for
code performance. Testing is normally focused on code that runs on the Cell/B.E.
SPU. The focus is more on making sure all code paths are processed and the
accuracy of the output.
9.2.3 Verify
The “verify” box in the Process box in Figure 9-1 on page 523 represents an
important aspect of many Cell/B.E. applications. The need for verification of code
that runs on the SPU is of special importance. The Cell/B.E. Synergistic
Processor Element (SPE) single-precision floating point is not the same
implementation as found in the PowerPC Processor Unit (PPU) and other
processors.
Code that is ported from other processor platforms presents a unique opportunity
for the verification of Cell/B.E. programs by comparing output from the original
application for accuracy. A binary comparison of single-precision floating point is
performed to eliminate conversions to and from textual format.
The FFT coding team chose to write a separate verification program to verify
output. The verification code uses a well-known open source FFT library that is
supported on PowerPC processors. The results from the Cell/B.E. FFT code
were compared at a binary level with the results from the open source FFT
results. Example 9-2 on page 525 shows how the single-precision floating point
was converted to displayable hexidecimal and then converted back to
single-precision floating point. These functions are one way to compare SPE and
PPE floats.
524 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 9-1 shows a simple C function from the test program. It illustrates how
two single-precision floating point values that represent a complex number are
output in displayable hexidecimal form.
Example 9-2 shows a code snippet from the verification program, which reads
displayable hexadecimal from stdin and converts it into a complex data type. The
verify program reads multiple FFT results from stdin, which explains why
variable p is a two-dimensional matrix.
Complex *t = team[i].srcAddr;
MT_FFTW_Complex *p = fftw[i].srcAddrFC;
unsigned int j;
for ( j=0; j < n; j++ ) {
volatile union {
float f;
unsigned int i;
} t1, t2;
scanf( "%x %x\n", &t1.i, &t2.i );
#define INPUT_FMT "i=%d j=%d r=%+13.10f/%8.8X i=%+13.10f/%8.8X\n"
In the following sections, we describe the different stages and include sample
code to illustrate the evolution of some aspects of the code that are unique to the
Cell/B.E. processor.
This version of the FFT code included an initial functional interface from a test
program to the actual FFT code and a primitive method for displaying the results.
The verification of FFT output was essential to ensure accuracy.
526 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The PowerPC version of the code performed much slower than the x86 version of
the code. This performance is due to the difference in the relative power of the
PPU portion of the Cell/B.E. processor compared to a fairly high-end dual-core
x86 CPU on which the x86 code was developed. Even though this code was
simple and single threaded, the x86 processor core that was used for
development is a much more power processor core than the PPU. Fortunately,
for the Cell/B.E. processor, the power of the chip lies in the eight SPUs, which we
take advantage of. (The PPU is eventually left to manage the flow of work to the
SPUs.)
Example 9-3 shows a code snippet that demonstrates how a single SPU thread
is started from the main FFT PPU code. The use of spe_context_run() is a
synchronous API and blocks until the SPU program finishes execution. The
multiple-SPU version of this code, which uses pThread support, is the preferred
solution and is discussed in 9.3.5, “Multiple SPUs” on page 529.
spe_context_ptr_t ctx;
unsigned int entry = SPE_DEFAULT_ENTRY;
/* Create context */
if ((ctx = spe_context_create (0, NULL)) == NULL) {
perror ("Failed creating context");
exit (1);
}
Note: The program in Example 9-3 on page 527 can be easily compiled and
run as a spulet.
The time spent in the computation phase of the problem shrank as the code
became more optimized by using various techniques, which are described later
in this section. As a result, DMA transfer time grew larger relative to the time
spent performing the computation phase. Eventually the team turned their
attention from focusing on the computation phase to the growing DMA transfer
time.
The programming team knew double buffering could be used to hide the time it
takes to do DMA transfers if done correctly. At this phase of the project, the team
was determined that DMA transfers were about 10 percent of the total time it took
to solve a problem. Correctly implemented DMA transfers were added to the
code at this phase to overlap with current ongoing computations. Given the
optimization effort that had been put into the base algorithm, the ability to reduce
our run time per problem by 10 percent was worth the effort in the end.
However, double buffering does not come for free. In this case, the
double-buffering algorithm required three buffer areas, one buffer for incoming
data and outgoing DMAs and two for the current computation. The use of a
single buffer for both input and output DMAs was possible by observing that the
data transfer time was significantly smaller than compute time. It was possible to
528 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
allocate the third buffer area when working on a small problem, but for the largest
problems, memory was already constrained and another buffer area was not
possible.
The solution was to have several implementations of the algorithm, where each
specific implementation would use a different data transfer strategy:
For large problems where an extra data buffer was not possible, double
buffering would not be attempted.
For smaller problems where an extra data buffer was possible, one extra data
buffer would be allocated and used for background data transfers, both to and
from SPU memory.
The library on the PPU that dispatched the problems to the SPUs now had to
look at the problem size that was provided by the user and choose between two
different implementations. The appropriate implementation was then loaded on
an SPU, and problems that were suitable for that implementation were sent to
that SPU.
With this solution, the code was flexible enough, so that small problems could to
take advantage of double buffering and large problems were still able to execute.
As a result, extending the code to run on multiple SPUs allowed for increasing
the throughput, not the maximum problem size. All of the SPUs that get used will
work independently, getting their work from a master work queue maintained in
PPU memory.
If there are four FFT problems in memory and those problems are of the same
length, then this is trivial to implement. All four problems are done in lockstep,
and the code that performs the indexing, the twiddle calculations, and other code
can be reused for all four problems.
The problem with this approach is that it requires multiple FFT problems to reside
in SPU memory. For the smallest problem sizes (up to around 2500 points), this
is feasible. However, for larger problem sizes, this technique cannot be leveraged
because the memory requirements are too great for a single SPU.
530 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Perform the math using the vector registers.
Split the results in the vector registers apart and put them back into their
proper places in memory.
The advantage of this solution is that it allows the use of the vector capabilities of
the SPU with only one FFT problem resident in memory. Parallelism is achieved
by performing loop unrolling on portions of the code.
The spu-timing tool allowed for analyzing the code generated by the compiler to
look for inefficient sections of code. Understanding why the code was inefficient
provided insight into where to focus efforts on generating better performing code.
The simulator gave insight as to how the code would behave on real hardware.
The dynamic profiling capability of the simulator let us verify our assumptions
about branching behavior, DMA transfer time, and overall SPU utilization.
The stopwatch technique provided verification that the changes to the code were
actually faster than the code that was replaced.
An important feature of the SIMD Math library are the special vector versions of
sinf() and cosf() that take four angles in a vector as input and return four sine or
cosine values in a vector as output. These functions were used to dramatically
Code inlining
Code inlining is an effective way to improve branching performance, because the
fastest branch is the one that you might not have taken. During the development
cycle, code is often written in functions or procedures to keep the code modular.
During the performance tuning stage, these functions and procedure calls are
reviewed for potential candidates that can be inlined. Besides reducing
branching, inlining has a side benefit of giving the compiler more code to work
with, allowing it to possibly use better the SPU pipelines.
532 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
When the FFT problems first come into memory, they are in an interleaved format
where the real portion of a complex number is followed immediately by the
imaginary portion of the complex number. Loading a vector from memory results
in two consecutive complex numbers from the same problem in a vector register.
That then requires shuffle operations to split the reals from the imaginaries and
more shuffle operations to end up with four problems striped across the vector
registers.
It is most efficient to do all of the data re-arrangement after the problems are in
SPU memory, but before computation begins. Example 9-4 shows a simple data
structure that is used to represent the problems in memory and a loop that does
the re-arrangement.
Workarea_t wa[2];
// Separate into arrays of floats and arrays of reals
// Do it naively, one float at at time for now.
short int i;
for ( i=0; i < worklist.problemSize; i++ ) {
wa[0].u.sep.real[i*4+0] = wa[1].prob[0][i].real;
wa[0].u.sep.real[i*4+1] = wa[1].prob[1][i].real;
wa[0].u.sep.real[i*4+2] = wa[1].prob[2][i].real;
wa[0].u.sep.real[i*4+3] = wa[1].prob[3][i].real;
wa[0].u.sep.imag[i*4+0] = wa[1].prob[0][i].imag;
wa[0].u.sep.imag[i*4+1] = wa[1].prob[1][i].imag;
wa[0].u.sep.imag[i*4+2] = wa[1].prob[2][i].imag;
wa[0].u.sep.imag[i*4+3] = wa[1].prob[3][i].imag;
}
On entry to the code wa[1] contains four problems in interleaved format. This
code copies all of the interleaved real and imaginary values from wa[1] to wa[0],
where they appear as separate arrays of reals and imaginaries. At the end of the
code, the view in wa[0] is of four problems striped across vectors in memory.
A review of the assembly code generated by the compiler and annotated by the
spu_timing tool (Example 9-5) shows that the code is inefficient for this
architecture. Notice the large number of stall cycles. The loop body takes 126
cycles to execute.
534 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
003000 1D 0123 cwx $8,$79,$24
003001 0 1234 shli $71,$6,2
003002 0 2345 shli $63,$7,2
003003 0D 34 a $70,$17,$29
003003 1D 3 hbrp # 2
003004 1 4567 rotqby $37,$39,$10
003005 0 56 ai $5,$35,3
003006 0D 67 a $62,$17,$28
003006 1D 6789 cwx $32,$71,$80
003007 0D 7890 shli $54,$5,2
003007 1D 7890 cwx $21,$63,$80
003008 0D 89 ai $11,$10,4
003008 1D 8901 shufb $34,$37,$36,$38
003009 0D 90 ai $77,$78,4
003009 1D 9012 cwx $75,$71,$24
003010 0D 01 ai $69,$70,4
003010 1D 0123 cwx $67,$63,$24
003011 0D 12 ai $60,$62,4
003011 1D 1234 cwx $14,$54,$80
003012 0D 23 ai $27,$27,1
003012 1D 234567 stqx $34,$79,$80
003013 0D 34 ai $26,$26,-1
003013 1D 345678 lqx $33,$17,$30
003014 1 456789 lqx $25,$71,$80
003015 1 5678 cwx $58,$54,$24
003019 1 ---9012 rotqby $31,$33,$78
003023 1 ---3456 shufb $23,$31,$25,$32
003024 0d 45 ori $25,$27,0
003027 1d ---789012 stqx $23,$71,$80
003028 1 890123 lqx $22,$17,$29
003029 1 901234 lqx $19,$63,$80
003034 1 ----4567 rotqby $20,$22,$70
003038 1 ---8901 shufb $18,$20,$19,$21
003042 1 ---234567 stqx $18,$63,$80
003043 1 345678 lqx $16,$17,$28
003044 1 456789 lqx $13,$54,$80
003049 1 012 ----9 rotqby $15,$16,$62
003053 1 ---3456 shufb $12,$15,$13,$14
003057 1 ---789012 stqx $12,$54,$80
003058 1 890123 lqx $9,$10,$61
003059 1 901234 lqx $4,$79,$24
003064 1 ----4567 rotqby $2,$9,$11
003068 1 ---8901 shufb $3,$2,$4,$8
003072 1 ---234567 stqx $3,$79,$24
003073 1 345678 lqx $76,$78,$61
003074 1 456789 lqx $73,$71,$24
003079 1 ----9012 rotqby $74,$76,$77
003083 1 ---3456 shufb $72,$74,$73,$75
003087 1 ---789012 stqx $72,$71,$24
The code in Example 9-6, while less intuitive, shows how the shuffle intrinsic can
be used to dramatically speed up the rearranging of complex data.
536 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
vector float i0 = spu_shuffle(
spu_shuffle( q0, q1, secondFloat ),
spu_shuffle( q2, q3, secondFloat ),
firstDword );
This code executes much better on the SPU for the following reasons:
All loads and stores to main memory are done by using full quadwords, which
makes the best use of memory bandwidth.
The compiler is working exclusively with vectors, avoiding the need to move
values into “preferred slots.”
Example 9-7 on page 538 shows the code annotated with the spu_timing tool.
This loop body is much shorter at only 23 cycles. It also has virtually no stall
cycles. The body of the second loop is approximately five times faster than the
body of the first loop. However, examination of the code shows that the second
loop is executed one half as many times as the first loop. With the savings
between the reduced cycles for the loop body and the reduced number of
iterations, the second loop works out to be 10 times faster than the first loop. This
result was verified with the simulator as well.
538 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Part 4
Part 4 Systems
This part includes Chapter 10, “SDK 3.0 and BladeCenter QS21 system
configuration” on page 541, which covers detailed system installation,
configuration, and management topics.
The Cell Broadband Engine Software Development Kit 2.1 Installation Guide
Version 2.1, which is available from IBM alphaWorks, explains the necessary
steps to install the operating system on a QS21 blade and the additional steps to
set up a diskless system.1 In this chapter, we consider this detail and address the
topics that are complementary to this guide.
1
Cell Broadband Engine Software Development Kit 2.1 Installation Guide Version 2.1 is available on
the Web at: ftp://ftp.software.ibm.com/systems/support/bladecenter/cpbsdk00.pdf
The BladeCenter QS21 is a single-wide blade that uses one BladeCenter H slot
and can coexist with any other blade in the same chassis. To ensure compatibility
with existing blades, the BladeCenter QS21 provides two midplane connectors
that contain Gigabit Ethernet links, Universal Serial Bus (USB) ports, power, and
a unit management bus. The local service processor supports environmental
monitoring, front panel, chip initialization, and the Advanced Management
Module Interface of the BladeCenter unit.
The blade includes support for an optional InfiniBand expansion card and an
optional Serial Attached SCSI (SAS) card.
Additionally, the BladeCenter QS21 blade server has the following major
components:
Two Cell/B.E. processor chips (Cell/B.E.-0 and Cell/B.E.-1) operating at
3.2 GHz
2 GB extreme data rate (XDR) system memory with ECC, 1 GB per Cell/B.E.
chip, and two Cell/B.E. companion chips, one per Cell/B.E. chip
2x8 Peripheral Component Interconnect Express (PCIe) as high-speed
daughter cards (HSDC)
1 PCI-X as a daughter card
Interface to optional DDR2 memory, for use as the I/O buffer
Onboard Dual Channel Gb-Ethernet controller BCM5704S
Onboard USB controller NEC uPD720101
One BladeCenter PCI-X expansion card connector
One BladeCenter High-Speed connector for 2 times x 8 PCIe buses
One special additional I/O expansion connector for 2 times 16 PCIe buses
Four dual inline memory module (DIMM) slots (two slots per Cell/B.E.
companion chip) for optional I/O buffer DDR2 VLP DIMMs
Integrated Renesas 2166 service processor (baseboard management
controller (BMC) supporting Intelligent Platform Management Interface (IPMI)
and Serial over LAN (SOL))
542 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
An important characteristic of the BladeCenter QS21 blade server is that it does
not contain onboard hard disk or other storage. Storage on the BladeCenter
QS21 blade server can be allocated through a network or a SAS-attached
device. In the following section, we discuss the installation of the operating
system, through a network storage, on the QS21 blade server.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 543
zIMage for DIM: Through the use of cluster management tools, such as
Distributed Image Management (DIM) or Extreme Cluster Administration
Toolkit (xCAT), the process of creating individual zImages can become
automated. For DIM, it has the capability of applying one RHEL 5.1 kernel
zImage to multiple BladeCenter QS21 blade servers. For more information,
see “DIM implementation on BladeCenter QS21 blade servers” on
page 570.
For Fedora 7, the same kernel zImage can be applied across multiple blades
and is accessible through the Barcelona Supercomputing Center at the
following address:
http://www.bsc.es/
Each blade must have its own separate root file system over the Network File
System (NFS) server. This restriction applies even if the root file system is
read only.
SWAP is not supported over NFS. Any root file system that is NFS mounted
cannot have a SWAP space.
SELinux cannot be enabled on nfsroot clients.
SDK 3.0 supports both Fedora7 and RHEL 5.1, but it is officially supported
only for RHEL 5.1
External Internet access is required for installing SDK 3.0 open source
packages.
Tip: If your BladeCenter QS21 does not have external Internet access, you
can download the SDK 3.0 open source components from the Barcelona
Supercomputing Center Web site from another machine that has external
Internet access and apply them to the BladeCenter QS21.
544 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Advanced Management Module through the Web interface
The Advanced Management Module (AMM) is a management and configuration
program for the BladeCenter system. Through its Web interface, the AMM allows
configuration of the BladeCenter unit, including such components as the
BladeCenter QS21. Systems status and an event log are also accessible to
monitor errors that are related to the chassis or its connected blades.
For more information and instructions about using the command-line interface,
refer to the IBM BladeCenter Management Module Command-Line Interface
Reference Guide (part number 42C4887) on the Web at the following address:
http://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brand
ind=5000008&lndocid=MIGR-5069717
To establish an SOL connection, you must configure the SOL feature and start an
SOL session. You must also ensure that the BladeCenter and AMM are
configured properly to enable the SOL connection.
For further information and details about establishing an SOL connection, refer to
the IBM BladeCenter Serial over LAN Setup Guide (part number 42C4885) on
the Web at the following address:
http://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brand
ind=5000008&lndocid=MIGR-54666
Serial interface
In addition to using SOL to access the BladeCenter QS21 server’s console, you
can use a serial interface. This method requires the connection of a specific
UART cable to the BladeCenter H chassis. This cable is not included with the
BladeCenter H chassis, so you must access it separately.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 545
Ensure to set the following parameters for serial connection on the terminal client:
115200 baud
8 data bits
No parity
One stop bit
No flow control
By default, input is provided to the blade server through the SOL connection.
Therefore, if you prefer input to be provided through a device connected to the
serial port, ensure that you press any key on that device while the server boots.
To access the SMS utility program, you must have input to the blade’s console
(accessible either through a serial interface or SOL) as it us starting. Early in its
boot process, press F1 as shown in Example 10-1.
SYSTEM INFORMATION
Processor = Cell/B.E.(TM) DD3.2 @ 3200 MHz
I/O Bridge = Cell Companion chip DD2.x
Timebase = 26666 kHz (internal)
SMP Size = 2 (4 threads)
Boot-Date = 2007-10-26 23:52
Memory = 2048MB (CPU0: 1024MB, CPU1: 1024MB)
546 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
The following configurations on the SMS utility cannot be implemented on the
AMM:
SAS configurations
Choosing which firmware image to boot from, either TEMP or PERM
Choosing static IP for network startup (otherwise, the default method is strictly
DHCP)
Due to the POWER technology-based architecture for the Cell/B.E. system, the
system to create the initial installation for storage must be on a 64-bit POWER
technology-based system. After this installation, you copy the resulting root file
system to an NFS server, make it network bootable so that it can be mounted via
NFS, and adapt it to the specifics of the individual blade server.
Multiple copies: You can create multiple copies of this file system if you
want to apply the same distribution on multiple BladeCenter QSS1
systems. For more information, refer to 10.2.4 “Example of installation
from network storage” on page 550.
3. Edit the copied root file system to reflect the specific blade to which you want
to mount the file system.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 547
Additional files to change: In addition to changing the basic network
configuration files on the root file system, you must change a few files to
enable NFS root as explained in the examples in 10.2.4, “Example of
installation from network storage” on page 550.
For these reasons, you must have a zImage kernel with an initial RAM disk
(initrd) that supports booting from NFS and apply it to your BladeCenter QS21
through a Trivial File Transfer Protocol (TFTP) server.
548 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Applying the zImage file and root file system
You have now obtained the most important components that are needed to boot
up a BladeCenter QS21: the root file system to be mounted and the zImage file.
You must establish how to pass these components onto the BladeCenter QS21.
When the BladeCenter QS21 system boots up, it first must access the kernel.
Therefore, you must load the zImage file. After the kernel is successfully loaded
and booted up, the root file system is mounted via NFS.
You can provide the zImage file to a DHCP server that sends the file to the
BladeCenter QS21 by using TFTP as explained in the following steps:
1. Ensure that the DHCP and TFTP packages are installed and the
corresponding services are enabled on the server.
2. Place the zImage file in a directory that will be exported. Ensure that this
directory is exported and the TFTP server is enabled by editing the
/etc/xinet.d/tftp file:
disable = no
server_args = -s <directory to export> -vvvvv
3. Edit the /etc/dhcpd.conf file to reflect your settings for the zImage by editing
the filename argument.
Fedora 7: If you are booting up a Fedora 7 file system, you must add the
following option entry to the /etc/dhpd.conf file:
option root-path “<NFS server>:<path to nfsroot>”;
You can start the BladeCenter QS21 blade server from the following locations:
The optical drive of the BladeCenter unit media tray
A SAS storage device, typically one or more hardisks attached to the
BladeCenter unit
A storage device attached to the network
To start the blade server through a device attached to the network, ensure that
the boot sequence for the BladeCenter QS21 is set to Network. This
configuration can be established through the AMM Web browser by selecting
Blade Tasks → Configuration → Boot Sequence. You can now start your
BladeCenter QS21 system.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 549
10.2.4 Example of installation from network storage
In this section, we take you through an example based on the steps in 10.2.3,
“Installation from network storage” on page 547. In this example, we cover
specific modifications to the root file system in order for it to successfully start on
the BladeCenter QS21 blade server. We implement these steps, mainly by using
bash scripts, to show how these steps can be applied in an automated fashion.
Note: We use the NFS, TFTP, and DHCP server as the same system. Ideally,
you want them to be the same, but the NFS server can be a different machine
if preferred for storage purposes.
550 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Example 10-3 shows the complete cell_build_master.sh script.
Arguments:
POWER_MACHINE Server where '/' will be copied from
DESTINATION_PATH Path where the master directory wil be
stored
Example:
EOF
exit 1
}
if [ $# != 2 ]; then
usage
fi
POWER_MACHINE=$1
DESTINATION_DIR=$2
RSYNC_OPTS="-avp -e ssh -x"
set -u
set -e
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 551
### Check if master tree already exists ###########################
test -d $DESTINATION_DIR || mkdir -p $DESTINATION_DIR
### Copy root filesystem from POWER-based machine to NFS server ###
rsync $RSYNC_OPTS $POWER_MACHINE:/ $DESTINATION_DIR
### Remove 'swap', '/' and '/boot' entries from /etc/fstab ########
grep -v "swap" $DESTINATION_DIR/etc/fstab | grep -v " / " \
| grep -v " /boot" > $DESTINATION_DIR/etc/fstab.bak
### Ensure SELinux is disabled ####################################
sed -i "s%^\(SELINUX\=\).*%\\1disabled%" \
$DESTINATION_DIR/etc/selinux/config
The /etc/fstab file has changes that are placed in the /etc/fstab.bak file. This
backup file will eventually overwrite the /etc/fstab file. For now, we have a master
copy for this distribution, so that we can apply it to multiple BladeCenter QS21
blade servers.
Next, we use a copy of this master root file system to edit some files and make it
more specific to the individual BladeCenter QS21s. We use the
cell_copy_rootfs.sh script:
root@dhcp-server# ./cell_copy_rootfs.sh \
/srv/netboot/qs21/RHEL5.1/master /srv/netboot/qs21/RHEL5.1/boot/ \
192.168.70.30 192.168.70.51-62 -i qs21cell 51 - 62
In this case, we copy the master root file system to 12 individual BladeCenter
QS21 blade servers. In addition, we configure some files in each of the copied
root file systems to accurately reflect the network identity of each corresponding
BladeCenter QS21. Example 10-4 shows the complete cell_copy_rootfs.sh
script.
552 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
usage: $0 [MASTER] [TARGET] [NFS_IP] [QS21_IP] -i [QS21_HOSTNAME]
Arguments:
MASTER Full path of master root filesystem
TARGET Path where the root filesystems of the blades
will be stored.
NFS_IP IP Address of NFS Server
QS21_IP IP Address of QS21 Blade(s). If creating
for multiple blades, put in range form:
10.10.2-12.10
-i [QS21_HOSTNAME] Hostname of Bladecenter QS21. If creating
root filesytems for multiple blades, use:
Example:
./cell_copy_master.sh /srv/netboot/qs21/RHEL5.1/master \
/srv/netboot/qs21/RHEL5.1 10.10.10.50 10.10.10.25-30 \
-i cell 25 - 30
This will create root paths for QS21 blades cell25 to cell30,
with IP addresses ranging from 10.10.10.25 to 10.10.10.25.
These paths will be copied from
/srv/netboot/qs21/RHEL5.1/master into
/srv/netboot/qs21/RHEL5.1/cell<25-30> on NFS server 10.10.10.50
EOF
exit 1
}
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 553
QS21_TEMP=`echo ${QS21_IP_ARY[*]}`
for a in `seq $RANGE`; do
NEW_IP=`echo ${QS21_TEMP[*]} | \
sed "s#new#$a#" | sed "s# #.#g"`
QS21_IP[b]=$NEW_IP
((b++))
done
echo ${QS21_IP[*]}
break
fi
done
if [ -z "$QS21_TEMP" ]; then
echo $PSD_QS21_IP
fi
}
MASTER=$1
TARGET=$2
NFS_IP=$3
QS21_IP=$4
shift 4
554 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
BASE=${BLADES[0]}
FIRST=${BLADES[1]}
LAST=${BLADES[3]}
BLADES=()
for i in `seq $FIRST $LAST`;
do
BLADES[a]=${BASE}${i}
((a++))
done
fi
### Ensure same number of IP and Hostnames have been provided ####
if [ "${#BLADES[*]}" != "${#QS21_IP[*]}" ] ; then
echo "Error: Mismatch in number of IP Addresses & Hostnames"
exit 1
fi
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 555
echo "Now configuring network for target machine...."
Two Ethernet connections: There might be situations where you want two
Ethernet connections for the BladeCenter QS21 blade server. For example, on
eth0, you have a private network and you want to establish public access to
the BladeCenter QS21 via eth1. In this case, make sure that you edit the
ifcnfg-eth1 file in /etc/sysconfig/network-scripts/. In our example, we edit
/srv/netboot/QS21boot/192.168.170.51/etc/sysconfig/
network-scripts/ifcfg-eth1, so that it reflects the network settings for the
BladeCenter QS21 blade server.
Our root file system is modified and ready to be NFS mounted to the
BladeCenter QS21 blade server. However, first, we must create a corresponding
zImage. For this purpose, we use the BLUEcell_zImage.sh script, which will run
on the POWER technology-based system where the RHEL 5.1 installation was
done:
root@POWERbox# ./cell_zImage 192.168.70.50 2.6.18-53.el5
/srv/netboot/qs21/RHEL5.1/boot/ -i cell 51 - 62
In Example 10-5, we create 12 zImage files, one for each BladeCenter QS21
blade server.
set -e
556 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
usage()
{
cat << EOF
${0##*/} - creates zImage for Cell QS21.
Arguments:
NFS_SERVER Server that will mount the root filesystem
to a BladeCenter QS21.
KERN_VERSION Kernel version zImage wil be based on
NFS_PATH Path on NFS_SERVER where the root
filesystems for blades will be stored
Example:
This will create zImages for QS21 blades cell21 to cell34, based
on kernel 2.5.18-53.el5, and whose NFS_PATH will be
10.10.10.1:/srv/netboot/qs21/RHEL5.1/cell<25-30>
EOF
exit 1
}
if [ $# = 0 ]; then
usage
fi
NFS_SERVER=$1
KERN_VERSION=$2
NFS_PATH=$3
shift 3
while getopts hi: OPTION; do
case $OPTION in
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 557
i)
shift
BLADES=( $* )
;;
h|?)
usage
;;
esac
done
558 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
rm -rf initrd-nfs-${QS21}-${KERN_VERSION}.img > \
/dev/null 2>&1
echo "zImage has been built as zImage-nfs-${QS21}-\
${KERN_VERSION}.img"
done
Now that we have placed the root file system in our NFS server and the zImage
file in our DHCP server, we must ensure that they are accessible by the
BladeCenter QS21 blade server. We edit the /etc/exports file in the NFS server to
give the proper access permissions as follows:
/srv/netboot/QS21/RHEL5.1/boot/192.168.70.51
192.168.170.0(rw,sync,no_root_squash)
We then edit the /etc/xinet.d/tftp file. We are mostly interested in the server_args
and disable parameters. Example 10-6 shows how our file looks now.
We pay attention to our /etc/dhcpd.conf file in our DHCP server. In order for
changes to the dhcpd.conf file to take effect, we must restart the dhcpd service
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 559
each time we change the dhcp.conf file. To minimize this need, we create a soft
link that points to the zImage file as shown in the following example:
[email protected]# ln -snf \
/srv/netboot/QS21/RHEL5.1/images/POWERbox/zImage-POWERbox-2.6.18-53\.el
5 /srv/netboot/QS21/images/QS21BLADE
In the future, if we want to change the zImage for the BladeCenter QS21 blade
server, we change where the soft link points instead of editing the dhcpd.conf file
and restarting the dhcpd service.
Next, we ensure that the /etc/dhcpd.conf file is properly set. Example 10-7 shows
how the file looks now.
Notice for the filename variable, we call it “QS21Blade”. This file is named in
reference to the directory that is specified on the variable server_args and is
defined in the /etc/xinet.d/tftp file. “POWERbox” is a soft link that we assigned
earlier that points to the appropriate zImage file of interest.
Let us assume that we edited our /etc/dhcpd.conf file. In this case, we must
restart the dhcpd service as shown in the following example:
[email protected]# service dhcpd restart
We have now completed this setup. Assuming that our BladeCenter QS21 blade
server is properly configured for starting via a network, we can restart it.
560 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Furthermore, while SDK 3.0 is available for both of these Linux distributions, SDK
3.0 support is available only for RHEL 5.1.
Additionally, the following SDK 3.0 components are not available for RHEL 5.1:
Crash SPU commands
Cell performance counter
OProfile
SPU-Isolation
Full-System Simulator and Sysroot Image
Table 10-1 and Table 10-2 on page 562 provide the SDK 3.0 ISO images that are
provided for each distribution, their corresponding contents, and how to obtain
them.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 561
Table 10-2 Fedora 7 ISO images
Product set ISO name Locations
If your BladeCenter QS21 blade server does not have outside Internet access,
you can download these components into one of your local servers that has
external access and install the RPMs manually on the BladeCenter QS21 blade
server.
The SDK open source components for RHEL 5.1 of the GCC toolchain and
numactl are available for download from the following address:
http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk3
.0/CellSDK-Open-RHEL/cbea/
The following SDK open source components for RHEL 5.1 are included with
distribution:
LIPSPE/LIBSPE2
netpbm
The following SDK open source components for RHEL 5.1 are not available:
Crash CPU commands
OProfile
Sysroot image
562 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
All SDK 3.0 open source components are available for Fedora 7 from the
following Web address:
http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk3
.0/CellSDK-Open-Fedora/cbea/
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 563
4. Ensure that the rsync, sed, tcl, and wget packages are installed in your
BladeCenter QS21 blade server. The SDK Installer requires these packages.
5. If you plan to install SDK 3.0 through the ISO images, create the /tmp/sdkiso
directory and place the SDK 3.0 ISO images in this directory.
Tip: If you prefer to install SDK 3.0 with a graphical user interface (GUI), you
can add the --gui flag when executing the cellsdk script.
Make sure to read the corresponding license agreements for GPL and LGPL and
additionally, either the International Programming License Agreement (IPLA) or
the International License Agreement for Non-Warranted Programs (ILAN). If you
install the “Extras” ISO, you are be presented with the International License
Agreement for Early Release of Programs (ILAER).
After you read and accept the license agreements, confirm to begin the
YUM-based installation of the specified SDK.
First, if you do not intend on installing additional packages, you must unmount
the ISOs that were automatically mounted when you ran the installation, by
entering the following command:
# ./cellsdk --iso /tmp/cellsdkiso unmount
564 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Note: If you must install additional packages that are included in the SDK 3.0
but are not part of the default installation, you must mount these ISOs again by
using the following command:
# ./cellsdk --iso /tmp/cellsdkiso mount
Next, edit the /etc/yum.conf file to prevent automatic updates from overwriting
certain SDK components. Add the following clause to the [Main] section of this
file to prevent a YUM update from overwriting SDK versions of the following
runtime RPMs:
exclude=blas kernel numactl oprofile
Re-enable the YUM updater daemon that was initially disabled before the SDK
3.0 installation:
# /etc/init.d/yumupdater start
You have now completed all of the necessary post-installation steps. If you are
interested in installing additional SDK 3.0 components for development
purposes, refer to the Cell Broadband Engine Software Development Kit 2.1
Installation Guide Version 2.1.2 Refer to this same guide for additional details
about SDK 3.0 installation and installing a supported distribution on a
BladeCenter QS21 system.
2
See note 1 on page 541.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 565
System firmware
– Takes control after BMC has successfully initialized the board
– Acts as BIOS
– Includes boot-time diagnostics and power-on self test
– Prepares the system for operating system boot
It is important that both of these packages match at any given time in order to
avoid problems and system performance issues.
The AMM gives information about the system firmware and the BMC firmware.
Through this interface, under Monitors → Firmware VPD, you can view the build
identifier, release, and revision level of both firmware types.
You can also view the system firmware level through the command line by typing
the following command:
# xxd /proc/device-tree/openprom/ibm,fw-vernum_encoded
In the output, the value of interest is the last field, which starts with “QB.” This is
your system firmware level.
566 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
This command automatically updates the firmware silently and reboots the
machine. You can also extract the .bin system firmware file into a directory by
using the following command:
# ./<firmware_script> -x <directory to extract to>
After you extract the .bin file into a chosen directory, you can update the firmware
by using the following command on Linux:
# update_flash -f <rom.bin file obtained from IBM support site>
Machine reboot: Both of the previous two commands cause the BladeCenter
QS21 machine to reboot. Make sure that you run these commands when you
have access to the machine’s console, either via a serial connection or SOL
connection.
Now that you have updated and successfully started the new system firmware,
ensure that you have a backup copy of this firmware image on your server. There
are always two copies of the system firmware image on the blade server:
TEMP
This is the firmware image that is normally used in the boot process. When
you update the firmware, the TEMP image is updated.
PERM
This is a backup copy of the system firmware boot image. Should your TEMP
image be corrupt, you can recover to a working firmware from this copy. We
provide more information about this process later in this section.
You can check from which image the blade server has booted by running the
following command:
# cat /proc/device-tree/openprom/ibm,fw-bank
If the output returns “P”, this means that you have booted on the PERM side.
Usually, you usually boot on the TEMP side unless that particular image is
corrupt.
After you have successfully booted up your blade server from the TEMP image
with the new firmware, you can copy this image to the backup PERM side by
typing the following command:
# update_flash -c
Additionally, you can accomplish this task of copying from TEMP to PERM by
typing the following command:
# echo 0 > /proc/rtas/manage_flash
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 567
Refer to “Recovering to a working system firmware” on page 568 if you
encounter problems with your TEMP image file.
You should now be set with your updated BMC firmware. You can start your
BladeCenter QS21 blade server.
To choose which image to start from, access the SMS utility. On the main menu,
select Firmware Boot Side Options.
568 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
To check which image your machine is currently booted on, type the following
command:
# cat /proc/device-tree/openprom/ibm,fw-bank
A returned value of “P” means that you have booted from the PERM side. If you
booted from the PERM side and want to boot from the TEMP side instead:
1. Copy the PERM image to the TEMP image by running the following
command:
# update_flash -r
2. Shut down the blade server.
3. Restart the blade system management processor through the AMM.
4. Turn on the BladeCenter QS21 blade server again.
In this section, we introduce two tools that can be implemented to simplify the
steps for scalable purposes and to establish cluster a environment in an
organized fashion. The two tools are Extreme Cluster Administration Toolkit
(xCAT) and Distributed Image Management (DIM) for Linux Clusters.
DIM for Linux Clusters is a cluster image management utility. It does not contain
tools for cluster monitoring, event management, or remote console management.
The primary focus of DIM is to address the difficult task of managing Linux
distribution images for all nodes of a fairly sized cluster.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 569
DIM includes the following additional characteristics:
Offers an automated IP and DHCP configuration through an XML file that
describes the cluster network and naming taxonomy
Allows set up of multiple images (that is, it can have Fedora and RHEL 5.1
images setup for one blade) for every node in parallel
Allows fast incremental maintenance of file system images, changes such as
user IDs, passwords, and RPM installations
Manages multiple configurations to be implemented across a spectrum of
individual blades:
– IP addresses
– DHCP
– NFS
– File system images
– network boot images (BOOTP/PXE)
– Node remote control
DIM has been tested on BladeCenter QS21 blade servers, and the latest release
supports DIM implementation on Cell blades.
Before installing and setting up DIM, the following prerequisites are required:
A single, regular disk-based installation on a POWER technology-based
machine
A POWER technology-based or x86-64 machine to be the DIM server
An image server to store the master and node trees (can be the same as the
DIM server)
We recommend that you have at least 20 GB of storage space plus .3 GB per
node.
The following software:
– BusyBox
– Perl Net-SNMP
570 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
– PERL XML-Simple
– PERL XML-Parser
We recommend that you establish a local YUM repository to ease the process of
package installation and for instances where your BladeCenter QS21 does not
have external Internet access. Assuming that you will apply the rpms from a
distribution installation DVD, use the following commands:
# mount /dev/cdrom /mnt
rsync -a /mnt /repository
umount /dev/cdrom
rpm -i /repository/Server/createrepo-*.noarch.rpm
createrepos /repository
Because the DIM loop mounts the BladeCenter QS21 images, you must increase
the number of allowed loop devices. Additionally, you want to automatically start
DIM_NFS upon startup of the DIM server. You can accomplish both of these
tasks by editing the /etc/rc.d/rc.local file so that it is configured as shown in
Example 10-8.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 571
DIM_SERVER=<DIM_IP_ADDRESS>
touch /var/lock/subsys/local
if grep -q "^/dev/root / ext" /proc/mounts; then
#commands for the DIM Server
modprobe -r loop
modprobe loop max_loop=64
if [ -x /opt/dim/bin/dim_nfs ]; then
/opt/dim/bin/dim_nfs fsck -f
/opt/dim/bin/dim_nfs start
fi
else
# commands for the DIM Nodes
ln -sf /proc/mounts /etc/mtab
test -d /home || mkdir /home
mount -t nfs $DIM_SERVER:/home /home
mount -t spufs spufs /spu
fi
if [ -x /etc/rc.d/rc.local.real ]; then
. /etc/rc.d/rc.local.real
fi
You must replace the variable DIM_SERVER with your particular DIM server IP
address. Next, run the following commands to edit /etc/init.d/halt:
root@dim-server# perl -pi -e \
's#loopfs\|autofs#\\/readonly\|loopfs\|autofs#' /etc/init.d/halt
Editing this file prevent the reboot command from failing on BladeCenter QS21
nodes. Without this fix, the QS21 blade server tries to unmount its own root file
system when switching into run level 6.
Now you can reboot your DIM server and proceed to install the DIM.
572 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
2. Download the additional required software:
– For Busybox:
http://www.busybox.net/download.html
– For Perl Net-SNMP:
http://www.cpan.org/modules/by-module/Net/Net-SNMP-5.1.0.tar.gz
– For Perl XML-Parser:
http://www.cpan.org/modules/by-module/XML/XML-Parser-2.34.tar.gz
– For Perl XML-Simple
http://www.cpan.org/modules/by-module/XML/XML-Simple-2.18.tar.gz
3. Place all of the downloaded packages and the DIM rpm into a created
/tmp/dim_install directory. Install the DIM rpm from this directory.
4. Add /opt/dim/bin to the PATH environment variable:
root@dim-server# echo 'export PATH=$PATH:/opt/dim/bin' >> \
$HOME/.bashrc
root@dim-server# . ~/.bashrc
5. Install the additional PERL modules that are required for DIM (Net-SNMP,
XML-Parser, XML-Simple). Ignore the warnings that might be displayed.
root@dim-server# cd /tmp/dim_install
make -f /opt/dim/doc/Makefile.perl
make -f /opt/dim/doc/Makefile.perl install
6. Build the Busybox binary and copy it to the DIM directory:
root@dim-server# cd /tmp/dim_install
tar xzf busybox-1.1.3.tar.gz
cp /opt/dim/doc/busybox.config \
busybox-1.1.3/.config
patch -p0 < /opt/dim/doc/busybox.patch
cd busybox-1.1.3
make
cp busybox /opt/dim/dist/busybox.ppc
cd /
At this point DIM and its dependent packages are installed on your DIM server.
Continue with the steps to set up the software.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 573
Setting up the DIM software
In this section, we begin by identifying specific configuration files where
modifications can be made to meet your specific network needs. Then we explain
the basic setup steps and the options that are available for them.
You must modify the following DIM configuration files to meet your specific
network setup:
/opt/dim/config/dim_ip.xml
You must create this file by copying one of the template examples named
“dim_ip.xml.<example>” and modifying it as you see fit. The purpose of this
file is to set a DIM IP configuration for your Cell clusters. It defines your Cell
DIM nodes, their corresponding host names, and IP addresses. It also defines
the DIM server.
/opt/dim/config/dim.cfg
This file defines the larger scale locations of the NFS, DHCP, and DIM
configuration files. The default values apply in most cases. However, the
values for the following configuration parameters frequently change:
– DIM_DATA
In this directory, you store all of the data that is relevant to the distribution
that you are deploying across your network. This data includes the
zImages, the master root file system, and the blade filesystem images.
– NFS_NETWORK
Here, you define your specific IP address deployment network.
– PLUGIN_MAC_ADDR
Here, you define the IP address of your BladeCenter H. Through this
variable, the DIM accesses the BladeCenter H chassis to obtain MAC
address information about the QS21 nodes along with executing basic,
operational, blade-specific commands.
/opt/dim/config/<DISTRO>/dist.cfg
This file defines variables that are specific to your distribution deployment. It
also defines the name of the DIM_MASTER machine, should it be different.
You might need to change the following variables:
– KERNEL_VERSION
This is the kernel version that you want to create bootable zImages on
your particular distribution.
574 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
– KERNEL_MODULES
These are modules that you want and are not enabled by default on your
kernel. The only module that you need for BladeCenter QS21 to function is
“tg3.”
– DIM_MASTER
This variable defines the machine that contains the master root file
system. In cases where the distribution that you want to deploy in your
cluster is in a different machine, you specify from which machine to obtain
the root file system. Otherwise, if it is in the same box as your DIM server,
you can leave the default value.
/opt/dim/<DIST>/master.exclude
This file contains a list of directories that are excluded from being copied to
the master root file system.
/opt/dim/<DIST>/image.exclude
This file contains a list of directories that are excluded from being copied to
the image root file system of each BladeCenter QS21 blade server.
Before you proceed with using the DIM commands to create your images, ensure
that you adapt the files and variables defined previously to meet your particular
network needs and preferences.
Now you can execute the DIM commands to create your distribution and node
specific images. You can find the extent of the DIM commands offered under the
/opt/dim/lib directory. Each of these commands should have an accessible
manual page on your system:
man <dim_command>
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 575
4. Add all of these DIM images to /etc/exports:
root@dim-server# dim_nfs add -d <distribution> all
5. Mount and confirm all the DIM images for NFS exporting:
root@dim-server# dim_nfs start
dim_nfs status
Important: Ensure that you have enabled the maximum number of loop
devices on your dim-server. Otherwise, you will see an error message
related to this when running the previous command. To increase the
number of loop devices, enter the following command:
root@dim-server# modrpobe -r loop
modprobe loop max_loop=64
Set the base configuration of DHCPD with your own IP subnet address:
root@dim-server# dim_dhcp add option -O
UseHostDeclNames=on:DdnsUpdateStyle=none
dim_dhcp add subnet -O Routers=<IP_subnet> \
dim-server[1]-bootnet1
dim_dhcp add option -O DomainName=dim
The following steps require the BladeCenter QS21 blade server to be
connected in the BladeCenter H chassis. The previous steps did not require
this.
6. Add each QS21 blade to the dhcp.conf file by using the following command:
root@dim-server# dim_dhcp add node -d <distribution> \
dim-server[y]-blade[x]
Again “y” represents the dim-server number (if there is more than one dim
server) and “x” represents the QS21 blade.
7. After you add all of the nodes of interest, restart the DHCP service:
root@dim-server# dim_dhcp restart dhcpd
You should now be ready to boot up your QS21 to your distribution as configured
under DIM.
576 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
After your setup is complete, you can eventually add additionally software to your
nodes. For these purposes, apply software maintenance on the original machine
where the root file system was copied from. Then you can sync the master root
file system and DIM images:
root@powerbox# dim_sync_master -d <distribution> <directory
updated>..<directory updated>..
Notice that the directories are listed to sync on the master and DIM images,
depending on which directories are affected by your software installation.
We chose to use both Fedora 7 and RHEL 5.1 distributions to split our cluster in
half between RHEL 5.1 and Fedora. Also, note that we have our DIM server and
our image server be two different machines.
Our dim-server is a System x server, while our DIM Boot Image server is a
POWER technology-based system. We chose a a POWER technology-based
system for our boot image server because such as system is required to create
the kernel zImage file. Because we are working with two different distributions,
we must copy the root file system from two different a POWER technology-based
system installations to our System x server.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 577
We install the DIM only on the dim server. When we create a zImage, we mount
the DIM directory to our boot image server. We specify the machine if a
procedure is to be applied to only one of these two servers.
We do not show the steps for the initial hard-disk installation on a POWER
technology-based system. However we indicate that we included the
Development Tools package in our installation.
1. We want to give our dim-server the ability to partially manage the
BladeCenter QS21 blade server from the dim-server’s command prompt. To
achieve this, we access the AMM through a Web browser. Then we select MM
Control → Network Protocols → Simple Network Management Protocol
(SNMP) and set the following values:
– SNMPv1 Agent: enable
– Community Name: public
– Access Type: set
– Hostname or IP: 192.168.20.50
2. Since one of our distributions is Fedora7, we install the kernel rpm that is
provided from the Barcelona Supercomputing Center on our Fedora 7
machine:
root@f7-powerbox# wget \
http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/s
dk3.0/kernel-2.6.22-5.20070920bsc.ppc64.rpm
rpm -ivh --force --noscripts \
kernel-2.6.22-5.20070920bsc.ppc64.rpm
depmod -a 2.6.22-5.20070920bsc
3. We increase the number of maximum loop devices and start the NFS service
by default by editing the file as shown in Example 10-8 on page 571. The file
looks the same, except that we assign the variable DIM-Server to the IP
address in our example, 192.168.20.70.
578 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
4. We make the two changes that are required on the /etc/init.d/halt file on both
the f7-powerbox and rhel51-powerbox:
root@f7-poerbox# perl -pi -e \
's#loopfs\|autofs#\\/readonly\|loopfs\|autofs#'\
/etc/init.d/halt
perl -pi -e \
's#/\^\\/dev\\/ram/#/(^\\/dev\\/ram|\\/readonly)/#’\
/etc/init.d/halt
5. We add all of the QS21 blade host names under /etc/hosts. This applies only
to the dim-server:
root@dim-server# echo "192.168.20.50 mm mm1" >> /etc/hosts
echo "192.168.20.70 dim-server" >> /etc/hosts
for i in `seq 71 82`; do echo "192.168.20.$i
cell$i\ b$i"; done >> /etc/hosts
6. Now we have established all the steps before proceeding to the DIM software
installation. We now install the DIM software:
a. For the DIM software installation, first we download the DIM RPM from the
IBM alphaWorks Web site at the following address and the required
dependent software for DIM:
http://www.alphaworks.ibm.com/tech/dim
b. We create a /tmp/dim_install directory and place our downloaded DIM rpm
in that directory. This applies to both the dim-server and the dim-storage
machine:
root@dim-server # mkdir /tmp/dim_install
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 579
Example 10-9 dim_install.sh script
#/bin/bash
##################################################################
# DIM Installation Script #
# #
# #
# #
##################################################################
set -e
580 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
d. The required busybox binary must be built on a POWER technology-based
system. Because of this, we build this binary on our boot image creation
POWER machine:
root@qs21-imagecreate# cd /tmp/dim_install
tar xzf busybox-1.1.3.tar.gz
cp /opt/dim/doc/busybox.config \
busybox-1.1.3/.config
patch -p0 < /opt/dim/doc/busybox.patch
cd busybox-1.1.3
make
scp busybox \
root@qs21-dim-server:/opt/dim/dist/busybox.ppc
7. With the DIM installed on our server, we must make some modifications to set
up DIM:
a. We use one of the example templates (dim_ip.xml.example4) and modify
the /opt/dim/config/dim_ip.xml file as needed on both servers.
root@dim-server# cp /opt/dim/config/dim_ip.xml.example4 \
/opt/dim/config/dim_ip.xml
Example 10-10 shows the xml file.
Examples:
DIM Component IP-Address Hostname Comment
===============================================================================================
dim-server[1]-bootnet1 192.168.70.14 s1 DIM Server JS21 Slot 14 eth0
dim-server[1]-blade[1]-bootnet1 192.168.70.1 cell1 Cell Blade 1 Slot 1 eth0
dim-server[1]-blade[13]-bootnet1 192.168.70.13 cell13 Cell Blade 13 Slot 13 eth0
dim-server[1]-blade[1]-usernet1 10.0.0.1 cell1u Cell Blade 1 Slot 1 eth1
-->
<dim_ip xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="dim_ip.xsd">
<configuration name="dim">
<component name="server" min="1" max="1">
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 581
<network name="bootnet1">
<mask>255.255.255.0</mask>
<addr>192.168.20.0</addr>
<ip>192, 168, 20, 30</ip>
<hostname>dim-server%d, server</hostname>
</network>
<component name="blade" min="1" max="13">
<network name="bootnet1" use="1">
<mask>255.255.255.0</mask>
<addr>192.168.20.0</addr>
<ip>192, 168, 20, ( blade + 70 )</ip>
<hostname>qs21cell%d, ( blade + 70 )</hostname>
</network>
<network name="usernet1">
<mask>255.0.0.0</mask>
<addr>10.10.10.0</addr>
<ip>10, 10, 10, ( blade + 70 )</ip>
<hostname>qs21cell%du, blade</hostname>
</network>
</component>
</component>
</configuration>
</dim_ip>
The main changes made in to the example template revolve around the IP
addresses and host names for both the DIM server and the individual
BladeCenter QS21 blade servers. This file is open to be modified to meet
your specific hardware considerations.
b. We modify /opt/dim/config/dim.cfg to fit our needs as shown in
Example 10-11.
582 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
# dhcpd restart command
DHCPD_RESTART_CMD service dhcpd restart
# network interface
NETWORK_INTERFACE eth0
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 583
8. Because we are working with two distributions, we must create entries for
both distributions:
root@qs21-storage# mkdir /opt/dim/config/Fedora7
mkdir /opt/dim/dist/Fedora7
mkdir /opt/dim/config/RHEL51
mdkir /opt/dim/dist/RHEL51
cp /opt/dim/CELL/* /opt/dim/config/Fedora7
cp /opt/dim/dist/CELL/* /opt/dim/dist/Fedora7/
cp /opt/dim/CELL/* /opt/dim/config/RHEL51
cp /opt/dim/dist/CELL/* /opt/dim/dist/RHEL51
As you can see, we copied all the files under /opt/dim/CELL and
/opt/dim/dist/CELL into our distribution config directories. The files copied
from /opt/dim/CELL were dist.cfg, image.exclude, and master.exclude. We
configure these files to meet our needs. For the files under
/opt/dim/dist/CELL, we do not need to change them.
Example 10-12 and Example 10-13 on page 586 show the dist.cfg file for
Fedora7 and RHEL 5.1 respectively.
584 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
# Ethernet DHCP trials
ETHERNET_DHCP_TRIALS 3
# Boot method
BOOT_METHOD bootp
# Kernel modules
KERNEL_MODULES tg3.ko sunrpc.ko nfs_acl.ko lockd.ko nfs.ko
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 585
Example 10-13 /opt/dim/config/RHEL51/dist.cfg file
#----------------------------------------------------------------------
# (C) Copyright IBM Corporation 2006
# All rights reserved.
#
#----------------------------------------------------------------------
# $Id: dist.cfg 1760 2007-10-22 16:30:11Z morjan $
#----------------------------------------------------------------------
# Boot method
BOOT_METHOD bootp
586 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
LINUX_RC linuxrc
# Kernel modules
KERNEL_MODULES tg3.ko fscache.ko sunrpc.ko nfs_acl.ko lockd.ko
nfs.ko
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 587
12.Now that these directories are exportable from the dim-server, we mount
them on the POWER technology-based boot image server:
root@qs21-imagecreate# mkdir /opt/dim
mkdir /var/dim
mkdir /srv/tftpboot
mount dim-server:/opt/dim /opt/dim
mount dim-server:/var/dim /var/dim
mount dim-server:/srv/tftpboot /srv/tftpboot
echo "export PATH=$PATH:/opt/dim/bin" >> \
$HOME/.bashrc
. ~/.bashrc
13.We create the zImage files on the dim boot image server and afterward
umount the directories:
root@qs21-imagecreate# dim_zimage -d Fedora7
dim_zimage -d RHEL5.1
umount /opt/dim
umount /var/dim
umount /srv/tftpboot
14.We return to the dim-server and create the images:
root@dim-server# dim_image -d Fedora7 readonly \
dim-server[1]-blade[{1..5}]
dim_image -d RHEL5.1 readonly \
dim-server[1]-blade[{6..12}]
dim_nfs add -d Fedora7 all
dim_nfs add -d RHEL5.1 all
dim_nfs start
15.We add the base configuration to our dhcpd.conf file, this will apply to our
dim-server only:
root@dim-server# dim_dhcp add option -O \
UseHostDeclNames=on:DdnsUpdateStyle=none
dim_dhcp add subnet -O Routers=192.168.20.100 \
dim-server[1]-bootnet1
dim_dhcp add option -O DomainName=dim
16.We add the entries into our /etc/dhcp.conf file:
root@dim-server# for i in ‘seq 1 5’; do dim_dhcp add node -d
Fedora7\ dim-server[1]-blade[i]; done
for i in ‘seq 6 12’; do dim_dhcp add node -d \
RHEL5.1 dim-server[1]-blade[i]; done
dim_dhcp restart
588 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
17.We can use DIM to ensure each blade is configured to boot up via network
and also boot up each QS21 blade node:
root@dim-server# for i in ‘seq 1 12’; do dim_bbs -H mm1 i network;\
done
for i in ‘seq 1 12’; do dim_bctool -H mm1 i on;
done
We have now completed implementing DIM on 12 of our QS21 nodes, using both
Fedora7 and RHEL 5.1 as deployment distributions.
xCAT is written entirely by using scripting languages such as korn, shell, perl,
and Expect. Many of these scripts can be altered to reflect the needs of your
particular network. xCAT provides cluster management through four main
branches, which are automated installation, hardware management and
monitoring, software administration, and remote console support for text and
graphics.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 589
Hardware management and monitoring
– Supports the Advanced Systems Management features on the System x
server
• Remote power control (on/off/state) via IBM Management Processor
Network or APC Master Switch
• Remote Network BIOS/firmware update and configuration on extensive
IBM hardware
• Remote vital statistics (fan speed, temperatures, voltages)
• Remote inventory (serial numbers, BIOS levels)
• Hardware event logs
• Hardware alerts via SNMP
• Creation and management of diskless clusters
– Supports remote power control switches for control of other devices
• APC MasterSwitch
• BayTech Switches
• Intel EMP
– Traps SNMP alerts and notify administrators via e-mail
Software administration
– Parallel shell to run commands simultaneously on all nodes or any subset
of nodes
– Parallel copy and file synchronization
– Provides installation and configuration assistance of the high-performance
computing (HPC) software stack
• Message passing interface: Build scripts, documentation, automated
setup for MPICH, MPICH-GM, and LAM
• Maui and PBS for scheduling and queuing of jobs
• GM for fast and low latency interprocess communication using Myrinet
Remote console support for text and graphics
– Terminal servers for remote console
• Equinox ELS and ESP™
• iTouch In-Reach and LX series
• Remote Console Redirect feature in System x BIOS
– VNC for remote graphics
590 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
As can be seen, the offerings from xCAT are mostly geared toward automating
many of the basic setup and management steps for small and larger, more
complex clusters.
As previously mentioned, xCAT’s offerings are broad to the extent that they can
complement the cluster management offerings that DIM does not provide.
Additionally, xCAT provides its own method of handling diskless clusters. While
we do not go into further detail about xCAT’s full offerings and implementation,
we briefly discuss xCAT’s solution to diskless clusters.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 591
Network booting
– Boot image built from Virtual Node File System (VNFS)
• A small chroot Linux distribution that resides on the master node
• Network boot image created using VNFS as a template
• Destined to live in the RAM on the nodes
– Nodes boot using Etherboot
• Open source project that facilitates network booting
• Uses DHCP and TFTP to obtain boot image
– Implements RAM-disk file systems
All nodes capable of running diskless
– Account user management
• Password file built for all nodes
• Standard authentication schemes (files, NIS, LDAP)
• Rsync used to push files to nodes
There are many similarities between Warewulf and DIM, among them, mostly
revolving around easing the process of installation and management in an
efficient, scalable manner. Both of these tools automate many of the initial
installation steps that are required for a BladeCenter QS21 blade server to start.
The primary difference between these two node installation and management
tools rests on the machine state. Warewulf provides solutions for stateless
machines through the use of RAM-disk file systems that are shared and read
only. DIM preserves the state of individual nodes by using NFS root methods to
provide read/write images on each node along with making certain components
of the file system read only.
Warewulf version 2.6.x in conjunction with xCAT version 1.3 has been tested on
BladeCenter QS21s. DIM version 9.14 has been tested on BladeCenter QS21
blade servers.
Both of these offerings are good solutions whose benefits depend on the needs
of the particular private cluster setup. In instances where storage is a limited
resource and maintaining node state is not important, Warewulf might be better
suited for your needs. If storage is not a limited resource and maintaining the
state of individual nodes is important, then DIM might be the preferred option.
592 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
10.6 Method for installing a minimized distribution
The diskless characteristic of the BladeCenter QS21 blade server requires a
common need to have the operating system use a minimal amount of resources.
Achieving this process saves on storage space and minimizes memory usage.
We do not cover these two topics in this section because they are beyond the
scope of this documentation. Additionally, DIM addresses both of these issues
and provides a resource efficient root file system and zImage kernel.
We cover additional steps that you can take during and briefly after installation to
remove packages that are not necessary in most cases for a BladeCenter QS21.
This example is shown for RHEL 5.1 only, but can closely be applied to Fedora 7
installations as well.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 593
10.6.1 During installation
In this section, we discuss the packages that are installed. While this example is
for RHEL 5.1, we do not explain the detailed steps of the installation on
RHEL 5.1.
During the package installation step, you must select Customize Software
Selection as shown in Figure 10-1.
On the Package Group Selection display (Figure 10-2 on page 595), you are
prompted to specify which group of software packages you want to install. Some
are already selected by default. Be sure to deselect all of the packages.
594 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Welcome to Red Hat Enterprise Linux Server
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 595
After you ensure that none of the package groups are selected, you can proceed
with your installation.
In determining which packages to remove from our newly installed system, our
approach is to keep in mind the common purpose of a BladeCenter QS21 server.
We remove packages that are related to graphics, video, sound, documentation
or word processing, network or e-mail, security, and other categories that contain
packages for the BladeCenter QS21 blade server that might be unnecessary.
alsa-utils X
antlr X
esound X
firstboot X
gjdoc X
gnome-mount X
gnome-python2 X
gnome-python2-bonobo X
gnome-python2-canvas X
596 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Package GNOME Sound Other graphics Doc Other
gnome-python2-extras X
gnome-python2-gconf X
gnome-python2- X
gnomevfs
gnome-python2- X
gtkhtml2
gnome-vfs2 X
gtkhtml2 X
gtkhtml2-ppc64 X
java-1.4.2-gcj-compat X
libbonoboui X
libgcj X
libgcj-ppc64 X
libgnome X
libgnomeui X
rhn-setup-gnome X
sox X
2. Remove the ATK library by using the following command, which adds
accessibility support to applications and GUI toolkits:
# yum remove atk
Removing this package also removes the packages listed in Table 10-4
through YUM.
Table 10-4 Packages removed with atk
Package GNOME Other graphics Other
GConf2 X
GConf2-ppc64 X
authconfig-gtk X
bluez-gnome X
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 597
Package GNOME Other graphics Other
bluez-utils X
gail X
gail-ppc64 X
gnome-keyring X
gtk2 X
gtk2-ppc64 X
gtk2-engines X
libglade2 X
libglade2-ppc64 X
libgnomecanvas X
libgnomecanvas-ppc64 X
libnotify X
libwnck X
metacity X
notification-daemon X
notify-python X
pygtk2 X
pygtk2-libglade X
redhat-artwork X
usermode-gtk X
xsri X
598 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
3. Remove the X.org X11 runtime library by entering the following statement:
# yum remove libX11
Removing this package also removes the packages listed in Table 10-5.
Table 10-5 Packages removed with libX11
Package X.org Other graphics Other
libXcursor X
libXcursor-ppc64 X
libXext X
libXext-ppc64 X
libXfixes X
libXfixes-ppc64 X
libXfontcache X
libXft X
libXft-ppc64 X
libXi X
libXi-ppc64 X
libXinerama X
libXinerama-ppc64 X
libXpm X
libXrandr X
libXrandr-ppc64 X
libXrender X
libXrender-ppc64 X
libXres X
libXtst X
libXtst-ppc64 X
libXv X
libXxf86dga X
libXxf86misc X
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 599
Package X.org Other graphics Other
libXxf86vm X
libXxf86vm-ppc64 X
libxkbfile X
mesa-libGL X
mesa-libGL-ppc64 X
tclx
tk X X
libSM X
libSM-ppc64 X
libXTrap X
libXaw X
libXmu X
libXt X
libXt-ppc64 X
startup-notification X
startup-notification-ppc64 X
xorg-x11-server-utils X
xorg-x11-xkb-utils X
600 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
5. Remove the vector graphics library cairo:
# yum remove cairo
Removing this package also removes the packages listed in Table 10-7.
Table 10-7 Packages removed with cairo
Package Doc Printing Other graphics
cups X
pango X
pango-ppc64 X
paps X
pycairo X
There are other graphics related packages that are removed by default due to
dependencies on other non-graphics packages.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 601
Documentation, word processing, and file manipulation
packages
Remove packages that are related with documentation, word processing, and file
manipulation:
1. Remove the diffutils file comparison tool:
# yum remove diffutils
Removing this package also removes the packages listed in Table 10-8.
Table 10-8 Packages removed with diffutils
Packages X.org SELinux GNOME Other
chkfontpath X
policycoreutils X
rhpxl X
sabayon-apply X
selinux-policy X
selinux-policy-targeted X
urw-fonts X
xorg-x11-drv-evdev X
xorg-x11-drv-keyboard X
xorg-x11-drv-mouse X
xorg-x11-drv-vesa X
xorg-x11-drv-void X
xorg-x11-fonts-base X
xorg-x11-server-Xnest X
xorg-x11-server-Xorg X
xorg-x11-xfs X
602 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Network, e-mail, and printing packages
Now you can move forward and remove any network, printing and e-mail related
packages from your default distribution installation:
1. Remove the network packages:
# yum remove ppp
The following command removes the rp-pppoe and wvdial packages:
# yum remove avahi avahi-glib yp-tools ypbind mtr lftp
NetworkManager
2. Remove the printing related packages:
# yum remove cups-libs mgetty
3. Remove the e-mail related packages:
# yum remove mailx coolkey
Remove the remaining packages from the RHEL 5.1 default installation:
# yum remove bluez-libs irda-utils eject gpm hdparm hicolor \ ifd-egate
iprutils parted pcsc-lite pcsc-lite-libs smartmontools \ wpa_supplicant
minicom
At this point, we have removed the packages that most relevant to graphics,
audio, word-processing, networking and other tools. This process should reduce
the number of installed packages by about 50%.
Chapter 10. SDK 3.0 and BladeCenter QS21 system configuration 603
10.6.3 Shutting off services
The final step in making the system faster and more efficient is to provide a small
list of remaining services that can be turned off. This list saves run-time memory
and speeds up the BladeCenter QS21 boot process. Use the chkconfig
command as shown in the following example to turn off the services that are
specified:
# chkconfig --level 12345 atd off
chkconfig --level 12345 auditd off
chkconfig --level 12345 autofs off
chkconfig --level 12345 cpuspeed off
chkconfig --level 12345 iptables off
chkconfig --level 12345 ip6tables off
chkconfig --level 12345 irqbalance atd off
chkconfig --level 12345 isdn off
chkconfig --level 12345 mcstrans off
chkconfig --level 12345 rpcgssd off
chkconfig --level 12345 rhnsd off
With the distribution stripped down to a lower amount of installed packages and a
minimum amount of services running, you can copy this root file system to a
master directory and forward it to individual BladeCenter QS21 blade servers for
deployment. As stated previously, DIM takes steps at implementing further
resource efficiency. Such implementation can be complemented with the process
shown in this section.
604 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Part 5
Part 5 Appendixes
In this part, we provide additional information particularly in regard to the
references made in this book and the code examples in Chapter 4, “Cell
Broadband Engine programming” on page 75. This part contains the following
appendixes:
Appendix A, “Software Developer Kit 3.0 topic index” on page 607
Appendix B, “Additional material” on page 615
C and C++ Standard Libraries C/C++ Language Extensions for Cell Broadband
Engine Architecture
Data types and programming directives C/C++ Language Extensions for Cell Broadband
Engine Architecture
Floating-point arithmetic on the SPU C/C++ Language Extensions for Cell Broadband
Engine Architecture
Low-level specific and generic intrinsics C/C++ Language Extensions for Cell Broadband
Engine Architecture
608 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Topic Cell/B.E. SDK 3.0 documentation
Objects, executables, and SPE loading Cell Broadband Engine Programming Handbook
Overview of the Cell Broadband Engine Cell Broadband Engine Programming Handbook
processor
Performance data collection and analysis IBM Full-System Simulator Performance Analysis
with emitters
Performance simulation and analysis with IBM Full-System Simulator Performance Analysis
Mambo
PPE serviced SPE C library functions and Security SDK Installation and User’s Guide
PPE-assisted functions
Program loading and dynamic linking SPU Application Binary Interface Specification
SPE channel and related MMIO interface Cell Broadband Engine Programming Handbook
SPE local storage memory allocation Cell Broadband Engine SDK Libraries
SPE serviced C library functions Security SDK Installation and User’s Guide
SPU and vector multimedia extension C/C++ Language Extensions for Cell Broadband
intrinsics Engine Architecture
610 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Topic Cell/B.E. SDK 3.0 documentation
SPU compare, branch, and halt instructions SPU Instruction Set Architecture
The Cell/B.E. SDK 3.0 documentation is installed as part of the install package
regardless of the product selected. The following list is the online documentation.
Software Development Kit (SDK) 3.0 for Multicore Acceleration
– Cell Broadband Engine Software Development Kit 2.1 Installation Guide
Version 2.1
– Software Development Kit for Multicore Acceleration Version 3.0
Programming Tutorial
– Cell Broadband Engine Programming Handbook
– Security SDK Installation and User’s Guide
Programming tools and standards
– C/C++ Language Extensions for Cell Broadband Engine Architecture
– IBM Full-System Simulator Users Guide and Performance Analysis
– IBM XL C/C++ single-source compiler
– SPU Application Binary Interface Specification
– SIMD Math Library Specification
– Cell BE Linux Reference Implementation Application Binary Interface
Specification
– SPU Assembly Language Specification
612 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Programming library documentation
– Accelerated Library Framework for Cell Broadband Engine Programmer’s
Guide and API Reference
– ALF Programmer’s Guide and API Reference For Cell For Hybrid-x86 (P)
– BLAS Programmer’s Guide and API Reference (P)
– Data Communication and Synchronization Library for Cell Broadband
Engine Programmer’s Guide and API Reference
– DACS Programmer’s Guide and API Reference- For Hybrid-x86
(prototype)
– Example Library API Reference
– Monte Carlo Library API Reference Manual (prototype)
– SPE Runtime Management Library
– SPE Runtime Management Library Version 1.2 to 2.2 Migration Guide
– SPU Timer Library (prototype)
Hardware documentation
– PowerPC User Instruction Set Architecture - Book I
– PowerPC Virtual Environment Architecture - Book II
– PowerPC Operating Environment Architecture - Book III
– Vector/SIMD Multimedia Extension Technology Programming
Environments Manual
– Cell Broadband Engine Architecture
– Cell Broadband Engine Registers
– Synergistic Processor Unit (SPU) InSample caption
Select the Additional materials and open the directory that corresponds with
the IBM Redbooks form number, SG247575.
In the rest of this appendix, we describe the contents of the additional material
examples in more detail.
616 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Data Communication and Synchronization programming
example
In this section, we describe the Data Communication and Synchronization
(DaCS) code example that is related to 4.7.1, “Data Communication and
Synchronization” on page 291.
618 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
SPU initiated DMA list transfers between LS and main storage
The spe_dma_list directory contains code that demonstrates how the SPU
program initiates DMA list transfers between LS and main storage. This example
also shows how to use the stall-and-notify mechanism of the DMA and
implementing the event handler on the SPU program to handle the
stall-and-notify events. This code is related to the following examples:
Example 4-19 on page 129
Example 4-20 on page 131
Double buffering
The spe_double_buffer directory contains an SPU program that implements a
double-buffering mechanism. This program is related to the following examples:
Example 4-29 on page 161
Example 4-30 on page 162
Example 4-31 on page 164
Huge pages
The simple_huge directory contains a PPU program that uses buffers that are
allocated on huge pages. This program is related to Example 4-33 on page 170.
Simple mailbox
The simple_mailbox directory contains a simple PPU and SPU program that
demonstrates how to use the SPE inbound and outbound mailboxes. This
program is related to the following examples:
Example 4-35 on page 186
Example 4-36 on page 188
Example 4-39 on page 199
620 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Simple signals
The simple_signals directory contains a simple PPU and SPU program that
demonstrates how to use the SPE signal notification mechanism. This program is
related to the following examples:
Example 4-37 on page 195
Example 4-38 on page 197
Example 4-39 on page 199
Example 4-40 on page 202
622 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Related publications
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this book.
IBM Redbooks
This is the first IBM Redbooks publication regarding the Cell Broadband Engine
(Cell/B.E.) Architecture (CBEA). There are no other related IBM Redbooks or
Redpaper publications at this time.
Other publications
These publications are also relevant as further information sources:
1. K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D.
Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelik. The Landscape of
Parallel Computing Research: A View from Berkeley. Technical report, EECS
Department, University of California at Berkeley, UCB/EECS-2006-183,
December, 2006.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
2. Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands,
Katherine Yelick. Scientific Computing Kernels on the Cell Processor.
http://crd.lbl.gov/~oliker/papers/IJPP07.pdf
3. Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides. Design
Patterns. Elements of Reusable Object-Oriented Software. Addison Wesley,
1994. ISBN 0201633612.
4. Timothy G. Mattson, Berna L. Massingill, Beverly A. Sanders. Patterns for
Parallel Programming. Addison Wesley, 2004. ISBN 0321228111.
5. Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, Toshio Nakatani. AA-Sort:
A New Parallel Sorting Algorithm for Multi-core SIMD processors.
International Conference on Parallel Architecture and Compilation
Techniques, 2007.
http://portal.acm.org/citation.cfm?id=1299042.1299047&coll=GUIDE&dl=
6. Marc Snir, Tim Mattson. “Programming Design Patterns, Patterns for High
Performance Computing”. February 2006.
http://www.cs.uiuc.edu/homes/snir/PDF/Dagstuhl.pdf
624 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
22.Cell/B.E. Monte Carlo Library API Reference Manual.
http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/8D78C965B98
4D1DE00257353006590B7
23.Domingo Tavella. Quantitative Methods in Derivatives Pricing. John Wiley &
Sons, Inc., 2002. ISBN 0471394475.
24.Mike Acton, Eric Christensen. “Developing Technology for Ratchet and Clank
Future: Tools of Destruction.” June 2007.
http://sti.cc.gatech.edu/Slides/Acton-070619.pdf
25.Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick,
James Demmel. Optimization of Sparse Matrix-vector Multiplication on
Emerging Multicore Platforms.
http://cacs.usc.edu/education/cs596/Williams-OptMultiCore-SC07.pdf
26.Eric Christensen, Mike Acton. Dynamic Code Uploading on the SPU. May
2007.
http://www.insomniacgames.com/tech/articles/0807/files/
dynamic_spu_code.txt
27.John L. Hennessy and David A. Patterson, Computer Architecture, Fourth
Edition: A Quantitative Approach. Morgan Kaufmann 2006. ISBN
0123704901.
Online resources
These Web sites are also relevant as further information sources:
Cell/B.E. resource center on IBM developerWorks with complete
documentation
http://www-128.ibm.com/developerworks/power/cell/
Distributed Image Management for Linux Clusters on alphaWorks
http://alphaworks.ibm.com/tech/dim
Extreme Cluster Administration Toolkit (xCAT)
http://www.alphaworks.ibm.com/tech/xCAT/
OProfile on sourceforge.net
http://oprofile.sourceforge.net
IBM Dynamic Application Virtualization
http://www.alphaworks.ibm.com/tech/dav
MASS
http://www.ibm.com/software/awdtools/mass
626 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Index
AES 35
Symbols affinity 94
/etc/fstab 552
ALF (Accelerated Library Framework) 298
accelerator API functions 312
Numerics accelerator code writer 44
13 dwarfs 34 accelerator task workflow 301
3-level memory structure 7 architecture 300
computational kernel 302
concepts 302
A data partitioning 308
ABAQUS 35
data sets 308
ABI (application binary interface) 330
defined 43
ABI-compliant assembly language instructions 104
host API functions 312
Accelerated Library Framework (ALF) 298
host code writer 44
accelerator API functions 312
library 22
accelerator code writer 44
optimization tips 313
accelerator task workflow 301
run time 44
architecture 300
runtime and programmer’s tasks 299
computational kernel 302
tasks and task descriptors 306
concepts 302
word blocks 307
data partitioning 308
alf_accel.h 303, 311
data sets 308
algorithm match 50
defined 43
align_hint 257
host API functions 312
aligned attribute 256
host code writer 44
alphaWorks 541
library 22
altivec.h 81
optimization tips 313
AMM (Advanced Management Module) 545
run time 44
application binary interface (ABI) 330
runtime and programmer’s tasks 299
application enablement on Cell/B.E. 31
tasks and task descriptors 306
application enablement process 63
word blocks 307
application libraries
accelerator element (AE) 23, 449
Fast Fourier Transform (FFT) 23
accelerator mode 39
game math 23
accelerator task memory layout 308
image processing 23
access ordering 607
matrix operation 23
accessing events programming interface 205
multi-precision math 23
accessing signaling programming interface 192
software managed cache 23
acosf4.h 265
synchronization 23
ADA compiler 18
vector 23
adacsd 458
application profiling 63
service 451
Argonne National Laboratory 43
additional material 615
array of structures (AOS) 268
Advanced Management Module (AMM) 545
assembly-language instructions 104
AE (accelerator element) 23, 449
asynchronous computation server 207
628 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
330 threshold mode 378
channel 97 tool 376
interface 99, 227, 231 CPI breakdown 414
MFC_Cmd 122 CPU affinity 467
MFC_EAH 122 cross-element shuffle instructions 252
MFC_LSA 122 CSR (Compressed Sparce Row) format 71
MFC_RdTagStat 123 Cygwin 478
MFC_Size 122
MFC_TagID 122
MFC_WrTagMask 123
D
DaCS (Data Communication and Synchronization)
problem-state 98
291, 449
chapel 52
common patterns 295
chgrp command 168
concepts 294
chmod command 168
configuration 455
Christopher Alexander 70
daemons 456
closed-page controller 10
defined 42
clustering scalars into vectors 280
elements (HE/AE) 294
Code Analyzer 25, 401, 430
group management 295
Code Sourcery 46
groups 294
collecting trace data with PDT 438
hybrid 449
combinatorial logic 35, 51
hybrid implementation 450
command queues 10
mailboxes 295
compat-libstdc++ 563
message passing 295
compiler directives 256
mutex 294
compilers 18
programming considerations 452
xlc 339
remote memory operations 295
completion variables 239
remote memory regions 294
complex number 533
resource and process management 295
rearrangement 536
services 295
composite intrinsics 103, 608
step-by-step example 458
Compressed Sparce Row (CSR) format 71
synchronization 295
computation kernels 34
topology 455
computational kernels 33
wait identifiers 294
condition variable 239
DaCS element (DE) 23
conditional variable 608
DaCS services
constant formation 255
API environment 454
constraint optimization 35
data synchronization 452
context switching 84
error handling 452
contexts 84
group management 452
continuous area of LS 126
mailboxes 452
Control Flow Analyzer 25, 402
message passing 452
conversion intrinsic 255
process management 451
Cooley-Tukey 522
process management model 453
count mode 377
process synchronization 452
Counter (mailboxes) 181
remote memory 452
Counter Analyzer 25, 402, 412, 423
resource reservation 451
CPC 423
resource sharing model 453
hardware sampling 377
data
occurrence mode 377
Index 629
alignment 344 Debug Perspective 373
communication 40 debugger, per-frame selection 349
distribution 39 debugging
ordering 218 architecture 347
organization, AOS versus SOA 267 multi-threaded code 347
transfer 110 techniques 329
transfers and synchronization guidelines 325 using scheduler-locking 348
Data Communication and Synchronization (DaCS) decision tree 35
291, 449 decrementer 207
common patterns 295 events 204
concepts 294 dense matrices 34, 51
configuration 455 DES 35
daemons 456 device memory 223
defined 42 DGEMM 317–318
elements (HE/AE) 294 DGEMV 318
group management 295 DHCP server 543
groups 294 dhcpd.conf 549
hybrid implementation 450 DIM implementation example 577
mailboxes 295 DIM_DATA 574
message passing 295 DIM_MASTER 575
mutex 294 direct problem state access 106
programming considerations 452 direction (mailboxes) 181
remote memory operations 295 discontinuous areas 126
remote memory regions 294 distributed array 55, 61–62
resource and process management 295 distributed image management 569
services 295 distributed programming 445
step-by-step example 458 divide and conquer 60, 62
synchronization 295 DMA
topology 455 commands 112–113
wait identifiers 294 controller 330
DAV (Dynamic Application Virtualization) 47, 475 data transfer
architecture 476 SPU initiated LS to LS 147
DAV-enabled application 478 events 203
defined 44 get and put transfers 123
IBM alphaWorks 475 list command 127
log file 494 list creation 126
stub library 477 list data transfer 126
target applications 476 list dynamic updates 207
DAVClientInstall.exe 478 optimization 528
David Patterson 34 transfer 113, 139
dav-server.ppc64.rpm 478 initiating 122
davService 495 PPU initiated between LS and main storage
davStart daemon 495 138
DAVToolingInstall.exe 478 waiting for completion 122
DAXPY 318 domain 97
DCOPY 318 channel problem-state 98
DDOT 318 decomposition 39
DE (DaCS element) 23 user-state 98
debug format (DWARF) 608 domain-specific libraries 289
630 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
double buffering 160 Event Mask channel 205
barrier-option 241 event-based coordination 60, 62
common header file 161 events 203
PPU code mechanism 164 decrementer 204
SPU code mechanism 162 mailbox or signals 203
double-precision instructions 247 MFC DMA 203
DSCAL 318 SPU
DSYRK 318 write event acknowledgement 205
DTRSM 318 write event mask 205
dual issue 248 SPU read event mask 205
optimization 253 SPU read event status 205
programming considerations 324 synchronization 204
DWARF 608 Extreme Cluster Administration Toolkit 589
dwarfs, 13 34
Dynamic Application Virtualization (DAV) 47, 475
architecture 476
F
fabsf4.h 264
DAV-enabled application 478
Fast Fourier Transform (FFT) 23, 522
defined 44
algorithm 521
IBM alphaWorks 475
branch hint directives 532
log file 494
code inlining 532
stub library 477
DMA optimization 528
target applications 476
FFT library 608
dynamic branch prediction 289
library 315, 521
Dynamic Creator 510
multiple SPUs 529
Dynamic Linking 610
performance 531
dynamic loading of the SPE executable 85
port to PowerPC 526
dynamic programming 35, 51
Shuffle intrinsic 532
SIMD Math Library 531
E SIMD strategies 529
EA (effective address) 99 single SPU 527
Eclipse 362 striping multiple problems across a vector 530
IDE 26 synthesizing vectors by loop unrolling 530
EEMBC (Embedded Microprocessor Benchmark transforms 34
Consortium) 34 x86 implementation 526
effective address (EA) space 5, 98–99, 330 fast-path mode 10
effective auto-SIMDization 271 FDPR_PROF_DIR 399
EIB (Element Interconnect Bus) 9, 122 FDPRO-Pro 563
exploitation 60 FDPR-Pro 25, 248, 375, 394–395, 428
Element Interconnect Bus (EIB) 9, 122 fdprpro 428
exploitation 60 FDPR-Pro process 396
elfspe utility 563 Fedora 544
Embedded Microprocessor Benchmark Consortium Feedback Directed Program Restructuring (FD-
(EEMBC) 34 PR-Pro) 25
embedspu command 333 fence or barrier command options 227
encapsulation 398 Fenced command 228
encryption 35 fenced option 240
Eric Christensen 74 fetch-and-increment 234
Euler scheme 500 FFT (Fast Fourier Transform) 23, 522
Index 631
algorithm 521 G
branch hint directives 532 game math library 608
code inlining 532 gang 94
DMA optimization 528 Gaussian
FFT library 608 random numbers on SPUs 510
library 521 random variables 502
multiple SPUs 529 GCC
performance 531 compiler 332
port to PowerPC 526 compiler directives 337
shuffle intrinsic 532 specific optimization passes 336
SIMD Math Library 531 gcc 478
SIMD strategies 529 gdb 345
single SPU 527 debugging PPE code 346
striping multiple problems across a vector 530 debugging SPE code 346
synthesizing vectors by loop unrolling 530 gdbserver 371
transforms 34 Gedae 46
x86 implementation 526 generic and built-in intrinsics 254
FFT (Fast Fourier Transform) library 315 genomics 35
FFT16M geometric decomposition 60, 62
analysis 418 get command 101, 114
makefile 418 getb 115
FFTW 34 getbs 115
FIDAP 35 getf 115
financial services 499 getfs 115
finite elements 35 getl 115
finite state machine 35, 51 getlb 115
firewall 494 getlf 115
firmware 566 getllar 119
considerations 565 gets 114
first pass SPU implementation 156 GNU ADA compiler 18
fixed work assignment 71 GNU toolchain 18, 332
floating-point operations 247 GPRegs 358
fluent 35 gprof 24
FNFS (Virtual Node File System) 592 graph traversal 35, 51
fork/join 61 graph traversal dwarf 50
programming construct 62 graphical models 35, 51
structure 55 graphical Trace Analyzer 392
FORTRAN 22 GROMACS 35
FORTRAN 77/90 317 groupadd command 168
FPRegs 358
frameworks 289
fstab 552 H
hard disk 543
Full System Simulator 19, 354
hardware sampling 377
function inlining 284, 337
hbr 287
function offload 39
HBR (hint-for branch) 287
function specific header files 239
hbra 287
functional-only simulation 19
hbrr 287
hdacsd service 451
632 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
HE (host element) 23 inlining 337
hello_world 314 inout_buffer 314
hierarchy of accelerators 293 installation of SDK 3.0 560
High Performance Computing Challenge (HPCC) instruction barrier 222
34 Instruction Set Architecture (ISA) 249–250
hint-for branch (HBR) 287 SIMD instructions 250
HMMER 35 instruction sets 11
host element (HE) 23, 449 inter-processor communication 178
host-accelerator model 58 PPU and SPU macros for tracing 202
hotspots 63 programming considerations 328
HPCC (High Performance Computing Challenge) Interrupt Handler 207
34 intrinsics 249
FFT 34 arithmetic 255
HPL 34 bits and masks 255
huge pages 166 branch 255
Hybrid ALF application channel control 256
building and running 465 classes 254
step-by-step example 467 compare 255
hybrid architecture motivations 447 composite 103
Hybrid DaCS 449 constant formation 255
building and running an application 454 control 256
hybrid implementation of DaCS 450 conversion 255
hybrid model functional types 255
performance 448 halt 255
system 447 logical 255
Hybrid Model System architecture 447 low level 104
hybrid programming model 445–446 ordering 256
Hybrid-x86 programming model 26 programming considerations 321
scalar 255
shift and rotate 255
I shuffle 532
IBM DAV Tooling wizard 484
synchronization 256
IBM Eclipse IDE for the SDK 26
inverse_matrix_ovl 314
IBM Full System Simulator 19, 354
IOIF 11
IBM SDK for Multicore Acceleration 17
irregular grids 35
IBM XL C/C++ 339
ISA (Instruction Set Architecture) 249–250
IBM XL Fortran for Multicore Acceleration for Linux
SIMD instructions 250
19
ISAMAX 318
IBM XLC/C++ compiler 18
IBM_DAV_PATH 489
IDAMAX 318 K
IEEE-754 80 kernel zImage 543
Image Management 569 KERNEL_MODULES 575
IMD programming 610 KERNEL_VERSION 574
inbound mailboxes 180 Kirkpatrick-Stoll 316
independent processor elements 4
indirect addressing 71
InfiniBand 41, 568
L
language options 335
initrd 548
LAPACK 22, 317
Index 633
Large Matrix Library 608 Managed Make C/C++ 363
libhugetlbfs 170 managing SPE threads 83
libmassv.a 265 many-to-one signalling mode 191
libnuma library 172 map-reduce 35, 51
libsimdmath.a 83, 264 map-reduce dwarf 50
libspe library 83 Markov models 35
libspe2 41 MASS (Mathematical Acceleration Subsystem)
libspe2_types.h 107–108 500, 513
libspe2.h 86, 90, 106, 113, 140, 145, 182, 193, 206 intrinsic functions 514
libsync.h 239 libraries 21
lightweight mailbox operation 67 MASS and MASSV libraries 264
Linpack (HPL) benchmark 22 master nodes 591
Linux Kernel 20 master/slave relationship 591
little endian 14 master/worker 60, 62, 71
load-and-reserve structure 55
functionality 238 Mathematical Acceleration Subsystem (MASS)
instructions 234 500, 513
Load-Exec 359 intrinsic functions 514
local storage (LS) 110, 246 libraries 21
programming considerations 321 matrix libraries 318
local store address (LSA) 99, 128 matrix_add 314
lock 119 matrix_transpose 314
lock report example 394 matrix-matrix operations 34
loop matrix-vector operations 34
parallelism 55, 61–62 Mattson 54
programming considerations 322 memory
unrolling 285 initialization 10
unrolling for converting scalar data to SIMD data latency 5
265 locality 342
Los Alamos National Laboratory 43 maps 609
low level intrinsics 104 scrubbing 10
LS (local storage) 110, 246 structure of an accelerator 59
programming considerations 321 memory flow controller (MFC) 99, 105, 330, 356,
LS arbitration 247 609
LSA (local store address) 99, 128 channels 99
DMA events 203
functions 102, 105–106
M functions for accessing mailboxes 182
mailbox or signal events 203
multisource synchronization 231
mailboxes 179
multisource synchronization facility 220, 230
attributes 181
ordering mechanisms 227
blocking versus nonblocking access 183
memory interface controller (MIC) 10
MFC functions for accessing 182
Memory Management Unit (MMU)
programming interface for accessing 182
MMU (Memory Management Unit) 112
mailboxes and signals comparison 179
memory-mapped I/O (MMIO)
main storage 98
interface 97, 99, 190, 227, 231
and DMA 246
register 13
domain 98
Mercury Computer Systems 45
Mambo 609
Mersenne Twister 316, 502
634 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
algorithm 506 MFC_SIGNAL_NOTIFY_1_EVENT 205
parameters 506 MFC_Size channel 122, 127
MESI 73 mfc_sndsig 192
MESIF 73 mfc_sync 230
message passing 40 mfc_tag_release 121
message passing interface (MPI) 40, 43, 62 mfc_tag_reserve 121
DaCS application arrangement 292 MFC_TagID channel 122
MFC (memory flow controller) 99, 220, 330, 356, mfc_write_tag_mask 122
609 MFC_WrMSSyncReq 232
channels 99 MFC_WrTagMask channel 123, 128
DMA events 203 mfceieio 119
functions 102, 105–106 mfcsync 119
functions for accessing mailboxes 182 MIC (memory interface controller) 10
MMIO interface programming methods 105 microprocessor performance 6
multisource synchronization 231 Microsoft Visual C++ 478
multisource synchronization facility 220, 230 minimized distribution 593
ordering mechanisms 227 mkinitrd 548
mfc_barrier 119, 230 MMIO (memory-mapped I/O)
MFC_Cmd channel 122 interface 97, 99, 190, 227, 231
MFC_CMDStatus register 140 register 13
MFC_EAH channel 122 MOESI 73
MFC_EAL channel 127 monitoring asynchronously 204
mfc_eieio 119, 230 Monte Carlo 35
mfc_get 114, 122 Dynamic Creator 506
mfc_getb 115 European option sample code 508
mfc_getf 115, 228 Gaussian random numbers 505
mfc_getl 115, 127 Gaussian variables 502
mfc_getlf 115 libraries 316
mfc_getllar 214, 235 option pricing 499
MFC_GETS_CMD 113, 139 parallel and vector implementation 503
mfc_list_element 127–128 parallelizing the simulation 504
MFC_LSA channel 122 performance improvement 518
MFC_MAX_DMA_LIST_SIZE 118 Polar method 518
MFC_MAX_DMA_SIZE 117 simulation 499
MFC_MSSync 231 simulation for option pricing 500
MFC_OUT_MBOX_AVAILABLE_EVENT 205 work partitioning 504
mfc_put 113, 122 Moro’s Inversion transformation 316
MFC_PUT_CMD 113, 139 most-significant bit 14
mfc_putb 114, 228 MPI (message passing interface) 40, 43, 62
mfc_putf 113 DaCS application arrangement 292
mfc_putl 114, 127 MPICH 43, 62
mfc_putlb 114 MPMD 22
mfc_putlf 114 multibuffering 161, 166
mfc_putllc 214, 235 Multicore Acceleration 45
mfc_putlluc 236 Multicore Acceleration Integrated Development En-
mfc_putqlluc 236 vironment 362
MFC_RdTagStat channel 123, 128 multiple SPE
mfc_read_tag_status_all 123 concurrently running
mfc_read_tag_status_any 122 PPU code 90
Index 635
SPU code version 94 numactl 176
multiple-program-multiple-data (MPMD) program-
ming module 22
multiplies programming considerations 324
O
object files 609
Multi-Precision Math Library 609
Ohio State University 43
multisource synchronization facility 231
one-sided communication 295
multi-SPE implementation 63
one-to-one signalling mode 191
multi-stage pipeline 73
opannotate 383
multi-threaded program, SPE 89
opcontrol 383
mutex 234, 239, 609
OpenIB (OFED) for InfiniBand networks 43
mutex lock 234
OpenMP 40, 45, 62
SPE implementation 237
OpenMPI 43
MVAPICH 43, 62
operating system installation 543
mysim 356
opreport tool 24, 383
OProfile 24, 382, 424
N optical drive 549
NAMD 35 optimization level 271
NAS ordering and synchronization mechanisms 240
CG 34 ordering reads 241
EP 35 oscillator libraries 609
FT 34 outbound mailboxes 180
LU 34 overlay 283
MG 35 overrun (mailboxes) 181
NAS (NASA Advanced Supercomputing) 34
NASA Advanced Supercomputing (NAS) 34
N-Body dwarf 50
P
package
N-body methods 35, 51
removal 596
network booting 592
selection 594
newlib 41
page hit ratio 166
NFS 550
parallel computing research community 34
NFS_NETWORK 574
parallel programming models 37
Noise LibraryPPE 609
taxonomy 54
noncoherent I/O interface (IOIF) protocol 11
parallelism 49
nonuniform memory access (NUMA) 41, 112, 168,
Partitioned Global Address Space (PGAS) 39–40
326
PDT (Performance Debug Tool) 25, 386, 438
BladeCenter 171
importing data into Trace Analyzer 441
code example 174
trace file set 392
command utility (numactl) 176
pdt_cbe_configuration.xml 440
memory access improvement 171
PDT_CONFIG_FILE 440
policy considerations 177
PDT_TRACE_OUTPUT 391
notify_event_handler function 128
PDTR 393
NUMA (nonuniform memory access) 41, 112, 168,
PDTR Report Example 393
326
PeakStream 45, 52
BladeCenter 171
pending breakpoints 350
code example 174
performance bottlenecks 35
command utility (numactl) 176
Performance Debug Tool (PDT) 25, 386, 438
memory access improvement 171
importing data into Trace Analyzer 441
policy considerations 177
trace file set 392
636 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
performance instrumentation 609 PPE-ELF 330
performance simulation 20, 355 PPE-to-SPE communications 242
performance tools 24, 417 PPSS (PowerPC processor storage subsystem) 13
performance tuning 66 PPU (PowerPC Processor Unit)
performance/watt 49 double buffering code 162, 164
PFA 522 bookmark mode 378
PGAS (Partitioned Global Address Space) 39–40 executable 365
Phillip Colella 34 shared library 365
pipeline 60, 62, 248 static library 365
Pipeline Analyzer 25, 401 ppu_intrinsics.h 235
pipeline model 58 ppu32-embedspu 333
PLUGIN_MAC_ADDR 574 ppu32-gcc 333
pointer aliasing 344 ppu-embedspu utility 466
Polar method 518 ppu-gcc command line options 334
transformation 316 Prime Factor Algorithm 522
Post-link Optimization for Linux 394 Privileged Mode Environment 610
Power Processing Element (PPE) 22 Problem State Memory-Mapped Registers 610
atomic implementation 235 processor elements 8
barrier intrinsics 222 Profile Analyzer 25, 401, 426
interrupts 609 Profile Checkpoints 609
multithreading 610 profile data 423
mutex_lock function implementation in sync li- profile directed feedback optimization 338
brary 237 profile information
ordering instructions 221 gathering with FDPR-Pro 428
oscillator subroutines 610 profiling 63, 418
programming 77 profiling or watchdog of SPU program 207
variables 331 program loading 610
PowerPC 77 programming
Architecture 8 considerations 33
PowerPC processor storage subsystem (PPSS) 13 distributed 445
PowerPC Processor Unit (PPU) dynamic 35, 51
bookmark mode 378 environment 11
double buffering code 162, 164 frameworks 61
executable 365 guidelines 319
shared library 365 IMD 610
static library 365 models 40
PPE (Power Processing Element) 22 SIMD 258
atomic implementation 235 techniques 75
barrier intrinsics 222 tools 329
interrupts 609 programming interface
multithreading 610 accessing events 205
mutex_lock function implementation in sync li- accessing signaling 192
brary 237 project configuration 365
ordering instructions 221 proxydma 352
oscillator subroutines 610 Prxy_QueryMask register 141
programming 77 Prxy_TagStatus register 141
variables 331 pthread.h 90
PPE-assisted functions 610 pthreads 40, 62
PPE-assisted library facilities 208 put 113
Index 637
putb 114 diffutils packages 602
putbs 114 libICE packages 600
putf 113 libX11 packages 599
putfs 114 restrict qualifier 258
putl 114 RHEL 5.1 544
putlb 114 installation package selection 594
putlf 114 RISC 8
putllc 119 root file system 547
putlluc 119 runtime environment 15
putqlluc 119
puts 113
S
safe mode 153
Q SAS 542
QS21 SAXPY 318
boot up 546 ScaLAPACK 22, 34, 317
Firmware considerations 565 scalar 255
Installing the Operating System 543 overlay on SIMD in SPE 252
network installation 547 overlay on SIMD instructions 278
overview 542 programming considerations 323
Updating firmware 566 related instructions 251
quadword boundaries 259 scatter-gather 273
queues 99 scenarios 66
Quicksort 35 Scientific Cluster Support 591
SCOPY 318
SCS 591
R SDE 500
RA (Real address) 99
SDK 3.0
random data access
installation 560
high cache hit rate 156
pre-installation 563
SPU software cache 148
SDOT 318
random numbers, Monte Carlo generation 502
SELinux 544
RapidMind 46, 52
semaphore 234
Ray tracing 35
sequence alignment 35
rc.sysinit 168
Sequential Trace Output Example 394
rDMA (remote direct memory access) 40
Serial Attached SCSI 542
reader/writer locks 239
serial interface 543, 545
Real Address (RA) range 99
Serial over LAN (SOL) 545
Redbooks Web site 626
set spu stop-on-load command 351
Contact us xvii
SGEMM 318
register file 246
SGEMV 318
relational operators 261
shared data 55, 61–62
remote direct memory access (rDMA) 40
shared memory 4
Remote Procedure Call (RPC) 298
shared queue 55, 61–62
remote tools for choosing the target environments
shared storage
367
model 221
removal
synchronization 610
alsa-lib packages 596
synchronizing 218
atk packages 597
shift and rotate 255
cairo packages 601
638 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
shuffle instructions 252 when and how to use 156
shuffle intrinsic 532 software cache activity 149
shutting off services 604 software pipelining 336
signal software-controlled modes 10
notification 190, 610 SOL (Serial over LAN) 545
notification code example 194 sparse matrices 34, 51
OR mode 191 SPE (Synergistic Processor Element) 22, 611
Overwrite mode 191 affinity using gang 94
signalling commands atomic implementation 235
sndsig 191 automatic software caching 157
sndsigb 191 Channel and Related MMIO Interface 610
sndsigf 191 code compile 333
signals and mailboxes comparison 179 context switching 610
SIMD contexts 84
arithmetic and logical operators 261 events 203, 610
low level intrinsics 261 instrumentation 398
Math Library 83, 262, 500, 531 local storage memory allocation 610
operations 250, 260 managing threads 83
programming 258 multi-threaded program 89
considerations 322 oscillator subroutines 610
scalar overlay in SPE 252 persistent threads on each 59
SIMDization 343 physical chain 95
problems 275 PPU code 86
simdmath.h 83, 264 process-management primitives 330
simplex algorithm 35 programming tips 610
simulation control 359 programs loading 85
simulator 19, 354 runtime management library 21, 83–84, 231
GUI 357 Serviced C Library Functions 610
integration 367 shared header file 86
simulator image 356 single program 85
Single Program Multiple Data (SPMD) 60, 62 SPU_RdSigNotify 190
structure 55 thread 347
single SPE updating shared structures 243
PPU code 86 writing notifications to PPE 240
program 85 spe_context_create 85, 107
shared header file 86 spe_context_destroy 85
single thread performance 446 spe_context_run 85
single-precision instructions 247 spe_cpu_info_get 173
slave nodes 591 spe_event_wait 206
slow mode 10 spe_ls_area_get 145, 147
SMM (synergistic memory management) unit 13 SPE_MAP_PS 107
SMS (System Management Services) 546 spe_mfcio_getf 228
utility program 546 spe_mfcio_put 113
sndsig 119, 191 spe_mfcio_putb 114, 228
sndsigb 119, 191 spe_mfcio_putf 113
sndsigf 119, 191 spe_mfcio_tag_status_read 141
SOA (structure of arrays) 268 spe_mfcio.h 129
Sobol 316 spe_ps_area_get 106–107
software cache 42, 152 SPEC (Standard Performance Evaluation Consor-
Index 639
tium) int and fp 34 (SPU_WrEventAck) 205
Specific Intrinsics 254 write event mask (SPU_WrEventMask) 205
SPECInt:gcc 35 spu_absd 255
Spectral methods 34, 51 spu_add 254–255
speculative read 10 spu_and 255
SPE-to-SPE DMA transfers 95 spu_cmpeq 255
SPMD (Single Program Multiple Data) 60, 62 spu_cmpgt 255
structure 55 spu_convtf 255
SpMV 34 spu_convts 255
SPU (Synergistic Processor Unit) 611 spu_dsync 225–226, 256
application binary interface 610 spu_extract 255, 279
architectural overview 611 spu_idisable 256
as computation server 207 spu_ienable 256
C/C++ language extensions (intrinsics) 253 spu_insert 255, 279
channel instructions 611 spu_internals.h 225
channel map 611 spu_intrinsics.h 162, 253
channels 612 spu_madd 255
code transfer using SPU code overlay 283 spu_mfcdma32 256
compare, branch, and halt instructions 611 spu_mfcdma64 256
constant-formation instructions 611 spu_mfcio.h 113, 117, 122, 162, 182, 192, 232
control instructions 611 spu_mfcstat 256
executable 365 spu_nmadd 255
floating-point instructions 611 spu_or 255
hint-for-branch instructions 611 spu_promote 255, 279
instruction set 249 SPU_RdSigNotify 190
Instruction Set Architecture 9 spu_read_event_status 205
integer instructions 611 spu_readch 256
interrupt facility 611 spu_rlqw 255
intrinsics 254 spu_rlqwbyte 255
isolation facility 611 spu_sel 255
load/store instructions 611 spu_shuffle 255
logical instructions 611 spu_splats 255, 279
multimedia extension intrinsics 610 spu_stat_event_status 205
ordering instructions 223 spu_stop 256
performance evaluation 611 spu_sync 225–226
performance evaluation criteria 611 spu_sync_c 225–226
programming 244, 612 spu_timing 24
programming methods 101 spu_timing tool 537
read event mask (SPU_RdEventMask) 205 spu_writech 256
read event status (SPU_RdEventStat) 205 spu2vmx.h 82
rotate and mask 611 spu-gcc 333
rotate instructions 611 command line options 334
shift 611 SPUStats 359
signal notification 192 SPU-Timing information 436
static library 365 SSCAL 318
static timing tool 24, 248 SSYRK 318
statistics 611 stall 443
synchronization and ordering 611 stall-and-notify event 128
write event acknowledgment stall-and-notify flag 128
640 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
stalling mechanism 100 runtime management library 21, 83–84, 231
Standard Make C/C++ 363 Serviced C Library Functions 610
Standard Performance Evaluation Consortium SPU_RdSigNotify 190
(SPEC) int and fp 34 thread 347
static branch prediction 289 updating shared structures 243
static loading of SPE object 85 writing notifications to PPE 240
static timing tool 24 Synergistic Processor Unit (SPU) 611
stochastic differential equation 500 application binary interface 610
stop-on-load 351 architectural overview 611
storage as a computation server 207
access 221 C/C++ language extensions (intrinsics) 253
access ordering 611 channel instructions 611
barriers 222 channel map 611
domains 97 channels 612
domains and interfaces 12 code transfer using SPU code overlay 283
models 611 compare, branch, and halt instructions 611
store-conditional 238 constant-formation instructions 611
instructions 234 control instructions 611
streaming 39 executable 365
streaming model 57 floating-point instructions 611
StreamIt 40, 52 hint-for-branch instructions 611
STRSM 318 instruction set 249
structure of arrays (SOA) 268 Instruction Set Architecture 9
structured grids 35, 51 integer instructions 611
SuperLU 34 interrupt facility 611
SWAP 544 intrinsics 254
Swing Modulo Scheduling 336 isolation facility 611
symbols 350 load/store instructions 611
sync library 611 logical instructions 611
facilities 238 multimedia extension intrinsics 610
synchronization ordering instructions 223
events 204 performance evaluation 611
primitives 39 performance evaluation criteria 611
synchronous data access programming 244, 612
using safe mode 153 programming methods 101
synchronous monitoring 204 read event mask (SPU_RdEventMask) 205
synergistic memory management (SMM) unit 13 read event status (SPU_RdEventStat) 205
Synergistic Processor Element (SPE) 22, 59, 84, rotate and mask 611
610–611 rotate instructions 611
atomic implementation 235 shift 611
code compile 333 signal notification 192
context switching 610 static library 365
events 203, 610 static timing tool 24, 248
instrumentation 398 statistics 611
local storage memory allocation 610 synchronization and ordering 611
oscillator subroutines 610 write event acknowledgment
process-management primitives 330 (SPU_WrEventAck) 205
programming tips 610 write event mask (SPU_WrEventMask) 205
programs loading 85 System Management Services (SMS) 546
Index 641
utility program 546 vmx2spu.h 82
system memory 223 volatile keyword 257
system root image 20 VPA (Visual Performance Analyzer) 25, 400, 423
systemsim script 356
W
T Warewulf 591
tag manager 121 work blocks 307
task descriptors 306 work distribution 39
task parallelism 60, 62 workload specific libraries 45
task synchronization 39 WRF 35
task_context 314
tasks 306
test-and-set 234
X
X10 62
TFTP 548
X10 (PGAS) 40
time base 612
xCAT 589
timing simulation 20, 355
diskless systems 591
TLB (translation lookaside buffer) 166
XCOFF 401
TLB misses 443
XDR memory 542
Tprofs 25
XL compiler 340
Trace Analyzer 25, 387, 402, 409, 441
High order transformations 341
trace data 437
Link-time Optimization 342
tracing 386
Optimization levels 340
architecture 387
Vectorization 343
transactional memory mechanisms 39
XL Fortran for Multicore Acceleration for Linux 19
translation lookaside buffer (TLB) 166
xlc 339
tree 60, 62
XLC/C++ compiler 18
triggers 360
XML parsing 35
U Y
unary operators 261
YUM updater daemon 563
unsafe mode 154
unstructured grids 35, 51
UPC 40, 62 Z
user mode environment 612 zImage 543
usermod command 168 creation of files 556
user-state 98
V
vec_types.h 83
vector data types 259
vector data types intrinsics 80
vector library 612
vector subscripting 261
Vector/SIMD Multimedia Extension 612
Virtual Node File System (VNFS) 592
virtual storage environment 612
Visual Performance Analyzer (VPA) 25, 400, 423
642 Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Programming the Cell Broadband Engine Architecture: Examples and Best Practices
(0.5” spine)
0.475”<->0.875”
250 <-> 459 pages
Back cover ®