0% found this document useful (0 votes)
270 views82 pages

Software Development Kit 2.1 Programmer's Guide 2.1: Cell Broadband Engine

Programmer's guide for Cell Broadband Engine Software Development Kit 2. This edition applies to the SDK 2. And to all subsequent releases and modifications..

Uploaded by

mathurvaibhav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views82 pages

Software Development Kit 2.1 Programmer's Guide 2.1: Cell Broadband Engine

Programmer's guide for Cell Broadband Engine Software Development Kit 2. This edition applies to the SDK 2. And to all subsequent releases and modifications..

Uploaded by

mathurvaibhav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Cell Broadband Engine 

Software Development Kit 2.1


Programmer's Guide
Version 2.1

SC33-8325-01
Cell Broadband Engine 

Software Development Kit 2.1


Programmer's Guide
Version 2.1

SC33-8325-01
Note: Before using this information and the product it supports, read the general information in Appendix A, “Notices,” on page
57.

Second Edition (March 2007)


This edition applies to the SDK 2.1 and to all subsequent releases and modifications until otherwise indicated in
new editions.
© Copyright International Business Machines Corporation 2006. All rights reserved.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Preface . . . . . . . . . . . . . . . v Introduction . . . . . . . . . . . . . . 25
About this book . . . . . . . . . . . . . v GDB for SDK 2.1 . . . . . . . . . . . 25
New in this release . . . . . . . . . . . . v Compiling with GCC or XLC . . . . . . . 25
Supported platforms . . . . . . . . . . . vi Using the debugger. . . . . . . . . . . . 26
Supported languages . . . . . . . . . . . vi Debugging PPE code . . . . . . . . . . 26
How to use the SDK . . . . . . . . . . . vi Debugging SPE code . . . . . . . . . . 26
Getting support . . . . . . . . . . . . . vi Source level debugging . . . . . . . . 27
Related documentation . . . . . . . . . . vii Assembler level debugging . . . . . . . 27
How spu-gdb manages SPE registers . . . . 28
Chapter 1. SDK overview . . . . . . . 1 Debugging in the Cell/B.E. environment . . . . 29
Debugging multithreaded code . . . . . . . 29
GNU tool chain . . . . . . . . . . . . . 1
Debugging architecture . . . . . . . . 29
IBM XL C/C++ compiler . . . . . . . . . . 2
Viewing symbolic and additional information 30
IBM Full-System Simulator . . . . . . . . . 3
Using scheduler-locking . . . . . . . . 31
System root image for the simulator . . . . . . 4
Using the combined debugger . . . . . . . 32
Linux kernel . . . . . . . . . . . . . . 5
Setting pending breakpoints . . . . . . . 32
Cell/B.E. libraries . . . . . . . . . . . . . 5
Using the set spu stop-on-load command . . 33
SPE Runtime Management Library Version 2.1 . . 5
New command reference . . . . . . . . . . 34
SIMD math libraries . . . . . . . . . . . 6
info spu event . . . . . . . . . . . . 35
Mathematical Acceleration Subsystem (MASS)
info spu signal . . . . . . . . . . . . 35
library . . . . . . . . . . . . . . . 7
info spu mailbox . . . . . . . . . . . 35
Prototype code . . . . . . . . . . . . . 7
info spu dma . . . . . . . . . . . . . 35
Prototype libraries and samples package . . . . 7
info spu proxydma . . . . . . . . . . . 36
Libraries and samples subdirectories . . . . 8
Setting up remote debugging . . . . . . . . 36
Accelerated Library Framework (ALF) . . . . 10
Remote debugging overview . . . . . . . 36
Performance support libraries and utilities . . . . 10
Using remote debugging . . . . . . . . . 36
SPU timing tool . . . . . . . . . . . . 10
Starting remote debugging . . . . . . . . 37
Feedback Directed Program Restructuring
(FDPR-Pro) . . . . . . . . . . . . . 11
OProfile . . . . . . . . . . . . . . 11 Chapter 4. SPU code overlays . . . . . 39
SPU profiling restrictions . . . . . . . . 12 What are overlays . . . . . . . . . . . . 39
SPU report anomalies . . . . . . . . . 12 How overlays work . . . . . . . . . . . 39
Known restrictions . . . . . . . . . . 13 Restrictions on the use of overlays . . . . . . . 40
Cell-perf-counter tool . . . . . . . . . . 13 Planning to use overlays . . . . . . . . . . 40
IBM Eclipse IDE for Cell/B.E. SDK . . . . . . 13 Overview . . . . . . . . . . . . . . 40
Sizing . . . . . . . . . . . . . . . 40
Chapter 2. Programming with the SDK 15 Scaling considerations . . . . . . . . . . 41
Overlay tree structure example . . . . . . . 41
System root directories . . . . . . . . . . 15
Length of an overlay program . . . . . . . 42
Running the simulator . . . . . . . . . . . 16
Segment origin . . . . . . . . . . . . 42
The callthru utility . . . . . . . . . . . 17
Overlay processing . . . . . . . . . . . 43
Read and write access to the simulator sysroot
Call stubs . . . . . . . . . . . . . 44
image . . . . . . . . . . . . . . . 18
Segment and region tables . . . . . . . 44
Enabling Symmetric Multiprocessing support . . 18
Overlay graph structure example . . . . . . 44
Enabling xclients from the simulator . . . . . 18
Specification of an SPU overlay program . . . . 47
Specifying the processor architecture . . . . . . 19
Coding for overlays . . . . . . . . . . . 48
SDK programming samples . . . . . . . . . 20
Migration/Co-Existence/Binary-Compatibility
Changing the default compiler . . . . . . . 20
Considerations . . . . . . . . . . . . 48
Building and running a specific program . . . 21
Compiler options . . . . . . . . . . . 48
Compiling and linking with the GNU tool chain 21
SDK overlay samples . . . . . . . . . . . 49
Support for huge TLB file systems . . . . . . . 21
Simple overlay sample. . . . . . . . . . 49
SDK development best practices . . . . . . . 22
Overview overlay sample. . . . . . . . . 52
Using sandboxes for application development . . 22
Large matrix overlay sample . . . . . . . 53
Using a shared development environment . . . 22
Using the GNU SPU linker for overlays . . . . . 55

Chapter 3. Debugging Cell/B.E.


Appendix A. Notices . . . . . . . . . 57
applications . . . . . . . . . . . . 25
© Copyright IBM Corp. 2006 iii
Edition notice . . . . . . . . . . . . . 59 Glossary . . . . . . . . . . . . . . 63
Trademarks . . . . . . . . . . . . . . 60
Index . . . . . . . . . . . . . . . 67
Appendix B. Related documentation . . 61

iv Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Preface
The Software Development Kit 2.1 (SDK) for the Cell Broadband Engine™
(Cell/B.E.™) is a complete package of tools to enable you to program applications
for the Cell/B.E. processor. The SDK is composed of development tool chains,
software libraries and sample source files, a system simulator, and a Linux® kernel,
all of which fully support the capabilities of the Cell/B.E..

About this book


This book describes how to use the SDK to write applications for the Cell/B.E.
platform. How to install the SDK is described in a separate manual, SDK 2.1
Installation Guide, and there is also a programming tutorial provided within the
SDK ISO image.

Each section of this book covers a different topic:


v Chapter 1, “SDK overview,” on page 1 describes the components of the SDK
v Chapter 2, “Programming with the SDK,” on page 15 explains how to program
applications for the Cell/B.E. platform
v Chapter 3, “Debugging Cell/B.E. applications,” on page 25 describes how to
debug your applications
v Chapter 4, “SPU code overlays,” on page 39 describes how to use overlays

New in this release


SDK 2.1 contains a number of significant enhancements over previous versions of
the SDK and completely replaces these SDK versions. These enhancements include:
v Upgraded Linux kernel to 2.6.20 with enhancements for preemptive scheduling
of Synergistic Processor Element (SPE) tasks, SPE logical affinity support, and
improved performance via 64 KB Local Store page mapping.
v Standardization on the new SPE Runtime Management Library (libspe2). The
older and less functional library, libspe 1.x, is being deprecated in this release.
v Migration of example libraries and code to libspe2. A migration guide is
provided to help you move existing applications to libspe2.
v Enhancements and improvements to the Accelerator Library and Framework
(ALF) including additional examples that use ALF.
v Improvements and additions to SIMD math library.
v Addition of SIMD MASS and vector MASS libraries for SPE.
v Addition of example benchmarking code to measure and report on the
performance of a representative set of DMA operations.
v Added GNU GCC, XL C/C++ compiler, and Full-System Simulator support for
an enhanced CBEA-compliant processor with a fully pipelined, double precision
SPE.
v Addition of a sample DMA channel profiling tool. Support for cycle count
profiling of code running on the SPE using OProfile.
v Addition of the Cell Performance Counter utility, which can be used to monitor
and count cell performance events.

© Copyright IBM Corp. 2006 v


v Improved PowerPC Processor (PPE) model in the Full-System Simulator to for
better performance correlation across the Cell Broadband Engine. Improved
integration between Full-System Simulator and Eclipse IDE for Cell Broadband
Engine.
v Addition of Linux man pages for some libraries and tools.
v Upgraded the XL C/C++ compiler version to 0.8.2.
v Upgraded binutils version to 2.18 prerelease.
v Upgraded GDB version to 6.6.
v Upgraded newlib version to 1.15.0.

Supported platforms
Cell/B.E. applications can be developed on the following platforms:
v x86
v x86-64
v 64-bit PowerPC® (PPC64)
v BladeCenter QS20

Supported languages
The supported languages are:
v C/C++
v Assembler

Note: Although C++ is supported, take care when you write code for the
Synergistic Processing Units (SPUs) because many of the C++ libraries are
too large for the memory available.

How to use the SDK


The SDK includes both PPU and SPU compilers for all the supported platforms. A
Cell/B.E. application can run either natively on a BladeCenter QS20 or in the IBM®
Full-System Simulator, which is supported on all of the supported platforms. The
Full-System Simulator is useful for debugging or verifying a problem with
applications that you plan to run on the BladeCenter QS20. For example, it is
possible to build an application on an x86 system, test it under the Full-System
Simulator on that system, and then later run the same binary natively on a
BladeCenter QS20.

Getting support
The SDK is supported through the Cell/B.E. architecture forum on the
developerWorks® Web site at
http://www.ibm.com/developerworks/power/cell/

or on the Barcelona Supercomputing Center (BSC) Web site at


http://www.bsc.es/projects/deepcomputing/linuxoncell

There is also support for the Full-System Simulator and XL C/C++ Compiler
through their individual alphaWorks® forums. If in doubt, start with the Cell/B.E.
architecture forum.

vi Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
GDB is supported through many different forums on the Web, but primarily at the
GDB Web site
http://www.gnu.org/software/gdb/gdb.html

This version (2.1) of the Cell/B.E. SDK supersedes all previous versions of the
SDK.

Note: The Cell/B.E. SDK is provided on an “as-is” basis. Wherever possible,


workarounds to problems are provided in the respective forums.

Related documentation
For a full list of documentation available on the SDK 2.1 ISO image, see Appendix
B. Related documentation.

Preface vii
viii Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Chapter 1. SDK overview
This section describes the contents of the SDK, where it is installed on the system,
and how the various components work together.

GNU tool chain


The GNU tool chain contains the GCC C-language compiler (GCC compiler) for
the PPU and the SPU. For the PPU it is a replacement for the native GCC compiler
on PowerPC (PPC) platforms and it is a cross-compiler on x86. The GCC compiler
for the PPU is preferred and the Makefiles are configured to use it when building
the libraries and samples.

The GCC compiler also contains a separate SPE cross-compiler that supports the
standards defined in the following documents:
v C/C++ Language Extensions for Cell BE Architecture V2.4. The GCC compiler
shipped in SDK 2.1 supports all language extension described in the
specification except for the following:
– The GCC compilers currently do not support alignment of stack variables
greater than 16 bytes as described in section 1.3.1.
– The GCC compilers currently do not support the optional alternate vector
literal format specified in section 1.4.6.
– The GCC compilers currently support mapping between SPU and VMX
intrinsics as defined in section 5 only in C++ code.
– The PPU GCC compiler does not support the PPU VMX intrinsics vec_extract,
vec_insert, vec_promote, and vec_splats as defined in section 7. (The other
eight intrinsics in that section are supported.)
– The recommended vector printf format controls as specified in section 8.1.1
are not supported. The SPU GCC compiler and library does not fully conform
to the behavior of floating-point operators and standard library functions as
documented in section 9.3.
– The GCC compilers support operator overloading for vector data types as
described in section 10 only for the following set of operators: unary +, -, ~;
binary +, +=, -, -=,*, *=, /, /=, &, &=, |, |=, ^, ^=.
v Application Binary Interface (ABI) Specification V1.7
v SPU Instruction Set Architecture V1.2

The associated assembler and linker additionally support the SPU Assembly
Language Specification V1.5. The assembler and linker are common to both the
GCC and XL C/C++ compilers. GDB support is provided for both PPU and SPU
debugging, and the debugger client can be in the same process or a remote
process. GDB also supports combined (PPU and SPU) debugging.

On a non-PPC system, the install directory for the GNU tool chain is /opt/cell.
There is a single bin subdirectory, which contains both PPU and SPU tools.

On a PPC64 or BladeCenter QS20, both tool chains are installed into /usr. See
“System root directories” on page 15 for further information.

© Copyright IBM Corp. 2006 1


The patches to the standard 4.1.1 GCC compiler are distributed under the GPL
license and are available on the BSC Web site
http://www.bsc.es/projects/deepcomputing/linuxoncell

IBM XL C/C++ compiler


IBM XL C/C++ Alpha Edition for Cell Broadband Engine Processor on Linux is an
advanced, high-performance cross-compiler that is tuned for the CBEA. The XL
C/C++ compiler, which is hosted on an x86, IBM PowerPC technology-based
system, or BladeCenter® QS20, generates code for the PPU or SPU. The compiler
requires the GCC toolchain for the CBEA, which provides tools for
cross-assembling and cross-linking applications for both the PPE and SPE.

IBM XL C/C++ supports the revised 2003 International C++ Standard ISO/IEC
14882:2003(E), Programming Languages -- C++ and the ISO/IEC 9899:1999,
Programming Languages -- C standard, also known as C99. The compiler also
supports the C89 Standard and K & R style of programming, as well as language
extensions for vector programming and language extensions for SPU
programming. In addition, the compiler supports numerous GCC C and C++
extensions to help users port their applications from GCC.

The XL C/C++ compiler provided in SDK 2.1 supports the languages extensions as
specified by the C/C++ Language Extensions for Cell BE Architecture V2.4
specification except:
v Alignment of greater than 16 for automatic variables as described in section 1.3.1
is not currently supported.
v Most, but not all, GCC inline assembly capabilities are supported as described in
section 1.7.
v Operator overloading for vector data, as described in section 10, is not currently
supported.
v The PPU VMX instrinsics vec_extract, vec_insert, vec_promote, and vec_splats
specified in section 7 are not currently supported.

The XL C/C++ compiler provides the following invocation commands:


v ppuxlc
v ppuxlc++
v spuxlc
v spuxlc++
The commands ppuxlc and ppuxlc++ are used to compile and generate code for the
PPU, and spuxlc and spuxlc++ are used to compile and generate code for the SPU.

The compiler invocation commands for the PPU perform all necessary steps to
compile C source files by ppuxlc (C++ source using ppuxlc++) into .o files and to
link the object files and libraries by ppu-ld into an executable program.

The compiler invocation command for the SPU performs all necessary steps to
compile C source files by spuxlc (C++ source using spuxlc++) into .s files,
assembling .s files into .o files by spu-as, and linking the object files and libraries
into an executable program by spu-ld. The ppu-embedspu tool that is part of the
GNU tool chain is used to link PPU object files and a SPU executable program into
a single PPU executable program.

The XL C/C++ compiler includes the following base optimization levels:

2 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
v -O0: almost no optimization
v -O2: strong, low-level optimization that benefits most programs
v -O3: intense, low-level optimization analysis with basic loop optimization
v -O4: all of -O3 and detailed loop analysis and good whole-program analysis at
link time
v -O5: all of -O4 and detailed whole-program analysis at link time.

Auto-SIMDization is enabled at O3 -qhot or O4 and O5 by default.

The XL C/C++ compiler is installed into the /opt/ibmcmp/xlc/8.2 directory and is


distributed under the IBM ILAR license.

Note: The source code is not available.

IBM Full-System Simulator


The IBM Full-System Simulator (referred to as the simulator in this document) is a
software application that emulates the behavior of a full system that contains a
Cell/B.E. processor. You can start a Linux operating system on the simulator and
run applications on the simulated operating system. The simulator also supports
the loading and running of statically-linked executable programs and standalone
tests without an underlying operating system.

The simulator infrastructure is designed for modeling processor and system-level


architecture at levels of abstraction, which vary from functional to performance
simulation models with a number of hybrid fidelity points in between:
v Functional-only simulation: Models the program-visible effects of instructions
without modeling the time it takes to run these instructions. Functional-only
simulation assumes that each instruction can be run in a constant number of
cycles. Memory accesses are synchronous and are also performed in a constant
number of cycles.
This simulation model is useful for software development and debugging when
a precise measure of execution time is not significant. Functional simulation
proceeds much more rapidly than performance simulation, and so is also useful
for fast-forwarding to a specific point of interest.
v Performance simulation: For system and application performance analysis, the
simulator provides performance simulation (also referred to as timing
simulation). A performance simulation model represents internal policies and
mechanisms for system components, such as arbiters, queues, and pipelines.
Operation latencies are modeled dynamically to account for both processing time
and resource constraints. Performance simulation models have been correlated
against hardware or other references to acceptable levels of tolerance.
The simulator for Cell/B.E. processor provides a cycle-accurate SPU core model
that can be used for performance analysis of computationally-intense
applications. For additional information, refer to the IBM developerWorks SPU
Pipeline Examination article
http://www.ibm.com/developerworks/power/library/pa-cellspu/

The simulator for SDK 2.1 provides additional support for performance
simulation. This is described in the IBM Full-System Simulator Users Guide.

The simulator can also be configured to fast-forward the simulation, using a


functional model, to a specific point of interest in the application and to switch to

Chapter 1. SDK overview 3


a timing-accurate mode to conduct performance studies. Then various types of
operational details can be gathered to help you understand real-world hardware
and software systems.

See the /opt/ibm/systemsim-cell/docs subdirectory for complete documentation


including a simulator user’s guide. The prerelease name of the simulator is
“Mambo” and this name may appear in some of the documentation.

The simulator for the Cell/B.E. processor is also available as an independent


technology at
http://www.alphaworks.ibm.com/tech/cellsystemsim

The simulator is installed into the /opt/ibm/systemsim-cell directory and is


distributed under the IBM ILAR license.

Note: The source code for the simulator is not available.

System root image for the simulator


The system root image for the simulator is a file that contains a disk image of
Fedora Core 6 files, libraries and binaries that can be used within the system
simulator. This disk image file is preloaded with a full range of Fedora Core 6
utilities and also includes all of the Cell/B.E. Linux support libraries described in
“Performance support libraries and utilities” on page 10. This RPM file is the
largest of the RPM files and when it is installed, it takes up to 1.6GB on the host
server’s hard disk. See also “System root directories” on page 15.

The system root image for the simulator must be located either in the current
directory when you start the simulator or the default /opt/ibm/systemsim-cell/
images/cell directory. The cellsdk script automatically puts the system root image
into the default directory.

You can mount the system root image to see what it contains. Assuming a mount
point of /mnt/cell-sdk-sysroot, which is the mount point used by the cellsdk
script, the command to mount the system root image is:
mount -o loop /opt/ibm/systemsim-cell/images/cell/sysroot_disk /mnt/cell-sdk-sysroot/

The command to unmount the image is:


umount /mnt/cell-sdk-sysroot/

Do not attempt to mount the image on the host system while the simulator is
running. You should always unmount the system root image before you start the
simulator. You should not mount the system root image to the same point as the
root on the host server because the system can become corrupted and fail to boot.

You can change files on the system root image disk in the following ways:
v Mount it as described above. Then change directory (cd) to the mount point
directory or below and modify the file using host system tools, such as vi or cp.
However, do not attempt to use the rpm utility on an x86 platform to install
packages to the sysroot disk, because the rpm database formats are not
compatible between the x86 and PPC platforms.
v Use the ./cellsdk synch command to synchronize the system root image with
the /opt/ibm/cell-sdk/prototype/sysroot directory for libraries and samples
(see “System root directories” on page 15) that have been cross-compiled and
linked on a host system and need to be copied to the target system.

4 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
v Use the callthru mechanism (see “The callthru utility” on page 17) to source or
sink the host system file when the simulator is running. This is the only method
that can be used while the simulator is running.

The source is distributed under the GPL license and the system root image is
available on the BSC Web site
http://www.bsc.es/projects/deepcomputing/linuxoncell

Linux kernel
A number of patches have been made to the Linux 2.6.18 kernel to provide the
services that are required to support the hardware facilities of the Cell/B.E.
processor.

For the BladeCenter QS20, the kernel is installed into the /boot directory,
yaboot.conf is modified and a reboot is required to activate this kernel. The
cellsdk install task (see SDK 2.1 Installation Guide) provides an option,
--nokernel, not to install this kernel.

Note: The cellsdk uninstall command does not automatically uninstall the
kernel to avoid leaving the system in an unusable state.

The kernel image for the simulator must be located in either the current
directory when you start the simulator or the default /opt/ibm/systemsim-
cell/images/cell directory. The cellsdk script automatically puts the kernel
image into the default directory.

The patches for the 2.6.18 kernel are distributed under the GPL license and are
available on the BSC Web site
http://www.bsc.es/projects/deepcomputing/linuxoncell

Cell/B.E. libraries
The libraries listed here have been function tested and are considered ready for
Cell/B.E. applications development. Any problems with these libraries should be
reported in:
http://www.alphaworks.ibm.com/tech/cellsw/forum

SPE Runtime Management Library Version 2.1


SPE Runtime Management Library (libspe) version 2.1 is an upgrade to version 1.2.
A libspe version 1.2 interface is provided for backward compatibility. It is
recommended that you port your applications from libspe version 1.2 to the new
version of the runtime library. For more information about how to do this, see SPE
Runtime Management Library Version 1.2 to Version 2.1 Migration Guide.

libspe2 constitutes the standardized low-level application programming interface


(API) for application access to the Cell/B.E. SPEs. This library provides an API that
is neutral with respect to the underlying operating system and its methods to
manage SPEs. Implementations of this library may provide additional functionality
that allows for access to operating system or implementation dependent aspects of
SPE runtime management. These capabilities are not subject to standardization in
this document and their use may lead to non-portable code and dependencies on
certain implemented versions of the library.

Chapter 1. SDK overview 5


The elfspe is a PPE program that allows an SPE program to run directly from a
Linux command prompt without needing a PPE application to create an SPE
thread and wait for it to complete.

For the BladeCenter QS20, the SDK installs the libspe headers, libraries, and
binaries into the /usr directory and the standalone SPE executive, elfspe, is
registered with the kernel during boot by commands added to /etc/rc.d/init.d
using the binfmt_misc facility.

For the simulator, the libspe and elfspe binaries and libraries are preinstalled in
the same directories in the system root image and no further action is required at
install time.

The source for the SPE runtime management library is distributed under the GPL
license and available on the BSC Web site
http://www.bsc.es/projects/deepcomputing/linuxoncell

SIMD math libraries


The traditional math functions specified by standards such as ISO/IEC 9899:1999
(more commonly known as “the C99 standard”) are defined in terms of scalar
instructions, and do not take advantage of the powerful Single Instruction,
Multiple Data (SIMD) instructions provided by both the PPU and SPU instruction
sets of the Cell/B.E. Architecture. SIMD instructions perform computations on
short vectors of data in parallel, instead of on a individual scalar data elements.
SIMD instructions often provide significant increases in program speed because the
more computation can be done in fewer instructions.

The SIMD math library provides short vector versions of a subset of the traditional
math functions. The MASS library provides long vector versions. These vector
versions conform as closely as possible to the specifications set out by the scalar
standards. However, fundamental differences between scalar architectures and the
Cell/B.E. Architecture require some deviations, including the handling of
rounding, error conditions, floating-point exceptions, and special operands, such as
NaN and infinities.

The SIMD math library is provided by the SDK as both a linkable library archive
and as a set of inlinable headers. Names of the SIMD math functions are
differentiated from their scalar counterparts by a vector type suffix that is
appended to the standard scalar function name. For example, the SIMD version of
fabsf(), which acts on a vector float, is called fabsf4(). Similarly, a SIMD version
of a standard scalar function that acts on a vector double has d2 appended to the
name, for example, fabsd2(). Inlinable versions of functions are prefixed with the
character “_” (underscore), so the inlinable version of fabsf4() is called _fabsf4().

Both versions require the inclusion of the primary header file, simdmath.h, and
linking against the libsimdmath.a archive. Additionally, the inlinable versions
require inclusion of a distinct header file for each function used. For example, to
use the inlinable function _fabsf4(), the fabsf4.h header file needs to be included
in addition to the simdmath.h header file.

The linkable library archive is more convenient, requiring the inclusion of a single
header file, but produces slower, larger binaries due to limitations of the linker and
the required branching inherent to function calls. The inlinable headers require
distinct header files to be included for each math function used, but produces
faster, smaller binaries because the compiler is able to reduce branching and often

6 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
achieves better dual-issue rates and optimization. In general, most developers
should use the inlinable versions whenever possible.

For the PPU, the SIMD math library header file simdmath.h is located in the
/usr/include directory, with the inlinable headers located in the
/usr/include/simdmath directory, and the library archive libsimdmath.a located in
the /usr/lib directory.

For the SPU, the SIMD math library header file simdmath.h is located in the
/usr/spu/include directory, with the inlinable headers located in the
/usr/spu/include/simdmath directory, and the library archive libsimdmath.a
located in the /usr/spu/lib directory.

For more information about the SIMD math library, refer to SIMD Math Library
Specification for Cell Broadband Engine Architecture.

Note: Some of the functions documented in the specification are not yet available.
The source code and the man pages document the functions that are
currently supported.

Mathematical Acceleration Subsystem (MASS) library


The Mathematical Acceleration Subsystem (MASS) consists of libraries of
mathematical intrinsic functions, which are tuned specifically for optimum
performance on the Cell/B.E. processor. Currently the 32-bit, 64-bit PPU and SPU
libraries are supported.

These libraries:
v Include both scalar and vector functions
v Are thread-safe
v Support both 32- and 64-bit compilations
v Offer improved performance over the corresponding standard system library
routines
v Are intended for use in applications where slight differences in accuracy or
handling of exceptional values can be tolerated
You can find information about using these libraries on the MASS Web site:
http://www.ibm.com/software/awdtools/mass

The MASS and MASS/V libraries are distributed under a modified ILAR license.

Prototype code
The function in these packages represents prototype or sample code that you can
use for experimentation. This code may change in future releases of the SDK.

Prototype libraries and samples package


The prototype libraries and samples package provides a set of optimized library
routines that greatly reduce the development cost and enhance the performance of
Cell/B.E. programs.

To demonstrate the versatility of Cell/B.E. architecture, a variety of


application-oriented libraries are included, such as:
v Fast Fourier Transform (FFT)

Chapter 1. SDK overview 7


v Image processing
v Audio resample
v Software managed cache
v Game math
v Intrinsics
v Matrix operation
v Multi-precision math
v Noise generation
v Oscillator
v Surface
v Synchronization
v Vector
v Accelerated Library Framework (ALF)
Additional samples and workloads demonstrate how you can exploit the on-chip
computational capacity.

The libraries and sample sources are installed in the /opt/ibm/cell-sdk/prototype


directory under the IBM CPL license.

Libraries and samples subdirectories


The libraries and samples RPM has been partitioned into the following
subdirectories.
Table 1. Subdirectories for the libraries and samples RPM
Subdirectory Description
bin Executable programs directory containing the SPU Timing tool and the ILAR license for this
tool.
docs Contains documentation about libraries and tools such as IDL compiler.
host Host system executables, headers and libraries for the IDL tool.
license Contains the text for the CPL license.
src/include System header files. These files are exported to the $SDKINC_<target> (where target is either ppu
or spu) directory for general use throughout the SDK.
scr/lib Series of libraries and reusable header files. These are exported to $SDKLIB_<target> or
$SDKINC_<target> directories (respectively). Complete documentation for all library functions
and available in /opt/ibm/cell-sdk/prototype/docs/libraries_SDK.pdf.

8 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Table 1. Subdirectories for the libraries and samples RPM (continued)
Subdirectory Description
src/samples The samples directory contains examples of Cell/B.E. programming techniques. Each program
shows a particular technique, or set of related techniques, in detail. You may review these
programs when you want to perform a specific task, such as double-buffered DMA transfers to
and from a program, performing local operations on an SPU, or provide access to main memory
objects to SPU programs.

Some subdirectories contain multiple programs. The sync subdirectory has examples of various
synchronization techniques, including mutex operations and atomic operations.

The spulet model is intended to encourage testing and refinement of programs that need to be
ported to the SPUs; it also provides an easy way to build filters that take advantage of the huge
computational capacity of the SPUs, while reading and writing standard input and output.

Other samples worth noting are:


v Overlay samples
v SW managed cache samples
v Tutorial code samples
src/tests The tests directory includes some regression tests for the system. These programs exercise key
system components.
src/tools The tools directory contains tools that are useful for software development such as the Interface
Definition Language (IDL) compiler and the callthru program. The IDL compiler reads
high-level interface specifications and generates PPU and SPU stub functions to allow PPU code
to ″call″ a function that is implemented and runs on the SPU. This is most useful for cases
where you have a computationally intensive task that is well suited to simply being handed off
to an SPU for processing. The code this generates is informative about PPU/SPU
communications, but is primarily intended as a prototyping tool rather than a learning tool.

The IDL tool has its own samples which show how to offload some processing to the SPU. The
IDL’s PPU stub code supports dynamic allocation of multiple SPUs to handle simultaneous
offloaded functions, and multiple functions can be loaded on a single SPU, if they are small
enough. Some features are still under development, such as double buffering support.
src/workloads The workloads directory provides a handful of examples that can be used to better understand
the performance characteristics of the Cell/B.E. processor. There are four sample programs,
which contain insights into how real-world code should run.
Note: Running these examples using the simulator takes much longer than on the native
Cell/B.E.-based hardware. The performance characteristics in wall-clock time using the
simulator are extremely inaccurate, especially when running on multiple SPUs. You need to
examine the emulator CPU cycle counts instead.

For example, the matrix_mul program lets you perform matrix multiplications on one or more
SPUs. Matrix multiplication is a good example of a function which the SPUs can accelerate
dramatically.

Unlike some of the other sample programs, these examples have been tuned to get the best
performance. This makes them harder to read and understand, but it gives an idea for the type
of performance code that you can write for the Cell/B.E. processor.
sysroot Contains some of the headers and libraries used during cross-compiling and contains the
compiled results of the libraries and samples. This can be synched up with the system root
image by using the command: /opt/ibm/cell-sdk/prototype/cellsdk synch
src/benchmarks The benchmarks directory contains sample benchmarks for various operations that are
commonly performed in Cell/B.E. applications. The intent of these benchmarks is to guide you
in the design, development, and performance analysis of applications for systems based on the
Cell/B.E. processor. The benchmarks are provided in source form to allow you to understand in
detail the actual operations that are performed in the benchmark. This also provides you with a
basis for creating your own benchmark codes to characterize performance for operations that
are not currently covered in the provided set of benchmarks.

Chapter 1. SDK overview 9


Accelerated Library Framework (ALF)
The ALF application programming interface (API) provides a set of functions to
solve parallel problems on multi-core memory hierarchy systems. The ALF 1.1 API
is focused on data parallel problems on a host-accelerator type hybrid system. ALF
supports the single-program-multiple-data (SPMD) programming style with a
single program running on all allocated accelerator elements at one time. ALF
offers an interface to write data parallel applications without requiring
architecturally dependent code. Features of ALF include task scheduling, data
transfer management, parallel task management, double buffering, and data
partitioning.

ALF considers a natural division of labor among the two types of processing
elements in a hybrid system: the host element and the accelerator element. Within
ALF, two different types of tasks are defined in a typical parallel program: the
control task and the compute task. These tasks are assigned to the different
processing elements in the hybrid system. The control task typically resides on the
host element, while the compute task typically resides on the accelerator element.
This division of labor enables programmers to specialize in different parts of a
given parallel workload.

ALF defines three different types of work that can be assigned to three different
types of programmers. At the highest level, application developers only program at
the host level. Application programmers can use the provided accelerated libraries
without understanding the inner workings of the hybrid system. The second type
of programmers is accelerated library developers. Using the provided ALF APIs,
the library developers provide the library wrappers to invoke the computational
kernels on the accelerators. Library developers are responsible for breaking the
problem into the control process running on the host and the compute kernel
running on the accelerators. Library developers then partition the input and output
into work blocks which ALF can schedule to run on different accelerators. At the
accelerator level, the computational kernel developers write optimized accelerator
code. The ALF API provides a common interface for the compute task to be
invoked automatically by the framework.

Performance support libraries and utilities


The following support libraries and utilities are provided by the Cell/B.E. SDK to
help you with development and performance testing your Cell/B.E. applications.

SPU timing tool


The SPU static timing tool, spu_timing, annotates an SPU assembly file with
scheduling, timing, and instruction issue estimates assuming straight, linear
execution of the program. The tool generates a textual output of the execution
pipeline of the SPE instruction stream from this input assembly file. Run
spu_timing -–help to see its usage syntax.

The SPU timing tool is distributed as an RPM under the IBM ILAR license and is
located in the /opt/ibm/cell-sdk/prototype/bin directory.

Note: The source code is not available.

10 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Feedback Directed Program Restructuring (FDPR-Pro)
The FDPR-Pro is a performance-tuning utility that reduces the runtime of
user-level application programs. The tool optimizes the executable image of a
program by collecting information about the program's behavior under a typical
workload, and creating a new version of the program that is optimized for that
workload. The new program generated by the post-link optimizer typically runs
faster than the original program. The FDPR-Pro utility is distributed as an RPM
under the IBM ILAR.

A typical use of the program consists of the following steps:


1. Instrument the program:
$ fdprpro -a instr myprog

The instrumented program is generated in myprog.instr and PPE profile


template in myprog.nprof
2. Collect a profile using a typical workload:
$ myprog.instr [args...]

The PPE profile is collected in .nprof, and the SPE profile in .mprof files.
3. Optimize the program:
$ fdprpro -a opt myprog [optimization options...] -f myprog.nprof

The optimized program is generated in myprog.fdpr


The steps operate on both the PPE and the SPE parts of the program.
Notes:
1. This utility is only available for PPC 64-bit platforms including the BladeCenter
QS20.
2. The source code is not available.
3. Overlays are not supported

OProfile
OProfile is a tool for profiling user and kernel level code. It uses the hardware
performance counters to sample the program counter every N events. You specify
the value of N as part of the event specification. The system enforces a minimum
value on N to ensure the system does not get completely swamped trying to
capture a profile.

Make sure you select a large enough value of N to ensure the overhead of
collecting the profile is not excessively high. The opreport tool produces the output
report. Reports can be generated based on the file names that correspond to the
samples, symbol names or annotated source code listings. The basic use of OProfile
and the postprocessing tool is described in the user manual available at
http://oprofile.sourceforge.net/doc/

The current SDK 2.1 version of OProfile for Cell/B.E. supports profiling on the
POWER™ processor events and SPU cycle profiling. These events include cycles as
well as the various processor, cache and memory events. It is possible to profile on
up to four events simultaneously on the Cell/B.E. system. There are restrictions on
which of the PPU events can be measured simultaneously. (The tool now verifies
that multiple events specified can be profiled simultaneously. In the previous
release it was up to the user to verify that.). When using SPU cycle profiling,
events must be within the same group due to restrictions in the underlying

Chapter 1. SDK overview 11


hardware support for the performance counters. You can use the opcontrol
–list-events command to view the events and which group contains each event.

There is one set of performance counters for each node that are shared between the
two CPUs on the node. For a given profile period, only half of the time is spent
collecting data for the even CPUs and half of the time for the odd CPUs. You may
need to allow more time to collect the profile data across all CPUs.
Notes:
1. Before you issue an opcontrol --start, you should issue the following
command:
opcontrol --start-daemon
2. To produce a report with Linux kernel symbol information you should install
the corresponding Kernel debuginfo RPM which is available from the BSC Web
site.

SPU profiling restrictions


When SPU cycle profiling is used, the opcontrol command is configured for
separating the profile based on SPUs and on the library. This corresponds to the
you specifying –separate=CPU and –separate=lib. The separate CPU is required
because it is possible to have multiple SPU binary images embedded into the
executable file or into a shared library. So for a given executable, the various SPUs
may be running different SPU images.

With –separate=CPU, the image and corresponding symbols can be displayed for
each SPU. The user can use the opreport –merge command to create a single report
for all SPUs that shows the counts for each symbol in the various embedded SPU
binaries. By default, opreport does not display the app name column when it
reports samples for a single application, such as when it profiles a single SPU
application. For opreport to attribute samples to a binary image, the opcontrol
script defaults to using –separate=lib when profiling SPU applications so that the
image name column is always be displayed in the generated reports.

SPU report anomalies


The report file uses the term CPUs when the event is SPU\_CYCLES. In this case,
CPUs actually refer to the various SPUs in the system. For all other events, the
CPU term refers to the virtual PPU processors.

With SPU profiling, opreport’s --long-filenames option may not print the full path
of the SPU binary image for which samples were collected. Short image names are
used for SPU applications that employ the technique of embedding SPU images in
another file (executable or shared library). The embedded SPU ELF data contains
only the filename and no path information to the SPU binary file being embedded
because this file may not exist or be accessible at runtime. You must have sufficient
knowledge of the application’s build process to be able to correlate the SPU binary
image names found in the report to the application’s source files.

Tip
Compile the application with -g and generate the OProfile report with -g to
facilitate finding the right source file(s) to focus on.

12 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Generally, when the report contains information about a single application,
opreport does not include the report column for the application name. It is
assumed that the performance analyst knows the name of the application being
profiled.

Known restrictions
Currently there are two known issues with the Cell/B.E. OProfile code.
v The first issue is when you use the opreport tool to generate XML output with
details for SPE embedded applications. More specifically, when you use the
command opreport --xml --details. The command is supposed to include the
binary code in the XML output. The binary code is missing.
v The second issue is with the opannotate tool for SPE embedded applications.
The opannotate tool is reporting the samples in the wrong source code file. The
opannotate works fine for PPU applications and where the SPE code is not
embedded.

OProfile is distributed under the GPL license and is available on the BSC Web site
http://www.bsc.es/projects/deepcomputing/linuxoncell

Cell-perf-counter tool
The cell-perf-counter (cpc) tool is used for setting up and using the hardware
performance counters in the Cell/B.E. processor. These counters allow you to see
how many times certain hardware events are occurring, which is useful if you are
analyzing the performance of software running on a Cell/B.E. system. Hardware
events are available from all of the logical units within the Cell/B.E. processor,
including the PPE, SPEs, interface bus, and memory and I/O controllers. Four
32-bit counters, which can also be configured as pairs of 16-bit counters, are
provided in the Cell/B.E. performance monitoring unit (PMU) for counting these
events.

CPC also makes use of the hardware sampling capabilities of the Cell/B.E. PMU.
This feature allows the hardware to collect very precise counter data at
programmable time intervals. The accumulated data can be used to monitor the
changes in performance of the Cell/B.E. system over longer periods of time.

The cpc tool provides a variety of output formats for the counter data. Simple text
output is shown in the terminal session, HTML output is available for viewing in a
Web browser, and XML output can be generated for use by higher-level analysis
tools such as the Visual Performance Analyzer (VPA).

You can find details in the documentation and manual pages included with the
cellperfctr-tools package, which can found in the /usr/share/doc/cellperfctr-
<version>/ directory after you have installed the package.

IBM Eclipse IDE for Cell/B.E. SDK


IBM Eclipse IDE for Cell/B.E. SDK is built upon the Eclipse and C Development
Tools (CDT) platform. It integrates the Cell/B.E. GNU tool chain, compilers, the
Full-System Simulator, and other development components to provide a
comprehensive, Eclipse-based development platform that simplifies Cell/B.E.
development. The key features include the following:
v A C/C++ editor that supports syntax highlighting, a customizable template, and
an outline window view for procedures, variables, declarations, and functions
that appear in source code

Chapter 1. SDK overview 13


v A visual interface for the PPE and SPE combined GDB (GNU debugger)
v Seamless integration of the simulator into Eclipse
v Automatic builder, performance tools, and several other enhancements
v Remote launching, running and debugging on a BladeCenter QS20

To create a Cell/B.E.-specific project in the integrated Eclipse environment, you


need to install Cell/B.E. SDK 2.1, Eclipse 3.2, CDT 3.1.1, and Eclipse IDE for
Cell/B.E. SDK. For information about Eclipse IDE for Cell/B.E. SDK including a
tutorial and downloadable installation files, refer to
http://alphaworks.ibm.com/tech/cellide

The IBM Eclipse IDE for Cell/B.E. SDK is available from the SDK ISO image and
is distributed under the IBM ILAR license.

Note: The source code is not available.

14 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Chapter 2. Programming with the SDK
This section is a short introduction about programming with the SDK. Refer to the
Cell BE Programming Tutorial, the Full-System Simulator User’s Guide, and other
documentation for more details.

System root directories


Because of the cross-compile environment and simulator in the Cell/B.E. SDK,
there are several different system root directories. Table 2 describes these
directories.
Table 2. System root directories
Directory name Description
Host The system root for the host system is “/”. The SDK is
installed relative to this host system root.
GCC Toolchain The system root for the GCC tool chain depends on the host
platform. For PPC platforms including the BladeCenter QS20,
this directory is the same as the host system root. For x86 and
x86-64 systems this directory is /opt/cell/sysroot. The tool
chain PPU header and library files are stored relative to the
GCC Tool chain system root in directories such as usr/include
and usr/lib. The tool chain SPU header and library files are
stored relative to the GCC Toolchain system root in directories
such as usr/spu/include and usr/spu/lib.
Simulator The simulator runs using a 2.6.18 kernel and a Fedora Core 6
system root image. This system root image has a root directory
of “/”. When this system root image is mounted into a
host-based directory such as /mnt/cell-sdk. This directory is
the termed the simulator system root.
Samples and Libraries The Samples and Libraries system root directory is
/opt/ibm/cell-sdk/prototype/sysroot. When the samples and
libraries are compiled and linked, the resulting header files,
libraries and binaries are placed relative to this directory in
directories such as usr/include, usr/lib, and
/opt/ibm/cell-sdk/prototype/bin. The libspe 2.1 beta library
is also installed into this system root.

After you have logged in as root, you can synchronize this


sysroot directory with the simulator sysroot image file. To do
this, use the cellsdk script with the synch task. The command
is:
./cellsdk synch

This command is very useful whenever a library or sample has


been recompiled. This script reduces user error because it
provides a standard mechanism to mount the system root
image, rsync the contents of the two corresponding directories,
and unmount the system root image.

© Copyright IBM Corp. 2006 15


Running the simulator
To verify that the simulator is operating correctly and then run it, issue the
following commands:
# export PATH=/opt/ibm/systemsim-cell/bin:$PATH
# systemsim -g

The systemsim script found in the simulator’s bin directory launches the simulator
and the –g parameter starts the graphical user interface.

Note: It is no longer necessary to have a local copy of .systemsim.tcl. The


simulator looks in the local directory first (as it always did), but if there is
not there, it uses the systemsim.tcl (no leading dot) in lib/cell of the
simulator install directory.

Console window for the


system running on the
simulator

Figure 1. Running the simulator

Notes:
1. You must be on a graphical console, or at least have the DISPLAY environment
variable pointed to an X server to run the simulator's graphical user interface
(GUI).
2. If an error message about libtk8.4.so is displayed, you must load the TK
package as described in SDK 2.1 Installation Guide.

When the GUI is displayed, click Go to start the simulator.

16 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Note: To make the simulator run in fast mode, you can click Mode and then Fast.
This forces the simulator to bypass its standard analysis and statistics
collection features. Fast mode is useful if you want to advance the simulator
through setup or initialization functions that are not the focus of analysis,
such as the Linux boot processing. You should disable fast mode when you
reach the point at which you wish to do detailed analysis or debug the
application. You can also select Simple or Cycle mode.

You can use the simulator's GUI to get a better understanding of the Cell/B.E.
architecture. For example, the simulator shows two sets of PPE state. This is
because the PPE processor core is dual-threaded and each thread has its own
registers and context. You can also look at the state of the SPE’s, including the state
of their Memory Flow Controller (MFC).

The systemsim command syntax is:


systemsim [-f file] [-g] [-n]

where:

Parameter Description
-f <filename> specifies an initial run script (TCL file)
-g specifies GUI mode, otherwise the simulator starts in command-line
mode
-n specifies that the simulator should not open a separate console
window

You can find documentation about the simulator including a user’s guide in the
/opt/ibm/systemsim-cell/doc directory.

The callthru utility


The callthru utility allows you to copy files to or from the simulated system while
it is running. The utility is located in the simulator system root image in the
/usr/bin directory.

If you call the utility as:


v callthru sink <filename>, it writes its standard input into <filename> on the
host system
v callthru source <filename>, it writes the contents of <filename> on the host
system to standard output.

Redirecting appropriately lets you copy files to and from the host. For example,
when the simulator is running on the host, you could copy a Cell/B.E. application
into /tmp:
cp matrix_mul /tmp

Then, in the console window of the simulated system, you could access it as
follows:
callthru source /tmp/matrix_mul > matrix_mul
chmod +x matrix_mul
./matrix_mul

The /tmp directory is shown as an example only.

Chapter 2. Programming with the SDK 17


The source files for the callthru utility are in /opt/ibm/systemsim-cell/sample/
callthru. The Makefile to build the utility is in /opt/ibm/cell-sdk/prototype/src/
tools/callthru. The callthru utility is built and installed onto the sysroot disk as
part of the SDK installation process.

Read and write access to the simulator sysroot image


By default the simulator does not write changes back to the simulator system root
(sysroot) image. This means that the simulator always begins in the same initial
state of the sysroot image. When necessary, you can modify the simulator
configuration so that any file changes made by the simulated system to the sysroot
image are stored in the sysroot disk file so that they are available to subsequent
simulator sessions.

To specify that you want update the sysroot image file with any changes made in
the simulator session, change the newcow parameter on the mysim bogus disk init
command in .systemsim.tcl to rw (specifying read/write access) and remove the
last two parameters. The following is the changed line from .systemsim.tcl:
mysim bogus disk init 0 $sysrootfile rw

Enabling Symmetric Multiprocessing support


By default the simulator provided an environment that simulates one Cell/B.E.
processor. To simulate an environment where two Cell/B.E. processors exist,
similar to a BladeCenter QS20, you must enabled Symmetric Multiprocessing
(SMP) support. A tcl run script is provided with the simulator to configure it for
SMP simulation. For example, following sequence of commands will start the
simulator configured with a graphical user interface and SMP.
export PATH=$PATH:/opt/ibm/systemsim/bin
systemsim -g -f config_smp.tcl

When the simulator is started, it has access to 16 SPEs across two Cell/B.E.
processors.

Enabling xclients from the simulator


To enable xclients from the simulator, you need to configure BogusNet (see the
BogusNet HowTo), and then perform the following configuration steps:
1. Enable ip-forward:
echo 1 > /proc/sys/net/ipv4/ip_forward
2. Configure IPTABLES
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o tap0 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i tap0 -o eth0 -j ACCEPT
Notes:
1. The IPTABLES commands need to use the correct ″tap#″ interface configured
with BogusNet.
2. The first iptables command fails unless the Linux kernel was configured to
allow the NAT feature. To enable your kernel for the NAT feature, you need to
rebuild the Kernel and reboot your simulator host system.

18 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Specifying the processor architecture
Many of the tools provided in SDK 2.1 support multiple implementations of the
CBEA. These include the Cell/B.E. processor and a future processor. This future
processor is a CBEA-compliant processor with a fully pipelined, enhanced double
precision SPU.

The processor supports five optional instructions to the SPU Instruction Set
Architecture. These include:
v DFCEQ
v DFCGT
v DFCMEQ
v DFCMEQ
v DFCMGT
Detailed documentation for these instructions is provided in version 1.2 (or later)
of the Synergistic Processor Unit Instruction Set Architecture specification. The future
processor also supports improved issue and latency for all double precision
instructions.

The SDK compilers support compilation for either the Cell/B.E. processor or the
future processor.
Table 3. spu-gcc compiler options
Options Description
-march=<cpu type> Generate machine code for the SPU architecture specified by
the CPU type. Supported CPU types are either cell (default)
or celledp, corresponding to the Cell/B.E. processor or
future processor, respectively.
-mtune=<cpu type> Schedule instructions according to the pipeline model of the
specified CPU type. Supported CPU types are either cell
(default) or celledp, corresponding to the Cell/B.E. processor
or future processor, respectively.

Table 4. spu-xlc compiler options


Header Header
-qarch=<cpu type> Generate machine code for the SPU architecture specified by
the CPU type. Supported CPU types are either spu (default)
or edp, corresponding to the Cell/B.E. processor or future
processor, respectively.
-qtune=<cpu type> Schedule instructions according to the pipeline model of the
specified CPU type. Supported CPU types are either spu
(default) or edp, corresponding to the Cell/B.E. processor or
future processor, respectively.

The simulator also supports simulation of the future processor. The simulator
installation provides a tcl run script to configure it for such simulation. For
example, the following sequence of commands start the simulator configured for
the future processor with a graphical user interface.
export PATH=$PATH:/opt/ibm/systemsim/bin
systemsim -g -f config_edp_smp.tcl

Chapter 2. Programming with the SDK 19


The static timing analysis tool, spu_timing, also supports multiple processor
implementations. The command line option –march=celledp can be used to specify
that the timing analysis be done corresponding to the future processors’ enhanced
pipeline model. If the architecture is unspecified or invoked with the command
line option –march=cell, then analysis is done corresponding to the Cell/B.E.
processors’ pipeline model.

SDK programming samples


Each of the samples has an associated README.txt file. There is also a top-level
readme in the /opt/ibm/cell-sdk/prototype/src directory, which introduces the
structure of the sample code source tree.

Almost all of the samples run both within the simulator and on the BladeCenter
QS20. Some samples include SPU-only programs that can be run on the simulator
in standalone mode.

The source code, which is specific to a given Cell/B.E. processor unit type, is in the
corresponding subdirectory within a given sample’s directory:
v ppu for code compiled to run on the PPE
v ppu64 for code specifically compiled for 64-bit ABI on the PPE
v spu for code compiled to run on an SPE
v spu_sim for code compiled to run on an SPE under the system simulator in
standalone environment

Changing the default compiler


In /opt/ibm/cell-sdk/prototype there are some top level Makefiles that control the
build environment for all of the samples. Most of the directories in the libraries
and samples contain a Makefile for that directory and everything below it. All of
the samples have their own Makefile but the common definitions are in the top
level Makefiles.

The build environment Makefiles are documented in /opt/ibm/cell-sdk/


prototype/README_build_env.txt.

Environment variables in the /opt/ibm/cell-sdk/prototype/make.env Makefiles are


used to determine which compiler is used to build the samples.

The cellsdk script contains a task which automatically switches the compiler, does
a make clean and then a make which rebuilds all of the samples and libraries. The
syntax of this command is:
./cellsdk build [-xlc | -gcc ]

where the –xlc or –x flag selects the XL C/C++ compiler and the –gcc or –g flag
selects the GCC compiler. The default, if unspecified, is to compile the samples
with the GCC compiler.

After you have selected a particular compiler, that same compiler is used for all
future builds, unless it is specifically overwritten by shell environment variables,
SPU_COMPILER, PPU_COMPILER, PPU32_COMPILER, or PPU64_COMPILER.

20 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Building and running a specific program
You do not need to build all the sample code at once, you can build each program
separately. To start from scratch, issue a make clean using the Makefile in
the /opt/ibm/cell-sdk/prototype/src directory or anywhere in the path to a
specific library or sample.

If you have performed a make clean at the top level, you need to rebuild the
include files and libraries first before you compile anything else. To do this run a
make in the src/include and src/lib directories.

Compiling and linking with the GNU tool chain


This release of the GNU tool chain includes a GCC compiler and utilities that
optimize code for the Cell/B.E. processor. These are:
v The spu-gcc compiler for creating an SPU binary.
v The ppu-embedspu tool which enables an SPU binary to be linked with a PPU
binary into a single executable program.
v The ppu32-gcc compiler for compiling the PPU binary and linking it with the
SPU binary.

The example below shows the steps required to create the executable program
simple which contains SPU code, simple_spu.c, and PPU code, simple.c.
1. Compile and link the SPE executable.
/usr/bin/spu-gcc -g -o simple_spu simple_spu.c
2. Run embedspu to wrap the SPU binary into a CESOF ( CBE Embedded SPE
Object Format) linkable file. This contains additional PPE symbol information.
/usr/bin/ppu32-embedspu simple_spu simple_spu simple_spu-embed.o
3. Compile the PPE side and link it together with the embedded SPU binary
/usr/bin/ppu32-gcc -g -o simple simple.c simple_spu-embed.o -lspe
Notes:
1. This section only highlights 32-bit ABI compilation. To compile for 64-bit, use
ppu-gcc (instead of ppu32-gcc) and remove -m32 from the embedspu invocation.
2. You are strongly advised to use the -g switch as shown in the examples. This
embeds extra debugging information into the code for later use by the GDB
debuggers supplied with the SDK. See Chapter 3, “Debugging Cell/B.E.
applications,” on page 25 for more information.
3. The GCC compiler does not support the optional Altivec style of vector literal
construction using parenthesis ("(" and ")"). The standard C method of array
initialization using curly braces (″{″ and ″}″) should be used.

Support for huge TLB file systems


The SDK supports the huge translation lookaside buffer (TLB) file system, which
allows you to reserve 16 MB large pages of pinned, contiguous memory. This
feature is particularly useful for some Cell/B.E. applications that operate on large
data sets, such as the FFT16M workload sample.

To configure the Cell/B.E.-based blade server for 20 pages (320 MB), run the
following commands:
mkdir -p /huge
echo 20 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs nodev /huge

Chapter 2. Programming with the SDK 21


If you have difficulties configuring adequate huge pages, it could be that the
memory is fragmented and you need to reboot. You can add the command
sequence shown above to a startup initialization script, such as
/etc/rc.d/rc.sysinit, so that the huge TLB file system is configured during the
system boot.

To verify the large memory allocation, run the command cat /proc/meminfo. The
output is similar to:
MemTotal: 1010168 kB
MemFree: 155276 kB
. . .

HugePages_Total: 20
HugePages_Free: 20
Hugepagesize: 16384 kB

SDK development best practices


This section documents some best practices in terms of developing applications
using the Cell/B.E. SDK. See also developerWorks articles about programming tips
and best practices for writing Cell/B.E. applications at
http://www.ibm.com/developerworks/power/cell/

Using sandboxes for application development


Sandboxes are useful for both a single developer or multiple developers to
experiment with the SDK. If multiple developers are using the same machine, then
each developer should have a different sandbox.

The simplest way to create a sandbox is to recursively copy the


/opt/ibm/cell-sdk/prototype directory to a sandbox directory, which can be a
subdirectory of your home directory.

To establish a sandbox containing only a subset of the prototype samples and


libraries, copy only a subtree of the sample or samples of interest into a private
directory and set the CELL_TOP environment variable -- /opt/ibm/cell-sdk/
prototype. This enables the makefiles to locate make rules, header files, and other
libraries that are in SDK.

For example: to create a personal version of the tutorial sample, do the following:
cp -r /opt/ibm/cell-sdk/prototype/src/samples/tutorial/simple mysimple
export CELL_TOP=/opt/ibm/cell-sdk/prototype
cd mysimple
<modify the sample as desired>
make

Note: Some of the samples may attempt to install the built binaries. Errors result
when the install attempts to write private builds to system directories,
unless the user has root authority. To avoid this, either create a complete
sandbox (as previously recommended) without setting CELL_TOP or modify
the Makefile to not install the built binary.

Using a shared development environment


Multiple users should not update the common simulator sysroot image file by
mounting it read-write in the simulator. In this case, the callthru utility (see “The
callthru utility” on page 17) can be used to get files in and out of the simulator.

22 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Alternatively, users can copy the sysroot image file to their own sandbox area and
then mount this version with read/write permissions to make persistent updates to
the image.

If multiple users need to run Cell/B.E. applications on a BladeCenter QS20, you


need a machine reservation mechanism to reduce collisions between two people
who are using SPEs at the same time. This is because SPE threads are not fully
preemptable in this version of the SDK.

Chapter 2. Programming with the SDK 23


24 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Chapter 3. Debugging Cell/B.E. applications
This section describes how to debug Cell/B.E. applications. It describes the
following:
v Using the debugger
v Debugging in a combined environment
v Setting up remote debugging

Introduction
GDB is the standard command-line debugger available as part of the GNU
development environment. GDB has been modified to allow debugging in a
Cell/B.E. processor environment and this section describes how to debug Cell/B.E.
software using the new and extended features of the GDBs which are supplied
with SDK 2.1.

Debugging is different in a Cell/B.E. processor environment to debugging in a


multithreaded environment, because threads can run either on the PPE or on the
SPE.

There are three versions of GDB which can be installed on a BladeCenter QS20:
v gdb which is installed with Fedora Core 6 for debugging PowerPC applications.
You should NOT use this debugger for Cell/B.E. applications.
v ppu-gdb for debugging PPE code or for debugging combined PPE and SPE code.
This is the combined debugger.
v spu-gdb for debugging SPE code only. This is the standalone debugger.

This section also describes how to run applications under gdbserver. The
gdbserver program allows remote debugging.

GDB for SDK 2.1


The GDB program released with SDK 2.1 replaces previous versions and contains
the following enhancements:
v It is based on GDB 6.6 (instead of GDB 6.5)
v A large number of bugfixes

Compiling with GCC or XLC


You should use the -g option when compiling both SPE and PPE code with GCC
or XLC. The linker embeds all the symbolic and additional information required
for the SPE binary within the PPE binary so it is available for the debugger to
access when the program runs. The -g option adds debugging information to the
binary which then enables GDB to lookup symbols and show the symbolic
information.

For more information about compiling with GCC, see “Compiling and linking with
the GNU tool chain” on page 21.

© Copyright IBM Corp. 2006 25


Using the debugger
This section assumes that you are familiar with the standard features of GDB. It
describes the following topics:
v “Debugging PPE code” on page 26
v “Debugging SPE code” on page 26

Debugging PPE code


There are several ways to debug programs designed for the Cell/B.E. processor. If
you have access to Cell/B.E. hardware, you can debug directly using ppu-gdb . You
can also run the application under ppu-gdb inside the simulator. Alternatively, you
can debug remotely as described in “Setting up remote debugging” on page 36.

Whichever method you choose, after you have started the application under
ppu-gdb, you can use the standard GDB commands available to debug the
application. The GDB manual is available at the GNU Web site
http://www.gnu.org/software/gdb/gdb.html

and there are many other resources available on the World Wide Web.

Note: Do not use gdb, which is the version of GDB which comes with the
operating system. Use ppu-gdb instead.

Debugging SPE code


Standalone SPE programs or spulets are self-contained applications that run
entirely on the SPE. Use spu-gdb to launch and debug standalone SPE programs in
the same way as you use ppu-gdb on PPE programs.

Note: You can use either spu-gdb or ppu-gdb to debug SPE only programs. In this
section spu-gdb is used.

The examples in this section use a standalone SPE (spulet) program, simple.c,
whose source code and Makefile are given below:

Source code:
#include <stdio.h>
#include <spu_intrinsics.h>

unsigned int
fibn(unsigned int n)
{
if (n <= 2)
return 1;
return (fibn (n-1) + fibn (n-2));
}

int
main(int argc, char **argv)
{
unsigned int c;
c = fibn (8);
printf ("c=%d\n", c);
return 0;
}

26 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Note: Recursive SPE programs are generally not recommended due to the limited
size of local storage. An exception is made here because such a program can
be used to illustrate the backtrace command of GDB.

Makefile:
simple: simple.c
spu-gcc simple.c -g -o simple

Source level debugging


Source-level debugging of SPE programs with spu-gdb is similar in nearly all
aspects to source-level debugging for the PPE. For example, you can:
v Set breakpoints on source lines
v Display variables by name
v Display a stack trace and single-step program execution
The following example illustrates the backtrace output for the simple.c standalone
SPE program.
$ spu-gdb ./simple
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "--host=powerpc64-unknown-linux-gnu --target=spu"...
(gdb) break 8
Breakpoint 1 at 0x184: file simple.c, line 8.
(gdb) break 18
Breakpoint 2 at 0x220: file simple.c, line 18.
(gdb) run
Starting program: /home/usr/md/fib/simple

Breakpoint 1, fibn (n=2) at simple.c:8


8 return 1;
(gdb) backtrace
#0 fibn (n=2) at simple.c:8
#1 0x000001a0 in fibn (n=3) at simple.c:10
#2 0x000001a0 in fibn (n=4) at simple.c:10
#3 0x000001a0 in fibn (n=5) at simple.c:10
#4 0x000001a0 in fibn (n=6) at simple.c:10
#5 0x000001a0 in fibn (n=7) at simple.c:10
#6 0x000001a0 in fibn (n=8) at simple.c:10
#7 0x0000020c in main (argc=1, argv=0x3ffe0) at simple.c:16
(gdb) delete breakpoint 1
(gdb) continue
Continuing.

Breakpoint 2, main (argc=1, argv=0x3ffe0) at simple.c:18


18 printf("c=%d\n", c);
(gdb) print c
$1 = 21
(gdb)

Assembler level debugging


The spu-gdb program also supports many of the familiar techniques for debugging
SPE programs at the assembler code level. For example, you can:
v Display register values
v Examine the contents of memory (which for the SPE means local storage)
v Disassemble sections of the program
v Step through a program at the machine instruction level

Chapter 3. Debugging Cell/B.E. applications 27


The following example illustrates some of these facilities.
$ spu-gdb ./simple
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "--host=powerpc64-unknown-linux-gnu --target=spu"...
(gdb) br 18
Breakpoint 1 at 0x220: file simple.c, line 18.
(gdb) r
Starting program: /home/usr/md/fib/simple

Breakpoint 1, main (argc=1, argv=0x3ffe0) at simple.c:18


18 printf("c=%d\n", c);
(gdb) print c
$1 = 21
(gdb) x /8i $pc
0x220 <main+72>: ila $3,0x8c0 <_fini+32>
0x224 <main+76>: lqd $4,32($1) # 20
0x228 <main+80>: brsl $0,0x2a0 <printf> # 2a0
0x22c <main+84>: il $2,0
0x230 <main+88>: ori $3,$2,0
0x234 <main+92>: ai $1,$1,80 # 50
0x238 <main+96>: lqd $0,16($1)
0x23c <main+100>: bi $0
(gdb) nexti
0x00000224 18 printf("c=%d\n", c);
(gdb) nexti
0x00000228 18 printf("c=%d\n", c);
(gdb) print $r4$1 = {uint128 = 0x0003ffd0000000001002002000000030, v2_int64 = {
1125693748412416, 1153484591999221808}, v4_int32 = {262096, 0, 268566560,48},
v8_int16 = {3, -48, 0, 0, 4098, 32, 0, 48},v16_int8 =
"\000\003??\000\000\000\000\020\002\000 \000\000\0000",
v2_double = {5.5616660882883401e-309, 1.4492977868115269e-231},
v4_float = {
3.67274722e-40, 0, 2.56380757e-29, 6.72623263e-44}}

How spu-gdb manages SPE registers


Because each SPE register can hold multiple fixed or floating point values of
several different sizes, spu-gdb treats each register as a data structure that can be
accessed with multiple formats. The spu-gdb ptype command, illustrated in the
following example, shows the mapping used for SPE registers:
(gdb) ptype $r80
type = union __spu_builtin_type_vec128 {
int128_t uint128;
int64_t v2_int64[2];
int32_t v4_int32[4];
int16_t v8_int16[8];
int8_t v16_int8[16];
double v2_double[2];
float v4_float[4];
}

To display or update a specific vector element in an SPE register, specify the


appropriate field in the data structure, as shown in the following example:
(gdb) p $r80.uint128
$1 = 0x00018ff000018ff000018ff000018ff0
(gdb) set $r80.v4_int32[2]=0xbaadf00d
(gdb) p $r80.uint128
$2 = 0x00018ff000018ff0baadf00d00018ff0

28 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Debugging in the Cell/B.E. environment
To debug combined code, that is code containing both PPE and SPE code, you
must use ppu-gdb.

Debugging multithreaded code


Typically a simple program contains only one thread. For example, a PPU "hello
world" program is run in a process with a single thread and the GDB attaches to
that single thread.

On many operating systems, a single program can have more than one thread. The
ppu-gdb program allows you to debug programs with one or more threads. The
debugger shows all threads while your program runs, but whenever the debugger
runs a debugging command, the user interface shows the single thread involved.
This thread is called the current thread. Debugging commands always show
program information from the point of view of the current thread. For more
information about GDB support for debugging multithreaded programs, see the
sections ’Debugging programs with multiple threads’ and ’Stopping and starting
multi-thread programs’ of the GDB User’s Manual, available at
http://www.gnu.org/software/gdb/gdb.html

The info threads command displays the set of threads that are active for the
program, and the thread command can be used to select the current thread for
debugging.

Note: The source code for the program simple.c used in the examples below
comes with the SDK and can be found at /opt/ibm/cell-sdk/prototype/
src/samples/tutorial/simple.

Debugging architecture
On the Cell/B.E. processor, a thread can run on either the PPE or on an SPE. A
program typically starts as single thread running on the PPE which can then
spawn new threads that run on either the PPE or on an SPE. When you choose a
thread to debug, the debugger automatically uses the correct architecture for the
thread. If the thread is running on the PPE, the debugger uses the PowerPC
architecture. If the thread is running on an SPE, the debugger uses the SPU
architecture.

To see which architecture the debugger is using, use the show architecture
command.

Example: show architecture

The example below shows the results of the show architecture command at two
different breakpoints in a program. At breakpoint 1 the program is executing in the
original PPE thread, where the show architecture command indicates that
architecture is powerpc:common. The program then spawns an SPE thread which
will execute the SPU code in simple_spu.c. When the debugger detects that the
SPE thread has reached breakpoint 3, it switches to this thread and sets the
architecture to spu:256K For more information about breakpoint 2, see “Setting
pending breakpoints” on page 32.

Note: The source code for the example below can be found at
/opt/ibm/cell-sdk/prototype/src/samples/tutorial/simple.

Chapter 3. Debugging Cell/B.E. applications 29


[user@localhost simple]$ ppu-gdb ./simple
...
...
...
(gdb) break main
Breakpoint 1 at 0x1801654: file simple.c, line 23.
(gdb) run
Starting program: /home/user/md/simple/simple
[Thread debugging using libthread_db enabled]
[New Thread 4160655360 (LWP 2490)]
[Switching to Thread 4160655360 (LWP 2490)]

Breakpoint 1, main (argc=1, argv=0xfff7a9e4) at simple.c:23


23 int i, status = 0;
(gdb) show architecture
The target architecture is set automatically (currently powerpc:common)
(gdb) break simple_spu.c:5
No source file named simple_spu.c.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 2 (simple_spu.c:5) pending.


(gdb) continue
Continuing.
Breakpoint 3 at 0x158: file simple_spu.c, line 5.
Pending breakpoint "simple_spu.c:5" resolved
[New Thread 4160390384 (LWP 2495)]
[Switching to Thread 4160390384 (LWP 2495)]

Breakpoint 3, main (id=103079215104) at simple_spu.c:13


13 {
(gdb) show architecture
The target architecture is set automatically (currently spu:256K)
(gdb)

Viewing symbolic and additional information


Compiling with the ’-g’ option adds debugging information to the binary that
enables GDB to lookup symbols and show the symbolic information.

The debugger sees SPE executable programs as shared libraries. The info
sharedlibrary command shows all the shared libraries including the SPE
executables when running SPE threads.

Example: info sharedlibrary

The example below shows the results of the info sharedlibrary command at two
breakpoints on one thread. At breakpoint 1, the thread is running on the PPE, at
breakpoint 3 the thread is running on the SPE. For more information about
breakpoint 2, see “Setting pending breakpoints” on page 32.
(gdb) break main
Breakpoint 1 at 0x1801654: file simple.c, line 23.
(gdb) r
Starting program: /home/user/md/simple/simple
[Thread debugging using libthread_db enabled]
[New Thread 4160655360 (LWP 2528)]
[Switching to Thread 4160655360 (LWP 2528)]

Breakpoint 1, main (argc=1, argv=0xffacb9e4) at simple.c:23


23 int i, status = 0;
(gdb) info sharedlibrary
From To Syms Read Shared Object Library
0x0ffc1980 0x0ffd9af0 Yes /lib/ld.so.1
0x0fe14b40 0x0fe20a00 Yes /usr/lib/libspe.so.1
0x0fe5d340 0x0ff78e30 Yes /lib/libc.so.6
0x0fce47b0 0x0fcf1a40 Yes /lib/libpthread.so.0

30 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
0x0f291cc0 0x0f2970e0 Yes /lib/librt.so.1
(gdb) break simple_spu.c:5
No source file named simple_spu.c.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 2 (simple_spu.c:5) pending.


(gdb) c
Continuing.
Breakpoint 3 at 0x158: file simple_spu.c, line 5.
Pending breakpoint "simple_spu.c:5" resolved
[New Thread 4160390384 (LWP 2531)]
[Switching to Thread 4160390384 (LWP 2531)]

Breakpoint 3, main (id=103079215104) at simple_spu.c:13


13 {
(gdb) info sharedlibrary
From To Syms Read Shared Object Library
0x0ffc1980 0x0ffd9af0 Yes /lib/ld.so.1
0x0fe14b40 0x0fe20a00 Yes /usr/lib/libspe.so.1
0x0fe5d340 0x0ff78e30 Yes /lib/libc.so.6
0x0fce47b0 0x0fcf1a40 Yes /lib/libpthread.so.0
0x0f291cc0 0x0f2970e0 Yes /lib/librt.so.1
0x00000028 0x00000860 Yes simple_spu@0x1801d00 <6>
(gdb)

GDB creates a unique name for each shared library entry representing SPE code.
That name consists of the SPE executable name, followed by the location in PPE
memory where the SPE is mapped (or embedded into the PPE executable image),
and the SPE ID of the SPE thread where the code is loaded.

Using scheduler-locking
Scheduler-locking is a feature of GDB that simplifies multithread debugging by
enabling you to control the behavior of multiple threads when you single-step
through a thread. By default scheduler-locking is off, and this is the recommended
setting.

In the default mode where scheduler-locking is off, single-stepping through one


particular thread does not stop other threads of the application from running, but
allows them to continue to execute. This applies to both threads executing on the
PPE and on the SPE. This may not always be what you expect or want when
debugging multithreaded applications, because those threads executing in the
background may affect global application state asynchronously in ways that can
make it difficult to reliably reproduce the problem you are debugging. If this is a
concern, you can turn scheduler-locking on. In that mode, all other threads remain
stopped while you are debugging one particular thread. A third option is to set
scheduler-locking to step, which stops other threads while you are single-stepping
the current thread, but lets them execute while the current thread is freely running.

However, if scheduler-locking is turned on, there is the potential for deadlocking


where one or more threads cannot continue to run. Consider, for example, an
application consisting of multiple SPE threads that communicate with each other
through a mailbox. If you single-step one thread across an instruction that reads
from the mailbox, and that mailbox happens to be empty at the moment, this
instruction (and thus the debugging session) will block until another thread writes
a message to the mailbox. However, if scheduler-locking is on, that other thread
will remain stopped by the debugger because you are single-stepping. In this
situation none of the threads can continue, and the whole program stalls
indefinitely. This situation cannot occur when scheduler-locking is off, because in

Chapter 3. Debugging Cell/B.E. applications 31


that case all other threads continue to run while the first thread is single-stepped.
You should ensure that you enable scheduler-locking only for applications where
such deadlocks cannot occur.

There are situations where you can safely set scheduler-locking on, but you should
do so only when you are sure there are no deadlocks.

The syntax of the command is:


set scheduler-locking <mode>

where mode has one of the following values:


v off
v on
v step
You can check the scheduler-locking mode with the following command:
show scheduler-locking

Using the combined debugger


Generally speaking, you can use the same procedures to debug code for Cell/B.E.
as you would for PPC code. However, some existing features of GDB and one new
command can help you to debug in the Cell/B.E. processor multithreaded
environment. These features are described below.

Setting pending breakpoints


Breakpoints stop programs running when a certain location is reached. You set
breakpoints with the break command, followed by the line number, function name,
or exact address in the program.

You can use breakpoints for both PPE and SPE portions of the code. There are
some instances, however, where GDB must defer insertion of a breakpoint because
the code containing the breakpoint location has not yet been loaded into memory.
This occurs when you wish to set the breakpoint for code that is dynamically
loaded later in the program. If ppu-gdb cannot find the location of the breakpoint it
sets the breakpoint to pending. When the code is loaded, the breakpoint is inserted
and the pending breakpoint deleted.

You can use the set breakpoint command to control the behavior of GDB when it
determines that the code for a breakpoint location is not loaded into memory. The
syntax for this command is:
set breakpoint pending <on off auto>

where
v on on specifies that GDB should set a pending breakpoint if the code for the
breakpoint location is not loaded.
v off off specifies that GDB should not create pending breakpoints, and break
commands for a breakpoint location that is not loaded result in an error.
v auto auto specifies that GDB should prompt the user to determine if a pending
breakpoint should be set if the code for the breakpoint location is not loaded.
This is the default behavior.

Example: Pending breakpoints

32 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
The example below shows the use of pending breakpoints. Breakpoint 1 is a
standard breakpoint set for simple.c, line 23. When the breakpoint is reached,
the program stops running for debugging. After set breakpoint pending is set to
off, GDB cannot set breakpoint 2 (break simple_spu.c:5) and generates the error
message No source file named simple_spu.c. After set breakpoint pending is
changed to auto, GDB sets a pending breakpoint for the location simple_spu.c:5.
At the point where GDB can resolve the location, it sets the next breakpoint,
breakpoint 3.
(gdb) break main
Breakpoint 1 at 0x1801654: file simple.c, line 23.
(gdb) r
Starting program: /home/user/md/simple/simple
[Thread debugging using libthread_db enabled]
[New Thread 4160655360 (LWP 2651)]
[Switching to Thread 4160655360 (LWP 2651)]

Breakpoint 1, main (argc=1, argv=0xff95f9e4) at simple.c:23


23 int i, status = 0;
(gdb) off
(gdb) break simple_spu.c:5
No source file named simple_spu.c.
(gdb) set breakpoint pending auto
(gdb) break simple_spu.c:5
No source file named simple_spu.c.
Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 2 (simple_spu.c:5) pending.


(gdb) c
Continuing.
Breakpoint 3 at 0x158: file simple_spu.c, line 5.
Pending breakpoint "simple_spu.c:5" resolved
[New Thread 4160390384 (LWP 2656)]
[Switching to Thread 4160390384 (LWP 2656)]

Breakpoint 3, main (id=103079215104) at simple_spu.c:13


13 {
(gdb)

Note: The example above shows one of the ways to use pending breakpoints. For
more information about other options, see the documentation available at
http://www.gnu.org/software/gdb/gdb.html

Using the set spu stop-on-load command


The new set spu stop-on-load stops each thread before it starts running on the
SPE. While set spu stop-on-load is in effect, the debugger automatically sets a
temporary breakpoint on the ″main″ function of each new SPE thread immediately
after it is loaded. You can use the set spu stop-on-load command to do this
instead of simply issuing a break main command, because the latter is always
interpreted to set a breakpoint on the ″main″ function of the PPE executable.

Note: The set spu stop-on-load command has no effect in the SPU standalone
debugger spu-gdb. To let an SPU standalone program proceed to its ″main″
function, you can use the start command in spu-gdb.

The syntax of the command is:


set spu stop-on-load <mode>

where mode is on or off.

To check the status of spu stop-on-load, use the show spu stop-on-load command.

Chapter 3. Debugging Cell/B.E. applications 33


Example: set spu stop-on-load on
(gdb) break main
Breakpoint 1 at 0x1801654: file simple.c, line 23.
(gdb) r
Starting program: /home/user/md/simple/simple
[Thread debugging using libthread_db enabled]
[New Thread 4160655360 (LWP 3009)]
[Switching to Thread 4160655360 (LWP 3009)]

Breakpoint 1, main (argc=1, argv=0xffc7c9e4) at simple.c:23


23 int i, status = 0;
(gdb) show spu stop-on-load
Stopping for new SPE threads is off.
(gdb) set spu stop-on-load on
(gdb) c
Continuing.
Breakpoint 2 at 0x174: file simple_spu.c, line 16.
[New Thread 4160390384 (LWP 3013)]
Breakpoint 3 at 0x174: file simple_spu.c, line 16.
[Switching to Thread 4160390384 (LWP 3013)]
main (id=25272376) at simple_spu.c:16
16 for (i = 0, n = 0; i<5; i++) {
(gdb) info threads
* 2 Thread 4160390384 (LWP 3013) main (id=25272376) at simple_spu.c:16
1 Thread 4160655360 (LWP 3009) 0x0ff27428 in mmap () from /lib/libc.so.6
(gdb) c
Continuing.
Hello Cell (0x181a038) n=3
Hello Cell (0x181a038) n=6
Hello Cell (0x181a038) n=9
Hello Cell (0x181a038) n=12
Hello Cell (0x181a038) n=15
[Thread 4160390384 (LWP 3013) exited]
[New Thread 4151739632 (LWP 3015)]
[Switching to Thread 4151739632 (LWP 3015)]
main (id=25272840) at simple_spu.c:16
16 for (i = 0, n = 0; i<5; i++) {
(gdb) info threads
* 3 Thread 4151739632 (LWP 3015) main (id=25272840) at simple_spu.c:16
1 Thread 4160655360 (LWP 3009) 0x0fe14f38 in load_spe_elf (
handle=0x181a3d8, ld_buffer=0xf6f29000, ld_info=0xffc7c230)
at elf_loader.c:224
(gdb)

New command reference


In addition to the set spu stop-on-load command, the ppu-gdb and ppu-gdb
programs offer an extended set of the standard GDB info commands. These are:
v info spu event
v info spu signal
v info spu mailbox
v info spu dma
v info spu proxydma

If you are working in GDB, you can access help for these new commands. To
access help, type help info spu followed by the info spu subcommand name. This
displays full documentation. Command name abbreviations are allowed if
unambiguous.

Note: For more information about the various output elements, refer to Cell
Broadband Engine Architecture available at

34 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
http://www.ibm.com/developerworks/power/cell/

info spu event


Displays SPE event facility status. The output is similar to:
(gdb) info spu event
Event Status 0x00000000
Event Mask 0x00000000

info spu signal


Displays SPE signal notification facility status. The output is similar to:
(gdb) info spu signal
Signal 1 not pending (Type Or)
Signal 2 control word 0x30000001 (Type Or)

info spu mailbox


Displays SPE mailbox facility status. Only pending entries are shown. Entries are
displayed in the order of processing, that is, the first data element in the list is the
element that is returned on the next read from the mailbox. The output is similar
to:
(gdb) info spu mailbox
SPU Outbound Mailbox
0x00000000
SPU Outbound Interrupt Mailbox
0x00000000
SPU Inbound Mailbox
0x00000000
0x00000000
0x00000000
0x00000000

info spu dma


Displays MFC DMA status. For each pending DMA command, the opcode, tag,
and class IDs are shown, followed by the current effective address, local store
address, and transfer size (updated as the command is processed). For commands
using a DMA list, the local store address and size of the list is shown. The ″E″
column indicates commands flagged as erroneous by the MFC. The output is
similar to:
(gdb) info spu dma
Tag-Group Status 0x00000000
Tag-Group Mask 0x00000000 (no query pending)
Stall-and-Notify 0x00000000
Atomic Cmd Status 0x00000000

Opcode Tag TId RId EA LSA Size LstAddr LstSize E


get 1 2 3 0x000000000ffc0001 0x02a80 0x00020 *
putllc 0 0 0 0xd000000000230080 0x00080 0x00000
get 4 1 1 0x000000000ffc0004 0x02b00 0x00004 *
mfcsync 0 0 0 0x00300 0x00880
get 0 0 0 0xd000000000230900 0x00e00 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000

Chapter 3. Debugging Cell/B.E. applications 35


0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000

info spu proxydma


Displays MFC Proxy-DMA status. The output is similar to:
(gdb) info spu proxydma
Tag-Group Status 0x00000000
Tag-Group Mask 0x00000000 (no query pending)

Opcode Tag TId RId EA LSA Size LstAddr LstSize E


getfs 0 0 0 0xc000000000379100 0x00e00 0x00000
get 0 0 0 0xd000000000243000 0x04000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000
0 0 0 0 0x00000 0x00000

Setting up remote debugging


There are three versions of gdbserver provided with SDK 2.1:
v spu-gdbserver to run a stand-alone spulet. You must use spu-gdb on the client.
v ppu32-gdbserver to run a 32-bit PPE or combined executable. You must use
ppu-gdb on the client.
v ppu-gdbserver to run a 64-bit PPE or combined executable. You must ppu-gdb on
the client.

Note: In the following section, gdbserver is used as the generic term for both
versions. Similarly GDB is used to refer to the two different debuggers.

This section describes how to set up remote debugging for the Cell/B.E. processor
and the simulator. It covers the following topics:
v “Remote debugging overview” on page 36
v “Using remote debugging” on page 36
v “Starting remote debugging” on page 37

Remote debugging overview


You can run an application under gdbserver to allow remote hardware and
simulator-based debugging. Gdbserver is a companion program to GDB that
implements the GDB remote serial protocol. This is used to convert GDB into a
client/server-style application, where gdbserver launches and controls the
application on the target platform, and GDB connects to gdbserver to specify
debugging commands.

The connection between GDB and gdbserver can either be through a traditional
serial line or through TCP/IP. For example, you can run gdbserver on a
BladeCenter QS20 and GDB on an Intel® x86 platform, which then connects to the
gdbserver using TCP/IP.

Using remote debugging


You can use gdbserver to run and debug applications on a BladeCenter QS20 or
under the simulator. For example, you can debug applications written on an Intel
x86 platform remotely on a BladeCenter QS20 target or on the simulator.

36 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Note: IDEs such as Eclipse do not directly communicate with gdbserver. However,
an IDE can communicate with GDB running on the same host which can
then in turn communicate with gdbserver running on a remote machine.

To use remote debugging, you need a version of the program for the target
platform and network connectivity. The gdbserver program comes packaged with
GDB and is installed with the SDK 2.1.

Note: To connect thru the network to the simulator, you must enable bogusnet
support in the simulator. This creates a special Ethernet device that uses a
″call-thru″ interface to send and receive packets to the host system. See the
simulator documentation for details about how to enable bogusnet.

Further information on the remote debugging of Cell Broadband Engine


applications is available in the DeveloperWorks article at
http://www-128.ibm.com/developerworks/power/library/pa-celldebug/

Starting remote debugging


To start a remote debugging session, do the following:
1. Use gdbserver to launch the application on the target platform (either the
BladeCenter QS20 or inside the Simulator). To do this enter:
<gdbserver version> [ip address] :<port> <application> [arg1 arg2 ...]
where
v <gdbserver version> refers to the version of gdbserver appropriate for the
program you wish to debug
v [ip address] is optional
v :<port> specifies the TCP/IP port to be used for communication with
gdbserver
v <application> is the name of the program to be debugged
v [arg1 arg2 ...] are the command line arguments for the program
An example for ppu-gdbserver using port 2101 for the program myprog which
requires no command line arguments would be:
ppu-gdbserver :2101 myprog

Note: If you use ppu-gdbserver as shown here then you must use ppu-gdb on
the client.
2. Start GDB from the client system (if you are using the simulator this is the host
system of the simulator).
For the simulator this is:
/opt/cell/bin/ppu-gdb myprog
For the BladeCenter QS20 this is:
/usr/bin/ppu-gdb myprog
You should have the source and compiled executable version for myprog on the
host system. If your program links to dynamic libraries, GDB attempts to locate
these when it attaches to the program. If you are cross-debugging, you need to
direct GDB to the correct versions of the libraries otherwise it tries to load the
libraries from the host platform. The default path is opt/cell/sysroot. For the
Cell/B.E. SDK 2.1, issue the following GDB command to connect to the server
hosting the correct version of the libraries:
set solib-absolute-prefix

Chapter 3. Debugging Cell/B.E. applications 37


Note: If you have not installed the libraries in the default directory you must
indicate the path to them. Generally the lib/ and lib64/ directories are
under /opt/cell/sysroot/.
3. At the GDB prompt, connect to the server with the command:
target remote 172.20.0.2:2101
where 172.20.0.2 is the IP address of the Cell system that is running gdbserver,
and the :2101 parameter is the TCP/IP port parameter that was used start
gdbserver. If you are running the client on the same machine then the IP
address can be omitted. If you are using the simulator, the IP address is
generally fixed at 172.20.0.2 To verify this, enter the ifconfig command in the
console window of the simulator.
If gdbserver is running on the simulator, you can use a symbolic host name for
the simulator, for example:
target remote systemsim:2101

To do this, edit the host system's /etc/hosts as follows:


# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
172.20.0.2 systemsim

The following shows an example of myprog


8 {
9 char *needle, *haystack;
10 int count = 0;
11
12 if (argc < 3) {
13 return 0;
14 }
15
16 needle = argv[1];
17 haystack = argv[2];
18
B+>19 while (*haystack != ’\0’)
20 {
21 int i = 0;
22 while (needle[i] != ’\0’ && haystack[i] != ’\0’ && needle[i])
23 i++
24 }
25 if (needle[i] == ’\0’) {
26 count++;
27 }
28 haystack++;
29 }
30
31 return count;
32 }
remote Thread 42000 In: main Line: 19 PC 0x18004c8
Type "how copying" to see conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "--host=i686-pc-linux-gnu --target=powerpc64-linux"....
(gdb) target remote 172.20.0.2:2101
Remote debugging using 172.20.0.2:2101
0xf7ff80f0 in ?? ()
(gdb) b 19
Breakpoint 1 at 0x18004c8: file myprog.c, line 19.
(gdb) c
Continuing.

Breakpoint 1, main (argc=3, argv=0xffab6b74) at myprog.c:19


(gdb)

38 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Chapter 4. SPU code overlays
This section describes how to use the overlay facility to overcome the physical
limitations on code and data size in the SPU.

What are overlays


Optimally a complete SPU program is loaded into the local storage of the SPU
before it is executed. This is the most efficient method of execution. However,
when the sum of the code and data lengths of the program exceeds the local
storage size it is necessary to use overlays. (For BladeCenter QS20 the storage size
is 256 KB.) Overlays may be used in other circumstances; for example performance
might be improved if the size of data areas can be increased by moving rarely
used functions to overlays.

An overlay is a program segment which is not loaded into SPU local storage
before the main program begins to execute, but is instead left in Cell main storage
until it is required. When the SPU program calls code in an overlay segment, this
segment is transferred to local storage where it can be executed. This transfer will
usually overwrite another overlay segment which is not immediately required by
the program.

In an overlay structure the local storage is divided into a root segment, which is
always in storage, and one or more overlay regions, where overlay segments are
loaded when needed. Any given overlay segment will always be loaded into the
same region. A region may contain more than one overlay segment, but a segment
will never cross a region boundary.

(A segment is the smallest unit which can be loaded as one logical entity during
execution. Segments contain program sections such as functions and data areas.)

The overlay feature is supported for Cell SPU programming (but not for PPU
programming) on a native BladeCenter QS20 or on the simulator hosted on an x86
or PowerPC machine.

How overlays work


The code size problem can be addressed through the generation of overlays by the
linker. Two or more code segments can be mapped to the same physical address in
local storage. The linker also generates call stubs and associated tables for overlay
management. Instructions to call functions in overlay segments are replaced by
branches to these call stubs, which load the function code to be called, if necessary,
and then branch to the function.

In most cases all that is needed to convert an ordinary program to an overlay


program is the addition of a linker script to structure the module. In the script you
specify which segments of the program can be overlaid. The linker then prepares
the required segments so that they may be loaded when needed during execution
of the program, and also adds supporting code from the overlay manager library.

At execution time when a call is made from an executing segment to another


segment the system determines from the overlay tables whether the requested

© Copyright IBM Corp. 2006 39


segment is already in local storage. If not this segment is loaded dynamically (this
is carried out by a DMA command), and may overlay another segment which had
been loaded previously.

Restrictions on the use of overlays


When using overlays you must consider the scope of data very carefully. It is a
widespread practice to group together code sections and the data sections used by
them. If these are located in an overlay region the data can only be used
transiently - overlay sections are not 'swapped out' (written back to Cell/B.E. main
storage) as on other platforms but are replaced entirely by other overlays.

Ideally all data sections are kept in the root segment which is never overlaid. If the
data size is too large for this then sections for transient data may be included in
overlay regions, but the implications of this must be carefully considered.

Planning to use overlays


The overlay structure should be considered at the program planning stage, as soon
as code sizes can be estimated. The planning needs to include the number of
overlay regions that are required; the number of segments which will be overlaid
into each region; and the number of functions within each segment. At this stage it
is better to overestimate the number of segments required than to underestimate
them. It is easier to combine segments later than to break up oversize segments
after they are coded.

Overview
The structure of an overlay SPU program module depends on the relationships
between the segments within the module. Two segments which do not have to be
in storage at the same time may share an address range. These segments can be
assigned the same load addresses, as they are loaded only when called. For
example, segments that handle error conditions or unusual data are used
infrequently and need not occupy storage until they are required.

Program sections which are required at any time are grouped into a special
segment called the root segment. This segment remains in storage throughout the
execution of an program.

Some overlay segments may be called by several other overlay segments. This can
be optimized by placing the called and calling segments in separate regions.

To design an overlay structure you should start by identifying the code sections or
stubs which receive control at the beginning of execution, and also any code
sections which should always remain in storage. These together form the root
segment. The rest of the structure is developed by checking the links between the
remaining sections and analyzing these to determine which sections can share the
same local storage locations at different times during execution.

Sizing
Because the minimum indivisible code unit is at the function level, the minimum
size of the overlay region is the size of the largest overlaid function. If this function
is so large that the generated SPU program does not fit in local storage then a
warning is issued by the linker. The user must address this problem by splitting
the function into one or more smaller functions.

40 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Scaling considerations
Even with overlays there are limits on how large an SPE executable can become.
An infrastructure of manager code, tables, and stubs is required to support
overlays and this infrastructure itself cannot reside in an overlay. For a program
with s overlay segments in r regions, making cross-segment calls to f functions, this
infrastructure requires the following amounts of local storage:
v manager: about 400 bytes,
v tables: s * 16 + r * 4 bytes,
v stubs: f * 8 + s * 8 bytes.

This allows a maximum available code size of about 512 megabytes, split into 4096
overlay sections of 128 kilobytes each. (This assumes a single entry point into each
section and no global data segment or stack.)

Except for the local storage memory requirements described above, this design
does not impose any limitations on the numbers of overlay segments or regions
supported.

Overlay tree structure example


Suppose that a program contains seven sections which are labelled SA through SG,
and that the total length of these exceeds the amount of local storage available.
Before the program is restructured it must be analyzed to find the optimum
overlay design.

The relationship between segments can be shown with a tree structure. This
graphically shows how segments can use local storage at different times. It does
not imply the order of execution (although the root segment is always the first to
receive control). Figure 2 shows the tree structure for this program. The structure
includes five segments:

Figure 2. Overlay tree structure

The position of the segments in an overlay tree structure does not imply the
sequence in which the segments are executed; in particular sections in the root
segment may be called from any segment. A segment can be loaded and overlaid
as many times as the logic of the program requires.

Chapter 4. SPU code overlays 41


Length of an overlay program
For purposes of illustration, assume the sections in the sample program have the
following lengths:
Table 5. Sample program lengths
Section Length (in bytes)
SA 30,000
SB 20,000
SC 60,000
SD 40,000
SE 30,000
SF 60,000
SG 80,000

If the program did not use overlays it would require 320 KB of local storage; the
sum of all sections. With overlays, however, the storage needed for the program is
the sum of all overlay regions, where the size of each region is the size of its
largest segment. In this structure the maximum is formed by segments 0, 4, and 2;
these being the largest segments in regions 0, 1, and 2. The sum of the regions is
then 200 KB, as shown in Figure 3.

Figure 3. Length of an overlay module

Note: The sum of all regions is not the minimum requirement for an overlay
program. When a program uses overlays, extra programming and tables are
used and their storage requirements must also be considered. The storage
required by these is described in “Scaling considerations” on page 41.

Segment origin
The linker typically assigns the origin of the root segment (the origin of the
program) to address 0x80. The relative origin of each segment is determined by the

42 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
length of all previously defined regions. For example, the origin of segments 2 and
3 is equal to the root origin plus 80 KB (the length of region 1 and segment 4) plus
50 KB (the length of the root segment), or 0x80 plus 130 KB. The origins of all the
segments are as follows:
Table 6. Segment origins
Segment Origin
0 0x80 + 0
1 0x80 + 50,000
2 0x80 + 130,000
3 0x80 + 130,000
4 0x80 + 50,000

The segment origin is also called the load point, because it is the relative location
where the segment is loaded. Figure 4 shows the segment origin for each segment
and the way storage is used by the sample program. The vertical bars indicate
segment origin; two segments with the same origin can use the same storage area.
This figure also shows that the longest path is that for segments 0, 4, and 2.

Figure 4. Segment origin and use of storage

Overlay processing
The overlay processing is initiated when a section in local storage calls a section
not in storage. The function which determines when an overlay is to occur is the
overlay manager. This checks which segment the called section is in and, if

Chapter 4. SPU code overlays 43


necessary, loads the segment. When a segment is loaded it overlays any segment in
storage with the same relative origin. No overlay occurs if one sectioncalls another
section which is in a segment already in storage (in another region or in the root
segment).

The overlay manager uses special stubs and tables to determine when an overlay is
necessary. These stubs and tables are generated by the linker and are part of the
output program module. The special stubs are used for each inter-segment call.
The tables generated are the overlay segment table and the overlay region table.
Figure 5 shows the location of the call stubs and the segment and region tables in
the root segment in the sample program.

Figure 5. Location of stubs and tables in an overlay program

The size of these tables must be considered when planning the use of local storage.

Call stubs
There is one call stub for each function in an overlay segment which is called from
a different segment. No call stub is needed if the function is called within the same
segment. All call stubs are in the root segment. During execution the call stub
specifies (to the overlay manager) the segment to be loaded, and the segment offset
to transfer control to, to invoke the function after it is loaded.

Segment and region tables


Each overlay program contains one overlay segment table and one overlay region
table. These tables are in the root segment. The segment table contains static
(read-only) information about the relationship of the segments and regions in the
program. During execution the region table contains dynamic (read-write) control
information such as which segments are loaded into each region.

Overlay graph structure example


If the same section is used by several segments it is usually desirable to place that
section in the root segment. However, the root segment can get so large that the
benefits of overlay are lost. If some of the sections in the root segment could

44 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
overlay each other then the program might be described as an overlay graph
structure (as opposed to an overlay tree structure) and it should use multiple
regions.

With multiple regions each segment has access to both the root segment and other
overlay segments in other regions. Therefore regions are independent of each other.

Figure 6 shows the relationship between the sections in the sample program and
two new sections: SH and SI. The two new sections are each used by two other
sections in different segments. Placing SH and SI in the root segment makes the
root segment larger than necessary, because SH and SI can overlay each other. The
two sections cannot be duplicated in two paths, because the linker automatically
deletes the duplicates.

Figure 6. Overlay graph structure

However, if the two sections are placed in another region they can be in local
storage when needed, regardless of the segments executed in the other regions.
Figure 7 on page 46 shows the sections in a four-region structure. Either segment
in region 3 can be in local storage regardless of the segments being executed in
regions 0, 1, or 2. Segments in region 3 can cause segments in region 0, 1 or 2 to be
loaded without being overlaid themselves.

Chapter 4. SPU code overlays 45


Figure 7. Overlay graph using multiple regions

The relative origin of region 3 is determined by the length of the preceding regions
(200 KB). Region 3, therefore, begins at the origin plus 200 KB.

The local storage required for the program is determined by adding the lengths of
the longest segment in each region. In Figure 7 if SH is 40 KB and SI is 30 KB the
storage required is 240 KB plus the storage required by the overlay manager, its
call stubs and its overlay tables. Figure 8 on page 47 shows the segment origin for
each segment and the way storage is used by the sample program.

46 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Figure 8. Overlay graph segment origin and use of storage

Specification of an SPU overlay program


Once you have designed an overlay structure, the program must be arranged into
that structure. You must indicate to the linker the relative positions of the
segments, the regions, and the sections in each segment, by using OVERLAY
statements. Positioning is accomplished as follows:
Regions
Are defined by each OVERLAY statement. Each OVERLAY statement begins a
new region.
Segments
Are defined within an OVERLAY statement. Each segment statement within
an overlay statement defines a new segment. In addition, it provides a
means to equate each load point with a unique symbolic name.
Sections
Are positioned in the segment specified by the segment statement with
which they are associated.

The input sequence of control statements and sections should reflect the sequence
of the segments in the overlay structure (for example the graph in Figure 7 on page
46), region by region, from top to bottom and from left to right. This sequence is
illustrated in later examples.

The origin of every region is specified with an OVERLAY statement. Each OVERLAY
statement defines a load point at the end of the previous region. That load point is
logically assigned a relative address at the quadword boundary that follows the
last byte of the largest segment in the preceding region. Subsequent segments
defined in the same region have their origin at the same load point.

Chapter 4. SPU code overlays 47


In the sample overlay tree program, two load points are assigned to the origins of
the two OVERLAY statements and their regions, as shown in Figure 2 on page 41.
Segments 1 and 4 are at the first load point; segments 2 and 3 are at the second
load point.

The following sequence of linker script statements results in the structure in


Figure 3 on page 42.
OVERLAY {
.segment1 {./sc.o(.text)}
.segment4 {./sg.o(.text)}
}
OVERLAY {
.segment2 {./sd.o(.text) ./se.o(.text)}
.segment3 {./sf.o(.text)}
}

Note: By implication sections SA and SB are associated with the root segment
because they are not specified in the OVERLAY statements.

In the sample overlay graph program, as shown in Figure 6 on page 45, one more
load point is assigned to the origin of the last OVERLAY statement and its region.
Segments 5 and 6 are at the third load point.

The following linker script statements add to the sequence for the overlay tree
program creating the structure shown in Figure 7 on page 46:
.
.
.
OVERLAY {
.segment5 {./si.o(.text)}
.segment6 {./sh.o(.text)}
}

Coding for overlays

Migration/Co-Existence/Binary-Compatibility Considerations
This feature will work with both IPA and non-IPA code, though the partitioning
algorithm will generate better overlays with IPA code.

Compiler options
Table 7. Compiler options
Option Description
-qipa=overlay Specifies that the compiler should
automatically create code overlays. The
-qipa=partition={small|medium|large}
option is used to control the size of the
overlay buffer. The overlay buffer will be
placed after the text segment of the linker
script.
-qipa=nooverlay Specifies that the compiler should not
automatically create code overlays. This is
the default behavior for the dual source
compiler.

48 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Table 7. Compiler options (continued)
Option Description
-qipa=overlayproc=<names_list> Specifies a comma-separated list of functions
that should be in the same overlay. Multiple
overlayproc suboptions may be present to
specify multiple overlay groups. If a
procedure is listed in multiple groups, it will
be cloned for each group referencing it. C++
function names must be mangled.
-qipa=nooverlayproc= <names_list> Specifies a comma-separated list of functions
that should not be overlaid. These will
always be resident in the local store. C++
function names must be mangled.

Examples:
# Compile and link without overlays.
xlc foo.c bar.c
xlc foo.c bar.c -qipa=nooverlay

# Compile and link with automatic overlays.


xlc foo.c bar.c -qipa=overlay

# Compile and link with automatic overlays and ensure that foo and bar are
# in the same overlay. The main function is always resident.
xlc foo.c bar.c -qipa=overlay:overlayproc=foo,bar:nooverlayproc=main

# Compile and link with automatic overlays and a custom linker script.
xlc foo.c bar.c -qipa=overlay -Wl,-Tmyldscript

SDK overlay samples


Three samples are considered:
1. a very simple overlay program: “Simple overlay sample”;
2. the sample used in the overview above: “Overview overlay sample” on page
52;
3. and a "large matrix" sample: “Large matrix overlay sample” on page 53.

These samples can be found in the SDK at /opt/ibm/cell-sdk/prototype/src/


samples/overlay.

Simple overlay sample


This sample consists of a single PPU program named driver which creates an SPU
thread and launches an embedded SPU main program named spu_main. The SPU
program calls four functions: o1_test1, o1_test2, o2_test1, and o2_test2. The first
two functions are defined in a single compilation unit, olay1/test.c, and the
second two functions are similarly defined in olay2/test.c. See the calling
diagram in Figure 9 on page 50. Upon completion of the SPU thread the driver
returns a value from the SPU program to the PPU program.

Chapter 4. SPU code overlays 49


Figure 9. Simple overlay program call graph

The SPU program is organized as an overlay program with two regions and three
segments. The first region is the non-overlay region containing the root segment
(segment 0). This root segment contains the spu_main function along with overlay
support programming and tables (not shown). The second region is an overlay
region and contains segments 1 and 2. In segment 1 are the code sections of
functions o1_test1 and o1_test2, and in segment 2 are the code sections of
functions o2_test1 and o2_test2, as shown in Figure 10.

Figure 10. Simple overlay program regions, segments and sections

Combining these figures yields the following diagram showing the structure of the
SPU program.

Figure 11. Simple overlay program logical structure

50 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
The physical view of this sample (Figure 12) shows one region containing the
non-overlay root segment, and a second region containing one of two overlay
segments. Because the functions in these two overlay segments are quite similar
their lengths happen to be the same.

Figure 12. Simple overlay program physical structure

The spu_main program calls its sub-functions multiple times. Specifically the
spu_main program first calls two functions, o1_test1 and o1_test2, passing in an
integer value (101 and 102 respectively) and upon return it expects an integer
result (1 and 2 respectively). Next spu_main calls the two other functions, o2_test1
and o2_test2 passing in an integer value (201 and 202 respectively) and upon
return it expects an integer result (11 and 12 respectively). Finally spu_main calls
again the first two functions, o1_test1 and o1_test2 passing in an integer value
(301 and 302 respectively) and upon return it expects an integer result (1 and 2
respectively). Between each pair of calls, the overlay manager loads the
appropriate segment into the appropriate region. In this case, for the first pair it
loads segment 1 into region 1 then for the second pair it loads segment 2 into
region 1, and for the last pair it reloads segment 1 back into region 1. See Figure 13
on page 52.

Chapter 4. SPU code overlays 51


Figure 13. Sample overlay program interaction diagram

The linker flags used are:


LDFLAGS = -Wl,-T,linker.script

The linker commands are in linker.script in /opt/ibm/cell-sdk/prototype/src/


samples/overlay/simple. In this the files in the overlay segments are explicitly
excluded from the .text non-overlay segment:

Note: To simplify the linker scripts only the affected statements are shown in this
and the following examples.
...
.text :
{
*( EXCLUDE_FILE(./olay1/test.o ./olay2/test.o)
.text .stub .text.* .gnu.linkonce.t.*)
*(.gnu.warning)
}
OVERLAY {
.segment1 {./olay1/test.o(.text)}
.segment2 {./olay2/test.o(.text)}
}
...

Overview overlay sample


The overview overlay program is an adaptation of the program described in
“Overlay graph structure example” on page 44. The structure is the same as that
shown in Figure 7 on page 46 but the sizes of each segment are different. Each
function is defined in its own compilation unit; a distinct file with a name the
same as the function name.

The sample consists of a single SPU main program. The main program calls the SA
function which in turn calls the SB function. These three functions are all located in
the root segment (segment 0) and cannot be overlaid.

52 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
The SB function calls the SC and SG functions. These are in two segments which are
both located in region 1 and overlay each other.

SC calls SD and SF. SD in turn calls SE. The SD and SE functions are in segment 2
and the SF function is in segment 3. These two segments are both located in region
2 and overlay each other.

The SF and SG functions call the SH and SI functions. SI is in segment 6, and SH is


in segment 5. These two segments are both located in region 3 and overlay each
other.

The physical view of this sample (Figure 8 on page 47) shows the four regions; one
region containing a single non-overlay root segment and three regions containing
six overlay segments.

The linker flags used are:


LDFLAGS = -Wl,-T,linker.script

The linker commands are in linker.script in /opt/ibm/cell-sdk/prototype/src/


samples/overlay:
...
.text :
{
*( EXCLUDE_FILE(./sc.o ./sd.o ./se.o ./sf.o ./sg.o ./sh.o ./si.o)
.text .stub .text.* .gnu.linkonce.t.*)
crtl.o*(.gnu.warning)
}

OVERLAY :
{
.segment1 {./sc.o(.text)}
.segment4 {./sg.o(.text) }
}
OVERLAY :
{
.segment2 {./sd.o(.text) ./se.o(.text)}
.segment3 {./sf.o(.text)}
}
OVERLAY :
{
.segment5 {./sh.o(.text)}
.segment6 {./si.o(.text)}
}
...

Large matrix overlay sample


The large matrix overlay program uses an existing large matrix library test in
/opt/ibm/cell-sdk/prototype/src/tests/lib. The current test consists of a single
monolithic non-overlay SPU standalone program. This new sample takes the
existing program and converts it to an overlay program by providing a linker
script. No changes (such as re-compilation) are made to the current library or to
the test case code.

The updated sample consists of a single standalone SPU program, large_matrix,


which calls test functions test_index_max_abs_vec and test_solve_linear_system
amongst others. These functions are defined in the single compilation unit
large_matrix.c. A simplified structure is shown in Figure 14 on page 54 (some
functions, and some calls within a region, have been omitted for clarity). If the test
completes successfully the function returns a zero value; in other cases it returns a

Chapter 4. SPU code overlays 53


non-zero value.

Figure 14. Large matrix overlay program call graph

Figure 15. Large matrix program physical structure

The physical view of this sample in Figure 15 shows three regions; one containing
a single non-overlay root segment, and two containing twelve overlay segments.

This assumes the archive library directory, /opt/ibm/cell-sdk/prototype/sysroot/


usr/spu/lib/, and the archive library, liblarge_matrix.a, are specified to the SPU
linker.

The linker flags used are:


LDFLAGS = -Wl,-T,linker.script

54 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
The linker commands are in linker.script in /opt/ibm/cell-sdk/prototype/src/
samples/overlay/large_matrix.

Note: this is a subset of all the functions in the large_matrix library. Only those
needed by the test case driver, large_matrix.c, are used in this example.

The relevant parts of the linker script are:


...
.text :
{
*( EXCLUDE_FILE(index_max_abs_vec.o*
index_max_abs_col.o*
madd_vector_vector.o*
nmsub_vector_vector.o*
solve_triangular.o*
...
madd_number_vector.o*
nmsub_number_vector.o*)
.text .stub .text.* .gnu.linkonce.t.*)
*(.gnu.warning)
}
OVERLAY {
.segment01 {index_max_abs_vec.o*(.text)}
.segment02 {index_max_abs_col.o*(.text)}
.segment03 {madd_vector_vector.o*(.text)}
.segment04 {nmsub_vector_vector.o*(.text)}
.segment05 {solve_triangular.o*(.text)}
...
}
OVERLAY {
.segment11 {madd_number_vector.o*(.text)}
.segment12 {nmsub_number_vector.o*(.text)}
...
}
...

Using the GNU SPU linker for overlays


The GNU SPU linker takes object files, object libraries, linker scripts, and
command line options as its inputs and produces a fully or partially linked object
file as its output. It is natural to control generation of overlays via a linker script as
this allows maximum flexibility in specifying overlay regions and in mapping
input files and functions to overlay segments. The linker has been enhanced so
that one or more overlay regions may be created by simply inserting multiple
OVERLAY statements in a standard script; no modification of the subsequent output
section specifications, such as setting the load address, is necessary. (It is also
possible to generate overlay regions without using OVERLAY statements by defining
loadable output sections with overlapping virtual memory address (VMA) ranges.)

On detection of overlays the linker automatically generates the data structures


used to manage them, and scans all non-debug relocations for calls to addresses
which map to overlay segments. Any such call, apart from those used in branch
instructions within the same section, causes the linker to generate an overlay call
stub for that address and to remap the call to branch to that stub. At execution
time these stubs call an overlay manager function which loads the overlay segment
into storage, if necessary, before branching to the final destination.

If the linker command option: –extra-overlay-stubs is specified then the linker


generates call stubs for all calls within an overlay segment, even if the target does
not lie within an overlay segment (for example if it is in the root segment). Note

Chapter 4. SPU code overlays 55


that a non-branch instruction referencing a function symbol in the same section
will also cause a stub to be generated; this ensures that function addresses which
escape via pointers are always remapped to a stub as well.

The management data structures generated include two overlay tables in a .ovtab
section. The first of these is a table with one entry per overlay segment. This table
is read-only to the overlay manager, and should never change during execution of
the program. It has the format:
struct {
u32 vma; // SPU local store address that the section is loaded to.
u32 size; // Size of the overlay in bytes.
u32 offset; // Offset in SPE executable where the section can be found.
u32 buf; // One-origin index into the _ovly_buf_table.
} _ovly_table[];

The second table has one entry per overlay region. This table is read-write to the
overlay manager, and changes to reflect the current overlay mapping state. The
format is:
struct {
u32 mapped; // One-origin index into _ovly_table for the
// currently loaded overlay. 0 if none.
} _ovly_buf_table[];

Note: These tables, all stubs, and the overlay manager itself must reside in the root
(non-overlay) segment.

Whenever the overlay manager loads an segment into a region it updates the
mapped field in the _ovly_buf_table entry for the region with the index of the
segment entry in the _ovly_buf table.

The overlay manager may be provided by the user as a library containing the
entries __ovly_load and _ovly_debug_event. (It is an error for the user to provide
_ovly_debug_event without also providing __ovly_load.) If these entries are not
provided the linker will use a built-in overlay manager which contains these
symbols in the .stub section.

56 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Appendix A. Notices
This information was developed for products and services offered in the U.S.A.

The manufacturer may not offer the products, services, or features discussed in this
document in other countries. Consult the manufacturer’s representative for
information on the products and services currently available in your area. Any
reference to the manufacturer’s product, program, or service is not intended to
state or imply that only that product, program, or service may be used. Any
functionally equivalent product, program, or service that does not infringe any
intellectual property right of the manufacturer may be used instead. However, it is
the user’s responsibility to evaluate and verify the operation of any product,
program, or service.

The manufacturer may have patents or pending patent applications covering


subject matter described in this document. The furnishing of this document does
not give you any license to these patents. You can send license inquiries, in
writing, to the manufacturer.

The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: THIS
INFORMATION IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may
not apply to you.

This information could include technical inaccuracies or typographical errors.


Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. The manufacturer may make
improvements and/or changes in the product(s) and/or the program(s) described
in this publication at any time without notice.

Any references in this information to Web sites not owned by the manufacturer are
provided for convenience only and do not in any manner serve as an endorsement
of those Web sites. The materials at those Web sites are not part of the materials for
this product and use of those Web sites is at your own risk.

The manufacturer may use or distribute any of the information you supply in any
way it believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact the manufacturer.

Such information may be available, subject to appropriate terms and conditions,


including in some cases, payment of a fee.

The licensed program described in this information and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,

© Copyright IBM Corp. 2006 57


IBM International Program License Agreement, IBM License Agreement for
Machine Code, or any equivalent agreement between us.

Information concerning products not produced by this manufacturer was obtained


from the suppliers of those products, their published announcements or other
publicly available sources. This manufacturer has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims
related to products not produced by this manufacturer. Questions on the
capabilities of products not produced by this manufacturer should be addressed to
the suppliers of those products.

All statements regarding the manufacturer’s future direction or intent are subject to
change or withdrawal without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which


illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to the
manufacturer, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the
operating platform for which the sample programs are written. These examples
have not been thoroughly tested under all conditions. The manufacturer, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.

SUBJECT TO ANY STATUTORY WARRANTIES WHICH CAN NOT BE


EXCLUDED, THE MANUFACTURER, ITS PROGRAM DEVELOPERS AND
SUPPLIERS MAKE NO WARRANTIES OR CONDITIONS EITHER EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, AND NON-INFRINGEMENT, REGARDING THE PROGRAM OR
TECHNICAL SUPPORT, IF ANY.

UNDER NO CIRCUMSTANCES IS THE MANUFACTURER, ITS PROGRAM


DEVELOPERS OR SUPPLIERS LIABLE FOR ANY OF THE FOLLOWING, EVEN
IF INFORMED OF THEIR POSSIBILITY:
1. LOSS OF, OR DAMAGE TO, DATA;
2. SPECIAL, INCIDENTAL, OR INDIRECT DAMAGES, OR FOR ANY
ECONOMIC CONSEQUENTIAL DAMAGES; OR
3. LOST PROFITS, BUSINESS, REVENUE, GOODWILL, OR ANTICIPATED
SAVINGS. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR
LIMITATION OF INCIDENTAL OR CONSEQUENTIAL DAMAGES, SO THE
ABOVE LIMITATION OR EXCLUSION MAY NOT APPLY TO YOU.

If you are viewing this information in softcopy, the photographs and color
illustrations may not appear.

58 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Edition notice
© Copyright International Business Machines Corporation 2005. All rights
reserved.

U.S. Government Users Restricted Rights — Use, duplication, or disclosure


restricted by GSA ADP Schedule Contract with IBM Corp.

Appendix A. Notices 59
Trademarks
The following terms are trademarks of International Business Machines
Corporation in the United States, other countries, or both:

alphaWorks
BladeCenter
developerWorks
IBM
POWER
Power PC®
PowerPC
PowerPC Architecture™

Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer


Entertainment, Inc., in the United States, other countries, or both and is used under
license therefrom.

Intel, MMX, and Pentium® are trademarks of Intel Corporation in the United
States, other countries, or both.

Microsoft®, Windows®, and Windows NT® are trademarks of Microsoft Corporation


in the United States, other countries, or both.

UNIX® is a registered trademark of The Open Group in the United States and
other countries.

Java™ and all Java-based trademarks and logos are trademarks or registered
trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Linux is a trademark of Linus Torvalds in the United States, other countries, or


both.

Red Hat, the Red Hat “Shadow Man” logo, and all Red Hat-based trademarks and
logos are trademarks or registered trademarks of Red Hat, Inc., in the United
States and other countries.

XDR is a trademark of Rambus Inc. in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of


others.

60 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Appendix B. Related documentation
All of the documentation listed in this section is available on the ISO image. The
latest versions of some documents may be available from the referenced web pages
or on your system after installing components of the SDK.

Cell/B.E. processor

There is a set of tutorial and reference documentation for the Cell/B.E. stored in
the IBM online technical library at:

http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
v Cell Broadband Engine Architecture
v Cell Broadband Engine Programming Handbook
v Cell Broadband Engine Registers
v C/C++ Language Extensions for Cell Broadband Engine Architecture
v Synergistic Processor Unit (SPU) Instruction Set Architecture
v SPU Application Binary Interface Specification
v Assembly Language Specification
v Cell Broadband Engine Linux Reference Implementation Application Binary Interface
Specification

Cell/B.E. programming using the SDK


v SDK 2.1 Installation Guide
v SDK 2.1 Programmer’s Guide
v Cell Broadband Engine Programming Tutorial
v SIMD Math Library
v Accelerated Library Framework Programmer’s Guide and API Reference

After you have installed the SDK, you can also find the following PDFs in the
/opt/ibm/cell-sdk/prototype/docs directory:
v SDK Sample Library documentation
v IDL compiler documentation

The following documents are available as downloads from:

http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
v Cell Broadband Engine Programming Tutorial documentation
v SPE Runtime Management library documentation Version 1.2
v SPE Runtime Management library documentation Version 2.1
v SPE Runtime Management library Version 1.2 to Version 2.0 Migration Guide

IBM XL C/C++ Compiler

After you have installed the SDK, you can find the following PDFs in the
/opt/ibmcmp/xlc/8.2/doc directory.
v Getting Started with IBM XL C/C++ Compiler
v IBM XL C/C++ Compiler Language Reference

© Copyright IBM Corp. 2006 61


v IBM XL C/C++ Compiler Programming Guide
v IBM XL C/C++ Compiler Reference
v IBM XL C/C++ Compiler Installation Guide

IBM Full-System Simulator

After you have installed the SDK, you can also find the following PDFs in the
/opt/ibm/systemsim-cell/doc directory.
v IBM Full-System Simulator Users Guide
v IBM Full-System Simulator Command Reference
v Performance Analysis with the IBM Full-System Simulator
v IBM Full-System Simulator BogusNet HowTo

PowerPC Base

The following documents can be found on the developerWorks Web site at:

http://www.ibm.com/developerworks/eserver/library
v PowerPC Architecture Book, Version 2.02
– Book I: PowerPC User Instruction Set Architecture
– Book II: PowerPC Virtual Environment Architecture
– Book III: PowerPC Operating Environment Architecture
v PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology
Programming Environments Manual Version 2.07c

62 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Glossary
This glossary contains terms and abbreviations application has one or more Linux threads and some
used in Cell/B.E. systems. number of SPE threads. All the Linux threads within
the application share the application’s resources,
ABI. Application Binary Interface. This is the standard including access to the SPE threads.
that a program follows to ensure that code generated
by different compilers (and perhaps linking with Cell Broadband Engine program. A PPE program
various, third-party libraries) will run correctly on the with one or more embedded SPE programs.
Cell Broadband Engine. The ABI defines data types,
register use, calling conventions and object formats. code section. A self-contained area of code, in
particular one which may be used in an overlay
ALF. Accelerated Library Framework. This an API that segment.
provides a set of functions to help programmers
solving data parallel problems on a hybrid system. ALF compiler. A programme that translates a high-level
supports the single-program-multiple-data (SPMD) programming language, such as C++, into executable
programming style with a single program running on code.
all accelerator elements at one time. ALF offers
CPL. Common Public License.
programmers an interface to partition data across a set
of parallel processes without requiring architecturally Cycle-accurate simulation. See Performance simulation.
dependent code.
cycle. Unless otherwise specified, one tick of the PPE
atomic operation. A set of operations, such as clock.
read-write, that are performed as an uninterrupted unit.
DMA. Direct Memory Access. A technique for using a
Auto-SIMDize. To automatically transform scaler code special-purpose controller to generate the source and
to vector code. destination addresses for a memory or I/O transfer.
Barcelona Supercomputing Center. Spanish National DMA command. A type of MFC command that
Supercomputing Center, supporting Bladecenter and transfers or controls the transfer of a memory location
Linux on cell. containing data or instructions. See MFC.
BSC. See Barcelona Supercomputing Center. ELF. Executable and Linking Format. The standard
object format for many UNIX operating systems,
BE. Broadband Engine.
including Linux. Originally defined by AT®&T and
Broadband Engine. See CBEA. placed in public domain. Compilers generate ELF files.
Linkers link to files with ELF files in libraries. Systems
C++. Deriving from C, C++ is an object-orientated run ELF files.
programming language.
elfspe. The SPE that allows an SPE program to run
cache. High-speed memory close to a processor. A directly from a Linux command prompt without
cache usually contains recently-accessed data or needing a PPE application to create an SPE thread and
instructions, but certain cache-control instructions can wait for it to complete.
lock, evict, or otherwise modify the caching of data or
instructions. ext3. Extended file system 3. One of the file system
options available for Linux partitions.
call stub. A small piece of code used as a link to other
code which is not immediately accessible. Fedora Core. An operating system built entirely from
open-source software and therefore freely available.
CBEA. Cell Broadband Processor Architecture. A new Often, but mistakenly, known as Fedora linux.
architecture that extends the 64-bit PowerPC
Architecture. The CBEA and the Cell Broadband Engine FDPR-Pro. Feedback Directed Program Restructuring.
are the result of a collaboration between Sony, Toshiba, A feedback-based post-link optimization tool.
and IBM, known as STI, formally started in early 2001.
FFT. Fast Fourier Transform.
Cell/B.E.. Cell Broadband Engine. See CBEA.
firmware. A set of instructions contained in ROM
Cell Broadband Engine Linux application. An usually used to enable peripheral devices at boot.
application running on the PPE and SPE. Each such

© Copyright IBM Corp. 2006 63


Glossary

FSF. Free Software Foundation. Organization kernel. The core of an operating which provides
promoting the use of open-source software such as services for other parts of the operating system and
Linux. provides multitasking. In Linux or UNIX operating
system, the kernel can easily be rebuilt to incorporate
FSS. IBM Full-System Simulator. IBM's tool which enhancements which then become operating-system
simulates the cell processor environment on other host wide.
computers.
K&R programming. A reference to a well-known
GCC. GNU C compiler. book on programming written by Dennis Kernighan
and Brian Ritchie.
GDB. GNU application debugger. A modified version
of gdb, ppu-gdb, can be used to debug a Cell Broadband L1. Level-1 cache memory. The closest cache to a
Engine program. The PPE component runs first and processor, measured in access time.
uses system calls, hidden by the SPU programming
library, to move the SPU component of the Cell L2. Level-2 cache memory. The second-closest cache to
Broadband Engine program into the local store of the a processor, measured in access time. A L2 cache is
SPU and start it running. A modified version of gdb, typically larger than a L1 cache.
spu-gdb, can be used to debug code executing on SPEs.
latency. The time between when a function (or
GPL. GNU General Public License. Guarantees instruction) is called and when it returns. Programmers
freedom to share, change and distribute free software. often optimize code so that functions return as quickly
as possible; this is referred to as the low-latency
GNU. GNU is Not Unix. A project to develop free approach to optimization. Low-latency designs often
Unix-like operating systems such as Linux. leave the processor data-starved, and performance can
suffer.
graph structure. A program design in which each
child segment is linked to one or more parent libspe. A SPU-thread runtime management library.
segments.
Linux. An open-source Unix-like computer operating
GUI. Graphical User Interface. User interface for system.
interacting with a computer which employs graphical
images and widgets in addition to text to represent the LGPL. Lesser General Public License. Similar to the
information and actions available to the user. Usually GPL, but does less to protect the user’s freedom.
the actions are performed through direct manipulation
of the graphical elements. local store. The 256-KB local store associated with
each SPE. It holds both instructions and data.
HTTP. Hypertext Transfer Protocol. A method used to
transfer or convey information on the World Wide Web. LS. See local store.

I/O device. Input/output device. From the viewpoint LSA. Local Store Address. An address in the local
of software I/O devices exist as memory-mapped store of a SPU through which programs running in the
registers that are accessed in main-storage space by SPU, and DMA transfers managed by the MFC, access
load/store instructions. . the local store.

IDE. Integrated Development Environment. Integrates main memory. See main storage.
the Cell/B.E. GNU tool chain, compilers, the
Full-System Simulator, and other development main storage. The effective-address (EA) space. It
components to provide a comprehensive, Eclipse-based consists physically of real memory (whatever is
development platform that simplifies Cell/B.E. external to the memory-interface controller, including
development. both volatile and nonvolatile memory), SPU LSs,
memory-mapped registers and arrays, memory-mapped
IDL. Interface definition language. Not the same as I/O devices (all I/O is memory-mapped), and pages of
CORBA IDL virtual memory that reside on disk. It does not include
caches or execution-unit register files. See also local
ILAR. IBM International License Agreement for early store.
release of programs.
Makefile. A descriptive file used by the makecommand
initrd. A command file read at boot. in which the user specifies: (a) target program or
library, (b) rules about how the target is to be built, (c)
ISO image. Commonly a disk image which can be dependencies which, if updated, require that the target
burnt to CD. Technically it is a disk image of and ISO be rebuilt.
9660 file system.
Mambo. Pre-release name of the IBM Full-System
Simulator, see FSS

64 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Glossary

MASS. Mathematical Acceleration Subsystem. MASS PowerPC Architecture. A computer architecture that is
and MASS/V library contain scalar and vector based on the third generation of RISC processors. The
operations not in the standard C math library. PowerPC architecture was developed jointly by Apple,
Motorola, and IBM.
MFC. Memory Flow Controller. Part of an SPE which
provides two main functions: it moves data via DMA PPC. See Power PC.
between the SPE’s local store (LS) and main storage,
and it synchronizes the SPU with the rest of the PPE. PowerPC Processor Element. The
processing units in the system. general-purpose processor in the Cell.

MT. See multithreading. PPSS. PowerPC Processor Storage Subsystem. Part of


the PPE. It operates at half the frequency of the PPU
multithreading. Simultaneous execution of more than and includes a L2 cache and a Bus Interface Unit (BIU).
one program thread. It is implemented by sharing one
software process and one set of execution resources but PPU. PowerPC Processor Unit. The part of the PPE
duplicating the architectural state (registers, program that includes the execution units, memory-management
counter, flags and associated items) of each thread. unit, and L1 cache.

netboot. Command to boot a device from another on program section. See code section.
the same network. Requires a TFTP server.
proxy. Allows many network devices to connect to the
NUMA. Non-uniform memory access. In a internet using a single IP address. Usually a single
multiprocessing system such as the Cell/B.E., memory server, often acting as a firewall, connects to the
is configured so that it can be shared locally, thus internet behind which other network devices connect
giving performance benefits. using the IP address of that server.

overlay segment. Code that is dynamically loaded and region. See overlay region.
executed by a running SPU program. A segment
contains one or more code sections. RPM. Originally an acronym for Red Hat Package
Manager, and RPM file is a packaging format for one
overlay region. An area of storage, with a fixed or more files used by many Linux systems when
address range, into which overlay segments are loaded. installing software programs.
A region only contains one segment at any time.
root segment. Code that is always in storage when a
page table. A table that maps virtual addresses (VAs) SPU program runs. The root segment contains overlay
to real addresses (RA) and contains related protection control sections and may also contain code sections and
parameters and other information about memory data areas.
locations.
Sandbox. Safe place for running programs or script
PDF. Portable document format. without affecting other users or programs.

Performance simulation. Simulation by the IBM Full SDK. Software development toolkit. A complete
System Simulator for the Cell Broadband Engine in package of tools for application development. The
which both the functional behavior of operations and Cell/B.E. SDK includes sample software for the Cell
the time required to perform the operations is Broadband Engine.
simulated. Also called cycle-accurate simulation.
section. See code section.
pipelining. A technique that breaks operations, such
as instruction processing or bus transactions, into segment. See overlay segment and root segment.
smaller stages so that a subsequent stage in the
SIMD. Single Instruction Multiple Data. Processing in
pipeline can begin before the previous stage has
which a single instruction operates on multiple data
completed.
elements that make up a vector data-type. Also known
plugin. Code that is dynamically loaded and executed as vector processing. This style of programming
by running an SPU program. Plugins facilitate code implements data-level parallelism.
overlays.
SIMDize. To transform scaler code to vector code.
PowerPC. Of or relating to the PowerPC Architecture or
SPE. Synergistic Processor Element. Extends the
the microprocessors that implement this architecture.
PowerPC 64 architecture by acting as cooperative
PPC-64. 64 bit implementation of the PowerPC offload processors (synergistic processors), with the
Architecture. direct memory access (DMA) and synchronization
mechanisms to communicate with them (memory flow

Glossary 65
Glossary

control), and with enhancements for real-time vector. An instruction operand containing a set of data
management. There are 8 SPEs on each cell processor. elements packed into a one-dimensional array. The
elements can be fixed-point or floating-point values.
SPE thread. A thread scheduled and run on a SPE. A Most Vector/SIMD Multimedia Extension and SPU
program has one or more SPE threads. Each such SIMD instructions operate on vector operands. Vectors
thread has its own SPU local store (LS), 128 x 128-bit are also called SIMD operands or packed operands.
register file, program counter, and MFC Command
Queues, and it can communicate with other execution virtual memory. The address space created using the
units (or with effective-address memory through the memory management facilities of a processor.
MFC channel interface).
virtual storage. See virtual memory.
SPU. Synergistic Processor Unit. The part of an SPE
that executes instructions from its local store (LS). VMA. Virtual memory address. See virtual memory.

spulet. 1) A standalone SPU program that is managed workload. A set of code samples in the SDK that
by a PPE executive. 2) A programming model that characterizes the performance of the architecture,
allows legacy C programs to be compiled and run on algorithms, libraries, tools, and compilers.
an SPE directly from the Linux command prompt. ®
XDR. Rambus Extreme Data Rate DRAM memory
tag group. A group of DMA commands. Each DMA technology.
command is tagged with a 5-bit tag group identifier.
Software can use this identifier to check or wait on the XLC. The IBM optimizing C/C++ compiler.
completion of all queued commands in one or more tag
x86. Generic name for Intel-based processors.
groups. All DMA commands except getllar, putllc,
and putlluc are associated with a tag group. yaboot. Linux utility which is a boot loader for
PowerPC-based hardware.
Tcl. Tool Command Language. An interpreted script
language used to develop GUIs, application prototypes,
Common Gateway Interface (CGI) scripts, and other
scripts. Used as the command language for the Full
System Simulator.

TFTP. Trivial File Transfer Protocol. Similar to, but


simpler than the Transfer Protocol (FTP) but less
capable. Uses UDP as its transport mechanism.

thread. A sequence of instructions executed within the


global context (shared memory space and other global
resources) of a process that has created (spawned) the
thread. Multiple threads (including multiple instances
of the same sequence of instructions) can run
simultaneously if each thread has its own architectural
state (registers, program counter, flags, and other
program-visible state). Each SPE can support only a
single thread at any one time. Multiple SPEs can
simultaneously support multiple threads. The PPE
supports two threads at any one time, without the need
for software to create the threads. It does this by
duplicating the architectural state.

TLB. Translation Lookaside Buffer. An on-chip cache


that translates virtual addresses (VAs) to real addresses
(RAs). A TLB caches page-table entries for the most
recently accessed pages, thereby eliminating the
necessity to access the page table from memory during
load/store operations.

tree structure. A program design in which each child


segment is linked to a single parent segment.

UDP. User Datagram Protocol. Transports data as a


connectionless protocol, i.e. without acknowledgement
or receipt. Fast but fragile.

66 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Index
Special characters directory
archive library 54
libraries and samples (continued)
Cell/B.E. library 5
__ovly_load 56 directory structure FDPR-Pro 11
_ovly_debug_event 56 libraries and samples 8 libspe version 2.1 5
–extra-overlay-stubs programming sample 20 MASS 7
linker command 55 system root 15 OProfile 11
DMA 40 prototype 7
documentation 61 SIMD math library 6
A SPU timing tool 10
address subdirectories 8
load 40 E support libraries 10
library
archive library 54 elfspe 6
directory 54 archive 54
example
overlay manager 39
overlay graph structure 44
libspe
B version 2.1 5
linker 39, 44, 45, 54
best practices 22 F command 52, 55
BogusNet 18 fast mode commands 53
bogusnet support 37 Full-System Simulator 17 control statement 47
breakpoints FDPR-Pro 11 flags 52, 53, 54
setting pending 32 flags GNU 55
linker 52, 53, 54 OVERLAY statement 55
Full-System Simulator script 52, 53, 55
C callthru utility 17 linker command
call stub 39, 44, 55 description 3 –extra-overlay-stubs 55
callthru utility 17 fast mode 17 linker statement 48
cell-perf-counter 13 running 16 OVERLAY 47
command system root image 4, 18 Linux
linker 52, 53, 55 systemsim 16 kernel 5
compiler function 40 load address 40
changing 20 load point 43, 47, 48
compiling and linking
GNU tool chain 21 G
control statement GCC compiler 1 M
linker 47 GNU SPU linker 55 makefile
GNU tool chain 1 for samples 20
compiling and linking 21 manager
D overlay 43, 44, 46, 56
data MASS library 7
transient 40 I
debugging IDE 13
architecture 29 info spu dma 35 N
commands 34 info spu event 35 native debugging
compiling with GCC 25 info spu mailbox 35 setting up 36
compiling with XLC 25 info spu proxydma 36
GDB 25 info spu signal 35
info spu dma 35
info spu event 35
Integrated Development
Environment 13
O
info spu mailbox 35 Oprofile
info spu proxydma 36 restrictions 13
info spu signal 35 SPU profiling restrictions 12
multithreaded code 29 K SPU report anomalies 12
pending breakpoints 32 kernel 5 OProfile 11
PPE code 26 origin
remotely 36 segment 43, 44, 46
scheduler-locking 31 L overlay 39
graph structure example 44
SPE code 26 length of an overlay program 42
SPE registers 28 large matrix sample 53
libraries and samples
starting remote 37 manager 43, 44, 46, 51, 56
cell-perf-counter 13
using the combined debugger 32 manager library 39

© Copyright IBM Corp. 2006 67


overlay (continued)
manager user 56
segment (continued)
overlay 39, 45, 51, 53, 54, 55
X
overview sample 52 overlay table 44, 56 xclient
processing 43 root 40, 44, 48, 51, 52, 54, 55 enabling from simulator 18
program length 42 segment origin 43, 44, 46 XL C/C++ compiler 2
region 40, 45, 51, 53, 54, 55 setting up commands 2
region size 40 native debugging 36 optimization levels 2
region table 44, 56 remote debugging 36
restriction 40 shared development environment 22
segment 39, 45, 51, 53, 54, 55 SIMD math library 6
segment table 44, 56 simulator
simple sample 49 BogusNet 18
SPU program specification 47 configuring the processor 19
table 39, 56 enabling xclient 18
tree structure example 41 symmetric multiprocessing 18
OVERLAY SPE executable
linker statement 47, 48 size 41
statement 55 SPE registers 28
SPE Runtime Management Library
version 2.1 5
P specification
SPU overlay program 47
performance
SPU
support libraries 10
overlay program specification 47
ppu-gdb 26
thread 49
ppuxlc 2
SPU timing tool 10
ppuxlc++ 2
spu_main 50, 51, 52
processor 19
spu-gdb 26
architecture 19
SPE registers 28
compiler support 19
spuxlc 2
configuring the simulator 19
spuxlc++ 2
for the future 19
statement
programming sample
OVERLAY 55
building and running 21
support libraries 10
compiler 20
supported platforms vi
directory structure 20
symmetric multiprocessing support 18
programs
sysroot
debugging 26
Full-System Simulator 18
system root
directory 15
R image 4
region 44, 47, 50, 51
overlay 40, 44, 45, 51, 53, 54, 55
overlay table 44, 56
remote debugging
T
table
setting up 36
overlay 39, 56
root segment 40, 44, 48, 51, 52, 54, 55
overlay region 44, 56
address 42
overlay segment 44, 56
thread
SPU 49
S TLB file system
sample configuring 21
large matrix overlay 53 trademarks 60
overview overlay 52 transient data 40
simple overlay 49
sandbox
setting up 22
scheduler-locking 31
U
user overlay manager 56
script
utility
linker 52, 53, 55
callthru 17
systemsim 16
SDK
how to use vi
overlay samples 49 V
SDK documentation 61 virtual memory address (VMA) 55
section 47
segment 47, 50

68 Cell Broadband Engine: Software Development Kit 2.1 Programmer's Guide Version 2.1
Readers’ Comments — We’d Like to Hear from You
Cell Broadband Engine
Software Development Kit 2.1
Programmer's Guide
Version 2.1

Publication No. SC33-8325-01

We appreciate your comments about this publication. Please comment on specific errors or omissions, accuracy,
organization, subject matter, or completeness of this book. The comments you send should pertain to only the
information in this manual or product and the way in which the information is presented.

For technical questions and information about products and prices, please contact your IBM branch office, your
IBM business partner, or your authorized remarketer.

When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any
way it believes appropriate without incurring any obligation to you. IBM or any other organizations will only use
the personal information that you supply to contact you about the issues that you state on this form.

Comments:

Thank you for your support.


Submit your comments using one of these channels:
v Send your comments to the address on the reverse side of this form.
v Send a fax to the following number: +49-7031-16-3456
v Send your comments via e-mail to: [email protected]

If you would like a response from IBM, please fill in the following information:

Name Address

Company or Organization

Phone No. E-mail address


___________________________________________________________________________________________________
Readers’ Comments — We’d Like to Hear from You Cut or Fold
SC33-8325-01  Along Line

_ _ _ _ _ _ _Fold
_ _ _and
_ _ _Tape
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please
_ _ _ _ _do
_ _not
_ _ staple
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Fold
_ _ _and
_ _ Tape
______

NO POSTAGE
NECESSARY
IF MAILED IN THE
UNITED STATES

BUSINESS REPLY MAIL


FIRST-CLASS MAIL PERMIT NO. 40 ARMONK, NEW YORK

POSTAGE WILL BE PAID BY ADDRESSEE

IBM Deutschland Entwicklung GmbH


Department 3248
Schoenaicher Strasse 220
D-71032 Boeblingen
Federal Republic of Germany

_________________________________________________________________________________________
Fold and Tape Please do not staple Fold and Tape

Cut or Fold
SC33-8325-01 Along Line


Printed in USA

SC33-8325-01

You might also like