ALF Prog Guide API v1.1
ALF Prog Guide API v1.1
Accelerated Library Framework for Cell Broadband Engine Programmers Guide and API Reference Version 1.1 DRAFT
SC33-8333-02
Accelerated Library Framework for Cell Broadband Engine Programmers Guide and API Reference Version 1.1 DRAFT
SC33-8333-02
Note Before using this information and the product it supports, read the information in Notices on page 147.
Edition notice This edition applies to Beta Version of the Software Development Kit for Multicore Acceleration Version 3.0 (program number 5724-S84) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SC33-8333-01. Copyright International Business Machines Corporation 2006, 2007 - DRAFT. All rights reserved. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
About this publication
How to send your comments .
. . . . . . . . v
. . . . . . . . v
Chapter 10. Performance and debug trace. . . . . . . . . . . . . . . . 37 Chapter 11. Trace control . . . . . . . 39
Chapter 13. Platform-specific constraints for the ALF implementation on Cell BE architecture . . . . . . . 45
Local memory constraints . Data transfer list limitations . . . . . . . . . . . . . . . . 45 . 46
iii
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
83 84 85 86 87 88 89
Appendix C. ALF trace events . . . . 133 Appendix D. Attributes and descriptions . . . . . . . . . . . . 137 Appendix E. Error codes and descriptions . . . . . . . . . . . . 141 Appendix F. Related documentation Appendix G. Accessibility features 143 145
Notices . . . . . . . . . . . . . . 147
Trademarks . . . . Terms and conditions . . . . . . . . . . . . . . . . . . . . 149 . 150
iv
Related information
See Appendix F, Related documentation, on page 143.
vi
ALF tasks
The ALF design enables a separation of work. There are three distinct types of task within a given application: Application You develop programs only at the host level. You can use the provided accelerated libraries without direct knowledge of the inner workings of the underlying system. Accelerated library You use the ALF APIs to provide the library interfaces to invoke the computational kernels on the accelerators. You divide the problem into the control process, which runs on the host, and the computational kernel,
which runs on the accelerators. You then partition the input and output into work blocks, which ALF can schedule to run on different accelerators. Computational kernel You write optimized accelerator code at the accelerator level. The ALF API provides a common interface for the compute task to be invoked automatically by the framework.
ALF Runtime (Accelerator) Compute Task A Compute Task B Compute Tasks Accelerator Node
Figure 1. Overview of ALF
Accelerator API
Chapter 4. Concepts
The following sections explain the main concepts and terms used in ALF. It covers the following topics: v Computational kernel v Task on page 10 v Task descriptor on page 10 v Work blocks on page 12 v Data partitioning on page 16 v Accelerator buffer management on page 18 v Using work blocks and order of function calls per task instance on the accelerator on page 20 v Modifying the work block parameter and context buffer when using multi-use work blocks on page 22 v Data set on page 22 v Error handling on page 23
Computational kernel
A computational kernel is a user-defined accelerator routine that takes a given set of input data and returns the output data based on the given input. You should implement the computational kernel according to the function prototype definitions with the data in the provided buffers (see Accelerator buffer management on page 18). Then the computational kernel must be registered to the ALF runtime when the corresponding task descriptor is created. The computational kernel is usually accompanied by four other auxiliary functions. The five of them forms a 5-tuple for a task as:
{ alf_accel_comp_kernel, alf_accel_input_dtl_prepare, alf_accel_output_dtl_prepare, alf_accel_task_context_setup, alf_accel_task_context_merge }
Note: The above accelerator function names are used as conventions for this document only. You can provide your own function name for each of these functions and register the function name through the task descriptor service. Based on the different application requirements, some of the elements in this 5-tuple can be NULL. For more information about the APIs that define computational kernels, see User-provided computational kernel APIs on page 94.
Task descriptor
A task descriptor contains all the relevant task descriptions. To maximize accelerator performance, ALF employs a static memory allocation model per task execution on the accelerator. This means that ALF requires you to provide information about buffers, stack usage, and the number of data transfer list entries ahead of time. As well as accelerator memory usage information, the task descriptor also contains information about the names of the different user-defined accelerator functions and the data partition attribute. The following information is used to define a task descriptor: v Task context description Task context buffer size Task context entries: entry size, entry type v Accelerator executable image that contains the computational kernel: The name of the accelerator computational kernel function Optionally, the name of the accelerator input data transfer list prepare function Optionally, the name of the accelerator output data transfer list prepare function Optionally, the name of the accelerator task context setup function Optionally, the name of the accelerator task context merge function v Work block parameter context buffer size v Work block input buffer size v Work block output buffer size v Work block overlapped buffer size v Work block number of data transfer list entries v Task data partition attribute: Partition on accelerator Partition on host v Accelerator stack size For more information about the compute task APIs, see Compute task API on page 62.
Task
A task is defined as a ready-to-be-scheduled instantiation of a task description or you use the num_instances parameter in the task creation function (alf_task_create), to explicitly request a number of accelerators or let the ALF runtime decide the necessary number of accelerators to run the compute task. You can also provide the data for the context buffer associated with this particular task. You can also register an event handler to monitor different task events, see Task events on page 12. After you have created a task, you can create work blocks and enqueue the work blocks on to the working queue associated with the task. The ALF framework employs an immediate runtime mode. After a work block has been enqueued, if the task has no unresolved dependency on other tasks, the task is scheduled to process the work blocks.
10
For information about work blocks, see Work blocks on page 12.
Task finalize
After you have finished adding work blocks to the work queue, you must call alf_task_finalize function to notify ALF that there are no more work blocks for this particular task. A task that is not finalized cannot be run to the completion state.
Task instance
A task can be scheduled to run on multiple accelerators. Each task running on an accelerator is a task instance. If a task is created without the ALF_TASK_ATTR_SCHED_FIXED attribute, the ALF runtime can load and unload an instance of a task to and from an accelerator anytime. The ALF runtime posts an event after a task instance is started on an accelerator or unloaded from an accelerator. You can choose to register an event handler for this event, see Task events on page 12.
11
2. Set the ALF_TASK_ATTR_SCHED_FIXED task attribute In this case, the runtime makes sure all the task instances are started before work blocks are assigned to them.
Task context
Note: For more information, refer to Task context buffer on page 18. A task context is used to address the following usage scenarios:
Task events
The ALF framework provides notifications for the following task events: v ALF_TASK_EVENT_READY - the task is ready to be scheduled v ALF_TASK_EVENT_FINISHED - the task has finished running v ALF_TASK_EVENT_FINALIZED - all the work blocks for the task have been enqueued alf_task_finalized has been called v ALF_TASK_EVENT_INSTANCE_START - one new instance of the task starts v ALF_TASK_EVENT_INSTANCE_END - one instance of the task ends v ALF_TASK_EVENT_DESTROY - The task is destroyed explicitly For information about how to set event handling, see alf_task_event_handler_register.
Work blocks
A work block represents an invocation of a task with a specific set of related input data, output data, and parameters. The input and output data are described by corresponding data transfer lists. The parameters are provided through the ALF APIs. Depending on the application, the data transfer list can either be generated on the host (host data partition) or by the accelerators (accelerator data partition).
12
Before it calls the compute task, and as the ALF accelerator runtime processes a work block it retrieves the parameters and the input data based on the input data transfer list to the input buffer in host memory. After it has invoked the computational kernel, the ALF accelerator runtime puts the output result back into the host memory. The ALF accelerator runtime manages the memory of the accelerator to accommodate the work blocks input and output data. The ALF accelerator runtime also supports overlapping data transfers and computations transparently through double buffering techniques if there is enough free memory.
Chapter 4. Concepts
13
WB1
WB2
WB3
WB4
WB5
WB6
WB7
WB8
WB9
alf_wb_enqueue wb assign
WB4 WB4 WB7 WB7 WB2 WB1 WB8 WB6 WB7 WB5 WB9 WB3
Task Instance 1
Task Instance 2
Task Instance 3
14
WB1
WB2
WB3
WB4
WB5
WB6
WB7
WB8
WB9
Task Instance 1
Task Instance 2
Task Instance 3
WB1
WB2
WB3
WB4
WB5
WB6
WB7
WB8
WB9
WB10
WB4 WB3
WB8 WB7
Task Instance 1
Task Instance 2
Task Instance 3
Chapter 4. Concepts
15
Data partitioning
An important part to solving data parallel problems using multiple accelerators is to figure out how to partition data across the accelerators. The ALF API does not automatically partition data, however, it does provide a framework so that you can systematically partition the data. The ALF API provides the following different data partition methods: v Host data partitioning on page 17 v Accelerator data partitioning on page 17 These methods are described in the following sections together with the Figure 5 on page 17.
16
D C
A B D E F G H C
A B C D E F G H
A G E
H F
Figure 5. Data transfer list
To maximize accelerator performance, ALF employs a static memory allocation model per task execution on the accelerator. This means programmers need to explicitly specify the maximum number of entries a data transfer list in a task can have. This can be set through the alf_task_desc_set_int32 function with the ALF_TASK_DESC_NUM_DTL_ENTRIES function. For information about data transfer list limitations for Cell BE implementations, see Data transfer list limitations on page 46.
17
Buffer types
The ALF accelerator runtime code provides handles to the following different buffers for each instance of a task: v Task context buffer v Work block parameter and context buffer on page 19 v Work block input data buffer on page 20 v Work block output data buffer on page 20 v Work block overlapped input and output data buffer on page 20
18
compute task has been scheduled to be run on the accelerators, the ALF framework creates private copies of the task context for the task instance that is running. You can provide a function to initialize the task context (alf_accel_task_context_setup) on the accelerator. The ALF runtime invokes this function when the running task instance is first loaded on an accelerator as shown in Figure 6 (a). All work blocks that are processed by one task instance share the same private copy of task context on that accelerator as shown in Figure 6 (b). When the ALF scheduler requests an accelerator to unload a task instance, you can provide a merge function (alf_accel_task_context_merge), which is called by the runtime, to merge that accelerators task context with an active task context on another accelerator as shown in Figure 6 (c). When a task is shut down and all instances of the task are destroyed, the runtime automatically calls the merge function on the task instances to merge all of the private copies of task context into a single task context and write the final result to the task context on host memory provided when the task is created, as shown in Figure 6 (d).
Merge
WB WB WB WB WB WB
Merge Merge
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
Task Instance
19
For more information, see Modifying the work block parameter and context buffer when using multi-use work blocks on page 22.
Using work blocks and order of function calls per task instance on the accelerator
Based on the characteristics of an application, you can use single-use work blocks or multi-use work blocks to efficiently implement data partitioning on the accelerators. For a given task that can be partitioned into N work blocks, the following describes how the different types of work blocks can be used, and also the order of function calls per task instance based on a single instance of a the task on a single accelerator: 1. Task instance initialization (this is done by the ALF runtime)
20
2. Conditional execute: alf_accel_task_context_setup is only called if the task has context. The runtime calls it when the initial task context data has been loaded to the accelerator and before any work blocks are processed. 3. For each work block WB(k): a. If there are pending context merges, go to Step 4. b. For each iteration of a multi-use work block i < N (total number of iteration) 1) alf_accel_input_list_prepare(WB(k), i, N): It is only called when the task requires accelerator data partition. 2) alf_accel_comp_kernel(WB(k), i, N): The computational kernel is always called. 3) alf_accel_output_list_prepare(WB(k), i, N): It is only called when the task requires accelerator data partition. 4. Conditional execute: alf_accel_task_context_merge This API is only called when the context of another unloaded task instance is to be merged to current instance. a. If there are pending work blocks, go to Step 3. 5. Write out task context. 6. Unload image or pending for next scheduling. a. If a new task instance is created, go to Step 2. For step 3, the calling order of the three function calls is defined by the following rules: v For a specific single-use work block WB(k), the following calling order is guaranteed: 1. alf_accel_input_list_prepare(WB(k)) 2. alf_accel_comp_kernel(WB(k)) 3. alf_accel_output_list_prepare(WB(k)) v For two single-use work blocks that are assigned to the same task instance in the order of WB(k) and WB(k+1), ALF only guarantees the following calling orders: alf_accel_input_list_prepare(WB(k)) is called before alf_accel_input_list_prepare(WB(k+1)) alf_accel_comp_kernel(WB(k)) is called before alf_accel_comp_kernel(WB(k+1)) alf_accel_output_list_prepare(WB(k)) is called before alf_accel_output_list_prepare(WB(k+1)) v For a multi-use work block WB(k,N), it is considered as N single use work blocks assigned to the same task instance in the order of incremental iteration index WB(k,0), WB(k, 1), ..., WB(k, N-1). The only difference is that all these work blocks share the same work block parameter and context buffer. Other than that, the API calling order is still decided by the previous two rules. See Modifying the work block parameter and context buffer when using multi-use work blocks on page 22.
Chapter 4. Concepts
21
Modifying the work block parameter and context buffer when using multi-use work blocks
The work block parameter and context buffer of a multi-use work block is shared by multiple invocations of the alf_accel_input_dtl_prepare accelerator function and the alf_accel_output_dtl_prepare accelerator function. Take care when you change the contents of this buffer. Because the ALF runtime does double buffering transparently, it is possible that the current_count arguments for succeeding calls to the alf_accel_input_dtl_prepare function, the alf_accel_comp_kernel function, and the alf_accel_output_dtl_prepare function are not strictly incremented when a multi-use work block is processed. Because of this, modifying the parameter and context buffer according to the current_count in one of the subroutines can cause unexpected effects to other subroutines when they are called with different current_count values at a later time.
Data set
An ALF data set is a logical set of data buffers. A data set informs the ALF runtime about the set of all data to which the tasks work blocks refer. The ALF runtime uses this information to optimize how data is moved from the hosts memory to the accelerators memory and back. You set up a data set independently of tasks or work blocks using the alf_dataset_create, and alf_dataset_buffer_add functions. Before enqueuing the first work block, you must associate the data set to one or more tasks using the alf_task_dataset_associate function. As work blocks are enqueued, they are checked against the associated data set to ensure they reside within one of the buffers. Finally after finishing with the data set, you destroy it by using the alf_dataset_destroy function. A data set can have a set of data buffers associated with it. A data buffer can be identified as read-only, write-only, or read and write. You can add as many data buffers to the data set as needed. Different ALF implementations can choose to limit the number of data buffers in a specific data set. Refer to the implementation documentation for restriction information about the number of data buffers in a data set. However, after a data set has been associated with a task, you cannot add additional data buffers to the data set. A task can optionally be associated with one and only one data set. Work blocks within this task refer to data within the data set for input, output, and in-out buffers. References to work block input and output data which is outside of the data set result in an error. The task context buffer and work block parameter buffer do not need to reside within the data set and are not checked against it. Multiple tasks can share the same data set. It is your responsibility to make sure that the data in the data set is used correctly. If two tasks with no dependency on each other use the same data from the same data set, ALF cannot guarantee the consistency of the data. For tasks with a dependency on each other and which use the same data set, the data set gets updated in the order in which the tasks are run. Although for host data partitioning you may create and use data sets, it is recommended that you do use data sets. For accelerator data partitioning you must create and use data sets.
22
Error handling
ALF supports limited capability to handle runtime errors. Upon encountering an error, the ALF runtime tries to free up resources, then exits by default. To allow the accelerated library developers to handle errors in a more graceful manner, you can register a callback error handler function to the ALF runtime. Depending on the type of error, the error handler function can direct the ALF runtime to retry the current operation, stop the current operation, or shut down. These are controlled by the return values of the callback error handler function. When several errors happen in a short time or at the same time, the ALF runtime attempts to invoke the error handler in sequential order. Possible runtime errors include the following: v Compute task runtime errors such as bus error, undefined computing kernel function names, invalid task execution images, memory allocation issues, dead locks, and others v Detectable internal data structure corruption errors, which might be caused by improper data transfers or access boundary issues v Application detectable/catchable errors Standard error codes on supported platforms are used for return values when an error occurs. For this implementation, the standard C/C++ header file, errno.h, is used. See Appendix E, Error codes and descriptions, on page 141 and also the API definitions in Chapter 14, ALF API overview, on page 51 for a list of possible error codes.
Chapter 4. Concepts
23
24
25
26
HOST
Initialization
Accelerator
Prepare input DTL
ALF Runtime
Compute kernel
Wait task
27
28
29
30
If you choose to partition data on the accelerator, you need to generate the data transfer lists for the input buffer, the overlapped input buffer, and the overlapped I/O buffer in the user-provided alf_accel_input_dtl_prepare function and generate the data transfer lists for both the output buffer and the overlapped output buffer in the user-provided alf_accel_output_dtl_prepare function.
31
32
33
34
Four-buffer scheme
I0
C0 C0 I1
I2 O0 C1 C1 C1 O0 I1 C1 I2 I1 C1 O1 I2 O1 O1
C2 C2 I3
O2 C3 C3 C3 C3 O2 O3
Three-buffer Buf0 scheme Buf1 Buf2 Overlapped I/O buffer scheme Buf0 Buf1
I0
C0 C0
I3 C2 C2 C2 I3
O3
I0
C0
O2 C3 O3
Compute Kernel In/Out
See Figure 9 for an illustration of how double buffering works inside ALF. The ALF runtime evaluates each work block and decides which buffering scheme is most efficient. At each decision point, if the conditions are met, that buffering scheme is used. The ALF runtime first checks if the work block uses the overlapped I/O buffer. If the overlapped I/O buffer is not used, the ALF runtime next checks the conditions for the four-buffer scheme, then the conditions of the three-buffer scheme. If the conditions for neither scheme are met, the ALF runtime does not use double buffering. If the work block uses the overlapped I/O buffer, the ALF runtime first checks the conditions for the overlapped I/O buffer scheme, and if those conditions are not met, double buffering is not used. These examples use the following assumptions: 1. All SPUs have 256 KB of local memory. 2. 16 KB of memory is used for code and runtime data including stack, the task context buffer, and the data transfer list. This leaves 240 KB of local storage for the work block buffers. 3. Transferring data in or out of accelerator memory takes one unit of time and each computation takes two units of time. 4. The input buffer size of the work block is represented as in_size, the output buffer size as out_size, and the overlapped I/O buffer size as overlap_size. 5. There are three computations to be done on three inputs, which produces three outputs.
Buffer schemes
The conditions and decision tree are further explained in the examples below. v Four-buffer scheme: In the four-buffer scheme, two buffers are dedicated for input data and two buffers are dedicated for output data. This buffer use is shown in the Four-buffer scheme section of Figure 9.
35
Conditions satisfied: The ALF runtime chooses the four-buffer scheme if the work block does not use the overlapped I/O buffer and the buffer sizes satisfy the following condition: 2*(in_size + out_size) <= 240 KB. Conditions not satisfied: If the buffer sizes do not satisfy the four-buffer scheme condition, the ALF runtime will check if the buffer sizes satisfy the conditions of the three-buffer scheme. v Three-buffer scheme: In the three-buffer scheme, the buffer is divided into three equally sized buffers of the size max(in_size, out_size). The buffers in this scheme are used for both input and output as shown in the Three-buffer scheme section of Figure 9 on page 35. This scheme requires the output data movement of the previous result to be finished before the input data movement of the next work block starts, so the DMA operations must be done in order. The advantage of this approach is that for a specific work block, if the input and output buffer are almost the same size, the total effective buffer size can be 2*240/3 = 160 KB. Conditions satisfied: The ALF runtime chooses the three-buffer scheme if the work block does not use the overlapped I/O buffer and the buffer sizes satisfy the following condition: 3*max(in_size, out_size) <= 240 KB. Conditions not satisfied: If the conditions are not satisfied, the single-buffer scheme is used. v Overlapped I/O buffer scheme: In the overlapped I/O buffer scheme, two contiguous buffers are allocated as shown in the Overlapped I/O buffer scheme section of Figure 9 on page 35. The overlapped I/O buffer scheme requires the output data movement of the previous result to be finished before the input data movement of the next work block starts. Conditions satisfied: The ALF runtime chooses the overlapped I/O buffer scheme if the work block uses the overlapped I/O buffer and the buffer sizes satisfy the following condition: 2*(in_size + overlap_size + out_size) <= 240 KB. Conditions not satisfied: If the conditions are not satisfied, the single-buffer scheme is used. v Single-buffer scheme: If none of the cases outlined above can be satisfied, double buffering is not used, but performance might not be optimal. When creating buffers and data partitions, remember the conditions of these buffering schemes. If your buffer sizes can meet the conditions required for double buffering, it can result in a performance gain, but double buffering does not double the performances in all cases. When the time periods required by data movements and computation are significantly different, the problem becomes either I/O-bound or computing-bound. In this case, enlarging the buffers to allow more data for a single computation might improve the performance even with a single buffer.
36
37
38
Environment variable
PDT supports an environment variable (PDT_CONFIG_FILE ) that allows you to specify the relative or full path to a configuration file. ALF ships an example configuration file that lists all of the ALF groups and events, and allows the user to turn selected ones off as desired. This is shipped as
/usr/share/pdt/config/pdt_alf_config_cell.xml
39
40
41
42
43
44
Chapter 13. Platform-specific constraints for the ALF implementation on Cell BE architecture
This section describes constraints that apply when you program ALF for Cell BE.
0x3FFFF
Reserved
Runtime Stack
Data Text
0x00000
Global Data
Code
45
0x3FFFF
Reserved
Work Block Data Buffer
46
NOT help you to automatically deal with alignment issues. An ALF_ERR_INVAL error is returned if there is an unaligned address. The same limitation also applies to the offset_to_accel_buf parameter of alf_wb_dtl_begin.
Chapter 13. Platform-specific constraints for the ALF implementation on Cell BE architecture
47
48
49
50
ALF_NULL_HANDLE
The constant ALF_NULL_HANDLE is used to indicate a non-initialized handle in the ALF runtime environment. All handles should be initialized to this value to avoid ambiguity in code semantics.
Copyright IBM Corp. 2006, 2007 - DRAFT
51
ALF runtime APIs that create handles always return results through pointers to handles. After the API call is successful, the original content of the handle is overwritten. Otherwise, the content is kept unchanged. ALF runtime APIs that destroy handles modify the contents of handle pointers and initialize the contents to ALF_NULL_HANDLE.
ALF_STRING_TOKEN_ MAX
This constant defines the maximum allowed length of the string tokens in unit of bytes, excluding the trailing zero. These string tokens are used in ALF as identifiers of function names or other purposes. Currently, this value is defined to be 251 bytes.
52
53
alf_handle_t
This data structure is used as a reference to one instance of the ALF runtime. The data structure is initialized by calling the alf_init API call and is destroyed by alf_exit.
54
ALF_ERR_POLICY_T
NAME
ALF_ERR_POLICY_T - Callback function prototype that can be registered to the ALF runtime for customized error handling.
SYNOPSIS
ALF_ERR_POLICY_T(*alf_error_handler_t)(void *p_context_data, int error_type, int error_code, char *error_string)
Parameters p_context_data [IN] A pointer given to the ALF runtime when the error handler is registered. The ALF runtime passes it to the error handler when the error handler is invoked. The error handler can use this pointer to keep its private data. error_type [IN] A system-wide definition of error type codes, including the following: v ALF_ERR_FATAL: Cannot continue, the framework must shut down. v ALF_ERR_EXCEPTION: You can choose to retry or skip the current operation. v ALF_ERR_WARNING: You can choose to continue by ignoring the error. error_code [IN] A type-specific error code. error_string A C string that holds a printable text string that provides information [IN] about the error.
DESCRIPTION
This is a callback function prototype that can be registered to the ALF runtime for customized error handling.
RETURN VALUE
ALF_ERR_POLICY_RETRY Indicates that the ALF runtime should retry the operation that caused the error. If a severe error occurs and the ALF runtime cannot retry this operation, it will report an error and shut down. Indicates that the ALF runtime should stop the operation that caused the error and continue processing. If the error is severe and the ALF runtime cannot continue, it will report an error and shut down. Indicates that the ALF runtime must stop the operations and shut down. Indicates that the ALF runtime will ignore the error and continue. If the error is severe and the ALF runtime cannot continue, it will report an error and shut down.
ALF_ERR_POLICY_SKIP
ALF_ERR_POLICY_ABORT ALF_ERR_POLICY_IGNORE
EXAMPLES
See alf_register_error_handler on page 61 for an example of this function.
55
alf_init
NAME
alf_init - Initializes the ALF runtime.
SYNOPSIS
int alf_init(void* p_sys_config_info, alf_handle_t* p_alf_handle);
Parameters p_sys_config_info [IN] A platform-dependent configuration information placeholder so that the ALF can get the necessary data for system configuration information. This parameter should point to sys_config_info_CBEA_t data structure. This data structure is defined as follows: typedef struct { char* library_path; } alf_sys_config_t_CBEA_t; A pointer to a handle for a data structure that represents the ALF runtime. This buffer is initialized with proper data if the call is successful. Otherwise, the content is not modified.
p_alf_handle [OUT]
DESCRIPTION
This function initializes the ALF runtime. It allocates the necessary resources and global data for ALF as well as sets up any platform specific configurations.
RETURN VALUE
>= 0 less than 0 Successful, the result of the query Errors: v ALF_ERR_INVAL: Invalid input parameter v ALF_ERR_NODATA: Some system configuration data is not available v ALF_ERR_NOMEM: Out of memory or some system resources have been used up v ALF_ERR_GENERIC: Generic internal errors
OPTIONS
Field value library_path The path to all of the applications computational kernel shared object files. If the pointer is NULL, the ALF_LIBRARY_PATH environment variable is checked and if it is defined then it is used. If neither is set, the default . (the current directory) is used.
56
alf_query_system_info
NAME
alf_query_system_info - Queries basic configuration information.
SYNOPSIS
int alf_query_system_info(alf_handle_t alf_handle, ALF_QUERY_SYS_INFO_T query_info, ALF_ACCEL_TYPE_T accel_type, unsigned int * p_query_result);
Parameters alf_handle [IN] Handle to the ALF runtime. query_info [IN] A query identification that indicates the item to be queried: v ALF_QUERY_NUM_ACCEL: Returns the number of accelerators in the system. v ALF_QUERY_HOST_MEM_SIZE: Returns the memory size of control nodes up to 4T bytes, in units of kilobytes (2^10 bytes). When the size of memory is more than 4T bytes, the total reported memory size is (ALF_QUERY_HOST_MEM_SIZE_EXT*4T + ALF_QUERY_HOST_MEM_SIZE*1K) bytes. In case of systems where virtual memory is supported, this should be the maximum size of one contiguous memory block that a single user space application could allocate. v ALF_QUERY_HOST_MEM_SIZE_EXT: Returns the memory size of control nodes, in units of 4T bytes (2^42 bytes). v ALF_QUERY_ACCEL_MEM_SIZE: Returns the memory size of accelerator nodes up to 4T bytes, in units of kilo bytes (2^10 bytes) . When the size of memory is more than 4T bytes, the total reported memory size is (ALF_QUERY_ACCEL_MEM_SIZE_EXT*4T + ALF_QUERY_ACCL_MEM_SIZE*1K) bytes. For systems where virtual memory is supported, this should be the maximum size of one contiguous memory block that a single user space application could allocate. v ALF_QUERY_ACCEL_MEM_SIZE_EXT: Returns the memory size of accelerator nodes, in units of 4T bytes (2^42 bytes). v ALF_QUERY_HOST_ADDR_ALIGN: Returns the basic requirement of memory address alignment on control node side, in exponential of 2. A zero stands for byte aligned address. A 4 is to align by 16 byte boundaries. v ALF_QUERY_ACCEL_ADDR_ALIGN: Returns the basic requirement of memory address alignment on accelerator node side, in exponential of 2. A zero stands for byte aligned address. An 8 is to align by 256 byte boundaries v ALF_QUERY_DTL_ADDR_ALIGN: Returns the address alignment of data transfer list entries, in exponential of 2. A zero stands for byte aligned address. An 8 is to align by 256 byte boundaries. v ALF_QUERY_ACCEL_ENDIAN_ORDER: ALF_ENDIAN_ORDER_BIG ALF_ENDIAN_ORDER_LITTLE v ALF_QUERY_HOST_ENDIAN_ORDER: ALF_ENDIAN_ORDER_BIG ALF_ENDIAN_ORDER_LITTLE accel_type [IN] Accelerator type. There is only one accelerator type defined, which is ALF_ACCEL_TYPE_SPE p_query_result Pointer to a buffer where the return value of the query is saved. If the [OUT] query fails, the result is undefined. If a NULL pointer is provided, the query value is not returned, but the call returns zero.
57
DESCRIPTION
This function queries basic configuration information for the specific system on which ALF is running.
RETURN VALUE
0 less than 0 Successful, the result of query is returned by p_result if that pointer is not NULL Errors occurred: v ALF_ERR_INVAL: Unsupported query v ALF_BADF: Invalid ALF handle v ALF_ERR_GENERIC: Generic internal errors
58
alf_num_instances_set
NAME
alf_num_instances_set - Sets the maximum total number of parallel task instances ALF can have at one time.
SYNOPSIS
int alf_num_instances_set(alf_handle_t alf_handle, unsigned int number_of_instances);
Parameters alf_handle [IN] A handle to the ALF runtime code. number_of_instances [IN] Specifies the maximum number of task instances that the caller wants to have. When this parameter is zero, the runtime allocates as many task instances as requested by the application programmer. However, the subsequent alf_ask_create call returns an error if ALF cannot accommodate the request.
DESCRIPTION
This function sets the maximum total number of parallel task instances ALF can have at one time. If number_of_instances is zero, there is no limit set by the application and ALF returns an error if it cannot accommodate a particular task creation request with a large number of instances. Note: In SDK 3.0, this function is called once at the beginning after alf_init and before any alf_task_create. The ability to call this function twice to reset the number of instances is not supported. An ALF_ERR_PERM is returned in this situation.
RETURN VALUE
>0 less than 0 the actual number of instances provided by the ALF runtime. Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_GENERIC: Generic internal errors
59
alf_exit
NAME
alf_exit - Shuts down the ALF runtime.
SYNOPSIS
int alf_exit(alf_handle_t alf_handle, ALF_EXIT_POLICY_T policy, int timeout);
Parameters alf_handle [IN] policy [IN] The ALF handle Defines the shutdown behavior: v ALF_EXIT_POLICY_FORCE: Performs a shutdown immediately and stops all unfinished tasks if there are any. v ALF_EXIT_POLICY_WAIT: Waits for all tasks to be processed and then shuts down. v ALF_EXIT_POLICY_TRY: Returns with a failure if there are unfinished tasks. A timeout value that has the following values: v > 0 : Wait at most the specified milliseconds before a timeout error happens or a forced shutdown v = 0 : Shutdown or return without wait v less than 0 : Waits forever, only valid with ALF_EXIT_POLICY_WAIT
time_out [IN]
DESCRIPTION
This function shuts down the ALF runtime. It frees allocated accelerator resources and stops all running or pending work queues and tasks, depending on the policy parameter.
RETURN VALUE
>= 0 less than 0 The shutdown succeeded. The number of unfinished work blocks is returned. The shutdown failed: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_NOSYS: The required policy is not supported v ALF_ERR_TIME: Timeout v ALF_ERR_BUSY: There are tasks still running v ALF_ERR_GENERIC: Generic internal errors
60
alf_register_error_handler
NAME
alf_register_error_handler - Registers a global error handler function to the ALF runtime code.
SYNOPSIS
int alf_register_error_handler(alf_handle_t alf_handle, alf_error_handler_t error_handler_function, void *p_context)
Parameters alf_handle [IN] error_handler_function [IN] p_context [IN] A handle to the ALF runtime code. A pointer to the user-defined error handler function. A NULL value resets the error handler to the ALF default handler. A pointer to the user-defined context data for the error handler function. This pointer is passed to the user-defined error handler function when it is invoked.
DESCRIPTION
This function registers a global error handler function to the ALF runtime code. If an error handler has already been registered, the new one replaces it.
RETURN VALUE
0 less than 0 Successful. Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_FAULT: Invalid buffer or error handler address (only when it is possible to detect the fault) v ALF_ERR_GENERIC: Generic internal errors
61
alf_task_handle_t
This data structure is a handle to a specific compute task running on the accelerators. It is created by calling the alf_task_create function and destroyed by either calling the alf_task_destroy function or when the alf_exit function is called. Call the alf_task_wait function to wait for the task to finish processing all queued work blocks. The alf_task_wait API is also an indication to the ALF runtime that no new work blocks will be added to the work queue of the corresponding task in the future.
alf_task_desc_handle_t
This data structure is a handle to a task descriptor. It is used to access and setup task descriptor information. It is created by calling alf_task_desc_create and destroyed by calling alf_task_desc_destroy.
62
alf_task_desc_create
NAME
alf_task_desc_create - Creates a task descriptor.
SYNOPSIS
int alf_task_desc_create (alf_handle_t alf_handle, ALF_ACCEL_TYPE_T accel_type, alf_task_desc_handle_t * p_desc_info_handle);
Parameters alf_handle accel_type [IN] p_task_desc_handle [OUT] Handle to the ALF runtime. The type of accelerator that tasks created from this descriptor are expected to run on. Returns a handle to the created task description. The content of the pointer is not modified if the call fails.
DESCRIPTION
This function creates a task descriptor. The data structure is returned through the pointer to its handle. The created data structure contains all the information relevant for a compute task.
RETURN VALUE
0 less than 0 Successful Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_NOMEM: Out of memory or system resource v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_GENERIC: Generic internal errors
63
alf_task_desc_destroy
NAME
alf_task_desc_destroy - Destroys the specified task descriptor and frees up the resources associated with this task descriptor.
SYNOPSIS
int alf_task_desc_destroy (alf_task_desc_handle_t task_desc_handle);
Parameters task_desc_handle [IN/OUT] Handle to a task descriptor. This data structure is destroyed when it returns from this call.
DESCRIPTION
This function destroys the specified task descriptor and frees up the resources associated with this task descriptor. A task descriptor cannot be destroyed if it is being used by a task. An attempt to destroy an occupied task descriptor results in an error.
RETURN VALUE
0 less than 0 Successful Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_BADF: Invalid task descriptor handle. v ALF_ERR_BUSY: This task descriptor is being used. You must destroy all tasks using this descriptor before you can destroy the descriptor. v ALF_ERR_PERM: The API call is not permitted at the current context. v ALF_ERR_GENERIC: Generic internal errors.
64
alf_task_desc_ctx_entry_add
NAME
alf_task_desc_ctx_entry_add - Adds a description of one entry in the task context associated with this task descriptor.
SYNOPSIS
int alf_task_desc_ctx_entry_add (alf_task_desc_handle_t task_desc_handle, ALF_DATA_TYPE_T data_type, unsigned int size);
Parameters task_desc_handle [IN] data_type [IN] size [IN] Handle to the task descriptor structure Data type of data in the entry Number of elements of type data_type
DESCRIPTION
This function adds a description of one entry in the task context associated with this task descriptor.
RETURN VALUE
0 less than 0 Successful Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid task descriptor handle v ALF_ERR_NOSYS: The ALF_DATA_TYPE_T provided is not supported. v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_NOBUFS: The requested entry has exceeded the maximum buffer size v ALF_ERR_GENERIC: Generic internal errors
65
alf_task_desc_set_int32
NAME
alf_task_desc_set_int32 - Sets the value for a specific integer field of the task descriptor.
SYNOPSIS
int alf_task_desc_set_int32 (alf_task_desc_handle_t task_desc_handle, ALF_TASK_DESC_FIELD_T field, unsigned int value);
Parameters task_desc_handle [IN/OUT] Handle to the task descriptor structure field [IN] The field to be set. Possible inputs are v ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE: size of the work block parameter buffer v ALF_TASK_DESC_WB_IN_BUF_SIZE: size of the work block input buffer v ALF_TASK_DESC_WB_OUT_BUF_SIZE: size of the work block output buffer v ALF_TASK_DESC_WB_INOUT_BUF_SIZE: size of the work block overlapped input/output buffer v ALF_TASK_DESC_NUM_DTL_ENTRIES: maximum number of entries for the data transfer list v ALF_TASK_DESC_TSK_CTX_SIZE: size of the task context buffer v ALF_TASK_DESC_PARTITION_ON_ACCEL: specifies whether the accelerator functions (alf_accel_input_dtl_prepare and alf_accel_output_dtl_prepare) are invoked to generate data transfer lists for input and output data. value [IN] v ALF_TASK_DESC_MAX_STACK_SIZE: New value of the specified field
DESCRIPTION
This function sets the value for a specific integer field of the task descriptor.
RETURN VALUE
0 less than 0 Successful Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid task descriptor handle v ALF_ERR_NOSYS: The ALF_TASK_DESC_FIELD provided is not supported. v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_RANGE: The specified value is out of the allowed range v ALF_ERR_GENERIC: Generic internal errors
66
alf_task_desc_set_int64
NAME
alf_task_desc_set_int64 - Sets the value for a specific long integer field of the task descriptor structure.
SYNOPSIS
int alf_task_desc_set_int64(alf_task_desc_handle_t task_desc_handle, ALF_TASK_DESC_FIELD_T field, unsigned long long value);
Parameters task_desc_handle [IN/OUT] Handle to the task descriptor structure field [IN] The field to be set. Possible inputs are v ALF_TASK_DESC_ACCEL_LIBRARY_REF_L: Specify the name of the library that the accelerator image is contained in. v ALF_TASK_DESC_ACCEL_IMAGE_REF_L : Specify the name of the accelerator image that is contained in the library. v ALF_TASK_DESC_ACCEL_KERNEL_REF_L: Specify the name of the computational kernel function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. v ALF_TASK_DESC_ACCEL_INPUT_DTL_REF_L: Specify the name of the input list prepare function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. v ALF_TASK_DESC_ACCEL_OUTUT_DTL_REF_L: Specify the name of the output list prepare function, this usually is a string constant that the accelerator runtime could use to find the correspondent function v ALF_TASK_DESC_ACCEL_CTX_SETUP_REF_L: Specify the name of the context setup function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. v ALF_TASK_DESC_ACCEL_CTX_MERGE_REF_L: Specify the name of the context merge function, this usually a string constant that the accelerator runtime could use to find the correspondent function. New value of the specified field
value [IN]
DESCRIPTION
This function sets the value for a specific long integer field of the task descriptor structure. All string constants must have a maximum number of ALF_STRING_TOKEN_MAX size.
RETURN VALUE
0 Successful
67
less than 0
Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid task descriptor handle v ALF_ERR_NOSYS: The ALF_TASK_DESC_FIELD provided is not supported. v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_RANGE: The specified value is out of the allowed range v ALF_ERR_GENERIC: Generic internal errors
68
alf_task_create
NAME
alf_task_create - Creates a task and allows you to add work blocks to the work queue of the task.
SYNOPSIS
int alf_task_create(alf_task_desc_handle_t task_desc_handle, void* p_task_context_data, unsigned int num_instances, unsigned int tsk_attr, unsigned int wb_dist_size, alf_task_handle_t *p_task_handle);
Parameters task_desc_handle [IN] p_task_context_data [IN] Handle to a task_desc structure. Pointer to the task context data for this task. The structure and size for the task context have been defined through alf_task_desc_add_task_ctx_entry. If there is no task_context, a NULL pointer can be provided. Number of instances of the task, only used when ALF_TASK_ATTR_SCHED_FIXED is provided. Attribute for a task. This value can be set to a bit-wise OR to one of the following: v ALF_TASK_ATTR_SCHED_FIXED: The task must be scheduled on the specified number of accelerators. By default, a task can be scheduled on any number of accelerators and the number of accelerators can be adjusted at anytime during the execution of the task. v ALF_TASK_ATTR_WB_CYCLIC: the work blocks for this task are distributed to the accelerators in a cyclic order as specified by num_accelerators. By default, the work blocks distribution order is determined by the ALF runtime. This option must be used combined with ALF_TASK_ATTR_SCHED_FIXED. The specified block distribution bundle size in number of work blocks per distribution unit. A 0 (zero) value is treated as 1 (one). Returns a handle to the created task. The content of the pointer is not modified if the call returns failure.
wb_dist_size [IN]
p_task_handle [OUT]
DESCRIPTION
This function creates a task and allows you to enqueue work blocks to the task. The task remains in a pending status until the following condition is met: All dependencies are satisfied and either at least one work block is added or the task is finalized by calling alf_task_finalize. When the condition is met, the task becomes ready to run. However, when the task actually starts to run, depends on the available accelerator resources and the scheduling of ALF runtime. Multiple independent tasks can also run concurrently if there are enough accelerator resources. When the task starts to run, it keeps running until at least one of the following two conditions is met: v The task has been finalized by calling alf_task_finalize and all the enqueued work blocks are processed and the task context has been merged and written back; v alf_task_destroy is called to explicitly destroy the task.
Chapter 15. Host API
69
Note: A finalized task without any work block enqueued is never be actually loaded and run. The runtime considers this task as completed immediately after the dependencies are satisfied.
RETURN VALUE
0 less than 0 Successful Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_NOMEM: Out of memory or system resource v ALF_ERR_PERM: The API call is not permitted at the current context v ALF_ERR_NOEXEC: Invalid task image format or description information v ALF_ERR_2BIG: Memory requirement for the task exceeds maximum range v ALF_ERR_NOSYS: The required task attribute is not supported v ALF_ERR_BADR: The requested number of accelerator resources is not available v ALF_ERR_GENERIC: Generic internal errors
70
alf_task_finalize
NAME
alf_task_finalize - Finalizes the work block queue of the specified task.
SYNOPSIS
int alf_task_finalize (alf_task_handle_t task_handle)
Parameters task_handle [IN] The task handle that is returned by the alf_create_task API
DESCRIPTION
This function finalizes the task. After the task has been finalized, future calls to alf_wb_create and alf_task_depends_on and alf_task_event_handler_register return errors. Note: Task finalization is a compulsory condition for a task to run and complete normally.
RETURN VALUE
less than 0 Errors occurred: v ALF_ERR_BADF: Invalid task descriptor handle. v ALF_ERR_SRCH: Already finalized task handle. v ALF_ERR_PERM: The API call is not permitted at the current context. For example, some created work block handles are not enqueued. v ALF_ERR_GENERIC: Generic internal errors.
71
alf_task_wait
NAME
alf_task_wait - Waits for the specified task to finish processing all work blocks on all the scheduled accelerators.
SYNOPSIS
int alf_task_wait(alf_task_handle_t task_handle, int time_out);
Parameters task_handle [IN] time_out [IN] A task handle that is returned by the alf_create_task API. A timeout input with the following options for values: v > 0: Waits for up to the number of milliseconds specified before a timeout error occurs. v less than 0: Waits until all of the accelerators finish processing. v 0: Returns immediately.
DESCRIPTION
This function waits for the specified task to finish processing all work blocks on all the scheduled accelerators. The task must be finalized (alf_task_finalize must be called) before this function is called. Otherwise, an ALF_ERR_PERM is returned. Data referenced by the tasks work blocks can only be used safely after this function returns. If the host application updates the data buffers referenced by work blocks or the task context buffer while the task is running, the result can be undetermined. If you need to update the buffer contents, the only safe point is before the ALF_TASK_EVENT_READY task event is handled by the task event handler registered by alf_task_event_handler_register.
RETURN VALUE
0 less than 0 All of the accelerators finished the job. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_BADF: Invalid task handle. v ALF_ERR_NODATA: The task is (during wait) or was (before wait) destroyed explicitly. v ALF_ERR_TIME: Timeout. v ALF_ERR_PERM: The API is not permitted at the current context. For example, the task is not finalized. v ALF_ERR_GENERIC: Generic internal errors.
72
alf_task_query
NAME
alf_task_query - Queries the current status of a task.
SYNOPSIS
int alf_task_query( alf_task_handle_t task_handle, unsigned int *p_unfinished_wbs, unsigned int *p_total_wbs);
Parameters task_handle [IN] p_unfinished_wbs [OUT] The task handle to be checked. A pointer to an integer buffer where the number of unfinished work blocks of this task is returned. When a NULL pointer is given, the return value is ignored. On error, a returned value is not defined. A pointer to an integer buffer where the total number of submitted work blocks of this task is returned. When a NULL pointer is given, the return value is ignored. On error, a returned value is not defined.
p_total_wbs [OUT]
DESCRIPTION
This function queries the current status of a task.
RETURN VALUE
>1 1 0 less than 0 The task is pending or ready to run. The task is currently running. The task finished normally. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_BADF: Invalid task handle. v ALF_ERR_NODATA: The task was explicitly destroyed. v ALF_ERR_GENERIC: Generic internal errors.
73
alf_task_destroy
NAME
alf_task_destroy - Destroys the specified task.
SYNOPSIS
int alf_task_destroy(alf_task_handle_t* p_task_handle)
Parameters task_handle [IN] The pointer to a task handle that is returned by the alf_create_task API.
DESCRIPTION
This function explicitly destroys the specified task if it is in pending or running state. If there are work blocks that are still not processed, this routine stops the execution of those work blocks. If a task is running when this API is invoked, the task is cancelled before the API returns. Resources associated with this task are recycled by the runtime either synchronously or asynchronously, depending on the runtime implementation. This API does nothing on an already completed task. If a task is destroyed explicitly, all tasks that depend on this task directly or indirectly are destroyed. Because ALF frees task resources automatically, it is not necessary to call this API to free up resources after a task has been run to complete normally. The API should only be used to explicitly end a task when you need to.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_BADF: Invalid task handle. v ALF_ERR_PERM: The API call is not permitted at current context. v ALF_ERR_BUSY: Resource busy. v ALF_ERR_SRCH: Already destroyed task handle. v ALF_ERR_GENERIC: Generic internal errors.
74
alf_task_depends_on
NAME
alf_task_depends_on - Describes a relationship between two tasks.
SYNOPSIS
int alf_task_depends_on (alf_task_handle_t task_handle_dependent, alf_task_handle_t task_handle);
Parameters task_handle_dependent [IN] task_handle [IN] The handle to the dependent task The handle to a task
DESCRIPTION
This function describes a relationship between two tasks. The task specified by task_handle_dependent cannot be scheduled to run until the task specified by task_handle has run to finish normally. When this API is called, task_handle must not be an explicitly destroyed task. An error is reported if it is the case. If the task associated with task_handle is destroyed before normal completion, the task_handle_dependent is also destroyed because its dependency can no longer be satisfied. If task A depends on task B, a call to alf_task_wait (A_handle) effectively enforces a wait on task B as well. A duplicate dependency is handled silently and not treated as an error. Note: This function can only be called before any work blocks are enqueued to the task_handle_dependent and before the task_handle_dependent is finalized. For the task_handle, these constraint is not applicable. Whenever a situation occurs that is not permitted, the function returns ALF_ERR_PERM.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_BADF: Invalid task handle. v ALF_ERR_PERM: The API call is not permitted at the current context. For example, the dependency cannot be set because of the current state of the task. v ALF_ERR_GENERIC: Generic internal errors.
75
alf_task_event_handler_register
NAME
alf_task_event_handler_register - Allows you to register and unregister an event handler for a specific task.
SYNOPSIS
int alf_task_event_handler_register (alf_task_handle_t task_handle, int (*task_event_handler)( alf_task_handle_t task_handle, ALF_TASK_EVENT_TYPE_T event, void* p_data), void* p_data, unsigned int data_size, unsigned int event_mask);
Parameters task_handle [IN] task_event_handler [IN] The handle to a task. Pointer of the event handler function for the specified task. A NULL value indicates the current event handler is to be unregistered. A pointer to a context buffer that is copied to another buffer managed by the ALF runtime. The pointer to this buffer is passed to the event handler. The content of the context buffer is copied by value only. A NULL value indicates no context buffer. The size of the context buffer in bytes. Zero indicates no context buffer. A bitwise OR of ALF_TASK_EVENT_TYPE_T values. ALF_TASK_EVENT_TYPE_T is defined as follows: v ALF_TASK_EVENT_FINALIZED: This task has been finalized. No additional work block can be added to this task. The registered event handler is invoked right before alf_task_finalize returns. v ALF_TASK_EVENT_READY: This task has been scheduled for execution. The registered event handler is invoked as soon as the ALF runtime determines that all dependencies have been satisfied for this specific task and can schedule this task for execution as soon as this event handler returns. v ALF_TASK_EVENT_FINISHED: All work blocks in this task have been processed. The registered event handler is invoked as soon as the last work block has been processed and the task context is written back to host memory. v ALF_TASK_EVENT_INSTANCE_START: One new instance of the task is started on an accelerator after the event handler returns. v ALF_TASK_EVENT_INSTANCE_END: One existing instance of the task ends and the task context has been copied out to the original location or has been merged to another current instance of the same task. The event handler is called as soon as the task instance is ended and unloaded from the accelerator. v ALF_TASK_EVENT_DESTROY: The task is destroyed explicitly.
p_context [IN]
DESCRIPTION
This function allows you to register an event handler for a specified task. This function can only be called before alf_task_finalize is invoked. An error is returned if a you try to register an event handler for a task that has been finalized.
76
If the task_event_handler function is NULL, this function unregisters the current event handler. If there is no current event handler, nothing happens. Note: If the event handler is registered after the task begin to run, some of the events may not be seen.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input handle. v ALF_ERR_BADF: Invalid ALF task handle. v ALF_ERR_PERM: The API call is not permitted at the current context. v ALF_ERR_NOMEM: Out of memory. v ALF_ERR_FAULT: Invalid buffer or error handler address (only when it is possible to detect the fault). v ALF_ERR_GENERIC: Generic internal errors.
77
Data structures
alf_wb_handle_t
This data structure refers to the work block being constructed by the control node.
78
alf_wb_create
NAME
alf_wb_create - Creates a new work block for the specified compute task.
SYNOPSIS
int alf_wb_create(alf_task_handle_t task_handle, ALF_WORK_BLOCK_TYPE_T work_block_type, unsigned int repeat_count, alf_wb_handle_t *p_wb_handle);
Parameters p_wb_handle [OUT] task_handle [IN] work_block_type [IN] The pointer to a buffer where the created handle is returned. The contents are not modified if this call fails. The handle to the compute task. The type of work block to be created. Choose from the following types: v ALF_WB_SINGLE: Creates a single-use work block v ALF_WB_MULTI: Creates a multi-use work block. This work block type is only supported when the task is created with the ALF_PARTITION_ON_ACCEL attribute. Specifies the number of iterations for a multi-use work block. This parameter is ignored when a single-use work block is created.
repeat_count [IN]
DESCRIPTION
This function creates a new work block for the specified computing task. The work block is added to the work queue of the task and the runtime releases the allocated resources once the work block is processed. The caller can only update the contents of a work block before it is added to the work queue. After the work block is added to the work queue, the lifespan of the data structure is left to the ALF runtime. The ALF runtime is responsible for cleaning up any resource allocated for the work block. This API can only be called before alf_task_finalize is invoked. After the alf_task_finalize is called, further calls to this API return an error.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_PERM: Operation not allowed in current context. For example, the task has already been finalized or the work block has been enqueued. v ALF_ERR_BADF: Invalid task handle. v ALF_ERR_NOMEM: Out of memory. v ALF_ERR_GENERIC: Generic internal errors.
79
alf_wb_enqueue
NAME
alf_wb_enqueue - Adds the work block to the work queue of the specified task handle.
SYNOPSIS
int alf_wb_enqueue(alf_wb_handle_t wb_handle)
Parameters wb_handle [IN] The handle of the work block to be put into the work queue.
DESCRIPTION
This function adds the work block to the work queue of the specified task handle. The caller can only update the contents of a work block before it is added to the work queue. After it is added to the work queue, you cannot access the wb_handle.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid task handle or work block handle v ALF_ERR_PERM: Operation not allowed in current context v ALF_ERR_BUSY: An internal resource is occupied v ALF_ERR_GENERIC: Generic internal errors
80
alf_wb_parm_add
NAME
alf_wb_parm_add - Adds the given parameter to the parameter and context buffer of the work block in the order that this function is called.
SYNOPSIS
int alf_wb_parm_add(alf_wb_handle_t wb_handle, void *pdata, unsigned int size_of_data, ALF_DATA_TYPE_T data_type, unsigned int address_alignment)
Parameters wb_handle [IN] pdata [IN] size_of_data [IN] data_type [IN] address_alignment [IN] The work block handle. A pointer to the data to be copied. The size of the data in units of the data type. The type of data. This value is required if data endianess conversion is necessary when moving the data. Power of 2 byte alignment of 2 address_alignment. The valid range is from 0 to 16. A zero indicates a byte-aligned address. An 8 indicates alignment on 256 byte boundaries.
DESCRIPTION
This function adds the given parameter to the parameter and context buffer of the work block in the order that this function is called. The starting address is from offset zero. The added data is copied to the internal parameter and context buffer immediately. The relative address of the data can be aligned as specified. For a specific work block, additional calls to this API return an error after the work block is put into the work queue by calling the alf_wb_enqueue function.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_PERM: Operation not allowed in current context. v ALF_ERR_BADF: Invalid task handle or work block handle. v ALF_ERR_NOBUFS: Some internal resource is occupied. v ALF_ERR_GENERIC: Generic internal errors.
81
alf_wb_dtl_begin
NAME
alf_wb_dtl_begin - Marks the beginning of a data transfer list for the specified target buffer_type.
SYNOPSIS
int alf_wb_dtl_begin (alf_wb_handle_t wb_handle, ALF_BUF_TYPE_T buffer_type, unsigned int offset_to_accel_buf);
Parameters wb_handle [IN] buffer_type [IN] The work block handle. The type of the buffer. Possible values are: v ALF_BUF_IN: Input to the input only buffer v ALF_BUF_OUT: Output from the output only buffer v ALF_BUF_OVL_IN: Input to the overlapped buffer v ALF_BUF_OVL_OUT: Output from the overlapped buffer offset_to_accel_buf [IN] v ALF_BUF_OVL_INOUT: In/out to/from the overlapped buffer Offset of the target buffer on the accelerator.
DESCRIPTION
This function marks the beginning of a data transfer list for the specified target buffer_type. Further calls to function alf_wb_dtl_entry_add refers to the currently opened data transfer list. You can create multiple data transfer lists per buffer type, however, only one data transfer list is opened for entry at any time for a specific work block there can be no nesting of data transfer list.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_PERM: Operation not allowed. v ALF_ERR_BADF: Invalid work block handle. v ALF_ERR_2BIG: The offset to the accelerator buffer is larger than the size of the buffer. v ALF_ERR_NOSYS: The specified I/O type feature is not supported. v ALF_ERR_BADR: The requested buffer is not defined in the task context. v ALF_ERR_GENERIC: Generic internal errors. v ALF_ERR_NOBUFS: The internal data buffer is used up.
82
alf_wb_dtl_entry_add
NAME
alf_wb_dtl_entry_add - Adds an entry to the input or output data transfer lists of a single use work block.
SYNOPSIS
int alf_wb_dtl_entry_add (alf_wb_handle_t wb_handle, void* host_addr, unsigned int size, ALF_DATA_TYPE_T data_type);
Parameters wb_handle [IN] host_address [IN] size [IN] data_type [IN] The work block handle The pointer (EA) to the data in remote memory The size of the data in units of the data type The type of data, this value is required if data endianess conversion is necessary when doing the data movement
DESCRIPTION
This function adds an entry to the input or output data transfer lists of a single use work block. The entry describes a single piece of data transferred from and to the remote memory. For a specific work block, further calls to this API return errors after the work block is put to work queue by calling alf_wb_enqueue. For a specific work block, further calls to this API return error after the work block is put to work queue by calling alf_wb_enqueue. If the work blocks task is associated with a dataset, the specified buffer with host_addr and size must be contained within the dataset. Adding a dtl entry describing a buffer that is outside the associated dataset returns a ALF_ERR_PERM error. This function can only be called if the task descriptor associated with the work blocks task is created with the task descriptor attribute ALF_TASK_DESC_PARTITION_ON_ACCEL set to false.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_PERM: Operation not allowed. v ALF_ERR_BADF: Invalid work block handle. v ALF_ERR_2BIG: Trying to add too many lists. v ALF_ERR_NOBUFS: The amount of data to move exceeds the maximum buffer size. v ALF_ERR_FAULT: Invalid host address (if it can be detected). v ALF_ERR_GENERIC: Generic internal errors.
83
alf_wb_dtl_end
NAME
alf_wb_dtl_end - This function marks the ending of a data transfer list.
SYNOPSIS
int alf_wb_dtl_end (alf_wb_handle_t wb_handle);
Parameters wb_handle [IN] The work block handle
DESCRIPTION
This function marks the ending of a data transfer list.
RETURN VALUE
0 less than 0 Success. Errors occurred: v ALF_ERR_PERM: Operation not allowed. v ALF_ERR_BADF: Invalid work block handle.
84
85
alf_dataset_create
NAME
alf_dataset_create - Creates a dataset.
SYNOPSIS
int alf_dataset_create(alf_handle_t alf_handle, alf_dataset_handle_t * p_dataset_handle);
Parameters alf_handle[in] p_dataset_handle[out] Handle to the ALF runtime Handle to the dataset
DESCRIPTION
This function creates a dataset.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_GENERIC: Generic internal errors
86
alf_dataset_buffer_add
NAME
alf_dataset_buffer_add - Adds a data buffer to the data set.
SYNOPSIS
int alf_dataset_buffer_add(alf_dataset_handle_t dataset, void *buffer, unsigned long long size, ALF_CACHE_ACCESS_MODE_T access_mode);
Parameters buffer size access mode Address of the buffer to be added Size of the buffer Access mode for the buffer. A buffer can have either of the following access modes: v ALF_DATASET_READ_ONLY: The dataset is read-only. Work blocks referencing the data in this buffer cannot update this buffer as an output buffer. v ALF_DATASET_WRITE_ONLY: The dataset is write-only. Work blocks referencing the data in this buffer as input data result in indeterminate behavior. v ALF_DATASET_READ_WRITE: The dataset allows both read and write access. Work blocks can use this buffer as input buffers and output buffers and/or in out buffers.
DESCRIPTION
This function adds a data buffer to the data set.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_PERM: The API call is not permitted with the current calling context. The dataset has been associated with a task and thus closed from further buffer additions. v ALF_ERR_GENERIC: Generic internal errors
87
alf_dataset_destroy
NAME
alf_dataset_destroy - Destroys a given data set.
SYNOPSIS
int alf_dataset_destroy(alf_dataset_handle_t dataset_handle);
Parameters dataset_handle Handle to the dataset
DESCRIPTION
This function destroys a given dataset. Further references to the dataset result in indeterminate behaviors. Further references to the data within a dataset are still valid. You cannot destroy a dataset if there are still running tasks associated with a dataset.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input argument. v ALF_ERR_BADF: Invalid ALF handle. v ALF_ERR_PERM: The API call is not permitted with the current calling context. The dataset has been associated with a task and thus closed from further buffer additions. v ALF_ERR_GENERIC: Generic internal errors.
88
alf_task_dataset_associate
NAME
alf_task_dataset_associate - Associates a given task with a dataset.
SYNOPSIS
int alf_task_dataset_associate(alf_task_handle_t task, alf_dataset_handle_t dataset);
Parameters dataset_handle task_handle Handle to dataset Handle to the task
DESCRIPTION
This function associates a given task with a dataset. This function can only be called before any work block is enqueued for the task. After a task is associated with a dataset, all subsequent work blocks created and enqueued for this task cannot reference data outside the dataset. After a task is associated with a dataset, further calls to alf_data_buffer_add results in error. After a task is associated with a dataset, the host application program can only use the data after alf_task_wait is called and returned.
RETURN VALUE
0 less than 0 Success Errors occurred: v ALF_ERR_INVAL: Invalid input argument v ALF_ERR_BADF: Invalid ALF handle v ALF_ERR_PERM: The API call is not permitted with the current calling context. The dataset has been associated with a task and thus closed from further buffer additions. v ALF_ERR_SRCH: Already destroyed task handle v ALF_ERR_GENERIC: Generic internal errors
89
90
ALF_ACCEL_EXPORT_API_LIST_END
ALF_ACCEL_EXPORT_API_LIST_BEGIN
This macro declares the beginning of computational kernel API exporting definition section. This macro must be the first statement of the definition section.
91
ALF_ACCEL_EXPORT_API_LIST_END
This macro declares the ending of computational kernel API exporting definition section. This macro must be the last statement of the definition section.
92
ALF_ACCEL_EXPORT_API
NAME
ALF_ACCEL_EXPORT_API - Declares one entry of the computing kernel API exporting definition section.
SYNOPSIS
ALF_ACCEL_EXPORT_API(const char *p_api_name, int (*p_api)())
Parameters p_api_name[IN] The string constant that uniquely identifies the exported API. It is recommended to be just the same as the correspondent function identifier. The exported function entry pointer.
p_api[IN]
DESCRIPTION
This macro declares one entry of the computing kernel API exporting definition section. The ALF runtime locates the entry address of the user-implemented computing kernel functions based on information provided by the corresponding entries.
93
94
alf_accel_comp_kernel
NAME
alf_accel_comp_kernel - Computes the work blocks.
SYNOPSIS
int alf_accel_comp_kernel(void* p_task_ctx, void *p_parm_ctx_buffer, void *p_input_buffer, void *p_output_buffer, void* p_inout_buffer, unsigned int current_iter, unsigned int num_iter);
Parameters p_task_context [IN] p_parm_ctx_data [IN] p_input_buffer [IN] p_output_buffer [IN] p_inout_buffer [IN] current_iter [IN] A pointer to the local memory block where the task context buffer is kept. A pointer to the local memory block where the parameter and context data are kept. A pointer to the local memory block where the input data is loaded. A pointer to the local memory block where the output data is written. A pointer to the accelerator memory block where the in/out buffers are located. The current iteration count of multi-use work blocks. This value starts at 0. For single-use work blocks, this value is always 0. The total number of iterations of multi-use work blocks. For single-use work blocks, this value is always 1.
num_iter [IN]
DESCRIPTION
This is the computational kernel that does the computation of the work blocks. The ALF runtime ensures that all input data are available before invoking this call. You must provide an implementation for this function.
RETURN VALUE
0 less than 0 The computation finished correctly. An error occurred during the computation. The error code is passed back to you to be handled.
95
alf_accel_input_dtl_prepare
NAME
alf_accel_input_dtl_prepare - Defines the data transfer lists for input data.
SYNOPSIS
int alf_accel_input_dtl_prepare (void* p_task_context, void *p_parm_context, void *p_dtl, unsigned int current_iter, unsigned int num_iter);
Parameters p_task_context[IN] p_parm_ctx_buffer[IN] p_dtl[IN] current_iter[IN] Pointer to the task context buffer in accelerator memory. Pointer to the work block parameter context buffer in accelerator memory. Pointer the data transfer list the generated data transfer list should be saved. The current iteration count of multi-use work blocks. This value starts at 0. For single-use work blocks, this value is always 0. The total number of iterations of multi-use work blocks. For single-use work blocks, this value is always 1.
num_iter[IN]
DESCRIPTION
This function is called by the ALF runtime when it needs the accelerator to define the data transfer lists for input data. One important point to consider is that because the ALF framework may do double buffering, the function only refers to the information provided by the p_parm_ctx_buffer. This function should generate the data transfer lists for the input buffer (ALF_BUF_IN), the overlapped input buffer (ALF_BUF_OVL_IN), and the overlapped I/O buffer (ALF_BUF_OVL_INOUT) when these buffers are enabled. For the overlapped I/O buffer (ALF_BUF_OVL_INOUT), the data transfer list generated in this function is reused by the runtime to push the data back to host memory. This function is an optional function. It is only called if the task descriptor sets the ALF_TASK_DESC_PARTITION_ON_ACCEL to true. When this attribute is not set or set to false, you can choose not to implement this API when the programming environment supports weak link or to implement an empty function that returns zero when weak link is not supported.
RETURN VALUE
0 less than 0 The computation finished correctly. An error occurred during the call. The error code is passed back to you to be handled.
96
alf_accel_output_dtl_prepare
NAME
alf_accel_output_dtl_prepare - Defines the partition of output data.
SYNOPSIS
int alf_accel_output_dtl_prepare (void* p_task_context, void *p_parm_ctx_buffer, void *p_io_container, unsigned int current_iter, unsigned int num_iter);
Parameters p_task_context[IN] p_parm_ctx_buffer[IN] p_dt_list_buffer[IN] current_iter[IN] Pointer to the task context buffer in accelerator memory. Pointer to the work block parameter context in accelerator memory. Pointer to the buffer where the generated data transfer list should be saved. The current iteration count of multi-use work blocks. This value starts at 0. For single-use work blocks, this value is always 0. The total number of iterations of multi-use work blocks. For single-use work blocks, this value is always 1.
num_iter[IN]
DESCRIPTION
This function is called by the ALF runtime when it needs the accelerator to define the partition of output data. Because the ALF may be doing double buffering, the function should only refer to the information provided by the p_parm_ctx_buffer. This function generates the data transfer lists for the output buffer (ALF_BUF_OUT) and the overlapped output buffer (ALF_BUF_OVL_OUT) when these buffers are enabled. This function is only called if the task descriptor sets the ALF_TASK_DESC_PARTITION_ON_ACCEL to true. When this attribute is not set or set false, you can choose not to implement this API when the programming environment supports weak link or to implement an empty function that return zero when weak link is not supported.
RETURN VALUE
0 less than 0 The computation finished correctly. An error occurred during the call. The error code is passed back to you to be handled.
97
alf_accel_task_context_setup
NAME
alf_accel_task_context_setup - Initializes a task.
SYNOPSIS
int alf_accel_task_context_setup (void* p_task_context);
Parameters p_task_context [IN/OUT] Pointer to task context in accelerator memory.
DESCRIPTION
This function is called by the ALF runtime when a task starts running on an accelerator. The runtime loads the initial task context to the local memory and calls this function to do some task instance specific initialization. The ALF runtime only invokes this API when the task has a task context. When the task does not have a task context or the application does not need extra setup of the initial context, you can choose not to implement this API when the programming environment supports weak link or to implement an empty function that returns zero when weak link is not supported.
RETURN VALUE
0 less than 0 The API call finished correctly. An error happened during the call. The error code is passed back to you to be handled.
98
alf_accel_task_context_merge
NAME
alf_accel_task_context_merge - Merges the context after a task has stopped running.
SYNOPSIS
int alf_accel_task_context_merge (void* p_task_context_to_be_merged, void* p_task_context);
Parameters p_task_context_to_merge[IN] Pointer context p_task_context[IN/OUT] Pointer context to the local memory block where the to be merged task buffer is kept. to the local memory block where the to be target task buffer is kept.
DESCRIPTION
This function is called by the ALF runtime when a task stops running on an accelerator. The runtime loads the corresponding task context to the memory of an accelerator that is running this task and calls this function to do the context merge. The ALF runtime only invokes this API only when the task has a task context. If the task does not have a task context or the application does not need to do context merge, you can choose not to implement this API when the programming environment supports weak link or to implement an empty function that returns zero when weak link is not supported.
RETURN VALUE
0 less than 0 The API call finishes correctly. An error occurred during the call. The error code is passed back to you to be handled.
99
Runtime APIs
This section lists the APIs that accelerator side ALF runtime provides.
100
alf_accel_num_instances
NAME
alf_accel_num_instances - Returns the number of instances that are running this computational kernel.
SYNOPSIS
int alf_accel_num_instances (void);
Parameters None
DESCRIPTION
This function returns the number of instances that are currently executing this computational kernel. This function should only be used when a task is created with the task attribute ALF_TASK_ATTR_SCHED_FIXED. If user calls this function without ALF_TASK_ATTR_SCHED_FIXED, the number returned might change from one invocation to the next as the ALF runtime dynamically loads and unloads task instances.
RETURN VALUE
>0 less than 0 number of accelerators that are executing this compute task Internal error
101
alf_instance_id
NAME
alf_instance_id - Returns the number of instances that are running this computational kernel.
SYNOPSIS
int alf_instance_id (void);
Parameters None
DESCRIPTION
This function returns the current instance ID of the task. This ID ranges from 0 to alf_accel_num_instances.
RETURN VALUE
>=0 Returns the ID of the current accelerator. This is guaranteed to be unique within the reserved accelerators for ALF runtime Internal error
less than 0
102
ALF_ACCEL_DTL_BEGIN
NAME
ALF_ACCEL_DTL_BEGIN - Marks the beginning of a data transfer list for the specified target buffer_type.
SYNOPSIS
ALF_ACCEL_DTL_BEGIN (void* p_dtl, ALF_IO_BUF_TYPE_T buf_type, unsigned int offset);
Parameters p_dtl[IN/OUT] buf_type Pointer to buffer for the data transfer list data structure. ALF_BUF_IN ALF_BUF_OUT ALF_OVL_IN ALF_OVL_OUT offset[IN] ALF_OVL_INOUT Offset to the input or output buffer pointer in local memory to which the data transfer list refers to.
DESCRIPTION
This utility marks the beginning of a data transfer list for the specified target buffer_type. Further calls to function ALF_ACCEL_DTL_ENTRY_ADD refer to the currently opened data transfer list. You can create multiple data transfer lists per buffer type. However, only one data transfer list is opened for entry at any time. Note: This API is for accelerator node side to generate the data transfer list entries. It may be implemented as macros on some platforms.
RETURN VALUE
None.
103
ALF_ACCEL_DTL_ENTRY_ADD
NAME
ALF_ACCEL_DTL_ENTRY_ADD - Fills the data transfer list entry.
SYNOPSIS
ALF_ACCEL_DTL_ENTRY_ADD (void *p_dtl, unsigned int data_size, ALF_DATA_TYPE_T data_type, alf_data_addr64_t p_host_address);
Parameters p_dtl[IN] data_size[IN] data_type[IN] host_address[IN] Pointer to buffer for the data transfer list data structure. Size of the data in unit of the data type. The type of data. This value is required if data endianess conversion is necessary when moving the data. Address of the host memory.
DESCRIPTION
This function fills the data transfer list entry. This API is for the accelerator node side to generate the data transfer list entries. It can be implemented as macros on some platforms. Note: This API is for accelerator node side to generate the data transfer list entries. It can be implemented as macros on some platforms.
RETURN VALUE
None.
104
ALF_ACCEL_DTL_END
NAME
ALF_ACCEL_DTL_END - Marks the ending of a data transfer list.
SYNOPSIS
ALF_ACCEL_DTL_END(void* p_dtl);
Parameters p_dtl[IN] Pointer to buffer for the data transfer list data structure.
DESCRIPTION
This utility marks the ending of a data transfer list.
RETURN VALUE
None.
105
106
107
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_GET
NAME
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_GET - Gets the internal DMA list buffers.
SYNOPSIS
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_GET (void *p_dtl, void **pp_dma_list_buffer, unsigned int *p_max_entries);
Parameters p_dtl [IN] pp_dma_list_buffer [OUT] p_max_entries [OUT] A pointer to the buffer for the data transfer list data structure. Returns a pointer to the internal DMA list buffer. Returns the maximum allowed entries in this buffer
DESCRIPTION
This utility gets the internal DMA list buffers so that you can directly access them. It must be called after ALF_ACCEL_DTL_BEGIN and before ALF_ACCEL_DTL_END After this call, ALF_ACCEL_DTL_ENTRY_ADD must not be used before ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_UPDATE is called.
RETURN VALUE
0 less than 0 The computation finished correctly. An error occurred during the computation. The error code is passed back to the library developer to be handled.
108
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_UPDATE
NAME
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_UPDATE - Updates the internal data structure when the direct access completes.
SYNOPSIS
ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_UPDATE (void *p_dtl, unsigned int num_entries);
Parameters p_dt_list_buffer [IN] num_entries [IN] A pointer to the buffer for the data transfer list data structure. The number of DMA list entries filled in during the direct access.
DESCRIPTION
This utility updates the internal data structure when the direct access completes. It must be called after ALF_ACCEL_DTL_CBEA_DMA_LIST_BUFFER_GET and before ALF_ACCEL_DTL_END Any further calls to ALF_ACCEL_DTL_ENTRY_ADD can only be done after this call.
RETURN VALUE
Not specified
109
110
Part 5. Appendixes
111
112
alf_query_system_info alf_num_instances_set alf_exit alf_register_error_handler alf_configure Compute APIs alf_task_handle_t alf_task_desc_handle_t alf_task_desc_create
N X X The task descriptor concept replaces the task_info structure in SDK 2.1 The task descriptor concept replaces the task_info structure in SDK 2.1 The task descriptor concept replaces the task_info structure in SDK 2.1
alf_task_desc_destroy
alf_task_desc_ctx_entry_add
alf_task_desc_set_int32 alf_task_desc_set_int64
Y X
113
Table 1. API changes (continued) API has been updated for this release Y/N API removed for this release
New API for this release Changes from SDK 2.1 X The alf_task_create function in this new API is very different from the alf_task_create function in SDK 2.1. The differences are: v Task is created based on a task descriptor, not task_info v You can specify the number of instances of a task in this function. v Users can specify work block distribution v Task context data is provided through this function.
alf_task_finalize alf_task_wait Y
X In SDK 2.1, alf_task_wait also signifies that you cannot add work blocks into a task. In this API, alf_task_wait is divided into two separate functions, alf_task_finalize and alf_task_wait.
alf_task_query alf_task_destroy
N N It is no longer required to call this API to release the resources that a task uses. X X API replaced by alf_desc_task_handle_t alf_task_create alf_task_desc_ctx_entry_add alf_task_create Y Y Y Y
alf_task_depends_on alf_task_event_handler_register alf_task_info_t alf_task_context_create alf_task_context_add_entry alf_task_context_register Work block APIs alf_wb_handle_t alf_wb_create alf_wb_enqueue alf_wb_dtl_begin alf_wb_add_parm alf_wb_dtl_entry_add alf_wb_dtl_end alf_wb_add_io_buffer N N Y N
114
Table 1. API changes (continued) API has been updated for this release Y/N API removed for this release Y Y Y Y
API Name alf_wb_sync sync_callback_func alf_wb_sync_wait alf_wb_sync_handle_t Data set APIs alf_dataset_handle_t alf_dataset_create alf_dataset_buffer_add alf_dataset_destroy alf_task_dataset_associate Accelerator APIs ALF_ACCEL_EXPORT_API_ LIST_BEGIN ALF_ACCEL_EXPORT_API ALF_ACCEL_EXPORT_API_ LIST_END Computational kernel APIs alf_accel_comp_kernel alf_accel_input_dtl_prepare alf_accel_output_dtl_prepare
X X X X X
X X X
X X X X
alf_accel_task_context_setup alf_accel_task_context_merge Runtime APIs alf_accel_num_instances alf_instance_id ALF_ACCEL_DTL_BEGIN ALF_ACCEL_DTL_ENTRY_ADD ALF_ACCEL_DTL_END alf_comp_kernel alf_prepare_input_list alf_prepare_output_list ALF_DT_LIST_CREATE Cell BE platform specific APIs ALF_ACCEL_DTL_CBEA_ DMA_LIST_BUFFER_UPDATE ALF_ACCEL_DTL_CBEA_ DMA_LIST_BUFFER_GET
X X
X X
115
Table 1. API changes (continued) API has been updated for this release Y/N API removed for this release Y
116
Appendix B. Examples
The following examples are described in this section: v Matrix add - host data partitioning example v Matrix add - accelerator data partitioning example on page 120 v Table lookup example on page 121 v Min-max finder example on page 122 v Multiple vector dot products on page 124 v Overlapped I/O buffer example on page 127 v Task dependency example on page 129
Basic examples
This section describes the following basic examples: v Matrix add - host data partitioning example. This example includes the source code. v Matrix add - accelerator data partitioning example on page 120.
where m and n are the dimensions of the matrices. This simple example demonstrates how to: v Use task descriptor v Start a task on the accelerators v Create and add a work block to a task v Exit the ALF runtime environment correctly You can also use this sample as a template to build a more complicated application. In this example, the host application: v Initializes the ALF runtime environment v Creates a task descriptor v Creates a task based on that task descriptor v Creates work blocks with the appropriate data transfer lists which start invocations of the computational kernel on the accelerator v Waits for the computational kernel to finish and exits The accelerator application includes a simple computational kernel that computes the addition of the two matrices. The scalar code to add two matrices for a uni-processor machine is provided below:
Copyright IBM Corp. 2006, 2007 - DRAFT
117
float mat_a[NUM_ROW][NUM_COL]; float mat_b[NUM_ROW][NUM_COL]; float mat_c[NUM_ROW][NUM_COL]; int main(void) { int i,j; for (i=0; i<NUM_ROW; i++) for (j=0; j<NUM_COL; j++) mat_c[i][j] = mat_a[i][j] + mat_b[i][j]; return 0; }
An ALF host program can be logically divided into several sections: v Initialization v Task setup v Work block set up v Task wait and exit
Source code
The following code listings only show the relevant sections of the code. For a complete listing, refer to the ALF samples directory
matrix_add/STEP1a_partition_scheme_A
Initialization
The following code segment shows how ALF is initialized and accelerators allocated for a specific ALF runtime.
alf_handle_t alf_handle; unsigned int nodes; /* initializes the runtime environment for ALF*/ alf_init(&config_parms;, &alf_handle;); /* get the number of SPE accelerators available for from the Opteron */ rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;); /* set the total number of accelerator instances (in this case, SPE) */ /* the ALF runtime will have during its lifetime */ rc = alf_num_instances_set (alf_handle, nodes);
Task setup
The next section of an ALF host program contains information about the description of a task and the creation of the task runtime. The alf_task_desc_create function creates a task descriptor. This descriptor can be used multiple times to create different executable tasks. The function alf_task_create creates a task to run an SPE program with the name spe_add_program.
/* variable declarations */ alf_task_desc_handle_t task_desc_handle; alf_task_handle_t task_handle; const char* spe_image_name; const char* library_path_name; const char* comp_kernel_name; /* describing a task thats executable on the SPE*/ alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE,
118
sizeof(add_parms_t)); alf_task_desc_set_int32(task_desc_handle, sizeof(float)); alf_task_desc_set_int32(task_desc_handle, sizeof(float)); alf_task_desc_set_int32(task_desc_handle, alf_task_desc_set_int32(task_desc_handle, /* providing the SPE executable name */ alf_task_desc_set_int64(task_desc_handle, (unsigned long long) spe_image_name); alf_task_desc_set_int64(task_desc_handle, (unsigned long) library_path_name); alf_task_desc_set_int64(task_desc_handle, (unsigned long) comp_kernel_name);
119
/* waiting for all work blocks to be done*/ alf_task_wait(task_handle, -1); /* exit ALF runtime */ alf_exit(alf_handle, ALF_EXIT_WAIT, -1);
Accelerator side
On the accelerator side, you need to provide the actual computational kernel that computes the addition of the two blocks of matrices. The ALF runtime on the accelerator is responsible for getting the input buffer to the accelerator memory before it runs the user-provided alf_accel_comp_kernel function. After alf_accel_comp_kernel returns, the ALF runtime is responsible for getting the output data back to host memory space. Double buffering or triple buffering is employed as appropriate to ensure that the latency for the input buffer to get into accelerator memory and the output buffer to get to host memory space is well covered with computation.
int alf_accel_comp_kernel(void *p_task_context, void *p_parm_context, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count { unsigned int i, cnt; vector float *sa, *sb, *sc; add_parms_t *p_parm = (add_parms_t *) p_parm_context; cnt = p_parm->h * p_parm->v / 4; sa = (vector float *) p_input_buffer; sb = sa + cnt; sc = (vector float *) p_output_buffer; for (i = 0; i < cnt; i += 4) { sc[i] = spu_add(sa[i], sb[i]); sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]); sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]); sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]); } return 0; }
120
For all -32768 <= in <32768 / 0, in < -4096 Table(in) = | in/32 -4096 <=in<4096 \ 255, in >= 4096
The following is the stripped down code list. The routines of less interest have been removed to allow you to focus on the key features. Because the task context buffer (the lookup table) is already initialized by the host code and the table is used as read-only data, you do not need the context setup and context merge functions on the accelerator side.
Appendix B. Examples
121
Accelerator code
The following code is the accelerator side code. The section of the code that modifies the task context is marked in bold.
/* ---------------------------------------------- */ /* the accelerator side code */ /* ---------------------------------------------- */ /* the computation kernel function */ int comp_kernel(void *p_task_context, void *p_parm_ctx_buffer, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) { my_task_context_t *p_ctx = (my_task_context_t *) p_task_context; my_wb_parms_t *p_parm = (my_wb_parms_t *) p_parm_ctx_buffer; alf_data_int16_t *in = (alf_data_int16_t *)p_input_buffer; alf_data_byte_t *out = (alf_data_byte_t *)p_output_buffer; unsigned int size = p_parm->num_data; unsigned int i; // it is just a simple table lookup for(i=0;i<size;i++) { out[i] = p_ctx->table[(unsigned short)in[i]]; } return 0; }
122
You can use ALF framework to convert the sequential code into a parallel algorithm. The data set must be partitioned into smaller work blocks. These work blocks are then assigned to the different task instances running on the accelerators. Each invocation of a computational kernel on a task instance is to find the maximum or minimum value in the work block assigned to it. After all the work blocks are processed, you have multiple intermediate best values in the context of each task instance. The ALF runtime then calls the context merge function on accelerators to reduce the intermediate results into the final results.
Input data
WB
WB
WB
WB
WB
WB
WB
WB
WB
WB
WB
WB
WB
Merge
Merge
Source code
You can find the source code in the sample directory task_context/min_max.
Computational kernel
The following code section shows the computational kernel for this application. The computational kernel finds the maximum and minimum values in the provided input buffer then updates the task_context with those values.
/* ---------------------------------------------- */ /* the accelerator side code */ /* ---------------------------------------------- */ /* the computation kernel function */ int comp_kernel(void *p_task_context, void *p_parm_ctx_buffer, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) {
Appendix B. Examples
123
my_task_context_t *p_ctx = (my_task_context_t *) p_task_context; my_wb_parms_t *p_parm = (my_wb_parms_t *) p_parm_ctx_buffer; alf_data_int32_t *a = (alf_data_int32_t *)p_input_buffer; unsigned int size = p_parm->num_data; unsigned int i; /* update the best known values in context buffer */ for(i=0;i<size;i++) { if(a[i]>p_ctx->max) p_ctx->max = a[i]; else if(a[i]<p_ctx->min) p_ctx->min = a[i]; } return 0; }
return 0;
The dot product requires the element multiplication values of the vectors to be accumulated. In the case where a single work block can hold the all the data for vector Ai and Bi, the calculation is straight forward. However, when the size of the vector is too big to fit into a single work block, the straight forward approach does not work. For example, with the Cell BE processor, there are only 256 KB of local memory on the SPE. It is impossible to store two double precision vectors when the dimension exceeds 16384. In addition, if you consider the extra memory needed by double buffering, code storage, and so on, you are only be able to handle two vectors of 7500 double precision float point
124
elements each (7500*8[size of double]*2[two vectors] * 2[double buffer] 240 KB of local storage). In this case, large vectors must be partitioned to multiple work blocks and each work block can only return the partial result of a complete dot product. You can choose to accumulate the partial results of these work blocks on the host to get the final result. But this is not an elegant solution and the performance is also affected. The better solution is to do these accumulations on the accelerators and do them in parallel. ALF provides the following two implementations for this problem: v Implementation 1: Making use of task context and bundled work block distribution v Implementation 2: Making use of multi-use work blocks together with task context or work block parameter/context buffers on page 126, with the limitation that accelerator side data partitioning is required
Source code
The source code for the two implementations is provided for you to compare with the shipped samples in the following directories: v task_context/dot_prod directory: Implementation 1. task context and bundled work block distribution v task_context/dot_prod_multi directory: Implementation 2. multi-use work blocks together with task context or work block parameter/context buffers
Implementation 1: Making use of task context and bundled work block distribution
For this implementation, all the work blocks of a single vector are put into a bundle. All the work blocks in a single bundle are assigned to one task instance in the order of enqueuing. This means it is possible to use the task context to accumulate the intermediate results and write out the final result when the last work block is processed. The accumulator in task context is initialized to zero each time a new work block bundle starts. When the last work block in the bundle is processed, the accumulated value in the task context is copied to the output buffer and then written back to the result area.
Appendix B. Examples
125
A1 B1
A2 B2
A3 B3
A4 B4
Figure 13. Making use of task context and bundled work block distribution
Implementation 2: Making use of multi-use work blocks together with task context or work block parameter/context buffers
The second implementation is based on multi-use work blocks and work block parameter and context buffers. A multi-use work block is similar to an iteration operation. The accelerator side runtime repeatedly processes the work block until it reaches the provided number of iteration. By using accelerator side data partitioning, it is possible to access different input data during each iteration of the work block. This means the application can be used to handle larger data which a single work block cannot cover due to local storage limitations. Also, the parameter and context buffer of the multi-use work block is kept through the iterations, so you can also choose to keep the accumulator in this buffer, instead of using the task context buffer. Both methods, using the task context and using multi-use work block are equally valid.
126
A1 B1
A2 B2
A3 B3
A4 B4
WB1
WB2
WB3
WB4
WB2
WB1
WB3
Figure 14. Making use of multi-use work blocks together with task context or work block parameter/context buffers
Appendix B. Examples
127
Implementation 1
A OVL_IN A/C OVL_OUT C
OVL_IN
Implementation 2
A OVL_IN OUT A
OVL_IN
Matrix setup
Note: The code is similar to the matrix_add example, see Matrix add - host data partitioning example on page 117. Here only the relevant code listing is shown.
/* ---------------------------------------------- */ /* matrix declaration for the two cases */ /* ---------------------------------------------- */ #ifdef C_A_B // C = A + B alf_data_int32_t mat_a[ROW_SIZE][COL_SIZE]; // the matrix alf_data_int32_t mat_b[ROW_SIZE][COL_SIZE]; // the matrix alf_data_int32_t mat_c[ROW_SIZE][COL_SIZE]; // the matrix #else // A = A + B alf_data_int32_t mat_a[ROW_SIZE][COL_SIZE]; // the matrix alf_data_int32_t mat_b[ROW_SIZE][COL_SIZE]; // the matrix #endif
a b c a b
// offset at 0
// the output data C is overlapped with input data A // offset at 0, this is overlapped with A alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_OUT, 0); alf_wb_dtl_entry_add(wb_handle, &mat_c[i][0],
128
wb_parm.num_data*COL_SIZE, ALF_DATA_INT32); // C alf_wb_dtl_end(wb_handle); #else // A = A + B // the input and output data A alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_INOUT, 0); // offset 0 alf_wb_dtl_entry_add(wb_handle, &mat_a[i][0], wb_parm.num_data*COL_SIZE, ALF_DATA_INT32); // A alf_wb_dtl_end(wb_handle); // the input data B is placed after A // placed after A alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_IN, wb_parm.num_data*COL_SIZE*sizeof(alf_data_int32_t)); alf_wb_dtl_entry_add(wb_handle, &mat_b[i][0], wb_parm.num_data*COL_SIZE, ALF_DATA_INT32); // B alf_wb_dtl_end(wb_handle); #endif alf_wb_parm_add(wb_handle, (void *)&wb_parm, sizeof(wb_parm)/sizeof(unsigned int), ALF_DATA_INT32, 0); alf_wb_enqueue(wb_handle); }
Accelerator code
The accelerator code is shown here. In both cases, the output sc can be set to the same location in accelerator memory as sa and sb.
/* ---------------------------------------------- */ /* the accelerator side code */ /* ---------------------------------------------- */ /* the computation kernel function */ int comp_kernel(void *p_task_context, void *p_parm_ctx_buffer, void *p_input_buffer, void *p_output_buffer, void *p_inout_buffer, unsigned int current_count, unsigned int total_count) { unsigned int i, cnt; int *sa, *sb, *sc; my_wb_parms_t *p_parm = (my_wb_parms_t *) p_parm_context; cnt = p_parm->num_data * COL_SIZE; sa = (int *) p_inout_buffer; sb = sa + cnt; sc = sa; for (i = 0; i < cnt; i ++) sc[i] = sa[i] + sb[i]; return 0; }
Appendix B. Examples
129
A two stage pipeline is used to solve the problem so that the random number generation and the simulation can be paralleled: v The first stage generates random numbers using a pseudo random number generator v The second stage simulates the movements Because ALF currently does not support pipeline directly, a pipeline structure is simulated using task dependency. There are two tasks which correspond to the two pipeline stages. For this problem, each simulation step only needs a small amount of data just as a motion vector. Although ALF does not have a strict limit on how small the data can be, it is better to use larger data blocks for performance considerations. Therefore, the data for thousands of simulation steps is grouped into a single work block. Stage 1 task: For the stage 1 task, a Lagged Fibonacci pseudo random number generator (PRNG) is used for simplicity. In this example, the algorithm is as follows:
Sn=(Sn-j^Sn-k)%232
where k > n > 0 and k = 71, j = 65 The algorithm requires a length k history buffer to save the older values. In this implementation, the task context is used for the history buffer. Because no input data is needed, the work block for this task only has output data. Stage 2 task: For the stage 2 task, the task context is used to save the current status of the simulation including the position of the object and the number of hits to the walls. The work block in this stage only has input data, which are the PRNG results from stage 1. Another target of pipelining is to overlap the execution of different stages for performance improvement. However, this requires work block level task synchronization between stages, and this is not yet supported by ALF. The alternative approach is to use multiple tasks whereby each task only handles a percentage of the work blocks for the whole simulation. So there are now two stage tasks. For each chunk of work blocks, the following two tasks are created:
130
v The stage 1 task generates the random numbers and writes out the results to a temporary buffer v The stage 2 task reads the random numbers from the temporary buffer to do the simulation A task dependency is set between the two tasks to make sure the stage 2 task can get the correct results from stage 1 task. Because both the PRNG and the simulation have internal states, you have to pass the states data between the succeeding tasks of the same stage to preserve the states. The approach described here lets the tasks for the same stage share the same task context buffer. Dependencies are used to make sure the tasks access the shared task context in the correct order. Figure 17 (a) shows the task dependency as described in previous discussions. To further reduce the use of temporary intermediate buffers, you can use double or multi-buffering technology for the intermediate buffers. The task dependency graph for double buffering the intermediate buffers is shown in Figure 17 (b), where a new dependency is added between the n-2th stage 2 task and the nth stage 1 task to make sure the stage 1 task does not overwrite the data that may still be in use by the previous stage 2 task. This is what is implemented in the sample code.
S1
C1 S1 C1 0 1
S1
C1 S1 C1
S2 C2
1 S1 S2 C2 S2 3 2 S1 C1
S2 C2
0 1 S1 S2 C2 S2 C2 0 1 0 1 S1 C1
Legend C2 S1 C1 S2 C2 0 0 Tasks Tasks Contexts Used task buffer Unused task buffer S2
S2
Source code
The complete source code can be found in the sample directory pipe_line.
Appendix B. Examples
131
132
133
Table 2. ALF debug hooks (continued) Hook identifier _ALF_TASK_DESC_DESTROY_ENTRY _ALF_TASK_DESC_DESTROY_EXIT_INTERVAL _ALF_TASK_DESC_SET_INT32_ENTRY _ALF_TASK_DESC_SET_INT32_EXIT_INTERVAL _ALF_TASK_DESC_SET_INT64_ENTRY _ALF_TASK_DESC_SET_INT64_EXIT_INTERVAL _ALF_TASK_DESTROY_ENTRY _ALF_TASK_DESTROY_EXIT_INTERVAL _ALF_TASK_EVENT_HANDLER_REGISTER_ENTRY _ALF_TASK_EVENT_HANDLER_REGISTER_EXIT_INTERVAL _ALF_TASK_FINALIZE_ENTRY _ALF_TASK_FINALIZE_EXIT_INTERVAL _ALF_TASK_QUERY_ENTRY _ALF_TASK_QUERY_EXIT_INTERVAL _ALF_TASK_WAIT_ENTRY _ALF_TASK_WAIT_EXIT_INTERVAL _ALF_WB_CREATE_ENTRY _ALF_WB_CREATE_EXIT_INTERVAL _ALF_WB_DTL_SET_BEGIN_ENTRY _ALF_WB_DTL_SET_BEGIN_EXIT_INTERVAL _ALF_WB_DTL_SET_END_ENTRY _ALF_WB_DTL_SET_END_EXIT_INTERVAL _ALF_WB_DTL_SET_ENTRY_ADD_ENTRY _ALF_WB_DTL_SET_ENTRY_ADD_EXIT_INTERVAL _ALF_WB_ENQUEUE_ENTRY _ALF_WB_ENQUEUE_EXIT_INTERVAL _ALF_WB_PARM_ADD_ENTRY _ALF_WB_PARM_ADD_EXIT_INTERVAL Traced values task_desc_handle retcode task_desc_handle, field, value retcode task_desc_handle, field, value retcode task_handle retcode task_handle, task_event_handler, p_data, data_size, event_mask retcode task_handle retcode talk_handle, p_unfinished_wbs, p_total_wbs unfinished_wbs, total_wbs, retcode task_handle, time_out retcode task_handle, work_block_type, repeat_count, p_wb_handle wb_handle, retcode wb_handle, buffer_type, offset_to_the_local_buffer retcode wb_handle retcode wb_handle, p_address, size_of_data, data_type retcode wb_handle retcode wb_handle, pdata, size_of_data, data_type, address_alignment retcode
134
Table 3. ALF performance hooks Hook Identifier _ALF_GENERIC_PERFORM_HOST _ALF_GENERIC_PERFORM_SPU _ALF_HOST_COUNTERS _ALF_HOST_TIMERS _ALF_SPU_COUNTERS _ALF_SPU_TIMERS Traced values long1, long2, long3, long4, long5, long6, long7, long8, long9, long10 long1, long2, long3, long4, long5, long6, long7, long8, long9, long10 alf_task_creates, alf_task_waits, alf_wb_enqueues, thread_total_count, thread_reuse_count, x alf_runtime, alf_accel_utilize, x1, x2 alf_input_bytes, alf_output_bytes, alf_workblock_total, double_buffer_used, x1, x2 alf_lqueue_empty, alf_wait_data_dtl, alf_prep_input_dtl, alf_prep_output_dtl, alf_compute_kernel, alf_spu_task_run, x1, x2 task_flag task_flag task_flag task_flag task_flag task_flag, wb_flag, packet_flag task_flag, wb_flag, wb idx task_flag, wb_flag, wb idx task_flag, wb_flag, wb idx task_flag, wb_flag, wb idx task_flag, packet_flag
_ALF_TASK_BEFORE_EXEC_INTERVAL _ALF_TASK_CONTEXT_MERGE_INTERVAL _ALF_TASK_CONTEXT_SWAP_INTERVAL _ALF_TASK_EXEC_INTERVAL _ALF_THREAD_RUN_INTERVAL _ALF_WAIT_FIRST_WB_INTERVAL _ALF_WB_COMPUTE_KERNEL_INTERVAL _ALF_WB_DATA_TRANSFER_WAIT_INTERVAL _ALF_WB_DTL_PREPARE_IN_INTERVAL _ALF_WB_DTL_PREPARE_OUT_INTERVAL _ALF_WB_LQUEUE_EMPTY_INTERVAL
_ALF_ACCEL_COMP_KERNEL_EXIT _ALF_ACCEL_DTL_BEGIN_ENTRY _ALF_ACCEL_DTL_BEGIN_EXIT _ALF_ACCEL_DTL_END_ENTRY _ALF_ACCEL_DTL_END_EXIT _ _ALF_ACCEL_DTL_ENTRY_ADD_ENTRY _ALF_ACCEL_DTL_ENTRY_ADD_EXIT _ALF_ACCEL_INPUT_DTL_PREPARE_ENTRY _ALF_ACCEL_INPUT_DTL_PREPARE_EXIT _ALF_ACCEL_NUM_INSTANCES
135
Table 4. ALF SPU hooks (continued) Hook identifier _ALF_ACCEL_OUTPUT_DTL_PREPARE_ENTRY _ALF_ACCEL_OUTPUT_DTL_PREPARE_EXIT _ALF_ACCEL_TASK_CONTEXT_MERGE_ENTRY _ALF_ACCEL_TASK_CONTEXT_MERGE_EXIT _ALF_ACCEL_TASK_CONTEXT_SETUP_ENTRY _ALF_ACCEL_TASK_CONTEXT_SETUP_EXIT _ALF_INSTANCES_ID _ALF_SPE_GENERIC_DEBUG Traced values p_task_context, p_parm_ctx_buffer, p_io_container, current_iter, num_iter retcode p_task_context_to_be_merged, p_task_context retcode p_task_context retcode retcode long1, long2, long3, long4, long5, long6, long7, long8, long9, long10
136
ALF_QUERY_DTL_ADDR_ALIGN ALF_ACCEL_TYPE_SPE ALF_EXIT_POLICY_FORCE ALF_EXIT_POLICY_WAIT ALF_EXIT_POLICY_TRY ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE ALF_TASK_DESC_WB_IN_BUF_SIZE ALF_TASK_DESC_WB_OUT_BUF_SIZE ALF_TASK_DESC_WB_INOUT_BUF_SIZE ALF_TASK_DESC_NUM_DTL_ENTRIES ALF_TASK_DESC_TSK_CTX_SIZE ALF_TASK_DESC_PARTITION_ON_ACCEL
137
Table 5. Attributes and descriptions (continued) Attribute name ALF_TASK_DESC_ACCEL_KERNEL_REF_L Description Specify the name of the computational kernel function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. Specify the name of the input list prepare function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. Specify the name of the output list prepare function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. Specify the name of the context setup function, this usually is a string constant that the accelerator runtime could use to find the correspondent function. Specify the name of the context merge function, this usually a string constant that the accelerator runtime could use to find the correspondent function. The task must be scheduled on num_instances of accelerators. The work blocks for this task is distributed to the accelerators in a cyclic order as specified by num_accelerators. Defined as followed: v ALF_TASK_EVENT_FINALIZED: This task has been finalized. No additional work block can be added to this task. v ALF_TASK_EVENT_READY: This task has been scheduled for execution. v ALF_TASK_EVENT_FINISHED: All work blocks in this task have been processed. v ALF_TASK_EVENT_INSTANCE_START: One new instance of the task is started on an accelerator after the event handler returns v ALF_TASK_EVENT_INSTANCE_END: One existing instance of the task ends and the task context has been copied out to the original location or has been merged to another current instance of the same task. v ALF_TASK_EVENT_DESTROY: The task is destroyed explicitly ALF_WB_SINGLE ALF_WB_MULTI (Level 1) Create a single use work block. Create a multi use work block. This work block type is only supported when the task is created with ALF_PARTITION_ON_ACCELERATOR. Input to the input-only buffer. Output from the output only buffer. Input to the overlapped buffer. Output from the overlapped buffer.
ALF_TASK_DESC_ACCEL_INPUT_DTL_REF_L
ALF_TASK_DESC_ACCEL_OUTPUT_DTL_REF_L
ALF_TASK_DESC_ACCEL_CTX_SETUP_REF_L
ALF_TASK_DESC_ACCEL_CTX_MERGE_REF_L
ALF_TASK_ATTR_SCHED_FIXED ALF_TASK_ATTR_WB_CYCLIC
ALF_TASK_EVENT_TYPE_T
138
Table 5. Attributes and descriptions (continued) Attribute name ALF_BUF_OVL_INOUT ALF_DATASET_READ_ONLY Description In/out to/from the overlapped buffer. The dataset is read-only. Work blocks referencing the data in this buffer cannot update this buffer as an output buffer. The dataset is write-only. Work blocks referencing the data in this buffer as input data results in indeterminate behavior. The dataset allows both read and write access. Work blocks can use this buffer as input buffers and output buffers and/or inout buffers.
ALF_DATASET_WRITE_ONLY
ALF_DATASET_READ_WRITE
139
140
141
142
Document location
Links to documentation for the SDK are provided on the developerWorks Web site located at:
http://www-128.ibm.com/developerworks/power/cell/
Click on the Docs tab. The following documents are available, organized by category:
Architecture
v Cell Broadband Engine Architecture v Cell Broadband Engine Registers v SPU Instruction Set Architecture
Standards
v C/C++ Language Extensions for Cell Broadband Engine Architecture v SPU Assembly Language Specification v SPU Application Binary Interface Specification v SIMD Math Library Specification for Cell Broadband Engine Architecture v Cell Broadband Engine Linux Reference Implementation Application Binary Interface Specification
Programming
v Cell Broadband Engine Programming Handbook v Programming Tutorial v SDK for Multicore Acceleration Version 3.0 Programmers Guide
Library
SPE Runtime Management library SPE Runtime Management library Version 1.2 to Version 2.0 Migration Guide Accelerated Library Framework for Cell Programmers Guide and API Reference Accelerated Library Framework for Hybrid-x86 Programmers Guide and API Reference Data Communication and Synchronization for Cell Programmers Guide and API Reference v Data Communication and Synchronization for Hybrid-x86 Programmers Guide and API Reference v SIMD Math Library Specification v Monte Carlo Library API Reference Manual (Prototype) v v v v v
Installation
v SDK for Multicore Acceleration Version 3.0 Installation Guide
143
PowerPC Base
v PowerPC Architecture Book, Version 2.02 Book I: PowerPC User Instruction Set Architecture Book II: PowerPC Virtual Environment Architecture Book III: PowerPC Operating Environment Architecture v PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.07c
144
145
146
Notices
This information was developed for products and services offered in the U.S.A. The manufacturer may not offer the products, services, or features discussed in this document in other countries. Consult the manufacturers representative for information on the products and services currently available in your area. Any reference to the manufacturers product, program, or service is not intended to state or imply that only that product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any intellectual property right of the manufacturer may be used instead. However, it is the users responsibility to evaluate and verify the operation of any product, program, or service. The manufacturer may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the manufacturer. For license inquiries regarding double-byte (DBCS) information, contact the Intellectual Property Department in your country or send inquiries, in writing, to the manufacturer. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: THIS INFORMATION IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. The manufacturer may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to Web sites not owned by the manufacturer are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this product and use of those Web sites is at your own risk. The manufacturer may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact the manufacturer. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.
Copyright IBM Corp. 2006, 2007 - DRAFT
147
The licensed program described in this information and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement, IBM License Agreement for Machine Code, or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning products not produced by this manufacturer was obtained from the suppliers of those products, their published announcements or other publicly available sources. This manufacturer has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to products not produced by this manufacturer. Questions on the capabilities of products not produced by this manufacturer should be addressed to the suppliers of those products. All statements regarding the manufacturers future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. The manufacturers prices shown are the manufacturers suggested retail prices, are current and are subject to change without notice. Dealer prices may vary. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to the manufacturer, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. The manufacturer, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. CODE LICENSE AND DISCLAIMER INFORMATION: The manufacturer grants you a nonexclusive copyright license to use all programming code examples from which you can generate similar function tailored to your own specific needs. SUBJECT TO ANY STATUTORY WARRANTIES WHICH CANNOT BE EXCLUDED, THE MANUFACTURER, ITS PROGRAM DEVELOPERS AND SUPPLIERS, MAKE NO WARRANTIES OR CONDITIONS EITHER EXPRESS OR
148
IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT, REGARDING THE PROGRAM OR TECHNICAL SUPPORT, IF ANY. UNDER NO CIRCUMSTANCES IS THE MANUFACTURER, ITS PROGRAM DEVELOPERS OR SUPPLIERS LIABLE FOR ANY OF THE FOLLOWING, EVEN IF INFORMED OF THEIR POSSIBILITY: 1. LOSS OF, OR DAMAGE TO, DATA; 2. SPECIAL, INCIDENTAL, OR INDIRECT DAMAGES, OR FOR ANY ECONOMIC CONSEQUENTIAL DAMAGES; OR 3. LOST PROFITS, BUSINESS, REVENUE, GOODWILL, OR ANTICIPATED SAVINGS. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF DIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, SO SOME OR ALL OF THE ABOVE LIMITATIONS OR EXCLUSIONS MAY NOT APPLY TO YOU. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved. If you are viewing this information in softcopy, the photographs and color illustrations may not appear.
Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM developerWorks PowerPC PowerPC Architecture Resource Link Adobe, Acrobat, Portable Document Format (PDF), and PostScript are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, other countries, or both. Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names may be trademarks or service marks of others.
Notices
149
150
Glossary
ABI
Application Binary Interface. This is the standard that a program follows to ensure that code generated by different compilers (and perhaps linking with various, third-party libraries) run correctly on the Cell BE. The ABI defines data types, register use, calling conventions and object formats.
C++
C++ is an object-orientated programming language, derived from C.
cache
High-speed memory close to a processor. A cache usually contains recently-accessed data or instructions, but certain cache-control instructions can lock, evict, or otherwise modify the caching of data or instructions.
accelerator
General or special purpose processing element in a hybrid system. An accelerator can have a multi-level architecture with both host elements and accelerator elements. An accelerator, as defined here, is a hierarchy with potentially multiple layers of hosts and accelerators. An accelerator element is always associated with one host. Aside from its direct host, an accelerator cannot communicate with other processing elements in the system. The memory subsystem of the accelerator can be viewed as distinct and independent from a host. This is referred to as the subordinate in a cluster collective.
CBEA
Cell Broadband Engine Architecture. A new architecture that extends the 64-bit PowerPC Architecture. The CBEA and the Cell Broadband Engine are the result of a collaboration between Sony, Toshiba, and IBM, known as STI, formally started in early 2001.
Cell BE processor
The Cell BE processor is a multi-core broadband processor based on IBMs Power Architecture.
ALF
Accelerated Library Framework. This an API that provides a set of services to help programmers solving data parallel problems on a hybrid system. ALF supports the multiple-programmultiple-data (MPMD) programming style where multiple programs can be scheduled to run on multiple accelerator elements at the same time. ALF offers programmers an interface to partition data across a set of parallel processes without requiring architecturally-dependent code.
cluster
A collection of nodes.
compiler
A programme that translates a high-level programming language, such as C++, into executable code.
API
Application Program Interface.
computational kernel
Part of the accelerator code that does stateless computation task on one piece of input data and generates corresponding output results.
ATO
Atomic Unit. Part of an SPEs MFC. It is used to synchronize with other processor units.
Broadband Engine
See CBEA.
Copyright IBM Corp. 2006, 2007 - DRAFT
151
compute task
An accelerator execution image that consists of a compute kernel linked with the accelerated library framework accelerator runtime library.
address (RA) that accesses real (physical) memory. The maximum size of the effective address space is 264 bytes.
exception
An error, unusual condition, or external signal that may alter a status bit and will cause a corresponding interrupt, if the interrupt is enabled. See interrupt.
data set
An ALF data set is a logical set of data buffers.
DMA
Direct Memory Access. A technique for using a special-purpose controller to generate the source and destination addresses for a memory or I/O transfer.
FFT
Fast Fourier Transform.
GCC
GNU C compiler
DMA command
A type of MFC command that transfers or controls the transfer of a memory location containing data or instructions. See MFC.
handle
A handle is an abstraction of a data object; usually a pointer to a structure.
DMA list
A sequence of transfer elements (or list entries) that, together with an initiating DMA-list command, specify a sequence of DMA transfers between a single area of LS and discontinuous areas in main storage. Such lists are stored in an SPEs LS, and the sequence of transfers is initiated with a DMA-list command such as getl or putl. DMA-list commands can only be issued by programs running on an SPE, but the PPE or other devices can create and store the lists in an SPEs LS. DMA lists can be used to implement scatter-gather functions between main storage and the LS.
host
A general purpose processing element in a hybrid system. A host can have multiple accelerators attached to it. This is often referred to as the master node in a cluster collective.
HTTP
Hypertext Transfer Protocol. A method used to transfer or convey information on the World Wide Web.
Hybrid
A module comprised of two Cell BE cards connected via an AMD Opteron processor.
DMA-list command
A type of MFC command that initiates a sequence of DMA transfers specified by a DMA list stored in an SPEs LS. See DMA list.
IDL
Interface definition language. Not the same as CORBA IDL
EA
See Effective address.
kernel
The core of an operating which provides services for other parts of the operating system and provides multitasking. In Linux or UNIX operating system, the kernel can easily be rebuilt to incorporate enhancements which then become operating-system wide.
effective address
An address generated or used by a program to reference memory. A memory-management unit translates an effective address (EA) to a virtual address (VA), which it then translates to a real
152
latency
The time between when a function (or instruction) is called and when it returns. Programmers often optimize code so that functions return as quickly as possible; this is referred to as the low-latency approach to optimization. Low-latency designs often leave the processor data-starved, and performance can suffer.
MPMD
Multiple Program Multiple Data. Parallel programming model with several distinct executable programs operating on different sets of data.
node
A node is a functional unit in the system topology, consisting of one host together with all the accelerators connected as children in the topology (this includes any children of accelerators).
local store
The 256-KB local store associated with each SPE. It holds both instructions and data.
LS
See local store.
PDF
Portable document format.
main storage
The effective-address (EA) space. It consists physically of real memory (whatever is external to the memory-interface controller, including both volatile and nonvolatile memory), SPU LSs, memory-mapped registers and arrays, memory-mapped I/O devices (all I/O is memory-mapped), and pages of virtual memory that reside on disk. It does not include caches or execution-unit register files. See also local store.
pipelining
A technique that breaks operations, such as instruction processing or bus transactions, into smaller stages so that a subsequent stage in the pipeline can begin before the previous stage has completed.
PPE
PowerPC Processor Element. The general-purpose processor in the Cell.
main thread
The main thread of the application. In many cases, Cell BE architecture programs are multi-threaded using multiple SPEs running concurrently. A typical scenario is that the application consists of a main thread that creates as many SPE threads as needed and the application organizes them.
PPE
PowerPC Processor Element. The general-purpose processor in the Cell BE processor.
PPU
PowerPC Processor Unit. The part of the PPE that includes the execution units, memorymanagement unit, and L1 cache.
MFC
Memory Flow Controller. Part of an SPE which provides two main functions: it moves data via DMA between the SPEs local store (LS) and main storage, and it synchronizes the SPU with the rest of the processing units in the system.
process
A process is a standard UNIX-type process with a separate address space.
program section
See code section.
Glossary
153
SDK
Software development toolkit for Multicore Acceleration. A complete package of tools for application development.
at any one time. Multiple SPEs can simultaneously support multiple threads. The PPE supports two threads at any one time, without the need for software to create the threads. It does this by duplicating the architectural state. A thread is typically created by the pthreads library.
section
See code section.
vector
An instruction operand containing a set of data elements packed into a one-dimensional array. The elements can be fixed-point or floating-point values. Most Vector/SIMD Multimedia Extension and SPU SIMD instructions operate on vector operands. Vectors are also called SIMD operands or packed operands.
SIMD
Single Instruction Multiple Data. Processing in which a single instruction operates on multiple data elements that make up a vector data-type. Also known as vector processing. This style of programming implements data-level parallelism.
virtual storage
See virtual memory.
work block
A basic unit of data to be managed by the framework. It consists of one piece of the partitioned data, the corresponding output buffer, and related parameters. A work block is associated with a task. A task can have as many work blocks as necessary.
SPU
Synergistic Processor Unit. The part of an SPE that executes instructions from its local store (LS).
SPU
Synergistic Processor Unit. The part of an SPE that executes instructions from its local store (LS).
workload
A set of code samples in the SDK that characterizes the performance of the architecture, algorithms, libraries, tools, and compilers.
synchronization
The order in which storage accesses are performed.
work queue
An internal data structure of the accelerated library framework that holds the lists of work blocks to be processed by the active instances of the compute task.
thread
A sequence of instructions executed within the global context (shared memory space and other global resources) of a process that has created (spawned) the thread. Multiple threads (including multiple instances of the same sequence of instructions) can run simultaneously if each thread has its own architectural state (registers, program counter, flags, and other program-visible state). Each SPE can support only a single thread
x86
Generic name for Intel-based processors.
XLC
The IBM optimizing C/C++ compiler.
154
Index A
accelerator API 91 buffer 18 data partitioning 43 data transfer list generation 29 element 3 process flow 27 runtime library 3 alf_accel_comp_kernel 95 ALF_ACCEL_DTL_BEGIN 103 ALF_ACCEL_DTL_CBEA_DMA_ 108, 109 ALF_ACCEL_DTL_END 105 ALF_ACCEL_DTL_ENTRY_ADD 104 ALF_ACCEL_EXPORT_API 93 alf_accel_input_dtl_prepare 96 alf_accel_num_instances 101 alf_accel_output_dtl_prepare 97 alf_accel_task_context_merge 99 alf_accel_task_context_merge API 99 alf_accel_task_context_setup 98 ALF_BUF_IN 138 ALF_BUF_OUT 138 ALF_BUF_OVL_IN 138 ALF_BUF_OVL_INOUT 139 ALF_BUF_OVL_OUT 138 ALF_DATA_TYPE_T 51 alf_dataset_buffer_add 87 alf_dataset_create 86 alf_dataset_destroy 88 ALF_DATASET_READ_ONLY 139 ALF_DATASET_READ_WRITE 139 ALF_DATASET_WRITE_ONLY 139 ALF_ERR_POLICY_T 55 alf_exit 60 alf_handle_t 54 framework API 54 alf_init 56 alf_instance_id 102 ALF_NULL_HANDLE 51 alf_num_instances_set 59 alf_query_system_info 57 alf_register_error_handler 61 ALF_STRING_TOKEN_ MAX 52 ALF_TASK_ATTR_SCHED_FIXED 138 alf_task_create 69 alf_task_dataset_associate 89 alf_task_depends_on 75 alf_task_desc_create 63 alf_task_desc_ctx_entry_add 65 alf_task_desc_destroy 64 alf_task_desc_handle_t 62 alf_task_desc_set_int32 66 alf_task_desc_set_int64 67 alf_task_destroy 74 alf_task_event_handler_register 76 alf_task_finalize 71 alf_task_handle_t 62 alf_task_query 73 alf_task_wait 72 Copyright IBM Corp. 2006, 2007 - DRAFT alf_wb_create 79 alf_wb_dtl_begin 82 alf_wb_dtl_end 84 alf_wb_dtl_entry_add 83 alf_wb_enqueue 80 alf_wb_handle_t 78 ALF_WB_MULTI (Level 1) 138 alf_wb_parm_add 81 ALF_WB_SINGLE 138 API accelerator 91 basic framework 54 Cell BE platform-dependent 107 changes 113 computational kernel 94 compute task 62 conventions 51 data set 85 host 53 reference 51 runtime 100 work block 78 attributes 137 constant ALF_NULL_HANDLE 51 ALF_STRING_TOKEN_ MAX control task 3 conventions 51 cyclic distribution policy 14 52
D
data partitioning 16, 18 accelerator APIs 18 design 33 optimizing performance 43 data set 22 alf_dataset_buffer_add API 87 alf_dataset_create API 86 alf_dataset_destroy API 88 alf_task_dataset_associate API 89 API 85 using 7 data structure 51 ALF_DATA_TYPE_T 51 alf_task_desc_handle_t 62 alf_task_handle_t 62 alf_wb_handle_t 78 work block 78 data transfer list 16, 29 limitations 46 data type 51 debugging 37 hooks 133 installing the PDT 37 trace events 133 documentation 143 double buffering 35
B
basic framework API 54 buffer 18 accelerator 18 double buffering 35 layout 18 task context buffer 18 types of buffer area 31 work block input data buffer 18 work block output data buffer 18 work block overlapped input and output data buffer 18 work block parameter and context buffer 18 bundled distribution 15
E
environment variable 39 error callback error handler 23 codes 141 handling 23
C
callback error handler 23 Cell BE architecture platform-dependent API 107 programming 41 computational kernel 9, 99 alf_accel_comp_kernel API 95 alf_accel_input_dtl_prepare API 96 alf_accel_output_dtl_prepare API 97 alf_accel_task_context_setup API 98 API 94 macro 91 sample code 123 computational kernel macro ALF_ACCEL_EXPORT_API 93 compute task 3 API 62
F
framework API ALF_ERR_POLICY_T 55 alf_exit 60 alf_init 56 alf_num_instances_set 59 alf_query_system_info 57 alf_register_error_handler 61 function call order 20
H
host API 53
155
host (continued) application 29 data partitioning 17 element 3 memory addresses 18 process flow 27 runtime library 3
runtime (continued) alf_accel_num_instances API alf_instance_id API 102 API 100 framework 4
101
S
sample 117 matrix add 117, 120 min-max finder 122 multiple vector dot products 124 overlapped I/O buffer 127 table lookup 121 task dependency 129 scheduling policy 13 bundled distribution 15 cyclic 14 for work blocks 13 SDK documentation 143 source code computational kernel 123 min-max finder 122 multiple vector dot products 125 overlapped I/O buffer 127 table lookup 121 task context merge 124 task dependency 129 task setup 118 task wait and exit 119 work block setup 119
I
installing PDT 37
task event (continued) API 76 attributes 138 task finalize 11 task instance 11 task mapping 11 task scheduling 11 fixed task mapping trace control 39 trace events 133
11
W
work block alf_wb_create API 79 alf_wb_dtl_begin API 82 alf_wb_dtl_end API 84 alf_wb_dtl_entry_add API 83 alf_wb_enqueue API 80 alf_wb_parm_add API 81 API 78 bundled distribution 15 cyclic block distribution 14 data structure 78 modifying parameter buffer 22 multi-use 13 optimizing performance 43 scheduling 13 scheduling policy 13 single-use 13 using multi-use 20, 43 using single-use 20 workload division 3
M
macro ALF_ACCEL_EXPORT_API_ 91 computational kernel 91 matrix add example accelerator data partition 120 host data partition 117 memory constraints 45 host 45 host address 18 local 45 memory constraints 18 min-max finder 122 MPMD 3 multiple vector dot products 124
O
optimizing 43 overlapped I/O buffer 31, 127 accelerator code sample 129 matrix setup sample 128 work block setup sample 128
T
table lookup 121 task 10 accelerated library 3 alf_task_create API 69 alf_task_depends_on API 75 alf_task_desc_create API 63 alf_task_desc_ctx_entry_add API 65 alf_task_desc_destroy API 64 alf_task_desc_set_int32 API 66 alf_task_desc_set_int64 API 67 alf_task_destroy API 74 alf_task_event_handler_register API 76 alf_task_finalize API 71 alf_task_query API 73 alf_task_wait API 72 application programming 3 computational kernel 3 managing parallel 7 running multiple 7 task context examples 120 min-max finder 122 multiple vector dot products 124 overlapped I/O buffer 127 sample code for merge 124 table lookup 121 uses 12 task dependency 11, 129 example 129 task descriptor 10 task event 12
P
parallel data 7 limitations 7 tasks 7 partitioning host data partitioning 17 PDT 37 trace control 39 PDT_CONFIG_FILE 39 Performance Debugging Tool 37 performance hooks 134 process flow 27 accelerator 27 host 27 programming for Cell BE 41
R
runtime ALF_ACCEL_DTL_BEGIN API 103 ALF_ACCEL_DTL_END API 105 ALF_ACCEL_DTL_ENTRY_ADD API 104
156
Printed in USA
SC33-8333-02