0% found this document useful (0 votes)

53 views31 pages

Parallel Programming in Opencl: Advanced Graphics & Image Processing

Uploaded by

Akash AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views31 pages

Parallel Programming in Opencl: Advanced Graphics & Image Processing

Uploaded by

Akash AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Parallel programming in OpenCL

Advanced Graphics & Image Processing

Rafał Mantiuk
Computer Laboratory, University of Cambridge
Single Program Multiple Data (SPMD)
 Consider the following vector addition example
for( i = 0:11 ) {
C[ i ] = A[ i ] + B[ i ]
Serial program: }
one program completes
the entire task A
+
B
||
C
Multiple copies of the same program execute on different data in parallel
for( i = 0:3 ) { for( i = 4:7 ) { for( i = 8:11 ) {
C[ i ] = A[ i ] + B[ i ] C[ i ] = A[ i ] + B[ i ] C[ i ] = A[ i ] + B[ i ]
SPMD program:
} } }
multiple copies of the
same program run on A
different chunks of the +
data
B
||
C

2 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Parallel Software – SPMD
 In the vector addition example, each chunk of data could
be executed as an independent thread
 On modern CPUs, the overhead of creating threads is so
high that the chunks need to be large
 In practice, usually a few threads (about as many as the number
of CPU cores) and each is given a large amount of work to do
 For GPU programming, there is low overhead for thread
creation, so we can create one thread per loop iteration

3 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Parallel Software – SPMD
= loop iteration
Single-threaded (CPU)
// there are N elements Time
for(i = 0; i < N; i++) T0 0 1 2 3 4 5 6 7 8 9 10 15
C[i] = A[i] + B[i]

Multi-threaded (CPU)
// tid is the thread id T0 0 1 2 3
// P is the number of cores T1 4 5 6 7
for(i = 0; i < tid*N/P; i++) T2 8 9 10 11
C[i] = A[i] + B[i] T3 12 13 14 15

Massively Multi-threaded (GPU)

T0 0
// tid is the thread id T1 1
C[tid] = A[tid] + B[tid] T2 2
T3 3

T15 15

4 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Parallel programming frameworks
 These are some of more relevant frameworks for
creating parallelized code

CPU GPU

OpenCL
CUDA
OpenMP
OpenACC

Metal
OpenCL
 OpenCL is a framework for writing parallelized code for
CPUs, GPUs, DSPs, FPGAs and other processors
 Initially developed by Apple, now supported by AMD, IBM,
Qualcomm, Intel and Nvidia (reluctantly)
 Versions
 Latest: OpenCL 2.2
 OpenCL C++ kernel language
 SPIR-V as intermediate representation for kernels
 Vulcan uses the same Standard Portable Intermediate Representation
 AMD, Intel
 Mostly supported: OpenCL 1.2
 Nvidia, OSX
OpenCL platforms and drivers
 To run OpenCL code you need:
 Generic ICD loader
 Included in the OS
 Installable Client Driver
 From Nvidia, Intel, etc.
 This applies to Windows and Linux, only one platform on Mac
 To develop OpenCL code you need:
 OpenCL headers/libraries
 Included in the SDKs
 Nvidia – CUDA Toolkit
 Intel OpenCL SDK
 But lightweight options are also available
Programming OpenCL
 OpenCL natively offers C99 API
 But there is also a standard OpenCL C++ API wrapper
 Strongly recommended – reduces the amount of code
 Programming OpenCL is similar to programming shaders
in OpenGL
 Host code runs on CPU and invokes kernels
 Kernels are written in C-like programming language
 In many respects similar to GLSL
 Kernels are passed to API as strings and compiled at runtime
 Kernels are usually stored in text files
 Kernels can be precompiled into SPIR from OpenCL 2.1
Example: Step 1 - Select device
Get all Select Get all Select
Platforms Platform Devices Device
Example: Step 2 - Build program
Create Load sources Create Build
context (usually from files) Program Program
Example: Step 3 - Create Buffers and
copy memory
Create Create Enqueue
Buffers Queue Memory Copy
Example: Step 4 - Execute Kernel and
retrieve the results
Create Set Kernel Enqueue Enqueue
Kernel Arguments Kernel memory copy

Our Kernel was

OpenCL API Class Diagram
 Platform – Nvidia CUDA
 Device – GeForce 780
 Program – collection of
kernels
 Buffer / Image – device
memory
 Sampler – how to
interpolate values for
Image
 Command Queue – put a
sequence of operations
there
 Event – to notify that
something has been done

From: OpenCL API 1.2 Reference Card

Platform model
 The host is whatever the OpenCL library runs on
 Usually x86 CPUs for both NVIDIA and AMD
 Devices are processors that the library can talk to
 CPUs, GPUs, DSP,s and generic accelerators
 For AMD
 All CPUs are combined into a single device (each core is a compute unit
and processing element)
 Each GPU is a separate device

14
Execution model
 Each kernel executes on 1D, 2D or 3D array (NDRange)
 The array is split into work-groups
 Work items (threads) in each work-group share some local
memory
 Kernel can querry
 get_global_id(dim)
 get_group_id(dim)
 get_local_id(dim)
 Work items are not
bound to any memory
entity
(unlike GLSL shaders)
Memory model
 Host memory
 Usually CPU memory, device does
not have access to that memory
 Global memory [__global]
 Device memory, for storing large
data
 Constant memory [__constant]
 Local memory [__local]
 Fast, accessible to all work-items
(threads) within a workgroup
 Private memory [__private]
 Accessible to a single work-item
(thread)
Memory objects
cl::Image1DBuffer
cl::Memory
cl::Buffer cl::Image

cl::BufferGL cl::BufferRenderGL cl::Image1D cl::Image2D cl::Image2D

This diagram is incomplete – there are more memory objects

 Buffer
 ArrayBuffer in OpenGL
 Accessed directly via C pointers
 Image
 Texture in OpenGL
 Access via texture look-up function
 Can interpolate values, clamp, etc.
Programming model
 Data parallel programming
 Each NDRange element is assigned to a work-item (thread)
 Task-parallel programming
 Multiple different kernels can be executed in parallel
 Each kernel can use vector-types of the device (float4, etc.)
 Command queue
queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int)*10, A);
CL_TRUE - Execute in-order
CL_FALSE – Execute out-of-order

 Provides means to both synchronize kernels and execute them in parallel

Big Picture

19
Thread Mapping
 By using different mappings, the same thread can be
assigned to access different data elements
 The examples below show three different possible mappings of
threads to data (assuming the thread id is used to access an
element) int group_size =
get_local_size(0) *
get_local_size(1);

int tid =
get_group_id(1) *
get_num_groups(0) *
int tid = int tid = group_size +
get_group_id(0) *
Mapping get_global_id(1) * get_global_id(0) *
group_size +
get_global_size(0) + get_global_size(1) +
get_global_id(0); get_global_id(1); get_local_id(1) *
get_local_size(0) +
get_local_id(0)
0 1 2 3 0 4 8 12
Thread IDs 4 5 6 7 1 5 9 13 0 1 4 5
8 9 10 11 2 6 10 14 2 3 6 7
12 13 14 15 3 7 11 15 8 9 12 13
10 11 14 15
20 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/ *assuming 2x2 groups
Thread Mapping
 Consider a serial matrix multiplication algorithm

 This algorithm is suited for output data decomposition

 We will create N x M threads
 Effectively removing the outer two loops
 Each thread will perform P calculations
 The inner loop will remain as part of the kernel
 Should the index space be MxN or NxM?

21 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Thread Mapping
 Thread mapping 1: with an MxN index space, the kernel would be:
Mapping for C
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15

 Thread mapping 2: with an NxM index space, the kernel would be:
Mapping for C
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

 Both mappings produce functionally equivalent versions of the program

22 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Thread Mapping
 This figure shows the execution of the two thread mappings
on NVIDIA GeForce 285 and 8800 GPUs

 Notice that mapping 2 is far superior in performance for both

GPUs

23 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Thread Mapping
 The discrepancy in execution times between the
mappings is due to data accesses on the global memory
bus
 Assuming row-major data, data in a row (i.e., elements in
adjacent columns) are stored sequentially in memory
 To ensure coalesced accesses, consecutive threads in the same
wavefront should be mapped to columns (the second
dimension) of the matrices
 This will give coalesced accesses in Matrices B and C
 For Matrix A, the iterator i3 determines the access pattern for row-
major data, so thread mapping does not affect it

24 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Reduction
 GPU offers very good float reduce_sum(float* input, int length)
performance for tasks {
in which the results are float accumulator = input[0];
for(int i = 1; i < length; i++)
stored independently
accumulator += input[i];
 Process N data items return accumulator;
and store in N memory
}
location
 But many common operations require reducing N values into 1 or few values
 sum, min, max, prod, min, histogram, …
 Those operations require an efficient implementation of reduction

 The following slides are based on AMD’s OpenCL™ Optimization Case Study: Simple Reductions
 http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/
Reduction tree for the min operation
__kernel
void reduce_min(__global float* buffer,
 barrier ensures that all threads
__local float* scratch, (work units) in the local group
__const int length,
__global float* result) { reach that point before execution
int global_index = get_global_id(0);
continue
int local_index = get_local_id(0);
// Load data into local memory  Each iteration of the for loop
if (global_index < length) { computes next level of the
scratch[local_index] = buffer[global_index];
} else { reduction pyramid
scratch[local_index] = INFINITY;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0; offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine :
other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
Multistage reduction
 The local memory is usually
limited (e.g. 50kB), which
restricts the maximum size of
the array that can be processed
 Therefore, for large arrays need
to be processed in multiple
stages
 The result of a local memory
reduction is stored in the array
and then this array is reduced
Two-stage reduction

__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {

int global_index = get_global_id(0);

float accumulator = INFINITY;
// Loop sequentially over chunks of input
vector
while (global_index < length) {
 First stage: serial reduction by float element = buffer[global_index];
accumulator = (accumulator < element) ?
N concurrent threads accumulator : element;
global_index += get_global_size(0);
 Number of threads < data items }

 Second stage: parallel reduction // Perform parallel reduction

[The same code as in the previous example]

in local memory }
Reduction performance CPU/GPU

 Different reduction algorithm may be optimal for CPU and GPU

 This can also vary from one GPU to another

 The results from: http://developer.amd.com/resources/articles-whitepapers/opencl-

optimization-case-study-simple-reductions/
Better way?
 Halide - a language for image processing and
computational photography
 http://halide-lang.org/
 Code written in a high-level language, then translated to
x86/SSE, ARM, CUDA, OpenCL
 The optimization strategy defined separately as a schedule
 Auto-tune software can test thousands of schedules and
choose the one that is the best for a particular platform
 Automatically find the best trade-offs
for a particular platform
 Designed for image processing but
similar languages created for other
purposes
OpenCL resources
 https://www.khronos.org/registry/OpenCL/
 Reference cards
 Google: “OpenCL API Reference Card”
 AMD OpenCL Programming Guide
 http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OC
L_Programming_Guide-2013-06-21.pdf

Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Explicit Analysis RADIOSS Ebook PDF
100% (2)
Explicit Analysis RADIOSS Ebook PDF
438 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
Opencl 2pp
No ratings yet
Opencl 2pp
28 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
Lec 14
No ratings yet
Lec 14
52 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Explicit Analysis RADIOSS Ebook
No ratings yet
Explicit Analysis RADIOSS Ebook
438 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Csit3913 PDF
No ratings yet
Csit3913 PDF
12 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Pete Presentation 2
No ratings yet
Pete Presentation 2
17 pages
Lec 1
No ratings yet
Lec 1
27 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
No ratings yet
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
9 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
16 Marks Question Bank New
No ratings yet
16 Marks Question Bank New
16 pages
Flynn's Taxonomy
No ratings yet
Flynn's Taxonomy
18 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
2 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
1 Introduction To Parallel Computing
No ratings yet
1 Introduction To Parallel Computing
58 pages
Explicitly Parallel Platforms
No ratings yet
Explicitly Parallel Platforms
90 pages
Lecture 3 - 1 Dichotomy of Parallel Computing Platforms
No ratings yet
Lecture 3 - 1 Dichotomy of Parallel Computing Platforms
17 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Foster Designing and Building Parallel Programs
No ratings yet
Foster Designing and Building Parallel Programs
370 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
50 Ques One Mark
100% (1)
50 Ques One Mark
11 pages
MPI Tutorial
No ratings yet
MPI Tutorial
23 pages
Parallel Matlab 2010
No ratings yet
Parallel Matlab 2010
110 pages
28 MIMD Architecture
No ratings yet
28 MIMD Architecture
28 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
9 pages
Kapita Selekta-Pertemuan I: Rajkumar Buyya
No ratings yet
Kapita Selekta-Pertemuan I: Rajkumar Buyya
63 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Introduction To Parallel Computing
0% (1)
Introduction To Parallel Computing
34 pages
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
No ratings yet
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
51 pages
CSC 580 - Chapter 2
No ratings yet
CSC 580 - Chapter 2
50 pages
Parallel Languages and Compilers: Perspective From The Titanium Experience
No ratings yet
Parallel Languages and Compilers: Perspective From The Titanium Experience
23 pages
Suparna GundagathiManjunath Thesis
No ratings yet
Suparna GundagathiManjunath Thesis
73 pages
A Parallel FDTD Algorithm Using The MPI L
No ratings yet
A Parallel FDTD Algorithm Using The MPI L
10 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
No ratings yet
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
25 pages
Mesh-Tensorflow: Deep Learning For Supercomputers
No ratings yet
Mesh-Tensorflow: Deep Learning For Supercomputers
16 pages
Application Security of Erlang Concurrent System: January 2008
No ratings yet
Application Security of Erlang Concurrent System: January 2008
7 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet

Parallel Programming in Opencl: Advanced Graphics & Image Processing

Uploaded by

Parallel Programming in Opencl: Advanced Graphics & Image Processing

Uploaded by

Parallel programming in OpenCL

Advanced Graphics & Image Processing

2 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

3 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Massively Multi-threaded (GPU)

4 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

Our Kernel was

From: OpenCL API 1.2 Reference Card

cl::BufferGL cl::BufferRenderGL cl::Image1D cl::Image2D cl::Image2D

 Provides means to both synchronize kernels and execute them in parallel

 This algorithm is suited for output data decomposition

21 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

 Both mappings produce functionally equivalent versions of the program

22 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

 Notice that mapping 2 is far superior in performance for both

23 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

24 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

int global_index = get_global_id(0);

 Second stage: parallel reduction // Perform parallel reduction

 Different reduction algorithm may be optimal for CPU and GPU

 The results from: http://developer.amd.com/resources/articles-whitepapers/opencl-

You might also like