0% found this document useful (0 votes)
16 views43 pages

Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views43 pages

Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

WRITING DATA PARALLEL

ALGORITHMS ON GPUs
WITH C++ AMP

Ade Miller
Technical Director, CenturyLink Cloud.
ABSTRACT
• TODAY MOST PCS, TABLETS AND PHONES SUPPORT MULTI-CORE
PROCESSORS AND MOST PROGRAMMERS HAVE SOME FAMILIARITY
WITH WRITING (TASK) PARALLEL CODE. MANY OF THOSE SAME
DEVICES ALSO HAVE GPUS BUT WRITING CODE TO RUN ON A GPU IS
HARDER. OR IS IT?

GETTING TO GRIPS WITH GPU PROGRAMMING IS REALLY ABOUT


UNDERSTANDING THINGS IN A DATA PARALLEL WAY. THIS TALK WILL
LOOK AT SOME OF THE COMMON PATTERNS FOR IMPLEMENTING
ALGORITHMS ON TODAY'S GPUS USING EXAMPLES FROM THE C++
AMP ALGORITHMS LIBRARY. ALONG THE WAY IT WILL COVER SOME
OF THE UNIQUE ASPECTS OF WRITING CODE FOR GPUS AND
CONTRAST THEM WITH A MORE CONVENTIONAL CODE RUNNING
ON A CPU.

© Ade Miller, September 2014


I AM NOT A PROFESSIONAL
I DO THIS FOR FUN

© Ade Miller, September 2014


WHAT YOU’LL LEARN
• GIVE YOU A BETTER UNDERSTANDING OF HOW TO WRITE CODE FOR
GPUS
• SOME MORE C++ AMP
• PASS ON SOME OF THE THINGS I LEARNT
• WHEN WRITING THE BOOK, CASE STUDIES AND SAMPLES
• DEVELOPING THE C++ AMP ALGORITHMS LIBRARY (AAL)

WHAT YOU’LL NOT LEARN


• HOW TO BE A PERFORMANCE GURU.
• ALL OF THE DEEP DARK SECRETS OF GPUS

© Ade Miller, September 2014


ALGORITHM FAMILY TREE

remove_if is_sorted count_if

adjacent_fin
copy_if d

Map-Reduce

radix_sort fill transform reduce

scan generate transform reduce

parallel_for_each

© Ade Miller, September 2014


TRANSFORM / MAP
GPU PROGRAMMING 101
TRANSFORM / MAP

𝑖𝑛 = 𝑓(𝑖𝑛 )
struct doubler_functor
{
int operator()(const int& x) const { return x * 2; }
};

std::vector<int> input(1024);
std::iota(begin(input), end(input), 1);
std::vector<int> output(1024);

std::transform(begin(input), end(input), begin(output),


doubler_functor());

© Ade Miller, September 2014


TRANSFORM

1  1 MAPPINGS ARE TRIVIAL WITH GPUs


idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

F(a) = 2 x a
+ + + + + + + + + + + + + + + +
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Lines represent read or write memory accesses


+ Red circles are an operation executed by a thread
22 Blue boxes are memory locations in global memory and their current value
25 Green boxes are memory locations in tile memory and their current value

© Ade Miller, September 2014


TRANSFORM WITH AAL
struct doubler_functor
{
int operator()(const int& x) const restrict(amp, cpu)
{
return x*2;
}
};

std::vector<int> input(1024);
std::iota(begin(input), end(input), 1);
std::vector<int> output(1024);

concurrency::array_view<const int> input_av(input);


concurrency::array_view<int> output_av(output);
output_av.discard_data();

amp_stl_algorithms::transform(begin(input_av), end(input_av),
begin(output_av), doubler_functor());

© Ade Miller, September 2014


TRANSFORM UNDER THE HOOD
concurrency::array_view<const int> input_av(input);
concurrency::array_view<int> output_av(output);
output_av.discard_data();

auto doubler_func = doubler_functor();


concurrency::parallel_for_each(output_av.extent,
[=](concurrency::index<1> idx) restrict(amp)
{
output_av[idx] = doubler_func(input_av[idx]);
});

© Ade Miller, September 2014


REDUCE

© Ade Miller, September 2014


REDUCE

𝑠= 𝑎𝑛
𝑛=0
std::vector<int> data(1024);
int s = 0;
for (int i : data)
{
s += i;
}

s = std::accumulate(cbegin(data), cend(data), 0);

REMEMBER… 𝒂 + 𝒃 + 𝒄 ≠ 𝒂 + (𝒃 + 𝒄)

© Ade Miller, September 2014


SIMPLE REDUCTION

idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

stride = 8
+ + + + + + + +
10 12 14 16 18 20 22 24

stride = 4
+ + + +
28 32 36 40

stride = 2
+ +
64 72

stride = 1
+
136

© Ade Miller, September 2014


SIMPLE REDUCTION WITH C++ AMP
template <typename T>
int reduce_simple(const concurrency::array_view<T>& source) const
{
int element_count = source.extent.size();
std::vector<T> result(1);
for (int stride = (element_count / 2); stride > 0; stride /= 2)
{
concurrency::parallel_for_each(concurrency::extent<1>(stride),
[=](concurrency::index<1> idx) restrict(amp)
{
source[idx] += source[idx + stride];
});
}
concurrency::copy(source.section(0, 1), result.begin());
return result[0];
}

© Ade Miller, September 2014


PROBLEMS WITH SIMPLE REDUCE
• DOESN'T USE TILE STATIC MEMORY
• ALL READS AND WRITES ARE TO GLOBAL MEMORY

• MOST OF THE THREADS ARE IDLE MOST OF THE TIME

© Ade Miller, September 2014


USING TILES AND TILE MEMORY: 1
• TILES ARE THE UNITS OF WORK
USED TO SCHEDULE THE GPUS
Application(s) Tiles Warps
STREAMING MULTIPROCESSORS.
1 • THE SCHEDULER ALLOCATES WORK
BASED ON THE AVAILABLE
2 RESOURCES.

• EACH PROCESSOR FURTHER DIVIDES


3 WORK UP INTO GROUPS OF
Kernel
THREADS CALLED WARPS.

4 • THE WARP SCHEDULER HIDES


(MEMORY) LATENCY WITH
COMPUTATION FROM OTHER
5
CORES.

6
© Ade Miller, September 2014
USING TILES AND TILE MEMORY: 2
Streaming Multiprocessor (SM) • TILES CAN ACCESS GLOBAL MEMORY
BUT WITH LIMITED SYNCHRONIZATION
Tile memory Tile memory Tile memory
~ KB ~ KB ~ KB (ATOMIC OPERATIONS).
• TILES CAN ACCESS LOCAL TILE MEMORY
1 2 3 USING SYNCHRONIZATION PRIMITIVES
FOR COORDINATING TILE MEMORY
ACCESS WITHIN A TILE.

Registers Registers Registers • EACH SM ALSO HAS REGISTERS WHICH


ARE USED BY THE TILES.
• YOUR APPLICATION MUST BALANCE
RESOURCE USE TO MAXIMIZE (THREAD)
Global memory OCCUPANCY.
~ GB

© Ade Miller, September 2014


USE TILE MEMORY

global = [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

tile = [0] [1]


local = [0] [1] [2] [3] [4] [5] [6] [7] [0] [1] [2] [3] [4] [5] [6] [7]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

stride = 1
+ + + + + + + +
3 7 11 15 19 23 27 31

stride = 2
+ + + +
10 26 42 58

stride = 4
+ +
36 100

36 100

© Ade Miller, September 2014


A TYPICAL TILED KERNEL
template <typename T, int tile_size = 64>
int reduce_tiled(concurrency::array_view<T>& source) const
{
int element_count = source.extent.size();
concurrency::array<int, 1> per_tile_results(element_count / tile_size);
concurrency::array_view<int, 1> per_tile_results_av(per_tile_results);
per_tile_results_av.discard_data();

while (element_count >= tile_size)


{
Create a (global) array for
concurrency::extent<1> ext(element_count);
concurrency::parallel_for_each(ext.tile<tile_size>(),
storing per-tile results.
[=](tiled_index<tile_size> tidx) restrict(amp)
{
int tid = tidx.local[0];
tile_static int tile_data[tile_size];
tile_data[tid] = source[tidx.global[0]];
tidx.barrier.wait(); Load data into tile_static
for (int stride = 1; stride < tile_size; stride *= 2) memory & wait.
{
if (tid % (2 * stride) == 0)
tile_data[tid] += tile_data[tid + stride];
Calculate result for each tile
tidx.barrier.wait_with_tile_static_memory_fence(); & wait.
}

if (tid == 0) Write tile results back into


per_tile_results_av[tidx.tile[0]] = tile_data[0];
}); (global) array.
element_count /= tile_size;
std::swap(per_tile_results_av, source);

}
per_tile_results_av.discard_data();
Read the final tile results out
and accumulate on the CPU.
std::vector<int> partialResult(element_count);
concurrency::copy(source.section(0, element_count), partialResult.begin());
source.discard_data();
return std::accumulate(partialResult.cbegin(), partialResult.cend(), 0);
}

© Ade Miller, September 2014


A TYPICAL TILED KERNEL
template <typename T, int tile_size = 64>
int reduce_tiled(concurrency::array_view<T>& source) const
{
int element_count = source.extent.size();
concurrency::array<int, 1> per_tile_results(element_count / tile_size);
concurrency::array_view<int, 1> per_tile_results_av(per_tile_results);
per_tile_results_av.discard_data();

while (element_count >= tile_size)


{
Create a (global) array for
concurrency::extent<1> ext(element_count);
concurrency::parallel_for_each(ext.tile<tile_size>(),
storing per-tile results.
[=](tiled_index<tile_size> tidx) restrict(amp)
{
int tid = tidx.local[0];
tile_static int tile_data[tile_size];
tile_data[tid] = source[tidx.global[0]];
tidx.barrier.wait(); Load data into tile_static
for (int stride = 1; stride < tile_size; stride *= 2) memory & wait.
{
if (tid % (2 * stride) == 0)
tile_data[tid] += tile_data[tid + stride];
Calculate result for each tile
tidx.barrier.wait_with_tile_static_memory_fence(); & wait.
}

if (tid == 0) Write tile results back into


per_tile_results_av[tidx.tile[0]] = tile_data[0];
}); (global) array.
element_count /= tile_size;
std::swap(per_tile_results_av, source);

}
per_tile_results_av.discard_data();
Read the final tile results out
and accumulate on the CPU.
std::vector<int> partialResult(element_count);
concurrency::copy(source.section(0, element_count), partialResult.begin());
source.discard_data();
return std::accumulate(partialResult.cbegin(), partialResult.cend(), 0);
}

© Ade Miller, September 2014


A TYPICAL TILED KERNEL
concurrency::extent<1> ext(element_count);
concurrency::parallel_for_each(ext.tile<tile_size>(),
[=](tiled_index<tile_size> tidx) restrict(amp)
{
int tid = tidx.local[0];
tile_static int tile_data[tile_size]; Each thread copies to tile_static
tile_data[tid] = source[tidx.global[0]]; memory and waits.
tidx.barrier.wait();
for (int stride = 1; stride < tile_size; stride *= 2)
{ Each thread sums
if (tid % (2 * stride) == 0) neighbors in
tile_static memory
tile_data[tid] += tile_data[tid + stride];
and waits.
tidx.barrier.wait_with_tile_static_memory_fence();
}
if (tid == 0) 1st tile thread copies result
per_tile_results_av[tidx.tile[0]] = tile_data[0]; to global memory.
});

© Ade Miller, September 2014


REDUCE IDLE THREADS

global = [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

tile = [0]
local = [0] [1] [2] [3] [4] [5] [6] [7]
+ + + + + + + +
3 7 11 15 19 23 27 31

stride = 1
+ + + +
10 26 42 58

stride = 2
+ +
36 100

stride = 4
+
136

136

© Ade Miller, September 2014


REDUCE IDLE THREADS
concurrency::extent<1> ext(element_count / 2);

concurrency::parallel_for_each(ext.tile<tile_size>(),
[=](tiled_index<tile_size> tidx) restrict(amp)
{
int tid = tidx.local[0];
tile_static int tile_data[tile_size];

int rel_idx = tidx.tile[0] * (tile_size * 2) + tid;


tile_data[tid] = source[rel_idx] + source[rel_idx + tile_size];

tidx.barrier.wait();

// Loop that does all the actual work...


});

© Ade Miller, September 2014


NEW PROBLEMS…
NOW THE KERNEL CONTAINS DIVERGENT CODE
• A TIGHT LOOP CONTAINING A CONDITIONAL

OUR MEMORY ACCESS PATTERNS ARE ALSO BAD

© Ade Miller, September 2014


MINIMIZE DIVERGENT CODE

[0] [1] [2] [3] [4] [5] [6] [7] for(int stride =1; stride <tile_size; stride *=2)
3 7 11 15 19 23 27 31 {
if (tid % (2 * stride) == 0)
tile_data[tid] += tile_data[tid + stride];
+ + + +

tidx.barrier
.wait_with_tile_static_memory_fence();
10 26 42 58
}

+ +

36 100

© Ade Miller, September 2014


MINIMIZE DIVERGENT CODE

[0] [1] [2] [3] [4] [5] [6] [7] for(int stride =1; stride <tile_size; stride *=2)
3 7 11 15 19 23 27 31 {
int index = 2 * stride * tid;
if (index < tile_size)
+ + + + tile_data[index] += tile_data[index +stride];

tidx.barrier
10 26 42 58
.wait_with_tile_static_memory_fence();
}
+ +

36 100

© Ade Miller, September 2014


REDUCE BANK CONFLICTS

[0] [1] [2] [3] [4] [5] [6] [7] for(int stride=(tile_size/2);stride>0;stride/=2)
3 7 11 15 19 23 27 31 {
if (tid < stride)
tileData[tid] += tileData[tid + stride];
+ + + +

tidx.barrier
.wait_with_tile_static_memory_fence();
22 30 38 46
}

+ +

60 76

© Ade Miller, September 2014


REDUCTION THE EASY WAY…
THE AAL INCLUDES AN IMPLEMENTATION OF REDUCE.

concurrency::array_view<int> input_av(input_av);

int result =
amp_algorithms::reduce(input_av, amp_algorithms::plus<int>());

THIS IS ANOTHER EXAMPLE OF WHY YOU SHOULD USE LIBRARIES


WHERE POSSIBLE.
LET SOMEONE ELSE DO THE WORK FOR YOU!

© Ade Miller, September 2014


SCAN

© Ade Miller, September 2014


INCLUSIVE SCAN

𝑖−1

𝑎𝑖 = 𝑎𝑛
𝑛=0

std::vector<int> data(1024);
for (size_t i = 1; i < data.size(); ++i)
{
data[i] += data[i – 1];
}

© Ade Miller, September 2014


SIMPLE INCLUSIVE SCAN

idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

stride = 1
+ + + + + + + + + + + + + + +
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

stride = 2
+ + + + + + + + + + + + + +
1 3 6 10 14 18 22 26 30 34 38 42 46 50 54 58

stride = 4
+ + + + + + + + + + + +
1 3 6 10 15 21 28 36 44 52 60 68 76 84 92 100

stride = 8
+ + + + + + + +
1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136

© Ade Miller, September 2014


PROBLEMS WITH SIMPLE INCLUSIVE SCAN

SEQUENTIAL SCAN IS O(N)

SIMPLE SCAN IS O(N LOG2 N)

© Ade Miller, September 2014


EXCLUSIVE SCAN

𝑖−1

𝑎𝑖 = 𝑎𝑛
𝑛=0

std::vector<int> input(1024);
std::vector<int> output(1024);
output[0] = 0;
for (size_t i = 1; i < output.size(); ++i)
{
output[i] = output[i – 1] + input[i – 1];
}

© Ade Miller, September 2014


EXCLUSIVE SCAN STEP 1: UP-SWEEP

idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 136
+

stride = 8

1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 100
+ +

stride = 4

1 3 3 10 5 11 7 26 9 19 11 42 13 27 15 58
+ + + +
stride = 2

1 3 3 7 5 11 7 15 9 19 11 23 13 27 15 31
+ + + + + + + +
stride = 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

© Ade Miller, September 2014


EXCLUSIVE SCAN STEP 2: DOWN-SWEEP

idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 0

© Ade Miller, September 2014


EXCLUSIVE SCAN STEP 2: DOWN-SWEEP

idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 0

stride = 8
+
1 3 3 10 5 11 7 0 9 19 11 42 13 27 15 36

stride = 4
+ +
1 3 3 0 5 11 7 10 9 19 11 36 13 27 15 78

stride = 2
+ + + +
1 0 3 3 5 10 7 21 9 36 11 55 13 78 15 105

stride = 1
+ + + + + + + +
0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120

© Ade Miller, September 2014


SCAN THE EASY WAY…
THE AAL INCLUDES A C++ AMP IMPLEMENTATION OF SCAN.

concurrency::array_view<int> input_av(input_vw);

scan<warp_size, scan_mode::exclusive>(input_vw, input_vw,


amp_algorithms::plus<int>());

THERE IS ALSO A DIRECTX SCAN WRAPPER

© Ade Miller, September 2014


ALGORITHM FAMILY TREE

remove_if is_sorted count_if

copy_if adjacent_find

Map-Reduce

radix_sort fill transform reduce

scan generate transform reduce

parallel_for_each

© Ade Miller, September 2014


SUMMARY
• MEMORY
• MINIMIZE UNNECESSARY COPYING TO AND FROM THE GPU
• MAKE USE OF TILE STATIC MEMORY WHERE POSSIBLE
• THINK ABOUT MEMORY ACCESS PATTERNS IN BOTH GLOBAL AND TILE
MEMORY

• COMPUTE
• MINIMIZE THE DIVERGENCE IN YOUR CODE
• REDUCE THE NUMBER OF STALLED OR IDLE THREADS
• THINK VERY CAREFULLY BEFORE RESORTING TO ATOMIC OPERATIONS

© Ade Miller, September 2014


WHAT’S NEXT
AMP LIBRARY
• 0.9.4: RELEASE LATER THIS WEEK…
• NUMEROUS NEW FEATURES INCLUDING; RADIX SORT & SCAN

• 0.9.5: BY END OF YEAR (I HOPE)


• PORTING TO RUN ON CLANG AND LLVM
• ADDING SOME MORE FEATURES
• SOME KEY PERFORMANCE TUNING

IF YOU WANT TO HELP I’M ALWAYS LOOKING FOR OSS DEVELOPERS

(THANKS TO AMD FOR PROVIDING SOME HARDWARE FOR THE LINUX


PORT)

© Ade Miller, September 2014


THE C++ AMP BOOK

• HTTP://WWW.GREGCONS.COM/CPPAMP

• WRITTEN BY KATE GREGORY & ADE MILLER


• COVERS ALL OF C++ AMP IN DETAIL, 350 PAGES
• ALL SOURCE CODE AVAILABLE ONLINE
HTTP://AMPBOOK.CODEPLEX.COM/
• EBOOK ALSO AVAILABLE FROM O’REILLY BOOKS
(OR AMAZON)

© Ade Miller, September 2014


RESOURCES
• C++ AMP LIBRARY
• HTTP://AMPALGORITHMS.CODEPLEX.COM/

• C++ AMP TEAM


• BLOG: HTTP://BLOGS.MSDN.COM/B/NATIVECONCURRENCY/
• SAMPLES:
HTTP://BLOGS.MSDN.COM/B/NATIVECONCURRENCY/ARCHIVE/2012/01/3
0/C-AMP-SAMPLE-PROJECTS-FOR-DOWNLOAD.ASPX

• HSA FOUNDATION
• HTTP://WWW.HSAFOUNDATION.COM/BRINGING-CAMP-BEYOND-
WINDOWS-VIA-CLANG-LLVM/

• FORUMS TO ASK QUESTIONS


• HTTP://STACKOVERFLOW.COM/QUESTIONS/TAGGED/C%2B%2B-AMP
• HTTP://SOCIAL.MSDN.MICROSOFT.COM/FORUMS/EN/PARALLELCPPNATIVE/T
HREADS

© Ade Miller, September 2014


QUESTIONS?

© Ade Miller, September 2014

You might also like