Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
ALGORITHMS ON GPUs
WITH C++ AMP
Ade Miller
Technical Director, CenturyLink Cloud.
ABSTRACT
• TODAY MOST PCS, TABLETS AND PHONES SUPPORT MULTI-CORE
PROCESSORS AND MOST PROGRAMMERS HAVE SOME FAMILIARITY
WITH WRITING (TASK) PARALLEL CODE. MANY OF THOSE SAME
DEVICES ALSO HAVE GPUS BUT WRITING CODE TO RUN ON A GPU IS
HARDER. OR IS IT?
adjacent_fin
copy_if d
Map-Reduce
parallel_for_each
𝑖𝑛 = 𝑓(𝑖𝑛 )
struct doubler_functor
{
int operator()(const int& x) const { return x * 2; }
};
std::vector<int> input(1024);
std::iota(begin(input), end(input), 1);
std::vector<int> output(1024);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
F(a) = 2 x a
+ + + + + + + + + + + + + + + +
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
std::vector<int> input(1024);
std::iota(begin(input), end(input), 1);
std::vector<int> output(1024);
amp_stl_algorithms::transform(begin(input_av), end(input_av),
begin(output_av), doubler_functor());
𝑠= 𝑎𝑛
𝑛=0
std::vector<int> data(1024);
int s = 0;
for (int i : data)
{
s += i;
}
REMEMBER… 𝒂 + 𝒃 + 𝒄 ≠ 𝒂 + (𝒃 + 𝒄)
idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride = 8
+ + + + + + + +
10 12 14 16 18 20 22 24
stride = 4
+ + + +
28 32 36 40
stride = 2
+ +
64 72
stride = 1
+
136
6
© Ade Miller, September 2014
USING TILES AND TILE MEMORY: 2
Streaming Multiprocessor (SM) • TILES CAN ACCESS GLOBAL MEMORY
BUT WITH LIMITED SYNCHRONIZATION
Tile memory Tile memory Tile memory
~ KB ~ KB ~ KB (ATOMIC OPERATIONS).
• TILES CAN ACCESS LOCAL TILE MEMORY
1 2 3 USING SYNCHRONIZATION PRIMITIVES
FOR COORDINATING TILE MEMORY
ACCESS WITHIN A TILE.
global = [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride = 1
+ + + + + + + +
3 7 11 15 19 23 27 31
stride = 2
+ + + +
10 26 42 58
stride = 4
+ +
36 100
36 100
}
per_tile_results_av.discard_data();
Read the final tile results out
and accumulate on the CPU.
std::vector<int> partialResult(element_count);
concurrency::copy(source.section(0, element_count), partialResult.begin());
source.discard_data();
return std::accumulate(partialResult.cbegin(), partialResult.cend(), 0);
}
}
per_tile_results_av.discard_data();
Read the final tile results out
and accumulate on the CPU.
std::vector<int> partialResult(element_count);
concurrency::copy(source.section(0, element_count), partialResult.begin());
source.discard_data();
return std::accumulate(partialResult.cbegin(), partialResult.cend(), 0);
}
global = [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
tile = [0]
local = [0] [1] [2] [3] [4] [5] [6] [7]
+ + + + + + + +
3 7 11 15 19 23 27 31
stride = 1
+ + + +
10 26 42 58
stride = 2
+ +
36 100
stride = 4
+
136
136
concurrency::parallel_for_each(ext.tile<tile_size>(),
[=](tiled_index<tile_size> tidx) restrict(amp)
{
int tid = tidx.local[0];
tile_static int tile_data[tile_size];
tidx.barrier.wait();
[0] [1] [2] [3] [4] [5] [6] [7] for(int stride =1; stride <tile_size; stride *=2)
3 7 11 15 19 23 27 31 {
if (tid % (2 * stride) == 0)
tile_data[tid] += tile_data[tid + stride];
+ + + +
tidx.barrier
.wait_with_tile_static_memory_fence();
10 26 42 58
}
+ +
36 100
[0] [1] [2] [3] [4] [5] [6] [7] for(int stride =1; stride <tile_size; stride *=2)
3 7 11 15 19 23 27 31 {
int index = 2 * stride * tid;
if (index < tile_size)
+ + + + tile_data[index] += tile_data[index +stride];
tidx.barrier
10 26 42 58
.wait_with_tile_static_memory_fence();
}
+ +
36 100
[0] [1] [2] [3] [4] [5] [6] [7] for(int stride=(tile_size/2);stride>0;stride/=2)
3 7 11 15 19 23 27 31 {
if (tid < stride)
tileData[tid] += tileData[tid + stride];
+ + + +
tidx.barrier
.wait_with_tile_static_memory_fence();
22 30 38 46
}
+ +
60 76
concurrency::array_view<int> input_av(input_av);
int result =
amp_algorithms::reduce(input_av, amp_algorithms::plus<int>());
𝑖−1
𝑎𝑖 = 𝑎𝑛
𝑛=0
std::vector<int> data(1024);
for (size_t i = 1; i < data.size(); ++i)
{
data[i] += data[i – 1];
}
idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
stride = 1
+ + + + + + + + + + + + + + +
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
stride = 2
+ + + + + + + + + + + + + +
1 3 6 10 14 18 22 26 30 34 38 42 46 50 54 58
stride = 4
+ + + + + + + + + + + +
1 3 6 10 15 21 28 36 44 52 60 68 76 84 92 100
stride = 8
+ + + + + + + +
1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136
𝑖−1
𝑎𝑖 = 𝑎𝑛
𝑛=0
std::vector<int> input(1024);
std::vector<int> output(1024);
output[0] = 0;
for (size_t i = 1; i < output.size(); ++i)
{
output[i] = output[i – 1] + input[i – 1];
}
idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 136
+
stride = 8
1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 100
+ +
stride = 4
1 3 3 10 5 11 7 26 9 19 11 42 13 27 15 58
+ + + +
stride = 2
1 3 3 7 5 11 7 15 9 19 11 23 13 27 15 31
+ + + + + + + +
stride = 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 0
idx [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
1 3 3 10 5 11 7 36 9 19 11 42 13 27 15 0
stride = 8
+
1 3 3 10 5 11 7 0 9 19 11 42 13 27 15 36
stride = 4
+ +
1 3 3 0 5 11 7 10 9 19 11 36 13 27 15 78
stride = 2
+ + + +
1 0 3 3 5 10 7 21 9 36 11 55 13 78 15 105
stride = 1
+ + + + + + + +
0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120
concurrency::array_view<int> input_av(input_vw);
copy_if adjacent_find
Map-Reduce
parallel_for_each
• COMPUTE
• MINIMIZE THE DIVERGENCE IN YOUR CODE
• REDUCE THE NUMBER OF STALLED OR IDLE THREADS
• THINK VERY CAREFULLY BEFORE RESORTING TO ATOMIC OPERATIONS
• HTTP://WWW.GREGCONS.COM/CPPAMP
• HSA FOUNDATION
• HTTP://WWW.HSAFOUNDATION.COM/BRINGING-CAMP-BEYOND-
WINDOWS-VIA-CLANG-LLVM/