0% found this document useful (0 votes)

88 views9 pages

Accelerating Data Parallelism in Gpus Through Apgas

The document discusses enhancing data parallelism in GPUs using the APGAS programming model. It aims to add data parallel constructs like MapReduce to high-level languages like X10 and implement them on GPUs using CUDA. The design involves building an X10/CUDA environment, testing X10 support for CUDA, building a MapReduce skeleton, and testing the skeleton on CUDA. The MapReduce framework hides programming complexity and automatically distributes and executes tasks on the GPU.

Uploaded by

sathishkadapa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views9 pages

Accelerating Data Parallelism in Gpus Through Apgas

Uploaded by

sathishkadapa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 9

Accelerating Data Parallelism in GPUs through APGAS

Under the guidance of B.Thanasekhar, Assistant Professor, MIT,Anna University.

Project by P.Subhasree 2010614021

Objective:
In this project I intend to add data parallel constructs to high-level programming languages of APGAS programming model. Thus trying to enhance the data parallelism in GPUs, by programming GPUs using APGAS. The constructs of a data-parallel programming environment can be used here.

Design:

There are two types of parallelism as Task parallelism and Data parallelism. In this project the data parallelism is enhanced by adding the data parallel construct (MapReduce) to the X10 and the mapping it to the CUDA. Finally it is run on the GPU.

Modules:
Building x10/CUDA environment Testing x10 support for CUDA Building data parallel skeleton Testing data parallel skeleton in CUDA

Flow diagram:

Module 1 Building x10/CUDA Environment:

X10 is a parallel programming language based on the APGAS programming model. It has many features including its APIs for CUDA. X10 provides an API for CUDA to support its program running in the GPUs (through CUDA). In this module the x10/CUDA environment is set and some programs are implemented in the GPUs through the CUDA API. Module 2 Testing x10 support for CUDA: In this module the kernel parameters, thread and block configurations of the GPU are set in the x10. Some of the CUDA constructs are mapped from the x10 constructs as given below S.No 1 2 Operation Thread Creation No.of Threads Constructs in X10 Model Async For loop (Dynamic allocation) Constructs in CUDA Model Kernel<< >> Grid in CUDA

3 4

Hierarchical execution Synchronization

finish clocks

inbuilt Syncthreads( )

Some examples of how the thread, blocks and shared memory of the GPU can be implemented in x10: Specifying Blocks, Threads and Shared Memory at (p) { finish for ([b] in 0..num_blocks-1) async { val shm = new Array[Int](num_elements, init); finish for ([t] in 0..num_threads-1) async { ... } } } Using clocks to represent the barrier construct at (p) { finish for ([b] in 0..num_blocks-1) async { clocked finish for ([t] in 0..num_threads-1) clocked async { ... next; ... } } }

Sample code for testing the x10 support for CUDA:

public class CUDAKernelTest { static def doTest1 (init:Array[Float](1){rail}, recv:Array[Float](1){rail}, p:Place, len:Int) { val remote =CUDAUtilities.makeRemoteArray[Float](p,len, (Int)=>0.0 as Float); // allocate finish async at (p) @CUDA { finish for (block in 0..7) async {

clocked finish for (thread in 0..63) clocked async { val tid = block*64 + thread; val tids = 8*64; for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(@NoInline init(i)); }}}} finish Array.asyncCopy(remote, 0, recv, 0, len); // validate var success:Boolean = true; for ([i] in recv.region) { val oracle = i as Float; if (Math.abs(1 - (recv(i)*recv(i))/oracle) > 1E-6f) { Console.ERR.println("recv("+i+"): "+recv(i)+" * "+recv(i)+" = "+(recv(i)*recv(i))); success = false; }} Console.OUT.println((success?"SUCCESS":"FAIL")+" at "+p); //calling in the main function
doTest2(gpu);

Module 3 Building the MapReduce skeleton:

MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. MapReduce has been applied to various domains such as

datamining, machine learning, and bioinformatics. This framework hides the programming complexity of the GPU behind the simple MapReduce interface. The MapReduce will automatically distribute and execute the task on GPU.
This framework provides two primitive operations

a map function to process input key/value pairs and to generate intermediate key/value pairs

a reduce function to merge all intermediate pairs associated with the same key

Map: (k1, v1) (k2, v2). Reduce: (k2, v2) (k2, v3).

User-defined APIs:
//MAP_COUNT counts result size of the map function. void MAP_COUNT(void *key, void *val, int keySize,int valSize); //The map function. voidMAP(void *key, void* val, int keySize, int valSize); //REDUCE_COUNT counts result size of the reduce function. void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount); //The reduce function. void REDUCE(void* key, void* vals, int keySize,int valCount);

MapReduce Workflow:

Algorithm for MapReduce implementation:

1. Prepare input key/value pairs in the main memory and store them into input arrays. 2. Initialize the parameters in the run-time configuration 3. Copy the input arrays from the main memory to the GPU device memory. 4. Start the map stage on the GPU and store the intermediate key/value pairs into arrays. 5. If sort is required, sort the intermediate result. 6. If Reduce is required, start the reduce stage on the GPU and generate the final results. Otherwise, the intermediate results are the final results. 7. Copy the final results from the GPU device memory to the main memory

Implementation details:
Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure key array, value array, Directory index (consists entry of <key offset, key size, value offset, value size> for each key/value pair). With the array structure, the space on the device memory is allocated for the input data as well as for the result output before executing the GPU program.

Avoiding the Write conflicts:

Each map task outputs three counts the number of intermediate

results , total size of keys, total size of values. Based on the key sizes of all map tasks, the runtime system computes a prefix sum on these sizes and produces an array of write locations. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided.

References:
[1] Michael D. McCool "Scalable Programming Models for Massively Multicore Processors" High performance computing, Vol. 96, No. 5, May 2008 [2] Duncan K. G. Campbell, "Towards the Classification of Algorithmic Skeletons", Department of Computer Science, University of York. December 3, 1996. [3]Horacio Gonzlez-Velez and Mario Leyton, A Survey of Algorithmic Skeleton Frameworks: High-Level Structured Parallel Programming Enablers softw. Pract. Exper. 0000; 00:126. may 15,2010 [4] David Luebke, "CUDA: SCALABLE PARALLEL PROGRAMMING FOR HIGH-PERFORMANCE SCIENTIFIC COMPUTING", NVIDIA Corporation. [5] NVIDIA. 2007. CUDA Programming Guide 1.1; see http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA _Programming_Guide_1.1.pdf. [6] John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, "Scalable Parallel Programming with CUDA"

[7] Kemal Ebcioglu Vijay Saraswat Vivek Sarkar, X10: an Experimental Language for High Productivity Programming of Scalable Systems IBM Research,T.J. Watson Research Center [8] Saraswat et al. The Asynchronous Partitioned Global Address Space Model. IBM 2010 [9] Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David Grove X10 Language Specication http://x10.codehaus.org/. [10] Charles, Donwa, Ebcioglu, Grothoff, Kielstra, von Praun, Saraswat, and Sarkar X10: An Object-oriented approach to non-uniform Clustered Computing OOPSLA05, October 1620, 2005, San Diego, California, USA.

[11]Jonathan K. Lee and Jens Palsberg Featherweight X10: a core calculus for async-finish parallelism n Proceedings of PPOPP'10, 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, January 2010. [12] Bingsheng He,Wenbin Fang,Qiong Luo, Mars: A MapReduce Framework on Graphics Processors,ACM 2008 October 25-29

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
CUDA Cuts Fast Graph Cuts On The GPU
No ratings yet
CUDA Cuts Fast Graph Cuts On The GPU
8 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Introduction To The Cuda Programming
No ratings yet
Introduction To The Cuda Programming
25 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
DAA Unit 5
No ratings yet
DAA Unit 5
29 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Programming GPU Clusters With Shared Memory Abstraction in Software
No ratings yet
Programming GPU Clusters With Shared Memory Abstraction in Software
8 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Soal Listening Utama
No ratings yet
Soal Listening Utama
3 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
Introduction To Blockchain
No ratings yet
Introduction To Blockchain
22 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
VRNC Proposal v0-1 PDF
0% (1)
VRNC Proposal v0-1 PDF
13 pages
Cisco 200-105 ICND2 v3.0 Real Dumps It-Dumps
29% (7)
Cisco 200-105 ICND2 v3.0 Real Dumps It-Dumps
12 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Itil
No ratings yet
Itil
79 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
CUDA
No ratings yet
CUDA
33 pages
6171675-Ix Std-Artificial Intelligence - Retestpostmidtermqp
No ratings yet
6171675-Ix Std-Artificial Intelligence - Retestpostmidtermqp
6 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
8085 Prog-Ans
No ratings yet
8085 Prog-Ans
23 pages
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
No ratings yet
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems For Programmability and Reliability
6 pages
Computer Science: (Code 083)
No ratings yet
Computer Science: (Code 083)
5 pages
DM 8 SetOperatoins
No ratings yet
DM 8 SetOperatoins
32 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Principles of Communications: Systems, Modulation, and Noise, 7
No ratings yet
Principles of Communications: Systems, Modulation, and Noise, 7
2 pages
Optimisation and Optimal Control
No ratings yet
Optimisation and Optimal Control
82 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Cam Desenhos
No ratings yet
Cam Desenhos
46 pages
Oracle DBA Interview Questions 1
33% (3)
Oracle DBA Interview Questions 1
14 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Gravador CCTV Hiseeu
No ratings yet
Gravador CCTV Hiseeu
20 pages
Prediction of Compressive Strength of Research Paper
No ratings yet
Prediction of Compressive Strength of Research Paper
9 pages
Configuration TO Enterprise Structure (Fi) : Welcome
No ratings yet
Configuration TO Enterprise Structure (Fi) : Welcome
50 pages
Entitlement PDF
No ratings yet
Entitlement PDF
3 pages
6 Cable Tutorial
No ratings yet
6 Cable Tutorial
20 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
Bi Apps Financial Analytics On Jde
No ratings yet
Bi Apps Financial Analytics On Jde
57 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Standard Calibration Procedure Weighing Scale Doc. No. Call/SCP/019 Rev. 00 May 01, 2015
No ratings yet
Standard Calibration Procedure Weighing Scale Doc. No. Call/SCP/019 Rev. 00 May 01, 2015
4 pages
Octave Plot Tutorial
No ratings yet
Octave Plot Tutorial
4 pages
Systems Theory Modelling
No ratings yet
Systems Theory Modelling
45 pages
ABibliographyon Logic
No ratings yet
ABibliographyon Logic
5 pages
CMS Vadapalani Placements List
No ratings yet
CMS Vadapalani Placements List
11 pages
JHBMM
No ratings yet
JHBMM
6 pages
Java
No ratings yet
Java
10 pages
Zip Grade 50 Questionv 2
No ratings yet
Zip Grade 50 Questionv 2
1 page
Curriculum Vitae Ouday George Zakko: Personal Details
No ratings yet
Curriculum Vitae Ouday George Zakko: Personal Details
4 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Accelerating Data Parallelism in Gpus Through Apgas

Uploaded by

Accelerating Data Parallelism in Gpus Through Apgas

Uploaded by

Accelerating Data Parallelism in GPUs through APGAS

Under the guidance of B.Thanasekhar, Assistant Professor, MIT,Anna University.

Project by P.Subhasree 2010614021

Module 1 Building x10/CUDA Environment:

Hierarchical execution Synchronization

Sample code for testing the x10 support for CUDA:

Module 3 Building the MapReduce skeleton:

Algorithm for MapReduce implementation:

Avoiding the Write conflicts:

You might also like