0% found this document useful (0 votes)
88 views9 pages

Accelerating Data Parallelism in Gpus Through Apgas

The document discusses enhancing data parallelism in GPUs using the APGAS programming model. It aims to add data parallel constructs like MapReduce to high-level languages like X10 and implement them on GPUs using CUDA. The design involves building an X10/CUDA environment, testing X10 support for CUDA, building a MapReduce skeleton, and testing the skeleton on CUDA. The MapReduce framework hides programming complexity and automatically distributes and executes tasks on the GPU.

Uploaded by

sathishkadapa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views9 pages

Accelerating Data Parallelism in Gpus Through Apgas

The document discusses enhancing data parallelism in GPUs using the APGAS programming model. It aims to add data parallel constructs like MapReduce to high-level languages like X10 and implement them on GPUs using CUDA. The design involves building an X10/CUDA environment, testing X10 support for CUDA, building a MapReduce skeleton, and testing the skeleton on CUDA. The MapReduce framework hides programming complexity and automatically distributes and executes tasks on the GPU.

Uploaded by

sathishkadapa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

Accelerating Data Parallelism in GPUs through APGAS

Under the guidance of B.Thanasekhar, Assistant Professor, MIT,Anna University.

Project by P.Subhasree 2010614021

Objective:
In this project I intend to add data parallel constructs to high-level programming languages of APGAS programming model. Thus trying to enhance the data parallelism in GPUs, by programming GPUs using APGAS. The constructs of a data-parallel programming environment can be used here.

Design:

There are two types of parallelism as Task parallelism and Data parallelism. In this project the data parallelism is enhanced by adding the data parallel construct (MapReduce) to the X10 and the mapping it to the CUDA. Finally it is run on the GPU.

Modules:
Building x10/CUDA environment Testing x10 support for CUDA Building data parallel skeleton Testing data parallel skeleton in CUDA

Flow diagram:

Module 1 Building x10/CUDA Environment:


X10 is a parallel programming language based on the APGAS programming model. It has many features including its APIs for CUDA. X10 provides an API for CUDA to support its program running in the GPUs (through CUDA). In this module the x10/CUDA environment is set and some programs are implemented in the GPUs through the CUDA API. Module 2 Testing x10 support for CUDA: In this module the kernel parameters, thread and block configurations of the GPU are set in the x10. Some of the CUDA constructs are mapped from the x10 constructs as given below S.No 1 2 Operation Thread Creation No.of Threads Constructs in X10 Model Async For loop (Dynamic allocation) Constructs in CUDA Model Kernel<< >> Grid in CUDA

3 4

Hierarchical execution Synchronization

finish clocks

inbuilt Syncthreads( )

Some examples of how the thread, blocks and shared memory of the GPU can be implemented in x10: Specifying Blocks, Threads and Shared Memory at (p) { finish for ([b] in 0..num_blocks-1) async { val shm = new Array[Int](num_elements, init); finish for ([t] in 0..num_threads-1) async { ... } } } Using clocks to represent the barrier construct at (p) { finish for ([b] in 0..num_blocks-1) async { clocked finish for ([t] in 0..num_threads-1) clocked async { ... next; ... } } }

Sample code for testing the x10 support for CUDA:


public class CUDAKernelTest { static def doTest1 (init:Array[Float](1){rail}, recv:Array[Float](1){rail}, p:Place, len:Int) { val remote =CUDAUtilities.makeRemoteArray[Float](p,len, (Int)=>0.0 as Float); // allocate finish async at (p) @CUDA { finish for (block in 0..7) async {

clocked finish for (thread in 0..63) clocked async { val tid = block*64 + thread; val tids = 8*64; for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(@NoInline init(i)); }}}} finish Array.asyncCopy(remote, 0, recv, 0, len); // validate var success:Boolean = true; for ([i] in recv.region) { val oracle = i as Float; if (Math.abs(1 - (recv(i)*recv(i))/oracle) > 1E-6f) { Console.ERR.println("recv("+i+"): "+recv(i)+" * "+recv(i)+" = "+(recv(i)*recv(i))); success = false; }} Console.OUT.println((success?"SUCCESS":"FAIL")+" at "+p); //calling in the main function
doTest2(gpu);

Module 3 Building the MapReduce skeleton:


MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. MapReduce has been applied to various domains such as

datamining, machine learning, and bioinformatics. This framework hides the programming complexity of the GPU behind the simple MapReduce interface. The MapReduce will automatically distribute and execute the task on GPU.
This framework provides two primitive operations

a map function to process input key/value pairs and to generate intermediate key/value pairs

a reduce function to merge all intermediate pairs associated with the same key

Map: (k1, v1) (k2, v2). Reduce: (k2, v2) (k2, v3).

User-defined APIs:
//MAP_COUNT counts result size of the map function. void MAP_COUNT(void *key, void *val, int keySize,int valSize); //The map function. voidMAP(void *key, void* val, int keySize, int valSize); //REDUCE_COUNT counts result size of the reduce function. void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount); //The reduce function. void REDUCE(void* key, void* vals, int keySize,int valCount);

MapReduce Workflow:

Algorithm for MapReduce implementation:


1. Prepare input key/value pairs in the main memory and store them into input arrays. 2. Initialize the parameters in the run-time configuration 3. Copy the input arrays from the main memory to the GPU device memory. 4. Start the map stage on the GPU and store the intermediate key/value pairs into arrays. 5. If sort is required, sort the intermediate result. 6. If Reduce is required, start the reduce stage on the GPU and generate the final results. Otherwise, the intermediate results are the final results. 7. Copy the final results from the GPU device memory to the main memory

Implementation details:
Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure key array, value array, Directory index (consists entry of <key offset, key size, value offset, value size> for each key/value pair). With the array structure, the space on the device memory is allocated for the input data as well as for the result output before executing the GPU program.

Avoiding the Write conflicts:


Each map task outputs three counts the number of intermediate

results , total size of keys, total size of values. Based on the key sizes of all map tasks, the runtime system computes a prefix sum on these sizes and produces an array of write locations. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided.

References:
[1] Michael D. McCool "Scalable Programming Models for Massively Multicore Processors" High performance computing, Vol. 96, No. 5, May 2008 [2] Duncan K. G. Campbell, "Towards the Classification of Algorithmic Skeletons", Department of Computer Science, University of York. December 3, 1996. [3]Horacio Gonzlez-Velez and Mario Leyton, A Survey of Algorithmic Skeleton Frameworks: High-Level Structured Parallel Programming Enablers softw. Pract. Exper. 0000; 00:126. may 15,2010 [4] David Luebke, "CUDA: SCALABLE PARALLEL PROGRAMMING FOR HIGH-PERFORMANCE SCIENTIFIC COMPUTING", NVIDIA Corporation. [5] NVIDIA. 2007. CUDA Programming Guide 1.1; see http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA _Programming_Guide_1.1.pdf. [6] John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, "Scalable Parallel Programming with CUDA"

[7] Kemal Ebcioglu Vijay Saraswat Vivek Sarkar, X10: an Experimental Language for High Productivity Programming of Scalable Systems IBM Research,T.J. Watson Research Center [8] Saraswat et al. The Asynchronous Partitioned Global Address Space Model. IBM 2010 [9] Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David Grove X10 Language Specication http://x10.codehaus.org/. [10] Charles, Donwa, Ebcioglu, Grothoff, Kielstra, von Praun, Saraswat, and Sarkar X10: An Object-oriented approach to non-uniform Clustered Computing OOPSLA05, October 1620, 2005, San Diego, California, USA.

[11]Jonathan K. Lee and Jens Palsberg Featherweight X10: a core calculus for async-finish parallelism n Proceedings of PPOPP'10, 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, January 2010. [12] Bingsheng He,Wenbin Fang,Qiong Luo, Mars: A MapReduce Framework on Graphics Processors,ACM 2008 October 25-29

You might also like