Accelerating Data Parallelism in Gpus Through Apgas
Accelerating Data Parallelism in Gpus Through Apgas
Objective:
In this project I intend to add data parallel constructs to high-level programming languages of APGAS programming model. Thus trying to enhance the data parallelism in GPUs, by programming GPUs using APGAS. The constructs of a data-parallel programming environment can be used here.
Design:
There are two types of parallelism as Task parallelism and Data parallelism. In this project the data parallelism is enhanced by adding the data parallel construct (MapReduce) to the X10 and the mapping it to the CUDA. Finally it is run on the GPU.
Modules:
Building x10/CUDA environment Testing x10 support for CUDA Building data parallel skeleton Testing data parallel skeleton in CUDA
Flow diagram:
3 4
finish clocks
inbuilt Syncthreads( )
Some examples of how the thread, blocks and shared memory of the GPU can be implemented in x10: Specifying Blocks, Threads and Shared Memory at (p) { finish for ([b] in 0..num_blocks-1) async { val shm = new Array[Int](num_elements, init); finish for ([t] in 0..num_threads-1) async { ... } } } Using clocks to represent the barrier construct at (p) { finish for ([b] in 0..num_blocks-1) async { clocked finish for ([t] in 0..num_threads-1) clocked async { ... next; ... } } }
clocked finish for (thread in 0..63) clocked async { val tid = block*64 + thread; val tids = 8*64; for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(@NoInline init(i)); }}}} finish Array.asyncCopy(remote, 0, recv, 0, len); // validate var success:Boolean = true; for ([i] in recv.region) { val oracle = i as Float; if (Math.abs(1 - (recv(i)*recv(i))/oracle) > 1E-6f) { Console.ERR.println("recv("+i+"): "+recv(i)+" * "+recv(i)+" = "+(recv(i)*recv(i))); success = false; }} Console.OUT.println((success?"SUCCESS":"FAIL")+" at "+p); //calling in the main function
doTest2(gpu);
datamining, machine learning, and bioinformatics. This framework hides the programming complexity of the GPU behind the simple MapReduce interface. The MapReduce will automatically distribute and execute the task on GPU.
This framework provides two primitive operations
a map function to process input key/value pairs and to generate intermediate key/value pairs
a reduce function to merge all intermediate pairs associated with the same key
Map: (k1, v1) (k2, v2). Reduce: (k2, v2) (k2, v3).
User-defined APIs:
//MAP_COUNT counts result size of the map function. void MAP_COUNT(void *key, void *val, int keySize,int valSize); //The map function. voidMAP(void *key, void* val, int keySize, int valSize); //REDUCE_COUNT counts result size of the reduce function. void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount); //The reduce function. void REDUCE(void* key, void* vals, int keySize,int valCount);
MapReduce Workflow:
Implementation details:
Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure key array, value array, Directory index (consists entry of <key offset, key size, value offset, value size> for each key/value pair). With the array structure, the space on the device memory is allocated for the input data as well as for the result output before executing the GPU program.
results , total size of keys, total size of values. Based on the key sizes of all map tasks, the runtime system computes a prefix sum on these sizes and produces an array of write locations. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided.
References:
[1] Michael D. McCool "Scalable Programming Models for Massively Multicore Processors" High performance computing, Vol. 96, No. 5, May 2008 [2] Duncan K. G. Campbell, "Towards the Classification of Algorithmic Skeletons", Department of Computer Science, University of York. December 3, 1996. [3]Horacio Gonzlez-Velez and Mario Leyton, A Survey of Algorithmic Skeleton Frameworks: High-Level Structured Parallel Programming Enablers softw. Pract. Exper. 0000; 00:126. may 15,2010 [4] David Luebke, "CUDA: SCALABLE PARALLEL PROGRAMMING FOR HIGH-PERFORMANCE SCIENTIFIC COMPUTING", NVIDIA Corporation. [5] NVIDIA. 2007. CUDA Programming Guide 1.1; see http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA _Programming_Guide_1.1.pdf. [6] John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, "Scalable Parallel Programming with CUDA"
[7] Kemal Ebcioglu Vijay Saraswat Vivek Sarkar, X10: an Experimental Language for High Productivity Programming of Scalable Systems IBM Research,T.J. Watson Research Center [8] Saraswat et al. The Asynchronous Partitioned Global Address Space Model. IBM 2010 [9] Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David Grove X10 Language Specication http://x10.codehaus.org/. [10] Charles, Donwa, Ebcioglu, Grothoff, Kielstra, von Praun, Saraswat, and Sarkar X10: An Object-oriented approach to non-uniform Clustered Computing OOPSLA05, October 1620, 2005, San Diego, California, USA.
[11]Jonathan K. Lee and Jens Palsberg Featherweight X10: a core calculus for async-finish parallelism n Proceedings of PPOPP'10, 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, January 2010. [12] Bingsheng He,Wenbin Fang,Qiong Luo, Mars: A MapReduce Framework on Graphics Processors,ACM 2008 October 25-29