100% found this document useful (1 vote)
51 views

Assignment 5 - OpenCL Optimizations

This document provides instructions for an assignment on optimizing OpenCL code. Students are asked to complete two parts: Part I involves measuring the performance of two kernels that access arrays with different offset and stride values to demonstrate the benefit of memory coalescing. Part II involves implementing and optimizing a matrix multiplication using OpenCL. Students are asked to measure performance of a naive implementation, a transposed matrix version, and a local memory cached version. They are also instructed to test their code on the instructor's machine if further optimization experiments with a GPU are desired.

Uploaded by

Abdulahi Abebe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
51 views

Assignment 5 - OpenCL Optimizations

This document provides instructions for an assignment on optimizing OpenCL code. Students are asked to complete two parts: Part I involves measuring the performance of two kernels that access arrays with different offset and stride values to demonstrate the benefit of memory coalescing. Part II involves implementing and optimizing a matrix multiplication using OpenCL. Students are asked to measure performance of a naive implementation, a transposed matrix version, and a local memory cached version. They are also instructed to test their code on the instructor's machine if further optimization experiments with a GPU are desired.

Uploaded by

Abdulahi Abebe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Addis

Ababa University
Addis Ababa Institute of Technology
School of Electrical and Computer Engineering

ECEG-6518: Parallel Computing


Assignment V: OpenCL Optimization

Follow the lecture notes posted on the course page to do this assignment

Part I - Benefit of memory coalescing

1. write a kernel that accepts two arrays of size N= (64 * 1024 * 1024) chars and also an
offset. What the kernel does is shown below

kernelA(char * A, char * B, int offset)


{
i=get_global_id(0)
A[i]=B[i+offset]
}

measure the time it takes to complete running this kernel. You are supposed to vary the
offset from 0,1,2,...,16 and repeat the measurement.
2. Also do the same kind of measurement for the following kernel.

kernelB(char *A, char *B, int stride)


{
i=get_global_id(0)
A[i]=B[i*stride]
}

here also vary stride from 1,2,...16. But you will need to limit the global work item
number to N/16.

Part II - Benefit of caching on local memory

1. Implement a naive matrix multiplication using OpenCL. Measure the time it takes to
complete a multiplication of two floating point (Real) matrices with dimensions of
1024x1024 (if this does not take long and if you feel you want to see a more relevant
result change it to 2048 x 2048). Also vary the work group size from 4x4, 8x8,....,until
the MAX workgroup size can accommodate.

Instructor: Fitsum Assamnew (Dr.)


2. Improve the naïve implementation by transposing the second matrix for data locality in
the cache. Do this for the CPU implementation as well. When measuring the runtime
include the transpose operation as well.

3. Implement a local memory cached version of the Matrix multiplication and do the same
measurements asked in 1. To do this experiment and appreciate the results you need to do
it on a GPU. I advise you to write you opencl code and test it on your own machine (can
be a computer that does not have a GPU). Then you can do your experiments on a
computer with a dedicated GPU in our lab. The operating system on this machine is
Ubuntu. Please make arrangements with me if you want to test your code on this
machine.

Due Date: ____________

Instructor: Fitsum Assamnew (Dr.)

You might also like