Assignment 5 - OpenCL Optimizations

This document provides instructions for an assignment on optimizing OpenCL code. Students are asked to complete two parts: Part I involves measuring the performance of two kernels that access arrays with different offset and stride values to demonstrate the benefit of memory coalescing. Part II involves implementing and optimizing a matrix multiplication using OpenCL. Students are asked to measure performance of a naive implementation, a transposed matrix version, and a local memory cached version. They are also instructed to test their code on the instructor's machine if further optimization experiments with a GPU are desired.

Uploaded by

Abdulahi Abebe

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

51 views

Assignment 5 - OpenCL Optimizations

Uploaded by

Abdulahi Abebe

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Addis

Ababa University
Addis Ababa Institute of Technology
School of Electrical and Computer Engineering

ECEG-6518: Parallel Computing

Assignment V: OpenCL Optimization

Follow the lecture notes posted on the course page to do this assignment

Part I - Benefit of memory coalescing

1. write a kernel that accepts two arrays of size N= (64 * 1024 * 1024) chars and also an
offset. What the kernel does is shown below

kernelA(char * A, char * B, int offset)

{
i=get_global_id(0)
A[i]=B[i+offset]
}

measure the time it takes to complete running this kernel. You are supposed to vary the
offset from 0,1,2,...,16 and repeat the measurement.
2. Also do the same kind of measurement for the following kernel.

kernelB(char A, char B, int stride)

{
i=get_global_id(0)
A[i]=B[i*stride]
}

here also vary stride from 1,2,...16. But you will need to limit the global work item
number to N/16.

Part II - Benefit of caching on local memory

1. Implement a naive matrix multiplication using OpenCL. Measure the time it takes to
complete a multiplication of two floating point (Real) matrices with dimensions of
1024x1024 (if this does not take long and if you feel you want to see a more relevant
result change it to 2048 x 2048). Also vary the work group size from 4x4, 8x8,....,until
the MAX workgroup size can accommodate.

Instructor: Fitsum Assamnew (Dr.)

2. Improve the naïve implementation by transposing the second matrix for data locality in
the cache. Do this for the CPU implementation as well. When measuring the runtime
include the transpose operation as well.

3. Implement a local memory cached version of the Matrix multiplication and do the same
measurements asked in 1. To do this experiment and appreciate the results you need to do
it on a GPU. I advise you to write you opencl code and test it on your own machine (can
be a computer that does not have a GPU). Then you can do your experiments on a
computer with a dedicated GPU in our lab. The operating system on this machine is
Ubuntu. Please make arrangements with me if you want to test your code on this
machine.

Due Date: ____________

Instructor: Fitsum Assamnew (Dr.)

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
HW5TCMT
No ratings yet
HW5TCMT
4 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
Solution Manual COD
No ratings yet
Solution Manual COD
115 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
No ratings yet
Problem 1 A) Considering The Number of Instructions Here To Be A Constant A
13 pages
Assignment 1: Name Class Date Period Sbuid Netid Email
No ratings yet
Assignment 1: Name Class Date Period Sbuid Netid Email
4 pages
Lenguaje Ensamblador. Problemas: Capítulo 1
No ratings yet
Lenguaje Ensamblador. Problemas: Capítulo 1
8 pages
Problem Bank 01
100% (1)
Problem Bank 01
8 pages
Matrix Multiplication Using SIMD Technologies
No ratings yet
Matrix Multiplication Using SIMD Technologies
13 pages
Heterogeneity in Parallel and Distributed Systems
No ratings yet
Heterogeneity in Parallel and Distributed Systems
5 pages
Design and Analysis of Algorithms: Time Space Trade Off
No ratings yet
Design and Analysis of Algorithms: Time Space Trade Off
6 pages
Answer All Questions, Each Carries 3 Marks: Reg No.: - Name
No ratings yet
Answer All Questions, Each Carries 3 Marks: Reg No.: - Name
2 pages
Computer Graphics Through Opengl: From Theory To Experiments Experiments Chapter 2
No ratings yet
Computer Graphics Through Opengl: From Theory To Experiments Experiments Chapter 2
26 pages
CO4 - Hashing in Data Structure
No ratings yet
CO4 - Hashing in Data Structure
13 pages
MCSL-54 (53) Computer Graphics - Solved Lab Manual
100% (3)
MCSL-54 (53) Computer Graphics - Solved Lab Manual
35 pages
Digital System Design: Provided by Humayra Jahan
No ratings yet
Digital System Design: Provided by Humayra Jahan
5 pages
CSE701 2016 Final Question
No ratings yet
CSE701 2016 Final Question
11 pages
Algorithm Practicals
No ratings yet
Algorithm Practicals
92 pages
WINSEM2023-24 BCSE205L TH VL2023240500897 2024-03-15 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE205L TH VL2023240500897 2024-03-15 Reference-Material-I
17 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
36 pages
Thoughtworks: TR Interview:-Interview Experience 1 - (90 Mins On Zoom, 2 Interviewers)
No ratings yet
Thoughtworks: TR Interview:-Interview Experience 1 - (90 Mins On Zoom, 2 Interviewers)
19 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
Algorithm Design and Analyisis Notes
No ratings yet
Algorithm Design and Analyisis Notes
37 pages
ECE 341 Final Exam Solution: Problem No. 1 (10 Points)
No ratings yet
ECE 341 Final Exam Solution: Problem No. 1 (10 Points)
9 pages
Computer Architecture Questions
No ratings yet
Computer Architecture Questions
1 page
Exercise 1 ComputerArchitecture
No ratings yet
Exercise 1 ComputerArchitecture
15 pages
CE376 Python Practical List
50% (2)
CE376 Python Practical List
15 pages
Lab Exercises
No ratings yet
Lab Exercises
10 pages
Parallel and Distributed Computing CSE4001 Lab - 4
100% (1)
Parallel and Distributed Computing CSE4001 Lab - 4
5 pages
ECE 341 2013 in Class Midterm1
No ratings yet
ECE 341 2013 in Class Midterm1
9 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Address Decoder For PC
No ratings yet
Address Decoder For PC
19 pages
Chapter 2 Instruction Sets of 8086 Part 2
No ratings yet
Chapter 2 Instruction Sets of 8086 Part 2
30 pages
Ff84602 - CHP - ISM Book Exc Solutions
No ratings yet
Ff84602 - CHP - ISM Book Exc Solutions
4 pages
Graphs Assignment
No ratings yet
Graphs Assignment
5 pages
Os Super-Imp-Tie-22 (1) PDF
No ratings yet
Os Super-Imp-Tie-22 (1) PDF
4 pages
Multilevel Memories
No ratings yet
Multilevel Memories
14 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
64 pages
Hwang Sol
No ratings yet
Hwang Sol
29 pages
Numerical Computing Final Exam Summer
No ratings yet
Numerical Computing Final Exam Summer
3 pages
Gate-Cs 2008
No ratings yet
Gate-Cs 2008
31 pages
NAND NOR Implementation
50% (2)
NAND NOR Implementation
3 pages
DDCA - CO-1 & 2 - Terminal Questions & Answers
No ratings yet
DDCA - CO-1 & 2 - Terminal Questions & Answers
15 pages
Tamil Morphological Analysis
No ratings yet
Tamil Morphological Analysis
18 pages
Problem CacheMemory1
No ratings yet
Problem CacheMemory1
9 pages
Computer Organization & Assembly Language Mid Term 2020-Resit Students
No ratings yet
Computer Organization & Assembly Language Mid Term 2020-Resit Students
4 pages
Allslides Handout
No ratings yet
Allslides Handout
269 pages
LAB Manual - PART A - PLSQL
No ratings yet
LAB Manual - PART A - PLSQL
8 pages
PPS - Unit 1
No ratings yet
PPS - Unit 1
69 pages
Lab Report: Numerical Analysis
100% (1)
Lab Report: Numerical Analysis
67 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Codetantra Code
No ratings yet
Codetantra Code
13 pages
TDM FDM ch6 - 1 - v1
No ratings yet
TDM FDM ch6 - 1 - v1
35 pages
Q1) Classify The Types of Operating Systems in Block Diagram Types of Operating System
No ratings yet
Q1) Classify The Types of Operating Systems in Block Diagram Types of Operating System
4 pages
Clenqueuereadbuffer (Queue, C - Buffer,, 0, N, C, 0, ,)
No ratings yet
Clenqueuereadbuffer (Queue, C - Buffer,, 0, N, C, 0, ,)
3 pages
Lab 3
No ratings yet
Lab 3
23 pages
LP1 1
No ratings yet
LP1 1
129 pages
My Experiments: Opencl Gpu Matrix Multiplication Program
No ratings yet
My Experiments: Opencl Gpu Matrix Multiplication Program
19 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
DS 7 Synchronization PDF
No ratings yet
DS 7 Synchronization PDF
52 pages
Ece496 Lecture4
No ratings yet
Ece496 Lecture4
13 pages
Ece496 Lecture2
No ratings yet
Ece496 Lecture2
3 pages
Ece496 Group Search
No ratings yet
Ece496 Group Search
1 page
QoTCex Lecture2 PDF
No ratings yet
QoTCex Lecture2 PDF
19 pages