0% found this document useful (0 votes)
14 views5 pages

LinearAlgebra Matlab HW3 V2s

Uploaded by

sheuwyatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

LinearAlgebra Matlab HW3 V2s

Uploaded by

sheuwyatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Linear Algebra, Spring 2024

HW3 Matlab assignment


CSIE department, NCKU, Taiwan

1. Matrix multiplication by parallel computing


In modern computer systems, it is common to see hardware architecture designed for
heterogeneous computing (異構計算). In other words, such systems contain both CPUs

Figure 1. A naïve algorithm for achieving data parallelism the


computation of matrix multiplication C = A x B in (a). Here, the
computation of entries in C is performed in parallel (without any
synchronization 僅平行計算,但不同步。). Therefore, before
calculating an entry such as c11, a row vector → and a column
𝒂!
vector 𝒃𝟏 need to be read from global memory and stored in shared
memory (blue and brown arrows in (b)) [1]. In other words, there is
no data sharing to reduce global memory access when calculating
entries in C. 1
(central processing unit) and GPUs (graphics processing unit). GPUs are known for their
advantage for massively parallel computing, which is valuable for numerical computing. In
this assignment, we will show you an example of the data parallelism approach to accelerate
matrix multiplication.

1.1 Naïve algorithm for parallel matrix multiplication


Figure 1(a) illustrates a computation task of C = A x B. A straightforward way of performing
such matrix multiplication is to calculate all entries in C in a sequential manner. However, to
accelerate matrix multiplication (particularly when their dimensions are large), we can use a
GPU to achieve parallel computing (i.e., computing entries in C in parallel and then combining
results). In a GPU, we usually compute an entry of C with a thread (i.e., a small task to be

executed). Before we can actually calculate an entry such as c11, a row vector → and a column
𝒂𝟏

vector 𝒃𝟏 need to be read from global memory and stored in shared memory (blue and brown
arrows in Figure 1(b)). In other words, to compute an entry in C, two vectors are read from
GPU global memory.

Please write a Matlab function to simulate the following matrix multiplication process. For
simplicity, we set matrix dimensions n1 = n2 = n3 = n in Figure 1(a). Please use separate
variables to simulate data in global memory and in shared memory. For example, you may use
A_global, B_global, C_global, a_i_shared, b_j_shared, c_ij_shared to simulate data storage in
a GPU.

Naïve algorithm for parallel matrix multiplication


[C, global_memory_read_count] = naïve_multiply(A, B)
1. Input parameters include A and B, both with a size of n x n.
2. Set n1 = n2 = n3 = n.
3. Set global_memory_read_count to 0.
4. Calculate entries in C:
for i = 1 to n1
for j = 1 to n3

Read → from global memory (and store in shared memory).


𝒂𝒊

Read 𝒃# from global memory (and store in shared memory).


Add 2 to global_memory_read_count.

Calculate the entry cij by →× 𝒃# .


𝒂𝒊

Store cij in C.

2
5. Return C and global_memory_read_count.

Here, you would see that there are two for loops, which make the matrix multiplication a
sequential process. However, in a GPU, please imagine that iterations in the two for loops can
be executed in parallel, potentially accelerating computation. In addition, the calculation of

entries cij can be executed in any order because → and 𝒃# are read from global memory,
𝒂𝒊

independent of computation of other entries in C.

1.2 Tiled matrix multiplication (simplified version)


In Section 1.1, we learned that sequential calculation (i.e., two for loops) can be broken up into
many smaller tasks, with them executed in parallel. This approach is called data parallelism
because parallel computing is achieved by breaking data into smaller parts. This idea is
attractive for reducing computation time particularly when the matrices to be multiplied are
large. However, in practice, the bottleneck of such an approach usually lies in global memory
access, which is much slower than reading shared memory.
Therefore, tiled matrix multiplication is frequently used to alleviate the computation bottleneck.
The idea is to group computation of C entries into tiles (or blocks) so that global memory access
is reduced. Figure 2 shows a simplified version of tiled matrix multiplication. Notice that C in
Figure 2(a), tiles are marked as dark green lines in C. Taking calculation of the upper left tile

as an example, we first read →, →, 𝒃$ , 𝒃% from global memory and store them in shared
𝒂𝟏 𝒂𝟐

memory as shown in Figure 2(b). Note that, here, we calculate all entries in the tile (c11, c12,
c21, c22) together and use same shared memory. In this way, the global memory read count for
a tile is four instead of eight (4 entries x 2 reads = 8 reads) in Section 1.1. This approach is like
carpooling on a highway to reduce traffic (i.e., global memory read).

3
Figure 2. A simplified version of tiled matrix multiplication which
groups calculation of entries in C into tiles (or blocks). For example,
in (b), with a 2x2 tile size, computation for four entries c11, c12, c21,
c22 are grouped together to achieve reuse of shared memory, thereby
reducing global memory access to half of that in the naïve algorithm
[1].

Please write a Matlab function to simulate tiled matrix multiplication. For simplicity, we set
matrix dimensions n1 = n2 = n3 = n in Figure 2(a). Please use separate variables to simulate data
in global memory and in shared memory. For example, you may use A_global, B_global,
C_global, a_i_shared, b_j_shared, c_ij_shared to simulate data storage in a GPU.

Tiled parallel matrix multiplication


[C, global_memory_read_count] = tiled_multiply(A, B, m)
1. Input parameters include A and B (both with a size of n x n), and m (tile size).

4
2. Set n1 = n2 = n3 = n.
3. Set global_memory_read_count to 0.
4. Calculate entries in C:

Row tile number k1 = n1 / m


Column tile number k3 = n3 / m

for i = 1 to k1
for j = 1 to k3
For the current m x m tile:
Read m rows of A from global memory.
Read m columns of B from global memory.
Add 2m to global_memory_read_count.
Perform multiplication for the current tile.
Store the m x m tile to C.

5. Return C and global_memory_read_count.

1.3 Compare two matrix multiplication algorithms


Please perform the following tasks.
1. Random initialize two matrices A and B with n = 128. (e.g., A = round(100*rand(128)))
2. Use naïve_multiply to get C_naive and global memory read count.
3. Use tiled_multiply (m= 4) to get C_tiled and global memory read count.
4. Compare two Cs from parallel algorithms to the correct C (i.e., C = A x B). Check errors
from both Cs (e.g., rms(C_naive – C,’all’)).
5. Use tiled_multiply to get C_tiled and global memory read count, with m = 1, 2, 4, 8, 16, 32,
64. Plot global_memory_read_count versus m. Note that global_memory_read_count of m =
1 is equal to that in the naïve parallel algorithm.
In the written homework, please discuss: (1) Whether C_naive and C_tiled are accurate?
(2) What’s your interpretation from the plot in task 5?

Reference
[1] Programming Heterogeneous Computing Systems with GPUs and other Accelerators
(ETH, Zurich, https://www.youtube.com/watch?v=ZQKMZIP3Fzg)
[2] Programming Massively Parallel Processors, Wen-mei W. Hwu, David B. Kirk, Izzat El
Hajj, 4th edition, 2022.

You might also like