LinearAlgebra Matlab HW3 V2s
LinearAlgebra Matlab HW3 V2s
executed). Before we can actually calculate an entry such as c11, a row vector → and a column
𝒂𝟏
vector 𝒃𝟏 need to be read from global memory and stored in shared memory (blue and brown
arrows in Figure 1(b)). In other words, to compute an entry in C, two vectors are read from
GPU global memory.
Please write a Matlab function to simulate the following matrix multiplication process. For
simplicity, we set matrix dimensions n1 = n2 = n3 = n in Figure 1(a). Please use separate
variables to simulate data in global memory and in shared memory. For example, you may use
A_global, B_global, C_global, a_i_shared, b_j_shared, c_ij_shared to simulate data storage in
a GPU.
Store cij in C.
2
5. Return C and global_memory_read_count.
Here, you would see that there are two for loops, which make the matrix multiplication a
sequential process. However, in a GPU, please imagine that iterations in the two for loops can
be executed in parallel, potentially accelerating computation. In addition, the calculation of
entries cij can be executed in any order because → and 𝒃# are read from global memory,
𝒂𝒊
as an example, we first read →, →, 𝒃$ , 𝒃% from global memory and store them in shared
𝒂𝟏 𝒂𝟐
memory as shown in Figure 2(b). Note that, here, we calculate all entries in the tile (c11, c12,
c21, c22) together and use same shared memory. In this way, the global memory read count for
a tile is four instead of eight (4 entries x 2 reads = 8 reads) in Section 1.1. This approach is like
carpooling on a highway to reduce traffic (i.e., global memory read).
3
Figure 2. A simplified version of tiled matrix multiplication which
groups calculation of entries in C into tiles (or blocks). For example,
in (b), with a 2x2 tile size, computation for four entries c11, c12, c21,
c22 are grouped together to achieve reuse of shared memory, thereby
reducing global memory access to half of that in the naïve algorithm
[1].
Please write a Matlab function to simulate tiled matrix multiplication. For simplicity, we set
matrix dimensions n1 = n2 = n3 = n in Figure 2(a). Please use separate variables to simulate data
in global memory and in shared memory. For example, you may use A_global, B_global,
C_global, a_i_shared, b_j_shared, c_ij_shared to simulate data storage in a GPU.
4
2. Set n1 = n2 = n3 = n.
3. Set global_memory_read_count to 0.
4. Calculate entries in C:
for i = 1 to k1
for j = 1 to k3
For the current m x m tile:
Read m rows of A from global memory.
Read m columns of B from global memory.
Add 2m to global_memory_read_count.
Perform multiplication for the current tile.
Store the m x m tile to C.
Reference
[1] Programming Heterogeneous Computing Systems with GPUs and other Accelerators
(ETH, Zurich, https://www.youtube.com/watch?v=ZQKMZIP3Fzg)
[2] Programming Massively Parallel Processors, Wen-mei W. Hwu, David B. Kirk, Izzat El
Hajj, 4th edition, 2022.