CS683 Pa1
CS683 Pa1
* Disclaimer: The memes included in this document are intended solely for fun learning. They are not
meant to offend, mislead, or be taken as factual information. Please enjoy them in the spirit of fun
learning.
We write a lot of programs, keeping the algorithm our primary concern, but while we do,
we often forget about the underlying hardware on which our programs will run. A wise
man once said,
Just a friendly reminder: if you think copying code is a clever shortcut, think again. It’s
not only easily spotted but also a great way to miss out on the chance to actually learn
something. Why not impress us with your own work?
NOTE:
- You need to have intel-based x86 machines to implement software prefetching.
- Make changes to the base codes only. Do not use a different implementation of the
base codes.
- No compiler or any other optimizations should be used. Points will be deducted
for using any additional optimization techniques other than the ones mentioned in
the assignment.
The assignment is divided into two tasks, each having its own subparts. The task structure
and their respective points are shown below:
Bonus Points: A bonus of 5 points will be given to the top 10 teams getting the best
speedups.
Task 1A: Tile it to See it
One of the most common matrix operations is getting the transpose. This operation is particularly
brutal on the cache if the matrix size is huge, as we access the matrix in column-major order.
What if we can divide the matrix into tiles and get the transpose per tile? Maybe we’ll be gentle
on the cache then…
TODOs
1. Report the L1-D cache MPKI when executing only the naive matrix transpose.
2. Implement the tiled matrix transpose.
3. Report the L1-D cache MPKI when executing only the tiled matrix transpose.
4. Compare the performance by calculating the speedup.
5. Do this for multiple matrix sizes and tile sizes and analyze the results.
6. Plot the MPKI and speedup for different matrix sizes and tile sizes of your choice. Select
the sizes such that you can draw clear conclusions from the results.
Software prefetching is a technique that aims to reduce cache misses by fetching data into the
cache before it is needed. In this section, you will optimize the matrix transpose code using
software prefetching techniques. Explain the concept of software prefetching and the different
strategies that can be employed. Strategies like temporal locality of fetched data, fetching a
variable number of addresses at a time, etc., can be considered.
Note:
1. You will have to turn off hardware prefetching to see the effects of software prefetching.
How?
TODOs
1. Report the number of instructions and L1-D cache MPKI when executing only the
naive matrix transpose.
2. Implement the software-prefetched matrix transpose.
3. Report the number of instructions and L1-D cache MPKI when executing only the
software-prefetched matrix transpose.
4. Compare the performance by calculating the speedup.
5. Do this for multiple matrix sizes and analyze the results.
6. Plot the MPKI and speedup for different matrix sizes of your choice. Select matrix sizes
such that you can draw clear conclusions from the results.
Resources:
1. To implement software prefetching, you will use ‘_mm_prefetch.’
2. _mm_prefetch is an intrinsic function provided by Intel that prefetches data into the
cache to enhance memory access efficiency. It enables programmers to give the processor
advance notice about which memory locations will be accessed soon, reducing cache
misses and improving performance.
3. The function is part of Intel's SSE (Streaming SIMD Extensions) and is highly optimized
for Intel architectures. It is especially effective when used with SIMD (Single Instruction,
Multiple Data) operations.
4. Following are the links where you can find details about _mm_prefetch:
- https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#ig_expand=5152&text=prefetch
- https://stackoverflow.com/questions/46521694/what-are-mm-prefetch-locality-hints
TODOs
1. Report the number of instructions and L1-D cache MPKI when executing only the
naive matrix transpose.
2. Implement the tiled + software-prefetched matrix transpose.
3. Report the number of instructions and L1-D cache MPKI when executing only the
tiled + software-prefetched matrix transpose.
4. Compare the performance by calculating the speedup.
5. Do this for multiple matrix sizes and analyze the results.
6. Plot the MPKI and speedup for different matrix sizes and tile sizes of your choice. Select
sizes such that you can draw clear conclusions from the results.
TODOs
1. Report the number of instructions when executing only the naive convolution
algorithm.
2. Implement the SIMD 2D convolution algorithm.
3. Report the number of instructions when executing only the SIMD 2D convolution
algorithm.
4. Compare the performance by calculating the speedup.
5. Do this using SIMD registers of width 128 bits and 256 bits (and 512 bits if available)
and compare the speedups.
6. Do this for multiple matrix sizes and kernel sizes and analyze the results.
7. Plot the speedup for the different matrix sizes and kernel sizes of your choice. Select the
sizes such that you can draw clear conclusions from the results.
Answer the following:
1. Report the change in the number of instructions that you observed while moving from the
naive to the SIMD 2D convolution algorithm, and argue.
2. Did you achieve any speedup? If so, how much and what contributed to it? If not, what
were the reasons?
Resources:
Here are a few basic points to get started with using SIMD (Single Instruction, Multiple Data)
instructions.
TODOs
1. Report the L1-D cache MPKI when executing only the naive 2D convolution.
2. Implement the tiled 2D convolution.
3. Report the L1-D cache MPKI when executing only the tiled 2D convolution.
4. Compare the performance by calculating the speedup.
5. Do this for multiple matrix sizes and kernel sizes and analyze the results.
6. Plot the MPKI and speedup for different matrix sizes, kernel sizes, and tile sizes of your
choice. Select the sizes such that you can draw clear conclusions from the results.
Note:
1. You will have to turn off hardware prefetching to see the effects of software prefetching.
TODOs
1. Report the number of instructions and L1-D cache MPKI when executing only the
naive 2D convolution.
2. Implement the software-prefetched 2D convolution.
3. Report the number of instructions and L1-D cache MPKI when executing only the
software-prefetched 2D convolution.
4. Compare the performance by calculating the speedup.
5. Do this for multiple matrix sizes and kernel sizes and analyze the results.
6. Plot the MPKI and speedup for different matrix sizes and kernel sizes of your choice.
Select the sizes such that you can draw clear conclusions from the results.
Deliverables
1. Source code for all the tasks in transpose.c and convolution.c files.
2. README.md summarizing all the tasks and their respective todos. Describe what you
did, why you did it, and how much it benefited you. Compare and analyze the
performance improvements achieved by each of the tasks. Discuss any trade-offs or
limitations associated with each optimization technique. Reflect on the importance of
understanding hardware architecture and the impact it has on performance.
3. All the plots should be included in the README.md file along with their summary.
4. Include a plot showing the comparison of the performance of each technique along with
the combination of the techniques against the different matrix sizes in the README.md
file. There will be two different plots for transpose and convolution operations,
respectively. This is what your plot should look like
Submission
● You should submit a single tar.gz file with the name roll_number_pa1.tar.gz on
Moodle.
● The folder structure within tar.gz should be in the below format. Place the files in the
appropriate folders for all the tasks.
---- pa1-chak-de-microarchitecture-template
|----- part1
|---- Makefile
|---- transpose.c
|----- part2
|---- Makefile
|---- convolution.c
|----- README.md
● Kindly read the document at this link to create a private repository for the assignment. Do
not push everything at the last moment. Maintain a proper commit history.
Appendix
Instructions to build and run the project:
1. For part1 (cd to the part1 directory)
There are various sections to run for part 1:
1. naive
2. tiling
3. prefetch
4. tiling-prefetch
5. all
The block size for the tiling tasks can be defined in the program itself.
Deadline
Chak De
Microarchitecture
Best wishes
See you in Vivas 🙃