Assignment 03 (1)
Assignment 03 (1)
Assignment-03
Akash Maji (24212) [email protected]
Nehal Shah (24623) [email protected]
Abstract
In this assignment, we are using `perf` to record various CPU events like cache
references, cache-misses, cycles, instructions, and time taken, to help us decide the best CPU
configuration for running the threads. Utility `perf` is a powerful profiling tool in Linux. It is used
to analyze and measure performance issues in Linux systems by collecting data such as CPU
usage, memory usage, and input/output operations. This tool helps developers identify
bottlenecks, optimize code, and improve overall system performance.
Introduction
Here, in the main.cpp program, three threads (say T0, T1 and T2) are run, and their time
taken are reported. These three threads incur some cache-misses as per the CPU configuration
they were allotted. Based on which CPU configuration they got, the value of parameters, say the
number of cache-misses, will vary. It has been asked to us to find the best CPU configuration. In
other words, we are to find a CPU assignment to these three threads such that parameter
values are reduced. We are given 4 logical CPU cores to work with, namely C0, C1, C2 and C3.
Analysis Methodology
We are given three threads T0, T1 and T2. Also four logical cores C0, C1, C2 and C3
are available for use. There are 4.3.2 = 24 ways in which these 3 threads can be alloted 4
CPUs. Beforehand we don’t know which among these configs will be useful or most beneficial.
Hence, our analysis will deploy these three threads to the four CPUs as per each of the
following 24 configurations:
[0, 1, 2], [0, 2, 1], [0, 1, 3], [0, 3, 1], [0, 2, 3], [0, 3, 2],
[1, 0, 2], [1, 2, 0], [1, 0, 3], [1, 3, 0], [1, 2, 3], [1, 3, 2],
[2, 0, 1], [2, 1, 0], [2, 0, 3], [2, 3, 0], [2, 1, 3], [2, 3, 1],
[3, 0, 1], [3, 1, 0], [3, 0, 2], [3, 2, 0], [3, 1, 2], [3, 2, 1]
Here, the first configuration is [0, 1, 2] meaning T0 gets C0, T1 gets C1 and T2 gets C2.
Similarly, other 23 configurations mean other CPU allotments.
Commands To Run
To begin running `perf` and other tools in linux as needed in the program, we will have to ensure
certain linux tools and utilities are installed and configured. Also, certain accesses are enabled
for `perf`. These 3 commands below are to be executed once for initial setup.
Our main.cpp program internally runs this command to generate one perf log file per thread for
each config, and save the perf process IDs in a perf_pids.txt file, so as to output the perf log
data into respective log files when the threads end by ending the perf processes.
perf stat -e cycles,instructions,cache-references,cache-misses -p <TID>
--output perf_thread_config_<CONFIG>_<TID>.log& echo $!>>perf_pids.txt
The above command will cause the perf processes, one for each thread in a particular
configuration, to run in background. When the threads are ended, the perf processes will be
ended too, and the log data so captured will be dumped into respective log files. If we don't end
the perf processes, they will continue to run in the background till the entire process (main
process) finishes, which is running all the configurations one by one. If this happens, perf log
data won’t be dumped into log files, until the entire process finishes. Then our analyze.py which
is internally called to make analysis and determine best CPU configuration won’t get any data.
Thus, we need to ensure that as soon as all threads of the config ends, the log files are
dumped. Also we generate log files per thread per configuration, so as to avoid data rate and
make sure our script reads well-formatted data for analysis.
Our main.cpp program internally runs this command to analyze all the 72 perf log files
generated for all configurations and obtain the best configuration, that is one which minimizes
the total cache-misses.
python3 analyze.py
When the analysis finishes, the file cpu_affinities_of_threads.txt contains the CPU preferences
for the three threads in that order. This file is then read in our main.cpp program to return
answers to the int32_t get_thread_affinity(uint32_t threadIdx) calls.
As per this scenario, when our main program will be called to output CPU affinities for three
threads, then it will return 3, 1 and 2 for T0, T1 and T2, as we took cache-misses for our
consideration of optimality.
When we run SEED=24212 make run for the original main.cpp program, we obtain the
following execution times as outputted to stdout:
TheadIdx: 1 completed, time was 37.595887069
TheadIdx: 0 completed, time was 47.173499375
TheadIdx: 2 completed, time was 47.657227231
So time taken by program = max(time taken) = 47.657227231
When we run SEED=24212 make run for the modified main.cpp program, we obtain the
following execution times for the config [0, 2, 1], which is the best configuration in time.
TheadIdx: 2 completed, time was 35.351452013
TheadIdx: 1 completed, time was 35.694299852
TheadIdx: 0 completed, time was 37.569915071
So time taken by program = max(time taken) = 37.569915071
The speed up achieved when we run in best config in modified program to original program is:
47.657227231 / 37.569915071 = 1.2685
Conclusion
We are running all possible configurations and obtaining the optimal configuration which
minimizes some parameter(say the number of cache-misses as reported by perf). Also we
obtain the speedup to find out how optimal configuration gained over the original program.
We found out that depending on the SEED we use to run the program, we may get a different
configuration as optimal.
CPU Specifications:
References
● https://man7.org/linux/man-pages/man1/perf.1.html
● https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monit
oring_and_managing_system_status_and_performance/getting-started-with-perf
_monitoring-and-managing-system-status-and-performance#introduction-to-perf_
getting-started-with-perf
● https://en.cppreference.com/w/