0% found this document useful (0 votes)
10 views

Assignment 03 (1)

Uploaded by

Akash Maji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Assignment 03 (1)

Uploaded by

Akash Maji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

E0-243 Computer Architecture

Assignment-03
Akash Maji (24212) [email protected]
Nehal Shah (24623) [email protected]

Abstract
In this assignment, we are using `perf` to record various CPU events like cache
references, cache-misses, cycles, instructions, and time taken, to help us decide the best CPU
configuration for running the threads. Utility `perf` is a powerful profiling tool in Linux. It is used
to analyze and measure performance issues in Linux systems by collecting data such as CPU
usage, memory usage, and input/output operations. This tool helps developers identify
bottlenecks, optimize code, and improve overall system performance.

Introduction
Here, in the main.cpp program, three threads (say T0, T1 and T2) are run, and their time
taken are reported. These three threads incur some cache-misses as per the CPU configuration
they were allotted. Based on which CPU configuration they got, the value of parameters, say the
number of cache-misses, will vary. It has been asked to us to find the best CPU configuration. In
other words, we are to find a CPU assignment to these three threads such that parameter
values are reduced. We are given 4 logical CPU cores to work with, namely C0, C1, C2 and C3.

Analysis Methodology
We are given three threads T0, T1 and T2. Also four logical cores C0, C1, C2 and C3
are available for use. There are 4.3.2 = 24 ways in which these 3 threads can be alloted 4
CPUs. Beforehand we don’t know which among these configs will be useful or most beneficial.
Hence, our analysis will deploy these three threads to the four CPUs as per each of the
following 24 configurations:
[0, 1, 2], [0, 2, 1], [0, 1, 3], [0, 3, 1], [0, 2, 3], [0, 3, 2],
[1, 0, 2], [1, 2, 0], [1, 0, 3], [1, 3, 0], [1, 2, 3], [1, 3, 2],
[2, 0, 1], [2, 1, 0], [2, 0, 3], [2, 3, 0], [2, 1, 3], [2, 3, 1],
[3, 0, 1], [3, 1, 0], [3, 0, 2], [3, 2, 0], [3, 1, 2], [3, 2, 1]
Here, the first configuration is [0, 1, 2] meaning T0 gets C0, T1 gets C1 and T2 gets C2.
Similarly, other 23 configurations mean other CPU allotments.

We use the function pthread_setaffinity_np() which is part of the POSIX threads


(pthreads) library to set the CPU affinity for a specific thread, allowing us to specify which CPU
core a thread can run on. Our main program (main.cpp) will run three threads simultaneously as
per one configuration, and obtain the perf log records in three separate files, one for each
thread in that configuration. There are 3 such perf log files generated per CPU configuration,
and there are 24 such configurations, totalling 72 files to be generated. These files will be
generated by our main program 24 times in batches of size 3 ( that is, one file per thread of the
configuration currently executing) as our program tries to allot threads to CPUs as per each of
24 configurations.
The final best configuration will be the one, which takes the minimum number of cache-misses,
combining cache-misses of all the 3 threads in that batch. This means, if config C takes time
cache-misses t0, t1 and t2 for the three threads and sum(t0, t1, t2) is least among all, C is the
best configuration. We will use other parameters also, like time taken, for speed up calculations.

Commands To Run
To begin running `perf` and other tools in linux as needed in the program, we will have to ensure
certain linux tools and utilities are installed and configured. Also, certain accesses are enabled
for `perf`. These 3 commands below are to be executed once for initial setup.

Linux build tools and `perf` utility:


sudo apt install -y build-essential linux-tools-common linux-tools-generic linux-tools-$(uname -r)
C++ and Python Language support:
sudo apt install python3 gcc g++ make
Access to `perf` so as to collect samples:
sudo sysctl kernel.perf_event_paranoid=-1

We run the main program main.cpp as:


SEED=24212 make run
It is to be noted that as we run the modified program with different SEED or at different times,
we may see change in our choice of optimal configuration. This will clean all files, run `perf` to
get perf log files, and run script on those log files internally, and obtain the best configuration.

Our main.cpp program internally runs this command to generate one perf log file per thread for
each config, and save the perf process IDs in a perf_pids.txt file, so as to output the perf log
data into respective log files when the threads end by ending the perf processes.
perf stat -e cycles,instructions,cache-references,cache-misses -p <TID>
--output perf_thread_config_<CONFIG>_<TID>.log& echo $!>>perf_pids.txt

The above command will cause the perf processes, one for each thread in a particular
configuration, to run in background. When the threads are ended, the perf processes will be
ended too, and the log data so captured will be dumped into respective log files. If we don't end
the perf processes, they will continue to run in the background till the entire process (main
process) finishes, which is running all the configurations one by one. If this happens, perf log
data won’t be dumped into log files, until the entire process finishes. Then our analyze.py which
is internally called to make analysis and determine best CPU configuration won’t get any data.
Thus, we need to ensure that as soon as all threads of the config ends, the log files are
dumped. Also we generate log files per thread per configuration, so as to avoid data rate and
make sure our script reads well-formatted data for analysis.

Our main.cpp program internally runs this command to analyze all the 72 perf log files
generated for all configurations and obtain the best configuration, that is one which minimizes
the total cache-misses.
python3 analyze.py
When the analysis finishes, the file cpu_affinities_of_threads.txt contains the CPU preferences
for the three threads in that order. This file is then read in our main.cpp program to return
answers to the int32_t get_thread_affinity(uint32_t threadIdx) calls.

Analysis and Reporting


When we run SEED=24210 make run for the modified main.cpp program, we obtain the
following average execution times and total cache-misses for three threads from the perf logs
for various configs as illustrated:
From graph1, we see that config [0, 2, 1], with 37.54830839633334 seconds is the overall best
performing in terms of time consumption.
From graph2, we see that config [3, 1, 2], with 2040013194 cache misses is the overall best
Performing in terms of number of cache-misses.

As per this scenario, when our main program will be called to output CPU affinities for three
threads, then it will return 3, 1 and 2 for T0, T1 and T2, as we took cache-misses for our
consideration of optimality.

Speedup gained by best configuration


As suggested, the time taken by the program is the time taken by the last thread to end.

When we run SEED=24212 make run for the original main.cpp program, we obtain the
following execution times as outputted to stdout:
TheadIdx: 1 completed, time was 37.595887069
TheadIdx: 0 completed, time was 47.173499375
TheadIdx: 2 completed, time was 47.657227231
So time taken by program = max(time taken) = 47.657227231

When we run SEED=24212 make run for the modified main.cpp program, we obtain the
following execution times for the config [0, 2, 1], which is the best configuration in time.
TheadIdx: 2 completed, time was 35.351452013
TheadIdx: 1 completed, time was 35.694299852
TheadIdx: 0 completed, time was 37.569915071
So time taken by program = max(time taken) = 37.569915071

The speed up achieved when we run in best config in modified program to original program is:
47.657227231 / 37.569915071 = 1.2685
Conclusion
We are running all possible configurations and obtaining the optimal configuration which
minimizes some parameter(say the number of cache-misses as reported by perf). Also we
obtain the speedup to find out how optimal configuration gained over the original program.
We found out that depending on the SEED we use to run the program, we may get a different
configuration as optimal.

CPU Specifications:

Architecture x86_64 (64 bit) Little Endian

Address sizes 48 bits physical, 48 bits virtual

CPUs 8 Physical, 16 Logical, 2 Way SMT

Model AMD Ryzen 7 5800H + Radeon Graphics

RAM / SSD 24 GB @ 3200 MHz / 512 GB

LLC Cache L3: 16 MiB (1 instance, shared)

References
● https://man7.org/linux/man-pages/man1/perf.1.html
● https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monit
oring_and_managing_system_status_and_performance/getting-started-with-perf
_monitoring-and-managing-system-status-and-performance#introduction-to-perf_
getting-started-with-perf
● https://en.cppreference.com/w/

You might also like