Lab05_IntroToLinuxKernelandApplicationProfiling
Lab05_IntroToLinuxKernelandApplicationProfiling
GOAL
The goal of this Lab is to allow you to use the Arm Development Studio Streamline tool to profile the execution of
code running on the Raspberry Pi 3.
PREREQUISITES
WORKPLACE SETUP
Assuming you completed all the previous labs, move to the directory raspberryPi3 and prepare the build
environment:
cd ~/raspberryPi3
source sources/poky/oe-init-build-env rpi-build
You now have the system ready for building the embedded Linux distribution for the Raspberry Pi 3. We now need
to:
Customize the configuration of the Linux kernel to activate the functionalities needed to run code
profiling, which is measuring the execution time of the code on the Raspberry Pi 3 target;
Prepare the Yocto recipe to deploy Streamline-specific kernel modules/middleware to the Raspberry Pi 3.
To configure the Linux kernel, we have to modify, using the kernel configuration utility, the Linux configuration file
(.config). For this purpose, we proceed as follows.
This command will tell the Yocto build system to prepare the Linux kernel compilation (bitbake linux-
raspberrypi) and to open a development shell (option -c devshell) with the build system configured for the
selected recipe (linux-raspberrypi). From this development shell, we can run the Linux kernel configuration
utility as follows:
make menuconfig
Customize the Linux kernel configuration as follows (in the following, the asterisk tells the option shall be
activated):
When done with this configuration, you can exit from the Linux configuration utility and save the configuration to
the .config file. The current configuration is valid only for the current instance of the bitbake command. If you
close the development shell, thus terminating the current instance of the bitbake command, the next instance of
the bitbake command will bring the configuration back to the default one. Therefore, we have to make the current
configuration the new default one. For this purpose, we have to copy the .config file to the default configuration
file in the linux-raspberrypi recipe as follows:
cd /home/user/raspberryPi3/rpi-build/tmp/work/raspberrypi3-poky-linux-gnueabi/linux-raspberrypi/
1_4.1.21+gitAUTOINC+ff45bc0e89-r0/linux-raspberrypi3-standard-build
This directory will be printed to the console after closing the kernel configuration utility; tailor this to the directory
your system prints.
cp .config ~/raspberryPi3/sources/meta-raspberrypi/recipes-kernel/linux/linux-
raspberrypi/defconfig
With this command, we make the modifications to the Linux kernel configuration permanent.
In order to profile the execution of the code on the Raspberry Pi 3, we will exploit the Streamline tool, part of the
Arm Development Studio suite.
To install Arm Development Studio trial version, go to
https://developer.arm.com/tools-and-software/embedded/arm-development-studio.
Select "Try for free" and follow the on-page instructions for downloading and installing it.
gator.ko, it is a loadable kernel module that shall be installed on the Raspberry Pi 3 to provide user-level
access to the performance counter of the Arm architecture;
gatord, it is a service that shall be installed on the Raspberry Pi 3 that interacts with gator.ko to collect
the performance counter values and to deliver them through the Ethernet network to the development
host;
streamline, it is an application running on the development host to configure the operations of
gator.ko/gatord and to display the gathered execution traces.
To integrate gator.ko/gatord in the Linux system for the Raspberry Pi 3, we have to prepare a novel Yocto
recipe to instruct the build system: where to find the needed source code, how to compile it, and how to deploy it
in the root file system.
mkdir ~/raspberryPi3/sources/meta-raspberrypi/recipes-kernel/gator
mkdir ~/raspberryPi3/sources/meta-raspberrypi/recipes-kernel/gator/files
Line 4 specifies which is the working directory to be used for downloading/building the code. Upon
executing the recipe, Yocto will access the Arm github and download the source code to the defined
directory where the gator directory contains the source code for the loadable kernel module, and gatord
contains the source code for the service;
Lines 8-9-11 instruct Yocto to download the source code for gator.ko/gatord from the Arm github so
that always the latest distributed source code is used;
Line 11 specifies the patch file to adapt the gatord Makefile to the Yocto build environment;
Lines 19-25 specify how the loadable kernel module (gator.ko) and the middleware (gatord) shall be
compiled/linked;
Lines 27-35 instruct Yocto how to deploy the obtained binaries to the target root file system; three new
files will be created on the Raspberry Pi 3 root file system: /ect/init.d/gator.ko,
/etc/init.d/gatord, /etc/init.d/rungator.sh
2: @@ -8,8 +8,8 @@
5:
6: -CC = $(CROSS_COMPILE)gcc
7: -CXX = $(CROSS_COMPILE)g++
8: +#CC = $(CROSS_COMPILE)gcc
9: +#CXX = $(CROSS_COMPILE)g++
10:
Once these operations are completed, you have to tell the machine layer configuration that the new driver is
needed. For this purpose, edit the file
raspberryPi3/sources/meta-raspberrypi/conf/machine/raspberrypi3.conf
MACHINE_ESSENTIAL_EXTRA_RRECOMMENDS += "kernel-module-gator"
Now you need to ensure that the local.conf file includes gator in the build by adding the line:
IMAGE_INSTALL_append += "gator"
Also add:
This will signal that the gator module is already-stripped and the build process will not
reattempt it.
After a while, a new Micro SD card image would be available, which you can deploy in the Micro SD as follows
(assuming the Micro SD is available to the PC as /dev/sdN). Alternatively, use a program of your preference to
flash the image.
sudo fdisk -l
command to determine which device to flash to (plug in and unplug the SD card to determine which device it is).
For this example, the SD card is under the name “sdc” (this may be different in your environment). Next, ensure
that the device is unmounted. This can be done using the command:
sudo umount /dev/sdc*
Once this is done, the following command can be used to copy the image across to the SD card (substitute any
folder names and device names to ensure they are relevant to your specific environment).
Note that if not done properly, the image being flashed across to the SD card may cause problems when
attempting to turn on the board. If this is the case, it may be worth retrying the process again and ensuring that it
is done properly, or use a flash program to automate the process.
Also note that this time, the most recently built image should be in the “tmp” folder, not “tmp-glibc”.
After booting the new Linux system, and logging into the Raspberry Pi 3, you can type the following commands:
The above commands insert the gator.ko module into the Linux kernel and run the gatord service.
Finally, we have to set up the IP address of the target by editing the file /etc/network/interfaces and to
modify the configuration of the network device eth0 as follows:
Make sure that the Raspberry Pi 3 is connected to the development host via Ethernet cable and that the IP
address of the Ethernet connection on the development host is 192.168.1.1.
To profile code execution on the Raspberry Pi 3, we have to start Streamline on the development host, but first
need to add it to the PATH:
Then add the path to Streamline to the “PATH” string. In this case, the path to Streamline is: “/home/user/DS-
5_CE_v5.29.1/bin” as this is where it was installed.
source /etc/environment
Then you should be able to call Streamline directly:
Or if you are using a Windows machine, find and run the Streamline (DS-5 CE x.xx.x)
Then, click on the Capture & analysis button and configure button:
In the dialogue window, configure the address for the connection, setting to 192.168.1.2 as shown below, then
click on the Add elf image button on the Program images section (first icon on the left), and add the elf image of
the Linux kernel located at:
/home/user/raspberryPi3/rpi-build/tmp/work/raspberrypi3-poky-linux-gnueabi/
linux-raspberrypi/1_4.1.21+gitAUTOINC+ff45bc0e89-r0/linux-raspberrypi3-
standard-build/vmlinux
Streamline uses the symbols in the elf files to decode the profile timeline collected on the target, to display the
collected data in a useful format.
Once done, click on Save.
You can now click on Start capture, and after specifying where to save the collected timeline, you can see that the
main window of Streamline starts displaying Raspberry Pi 3 activity.
When Streamline is running, on the target console, run the top command and obtain the following output:
root@raspberrypi3:~#
While top is running on the Raspberry Pi 3, you can recognize increased system activity on the Streamline timeline
(in the timeline below, top was executed from second 4 to about second 5).
After collecting few seconds of data, click on Stop. The obtained timeline shall look like as follows (please refer to
the Streamline manual for details on the obtained output):
POST-LAB PRACTICE
In the previous lab, you have successfully enabled and used the debugger for a given program; the GNU debugger
provides a wide range of functionalities, and we will be using them extensively in this session. In addition to the
GNU debugger, you have also used streamline for resource monitoring in the current lab. This is particularly useful
when examining execution time, memory usage, the clock frequency for a running program.
Now run the executable hello (or helloChallenge) file that is provided with this challenge; this program is
compiled with debugging features enabled. Note that the program is completely different from the one used in the
Lab. Try to answer questions with the help of GNU debugger and streamline.
1. Use appropriate tool to identify the program, explain its functionality and operation.
A: The program is given as an executable; however, you can always view the source code using the “list” command
from GNU debugger. Assuming your image is from the previous lab with debugging features enabled, simply run it
on your Raspberry Pi 3:
(gdb) list
2
3 unsigned long long operation(unsigned int x)
4 {
5 if(x == 0)
6 return 0;
7 else if (x == 1)
8 return 1;
9 else
10 return (operation(x-1)+operation(x-2));
11 }
12
13 unsigned long long main()
14 {
15 unsigned long long num;
16 printf("Enter the input number: ");
17 scanf("%llu", &num);
18 printf("%llu\n", operation(num));
19 return 0;
20 }
Upon viewing the code, the main function only asks for an input and passes it to the “operation” function. The
“operation” function is non-tail-recursive with two base cases when x=0 and x=1; according to the mathematical
definition, the program will calculate the Fibonacci series for the given input number.
2. Try the program with the input equals 30, 35, 40, and 45, respectively. Record the time taken for each
execution; any findings from it?
To conveniently record and monitor the time taken for executing a program, we shall use Streamline from the
main lab body. Assuming you have completed the main lab and your Linux image has the “gator.ko” kernel module
included, type the following commands to set it up:
Now you are ready to start Streamline and begin monitoring the operation. Open Streamline and follow the same
steps as in the main lab to begin a new capture session. Then, execute the hello program with required input
numbers; you can count the execution time by monitoring CPU activity and the Clock frequency. Here are example
measurements and a corresponding plot (note that this may vary slightly):
30
25
20
15
10
5
0
28 30 32 34 36 38 40 42 44 46
Input Number
Upon inspecting the plot above, it is clear that the execution time increases exponentially with the input number.
The curve becomes significantly steeper as input number increases and the trend becomes observable from
around 35.
3. You should have observed increase in time taken as the input number increases. Why would this happen?
How many factors are accounted for the increase in execution time?
There are two main factors that caused the exponential growth: input number and function call overhead. Since
the program is implemented with non-tail-recursion, the operation function is called each time when calculating an
element in the Fibonacci sequence. Each function call requires additional stack space and time to store the
previous function and variables in it. This becomes more significant and obvious as the recursion gets deeper, and
hence becomes unmanageable. Under extreme circumstances, a stack overflow may occur.
(gdb) list
2
3 unsigned long long operation(unsigned int x)
4 {
5 if(x == 0)
6 return 0;
7 else if (x == 1)
8 return 1;
9 else
10 return (operation(x-1)+operation(x-2));
11 }
12
13 unsigned long long main()
14 {
15 unsigned long long num;
16 printf("Enter the input number: ");
17 scanf("%llu", &num);
18 printf("%llu\n", operation(num));
19 return 0;
20 }
Now you can insert a break point at line 10, at which the program will stop temporarily:
(gdb) break 10
Breakpoint 1 at 0x104f4: file hello.c, line 10.
With the help of the break point, you can continue the program by typing continue or c; meanwhile, you can also
use backtrace or bt to show the function calls. Constantly typing these two commands will allow you to examine
the number of function calls and how it changes with the input number. You can try, for example, input number 3,
input number 5, and input number 7, and the number of functions calls will not increase linearly but exponentially.
Evidently and intuitively, for when the input number equals, for example, 3, operation(3) is called once,
operation(2) is called once, operation(1) is called twice; when the input number equals 4, operation(4) is called
once, operation(3) is called once, operation(2) is called twice, and operation(1) is called 4 times.
On the other hand, if you are familiar with big O notation, by inspecting the last line of code, it will become clear
that:
5. How much memory does one function call occupy? Use your GNU debugger to find out a specific value.
Moving on from the previous question, a very useful command from GNU debugger is info frame:
By subtracting the addresses of the two adjacent stack frames, it turns out that one stack frame occupies 24
addresses, and hence 24 bytes. Check: “info stack” at: https://stackoverflow.com/questions/37321252/check-
used-stack-size-using-core-file
6. Suggest and implement a new approach to solve the above timing issue (Hint: think about the recursion and
iteration, you can reuse the Makefile from the Lab session).
Since the majority of the increase in execution time is due to function call overhead, we can eliminate it by
replacing the recursion method with iteration method. Here is an example solution:
#include <stdio.h>
unsigned long long operation(unsigned int n)
{
unsigned long long a = 0;
unsigned long long b = 1;
unsigned long long temp;
while (n != 0)
{
temp = a;
a = b;
b = b + temp;
n--;
}
return a;
}
unsigned long long main()
{
unsigned int num;
printf("Enter the input number: ");
scanf("%llu", &num);
printf("%llu\n", operation(num));
return 0;
}
Now make your program as you did in the previous lab with debug features included, run it with/without remote
debugging and use your Streamline to record the execution time.
The iteration method has the time complexity of O(n) and memory complexity of O(1). In addition, since only one
function is called during the computation process, the program does not suffer from significant function call
overhead. Therefore, the program responses quickly for those given inputs.
8. Now try your new program with 91, 92, 93, and 94 as the input number; any error inspected? What might be
the cause of the potential error?
Assume you have defined variables in unsigned long long, you should get the following output:
91 4,660,046,610,375,530,309
92 7,540,113,804,746,346,429
93 12,200,160,415,121,876,738
94 1,293,530,146,158,671,551
From the table above, it is clear that the output number for input 94 is incorrect. This is called overflow—the
original output number is greater than the maximum of an unsigned long long can contain, which is 64-bit.
Theoretically, the maximum for an unsigned long long is 2^64-1= 18,446,744,073,709,551,615; hence, when
the output for the 94th Fibonacci number is greater than this, the most significant bit in binary is missed, leaving
the rest bits that are incorrect. Therefore, the overflow is the cause of the error.
9. With the help of the GNU debugger, justify the cause of the above error.
Assuming you have correctly analyzed the cause of above error as overflow, we shall visualize it with GNU
debugger. The overflow occurs when the last addition operation is conducted and Arm is a Load-Store
architecture; hence, the overflow will only occur in registers at the moment when the last addition operation is
conducted. Therefore, by showing the content of registers at an appropriate time, we shall inspect the overflow in
registers that caused the error.
Then, use the following command to show the disassembly view of the program:
(gdb) disas /m main
Dump of assembler code for function main:
warning: Source file is more recent than executable.
6 unsigned long long a = 0;
0x000103d0 <+136>: mov r4, #0
0x000103d4 <+140>: mov r5, #0
0x000103d8 <+144>: b 0x103ac <main+100>
11 {
12 temp = a;
13 a = b;
14 b = b + temp;
0x0001038c <+68>: mov r4, r6
0x00010390 <+72>: mov r5, r7
0x00010394 <+76>: adds r6, r0, r4
0x00010398 <+80>: mov r0, r4
0x0001039c <+84>: adc r7, r1, r5
15 n--;
16 }
17
18 return a;
19 }
20
21 unsigned long long main()
22 {
0x0001034c <+4>: push {r4, r5, r6, r7, lr}
0x00010354 <+12>: sub sp, sp, #12
25 scanf("%du", &num);
0x0001035c <+20>: movw r0, #1504 ; 0x5e0
0x00010360 <+24>: movt r0, #1
0x00010364 <+28>: add r1, sp, #4
0x00010368 <+32>: bl 0x10330 <__isoc99_scanf@plt>
26 printf("%llu\n", operation(num));
0x0001036c <+36>: ldr r3, [sp, #4]
0x000103ac <+100>: movw r0, #1512 ; 0x5e8
0x000103b0 <+104>: mov r2, r4
---Type <return> to continue, or q <return> to quit---
0x000103b4 <+108>: mov r3, r5
0x000103b8 <+112>: movt r0, #1
0x000103bc <+116>: bl 0x1030c <printf@plt>
27 return 0;
28 } 0x000103c0 <+120>: mov r0, #0
0x000103c4 <+124>: mov r1, #0
0x000103c8 <+128>: add sp, sp, #12
0x000103cc <+132>: pop {r4, r5, r6, r7, pc}
End of assembler dump.
From the assembly code, it is clear that the variable a, which is returned as the result of the computation, is
represented by registers r4 and r5 (since long long contains 64 bits and Arm is 32-bit architecture). The r4 and r5
registers are initialized as 0.
break *0x0001039c
With the break point, your program will stop whenever the break point is reached. Since the instruction is in a
while loop and we are intended to observe the register overflow, we shall set the break point as “conditional” by
ignoring the first few times until the overflow is about to happen:
ignore 1 45
The above command will tell the debugger to ignore break point 1 for the first 45 times; therefore, the program
will only stop at the 46th time encountering the break point.
continue
or simply type c. Then, the Raspberry Pi 3 will ask you for an input number, it is good to input either 96 or larger.
When the program stops at the break point, we shall visualize the content in registers by:
From the value stored in r4, it is clear that the current result is 1836311903, which is indeed the 46th Fibonacci
number. On the other hand, the value of r5 is 0.
continue
info registers
Note that the value in r4 is 2971215073, which is indeed the 47th Fibonacci number. Then, continue the program:
continue
info registers
If we look at the value in r4, the value stored is 512559680, which is inconsistent with the 48th Fibonacci number
and smaller than it should be. The reason is that each register is 32-bit long, and the 48th Fibonacci number is
4807526976, exceeding the range covered by an unsigned 32-bit number. However, since we declared the value a
as unsigned long long, the variable is stored in two registers with 64 bits available in total. Therefore, the
carry is stored in r5, with most significant bits in r5 and least significant bits in r4.
We can prove it by combining these two registers: 0x00000001 and 0x1e8d0a40. The result is a 64-bit number:
0x000000011e8d0a40, which is equivalent to decimal: 4807526976, indeed the 48th Fibonacci number.
Following the same philosophy, we ignore the next 45 iterations until the next overflow in r5 occurs:
ignore 1 45
continue
info registers
Now the values are: r4=ox221f2702 and r5=0xa94fad42. If we combine them as before, the decimal equivalent is:
12200160415121876738, which is the 93rd Fibonacci number.
Now as we know, the 94th Fibonacci number is too large to be stored in a pair of registers, the next ought to cause
overflow leading to an incorrect result:
continue
info registers
Look into r4 and r5; we can combine the values, and the decimal equivalent is: 1293530146158671551, an
incorrect result due to register overflow and identical to what we get in the previous question.
10. Fix the above error and test it again.
#include <stdio.h>
#include <stdlib.h>
return bits;
}
while (n != 0)
{
temp = a;
a = b;
if (overflow_check(temp,b))
{
b = b + temp;
}
else
{
printf("Potential overflow detected: Computation terminating\n ");
exit(1);
}
n--;
}
return a;
}