Requirements: Profiling Nios II Systems
Requirements: Profiling Nios II Systems
AN-391-3.0
Application Note
This application note describes the methods to measure the performance of a Nios II system with the GNU profiler (nios2-elf-gprof), the performance counter component, and the timestamp interval timer component. This application note also includes two tutorials to measure performance in the Altera Nios II Software Build Tools (SBT) development flow.
Requirements
You must be familiar with the Nios II SBT development flow for Nios II systems, including the Quartus II software and Qsys to use the tutorials.
2011 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX are Reg. U.S. Pat. & Tm. Off. and/or trademarks of Altera Corporation in the U.S. and other countries. All other trademarks and service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor products to current specifications in accordance with Alteras standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.
Page 2
Tools
Tools
You can use the GNU profiler without making any hardware changes to your Nios II system. This tool directs the compiler to add calls to profiler library functions into your application code. The performance counter component and the timestamp component are minimally intrusive hardware methods for measuring the performance of a Nios II system. This application note describes and compares the two components. To use these methods, you add the hardware components to your system, and you add macro invocations to your source code to start and stop the components. The hardware components perform the measurements. Compiler speed optimizations affect functions to widely varying degrees. Compiler size optimizations also affect functions in different ways. These differences impact the cache usage and the resource contention, which can change the relative start times and therefore increase the execution times of functions. For these reasons, you must optimize your code with the -O3 compiler switch, and then perform profiling on the code to gain the most insight on how to improve an application in its final form. The tutorials use three tools to measure the performance of a Nios II system, as described in the following sections:
In addition, the program counter trace collection tool is available for some Nios II processors. However, the tutorials do not use this tool. You use the GNU profiler to identify the areas of code that consume the most CPU time, and a performance counter or a timer component to analyze functional bottlenecks.
GNU Profiler
You must make minimal changes to the source code to take measurements for analysis with the GNU profiler. To implement the required changes, follow these steps: 1. In the Nios II SBT, enable the GNU profiler in your project by turning on the hal.enable_gprof and hal.enable_exit board support package (BSP) settings. 1 If you use the Nios II SBT for Eclipse, the software enables hal.enable_exit by default.
2. Verify that your main() function returns. 1 When main() calls return() or terminates, alt_main() calls exit() as appropriate for profiling. The exit() function runs the BREAK 2 instruction, which causes the profiling data to write to the gmon.out on the host computer.
Tools
Page 3
TimeA 64-bit time (clock tick) counter that counts the number of clock ticks during code section runs. OccurrencesA 32-bit event counter that counts the number of times the code section runs.
You can change the maximum number of measured code sections by editing the performance counter component in Qsys. These counters enable you to measure the execution time of the designated sections of C/C++ code. Macros enable you to mark the start and the end of the code sections in your program. The performance counter component has up to seven pairs of counters, supporting as many as seven measured sections of C/C++ code. You must add macros to your code at the start and end of each measured section. An additional, built-in pair of counters aggregates the individual code section counters, enabling you to measure each section as a fraction of a larger program. You can use performance counters for analyzing determinism and other runtime issues.
The performance counter component occupies a substantial number of logic elements (LEs) on your device, and requires software implementation to obtain performance measurements.
High-Resolution Timer
A high-resolution timer, in contrast to a performance counter component, does not use a large number of LEs on your device, and does not require heavy implementation of every function call in your code to obtain performance measurements. Timers require explicit read calls in the sections of the source code that you want to measure, so their use is better suited for pinpointing the performance issues in a program. You must implement the source code manually; however, because this implementation is less pervasive, therefore, this implementation is also less intrusive. Unlike the performance counter macros, the timer requires many more processor cycles to make two function calls; one to read the time at the beginning of a measured section, and one to read the time at the end.
Page 4
Each function is slightly larger due to the additional function call to collect profiling information. The entry and the exit time of each function due to profiling information collection. The instruction-cache misses are higher because the profiling function is in the instruction cache memory. Memory that records the profiling data can change the behavior of the data cache.
These effects can mask the time sensitive issues that you are trying to uncover through profiling. The GNU profiler determines the percentage of time spent in each function by interpolation, based on periodic samplings of the program counter. The GNU profiler ties the periodic samples to the timer tick of the system clock. The GNU profiler can take samples only when you enable interrupts, and therefore cannot record the processor cycles spent in interrupt routines.
Page 5
The GNU profiler cannot profile individual functions. You can use the GNU profiler to profile the entire system, or not at all. The profiling data is a sampling of the program counter at the resolution of the system timer tick. Therefore, the profiling data provides estimation, not an exact representation, of the processor time spent in different functions. You can improve the statistical significance of the sampling by increasing the frequency of the system timer tick. However, increasing the frequency of the tick increases the time spent recording samples, which in turn affects the integrity of the measurement. 1 To use the GNU profiler successfully with your custom hardware design, you must ensure that your design includes a system clock timer. The GNU profiler requires this component to produce proper output.
Software Considerations
The GNU profiler implements your source code with functions to track processor usage.
Profiler Mechanics
You enable the GNU profiler by turning on the hal.enable_gprof switch in the scripts to generate the BSP. Turning on this switch automatically turns on the -pg compiler switch and then links the profiling library code in the altera_nios2 software component with the BSP. This code counts the number of calls to each profiled function. The -pg compiler option forces the compiler to insert a call to the mcount() function (located in the file altera_nios2/HAL/src/alt_mcount.S) at the beginning of every function call. The calls to mcount() track every dynamic parent and child function call relationship to enable the construction of a call graph. The option also installs nios2_pcsample()function (located in the file altera_nios2/HAL/src/ alt_gmon.c) that samples the foreground program counter at every system clock interrupt. When the program executes, the GNU profiler collects data on the host of the gmon.out. The nios2-elf-gprof utility can read this file and display profiling information about the program. The profiling code operates on the target by performing the following steps: 1. The Compiler implements function prologues with a call to mcount() to enable the Compiler to determine the function call graph. The GNU profiler documentation refers to this data as the function call arc data. 2. The timer interrupt handler registers an alarm to capture information about the foreground function (histogram data) that executes when the alarm triggers. 3. The heap allocates a target memory to store the profiling data. 4. When your code exits with a BREAK 2 instruction, the nios2-download utility copies the profiling data from the target to the host. 1 The nios2-elf-gprof utility requires the function call arc data and the histogram data to work correctly.
Page 6
f For more information about the GNU profiler, refer to the Nios II GNU profiler documentation, included with the GCC documentation, available on the Nios II Embedded Design Suite Support page of the Altera website.
Profiler Overhead
Using the GNU profiler impacts memory and processor cycles. Memory The impact of the profiling information on the .text section size is proportional to the number of small functions in the application. The code overheadthe size of the .text sectionincreases when the GNU profiler enables profiling, due to the addition of the nios2_pcsample() and mcount() functions. The GNU profiler implements the system timer with a call to nios2_pcsample(), and implements every function with a call to mcount(). The .text section increases by the additional function calls and by the sizes of these two functions. 1 To view the impact on the .text section, you can compare the sizes of the .text sections in the .objdump. The GNU profiler uses buckets to store data on the heap during profiling. Each bucket is two bytes in size. Each bucket holds samples for 32 bytes of code in the .text section. The total number of profiler buckets allocated from the heap is when you divide the size of the .text section by 32. The heap memory that the GNU profiler buckets consume is therefore: ((.text section size) / 32) 2 bytes The GNU profiler measures all functions in the object code that the GNU profiler compiles with profiling information. This set of functions includes the library functions, which include the run-time library and the BSP. Processor Cycles The GNU profiler tracks each individual function with a call to mcount(). Therefore, if the application code contains many small functions, the impact of the GNU profiler on processor time is larger. However, the resolution of the profiled data is higher. To calculate the additional processor time consumed by profiling with mcount(), multiply the amount of time that the processor requires to execute mcount() by the number of run-time function calls in your application run. On every clock tick, the processor calls the nios2_pcsample() function. To calculate the required additional processor time to perform profiling with nios2_pcsample(), multiply the time the processor requires to execute this function by the number of clock ticks that your application requires, which includes the time the mcount() calls and execution requires. To calculate the number of additional processor cycles used for profiling, add the overhead you calculated for all the calls to mcount() to the overhead you calculated for all the calls to nios2_pcsample().
Page 7
Hardware Considerations
The GNU profiler requires only a system timer. If your Nios II hardware design includes a system timer, you do not need to change your design.
In the Windows operating system, on the Start menu, point to Programs > Altera > Nios II EDS <version>, and click Nios II <version> Command Shell. In the Linux operating system, in a command shell, change directories to <Nios II EDS install path>, and type the command ./sdk_shell.
2. Change to the directory <profiler_software_examples>/app/profiler_gnu 3. Create and build the application with the create-this-app script, by typing the following command:
./create-this-app r
The create-this-app script runs the create-this-bsp script, which reads settings from the parameter_definition.tcl in <profiler_software_examples>/bsp/hal_profiler_gnu. This Tcl file contains the following lines:
set_setting hal.enable_gprof true set_setting hal.enable_exit true
The first setting enables the GNU profiler, and the second setting enables the alt_main() function to call exit() following main().
Page 8
Running the Profiler Software Example To run the application and collect the GNU profiler data, follow these steps: 1. Open a second Nios II command shell. 2. In the second shell, open a nios2-terminal session by typing the following command:
nios2-terminal r
3. In your original Nios II command shell, download the .elf to the development board, run your design, and write the GNU profiler data to the gmon.out, by typing the following command:
nios2-download -g --write-gmon gmon.out *.elf r
The GNU profiler collects data while the application runs, and then writes the data to the gmon.out when the application calls the exit() function. Figure 1 shows an example of the GNU profiler output in the Nios II command shell. 4. Exit nios2-terminal by typing control-C.
Figure 1. GNU Profiler Output on Nios II Command Shell
Creating the GNU Profiler Report When you run your project, your project creates the gmon.out. You must format this file to a readable format. To format this file, follow these steps: 1. In the original Nios II command shell, change your directory to <profiler_software_examples>/app/profiler_gnu. 2. Type the following command:
nios2-elf-gprof profiler_gnu.elf gmon.out > report.txt r
This command generates a flat profile report and a call graph, which you can view in the report.txt. 3. Use any text editor to view the report.txt. For more information about the GNU profiler report, refer to Analyzing the GNU Profiler Report on page 10 .
Page 9
Viewing the GNU Profiler Report The software creates the gmon.out in your project folder, which you can view in the Project Explorer view of the Nios II SBT for Eclipse. If the gmon.out does not appear, right click on your project and select Refresh. When you open gmon.out, the Nios II SBT for Eclipse switches to the Profiling view, in which you can view the report. For more information about the GNU profiler report, refer to Analyzing the GNU Profiler Report.
Page 10
The flat profile portion of the report identifies the child functions in the order in which they consume processing time. The call graph portion of the report describes the call tree of the program sorted by the total amount of time spent in each function and its children. Each entry in this table consists of several lines. The line with the index number at the left hand margin lists the current function. The lines above it list the functions that called this function, and the lines below it list the functions this one called, with exceptions and conditions detailed further in the report itself and the GNU profiler documentation. f For more information, refer to the Nios II GNU profiler documentation, with the GCC documentation, available at the Nios II Embedded Design Suite Support page.
Example 1 shows the GNU profiler report excerpts from the previous tutorial. In Example 1, the flat profile shows that the checksum_test_routine() function call consumed 79.19% of the processing time during the execution. The granularity statement in the call graph report states that the report covers 2.55 seconds (2550 milliseconds). The Nios II timer (sys_clk_timer) has a 10 millisecond timer. The GNU profiler calls the timer interrupt once at the beginning, before a full clock period elapsed, and once every 10 milliseconds thereafter. A precise report, therefore, would show that the GNU profiler calls the timer interrupt handler 255 times. Index[13] shows that the GNU profiler calls alt_avalon_timer_sc_irq()256 times, which is in the precision range of this measurement method.
Page 11
Note that the result you see may vary from Example 1.
granularity: each sample hit covers 32 byte(s) for 0.39% of 2.55 seconds index [13] % time 0.00 0.0 0.00 0.00 0.00 . . . self 0.00 0.00 0.00 0.00 children called 273/273 273 256/256 17/17 name alt_irq_entry [106] alt_irq_handler [13] alt_avalon_timer_sc_irq [14] altera_avalon_jtag_uart_irq [17]
Page 12
Timer Advantages
Unlike the performance counter, which can track only seven sections of code simultaneously, the timer has no such limit. You can read the timer 1,000 times and store the timer in 1,000 different variables as a start time for a section. Then, you can compare the timer to 1,000 end timer readings. The only practical limiting factors are memory consumption, processor overhead, and complexity.
Page 13
Hardware Considerations
Performance counters and timestamp interval timers are Qsys components. When you add one to an existing system, you must regenerate the Qsys system and recompile the .sof in the Quartus II software. Timers and performance counters can eventually overflow, such as any hardware counter.
Page 14
Page 15
The create-this-app script runs the create-this-bsp script, which reads settings from the parameter_definition.tcl in <profiler_software_examples>/bsp/hal_profiler_performance_counter. This Tcl file contains the following lines:
set_setting hal.sys_clk_timer peripheral_subsystem_sys_clk_timer set_setting hal.timestamp_timer peripheral_subsystem_high_res_timer set_setting hal.enable_gprof true set_setting hal.enable_exit true
The first two lines set the system clock timer and timestamp timer to the corresponding timers in the Qsys system. The third line enables the GNU profiler, and the last line enable the alt_main() function to call exit() following main(). Running the Performance Counter Software Example To run the application and collect the GNU profiler data, follow these steps: 1. Open a second Nios II command shell. 2. In the second shell, open a nios2-terminal session by typing the following command:
nios2-terminal r
3. In your original Nios II command shell, run the program by typing the following command:
nios2-download -g *.elf r
Page 16
Figure 3 shows an example of the output that appears in the Nios II command shell. Your output might vary. For more information, refer to Analyzing the Performance Counter Report.
Figure 3. Performance Counter Report on Nios II Command Shell
Conclusion
Page 17
12. Generate your BSP project and exit. 13. Right click your project in the Project Explorer view, point to Build Project. 14. To run the profiler_performance_counter software, right click your application project, point to Run As and click Nios II Hardware. Figure 4 shows the Nios II Console output after running profiler_performance_counter. The data are similar to the command-line example in Figure 3. For more information, refer to Analyzing the Performance Counter Report.
Figure 4. Performance Counter Report on Nios II Console
Conclusion
The Nios II development environment provides several tools to analyze the performance of your project. The software-only GNU profiler approach adds minimal overhead. To analyze deterministic real-time performance issues, you can use a hardware timer or a performance counter. To choose the best tool for your task, consider the problem that you are solving.
Page 18
Troubleshooting
Troubleshooting
The following sections describe several problems that might occur, and suggest ways to troubleshoot the problems.
Assume a fourth section counter specifies a performance counter component that Qsys defines to have three section counters only (the default value). In Example 3, the test is performed on a hardware design that does not have any other component defined with registers mapped immediately after the registers of the performance counter component. Therefore, there is no impact to other component. Depending on how you configure the component register base addresses in Qsys for a particular hardware design, unpredictable system behavior could occur.
Output From a printf() or perf_print_formatted_output() Call Near the End of main() Might Be Prematurely Truncated
This issue occurs when the Nios II application executes a BREAK instruction to transfer profiling data to the development workstation during the exit() or return() from main(). As a workaround, call usleep(500000) before exiting or returning from main(). This call creates an adequate delay for you to transmit the I/O to the JTAG UART before main returns (or calls exit()). If the output is still partially truncated, increase the delay value passed to usleep(). Use #include <unistd.h> for the usleep() function prototype.
Further Reading
Page 19
Fitting a Performance Counter in a Hardware Design That Consumes Most of a Device's Resources
During development, you can measure the system in a larger device than the size of your device in a deployed system. Configure a performance counter to have only one section counter to save the most resources.
The Histogram for the gmon.out File Is Missing, Even Though My main() Function Terminates
If you do not define a system timer for the system, the profiler does not call the nios2_pcsample() function, and does not generate the histogram for the gmon.out. Define a system timer for your system.
Further Reading
f For information about the GNU profiler, refer to the Nios II GNU profiler documentation, included with the GCC documentation, available at the Nios II Embedded Design Suite Support. f Because Altera has rewritten the lib-gprof library, the information in this application note about data collection deviates from Alteras implementation. f For information about the performance counter, refer to the Performance Counter Core chapter in the Embedded Peripherals IP User Guide. For information about the high-speed timer, refer to the Timer Core chapter in the Embedded Peripherals IP User Guide.
Page 20
Replaced mentions of SOPC Builder with Qsys. Updated Obtaining the Hardware Design on page 1, Obtaining the Software Examples on page 1, Program Counter Trace Information on page 4, Tutorial: Using the GNU Profiler on page 7, Creating the Profiler Software Example on page 7, Creating the GNU Profiler Report on page 8, Creating and Running the Profiler Software Example on page 9Analyzing the GNU Profiler Report on page 10Flat Profile and Call Graph Example on page 11, Modifying the Nios II Hardware Design on page 14, Creating the Performance Counter Software Example on page 15, Running the Performance Counter Software Example on page 15, and Performance Counter Example with Nios II SBT for Eclipse on page 16. Updated document, software and screen shots for the Nios II SBT for Eclipse Added the Nios II SBT for Eclipse flow Updated examples for the NEEK Updated document for the Quartus II software and Nios II EDS v8.0. Replaced references to the Nios II IDE with instructions in the Nios II software build flow. General updates for the Quartus II software v8.0.
July 2010
3.0
Updated document for the Quartus II software and Nios II EDS v5.1 SP1. Updated document for the Quartus II software and Nios II EDS v5.1. Initial release.