Tracy Manual
Tracy Manual
Profiler
The user manual
https://github.com/wolfpld/tracy
Tracy Profiler The user manual
Quick overview
Hello and welcome to the Tracy Profiler user manual! Here you will find all the information you need to
start using the profiler. This manual has the following layout:
• Chapter 1, A quick look at Tracy Profiler, gives a short description of what Tracy is and how it works.
• Chapter 2, First steps, shows how the profiler can be integrated into your application, and how to build
the graphical user interface (section 2.3). At this point you will be able to establish a connection from
the profiler to your application.
• Chapter 3, Client markup, provides information on how to instrument your application, in order to
retrieve useful profiling data. This includes description of the C API (section 3.12), which enables usage
of Tracy in any programming language.
• Chapter 4, Capturing the data, goes into more detail on how the profiling information can be captured
and stored on disk.
• Chapter 5, Analyzing captured data, guides you through the graphical user interface of the profiler.
• Chapter 6, Exporting zone statistics to CSV, explains how to export some zone timing statistics into a CSV
format.
• Chapter 7, Importing external profiling data, documents how to import data from other profilers.
1
Tracy Profiler The user manual
Contents
2 First steps 10
2.1 Initial client setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Short-lived applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 On-demand profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Client discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Client network interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.5 Setup for multi-DLL projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Problematic platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6.1 Microsoft Visual Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6.2 Apple woes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6.3 Android lunacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6.4 Virtual machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.7 Changing network port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Check your environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 CPU design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2.1 Superscalar out-of-order speculative execution . . . . . . . . . . . . . . . . . 15
2.2.2.2 Simultaneous multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2.3 Turbo mode frequency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2.4 Power saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2.5 AVX offset and power licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2.6 Summing it up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Building the server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Required libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1.2 Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Build process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Embedding the server in profiled application . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Naming threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Crash handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Feature support matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
Tracy Profiler The user manual
3 Client markup 20
3.1 Handling text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Program data lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Specifying colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Marking frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Secondary frame sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Discontinuous frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Frame images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3.1 OpenGL screen capture code example . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Marking zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Manual management of zone scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Multiple zones in one scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 Filtering zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Transient zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.5 Variable shadowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.6 Exiting program from within a zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Marking locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 Custom locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Plotting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Message log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.1 Application information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 Memory profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.9 GPU profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9.1 OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9.2 Vulkan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9.3 Direct3D 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9.4 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9.5 Multiple zones in one scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.10 Collecting call stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.11 Lua support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11.1 Call stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11.2 Instrumentation cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.12 C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.12.1 Setting thread names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12.2 Frame markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12.3 Zone markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12.3.1 Zone context data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.12.3.2 Zone validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.12.3.3 Transient zones in C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12.4 Memory profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12.5 Plots and messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12.6 Call stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12.7 Using the C API to implement bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.13 Automated data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.13.1 CPU usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.13.2 Context switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.13.3 CPU topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.13.4 Call stack sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.13.5 Executable code retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.13.6 Vertical synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.14 Trace parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.15 Connection status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3
Tracy Profiler The user manual
4
Tracy Profiler The user manual
8 Configuration files 74
8.1 Root directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Trace specific settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A License 75
B List of contributors 75
5
Tracy Profiler The user manual
1.1 Real-time
The concept of Tracy being a real-time profiler may be explained in a couple of different ways:
1. The profiled application is not slowed down by profiling1. The act of recording a profiling event has
virtually zero cost – it only takes a few nanoseconds. Even on low-power mobile devices there’s no
perceptible impact on execution speed.
2. The profiler itself works in real-time, without the need to process collected data in a complex way.
Actually, it is quite inefficient in the way it works, as the data it presents is calculated anew each frame.
And yet it can run at 60 frames per second.
3. The profiler has full functionality when the profiled application is running and the data is still being
collected. You may interact with your application and then immediately switch to the profiler, when a
performance drop occurs.
6
Tracy Profiler The user manual
system, which do have significantly reduced resolution (about 300 ns – 1 µs). This is enough to hide the
subtle impact of cache access optimization, etc.
300 ns
À À À
Time
𝐶 1 𝐵1 𝐷1𝐷2 𝐴1 𝐴2 𝐵2 𝐶 2
Figure 1: Low precision (300 ns) timer. Discrete timer ticks are indicated by the À icon.
• The 𝐴 and 𝐷 ranges both take a very short amount of time (10 ns), but the 𝐴 range is reported as 300 ns,
and the 𝐷 range is reported as 0 ns.
• The 𝐵 range takes a considerable amount of time (590 ns), but according to the timer readings, it took
the same time (300 ns) as the short lived 𝐴 range.
• The 𝐶 range (610 ns) is only 20 ns longer than the 𝐵 range, but it is reported as 900 ns, a 600 ns difference!
Here you can see why it is important to use a high precision timer. While there is no escape from the
measurement errors, their impact can be reduced by increasing the timer accuracy.
7
Tracy Profiler The user manual
´ Thread 1
Display
ó Storage
´ Thread 3
In Tracy terminology, the profiled application is a client and the profiler itself is a server. It was named this
way because the client is a thin layer that just collects events and sends them for processing and long-term
storage on the server. The fact that the server needs to connect to the client to begin the profiling session may
be a bit confusing at first.
• Tracy is free and open source (BSD license), while RAD Telemetry costs about $8000 per year.
• Tracy provides out-of-the-box Lua bindings. It has been successfully integrated with other native and
interpreted languages (Rust, Arma scripting language) using the C API (see chapter 3.12 for reference).
• Tracy has a wide variety of profiling options. You can profile CPU, GPU, locks, memory allocations,
context switches and more.
• Tracy is feature rich. Statistical information about zones, trace comparisons, or inclusion of inline
function frames in call stacks (even in statistics of sampled stacks) are features unique to Tracy.
• Tracy focuses on performance. Many tricks are used to reduce memory requirements and network
bandwidth. The impact on the client execution speed is minimal, while other profilers perform heavy
data processing within the profiled application (and then claim to be lightweight).
• Tracy uses low-level kernel APIs, or even raw assembly, where other profilers rely on layers of
abstraction.
7See section 2.3.3 for guidelines.
8
Tracy Profiler The user manual
• Tracy is multi-platform right from the very beginning. Both on the client and server side. Other
profilers tend to have Windows-specific graphical interfaces.
• Tracy can handle millions of frames, zones, memory events, and so on, while other profilers tend to
target very short captures.
• Tracy doesn’t require manual markup of interesting areas in your code to start profiling. You may rely
on automated call stack sampling and add instrumentation later, when you know where it’s needed.
• Tracy provides mapping of source code to the assembly, with detailed information about cost of
executing each instruction on the CPU.
Mode Zones (total) Zones (single image) Clean run Profiling run Difference
ETC1 201,326,592 16,777,216 110.9 ms 148.2 ms +37.3 ms
ETC2 201,326,592 16,777,216 212.4 ms 250.5 ms +38.1 ms
mov byte ptr [rsp +0 C0h ],1 ; store zone activity information
mov r15d ,28h
mov rax ,qword ptr gs :[58h] ; TLS
mov r14 ,qword ptr [rax] ; queue address
mov rdi ,qword ptr [r15+r14] ; data address
mov rbp ,qword ptr [rdi +28h] ; buffer counter
mov rbx ,rbp
and ebx ,7 Fh ; 128 item buffer
jne function +54h -----------+ ; check if current buffer is usable
mov rdx ,rbp |
mov rcx ,rdi |
call enqueue_begin_alloc | ; reclaim / alloc next buffer
shl rbx ,5 <-----------------+ ; buffer items are 32 bytes
add rbx ,qword ptr [rdi +48h] ; calculate queue item address
mov byte ptr [rbx ] ,10h ; queue item type
rdtsc ; retrieve time
shl rdx ,20h
or rax ,rdx ; construct 64 bit timestamp
mov qword ptr [rbx +1] , rax ; write timestamp
lea rax ,[ __tracy_source_location ] ; static struct address
mov qword ptr [rbx +9] , rax ; write source location data
lea rax ,[ rbp +1] ; increment buffer counter
mov qword ptr [rdi +28h],rax ; write buffer counter
8https://github.com/wolfpld/etcpak
9
Tracy Profiler The user manual
The second code block, responsible for ending a zone, is similar, but smaller, as it can reuse some variables
retrieved in the above code.
1.8 Examples
To see how Tracy can be integrated into an application, you may look at example programs in the examples
directory. Looking at the commit history might be the best way to do that.
• Homepage – https://github.com/wolfpld/tracy
2 First steps
Tracy Profiler supports MSVC, gcc and clang. A reasonably recent version of the compiler is needed, due to
C++11 requirement. The following platforms are confirmed to be working (this is not a complete list):
• FreeBSD (x64)
• Cygwin (x64)
• MinGW (x64)
• WSL (x64)
• OSX (x64)
Moreover, the following platforms are not supported due to how secretive their owners are, but were
reported to be working after extending the system integration layer:
10
Tracy Profiler The user manual
• PlayStation 4
• Xbox One
• Nintendo Switch
• Google Stadia
• Using the last-version-tagged revision will give you a stable platform to work with. You won’t
experience any breakages, major UI overhauls or network protocol changes. Unfortunately, you
also won’t be getting any bug fixes.
• Working with the bleeding edge master development branch will give you access to all the new
improvements and features added to the profiler. While it is generally expected that master
should always be usable, there are no guarantees that it will be so.
Do note that all bug fixes and pull requests are made against the master branch.
With the source code included in your project, add the tracy/TracyClient.cpp source file to the IDE
project and/or makefile. You’re done. Tracy is now integrated into the application.
In the default configuration Tracy is disabled. This way you don’t have to worry that the production builds
will perform collection of profiling data. You will probably want to create a separate build configuration,
with the TRACY_ENABLE define, which enables profiling.
Important
Double-check that the define name is entered correctly (as TRACY_ENABLE), don’t make a mistake of
adding an additional D at the end. Make sure that this macro is defined for all files across your project
(e.g. it should be specified in the CFLAGS variable, which is always passed to the compiler, or in an
equivalent way), and not as a #define in just some of the source files.
The application you want to profile should be compiled with all the usual optimization options enabled
(i.e. make a release build). It makes no sense to profile debugging builds, as the unoptimized code and
additional checks (asserts, etc.) completely change how the program behaves.
Finally, on Unix make sure that the application is linked with libraries libpthread and libdl. BSD
systems will also need to be linked with libexecinfo.
11
Tracy Profiler The user manual
Caveats
The client with on-demand profiling enabled needs to perform additional bookkeeping, in order to
present a coherent application state to the profiler. This incurs additional time cost for each profiling
event.
12
Tracy Profiler The user manual
• Profiling is interrupted when the application exits. This will result in missing zones, memory allocations,
or even source location names.
setenforce 0
mount -o remount , hidepid =0 /proc
echo 0 > /proc/sys/ kernel / perf_event_paranoid
The first command will allow access to system CPU statistics. The second one will allow inspection of
foreign processes (which is required for context switch capture). The last one will lower restrictions on access
to performance counters. Be sure that you are fully aware of the consequences of making these changes.
• Inability to obtain precise time stamps, resulting in error messages such as CPU doesn’t support RDTSC
instruction, or CPU doesn’t support invariant TSC. On Windows this can be worked around by rebuilding
the profiled application with the TRACY_TIMER_QPC define, which severely lowers resolution of time
readings.
13
Tracy Profiler The user manual
Important
To enable network communication, Tracy needs to open a listening port. Make sure it is not blocked by
an overzealous firewall or anti-virus program.
2.1.8 Limitations
When using Tracy Profiler, keep in mind the following requirements:
The following conditions also need apply, but don’t trouble yourself with them too much. You would
probably already knew, if you’d be breaking any.
14
Tracy Profiler The user manual
15
Tracy Profiler The user manual
• How many cores are being used? Just one, or all 8? All 16?
• What type of work is being performed? Integer? Floating point? 128-wide SIMD? 256-wide SIMD?
512-wide SIMD?
• Were you lucky in the silicon lottery? Some dies are simply better made and are able to achieve higher
frequencies.
• Are you running on the best-rated core, or at the worst-rated core? Some cores may be unable to match
the performance of other cores in the same processor.
• What kind of cooling solution are you using? The cheap one bundled with the CPU, or a beefy chunk
of metal that has no problem with heat dissipation?
• Do you have complete control over the power profile? Spoiler alert: no. The operating system may run
anything at any time on any of the other cores, which will impact the turbo frequency you’re able to
achieve.
As you can see, this feature basically screams ’unreliable results!’ Best keep it disabled and run at the
base frequency. Otherwise your timings won’t make much sense. A true example: branchless compression
function executing multiple times with the same input data was measured executing at four different speeds.
Keep in mind that even at the base frequency you may hit thermal limits of the silicon and be downthrottled.
16
Tracy Profiler The user manual
2.2.2.6 Summing it up
Power management schemes employed in various CPUs make it hard to reason about true performance of
the code. For example, figure 3 contains a histogram of function execution times (as described in chapter 5.7),
as measured on an AMD Ryzen CPU. The results ranged from 13.05 µs to 61.25 µs (extreme outliers were not
included on the graph, limiting the longest displayed time to 36.04 µs).
We can immediately see that there are two distinct peaks, at 13.4 µs and 15.3 µs. A reasonable assumption
would be that there are two paths in the code, one that can omit some work, and the second one which must
do some additional job. But here’s a catch – the measured code is actually branchless and is always executed
the same way. The two peaks represent two turbo frequencies between which the CPU was aggressively
switching.
We can also see that the graph gradually falls off to the right (representing longer times), with a small
bump near the end. This can be attributed to running in power saving mode, with differing reaction times to
the required operating frequency boost to full power.
Important
Due to the memory requirements for data storage, Tracy server is only supposed to run on 64-bit
platforms. While there is nothing preventing the program from building and executing in a 32-bit
environment, doing so is not supported.
Capstone library At the time of writing, the capstone library is in a bit of disarray. The officially released
version 4.0.2 can’t disassemble some AVX instructions, which are successfully parsed by the next branch.
However, the next branch somehow lost information about input/output registers for some functions. You
may want to explore the various available versions to find one that suits your needs the best.
17
Tracy Profiler The user manual
2.3.1.1 Windows
On Windows you will need to use the vcpkg utility. If you are not familiar with this tool, please read the
description at the following address: https://docs.microsoft.com/en-us/cpp/build/vcpkg.
There are two ways you can run vcpkg to install the dependencies for Tracy:
• Local installation within the project directory – run this script to download and build both vcpkg and
the required dependencies:
This writes files only to the vcpkg\vcpkg directory and makes no other changes on your machine.
• System-wide installation – install vcpkg by following the instructions on its website, and then execute
the following commands:
2.3.1.2 Unix
On Unix systems you will need to install the pkg-config utility and the following libraries: glfw, freetype,
capstone, GTK3. Some Linux distributions will require you to add a lib prefix and a -dev, or -devel postfix
to library names. You may also need to add a seemingly random number to the library name (for example:
freetype2, or freetype6). The GTK library could be installed as libgtk-3-dev on some systems. How fun!
Installation of the libraries on OSX can be facilitated using the brew package manager.
• TRACY_NO_FILESELECTOR – controls whether a system load/save dialog is compiled in. If it’s enabled,
the saved traces will be named trace.tracy.
• TRACY_NO_STATISTICS – Tracy will perform statistical data collection on the fly, if this macro is not
defined. This allows extended analysis of the trace (for example, you can perform a live search for
matching zones) at a small CPU processing cost and a considerable memory usage increase (at least 8
bytes per zone).
• TRACY_NO_ROOT_WINDOW – the main profiler view won’t occupy whole window if this macro is defined.
Additional setup is required for this to work. If you are embedding the server into your application
you probably want to enable this option.
18
Tracy Profiler The user manual
Caveats
On MSVC the debugger has priority over the application in handling exceptions. If you want to finish
the profiler data collection with the debugger hooked-up, select the continue option in the debugger
pop-up dialog.
19For example, invalid memory accesses (’segmentation faults’, ’null pointer exceptions’), divisions by zero, etc.
19
Tracy Profiler The user manual
3 Client markup
With the aforementioned steps you will be able to connect to the profiled program, but there probably
won’t be any data collection performed20. Unless you’re able to perform automatic call stack sampling
(see chapter 3.13.4), you will have to manually instrument the application. All the user-facing interface is
contained in the tracy/Tracy.hpp header file.
Manual instrumentation is best started with adding markup to the main loop of the application, along
with a few function that are called there. This will give you a rough outline of the function’s time cost, which
you may then further refine by instrumenting functions deeper in the call stack. Alternatively, automated
sampling might guide you more quickly to places of interest.
1. When a macro only accepts a pointer (for example: TracyMessageL(text)), the provided string data
must be accessible at any time in program execution (this also includes the time after exiting the main
function). The string also cannot be changed. This basically means that the only option is to use a string
literal (e.g.: TracyMessageL("Hello")).
2. If there’s a string pointer with a size parameter (for example: TracyMessage(text, size)), the profiler
will allocate an internal temporary buffer to store the data. The size count should not include the
terminating null character, using strlen(text) is fine. The pointed-to data is not used afterwards.
Remember that allocating and copying memory involved in this operation has a small time cost.
Be aware that each single instance of text string data passed to the profiler can’t be larger than 64 KB.
20
Tracy Profiler The user manual
Do I need this?
This step is optional, as some applications do not use the concept of a frame.
Important
• Frame types must not be mixed. For each frame set, identified by an unique name, use either
continuous or discontinuous frames only!
• You must issue the FrameMarkStart and FrameMarkEnd macros in proper order. Be extra careful,
especially if multi-threading is involved.
• Discontinuous frames may not work correctly if the profiled program doesn’t have string pooling
enabled. This is an implementation issue which will be fixed in the future.
21
Tracy Profiler The user manual
Table 3: Client compression time of 320 × 180 image. x86: Ryzen 9 3900X (MSVC); ARM: ODROID-C2 (gcc).
Caveats
• Frame images are compressed on a second client profiler threada , to reduce memory usage of
queued images. This might have impact on the performance of the profiled application.
• Due to implementation details of the network buffer, single frame image cannot be greater than
256 KB after compression. Note that a 960 × 540 image fits in this limit.
a Small part of compression task is performed on the server.
Everything needs to be properly initialized (the cleanup is left for the reader to figure out).
22
Tracy Profiler The user manual
We will now setup a screen capture, which will downscale the screen contents to 320 × 180 pixels and
copy the resulting image to a buffer which will be accessible by the CPU when the operation is done. This
should be placed right before swap buffers or present call.
assert ( m_fiQueue . empty () || m_fiQueue . front () != m_fiIdx ); // check for buffer overrun
glBindFramebuffer ( GL_DRAW_FRAMEBUFFER , m_fiFramebuffer [ m_fiIdx ]);
glBlitFramebuffer (0, 0, res.x, res.y, 0, 0, 320 , 180 , GL_COLOR_BUFFER_BIT , GL_LINEAR );
glBindFramebuffer ( GL_DRAW_FRAMEBUFFER , 0);
glBindFramebuffer ( GL_READ_FRAMEBUFFER , m_fiFramebuffer [ m_fiIdx ]);
glBindBuffer ( GL_PIXEL_PACK_BUFFER , m_fiPbo [ m_fiIdx ]);
glReadPixels (0, 0, 320 , 180 , GL_RGBA , GL_UNSIGNED_BYTE , nullptr );
glBindFramebuffer ( GL_READ_FRAMEBUFFER , 0);
m_fiFence [ m_fiIdx ] = glFenceSync ( GL_SYNC_GPU_COMMANDS_COMPLETE , 0);
m_fiQueue . emplace_back ( m_fiIdx );
m_fiIdx = ( m_fiIdx + 1) % 4;
And lastly, just before the capture setup code that was just added27 we need to have the image retrieval
code. We are checking if the capture operation has finished and if it has, we map the pixel buffer object to
memory, inform the profiler that there’s image data to be handled, unmap the buffer and go to check the
next queue item. If a capture is still pending, we break out of the loop and wait until the next frame to check
if the GPU has finished the capture.
Notice that in the call to FrameImage we are passing the remaining queue size as the offset parameter.
Queue size represents how many frames ahead our program is relative to the GPU. Since we are sending
past frame images we need to specify how many frames behind the images are. Of course if this would be a
synchronous capture (without use of fences and with retrieval code after the capture setup), we would set
offset to zero, as there would be no frame lag.
High quality capture The code above uses glBlitFramebuffer function, which can only use nearest
neighbor filtering. This can result in low-quality screen shots, as shown on figure 4. With a bit more work it
is possible to obtain much nicer looking screen shots, as presented on figure 5. Unfortunately, you will need
to setup a complete rendering pipeline for this to work.
First, you need to allocate additional set of intermediate frame buffers and textures, sized the same as the
screen. These new textures should have minification filter set to GL_LINEAR_MIPMAP_LINEAR. You will also
need to setup everything needed to render a full-screen quad: a simple texturing shader and vertex buffer
27Yes, before. We are handling past screen captures here.
23
Tracy Profiler The user manual
with appropriate data. Since this vertex buffer will be used to render to the scaled-down framebuffer, you
may prepare its contents beforehand and update it only when the aspect ratio would change.
With all this done, the screen capture can be performed as follows:
• Setup vertex buffer configuration for the full-screen quad buffer (you only need position and uv coordi-
nates).
While this approach is much more complex than the previously discussed one, the resulting image quality
increase makes it worth it.
You can see the performance results you may expect in a simple application in table 4. The naïve capture
performs synchronous retrieval of full screen image and resizes it using stb_image_resize. The proper and
high quality captures do things as described in this chapter.
24
Tracy Profiler The user manual
constant in the recording (don’t try to parametrize it). You may also set a custom name for the zone, using the
ZoneScopedN(name) macro. Color and name may be combined by using the ZoneScopedNC(name, color)
macro.
Use the ZoneText(text, size) macro to add a custom text string that will be displayed along the zone
information (for example, name of the file you are opening). Multiple text strings can be attached to any
single zone. If you want to send a numeric value and don’t want to pay the cost of converting it to a string,
you may use the ZoneValue(uint64_t) macro.
If you want to set zone name on a per-call basis, you may do so using the ZoneName(text, size) macro.
This name won’t be used in the process of grouping the zones for statistical purposes (sections 5.6 and 5.7).
Important
Zones are identified using static data structures embedded in program code. You need to consider the
lifetime of code in your application, as discussed in section 3.1.1, to make sure that the profiler is able
to access this data at any time during the program lifetime.
If this requirement can’t be fulfilled, you must use transient zones, described in section 3.4.4.
Zone stack
The ZoneScoped macros are imposing creation and usage of an implicit zone stack. You must follow
the rules of this stack also when you are using the named macros, which give you some more leeway
in doing things. For example, you can only set the text for the zone which is on top of the stack, as
you only could do with the ZoneText macro. It doesn’t matter that you can call the Text method of a
non-top zone which is accessible through a variable. Take a look at the following code:
{
ZoneNamed (Zone1 , true);
a
{
ZoneNamed (Zone2 , true);
b
}
29https://en.cppreference.com/w/cpp/language/raii
30The last parameter is explained in section 3.4.3.
25
Tracy Profiler The user manual
c
}
It is valid to set the Zone1 text or name only in places a or c . After Zone2 is created at b you can
no longer perform operations on Zone1, until Zone2 is destroyed.
enum SubSystems
{
Sys_Physics = 1 << 0,
Sys_Rendering = 1 << 1,
Sys_NasalDemons = 1 << 2
}
...
...
26
Tracy Profiler The user manual
void Function ()
{
ZoneScoped ;
...
for(int i=0; i <10; i++)
{
ZoneScoped ;
...
}
}
This doesn’t stop some compilers from dispensing fashion advice about variable shadowing (as both
ZoneScoped calls create a variable with the same name, with the inner scope one shadowing the one in the
outer scope). If you want to avoid these warnings, you will also need to use the ZoneNamed macros.
with
Alternatively, you may use TracyLockableN(type, varname, description) to provide a custom lock
name at a global level, which will replace the automatically generated ’std::mutex m_lock’-like name. You
may also set a custom name for a specific instance of a lock, through the LockableName(varname, name,
size) macro.
The standard std::lock_guard and std::unique_lock wrappers should use the LockableBase(type)
macro for their template parameter (unless you’re using C++17, with improved template argument deduction).
For example:
To mark the location of a lock being held, use the LockMark(varname) macro, after you have obtained the
lock. Note that the varname must be a lock variable (a reference is also valid). This step is optional.
31https://en.cppreference.com/w/cpp/named_req/Mutex
27
Tracy Profiler The user manual
Condition variables
The standard std::condition_variable is only able to accept std::mutex locks. To be able to use
Tracy lock wrapper, use std::condition_variable_any instead.
Caveats
Due to limits of internal bookkeeping in the profiler, each lock may be used in no more than 64 unique
threads. If you have many short lived temporary threads, consider using a thread pool to limit the
numbers of created threads.
28
Tracy Profiler The user manual
To mark memory events, use the TracyAlloc(ptr, size) and TracyFree(ptr) macros. Typically you
would do that in overloads of operator new and operator delete, for example:
In some rare cases (e.g. destruction of TLS block), events may be reported after the profiler is no longer
available, which would lead to a crash. To workaround this issue, you may use TracySecureAlloc and
TracySecureFree variants of the macros.
Important
Each tracked memory free event must also have a corresponding memory allocation event. Tracy will
terminate the profiling session if this assumption is broken (see section 4.6). If you encounter this issue,
you may want to check for:
• Reporting the same memory address being allocated twice (without a free between two allocs).
• Untracked allocations made in external libraries, that are freed in the application.
29
Tracy Profiler The user manual
This requirement is relaxed in the on-demand mode (section 2.1.2), because the memory allocation
event might have happened before the connection was made.
{
vkBeginCommandBuffer (cmd , & beginInfo );
TracyVkZone (ctx , cmd , " Render ");
vkEndCommandBuffer (cmd);
}
To fix such issues, add a nested scope, encompassing the command buffer recording section.
3.9.1 OpenGL
You will need to include the tracy/TracyOpenGL.hpp header file and declare each of your rendering contexts
using the TracyGpuContext macro (typically you will only have one context). Tracy expects no more than
one context per thread and no context migration.
To mark a GPU zone use the TracyGpuZone(name) macro, where name is a string literal name of the zone.
Alternatively you may use TracyGpuZoneC(name, color) to specify zone color.
You also need to periodically collect the GPU events using the TracyGpuCollect macro. A good place to
do it is after the swap buffers function call.
Caveats
• GPU profiling is not supported on OSX, iOSa .
• Android devices do work, if GPU drivers are not broken. Disjoint events are not currently
handled, so some readings may be a bit spotty.
• Nvidia drivers are unable to provide consistent timing results when two OpenGL contexts are
used simultaneously.
30
Tracy Profiler The user manual
3.9.2 Vulkan
Similarly, for Vulkan support you should include the tracy/TracyVulkan.hpp header file. Tracing Vulkan
devices and queues is a bit more involved, and the Vulkan initialization macro TracyVkContext(physdev,
device, queue, cmdbuf) returns an instance of TracyVkCtx object, which tracks an associated Vulkan
queue. Cleanup is performed using the TracyVkDestroy(ctx) macro. You may create multiple Vulkan
contexts.
The physical device, logical device, queue and command buffer must relate with each other. The queue
must support graphics or compute operations. The command buffer must be in the initial state and be able
to be reset. It will be rerecorded and submitted to the queue multiple times and it will be in the executable
state on exit from the initialization function.
To mark a GPU zone use the TracyVkZone(ctx, cmdbuf, name) macro, where name is a string literal
name of the zone. Alternatively you may use TracyVkZoneC(ctx, cmdbuf, name, color) to specify zone
color. The provided command buffer must be in the recording state and it must be created within the queue
that is associated with ctx context.
You also need to periodically collect the GPU events using the TracyVkCollect(ctx, cmdbuf) macro34.
The provided command buffer must be in the recording state and outside of a render pass instance.
Calibrated context In order to maintain synchronization between CPU and GPU time domains, you will
need to enable the VK_EXT_calibrated_timestamps device extension and retrieve the following function
pointers: vkGetPhysicalDeviceCalibrateableTimeDomainsEXT and vkGetCalibratedTimestampsEXT.
To enable calibrated context, replace the macro TracyVkContext with TracyVkContextCalibrated and
pass the two functions as additional parameters, in the order specified above.
3.9.3 Direct3D 12
To enable Direct3D 12 support, include the tracy/TracyD3D12.hpp header file. Tracing Direct3D 12 queues
is nearly on par with the Vulkan implementation, where a TracyD3D12Ctx is returned from a call to
TracyD3D12Context(device, queue), which should be later cleaned up with the TracyD3D12Destroy(ctx)
macro. Multiple contexts can be created, each with any queue type.
The queue must have been created through the specified device, however a command list is not needed
for this stage.
Using GPU zones is the same as the Vulkan implementation, where the TracyD3D12Zone(ctx, cmdList,
name) macro is used, with name as a string literal. TracyD3D12ZoneC(ctx, cmdList, name, color) can be
used to create a custom-colored zone. The given command list must be in an open state.
The macro TracyD3D12NewFrame(ctx) is used to mark a new frame, and should appear before or after
recording command lists, similar to FrameMark. This macro is a key component that enables automatic query
data synchronization, so the user doesn’t have to worry about synchronizing GPU execution before invoking
a collection. Event data can then be collected and sent to the profiler using the TracyD3D12Collect(ctx)
macro.
Note that due to artifacts from dynamic frequency scaling, GPU profiling may be slightly inaccurate.
To counter this, ID3D12Device::SetStablePowerState() can be used to enable accurate profiling, at the
expense of some performance. If the machine is not in developer mode, the device will be removed upon
calling. Do not use this in shipping code.
Direct3D 12 contexts are always calibrated.
3.9.4 OpenCL
OpenCL support is achieved by including the tracy/TracyOpenCL.hpp header file. Tracing OpenCL requires
the creation of a Tracy OpenCL context using the macro TracyCLContext(context, device), which will
return an instance of TracyCLCtx object that must be used when creating zones. The specified device
34It is considerably faster than the OpenGL’s TracyGpuCollect.
31
Tracy Profiler The user manual
must be part of the context. Cleanup is performed using the TracyCLDestroy(ctx) macro. Although not
common, it is possible to create multiple OpenCL contexts for the same application.
To mark an OpenCL zone one must make sure that a valid OpenCL cl_event object is available. The
event will be the object that Tracy will use to query profiling information from the OpenCL driver. For this to
work, all OpenCL queues must be created with the CL_QUEUE_PROFILING_ENABLE property.
OpenCL zones can be created with the TracyCLZone(ctx, name) where name will usually be a descriptive
name for the operation represented by the cl_event. Within the scope of the zone, you must call
TracyCLSetEvent(event) for the event to be registered in Tracy.
Similarly to Vulkan and OpenGL, you also need to periodically collect the OpenCL events using the
TracyCLCollect(ctx) macro. A good place to perform this operation is after a clFinish, since this will
ensure that any previous queued OpenCL commands will have finished by this point.
1,500
x64
x86
1,000
Time (ns)
500
0
0 10 20 30 40 50 60
Call stack depth
Figure 6: Plot of call stack capture times (see table 5). Notice that the capture time grows linearly with requested capture depth
32
Tracy Profiler The user manual
Table 5: Median times of zone capture with call stack. x86, x64: i7 8700K; ARM: Banana Pi; ARM64: ODROID-C2. Selected
architectures are plotted on figure 6
You can force call stack capture in the non-S postfixed macros by adding the TRACY_CALLSTACK define, set
to the desired call stack capture depth. This setting doesn’t affect the explicit call stack macros.
The maximum call stack depth that can be retrieved is 62 frames. This is a restriction at the level of
operating system.
Debugging symbols
To have proper call stack information, the profiled application must be compiled with debugging
symbols enabled. You can achieve that in the following way:
• On MSVC open the project properties and go to Linker→Debugging→Generate Debug Info, where
the Generate Debug Information option should be selected.
• On gcc or clang remember to specify the debugging information -g parameter during compilation
and do not add the strip symbols -s parameter. Additionally, omitting frame pointers will severely
reduce the quality of stack traces, which can be fixed by adding the -fno-omit-frame-pointer
parameter. Link the executable with an additional option -rdynamic (or --export-dynamic, if
you are passing parameters directly to the linker).
• On OSX you may need to run dsymutil to extract the debugging data out of the executable binary.
• On iOS you will have to add a New Run Script Phase to your XCode project, which shall execute
the following shell script:
You will also need to setup proper dependencies, by setting the following input file:
${TARGET_BUILD_DIR}/${WRAPPER_NAME}.dSYM, and the following output file:
${TARGET_BUILD_DIR}/${UNLOCALIZED_RESOURCES_FOLDER_PATH}/${PRODUCT_NAME}.dSYM.
33
Tracy Profiler The user manual
You may also be interested in symbols from external libraries, especially if you have sam-
pling profiling enabled (section 3.13.4). In MSVC you can retrieve such symbols by going to
Tools→Options→Debugging→Symbols and selecting appropriate Symbol file (.pdb) location servers. Note
that additional symbols may significantly increase application startup times.
3.12 C API
In order to profile code written in C programming language, you will need to include the tracy/TracyC.h
header file, which exposes the C API.
At the moment there’s no support for C API based markup of locks, GPU zones, or Lua.
35While technically this name doesn’t need to be constant, like in the ZoneScopedN macro, it should be, as it is used to group the
zones together. This grouping is then used to display various statistics in the profiler. You may still set the per-call name using the
tracy.ZoneName method.
34
Tracy Profiler The user manual
Depth Time
1 707 ns
2 699 ns
3 624 ns
4 727 ns
5 836 ns
10 1.77 µs
15 2.44 µs
20 2.51 µs
25 2.98 µs
30 3.6 µs
35 4.33 µs
40 5.17 µs
45 6.01 µs
50 6.99 µs
55 8.11 µs
60 9.17 µs
Table 6: Median times of Lua zone capture with call stack (x64, 13 native frames)
Important
Tracy is written in C++, so you will need to have a C++ compiler and link with C++ standard library,
even if your program is strictly pure C.
• TracyCFrameMark
• TracyCFrameMarkNamed(name)
• TracyCFrameMarkStart(name)
• TracyCFrameMarkEnd(name)
• TracyCZone(ctx, active)
35
Tracy Profiler The user manual
10
Time (µs)
4
0
0 10 20 30 40 50 60
Call stack depth
Refer to sections 3.4 and 3.4.2 for description of macro variants and parameters. The ctx parameter
specifies the name of a data structure, which will be created on stack to hold the internal zone data.
Unlike C++, there’s no automatic destruction mechanism in C, so you will need to manually mark where
the zone ends. To do so use the TracyCZoneEnd(ctx) macro.
Zone text and name may be set by using the TracyCZoneText(ctx, txt, size), TracyCZoneValue(ctx,
value) and TracyCZoneName(ctx, txt, size) macros. Make sure you are following the zone stack rules,
as described in section 3.4.2!
• The created variable name is exactly what you pass as the ctx parameter.
• Contents of the data structure can be copied by assignment. Do not retrieve or use the structure’s
address – this is asking for trouble.
• You must use the data structure (or any of its copies) exactly once to end a zone.
36
Tracy Profiler The user manual
• TracyCAlloc(ptr, size)
• TracyCFree(ptr)
• TracyCSecureAlloc(ptr, size)
• TracyCSecureFree(ptr)
Using this functionality in a proper way can be quite tricky, as you also will need to handle all the memory
allocations made by external libraries (which typically allow usage of custom memory allocation functions),
but also the allocations made by system functions. If such an allocation can’t be tracked, you will need to
make sure freeing is not reported36.
There is no explicit support for realloc function. You will need to handle it by marking memory
allocations and frees, according to the system manual describing behavior of this routine.
For more information refer to section 3.8.
• TracyCPlot(name, val)
• TracyCMessage(txt, size)
• TracyCMessageL(txt)
• TracyCMessageLC(txt, color)
• TracyCAppInfo(txt, size)
37
Tracy Profiler The user manual
• ___tracy_init_thread(void)
Here line is line number in the source source file and function is the name of a function in which the
zone is created. sourceSz and functionSz are the size of the corresponding string arguments in bytes. You
may additionally specify an optional zone name, by providing it in the name variable, and specifying its size
in nameSz.
The ___tracy_alloc_srcloc and ___tracy_alloc_srcloc_name functions return an uint64_t source
location identifier corresponding to an allocated source location. As these functions do not require for the
provided string data to be available after they return, calling code is free to deallocate them at any time
afterwards. This way the string lifetime requirements described in section 3.1 are relaxed.
Before the ___tracy_alloc_* functions are called on a non-main thread for the first time, care should
be taken to ensure that ___tracy_init_thread has been called first. The ___tracy_init_thread function
initializes per-thread structures Tracy uses and can be safely called multiple times.
The uint64_t return value from allocation functions must be passed to one of the zone begin functions:
• ___tracy_emit_zone_begin_alloc(srcloc, active)
These functions return a TracyCZoneCtx context value, which must be handled, as described in sec-
tions 3.12.3 and 3.12.3.1.
The variable representing an allocated source location is of an opaque type. After it is passed to one of the
zone begin functions, its value cannot be reused (the variable is consumed). You must allocate a new source
location for each zone begin event, even if the location data would be the same as in the previous instance.
Important
Since you are directly calling the profiler functions here, you will need to take care of manually disabling
the code, if the TRACY_ENABLE macro is not defined.
38
Tracy Profiler The user manual
Caveats
• Context switch data is retrieved using the kernel profiling facilities, which are not available to
users with normal privilege level. To collect context switches you will need to elevate your rights
to admin level, either by running the profiled program from the root account on Unix, or through
the Run as administrator option on Windows. On Android context switches will be collected if you
have a rooted device (see section 2.1.6.3 for additional information).
• Android context switch capture requires spawning an elevated process to read kernel data. While
the standard cat utility can be used for this task, the CPU usage is not acceptable due to how
the kernel handles blocking reads. As a workaround, Tracy will inject a specialized kernel data
reader program at /data/tracy_systrace, which has more acceptable resource requirements.
37A context switch happens when any given CPU core stops executing one thread and starts running another one.
38Commonly known as Hyper-threading.
39
Tracy Profiler The user manual
Important
In this manual, the word core is typically used as a short term for logical CPU. Do not confuse it with
physical processor cores.
Important
For proper program code retrieval no module used by the application can be unloaded during the
runtime. See section 3.1.1 for an explanation.
40
Tracy Profiler The user manual
The idx argument is an user-defined parameter index and val is the value set in the profiler user interface.
To specify individual parameters, use the TracyParameterSetup(idx, name, isBool, val) macro. The
idx value will be passed to the callback function for identification purposes (Tracy doesn’t care what it’s set
to). Name is the parameter label, displayed on the list of parameters. Finally, isBool determines if val should
be interpreted as a boolean value, or as an integer number.
Important
Usage of trace parameters makes profiling runs dependent on user interaction with the profiler, and
thus it’s not recommended to be employed if a consistent profiling environment is desired. Furthermore,
interaction with the parameters is only possible in the graphical profiling application, and not in the
command line capture utility.
• -a address – specifies the IP address (or a domain name) of the client application (uses localhost if
not provided).
If there is no client running at the given address, the server will wait until a connection can be made.
During the capture the following information will be displayed:
The queue delay and timer resolution parameters are calibration results of timers used by the client. The
next line is a status bar, which displays: network connection speed, connection compression ratio, and the
resulting uncompressed data rate; total amount of data transferred over the network; memory usage of the
capture utility; time extent of the captured data.
You can disconnect from the client and save the captured trace by pressing Ctrl + C .
41
Tracy Profiler The user manual
Y Õ ♥
Address entry
Õ Connect g Open trace
Discovered clients: X
127.0.0.1 | 21 s | Application
Both connecting to a client and opening a saved trace will present you with the main profiler view, which
you can use to analyze the data (see section 5).
42
Tracy Profiler The user manual
If frame image capture has been implemented (chapter 3.3.3), a thumbnail of the last received frame
image will be provided for reference.
If the profiled application opted to provide trace parameters (see section 3.14) and the connection is still
active, this pop-up will also contain a trace parameters section, listing all the provided options. When you
change any value here, a callback function will be executed on the client.
set of events.
44The operating system is able to manage memory paging much better than Tracy would be ever able to.
43
Tracy Profiler The user manual
The new file contains the same data as the old one, but in the updated internal representation. Note that
to perform an upgrade, whole trace needs to be loaded to memory.
Time (s)
100
101
50
0 5 10 15 20 25 0 5 10 15 20 25
Mode Mode
Figure 10: Plot of trace sizes for different compression modes Figure 11: Logarithmic plot of trace compression times for
(see table 7). different compression modes (see table 7).
44
Tracy Profiler The user manual
900
800
Time (ms)
700
600
zstd
500 default
hc
400 extreme
0 5 10 15 20 25
Mode
Figure 12: Plot of trace load times for different compression modes (see table 7).
Trace files created using the default, hc and extreme modes are optimized for fast decompression and can
be further compressed using file compression utilities. For example, using 7-zip results in archives of the
following sizes: 77.2 MB, 54.3 MB, 52.4 MB.
For archival purposes it is however much better to use the zstd compression modes, which are faster,
45
Tracy Profiler The user manual
compress trace files more tightly, and are directly loadable by the profiler, without the intermediate
decompression step.
• l – locks.
• m – messages.
• p – plots.
• M – memory.
• i – frame images.
• c – context switches.
• s – sampling data.
• C – symbol code.
Flags can be concatenated, for example specifying -s CSi will remove symbol code, source file cache and
frame images in the destination trace file.
46
Tracy Profiler The user manual
Timeline view
Figure 13: Main profiler window. Note that the top line of buttons has been split into two rows in this manual.
• Õ Connection – Opens the connection information popup (see section 4.2.1). Only available when live
capture is in progress.
• Close – This button unloads the current profiling trace and returns to the welcome menu, where
another trace can be loaded. In live captures it is replaced by p Pause, Resume and $ Stopped buttons.
• p Pause – While a live capture is in progress, the profiler will display recent events, as either the last
three fully captured frames, or a certain time range. This can be used to see the current behavior of the
program. The pause button45 will stop the automatic updates of the timeline view (the capture will be
still progressing).
• Resume – This button allows to resume following the most recent events in a live capture. You will
have selection of one of the following options: Ý Newest three frames, or Ì Use current zoom level.
• $ Stopped – Inactive button used to indicate that the client application was terminated.
• S Messages – Toggles the message log window (section 5.5), which displays custom messages sent by
the client, as described in section 3.7.
• Ù Find zone – This buttons toggles the find zone window, which allows inspection of zone behavior
statistics (section 5.7).
• Statistics – Toggles the statistics window, which displays zones sorted by their total time cost
(section 5.6).
• 8 Memory – Various memory profiling options may be accessed here (section 5.9).
45Or perform any action on the timeline view, apart from changing the zoom level.
47
Tracy Profiler The user manual
• 6 Compare – Toggles the trace compare window, which allows you to see the performance difference
between two profiling runs (section 5.8).
• x Tools – Allows access to optional data collected during capture. Some choices might be unavailable.
– Playback – If frame images were captured (section 3.3.3), you will have option to open frame
image playback window, described in chapter 5.18.
– ÿ CPU data – If context switch data was captured (section 3.13.2), this button will allow inspecting
what was the processor load during the capture, as described in section 5.19.
– 1 Annotations – If annotations have been made (section 5.3.1), you can open a list of all annotations,
described in chapter 5.21.
– Ê Limits – Displays time range limits window (section 5.3).
The frame information block consists of four elements: the current frame set name along with the number
of captured frames (click on it with the left mouse button to go to a specified frame), the two navigational
buttons and , which allow you to focus the timeline view on the previous or next frame, and the frame set
selection button , which is used to switch to a another frame set46. For more information about marking
frames, see section 3.3.
The next three items show the 2 view time range, the ó time span of the whole capture (clicking on it with
the middle mouse button will set the view range to the entire capture), and the 8 memory usage of the
profiler.
• – At least one timeline item (e.g. a single thread, a single plot, a single lock, etc.) is hidden.
46See section 5.2.3.2 for another way to change the active frame set.
48
Tracy Profiler The user manual
Each bar displayed on the graph represents an unique frame in the current frame set47. The progress of
time is in the right direction. The height of the bar indicates the time spent in frame, complemented with the
color information:
• If the bar is blue, then the frame met the best time of 143 FPS, or 6.99 ms48 (represented by blue target
line).
• If the bar is green, then the frame met the good time of 59 FPS, or 16.94 ms (represented by green target
line).
• If the bar is yellow, then the frame met the bad time of 29 FPS, or 34.48 ms (represented by yellow target
line).
• If the bar is red, then the frame didn’t met any time limits.
The frames visible on the timeline are marked with a violet box drawn over them.
When a zone is displayed in the find zone window (section 5.7), the coloring of frames may be changed,
as described in section 5.7.2.
Moving the U mouse cursor over the frames displayed on the graph will display tooltip with information
about frame number, frame time, frame image (if available, see chapter 3.3.3), etc. Such tooltips are common
for many UI elements in the profiler and won’t be mentioned later in the manual.
The timeline view may be focused on the frames, by clicking or dragging the left mouse button on the
graph. The graph may be scrolled left and right by dragging the right mouse button over the graph. The
view may be zoomed in and out by using the mouse scroll. If the view is zoomed out, so that multiple
frames are merged into one column, the highest frame time will be used to represent the given column.
Clicking the left mouse button on the graph while the Ctrl key is pressed will open the frame image
playback window (section 5.18) and set the playback to the selected frame. See section 3.3.3 for more
information about frame images.
Collapsed items Due to extreme differences in time scales, you will almost constantly see events that
are too small to be displayed on the screen. Such events have preset minimum size (so they can be seen) and
are marked with a zig-zag pattern, to indicate that you need to zoom-in to see more detail.
The zig-zag pattern can be seen applied to frame sets on figure 16, and to zones on figure 17.
47Unless the view is zoomed out and multiple frames are merged into one column.
48The actual target is 144 FPS, but one frame leeway is allowed to account for timing inaccuracies.
49
Tracy Profiler The user manual
+13.76 s 20 µs 40 µs 60 µs 80 µs 100 µs
The leftmost value on the scale represents the time at which the timeline starts. The rest of numbers label
the notches on the scale, with some numbers omitted, if there’s no space to display them.
Hovering the U mouse pointer over the time scale will display tooltip with exact timestamp at the position
of mouse cursor.
On figure 16 we can see the fully described frames 312 and 347. The description consists of the frame
name, which is Frame for the default frame set (section 3.3) or the name you used for the secondary name
set (section 3.3.1), the frame number and the frame time. The frame 348 is too small to be fully displayed,
so only the frame time is shown. The frame 349 is even smaller, with no space for any text. Moreover,
frames 313 to 346 are too small to be displayed individually, so they are replaced with a zig-zag pattern, as
described in section 5.2.3.
You can also see that there are frame separators, projected down to the rest of the timeline view. Note
that only the separators for the currently selected frame set are displayed. You can make a frame set active by
clicking the left mouse button on a frame set row you want to select (also see section 5.2.1).
Clicking the middle mouse button on a frame will zoom the view to the extent of the frame.
If a frame has an associated frame image (see chapter 3.3.3), you can hold the Ctrl key and click the left
mouse button on the frame, to open the frame image playback window (see chapter 5.18) and set the playback
to the selected frame.
• Light blue label – GPU context. Multi-threaded Vulkan, OpenCL and Direct3D 12 contexts are additionally
split into separate threads.
50
Tracy Profiler The user manual
Main thread
Update Render
6 Physics
Physics lock
Streaming thread v
• White label – A CPU thread. Will be replaced by a bright red label in a thread that has crashed
(section 2.5). If automated sampling was performed, clicking the left mouse button on the v ghost
zones button will switch zone display mode between ’instrumented’ and ’ghost’.
Labels accompanied by the symbol can be collapsed out of the view, to reduce visual clutter. Hover
the U mouse pointer over the label to display additional information. Click the middle mouse button on a
label to zoom the view to the extent of the label contents.
Zones In an example on figure 17 you can see that there are two threads: Main thread and Streaming
thread49. We can see that the Main thread has two root level zones visible: Update and Render. The Update
zone is split into further sub-zones, some of which are too small to be displayed at the current zoom level.
This is indicated by drawing a zig-zag pattern over the merged zones box (section 5.2.3), with the number of
collapsed zones printed in place of zone name. We can also see that the Physics zone acquires the Physics lock
mutex for the most of its run time.
Meanwhile the Streaming thread is performing some Streaming jobs. The first Streaming job sent a message
(section 3.7), which in addition to being listed in the message log is being indicated by a triangle over the
thread separator. When there are multiple messages in one place, the triangle outline shape changes to a
filled triangle.
At high zoom levels, the zones will be displayed with additional markers, as presented on figure 18. The
red regions at the start and end of a zone indicate the cost associated with recording an event (Queue delay).
The error bars show the timer inaccuracy (Timer resolution). Note that these markers are only approximations,
as there are many factors that can impact the true cost of capturing a zone, for example cache effects, or CPU
frequency scaling, which is unaccounted for (see section 2.2.2).
Zone
Zone extent
49By clicking on a thread name you can temporarily disable display of the zones in this thread.
51
Tracy Profiler The user manual
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/OpenCL context
in place of a thread name.
Hovering the U mouse pointer over a zone will highlight all other zones that have the same source location
with a white outline. Clicking the left mouse button on a zone will open zone information window
(section 5.13). Holding the Ctrl key and clicking the left mouse button on a zone will open zone statistics
window (section 5.7). Clicking the middle mouse button on a zone will zoom the view to the extent of the
zone.
Ghost zones Display of ghost zones (not pictured on figure 17, but similar to normal zones view) can
be enabled by clicking on the v ghost zones icon next to thread label, available if automated sampling (see
chapter 3.13.4) was performed. Ghost zones will also be displayed by default, if no instrumented zones are
available for a given thread, to help with pinpointing functions that should be instrumented.
Ghost zones represent true function calls in the program, periodically reported by the operating system.
Due to the limited resolution of sampling, you need to take great care when looking at reported timing data.
While it may be apparent that some small function requires a relatively long time to execute, for example
125 µs (8 kHz sampling rate), in reality this time represents a period between taking two distinct samples,
not the actual function run time. Similarly, two (or more) distinct function calls may be represented as a
single ghost zone, because the profiler doesn’t have the information needed to know about true lifetime of a
sampled function.
Another common pitfall to watch for is the order of presented functions. It is not what you expect it to be!
Read chapter 5.14.1 for a critical insight on how call stacks might seem nonsensical at first, and why they
aren’t.
The available information about ghost zones is quite limited, but it’s enough to give you a rough outlook
on the execution of your application. The timeline view alone is more than any other statistical profiler is
able to present. In addition to that, Tracy properly handles inlined function calls, which are indicated by
darker colored ghost zones.
Clicking the left mouse button on a ghost zone will open the corresponding source file location, if able
(see chapter 5.16 for conditions). There are three ways in which source locations can be assigned to a ghost
zone:
1. If the selected ghost zone is not an inline frame and its symbol data has been retrieved, the source
location points to the function entry location (first line of the function).
2. If the selected ghost zone is not an inline frame, but its symbol data is not available, the source location
will point to a semi-random location within the function body (i.e. to one of the sampled addresses in
the program, but not necessarily the one representing the selected time stamp, as multiple samples
with different addresses may be merged into one ghost zone).
3. If the selected ghost zone is an inline frame, the source location will point to a semi-random location
within the inlined function body (see details in the above point). It is not possible to go to the entry
location of such function, as it doesn’t exist in the program binary. Inlined functions begin in the parent
function.
Call stack samples The row of dots right below the Main thread label shows call stack sample points,
which may have been automatically captured (see chapter 3.13.4 for more detail). Hovering the U mouse
pointer over each dot will display a short call stack summary, while clicking on a dot with the left mouse
button will open a more detailed call stack information window (see section 5.14).
Context switches The thick line right below the samples represents context switch data (see sec-
tion 3.13.2). We can see that the main thread, as displayed, starts in a suspended state, represented by
the dotted region. Then it is woken up and starts execution of the Update zone. In midst of the physics
processing it is preempted, which explains why there is an empty space between child zones. Then it is
52
Tracy Profiler The user manual
resumed again and continues execution into the Render zone, where it is preempted again, but for a shorter
time. After rendering is done, the thread sleeps again, presumably waiting for the vertical blanking, to
indicate next frame. Similar information is also available for the streaming thread.
Context switch regions are using the following color key:
• Red – Thread is waiting to be resumed by the scheduler. There are many reasons why a thread may be
in the waiting state. Hovering the U mouse pointer over the region will display more information.
• Blue – Thread is waiting to be resumed and is migrating to another CPU core. This might have visible
performance effects, because low level CPU caches are not shared between cores, which may result in
additional cache misses. To avoid this problem, you may pin a thread to a specific core, by setting its
affinity.
• Bronze – Thread has been placed in the scheduler’s run queue and is about to be resumed.
CPU data This label is only available if context switch data was collected. It is split into two parts: a
graph of CPU load by various threads running in the system, and a per-core thread execution display.
The CPU load graph is showing how much CPU resources were used at any given time during program
execution. The green part of the graph represents threads belonging to the profiled application and the gray
part of the graph shows all other programs running in the system. Hovering the U mouse pointer over the
graph will display a list of threads running on the CPU at the given time.
Each line in the thread execution display represents a separate logical CPU thread. If CPU topology data
is available (see section 3.13.3), package and core assignment will be displayed in brackets, in addition to
numerical processor identifier (i.e. [package:core] CPU thread). When a core is busy executing a thread,
a zone will be drawn at the appropriate time. Zones are colored according to the following key:
• Bright color – or orange if dynamic thread colors are disabled – Thread tracked by the profiler.
• Dark blue – Thread existing in the profiled application, but not known to the profiler. This may include
internal profiler threads, helper threads created by external libraries, etc.
When the U mouse pointer is hovered over either the CPU data zone, or the thread timeline label, Tracy
will display a line connecting all zones associated with the selected thread. This can be used to easily see
how the thread was migrating across the CPU cores.
Careful examination of the data presented on this graph may allow you to determine areas where the
profiled application was fighting for system resources with other programs (see section 2.2.1), or give you a
hint to add more instrumentation macros.
Locks Mutual exclusion zones are displayed in each thread that tries to acquire them. There are three
color-coded kinds of lock event regions that may be displayed. Note that when the timeline view is zoomed
out, the contention regions are always displayed over the uncontented ones.
• Green region50 – The lock is being held solely by one thread and no other thread tries to access it. In case
of shared locks it is possible that multiple threads hold the read lock, but no thread requires a write
lock.
• Yellow region – The lock is being owned by this thread and some other thread also wants to acquire the
lock.
50This region type is disabled by default and needs to be enabled in options (section 5.4).
53
Tracy Profiler The user manual
• Red region – The thread wants to acquire the lock, but is blocked by other thread, or threads in case of
shared lock.
Hovering the U mouse pointer over a lock timeline will highlight the lock in all threads to help reading
the lock behavior. Hovering the U mouse pointer over a lock event will display important information,
for example a list of threads that are currently blocking, or which are blocked by the lock. Clicking the
left mouse button on a lock event or a lock label will open the lock information window, as described in
section 5.17. Clicking the middle mouse button on a lock event will zoom the view to the extent of the
event.
Plots The numerical data values (figure 19) are plotted right below the zones and locks. Note that
the minimum and maximum values currently displayed on the plot are visible on the screen, along with
the y range of the plot and number of drawn data points. The discrete data points are indicated with little
rectangles. Multiple data points are indicated by a filled rectangle.
268
When memory profiling (section 3.8) is enabled, Tracy will automatically generate a 8 Memory usage
plot, which has extended capabilities. Hovering over a data point (memory allocation event) will visually
display duration of the allocation. Clicking the left mouse button on the data point will open the memory
allocation information window, which will display the duration of the allocation as long as the window is
open.
Another plot that is automatically provided by Tracy is the Q CPU usage plot, which represents the total
system CPU usage percentage (it is not limited to the profiled application).
54
Tracy Profiler The user manual
• Ù Limit find zone time range – this will limit find zone results. See chapter 5.7 for more details.
• Limit statistics time range – selecting this option will limit statistics results. See chapter 5.6 for more
details.
Alternatively, you may specify the time range by clicking the right mouse button on a zone or a frame.
The resulting time extent will match the selected item.
In order to reduce clutter, time ranges limiting the find zone51 or statistics52 results are only displayed if
the find zone or statistics windows are open, or if the time range limits control window is open (section 5.22).
Time range limits window can be accessed through the x Tools button on the control menu.
Each time range can be freely adjusted on the timeline by clicking the left mouse button on the range’s
edge and dragging the mouse.
Description
Please note that while the annotations persist between profiling sessions, they are not saved in the trace,
but in the user data files, as described in section 8.2.
• - Draw empty labels – By default threads that don’t have anything to display at the current zoom level
are hidden. Enabling this option will show them anyway.
– O Darken inactive thread – If enabled, inactive regions in threads will be dimmed out.
– ò Draw CPU usage graph – You can disable drawing of the CPU usage graph here.
51Marked with green striped pattern.
52Marked with red striped pattern.
55
Tracy Profiler The user manual
• < Draw CPU zones – Determines whether CPU zones are displayed.
– v Draw ghost zones – Controls if ghost zones should be displayed in threads which don’t have any
instrumented zones available.
– f Zone colors – Zones with no user-set color may be colored according to the following schemes:
∗ Disabled – A constant color (blue) will be used.
∗ Thread dynamic – Zones are colored according to a thread (identifier number) they belong to
and depth level.
∗ Source location dynamic – Zone color is determined by source location (function name) and
depth level.
– e Namespaces – controls display behavior of long zone names, which don’t fit inside a zone box:
∗ Full – Zone names are always fully displayed (e.g. std::sort).
∗ Shortened – Namespaces are shortened to one letter (e.g. s::sort).
∗ None – Namespaces are completely omitted (e.g. sort).
• Draw locks – Controls the display of locks. If the Only contended option is selected, the non-blocking
regions of locks won’t be displayed (see section 5.2.3.3). The Locks drop-down allows disabling display
of locks on a per-lock basis. As a convenience, the list of locks is split into the single-threaded and
multi-threaded (contended and uncontended) categories. Clicking the right mouse button on a lock
label opens the lock information window (section 5.17).
• ò Draw plots – Allows disabling display of plots. Individual plots can be disabled in the Plots
drop-down.
• ´ Visible threads – Here you can select which threads are visible on the timeline. Display order of
threads can be changed by dragging thread labels.
• ê Visible frame sets – Frame set display can be enabled or disabled here. Note that disabled frame sets
are still available for selection in the frame set selection drop-down (section 5.2.1), but are marked with
a dimmed font.
Disabling display of some events is especially recommended when the profiler performance drops below
acceptable levels for interactive usage.
56
Tracy Profiler The user manual
In a live capture, the message list will automatically scroll down to display the most recent message. This
behavior can be disabled by manually scrolling the message list up. When the view is scrolled down to
display the last message, the auto-scrolling feature will be enabled again.
The message list can be filtered in the following ways:
• By matching the message text to the expression in the X Filter messages entry field. Multiple filter
expressions can be comma-separated (e.g. ’warn, info’ will match messages containing strings ’warn’
or ’info’). Matches can be excluded by preceding the term with a minus character (e.g. ’-debug’ will
hide all messages containing string ’debug’).
57
Tracy Profiler The user manual
corresponding functions list (some functions may be hidden if the © Show all option is disabled, due to
lack of sampling data). Clicking on a function name will open the call stack sample parents window (see
chapter 5.15). Note that if inclusive times are displayed, listed functions will be partially or completely
coming from mid-stack frames, which will prevent, or limit the capability to display parent call stacks.
The Location column displays the corresponding source file name and line number. Depending on the
Location option selection it can either show function entry address, or the instruction at which the sampling
was performed. The Entry mode points at the beginning of a non-inlined function, or at the place where
inlined function was inserted in its parent function. The Sample mode is not useful for non-inlined functions,
as it points to one randomly selected sampling point out of many that were captured. However, in case of
inlined functions, this random sampling point is within the inlined function body. Using these options in
tandem enable you to look at both the inlined function code and the place where it was inserted. If the Smart
location is selected, profiler will display entry point position for non-inlined functions and sample location
for inlined functions. Selecting the @ Address option will instead print the symbol address.
The location data is complemented by the originating executable image name, contained in the Image
column.
Some function locations may not be found, due to insufficient debugging data available on the client side.
To filter out such entries, use the 4 Hide unknown option.
The Time or Count column (depending on the 4 Show time option selection) shows number of taken
samples, either as a raw count, or in an easier to understand time format. Note that the percentage value of
time is calculated relative to the wall-clock time, and the percentage value of sample counts is relative to total
number of collected samples.
The last column, Code size, displays the size of symbol in the executable image of the program. Since
inlined routines are directly embedded into other functions, their symbol size will be based on the parent
symbol, and displayed as ’less than’. In some cases this data won’t be available. If the symbol code has been
retrieved54, symbol size will be prepend with the ó icon, and clicking the right mouse button on the
location column entry will open symbol view window (section 5.16.2).
Finally, the list can be filtered using the X Filter symbols entry field, just like in the instrumentation
mode case. Additionally, you can also filter results by the originating image name of the symbol. The
exclusive/inclusive time counting mode can be switched using the À Self time switch. Limiting the time
range is also available, but is restricted to self time. If the © Show all option is selected, the list will include
not only call stack samples, but also all other symbols collected during the profiling process (this is enabled
by default, if no sampling was performed).
58
Tracy Profiler The user manual
1 and 10 ms marks, which can be ignored on most occasions, as these are single occurrences.
1 µs 10 µs 100 µs 1 ms
100 ns 10 ms 10 ms
Figure 21: Zone execution time histogram. Note that the extreme time labels and time range indicator (middle time value) are
displayed in a separate line.
The histogram is accompanied by various data statistics about displayed data, for example the total time
of the displayed samples, or the maximum number of counts in histogram bins. The following options control
how the data is presented:
• Log values – Switches between linear and logarithmic scale on the y axis of the graph, representing the
call counts55.
• Log time – Switches between linear and logarithmic scale on the x axis of the graph, representing the
time bins.
• Cumulate time – Changes how the histogram bin values are calculated. By default the vertical bars on
the graph represent the call counts of zones that fit in the given time bin. If this option is enabled, the
bars represent the time spent in the zones. For example, on graph presented on figure 21 the 10 µs
cluster is the dominating one, if we look at the time spent in zone, even if the 300 ns cluster has greater
number of call counts.
• Self time – Removes children time from the analysed zones, which results in displaying only the time
spent in the zone itself (or in non-instrumented function calls). Cannot be selected when Running time
is active.
• Running time – Removes time when zone’s thread execution was suspended by the operating system
due to preemption by other threads, waiting for system resources, lock contention, etc. Available only
when context switch capture was performed (section 3.13.2). Cannot be selected when Self time is active.
• Minimum values in bin – Excludes display of bins which do not hold enough values at both ends of the
time range. Increasing this parameter will eliminate outliers, allowing to concentrate on the interesting
part of the graph.
You can drag the left mouse button over the histogram to select a time range that you want to closely
look at. This will display the data in the histogram info section and it will also filter zones displayed in
the found zones section. This is quite useful, if you want to actually look at the outliers, i.e. where did they
originate from, what the program was doing at the moment, etc56. You can reset the selection range by
pressing the right mouse button on the histogram.
The found zones section displays the individual zones grouped according to the following criteria:
• Thread – In this mode you can see which threads were executing the zone.
55Or time, if the cumulate time option is enabled.
56More often than not you will find out, that the application was just starting, or an access to a cold file was required and there’s not
much you can do to optimize that particular case.
59
Tracy Profiler The user manual
• User text – Splits the zones according to the custom user text (see section 3.4).
• Zone name – Groups zones by the name set on a per-call basis (see section 3.4).
• Call stacks – Zones are grouped by the originating call stack (see section 3.10). Note that two call stacks
may sometimes appear identical, even if they are not, due to an easy to overlook difference in the source
line numbers.
• Parent – Groups zones according to the parent zone. This mode relies on the zone hierarchy, and not on
the call stack information.
• No grouping – Disables zone grouping. May be useful in cases when you just want to see zones in order
as they appeared.
Each group may be sorted according to the order in which it appeared, the call count, the total time spent
in the group, or the mean time per call. Expanding the group view will display individual occurrences of the
zone, which can be sorted by application’s time, execution time or zone’s name. Clicking the left mouse
button on a zone will open the zone information window (section 5.13). Clicking the middle mouse
button on a zone will zoom the timeline view to the zone’s extent.
Clicking the left mouse button on group name will highlight the group time data on the histogram
(figure 22). This function provides a quick insight about the impact of the originating thread, or input data
on the zone performance. Clicking the right mouse button on the group names area will reset the group
selection.
100 ns 1 µs 10 µs 100 µs 1 ms 10 ms
The call stack grouping mode has a different way of listing groups. Here only one group is displayed at
any time, due to need to display the call stack frames. You can switch between call stack groups by using
the and buttons. The group can be selected by clicking on the ¢ Select button (to reset the group selection
use the right mouse button, as usual). You can open the call stack window (section 5.14) by pressing
the Call stack button.
Tracy displays a variety of statistical values regarding the selected function: mean (average value), median
(middle value), mode (most common value, quantized using histogram bins), and σ (standard deviation).
The mean and median zone times are also displayed on the histogram as a red (mean) and blue (median)
vertical bars. When a group is selected, additional bars will indicate the mean group time (orange) and
median group time (green). You can disable drawing of either set of markers by clicking on the check-box
next to the color legend.
Hovering the U mouse cursor over a zone on the timeline, which is currently selected in the find zone
window, will display a pulsing vertical bar on the histogram, highlighting the bin to which the hovered zone
has been assigned. Zone entry on the zone list will also be highlighted.
60
Tracy Profiler The user manual
Keyboard shortcut
You may press Ctrl + F to open or focus the find zone window and set the keyboard input on the
search box.
Caveats
When using the execution times histogram you must be aware about the hardware peculiarities. Read
section 2.2.2 for more detail.
Caveats
The displayed data might not be calculated correctly and some zones may not be included in the
reported times.
61
Tracy Profiler The user manual
Now things start to get familiar. You search for a zone, similarly like in the find zone window, choose the
one you want in the matched source locations drop-down, and then you look at the histogram57. This time there
are two overlaid graphs, one representing the current trace, and the second one representing the external
(reference) trace (figure 23). You can easily see how the performance characteristics of the zone were affected
by your modifications.
100 ns 1 µs 10 µs 100 µs 1 ms 10 ms
Note that the traces are color and symbol coded. The current trace is marked by a yellow symbol, and
the external one is marked by a red t symbol.
When searching for source locations it’s not uncommon to match more than one zone (for example a
search for Draw may result in DrawCircle and DrawRectangle matches). Typically you wouldn’t want to
compare execution profiles of two unrelated functions, which is prevented by the link selection option, which
ensures that when you choose a source location in one trace, the corresponding one is also selected in second
trace. Be aware that this may still result in a mismatch, for example if you have overloaded functions. In such
case you will need to manually select the appropriate function in the other trace.
It may be difficult, if not impossible, to perform identical runs of a program. This means that the number
of collected zones may differ in both traces, which would influence the displayed results. To fix this problem
enable the Normalize values option, which will adjust the displayed results as-if both traces had the same
number of recorded zones.
Trace descriptions
Set custom trace descriptions (see section 5.12) to easily differentiate the two loaded traces. If no trace
description is set, a name of the profiled program will be displayed along with the capture time.
62
Tracy Profiler The user manual
The allocation’s timing data is contained in two columns: appeared at and duration. Clicking the left
mouse button on the first one will center the timeline view at the beginning of allocation, and likewise,
clicking on the second one will center the timeline view at the end of allocation. Note that allocations that
have not yet been freed will have their duration displayed in green color.
The memory event location in the code is displayed in the last four columns. The thread column contains
the thread where the allocation was made and freed (if applicable), or an alloc / free pair of threads, if it was
allocated in one thread and freed in another. The zone alloc contains the zone in which the allocation was
performed60, or - if there was no active zone in the given thread at the time of allocation. Clicking the left
mouse button on the zone name will open the zone information window (section 5.13). Similarly, the zone
free column displays the zone which freed the allocation, which may be colored yellow, if it is the same exact
zone that did the allocation. Alternatively, if the zone has not yet been freed, a green active text is displayed.
The last column contains the alloc and free call stack buttons, or their placeholders, if no call stack is available
(see section 3.10 for more information). Clicking on either of the buttons will open the call stack window
(section 5.14). Note that the call stack buttons that match the information window will be highlighted.
The memory window is split into the following sections:
5.9.1 Allocations
The @ Allocations pane allows you to search for the specified address usage during the whole lifetime of the
program. All recorded memory allocations that match the query will be displayed on a list.
63
Tracy Profiler The user manual
the function name level, which will result in less valid source file locations, as multiple entries are collapsed
into one.
Enabling the Only active allocations option will limit the call stack tree to only display active allocations.
Clicking the right mouse button on the function name will open allocations list window (see section
5.10), which list all the allocations included at the current call stack tree level. Clicking the right mouse
button on the source file location will open the source file view window (if applicable, see section 5.16).
Some function names may be too long to be properly displayed, with the events count data at the end. In
such cases, you may press the control button, which will display events count tooltip.
62See section 5.7 for a description of the histogram. Note that there are subtle differences in the available functionality.
63This has no effect on source files cached during the profiling run.
64
Tracy Profiler The user manual
Quick example
Let’s say we have an unix-based operating system with program sources in /home/user/program/src/
directory. We have also performed a capture of an application running under Windows, with sources
in C:\Users\user\Desktop\program\src directory. Obviously, the source locations don’t match and
the profiler can’t access the source files we have on our disk. We can fix that by adding two substitution
patterns:
• ˆC:\\Users\\user\\Desktop → /home/user
• \\ → /
In this window you can view the information about the machine on which the profiled application was
running. This includes the operating system, used compiler, CPU name, amount of total available RAM, etc.
If application information was provided (see section 3.7.1), it will also be displayed here.
If an application should crash during profiling (section 2.5), the crash information will be displayed in
this window. It provides you information about the thread that has crashed, the crash reason and the crash
call stack (section 5.14).
• Basic source location information: function name, source file location and the thread name.
• Timing information.
• If context switch capture was performed (section 3.13.2) and a thread was suspended during zone
execution, a list of wait regions will be displayed, with complete information about timing, CPU
migrations and wait reasons. If CPU topology data is available (section 3.13.3), zone migrations across
cores will be marked with ’C’, and migrations across packages – with ’P’. In some cases context switch
data might be incomplete64, in which case a warning message will be displayed.
• Memory events list, both summarized and a list of individual allocation/free events (see section 5.9 for
more information on the memory events list).
• List of messages that were logged in the zone’s scope (including its children).
• Zone trace, taking into account the zone tree and call stack information (section 3.10), trying to
reconstruct a combined zone + call stack trace65. Captured zones are displayed as normal text, while
functions that were not instrumented are dimmed. Hovering the U mouse pointer over a zone will
highlight it on the timeline view with a red outline. Clicking the left mouse button on a zone will
switch the zone info window to that zone. Clicking the middle mouse button on a zone will zoom
the timeline view to the zone’s extent. Clicking the right mouse button on a source file location will
open the source file view window (if applicable, see section 5.16).
• Child zones list, showing how the current zone’s execution time was used. Zones on this list can be
grouped according to their source location. Each group can be expanded to show individual entries.
All the controls from the zone trace are also available here.
64For example, when a capture is ongoing and context switch information has not yet been received.
65Reconstruction is only possible, if all zones have full call stack capture data available. In case where that’s not available, an unknown
frames entry will be present.
65
Tracy Profiler The user manual
• Time distribution in child zones, which expands the information provided in the child zones list by
processing all zone children (including multiple levels of grandchildren). This results in a statistical list
of zones that were really doing the work in the current zone’s time span. If a group of zones is selected
on this list, the find zone window (section 5.7) will open, with time range limited to show only the
children of the current zone.
• ( Go to parent – Switches the zone information window to display current zone’s parent zone (if
available).
• Statistics – Displays the zone general performance characteristics in the find zone window (section 5.7).
• Call stack – Views the current zone’s call stack in the call stack window (section 5.14). The button
will be highlighted, if the call stack window shows the zone’s call stack. Only available if zone had
captured call stack data (section 3.10).
• ? Source – Display source file view window with the zone source code (only available if applicable, see
section 5.16). Button will be highlighted, if the source file is being currently displayed (but the focused
source line might be different).
• # Go back – Returns to the previously viewed zone. The viewing history is lost when the zone
information window is closed, or when the type of displayed zone changes (from CPU to GPU or vice
versa).
Clicking on the ½ Copy to clipboard buttons will copy the appropriate data to the clipboard.
• Source code – displays source file and line number associated with the frame.
• Entry point – source code at the beginning of the function containing selected frame, or function call
place in case of inline frames.
• Return address – shows return address, which may be used to pinpoint the exact instruction in the
disassembly.
• Symbol address – displays begin address of the function containing the frame address.
66Executable images are called modules by Microsoft.
67Or ’’ icon in case of call stack tooltips.
66
Tracy Profiler The user manual
In some cases it may be not possible to properly decode stack frame address. Such frames will be presented
with a dimmed ’[ntdll.dll]’ name of the image containing the frame address, or simply ’[unknown]’ if
even this information cannot be retrieved. Additionally, ’[kernel]’ is used to indicate unknown stack frames
within the operating system internal routines.
If the displayed call stack is a sampled call stack (chapter 3.13.4), an additional button will be available,
Global entry statistics. Clicking it will open the call stack sample parents window (chapter 5.15) for the
current call stack.
int main ()
{
auto app = std :: make_unique < Application >();
app ->Run ();
app. reset ();
}
Let’s say you are looking at the call stack of some function called within Application::Run. This is the
result you might get:
0. ...
1. ...
2. Application :: Run
3. std :: unique_ptr < Application >:: reset
4. main
At the first glance it may look like unique_ptr::reset was the call site of the Application::Run, which
would make no sense, but this is not the case here. When you remember these are the function return points, it
becomes much more clear what is happening. As an optimization, Application::Run is returning directly
into unique_ptr::reset, skipping the return to main and an unnecessary reset function call.
Moreover, the linker may determine in some rare cases that any two functions in your program are
identical68. As a result, only one copy of the binary code will be provided in the executable for both functions
to share. While this optimization produces more compact programs, it also means that there’s no way to
distinguish the two functions apart in the resulting machine code. In effect, some call stacks may look
nonsensical until you perform a small investigation.
67
Tracy Profiler The user manual
more feature-rich.
Important
Source file view depends on local files you have on your disk, as the profiled application doesn’t need
them to run. If the profiler can access the source files during capture, it will cache them for further
reference. Otherwise, you will need to make them available, possibly by using file path substitutions.
Keep the following rules in mind:
• Source files can only be used, if the source file location recorded in the trace matches the files you
have on your disk. See section 5.12 for information on redirecting source file locations.
• Time stamp of the source file cannot be newer than the trace, as it typically would indicate that
the file has been changed and no longer contains the code that was profiled.
• The displayed source files might not reflect the code that was profiled! It is up to you to verify
that you don’t have a modified version of the code, with regards to the trace.
In some circumstances (missing or outdated source files, lack of machine code) some modes may be
unavailable.
68
Tracy Profiler The user manual
Exploring microarchitecture If the listed assembly code targets x86 or x64 instruction set architectures,
hovering U mouse pointer over an instruction will display a tooltip with microarchitectural data, based on
measurements made in [AR19]. This information is retrieved from instruction cycle tables, and does not represent
true behavior of the profiled code. Reading the cited article will give you a detailed definition of the presented
data, but here’s a quick (and inaccurate) explanation:
• Throughput – How many cycles are required to execute an instruction in a stream of independent same
instructions. For example, if two independent add instructions may be executed simultaneously on
different execution units, then the throughput (cycle cost per instruction) is 0.5.
• Latency – How many cycles it takes for an instruction to finish executing. This is reported as a min-max
range, as some output values may be available earlier than the rest.
• µops – How many microcode operations have to be dispatched for an instruction to retire. For example,
adding a value from memory to a register may consist of two microinstructions: first load the value
from memory, then add it to the register.
• Ports – Which ports (execution units) are required for dispatch of microinstructions. For example,
2*p0+1*p015 would mean that out of the three microinstructions implementing the assembly instruction,
two can only be executed on port 0, and one microinstruction can be executed on ports 0, 1, or 5.
69This includes jumps, procedure calls and returns. For example, in x86 assembly the respective operand names can be: jmp, call,
ret.
69
Tracy Profiler The user manual
Number of available ports and their capabilities vary between different processors architectures. Refer
to https://wikichip.org/ for more information.
Selection of the CPU microarchitecture can be performed using the < µarch drop-down. Each architecture
is accompanied with a name of an example CPU implementing it.
Enabling the Latency option will display graphical representation of instruction latencies on the listing.
Minimum latency of an instruction is represented with a red bar, while the maximum latency is represented
by a yellow bar.
Clicking on the I Save button lets you write the disassembly listing to a file. You can then manually
extract some critical loop kernel and pass it to a CPU simulator, such as LLVM Machine Code Analyzer
(llvm-mca)70, in order to see how the code is executed and if there are any pipeline bubbles. Consult the
llvm-mca documentation for more details. Alternatively, you might click the right mouse button on a
jump arrow and save only the instructions within the jump range, using the I Save jump range button.
Instruction dependencies Assembly instructions may read values stored in registers and may also
write values to registers. A dependency between two instructions is created when one produces some
result, which is then consumed by the other one. Combining this dependency graph with information about
instruction latencies may give deep understanding of the bottlenecks in code performance.
Clicking the left mouse button on any assembly instruction will mark it as a target for resolving
register dependencies between instructions. To cancel this selection, click on any assembly instruction with
right mouse button.
The selected instruction will be highlighted in red, while its dependencies will be highlighted in violet.
Additionally, a list of dependent registers will be listed next to each instruction which reads or writes to
them, with the following color code:
• Grey – Value in a register is either discarded (overwritten), or was already consumed by an earlier
instruction (i.e. it is readily available71). Dependency will be not followed further.
Search for dependencies follows program control flow, so there may be multiple producers and consumers
for any single register. While the after and before guidelines mentioned above hold in general case, things may
be more complicated when there’s a large amount of conditional jumps in the code. Note that dependencies
further away than 64 instructions are not displayed.
For easier navigation, dependencies are also marked on the left side of the scroll bar, following the green,
red and yellow convention. The selected instruction is marked in blue.
70https://llvm.org/docs/CommandGuide/llvm-mca.html
71This is actually a bit of simplification. Run a pipeline simulator, e.g. llvm-mca for a better analysis.
70
Tracy Profiler The user manual
Important
An assembly instruction may be associated with only a single source line, but a source line might be
associated with multiple assembly lines, sometimes intermixed with other assembly instructions.
Important
Be aware that the data is not fully accurate, as it is the result of random sampling of program execution.
Furthermore, undocumented implementation details of an out-of-order CPU architecture will highly
impact the measurement. Read chapter 2.2.2 to see the tip of an iceberg.
71
Tracy Profiler The user manual
If the Sync timeline option is selected, the timeline view will be focused on the frame corresponding to the
currently displayed screen shot. The Zoom 2× option enlarges the image, for easier viewing.
Each displayed frame image is also accompanied by the following parameters: timestamp, showing at
which time the image was captured, frame, displaying the numerical value of corresponding frame, and ratio,
telling how well the in-memory loss-less compression was able to reduce the image data size.
• Remove – Removes the annotation. You must press the Ctrl key to enable this button.
A Text description
72
Tracy Profiler The user manual
• 1 Set from annotation – Allows using the annotation region for limiting purposes.
• Ù Copy from find zone – Copies the find zone time range limit.
Note that ranges displayed in the window have color hints that match color of the striped regions on the
timeline.
• src_line – Line in the source file where the zone was set
• mean_ns – Mean zone time (equivalent in MPTC in the profiler GUI) in nanoseconds
You can customize the output with the following command line options:
• -e, --self – use self time (equivalent to the “Self time” toggle in the profiler GUI)
• -u, --unwrap – report each zone individually; this will discard the statistics columns and instead
reports for each zone entry its timestamp and the duration of the zone entry.
73
Tracy Profiler The user manual
Limitations
• Tracy is a single-process profiler. There is no differentiation between data coming from different
pids.
• Tracy uses thread identifiers assigned by the operating system. This means that no two concurrent
threads can have the same tid. Be aware that some external data formats may encourage usage of
duplicated thread identifiers.
• The imported data may be severely limited, either by not mapping directly to the data structures
used by Tracy, or by following undocumented practices.
8 Configuration files
While the client part doesn’t read or write anything to the disk (with the exception of accessing the /proc
filesystem on Linux), the server part has to keep some persistent state. The naming conventions or internal
data format of the files are not meant to be known by profiler users, but you may want to do a backup of the
configuration, or move it to another machine.
On Windows settings are stored in the %APPDATA%/tracy directory. All other platforms use the
$XDG_CONFIG_HOME/tracy directory, or $HOME/.config/tracy if the XDG_CONFIG_HOME environment variable
is not set.
74
Tracy Profiler The user manual
Appendices
A License
Tracy Profiler (https://github.com/wolfpld/tracy) is licensed under the
3-clause BSD license.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
B List of contributors
Bartosz Taudul <[email protected]>
Kamil Klimek <[email protected]> (initial find zone implementation)
Bartosz Szreder <[email protected]> (view/worker split)
Arvid Gerstmann <[email protected]> (compatibility fixes)
Rokas Kupstys <[email protected]> (compatibility fixes, initial CI work, MingW suppo
Till Rathmann <[email protected]> (DLL support)
Sherief Farouk <[email protected]> (compatibility fixes)
Dedmen Miller <[email protected]> (find zone bug fixes, improvements)
Michał Cichoń <[email protected]> (OSX call stack decoding backport)
Thales Sabino <[email protected]> (OpenCL support)
Andrew Depke <[email protected]> (Direct3D 12 support)
Simonas Kazlauskas <[email protected]> (OSX CI, external bindings)
Jakub Žádník <[email protected]> (csvexport utility)
Andrey Voroshilov <[email protected]> (multi-DLL fixes)
75
Tracy Profiler The user manual
References
[AR19] Andreas Abel and Jan Reineke. uops.info: Characterizing latency, throughput, and port usage of
instructions on intel microarchitectures. In ASPLOS, ASPLOS ’19, pages 673–686, New York, NY,
USA, 2019. ACM.
76