Performance Analysis Guide For Intel I7 Processor
Performance Analysis Guide For Intel I7 Processor
1
Performance Analysis Guide
2
Performance Analysis Guide
Introduction......................................................................................................................... 4
Basic Intel® Core™ i7 Processor and Intel® Xeon™ 5500 Processor Architecture and
Performance Analysis ..................................................................................................... 5
Core Out of Order Pipeline ............................................................................................. 6
Core Memory Subsystem................................................................................................ 8
Uncore Memory Subsystem.......................................................................................... 10
Overview................................................................................................................... 10
Intel® Xeon™ 5500 Processor ................................................................................. 10
Core Performance Monitoring Unit (PMU).................................................................. 12
Uncore Performance Monitoring Unit (PMU).............................................................. 13
Performance Analysis and the Intel® Core™ i7 Processor and Intel® Xeon™ 5500
processor Performance Events: Overview ........................................................................ 13
Cycle Accounting and Uop Flow...................................................................................... 14
Branch mispredictions, Wasted Work, Misprediction Penalties and UOP Flow ......... 17
Stall Decomposition Overview ......................................................................................... 20
Measuring Penalties ...................................................................................................... 21
Core Precise Events .......................................................................................................... 23
Overview....................................................................................................................... 23
Precise Memory Access Events .................................................................................... 23
Latency Event ............................................................................................................... 26
Precise Execution Events.............................................................................................. 28
Shadowing..................................................................................................................... 29
Loop Tripcounts............................................................................................................ 30
Last Branch Record (LBR) ........................................................................................... 30
Non-PEBS Core Memory Access Events ......................................................................... 35
Bandwidth per core ....................................................................................................... 37
L1D, L2 Cache Access and More Offcore events ........................................................ 38
Store Forwarding ...................................................................................................... 42
Front End Events............................................................................................................... 43
Branch Mispredictions .................................................................................................. 43
FE Code Generation Metrics ........................................................................................ 44
Microcode and Exceptions............................................................................................ 45
Uncore Performance Events ............................................................................................. 45
The Global Queue ......................................................................................................... 46
L3 CACHE Events........................................................................................................ 51
Intel® QuickPath Interconnect Home Logic (QHL) ................................................... 52
Integrated Memory Controller (IMC)........................................................................... 53
Intel® QuickPath Interconnect Home Logic Opcode Matching ................................. 56
Measuring Bandwidth From the Uncore....................................................................... 63
Conclusion: ....................................................................................................................... 64
Intel® Core™ i7 Processors and Intel® Xeon™ 5500 Processors open a new class of
performance analysis capablitlies ..................................................................................... 64
Appendix 1.................................................................................................................... 64
Profiles .......................................................................................................................... 64
General Exploration .................................................................................................. 64
3
Performance Analysis Guide
Introduction
With the introduction of the Intel® Core™ i7 processor and Intel® Xeon™ 5500
processors, mass market computing enters a new era and with it a new need for
performance analysis techniques and capabilities. The performance monitoring unit
(PMU) of the processor has progressed in step, providing a wide variety of new
capabilities to illuminate the code interaction with the architecture.
In this paper I will discuss the basic performance analysis methodology that applies to
Intel® Core™ i7 processor and platforms that support Non-Uniform Memory Access
(NUMA) using two Intel® Xeon 5500 processors based on the same microarchitecture as
Intel® Core™ i7 processor. The events and methodology that referred to Intel® Core™
i7 processor also apply to Intel® Xeon™ 5500 processors which are based on the same
microarchitecture as Intel® Core™ i7 processor. Thus statements made only about Intel®
core™ i7 processors in this document also apply to the Intel® Xeon™ 5500 processor
based systems. This will start with extensions to the basic cycle accounting methodology
outlined for Intel® Core™2 processors(1) and also include both the specific NUMA
directed capabilities and the large extension to the precise event based sampling (PEBS) .
Software optimization based on performance analysis of large existing
applications, in most cases, reduces to optimizing the code generation by the compiler
and optimizing the memory access. This paper will focus on this approach. Optimizing
the code generation by the compiler requires inspection of the assembler of the time
consuming parts of the application and verifying that the compiler generated a reasonable
code stream. Optimizing the memory access is a complex issue involving the bandwidth
and latency capabilities of the platform, hardware and software prefetching efficiencies
and the virtual address layout of the heavily accessed variables. The memory access is
where the NUMA nature of the Intel® Core™ i7 processor based platforms becomes an
issue.
Performance analysis illuminates how the existing invocation of an algorithm
executes. It allows a software developer to improve the performance of that invocation. It
does not offer much insight about how to change an algorithm, as that really requires a
better understanding of the problem being solved rather than the performance of the
existing solution. That being said, the performance gains that can be achieved on a large
4
Performance Analysis Guide
existing code base can regularly exceed a factor of 2, (particularly in HPC) which is
certainly worth the comparatively small effort required.
DDR3 DDR3
C0 C1 C2 C3 C0 C1 C2 C3
8M LLC 8M LLC
Figure 1
5
Performance Analysis Guide
Each core is quite similar to that of the Intel® Core™2 processor. The pipelines are
rather similar except that the Intel® Core™ i7 core and pipeline supports Intel® Hyper-
Threading Technology (HT), allowing the hardware to interleave instructions of two
threads during execution to maximize utilization of the core’s resources. The Intel®
Hyper-Threading Technology (HT) can be enabled or disabled through a bios setting.
Each core has a 32KB data and instruction cache, a 256 KB unified mid-level cache and
2 level DTLB system of 64 and 512 entries. There is a single, 32 entry large page dtlb.
The cores in a socket share an inclusive last level cache. The inclusive aspect of this
cache is an important issue and will be discussed later. In the usual DP configuration the
shared, inclusive last level cache is 8MB and 16 way associative.
The cache coherency protocol messages, between the multiple sockets, are exchanged
over the Intel® QuickPath Interconnects. The inclusive L3 CACHE allow this protocol
to be extremely fast, with the latency to the L3 CACHE of the adjacent socket being even
less than the latency to the local memory.
One of the main virtues of the integrated memory controller is the separation of the cache
coherency traffic and the memory access traffic. This enables an enormous increase in
memory access bandwidth. This results in a non-uniform memory access (NUMA). The
latency to the memory dimms attached to a remote socket is considerably longer than to
the local dimms. A second advantage is that the memory control logic can run at
processor frequencies and thereby reduce the latency.
6
Performance Analysis Guide
Retirement/Writeback
Figure 2
After instructions are decoded into the executable micro operations (uops), they are
assigned their required resources. They can only be issued to the downstream stages
when there are sufficient free resources. This would include (among other requirements):
1) space in the Reservation Station (RS), where the uops wait until their inputs are
available
2) space in the Reorder Buffer, where the uops wait until they can be retired
3) sufficient load and store buffers in the case of memory related uops (loads and
stores)
Retirement and write back of state to visible registers is only done for instructions and
uops that are on the correct execution path. Instructions and uops of incorrectly predicted
paths are flushed upon identification of the misprediction and the correct paths are then
processed. Retirement of the correct execution path instructions can proceed when two
conditions are satisfied
1) The uops associated with the instruction to be retired have completed, allowing the
retirement of the entire instruction, or in the case of instructions that generate very
large number of uops, enough to fill the retirement window
2) Older instructions and their uops of correctly predicted paths have retired
The mechanics of following these requirements ensures that the visible state is always
consistent with in-order execution of the instructions.
The “magic” of this design is that if the oldest instruction is blocked, for example waiting
for the arrival of data from memory, younger independent instructions and uops, whose
inputs are available, can be dispatched to the execution units and warehoused in the
ROB upon completion. They will then retire when all the older work has completed.
7
Performance Analysis Guide
The terms “issued”, “dispatched”, “executed” and “retired” have very precise meanings
as to where in this sequence they occur and are used in the event names to help document
what is being measured.
In the Intel® Core™ i7 Processor, the reservation station has 36 entries which are shared
between the Hyper-threads when that mode (HT) is enabled in the bios, with some entries
reserved for each thread to avoid locking. If not, all 36 could be available to the single
running thread, making restarting a blocked thread inefficient. There are 128 positions in
the reorder buffer, which are again divided if HT is enabled or entirely available to the
single thread if HT is not enabled. As on Core™2 processors, the RS dispatches the uops
to one of 6 dispatch ports where they are consumed by the execution units. This implies
that on any cycle between 0 and 6 uops can be dispatched for execution.
The hardware branch prediction requests the bytes of instructions for the predicted code
paths from the 32KB L1 instruction cache at a maximum bandwidth of 16 bytes/cycle.
Instructions fetches are always 16 byte aligned, so if a hot code path starts on the 15th
byte, the FE will only receive 1 byte on that cycle. This can aggravate instruction
bandwidth issues. The instructions are referenced by virtual address and translated to
physical address with the help of a 128 entry instruction translation lookaside buffer
(ITLB). The x86 instructions are decoded into the processors uops by the pipeline front
end. Four instructions can be decoded and issued per cycle.
If the branch prediction hardware mispredicts the execution path, the uops from the
incorrect path which are in the instruction pipeline are simply removed where they are,
without stalling execution. This reduces the cost of branch mispredictions. Thus the
“cost” associated with such mispredictions is only the wasted work associated with any of
the incorrect path uops that actually got dispatched and executed and any cycles that are
idle while the correct path instructions are located, decoded and inserted into the
execution pipeline.
8
Performance Analysis Guide
kept in the data translation lookaside buffers (DTLBs) for future reuse, as all load and
store operations require such a translation to access the data caches. Programs reference
virtual addresses but access the cachelines in the caches through the physical addresses.
As mentioned before, there is a multi level TLB system in each core for the 4KB pages.
The level 1 caches have TLBs of 64 and 128 entries respectively for the data and
instruction caches. There is a shared 512 entry second level TLB. There is a 32 entry
DTLB for the large 2/4MB pages should the application allocate and access any large
pages. There are 7 large page ITLB entries per HT. When a translation entry cannot be
found in the DTLBs the hardware page walker (HPW) works with the OS translation data
structures to retrieve the needed translation and updates the DTLBs. The hardware page
walker begins its search in the cache for the table entry and then can continue searching in
memory if the page containing the entry required is not found.
Cacheline coherency in a multi core multi socket system must be maintained to ensure
that the correct values for the data variables can be retrieved. This has traditionally been
done through the use of a 4 value state for each copy of each cacheline. The four state
(MESI) cacheline protocol allows for a coherent use of data in a multi-core, multi-socket
platform. A line that is only read can be shared and the cacheline access protocol supports
this by allowing multiple copies of the cacheline to coexist in the multiple cores. Under
these conditions, the multiple copies of the cacheline would be in what is called a Shared
state (S). A cacheline can be put in an Exclusive state (E) in response to a “read for
ownership” (RFO) in order to store a value. All instructions containing a lock prefix will
result in a (RFO) since they always result in a write to the cache line. The F0 lock prefix
will be present in the opcode or is implied by the xchg and cmpxchg instructions when a
memory access is one of the operands. The exclusive state ensures exclusive access of the
line. Once one of the copies is modified the cacheline’s state is changed to Modified (M).
That change of state is propagated to the other cores, whose copies are changed to the
Invalid state (I).
With the introduction of the Intel® QuickPath Interconnect protocol the 4 MESI states
are supplemented with a fifth, Forward (F) state, for lines forwarded from on socket to
another.
When a cacheline, required by a data access instruction, cannot be found in the L1 data
cache it must be retrieved from a higher level and longer latency component of the
memory access subsystem. Such a cache miss results in an invalid state being set for the
cacheline. This mechanism can be used to count cache misses.
The L1D miss creates an entry in the 16 element superqueue and allocates a line fill
buffer. If the line is found in the 256KB mid level cache (MLC, also referred to as L2), it
is transferred to the L1 data cache and the data access instruction can be serviced. The
load latency from the L2 CACHE is 10 cycles, resulting in a performance penalty of
around 6 cycles, the difference of the effective L2 CACHE and L1D latencies. If the line
is not found in the L2 CACHE, then it must be retrieved from the uncore.
When all the line fill buffers are in use, the data access operations in the load and store
buffers cannot be processed. They are thus queued up in the load and store buffers. When
all the load or store buffers are occupied, the front end is inhibited from issuing uops to
the RS and OOO engine. This is the same mechanism as used in Core™2 processors to
maintain pipeline consistency.
9
Performance Analysis Guide
The Intel® Core™ i7 processor has a 4 component hardware prefetcher very similar to
that of the Core™ processors. Two components associated with the L2 CACHE and two
components associated with the L1 data cache. The 2 components of L2 CACHE
hardware prefetcher are similar to those in the Pentium™ 4 and Core™ processors. There
is a “streaming” component that looks for multiple accesses in a local address window as
a trigger and an “adjacency” component that causes 2 lines to be fetched instead of one
with each triggering of the “streaming” component. The L1 data cache prefetcher is
similar to the L1 data cache prefetcher familiar from the Core™ processors. It has another
“streaming” component (which was usually disabled in the bios’ for the Core™
processors) and a “stride” or “IP” component that detected constant stride accesses at
individual instruction pointers. The Intel® Core™ i7 processor has various
improvements in the details of the hardware pattern identifications used in the prefetchers.
10
Performance Analysis Guide
CSI
6.4 GH
1 .4-2 .3 G /
QI
(Intel® QuickPath
Interconnect Controller )
GCL
(PLL Farm ) Physical
CRA
Link (Control Register Access
Bus Controller )
PC
(Power
LLC Control Unit )
Last IMC
level QHL
(QP
(Integrated
Cache Home
Logic ) Memory
GQ Controller)
(Global Queue )
C3 C1 C0 C2
Figure 3
Cacheline requests from the cores or from a remote package or the I/O Hub are handled
by the Intel® Xeon™ 5500 processor Uncore’s Global Queue (GQ). The GQ contains 3
request queues for this purpose. One for writes with 16 entries and one of 12 entries for
off package requests delivered by the Intel® QuickPath Interconnect and one of 32
entries for load requests from the cores.
On receiving a cacheline request from one of the cores, the GQ first checks the Last Level
Cache (L3 CACHE) to see if the line is on the package. As the L3 CACHE is inclusive,
the answer can be quickly ascertained. If the line is in the L3 CACHE and was owned by
the requesting core it can be returned to the core from the L3 CACHE directly. If the line
is being used by multiple cores, the GQ will snoop the other cores to see if there is a
modified copy. If so the L3 CACHE is updated and the line is sent to the requesting core.
In the event of an L3 CACHE miss the GQ must send out requests for the line. Since the
cacheline could be in the other package, a request through the Intel® QuickPath
Interconnect (Intel QPI) to the remote L3 CACHE must be made. As each Intel® Core™
i7 processor package has its own local integrated memory controller the GQ must identify
the “home” location of the requested cacheline from the physical address. If the address
identifies home as being on the local package, then the GQ makes a simultaneous request
11
Performance Analysis Guide
to the local memory controller, the Integrated memory controller (IMC). If home is
identified as belonging to the remote package, the request sent by the QPI will also be
used to access the remote IMC.
This process can be viewed in the terms used by the Intel® QuickPath Interconnect
protocol. Each socket has a Caching agent (that might be thought of as the GQ plus the L3
CACHE) and a Home agent (the IMC). An L3 CACHE miss results in simultaneous
queries for the line from all the Caching Agents and the Home agent (wherever it is). This
is shown diagrammatically below for a system with 3 caching agents (2 sockets and an
I/O hub) none of whom have the line and a home agent, which ultimately delivers the line
to the caching agent C that requested it.
CA-A CA-B CA-C
Request
H
I I I
M
RdDat
SnpData
SnpData M
Rsp DATA
Rsp
DataC_E_Cm
I E
Figure 4
Clearly, the IMC has queues for handling local and remote, read and write requests. These
will be discussed at greater length as the events that monitor their use are described.
12
Performance Analysis Guide
13
Performance Analysis Guide
8) Uncore Events
14
Performance Analysis Guide
UOPS_EXECUTED
Inst Fetch
Br Pred
UOPS_ISSUED
RESOURCE_STALLS
UOPS_RETIRED
Retirement/Writeback
Figure 5
The pipeline has buffers distributed along the uops flow path, for example the RS and
ROB. The result is the flow discontinuities (stalls) in one location do not necessarily
propagate at all locations. The OOO execution can keep the execution units occupied
during cycles where no uops retire, with the completed uops simply being staged in the
ROB for future retirement. Similarly the buffering in the RS can similarly keep the
execution units occupied during short discontinuities in uops being issued by the pipeline
front end. The design optimizes the continuity of the uop flow at the dispatch to the
execution units. Thus SW performance optimization should also focus on this objective.
In order to evaluate the efficiency of execution, cycles are divided into those
where micro-ops are dispatched to the execution units and those where no micro-ops are
dispatched, which are thought of as execution stalls. In the Intel® Core™ i7 processor
(as on the Intel® Core™ 2 processors) uops are dispatched to one of six ports. By
comparing the total uop count to 1 (cmask=1) and using a “less than” (inv=1) and
“greater than or equal to” (inv = 0) comparison, the PMU can divide all cycles into
“stalled” and “unstalled” classes. These PMU programmings are predefined enabling the
identify:
15
Performance Analysis Guide
This expression is in a sense a trivial truism, uops either are, or are not, executed on any
given cycle. This technique can be applied to any core event, with any threshold (cmask)
value and it will always be true. Any event, with a given cmask threshold value, counts
the cycles where the events value is >= to the cmask value (inv=0) , or < the cmask value
(inv=1). Thus the sum of the counts for inv =0 and inv=1 for a non-zero cmask will
always be the total core cycles, not just the unhalted cycles. This sum value is of course
subject to any frequency throttling the core might experience during the counting period.
The choice of dividing cycles at execution in this particular manner is driven by the
realization that ultimately keeping the execution units occupied is one of the essential
objectives of optimization.
Total cycles can be directly measured with CPU_CLK_UNHALTED.TOTAL_CYCLES.
This event is derived from CPU_CLK_UNHALTED.THREAD by setting the cmask = 2
and inv = 1, creating a condition that is always true. The difference between these two is
the halted cycles. These occur when the OS runs the null process.
The signals used to count the memory access uops executed (ports 2, 3 and 4) are the
only core events which cannot be counted on a logical core or HT basis. Thus the total
execution stall cycles can only be evaluated on a per core basis. If the HT is disabled this
presents no difficulty. There is some added complexity when the HT is enabled however.
While the memory ports only count on a per core basis, the ALU ports (0,1,5) count on a
per thread basis. The number of cycles where no uops were dispatched on the ALU ports
can be evaluated on a per thread basis consequently. This event is called
UOPS_EXECUTED.PORT015_STALL_CYCLES. Thus in the case where HT is
enabled we have the following inequality
UOPS_EXECUTED.CORE_STALL_CYCLES <=
True execution stalls per thread <=
UOPS_EXECUTED.PORT015_STALL_CYCLES
In addition the uop flow can be measured at issue and retirement on a per thread basis
and so can the number of cycles where no uops flow at those points. These events are
predefined as UOPS_ISSUED.STALL_CYCLES for measuring stalls in uop issue and
UOPS_RETIRED.STALL_CYCLES for measuring stalls in uop retirement, respectively.
The edge detection option in the PMU can be used to count the number of times an
event’s value changes, by detecting the rising signal edge. If this is applied to
UOPS_EXECUTED.CORE_STALLS_CYCLES as,
(UOPS_EXECUTED:CMASK=1:INV=1:EDGE=1), then the PMU will count the
number of stalls. This programming is defined as the event
UOPS_EXECUTED.CORE_STALL_COUNT. The ratio,
16
Performance Analysis Guide
UOPS_EXECUTED.CORE_STALLS_CYCLES/
UOPS_EXECUTED.CORE_STALLS_COUNT
is the average stall duration, and with the use of sampling can be measured reasonably
accurately even within a code region like a single loop.
The events were designed to be used in this manner without corrections for micro or
macro fusion. If HT is disabled, the count for the second HT is not needed.
A “per thread” measurement can be made looking at the difference between the uops
issued and uops retired as both of these events can be counted per logical core/HT. It over
counts slightly, by the mispredicted uops that are eliminated in the RS before they can
waste cycles being executed, but this is a small correction.
17
Performance Analysis Guide
As stated above, there is no interruption in uop dispatch or execution due to flushing the
pipeline. Thus the second component of the misprediction penalty is zero.
The third component of the misprediction penalty, instruction starvation, occurs when the
instructions associated with the correct path are far away from the core and execution is
stalled due to a lack of uops. This can now be explicitly measured at the output of the
resource allocation as follows. Using a cmask =1 and inv=1 logic applied to
UOPS_ISSUED, we can count the total number of cycles where no uops were issued to
the OOO engine.
UOPS_ISSUED.STALL_CYCLES = UOPS_ISSUED.ANY:CMASK=1:INV=1
Since the event RESOURCE_STALLS.ANY counts the number of cycles where uops
could not be issued due to a lack of downstream resources (RS or ROB slots, load or
store buffers etc), the difference is the cycles no uops are issued because there were none
available.
With HT disabled we can identify an instruction starvation condition indicating that the
front end was not delivering uops when the execution stage could have accepted them.
Instruction Starvation =
UOPS_ISSUED.STALL_CYCLES - RESOURCE_STALLS.ANY
When HT is enabled, the uop delivery to the RS alternates between the two threads. In an
ideal case the above condition would then count 50% of the cycles, as those cycles were
delivering uops for the other thread. We can modify the expression by subtracting the
cycles that the other thread is having uops issued.
Instruction Starvation =
UOPS_ISSUED.STALL_CYCLES - RESOURCE_STALLS.ANY
-UOPS_ISSUED.ANY:CMASK=1(other thread)
But this will over count as the resource_stall condition could exist on “this” thread while
the other thread was issuing uops. An alternative might be
CPU_CLK_UNHALTED.THREAD – UOPS_ISSUED.CORE_CYCLES_ACTIVE-
RESOURCE_STALLS.ANY
Where UOPS_ISSUED.CORE_CYCLES_ACTIVE counts the UOPS_ISSUED.ANY
event with cmask=1 and allthreads=1, thus counting the cycles either thread issues uops.
The problem of course is that if the other thread can always issue uops, it will mask the
stalls in the thread that cannot.
18
Performance Analysis Guide
thousands of samples. The solution to this is to average the sample counts over the
instructions of the basic block. This will result in yielding the best measurement of the
basic block execution count.
Basic Block Execution Count =
Σinst_in_BB Samples(inst_retired)*Sample_after_Value/(Number of inst in BB)
When analyzing the execution of loops, the basic block execution counts can be used to
get the average tripcount (iteration count) of the loop. For a simple loop with no
conditional branches, this ends up being the ratio of the basic block execution count of
the loop block to the basic block execution count of the block immediately before and/or
after the loop block. Judicious use of averaging over multiple blocks can be used to
improve the accuracy. Usually the objective of the analysis is just to determine if the
tripcount is large (> 100) or very small (<10), so this rough technique is usually adequate.
There is a fixed counter version of the event and a version that can be programmed into
the general counters, which also uses the PEBS (precise event based sampling)
mechanism. The PEBS mechanism is armed by the overflow of the counter. There is a
short propagation delay between the counter overflow and when PEBS is ready to capture
the next event. This shadow makes the use of the precise event inappropriate for basic
block execution counting. By far the best mechanism for this is to use the PEBS
br_inst_retired.all_branches event and capture the LBRs (Last Branch Records). More
will be said of the use of the precise version in the section on precise events.
A final event should be mentioned in regards to stalled execution. Chains of dependent
long latency instructions (fmul, fadd, imul, etc) can result in the dispatch being stalled
while the outputs of the long latency instructions become available. In general there are
no events that assist in counting such stalls with the exception of the divide and sqrt
instructions. For these two instructions the event ARITH can be used to count both the
occurrences of these instructions and the duration in cycles that they kept their execution
units occupied. The event ARITH.CYCLES_DIV_BUSY counts the cycles that either the
divide/sqrt execution unit was occupied. (perhaps the events name is thus a bit
misleading)
The flow of uops is mostly due to the decoded instructions. There are also uops
that can enter the flow due to micro coded exception handling, like those associated with
floating point exceptions. Micro code will be covered as part of the Front End discussion.
In summary, a table of these events is shown below, with C indicating the CMASK value,
I indicating the INV value, E indicating the EDGE DETECT value and AT indicating the
value of the ALLTHREAD bit. For the Edge Detect to work, a non zero cmask value
must also be used.
Table 1
19
Performance Analysis Guide
Fixed
CPU_CLK_UNHALTED.THREAD Cycles when thread is not halted 0 Ctr 0 0 0 0
Cycles when thread is
CPU_CLK_UNHALTED.THREAD_P not halted (programmable counter) 0 3C 0 0 0 0
Reference cycles when thread is
CPU_CLK_UNHALTED.REF_P not halted (programmable counter) 1 0 0 0 0
Fixed
INST_RETIRED.ANY Instructions retired (fixed counter) 0 Ctr 0 0 0 0
Instructions retired
INST_RETIRED.ANY_P (programmable counter) 1 C0 0 0 0 0
UOPS_EXECUTED.PORT0 Uops dispatched from port 0 1 B1 0 0 0 0
UOPS_EXECUTED.PORT1 Uops dispatched on port 1 2 0 0 0 0
UOPS_EXECUTED.PORT2_CORE Uops dispatched on port 2 4 0 0 0 1
UOPS_EXECUTED.PORT3_CORE Uops dispatched on port 3 8 0 0 0 1
UOPS_EXECUTED.PORT4_CORE Uops dispatched on port 4 10 0 0 0 1
UOPS_EXECUTED.
PORT5 Uops dispatched on port 5 20 0 0 0 0
UOPS_EXECUTED.
PORT015 Uops dispatched on ports 0, 1 or 5 40 0 0 0 0
UOPS_EXECUTED.PORT015 Cycles no Uops
_STALL_CYCLES dispatched on ports 0, 1 or 5 40 1 1 0 0
UOPS_EXECUTED.
PORT234_CORE Uops dispatched on ports 2, 3 or 4 80 0 0 0 1
UOPS_EXECUTED. Cycles no Uops
CORE_ACTIVE_CYCLES dispatched on any port 3F 1 0 0 1
UOPS_EXECUTED. Number of times no Uops
CORE_STALL_COUNT dispatched on any port 3f 1 1 1 1
UOPS_EXECUTED. Cycles no Uops
CORE_STALL_CYCLES dispatched on any port 3F 1 1 0 1
UOPS_ISSUED.ANY Uops issued 1 0E 0 0 0 0
UOPS_ISSUED.STALL_CYCLES Cycles no Uops were issued 1 1 1 0 0
UOPS_ISSUED.FUSED Fused Uops issued 2 0 0 0 0
UOPS_RETIRED.ACTIVE_CYCLES Cycles Micro-ops are retiring 1 C2 1 0 0 0
UOPS_RETIRED.ANY Micro-ops retired 1 0 0 0 0
UOPS_RETIRED.STALL_CYCLES Cycles Micro-ops are not retiring 1 1 1 0 0
UOPS_RETIRED.RETIRE_SLOTS Number of retirement slots used 2 0 0 0 0
UOPS_RETIRED.MACRO_FUSED Number of macro-fused Uops retired 4 0 0 0 0
RESOURCE_STALLS.ANY Resource related stall cycles 1 A2 0 0 0 0
RESOURCE_STALLS.LOAD Load buffer stall cycles 2 0 0 0 0
RESOURCE_STALLS.RS_FULL Reservation Station full stall cycles 4 0 0 0 0
RESOURCE_STALLS.STORE Store buffer stall cycles 8 0 0 0 0
RESOURCE_STALLS.ROB_FULL ROB full stall cycles 10 0 0 0 0
RESOURCE_STALLS.FPCW FPU control word write stall cycles 20 0 0 0 0
RESOURCE_STALLS.MXCSR 40 0 0 0 0
RESOURCE_STALLS.OTHER Other Resource related stall cycles 80 0 0 0 0
20
Performance Analysis Guide
Counted_Stall_Cycles = Σ Pi * Ni
This only accounts for the performance impacting events that are or can be
counted with a PMU event. Ultimately there will be several sources of stalls that cannot
be counted, however their total contribution can be estimated by the difference of
Measuring Penalties
Decomposing the stalled cycles in this manner should always start by first considering
the large penalty events, events with penalties of greater than 10 cycles for example.
Short penalty events (P < 5 cycles) can frequently be hidden by the combined actions of
the OOO execution and the compiler. Both of these strive to create maximal parallel
execution for precisely the purpose of keeping the execution units busy during stalls due
to instruction dependencies. The large penalty operations are dominated by memory
access and the very long latency instructions for divide and sqrt.
The largest penalty events are associated with load operations that require a cacheline
which is not in one of the core’s two data caches. Not only must we count how many
occur, but we need to know what penalty to assign. The standard approach to measuring
latency is to measure the average number of cycles a request is in a queue.
Latency = (Σcycles Queue_entries_outstanding)/Queue_inserts
21
Performance Analysis Guide
However, the penalty associated with each queue insert (ie cachemiss), is the latency
divided by the average queue occupancy. This correction is needed to avoid over
counting associated with overlapping penalties.
Average Queue Depth = (Σcycles Queue_entries_outstanding)
/Cycles_queue_not_empty
Thus
Penalty = Latency/Average Queue Depth
= Cycles_queue_not_empty/Queue_inserts
An alternative way of thinking about this is to realize that the sum of all the penalties, for
an event that occupies a queue for its duration, cannot exceed the time that the queue is
not empty.
Cycles_queue_not_empty >= Events * <Penalty>
The equality results in the expression derived in the first part of the discussion.
Neither of these more standard techniques will be used much for this processor. In
part due to the wide number of data sources and the large variations in their data delivery
latencies. The Precise Event Based Sampling (PEBS) will be the technique of choice
The use of the precise latency event, that will be discussed later, provides a more accurate
and flexible measurement technique when sampling is used. As each sample records both
a load to use latency and a data source, the average latency per data source can be
evaluated. Further as the PEBS hardware supports buffering the events without
generating a PMI until the buffer is full, it is possible to make such an evaluation quite
efficient.
While there are many events that will yield the number of L2 CACHE misses, the
associated penalties may average over a wide variety of data sources which actually have
individual penalties that vary by an order of magnitude. A more detailed decomposition is
needed that just an L2 CACHE miss.
The approximate latencies for the individual data sources that respond to an L2 CACHE
miss are shown in table 2. These values are only approximate as they depend on
processor frequency and DIMM speed among other things.
Table 2
Data Source Latency
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
~100-300
remote L3 CACHE cycles
Local Dram ~60 ns
Remote Dram ~100 ns
22
Performance Analysis Guide
23
Performance Analysis Guide
As the PEBS mechanism captures the values of the register at completion of the
instruction, the dereferenced address for the following type of load instruction (Intel asm
convention) cannot be reconstructed.
MOV RAX, [RAX+const]
This kind of instruction is mostly associated with pointer chasing
mystruc = mystruc->next;
This is a significant shortcoming of this approach to capturing memory instruction
addresses.
The basic memory access events are shown in the table below:
Table 3
Event Name Description umask Event
Instructions retired which
MEM_INST_RETIRED.LOADS contains a load 01 0B
Instructions retired which
MEM_INST_RETIRED.STORES contains a store 02
Retired loads that hit the
MEM_LOAD_RETIRED.L1D_HIT L1 data cache 01 CB
Retired loads that hit the
MEM_LOAD_RETIRED.L2_HIT L2 cache 02
MEM_LOAD_RETIRED. Retired loads that hit the
LLC_UNSHARED_HIT LL3 cache 04
MEM_LOAD_RETIRED. Retired loads that hit
OTHER_CORE_L2_HIT_HITM sibling core's L2 08
Retired loads that miss the
MEM_LOAD_RETIRED.LLC _MISS LL3 cache 10
MEM_LOAD_RETIRED. Retired load info dropped
DROPPED_EVENTS due to data breakpoint 20
Retired loads that miss the
L1 data cache and hit a
MEM_LOAD_RETIRED.HIT_LFB line fill buffer 40
Retired loads that miss the
MEM_LOAD_RETIRED.DTLB_MISS DTLB 80
Memory instructions retired
MEM_UNCORE_RETIRED. LL3 Cache hit and HITM in
OTHER_CORE_L2_HITM sibling core 02 0F
MEM_UNCORE_RETIRED. Memory instructions retired
REMOTE_CACHE_LOCAL_HOME_HIT remote cache HIT 08
MEM_UNCORE_RETIRED. Memory instructions retired
REMOTE_DRAM remote DRAM 10
Memory instructions retired
MEM_UNCORE_RETIRED.LOCAL_DRAM local DRAM 20
Memory instructions retired
MEM_UNCORE_RETIRED.UNCACHEABLE IO 80
Retired stores that miss
MEM_STORE_RETIRED.DTLB_MISS the DTLB 01 0C
MEM_STORE_RETIRED. Retired stores dropped due
DROPPED_EVENTS to data breakpoint 02
Retired instructions that
ITLB_MISS_RETIRED missed the ITLB 20 C8
24
Performance Analysis Guide
Strictly speaking the ITLB miss event is really an execution event but is listed here as it
is associated with cacheline access.
The precise events listed above allow load driven cache misses to be identified by data
source. This does not identify the “home” location of the cachelines with respect to the
NUMA configuration. The exceptions to this statement are the events
MEM_UNCORE_RETIRED.LOCAL_DRAM and
MEM_UNCORE_RETIRED.NON_LOCAL_DRAM. These can be used in conjunction
with instrumented malloc invocations to identify the NUMA “home” for the critical
contiguous buffers used in an application.
The sum of all the MEM_LOAD_RETIRED events will equal the
MEM_INST_RETIRED.LOADS count.
A count of L1D misses can be achieved with the use of all the MEM_LOAD_RETIRED
events, except MEM_LOAD_RETIRED.L1D_HIT. It is better to use all of the individual
MEM_LOAD_RETIRED events to do this, rather than the difference of
MEM_INST_RETIRED.LOADS-MEM_LOAD_RETIRED.L1D_HIT because while the
total counts of precise events will be correct, and they will correctly identify instructions
that caused the event in question, the distribution of the events may not be correct due to
PEBS SHADOWING, discussed later in this section.
L1D_MISSES = MEM_LOAD_RETIRED.HIT_LFB +
MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT
+ MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM +
MEM_LOAD_RETIRED.LLC_MISS
When a modified line is retrieved from another socket it is also written back to memory.
This causes remote HITM access to appear as coming from the home dram. The
MEM_UNCORE_RETIRED.LOCAL_DRAM and
MEM_UNCORE_RETIRED.REMOTE_DRAM thus also count the L3 CACHE misses
satisfied by modified lines in the caches of the remote socket.
25
Performance Analysis Guide
Latency Event
Saving the best for last, the Intel® Core™ i7 processor has a “latency event” which is
very similar to the Itanium® Processor Family Data EAR event. This event samples
loads, recording the number of cycles between the execution of the instruction and actual
deliver of the data. If the measured latency is larger than the minimum latency
programmed into MSR 0x3f6, bits 15:0, then the counter is incremented. Counter
overflow arms the PEBS mechanism and on the next event satisfying the latency
threshold, the measured latency, the virtual or linear address and the data source are
copied into 3 additional registers in the PEBS buffer. Because the virtual address is
captured into a known location, the sampling driver could also execute a virtual to
physical translation and capture the physical address. The physical address identifies the
NUMA home location and in principle allows an analysis of the details of the cache
occupancies.
Further, as the address is captured before retirement even the pointer chasing encodings
MOV RAX, [RAX+const]
have their addresses captured.
Because an MSR is used to program the latency only one minimum latency value can be
sampled on a core during a given period. To enable this, the Intel performance tools
restrict the programming of this event to counter 4 to simplify the scheduling.
The preprogrammed event files used by the Intel® PTU and Vtune™ Performance
Analyzer contain the following latency events, differing in the minimum latencies
required to make them count. Both tools do the required programming of MSR 0x3f6.
Table 4
Event Name Description umask Event
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 0
_THRESHOLD_0 cycles 10 0B
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 4
_THRESHOLD_4 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 8
_THRESHOLD_8 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 16
_THRESHOLD_10 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 32
_THRESHOLD_20 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 64
_THRESHOLD_40 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above
_THRESHOLD_80 128 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above
_THRESHOLD_100 256 cycles 10
MEM_INST_RETIRED.LATENCY_ABOVE Load instructions retired above 10
26
Performance Analysis Guide
The Data Source register captured in the PEBS buffer with the Latency Event is
interpreted as follows:
Table 5
Intel®
Core™ i7
Processor
Data
Source Data Source short
Encoding description Data Source Longer Description
0x0 Unknown Miss Unknown cache miss.
Minimal latency core cache hit. This request was
0x1 L1 Hit
satisfied by the data cache.
Pending core cache hit. The data is not yet in the data
0x2 Fill Buffer Hit cache, but is located in a line fill buffer and will soon be
committed to cache. The data request was satisfied
from the line fill buffer.
Highest latency core cache hit. This data request was
0x3 L2 CACHE Hit satisfied by the L2 CACHE.
L3 CACHE Hit: Hit the last level cache and was serviced
L3 CACHE Hit Other Core
0x5 by another core with a cross core snoop where no
Hit Snp
modified copies found. (clean)
L3 CACHE Hit: Hit the last level cache and was serviced
L3 CACHE Hit Other Core
0x6 by another core with a cross core snoop where modified
HitM
copies found. (HITM)
0x7 L3 CACHE_No_Details Encoding not supported
27
Performance Analysis Guide
The latency event is the recommended method to measure the penalties for a cycle
accounting decomposition. Each time a PMI is raised by this PEBS event a load to use
latency and a data source for the cacheline is recorded in the PEBS buffer. The data
source for the cacheline can be deduced from the low order 4 bits of the data source field
and the table shown above. Thus an average latency for each of the 16 sources can be
evaluated from the collected data. As only one minimum latency at a time can be
collected it may be awkward to evaluate the latency for an L2 CACHE hit and a remote
socket dram. A minimum latency of 32 cycles should give a reasonable distribution for
all the offcore sources however. The Intel® PTU version 3.2 performance tool can
display the latency distribution in the data profiling mode and allows sophisticated event
filtering capabilities for this event.
28
Performance Analysis Guide
and the branch that was not taken and retired is the instruction before the IP in the PEBS
buffer.
In the case of near calls retired, this means that Event Based Sampling (EBS) can be used
to collect accurate function call counts. As this is the primary measurement for driving
the decision to inline a function, this is an important improvement. In order to measure
call counts, you must sample on calls. Any other trigger introduces a bias that cannot be
guaranteed to be corrected properly.
The precise branch events are shown in the table below:
Table 6
Event Name Description umask Event
BR_INST_RETIRED.CONDITIONAL Retired conditional branch instructions 01 C4
BR_INST_RETIRED.NEAR_CALL Retired near call instructions 02
BR_INST_RETIRED.ALL_BRANCHES Retired branch instructions 04
Shadowing
There is one source of sampling bias associated with precise events. It is due to the time
delay between the PMU counter overflow and the arming of the PEBS hardware. During
this period events cannot be detected due to the timing shadow. To illustrate the effect
consider a function call chain where a long duration function, foo, which calls a chain of
3 very short duration functions, foo1 calling foo2 which calls foo3, followed by a long
duration function foo4. If the durations of foo1,foo2 and foo3 are less than the shadow
period the distribution of pebs sampled calls will be severely distorted.
1) if the overflow occurs on the call to foo, the pebs mechanism is armed by the time
the call to foo1 is executed and samples will be taken showing the call to foo1
from foo.
2) If the overflow occurs due to the call to foo1, foo2 or foo3 however, the PEBS
mechanism will not be armed until execution is in the body of foo4. Thus the calls
to foo2, foo3 and foo4 cannot appear as pebs sampled calls
Shadowing can effect the distribution of all PEBS events. It will also effect the
distribution of basic block execution counts identified by using the combination of a
branch retired event (PEBS or not) and the last entry in the LBR. If there were no delay
between the PMU counter overflow and the LBR freeze, the last LBR entry could be used
to sample taken retired branches and from that the basic block execution counts. All the
instructions between the last taken branch and the previous target are executed once.
Such a sampling could be used to generate a “software” instruction retired event with
uniform sampling, which in turn can be used to identify basic block execution counts.
Unfortunately the shadowing causes the branches at the end of short basic blocks to not
be the last entry in the LBR, distorting the measurement. Since all the instructions in a
basic block are by definition executed the same number of times.
The shadowing effect on call counts and basic block execution counts can be
alleviated to a large degree by averaging over the entries in the LBR. This will be
discussed in the section on LBRs.
29
Performance Analysis Guide
Loop Tripcounts
The available options for optimizing loops are completely constrained by the loop
tripcount. For counted loops it is very common for the induction variable to be compared
to the tripcount in the termination condition evaluation. This is particularly true if the
induction variable is used within the body of the loop, even in the face of heavy
optimization. Thus a sequence like the following will appear in the disassembly:
addq $0x8, %rcx
cmpq %rax, %rcx
jnge triad+0x27
This was from a heavily optimized triad loop that the compiler had unrolled by 8X. In
this case the two registers, rax and rcx are the tripcount and induction variable. If the
PEBS buffer is captured for the conditional branches retired event, the average values of
the two registers in the compare can be evaluated. The one with the larger average will be
the tripcount. Thus the average, RMS, min and max can be evaluated and even a
distribution of the recorded values.
Table 7
LBR Filter Bit Name Bit Description bit
CPL_EQ_0 Exclude ring 0 0
CPL_NEQ_0 Exclude ring3 1
JCC Exclude taken conditional branches 2
NEAR_REL_CALL Exclude near relative calls 3
NEAR_INDIRECT_CALL Exclude near indirect calls 4
NEAR_RET Exclude near returns 5
NEAR_INDIRECT_JMP Exclude near unconditional near branches 6
NEAR_REL_JMP Exclude near unconditional relative branches 7
FAR_BRANCH Exclude far branches 8
The default is to capture all branches at all privilege levels (all bits zero). Another
reasonable programming would set all bits to 1 except bit 1 (capture ring 3) and bit 3
(capture near calls) and bits 6 and 7. This would leave only ring 3 calls and unconditional
jumps in the LBR. Such a programming would result in the LBR having the last 16 taken
calls and unconditional jumps retired and their targets in the buffer. A PMU sampling
driver could then capture this restricted “call chain” with any event, thereby providing a
“call tree” context. The inclusion of the unconditional jumps will unfortunately cause
problems, particularly when there are if-else structures within loops. In a case where
there were particularly frequent function calls at all levels, the inclusion of returns could
be added to clarify the context. However this would reduce the call chain depth that could
30
Performance Analysis Guide
be captured. A fairly obvious usage would be to trigger the sampling on extremely long
latency loads, to enrich the sample with accesses to heavily contended locked variables,
and then capture the call chain to identify the context of the lock usage.
Call Counts and Function Arguments
If the LBRs are captured for PMIs triggered by the BR_INST_RETIRED.NEAR_CALL
event, then the call count per calling function can be determined by simply using the last
entry in LBR.As the PEBS IP will equal the last target IP in the LBR, it is the entry point
of the calling function. Similarly, the last source in the LNR buffer was the call site from
within the calling function. If the full PEBS record is captured as well, then for functions
with limited numbers of arguments on Intel64 OS’s, you can sample both the call counts
and the function arguments.
LBRs and Basic Block Execution Counts
Another interesting usage is to use the BR_INST_RETIRED.ALL_BRANCHES event
and the LBRs with no filter to evaluate the execution rate of basic blocks. As the LBRs
capture all taken branches, all the basic blocks between a branch IP (source) and the
previous target in the LBR buffer were executed one time. Thus a simple way to evaluate
the basic block execution counts for a given load module is to make a map of the starting
locations of every basic block. Then for each sample triggered by the PEBS collection of
BR_INST_RETIRED.ALL_BRANCHES, starting from the PEBS address (a target but
perhaps for a not taken branch and thus not necessarily in the LBR buffer) and walking
backwards through the LBRs until finding an address not corresponding to the load
module of interest, count all the basic blocks that were executed. Calling this value
“number_of_basic_blocks”, increment the execution counts for all of those blocks by
1/(number_of_basic_blocks). This technique also yields the taken and not taken rates for
the active branches. All branch instructions between a source IP and the previous target
IP (within the same module) were not taken, while the branches listed in the LBR were
taken. This is illustrated in the graphics below
Branch_0 Target_0
Branch_1 Target_1
All instructions between Target_0 and Branch_1 are retired 1 time
All Basic Blocks between Target_0 and Branch_1 are executed 1 time
All Branch Instructions between Target_0 and Branch_1 are not taken
31
Performance Analysis Guide
20 O
Assume every branch is taken, and average over the basic blocks in the LBR trajectory
results in:
32
Performance Analysis Guide
As on Intel® Core™2 processors there is a precise instructions retired event that can be
used in a wide variety of ways. In addition there are precise events for uops_retired,
various SSE instruction classes, FP assists. It should be noted that the FP assist events
only detect x87 FP assists, not those involving SSE FP instructions. Detecting all assists
will be discussed in the section on the pipeline Front End.
The instructions retired event has a few special uses. While its distribution is not uniform,
the totals are correct. If the values recorded for all the instructions in a basic block are
averaged, a measure of the basic block execution count can be extracted. The ratios of
basic block executions can be used to estimate loop tripcounts when the counted loop
technique discussed above cannot be applied.
The PEBS version (general counter) instructions retired event can further be used to
profile OS execution accurately even in the face of STI/CLI semantics, because the pebs
33
Performance Analysis Guide
buffer retains the IP value of the OS code section where the overflow (+1) occurred. The
interrupt then occurs after the critical section has completed, but the data was frozen
correctly. If the cmask value is set to some very high value and the invert condition is
applied, the result is always true, and the event will count core cycles (halted + unhalted).
Consequently both cycles and instructions retired can be accurately profiled. The
UOPS_RETIRED.ANY event, which is also precise can also be used to profile Ring 0
execution and really gives a more accurate display of execution.
Table 8
Even In Edg
Event Name Description Umask t Cmask v e
Instructions retired
INST_RETIRED.ANY_P (general counter) 01 C0 0 0 0
INST_RETIRED.TOTAL_CYCLES Total cycles 01 C0 10 1 0
INST_RETIRED.TOTAL_CYCLES
_R0 Total cycles in ring 0 01 C0 10 1 0
INST_RETIRED.TOTAL_CYCLES
_R3 Total cycles in ring 3 01 C0 10 1 0
Retired floating-point
INST_RETIRED.X87 operations 02 C0 0 0 0
Retired MMX
INST_RETIRED.MMX instructions 04 C0 0 0 0
UOPS_RETIRED.ACTIVE_ Cycles Uops are being
CYCLES retired 01 C2 1 0 0
UOPS_RETIRED.ANY Uops retired 01 C2 0 0 0
UOPS_RETIRED.STALL_ Cycles Uops are not
CYCLES retiring 01 C2 1 1 0
UOPS_RETIRED.RETIRE_SLOTS Retirement slots used 02 C2 0 0 0
UOPS_RETIRED.MACRO_ Macro-fused Uops
FUSED retired 04 C2 0 0 0
SSEX_UOPS_RETIRED.PACKED SIMD Packed-Single
_SINGLE Uops retired 01 C7 0 0 0
SSEX_UOPS_RETIRED.SCALAR SIMD Scalar-Single
_SINGLE Uops retired 02 C7 0 0 0
SSEX_UOPS_RETIRED.PACKED SIMD Packed-Double
_DOUBLE Uops retired 04 C7 0 0 0
SSEX_UOPS_RETIRED.SCALAR SIMD Scalar-Double
_DOUBLE Uops retired 08 C7 0 0 0
SSEX_UOPS_RETIRED.VECTOR SIMD Vector Integer
_INTEGER Uops retired 10 C7 0 0 0
X87 Floating point
FP_ASSIST.ANY assists 01 F7 0 0 0
X87 FP assist on input
FP_ASSIST.OUTPUT values 02 F7 0 0 0
X87 FP assist on
FP_ASSIST.INPUT output values 04 F7 0 0 0
34
Performance Analysis Guide
Table 9
Bit
position Description
Demand Data Rd = DCU reads (includes partials, DCU
Request 0 Prefetch)
Type 1 Demand RFO = DCU RFOs
2 Demand Ifetch = IFU Fetches
3 Writeback = L2 CACHE_EVICT/DCUWB
4 PF Data Rd = MPL Reads
5 PF RFO = MPL RFOs
6 PF Ifetch = MPL Fetches
7 OTHER
Response 8 L3 CACHE_HIT_UNCORE_HIT
Type 9 L3 CACHE_HIT_OTHER_CORE_HIT_SNP
10 L3 CACHE_HIT_OTHER_CORE_HITM
11 L3 CACHE_MISS_REMOTE_HIT_SCRUB
12 L3 CACHE_MISS_REMOTE_FWD
13 L3 CACHE_MISS_REMOTE_DRAM
14 L3 CACHE_MISS_LOCAL_DRAM
15 IO_CSR_MMIO
The request type “other”, selected by enabling bit 7 includes things like non temporal
SSE stores. The three L3 CACHE hit options correspond to an unshared line, a shared
clean line and a shared line that is modified in one of the other cores. The L3
CACHE_miss_remote options correspond to lines that must have the cores snooped (ie
used by multiple cores) or clean lines that are used by only one core and can be safely
forwarded directly from the remote L3 CACHE.
Due to the number of bits required to program the matrix selection, a dedicated MSR (1A6) is used.
However, the global nature of the MSR results in only one version of the event being able to be
collected during a given collection period. The Intel Performance Tools (the VTune™ Performance
Analyzer and the Intel Performance Tuning Utility) constrain this by requiring the event be programmed
into counter 2.
In order to make data collection maximally efficient a large set of predefined bit
combinations were included in the default event lists to minimize the number of data
collection runs needed. The predefined combinations for requests are shown below
35
Performance Analysis Guide
Multiple request type and response type bits can be set (there are approximately 65K non-zero
programmings of the event possible) to allow data collection with the minimum number of data
collection runs. The predefined combinations used by the Intel Performance Tools are shown below.
The event names are constructed as OFFCORE_RESPONSE_0.REQUEST.RESPONSE.
Where the REQUEST and RESPONSE strings are defined to correspond to unique
programmings of the lower 8 bits or the upper 8 bits of the 16 bit field. The *DEMAND*
events discussed in this document also include any requests made by the L1D cache
hardware prefetchers.
Errata:
36
Performance Analysis Guide
Non temporal stores to locally homed cachelines would be thought to increment the
offcore_response_0 event when the MSR was set to a value of 0x4080
(other.local_dram). These NT writes in fact increment the event when the MSR is
programmed to 0x280 (other.llc_other_core_hit). This can make the analysis of total
traffic to dram a bit clumsy.
Table 10
Event Name Definition Umask Event
OFFCORE_REQUESTS_BUFFER_FULL Offcore request queue (SQ) full 01 B2
SQ_STALL Cycles SQ allocation stalled, SQ full 01 F6
But of course, none of the above includes the bandwidth associated with writebacks of
modified cacheable lines.
37
Performance Analysis Guide
Table 11
Event Name Definition Umask Event
L2_RQSTS.LD_HIT Load requests that hit the L2 1 24
L2_RQSTS.LD_MISS Load requests that miss the L2 2
L2_RQSTS.LOADS All L2 load requests 3
L2_RQSTS.RFO_HIT Store RFO requests that hit the L2 4
Store RFO requests that miss the
L2_RQSTS.RFO_MISS L2 8
L2_RQSTS.IFETCH_HIT Code requests that hit the L2 10
L2_RQSTS.IFETCH_MISS Code requests that miss the L2 20
L2_RQSTS.IFETCHES All L2 code requests 30
L2_RQSTS.PREFETCH_HIT Prefetch requests that hit the L2 40
L2_RQSTS.PREFETCH_MISS Prefetch requests that miss the L2 80
L2_RQSTS.RFOS All L2 store RFO requests 0C
L2_RQSTS.MISS All L2 misses AA
L2_RQSTS.PREFETCHES All L2 prefetches C0
38
Performance Analysis Guide
Table 12
Event Name Definition Umask Event
L2_LINES_IN.S_STATE L2 lines allocated in the S state 2 F1
L2_LINES_IN.E_STATE L2 lines allocated in the E state 4
L2_LINES_IN.ANY Lines allocated in the L2 cache 7
L2_LINES_OUT.ANY All L2 lines evicted 0F F2
L2_LINES_OUT.DEMAND_CLEAN L2 lines evicted by a demand request 1
L2 modified lines evicted by a demand
L2_LINES_OUT.DEMAND_DIRTY request 2
L2_LINES_OUT.PREFETCH_CLEAN L2 lines evicted by a prefetch request 4
L2 modified lines evicted by a prefetch
L2_LINES_OUT.PREFETCH_DIRTY request 8
The event L2_TRANSACTIONS counts all interactions with the L2 CACHE and is
divided up as follows
Table 13
Event Name Definition Umask Event
Load, SW prefetch and L1D prefetcher
L2_TRANSACTIONS.LOAD requests 1 F0
L2_TRANSACTIONS.RFO RFO requests from L1D 2
L2_TRANSACTIONS.IFETCH Cachelines requested from L1I 4
L2 Hw prefetches, includes L2 hits and
L2_TRANSACTIONS.PREFETCH Misses 8
L2_TRANSACTIONS.L1D_WB Writebacks from L1D 10
L2_TRANSACTIONS.FILL Cachelines brought in from L3 CACHE 20
L2_TRANSACTIONS.WB Writebacks to the L3 CACHE 40
L2_TRANSACTIONS.ANY All actions taken by the L2 80
39
Performance Analysis Guide
Table 14
Event Name Definition Umask Event
L2_WRITE.RFO.I_STATE L2 store RFOs in I state (misses) 1 27
L2_WRITE.RFO.S_STATE L2 store RFOs in S state 2
L2_WRITE.RFO.E_STATE L2 store RFOs in E state 4
L2_WRITE.RFO.M_STATE L2 store RFOs in M state 8
L2_WRITE.LOCK.I_STATE L2 lock RFOs in I state (misses) 10
L2_WRITE.LOCK.S_STATE L2 lock RFOs in S state 20
L2_WRITE.LOCK.E_STATE L2 lock RFOs in E state 40
L2_WRITE.LOCK.M_STATE L2 lock RFOs in M state 80
L2_WRITE.RFO.HIT All L2 store RFOs that hit the cache 0E
L2_WRITE.RFO.MESI All L2 store RFOs 0F
L2_WRITE.LOCK.HIT All L2 lock RFOs that hit the cache E0
L2_WRITE.LOCK.MESI All L2 lock RFOs F0
The next largest set of penalties is associated with the TLBs and accessing more physical
pages than can be mapped with their finite number of entries. A miss in the first level
TLBs results in a very small penalty that can usually be hidden by the OOO execution
and compiler’s scheduling. A miss in the shared TLB results in the Page Walker being
invoked and this penalty can be noticeable in the execution.
The (non pebs) TLB miss events break down into three sets: DTLB misses, Load DTLB
misses and ITLB misses. Store DTLB misses can be evaluated from the difference of the
DTLB misses and the Load DTLB misses. Each then has a set of sub events programmed
with the umask value. A summary of the non PEBS TLB miss events is in the table
below.
Table 15
40
Performance Analysis Guide
The L1 data cache, L1D, is the final component to be discussed. These events can only be
counted with the first 2 of the 4 general counters. Most of the events are self explanatory.
The total number of references to the L1D can be counted with L1D_ALL_REF, either
just cacheable references or all. The cacheable references can be divided into loads and
stores with L1D_CACHE_LOAD and L1D_CACHE.STORE. These events are further
subdivided by MESI states through their umask values, with the I state references
indicating the cache misses.
The evictions of modified lines in the L1D result in writebacks to the L2 CACHE. These
are counted with the L1D_WB_L2 events. The umask values break these down by the
MESI state of the version of the line in the L2 CACHE.
The locked references can be counted also with the L1D_CACHE_LOCK events. Again
these are broken down by MES states for the lines in L1D.
The total number of lines brought into L1D, the number that arrived in an M state and the
number of modified lines that get evicted due to receiving a snoop are counted with the
L1D event and its umask variations.
NOTE: many of these events are known to overcount (l1d_cache_ld, l1d_cache_lock) so
they can only be used for qualitative analysis.
These events and a few others are summarized below.
Table 16
Event Name Definition Umask Event
L1D_WB_L2.I_STATE L1 writebacks to L2 in I state (misses) 01 28
L1D_WB_L2.S_STATE L1 writebacks to L2 in S state 02 28
L1D_WB_L2.E_STATE L1 writebacks to L2 in E state 04 28
L1D_WB_L2.M_STATE L1 writebacks to L2 in M state 08 28
L1D_WB_L2.MESI All L1 writebacks to L2 0F 28
L1D_CACHE_LD.I_STATE L1 data cache read in I state (misses) 01 40
L1D_CACHE_LD.S_STATE L1 data cache read in S state 02 40
L1D_CACHE_LD.E_STATE L1 data cache read in E state 04 40
41
Performance Analysis Guide
Store Forwarding
There are few cases of loads not being able to forward from active store buffers in Intel®
Core™ i7 processors. The predominant remaining case has to do with larger loads
overlapping smaller stores. There is not event that detects when this occurs. There is also
a “false store forwarding” case where the addresses only match in the lower 12 address
bits. This is sometimes referred to as 4K aliasing. This can be detected with the following
event
Table 17
Event
Event Name Description Code Umask
False dependencies due to partial
PARTIAL_ADDRESS_ALIAS address aliasing 7 1
42
Performance Analysis Guide
Branch Mispredictions
As discussed earlier there is good coverage of branch mispredictions with precise events.
These are enhanced by use of the LBR to identify the branch location to go along with
the target location captured in the PEBS buffer. It is not clear to the author that there is
more information to be garnered from the front end events that can be used by code
developers, but they are certainly of use to chip designers and architects. These events are
listed below.
Table 18
Event Name Description Event Code Umask
BACLEAR.CLEAR BAclears asserted, regardless of cause E6 1
BACLEAR.BAD_TARGET BACLEAR asserted with bad target address E6 2
BPU_CLEARS.EARLY Early Branch Prediciton Unit clears E8 1
BPU_CLEARS.LATE Late Branch Prediction Unit clears E8 2
BPU_MISSED_CALL_RET Branch prediction unit missed call or return E5 1
BR_INST_DECODED Branch instructions decoded E0 1
BR_INST_EXEC.COND Conditional branch instructions executed 88 1
BR_INST_EXEC.DIRECT Unconditional branches executed 88 2
BR_INST_EXEC.INDIRECT_NON_CALL Indirect non call branches executed 88 4
BR_INST_EXEC.RETURN_NEAR Indirect return branches executed 88 8
BR_INST_EXEC.DIRECT_NEAR_CALL Unconditional call branches executed 88 10
BR_INST_EXEC.INDIRECT_NEAR_CALL Indirect call branches executed 88 20
BR_INST_EXEC.TAKEN Taken branches executed 88 40
BR_INST_EXEC.ANY Branch instructions executed 88 7F
BR_INST_EXEC.NON_CALLS All non call branches executed 88 3
BR_INST_EXEC.NEAR_CALLS Call branches executed 88 30
BR_MISP_EXEC.COND Mispredicted conditional branches executed 89 1
BR_MISP_EXEC.DIRECT Mispredicted unconditional branches executed 89 2
Mispredicted indirect non call branches
BR_MISP_EXEC.INDIRECT_NON_CALL executed 89 4
BR_MISP_EXEC.RETURN_NEAR Mispredicted return branches executed 89 8
BR_MISP_EXEC.DIRECT_NEAR_CALL Mispredicted non call branches executed 89 10
BR_MISP_EXEC.INDIRECT_NEAR_CALL Mispredicted indirect call branches executed 89 20
BR_MISP_EXEC.TAKEN Mispredicted taken branches executed 89 40
BR_MISP_EXEC.ANY Mispredicted branches executed 89 7F
BR_MISP_EXEC.NON_CALLS Mispredicted non call branches executed 89 3
BR_MISP_EXEC.NEAR_CALLS Mispredicted call branches executed 89 30
43
Performance Analysis Guide
Table 19
Event Name Description Event Code Umask
L1I.HITS L1I instruction fetch hits 80 1
L1I.MISSES L1I instruction fetch misses 80 2
L1I.CYCLES_STALLED L1I instruction fetch stall cycles 80 4
L1I.READS L1I Instruction fetches 80 3
IFU_IVC.FULL victim cache full 81 1
IFU_IVC.L1I_EVICTION L1I eviction 81 2
ITLB_FLUSH ITLB flushes AE 1
ITLB_MISSES.ANY ITLB miss 85 1
ITLB_MISSES.WALK_COMPLETED ITLB miss page walks 85 2
ITLB_MISSES.LARGE_WALK_COMPLETED ITLB miss large page walks 85 80
LARGE_TLB.HIT large TLB hit 82 1
44
Performance Analysis Guide
Table 21
Event
Event Name Description Code Umask
Uops decoded by
UOPS_DECODED.MS Microcode Sequencer D1 2
MACHINE_CLEARS.CYCLES Cycles Machine Clears Asserted C3 1
45
Performance Analysis Guide
Table 22
46
Performance Analysis Guide
A latency can measured by the average duration of the queue occupancy, if the
occupancy stops as soon as the data has been delivered. Thus the ratio of
UNC_GQ_TRACKER_OCCUP.X/UNC_GQ_ALLOC.X measures an average duration
of queue occupancy. The total occupancy period measured by
Total Read Period = UNC_GQ_TRACKER_OCCUP.RT/UNC_GQ_ALLOC.RT
Is longer than the data delivery latency due to it including time for extra General Queue
bookkeeping and cleanup.
Similary, the
L3 CACHE response Latency = UNC_GQ_TRACKER_OCCUP.RT_TO_LLC_RESP
/UNC_GQ_ALLOC.RT_TO_LLC_RESP
This term is essentially a constant. It does not include the total time to snoop and retrieve
a modified line from another core for example, just the time to scan the L3 CACHE and
see if the line is or is not present on the socket.
Table 15
OFFCORE_RESPONSE_0.DEMAND_DATA.LLC_HIT_NO_OTHER_CORE
OFFCORE_RESPONSE_0.DEMAND_DATA.LLC_HIT_OTHER_CORE_HIT
OFFCORE_RESPONSE_0.DEMAND_DATA.LLC_HIT_OTHER_CORE_HITM
OFFCORE_RESPONSE_0.DEMAND_DATA.LOCAL_CACHE
The *LOCAL_CACHE event should be used as the denominator. The individual
latencies could have to be measured with microbenchmarks, but the use of the precise
latency event will be far more effective as any bandwidth loading effects will be
included.
The L3 CACHE miss component is the weighted average over the latencies of hits in a
cache on another socket, with the multiple latencies as in the L3 CACHE hit just
discussed. It also includes in the weighted average the latencies to local and remote dram.
The local dram access and the remote socket access can be decomposed with more
uncore events. This will be discussed a bit later in this paper.
The *RTID* events allow the monitoring of a sub component of the Miss to fill latency
associated with the communications between the GQ and the QHL.
The write and peer probe latencies are the L3 CACHE response time + any other time
required. This can also include a fraction due to L3 CACHE misses and retrievals from
47
Performance Analysis Guide
dram. The breakdowns cannot be done in the way discussed above as the extra required
events do not exist.
There are events which monitor if the three trackers are not empty (>= 1 entry) or full.
Table 23
The technique of dividing the latencies by the average queue occupancy in order to
determine a penalty does not work for the uncore. Overlapping entries from different
cores do not result in overlapping penalties and thus a reduction in stalled cycles. Each
core suffers the full latency independently. To evaluate the correction on a per core basis
one needs the number of cycles there is an entry from the core in question. A
*NOT_EMPTY_CORE_N type event would needed. There is no such event.
Consequently, in the cycle decomposition one must use the full latency for the estimate
of the penalty. As has been stated before it is best to use the PEBS latency event as the
data sources are also collected with the latency for the individual sample.
The individual components of the read tracker, discussed above, can also be monitored as
busy or full by setting the cmask value to 1 or 32 and applying it to the assorted RT
occupancy events.
Table 24
The GQ data buffer traffic controls the flow of data through the uncore. Diagramatically
it can be shown as follows
48
Performance Analysis Guide
LLC
LLC
QPI/
QPI IMC
Core0/2
Cores
Core1/3
16B
IMC
The input and output flows can be monitored with the following events. They measure
the cycles that the ports they are monitoring are busy. Most of the ports transfer a fixed
number of bits per cycle, however the Intel® QuickPath Interconnect protocols can
result in either 8 or 16 bytes being transferred on the read Intel QPI and IMC ports.
Consequently these events cannot be used to measure total data transfers and bandwidths.
Table 25
Event Name Definition Umask Event
Cycles GQ data is imported from
UNC_GQ_DATA.FROM_QPI Quickpath interconnect 01 04
Cycles GQ data is imported from
UNC_GQ_DATA.FROM_IMC integrated memory interface 02
Cycles GQ data is imported from L3
UNC_GQ_DATA.FROM_LLC CACHE 04
Cycles GQ data is imported from Cores
UNC_GQ_DATA.FROM_CORES_02 0 and 2 08
Cycles GQ data is imported from Cores
UNC_GQ_DATA.FROM_CORES_13 1 and 3 10
UNC_GQ_DATA.TO_QPI_IMC Cycles GQ data sent to the QPI or IMC 01 05
UNC_GQ_DATA.TO_LLC Cycles GQ data sent to L3 CACHE 02
UNC_GQ_DATA.TO_CORES Cycles GQ data sent to cores 04
The GQ handles the snoop responses for the cacheline requests that come in from the
Intel® QuickPath Interconnect. These correspond to the queue entries in the peer probe
tracker.
They are divided into requests for locally homed data and remotely homed data. If the
line is in a modified state and the the GQ is responding to a read request the line also
49
Performance Analysis Guide
must be written back to memory. This would be a wasted effort for a response to a RFO
as the line will just be modified again, so no Writeback is done for RFOs.
The local home events:
Table 26
Table 27
Event Name Definition Umask Event
Remote home snoop response
UNC_SNP_RESP_TO_REMOTE_HOME. L3 CACHE does not have cache
I_STATE line 01 07
UNC_SNP_RESP_TO_REMOTE_HOME. L3 CACHE has cache line in S
S_STATE state 02
L3 CACHE in E state, changed to
UNC_SNP_RESP_TO_REMOTE_HOME. S state
FWD_S_STATE and forwarded 04
L3 CACHE has forwarded a
UNC_SNP_RESP_TO_REMOTE_HOME. modified cache line
FWD_I_STATE responding to rfo 08
UNC_SNP_RESP_TO_REMOTE_HOME. Remote home conflict snoop
CONFLICT response 10
L3 CACHE has cache line in the M
UNC_SNP_RESP_TO_REMOTE_HOME. state
WB responding to read 20
UNC_SNP_RESP_TO_REMOTE_HOME. L3 CACHE HITM (WB,
HITM FWD_S_STATE) 24
UNC_SNP_RESP_TO_REMOTE_HOME. L3 CACHE HIT (S,
HIT FWD_I_STATE, conflict) 1A
50
Performance Analysis Guide
Some related events count the MESI transitions in response to snoops from other caching
agents (processors or IOH). Some of these rely on an MSR so they can only be measured
one at a time, as there is only one MSR. The Intel performance tools will schedule this
correctly by restricting these events to a single general uncore counter.
Table 28
msr
Event Name Definition Umask Event msr value
UNC_GQ_SNOOP.GOTO_S change cache line to S state 01 0C 0 0
UNC_GQ_SNOOP.GOTO_I change cache line to I state 02 0 0
change cache line from M to S
UNC_GQ_SNOOP.GOTO_S_HIT_M state 04 301 1
change cache line from E to S
UNC_GQ_SNOOP.GOTO_S_HIT_E state 04 301 2
change cache line from S to S
UNC_GQ_SNOOP.GOTO_S_HIT_S state 04 301 4
change cache line from F to S
UNC_GQ_SNOOP.GOTO_S_HIT_F state 04 301 8
change cache line from M to I
UNC_GQ_SNOOP.GOTO_I_HIT_M state 08 301 10
change cache line from E to I
UNC_GQ_SNOOP.GOTO_I_HIT_E state 08 301 20
change cache line from S to I
UNC_GQ_SNOOP.GOTO_I_HIT_S state 08 301 40
UNC_GQ_SNOOP.GOTO_I_HIT_F change cache line from F to I state 08 301 80
L3 CACHE Events
The number of hits and misses can be determined from the GQ tracker allocation events,
but it is simpler with the following list:
Table 29
The MESI breakdown of lines allocated and victimized can also be monitored with
LINES_IN, LINES_OUT:
51
Performance Analysis Guide
Table 30
52
Performance Analysis Guide
OFFCORE_RESPONSE_0.DATA_IFETCH.LOCAL_DRAM
Table 31
Table 32
53
Performance Analysis Guide
The bandwidths due to normal application dram access can be evaluated as follows:
Read Bandwidth (ch0) = 64*UNC_IMC_NORMAL_READS.CH0* Frequency/Cycles
Where any of the cycle events (core or uncore) can be used as long as the corresponding
frequency is used also.
Similarly the write bandwidth can be evaluated (ignoring partial writes) as:
Write Bandwidth (ch0) = 64*UNC_IMC_WRITES.FULL.CH0* Frequency/Cycles
The penalty can be evaluated using the cycles that there are entries in the queues as usual.
Similarly there are events for counting the cycles during which the associated queues
were full.
Table 33
As the dram control is on the processor the dram paging policy statistics can also be
collected. Some of the events related to this are listed below.
Table 34
Umas Even
Event Name Definition k t
UNC_DRAM_OPEN.CH0 DRAM Channel 0 open commands 01 60
UNC_DRAM_OPEN.CH1 DRAM Channel 1 open commands 02 60
UNC_DRAM_OPEN.CH2 DRAM Channel 2 open commands 04 60
UNC_DRAM_PAGE_CLOSE.CH0 DRAM Channel 0 page close 01 61
54
Performance Analysis Guide
There are also queues to monitor the high priority accesses like those associated with
device drivers for video and network adapters.
Table 35
Event Name Definition Umask Event
UNC_IMC_ISOC_OCCUPANCY.CH0 IMC channel 0 ISOC read request occupancy 01 2B
UNC_IMC_ISOC_OCCUPANCY.CH1 IMC channel 1 ISOC read request occupancy 02 2B
UNC_IMC_ISOC_OCCUPANCY.CH2 IMC channel 2 ISOC read request occupancy 04 2B
UNC_IMC_ISOC_OCCUPANCY.ANY IMC ISOC read request occupancy 07 2B
UNC_IMC_HIGH_PRIORITY_READS.CH0 IMC channel 0 high priority read requests 01 2D
UNC_IMC_HIGH_PRIORITY_READS.CH1 IMC channel 1 high priority read requests 02 2D
UNC_IMC_HIGH_PRIORITY_READS.CH2 IMC channel 2 high priority read requests 04 2D
UNC_IMC_HIGH_PRIORITY_READS.ANY IMC high priority read requests 07 2D
UNC_IMC_CRITICAL_PRIORITY_READS.CH0 IMC channel 0 critical priority read requests 01 2E
UNC_IMC_CRITICAL_PRIORITY_READS.CH1 IMC channel 1 critical priority read requests 02 2E
UNC_IMC_CRITICAL_PRIORITY_READS.CH2 IMC channel 2 critical priority read requests 04 2E
UNC_IMC_CRITICAL_PRIORITY_READS.ANY IMC critical priority read requests 07 2E
Table 36
Event Name Definition Umask Event
DRAM Channel 0 read CAS
UNC_DRAM_READ_CAS.CH0 commands 01 63
DRAM Channel 0 read CAS
UNC_DRAM_READ_CAS.AUTOPRE_CH0 auto page close commands 02 63
DRAM Channel 1 read CAS
UNC_DRAM_READ_CAS.CH1 commands 04 63
DRAM Channel 1 read CAS
UNC_DRAM_READ_CAS.AUTOPRE_CH1 auto page close commands 08 63
DRAM Channel 2 read CAS
UNC_DRAM_READ_CAS.CH2 commands 10 63
55
Performance Analysis Guide
56
Performance Analysis Guide
Table 37
There is no reasonable way to predefine any address matching of course but several
opcodes that identify writebacks and forwards from the caches that are certainly useful as
they identify (un)modified lines that were forwarded from the remote socket or to the
remote socket. Not all the predefined entries currently make sense.
Table 38
57
Performance Analysis Guide
These opcode uses can be seen from the dual socket QPI communications diagrams
below. These predefined opcode match encodings can be used to monitor HITM
accesses in particular and serve as the only event that allows profiling the requesting code
on the basis of the HITM transfers its requests generate.
RdData request after LLC Miss to Local
Home (Clean Rsp)
Cores Cores
DRd
Uncore Uncore
GQ
[ Broadcast GQ Ca
c he
Cache snoops to all L
Sn other caching Ca oo
Lookup pD c ku
at a agents) ] he p
a Mi
at ss
Rs pD
L pI SnpData Sn A L
DataC_E_CMP
Q Q [I- lloc
L > at L
Cache [Send P P E] e
RdData
I
in
p
Rs
C Miss Snoop I I E C
RspI st
to LLC] a te
[Fill complete to
Rs
Socket2]
pI
Speculative
[ Sending Req to mem Rd
Local Home QHL IMC
IMC QHL (socket 2 owns
this address) ] Data
Socket 2
Socket 1
58
Performance Analysis Guide
Cores Cores
DRd
(1)
Uncore Uncore
GQ
GQ Cache
Cache Sn Lookup
Lookup pD RdData (2)
ata (4)
(7)
(6)
) [Send Allocate
(8 RdData (5)
L p Snoop in E state L
Rs to LLC] Q Q [i->E] (13) Cache
L an L
C le P P
DataC_E_cmp Miss
C ) I I C
(6
(12) (3)
DataC_E_cmp
ta
Da
[Send (11)
RspI (9)
Rd
to CHL]
0) c
(socket 1 owns
(1 E_
this address) ]
_
aC
t
Speculative
Da
Socket 2
Socket 1
59
Performance Analysis Guide
Cores Cores
DRd
(1)
Uncore [Send Uncore
GQ Cache
Snoop GQ
Cache to LLC]
Lookup
Sn (2)
Lookup pD RdData Cache
)
ata (4)
(7) (8
p a (6) Miss
Rs at Allocate (3)
L i tm I , D RdData (5) in E state L
H ->
M Q Q [i->E] (13)
HL t
L L
t o eq u d
C es
R en
P P
[Data written back DataC_E_cmp
]
[S
C I I C
)
to Home. RspIWb (6
WbIData (9)
(12)
DataC_E_cmp
ta
is a NDR response.
Da
(11)
RspIWb,
WB
t
Da
Speculative mem Rd
[Send complete
(7)
and Data to QHL IMC
IMC QHL Socket2 to
Data (9) allocate in E state]
Socket 2
Socket 1
Cores Cores
DRd
Uncore Uncore
GQ [Send
Snoop [ Broadcast GQ Ca
ch
Cache to LLC] snoops to all eL
Sn other caching Ca oo
Lookup pD ch ku
W agents) ] eM p
at a
bI
Da a at iss
ta pD A
[I- lloc
L p SnpData Sn > at L
Rs t a Rs Q Q E] e
in
L tm , Da pI E L
DataC_E_Cmp
Hi W P P
RdData
I b st
C -> I I at C
M e
W
bI
60
Performance Analysis Guide
Cores Cores
DRd
Uncore Uncore
GQ [Send
Snoop [ Broadcast GQ Ca
ch
Cache to LLC] snoops to all eL
Da Sn other caching Ca oo
Lookup pD ch ku
t aC at agents) ] eM p
_F a a ta iss
Rs pD A
[I- lloc
L pF SnpData Sn > at L
F] e
F
wd Q Q in
C_
L F L
Hit Rsp S P P
RdData
ta
st
at
Da
C I I C
e
E,F -> S,
Data DataC_F
Cmp
[RspFwdS indicates Hit [Send complete to
snoop response and data RspFwdS
Socket2]
forwarded to Peer agent] ]
Rs
[DataC_F indicates data
p
Fw
forwarded to Peer agent in Speculative
dS
F state]
mem Rd
QHL IMC
IMC QHL [ Sending Req to
Local Home Data
(socket 2 owns
this address) ]
Socket 2
Socket 1
Cores Cores
RFO
Uncore Uncore
GQ [Send
GQ Cache
Snoop
Cache
to LLC] Lookup
n
Sn
w
Lookup p In
O
Ca
nv
I) vO ch
w
p
dI
Æ n eM
m
R
I
_C
, F, iss
_E
L RdInvOwn L
(S
aC
n Q Q
L ea L
RspI
at
Cl
P P
D
C [Send I I C
RspI indicates Request DataC_E_cmp Allocate
in E state
p
[I->E]
_c
Response
w
O
_E
nv
[ Sending Req to
aC
dI
Remote Home
at
R
(socket 1 owns
Speculative [Home Sends cmp this address) ]
mem Rd and Data to
Socket2 to QHL IMC
IMC QHL allocate in E state]
Data
Socket 2
Socket 1
61
Performance Analysis Guide
Cores Cores
RFO
Uncore Uncore
GQ [Send
Snoop GQ Cache
Cache to LLC] Lookup
n
Sn
w
p In Ca
O
Lookup
nv
Da vO ch
wn eM
dI
t _M
R
I), aC RdInvOwn iss
_M aC
L Æ at L
(M D
RspFwdI
[Send Data to Q Q
L TM L
HI ata Socket2 to P DataC_M P
p
cm
C allocate in M I I Allocate C
D n
w
state]
in M state
O
nv
d st [I->M]
dI
to Remote Home
cm
(socket 1 owns
Speculative this address) ]
mem Rd
QHL IMC
IMC QHL
Data
Socket 2
Socket 1
Cores Cores
RFO
Uncore Uncore
GQ [Send [Broadcast
Snoop Snoops to all GQ Cache
Sn n Lookup
Cache Rs pIn to LLC] other caching w
vO O
Lookup p Fw wn agents I nv Ca
p ch
dI
Da Sn _E eM
I), Indicates to
SnpInvOwn aC iss
L ta
C_ at L
Æ E
D
L (E Home that Data Q
DataC_E
Q
L
T P P
C HI ata has already
I I C
D been forwarded
Allocate
RdInvOwn
to Socket 2 RspFwdI
[Send Data to in E state
Socket2 to [I->E]
Rs
cmp
allocate in E state]
p
Fw
dI
Speculative
mem Rd
[ Sending Req to
IMC
Local Home QHL IMC
QHL (socket 2 owns
this address) ] Data
Socket 2
Socket 1
The diagrams show a series of QPI protocol exchanges associated with Data Reads and
Reads for Ownership (RFO), after an L3 CACHE miss, under a variety of combinations
62
Performance Analysis Guide
of the local home of the cacheline, and the MESI state in the remote cache. Of particular
note are the cases where the data comes from the remote QHL even when the data was in
the remote L3 CACHE. These are the Read Data with the remote L3 CACHE having the
line in an M state. Whether the line is locally or remotely “homed” it has to be written
back to dram before the originating GQ receives the line, so it always appears to come
from a QHL. The RFO does not do this. However, when responding to a remote RFO
(SnpInvOwn) and the line is in an S or F state, the cacheline gets invalidated and the line
is sent from the QHL.
The point is that the data source might not always be so obvious.
Table 39
Event Description
UNC_IMC_WRITES.FULL.ANY All writes of full cachelines (cached and uncached)
UNC_IMC_WRITES.FULL.CH0 Writes of full lines to channel 0
UNC_IMC_WRITES.FULL.CH1 Writes of full lines to channel 1
UNC_IMC_WRITES.FULL.CH2 Writes of full lines to channel 2
UNC_QHL_REQUESTS.
LOCAL_WRITES Writes of modified cached lines from local cores
UNC_QHL_REQUESTS. Writes of modified cached lines AND uncached lines from
63
Performance Analysis Guide
Conclusion:
Intel® Core™ i7 Processors and Intel® Xeon™ 5500
Processors open a new class of performance analysis
capablitlies
Appendix 1
Profiles
Basic Intel® PTU Profiles
General Exploration
A six event set that can be captured in a single run. The following events are included:
Cpu_clk_unhalted.core
Inst_retired.any
Br_inst_retired.all_branches
Mem_inst_retired.latency_above_threshold_32
Mem_load_retired.llc_miss
Uops_executed.core_stall_cycles
Thus this profile gives cycle usage and stalled cycles for the core (ie works best with HT
disabled). Instructions retired can be used for basic block execution counts, particularly in
conjunction with the precise branch retired event
Using the latency event with a 32 cycle threshold measures the distribution of offcore
accesses though the data source encoding captured with the event. As the latency events
capture data sources, latencies and linear addresses, this profile can also yield a
breakdown of data sources for loads that miss the core’s data caches and the latencies that
result. This event is randomly samples loads and the sampling fraction is dependent on
the application. This can be measured by normalizing the sum of the L3 CACHE miss
data sources with the Mem_uncore_retired.llc_miss event, which counts them all.
64
Performance Analysis Guide
Branch Analysis
This profile is designed for detailed branch analysis. The 4 events:
br_inst_retired.all_branches
br_inst_retired.near_call:LBR=user_calls
cpu_clk_unhalted.thread
inst_retired.any
allow basic execution analysis and a variety of deatiled loop and call analyses. The call
counts per source can be extracted as the LBRs are captured. If the call counts are low
you may need to make a copy of the profile and decrease the SAV value. As the LBRs
are captured this will take 2 runs and cannot be multiplexed. As the registers are
captured, on Intel(r) Core(tm) i7 systems running in Intel(r) 64 enabled mode, the integer
arguments of functions can be extracted from the register values display in the asm
display, for functions with limited numbers of arguments. Further the register values can
be used to get average tripcount values for counter loops where an induction variable is
compared to a tripcount, when using the all_branches event. The SAV value for the call
retired event must be tuned to the application as this can vary by orders of magnitude
between applications.
65
Performance Analysis Guide
Memory Access
A set of 13 events that require 3 runs, or three groups if event multiplexing is enabled.
These events were selected to give a reasonably complete breakdown of cacheline traffic
due to loads and some overview of total offcore traffic. As cycles and instructions retired
have dedicated counters, they are also included. The additional events are
Mem_inst_retired.loads
Mem_inst_retired.stores
Mem_inst_retired.latency_above_threshold_32
Mem_inst_retired.latency_above_threshold_128
Mem_load_retired.llc_miss
Mem_load_retired.llc_unshared_hit
Mem_load_retired.other_core_l2_hit_hitm
Mem_uncore_retired.local_dram
Mem_uncore_retired.remote_dram
Offcore_response_0.data_in.local_dram
Offcore_response_0.data_in.remote_dram
The use of the offcore_response_0.any_request.local_dram/remote dram events was
selected because non temporal stores to local dram are miscounted by the
“other_core_hit_hitm” data source. This does not happen for the remote dram.
Using two latency thresholds allows simultaneous monitoring of L3 CACHE hits and L3
CACHE misses with reasonable statistical accuracy. As the latency events capture data
sources, latencies and linear addresses, it was decided that full data profiling with all the
events would not be needed. A copy that includes this option is also provided.
66
Performance Analysis Guide
FE Investigation
A list of 14 events, thus collected in 3 runs, that yields a reasonably complete breakdown
of instruction delivery related performance issues.
Br_inst_exec.any
Br_misp_exec.any
Cpu_clk_unhalted.core
Inst_retired.any
Ild_stall.any
Ild_stall.lcp
Itlb_miss.retired
L1I.cycles_stalled
L1I.misses
Rat_stalls.flags
Rat_stalls.registers
Rat_stalls.rob_read_port
Resource_stalls.any
Uops_issued.stall_cycles
The difference of uops_issued.stall_cycles - resource_stalls.any yields the instruction
starvation cycles when the machine is booted with HT disabled. This can be used as an
overall guide for identifying a uop delivery problem. The main causes for such issues are
usually, branch mispredictions causing incorrect instruction “prefetching”, uop decoding
and resource allocation bandwidth issues and excessively large active binaries. The
selected events should assist in the identification of these issues.
Working Set
A single run data collection gathering PEBS data on all loads and stores retired. In
addition, cycles, inst_retired.any, and conditional branches executed are also collected.
The Sample After Values for the load and store instructions are lowered from the default
values by a factor of 100. This will result in a severe performance distortion and an
enormous amount of data being collected. This is needed to accurately sample the address
space of a real application. The data profiling is enabled and the Intel® PTU Data Access
Analysis package can be used to get an idea of the working set size of the program in the
utility histogram pane. Address profiles can also be nicely extracted with this profile. An
application that normally runs in one minute on a single core will produce approximately
2GBs of data, so it is wise to use the SAV multipliers if the run time is longer than a
couple minutes.
67
Performance Analysis Guide
Cpu_clk_unhalted.thread
DTLB_misses.any
Inst_retired.any
Load_hit_pre
Mem_inst_retired.Latency_above_threshold_32
Mem_load_retired.L2_hit
Mem_load_retired.llc_miss
Mem_load_retired.llc_unshared_hit
Mem_load_retired.other_core_l2_hit_hitm
Mem_uncore_retired.local_dram
Mem_uncore_retired.other_core_l2_hitm
Mem_uncore_retired.remote_dram
Offcore_response_0.data_in.any_dram
Offcore_response_0.data_in.local_dram
Offcore_response_0.data_in.remote_dram
Sq_full_stall_cycles
Rat_stalls.any
Rat_stalls.rob_read_port
Resource_stalls.load
Resource_stalls.ROB_full
Resource_stalls.RS_full
Resource_stalls.store
Uops_executed.core_stall_cycles
Uops_issued.any
Uops_issued.stall_cycles
Uops_retired.any
Uops_retired.stall_cycles
This list of events will allow computation of many of the more relevant predefined ratios
for loop execution. These include the cycles lost to assorted load latencies, stalls at
execution, FE stalls, stalls at retirement, wasted work, branch misprediction rates, basic
block execution counts, function call counts, a few specific loop related FE and saturation
effects and input bandwidth. The precise events will be collected with the full PEBS
buffer enabling data address profiling.
This profile was added in Intel® PTU with and without call site collection to allow
multiplexing. However, currently multiplexing in intel® PTU will not work with this
many events. Further, Intel® PTU 3.2 multiplexing may crash some OS’s when HT is
enabled.
68
Performance Analysis Guide
Br_misp_exec.any
Cache_lock_cycles.l1d
Cpu_clk_unhalted.thread
DTLB_misses.any
Fp_mmx_trans.any
Ild_stall.any
ild_stalls.iq_full
Ild_stall.lcp
ild_stalls.mru
ild_stalls.regen
Inst_retired.any
Itlb_miss_retired
L1i.cycles_stalled
L1I.misses
Load_hit_pre
Machine_clears.cycles
Mem_inst_retired.Latency_above_threshold_32
Mem_inst_retired.loads
mem_inst_retired.stores
Mem_load_retired.hit_lfb
mem_load_retired.L1d_hit
Mem_load_retired.L2_hit
Mem_load_retired.llc_miss
Mem_load_retired.llc_unshared_hit
Mem_load_retired.other_core_l2_hit_hitm
Mem_uncore_retired.local_dram
Mem_uncore_retired.other_core_l2_hitm
Misalign_mem_ref.load
Misalign_mem_ref.store
Offcore_request.uncached_mem
Offcore_response_0.data_in.any_dram
Partial_address_alias
Sq_full_stall_cycles
Rat_stalls.any
Rat_stalls.flags
Rat_stalls.registers
Rat_stalls.rob_read_port
Resource_stalls.any
Resource_stalls.load
Resource_stalls.ROB_full
Resource_stalls.RS_full
Resource_stalls.store
Uops_executed.core_stall_cycles
Uops_issued.any
Uops_issued.stall_cycles
Uops_retired.any
69
Performance Analysis Guide
Uops_retired.stall_cycles
Uops_issued.core_stall_cycles
This list of events wil allow computation of many of the more relevant predefined ratios
of interest in client application execution. These include the cycles lost to assorted load
latencies, stalls at execution, FE stalls and most of the causes of FE stalls, stalls at
retirement, wasted work, branch misprediction rates, basic block execution counts,
function call counts, a few specific low penalty issues seen in client applications,
saturation effects and input bandwidth. With HT disabled FE stalls can be evaluated with
uops_issued.stall_cycles - resource_stalls.any, with HT enabled use
uops_issued.core_stall_cycles-resource_stalls.any. The precise events will be collected
with the full PEBS buffer enabling data address profiling.
This profile was added in Intel® PTU with and without call site collection to allow
multiplexing. However, currently multiplexing in intel® PTU will not work with this
many events. Further, Intel® PTU 3.2 multiplexing may crash some OS’s when HT is
enabled.
EVTSEL
EVTMSK
Reserved
USR
OS
E
31 23 15 7 0
CMASK
INV
EN
AnyThr
INT
63 55 47 39 32
70
Performance Analysis Guide
71
Performance Analysis Guide
72