Trace Event Programmers Guide
Trace Event Programmers Guide
Programmers Guide
Vance Morrison
Content
1. The Event Session (represents the entity controlling the logging). The
session has the ability to tell the providers of events to start and stop logging
and control how verbose the logging is. It also has the ability to route the
data to various places. It can indicate that the data should be directly
written to a file (for maximum efficiency) or to send it to the session itself (for
on the fly processing)
2. The Event Provider is the part of the logging system that is wired into the
application to be monitored. Its job is to call a logging API when interesting
things happen in the application.
3. The Event Consumer takes the data from the file or from the session and
consumes it in some way, typically generating aggregate statistics and
generating alerts.
EventSour TraceEventSessi
Comman
ce on
Event Provider ds Event Session
Event
Data
Event
Data TraceEventSour
Event File ce
Event Consumer
Manifest
(Schema
)
Corresponding to each of these players , the TraceEvent library has a class that
supports that role.
Multi-player
An important aspect of the architecture that is not obvious in the diagram above is
that each of the elements of the diagram can have multiple instances. Thus there
can be:
1. Multiple event providers, each emitting events for the part of the system they
understand, and each of which has a manifest for the events it might emit.
2. Multiple event sessions each of which are gathering different sets of events
for different purposes. Each of these sessions has the option of logging its
events to a file or to the session itself in real time.
3. Multiple event consumers that can process the events from a file or session.
It is not recommended however to have multiple consumers feeding from the
same event session in real time. Instead it is simpler and better to have
multiple session each feeding a unique event consumer in the real-time case.
Asynchronous
Another fundamental property of the system that is not obvious from the diagram
above is that the logging system is asynchronous. When a provider writes an
event, it is a fire and forget operation. The event quickly gets written to a buffer
and the program continues. From this point on processing of the event is
concurrent with the running of the application. This has a number of ramifications:
1. Logging is fast and scalable. Only the initial copy to the first buffer in
the logging pipeline actually delays the program. Note that this is
independent of the number of sessions or consumers. All the rest of the
logging activity happens on other threads of execution and can be
parallelized (thus it scales well).
2. Logging has minimal and predictable impact on the program. There
are no global locks that need to be taken by the provider when it logs
events. Because of this it is much more likely that behavior of threads
(races) with logging on will be essentially the same (statistically speaking) as
with logging off.
3. There is the possibility of lost events. If the providers generate event
faster than the file can store them (or the real time processor can process
them), then eventually events need to be dropped. This is the price you pay
for making the writing of an event be asynchronous. The system will detect
that events are lost, but the event data itself is unrecoverable.
Quick Start Example
Enough theory. Lets see how this works in practice.
Above is source code for a program that logs two events using EventSource class.
To be strongly typed, we must provide a schema for each event, and we do this by
defining a subclass of the EventSource class. This class will have an instance
method for each event that can be logged by this provider. In the example above
we define two events.
1. The MyFirstEvent that logs a string MyName and an integer MyId.
2. The MySecondEvent that logs just the integer MyId (which allows it to be
correlated with the corresponding MyFirstEvent.
Notice that the method definitions provide all the information needed to generate
the schema for the events. The name of the method defines the name of the event,
and the argument names and types provide the names and types of each of the
properties associate with the event.
In a more perfect world, humans would only author the declarations of an
EventSource class since it is these declarations that specify the programmers intent.
However to make these methods actually log events, the user need to define a
boiler plate body for each event that does two things
1 Defines a numeric value associated with the event. This is the first
parameter to the WriteEvent method call and is used to identify the event in
all further processing (the event name is only used to generate the manifest).
These event numbers start at 1 (0 is reserved) and need to be the ordinal
number of the method in the class. Thus it would be an error to reverse
the order of the MyFirstEvent and MySecondEvent declarations above
without also changing the first parameter to WriteEvent to match the order
in the class. If this restriction bugs you we will see how to avoid it later, but
it will mean more typing.
1. Passes along all the arguments from the method to the WriteEvent method.
Because the arguments to the event method are used to generate the
manifest, and the manifest is supposed to accurately describe the event, it
would be an error to more or fewer arguments to WriteEvent. Thus the
WriteEvent method is intended to be used only in this very particular way
illustrated above.
The Logger class also has an attribute that defines the name for this provider to be
Microsoft-Demos-MySource. If this attribute had not been provided the name of
the provider would have been the simple name of the class (e.g. Logger). If your
provider is for more than ad-hoc logging, it is STRONGLY encouraged that you define
a real name for it that avoids collisions and helps your users understand what
information your provider will log. You can see we followed these best practices
which the Windows Operation system group uses by making our name:
Finally, in our example above the Logger class also defines a global static variable
which creates an instance of the class that we use to log events. Having more than
one instance of a particular EventSource is not recommended, since there is a cost
to construct them, and having two serves no use purpose. Thus most
EventSources will have a single instance, typically in a static variable.
Once we have our Logger EventSource defined, we simply call the event methods
to log events. At this point we have an application with a fully functional ETW
provider. Of course those events are off by default, so this program does not do
anything yet, which brings us to step 2.
Which creates a C# delegate (the => operator) which takes ProcessTraceData, and
prints out the command line that started the process. Thus a parser
1. Has a subscription point (a C# event) for every event it knows about (in this
case we see the ProcessStart event.
2. For each such event, if there are payload values, then there is a specific
subclass of TraceEvent, which defines properties for each data item in that
event. In the example above the ProcessStart event has payload values
(event arguments) like CommandLine and ParentID and you can simply use
C# property syntax to get their value, and the properties have the types you
would expect them to have (string for CommandLine and integer for
ParentID).
Here is where the strong typing of the logging becomes apparent. On the one end
an EventSource logs strongly typed set of arguments (with names), and on the
other end they pop out of a TraceEventParser as a class that has properties that let
you fetch the values with full type fidelity.
Thus in general you get a diagram that looks like this:
GC Event Callback
ClrTraceEventParser
Unparsed RegisteredTraceEventParser
Events
Parsed IIS Event Callback
Events
DynamicTraceEventParser
EventSource Callback
In fact the first three providers are so common, that the ETWTraceEventSource has
three shortcut properties (Kernel, CLR and Dynamic), that allow you to access these
providers in a very easy way.
We are now finally in a position to explain the mysterious piece of code in our
original example:
source.Dynamic.All += delegate(TraceEvent data) { }
Will print out the value of the MyName property of the event. There are also
methods for discovering all the names of the properties so you can dump the event
however you like (e.g. XML or JSON).
But clearly this experience is a step down from what you get with a compile time
solution. Certainly it is clunkier to write, and also error prone (if you misspell
MyName above the compiler will not catch it like it would if we misspelled
CommanLine when accessing process data). It is also MUCH less efficient to use
PyaloadByName than to use compile time trace parser properties.
Thus even though DynamicTraceEventParser is sufficient to parse events from an
EventSource with full fidelity, if you are doing more than just printing the event, it is
a good idea to create a static (compile time) parser that is tailored for your
EventSource and thus can return compile time types tailored for your event
payloads. This is what the TraceParserGen tool is designed to do, which we will
cover later.
The ability to monitor ETW events, sending them either to a file or directly to
a programmatic callback in real time.
The ability for those real time events to be passed to the IObservable
interface and thus be used by the Reactive Extensions.
The ability turn on event providers selectively using ETW Keywords and
verbosity Levels. You can also pass additional arguments to your provider
which EventSources can pick up. In that way you can create very
sophisticated filtering specification as well as execute simple commands (e.g.
force a GC, flush the working set, and etc.).
The ability to enumerate the ETW providers on the system as well as in a
particular process, and the ability to determine what ETW groups (Keywords)
you can turn on.
Ability to take ETL files and merge them together into a single file.
Ability to read an ETL file or real time session and write an ETL file from it,
filtering it or adding new events (Windows 8 only).
The ability to capture stack traces when events are being captured.
The ability to convert the stacks to symbolic form both for .NET, Jscript, as
well as native code.
The ability to store events in a new format (ETLX) that allows the event to be
access efficiently in a random fashion as well as to enumerate the events
backwards as well as forwards, and to efficiently represent the stack
information. (See TraceLog below).
The ability to make generate C# code that implements a strongly typed
parsers for any ETW provider with a manifest.
The ability to read events written with the WPP Tracing system.
The ability to access Activity IDs that allow you to track causality across
asynchronous operations (if all components emits the right events).
Access Kernel events (along with stack traces), including:
o Process start/stop, Thread start/stop, DLL load and unload
o CPU Samples every MSec (but you can control the frequency down to .
125 msec)
o Every context switch (which means you know where you spend blocked
item) as well as the thread that unblocked the thread.
o Page faults.
o Virtual Memory allocation.
o C or C++ heap allocations.
o Disk I/O.
o File I/O (whether it hits the disk or not).
o Registry access.
o Network I/O.
o Every packet (with compete data) that comes on or off the network
(network sniffer).
o Every System call.
o Sampling of Processor CPU counters (Instructions executed, branch
mispredicts, Cache misses, ) (Windows 8 only).
o Remote Procedure calls.
o How the machine is configured (disk/ memory CPUs ).
Access CLR (.NET Runtime) events, including:
o When GCs happen.
o When allocations are made (sampling and non-sampling).
o When object are moved during a GC.
o When methods are Just In Time (JIT) compiled.
o When Exceptions are thrown.
o When System.Threading.Task.Tasks are created, and scheduled.
o Addition information on why a .NET assembly failed to load (to
diagnose failures).
o Information to decode .NET frames in stack traces.
Access ASP.NET events which log when request come in and when various
stages of the pipeline complete.
Access WCF events which log packets as the go through their pipeline.
Access Jscript runtime events, including:
o Garbage collection.
o Just-in-Time (JIT) compilation of methods.
o Information to decode JScript frames in stack traces.
You can also get a reasonably good idea of what is possible by taking a look at the
PerfView tool. PerfView was built on top of the TraceEvent library and all the ETW
capabilities of that tool are surfaced in the TraceEvent library.
ETW Limitations
Unfortunately, there are some limitations in ETW that sometimes block it from being
used in scenarios where it would otherwise be a natural fit. They are listed here for
emphasis.
You send commands to providers on a machine wide basis. Thus you cant
target particular processes (however if you own the event provider code you
can pass it extra information as arguments to enable command to the
provider and have your provider implement logic to ignore enable
commands not intended for it).
Because commands are machine wide and thus give you access to all
information on the system, you have to be Elevated (Admin) to turn an ETW
session on or off.
By design the communication between the controllers and the providers is
fire and forget. Thus ETW is not intended to be a general purpose cross
process communication mechanism. Dont try to use it as such.
In real time mode, events are buffered and there is at least a second or so
delay between the firing of the event and the reception by the session (to
allow events to be delivered in efficient clumps of many events).
Before Windows 8, there could only one kernel session. Thus using kernel
mode events for monitoring scenarios was problematic because any other
tools that used kernel sessions were likely to interfere by overriding the single
Kernel model event logging session.
In general scenarios having multiple controllers (sessions) controlling the
same providers is dangerous. It can be done in some cases, but there is a
significant potential for interference between the sessions.
The file format is private, and before windows 8 could be quite space
inefficient (it compresses 8-to-1). Files can get big fast.
Logging more than 10K events /sec will load the system noticeably (5%).
Logging more frequently than 10K/sec should be avoided if possible. Logging
1M events /sec will completely swamp a typical machine.
You will find having PerfVIew to be very handy when debugging your own ETW
processing.
Full dumps of ETW events in PerfVIew
Normally PerfViews event view does not show all the data in an ETW file. Things
like the Provider GUID, EventID, Opcode and payload bytes are not shown because
they typically are not relevant. However when you are debugging ETW processing
of your own, these low level details can be critical so you would like to be able to
see them. You can do this selectively in PerfView by doing the following:
1) Find the Event that you are interested in (typically by looking at the
timestamp).
2) Select the time cell of that event.
3) Right click and select -> Dump Event.
It will then open PerfViews log that will contain a XML dump of the event, including
all low level information (including the raw payload bytes).
Which starts a circular logging session and leaves it on until you explicitly
turn it off. Now you will get exceptions whenever your code runs, however
they will still be swallowed. To fix that go to your Debug->Exceptions
dialog and enable stopping on any thrown CLR exception. Any authoring
mistakes will now be very obvious.
3) Look at the Exceptions events in perfVIew when you attempt to use your
EventSource (see this blog entry for more).
Architecture
TODO
Symbol Resolution
TODO