Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be
stored and analyzed. Sampling is a trade-off between cost and completeness of data.
Head sampling means the decision to sample is made at the beginning of a trace. This is simpler and more common.
Tail sampling means the decision to sample is delayed, possibly until the end of a trace. This means there is more
information available to make the decision, but this adds complexity.
Sampling usually happens at the trace level, meaning entire traces are kept or discarded. This way the remaining traces
are generally complete.
Random head sampling often works well, but you may not want to lose any traces which indicate problems. In this case,
you can use tail sampling. Here's a simple example:
importtimeimportlogfirelogfire.configure(sampling=logfire.SamplingOptions.level_or_duration())forxinrange(3):# None of these are loggedwithlogfire.span('excluded span'):logfire.info(f'info {x}')# All of these are loggedwithlogfire.span('included span'):logfire.error(f'error {x}')fortinrange(1,10,2):withlogfire.span(f'span with duration {t}'):time.sleep(t)
This outputs something like:
11:37:45.484 included span
11:37:45.484 error 0
11:37:45.485 included span
11:37:45.485 error 1
11:37:45.485 included span
11:37:45.485 error 2
11:37:49.493 span with duration 5
11:37:54.499 span with duration 7
11:38:01.505 span with duration 9
has a log level greater than info (the default of any span), or
has a duration greater than 5 seconds.
This way you won't lose information about warnings/errors or long-running operations. You can customize what to keep
with the level_threshold and duration_threshold arguments.
This will only keep 10% of traces, even if they have a high log level or duration. Traces that don't meet the tail
sampling criteria will be discarded every time.
12:24:40.293 info 2
12:24:40.293 info 3
12:24:40.293 info 7
12:24:40.294 error 0
12:24:40.294 error 1
12:24:40.294 error 2
12:24:40.294 error 3
12:24:40.295 error 4
i.e. about 30% of the info logs and 100% of the error logs are kept.
(Technical note: the trace ID is compared against the head and background rates to determine inclusion, so the
probabilities don't depend on the number of spans in the trace, and the rates give the probabilities directly without
needing any further calculations. For example, with a head sample rate of 0.6 and a background rate of 0.3, the
chance of a non-notable trace being included is 0.3, not 0.6 * 0.3.)
For tail sampling to work, all the spans in a trace must be kept in memory until either the trace is included by
sampling or the trace is completed and discarded. In the above example, the spans named included span don't have a
high enough level to be included, so they are kept in memory until the error logs cause the entire trace to be included.
This means that traces with a large number of spans can consume a lot of memory, whereas without tail sampling the spans
would be regularly exported and freed from memory without waiting for the rest of the trace.
In practice this is usually OK, because such large traces will usually exceed the duration threshold, at which point the
trace will be included and the spans will be exported and freed. This works because the duration is measured as the time
between the start of the trace and the start/end of the most recent span, so the tail sampler can know that a span will
exceed the duration threshold even before it's complete. For example, running this script:
will do nothing for the first 5 seconds, before suddenly logging all this at once:
12:29:43.063 span
12:29:44.065 info 1
12:29:45.066 info 2
12:29:46.072 info 3
12:29:47.076 info 4
12:29:48.082 info 5
followed by additional logs once per second. This is despite the fact that at this stage the outer span hasn't completed
yet and the inner logs each have 0 duration.
However, memory usage can still be a problem in any of the following cases:
Logfire's tail sampling is implemented in the SDK and only works for traces within one process. If you need tail
sampling with distributed tracing, consider deploying
the Tail Sampling Processor in the OpenTelemetry Collector.
If a trace was started on another process and its context was propagated to the process using the Logfire SDK tail
sampling, the whole trace will be included.
If you start a trace with the Logfire SDK with tail sampling, and then propagate the context to another process, the
spans generated by the SDK may be discarded, while the spans generated by the other process may be included, leading to
an incomplete trace.
Spans starting after root ended, e.g. background tasks¶
When the root span of a trace ends, if the trace doesn't meet the tail sampling criteria, all spans in the trace are
discarded. If you start a new span in that trace (i.e. as a descendant of the root span) after the root span has ended,
the new span will always be included anyway, and its parent will be missing. This is because the tail sampling mechanism
only keeps track of active traces to save memory. This is similar to the distributed tracing case above.
Here's an example with a FastAPI background task which starts after the root span corresponding to the request has
ended:
importuvicornfromfastapiimportBackgroundTasks,FastAPIimportlogfireapp=FastAPI()logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(duration_threshold=0.1,),)logfire.instrument_fastapi(app)asyncdefbackground_task():# This will be included even if the root span was excluded.logfire.info('background')@app.get('/')asyncdefindex(background_tasks:BackgroundTasks):# Uncomment to prevent request span from being sampled out.# await asyncio.sleep(0.2)background_tasks.add_task(background_task)return{}uvicorn.run(app)
A workaround is to explicitly put the new spans in their own trace using attach_context:
fromlogfire.propagateimportattach_contextasyncdefbackground_task():# `attach_context({})` forgets existing context# so that spans within start a new trace.withattach_context({}):withlogfire.span('new trace'):awaitasyncio.sleep(0.2)logfire.info('background')
If you need more control than random sampling, you can pass an OpenTelemetry
Sampler. For example:
fromopentelemetry.sdk.trace.samplingimport(ALWAYS_OFF,ALWAYS_ON,ParentBased,Sampler,TraceIdRatioBased,)importlogfireclassMySampler(Sampler):defshould_sample(self,parent_context,trace_id,name,*args,**kwargs,):ifname=='exclude me':sampler=ALWAYS_OFFelifname=='include me minimally':sampler=TraceIdRatioBased(0.01)# 1% samplingelifname=='include me partially':sampler=TraceIdRatioBased(0.5)# 50% samplingelse:sampler=ALWAYS_ONreturnsampler.should_sample(parent_context,trace_id,name,*args,**kwargs,)defget_description(self):return'MySampler'logfire.configure(sampling=logfire.SamplingOptions(head=ParentBased(MySampler(),)))withlogfire.span('keep me'):logfire.info('kept child')foriinrange(5):withlogfire.span('include me partially'):logfire.info(f'partial sample {i}')foriinrange(270):withlogfire.span('include me minimally'):logfire.info(f'minimal sample {i}')withlogfire.span('exclude me'):logfire.info('excluded child')
This will output something like:
10:37:30.897 keep me
10:37:30.898 kept child
10:37:30.899 include me partially
10:37:30.900 partial sample 0
10:37:30.901 include me partially
10:37:30.902 partial sample 3
10:37:30.905 include me minimally
10:37:30.906 minimal sample 47
10:37:30.910 include me minimally
10:37:30.911 minimal sample 183
The sampler applies different strategies based on span names:
exclude me: Never sampled (0% using ALWAYS_OFF)
include me partially: 50% sampling (roughly half appear)
include me minimally: 1% sampling (roughly 1 in a 100 appears)
keep me and all others: Always sampled (100% using ALWAYS_ON)
The sampler is wrapped in a ParentBased sampler, which ensures child spans follow their parent's sampling decision.
If you remove that and simply pass head=MySampler(), child spans might be included even when their parents are
excluded, resulting in incomplete traces.
You can also pass a Sampler to the head argument of SamplingOptions.level_or_duration to combine tail sampling
with custom head sampling.
If you want tail sampling with more control than level_or_duration, you can pass a function to tail which will accept an instance of TailSamplingSpanInfo and return a float between 0 and 1 representing the
probability that the trace should be included. For example: