Sampling¶

Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be stored and analyzed. Sampling is a trade-off between cost and completeness of data.

Head sampling means the decision to sample is made at the beginning of a trace. This is simpler and more common.

Tail sampling means the decision to sample is delayed, possibly until the end of a trace. This means there is more information available to make the decision, but this adds complexity.

Sampling usually happens at the trace level, meaning entire traces are kept or discarded. This way the remaining traces are generally complete.

Random head sampling¶

Here's an example of randomly sampling 50% of traces:


import logfire

logfire.configure(sampling=logfire.SamplingOptions(head=0.5))

for x in range(10):
    with logfire.span(f'span {x}'):
        logfire.info(f'log {x}')

This outputs something like:


11:09:29.041 span 0
11:09:29.041   log 0
11:09:29.041 span 1
11:09:29.042   log 1
11:09:29.042 span 4
11:09:29.042   log 4
11:09:29.042 span 5
11:09:29.042   log 5
11:09:29.042 span 7
11:09:29.042   log 7

Note that 5 out of 10 traces are kept, and that the child log is kept if and only if the parent span is kept.

Tail sampling by level and duration¶

Random head sampling often works well, but you may not want to lose any traces which indicate problems. In this case, you can use tail sampling. Here's a simple example:


import time

import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())

for x in range(3):
    # None of these are logged
    with logfire.span('excluded span'):
        logfire.info(f'info {x}')

    # All of these are logged
    with logfire.span('included span'):
        logfire.error(f'error {x}')

for t in range(1, 10, 2):
    with logfire.span(f'span with duration {t}'):
        time.sleep(t)

This outputs something like:


11:37:45.484 included span
11:37:45.484   error 0
11:37:45.485 included span
11:37:45.485   error 1
11:37:45.485 included span
11:37:45.485   error 2
11:37:49.493 span with duration 5
11:37:54.499 span with duration 7
11:38:01.505 span with duration 9

logfire.SamplingOptions.level_or_duration() creates an instance of logfire.SamplingOptions with simple tail sampling. With no arguments, it means that a trace will be included if and only if it has at least one span/log that:

has a log level greater than info (the default of any span), or
has a duration greater than 5 seconds.

This way you won't lose information about warnings/errors or long-running operations. You can customize what to keep with the level_threshold and duration_threshold arguments.

Combining head and tail sampling¶

You can combine head and tail sampling. For example:


import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(head=0.1))

This will only keep 10% of traces, even if they have a high log level or duration. Traces that don't meet the tail sampling criteria will be discarded every time.

Keeping a fraction of all traces¶

To keep some traces even if they don't meet the tail sampling criteria, you can use the background_rate argument. For example, this script:


import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(background_rate=0.3))

for x in range(10):
    logfire.info(f'info {x}')
for x in range(5):
    logfire.error(f'error {x}')

will output something like:


12:24:40.293 info 2
12:24:40.293 info 3
12:24:40.293 info 7
12:24:40.294 error 0
12:24:40.294 error 1
12:24:40.294 error 2
12:24:40.294 error 3
12:24:40.295 error 4

i.e. about 30% of the info logs and 100% of the error logs are kept.

(Technical note: the trace ID is compared against the head and background rates to determine inclusion, so the probabilities don't depend on the number of spans in the trace, and the rates give the probabilities directly without needing any further calculations. For example, with a head sample rate of 0.6 and a background rate of 0.3, the chance of a non-notable trace being included is 0.3, not 0.6 * 0.3.)

Caveats of tail sampling¶

Memory usage¶

For tail sampling to work, all the spans in a trace must be kept in memory until either the trace is included by sampling or the trace is completed and discarded. In the above example, the spans named included span don't have a high enough level to be included, so they are kept in memory until the error logs cause the entire trace to be included. This means that traces with a large number of spans can consume a lot of memory, whereas without tail sampling the spans would be regularly exported and freed from memory without waiting for the rest of the trace.

In practice this is usually OK, because such large traces will usually exceed the duration threshold, at which point the trace will be included and the spans will be exported and freed. This works because the duration is measured as the time between the start of the trace and the start/end of the most recent span, so the tail sampler can know that a span will exceed the duration threshold even before it's complete. For example, running this script:


import time

import logfire

logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())

with logfire.span('span'):
    for x in range(1, 10):
        time.sleep(1)
        logfire.info(f'info {x}')

will do nothing for the first 5 seconds, before suddenly logging all this at once:


12:29:43.063 span
12:29:44.065   info 1
12:29:45.066   info 2
12:29:46.072   info 3
12:29:47.076   info 4
12:29:48.082   info 5

followed by additional logs once per second. This is despite the fact that at this stage the outer span hasn't completed yet and the inner logs each have 0 duration.

However, memory usage can still be a problem in any of the following cases:

The duration threshold is set to a high value
Spans are produced extremely rapidly
Spans contain large attributes

Distributed tracing¶

Logfire's tail sampling is implemented in the SDK and only works for traces within one process. If you need tail sampling with distributed tracing, consider deploying the Tail Sampling Processor in the OpenTelemetry Collector.

If a trace was started on another process and its context was propagated to the process using the Logfire SDK tail sampling, the whole trace will be included.

If you start a trace with the Logfire SDK with tail sampling, and then propagate the context to another process, the spans generated by the SDK may be discarded, while the spans generated by the other process may be included, leading to an incomplete trace.

Spans starting after root ended, e.g. background tasks¶

When the root span of a trace ends, if the trace doesn't meet the tail sampling criteria, all spans in the trace are discarded. If you start a new span in that trace (i.e. as a descendant of the root span) after the root span has ended, the new span will always be included anyway, and its parent will be missing. This is because the tail sampling mechanism only keeps track of active traces to save memory. This is similar to the distributed tracing case above.

Here's an example with a FastAPI background task which starts after the root span corresponding to the request has ended:


import uvicorn
from fastapi import BackgroundTasks, FastAPI

import logfire

app = FastAPI()

logfire.configure(
    sampling=logfire.SamplingOptions.level_or_duration(
        duration_threshold=0.1,
    ),
)
logfire.instrument_fastapi(app)


async def background_task():
    # This will be included even if the root span was excluded.
    logfire.info('background')


@app.get('/')
async def index(background_tasks: BackgroundTasks):
    # Uncomment to prevent request span from being sampled out.
    # await asyncio.sleep(0.2)

    background_tasks.add_task(background_task)
    return {}


uvicorn.run(app)

A workaround is to explicitly put the new spans in their own trace using attach_context:


from logfire.propagate import attach_context


async def background_task():
   # `attach_context({})` forgets existing context
   # so that spans within start a new trace.
   with attach_context({}):
      with logfire.span('new trace'):
         await asyncio.sleep(0.2)
         logfire.info('background')

Custom head sampling¶

If you need more control than random sampling, you can pass an OpenTelemetry Sampler. For example:


from opentelemetry.sdk.trace.sampling import (
    ALWAYS_OFF,
    ALWAYS_ON,
    ParentBased,
    Sampler,
    TraceIdRatioBased,
)

import logfire


class MySampler(Sampler):
    def should_sample(
            self,
            parent_context,
            trace_id,
            name,
            *args,
            **kwargs,
    ):
        if name == 'exclude me':
            sampler = ALWAYS_OFF
        elif name == 'include me minimally':
            sampler = TraceIdRatioBased(0.01)  # 1% sampling
        elif name == 'include me partially':
            sampler = TraceIdRatioBased(0.5)   # 50% sampling
        else:
            sampler = ALWAYS_ON
        return sampler.should_sample(
            parent_context,
            trace_id,
            name,
            *args,
            **kwargs,
        )

    def get_description(self):
        return 'MySampler'


logfire.configure(
    sampling=logfire.SamplingOptions(
        head=ParentBased(
            MySampler(),
        )
    )
)

with logfire.span('keep me'):
    logfire.info('kept child')

for i in range(5):
    with logfire.span('include me partially'):
        logfire.info(f'partial sample {i}')

for i in range(270):
    with logfire.span('include me minimally'):
        logfire.info(f'minimal sample {i}')

with logfire.span('exclude me'):
    logfire.info('excluded child')

This will output something like:


10:37:30.897 keep me
10:37:30.898   kept child
10:37:30.899 include me partially
10:37:30.900   partial sample 0
10:37:30.901 include me partially
10:37:30.902   partial sample 3
10:37:30.905 include me minimally
10:37:30.906   minimal sample 47
10:37:30.910 include me minimally
10:37:30.911   minimal sample 183

The sampler applies different strategies based on span names:

exclude me: Never sampled (0% using ALWAYS_OFF)
include me partially: 50% sampling (roughly half appear)
include me minimally: 1% sampling (roughly 1 in a 100 appears)
keep me and all others: Always sampled (100% using ALWAYS_ON)

The sampler is wrapped in a ParentBased sampler, which ensures child spans follow their parent's sampling decision. If you remove that and simply pass head=MySampler(), child spans might be included even when their parents are excluded, resulting in incomplete traces.

You can also pass a Sampler to the head argument of SamplingOptions.level_or_duration to combine tail sampling with custom head sampling.

Custom tail sampling¶

If you want tail sampling with more control than level_or_duration, you can pass a function to tail which will accept an instance of TailSamplingSpanInfo and return a float between 0 and 1 representing the probability that the trace should be included. For example:


import logfire


def get_tail_sample_rate(span_info):
    if span_info.duration >= 1:
        return 0.5  Keep 50% of traces with duration >= 1 second


    if span_info.level > 'warn':  span_info.level is a special object that can be compared to log level names

        return 0.3  Keep 30% of traces with a warning or error and with duration < 1 second


    return 0.1  Keep 10% of other traces



logfire.configure(
    sampling=logfire.SamplingOptions(
        head=0.5,  Discard 50% of traces at the beginning to reduce the overhead of generating spans. This is optional, but improves
   performance, and we know that get_tail_sample_rate will always return at most 0.5 so the other 50% of traces will
   be discarded anyway. The probabilities are not independent - this will not discard traces that would otherwise have
   been kept by tail sampling.

        tail=get_tail_sample_rate,
    ),
)