Io uring-BPF
Io uring-BPF
Pavel Begunkov
asml.silence at gmail.com
io_uring: introduction
2
Lots of operations ...
IORING_OP_OPENAT,
IORING_OP_CLOSE,
IORING_OP_FILES_UPDATE,
enum { IORING_OP_STATX,
IORING_OP_NOP, IORING_OP_READ,
IORING_OP_READV, IORING_OP_WRITE,
IORING_OP_WRITEV, IORING_OP_FADVISE,
IORING_OP_FSYNC, IORING_OP_MADVISE,
IORING_OP_READ_FIXED, IORING_OP_SEND,
IORING_OP_WRITE_FIXED, IORING_OP_RECV,
IORING_OP_POLL_ADD, IORING_OP_OPENAT2,
IORING_OP_POLL_REMOVE, IORING_OP_EPOLL_CTL,
IORING_OP_SYNC_FILE_RANGE, IORING_OP_SPLICE,
IORING_OP_SENDMSG, IORING_OP_PROVIDE_BUFFERS,
IORING_OP_RECVMSG, IORING_OP_REMOVE_BUFFERS,
IORING_OP_TIMEOUT, IORING_OP_TEE,
IORING_OP_TIMEOUT_REMOVE, IORING_OP_SHUTDOWN,
IORING_OP_ACCEPT, IORING_OP_RENAMEAT,
IORING_OP_ASYNC_CANCEL, IORING_OP_UNLINKAT,
IORING_OP_LINK_TIMEOUT, IORING_OP_MKDIRAT,
IORING_OP_CONNECT, IORING_OP_SYMLINKAT,
IORING_OP_FALLOCATE, IORING_OP_LINKAT,
... };
3
Features
• SQPOLL for syscall-less submission
• IOPOLL for beating performance records
• Registered resources with fast updates
- IORING_REGISTER_FILES: optimised file refcounting
- IORING_REGISTER_BUFFERS: eliminates page refcounting, no page table walking, etc.
- dynamic fast updates: no more full io_uring quiesce
• IOSQE_IO_LINK: request links for execution ordering
• IORING_FEAT_FAST_POLL: automatic poll fallback, no need for epoll
• IO-WQ: internal thread pool, when nothing else works
• multi-shot requests, e.g. poll generating multiple CQEs
• executors (IO-WQ, SQPOLL) sharing
• and more ...
4
Execution flow
First try nowait: IOCB_NOWAIT, LOOKUP_CACHED, etc.
• might just complete, e.g. if data is already there
• O_DIRECT goes async, -EIOCBQUEUED
• added to a waitqueue, e.g. poll requests
5
Misconception debunking
io_uring is not "just a worker pool"
• worker threads is a slower path
6
The problem
7
syscall overhead
Vulnerability mitigations are expensive, and so are syscalls
• cost varies with CPU and enabled mitigations
Overhead for syscalls in a tight loop with little work can take 20-50%
(apparently, tested CPU is the worst case)
8
# mitigations enabled
# nop requests, batch 32
# fio/t/io_uring -d32 -s32 -c32 -N1
9
# mitigations enabled
# Null block device, “realistic batching” 4 requests at a time
# modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16
# fio/t/io_uring -d4 -s4 -c4 -p1 -B1 -F1 -b512 /dev/nullb0
10
Sweet spot for optimisation. How about SQPOLL?
• still needs userspace to process completions
• takes a CPU core; high CPU consumption
• cache bouncing
11
Requirements
Flexibility: what capabilities BPF has to have?
• submitting new requests
• accessing CQEs, multiple if needed
• poking into userspace memory
Low overhead
• Traditionally we’ve optimised batched submission more
• BPF is expected to have a lower batch ratio
12
struct io_uring_sqe {
...
u32 callback_id;
Idea 1: let’s add a callback to each };
13
New io_uring request type: IORING_OP_BPF
No extra per request overhead, everything is enclosed in opcode handlers.
And we can use generic io_uring infrastructure:
• locking and better control of execution context
• completion and other batching
• space in the internal request struct, i.e. struct io_kiocb
• can be linked to other requests
• possible to execute multiple times, i.e. keeping a BPF request alive
The downside is that extra requests are not free, there is a cost to
that, but we can work with it.
14
Feeding BPF completions
BPF needs feedback from other
requests.
The first idea: just use links and
pass a CQE of the previous request
to BPF!
• ugly again
• bound to linking by design
• no way to pass multiple CQEs
• extra overhead for non-BPF code
15
Multiple CQs
Introduce multiple CQs:
• sqe->cq_idx, each request
specifies to which CQ its
completion goes
• BPF can emit and consume CQEs
to / from any CQ
• Can wait
• Synchronisation is up to the
userspace / BPF
16
Pros:
• Can pass multiple CQEs
• CQs can be waited on (including by BPF)
• Extra way of communication:
posting to a CQ
Example:
Each BPF request has its own CQ. It keeps a
number of operations in-flight and posts to
the main CQ when it’s done with the job.
17
What about poking into the normal userspace memory?
BPF subsystem already has an answer: sleepable BPF programs
There are also BPF maps / arrays and other infrastructure provided by BPF
• not everything is supported with sleepable programs, may get lifted (if not already)
18
Overhead
There can be O(N) BPF requests, important to keep overhead low
A lot of work has been done! Highlights:
• persistent submission state, request caching
• infrastructure around task_work and execution batching
• task_struct referencing and other overhead amortisation
• removing request refcounting
• completion batching
• native io-wq workers (planned to use)
• upcoming IOSQE_CQE_SKIP_SUCCESS
• just cutting the number of instructions required per request ...
enum {
...
IORING_REGISTER_BPF,
IORING_UNREGISTER_BPF,
};
20
API: BPF request
enum {
...
IORING_OP_BPF,
};
21
API: BPF definitions
enum { // Return values for io_uring BPF programs
IORING_BPF_OK = 0, // complete request
IORING_BPF_WAIT, // wait on CQ for completions
};
22
API: libbpf example
SEC("iouring") // io_uring BPF program
int bpf_program_name(struct io_uring_bpf_ctx *ctx) {
struct io_uring_cqe cqe;
ret = bpf_io_uring_reap_cqe(ctx, cq_idx, &cqe, sizeof(cqe));
24
Testing
Not yet conclusive. Test case:
• Copy a file by 4KB at a time into /dev/zero, buffered and fully cached
Mitigations Test case Time (ms)
ON read(2)/write(2) 1350
ON read(2)/write(2) 1320
25
Applicability
Applicability: shouldn’t be of interest if batching is naturally “high
enough”.
High queue depth is not always possible and/or desirable.
• batching hurts latency
• may care about ordering, e.g. TCP sockets.
• slow devices and memory/responsiveness restrictions
27
Resources
Kernel
https://github.com/isilence/linux.git bpf_v3
Liburing, see <liburing>/examples/bpf/*
https://github.com/isilence/liburing.git bpf_v3
28