This is the fourth post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at processing the virtio-serial device's transmit and receive virtqueues.
The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.
The virtio-serial device has a pair of virtqueues that allow the driver to transmit and receive data. The driver enqueues empty buffers onto the receiveq (virtqueue 0) and the device fills them with received data. The driver enqueues buffers containing data onto the transmitq (virtqueue 1) and the device sends them.
This logic is split into two modules: virtqueue_reader for the transmitq and virtqueue_writer for the receiveq. The interface of virtqueue_reader looks like this:
/* Stream data from a virtqueue without framing */
module virtqueue_reader (
input clk,
input resetn,
/* Number of elements in descriptor table */
input [15:0] queue_size,
/* Lower 32-bits of Virtqueue Descriptor Area address */
input [31:0] queue_desc_low,
/* Lower 32-bits of Virtqueue Driver Area address */
input [31:0] queue_driver_low,
/* Lower 32-bits of Virtqueue Device Area address */
input [31:0] queue_device_low,
input queue_notify, /* kick */
input phase,
output reg [31:0] data = 0,
output reg [2:0] data_len = 0,
output ready,
/* For DMA */
output reg ram_valid = 0,
input ram_ready,
output reg [3:0] ram_wstrb = 0,
output reg [21:0] ram_addr = 0,
output reg [31:0] ram_wdata = 0,
input [31:0] ram_rdata
);
If you are familiar with the VIRTIO specification you might recognize queue_size, queue_desc_low, queue_driver_low, queue_device_low, and queue_notify since they are values provided by the VIRTIO MMIO Transport. The driver configures them with the memory addresses of the virtqueue data structures in RAM. The device will DMA to access those data structures. The driver can kick the device to indicate that new buffers have been enqueued using queue_notify.
The reader interface consists of phase, data, data_len, and ready and this is what the rdwr_stream module needs to use virtqueue_reader as a data source. rdwr_stream will keep reading the next byte(s) by flipping the phase bit and waiting for ready to be asserted by the device. Note that the device can provide up to 4 bytes at a time through the 32-bit data register and data_len allows the device to indicate how much data was read.
Finally, the DMA interface is how virtqueue_reader initiates RAM accesses so it can fetch the virtqueue data structures that the driver has configured.
Virtqueue processing consists of multiple steps and cannot be completed within a single clock cycle. Therefore the processing is decomposed into a state machine where each step consists of a DMA transfer or waiting for an event. Here are the states:
`define STATE_WAIT_PHASE 0 /* waiting for phase bit to flip */ `define STATE_READ_AVAIL_IDX 1 /* waiting for avail.idx read */ `define STATE_WAIT_NOTIFY 2 /* waiting for queue notify (kick) */ `define STATE_READ_DESCRIPTOR_ADDR_LOW 3 /* waiting for descriptor read */ `define STATE_READ_DESCRIPTOR_LEN 4 `define STATE_READ_DESCRIPTOR_FLAGS_NEXT 5 `define STATE_READ_BUFFER 6 /* waiting for data buffer read */ `define STATE_WRITE_USED_ELEM_ID 7 /* waiting for used element write */ `define STATE_WRITE_USED_ELEM_LEN 8 `define STATE_WRITE_USED_FLAGS_IDX 9 /* waiting for used.flags/used.idx write */ `define STATE_READ_AVAIL_RING_ENTRY 10 /* waiting for avail element read */
The device starts up in STATE_WAIT_PHASE because it is waiting to be asked to read the first byte(s). As soon as rdwr_stream flips the phase bit, virtqueue_reader must check the virtqueue to see if any data buffers are available.
I won't describe all the details of virtqueue processing, but here is a summary of the steps involved. See the VIRTIO specification or the code for the details.
After a buffer has been fully consumed, there are also several steps to fill out a used descriptor and increment the used.idx field so that the driver is aware that the buffer is done.
There are two wait states when the device stops until it there is more work to do. First, rdwr_stream will stop asking to read more data if the writer is too slow. This flow control ensures that data is not dropped due to a slow writer. This is STATE_WAIT_PHASE. Second, if the device wants to read but the virtqueue is empty, then it has to wait until queue_notify goes high. This is STATE_WAIT_NOTIFY.
The virtqueue_writer module is similar to virtqueue_reader but it fills in the buffers with data instead of consuming them.
A quick side note about memory alignment: the memory interface is 32-bit aligned, so it is only possible to read an entire 32-bit value from memory at multiples of 4 bytes. On a fancier CPU with a cache the unit would be a cache line (e.g. 128 bytes). When the data structures being DMAed are not aligned it becomes tedious to handle the shifting and masking, especially when reading data from a source and writing it to a destination. Life is much simpler when everything is aligned, because data can be trivially read or written in a single access without any special logic to adjust the data to fit the cache line size.
The virtqueue_reader and virtqueue_writer modules use DMA to read or write data from/to RAM buffers provided by the driver running on the PicoRV32 RISC-V CPU inside the FPGA. They are state machines that run through a sequence of DMA transfers and provide the reader/writer interfaces that the rdwr_module uses to transfer data. In the next post we will look at the UART receiver and transmitter.
This is a the first post in a series about building a virtio-serial device in Verilog for a Field Programmable Gate Array (FPGA) development board. This was a project I did in my spare time to become familiar with logic design. I hope these blog posts will offer a glimpse into designing your own devices and FPGA development.
Having developed systems software including firmware, device drivers for Linux, and device emulation in QEMU, I wanted to implement a device from scratch on an FPGA, leaving the comfort of the software world and getting some experience with hardware internals. And it didn't take long before I got both the good and the bad experiences. For example, when a device has to process data structures that are not aligned in memory and what a pain that becomes! More on that later.
A few years ago, I ordered a development board with an iCE40UP5k FPGA with the intention of implementing a CPU and maybe a USB controller. I was busy with other things though and the FPGA ended up in a drawer until I recently felt the time was right to dive in.
The muselab iCESugar board that I used for this project costs around 50 USD. It does not support high-speed interfaces like PCIe or Ethernet, but it has 5280 logic cells, 128 KB RAM, 8 MB of flash memory, and a collection of basic I/O including onboard LEDs, UART pins, and PMOD headers. That puts it roughly on par with an Arduino microcontroller board, except you're not stuck with a particular microcontroller because you can design your own or use existing soft-cores, as they are called.
The board can be flashed via USB and loading the manufacturer's demos was an eye opener: it can run several different CPU soft-cores (RISC-V, 6502, etc) and there is even enough capacity to run MicroPython on a soft-core. Typing Python into the prompt and getting output back knowing that the CPU it is running on is just some Verilog code that you can read and modify is neat.
Out the available demo soft-cores, the PicoRV32 RISC-V soft-core interested me most because it's a 32-bit microcontroller with open source compiler toolchain support despite the Verilog implementation being tiny. You can write firmware for the PicoRV32 in Rust, C, etc.
A tiny soft-core is important because it leaves logic cells free for integrating custom devices. There is no point in a fancier soft-core if it complicates the project or limits the number of cells available for my own logic.
The PicoRV32 code comes with an example System-on-Chip (SoC) called PicoSoC that integrates RAM, flash, and UART serial port communication. Custom memory-mapped I/O (MMIO) devices can be wired into the SoC by adding address decoding logic and connecting the devices to the bus. PicoSoC is a great time-saver for developing a custom RV32 SoC because RAM and flash are critical but not particularly exciting to integrate yourself.
The PicoSoC exposes a trivial MMIO register interface for the UART, but I wanted to replace it with a virtio-serial device in order to learn about implementing a more advanced device. VIRTIO devices use Direct Memory Access (DMA) and interrupts, although I ended up not implementing interrupts due to running out of logic cells in the end. This provides an opportunity to implement a device from scratch that is small but not trivial.
While PicoSoC has no PCI bus for the popular VIRTIO PCI transport, it is possible to implement the VIRTIO MMIO transport for this SoC since that just involves selecting some address space for the device's registers where the PicoRV32 CPU can access the device.
Having covered all this, the goal of this project is to write a virtio-serial device in Verilog and integrate it into PicoSoC. This also requires writing firmware that runs on the PicoRV32 soft-core to prove that the virtio-serial device works. In the posts that follow, I'll describe the main stops on the journey to building this.
The next post will cover MMIO registers, DMA, and interrupts.
You can also check out the code for this project at https://gitlab.com/stefanha/virtio-serial-fpga.
This is the final post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the firmware running on the PicoRV32 RISC-V soft-core in the FPGA.
The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.
The PicoRV32 RISC-V soft-core boots up executing code from flash memory at 0x10000000. Since RISC-V is supported by LLVM and gcc, it is possible to write the firmware in several languages. For this project I wanted to use Rust and was aware of several existing crates that already provide APIs for things that would be needed.
I used a Rust no_std environment, which means that the standard library (std) is not available and only the core library (core) is available. Crates written for embedded systems and low-level programming often support no_std, but most other crates rely on the standard library and an operating system. no_std is a niche in the Rust ecosystem but it works pretty well.
The following crates came in handy:
Initially I thought I could get away without a memory allocator since no_std does not have one by default and it would be extra work to set one up. However, virtio-drivers needed one for the virtio-serial device (I don't think it is really necessary, but the code is written that way). Luckily the embedded-alloc has memory allocators that are easy to set up and just need a piece of memory to operate in.
Aside from the setup code, the firmware is trivial. The CPU just sends a hello world message and then echoes back bytes received from the virtio-serial device.
#[riscv_rt::entry]
fn main() -> ! {
unsafe {
extern "C" {
static _heap_size: u8;
}
let heap_bottom = riscv_rt::heap_start() as usize;
let heap_size = &_heap_size as *const u8 as usize;
HEAP.init(heap_bottom, heap_size);
}
// Point virtio-drivers at the MMIO device registers
let header = NonNull::new(0x04000000u32 as *mut VirtIOHeader).unwrap();
let transport = unsafe { MmioTransport::new(header, 0x1000) }.unwrap();
// Put the string on the stack so the device can DMA (it cannot DMA flash memory)
let mut buf: [u8; 13] = *b"Hello world\r\n";
if transport.device_type() == DeviceType::Console {
let mut console = VirtIOConsole::::new(transport).unwrap();
console.send_bytes(&buf).unwrap();
loop {
if let Ok(Some(ch)) = console.recv(true) {
buf[0] = ch;
console.send_bytes(&buf[0..1]).unwrap();
}
}
}
loop {}
}
In the early phases I ran tests on the iCESugar board that lit up an LED to indicate the test result. As things became more complex I switched over to Verilog simulation. I wrote testbenches that exercise the Verilog modules I had written. This is similar to unit testing software.
In the later stages of the project, I changed the approach once more in order to do integration testing and debugging. To get more visibility into what was happening in the full design with a CPU and virtio-serial device, I used GTKWave to view the VCD files that Icarus Verilog can write during simulation. You can see every cycle and every value in each register or wire in the entire design, including the PicoRV32 RISC-V CPU, virtio-serial device, etc.
This allowed very powerful debugging since the CPU activity is visible (see the program counter in the reg_pc register in the screenshot) alongside the virtio-serial device's internal state. It is possible to look up the program counter in the firmware disassembly to follow the program flow and see where things went wrong.
The firmware is a small Rust codebase that uses existing crates, including riscv-rt and virtio-drivers. Throughout the project I used several debugging and simulation approaches, depending on the level of complexity. Thanks to the open source code and tools available, it was possible to complete this project using fairly convenient and powerful tools and without spending a lot of time reinventing the wheel. Or at least without reinventing the wheels I didn't want to reinvent :).
Let me know if you enjoy FPGAs and projects you've done!
This is the fifth post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the UART receiver and transmitter.
The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.
A Universal Asynchronous Receiver-Transmitter (UART) is a simple interface for data transfer that only requires a transmitter (tx) and a receiver (rx) wire. There is no clock wire because both sides of the connection use their own clocks and sample the signal in order to reconstruct the bits being transferred. This agreed-upon data transfer rate (or baud rate) is usually modest and the frame encoding is also not the most efficient way of transferring data, but UARTs get the job done and are commonly used for debug consoles, modems, and other relatively low data rate interfaces.
There is a framing protocol that makes it easier to reconstruct the transferred data. This is important because failure to correctly reconstruct the data results in corrupted data being received on the other side. In this project I used a 9,600 bit/s baud rate and 8 data bits, no parity bit, and 1 stop bit (sometimes written as 8N1). The framing works as follows:
The job of the transmitter is to follow this framing protocol. The job of the receiver is to detect the next frame and to reconstruct the byte being transferred.
The uart_reader and uart_writer modules implement the UART receiver and transmitter, respectively. They are designed around the rdwr_stream module's reader and writer interfaces. That means uart_reader receives the next byte from the UART rx pin whenever it is asked to read more data and uart_writer transmits on the UART tx pin whenever it is asked to write more data.
uart_reader follows a trick I learnt from the PicoSoC's simpleuart module: once the rx pin goes from 1 to 0, it waits until half the period (e.g. 9,600 baud @ 12 MHz / 2 = 625 clock cycles) has passed before sampling the rx pin. This works well because the UART only transfers data on the iCESugar PCB and is not exposed to much noise. Fancier approaches involve sampling the pin every clock cycle in order to try to reconstruct the value more accurately, but they don't seem to be necessary for this project.
Here is the core uart_reader code, a state machine that parses the incoming frame:
always @(posedge clk) begin
...
div_counter = div_counter + 1;
case (bit_counter)
0: begin // looking for the start bit
if (rx == `START_BIT) begin
div_counter = 0;
bit_counter = 1;
end
end
1: begin
/* Sample in the middle of the period */
if (div_counter == clk_div >> 1) begin
div_counter = 0;
bit_counter = 2;
end
end
10: begin // expecting the stop bit
if (div_counter == clk_div) begin
if (rx == `STOP_BIT && !reg_ready) begin
data = {24'h0, rx_buf};
data_len = 1;
reg_ready = 1;
end
bit_counter = 0;
end
end
default: begin // receive the next data bit
if (div_counter == clk_div) begin
rx_buf = {rx, rx_buf[7:1]};
div_counter = 0;
bit_counter = bit_counter + 1;
end
end
endcase
The uart_writer module is similar, but it has a transmit buffer that it sends over the UART tx pin with the framing that I've described here.
The uart_reader and uart_writer modules are responsible for receiving and transmitting data over the UART rx/tx pins. They implement the framing protocol that UARTs use to protect data. In the next post we will cover the firmware running on the PicoRV32 RISC-V soft-core that drives the I/O.
This is the third post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the design of the virtio-serial device and how to decompose it into modules.
The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.
A virtio-serial device is a serial controller, enabling communication with the outside world. The iCESugar FPGA development board has UART rx and tx pins connecting the FPGA to a separate microcontroller that acts as a bridge for USB serial communication. That means the FPGA can wiggle the bits on the UART tx pin to send bytes to a computer connected to the board via USB and you can receive bits from the computer through the UART rx pin. The purpose of the virtio-serial device is to present a VIRTIO device to the PicoRV32 RISC-V CPU inside the FPGA so the software on the CPU can send and receive data.
The virtio-serial device implements the Console device type defined in the VIRTIO specification and exposes it to the driver running on the CPU via the VIRTIO MMIO Transport. The terms "serial" and "console" are used interchangeably in the VIRTIO community and I will usually use serial unless I'm specifically talking about the Console device type section in the VIRTIO specification.
VIRTIO separates the concept of a device type (like net, block, or console) from the transport that allows the driver to access the device. This architecture allows VIRTIO to be used across a range of different machines, including machines that have a PCI bus, MMIO devices, and so on. Fortunately the VIRTIO MMIO transport is fairly easy to implement from scratch.
The virtio_serial_mmio module implements the virtio-serial device from the following parts:
The virtio-serial device interfaces with the outside world through an MMIO interface that the CPU uses to access the device registers, a DMA interface for initiating RAM memory transfers, and the UART rx/tx pins for actually sending and receive data.
Note that both the virtqueue_reader and the virtqueue_writer modules require DMA access, so I reused the spram_mux module that multiplexes the CPU and the virtio-serial device's RAM accesses. spram_mux is used inside virtio_serial_mmio to multiplex access to the single DMA interface.
Since the job of the device is to transfer data between the virtqueues and the UART rx/tx pins, it is organized around a module named rdwr_stream that constantly attempts to read data from a source and write it to a destination:
/* Stream data from a reader to a writer */
module rdwr_stream (
input clk,
input resetn,
/* The reader interface */
output reg rd_phase = 0,
input [31:0] rd_data,
input [2:0] rd_data_len,
input rd_ready,
/* The writer interface */
output reg wr_phase = 0,
output reg [31:0] wr_data = 0,
output reg [2:0] wr_data_len = 0,
input wr_ready
);
By implementing the reader and writer interfaces for the virtqueues and UART rx/tx pins, it becomes possible to pump data between them using rdwr_stream. For testing it's also possible to configure virtqueue loopback or UART loopback so that the virtqueue logic or the UART logic can be exercised in isolation.
The reader and writer interfaces that the rdwr_stream module uses are the central abstraction in the virtio-serial device. You might notice that this interface uses a phase bit rather than a valid bit like in the valid/ready interface for MMIO and DMA. Every transfer is initiated by flipping the phase bit from its previous value. I find the phase bit approach easier to work with because it distinguishes back-to-back transfers, whereas interfaces that allow the valid bit to stay 1 for back-to-back transfers are harder to debug. It would be possible to switch to a valid/ready interface though.
To summarize, there are 4 reader or writer implementations that can be connected freely through the rdwr_stream module:
The virtio-serial device consists of the VIRTIO MMIO Transport device registers plus two rdwr_streams that transfer data between virtqueues and the UART. The next post will look at how virtqueue processing works.
This is the second post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at integrating MMIO devices to the PicoSoC (an open source System-on-Chip using the PicoRV32 RISC-V soft-core).
There are three common ways in which devices interact with a system:
PicoSoC supports MMIO device registers and interrupts out of the box. It does not support DMA, but I will explain how this can be added by modifying the code later.
First let's look at implementing MMIO registers for a device in PicoSoC. The PicoRV32 CPU's memory interface looks like this:
output mem_valid // request a memory transfer
output mem_instr // hint that CPU is fetching an instruction
input mem_ready // reply to memory transfer
output [31:0] mem_addr // address
output [31:0] mem_wdata // data being written
output [ 3:0] mem_wstrb // 0000 - read
// 0001 - write 1 byte
// 0011 - write 2 bytes
// 0111 - write 3 bytes
// 1111 - write 4 bytes
input [31:0] mem_rdata // data being read
When mem_valid is 1 the CPU is requesting a memory transfer. The memory address in mem_addr is decoded and the appropriate device is selected according to the memory map (e.g. virtio-serial device at 0x04000000-0x040000ff). The selected device then handles the memory transfer and asserts mem_ready to let the CPU know that the transfer has completed.
In order to handle MMIO device register accesses, the virtio-serial device needs a similar memory interface. The register logic is implemented in a case statement that handles wdata or rdata depending on the semantics of the register. Here is the VIRTIO MMIO MagicValue register implementation that reads a constant identifying this as a VIRTIO MMIO device:
module virtio_serial_mmio (
...
input iomem_valid,
output iomem_ready,
input [3:0] iomem_wstrb,
input [7:0] iomem_addr,
input [31:0] iomem_wdata,
output [31:0] iomem_rdata,
...
);
...
always @(posedge clk) begin
...
case (iomem_addr)
`REG_MAGIC_VALUE: begin
// Note that ready and rdata are basically iomem_ready
// and iomem_rdata but there is some more glue behind
// this.
ready = 1;
rdata = `MAGIC_VALUE;
end
MMIO registers are appropriate when the CPU needs to initiate some activity in the device, but it ties up the CPU during the load/store instructions that are accessing the device registers. For bulk data transfer it is common to use DMA instead where a device initiates RAM data transfers itself without CPU involvement. This allows the CPU to continue running independently of device activity.
VIRTIO is built around DMA because the virtqueues live in RAM and the device initiates accesses to both the virtqueue data structures as well as the actual data buffers containing the I/O payload.
The iCESugar board has Single Port RAM (SPRAM), which means that it can only be accessed through one interface and that is already connected to the CPU. In order to allow the virtio-serial device to access RAM, it is necessary to multiplex the SPRAM interface between the CPU and the virtio-serial device. I chose to implement a fixed-priority arbiter to do this because fancier a round-robin strategy is not necessary for this project. The virtio-serial device will only access RAM in short bursts, so the CPU will not be starved.
You can look at the spram_mux module to see the implementation, but it basically has 2 input memory interfaces and 1 output memory interface. One input interface is high priority and the other is low priority. The virtio-serial device uses the high priority port and the CPU uses the low priority port.
The virtio-serial device is designed for DMA via a state machine that keeps track of the current memory access that is being performed. When the device sees the ready input asserted, it knows the DMA transfer has completed and it transitions to the next state (often multiple memory accesses are performed in sequence to load the virtqueue data structures).
For example, here are state machine transitions for loading the first two fields of the virtqueue descriptor:
always @(posedge clk) begin
...
if (ram_valid && ram_ready) begin
...
case (state)
...
`STATE_READ_DESCRIPTOR_ADDR_LOW: begin
desc_addr_low = ram_rdata;
ram_addr = ram_addr + 2;
state = `STATE_READ_DESCRIPTOR_LEN;
end
`STATE_READ_DESCRIPTOR_LEN: begin
desc_len = ram_rdata;
ram_addr = ram_addr + 1;
state = `STATE_READ_DESCRIPTOR_FLAGS_NEXT;
end
When the DMA transfer completes in the STATE_READ_DESCRIPTOR_ADDR_LOW state, the virtqueue descriptor's buffer address (low 32 bits) are stored into the desc_addr_low register for later use and ram_addr is updated to the memory address of the virtqueue descriptor's length field. The STATE_READ_DESCRIPTOR_LEN state has similar logic.
In other words, DMA transfers require splitting up the device implementation into a state machine that handles DMA completion in a future clock cycle. In the software world this is similar to callbacks in event loops where code is split up because we need to wait for a completion.
The PicoRV32 soft-core has basic interrupt support, but it does not implement the standard RISC-V Control and Status Registers (CSRs) for interrupt handling. Supporting this would require extra work on the firmware side because the existing riscv-rt Rust crate doesn't implement the PicoRV32 interrupt mechanism. Also, I ended up running low on logic cells in the FPGA, so I disabled the PicoRV32's optional interrupt support to save space. Luckily VIRTIO devices support busy waiting, so interrupts are not required.
This post described how the virtio-serial device is connected to the PicoSoC and how MMIO registers and DMA work. MMIO register implementation was easy, but I spent quite a bit of time debugging waveforms with GTKWave to make sure that the memory interface and spram_mux was both working correctly and not wasting clock cycles. In the next post we'll look at the design of the virtio-serial device.
We’d like to announce the availability of the QEMU 10.2.0 release. This release contains 2300+ commits from 188 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
This article brings some background information for security advisories GHSA-6pp6-cm5h-86g5 and CVE-2025-2296.
Lets start with some tech background and history, which is helpful to understand the chain of events leading to CVE-2025-2296.
The x86 linux kernel has a 'setup' area at the start of the binary, and the traditional role for that area is to hold information needed by the linux kernel to boot, for example command line and initrd location. The boot loader patches the setup header before handing over control to the kernel, which allows the linux kernel to find the command line you have typed into the boot loader prompt. Booting in BIOS mode still works that way, and will most likely continue to do so until the BIOS mode days are numbered.
In the early days of UEFI support the boot process in UEFI mode worked quite simliar. It's known as 'EFI handover protocol'. It turned out to have a number of disadvantages though, for example passing additional information requires updating both linux kernel and the boot loader. The latter is true for BIOS mode too, but new development there are very rare with the world moving towards UEFI.
Enter 'EFI stub'. With this the linux kernel is simply an EFI binary. Execution starts in EFI mode, so the early kernel code can do EFI calls and gather all information needed to boot the kernel without depending on the boot loader to do this for the kernel. Version dependencies are gone. Additional bonus is that no kernel-specific knowledge is needed any more. Anything which is able to start efi binaries -- for example efi shell -- can be used to launch the kernel.
Qemu offers the option to launch linux kernels directly, using
the -kernel command line switch. What actually happens
behind the scenes is that qemu exposes the linux kernel to the guest
using the firmware config interface (fw_cfg for short). The virtual
machine firmware (SeaBIOS or OVMF) will fetch the kernel from qemu,
place it in guest memory and launch it.
OVMF supports both 'EFI stub' and 'EFI handover protocol' boot methods. It will try the modern 'EFI stub' way first, which actually is just 'try start as EFI binary'. Which btw. implies that you can load any EFI binary that way, this is not limited to linux kernels.
If starting the kernel as EFI binary fails OVMF will try to fallback to the old 'EFI handover protocol' method. OVMF names the latter 'legacy linux kernel loader' in messages printed to the screen.
So, what is the problem with secure boot? Well, there isn't only one problem, we had multiple issues:
Secure boot bypass sounds scary, but is it really?
First, the bypass is restricted to exactly one binary, which is the linux kernel the firmware fetches from qemu. This issue does not allow to run arbitrary code inside the guest, for example some unsigned efi binary an attacker places on the ESP after breaking into the virtual machine.
Second, the guest has to trust the virtualization host to not do evil things. The host has full control over the environment the guest is running in. Providing the linux kernel image for direct kernel boot is only one little thing out of many. If an evil host wants attack/disturb the guest there are plenty of ways to do so. The host does not need some exploit for that.
Third, many typical virtual machine configurations do not use direct kernel boot. The kernel is loaded from the guest disk instead.
So, the actual impact is quite limited.
There is no quick and easy fix available. Luckily it is also not super urgent and critical. Time to play the long game ...
Fix #1: qemu exposes an unmodified kernel image to the guest now (additionally to the traditional setup/rest split which is kept for BIOS mode and backward compatibility). Fixes the first issue.
Fix #2: qemu can expose the shim binary to the guest, using the
new -shim command line switch. Fixes the third issue.
Both qemu changes are available in qemu version 10.0 (released April
2025) and newer.
Fix #3: OVMF companion changes for fixes #1 + #2, use the new fw_cfg items exposed by qemu. Available in edk2-stable202502 and newer. Both qemu and OVMF must be updated for this to work.
Fix #4: Add a config option to disable the legacy 'EFI handover protocol' loader. Leave it enabled by default because of the low security impact and the existing use cases, but print warnings to the console in case the legacy loader is used. Also present in edk2-stable202502 and newer. Addresses the second issue.
With all that in place it is possible to plug the CVE-2025-2296 hole, by flipping the new config switch to disabled.
Fix #5: Do not use the legacy loader in confidential VMs. Present in edk2-stable202511 and newer.
Roughly one year has been passed since the first changes have been committed to qemu and edk2. What happened? libvirt also got support for passing shim to the guest for direct kernel boot (version 11.2.0 and newer). The new versions have found their way into the major linux distributions. debian, ubuntu and fedora should all be ready for the next step now.
Fix #6: Flip the default value for the legacy loader config option to disabled. This update just landed in the edk2 git repository and will be in edk2-stable202602.
What is left to do? The final cleanup. Purge the legacy loader from the edk2 code base. Will probably happen a year or two down the road.
The edk2 changes are in X86QemuLoadImageLib.
The qemu changes are in hw/i386/x86-common.c.
With the VIRTIO 1.4 specification for I/O devices expected to be published soon, here are the most prominent changes. For more fine-grained changes like the latest offloading capabilities in virtio-net devices, please refer to the draft specification.
The most exciting changes are new device types that allow for entirely new I/O devices to be built with VIRTIO. In 1.4 there are new device types that are especially relevant for automotive and embedded systems.
In addition to the new devices, VIRTIO itself has evolved to provide new functionality across device types:
This is a nice step forward for VIRTIO. Congratulations to everyone who contributed to VIRTIO 1.4!
Wait, what? RISC-V? In ‘the diary of AArch64 porter’? WTH?
Yes, I started working on Fedora packaging for the 64-bit RISC-V architecture port.
About a week ago, one of my work colleagues asked me about my old post about speeding up Mock. We had a discussion, I pointed him to the Mock documentation, and gave some hints.
It turned out that he was working on RISC-V related changes to Fedora packages. As I had some spare cycles, I decided to take a look. And I sank…
The 64-bit RISC-V port of Fedora Linux is going quite well. There are over 90% of Fedora packages already built for that architecture. And there are several packages with the riscv64 specific changes, such as:
Note that these changes are temporary. There are people working on solving toolchain issues, languages are being bootstrapped (there was a review of Java changes earlier this week), patches are being integrated upstream and in Fedora, and so on.
There is the Fedora RISC-V tracker website showing the progress of the port:
This is a simple way to check what to work on. And there are several packages, not built yet due to use of “ExclusiveArch” setting in them.
The quick look at work needed reminded me of the 2012-2014 period, when I worked on the same stuff but for AArch64 ports (OpenEmbedded, Debian/Ubuntu, Fedora/RHEL). So I had a knowledge, I knew the tools and started working.
In the beginning, I went through entries in the tracker and tried to triage as many packages as possible, so it will be more visible which ones need work and which can be ignored (for now). The tracker went from seven to over eighty triaged packages in a few days.
Then I looked at changes done by current porters. Which usually meant David Abdurachmanov. I used his changes as a base for the changes needed for Fedora packaging, while trying to minimise the amount of them to the minimum required.
I did over twenty pull requests to Fedora packages during a week of work.
But which hardware did I use to run those hundreds of builds? Was it HiFive Premier P550? Or maybe Milk-V Titan or another RISC-V SBC?
Nope. I used my 80-core, Altra-based, AArch64 desktop to run all those builds. With the QEMU userspace helper.
You see, Mock allows to run builds for foreign architectures — all you need is
the proper qemu-user-static-* package and you are ready to go:
$ fedpkg mockbuild -r fedora-43-riscv64
You can extract the “fedora-43-riscv64” Mock config from the mock-riscv64-configs.patch hosted on Fedora RISC-V port forge. I hope that these configuration files may be found in the “mock-core-configs” in Fedora soon.
At some point I had 337 qemu-user-static-riscv processes running at same
time. And you know what? It was still faster than a native build on 64-bit
RISC-V hardware.
But, to be honest, I only compared a few builds, so it may be better with other builders. Fedora RISC-V Koji uses wide list of SBCs to build on:
Also note that using QEMU is not a solution for building a distribution. I used it only to check if package builds, and then scrap the results.
Will I continue working on the RISC-V port of Fedora Linux? Probably yes. And, at some point, I will move to integrating those changes into CentOS Stream 10.
For sure I do not want to invest in RISC-V hardware. Existing models are not worth the money (in my opinion), incoming ones are still old (RVA20/RVA22) and they are slow. Maybe in two, three years there will be something fast enough.
The Google Summer of Code (GSoC) Mentor Summit 2025 took place from October 23rd to 25th in Munich, Germany. This event marks the conclusion of the annual program, bringing together mentors from all over the world. QEMU had another successful year with several interesting projects (details on our organization page), and it was a pleasure for me to represent the QEMU community at the summit, joining mentors from over 100 open source organizations to discuss the program, share experiences, and talk about open source challenges.
The summit follows an “unconference” format. There is no pre-planned schedule; instead, attendees propose sessions on the first day based on what they want to discuss. Since the event moved to Munich this year, it was a great opportunity for me to join and meet people from other communities face-to-face.

During the “Lightning talks” session, mentors had a short slot to introduce their projects. I presented the project I mentored this summer: vhost-user devices in Rust on macOS and *BSD.
The student, Wenyu Huang, worked on extending rust-vmm crates
(specifically vhost, vhost-user-backend, and vmm-sys-utils) to support
vhost-user devices on non-Linux POSIX systems. This work is important for
portability, allowing rust-vmm components to run also on macOS and BSD.
You can find the full details and the code in the final project report.
This project focused primarily on the rust-vmm ecosystem rather than QEMU
itself. This was possible thanks to QEMU acting as an umbrella organization,
allowing related projects like rust-vmm to participate in the program.
Networking with other mentors was a key part of the event. It was nice to see that QEMU is well-recognized; many mentors I met were familiar with the project, which made it easy to start conversations. We exchanged views on how to handle the mentorship lifecycle, from interviewing GSoC applicants (and the impact of AI on that process) to the coding phase. We shared tips on how to best help students during the summer, such as setting up regular meetings and maintaining effective communication.
I also attended several sessions covering different topics. The most interesting discussions were:
The “sticker table” and “chocolate table” are traditions of the summit. I enjoyed trying chocolates from different countries. Unfortunately, I didn’t have any QEMU stickers to share this time. We should definitely plan to bring a stack for next year!
We really believe that GSoC is a great and useful program, as it brings new ideas and contributors to our community. We will definitely apply again for GSoC 2026, and we hope to have the chance to join the Mentor Summit again next year!
If something goes wrong if usually is very helpful to have log files at hand. Virtual machine firmware is no exception here. So, lets have a look at common practices here.
On the x86 platform qemu provides a isa-debugcon
device. That is the simplest possible logging device, featuring a
single ioport. Reading from the ioport returns a fixed value,
which can be used to detect whenever the device is present or not.
Writing to the ioport sends the character written to the chardev
backend linked to device.
By convention the qemu firmware -- both seabios and OVMF -- uses the ioport address 0x402. So getting the firmware log printed on your terminal works this way:
qemu-system-x86_64 \
-chardev stdio,id=fw \
-device isa-debugcon,iobase=0x402,chardev=fw
When using libvirt you can use this snippet (in
the <devices> section) to send the firmware log
to file:
<serial type='null'>
<log file='/path/to/firmware.log' append='off'/>
<target type='isa-debug' port='1'>
<model name='isa-debugcon'/>
</target>
<address type='isa' iobase='0x402'/>
</serial>
Note that virsh console will connect to the first
serial device specified in the libvirt xml, so this should be
inserted after other serial devices to avoid breaking your
serial console setup.
On the arm virt platform there is no special device for the firmware log. The logs are sent to the serial port instead. That is inconvinient when using a serial console though, so linux distros typically ship two variants of the arm firmware image. One with logging turned on, and one with logging turned off (on RHEL and Fedora the latter have 'silent' in the filename). By default libvirt uses the silent builds, so in case you need the debug log you have to change the VM configuration to use the other variant.
Recently (end of 2023) the arm builds learned a new trick. In case two serial ports are present the output will be split. The first serial port is used for the console, the second port for the debug log. With that the console is actually usable with the verbose firmware builds. To enable that you simply need two serial devices in your libvirt config:
<serial type='pty'>
<log file='/path/to/console.log' append='off'/>
<target type='system-serial' port='0'>
<model name='pl011'/>
</target>
</serial>
<serial type='null'>
<log file='/path/to/firmware.log' append='off'/>
<target type='system-serial' port='1'>
<model name='pl011'/>
</target>
</serial>
Starting with the edk2-stable202508 tag OVMF supports
logging to a memory buffer. The feature is disabled by default and
must be turned on at compile time using the -D
DEBUG_TO_MEM=TRUE option when building the firmware.
There are multiple ways to access the log memory buffer. First is
a small efi application which can print the log to the efi console
(source
code,
x64
binary). Pass -p or pager as argument
on the command line to enable the build-in pager.
Second way is a recent linux kernel, version 6.17 got a new bool
config option: OVMF_DEBUG_LOG. When enabled the linux
kernel will make the firmware log available via sysfs. If supported
by both kernel and firmware the log will show up in
the /sys/firmware/efi directory with the
filename ovmf_debug_log.
Third option is a qemu monitor command. The changes just landed in
qemu master branch and will be available in the next release qemu
(10.2) expected later this year. Both a qmp command
(query-firmware-log) and a hmp command (info
firmware-log) are availablke. This is useful to diagnose
firmware failures happen early enough in boot that the other options
can not be used.
My KVM Forum 2025 talk "Making io_uring Pervasive in QEMU" is now available on YouTube. The slides are also available here (PDF).
This talk is about integrating Linux io_uring into QEMU's event loop to enable performance optimizations and use new kernel features available through io_uring. This topic is also relevant for other I/O-intensive applications (network services, software-defined networking or storage systems, databases, etc) that require modifications in order to take advantage of io_uring. The challenge is usually how to move from a reactor-based event loop that monitors file descriptors to a proactor-based event loop that waits for asynchronous operation completion. In QEMU this can be solved by keeping existing fd monitoring users while introducing an API for io_uring request submission that new code can use.
We’d like to announce the availability of the QEMU 10.1.0 release. This release contains 2700+ commits from 226 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
File systems and relational databases are like cousins. They share more than is apparent at first glance.
It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.
Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.
Here is a key-value store with an integer key and a string value:
Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.
A key-value store provides the following interface for storing and retrieving values by a given key:
You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.
Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.
Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.
This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.
First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:
Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.
File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.
If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.
Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.
Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.
Update (2025-07-21): Another good example of the connection between file systems and key-value stores is the "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution" paper (PDF) where a key-value store is used to hold the metadata and raw blocks are used to hold file data.
The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:
This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?
To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.
SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.
If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.
Some months ago, my colleague Madeeha Javed and I wrote a tool to convert QEMU disk images into qcow2, writing the result directly to stdout.
This tool is called qcow2-to-stdout.py and can be used for example to create a new image and pipe it through gzip and/or send it directly over the network without having to write it to disk first.
This program is included in the QEMU repository: https://github.com/qemu/qemu/blob/master/scripts/qcow2-to-stdout.py
If you simply want to use it then all you need to do is have a look at these examples:
$ qcow2-to-stdout.py source.raw > dest.qcow2$ qcow2-to-stdout.py -f dmg source.dmg | gzip > dest.qcow2.gz
If you’re interested in the technical details, read on.
QEMU uses disk images to store the contents of the VM’s hard drive. Images are often in qcow2, QEMU’s native format, although a variety of other formats and protocols are also supported.
I have written in detail about the qcow2 format in the past (for example, here and here), but the general idea is very easy to understand: the virtual drive is divided into clusters of a certain size (64 KB by default), and only the clusters containing non-zero data need to be physically present in the qcow2 image. So what we have is essentially a collection of data clusters and a set of tables that map guest clusters (what the VM sees) to host clusters (what the qcow2 file actually stores).
qemu-img is a powerful and versatile tool that can be used to create, modify and convert disk images. It has many different options, but one question that sometimes arises is whether it can use stdin or stdout instead of regular files when converting images.
The short answer is that this is not possible in general. qemu-img convert works by checking the (virtual) size of the source image, creating a destination image of that same size and finally copying all the data from start to finish.
Reading a qcow2 image from stdin doesn’t work because data and metadata blocks can come in any arbitrary order, so it’s perfectly possible that the information that we need in order to start writing the destination image is at the end of the input data¹.
Writing a qcow2 image to stdout doesn’t work either because we need to know in advance the complete list of clusters from the source image that contain non-zero data (this is essential because it affects the destination file’s metadata). However, if we do have that information then writing a new image directly to stdout is technically possible.
The bad news is that qemu-img won’t help us here: it uses the same I/O code as the rest of QEMU. This generic approach makes total sense because it’s simple, versatile and is valid for any kind of source and destination image that QEMU supports. However, it needs random access to both images.
If we want to write a qcow2 file directly to stdout we need new code written specifically for this purpose, and since it cannot reuse the logic present in the QEMU code this was written as a separate tool (a Python script).
The process itself goes like this:
Images created with this program always have the same layout: header, refcount tables and blocks, L1 and L2 tables, and finally all data clusters.
One problem here is that, while QEMU can read many different image formats, qcow2-to-stdout.py is an independent tool that does not share any of the code and therefore can only read raw files. The solution here is to use qemu-storage-daemon. This program is part of QEMU and it can use FUSE to export any file that QEMU can read as a raw file. The usage of qemu-storage-daemon is handled automatically and the user only needs to specify the format of the source file:
$ qcow2-to-stdout.py -f dmg source.dmg > dest.qcow2
qcow2-to-stdout.py can only create basic qcow2 files and does not support features like compression or encryption. However, a few parameters can be adjusted, like the cluster size (-c), the width of the reference count entries (-r) and whether the new image is created with the input as an external data file (-d and -R).
And this is all, I hope that you find this tool useful and this post informative. Enjoy!
This work has been developed by Igalia and sponsored by Outscale, a Dassault Systèmes brand.
¹ This problem would not happen if the input data was in raw format but in this case we would not know the size in advance.
Coinciding with the switch to the new Network Express adapters, we have updated our documentation to include the new card here.
From the introduction:
"This publication explores what PCI network adapters offer for network connections of Linux® on IBM Z® and IBM® LinuxONE. The publication applies to Network Express and RoCE Express2 or RoCE Express3 adapters."
Note that Network Express adapters now also support promiscuous mode as required by Open vSwitch in a KVM context!
A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.
Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):
If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.
Here are the tools I like to use when hunting for QEMU storage performance optimizations.
kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.
kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.
Here is output from an I/O intensive workload:
Event Total %Total CurAvg/s kvm_entry 1066255 20.3 43513 kvm_exit 1066266 20.3 43513 kvm_hv_timer_state 954944 18.2 38754 kvm_msr 487702 9.3 19878 kvm_apicv_accept_irq 278926 5.3 11430 kvm_apic_accept_irq 278920 5.3 11430 kvm_vcpu_wakeup 250055 4.8 10128 kvm_pv_tlb_flush 250000 4.8 10123 kvm_msi_set_irq 229439 4.4 9345 kvm_ple_window_update 213309 4.1 8836 kvm_fast_mmio 123433 2.3 5000 kvm_wait_lapic_expire 39855 0.8 1628 kvm_apic_ipi 9614 0.2 457 kvm_apic 9614 0.2 457 kvm_unmap_hva_range 9 0.0 1 kvm_fpu 42 0.0 0 kvm_mmio 28 0.0 0 kvm_userspace_exit 21 0.0 0 vcpu_match_mmio 19 0.0 0 kvm_emulate_insn 19 0.0 0 kvm_pio 2 0.0 0 Total 5258472 214493
Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.
Important events:
sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.
The mpstat, pidstat, and iostat tools are the ones I use most often:
blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.
The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.
In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.
perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.
Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:
3.95% [kvm_intel] [k] vmx_vmexit 3.37% qemu-system-x86_64 [.] virtqueue_split_pop 2.64% libc.so.6 [.] __memset_evex_unaligned_erms 2.50% [kvm_intel] [k] vmx_spec_ctrl_restore_host 1.57% [kernel] [k] native_irq_return_iret 1.13% qemu-system-x86_64 [.] bdrv_co_preadv_part 1.13% [kernel] [k] sync_regs 1.09% [kernel] [k] native_write_msr 1.05% [nvme] [k] nvme_poll_cq
The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.
perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.
Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:
IO iothread1 (332737), 1763284 events, 99.9%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
io_uring_enter 351680 0 8016.308 0.003 0.023 0.209 0.11%
write 390189 0 1238.501 0.002 0.003 0.098 0.08%
read 121057 0 305.355 0.002 0.003 0.019 0.14%
ppoll 18707 0 62.228 0.002 0.003 0.023 0.32%
We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!
We’d like to announce the availability of the QEMU 10.0.0 release. This release contains 2800+ commits from 211 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
Canonical released a new version of their Ubuntu server offering Ubuntu Server 25.04!
Our latest offering in the IBM Z family, IBM z17, was announced yesterday. General availability will be June 18.
See the official announcement, the updated Linux support matrix, and the Technical Guide with all the insights you will need.
QEMU is participating in Google Summer of Code again this year! Google Summer of Code is an open source internship program that offers paid remote work opportunities for contributing to open source. Internships run May through August, so if you have time and want to experience open source development, read on to find out how you can apply.
Each intern is paired with one or more mentors, experienced QEMU contributors who support them during the internship. Code developed by the intern is submitted through the same open source development process that all QEMU contributions follow. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.
Information on who can apply for Google Summer of Code is here.
Look through the the list of QEMU project ideas and see if there is something you are interested in working on. Once you have found a project idea you want to apply for, email the mentor for that project idea to ask any questions you may have and discuss the idea further.
You can apply for Google Summer of Code from March 24th to April 8th.
Good luck with your applications!
If you have questions about applying for QEMU GSoC, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.
If you run into a situation where migration fails with something like
internal error: QEMU unexpectedly closed the monitor (vm='testguest'):
qemu-kvm: Some features requested in the CPU model are not available in the current configuration: pckmo-aes-256 pckmo-aes-192 pckmo-aes-128 pckmo-etdea-192 pckmo-etdea-128 pckmo-edea msa9_pckmo Consider a different accelerator, QEMU, or kernel version
This indicates that both host are configured differently regarding the CPACF key management Operations
So you can (preferred if the security scheme allows for that) configure both LPARs the same way, de-activate and re-activate the LPAR
or
you can change the CPU model of the guest to no longer have these pckmo functions. Change your guest xml from "host-model" to "host-model with some features disabled".
So shutdown the guest and change the XML from
<cpu mode='host-model' check='partial'/>
to
We’d like to announce the availability of the QEMU 9.2.0 release. This release contains 1700+ commits from 209 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
Recently, I needed to debug a problem that only occurred in RHCOS images that are running in secure execution mode on an IBM Z system. Since I don’t have a OCP installation at hand, I wanted to run such an image directly with QEMU or libvirt. This sounded easy at a first glance, since there are qcow2 images available for RHCOS, but in the end, it was quite tricky to get this working, so I’d like to summarize the steps here, maybe it’s helpful for somebody else, too. Since the “secex” images are encrypted, you cannot play the usual tricks with e.g. guestfs here, you have to go through the ignition process of the image first. Well, maybe there is already the right documentation for this available somewhere and I missed it, but most other documents mainly talk about x86 or normal (unencrypted) images (like the one for Fedora CoreOS on libvirt ), so I think it will be helpful to have this summary here anyway.
First, make sure that you have the right tools installed for this task:
sudo dnf install butane wget mkpasswd openssh virt-install qemu-img
Since we are interested in the secure execution image, we have to download the image with “secex” in the name, together with the right GPG key that is required for encrypting the config file later, for example:
wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-qemu-secex.s390x.qcow2.gz
wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-ignition-secex-key.gpg.pub
Finally, uncompress the image. And since we want to avoid modifying the original image, let’s also create an overlay qcow2 image for it:
gzip -d rhcos-qemu-secex.s390x.qcow2.gz
qemu-img create -f qcow2 -b rhcos-qemu-secex.s390x.qcow2 -F qcow2 rhcos.qcow2
For being able to log in your guest via ssh later, you need an ssh key, so let’s create one and add it to your local ssh-agent:
ssh-keygen -f rhcos-key
ssh-add rhcos-key
If you also want to log in on the console via password, create a
password hash with the mkpasswd tool, too.
Now create a butane configuration file and save it as “config.bu”:
variant: fcos
version: 1.4.0
passwd:
users:
- name: core
ssh_authorized_keys:
- INSERT_THE_CONTENTS_OF_YOUR_rhcos-key.pub_FILE_HERE
password_hash: INSERT_THE_HASH_FROM_mkpasswd_HERE
groups:
- wheel
storage:
files:
- path: /etc/se-hostkeys/ibm-z-hostkey-1
overwrite: true
contents:
local: HKD.crt
systemd:
units:
- name: [email protected]
mask: false
Make sure to replace the “INSERT_…” markers in the file with the contents
of your rhcos-key.pub and the hash from mkpasswd, and also
make sure to have the host key document (required for encrypting the
guest with genprotimg) available as HKD.crt in the current directory.
Next, the butane config file needs to be converted into an ignition file, which then needs to be encrypted with the GPG key of the RHCOS image:
butane -d . config.bu > config.ign
gpg --recipient-file rhcos-ignition-secex-key.gpg.pub --yes \
--output config.crypted --armor --encrypt config.ign
The encrypted config file can now be used to start the ignition of the guest. On s390x, the config file is not presented via the “fw_cfg” mechanism to the guest (like it is done on x86), but with a drive that has a special serial number. Thus QEMU should be started like this:
/usr/libexec/qemu-kvm -d guest_errors -accel kvm -m 4G -smp 4 -nographic \
-object s390-pv-guest,id=pv0 -machine confidential-guest-support=pv0 \
-drive if=none,id=dr1,file=rhcos.qcow2,auto-read-only=off,cache=unsafe \
-device virtio-blk,drive=dr1 -netdev user,id=n1,hostfwd=tcp::2222-:22 \
-device virtio-net-ccw,netdev=n1 \
-drive if=none,id=drv_cfg,format=raw,file=config.crypted,readonly=on \
-device virtio-blk,serial=ignition_crypted,iommu_platform=on,drive=drv_cfg
This should start the ignition process during the first boot of the guest.
During future boots of the guest, you don’t have to specify the drive with
the “config.crypted” file anymore.
Once the ignition is done, you can log in to the guest either on the
console with the password that you created with mkpasswd, or via ssh:
ssh -p 2222 core@localhost
Now you should be able to use the image. But keep in mind that this is an
rpm-ostree based image, so for installing additional packages, you have to
use rpm-ostree install instead of dnf install here. And the kernel
can be replaced like this, for example:
sudo rpm-ostree override replace \
kernel-5.14.0-...s390x.rpm \
kernel-core-5.14.0-...s390x.rpm \
kernel-modules-5.14.0-...s390x.rpm \
kernel-modules-core-5.14.0-...s390x.rpm \
kernel-modules-extra-5.14.0-...s390x.rpm
That’s it! Now you can enjoy your configured secure-execution RHCOS image!
Special thanks to Nikita D. for helping me understanding the ignition process of the secure execution images.
My KVM Forum 2024 talk "IOThread Virtqueue Mapping: Improving virtio-blk SMP scalability in QEMU" is now available on YouTube. The slides are also available here.
IOThread Virtqueue Mapping is a new QEMU feature for configuring multiple IOThreads that will handle a virtio-blk device's virtqueues. This means QEMU can take advantage of the host's Linux multi-queue block layer and assign CPUs to I/O processing. Giving additional resources to virtio-blk emulation allows QEMU to achieve higher IOPS and saturate fast local NVMe drives. This is especially important for applications that submit I/O on many vCPUs simultaneously - a workload that QEMU had trouble keeping up with in the past.
You can read more about IOThread Virtqueue Mapping in this Red Hat blog post.
We’d like to announce the availability of the QEMU 9.1.0 release. This release contains 2800+ commits from 263 authors.
You can grab the tarball from our download page. The full list of changes are available in the changelog.
Highlights include:
Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!
Ever struggled to create configuration files for starting Linux on IBM Z and LinuxONE installations? Fear no more, we got you covered now: A new assistant available online will help you create parameter files!
Writing parameter files can be a challenge, with bugs triggering cycles with lengthy turnaround times. Our new installation assistant generates installer parameter files by walking you through a step-by-step process, where you answer simple questions to generate a parameter file. Comes with contextual help in every stage, so you can follow along what is happening!
Currently supports OSA and PCI networking devices, IPv4/v6, and VLAN installations.
Currently supports RHEL 9 and SLES 15 SP5 or later.
Access the assistant at https://ibm.github.io/liz/
Step number one for the firmware on any system is sending out a DHCP request, asking the DHCP server for an IP address, the boot server (called "next server" in dhcp terms) and the bootfile.
On success the firmware will contact the boot server, fetch the bootfile and hand over control to the bootfile. Traditional method to serve the bootfile is using tftp (trivial file transfer protocol). Modern systems support http too. I have an article on setting up the dhcp server for virtual machines you might want check out.
What the bootfile is expected to be depends on the system being booted. There are embedded systems -- for example IP phones -- which load the complete system software that way.
When booting UEFI systems the bootfile typically is an EFI binary. That is not the only option though, more on that below.
The traditional way to netboot linux on UEFI systems is using a
bootloader. The bootfile location handed out by the DHCP server
points to the bootloader and is the first file loaded over the
network. Typical choices for the bootloader
are grub.efi, snponly.efi
(from ipxe project)
or syslinux.efi.
Next step is the bootloader fetching the config file. That works the same way the bootloader itself was loaded, using the EFI network driver provided by either the platform firmware (typically the case for onboard NICs) or via PCI option rom (plug-in NICs). The bootloader does not need its own network drivers.
The loaded config file controls how the boot will continue. This can be very simple, three lines asking the bootloader to fetch kernel + initrd from a fixed location, then start the kernel with some command line. This can also be very complex, creating an interactive menu system where the user has dozens of options to choose from (see for example netboot.xyz).
Now the user can -- in case the config file defines menus -- pick what he wants boot.
Final step is the bootloader fetching the kernel and initrd (again using the EFI network driver) and starting the kernel. Voila.
When using secure boot there is one more intermediate step needed:
The first binary needs to be be shim.efi, which in turn
will download the actual bootloader. Most distros ship
only grub.efi with a secure boot signature, which
limits the boot loader choice to that.
Also all components (shim + grub + kernel) must come from the same
distribution. shim.efi has the distro secure boot
signing certificate embedded, so Fedora shim will only boot
grub + kernel with a secure boot signature from Fedora.
You probably do not have to worry about this. Shipping systems with
EFI network driver and UEFI network boot support is standard feature
today, snponly.efi should be used for these systems.
When using older hardware network boot support might be missing
though. Should that be the case
the ipxe project can help because it
also features a large collection of firmware network drivers. It
ships an all-in-one EFI binary named ipxe.efi which
includes the the bootloader and scripting features (which are
in snponly.efi too) and additionally all the ipxe
hardware drivers.
That way ipxe.efi can boot from the network even if the
firmware does not provide a driver. In that
case ipxe.efi itself must be loaded from local storage
though. You can download the efi binary and ready-to-use ISO/USB
images from boot.ipxe.org.
A UKI (unified kernel image) is an EFI binary bundle. It contains a linux kernel, an initrd, the command line and a few more optional components not covered here in sections of the EFI binary. Also the systemd efi stub, which handles booting the bundled linux kernel with the bundled initrd.
One advantage is that the secure boot signature of an UKI image will cover all components and not only the kernel itself, which is a big step forward for linux boot security.
Another advantage is that a UKI is self-contained. It does not need a bootloader which knows how to boot linux kernels and handle initrd loading. It is simply an EFI binary which you can start any way you want, for example from the EFI shell prompt.
The later makes UKIs interesting for network booting, because they can be used as bootfile too. The DHCP server hands out the UKI location, the UEFI firmware fetches the UKI and starts it. Done.
Combining the bootloader and UKI approaches is possible too. UEFI
bootloaders can load not only linux kernels. EFI binaries
(including UKIs) can be loaded too, in case of grub.efi
with the chainloader command. So if you want
interactive menus to choose an UKI to boot you can do that.
Modern UEFI implementations can netboot ISO images too. Unfortunately there are a few restrictions though:
When the UEFI firmware gets an ISO image as bootfile from the DHCP server it will load the image into a ramdisk, register the ramdisk as block device and try to boot from it.
From that point on booting works the same way booting from a local cdrom device works. The firmware will look for the boot loader on the ramdisk and load it. The bootloader will find the other components needed on the ramdisk, i.e. kernel and initrd in case of linux. All without any network access.
The UEFI firmware will also create ACPI tables for a pseudo nvdimm
device. That way the booted linux kernel will find the ramdisk too.
You can use the standard Fedora / CentOS / RHEL netinst ISO image,
linux will find the images/install.img on the ramdisk
and boot up all the way to anaconda. With enough RAM you can even
load the DVD with all packages, then do the complete system install
from ramdisk.
The big advantage of this approach is that the netboot workflow becomes very simliar to other installation workflows. It's not the odd kid on the block any more where loading kernel and initrd works in a completely different way. The very same ISO image can be:
Bonus: secure boot support suddenly isn't a headace any more.
There is one problem with the fancy new world though. We have lots of places in the linux world which depend on the linux kernel command line for system configuration. For example anaconda expects getting the URL of the install repository and the kickstart file that way.
When using a boot loader that is simple. The kernel command line simply goes into the boot loader config file.
With ISO images it is more complicated, changing the grub config file on a ISO image is a cumbersome process. Also ISO images are not exactly small, so install images with customized grub.cfg need quite some storage space.
UKIs can pass through command line arguments to the linux kernel, but that is only allowed in case secure boot is disabled. When using UKIs with secure boot the best option is to use the UKIs built and signed on distro build infrastructure. Which implies using the kernel command line for customization is not going to work with secure boot enabled.
So, all of the above (and UKIs in general) will work better if we
can replace the kernel command line as universal configuration
vehicle with something else. Which most likely will not be a single
thing but a number of different approaches depending on the use
case. Some steps into that direction did happen already. Systemd
can autodetect
partitions (so booting from disk without root=...
on the kernel command line works).
And systemd
credentials can be used to configure some aspects of a linux
system. There is still a loooong way to go though.
A new video illustrating the steps to perform on a KVM host and in a virtual server configuration to make AP queues of cryptographic adapters available to a KVM guest can be found here.