Blogging about open source virtualization

January 10, 2026

Stefan Hajnoczi

Building a virtio-serial FPGA device (Part 4): Virtqueue processing

This is the fourth post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at processing the virtio-serial device's transmit and receive virtqueues.

Series table of contents

The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.

The virtio-serial device has a pair of virtqueues that allow the driver to transmit and receive data. The driver enqueues empty buffers onto the receiveq (virtqueue 0) and the device fills them with received data. The driver enqueues buffers containing data onto the transmitq (virtqueue 1) and the device sends them.

This logic is split into two modules: virtqueue_reader for the transmitq and virtqueue_writer for the receiveq. The interface of virtqueue_reader looks like this:

/* Stream data from a virtqueue without framing */
module virtqueue_reader (
    input clk,
    input resetn,

    /* Number of elements in descriptor table */
    input [15:0] queue_size,
    /* Lower 32-bits of Virtqueue Descriptor Area address */
    input [31:0] queue_desc_low,
    /* Lower 32-bits of Virtqueue Driver Area address */
    input [31:0] queue_driver_low,
    /* Lower 32-bits of Virtqueue Device Area address */
    input [31:0] queue_device_low,
    input queue_notify,            /* kick */

    input phase,
    output reg [31:0] data = 0,
    output reg [2:0] data_len = 0,
    output ready,

    /* For DMA */
    output reg ram_valid = 0,
    input ram_ready,
    output reg [3:0] ram_wstrb = 0,
    output reg [21:0] ram_addr = 0,
    output reg [31:0] ram_wdata = 0,
    input [31:0] ram_rdata
);

If you are familiar with the VIRTIO specification you might recognize queue_size, queue_desc_low, queue_driver_low, queue_device_low, and queue_notify since they are values provided by the VIRTIO MMIO Transport. The driver configures them with the memory addresses of the virtqueue data structures in RAM. The device will DMA to access those data structures. The driver can kick the device to indicate that new buffers have been enqueued using queue_notify.

The reader interface consists of phase, data, data_len, and ready and this is what the rdwr_stream module needs to use virtqueue_reader as a data source. rdwr_stream will keep reading the next byte(s) by flipping the phase bit and waiting for ready to be asserted by the device. Note that the device can provide up to 4 bytes at a time through the 32-bit data register and data_len allows the device to indicate how much data was read.

Finally, the DMA interface is how virtqueue_reader initiates RAM accesses so it can fetch the virtqueue data structures that the driver has configured.

The state machine

Virtqueue processing consists of multiple steps and cannot be completed within a single clock cycle. Therefore the processing is decomposed into a state machine where each step consists of a DMA transfer or waiting for an event. Here are the states:

`define STATE_WAIT_PHASE 0                 /* waiting for phase bit to flip */
`define STATE_READ_AVAIL_IDX 1             /* waiting for avail.idx read */
`define STATE_WAIT_NOTIFY 2                /* waiting for queue notify (kick) */
`define STATE_READ_DESCRIPTOR_ADDR_LOW 3   /* waiting for descriptor read */
`define STATE_READ_DESCRIPTOR_LEN 4
`define STATE_READ_DESCRIPTOR_FLAGS_NEXT 5
`define STATE_READ_BUFFER 6                /* waiting for data buffer read */
`define STATE_WRITE_USED_ELEM_ID 7         /* waiting for used element write */
`define STATE_WRITE_USED_ELEM_LEN 8
`define STATE_WRITE_USED_FLAGS_IDX 9       /* waiting for used.flags/used.idx write */
`define STATE_READ_AVAIL_RING_ENTRY 10     /* waiting for avail element read */

The device starts up in STATE_WAIT_PHASE because it is waiting to be asked to read the first byte(s). As soon as rdwr_stream flips the phase bit, virtqueue_reader must check the virtqueue to see if any data buffers are available.

I won't describe all the details of virtqueue processing, but here is a summary of the steps involved. See the VIRTIO specification or the code for the details.

Read the avail.idx field from RAM in case the driver has enqueued more buffers.
Read the avail.ring[i] entry from RAM to fetch the descriptor table index of the next available buffer.
Read the descriptor from RAM to find out the buffer address and length.
Repeatedly read bytes from the buffer until the current descriptor is empty. If the descriptor is chained, read the next descriptor from RAM and repeat.
If the chain is finished, check avail.idx again in case there are more buffers available.

After a buffer has been fully consumed, there are also several steps to fill out a used descriptor and increment the used.idx field so that the driver is aware that the buffer is done.

There are two wait states when the device stops until it there is more work to do. First, rdwr_stream will stop asking to read more data if the writer is too slow. This flow control ensures that data is not dropped due to a slow writer. This is STATE_WAIT_PHASE. Second, if the device wants to read but the virtqueue is empty, then it has to wait until queue_notify goes high. This is STATE_WAIT_NOTIFY.

The virtqueue_writer module is similar to virtqueue_reader but it fills in the buffers with data instead of consuming them.

A quick side note about memory alignment: the memory interface is 32-bit aligned, so it is only possible to read an entire 32-bit value from memory at multiples of 4 bytes. On a fancier CPU with a cache the unit would be a cache line (e.g. 128 bytes). When the data structures being DMAed are not aligned it becomes tedious to handle the shifting and masking, especially when reading data from a source and writing it to a destination. Life is much simpler when everything is aligned, because data can be trivially read or written in a single access without any special logic to adjust the data to fit the cache line size.

Conclusion

The virtqueue_reader and virtqueue_writer modules use DMA to read or write data from/to RAM buffers provided by the driver running on the PicoRV32 RISC-V CPU inside the FPGA. They are state machines that run through a sequence of DMA transfers and provide the reader/writer interfaces that the rdwr_module uses to transfer data. In the next post we will look at the UART receiver and transmitter.

by Unknown ([email protected]) at January 10, 2026 03:51 PM

Building a virtio-serial FPGA device (Part 1): Overview

This is a the first post in a series about building a virtio-serial device in Verilog for a Field Programmable Gate Array (FPGA) development board. This was a project I did in my spare time to become familiar with logic design. I hope these blog posts will offer a glimpse into designing your own devices and FPGA development.

Series table of contents

Having developed systems software including firmware, device drivers for Linux, and device emulation in QEMU, I wanted to implement a device from scratch on an FPGA, leaving the comfort of the software world and getting some experience with hardware internals. And it didn't take long before I got both the good and the bad experiences. For example, when a device has to process data structures that are not aligned in memory and what a pain that becomes! More on that later.

A few years ago, I ordered a development board with an iCE40UP5k FPGA with the intention of implementing a CPU and maybe a USB controller. I was busy with other things though and the FPGA ended up in a drawer until I recently felt the time was right to dive in.

The muselab iCESugar board that I used for this project costs around 50 USD. It does not support high-speed interfaces like PCIe or Ethernet, but it has 5280 logic cells, 128 KB RAM, 8 MB of flash memory, and a collection of basic I/O including onboard LEDs, UART pins, and PMOD headers. That puts it roughly on par with an Arduino microcontroller board, except you're not stuck with a particular microcontroller because you can design your own or use existing soft-cores, as they are called.

The board can be flashed via USB and loading the manufacturer's demos was an eye opener: it can run several different CPU soft-cores (RISC-V, 6502, etc) and there is even enough capacity to run MicroPython on a soft-core. Typing Python into the prompt and getting output back knowing that the CPU it is running on is just some Verilog code that you can read and modify is neat.

Out the available demo soft-cores, the PicoRV32 RISC-V soft-core interested me most because it's a 32-bit microcontroller with open source compiler toolchain support despite the Verilog implementation being tiny. You can write firmware for the PicoRV32 in Rust, C, etc.

A tiny soft-core is important because it leaves logic cells free for integrating custom devices. There is no point in a fancier soft-core if it complicates the project or limits the number of cells available for my own logic.

The PicoRV32 code comes with an example System-on-Chip (SoC) called PicoSoC that integrates RAM, flash, and UART serial port communication. Custom memory-mapped I/O (MMIO) devices can be wired into the SoC by adding address decoding logic and connecting the devices to the bus. PicoSoC is a great time-saver for developing a custom RV32 SoC because RAM and flash are critical but not particularly exciting to integrate yourself.

The PicoSoC exposes a trivial MMIO register interface for the UART, but I wanted to replace it with a virtio-serial device in order to learn about implementing a more advanced device. VIRTIO devices use Direct Memory Access (DMA) and interrupts, although I ended up not implementing interrupts due to running out of logic cells in the end. This provides an opportunity to implement a device from scratch that is small but not trivial.

While PicoSoC has no PCI bus for the popular VIRTIO PCI transport, it is possible to implement the VIRTIO MMIO transport for this SoC since that just involves selecting some address space for the device's registers where the PicoRV32 CPU can access the device.

Having covered all this, the goal of this project is to write a virtio-serial device in Verilog and integrate it into PicoSoC. This also requires writing firmware that runs on the PicoRV32 soft-core to prove that the virtio-serial device works. In the posts that follow, I'll describe the main stops on the journey to building this.

The next post will cover MMIO registers, DMA, and interrupts.

You can also check out the code for this project at https://gitlab.com/stefanha/virtio-serial-fpga.

by Unknown ([email protected]) at January 10, 2026 03:47 PM

Building a virtio-serial FPGA device (Part 6): Writing the RISC-V firmware

This is the final post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the firmware running on the PicoRV32 RISC-V soft-core in the FPGA.

Series table of contents

The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.

The PicoRV32 RISC-V soft-core boots up executing code from flash memory at 0x10000000. Since RISC-V is supported by LLVM and gcc, it is possible to write the firmware in several languages. For this project I wanted to use Rust and was aware of several existing crates that already provide APIs for things that would be needed.

I used a Rust no_std environment, which means that the standard library (std) is not available and only the core library (core) is available. Crates written for embedded systems and low-level programming often support no_std, but most other crates rely on the standard library and an operating system. no_std is a niche in the Rust ecosystem but it works pretty well.

The following crates came in handy:

riscv-rt provides the basic startup code for bare metal on RISC-V. It has the linker script, assembly pre-Rust startup code, and provides things that Rust's runtime needs.
safe-mmio is an API for MMIO device register access. This was helpful for low-level testing of device registers during the early phases of the project.
virtio-drivers has a virtio-serial driver! I didn't need to implement virtqueues, the VIRTIO MMIO Transport, or the virtio-serial driver software myself.

Initially I thought I could get away without a memory allocator since no_std does not have one by default and it would be extra work to set one up. However, virtio-drivers needed one for the virtio-serial device (I don't think it is really necessary, but the code is written that way). Luckily the embedded-alloc has memory allocators that are easy to set up and just need a piece of memory to operate in.

Aside from the setup code, the firmware is trivial. The CPU just sends a hello world message and then echoes back bytes received from the virtio-serial device.

#[riscv_rt::entry]
fn main() -> ! {
    unsafe {
        extern "C" {
            static _heap_size: u8;
        }
        let heap_bottom = riscv_rt::heap_start() as usize;
        let heap_size = &_heap_size as *const u8 as usize;
        HEAP.init(heap_bottom, heap_size);
    }

    // Point virtio-drivers at the MMIO device registers
    let header = NonNull::new(0x04000000u32 as *mut VirtIOHeader).unwrap();
    let transport = unsafe { MmioTransport::new(header, 0x1000) }.unwrap();

    // Put the string on the stack so the device can DMA (it cannot DMA flash memory)
    let mut buf: [u8; 13] = *b"Hello world\r\n";

    if transport.device_type() == DeviceType::Console {
        let mut console = VirtIOConsole::::new(transport).unwrap();
        console.send_bytes(&buf).unwrap();
        loop {
            if let Ok(Some(ch)) = console.recv(true) {
                buf[0] = ch;
                console.send_bytes(&buf[0..1]).unwrap();
            }
        }
    }
    loop {}
}

In the early phases I ran tests on the iCESugar board that lit up an LED to indicate the test result. As things became more complex I switched over to Verilog simulation. I wrote testbenches that exercise the Verilog modules I had written. This is similar to unit testing software.

In the later stages of the project, I changed the approach once more in order to do integration testing and debugging. To get more visibility into what was happening in the full design with a CPU and virtio-serial device, I used GTKWave to view the VCD files that Icarus Verilog can write during simulation. You can see every cycle and every value in each register or wire in the entire design, including the PicoRV32 RISC-V CPU, virtio-serial device, etc.

This allowed very powerful debugging since the CPU activity is visible (see the program counter in the reg_pc register in the screenshot) alongside the virtio-serial device's internal state. It is possible to look up the program counter in the firmware disassembly to follow the program flow and see where things went wrong.

Conclusion

The firmware is a small Rust codebase that uses existing crates, including riscv-rt and virtio-drivers. Throughout the project I used several debugging and simulation approaches, depending on the level of complexity. Thanks to the open source code and tools available, it was possible to complete this project using fairly convenient and powerful tools and without spending a lot of time reinventing the wheel. Or at least without reinventing the wheels I didn't want to reinvent :).

Let me know if you enjoy FPGAs and projects you've done!

by Unknown ([email protected]) at January 10, 2026 12:54 PM

Building a virtio-serial FPGA device (Part 5): UART receiver and transmitter

This is the fifth post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the UART receiver and transmitter.

Series table of contents

The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.

How UARTs work

A Universal Asynchronous Receiver-Transmitter (UART) is a simple interface for data transfer that only requires a transmitter (tx) and a receiver (rx) wire. There is no clock wire because both sides of the connection use their own clocks and sample the signal in order to reconstruct the bits being transferred. This agreed-upon data transfer rate (or baud rate) is usually modest and the frame encoding is also not the most efficient way of transferring data, but UARTs get the job done and are commonly used for debug consoles, modems, and other relatively low data rate interfaces.

There is a framing protocol that makes it easier to reconstruct the transferred data. This is important because failure to correctly reconstruct the data results in corrupted data being received on the other side. In this project I used a 9,600 bit/s baud rate and 8 data bits, no parity bit, and 1 stop bit (sometimes written as 8N1). The framing works as follows:

When no data is being transferred, the signal is 1.
Before the data byte, a start bit is sent with the value 0. This way a receiver can detect the beginning of a frame.
The start bit is followed by the 8 data bits in least significant bit order.
After the data bits the frame ends with a stop bit with the value 1.

The job of the transmitter is to follow this framing protocol. The job of the receiver is to detect the next frame and to reconstruct the byte being transferred.

Implementation

The uart_reader and uart_writer modules implement the UART receiver and transmitter, respectively. They are designed around the rdwr_stream module's reader and writer interfaces. That means uart_reader receives the next byte from the UART rx pin whenever it is asked to read more data and uart_writer transmits on the UART tx pin whenever it is asked to write more data.

uart_reader follows a trick I learnt from the PicoSoC's simpleuart module: once the rx pin goes from 1 to 0, it waits until half the period (e.g. 9,600 baud @ 12 MHz / 2 = 625 clock cycles) has passed before sampling the rx pin. This works well because the UART only transfers data on the iCESugar PCB and is not exposed to much noise. Fancier approaches involve sampling the pin every clock cycle in order to try to reconstruct the value more accurately, but they don't seem to be necessary for this project.

Here is the core uart_reader code, a state machine that parses the incoming frame:

always @(posedge clk) begin
    ...
    div_counter = div_counter + 1;
    case (bit_counter)
    0: begin // looking for the start bit
        if (rx == `START_BIT) begin
            div_counter = 0;
            bit_counter = 1;
        end
    end
    1: begin
        /* Sample in the middle of the period */
        if (div_counter == clk_div >> 1) begin
            div_counter = 0;
            bit_counter = 2;
        end
    end
    10: begin // expecting the stop bit
        if (div_counter == clk_div) begin
            if (rx == `STOP_BIT && !reg_ready) begin
                data = {24'h0, rx_buf};
                data_len = 1;
                reg_ready = 1;
            end
            bit_counter = 0;
        end
    end
    default: begin // receive the next data bit
        if (div_counter == clk_div) begin
            rx_buf = {rx, rx_buf[7:1]};
            div_counter = 0;
            bit_counter = bit_counter + 1;
        end
    end
    endcase

The uart_writer module is similar, but it has a transmit buffer that it sends over the UART tx pin with the framing that I've described here.

Conclusion

The uart_reader and uart_writer modules are responsible for receiving and transmitting data over the UART rx/tx pins. They implement the framing protocol that UARTs use to protect data. In the next post we will cover the firmware running on the PicoRV32 RISC-V soft-core that drives the I/O.

by Unknown ([email protected]) at January 10, 2026 12:54 PM

Building a virtio-serial FPGA device (Part 3): virtio-serial device design

This is the third post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at the design of the virtio-serial device and how to decompose it into modules.

Series table of contents

The code is available at https://gitlab.com/stefanha/virtio-serial-fpga.

A virtio-serial device is a serial controller, enabling communication with the outside world. The iCESugar FPGA development board has UART rx and tx pins connecting the FPGA to a separate microcontroller that acts as a bridge for USB serial communication. That means the FPGA can wiggle the bits on the UART tx pin to send bytes to a computer connected to the board via USB and you can receive bits from the computer through the UART rx pin. The purpose of the virtio-serial device is to present a VIRTIO device to the PicoRV32 RISC-V CPU inside the FPGA so the software on the CPU can send and receive data.

Device design

The virtio-serial device implements the Console device type defined in the VIRTIO specification and exposes it to the driver running on the CPU via the VIRTIO MMIO Transport. The terms "serial" and "console" are used interchangeably in the VIRTIO community and I will usually use serial unless I'm specifically talking about the Console device type section in the VIRTIO specification.

VIRTIO separates the concept of a device type (like net, block, or console) from the transport that allows the driver to access the device. This architecture allows VIRTIO to be used across a range of different machines, including machines that have a PCI bus, MMIO devices, and so on. Fortunately the VIRTIO MMIO transport is fairly easy to implement from scratch.

The virtio_serial_mmio module implements the virtio-serial device from the following parts:

VIRTIO MMIO Transport - MMIO device registers conforming to the VIRTIO specification. They allow the CPU to configure the device and initiate data transfers.
UART reader & virtqueue writer - Incoming data from the UART rx pin is enqueued on the VIRTIO Console receiveq (virtqueue 0) where the driver can receive it.
Virtqueue reader & UART writer - The VIRTIO Console transmitq (virtqueue 1) lets the driver enqueue data that the device sends over the UART tx pin.

The virtio-serial device interfaces with the outside world through an MMIO interface that the CPU uses to access the device registers, a DMA interface for initiating RAM memory transfers, and the UART rx/tx pins for actually sending and receive data.

Note that both the virtqueue_reader and the virtqueue_writer modules require DMA access, so I reused the spram_mux module that multiplexes the CPU and the virtio-serial device's RAM accesses. spram_mux is used inside virtio_serial_mmio to multiplex access to the single DMA interface.

Reader and writer interfaces

Since the job of the device is to transfer data between the virtqueues and the UART rx/tx pins, it is organized around a module named rdwr_stream that constantly attempts to read data from a source and write it to a destination:

/* Stream data from a reader to a writer */
module rdwr_stream (
    input clk,
    input resetn,

    /* The reader interface */
    output reg rd_phase = 0,
    input [31:0] rd_data,
    input [2:0] rd_data_len,
    input rd_ready,

    /* The writer interface */
    output reg wr_phase = 0,
    output reg [31:0] wr_data = 0,
    output reg [2:0] wr_data_len = 0,
    input wr_ready
);

By implementing the reader and writer interfaces for the virtqueues and UART rx/tx pins, it becomes possible to pump data between them using rdwr_stream. For testing it's also possible to configure virtqueue loopback or UART loopback so that the virtqueue logic or the UART logic can be exercised in isolation.

The reader and writer interfaces that the rdwr_stream module uses are the central abstraction in the virtio-serial device. You might notice that this interface uses a phase bit rather than a valid bit like in the valid/ready interface for MMIO and DMA. Every transfer is initiated by flipping the phase bit from its previous value. I find the phase bit approach easier to work with because it distinguishes back-to-back transfers, whereas interfaces that allow the valid bit to stay 1 for back-to-back transfers are harder to debug. It would be possible to switch to a valid/ready interface though.

To summarize, there are 4 reader or writer implementations that can be connected freely through the rdwr_stream module:

virtqueue_reader - reads buffers from the transmitq virtqueue (virtqueue 1).
virtqueue_writer - writes buffers to the receiveq virtqueue (virtqueue 0).
uart_reader - reads data from the UART rx pin.
uart_writer - writes data to the UART tx pin.

Conclusion

The virtio-serial device consists of the VIRTIO MMIO Transport device registers plus two rdwr_streams that transfer data between virtqueues and the UART. The next post will look at how virtqueue processing works.

by Unknown ([email protected]) at January 10, 2026 12:53 PM

Building a virtio-serial FPGA device (Part 2): MMIO registers, DMA, and interrupts

This is the second post in a series about building a virtio-serial device in Verilog for an FPGA development board. This time we'll look at integrating MMIO devices to the PicoSoC (an open source System-on-Chip using the PicoRV32 RISC-V soft-core).

Series table of contents

There are three common ways in which devices interact with a system:

Memory-mapped hardware registers let driver software running on the CPU communicate with the device. This is called MMIO.
Direct Memory Access (DMA) lets the device initiate RAM read or write accesses without tying up the CPU. This is typically used for bulk data transfers. An example is a network card receiving a packet into memory buffer.
Interrupts allow the device to signal the CPU that can event has occurred.

PicoSoC supports MMIO device registers and interrupts out of the box. It does not support DMA, but I will explain how this can be added by modifying the code later.

Memory-mapped I/O registers

First let's look at implementing MMIO registers for a device in PicoSoC. The PicoRV32 CPU's memory interface looks like this:

output        mem_valid // request a memory transfer
output        mem_instr // hint that CPU is fetching an instruction
input         mem_ready // reply to memory transfer

output [31:0] mem_addr  // address
output [31:0] mem_wdata // data being written
output [ 3:0] mem_wstrb // 0000 - read
                        // 0001 - write 1 byte
                        // 0011 - write 2 bytes
                        // 0111 - write 3 bytes
                        // 1111 - write 4 bytes
input  [31:0] mem_rdata // data being read

When mem_valid is 1 the CPU is requesting a memory transfer. The memory address in mem_addr is decoded and the appropriate device is selected according to the memory map (e.g. virtio-serial device at 0x04000000-0x040000ff). The selected device then handles the memory transfer and asserts mem_ready to let the CPU know that the transfer has completed.

In order to handle MMIO device register accesses, the virtio-serial device needs a similar memory interface. The register logic is implemented in a case statement that handles wdata or rdata depending on the semantics of the register. Here is the VIRTIO MMIO MagicValue register implementation that reads a constant identifying this as a VIRTIO MMIO device:

module virtio_serial_mmio (
    ...
    input iomem_valid,
    output iomem_ready,
    input [3:0] iomem_wstrb,
    input [7:0] iomem_addr,
    input [31:0] iomem_wdata,
    output [31:0] iomem_rdata,
    ...
);
    ...
    always @(posedge clk) begin
        ...
        case (iomem_addr)
            `REG_MAGIC_VALUE: begin
                // Note that ready and rdata are basically iomem_ready
                // and iomem_rdata but there is some more glue behind
                // this.
                ready = 1;
                rdata = `MAGIC_VALUE;
            end

Direct Memory Access

MMIO registers are appropriate when the CPU needs to initiate some activity in the device, but it ties up the CPU during the load/store instructions that are accessing the device registers. For bulk data transfer it is common to use DMA instead where a device initiates RAM data transfers itself without CPU involvement. This allows the CPU to continue running independently of device activity.

VIRTIO is built around DMA because the virtqueues live in RAM and the device initiates accesses to both the virtqueue data structures as well as the actual data buffers containing the I/O payload.

The iCESugar board has Single Port RAM (SPRAM), which means that it can only be accessed through one interface and that is already connected to the CPU. In order to allow the virtio-serial device to access RAM, it is necessary to multiplex the SPRAM interface between the CPU and the virtio-serial device. I chose to implement a fixed-priority arbiter to do this because fancier a round-robin strategy is not necessary for this project. The virtio-serial device will only access RAM in short bursts, so the CPU will not be starved.

You can look at the spram_mux module to see the implementation, but it basically has 2 input memory interfaces and 1 output memory interface. One input interface is high priority and the other is low priority. The virtio-serial device uses the high priority port and the CPU uses the low priority port.

The virtio-serial device is designed for DMA via a state machine that keeps track of the current memory access that is being performed. When the device sees the ready input asserted, it knows the DMA transfer has completed and it transitions to the next state (often multiple memory accesses are performed in sequence to load the virtqueue data structures).

For example, here are state machine transitions for loading the first two fields of the virtqueue descriptor:

always @(posedge clk) begin
    ...
    if (ram_valid && ram_ready) begin
        ...
        case (state)
        ...
        `STATE_READ_DESCRIPTOR_ADDR_LOW: begin
            desc_addr_low = ram_rdata;
            ram_addr = ram_addr + 2;
            state = `STATE_READ_DESCRIPTOR_LEN;
        end
        `STATE_READ_DESCRIPTOR_LEN: begin
            desc_len = ram_rdata;
            ram_addr = ram_addr + 1;
            state = `STATE_READ_DESCRIPTOR_FLAGS_NEXT;
        end

When the DMA transfer completes in the STATE_READ_DESCRIPTOR_ADDR_LOW state, the virtqueue descriptor's buffer address (low 32 bits) are stored into the desc_addr_low register for later use and ram_addr is updated to the memory address of the virtqueue descriptor's length field. The STATE_READ_DESCRIPTOR_LEN state has similar logic.

In other words, DMA transfers require splitting up the device implementation into a state machine that handles DMA completion in a future clock cycle. In the software world this is similar to callbacks in event loops where code is split up because we need to wait for a completion.

Interrupts

The PicoRV32 soft-core has basic interrupt support, but it does not implement the standard RISC-V Control and Status Registers (CSRs) for interrupt handling. Supporting this would require extra work on the firmware side because the existing riscv-rt Rust crate doesn't implement the PicoRV32 interrupt mechanism. Also, I ended up running low on logic cells in the FPGA, so I disabled the PicoRV32's optional interrupt support to save space. Luckily VIRTIO devices support busy waiting, so interrupts are not required.

Conclusion

This post described how the virtio-serial device is connected to the PicoSoC and how MMIO registers and DMA work. MMIO register implementation was easy, but I spent quite a bit of time debugging waveforms with GTKWave to make sure that the memory interface and spram_mux was both working correctly and not wasting clock cycles. In the next post we'll look at the design of the virtio-serial device.

by Unknown ([email protected]) at January 10, 2026 12:52 PM

December 24, 2025

QEMU project

QEMU version 10.2.0 released

We’d like to announce the availability of the QEMU 10.2.0 release. This release contains 2300+ commits from 188 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

9pfs shared filesystem support for FreeBSD hosts
Live update support via new ‘cpr-exec’ migration mode, allowing reduced resource usage when updating VMs, and potential for re-using existing state/connections throughout update
Performance improvements via switching to io_uring for QEMU’s main loop
Lots of fixes/improvements for user-mode emulation
ARM: support for CPU features FEAT_SCTLR2, FEAT_TCR2, FEAT_CSSC, FEAT_LSE128, FEAT_ATS1A, FEAT_RME_GPC2, FEAT_AIE, FEAT_MEC, and FEAT_GCS
ARM: support for new ‘amd-versal2-virt’ board model, and improvements to existing ‘AST2600’/’AST2700’/’AST1030’ and ‘xlnx-zynqmp’ boards
HPPA: Emulation support for an HP 715/64 workstation
HPPA: Emulation support for NCR 53c710 SCSI controller and HP LASI multi-I/O chip (developed by Google Summer of Code contributor Soumyajyotii Ssarkar)
PowerPC: Support for PowerNV11 and PPE42 CPU/machines
PowerPC: FADUMP support for pSeries
RISC-V: Numerous emulation fixes/improvements for various components
s390x: virtio-pci performance improvements via irqfd
and lots more…

Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!

December 24, 2025 11:12 PM

December 11, 2025

Gerd Hoffmann

Analyzing CVE-2025-2296

This article brings some background information for security advisories GHSA-6pp6-cm5h-86g5 and CVE-2025-2296.

Booting x86 linux kernels

Lets start with some tech background and history, which is helpful to understand the chain of events leading to CVE-2025-2296.

The x86 linux kernel has a 'setup' area at the start of the binary, and the traditional role for that area is to hold information needed by the linux kernel to boot, for example command line and initrd location. The boot loader patches the setup header before handing over control to the kernel, which allows the linux kernel to find the command line you have typed into the boot loader prompt. Booting in BIOS mode still works that way, and will most likely continue to do so until the BIOS mode days are numbered.

In the early days of UEFI support the boot process in UEFI mode worked quite simliar. It's known as 'EFI handover protocol'. It turned out to have a number of disadvantages though, for example passing additional information requires updating both linux kernel and the boot loader. The latter is true for BIOS mode too, but new development there are very rare with the world moving towards UEFI.

Enter 'EFI stub'. With this the linux kernel is simply an EFI binary. Execution starts in EFI mode, so the early kernel code can do EFI calls and gather all information needed to boot the kernel without depending on the boot loader to do this for the kernel. Version dependencies are gone. Additional bonus is that no kernel-specific knowledge is needed any more. Anything which is able to start efi binaries -- for example efi shell -- can be used to launch the kernel.

Direct kernel boot in qemu

Qemu offers the option to launch linux kernels directly, using the -kernel command line switch. What actually happens behind the scenes is that qemu exposes the linux kernel to the guest using the firmware config interface (fw_cfg for short). The virtual machine firmware (SeaBIOS or OVMF) will fetch the kernel from qemu, place it in guest memory and launch it.

OVMF supports both 'EFI stub' and 'EFI handover protocol' boot methods. It will try the modern 'EFI stub' way first, which actually is just 'try start as EFI binary'. Which btw. implies that you can load any EFI binary that way, this is not limited to linux kernels.

If starting the kernel as EFI binary fails OVMF will try to fallback to the old 'EFI handover protocol' method. OVMF names the latter 'legacy linux kernel loader' in messages printed to the screen.

Direct kernel boot and secure boot

So, what is the problem with secure boot? Well, there isn't only one problem, we had multiple issues:

qemu (prior to version 10.0) provides the linux kernel binary in two chunks, the setup area and the rest of the linux kernel. The setup area is patched by qemu to fill in some information which would otherwide be provided by the bootloader, as explained above. The patching breaks the secure boot signature.
So, if secure boot is enabled attempts to boot via 'EFI stub' will fail, the firmware rejects the binary due to the signature check failing. OVMF will fallback to the legacy 'EFI handover protocol' loader. The legacy loader does not do secure boot verification, which is the core of CVE-2025-2296. And this was essentially unfixable (in the firmware alone) because there simply is no valid secure boot signature due to the patching qemu is doing. Nevertheless there are some use cases which expect direct kernel boot with secure boot enabled to work. Catch 22.
Also note that the typical linux distro boot workflow is not that the linux kernel signature is verified by the firmware. Instead the first binary loaded is shim.efi (signed by microsoft and verified by the firmware). Shim in turn will verify the EFI binaries signed by the distribution (typically grub and the linux kernel).

Security impact

Secure boot bypass sounds scary, but is it really?

First, the bypass is restricted to exactly one binary, which is the linux kernel the firmware fetches from qemu. This issue does not allow to run arbitrary code inside the guest, for example some unsigned efi binary an attacker places on the ESP after breaking into the virtual machine.

Second, the guest has to trust the virtualization host to not do evil things. The host has full control over the environment the guest is running in. Providing the linux kernel image for direct kernel boot is only one little thing out of many. If an evil host wants attack/disturb the guest there are plenty of ways to do so. The host does not need some exploit for that.

Third, many typical virtual machine configurations do not use direct kernel boot. The kernel is loaded from the guest disk instead.

So, the actual impact is quite limited.

Fixing the whole mess

There is no quick and easy fix available. Luckily it is also not super urgent and critical. Time to play the long game ...

Fix #1: qemu exposes an unmodified kernel image to the guest now (additionally to the traditional setup/rest split which is kept for BIOS mode and backward compatibility). Fixes the first issue.

Fix #2: qemu can expose the shim binary to the guest, using the new -shim command line switch. Fixes the third issue. Both qemu changes are available in qemu version 10.0 (released April 2025) and newer.

Fix #3: OVMF companion changes for fixes #1 + #2, use the new fw_cfg items exposed by qemu. Available in edk2-stable202502 and newer. Both qemu and OVMF must be updated for this to work.

Fix #4: Add a config option to disable the legacy 'EFI handover protocol' loader. Leave it enabled by default because of the low security impact and the existing use cases, but print warnings to the console in case the legacy loader is used. Also present in edk2-stable202502 and newer. Addresses the second issue.

With all that in place it is possible to plug the CVE-2025-2296 hole, by flipping the new config switch to disabled.

Fix #5: Do not use the legacy loader in confidential VMs. Present in edk2-stable202511 and newer.

Roughly one year has been passed since the first changes have been committed to qemu and edk2. What happened? libvirt also got support for passing shim to the guest for direct kernel boot (version 11.2.0 and newer). The new versions have found their way into the major linux distributions. debian, ubuntu and fedora should all be ready for the next step now.

Fix #6: Flip the default value for the legacy loader config option to disabled. This update just landed in the edk2 git repository and will be in edk2-stable202602.

What is left to do? The final cleanup. Purge the legacy loader from the edk2 code base. Will probably happen a year or two down the road.

References

The edk2 changes are in X86QemuLoadImageLib.

The qemu changes are in hw/i386/x86-common.c.

by Gerd Hoffmann at December 11, 2025 11:00 PM

Stefan Hajnoczi

What's new in VIRTIO 1.4

With the VIRTIO 1.4 specification for I/O devices expected to be published soon, here are the most prominent changes. For more fine-grained changes like the latest offloading capabilities in virtio-net devices, please refer to the draft specification.

New device types

The most exciting changes are new device types that allow for entirely new I/O devices to be built with VIRTIO. In 1.4 there are new device types that are especially relevant for automotive and embedded systems.

The Controller Area Network (CAN) device provides access to the CAN bus that is popular in automotive systems.
The Serial Peripheral Interface (SPI) Controller device provides access to the SPI bus that a large number of devices in embedded systems support. This will make low-level control of SD cards, flash memory, sensors, displays, and more possible via VIRTIO.
The Media (V4L2) device exposes Video4Linux over VIRTIO so that webcams and related devices can be supported.
The Real Time Clock (RTC) provides clock and alarm functionality.

Infrastructure

In addition to the new devices, VIRTIO itself has evolved to provide new functionality across device types:

Device suspend is now supported. Previously a running device could only be reset but not suspended.
The Device parts mechanism has been introduced in order to save and restore device state for live migration and snapshot save/load use cases.

This is a nice step forward for VIRTIO. Congratulations to everyone who contributed to VIRTIO 1.4!

by Unknown ([email protected]) at December 11, 2025 04:33 PM

December 04, 2025

Marcin Juszkiewicz

From the diary of AArch64 porter — RISC-V

Wait, what? RISC-V? In ‘the diary of AArch64 porter’? WTH?

Yes, I started working on Fedora packaging for the 64-bit RISC-V architecture port.

All started with discussion about Mock

About a week ago, one of my work colleagues asked me about my old post about speeding up Mock. We had a discussion, I pointed him to the Mock documentation, and gave some hints.

It turned out that he was working on RISC-V related changes to Fedora packages. As I had some spare cycles, I decided to take a look. And I sank…

State of the RISC-V Fedora port

The 64-bit RISC-V port of Fedora Linux is going quite well. There are over 90% of Fedora packages already built for that architecture. And there are several packages with the riscv64 specific changes, such as:

patches adding RISC-V support
disabling some parts of test suites
disabling some build options due to bootstrapping of some languages being in progress (like Java)
disabling of debug information due to some toolchain issues (there is a work-in-progress now to solve them)

Note that these changes are temporary. There are people working on solving toolchain issues, languages are being bootstrapped (there was a review of Java changes earlier this week), patches are being integrated upstream and in Fedora, and so on.

There is the Fedora RISC-V tracker website showing the progress of the port:

package name
current status (new, triaged, patch posted, patch merged, done)
version in RISC-V port Koji
version in Fedora Koji (F43 release is tracked now)
version in CentOS Stream 10
notes

This is a simple way to check what to work on. And there are several packages, not built yet due to use of “ExclusiveArch” setting in them.

My work

The quick look at work needed reminded me of the 2012-2014 period, when I worked on the same stuff but for AArch64 ports (OpenEmbedded, Debian/Ubuntu, Fedora/RHEL). So I had a knowledge, I knew the tools and started working.

In the beginning, I went through entries in the tracker and tried to triage as many packages as possible, so it will be more visible which ones need work and which can be ignored (for now). The tracker went from seven to over eighty triaged packages in a few days.

Then I looked at changes done by current porters. Which usually meant David Abdurachmanov. I used his changes as a base for the changes needed for Fedora packaging, while trying to minimise the amount of them to the minimum required.

I did over twenty pull requests to Fedora packages during a week of work.

Hardware?

But which hardware did I use to run those hundreds of builds? Was it HiFive Premier P550? Or maybe Milk-V Titan or another RISC-V SBC?

Nope. I used my 80-core, Altra-based, AArch64 desktop to run all those builds. With the QEMU userspace helper.

You see, Mock allows to run builds for foreign architectures — all you need is the proper qemu-user-static-* package and you are ready to go:

$ fedpkg mockbuild -r fedora-43-riscv64

You can extract the “fedora-43-riscv64” Mock config from the mock-riscv64-configs.patch hosted on Fedora RISC-V port forge. I hope that these configuration files may be found in the “mock-core-configs” in Fedora soon.

At some point I had 337 qemu-user-static-riscv processes running at same time. And you know what? It was still faster than a native build on 64-bit RISC-V hardware.

But, to be honest, I only compared a few builds, so it may be better with other builders. Fedora RISC-V Koji uses wide list of SBCs to build on:

Banana Pi BPI-F3
Milk-V Jupiter
Milk-V Megrez
SiFive HiFive Premier P550
StarFive VisionFive 2

Also note that using QEMU is not a solution for building a distribution. I used it only to check if package builds, and then scrap the results.

Future

Will I continue working on the RISC-V port of Fedora Linux? Probably yes. And, at some point, I will move to integrating those changes into CentOS Stream 10.

For sure I do not want to invest in RISC-V hardware. Existing models are not worth the money (in my opinion), incoming ones are still old (RVA20/RVA22) and they are slow. Maybe in two, three years there will be something fast enough.

by Marcin Juszkiewicz at December 04, 2025 05:47 PM

November 20, 2025

QEMU project

QEMU at Google Summer of Code Mentor Summit 2025

The Google Summer of Code (GSoC) Mentor Summit 2025 took place from October 23rd to 25th in Munich, Germany. This event marks the conclusion of the annual program, bringing together mentors from all over the world. QEMU had another successful year with several interesting projects (details on our organization page), and it was a pleasure for me to represent the QEMU community at the summit, joining mentors from over 100 open source organizations to discuss the program, share experiences, and talk about open source challenges.

The Unconference

The summit follows an “unconference” format. There is no pre-planned schedule; instead, attendees propose sessions on the first day based on what they want to discuss. Since the event moved to Munich this year, it was a great opportunity for me to join and meet people from other communities face-to-face.

gsoc mentor summit schedule

Lightning Talks

During the “Lightning talks” session, mentors had a short slot to introduce their projects. I presented the project I mentored this summer: vhost-user devices in Rust on macOS and *BSD.

The student, Wenyu Huang, worked on extending rust-vmm crates (specifically vhost, vhost-user-backend, and vmm-sys-utils) to support vhost-user devices on non-Linux POSIX systems. This work is important for portability, allowing rust-vmm components to run also on macOS and BSD.

You can find the full details and the code in the final project report.

This project focused primarily on the rust-vmm ecosystem rather than QEMU itself. This was possible thanks to QEMU acting as an umbrella organization, allowing related projects like rust-vmm to participate in the program.

Sessions and Networking

Networking with other mentors was a key part of the event. It was nice to see that QEMU is well-recognized; many mentors I met were familiar with the project, which made it easy to start conversations. We exchanged views on how to handle the mentorship lifecycle, from interviewing GSoC applicants (and the impact of AI on that process) to the coding phase. We shared tips on how to best help students during the summer, such as setting up regular meetings and maintaining effective communication.

I also attended several sessions covering different topics. The most interesting discussions were:

Operating System Summit: A gathering of maintainers from various kernels (Linux, BSD, etc.) to connect and share updates.
Heterogeneous architectures: A discussion on how AI systems and workloads are driving the requirement for heterogeneous architectures (GPUs, FPGAs, and other accelerators).
Funding your open source project: A session on sustainability, focusing on how other open source projects manage funding and resources.
GSoC feedback session: A meeting with the Google program admins to share experiences and suggest improvements for next year.

The “sticker table” and “chocolate table” are traditions of the summit. I enjoyed trying chocolates from different countries. Unfortunately, I didn’t have any QEMU stickers to share this time. We should definitely plan to bring a stack for next year!

Looking Ahead

We really believe that GSoC is a great and useful program, as it brings new ideas and contributors to our community. We will definitely apply again for GSoC 2026, and we hope to have the chance to join the Mentor Summit again next year!

by Stefano Garzarella at November 20, 2025 08:00 AM

October 22, 2025

Gerd Hoffmann

the art of firmware logging

If something goes wrong if usually is very helpful to have log files at hand. Virtual machine firmware is no exception here. So, lets have a look at common practices here.

qemu debug console

On the x86 platform qemu provides a isa-debugcon device. That is the simplest possible logging device, featuring a single ioport. Reading from the ioport returns a fixed value, which can be used to detect whenever the device is present or not. Writing to the ioport sends the character written to the chardev backend linked to device.

By convention the qemu firmware -- both seabios and OVMF -- uses the ioport address 0x402. So getting the firmware log printed on your terminal works this way:

qemu-system-x86_64 \
    -chardev stdio,id=fw \
    -device isa-debugcon,iobase=0x402,chardev=fw

When using libvirt you can use this snippet (in the <devices> section) to send the firmware log to file:

<serial type='null'>
  <log file='/path/to/firmware.log' append='off'/>
  <target type='isa-debug' port='1'>
    <model name='isa-debugcon'/>
  </target>
  <address type='isa' iobase='0x402'/>
</serial>

Note that virsh console will connect to the first serial device specified in the libvirt xml, so this should be inserted after other serial devices to avoid breaking your serial console setup.

edk2 firmware on arm

On the arm virt platform there is no special device for the firmware log. The logs are sent to the serial port instead. That is inconvinient when using a serial console though, so linux distros typically ship two variants of the arm firmware image. One with logging turned on, and one with logging turned off (on RHEL and Fedora the latter have 'silent' in the filename). By default libvirt uses the silent builds, so in case you need the debug log you have to change the VM configuration to use the other variant.

Recently (end of 2023) the arm builds learned a new trick. In case two serial ports are present the output will be split. The first serial port is used for the console, the second port for the debug log. With that the console is actually usable with the verbose firmware builds. To enable that you simply need two serial devices in your libvirt config:

<serial type='pty'>
  <log file='/path/to/console.log' append='off'/>
  <target type='system-serial' port='0'>
    <model name='pl011'/>
  </target>
</serial>
<serial type='null'>
  <log file='/path/to/firmware.log' append='off'/>
  <target type='system-serial' port='1'>
    <model name='pl011'/>
  </target>
</serial>

edk2 memory log buffer

Starting with the edk2-stable202508 tag OVMF supports logging to a memory buffer. The feature is disabled by default and must be turned on at compile time using the -D DEBUG_TO_MEM=TRUE option when building the firmware.

There are multiple ways to access the log memory buffer. First is a small efi application which can print the log to the efi console (source code, x64 binary). Pass -p or pager as argument on the command line to enable the build-in pager.

Second way is a recent linux kernel, version 6.17 got a new bool config option: OVMF_DEBUG_LOG. When enabled the linux kernel will make the firmware log available via sysfs. If supported by both kernel and firmware the log will show up in the /sys/firmware/efi directory with the filename ovmf_debug_log.

Third option is a qemu monitor command. The changes just landed in qemu master branch and will be available in the next release qemu (10.2) expected later this year. Both a qmp command (query-firmware-log) and a hmp command (info firmware-log) are availablke. This is useful to diagnose firmware failures happen early enough in boot that the other options can not be used.

by Gerd Hoffmann at October 22, 2025 10:00 PM

September 11, 2025

Stefan Hajnoczi

Video and slides available for "Making io_uring Pervasive in QEMU" talk at KVM Forum 2025

My KVM Forum 2025 talk "Making io_uring Pervasive in QEMU" is now available on YouTube. The slides are also available here (PDF).

This talk is about integrating Linux io_uring into QEMU's event loop to enable performance optimizations and use new kernel features available through io_uring. This topic is also relevant for other I/O-intensive applications (network services, software-defined networking or storage systems, databases, etc) that require modifications in order to take advantage of io_uring. The challenge is usually how to move from a reactor-based event loop that monitors file descriptors to a proactor-based event loop that waits for asynchronous operation completion. In QEMU this can be solved by keeping existing fd monitoring users while introducing an API for io_uring request submission that new code can use.

by Unknown ([email protected]) at September 11, 2025 05:12 PM

August 26, 2025

QEMU project

QEMU version 10.1.0 released

We’d like to announce the availability of the QEMU 10.1.0 release. This release contains 2700+ commits from 226 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

VFIO: Initial support for accessing/mapping memory for confidential guests when guest_memfd is being utilized, allowing passthrough support for virtual machines running under SEV-SNP/TDX
Live migration: support for utilizing multifd to accelerate post-copy migration, optimizations for pre-copy migration, and RDMA migration support for ipv6
QEMU guest agent: support for querying load of virtual machines running Windows via new ‘guest-get-load’ command
ARM: support for CPU features FEAT_SME2, FEAT_SME2p1, FEAT_SME_B16B16, FEAT_SME_F16F16, FEAT_SVE_B16B16, and FEAT_SVE2p1
ARM: support for new board/machine models ‘max78000fthr’, ‘ast2700fc’, ‘catalina-bmc’, ‘gb200-bmc’, and ‘ast2700a0-evb’
ARM: ‘virt’ board now supports nested virtualization under KVM, CXL, and ACPI-based PCI hotplug
LoongArch: support for in-kernel irqchip
Microblaze: support for selecting the endianess of ‘petalogix_s3adsp1800’ machine type
RISC-V: ISA/extension support for atomic instruction fetch (Ziccif), ‘Svrsw60t59b’, and numerous other improvements/additions/fixes
RISC-V: support for Kunminghu CPU and platform
x86: KVM support for running confidential guests via Intel TDX
x86: Support for initializing AMD SEV/SEV-ES/SEV-SNP virtual machines using the IGVM file format
and lots more…

August 26, 2025 11:25 PM

July 21, 2025

Stefan Hajnoczi

Key-Value Stores: The Foundation of File Systems and Databases

File systems and relational databases are like cousins. They share more than is apparent at first glance.

It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.

The key-value store interface

Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.

Here is a key-value store with an integer key and a string value:

Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.

A key-value store provides the following interface for storing and retrieving values by a given key:

put(Key, Value) - an insert/update operation that stores a value for a given key
get(Key) -> Value - a lookup operation that retrieves the most recently stored value for a given key
first() -> Key, last() -> Key, next(Key) -> Key, prev(Key) -> Key - a cursor API that enumerates keys in sorted order

You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.

Why key-value stores are necessary

Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.

Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.

This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.

Building a file system on a key-value store

First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:

Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.

File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.

If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.

Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.

Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.

Update (2025-07-21): Another good example of the connection between file systems and key-value stores is the "File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution" paper (PDF) where a key-value store is used to hold the metadata and raw blocks are used to hold file data.

Building a database on a key-value store

The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:

This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?

To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.

SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.

Conclusion

If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.

by Unknown ([email protected]) at July 21, 2025 11:11 AM

July 15, 2025

Alberto Garcia

Converting QEMU qcow2 images directly to stdout

Introduction

Some months ago, my colleague Madeeha Javed and I wrote a tool to convert QEMU disk images into qcow2, writing the result directly to stdout.

This tool is called qcow2-to-stdout.py and can be used for example to create a new image and pipe it through gzip and/or send it directly over the network without having to write it to disk first.

This program is included in the QEMU repository: https://github.com/qemu/qemu/blob/master/scripts/qcow2-to-stdout.py

If you simply want to use it then all you need to do is have a look at these examples:

$ qcow2-to-stdout.py source.raw > dest.qcow2
$ qcow2-to-stdout.py -f dmg source.dmg | gzip > dest.qcow2.gz

If you’re interested in the technical details, read on.

A closer look under the hood

QEMU uses disk images to store the contents of the VM’s hard drive. Images are often in qcow2, QEMU’s native format, although a variety of other formats and protocols are also supported.

I have written in detail about the qcow2 format in the past (for example, here and here), but the general idea is very easy to understand: the virtual drive is divided into clusters of a certain size (64 KB by default), and only the clusters containing non-zero data need to be physically present in the qcow2 image. So what we have is essentially a collection of data clusters and a set of tables that map guest clusters (what the VM sees) to host clusters (what the qcow2 file actually stores).

A qcow2 file is a collection of data clusters plus some metadata to map them to what the guest VM sees.

qemu-img is a powerful and versatile tool that can be used to create, modify and convert disk images. It has many different options, but one question that sometimes arises is whether it can use stdin or stdout instead of regular files when converting images.

The short answer is that this is not possible in general. qemu-img convert works by checking the (virtual) size of the source image, creating a destination image of that same size and finally copying all the data from start to finish.

Reading a qcow2 image from stdin doesn’t work because data and metadata blocks can come in any arbitrary order, so it’s perfectly possible that the information that we need in order to start writing the destination image is at the end of the input data¹.

Writing a qcow2 image to stdout doesn’t work either because we need to know in advance the complete list of clusters from the source image that contain non-zero data (this is essential because it affects the destination file’s metadata). However, if we do have that information then writing a new image directly to stdout is technically possible.

The bad news is that qemu-img won’t help us here: it uses the same I/O code as the rest of QEMU. This generic approach makes total sense because it’s simple, versatile and is valid for any kind of source and destination image that QEMU supports. However, it needs random access to both images.

If we want to write a qcow2 file directly to stdout we need new code written specifically for this purpose, and since it cannot reuse the logic present in the QEMU code this was written as a separate tool (a Python script).

The process itself goes like this:

Read the source image from start to finish in order to determine which clusters contain non-zero data. These are the only clusters that need to be present in the new image.
Write to stdout all the metadata structures of the new image. This is now possible because after the previous step we know how much data we have and where it is located.
Read the source image again and copy the clusters with non-zero data to stdout.

Images created with this program always have the same layout: header, refcount tables and blocks, L1 and L2 tables, and finally all data clusters.

One problem here is that, while QEMU can read many different image formats, qcow2-to-stdout.py is an independent tool that does not share any of the code and therefore can only read raw files. The solution here is to use qemu-storage-daemon. This program is part of QEMU and it can use FUSE to export any file that QEMU can read as a raw file. The usage of qemu-storage-daemon is handled automatically and the user only needs to specify the format of the source file:

$ qcow2-to-stdout.py -f dmg source.dmg > dest.qcow2

qcow2-to-stdout.py can only create basic qcow2 files and does not support features like compression or encryption. However, a few parameters can be adjusted, like the cluster size (-c), the width of the reference count entries (-r) and whether the new image is created with the input as an external data file (-d and -R).

And this is all, I hope that you find this tool useful and this post informative. Enjoy!

Acknowledgments

This work has been developed by Igalia and sponsored by Outscale, a Dassault Systèmes brand.

¹ This problem would not happen if the input data was in raw format but in this case we would not know the size in advance.

by berto at July 15, 2025 05:17 PM

July 11, 2025

KVM on Z

Updated Publication: "Networking with PCI adapters and functions"

Coinciding with the switch to the new Network Express adapters, we have updated our documentation to include the new card here.

From the introduction:

"This publication explores what PCI network adapters offer for network connections of Linux® on IBM Z® and IBM® LinuxONE. The publication applies to Network Express and RoCE Express2 or RoCE Express3 adapters."

Note that Network Express adapters now also support promiscuous mode as required by Open vSwitch in a KVM context!

by Stefan Raspl ([email protected]) at July 11, 2025 12:59 PM

June 25, 2025

Stefan Hajnoczi

Profiling tools I use for QEMU storage performance optimization

A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.

Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):

If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.

Here are the tools I like to use when hunting for QEMU storage performance optimizations.

kvm_stat

kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.

kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.

Here is output from an I/O intensive workload:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                   1066255   20.3    43513
 kvm_exit                                    1066266   20.3    43513
 kvm_hv_timer_state                           954944   18.2    38754
 kvm_msr                                      487702    9.3    19878
 kvm_apicv_accept_irq                         278926    5.3    11430
 kvm_apic_accept_irq                          278920    5.3    11430
 kvm_vcpu_wakeup                              250055    4.8    10128
 kvm_pv_tlb_flush                             250000    4.8    10123
 kvm_msi_set_irq                              229439    4.4     9345
 kvm_ple_window_update                        213309    4.1     8836
 kvm_fast_mmio                                123433    2.3     5000
 kvm_wait_lapic_expire                         39855    0.8     1628
 kvm_apic_ipi                                   9614    0.2      457
 kvm_apic                                       9614    0.2      457
 kvm_unmap_hva_range                               9    0.0        1
 kvm_fpu                                          42    0.0        0
 kvm_mmio                                         28    0.0        0
 kvm_userspace_exit                               21    0.0        0
 vcpu_match_mmio                                  19    0.0        0
 kvm_emulate_insn                                 19    0.0        0
 kvm_pio                                           2    0.0        0
 Total                                       5258472          214493

Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.

Important events:

kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
kvm_msr: Model Specific Register accesses
kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses

sysstat

sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.

The mpstat, pidstat, and iostat tools are the ones I use most often:

mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.

blktrace

blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.

The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.

perf-top

In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.

perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.

Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:

   3.95%  [kvm_intel]                 [k] vmx_vmexit
   3.37%  qemu-system-x86_64          [.] virtqueue_split_pop
   2.64%  libc.so.6                   [.] __memset_evex_unaligned_erms
   2.50%  [kvm_intel]                 [k] vmx_spec_ctrl_restore_host
   1.57%  [kernel]                    [k] native_irq_return_iret
   1.13%  qemu-system-x86_64          [.] bdrv_co_preadv_part
   1.13%  [kernel]                    [k] sync_regs
   1.09%  [kernel]                    [k] native_write_msr
   1.05%  [nvme]                      [k] nvme_poll_cq

The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.

perf-trace

perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.

Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:

 IO iothread1 (332737), 1763284 events, 99.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   io_uring_enter    351680      0  8016.308     0.003     0.023     0.209      0.11%
   write             390189      0  1238.501     0.002     0.003     0.098      0.08%
   read              121057      0   305.355     0.002     0.003     0.019      0.14%
   ppoll              18707      0    62.228     0.002     0.003     0.023      0.32%

Conclusion

We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!

by Unknown ([email protected]) at June 25, 2025 08:26 PM

April 23, 2025

QEMU project

QEMU version 10.0.0 released

We’d like to announce the availability of the QEMU 10.0.0 release. This release contains 2800+ commits from 211 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

block: virtio-scsi multiqueue support for using different I/O threads to process requests for each queue (similar to the virtio-blk multiqueue support that was added in QEMU 9.2)
VFIO: improved support for IGD passthrough on all Intel Gen 11/12 devices
Documentation: significant improvement/overhaul of documentation for QEMU Machine Protocol to make it clearer and more organized, including all commands/events/types now being cross-reference-able via click-able links in generated documentation
ARM: emulation support for Secure EL2 physical and virtual timers
ARM: emulation support for FEAT_AFP, FEAT_RPRES, and FEAT_XS architecture features
ARM: new board models for NPCM8445 Evaluation and i.MX 8M Plus EVK boards
HPPA: new SeaBIOS-hppa version 18 with lots of fixes and enhancements
HPPA: translation speed and virtual CPU reset improvements
HPPA: emulation support for Diva GSP BMC boards
LoongArch: support for CPU hotplug, paravirtual IPIs, KVM steal time accounting, and virtual ‘extioi’ interrupt routing.
RISC-V: ISA/extension support for riscv-iommu-sys devices, ‘svukte’, ‘ssstateen’, ‘smrnmi’, ‘smdbltrp’/’ssdbltrp’, ‘supm’/’sspm’, and IOMMU translation tags
RISC-V: emulation support for Ascalon and RV64 Xiangshan Nanhu CPUs, and Microblaze V boards.
s390x: add CPU model support for the generation 17 mainframe CPU
s390x: add support for virtio-mem and for bypassing IOMMU to improve PCI device performance
x86: CPU model support for Clearwater Forest and Sierra Forest v2
x86: faster emulation of string instructions
and lots more…

April 23, 2025 06:14 PM

KVM on Z

New Release: Ubuntu 25.04

Canonical released a new version of their Ubuntu server offering Ubuntu Server 25.04!

Highlights include IBM z17 support, in particular:

Exploitation of the AI facility in the Telum II processor
New Secure Execution and Crypto features
Network Express support
Power consumption reporting

See the announcement on the mailing list here, and the blog entry at Canonical with all Z-specific highlights here.

by Stefan Raspl ([email protected]) at April 23, 2025 11:45 AM

April 09, 2025

KVM on Z

IBM z17 announced!!

Our latest offering in the IBM Z family, IBM z17, was announced yesterday. General availability will be June 18.

See the official announcement, the updated Linux support matrix, and the Technical Guide with all the insights you will need.

by Stefan Raspl ([email protected]) at April 09, 2025 08:07 AM

March 06, 2025

QEMU project

Announcing QEMU Google Summer of Code 2025 internships

QEMU is participating in Google Summer of Code again this year! Google Summer of Code is an open source internship program that offers paid remote work opportunities for contributing to open source. Internships run May through August, so if you have time and want to experience open source development, read on to find out how you can apply.

Each intern is paired with one or more mentors, experienced QEMU contributors who support them during the internship. Code developed by the intern is submitted through the same open source development process that all QEMU contributions follow. This gives interns experience with contributing to open source software. Some interns then choose to pursue a career in open source software after completing their internship.

Find out if you are eligible

Information on who can apply for Google Summer of Code is here.

Select a project idea

Look through the the list of QEMU project ideas and see if there is something you are interested in working on. Once you have found a project idea you want to apply for, email the mentor for that project idea to ask any questions you may have and discuss the idea further.

Submit your application

You can apply for Google Summer of Code from March 24th to April 8th.

Good luck with your applications!

If you have questions about applying for QEMU GSoC, please email Stefan Hajnoczi or ask on the #qemu-gsoc IRC channel.

March 06, 2025 07:00 AM

December 17, 2024

KVM on Z

Migration fails with something like ..pckmo.. ?

If you run into a situation where migration fails with something like

internal error: QEMU unexpectedly closed the monitor (vm='testguest'):
qemu-kvm: Some features requested in the CPU model are not available in the current configuration: pckmo-aes-256 pckmo-aes-192 pckmo-aes-128 pckmo-etdea-192 pckmo-etdea-128 pckmo-edea msa9_pckmo Consider a different accelerator, QEMU, or kernel version

This indicates that both host are configured differently regarding the CPACF key management Operations

So you can (preferred if the security scheme allows for that) configure both LPARs the same way, de-activate and re-activate the LPAR

you can change the CPU model of the guest to no longer have these pckmo functions. Change your guest xml from "host-model" to "host-model with some features disabled".

So shutdown the guest and change the XML from

This guest can now be migrated to a host without key managment functions. The downside is that this guest no longer has access to the key management functions. This makes it harder to use encrypted swap devices with automatic key generation. Therefore, enabling these functions on both LPARs is preferred.

by Christian Bornträger ([email protected]) at December 17, 2024 12:06 PM

December 11, 2024

QEMU project

QEMU version 9.2.0 released

We’d like to announce the availability of the QEMU 9.2.0 release. This release contains 1700+ commits from 209 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

virtio-gpu: support for 3D acceleration of Vulkan applications via Venus Vulkan driver in the guest and virglrenderer host library
crypto: GLib crypto backend now supports SHA-384 hashes
migration: QATzip-accelerated compression support while using multiple migration streams
Rust: experimental support for device models written in Rust (for development use only)
ARM: emulation support for FEAT_EBF16, FEAT_CMOW architecture features
ARM: support for two-stage SMMU translation for sbsa-ref and virt boards
ARM: support for CPU Security Extensions for xilinx-zynq-a9 board
ARM: 64GB+ memory support when using HVF acceleration on newer Mac systems
HPPA: SeaBIOS-hppa v17 firmware with various fixes and enhancements
RISC-V: IOMMU support for virt machine
RISC-V: support for control flow integrity and Svvptc extensions, and support for Bit-Manipulation extension on OpenTitan boards
RISC-V: improved performance for vector unit-stride/whole register ld/st instructions
s390x: support for booting from other devices if the previous ones fail
x86: support for new nitro-enclave machine type that can emulate AWS Nitro Enclave and can boot from Enclave Image Format files.
x86: KVM support for enabling AVX10, as well as enabling specific AVX10 versions via command-line
and lots more…

December 11, 2024 11:42 PM

October 23, 2024

Thomas Huth

How to use secure RHCOS images on s390x

Recently, I needed to debug a problem that only occurred in RHCOS images that are running in secure execution mode on an IBM Z system. Since I don’t have a OCP installation at hand, I wanted to run such an image directly with QEMU or libvirt. This sounded easy at a first glance, since there are qcow2 images available for RHCOS, but in the end, it was quite tricky to get this working, so I’d like to summarize the steps here, maybe it’s helpful for somebody else, too. Since the “secex” images are encrypted, you cannot play the usual tricks with e.g. guestfs here, you have to go through the ignition process of the image first. Well, maybe there is already the right documentation for this available somewhere and I missed it, but most other documents mainly talk about x86 or normal (unencrypted) images (like the one for Fedora CoreOS on libvirt ), so I think it will be helpful to have this summary here anyway.

Preparation

First, make sure that you have the right tools installed for this task:

sudo dnf install butane wget mkpasswd openssh virt-install qemu-img

Since we are interested in the secure execution image, we have to download the image with “secex” in the name, together with the right GPG key that is required for encrypting the config file later, for example:

wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-qemu-secex.s390x.qcow2.gz
wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-ignition-secex-key.gpg.pub

Finally, uncompress the image. And since we want to avoid modifying the original image, let’s also create an overlay qcow2 image for it:

gzip -d rhcos-qemu-secex.s390x.qcow2.gz
qemu-img create -f qcow2 -b rhcos-qemu-secex.s390x.qcow2 -F qcow2 rhcos.qcow2

Creation of the configuration file

For being able to log in your guest via ssh later, you need an ssh key, so let’s create one and add it to your local ssh-agent:

ssh-keygen -f rhcos-key
ssh-add rhcos-key

If you also want to log in on the console via password, create a password hash with the mkpasswd tool, too.

Now create a butane configuration file and save it as “config.bu”:

variant: fcos
version: 1.4.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - INSERT_THE_CONTENTS_OF_YOUR_rhcos-key.pub_FILE_HERE
      password_hash: INSERT_THE_HASH_FROM_mkpasswd_HERE
      groups:
        - wheel
storage:
  files:
    - path: /etc/se-hostkeys/ibm-z-hostkey-1
      overwrite: true
      contents:
        local: HKD.crt
systemd:
  units:
    - name: [email protected]
      mask: false

Make sure to replace the “INSERT_…” markers in the file with the contents of your rhcos-key.pub and the hash from mkpasswd, and also make sure to have the host key document (required for encrypting the guest with genprotimg) available as HKD.crt in the current directory.

Next, the butane config file needs to be converted into an ignition file, which then needs to be encrypted with the GPG key of the RHCOS image:

butane -d . config.bu > config.ign
gpg --recipient-file rhcos-ignition-secex-key.gpg.pub --yes \
    --output config.crypted --armor --encrypt config.ign

Ignition of the guest image

The encrypted config file can now be used to start the ignition of the guest. On s390x, the config file is not presented via the “fw_cfg” mechanism to the guest (like it is done on x86), but with a drive that has a special serial number. Thus QEMU should be started like this:

/usr/libexec/qemu-kvm -d guest_errors -accel kvm -m 4G -smp 4 -nographic \
  -object s390-pv-guest,id=pv0 -machine confidential-guest-support=pv0 \
  -drive if=none,id=dr1,file=rhcos.qcow2,auto-read-only=off,cache=unsafe \
  -device virtio-blk,drive=dr1 -netdev user,id=n1,hostfwd=tcp::2222-:22 \
  -device virtio-net-ccw,netdev=n1 \
  -drive if=none,id=drv_cfg,format=raw,file=config.crypted,readonly=on \
  -device virtio-blk,serial=ignition_crypted,iommu_platform=on,drive=drv_cfg

This should start the ignition process during the first boot of the guest. During future boots of the guest, you don’t have to specify the drive with the “config.crypted” file anymore. Once the ignition is done, you can log in to the guest either on the console with the password that you created with mkpasswd, or via ssh:

ssh -p 2222 core@localhost

Now you should be able to use the image. But keep in mind that this is an rpm-ostree based image, so for installing additional packages, you have to use rpm-ostree install instead of dnf install here. And the kernel can be replaced like this, for example:

sudo rpm-ostree override replace \
 kernel-5.14.0-...s390x.rpm \
 kernel-core-5.14.0-...s390x.rpm \
 kernel-modules-5.14.0-...s390x.rpm \
 kernel-modules-core-5.14.0-...s390x.rpm \
 kernel-modules-extra-5.14.0-...s390x.rpm

That’s it! Now you can enjoy your configured secure-execution RHCOS image!

Special thanks to Nikita D. for helping me understanding the ignition process of the secure execution images.

October 23, 2024 04:30 PM

October 01, 2024

Stefan Hajnoczi

Video and slides available for "IOThread Virtqueue Mapping" talk at KVM Forum 2024

My KVM Forum 2024 talk "IOThread Virtqueue Mapping: Improving virtio-blk SMP scalability in QEMU" is now available on YouTube. The slides are also available here.

IOThread Virtqueue Mapping is a new QEMU feature for configuring multiple IOThreads that will handle a virtio-blk device's virtqueues. This means QEMU can take advantage of the host's Linux multi-queue block layer and assign CPUs to I/O processing. Giving additional resources to virtio-blk emulation allows QEMU to achieve higher IOPS and saturate fast local NVMe drives. This is especially important for applications that submit I/O on many vCPUs simultaneously - a workload that QEMU had trouble keeping up with in the past.

You can read more about IOThread Virtqueue Mapping in this Red Hat blog post.

by Unknown ([email protected]) at October 01, 2024 02:02 PM

September 03, 2024

QEMU project

QEMU version 9.1.0 released

We’d like to announce the availability of the QEMU 9.1.0 release. This release contains 2800+ commits from 263 authors.

You can grab the tarball from our download page. The full list of changes are available in the changelog.

Highlights include:

migration: compression offload support via Intel In-Memory Analytics Accelerator (IAA) or User Space Accelerator Development Kit (UADK), along with enhanced support for postcopy failure recovery
virtio: support for VIRTIO_F_NOTIFICATION_DATA, allowing guest drivers to provide additional data as part of sending device notifications for performance/debug purposes
guest-agent: support for guest-network-get-route command on linux, guest-ssh-* commands on Windows, and enhanced CLI support for configuring allowed/blocked commands
block: security fixes for QEMU NBD server and NBD TLS encryption
ARM: emulation support for FEAT_NMI, FEAT_CSV2_3, FEAT_ETS2, FEAT_Spec_FPACC, FEAT_WFxT, FEAT_Debugv8p8 architecture features
ARM: nested/two-stage page table support for emulated SMMUv3
ARM: xilinx_zynq board support for cache controller and multiple CPUs, and B-L475E-IOT01A board support for a DM163 display
LoongArch: support for directly booting an ELF kernel and for running up to 256 vCPUs via extioi virt extension
LoongArch: enhanced debug/GDB support
RISC-V: support for version 1.13 of privileged architecture specification
RISC-V: support for Zve32x, Zve64x, Zimop, Zcmop, Zama16b, Zabha, Zawrs, and Smcntrpmf extensions
RISC-V: enhanced debug/GDB support and general fixes
SPARC: emulation support for FMAF, IMA, VIS3, and VIS4 architecture features
x86: KVM support for running AMD SEV-SNP guests
x86: CPU emulation support for Icelake-Server-v7, SapphireRapids-v3, and SierraForest
and lots more…

September 03, 2024 11:08 PM

August 09, 2024

KVM on Z

New Feature: Installation Assistant for Linux on IBM Z

Ever struggled to create configuration files for starting Linux on IBM Z and LinuxONE installations? Fear no more, we got you covered now: A new assistant available online will help you create parameter files!
Writing parameter files can be a challenge, with bugs triggering cycles with lengthy turnaround times. Our new installation assistant generates installer parameter files by walking you through a step-by-step process, where you answer simple questions to generate a parameter file. Comes with contextual help in every stage, so you can follow along what is happening!
Currently supports OSA and PCI networking devices, IPv4/v6, and VLAN installations.

Currently supports RHEL 9 and SLES 15 SP5 or later.
Access the assistant at https://ibm.github.io/liz/

by Stefan Raspl ([email protected]) at August 09, 2024 12:11 PM

July 08, 2024

Gerd Hoffmann

modern uefi network booting

Network boot kickoff.

Step number one for the firmware on any system is sending out a DHCP request, asking the DHCP server for an IP address, the boot server (called "next server" in dhcp terms) and the bootfile.

On success the firmware will contact the boot server, fetch the bootfile and hand over control to the bootfile. Traditional method to serve the bootfile is using tftp (trivial file transfer protocol). Modern systems support http too. I have an article on setting up the dhcp server for virtual machines you might want check out.

What the bootfile is expected to be depends on the system being booted. There are embedded systems -- for example IP phones -- which load the complete system software that way.

When booting UEFI systems the bootfile typically is an EFI binary. That is not the only option though, more on that below.

UEFI network boot with a boot loader.

The traditional way to netboot linux on UEFI systems is using a bootloader. The bootfile location handed out by the DHCP server points to the bootloader and is the first file loaded over the network. Typical choices for the bootloader are grub.efi, snponly.efi (from ipxe project) or syslinux.efi.

Next step is the bootloader fetching the config file. That works the same way the bootloader itself was loaded, using the EFI network driver provided by either the platform firmware (typically the case for onboard NICs) or via PCI option rom (plug-in NICs). The bootloader does not need its own network drivers.

The loaded config file controls how the boot will continue. This can be very simple, three lines asking the bootloader to fetch kernel + initrd from a fixed location, then start the kernel with some command line. This can also be very complex, creating an interactive menu system where the user has dozens of options to choose from (see for example netboot.xyz).

Now the user can -- in case the config file defines menus -- pick what he wants boot.

Final step is the bootloader fetching the kernel and initrd (again using the EFI network driver) and starting the kernel. Voila.

Boot loaders and secure boot.

When using secure boot there is one more intermediate step needed: The first binary needs to be be shim.efi, which in turn will download the actual bootloader. Most distros ship only grub.efi with a secure boot signature, which limits the boot loader choice to that.

Also all components (shim + grub + kernel) must come from the same distribution. shim.efi has the distro secure boot signing certificate embedded, so Fedora shim will only boot grub + kernel with a secure boot signature from Fedora.

Netbooting machines without EFI network driver.

You probably do not have to worry about this. Shipping systems with EFI network driver and UEFI network boot support is standard feature today, snponly.efi should be used for these systems.

When using older hardware network boot support might be missing though. Should that be the case the ipxe project can help because it also features a large collection of firmware network drivers. It ships an all-in-one EFI binary named ipxe.efi which includes the the bootloader and scripting features (which are in snponly.efi too) and additionally all the ipxe hardware drivers.

That way ipxe.efi can boot from the network even if the firmware does not provide a driver. In that case ipxe.efi itself must be loaded from local storage though. You can download the efi binary and ready-to-use ISO/USB images from boot.ipxe.org.

UEFI network boot with an UKI.

A UKI (unified kernel image) is an EFI binary bundle. It contains a linux kernel, an initrd, the command line and a few more optional components not covered here in sections of the EFI binary. Also the systemd efi stub, which handles booting the bundled linux kernel with the bundled initrd.

One advantage is that the secure boot signature of an UKI image will cover all components and not only the kernel itself, which is a big step forward for linux boot security.

Another advantage is that a UKI is self-contained. It does not need a bootloader which knows how to boot linux kernels and handle initrd loading. It is simply an EFI binary which you can start any way you want, for example from the EFI shell prompt.

The later makes UKIs interesting for network booting, because they can be used as bootfile too. The DHCP server hands out the UKI location, the UEFI firmware fetches the UKI and starts it. Done.

Combining the bootloader and UKI approaches is possible too. UEFI bootloaders can load not only linux kernels. EFI binaries (including UKIs) can be loaded too, in case of grub.efi with the chainloader command. So if you want interactive menus to choose an UKI to boot you can do that.

UEFI network boot with an ISO image.

Modern UEFI implementations can netboot ISO images too. Unfortunately there are a few restrictions though:

It is a relatively new feature. It exists for a few years already in edk2, but with the glacial speeds firmware feature updates are delivered (if at all) this means there is hardware in the wild which does not support this.
It is only supported for HTTP boot. Which makes sense given that ISO images can be bulky and the http protocol typically is much faster than the tftp protocol used by PXE boot. Nevertheless you might need additional setup steps because of this.

When the UEFI firmware gets an ISO image as bootfile from the DHCP server it will load the image into a ramdisk, register the ramdisk as block device and try to boot from it.

From that point on booting works the same way booting from a local cdrom device works. The firmware will look for the boot loader on the ramdisk and load it. The bootloader will find the other components needed on the ramdisk, i.e. kernel and initrd in case of linux. All without any network access.

The UEFI firmware will also create ACPI tables for a pseudo nvdimm device. That way the booted linux kernel will find the ramdisk too. You can use the standard Fedora / CentOS / RHEL netinst ISO image, linux will find the images/install.img on the ramdisk and boot up all the way to anaconda. With enough RAM you can even load the DVD with all packages, then do the complete system install from ramdisk.

The big advantage of this approach is that the netboot workflow becomes very simliar to other installation workflows. It's not the odd kid on the block any more where loading kernel and initrd works in a completely different way. The very same ISO image can be:

Burned to a physical cdrom and used to boot a physical machine.
In many cases the ISO images are hypbrid, so they can be flashed to a USB stick too for booting a physical machine.
The ISO image can be attached as virtual device to a virtual machine.
On server grade managed hardware the ISO image can be attached as virtual media using redfish and the BMC.
And finally: The ISO image can be loaded into a ramdisk via UEFI http boot.

Bonus: secure boot support suddenly isn't a headace any more.

The kernel command line.

There is one problem with the fancy new world though. We have lots of places in the linux world which depend on the linux kernel command line for system configuration. For example anaconda expects getting the URL of the install repository and the kickstart file that way.

When using a boot loader that is simple. The kernel command line simply goes into the boot loader config file.

With ISO images it is more complicated, changing the grub config file on a ISO image is a cumbersome process. Also ISO images are not exactly small, so install images with customized grub.cfg need quite some storage space.

UKIs can pass through command line arguments to the linux kernel, but that is only allowed in case secure boot is disabled. When using UKIs with secure boot the best option is to use the UKIs built and signed on distro build infrastructure. Which implies using the kernel command line for customization is not going to work with secure boot enabled.

So, all of the above (and UKIs in general) will work better if we can replace the kernel command line as universal configuration vehicle with something else. Which most likely will not be a single thing but a number of different approaches depending on the use case. Some steps into that direction did happen already. Systemd can autodetect partitions (so booting from disk without root=... on the kernel command line works). And systemd credentials can be used to configure some aspects of a linux system. There is still a loooong way to go though.

by Gerd Hoffmann at July 08, 2024 10:00 PM

July 02, 2024

KVM on Z

New Video: Configuring Crypto Express Adapters for KVM Guests

A new video illustrating the steps to perform on a KVM host and in a virtual server configuration to make AP queues of cryptographic adapters available to a KVM guest can be found here.

by Stefan Raspl ([email protected]) at July 02, 2024 09:14 AM

Blogging about open source virtualization

News from QEMU, KVM, libvirt, libguestfs, virt-manager and related tools

Subscriptions

Planet Feeds

January 10, 2026

Series table of contents

The state machine

Conclusion

Series table of contents

Series table of contents

Conclusion

Series table of contents

How UARTs work

Implementation

Conclusion

Series table of contents

Device design

Reader and writer interfaces

Conclusion

Series table of contents

Memory-mapped I/O registers

Direct Memory Access

Interrupts

Conclusion

December 24, 2025

December 11, 2025

Booting x86 linux kernels

Direct kernel boot in qemu

Direct kernel boot and secure boot

Security impact

Fixing the whole mess

References

New device types

Infrastructure

December 04, 2025

All started with discussion about Mock

State of the RISC-V Fedora port

My work

Hardware?

Future

November 20, 2025

The Unconference

Lightning Talks

Sessions and Networking

Looking Ahead

October 22, 2025

qemu debug console

edk2 firmware on arm

edk2 memory log buffer

September 11, 2025

August 26, 2025

July 21, 2025

The key-value store interface

Why key-value stores are necessary

Building a file system on a key-value store

Building a database on a key-value store

Conclusion

July 15, 2025

Introduction

A closer look under the hood

Acknowledgments

July 11, 2025

June 25, 2025

kvm_stat

sysstat

blktrace

perf-top

perf-trace

Conclusion

April 23, 2025

See the announcement on the mailing list here, and the blog entry at Canonical with all Z-specific highlights here.

April 09, 2025

March 06, 2025

Find out if you are eligible

Select a project idea

Submit your application

December 17, 2024

December 11, 2024

October 23, 2024

Preparation