FreeBSD Journal
FreeBSD Journal
Writing Custom
Commands in FreeBSD’s
DDB Kernel Debugger
DTrace: New Additions
to an Old Tracing System
Certificate-based
Monitoring with Icinga
activitymonitor.sh
2023 Editorial Calendar
Pragmatic IPv6 (Part 4)
•Building a FreeBSD Web Server
(January-February)
•Embedded (March-April)
•FreeBSD at 30 (May-June)
•Containers and Cloud (Virtualization)
(July-August)
•FreeBSD 14 (September-October)
•To be decided (November-December)
Nov/Dec 2019 57
®
JOURNAL
LETTER
from the Foundation
Editorial Board
John Baldw in • Member of the FreeBSD Core Team and
Chair of FreeBSD Journal Editorial Board
W
of the FreeBSD Core Team
Benedict Reuschling • FreeBSD Documentation Committer elcome to the January/February issue. This
and Member of the FreeBSD Core Team edition features articles on several topics
Mariusz Zaborski • FreeBSD Developer to aid in using FreeBSD to deploy web
applications. Drew Gurkowski provides an introduction
Advisory Board to using ZFS to manage storage. Roller Angel
describes the creation of a virtual test lab built on top
Anne Dickison • Marketing Director, FreeBSD Foundation
of jails. Finally, Thomas Munro dives into interactions
Justin Gibbs • Founder of the FreeBSD Foundation, between PostgreSQL and ZFS.
President and Treasurer of the FreeBSD
Foundation Board On a different note, Anne Dickison sat down with
Daichi Goto • Director at BSD Consulting Inc. Ed Maste from the FreeBSD Foundation to talk about
(Tokyo) support for desktop use in FreeBSD.
Allan Jude • CTO at Klara Inc., the global FreeBSD In a previous welcome letter, I was eagerly
Professional Services and Support
company anticipating a return to an in-person conference—
Dru Lavigne • Author of BSD Hacks and
EuroBSDCon 2022, that was held in Vienna, Austria.
The Best of FreeBSD Basics As a trip report by Kyle Evans in the November/
Michael W Lucas • Author of more than 40 books including December 2022 issue indicated, the conference
Absolute FreeBSD, the FreeBSD
Mastery series, and git commit murder
was a tremendous success. In the new year, many of
our beloved conferences are returning to in-person
Kirk McKusick • Lead author of The Design and
Implementation book series formats. FOSDEM was recently held in person, and
AsiaBSDCon, BSDCan, and EuroBSDCon are all
George Neville-Neil • Past President of the FreeBSD Foundation
Board, and co-author of The Design returning this year. At BSDCan in Ottawa, we will
and Implementation of the FreeBSD
Operating System
celebrate FreeBSD’s 30th birthday! Editorial board
members and FreeBSD Journal authors will be at these
Hiroki Sato • Director of the FreeBSD Foundation
conferences and they’re eager to chat with readers or
Board, Chair of AsiaBSDCon, and
Assistant Professor at Tokyo
Institute of Technology
answer questions.
As always, we love to hear from readers whether
Robert N. M. Watson • Director of the FreeBSD Foundation
in person or via email. If you have feedback or
Board, Founder of the TrustedBSD
Project, and University Senior Lecturer
at the University of Cambridge
suggestions for topics for a future article, or are
interested in writing an article, please email us at
[email protected].
S &W P UBLISHING LLC
PO BOX 3757 CHAPEL HILL, NC 27515-3757
John Baldwin
Chair of the FreeBSD Journal Editorial Board
Editor-at-Large • James Maurer
[email protected]
Design & Production • Reuter & Associates
18 An Introduction to ZFS
By Drew Gurkowski
3 Foundation Letter
By John Baldwin
24 We Get Letters
By Michael W Lucas
29 WIP/CFT:Packet Batching
By Tom Jones and John Baldwin
33 Events Calendar
By Anne Dickison
P
ostgreSQL is a relational database management system implementing the SQL stan-
dard, with a BSD-like license. Its pre-SQL ancestor POSTGRES began at Berkeley Uni-
versity in the mid 1980s. It’s popular on FreeBSD, where it is usually deployed on ZFS
storage.
Many articles about PostgreSQL on ZFS recommend changing ZFS’s recordsize set-
ting and PostgreSQL’s full_page_writes setting. The real impact of the latter setting
on performance and crash-safety is not often explained, perhaps because it’s not generally
safe to adjust it on most popular file systems. In this article I summarize the logic and trade-
offs behind this mysterious mechanism—after a brief detour to talk about block sizes.
Blocks
Nearly all of PostgreSQL’s disk I/O is aligned on
8KB blocks, or pages. It is possible to recompile it
to use a different size, but that is rarely done. This
POSTGRES began at
size may originally have been chosen to match Berkeley University
UFS’s historical default block size (though note
that FreeBSD’s UFS now defaults to 32KB). ZFS in the mid 1980s.
uses the term record size, and defaults to 128KB.
Unlike other file systems, ZFS allows the record
size to be changed easily at any time, and to be
configured separately for each dataset.
If the data will be accessed randomly, then in theory the size should ideally match Post-
greSQL’s 8KB blocks. Otherwise, random I/O could suffer from two effects:
• I/O amplification, because every read or write of an 8KB block also transfers extra
neighboring data
• read-before-write when storage blocks are not currently in the OS’s cache and an 8KB
block must be written, so the neighboring data must be read first
If the data will be accessed mostly sequentially, or rarely, and especially if the benefits of
ZFS compression using larger records outweigh concerns about I/O bandwidth and latency,
then it can be a good idea.
Some sources make a blanket recommendation of 16KB, 32KB or 128KB record size, as a
sweet spot for better compression without too much write amplification or latency. My aim
here isn’t to make such recommendations–I doubt there is one answer–but rather to ex-
plain what’s going on.
Some applications have a mix of requirements for different kinds of data. Tablespac-
es can be used to store different tables in different ZFS datasets with different record size,
compression or physical media. It’s also possible for a table to be partitioned, for example
with older data in one tablespace and current active data in another.
ALTER TABLE t
SET TABLESPACE compressed_tablespace;
One problem reported with small ZFS record sizes is fragmentation. A table that receives
frequent random updates might finish up with blocks scattered all over the place, and we’d
prefer them to be physically clustered for good sequential read performance. A simple way
to ask PostgreSQL to rewrite the files that hold a table and its indexes in order to defrag-
ment them at the ZFS level would be to issue VACUUM FULL table_name or CLUSTER
table_name, if you are prepared to lock queries out of the table for the duration of the re-
write. Rewriting a table also allows a new record size to take effect, if it has been changed at
the dataset level.
Torn Writes
The PostgreSQL setting full_page_writes defaults to on, and ZFS users often turn it
off. The performance of write-intensive workloads then becomes faster and more consis-
tent. For example, in a simple pgbench test on a low end cloud VM I measured a 32% in-
crease in transactions per second by turning it off.
So what does it really do? That requires a surprising amount of background explanation.
The short version is that PostgreSQL uses physiological logging for crash safety, and that
means that writes to individual database pages must be atomic on power failure, or it may
not be able to recover after a crash. Unless you promise that your storage stack has that
property, then PostgreSQL has to do some extra work to protect your data.
Atomicity on power failure is the property that if a physical write was in progress when
power was lost, later we can expect to read back either the old version or the new version
of a block, for some given block size, but not a partially modified or torn version. This is not
to be confused with atomicity of concurrent reads and writes (see below). Physiological log-
ging, short for physical-to-the-page, logical-within-the-page, is a term from textbook classi-
fications of logging strategies, and it means that log records identify a block to be changed
by file and block number, but then describe the change to make within that page in a nota-
tion that requires us to read in the existing page to understand how to modify it “logically”,
rather than just updating bits at a physical address.
After a crash, the recovery algorithm can cope with the “old” page contents or the “new”
page contents, applying any logged changes required to bring it up to date. If it encounters
a non-atomic mash-up of old and new data, then logical changes to the page cannot be re-
played, and recovery fails! A superficial problem is that if data_checksums is enabled, then
PostgreSQL’s page-level checksum check will fail even to read the page in. If checksums are
disabled, we’ll get further, but a logical change such as “insert tuple (42,Fred) in slot 3” can’t
be replayed reliably. In order to apply the change in this example we need to understand a
table of slots using pre-existing meta-data on the page, but it’s potentially corrupted.
Physiological logging is a very widely used technique in the database industry, and dif-
ferent RDBMSs have found different solutions to the problem of torn pages. Since open
source systems have been developed and used on a wide variety of low end systems often
without various forms of hardware protection against power loss, failures were common
and software solutions had to be developed.
PostgreSQL’s current solution is to switch to page-level physical-only logging or full page
writes, where the whole data page is dumped into the log, for the first modification to each
data page after each checkpoint. Checkpointing is a periodic background activity, and in an
ideal world would have minimal effects on foreground transaction performance. However,
due to the first-touch rule, once a checkpoint starts, write-heavy workloads might suddenly
start generating a lot more log data, as small updates suddenly require many 8KB pages to
be logged. This effect typically decays gradually because subsequent modifications to each
page go back to being physiological, until the next checkpoint, sometimes resulting in a
sawtooth pattern in I/O bandwidth and transaction latency.
Another popular open source database has a
different solution that also involves writing all data
out twice with a synchronization barrier between, It’s not possible to see
since both copies can’t be torn.
ZFS needs none of that! It has record-level ato- a mixture of the old
micity by virtue of its own copy-on-write design.
It’s not possible to see a mixture of the old and and new contents of
new contents of a ZFS record, because it doesn’t
physically overwrite them, and its system of TXGs a ZFS record.
and the ZIL makes writes transactional. Therefore,
it is safe to set full_page_writes=off as long as recordsize is at least 8KB.
Note that ZFS itself also physically writes data twice in some scenarios. A common rec-
ommendation is to consider setting logbias=throughput for the dataset holding the
main data files (but perhaps not the one holding PostgreSQL’s log directory pg_wal—a top-
ic not explored in this article). That option tries to write blocks directly into their final loca-
tion instead of logging them first in the ZIL. If you use the ZFS default logbias=latency
and the PostgreSQL default full_page_writes=on, data may in fact be written out four
times in total as both PostgreSQL and ZFS perform extra work to create record-level ato-
micity, while both of those changes bring it down to one copy.
Unfortunately there are two special scenarios where full_page_writes=on is still
needed for correct behavior: while running pg_basebackup and pg_rewind. Those tools
are used for backups, or to create or re-synchronize streaming replicas from another serv-
er; in the case of pg_basebackup, full page writes will be silently enabled while running the
command, while in the case of pg_rewind, the command will refuse to run if it is not man-
ually enabled (an annoying inconsistency in current releases). These tools make raw file sys-
tem-level copies of data files, along with the logs required for crash recovery to deal with
consistency problems caused by concurrent changes. Here we run into a different meaning
of I/O atomicity: reading from a file that might be concurrently written to. The first prob-
lem is that file systems on Linux and Windows (but not ZFS, or any file system on FreeBSD,
due to the use of range locks) can show readers a random selection of before and after
bits when there is an overlapping concurrent write. Furthermore, the I/O is currently done
in a way that isn’t suitably aligned, so even on ZFS, torn pages could be copied. To defend
against that, full_page_writes behavior is needed. This problem should eventually be
fixed in PostgreSQL, by copying the raw data files with appropriate alignment and interlock-
ing. Note that ZFS snapshots can be used instead of pg_basebackup if certain precautions
are taken (primarily that the snapshot must atomically capture the logs and all data files),
thus reducing the impact when cloning or backing up a very busy system.
Recovery
We’ve seen how full_page_writes=off improves the performance of write transac-
tions, and ZFS makes that safe. Unfortunately there can also be negative performance im-
plications for replication and crash recovery. These activities both perform recovery, mean-
ing that they replay the log. Although full page images are a pessimization when they’re
written, they act as an optimization when they’re replayed at recovery time. Instead of hav-
ing to perform a random synchronous read that might block recovery’s serial processing
loop, we have the contents of the page to be modified already in our nice sequential log,
and after that it is cached.
PostgreSQL 15 includes a partial solution to this problem: it looks ahead in the log to find
pages that will soon be read, and issues POSIX_FADV_WILLNEED advice, to generate a con-
figurable degree of I/O concurrency (a sort of poor man’s asynchronous I/O). At the time of
writing, FreeBSD ignores the advice, but a future version of OpenZFS will hopefully connect
it up to FreeBSD’s VFS (OpenZFS pull request #13958). Eventually, this should be replaced
by a true asynchronous I/O subsystem that is currently being developed and proposed for a
future version of PostgreSQL.
The effect of full_page_writes=off on recovery I/O stalls was studied by a group
using PostgreSQL on ZFS on the illumos operating system at scale. They developed a tool
called pg_prefaulter as a workaround. They had found that their streaming replicas
couldn’t keep up with their primary servers due to predictable I/O stalls. They may have
been uniquely placed to see this effect since most large scale users of PostgreSQL don’t
even have the option of setting full_page_writes=off. pg_prefaulter may be a solu-
tion if you run into this problem, until built-in prefetching is available.
Looking Ahead
Block size alignment is likely to become a bigger topic in future PostgreSQL releases that
will hopefully include proposed direct I/O support, which for now exists only in prototype
form. This coincides happily with the development of direct I/O support for OpenZFS (pull
request #10018), which will probably require block size agreement to work effectively (the
current prototype reverts to the ARC otherwise; some other file systems simply refuse non-
aligned direct I/O). Another OpenZFS feature in the works that is likely to be very useful
for databases is block cloning (pull request #13392), along with new systems interfaces for
FreeBSD, which PostgreSQL should hopefully be able to use for fast cloning of databases
and database objects with finer granularity than whole datasets.
THOMAS MUNRO is an open source database hacker working for Microsoft Azure, who
is usually logged into a FreeBSD box.
Virtual Lab –
BSD Programming Workshop
BY ROLLER ANGEL
O
ur virtual lab will consist of a FreeBSD Host system that uses the FreeBSD Jails tech-
nology to provide each system we want to install within the lab its own separate
environment to run services and perform its duties. These duties could be any
number of things like serving up web pages, storing and retrieving database records, que-
rying and answering DNS requests, caching system update files, etc. The idea is to build a
solid foundation that will enable future growth for our virtual lab. Given that the nature of
FreeBSD Jails is to provide a lightweight system for containing our services, we can rest as-
sured knowing that any number of rabbit holes we may find ourselves in won’t limit our
creativity or exploration due to reaching our resource limitations. We can have a separate
environment for each idea and not have to worry about the expense of provisioning anoth-
er operating system to support the services necessary for that idea to flourish. I’ve experi-
mented with many different methods of hosting operating system installs for my work and
I’m pleasantly surprised by the peace of mind I get when provisioning a new jail. It’s so eco-
nomical! I’m no longer burdened with a financial concern and a question of how long this
expense will be ongoing, rather it’s just another environment in my ever growing lab and
doesn’t inherently come with a minimum monthly fee to use it. Not to get too technical
and go down a financial rabbit hole regarding the cost of electricity, internet connectivity,
host hardware; yes, I agree those things have a cost, but once they are in place, the addition
of new hosts isn’t anywhere near the considerations and expenses that go into the initial
lab setup. I recommend repurposing an existing machine as the host machine, maybe even
a laptop, as it comes with a built in battery backup, giving you time to gracefully shutdown
your systems in the event of a sustained power outage. The issue of needing multiple phys-
ical network interfaces to connect physical Ethernet switches to Ethernet cables and access
points that your physical hosts will use are not a material concern in our virtual lab. These in-
terfaces can be created with words in a text file and virtual Ethernet cables can be created
to connect the pieces of our virtual network.
FreeBSD Host
This host machine will need a few network interfaces.
We want the host to have its own way out to the internet.
This can be a good ole DHCP assigned address provid-
ed by a local router, or your host could be performing the
role of router and have a direct connection to the outside
world through a modem. However your host gets its inter-
net connection, we’ll want to have an additional network
interface that we can use solely for our lab network. This in-
terface may have a name like em0, igb0, or similar depend-
ing on the network card driver and the number of installed
you’d like to get some tips on installing and configuring FreeBSD.
Configuration
Now that we’ve covered what we’re going to do, let’s get to doing it. We start with a
FreeBSD Host running 13.1-RELEASE. As described above, this host should have an active
internet connection. We’ll use that connection to download some files for use in creating
our jails. The host has NTPD running, so it gets accurate time. Check for any services listen-
ing on the host with sockstat -46 and turn them off if unused. Remember that the host
should be limited in what it does—we’ll have plenty of fun things to do in jails inside our lab,
so do your best to limit the services on the host. I plan on doing any management of my
host in person by logging in with the attached keyboard and screen so I’ve not enabled SSH
on the host.
Now we’re ready to enable jails. A simple sysrc jail_enable=YES will do the trick. No
need to install any package, jail management is built into FreeBSD. Take a look at the READ-
ME file in /usr/share/examples/jails for some examples of how you might configure
your jails. As you will see, there are many ways to go about jail configuration. I’ve done my
research and picked the configuration method you’ll see here. You’re welcome to give one
of the other approaches a try and see what fits. If you do go about this task another way,
please consider writing about it so others can see what you’ve found useful and give it a try
themselves. At this point, we’ve verified our host machine is ready for hosting jails and have
enabled the jail service so we can do a quick reboot double check that minimal services are
listening on the host and move on to creating the base configuration for all our jails to use.
When editing configuration files we’ll be using vim, for basic editing tasks you really only
need to know a handful of things, spend a few minutes going through the interactive exer-
cises as part of the command vimtutor to get your bearings and you’ll be a vim novice in
no-time at all.
Note: We’re running all the following commands as the root user. Type sudo -i to be-
come root.
edit jail.conf
vim /etc/jail.conf
put the following into /etc/jail.conf
$labdir=”/lab”;
$domain=”lab.bsd.pw”;
path=”$labdir/$name”;
host.hostname=”$name.$domain”;
exec.clean;
base.txz
mkdir -p /lab/media/13.1-RELEASE
cd /lab/media/13.1-RELEASE
fetch http://ftp.freebsd.org/pub/FreeBSD/releases/amd64/13.1-RELEASE/base.txz
Gateway Jail
mkdir /lab/gateway
tar -xpf /lab/media/13.1-RELEASE/base.txz -C /lab/gateway
edit jail.conf
vim /etc/jail.conf
add to the bottom of the file
gateway {
ip4=inherit;
}
feel free to add a user account to the jail with the following optional command, for this
article we’re just going to be using the user root
chroot /lab/gateway adduser
set the root password for the jail
chroot /lab/gateway passwd root
setup DNS resolution using OpenDNS servers
vim /lab/gateway/etc/resolv.conf
add the following lines to resolv.conf
nameserver 208.67.222.222
nameserver 208.67.220.220
copy the hosts time zone setting
cp /etc/localtime /lab/gateway/etc/
create an empty file system table
touch /lab/gateway/etc/fstab
Gateway jail.conf
At this point, we’re ready to move from inheriting the ip4 network from the host and
use vnet, remove the gateway {} configuration block from /etc/jail.conf and replace it
with the following
gateway {
vnet;
vnet.interface=e0b_$name, e1b_$name;
exec.prestart+=”/lab/scripts/jib addm $name lab0 labnet”;
exec.poststop+=”/lab/scripts/jib destroy $name”;
devfs_ruleset=666;
}
create the internal LAN network for the jails in the lab
sysrc cloned_interfaces=vlan2
sysrc ifconfig_vlan2_name=labnet
sysrc ifconfig_labnet=up
service netif restart
destroy and recreate gateway
jail -vr gateway
jail -vc gateway
configure networking for gateway jail
sysrc -j gateway gateway_enable=YES
sysrc -j gateway ifconfig_e0b_gateway=SYNCDHCP
sysrc -j gateway ifconfig_e1b_gateway=”inet 10.66.6.1/24”
service jail restart gateway
jexec -l gateway login -f root
test connectivity
host bsd.pw
ping -c 3 bsd.pw
exit the jail
logout
ROLLER ANGEL spends most of his time helping people learn how to accomplish their
goals using technology. He’s an avid FreeBSD Systems Administrator and Pythonista who
enjoys learning amazing things that can be done with Open Source technology — especial-
ly FreeBSD and Python — to solve issues. He’s a firm believer that people can learn anything
they wish to set their minds to. Roller is always seeking creative solutions to problems and en-
joys a good challenge. He’s driven and motivated to learn, explore new ideas, and to keep his
skills sharp. He enjoys participating in the research community and sharing his ideas.
An Introduction
to ZFS
BY DREW GURKOWSKI
ZFS combines the roles of volume manager and independent file system into one, giving
multiple advantages over a stand-alone file system. It is renowned for speed, flexibility, and,
most importantly, taking great care to prevent data loss. While many traditional file systems
had to exist on a single disk at a time, ZFS is aware of the underlying structure of the disks
and creates a pool of available storage, even on multiple disks. The existing file system will
grow automatically when extra disks are added to the pool, immediately becoming available
to the file system.
Getting Started
FreeBSD can mount ZFS pools and datasets during system initialization. To enable it, add
this line to /etc/rc.conf:
zfs_enable=”YES”
Then start the service:
# service zfs start
Identify Hardware
Before setting up ZFS, identify the device names of the disk associated with the system.
A quick way of doing this is with:
# egrep da[0-9]|cd[0-9] /var/run/dmesg.boot
The output should identify the device names, examples throughout the rest of this guide
will use the default SCSI names: da0, da1, and da2. If the hardware differs, make sure to use
the correct device names instead.
To destroy the file systems and then the pool that is no longer needed:
# zfs destroy example/compressed
# zfs destroy example/data
# zpool destroy example
RAID-Z
RAID-Z pools require three or more disks but offer protection from data loss if a disk
were to fail. Because the ZFS pools can use multiple disks, support for RAID is inherent in
the design of the file system
To create a RAID-Z pool, specifying the disks to add to the pool:
# zpool create storage raidz da0 da1 da2
With the zpool created, a new file system can be made in that pool:
# zfs create storage/home
Enable compression and store an extra copy of directories and files:
# zfs set copies=2 storage/home
# zfs set compression=gzip storage/home
A RAID-Z pool is a great place to store crucial system files, such as the home directory
for users.
# cp -rp /home/* /storage/home
# rm -rf /home /usr/home
# ln -s /storage/home /home
# ln -s /storage/home /usr/home
File system snapshots can be created to roll back to later, the snapshot name is marked in
yellow and can be whatever you want:
# zfs snapshot storage/home@11-01-22
ZFS creates snapshots of a dataset, allowing users to back up a file system for roll backs
or data recovery in the future.
# zfs rollback storage/home@11-01-22
To list all available snapshots, zfs list can be used:
# zfs list -t snapshot storage/home
Recovering RAID-Z
Every software RAID has a method of monitoring its state. View the status of RAID-Z de-
vices using:
# zpool status -x
If all pools are Online and everything is normal, the message shows:
all pools are healthy
If there is a problem, perhaps a disk being in the Offline state, the pool state will look like
this:
pool: storage
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using ‘zpool online’ or replace the device with
zpool replace.
scrub: none requested
config:
Data Verification
ZFS uses checksums to verify the integrity of stored data, these data checksums can be
verified (which is called scrubbing) to ensure integrity of the storage pool:
# zpool scrub storage
Only one scrub can be run at a time due to the heavy input/output requirements. The
length of the scrub depends on how much data is stored in the pool. After scrubbing com-
pletes, view the status with zpool status:
# zpool status storage
pool: storage
state: ONLINE
scrub: scrub completed with 0 errors on Fri Nov 4 11:19:52 2022
config:
ZFS Administration
ZFS has two main utilities for administration: The zpool utility controls the operation of
the pool and allows adding, removing, replacing, and managing disks. The zfs utility allows
creating, destroying, and managing datasets, both file systems and volumes.
While this introductory guide won’t cover ZFS administration, you can refer to zfs(8) and
zpool(8) for other ZFS options.
Further Resources
While both the non-redundant and RAID-Z pools created using this guide will work in
most use cases, more complex or specialized systems may require further ZFS manage-
ment and setup. This guide barely scrapes the surface of what can be done using ZFS as
it is an extremely powerful and customizable file system. The OpenZFS wiki has expansive
documentation on installation, ZFS system administration, and manual pages. If tuning is re-
quired due to system architecture, ZFS tuning guides can be found on both the OpenZFS
and FreeBSD wiki pages.
by Michael W Lucas
freebsdjournal.org
Dear NS,
Your letter is something of a relief, as it provides ample distraction from the horrors of
administering the web server I set up in 2017, or my Sendmail configuration from 1992. It’s
like getting my mind off my abscessed tooth by busting a few of my ribs. Well done!
You might be a new sysadmin, but at least you understand that you must live with your
bad decisions throughout the server’s lifespan. Yes, you could get all DevOps and dynam-
ically redeploy hosts with improved settings, but all
you’re doing is reducing the time you must live with
How do you one set of decisions before replacing them with a
optimize a filesystem different set of equally bad ones. A change of poor
for a database choices is not as good as a rest.
How do you optimize a filesystem for a database
at install time? at install time? My answer is: don’t. Premature optimi-
zation is the root of all evil, along with poor privilege
management and nano. You have no idea how your database will interact with the filesys-
tem until you run the application under real load. The only sensible choice is to arrange your
new system so that you will have empty disks to move your database to. Yes, this is pretty
much the same thing as devopsing to a new host, except it’s not a new host and you don’t
need Ansible. If you’re using one of those virtual host providers that offers block storage,
that’s dandy, except you’ll be formatting those blocks with a filesystem. Where The Cloud
is really “other people’s computers,” Block Storage is “other people’s cast-off hard drives ar-
ranged in a Redundant Array of Inexpensive Crap.” The main advantage is you’re not the
person who needs to trace the alarm beep to a drive tray.
If you insist on optimizing your filesystems, well, here’s what you do.
First off, understand that storage devices are lying liars that lie. The newest solid state
storage maintains a malformed compatibility with hard drives released in the previous cen-
tury, which were built on standards designed for punch cards, which had their roots in
17th-century looms and the Luddites, so every time you plug in a storage device you’re put-
ting someone out of work but there’s no ethical data storage under capitalism so go for it.
The main lie that needs to concern you is the sector size. Today’s drives overwhelmingly use
4K sectors, except for some NVMe devices that support multiple sector sizes but I don’t
have any of those so I’ll pretend they don’t exist. You need to make sure that the partitions—
not the filesystems, the partitions—on your drives align with those 4K sector sizes. If the fan-
cy boot loader you like requires a 98K GPT partition, it
fills nineteen and a half disk sectors. Drives claim that
they’ll save you, but that’s another lie. That next par-
What about tition better begin at 100K, a nice multiple of 4, or all
the filesystems? your filesystem blocks will be split between drive sec-
tors and every interaction with the hardware will take
Filesystems twice as long and burn out the drive twice as quickly.
need tuning! Once you have partitions, the filesystem blocks
Balderdash, I say! need to also be multiples of 4K. ZFS defaults to 128K
stripes. UFS defaults to block sizes of 32K, which lets it
use 4K fragments, so it should be good as-is but don’t
get clever and think that smaller blocks mean better
performance because—no matter what a bunch of old blog posts say—they don’t.
There you go. You can report to management that your filesystems are tuned for your
hardware. Return to playing nethack.
But that’s the partitions, I hear some of you whine. What about the filesystems? Filesys-
tems need tuning! Balderdash, I say! You wouldn’t tuna fish, why tune a filesystem? Filesys-
tems are written for lazy people. Leave them alone, they have only caused you as much pain
as their programmers insisted on, and that was only because marketing insisted on planned
obsolescence to compel upgrades. BSD operating systems are not driven by profit, and user
pain increases support requests, so the amount of filesystem agony has been methodical-
ly reduced until there’s hardly any. Why, using filesystems barely qualifies as “torment” any-
more. Any changes are likely to increase your pain.
You’re still here? You still want advice?
Huh. You do know that therapists build careers out of helping people like you, right?
Fine.
Your filesystem should reflect your data. If you know that your data consists of, say, many
64KB files, you can set your blocks to that size. You know your own data, you should be able
to figure this out. If your data is less predictable, don’t optimize.
Databases can tempt even the most jaded sysadmin into optimizing their filesystem. Da-
tabases have predictable block sizes. MySQL uses a 16K block, so you could configure the
underlying ZFS dataset to use a recordsize of 16K. MySQL can compress its data, as can ZFS.
Attempting to compress already-compressed data wastes system resources. Study your ap-
plication and pick a place to compress data.
Don’t configure UFS to use 16K blocks, even when it’s supporting MySQL. A UFS block
has eight fragments, and thanks to the underlying disk each fragment has a minimum size
of 4K. Putting two UFS fragments on each disk drive sector would shatter performance.
Tuning your MySQL install to use larger blocks on UFS might make sense, but again, leave
the filesystem alone.
Postgres? 8K blocks everywhere by default. Again, tuning the database to match the disk
might make sense, but 8K blocks on either ZFS or UFS ruin system performance. If your
data demands 8K blocks, however, your best option is to use ZFS and set an 8K record size.
But in general, leave bad enough alone unless you want to make things worse. Which is
a very common impulse among people who won’t get the therapy they need. But rest as-
sured, a few months of struggling to understand the interactions between applications and
filesystems will soon make you an experienced system administrator.
You poor slob.
MICHAEL W LUCAS has written over fifty books, and the UN’s “Special Ambassador To
Make Lucas Shut Up” was accidentally-on-purpose crushed beneath a stack of them. He
wrote one book about UFS and two on ZFS. As the ZFS books were co-written with Allan
Jude, they might even contain something helpful.
W
e were thrilled to be part of the Rocky Mountain Celebration of Women in
Computing Conference, September 29-30, 2022, here in Boulder, Colorado!
The Rocky Mountain Celebration of Women in Computing (RMCWiC) is a
regional meeting modeled after the highly successful international Grace
Hopper Celebration. The goal of RMCWiC is to encourage the research and career inter-
ests of local women in computing. RMCWiC has been held every two years since 2008
with a pause in 2020
RMCWiC offers an opportunity for students to present their research and to network
with leaders from academia, government, and industry. In this way, RMCWiC provides a
unique opportunity for technical women from Colorado and neighboring states to come
Just a few years ago, we were gaining momentum on showcasing FreeBSD at women in
computing conferences and university groups.
But that came to a standstill when Covid hit. We
are now kickstarting that effort to attend more
of these types of events, from meetups to cele-
bration of women in computer conferences. So, The goal of The Rocky
I was thrilled when I saw the local Rocky Moun- Mountain Celebration of
tain Celebration of Women in Computing was
taking place right here in Boulder, Colorado in Women in Computing is
September!
First, I love these events, because they are to encourage the research
smaller and it’s easier to talk to attendees about
FreeBSD. This event brought in around 300 at- and career interests of
tendees from Colorado and surrounding states.
I always love the energy of young folks as they local women in computing.
meet others with similar interests in computing,
while they learn from amazing role models in
various technology fields.
I had the opportunity to give a talk on Open Source and why people should get involved.
Of course, I used FreeBSD as an example of an open source project they should consider.
After my talk, there was a career fair, where Justin Gibbs and I staffed a FreeBSD table, giv-
ing us the opportunity to talk with many of the attendees about FreeBSD. It was crazy loud,
and everyone was wearing masks, so it was difficult, but we made it work. We had lots of at-
tendees stop by our table to talk and ask us questions.
All storage;
all the in all, I’d we
sayuse
thisZFS
wastoa replicate
great eventthe for
datathe Project,cluster
between the students,
nodes; we andusethe Foundation.
We always appreciate
compression and snapshots.an opportunity to educate
And we heavily peopletoabout
use Capsicum make FreeBSD and encourage
it all secure.
them to contribute
We want to be suretothattheeven
project.
if someone breaks into a single session, he can-
not In
access
2023,other sessions.
we will He cannot
be identifying actually
a few women accessin anything,
computing because if he that we’d like to
conferences
breaks
attend.inLetbefore authentication,
us know he won't
if there is one be granted
you are familiar access to connect
with that we should to the
attend. Or maybe
server.
you’d Only
like toafter successful
present at one authentication will we
and staff a table provide
in their a connection
career fair. We’retohere
the to support you if
destination server.
you choose that path!
And Capsicum makes it really clean and very efficient actually.
Allan: You don't have to enumerate all the things you can't do. You're saying
DEB GOODKIN
you're only allowedistothe do Executive
these things? Director of the FreeBSD Foundation. She’s thrilled to
•bePawel
in her 15th
: Yes. year
This at the Foundation
is capability ideology. You andonlyis proud of exact
grant the her hardworking
rights or access and dedicated
team.
to She spent
resources that theover 20 years
process in the
requires. dataisstorage
Which not UNIXindustry
ideologyinbecause,
engineering
of development,
course,
technicalif you are and
sales, running a UNIXmarketing.
technical program, it When
has accessnot to everything.
working, you’ll find her on her road
or mountain
Allan: Was therebike, running,
anything hiking
else you with to
wanted her dogs,
talk skiing the slopes of Colorado, or read-
about?
•ingPawel
a good really. •
: Not book.
Sept/Oct 2019 23
Packet Batching
BY TOM JONES AND JOHN BALDWIN
I
n the last 30 years, the computers we use have grown unimaginably faster. The 1995 Al-
pha AXP paper talked about designing for a machine’s continuing the trend of the previ-
ous 25 years, getting 1000 times faster.
We have certainly managed to meet that goal, the 386 machines that were the original
target of FreeBSD are akin to the micro controllers we use in keyboards today.
Even with these changes, the core of computer performance has remained the same, ex-
ecute fewer instructions per unit of work, and things will go faster. This fundamental truth
underlying networking has led to several different approaches to improving performance.
We have worked on mechanisms that moved work away from our CPU and, instead, into
the network card with checksum offload. If the card runs the instructions to checksum out-
going packets, then our precious CPU time can be spent doing other things.
Checksum offload saw great results, and we started to move other things away from
the CPU and into the network interface. TCP Segment Offload (TSO) was the next great
mechanism that improved performance for a network sender. Rather than forming the IP
packets for the TCP segments we will send, we can form one template and send that with
a large block of data to the card. The network interface handles the segmenting as it places
the packets onto the wire. TSO gives huge benefits
to a TCP sender, providing us the ability to saturate
10-Gigabit network interfaces well before we run
out of even a single core.
TSO lets us be more efficient with precious re-
FreeBSD has excellent
sources. We reduce the number of bus (memory
and PCI) transactions required to send each packet
support for TSO and
by batching them together and creating the final
chunks at the point of transmission. This is straight-
LRO, but is lacking
forward for TCP to do, most of the time if we are
bulk sending a stream of data and the chunking of mechanisms similar
data is clear. To mirror these improvements on the
TCP receiver, we have Large Receive Offload (LRO). to GSO and GRO.
LRO lets us again reduce the number of transac-
tions required to maintain high-rate data transfers.
For UDP, Linux has generic mechanisms that attempt to replicate TSO-like mechanisms.
This support comes with Generic Segment Offload (GSO) and Generic Receive Offload
(GRO). GSO enables huge improvements on the order or 20% for a UDP sender, GRO is
more difficult to measure, but the mechanism is there.
FreeBSD has excellent support for TSO and LRO, but is lacking mechanisms similar to
GSO and GRO. At EuroBSDCon in Vienna last year I spoke to John Baldwin about a mecha-
nism similar to GRO that he is working on, which he calls Packet Batching.
Packet Batching
TJ: What is the background to the packet batching work?
JB: The idea of packet batching on receive has been around for a while, at least in the form
of a wish list item I’ve heard various people mention several times. We already have some
forms of packet batching specific to TCP for both sending (TSO) and receiving (LRO). This
packet batching aims to be more generic than LRO so that it can apply to other protocols
(primarily UDP).
TJ: Why is the work needed?
JB: The goal of packet batching approaches such as TSO and LRO is to amortize per-pack-
et costs (various checks in the network stack on header fields, etc.) by doing them once per
batch rather than once per packet. The cost of per-packet overheads becomes an increas-
ingly worse problem as network speeds increase faster than CPU speeds. It is true that one
of the fixes for this problem, in general, which does help with per-packet overhead, is hor-
izontal scaling by using RSS to distribute packets across separate queues bound to differ-
ent CPUs. However, you can’t distribute a single flow across multiple cores, and batching
schemes are aimed at making the performance of a single queue more efficient.
TJ: What new features/enhancements does the work make possible?
JB: The goal is higher PPS and/or reduced CPU usage for network received workloads. I
don’t expect it to help with TCP when LRO is enabled, mostly to help with UDP.
TJ: How can people test the work? Normally we need to emphasize testing with more di-
verse workloads, does this apply here?
JB: Benchmarking would be welcome. My initial set of simple benchmarks using iperf3 were
mixed and not a clear enough win to justify the changes. The changes do add complexi-
ty, so it needs to be a clear win in some workloads, I think, before it should be considered a
commit candidate. I have not observed any regressions in my benchmarks to date, just mea-
ger to zero gains.
TJ: How would you like feedback?
JB: E-mail directly to me is probably the best way to send feedback for now. At some point
in the future, I will start a public RFC thread on net@ and/or arch@ at which point that
thread will be the best place to send feedback. Folks wishing to test the patches or review
them can find them at https://github.com/freebsd/freebsd-src/compare/main...bsdjhb:-
freebsd:cxgbe_batching.
From John’s responses here, it isn’t yet clear where the benefits should be seen. iperf3
measurements can’t simulate the workload of a very busy server. For Packet Batching to of-
fer a benefit in FreeBSD it is likely that more workloads need to be tested and tuned for. By
pulling down John’s github branch and experimenting with your network traffic, you can
help establish a new receiver optimization in FreeBSD.
TOM JONES wants FreeBSD-based projects to get the attention they deserve. He lives in
the North East of Scotland and offers FreeBSD consulting.
JOHN BALDWIN is a systems software developer. He has directly committed changes to the
FreeBSD operating system for 20 years across various parts of the kernel (including x86 plat-
form support, SMP, various device drivers, and the virtual memory subsystem) and userspace
programs. In addition to writing code, John has served on the FreeBSD core and release en-
gineering teams. He has also contributed to the GDB debugger and LLVM. John lives in Con-
cord, California, with his wife, Kimberly, and three children: Janelle, Evan, and Bella.
The Foundation
and the FreeBSD
Desktop
BY ANNE DICKISON
T
he Desktop experience can be formative. I got my first PC in 1990 as an 8th Grade
graduation gift. (Thanks Dad!) It helped instill my interest in computers and it got
me through high school. I used it mostly for playing Zork, Jeopardy and, of course,
writing papers on Word Perfect. The interface was rather clunky, but for the pur-
poses of a small-town high school student in the 90s, it worked quite well. Once college
came about, a new machine came my way and a GUI that made things work so much bet-
ter. Using a computer became part of everyday life. In fact, one of the selling points of my
university was that every dorm had its own desktop. Fast forward 20+ years, and the stan-
dards for a usable desktop are quite high. Intuitive, fast, pretty graphics, and speedy wi-fi
are all expected. FreeBSD’s desktop experience over the years has had its ups and downs.
Twenty or so years ago, FreeBSD and Linux were mostly neck and neck in terms of desktop
usability. Unfortunately, as time went on, FreeBSD did fall behind. The desktop experience
became a lower priority. However, catch up eventually ensued and within the last 10 or so
years, focusing on the desktop has increasingly become of greater importance for many
members in the community. To help understand more about the Foundation’s work on the
desktop experience, we sat down with Ed Maste, Senior Director of Technology.
Unsurprisingly, one question the Foundation often gets is where the desktop experience
falls in our list of priorities. The answer: Well, it varies. Because the Foundation’s main goal is
to support the Project in technical areas that aren’t being fully addressed by the communi-
ty, the desktop sponsored work ebbs and flows. When work stagnated about 10 years ago
and the Project began to fall behind in terms of hardware support, the Foundation funded
Kostik Belousov to work on Intel Graphics Drivers. More recently though, the Project has
moved to using the Linux Kernel Interface (KPI) to help keep drivers up to date. The Foun-
dation funded Bjorn Zeeb to work on the wireless side, and about 2 years ago, it funded Em-
manual Vadot to work on graphics drivers.
These days, the FreeBSD community has continued the graphics work via the Linux KPI,
while the Foundation is funding Bjorn to do the same on the wireless side. The net result is
that generally, you can take a contemporary x86 laptop or desktop system, and the graph-
ics and wireless will just work. The hope with this method is that as each, new generation
of hardware comes out, we’ll be able to take the latest upstream drivers and just use them
without any sort of significant rework to make them work on FreeBSD. Ed notes that while
using the Linux KPI might not be the most popular solution, it does seem to be the most
developer-efficient way to keep the drivers up to date.
“In an ideal world, with unlimited resources and an unlimited supply of qualified techni-
cal people, I would just have developers create bespoke FreeBSD drivers. While the current
method may have its detractors, the result is that FreeBSD has a working driver that is per-
formant and featureful, that should allow us to basically remain up to date.”
Speaking of up to date, Ed was quick to mention one caveat when it comes to wireless
drivers. While the wi-fi does work out of the box on many desktop systems, the speed is
sometimes lacking in comparison to contemporary wi-fi standards. That doesn’t mean you
can’t use FreeBSD as your daily desktop though. It’s fast enough for video, conference calls,
and web browsing. Ed mentioned how Bjorn’s work has made the wi-fi on his Framework
laptop stable and reliable. But when it comes to downloading large files, you will notice slow-
er speeds. The Foundation has extended Bjorn’s contract into 2023 and he is working on
those standards now with the goal of having it available in FreeBSD 14.0, if not 13.2.
However, as mentioned above, FreeBSD can be used as your daily driver, an aspect that
is very important to Ed and the Foundation. One of the reasons the Foundation has cho-
sen to support the wi-fi efforts as of late is that there’s a huge amount of value in being
able to use the operating system that you’re developing on as your desktop machine. In
fact, Ed sees that as being connected to the Project’s long-term viability and the ability to
bring on new users.
“Let’s take someone who is in university, I think it really is the case that FreeBSD is the
best operating system for someone who is interested in learning about operating sys-
tem internals. Someone who wants to become an operating system developer or wants
to explore and learn about operating systems. FreeBSD is advanced enough that it can do
what you need, but you can still find a niche and make your own impact. But, without a us-
er-friendly desktop experience, it’s hard to make the argument that someone should try
FreeBSD if they’re already familiar with Linux on their laptop.”
Thanks to great work from members of the community along with Foundation support-
ed efforts in key areas, the FreeBSD desktop experience is on a positive trajectory. As we
head into 2023, Ed says the Foundation plans to continue to support Bjorn’s wi-fi work and
take another look at the installer to help make sure that you’re able to get a usable graphical
desktop environment--out of the box. Of course, that all may change as 2023 progresses,
but ultimately, Ed and his team are dedicated to working with other community members
to produce a modern and user-friendly desktop experience.
ANNE DICKISON joined the Foundation in 2015 and brings over 20 years experience in
technology-focused marketing and communications. Specifically, her work as the Marketing
Director and then Co-Executive Director of the USENIX Association helped instill her com-
mitment to spreading the word about the importance of free and open source technologies.
SCALE 20X
March 9-12, 2023
Pasadena, CA
https://www.socallinuxexpo.org/blog/scale-20x
SCaLE is the largest community-run open-source and free software conference in North
America. It is held annually in the greater Los Angeles area. Roller Angel will also be hosting a
FreeBSD workshop during the conference.
AsiaBSDCon 2023
March 30-April 2, 2023
Tokyo, Japan
https://2023.asiabsdcon.org/
AsiaBSDCon is for anyone developing, deploying and using systems based on FreeBSD,
NetBSD, OpenBSD, DragonFlyBSD, Darwin and MacOS X. It is a technical conference and
aims to collect the best technical papers and presentations available to ensure that the latest
developments in our open source community are shared with the widest possible audience.