UFS System
UFS System
15
The UFS File System
virtual memory operations. The new VM system implemented the concept of vir-
tual file caching—a departure from the traditional physical file cache (known as
the “buffer cache” in previous versions of UNIX). The old buffer cache was layered
under the file systems and was responsible for caching physical blocks from the file
system to the storage device. The new model is layered above the file systems and
allows the VM system to act as a cache for files rather than blocks. The new sys-
tem caches page-sized pieces of files, whereby the file and a particular offset are
cached as pages of memory. From this point forward, the buffer cache was used
only for file system metadata, and the VM system implemented the file system
caching. The introduction of the virtual file caching affected file systems in many
ways and required significant changes to the vnode interface. At that point, UFS
was substantially modified to support the new vnode and VM interfaces.
The third major change to UFS came about in Solaris 2.4 in the year 1994 with
the introduction of file system metadata logging in an effort to provide better reli-
ability and faster reboot times after a system crash or outage. The first versions of
logging were introduced with the unbundled Online: DiskSuite 3.0 software pack-
age, the precursor to Solstice DiskSuite (SDS) product and the Solaris Volume
Manager (SVM) as it is known today. Solaris 7 saw the integration of logging into
UFS, and after six years of development, Solaris 10 shipped with logging turned on
by default. Table 15.1 summarizes the major UFS development milestones.
UFS is built around the concept of a disk’s geometry, which is described as the
number of sectors in a track, the location of the head, and the number of tracks.
UFS uses a hybrid block allocation strategy that allocates full blocks or smaller
parts of the block called fragments. A block is a set of contigous fragments start-
ing on a particular boundary. This boundary is determined by the size of a frag-
ment and the number of fragments that constitute a block. For example, fragment
32 and block 32 both relate to the same physical location on disk. Although the
next fragment on disk is 33 followed by 34, 35, 36, 37 and so on, the next block is
at 40, which begins on fragment 40. This is true in the case of 8-Kbyte block size
and 1-Kbyte fragment size, where 8 fragments constitutes a file system block.
struct inode
i_ic
struct icommon
ic_smode
ic_nlink
ic_suid
ic_sgid
... struct ufsvfs
i_ufsvfs
struct icommon {
o_mode_t ic_smode; /* 0: mode and type of file */
short ic_nlink; /* 2: number of links to file */
o_uid_t ic_suid; /* 4: owner's user id */
o_gid_t ic_sgid; /* 6: owner's group id */
u_offset_t ic_lsize; /* 8: number of bytes in file */
#ifdef _KERNEL
struct timeval32 ic_atime; /* 16: time last accessed */
struct timeval32 ic_mtime; /* 24: time last modified */
struct timeval32 ic_ctime; /* 32: last time inode changed */
#else
time32_t ic_atime; /* 16: time last accessed */
int32_t ic_atspare;
time32_t ic_mtime; /* 24: time last modified */
int32_t ic_mtspare;
time32_t ic_ctime; /* 32: last time inode changed */
int32_t ic_ctspare;
#endif
daddr32_t ic_db[NDADDR]; /* 40: disk block addresses */
daddr32_t ic_ib[NIADDR]; /* 88: indirect blocks */
int32_t ic_flags; /* 100: cflags */
int32_t ic_blocks; /* 104: 512 byte blocks actually held */
int32_t ic_gen; /* 108: generation number */
int32_t ic_shadow; /* 112: shadow inode */
uid_t ic_uid; /* 116: long EFT version of uid */
gid_t ic_gid; /* 120: long EFT version of gid */
uint32_t ic_oeftflag; /* 124: extended attr directory ino, 0 = none */
};
See usr/src/uts/common/sys/fs/ufs_inode.h
Most of the fields are self-explaining, but a couple of them need a bit of help:
ic_smode. Indicates the type of inode. There are primarily four main types
of inode: zero, special node (IFCHR, IFBLK, IFIFO, IFSOCK), symbolic link
(IFLNK), a directory (IFDIR), a file (IFREG), or an extended metadata inode
(IFSHAD, IFATTRDIR). Type zero indicates that the inode is not in use and
ic_nlink should be zero, unless logging’s reclaim_needed flag is set. With
the special nodes, no data blocks are associated. They are used for character
and block devices, pipes and sockets. The type file indicates where this inode
is a directory, a regular file, a shadow inode, or an extended attribute directory.
ic_nlink. Refers to the number of links to a file, that is, the number of
names in the namespace that correspond to a specific file identifier. A regu-
lar file will have link count of 1 because only one name in the namespace cor-
responds to that particular file identifier. A directory link count has the value
solarisinternals.book Page 741 Thursday, June 15, 2006 1:27 PM
2 by default: one is the name of the directory itself, and the other is the “.”
entry within the directory. Any subdirectory within a directory causes the
link count to be incremented by 1 because of the “..” entry. The limit is 32,767
and hence, the limit for the number of subdirectories is 32,765 and also the
total number of links. The “..” entry counts against the parent directory only.
ic_db. Is an array that holds 12 pointers to data blocks. These are called the
direct blocks. On a system with block size of 8192 bytes or 8 Kbytes, these can
accommodate up to 98,304 bytes or 96 Kbytes. If the file consists entirely of
direct blocks, then the last block for the file (not the last ic_db entry) may
contain fragments. Note that if the file size exceeds the capacity of the ic_db
array, then the block list for the file must consist entirely of full-sized file sys-
tem blocks.
ic_ib. Is a small array of only three pointers but allows a file to be up to one
terabyte. How does this work? Well, the first entry in ic_ib points to a block
that stores 2048 block addresses. A file with a single indirect block can accom-
modate up to 8192 * (12 + 2048) bytes or 16 Mbytes. If more storage is required,
another level of indirection is added and the second indirect block is used.
The second entry in ic_ib points to 2048 block addresses, and each of
those 2048 entries points to another block containing 2048 entries that finally
point to the data blocks. With two levels of indirection, a file can accommo-
date up to 8192 * 12 + 2048 + (2048 * 2048) bytes, or 32 Gbytes. A third level
of indirection permits the file to be 8192 * 12 + 2048 + (2048 * 2048) + (2048 *
2048 * 2048) = 70,403,120,791,552 bytes long or—yes, you guessed it—64 Tbytes!
However, since all addresses must be addressable as fragments, that is, a
31-bit count, the maximum is 2TB (2^31 * 1KB). Multi-terrabyte UFS
(MTBUFS) enables 16TB filesystem sizes by enforcing the minimum frag
size to be 8K, which gives you 2^31 * 2^10 * 8k, or 16 TB.
Figure 15.2 illustrates the layout.
ic_shadow. If non-zero, contains the number of an inode providing shadow
metadata (usually, this data would be ACLs).
ic_oeftflag. If non-zero, contains the number of an inode of type
IFATTRDIR, which is a directory containing extended attribute files.
solarisinternals.book Page 742 Thursday, June 15, 2006 1:27 PM
ic_db .
.
. . 2048
. . .
2048 2048
.
. .
12
.
ic_ib
0 .
.
1
2048
2 .
. . .
. . .
2048 2048 2048
. . .
.
.
2048
.
The directory itself is stored in a file as a series of chunks, which are groups of
the directory entries. Earlier file systems like the System V file system had a fixed
directory record length, which meant that a lot of space would be wasted if provision
was made for long file names. In the UFS, each directory entry can be of variable
length, thus providing a mechanism for long file names without a lot of wasted
space. UFS file names can be up to 255 characters long.
The group of directory chunks that constitute a directory is stored as a special
type of file. The notion of a directory as a type of file allows UFS to implement a
hierarchical directory structure: Directories can contain files that are directories.
For example, the root directory has a name, “/”, and an inode number, 2, which
holds a chunk of directory entries holding a number of files and directories. One of
these directory entries, named etc, is another directory containing more files and
directories. For traversal up and down the file system, the chdir system call opens
the directory file in question and then sets the current working directory to point
to the new directory file. Figure 15.3 illustrates the directory hierarchy.
INODE
INODE
PASSWD GROUP HOSTS
INODE
INODE
INODE
Each directory contains two special files. The file named “.” is a link to the direc-
tory itself; the file named “..” is a link to the parent directory. Thus, a change of
directory to .. leads to the parent directory.
Now let’s switch gears and see what the on-disk structures for directories look
like.
The contents of a directory are broken up into DIRBLKSIZ chunks, also known
as dirblks. Each of these contains one or more direct structures. DIRBLKSIZ was
chosen to be the same as the size of a disk sector so that modifications to directory
entries could be done atomically on the assumption that a sector write either com-
solarisinternals.book Page 744 Thursday, June 15, 2006 1:27 PM
pletes successfully or fails (which can no longer be guaranteed with the advance-
ment of cached hard drives).
Each directory entry is stored in a structure called direct that contains the
inode number (d_ino), the length of the entry (d_reclen), the length of the name
(d_namelen), and a null-terminated string for the name itself (d_name).
struct direct {
uint32_t d_ino; /* inode number of entry */
ushort_t d_reclen; /* length of this record */
ushort_t d_namlen; /* length of string in d_name */
char d_name[MAXNAMLEN + 1]; /* name must be no longer than this */
};
See usr/src/uts/common/sys/fs/ufs_fsdir.h
d_reclen includes the space consumed by all the fields in a directory entry,
including d_name’s trailing null character. This facilitates directory entry deletion
because when an entry is deleted, if it is not the first entry in the current direc-
tory, the entry before it is grown to include the deleted one, that is, d_reclen is
incremented to account for the size of the next entry. The procedure is relatively
inexpensive and helps keep internal fragmentation down. Figure 15.4 illustrates
the concept of directory deletion.
)NITIAL DIRECTORY CONTENTS
D?INO D?RECLEN D?NAME EXCESS D?INO D?RECLEN D?NAME EXCESS
!FTER ENTRY DELETION
D?INO D?RECLEN D?NAME EXCESS D?INO D?RECLEN D?NAME EXCESS
INODE
HOMEFILE SIZE
HOMEFILE
ATIME
MTIME
REFCNT
The way a shadow inode is laid out on disk is quite simple (see Figure 15.6). All
the entries for the shadow inode contain one header that includes the type of data
and the length of the whole record, data + header. Entries are then simply concate-
nated and stored to disk as a separate inode with the inode’s ic_smode set to
ISHAD. The parent’s ic_shadow is then updated to point to this shadow inode.
IC?SHADOW
)NODE
FSD?TYPE FSD?SIZE FSD?DATA FSD?TYPE FSD?SIZE
FSD?SIZE
Boot Block
Inodes contain disk block
Superblock
address pointers.
Cylinder
Group Disk addresses are 31-bit units of file
system fragments. The address is
limited to 2^31 x 1k for the default file
Data Block system configuration, yielding a
Data Block maximum file size of 1TB.
Data Block
Data Block
Boot Block
Superblock
Cylinder
Group
Data Block
Data Block
Data Block
Data Block
The file system configuration parameters also reside in the superblock. The file
system parameters include some of the following, which are configured at the time
the file system is constructed. You can tune the parameters later with the tunefs
command.
And here are the significant logging related fields in the superblock:
fs_rolled. Determines whether any data in the log still needs to be rolled
back to the file system.
fs_si. Indicates whether logging summary information is up to date or
whether it needs to be recalculated from cylinder groups.
fs_clean. Is set to FS_LOG for logging file system.
fs_logbno. Is the disk block number of logging metadata.
fs_reclaim: Is set to indicate if the reclaim thread is running or needs to
be run.
The last cylinder group in a file system may be incomplete because the number
of cylinders in a disk drive is usually not exactly rounded up to the cylinder
groups. In this case, we simply reduce the number of data blocks available in the
last cylinder group; however, the metadata portion of the cylinder group stays the
same throughout the file system. The cg_ncyl and cg_nblk fields of the cylinder
group structure guide us to the size so that we don’t accidentally go out of bounds.
/*
* Cylinder group block for a file system.
*
* Writable fields in the cylinder group are protected by the associated
* super block lock fs->fs_lock.
*/
#define CG_MAGIC 0x090255
struct cg {
uint32_t cg_link; /* NOT USED linked list of cyl groups */
int32_t cg_magic; /* magic number */
time32_t cg_time; /* time last written */
int32_t cg_cgx; /* we are the cgx'th cylinder group */
short cg_ncyl; /* number of cyl's this cg */
short cg_niblk; /* number of inode blocks this cg */
int32_t cg_ndblk; /* number of data blocks this cg */
struct csum cg_cs; /* cylinder summary information */
int32_t cg_rotor; /* position of last used block */
int32_t cg_frotor; /* position of last used frag */
int32_t cg_irotor; /* position of last used inode */
int32_t cg_frsum[MAXFRAG]; /* counts of available frags */
int32_t cg_btotoff; /* (int32_t)block totals per cylinder */
int32_t cg_boff; /* (short) free block positions */
int32_t cg_iusedoff; /* (char) used inode map */
int32_t cg_freeoff; /* (uchar_t) free block map */
int32_t cg_nextfreeoff; /* (uchar_t) next available space */
int32_t cg_sparecon[16]; /* reserved for future use */
uchar_t cg_space[1]; /* space for cylinder group maps */
/* actually longer */
};
See usr/src/uts/common/sys/fs/ufs_fs.h
getpage()/
rename()
unlink()
read()
close()
mkdir()
rmdir()
fsync()
ioctl()
creat()
putpage() write()
open()
link()
seek()
direction
block map
Directory
Implementation bmap_read()
bmap_write()
_pagecreate()
file/offset _getmap()
Directory Structures to _release()
disk addr
mapping
maps file into
Directory Name kernel address
Lookup Cache space
getpage() VM File Segment
putpage()
Metadata (Inode) driver (seg_map)
Cache
read/writes
Direct/ vnode pages to/from
disk pagelookup()
Indirect
Blocks pageexists()
pvn_readkluster()
pvn_readdone()
bread()/bwrite() pvn_writedone()
bdev_strategy()
VM System
Cached I/O (BUFHWM) Noncached I/O
sd ssd
The inode (Index Node) is UFS’s internal descriptor for a file. Each file system has
two forms of an inode: the on-disk inode and the in-core (in-memory) inode. The
on-disk inode resides on the physical medium and represents the on-disk format
and layout of the file.
See usr/src/uts/common/sys/fs/ufs_inode.h
solarisinternals.book Page 752 Thursday, June 15, 2006 1:27 PM
New with Solaris 10, an inode sequence number was added to the in-core inode
structure to support NFSv3 and NFSv4 detection of atomic changes to the inode.
Two caveats with this new value: i_seq must be updated if i_ctime and i_mtime
are changed; the value of i_seq is only guaranteed to be persistent while the
inode is active.
Idle queue. Holds the idle or unreferenced inodes (where the v_count
equals 1, t and the i_nlink is greater than 0). This queue is managed by the
global file system idle thread, which frees entries, starting at the head. When
new entries are added, ufs_inactive() adds an inode to the head if the
inode has no associated pages; otherwise, the inode is added to the tail. This
ensures that pages are retained longer in memory for possible reuse—the
frees are done starting at the head.
solarisinternals.book Page 753 Thursday, June 15, 2006 1:27 PM
)./$% &2%% ,)34
Starting with Solaris 10, the idle queue architecture was reorganized into
two separate hash queues: ufs_useful_iq and ufs_junk_iq. If an inode
has pages associated with it (vn_has_cached_data(vnode)) or is a fast
symbolic link (i_flag and IFASTSYMLNK), then it is attached to the useful
idle queue. All other inodes are attached to the junk idle queue instead.
These queues are not used for searching but only for grouping geographically
local inodes for faster updates and fewer disk seeks upon reuse. Entries from
the junk idle queue are destroyed first when ufs_idle_free() is invoked
by the UFS idle thread so that cached pages pertaining to entries in the
ufs_useful_iq idle queue stay in memory longer.
The idle thread is adjusted to run when there are 25% of ufs_ninode
entries on the idle queue. When it runs, it gives back half of the idle queue
until the queue falls below the low water mark of ufs_q->uq_lowat. Inodes
on the junk queue get destroyed first. Figure 15.11 illustrates the process.
solarisinternals.book Page 754 Thursday, June 15, 2006 1:27 PM
Delete queue. Is active if UFS logging is enabled and consists of inodes that
are unlinked or deleted (v_count equals 1 and i_nlink is less than or equal
to 0). This queue is a performance enhancer for file systems with logging
turned on and observing heavy deletion activity. The delete queue is handled
by the per-file system delete thread, which queues the inodes to be deleted by
the ufs_delete() thread. This significantly boosts response times for
removal of large amounts of data. If logging is not enabled, ufs_delete()
is called immediately. ufs_delete() calls VN_RELE() after it has finished
processing, which causes the inode to once again be processed by ufs_
inactive, which this time puts it on the idle queue. While on the delete
queue, the inode’s i_freef and i_freeb point to the inode itself since the
inodes are not free yet.
bytes of disk space for each partly filled file system block was wasted. To overcome
this disadvantage, UFS uses the notion of file system fragments. Fragments allow
a single block to be broken up into 2, 4, or 8 fragments when necessary (4 Kbytes, 2
Kbytes or 1 Kbyte, respectively).
UFS block allocation tries to prevent excessive disk seeking by attempting to co-
locate inodes within a directory and by attempting to co-locate a file’s inode and its
data blocks. When possible, all the inodes in a directory are allocated in the same
cylinder group. This scheme helps reduce disk seeking when directories are tra-
versed; for example, executing a simple ls -l of a directory will access all the
inodes in that directory. If all the inodes reside in the same cylinder group, most of
the data are cached after the first few files are accessed. A directory is placed in a
cylinder group different from that of its parent.
Blocks are allocated to a file sequentially, starting with the first 96 Kbytes (the
first 12 direct blocks), skipping to the next cylinder group and allocating blocks up
to the limit set by the file system parameter maxbpg (maximum-blocks-per-cylin-
der-group). After that, blocks are allocated from the next available cylinder group.
By default, on a file system greater than 1 Gbyte, the algorithm allocates 96
Kbytes in the first cylinder group, 16 Mbytes in the next available cylinder group,
16 Mbytes from the next, and so on. The maximum cylinder group size is 54
Mbytes, and the allocation algorithm allows only one-third of that space to be allo-
cated to each section of a single file when it is extended. The maxbpg parameter is
set to 2,048 8-Kbyte blocks by default at the time the file system is created. It is
also tunable but can only be tuned downward since the maximum cylinder group
size is 16-Mybte allocation per cylinder group.
Selection of a new cylinder group for the next segment of a file is governed by a
rotor and free-space algorithm. A per-file-system allocation rotor points to one of
the cylinder groups; each time new disk space is allocated, it starts with the cylin-
der group pointed to by the rotor. If the cylinder group has less than average free
space, then it is skipped and the next cylinder group is tried. This algorithm
makes the file system attempt to balance the allocation across the cylinder groups.
Figure 15.12 shows the default allocation that is used if a file is created on a
large UFS. The first 96 Kbytes of file 1 are allocated from the first cylinder group.
Then, allocation skips to the second cylinder group and another 16 Mbytes of file 1
are allocated, and so on. When another file is created, we can see that it consumes
the holes in the allocated blocks alongside file 1. There is room for a third file to do
the same.
The actual on-disk layout will not be quite as simple as the example shown but
does reflect the allocation policies discussed. We can use an add-on tool, filestat,
to view the on-disk layout of a file, as shown below.
solarisinternals.book Page 756 Thursday, June 15, 2006 1:27 PM
file1
file1 file1
Cylinder Group
Cylinder Group
Cylinder Group
file2
file2
file3
file3
54 MB 62 MB 78 MB 110 MB
The filestat output shows that the first segment of the file occupies 192 (512-
byte) blocks, followed by the next 16 Mbytes, which start in a different cylinder
group. This particular file system was not empty when the file was created, which
is why the next cylinder group chosen is a long way from the first.
We can observe the file system parameters of an existing file system with the
fstyp command. The fstyp command simply dumps the superblock information
for the file, revealing all the cylinder group and allocation information. The follow-
ing example shows the output for a 4-Gbyte file system with default parameters.
solarisinternals.book Page 757 Thursday, June 15, 2006 1:27 PM
We can see that the file system has 8,247,421 blocks and has 167 cylinder groups
spaced evenly at 6,272 (51-Mbyte) intervals. The maximum blocks to allocate for
each group is set to the default of 2,048 8-Kbyte, 16 Mbytes.
The UFS-specific version of the fstyp command dumps the superblock of a UFS
file system, as shown below.
continues
solarisinternals.book Page 758 Thursday, June 15, 2006 1:27 PM
cs[].cs_(nbfree,ndir,nifree,nffree):
(23,26,5708,102) (142,26,5724,244) (87,20,5725,132) (390,69,5737,80)
(72,87,5815,148) (3,87,5761,110) (267,87,5784,4) (0,66,5434,4)
(217,46,5606,94) (537,87,5789,70) (0,87,5901,68) (0,87,5752,20)
.
.
cylinders in last group 48
blocks in last group 6144
cg 0:
magic 90255 tell 6000 time Sat Feb 27 22:53:11 1999
cgx 0 ncyl 49 niblk 6144 ndblk 50176
nbfree 23 ndir 26 nifree 5708 nffree 102
rotor 1224 irotor 144 frotor 1224
frsum 7 7 3 1 1 0 9
sum of frsum: 102
iused: 0-143, 145-436
free: 1224-1295, 1304-1311, 1328-1343, 4054-4055, 4126-4127, 4446-4447, 4455, 4637-
4638,
bmap_read() queries the file system as to which physical disk sector a file
block resides on; that is, requests a lookup of the direct/indirect blocks that
contain the disk address(es) of the required blocks.
bmap_write() allocates, with the aid of helper functions, new disk blocks
when extending or allocating blocks for a file.
int
bmap_read(struct inode *ip, u_offset_t off, daddr_t *dap, int *lenp)
See usr/src/uts/common/fs/ufs/ufs_bmap.c
The file system uses the bmap_read() algorithm to locate the physical blocks
for the file being read. The bmap_read() function searches through the direct,
indirect, and double-indirect blocks of the inode to locate the disk address of the
disk blocks that map to the supplied offset. The function also searches forward
from the offset, looking for disk blocks that continue to map contiguous portions of
solarisinternals.book Page 759 Thursday, June 15, 2006 1:27 PM
the inode, and returns the length of the contiguous segment (in blocks) in the
length pointer argument. The length and the file system block clustering parame-
ters are used within the file system as bounds for clustering contiguous blocks to
provide better performance by reading larger parts of a file from disk at a time.
See ufs_getpage_ra(), defined in usr/src/uts/common/fs/ufs_vnops.c,
for more information on read-aheads.
int
bmap_write(struct inode *ip, u_offset_t off, int size,
int alloc_only, struct cred *cr);
See usr/src/uts/common/fs/ufs/ufs_bmap.c
The bmap_write() function allocates file space in the file system when a file is
extended or a file with holes has blocks written for the first time and is responsible
for storing the allocated block information in the inode. bmap_write() traverses
the block free lists, using the rotor algorithm (discussed in Section 15.3.3), and
updates the local, direct, and indirect blocks in the inode for the file being extended.
bmap_write calls several helper functions to facilitate the allocation of blocks.
daddr_t blkpref(struct inode *ip, daddr_t lbn, int indx, daddr32_t *bap)
Guides bmap_write in selecting the next desired block in the file. Sets the policy as
described in Section 15.3.3.1.
int realloccg(struct inode *ip, daddr_t bprev, daddr_t bpref, int osize, int nsize,
daddr_t *bnp, cred_t *cr)
Re-allocates a fragment to a bigger size. The number and size of the old block size is
specified and the allocator attempts to extend the original block. Failing that, the
regular block allocator is called to obtain an appropriate block.
int alloc(struct inode *ip, daddr_t bpref, int size, daddr_t *bnp, cred_t *cr)
Allocates a block in the file system. The size of the block is specified which is a mul-
tiple of (fs_fsize <= fs_bsize). If a preference (usually obtained from blkpref()) is
specified, the allocator will try to allocate the requested block. If that fails, a
rotationally optimal block in the same cylinder is found. Failing that a block in the
same cylinder group is searched for. And in case that fails, the allocator quadratically
rehashes into other cylinder groups (see hashalloc() in uts/common/fs/ufs/ufs_alloc.c)
to locate an available block. If no preference is given, a block in the same cylinder is
found, and failing that the allocator quadratically searches other cylinder groups for
one.
See uts/common/fs/ufs/ufs_alloc.c
In the case of an error, bmap_write() will call ufs_undo_allocation to free any blocks
which were used during the allocation process.
See uts/common/fs/ufs/ufs_bmap.c
solarisinternals.book Page 760 Thursday, June 15, 2006 1:27 PM
15.3.4.1 ufs_read()
An example of the steps taken by a UFS read system call is shown in Figure 15.13.
A read system call invokes the file-system-dependent read function, which turns
the read request into a series of vop_getpage() calls by mapping the file into the
kernel’s address space with the seg_kpm driver (through the seg_map driver), as
described in Section 14.7.
The ufs_read method calls into the seg_map driver to locate a virtual address
in the kernel address space for the file and offset requested with the segmap_
getmapflt() function. The seg_map driver determines whether it already has a
mapping for the requested offset by looking into its hashed list of mapping slots.
Once a slot is located or created, an address for the page is located. segmap then
calls back into the file system with ufs_getpage() to soft-initiate a page fault to
solarisinternals.book Page 761 Thursday, June 15, 2006 1:27 PM
16K of file in
kernel address
16K of heap space space
in process
read in the page at the virtual address of the seg_map slot. The page fault is initiated
while we are still in the segmap_getmap() routine, by a call to segmap_fault().
That function in turn calls back into the file system with ufs_getpage(), which calls
out file system’s _getpage(). If not, then a slot is created and ufs_getpage() is
called to read in the pages.
The ufs_getpage() routine brings the requested range of the file (vnode, off-
set, and length) from disk into the virtual address, and the length is passed into
the ufs_getpage() function. The ufs_getpage() function locates the file’s
blocks (through the block map functions discussed in Section 15.3.3.2) and reads
them by calling the underlying device’s strategy routine.
Once the page is read by the file system, the requested range is copied back to
the user by the uiomove() function. The file system then releases the slot associ-
ated with that block of the file by using the segmap_release() function. At this
point, the slot is not removed from the segment, because we may need the same
file and offset later (effectively caching the virtual address location); instead, it is
added to a seg_map free list so that it can be reclaimed or reused later.
15.3.4.2 ufs_write()
Writing to the file system is performed similarly, although it is more complex
because of some of the file system write performance enhancements, such as
delayed writes and write clustering. Writing to the file system follows the steps
shown in Figure 15.14.
The write system call calls the file-system-independent write, which in our
example calls ufs_write(). UFS breaks the write into 8-Kbyte chunks and then
processes each chunk. For each 8-Kbyte chunk, the following steps are performed.
1. UFS asks the segmap driver for an 8-Kbyte mapping of the file in the ker-
nel’s virtual address space. The page for the file and offset is mapped here so
that the data can be copied in and then written out with paged I/O.
2. If the file is being extended or a new page is being created within a hole of a
file, then a call is made to the segmap_pagecreate function to create and
lock the new pages. Next, a call is made segmap_pageunlock() to unlock
the pages that were locked during the page_create.
3. If the write is to a whole file system block, then a new zeroed page is created
with segmap_pagecreate(). In the case of a partial block write, the block
must first be read in so that the partial block contents can be replaced.
4. The new page is returned, locked, to UFS. The buffer that is passed into the
write system call is copied from user address space into kernel address space.
5. The ufs_write throttle first checks to see if too many bytes are outstanding
for this file as a result of previous delayed writes. If more than the kernel
solarisinternals.book Page 763 Thursday, June 15, 2006 1:27 PM
16K of file in
kernel address
16K of heap space space
in process
6. segmap_release
returns control to
the caller.
parameter ufs_HW bytes are outstanding, the write is put to sleep until the
amount of outstanding bytes drops below the kernel parameter ufs_LW.
The file system calls the seg_map driver to map in the portion of the file we are
going to write. The data is copied from the process’s user address space into the
kernel address space allocated by seg_map, and seg_map is then called to release
the address space containing the dirty pages to be written. This is when the real
work of write starts, because seg_map calls ufs_putpage() when it realizes
there are dirty pages in the address space it is releasing.
The traditional UNIX File System provides a simple file access scheme based on
users, groups, and world, whereby each file is assigned an owner and a UNIX
group, and then is assigned a bitmap of permissions for user, group, and world, as
illustrated in Figure 15.15.
This scheme is flexible when file access permissions align with users and groups
of users, but it does not provide a mechanism to assign access to lists of users that
do not coincide with a UNIX group. For example, if we want to give read access to
file 1 to Mark and Chuck, and then read access to file 2 to Chuck and Barb, then
we would need to create two UNIX groups, and Chuck would need to switch groups
with the chgrp command to gain access to either file.
To overcome this drawback, some operating systems use an access control list
(ACL), whereby lists of users with different permissions can be assigned to a file.
Solaris introduced the notion of access control lists in the B1 secure version,
known as Trusted Solaris, in 1993. Trusted Solaris ACLs were later integrated
with the commercial Solaris version in 1995 with Solaris 2.5.
solarisinternals.book Page 765 Thursday, June 15, 2006 1:27 PM
With Solaris ACLs, administrators can assign a list of UNIX user IDs and
groups to a file by using the setfacl command and can review the ACLs by using
the getfacl command, as shown below.
# file: memtool.c
# owner: rmc
# group: staff
user::r--
user:jon:rw- #effective:r--
group::r-- #effective:r--
mask:r--
other:r--
# ls -l memtool.c
-r--r--r--+ 1 rmc staff 638 Mar 30 11:32 memtool.c
For example, we can assign access to a file for a specific user by using the setfacl
command. Note that the UNIX permissions on the file now contain a +, signifying
that an access control list is assigned to this file.
Multiple users and groups can be assigned to a file, offering a flexible mecha-
nism for assigning access rights. ACLs can be assigned to directories as well. Note
that unlike the case with some other operating systems, access control lists are not
inherited from a parent, so a new directory created under a directory with an ACL
will not have an ACL assigned by default.
ACLs are divided into three parts: on-disk, in-core, and user level. On-disk for-
mat is used to represent the ACL data that is stored in the file’s shadow inode, in-
core structure is used by UFS internally, and the user-level format is used by the
system to present data to the requester.
The ufs_acl structure defines an ACL entry that is encapsulated in the ufs_fsd
structure and then stored on disk in a shadow inode. Refer to Section 15.2.4 for
more information on shadow inode storage.
/*
* On-disk UFS ACL structure
*/
typedef struct ufs_acl {
union {
uint32_t acl_next; /* Pad for old structure */
ushort_t acl_tag; /* Entry type */
} acl_un;
o_mode_t acl_perm; /* Permission bits */
uid_t acl_who; /* User or group ID */
} ufs_acl_t;
See usr/src/uts/common/sys/fs/ufs_acl.h
solarisinternals.book Page 766 Thursday, June 15, 2006 1:27 PM
The in-core format consists of the ufs_ic_acl structure and the in-core ACL
mask (ufs_aclmask) structure.
/*
* In-core UFS ACL structure
*/
typedef struct ufs_ic_acl {
struct ufs_ic_acl *acl_ic_next; /* Next ACL for this inode */
o_mode_t acl_ic_perm; /* Permission bits */
uid_t acl_ic_who; /* User or group ID */
} ufs_ic_acl_t;
/*
* In-core ACL mask
*/
typedef struct ufs_aclmask {
short acl_ismask; /* Is mask defined? */
o_mode_t acl_maskbits; /* Permission mask */
} ufs_aclmask_t;
See usr/src/uts/common/sys/fs/ufs_acl.h
When ACL data is exchanged to and from the application, a struct acl relays
the permission bits, user or group ID, and the type of ACL.
static int
ufs_setsecattr(struct vnode *vp, vsecattr_t *vsap, int flag, struct cred *cr)
Used primarily for updates to ACLs. The structure vsecattr is converted to ufs_acl for
in-core storage of ACLs. All file mode changes are updated via this routine.
static int
ufs_getsecattr(struct vnode *vp, vsecattr_t *vsap, int flag,struct cred *cr)
int
ufs_acl_access(struct inode *ip, int mode, cred_t *cr)
int
ufs_acl_get(struct inode *ip, vsecattr_t *vsap, int flag, cred_t *cr)
continues
solarisinternals.book Page 767 Thursday, June 15, 2006 1:27 PM
int
ufs_acl_set(struct inode *ip, vsecattr_t *vsap, int flag, cred_t *cr)
si_t *
ufs_acl_cp(si_t *sp)
Copies ACL information from one shadow inode into a new created shadow inode.
int
ufs_acl_setattr(struct inode *ip, struct vattr *vap, cred_t *cr)
usr/src/uuts/common/fs/ufs/ufs_acl.c
In Solaris 9, a new interface was added to UFS for the storage of attributes.
Rather than ACLs, which added a shadow inode to each file for permission stor-
age; extended attributes adds a directory inode to each file (see struct icommon).
This directory is not part of the regular file system name space, rather it is in its
own dimension and is attached to ours via a worm-hole of function calls, such as
openat(2) and attropen(3C).
An excellent discussion of extended attributes can be found in fsattr(5). This
interface exists to support any extra attributes desired for files - this may be to
support files from other file systems that require the storing of non-UFS
attributes. Other uses will be discovered over time.
The following demonstration should get to the point quickly. Here we create an
innocuous file, tardis.txt, and copy (yes, copy) several other files into its extended
attribute name space, purely as a demonstration.
$ ls -l tardis.txt
-rw-r--r-- 1 user1 other 29 Apr 3 10:46 tardis.txt
$ ls -@ tardis.txt
-rw-r--r--@ 1 user1 other 29 Apr 3 10:46 tardis.txt
$
$ du -ks tardis.txt
184 tardis.txt
solarisinternals.book Page 768 Thursday, June 15, 2006 1:27 PM
UFS uses two basic types of locks: kmutex_t and krwlock_t. The workings of
these synchronization primitives is covered in Chapter 17. UFS locks can be
divided into eight categories:
Inode locks
Queue locks
ACL locks
VNODE locks
VFS locks
VOP_RWLOCK
ufs_iuniqtime_lock
Logging locks
solarisinternals.book Page 769 Thursday, June 15, 2006 1:27 PM
()'(%34 ,/7%34
ufs_close
ufs_putpage
ufs_inactive
ufs_addmap
ufs_delmap
ufs_rwlock
ufs_rwunlock
ufs_poll
The basic principle here is that UFS supports various file system lock states
(see list below) and each vnode operation must initiate the protocol by calling ufs_
lockfs_begin() with an appropriate lock mask (a lock that this operation might
grab while it is being processed) and end the protocol by calling ufs_lockfs_end
before it returns. This way, UFS knows exactly how many vnode operations are in
progress for the given file system by incrementing and decrementing the ul_vnops_
cnt variable in the file-system-dependent ulockfs structure. If the file system is
hard-locked, the thread gets an EIO error. If the file system is error-locked, then
the thread is blocked.
Here are the file system locks and their actions.
Write lock. Suspends writes that would modify the file system. Access times
are not kept while a file system is write-locked.
Name lock. Suspends accesses that could change or remove existing directo-
ries entries.
Delete lock. Suspends access that could remove directory entries.
Hard lock. Returns an error upon every access to the locked file system and
cannot be unlocked. Hard-locked file systems can be unmounted. Hard lock
supports forcible unmount.
Error lock. Blocks all local access to the file system and returns EWOULDBLOCK
on all remote access. File systems are error-locked by UFS upon detection of
solarisinternals.book Page 774 Thursday, June 15, 2006 1:27 PM
While a vnode operation is being executed in UFS, a call can be made to another
vnode function on the same UFS or a different UFS. This is called recursive VOP.
The per-file system vnode operation counter is not incremented or decremented
during recursive calls.
Here is the basic ordering to initiate and complete the lock protocol when oper-
ating on an inode in UFS.
When working with directories, you need to make one minor change. i_rwlock
is acquired after the logging transaction is initialized, and i_rwlock is released
before the transaction is ended. Here are the steps.
15.7 Logging
Important criteria for commercial systems are reliability and availability, both of
which may be compromised if the file system does not provide the required level of
robustness. We have become familiar with the term journaling to mean just one
thing, but, in fact, file system logging can be implemented in several ways. The
three most common forms of journaling are
The most common form of file system logging is metadata logging, and this is
what UFS implements. When a file system makes changes to its on-disk struc-
ture, it uses several disconnected synchronous writes to make the changes. If an
outage occurs halfway through an operation, the state of the file system is
unknown, and the whole file system must be checked for consistency. For exam-
ple, if the file is being extended the free block bitmap must be updated to mark the
newly allocated block as no longer free. The inode block list must also be updated
to indicate that the allocated block is owned by the file. If an outage occurs after
the block is allocated, but before the inode is updated, file system inconsistency
occurs.
A metadata logging file system such as UFS has an on-disk, cyclic, append-only
log area that it can use to record the state of each disk transaction. Before any on-
disk structures are changed, an intent-to-change record is written to the log. The
directory structure is then updated, and when complete, the log entry is marked
complete. Since every change to the file system structure is in the log, we can
check the consistency of the file system by looking in the log, and we need not do a
full file system scan. At mount time, if an intent-to-change entry is found but not
marked complete the changes will not be applied to the file system. Figure 15.17
illustrates how metadata logging works.
Logging was first introduced in UFS in Solaris 2.4; it has come a long way since
then, to being turned on by default in Solaris 10. Enabling logging turns the file
system into a transaction-based file system. Either the entire transaction is applied
or it is completely discarded. Logging is on by default in Solaris 10; however, it can
be manually turned on by mount(1M) -o logging (using the _FIOLOGENABLE
ioctl). Logging is not compatible with Solaris Logical Volume Manager (SVM)
translogging, and attempt to turn on logging on a UFS file system that resides on
an SVM will fail.
solarisinternals.book Page 776 Thursday, June 15, 2006 1:27 PM
Log is updated
to indicate start
of transaction. 1
File system
is modified.
2
3
LOG DATA
Log transaction
is marked complete
and deleted.
See usr/src/uts/common/sys/fs/ufs_log.h
solarisinternals.book Page 777 Thursday, June 15, 2006 1:27 PM
The extent_block structure describes logging metadata and is the main data
structure used to find the on-disk log. It is followed by a series of extents that con-
tain the physical block number for on-disk logging segments. The number of
extents present for the file system is described by the nextents field in the
extent_block structure.
See usr/src/uts/common/sys/fs/ufs_log.h
/*
* Important constants
*/
uint32_t od_maxtransfer; /* max transfer in bytes */
uint32_t od_devbsize; /* device bsize */
int32_t od_bol_lof; /* byte offset to begin of log */
int32_t od_eol_lof; /* byte offset to end of log */
/*
* The disk space is split into state and circular log
*/
uint32_t od_requestsize; /* size requested by user */
uint32_t od_statesize; /* size of state area in bytes */
uint32_t od_logsize; /* size of log area in bytes */
int32_t od_statebno; /* first block of state area */
int32_t od_unused2;
/*
* Head and tail of log
*/
int32_t od_head_lof; /* byte offset of head */
uint32_t od_head_ident; /* head sector id # */
int32_t od_tail_lof; /* byte offset of tail */
uint32_t od_tail_ident; /* tail sector id # */
uint32_t od_chksum; /* checksum to verify ondisk contents */
/*
* Used for error recovery
*/
uint32_t od_head_tid; /* used for logscan; set at sethead */
/*
* Debug bits
*/
int32_t od_debug;
/*
* Misc
*/
struct timeval od_timestamp; /* time of last state change */
} ml_odunit_t;
See usr/src/uts/common/sys/fs/ufs_log.h
The values in the ml_odunit_t structure represent the location, usage and
state of the on-disk log. The contents in the on-disk log consist of delta structures,
which define the changes, followed by the actual changes themselves. Each 512
byte disk block of the on-disk log will contain a sect_trailer at the end of the
block. This sect_trailer is used to identify the disk block as containing valid
deltas. The *_lof fields reference the byte offset in the logical on-disk layout and
not the physical on-the-disk contents.
solarisinternals.book Page 779 Thursday, June 15, 2006 1:27 PM
struct delta {
int64_t d_mof; /* byte offset on device to start writing */
/* delta */
int32_t d_nb; /* # bytes in the delta */
delta_t d_typ; /* Type of delta. Defined in ufs_trans.h */
};
See usr/src/uts/common/sys/fs/ufs_log.h
See usr/src/uts/common/sys/fs/ufs_log.h
M)?UNIT?T
MT?MAP?T MAPENTRY?T MAPENTRY?T MAPENTRY?T
MTM?NEXT ME?NEXT ME?NEXT ME?NEXT
MTM?PREV ME?PREV ME?PREV ME?PREV
ME?HASH ME?HASH ME?HASH
MTM?CANCEL ME?CANCEL ME?CANCEL ME?CANCEL
M)?UNIT?T
MTM?HASH ME?CRB ME?CRB ME?CRB
ME?DELTA ME?DELTA ME?DELTA
UN?DELTAMAP D?MOF D?MOF D?MOF
UN?LOGMAP D?NB D?NB D?NB
D?TYP D?TYP D?TYP
ME?LOF ME?LOF ME?LOF
CRB?T
C?MOF
$ATA FOR MAPENTRY C?BUF
$ATA FOR MAPENTRY C?NB
C?REFCNT
C?INVALID
ml_unit_t is the main in-core logging structure. There is only one per file sys-
tem, and it contains all logging information or pointers to all logging data struc-
tures for the file system. The un_ondisk field contains an in-memory replica of
the on-disk ml_odunit structure.
/*
* Used for managing transactions
*/
uint32_t un_maxresv; /* maximum reservable space */
uint32_t un_resv; /* reserved byte count for this trans */
uint32_t un_resv_wantin; /* reserved byte count for next trans */
/*
* Used during logscan
*/
uint32_t un_tid;
/*
* Read/Write Buffers
*/
cirbuf_t un_rdbuf; /* read buffer space */
cirbuf_t un_wrbuf; /* write buffer space */
/*
* Ondisk state
*/
ml_odunit_t un_ondisk; /* ondisk log information */
/*
* locks
*/
kmutex_t un_log_mutex; /* allows one log write at a time */
kmutex_t un_state_mutex; /* only 1 state update at a time */
} ml_unit_t;
See usr/src/uts/common/sys/fs/ufs_log.h
mt_map_t tracks all the deltas for the file system. At least three mt_map_t
structures are defined:
deltamap. Tracks all deltas for currently active transactions. When a file
system transaction completes, all deltas from the delta map are written to the
log map and all the entries are then removed from the delta map.
solarisinternals.book Page 781 Thursday, June 15, 2006 1:27 PM
logmap. Tracks all committed deltas from completed transactions, not yet
applied to the file system.
matamap. Is the debug map for delta verification.
struct mapentry {
/*
* doubly linked list of all mapentries in map -- MUST BE FIRST
*/
mapentry_t *me_next;
mapentry_t *me_prev;
mapentry_t *me_hash;
mapentry_t *me_agenext;
mapentry_t *me_cancel;
crb_t *me_crb;
int (*me_func)();
ulong_t me_arg;
ulong_t me_age;
struct delta me_delta;
uint32_t me_tid;
off_t me_lof;
ushort_t me_flags;
};
See usr/src/uts/common/sys/fs/ufs_log.h
See usr/src/uts/common/sys/fs/ufs_log.h
solarisinternals.book Page 782 Thursday, June 15, 2006 1:27 PM
A canceled mapentry with the ME_CANCEL bit set in the me_flags field is a
special type of mapentry. This type of mapentry is basically a place holder for free
blocks and fragments. It can also represent an old mapentry that is no longer
valid due to a new mapentry for the same offset. Freed blocks and fragments are
not eligible for reallocation until all deltas have been written to the on-disk log.
Any attempt to allocate a block or fragment in which a corresponding canceled
mapentry exists in the logmap, results in the allocation of a different block or
fragment.
See sys/fs/ufs_log.h
The crb_t, or cache roll buffer, caches blocks that exist within the same disk-
block. It is merely a performance enhancement when information is rolled back to
the file system. It helps reduce reads and writes that can occur while writing com-
pleted transactions deltas to the file system. It also acts as a performance enhance-
ment on read hits of deltas.
UFS logging maintains private buf_t structures used for reading and writing of
the on-disk log. These buf_t structures are managed through cirbuf_t struc-
tures. Each file system will have 2 cirbuf_t structures. One is used to manage
log reads, and one to manage log writes.
See sys/fs/ufs_log.h
only for that respective cylinder group. All cylinder group summary information is
totaled; these numbers are kept in the fs_cstotal field of the superblock. A copy
of all the cylinder group’s summary information is also kept in a buffer pointed to
from the file system superblock’s fs_csp field. Also kept on disk for redundancy is
a copy of the fs_csp buffer, whose block address is stored in the fs_csaddr field
of the file system superblock.
All cylinder group information can be determined from reading the cylinder
groups, as opposed to reading them from fs_csaddr blocks on disk. Hence,
updates to fs_csaddr are logged only for large file systems (in which the total
number of cylinder groups exceeds ufs_ncg_log, which defaults to 10,000). If a
file system isn’t logging deltas to the fs_csaddr area, then the ufsvfs->vfs_
nolog_si is set to 1 and instead marks the fs_csaddr area as bad by setting the
superblock’s fs_si field to FS_SI_BAD. However, these changes are brought up to
date when an unmount or a log roll takes place.
15.7.4 Transactions
A transaction is defined as a file system operation that modifies file system metat-
data. A group of these file system transactions is known as a moby transaction.
Logging transactions are divided into two types:
Synchronous file system transactions are those that are committed and
written to the log as soon as the file system transaction ends.
Asynchronous file system transactions are those for which the file sys-
tem transactions are committed and written to the on-disk log after closure of
the moby transaction. In this case the file system transaction may complete,
but the metadata that it modified is not written to the log and not considered
commited until the moby transaction has been completed.
So what exactly are committed transactions? Well, they are transactions whose
deltas (unit changes to the file system) have been moved from the delta map to the
log map and written to the on-disk log.
There are four steps involved in logging metadata changes of a file system
transaction:
All other file system transactions have a constant transaction size, and UFS has
predefined macros for these operations:
/*
* size calculations
*/
#define TOP_CREATE_SIZE(IP) \
(ACLSIZE(IP) + SIZECG(IP) + DIRSIZE(IP) + INODESIZE)
#define TOP_REMOVE_SIZE(IP) \
DIRSIZE(IP) + SIZECG(IP) + INODESIZE + SIZESB
#define TOP_LINK_SIZE(IP) \
DIRSIZE(IP) + INODESIZE
#define TOP_RENAME_SIZE(IP) \
DIRSIZE(IP) + DIRSIZE(IP) + SIZECG(IP)
#define TOP_MKDIR_SIZE(IP) \
DIRSIZE(IP) + INODESIZE + DIRSIZE(IP) + INODESIZE + FRAGSIZE(IP) + \
SIZECG(IP) + ACLSIZE(IP)
#define TOP_SYMLINK_SIZE(IP) \
DIRSIZE((IP)) + INODESIZE + INODESIZE + SIZECG(IP)
#define TOP_GETPAGE_SIZE(IP) \
ALLOCSIZE + ALLOCSIZE + ALLOCSIZE + INODESIZE + SIZECG(IP)
#define TOP_SYNCIP_SIZE INODESIZE
#define TOP_READ_SIZE INODESIZE
#define TOP_RMDIR_SIZE (SIZESB + (INODESIZE * 2) + SIZEDIR)
#define TOP_SETQUOTA_SIZE(FS) ((FS)->fs_bsize << 2)
#define TOP_QUOTA_SIZE (QUOTASIZE)
#define TOP_SETSECATTR_SIZE(IP) (MAXACLSIZE)
#define TOP_IUPDAT_SIZE(IP) INODESIZE + SIZECG(IP)
#define TOP_SBUPDATE_SIZE (SIZESB)
continues
solarisinternals.book Page 785 Thursday, June 15, 2006 1:27 PM
See sys/fs/ufs_trans.h
in the log will be applied to the file system and removed from the log. This is
known as “rolling the log” and is done in by a seperate thread.
The actual rolling of the log is handled by the log roll thread, which executes the
trans_roll() function found in usr/src/uts/common/fs/lufs_thread.c.
The trans_roll() function preallocates a number of rollbuf_t structures (based
on LUFS_DEFAULT_NUM_ROLL_BUF = 16, LUFS_DEFAULT_MIN_ROLL_BUFS = 4,
LUFS_DEFAULT_MAX_ROLL_BUFS = 64) to handle rolling deltas from the log to the
file system.
solarisinternals.book Page 788 Thursday, June 15, 2006 1:27 PM
Along with allocating memory for the rollbuf_t structures, trans_roll also
allocates MAPBLOCKSIZE * lufs_num_roll_bufs bytes to be used by rollbuf_t’s
buf_t structure stored in rb_bh. These rollbuf_t’s are populated according to
information found in the rollable mapentries of the logmap. All rollable mapen-
tries will be rolled starting from the logmap’s un_head_lof offset, and continuing
until an unrollable mapentry is found. Once a rollable mapentry is found, all other
rollable mapentries within the same MAPBLOCKSIZE segment on the file system
device are located and mapped by the same rollbuf structure.
If all mapentries mapped by a rollbuf have the same cache roll buffer (crb),
then this crb maps the on-disk block and buffer containing the deltas for the roll-
buf’s buf_t. Otherwise, the rollbuf’s buf_t uses MAPBLOCKSIZE bytes of kernel
memory allocated by the trans_roll thread to do the transfer. The buf_t reads
the MAPBLOCKSIZE bytes on the file system device into the rollbuf buffer. The
deltas defined by each mapentry overlap the old data read into the rollbuf
buffer. This buffer is then writen to the file system device.
If the rollbufs contain holes, these rollbufs may have to issue more than one
write to disk to complete writing the deltas. To asynchronously write these deltas,
the rollbuf’s buf_t structure is cloned for each additional write required for the
given rollbuf. These cloned buf_t structures are linked into the rollbuf ’s buf_t
structure at the b_list field. All writes defined by the rollbuf’s buf_t structures
and any clone buf_t structures are issued asynchronously.
The trans_roll() thread waits for all these writes to complete. If any fail, a
warning is printed to the console and the log is marked as LDL_ERROR in the log-
map->un_flags field. If the roll completes successfully, all corresponding mapen-
tries are completely removed from the log map. The head of the log map is then
adjusted to reflect this change, as illustrated in Figure 15.20.