Of File Systems and Storage Models
Of File Systems and Storage Models
Of File Systems and Storage Models
Disks are always full. It is futile to try to get more disk space.
Data expands to fill any void. – Parkinson’s Law as applied to
disks
4.1 Introduction
This chapter deals primarily with how we store data. Virtually all computer
systems require some way to store data permanently; even so-called “diskless”
systems do require access to certain files in order to boot, run and be useful.
Albeit stored remotely (or in memory), these bits reside on some sort of
storage system.
Most frequently, data is stored on local hard disks, but over the last
few years more and more of our files have moved “into the cloud”, where
di↵erent providers o↵er easy access to large amounts of storage over the
network. We have more and more computers depending on access to remote
systems, shifting our traditional view of what constitutes a storage device.
74
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 75
indirection we have between the operating system issuing I/O operations and
the bits actually ending up on a storage medium, the better.
At the same time, there are some disadvantages. Since the storage media
is, well, directly attached, it implies a certain isolation from other systems on
the network. This is both an advantage as well as a drawback: on the one
hand, each server requires certain data to be private or unique to its operating
system; on the other hand, data on one machine cannot immediately be made
available to other systems. This restriction is overcome with either one of the
two storage models we will review next: Network Attached Storage (NAS)
and Storage Area Networks (SANs).
DAS can easily become a shared resource by letting the operating system
make available a local storage device over the network. In fact, all network
file servers and appliances ultimately are managing direct attached storage
on behalf of their clients; DAS becomes a building block of NAS. Likewise,
physically separate storage enclosures can function as DAS if connected di-
rectly to a server or may be combined with others and connected to network
or storage fabric, that is: they become part of a SAN.2
other storage media, which, within this system are e↵ectively direct attached
storage. In order for the clients to be able to use the server’s file system
remotely, they require support for (and have to be in agreement with) the
protocols used3 . However, the clients do not require access to the storage
media on the block level; in fact, they cannot gain such access.
From the clients’ perspective, the job of managing storage has become
simpler: I/O operations are performed on the file system much as they would
be on a local file system, with the complexity of how to shu✏e the data
over the network being handled in the protocol in question. This model
is illustrated in Figure 4.2, albeit in a somewhat simplified manner: even
though the file system is created on the file server, the clients still require
support for the network file systemthat brokers the transaction performed
locally with the file server.
3
The most common protocols in use with network attached storage solution are NFS
on the Unix side and SMB/CIFS on the Windows side. The Apple Filing Protocol (AFP)
is still in use in some predominantly Mac OS environments, but Apple’s adoption of Unix
for their Mac OS X operating system made NFS more widespread there as well.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 80
Figure 4.4: A SAN providing access to three devices; one host accesses parts
of the available storage as if it was DAS, while a file server manages other
parts as NAS for two clients.
need to create a logical structure on top of the block devices (as which the
SAN units appear), and they control all aspects of the I/O operations down
to the protocol. With this low-level access, clients can treat the storage like
any other device. In particular, they can boot o↵ SAN attached devices, they
can partition the volumes, create di↵erent file systems for di↵erent purposes
on them and export them via other protocols.
Storage area networks are frequently labeled an “enterprise solution” due
to their significant performance advantages and distributed nature. Espe-
cially when used in a switched fabric, additional resources can easily be
made available to all or a subset of clients. These networks utilize the Small
Computer System Interface (SCSI) protocol for communications between the
di↵erent devices; in order to build a network on top of this, an additional
protocol layer – the Fibre Channel Protocol (FCP) being the most common
one – is required. We will review the various protocols and interfaces in
Section 4.3.
SANs overcome their restriction to a local area network by further encap-
sulation of the protocol: Fibre Channel over Ethernet (FCoE) or iSCSI, for
example, allow connecting switched SAN components across a Wide Area
Network (or WAN). But the concept of network attached storage devices fa-
cilitating access to a larger storage area network becomes less accurate when
end users require access to their data from anywhere on the Internet. Cloud
storage solutions have been developed to address these needs. However, as
we take a closer look at these technologies, it is important to remember that
at the end of the day, somewhere a system administrator is in charge of mak-
ing available the actual physical storage devices underlying these solutions.
Much like a file server may provide NAS to its clients over a SAN, so do cloud
storage solutions provide access on “enterprise scale” (and at this size the use
of these words finally seems apt) based on the foundation of the technologies
we discussed up to here.
We also have come full circle from direct attached storage providing block-
level access, to distributed file systems, and then back around to block-level
access over a dedicated storage network. But this restricts access to clients
on this specific network. As more and more (especially smaller or mid-sized)
companies are moving away from maintaining their own infrastructure to-
wards a model of Infrastructure as a Service (IaaS) and Cloud Computing,
the storage requirements change significantly, and we enter the area of Cloud
Storage4 .
The term “cloud storage” still has a number of conflicting or surprisingly
di↵erent meanings. On the one hand, we have commercial services o↵ering
file hosting or file storage services; common well-known providers currently
include Dropbox, Google Drive, Apple’s iCloud and Microsoft’s SkyDrive.
These services o↵er customers a way to not only store their files, but to
access them from di↵erent devices and locations: they e↵ectively provide
network attached storage over the largest of WANs, the Internet.
On the other hand we have companies in need of a more flexible storage
solutions than can be provided with the existing models. Especially the
increased use of virtualization technologies demands faster and more flexible
access to reliable, persistent yet relocatable storage devices. In order to meet
these requirements, storage units are rapidly allocated from large storage
area networks spanning entire data centers.
Since the di↵erent interpretations of the meaning of “cloud storage” yield
significantly di↵erent requirements, the implementations naturally vary, and
there are no current industry standards defining an architecture. As such,
we are forced to treat each product independently as a black box; system
administrators and architects may choose to use any number of combinations
of the previously discussed models to provide the storage foundation upon
which the final solution is built.
We define three distinct categories within this storage model: (1) services
that provide file system level access as in the case of file hosting services such
as those mentioned above; (2) services that provide access on the object level,
hiding file system implementation details from the client and providing for
easier abstraction into an API and commonly accessed via web services5 such
4
Large companies are of course also moving towards IaaS, only they frequently are
the ones simultaneously consuming as well as providing the service, either internally or to
outside customers.
5
“Web services” generally expose an API over HTTP or HTTPS using REpresentationsl
State Transfer (REST) or the Simple Object Access Protocol (SOAP).
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 85
Figure 4.5: A possible cloud storage model: an internal SAN is made available
over the Internet to multiple clients. In this example, the storage provider
e↵ectively functions as a NAS server, though it should generally be treated
as a black box.
Figure 4.6: An open PATA (or IDE) hard drive (left) and a Solid State Drive
(right). The HDD shows the rotating disk platters, the read-write head with
its motor, the disk controller and the recognizable connector socket.
access to the shared storage, even though layer-2 security mechanisms such
as IPsec may be combined with or integrated into the solution. Cloud stor-
age, on the other hand, has to directly address the problem of transmitting
data and providing access over untrusted networks and thus usually relies on
application layer protocols such as Transport Layer Security (TLS)/Secure
Sockets Layer (SSL).
We will touch on some of these aspects in future sections and chapters,
but you should keep them in mind as you evaluate di↵erent solutions for
di↵erent use cases. As we will see, in most cases the simpler model turns
out to be the more scalable and more secure one as well, so beware adding
unneeded layers of complexity!
were some of the advantages provided by the Serial ATA (SATA) interface.
A number of revisions and updates to the standard added more advanced
features and, most significantly, increasingly greater transfer speeds.
Most motherboards have integrated ATA host adapters, but a server can
be extended with additional HBAs via, for example, its PCI Express ex-
pansion slots; similarly, dedicated storage appliances make use of disk array
controllers to combine multiple drives into logical units (more on that in
Section 4.4). Fibre Channel HBAs finally allow a server to connect to a ded-
icated Fibre Channel SAN. All of these interfaces can be either internal (the
devices connected to the bus are housed within the same physical enclosure
as the server) or external (the devices are entirely separate of the server,
racked and powered independently and connected with suitable cables). In
the end, consider a host with a large amount of DAS and a NAS server man-
aging multiple terabytes of file space which is housed in a separate device and
which it accesses over a SAN: the main di↵erence lies not in the technologies
and protocols used, but in how they are combined.
Hands-on Hardware
I have found it useful to have actual hardware in class whenever
possible. In the past, I have brought with me di↵erent hard
drives by di↵erent manufacturers and with di↵erent interfaces
(PATA, SATA, SCSI); in particular, showing students the inside of a hard
drive, the rotating platters and the read-write arms has served as a great
illustration of the performance limiting factors, such as rotational latency,
seek time etc. Especially old and by now possibly obsolete hardware and
the contrast in which it stands to modern solutions is always investigated
with great interest.
SCSI drives when compared to the individual workstations at the time, were rather fond of
this: when a disk failed, it could be replaced without scheduling downtime for all services.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 90
Figure 4.7: On the left: Illustration of tracks and sectors on a hard disk.
Note that for simplicity, sectors on the inside and outside of the platters are
of identical size. On disks with Zone Bit Recording (shown on the right),
this is no longer the case.
sectors on the outer tracks of each disc than on the inner area.
The total number of 512 byte sectors across all platters of the hard drive
thus define its total capacity. In order to access each storage unit, the read-
write head needs to be moved radially to the correct cylinder – a process we
call seeking – and the platter then spun until the correct sector is positioned
under it. In the worst case scenario, we just passed the sector we wish to
access and we have to perform a full rotation. Therefore, a drive’s perfor-
mance is largely defined by this rotational latency and the time to position
the read-write head (also known as seek time).
Since the motor of a drive may rotate the discs at a constant linear velocity
(versus constant angular velocity), the discs are in fact moving slower near
the spindle than on the outer part. This means that more sectors could
be read in a given time frame from the outside of the discs than from the
inside. In fact, it used to be not uncommon for system administrators to
partition their disks such that large, frequently accessed files would reside on
around 2010, a number of vendors have started creating hard drives with a physical block
size of 4096 bytes. Interestingly, file systems having standardized on 512 byte blocks tend
to divide these blocks and continue to present to the OS 512 byte “physical” blocks.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 92
cylinders near the beginning (i.e. the outside) of the platters. Nowadays such
fine tuning may no longer be common, but instead people have started to
simply create a single partition occupying only about 25% of the disk at the
beginning and ignoring the rest. This technique, known as “short stroking”,
may seem wasteful, but the performance gain compared to the cheap prices
of today’s HDDs may make it actually worthwhile.
It is worth noting that the physical disk structure described here applies
only to traditional mechanical hard drives, not to Solid State Drives or other
storage media. Nevertheless, it is useful to understand the structure, as a
number of file system or partitioning conventions derive directly from these
physical restrictions.
view of the storage space as a single unit. Finally, the ways in which we
divide or combine disks have implications on system performance and data
redundancy.
4.4.1 Partitions
Now that we understand the physical layout of the hard disks, we can take
a look at how we partitions are created and used. As we noted in the pre-
vious section, a disk partition is a grouping of adjacent cylinders through
all platters of a hard drive10 . Despite this unifying principle, we encounter
a variety of partition types and an abundance of related terminology: there
are partition tables and disklables, primary and extended partitions; there
are whole-disk partitions, disks with multiple partitions, and some of the
partitions on a disk may even overlap.
Di↵erent file systems and anticipated uses of the data on a disk require
di↵erent kinds of partitions. First, in order for a disk to be bootable, we
require it to have a boot sector, a small region that contains the code that
the computer’s firmware (such as the Basic Input/Output System (BIOS))
can load into memory. In fact, this is precisely what a BIOS does: it runs
whatever code it finds in the first sector of the device so long as it matches
a very simple boot signature. That is, regardless of the total capacity of the
disk in question, the code that chooses how or what to boot needs to fit into
a single sector, 512 bytes. On most commodity servers11 , this code is known
as the Master Boot Record or MBR. In a classical MBR the last two bytes of
this sector contain the signature 0x55 0xAA; bootstrap code area itself takes
up 446 bytes, leaving 64 bytes of space. At 16 bytes per partition entry,
we can have at most four such BIOS partitions that the MBR can transfer
control to.
Sometimes you may want to divide the available disk space into more
than four partitions. In order to accomplish this, instead of four primary
10
Some operating systems such as Solaris, for example, have traditionally referred to
partitions as “slices”; some of the BSD systems also refer to disk “slices” within the context
of BIOS partitions. Unfortunately, this easily brings to mind the misleading image of a
slice of pie, a wedge, which misrepresents how partitions are actually laid out on the disk.
11
Even though many other hardware architectures used to be dominant in the server
market, nowadays the x86 instruction set (also known as “IBM PC-compatible” computers)
has replaced most other systems. For simplicity’s sake, we will assume this architecture
throughout this chapter.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 94
# fdisk -l
Disk / dev / cciss / c0d0 : 73.3 GB , 73372631040 bytes
255 heads , 63 sectors / track , 8920 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
partitions, the Master Boot Record (MBR) allows you to specify three pri-
mary and one so-called extended partition, which can be subdivided further
as needed. When the system boots, the BIOS will load the MBR code, which
searches its partition table for an “active” partition, from which it will then
load and execute the boot block. This allows the user to run multiple operat-
ing systems from the same physical hard drive, for example. BIOS partitions
are usually created or maintained using the fdisk(8) utility.
Just as the first sector of the disk contains the mbr, so does the first
sector of a BIOS partition contain a volume boot record, also known as a par-
tition boot sector. In this sector, the system administrator may have placed a
second-stage bootloader, a small program that allows the user some control
over the boot process by providing, for example, a selection of di↵erent ker-
nels or boot options to choose from. Only one partition is necessary to boot
the OS, but this partition needs to contain all the libraries and executables
to bootstrap the system. Any additional partitions are made available to the
OS at di↵erent points during the boot process.
In the BSD family of operating systems, the volume boot record contains
a disklabel, detailed information about the geometry of the disk and the par-
titions it is divided into. Listing 4.2 shows the output of the disklabel(8)
command on a NetBSD system. You can see the breakdown of the disk’s
geometry by cylinders, sectors and tracks and the partitioning of the disk
space by sector boundaries. This example shows a 40 GB12 disk containing
three partitions, a 10 GB root partition, a 512 MB swap partition and a data
partition comprising the remainder of the disk. Since the disk in question
12
(78140160 sectors ⇤ 512 bytes/sector)/(10243 bytes/GB) = 37.26 GB
Note the di↵erence in actual versus reported disk size:
(40 ⇤ 230 40 ⇤ 109 )/(10243 ) = 40 37.26 = 2.74
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 96
# dmesg | g r e p xbd3
xbd3 a t xenbus0 i d 3 : Xen V i r t u a l Block D e v i c e I n t e r f a c e
xbd3 : 24575 MB, 512 b y t e s / s e c t x 50331520 s e c t o r s
# d i s k l a b e l / dev / xbd3
t y p e : ESDI
d i s k : U n i v e r s a l Swap
label : disk1
b y t e s / s e c t o r : 512
s e c t o r s / t r a c k : 306432
tracks / cylinder : 1
s e c t o r s / c y l i n d e r : 306432
c y l i n d e r s : 255
t o t a l s e c t o r s : 78140160
rpm : 3600
interleave : 1
8 partitions :
# size offset f s t y p e [ f s i z e b s i z e cpg / s g s ]
a: 20972385 63 4 . 2BSD 4096 32768 1180 # ( Cyl . 0⇤ 20805)
b: 1048320 20972448 swap # ( Cyl . 20806 21845)
c: 78140097 63 unused 0 0 # ( Cyl . 0⇤ 77519)
d: 78140160 0 unused 0 0 # ( Cyl . 0 77519)
e: 56119392 22020768 4 . 2BSD 4096 32768 58528 # ( Cyl . 21846 77519)
#
Dividing a single large disk into multiple smaller partitions is done for a
number of good reasons: if you wish to install multiple operating systems, for
example, you need to have dedicated disk space as well as a bootable primary
partition for each OS. You may also use partitions to ensure that data written
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 97
to one location (log files, for example, commonly stored under e.g. /var/log)
cannot cause you to run out of disk space in another (such as user data under
/home). Other reasons to create di↵erent partitions frequently involve the
choice of file system or mount options, which necessarily can be applied only
on a per-partition basis. We will discuss a number of examples in Section
4.5.
Figure 4.8: Logical Volume Management lets you combine multiple physical
disks or partitions into a single volume group, from which logical volumes can
be allocated.
data on multiple devices. These last two advantages are also provided by the
RAID storage solution, which we will look at in more detail in the following
section.
4.4.3 RAID
Logical Volume Managers provide a good way to consolidate multiple disks
into a single large storage resource from which individual volumes can be
created. An LVM may also provide a performance boost by striping data, or
redundancy by mirroring data across multiple drives.
Another popular storage technology used for these purposes is RAID,
which stands for Redundant Array of Independent Disks14 . Multiple disks
can be combined in a number of ways to accomplish one or more of these
goals: (1) increased total disk space, (2) increased performance, (3) increased
data redundancy.
Much like an LVM, RAID as well hides the complexity of the management
of these devices from the OS and simply presents a virtual disk comprised
of multiple physical devices. However, unlike with logical volume manage-
14
The acronym “RAID” is sometimes expanded as “Redundant Array of Inexpensive
Disks”; since disks have become less and less expensive over time, it has become more
customary to stress the “independent” part. A cynic might suggest that this change in
terminology was driven by manufacturers, who have an interest in not explicitly promising
a low price.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 100
RAID 0
By writing data blocks in parallel across all available disks (see Figure 4.9a),
RAID 0 accomplishes a significant performance increase. At the same time,
available disk space is linearly increased (i.e. two 500 GB drives yield 1 TB
of disk space, minus overhead). However, RAID 0 does not provide any fault
tolerance: any disk failure in the array causes data loss. What’s more, as
you increase the number of drives, you also increase the probability of disk
failure.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 101
RAID 1
This configuration provides increased fault tolerance and data redundancy
by writing all blocks to all disks in the array, as shown in Figure 4.9b. When
a disk drive fails, the array goes into degraded mode, with all I/O operations
continuing on the healthy disk. The failed drive can then be replaced (hot-
swapped), and the RAID controller rebuilds the original array, copying all
data from the healthy drive to the new drive, after which full mirroring
will again happen for all writes. This fault tolerance comes at the price of
available disk space: for an array with two drives with a 500 GB capacity,
the total available space remains 500 GB.
Figure 4.9: Three of the most common RAID levels illustrated. RAID 0
increases performance, as blocks are written in parallel across all available
disks. RAID 1 provides redundancy, as blocks are written identically to all
available disks. RAID 5 aims to provide increased disk space as well as
redundancy, as data is striped and parity information distributed across all
disks.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 102
RAID 5
This level provides a bit of both RAID 0 and RAID 1: data is written
across all available disks, and for each such stripe the data parity is recorded.
Unlike in levels 2 through 4, this parity is not stored on a single, dedicated
parity drive, but instead distributed across all disks. See Figure 4.9c for an
illustration of the block distribution.
Since parity information is written in addition to the raw data, a RAID 5
cannot increase disk capacity as linearly as a RAID 0. However, any one of
the drives in this array can fail without impacting data availability. Again,
as in the case of a RAID 1 configuration, the array will go into degraded
mode and get rebuilt when the failed disk has been replaced. However, the
performance of the array is decreased while a failed drive remains in the
array, as missing data has to be calculated from the parity; the performance
is similarly reduced as the array is being rebuilt. Depending on the size of
the disks inquestion, this task can take hours; all the while the array remains
in degraded mode and another failed drive would lead to data loss.
probability of a second drive failing is actually higher than one would expect
with truly independent drives.
Secondly, the actions performed by the RAID on the disks needs to be
taken into consideration as well. When a disk in a RAID 5 array fails and
is replaced, all of the disks will undergo additional stress as all of the data
is read in order to rebuild the redundant array. This process may well take
several hours if not days – a very long period of time during which our data is
at increased risk. In other words, the failure of a single drive will necessarily
increase the probability of failure of the other drives15 .
When choosing a composite RAID architecture, it is well worth our time
to consider the Mean Time to Recovery (MTTR) in addition to the other
factors. From the moment that a disk failure is detected until the array has
been rebuilt, our data is at severe risk. In order to reduce the MTTR, many
system administrators deploy so-called hot spares, disks that are installed in
the server and known to the RAID controller, but that are inactive until a
disk failure is detected. At that time, the array is immediately rebuilt using
this stand-by drive; when the faulty disk has been replaced, the array is
already in non-degraded mode and the new disk becomes the hot spare.
Figure 4.10: An Apple Xserve RAID, a now discontinued storage device with
14 Ultra-ATA slots o↵ering Fibre Channel connectivity and implementing a
number of RAID levels in hardware as well as software across two independent
controllers.
15
RAID 6 improves this situation, at the expense of efficiency. Dropping prices have not
very surprisingly lead more and more enterprise environments to consider this trade-o↵
entirely acceptable.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 104
The “left” RAID was dedicated to storing large amounts of video and
audio data made available to clients running Mac OS X. We connected
the RAID controller via Fibre Channel to a SAN switch, and from there
to an Apple Xserve network server, which managed the HFS+ file system
on this storage component.
The second 2.2 TB of storage space, the ”right” side of the array, was
meant to become the central data space for all workstations in the Com-
puter Science and Mathematics departments as well as their laboratories.
Up until then, this file space had been provided via NFS from a two-mod-
ule SGI Origin 200 server running IRIX, managing a few internal SCSI
disks as well as some Fibre Channel direct attached storage. We intended
to migrate the data onto the XServe RAID, and to have it served via a
Solaris 10 server, allowing us to take advantage of several advanced fea-
tures in the fairly new ZFS and to retire the aging IRIX box.
Neatly racked, I connected the second RAID controller and the new So-
laris server to the SAN switch, and then proceeded to create a new ZFS
file system. I connected the Fibre Channel storage from the IRIX server
and started to copy the data onto the new ZFS file system. As I was
sitting in the server room, I was able to see the XServe RAID; I noticed
the lights on the left side of the array indicate significant disk activity,
but I initially dismissed this as not out of the ordinary. But a few seconds
later, when the right side still did not show any I/O, it dawned on me:
the Solaris host was writing data over the live file system instead of onto
the new disks!
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 105
Now it was interesting to note that at the same time as I was overwriting
the live file system, data was still being written to and read from the
HFS+ file system on the Apple server. I was only able to observe inter-
mittent I/O errors. Thinking I could still save the data, I made my next
big mistake: I shut down the Apple server, hoping a clean boot and file
system check could correct what I still thought was a minor problem.
Unfortunately, however, when the server came back up, it was unable to
find a file system on the attached RAID array! It simply could not iden-
tify the device. In retrospect, this is no surprise: the Solaris server had
constructed a new (and di↵erent) file system on the device and destroyed
all the HFS+ specific file system meta data stored at the beginning of
the disks. That is, even though the blocks containing the data were likely
not over written, there was no way to identify them. After many hours
of trying to recreate the HFS+ meta data, I had to face the fact this
was simply impossible. What was worse, I had neglected to verify that
backups for the server were done before putting it into production use –
fatal mistake number three! The data was irrevocably lost; the only plus
side was that I had learned a lot about data recovery, SAN zoning, ZFS,
HFS+ and file systems in general.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 106
As we discuss these file systems, we will find that the approach to store
both the actual file data as well as the metadata, the information associated
with the files, di↵ers significantly. We will also notice – as a recurring pattern
throughout this book – how di↵erent layers of abstraction help us improve
portability and provide a consistent interface, as standard file I/O semantics
can remain the same across di↵erent file systems.
the inode structures, or the Virtual File System layer to allow the kernel to
support multiple di↵erent file systems remains largely the same across the
di↵erent Unix operating and file systems.
System[5], for example, provides data confidentiality through the use of cryp-
tography, thus allowing complete decentralization over the (public) Internet.
On the other hand, the advent of cloud computing has further blurred the
boundaries of where a distributed file system ends and where a network
service begins: Amazon’s Simple Storage Service (S3), for example, despite
being an API driven web service, might as well be considered a distributed
file system, but then so could online storage providers, such as Dropbox, who
build their solutions on top of Amazon’s service. It will be interesting to see
how these distributed file systems or data stores evolve with time.
kernel space: file I/O system calls do happen here and devices and certain
other resources can only be controlled by the kernel; likewise, mounting or
unmounting file systems requires superuser privileges. But not all file sys-
tems are actually accessing any of the resources protected by the kernel. In
order to facilitate non-privileged access to file system interfaces, some Unix
systems have created kernel modules that provide a method to implement
virtual “File Systems in Userspace” (known as FUSE).
All of these file systems that aren’t actually file systems but rather an ab-
straction of other resources into a file API illustrate how much the concept of
simplicity permeates the Unix culture. At this point, few system administra-
tors would wish to part with the convenience of, for example, procfs – in fact,
its availability is almost universally assumed. For a polyglot administrator,
however, it is important to be familiar with the di↵erent pseudo- and virtual
file systems available on the di↵erent Unix versions and, more importantly,
know how to access the resources they represent in their absence.
Figure 4.10: The Unix file system is a tree-like structure, rooted at /; di↵erent
file systems can be attached at di↵erent directories or mount points. In this
illustration, /home and /usr reside on separate disks from /.
a string.18 Most Unix systems impose a maximum file name length of 255
bytes and a maximum pathname length of 1024 bytes; however, these are file
system and OS specific limits.
Every Unix process has the concept of a current working directory – run
the pwd(1) command or type “echo $PWD” in your shell to display where in
the file system hierarchy you currently are. Pathnames may either be relative
to this location, or, if they begin with a /, absolute.
Each directory contains at least two entries: “.” (dot), a name for the
current working directory and “..” (dot-dot), a name for the parent direc-
18
It is a common mistake to assume that a file name contains only printable characters.
The Unix system does not impose many restrictions on how a user might choose to name
a file, and file names containing e.g. control characters such as a carriage return (“
n”), while confusing on a line-bu↵ered terminal, are possible.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 112
$ pwd
/home/ jschauma # pwd ( 1 ) w r i t e s t h e a b s o l u t e pathname
$ echo h e l l o > f i l e # c r e a t e ” f i l e ” in the current d i r e c t o r y
$ cd / u s r / s h a r e / doc # cd ( 1 ) u s i n g an a b s o l u t e pathname
$ pwd
/ u s r / s h a r e / doc # no s u r p r i s e h e r e
$ c a t /home/ jschauma / f i l e # now u s i n g an a b s o l u t e pathname
hello
$ cd . . / . . / . . / home/ jschauma # cd ( 1 ) u s i n g a r e l a t i v e pathname
$ pwd
/home/ jschauma
$ cat f i l e # ” . / f i l e ” would a l s o work
hello
$
Listing 4.3: Absolute pathnames begin with a / and are resolved from the
root of the file system; relative pathnames are resolved from the current
working directory.
tory.19 Since relative pathnames are resolved from within the current working
directory, the same file can be referred to by di↵erent names, as shown in the
examples in Listing 4.3.
the location of the inode and data blocks. As this information is critical
for the operation of the file system and corruption of this block would
be disastrous, it is replicated and stored in a number of (predictable)
locations. This allows the super user to repair a corrupted file system
by pointing to an alternate superblock.
• A number of cylinder groups, which break the large file system into
more manageable chunks by distributing meta data evenly across the
physical partition.
solid state drives, for example, we no longer su↵er performance penalties due
to seek time.
It is important to note that the data blocks used by the file system are
di↵erent from the physical blocks of the hard disk. The latter are, as we
discussed in Section 4.3.1, 512 bytes in size (or, in more recent drives, 4096
bytes); the former – called the logical block size – can be decided on by the
system administrator at file system creation time. UFS uses a minimum log-
ical block size of 4096 bytes, and defaults to larger block sizes based on the
overall size of the file system. Likewise, the number of inodes in a file system
is fixed once the file system has been created. Here, UFS defaults to an inode
density of one inode per 2048 bytes of space for small file systems – consult
the newfs(8) manual page for this and other parameters to define and tune
the file system at its creation time.
Let us dive a little bit deeper and think about how the Unix File System
manages disk space: A file system’s primary task is to store data on behalf
of the users. In order to read or write this data, it needs to know in which
logical blocks it is located. That is, the file system needs a map of the blocks,
a way to identify and address each location. This is accomplished by way
of the inode and data block maps: the total number of inodes represents the
total number of files that can be referenced on this file system, while the data
blocks represent the space in which the file data is stored.
As we noted, a data block is, necessarily, of a fixed size. That means
that if we wish to store a file that is larger than a single block, we have to
allocate multiple blocks and make a note of which blocks belong to the given
file. This information is stored as pointers to the disk blocks within the inode
data structure.
Unfortunately, however, not all files will be multiples of the logical block
size. Likewise, it is possible that files will be smaller than a single block.
In other words, we will always end up with blocks that are only partially
allocated, a waste of disk space. In order to allow more efficient management
of small files, UFS allowed a logical block to be divided further into so-called
fragments, providing for a way to let the file system address smaller units of
storage. The smallest possible fragment size then is the physical block size
of the disk, and logical blocks are only fragmented when needed.
A pointer to the data blocks and fragments allocated to a given file is
stored in the inode data structure, which comprises all additional information
about the file. This allows for an elegant separation of a file’s metadata,
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 115
$ ls ai /
2 . 5740416 cdrom 5816448 libexec 2775168 stand
2 .. 1558656 dev 1862784 mnt 2166912 tmp
3003280 . c s h r c 988439 emul 4 netbsd 3573504 usr
3003284 . p r o f i l e 342144 etc 1824768 proc 3497472 var
3421440 a l t r o o t 1026432 home 798336 rescue
5702400 b i n 3763584 lib 3003264 root
3 boot . c f g 2204928 libdata 5588352 sbin
$
Listing 4.4: Use of the ls(1) command on a NetBSD system to illustrate how
file names are mapped to inode numbers in a directory. (Note that in the
root directory both ’.’ and ’..’ have the same inode number as in this special
case they actually are the same directory.)
which takes up only a fixed and small amount of disk space, and its contents.
Accessing the metadata is therefore independent of the file size (itself a piece
of metadata), allowing for efficient and fast retrieval of the file’s properties
without requiring access of the disk blocks. Other pieces of information
stored in the inode data structure include the file’s permissions, the numeric
user-id of the owner, the numeric group-id of the owner, the file’s last access,
modification and file status change times20 , the number of blocks allocated
for the file, the block size of the file system and the device the file resides on,
and of course the inode number identifying the data structure.
Now humans tend to be rather bad at remembering large numbers and
prefer the use of strings to represent a file, but the one piece of information
that is not stored in the inode is the file name. Instead, the Unix File System
allows for a mapping between a file name and its unique identifier – its inode
– through the use of directories. A directory really is nothing but a special
type of file: it has the same properties as any other file, but the data it
holds is well-structured (in contrast to the byte stream contained in so-called
“regular” files) and consists of inode number and file name pairs. You can
use the ls(1) command to illustrate this – see Listing 4.4.
A mapping of an inode number to a file name is called a hardlink, even
though humans tend to prefer the term “file name”. An inode can be accessed
by more than just one file name: within a single file system, each file may
20
A file’s ctime, it’s time of last file status change, is frequently misinterpreted to be the
file’s creation time; most file systems, including UFS, do not store this kind of information.
Instead, the ctime reflects the last time the meta information of a file was changed.
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 116
The file type as well as its permissions and all the other properties of
a file can be inspected using the stat(1) command (see Listing 4.5 for an
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 117
Listing 4.5: Sample output of the stat(1) command on a Linux system show-
ing the various pieces of information stored in the inode data structure for
the file “/etc/passwd”.
example), or, perhaps more commonly, using the ls(1) command (as illus-
trated in Figure 4.12). This command is so frequently used and its output
so ubiquitous that any system administrator can recite the meaning of the
fields in their sleep. The semantics and order in which permissions are ap-
plied, however, include a few non-obvious caveats, which is why we will look
at the Unix permissions model in more detail in Chapter 6.
Over the last 30 years, UFS has served as the canonical file system for
almost all Unix versions. With time, a number of changes have become
necessary: more fine-grained access controls than the traditional Unix per-
missions model allows were made possible via file system extensions such as
ACLs; larger storage devices required not only updates to the block address-
ing schemas, but also to the di↵erent data types representing various file
system aspects; today’s huge amounts of available data storage have made
log-based file systems or journaling capabilities a necessity, and massively dis-
tributed data stores pose entirely di↵erent requirements on the underlying
implementation of how the space is managed.
Yet through all this, as enhancements have been made both by commer-
cial vendors as well as by various open source projects, the principles, the very
fundamentals of the Unix File System have remained the same: the general
concept of the inode data structure and the separation of file metadata from
the data blocks have proven reliable and elegant in their simplicity. At the
same time, this simplicity has proven to yield scalability and adaptability:
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 118
Figure 4.12: The default output of the ls -l command includes most of the
metadata of a given file.
the persistent idea that “everything is a file”, and that files simply store bytes
and thus have no inherent structure (unlike a database or certain archives)
have allowed a surprising flexibility and lead to the creation of many pseudo-
or virtual file systems providing a simple API and User Interface (UI) to any
number of resources.
4.8 Conclusions
Throughout this chapter, we have built our understanding of file systems
and storage models from the ground up. We have seen how simple concepts
are combined to construct increasingly complex systems – a pattern that
weaves like a red thread through all areas we cover. We noted the circular
nature of technology development: the simple DAS model repeats itself,
albeit more complex and with additional layers, in common SANs, much as
network attached storage utilizes both DAS and SAN solutions, yet is taken
to another extreme in cloud storage solutions.
Being aware of the physical disk structure helps us understand a file sys-
tem’s structure: we realize, for example, that concentric cylinder groups make
up partitions, and that the location of data blocks within these cylinders may
have an impact on I/O performance. What’s more, the concepts of file sys-
tem blocks become clearer the further we deepen our knowledge of both the
logical and physical components, and the distinction of metadata from the
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 119
actual contents of a file allow us to explain how the various file system related
Unix tools operate on a fairly low level as well as how to tune these values
at file system creation time. As system administrators, this understanding
is crucial.
But what about our earlier point of noting the three pillars of strong sys-
tem architecture and design: Scalability, Security, Simplicity. What role do
these play in the context of file systems and storage models?
the factors, and to ask the questions we’ve started to suggest here at the end
of this chapter. Not surprisingly, this line of questions quickly leads us be-
yond the basic concepts of file systems and storage models, but likewise often
brings us full circle when we have to consider higher level applications and
consider just how exactly they access data and what kinds of assumptions
and requirements they pose as to availability, security and performance. The
topics covered in this chapter are the foundation upon which our systems are
built.
Problems and Exercises
Problems
1. Identify the storage area model(s) predominantly used in your envi-
ronment(s). What kind of problems with each do you frequently en-
counter? Would changing the storage model be a feasible solution to
these problems? Why or why not?
4. Ask your system administrators if they have any old or broken hard
drives, if possible from di↵erent manufacturers or with di↵erent capac-
ities. Open up the drives and identify the various components. How
many read-write heads are there? How many platters? How do the
di↵erent models di↵er?
6. Compare the various composite RAID levels and analyze their respec-
tive fault tolerance and mean time to recovery in the case of one or
122
CHAPTER 4. OF FILE SYSTEMS AND STORAGE MODELS 123
(a) Create a new file, then create a second hard link for this file. Verify
that both files are completely identical by using ls(1), stat(1)
and by appending data to one of the files and reading from the
other.
(b) Rename the original file and repeat – what changed? Why?
(c) Create a new file, then create a symbolic link for this file. Verify
that both the original file and the symbolic link are unique by
inspecting their inode numbers. Then append data to the file using
the regular name and confirm that reading from the symbolic link
yields the same data.
(d) Rename the original file and repeat – what changed? Why?
15. Create a very large file. Measure how long it takes to rename the file
within one directory using the mv(1) command. Next, use mv(1) to
move the file into a directory on a di↵erent file system or partition.
What do you observe? Explain the di↵erence.
(a) Identify how much free disk space you have, then fill it up. Can
you use up all available space using a single file? What are the
e↵ects of using up all disk space? Can you still create new, empty
files? Can you still log in? Why/why not?
(b) Identify how many free inodes are available on your root file sys-
tem, then use them up by, e.g., creating lots and lots of empty files.
What happens to the directory size in which you create these files?
What is the error message once you run out of inodes? Can you
still create new files? Can you still write to existing files? How
much disk space is available now? Can you still log in? Why/why
not?
BIBLIOGRAPHY 125
Bibliography
[1] EMC Education Service, Information Storage and Management: Stor-
ing, Managing, and Protecting Digital Information, John Wiley & Sons,
2009
[3] Sanjay Ghemawat, Howard Gobio↵, and Shun-Tak Leung, The Google
File System, in “Proceedings of the nineteenth ACM symposium on Op-
erating systems principles”, 2003, ACM, New York, NY; also available
on the Internet at http://research.google.com/archive/gfs-sosp2003.pdf
(visited September 6th, 2012)
[9] Marshall Kirk McKusick, William N. Joy, Samuel J. Le✏er and Robert
S. Fabry A Fast File System for UNIX, ACM Transactions on Computer
Systems 2 (3), 1984, ACM, New York, NY; also available on the In-
ternet at http://www.cs.berkeley.edu/˜brewer/cs262/FFS.pdf (visited
September 9, 2012)