0% found this document useful (0 votes)
92 views11 pages

The Linux Device File-System: Richard Gooch EMC Corporation Rgooch@atnf - Csiro.au

The document summarizes limitations of the traditional Linux device file system approach of using device numbers and storing device nodes on disk. It discusses how the 16-bit device numbers limit the number of supported devices. It also notes issues with needing centralized coordination for allocating numbers and names, duplicating this information in multiple places, and limitations of the mapping between numbers and driver methods. The Device File System (devfs) is then introduced as an alternative that aims to address these limitations.

Uploaded by

PAL_UNIVERSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views11 pages

The Linux Device File-System: Richard Gooch EMC Corporation Rgooch@atnf - Csiro.au

The document summarizes limitations of the traditional Linux device file system approach of using device numbers and storing device nodes on disk. It discusses how the 16-bit device numbers limit the number of supported devices. It also notes issues with needing centralized coordination for allocating numbers and names, duplicating this information in multiple places, and limitations of the mapping between numbers and driver methods. The Device File System (devfs) is then introduced as an alternative that aims to address these limitations.

Uploaded by

PAL_UNIVERSE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

The Linux Device File-System

Richard Gooch
EMC Corporation
[email protected]

Abstract 1 Introduction

All Unix systems provide access to hardware via de-


vice drivers. These drivers need to provide entry
points for user-space applications and system tools
to access the hardware. Following the “everything is
a file” philosophy of Unix, these entry points are ex-
posed in the file name-space, and are called “device
The Device File-System (devfs) provides a power- special files” or “device nodes”.
ful new device management mechanism for Linux.
Unlike other existing and proposed device manage- This paper discusses how these device nodes are cre-
ment schemes, it is powerful, flexible, scalable and ated and managed in conventional Unix systems and
efficient. the limitations this scheme imposes. An alternative
mechanism is then presented.
It is an alternative to conventional disc-based char-
acter and block special devices. Kernel device
drivers can register devices by name rather than de-
vice numbers, and these device entries will appear
in the file-system automatically. 1.1 Device numbers

Devfs provides an immediate benefit to system ad-


ministrators, as it implements a device naming
scheme which is more convenient for large systems Conventional Unix systems have the concept of a
(providing a topology-based name-space) and small “device number”. Each instance of a driver and
systems (via a device-class based name-space) alike. hardware component is assigned a unique device
number. Within the kernel, this device number is
Device driver authors can benefit from devfs by used to refer to the hardware and driver instance.
avoiding the need to obtain an official device number The device number is encoded in the appropriate de-
or numbers, which are a diminishing resource, prior vice node, and the device nodes are stored on normal
to releasing their product. Furthermore, they have disc-based file-systems.
complete autonomy in their assigned sub name-
space, facilitating rapid product changes where nec- To aid management of device numbers, they are
essary. split into two components. These are called “ma-
jor” and “minor” numbers. Each driver is assigned
The full power of devfs is released when combined a major number. The minor number is used by the
with devfsd, the devfs daemon. This combination driver to determine which particular hardware in-
allows administrators and developers to manage de- stance is being accessed.
vices in new ways, from creating custom and virtual
device name-spaces, to managing hot-plug devices Prior to accessing a piece of hardware (for example,
(i.e. USB, PCMCIA and FireWire). Devfsd pro- a disc drive), the appropriate device node(s) must
vides a lightweight, uniform mechanism for manag- be created. By convention, these are stored in the
ing diverse device types. /dev directory.
1.1.1 Linux Implementation ber of supported discs to 128. While sufficient for
medium enterprises, it is insufficient for large sys-
tems.
In the Linux kernel, device numbers are currently
stored in 16 bit integers. The major number com-
Large storage arrays can currently present thou-
ponent is allocated 8 bits, and the remaining 8 bits
sands of logical volumes (one vendor can present
are used for the minor number. Each driver is thus
4096, and will soon double that figure). To support
allocated one of the 256 possible major numbers.
this under Linux would require the reservation of
thousands of device numbers. This would rapidly
Each driver must register the major number it
erode the remaining device numbers, by consuming
wishes to use, and the “driver operation methods”
16 or 32 major numbers for a single device driver.
which must be called when file operations for that
Combined with pressure from other device drivers,
driver must be performed. These operations include
the device number space will soon be exhausted.
opening, closing, reading, writing and others. The
driver methods are stored in a table, indexed by
The limitation imposed by a 16 bit device number
major number, for later use. This table is called the
must be resolved if Linux is to grow in the enterprise
“major table”.
market.
When a device node is opened (i.e. when a pro-
cess executes the open(2) system call on that node),
the major number is extracted and is used to index 1.2.2 Device Number and Name Allocation
into the major table and determine the driver meth-
ods. These methods are then recorded into the “file
structure”, which is the in-kernel representation of The conventional scheme requires the allocation of
the new file descriptor handle given to the process. major and minor device numbers for each and ev-
ery device. This means that a central co-ordinating
Subsequent operations on the file descriptor (such as authority is required to issue these device numbers
reading and writing) will result in the driver meth- (unless you’re developing a “private” device driver),
ods being called, and thus control is transferred to in order to preserve uniqueness.
the driver. The driver will then use the minor num-
ber to determine which hardware instance must be In addition, the name of each device node must be
manipulated. co-ordinated by this central authority, so that ap-
plications will look for the same device names, and
system administrators and distributors create the
1.2 Limitations approved device nodes.

This system is not well suited to delegation of this


Device numbers, traditional device nodes and the responsibility. Thus, a bottleneck is introduced,
Linux implementation, have several limitations, dis- which can delay the development and release of new
cussed below. devices and their drivers.

1.2.1 Major and Minor size 1.2.3 Information Duplication

Existing major and minor numbers are limited to 8 Since device nodes are stored on a disc media, these
bits each. This is now a limiting factor for some must be created by the system administrator. For
drivers, particularly the SCSI disc driver, which standard devices one can usually find a MAKEDEV
originally consumed a single major number. Since 4 programme which creates the thousands of device
bits were assigned to the partition index (support- nodes in common use. Thus, for a change in any
ing 15 partitions per disc), this left 4 bits for the one of the hundreds of device drivers which re-
disc index. Thus, only 16 discs were supported. quires a device name or number change, a corre-
sponding change is required in the MAKEDEV pro-
A subsequent change reserved another 7 major num- gramme, or else the system administrator creates
bers for SCSI discs, which has increased the num- device nodes by hand.
The fundamental problem is that there are multiple, This would require 8 Mega (1024*1024) inodes if all
separate databases of major and minor numbers and possible device nodes were stored. This would result
device names. Device numbers are stored in the in an impractically large /dev directory.
following databases:

1.2.5 Node to driver methods translation


• inside the kernel and driver source code
• in the MAKEDEV programme As discussed in section 1.1.1, each driver must store
• in the /dev directory on many millions of com- its driver methods in the major table. This table
puters is 256 entries long, and is indexed upon each de-
vice open. If the size of device numbers were to be
increased (in response the limitations discussed in
and device names are stored in the following section 1.2.1), then the major table would need to
databases: be converted to a list, since it would be impractical
to store a table with a very large number of entries.

• in the MAKEDEV programme If the major table is converted to a list, this would
require a list traversal for each device open. This
• in the /dev directory on many millions of com- is undesirable, as it would slow down device open
puters operations. The effect could be reduced by using a
• in the source code of thousands of applications hash function, but not eliminated.
which access device nodes.

1.2.6 /dev as a system administration tool


This is a classic case of information duplication,
which results in “version skew”, where one database
Because /dev must contain device nodes for all con-
is updated while another is not. This results in ap-
ceivable devices, it does not reflect the installed
plications not finding or using the correct device
hardware on the system. Thus, it cannot serve as a
nodes. There is no central database which, when
system administration tool. There is no mechanism
changed, automatically changes all other databases.
to determine all the installed hardware on a system.

It is possible to determine the presence of some


1.2.4 /dev growth hardware, as some device drivers report hardware
through kernel messages or create informational files
A typical /dev has over 1200 nodes. Most of these in /proc, but there is no uniform format, nor is it
devices simply don’t exist because the hardware is complete.
not available on any one system. The reason for the
large number of installed device nodes is to cater for
the possibility of hardware that may be installed. A 1.2.7 PTY security
huge /dev increases the time to access devices, as
directory and inode lookup operations will need to
read more data from disc. Current pseudo-tty (pty) devices are owned by root
and read-writable by everyone. The user of a pty-
An example of how big /dev can grow is if we con- pair cannot change ownership/protections without
sider SCSI devices: having root privileges. Thus, programmes which al-
locate pseudo-tty devices and wish to set permis-
sions on them must be privileged. This problem is
host 6 bits caused by the storing of permissions on disc. Privi-
channel 4 bits leged programmes are a potential security risk, and
id 4 bits should be avoided where possible.
lun 3 bits
partition 6 bits This could be solved with a secure user-space dae-
TOTAL 23 bits mon which runs as root and does the actual creation
of pty-pairs. Such a daemon would require modifi- 2.1.1 Naming Scheme
cation to every programme that wants to use this
new mechanism. It also slows down creation of pty-
Each device driver, or class of device drivers, has
pairs.
been assigned a portion of the device name-space.
In most cases, the names are different from the pre-
vious convention. The previous convention used a
flat hierarchy, where all device nodes were kept di-
rectly in the /dev directory, rather than using sub-
2 The Alternative directories.

Devfs implements a hierarchical name-space which


is designed to reflect the topology of the hardware
The solution to these and other problems is to al- as seen by the device drivers. This new naming
low device drivers to create and manage their device scheme reduces the number of entries in the top-
nodes directly. To support this, a special device file- level /dev directory, which aids the administrator
system (devfs) is implemented, and is mounted onto in navigating the available hardware.
/dev. This file-system allows drivers to create, mod-
ify and remove entries. Further, this scheme is better suited to automated
tools which need to manipulate different types of de-
This is not a new idea. Similar schemes have been vices, since all devices of the same type are found in
implemented for FreeBSD, Solaris, BeOS, Plan 9, a specific portion of the name-space, which does not
QNX and IRIX. The Linux implementation is more contain device nodes for other device types. Thus,
advanced, and includes a powerful device manage- automated tools only need to know the name of the
ment daemon discussed in section 3. directory to process, rather than the names of all
devices in that class.

2.1 Linux Implementation 2.2 Problems Solved

The limitations discussed in section 1.2 are revisited


below, showing how devfs solves these problems.
The Linux implementation of devfs was initiated
and developed by the author in January 1998, and
was accepted for inclusion in the official kernel in
February 2000. As well as implementing the core 2.2.1 Major and Minor size
file-system itself, a large number of device drivers
were modified to take advantage of this new file- Because the driver methods are stored in the device
system. node itself, there is no need use device numbers to
identify the driver and device instance. Thus, the
The Linux devfs provides an interface which allows limitations imposed by a 16 bit device number are
device drivers to create, modify and destroy device completely avoided. An unlimited number of de-
nodes. When “registering” (creating) device nodes, vices may be supported using this scheme.
the driver operation methods are provided by the
driver, and these are recorded in the newly created
entry. A generic pointer may also be provided by the
2.2.2 Device Number and Name Allocation
driver, which may be subsequently used to identify
a specific device instance.
By eliminating device numbers, there is no need for
Events in the file-system (initiated by drivers regis- a central co-ordinating authority to allocate device
tering or unregistering entries, or user-space appli- numbers. There is still need for a central co-ordinate
cations searching for or changing entries) may be se- authority to allocate device names, but with the
lectively passed to a user-space device management elimination of device numbers, this can be easily
daemon, discussed in section 3. delegated. Each device driver can be allocated a
directory in the device name-space, by the central 2.2.5 Node to driver methods translation
authority. The driver may then create device nodes
under this directory, without fear of conflict with
other drivers. When drivers register device entries, their driver
methods are recorded in the device node. This is
After allocation of the directory, the driver author later used when the device node is opened, elimi-
is free to assign entries in that directory, without nating the need to find the driver methods. Thus,
any reference to a central authority. If the driver there is a direct link between the driver and the de-
author is also responsible for distributing the appli- vice node.
cation(s) that must access these device nodes, then
these applications can be changed at the same time This linking is architecturally better, as it eliminates
as the driver is changed. Distributing the changed a level of indirection, and thus improves code clarity
driver and application(s) at the same time, design- as well as avoiding extra processing cycles.
ers are free to re-engineer without fear of breaking
installed systems.
2.2.6 /dev as a system administration tool

2.2.3 Information Duplication


With /dev being managed by device drivers, it now
also becomes an administrative tool. A simple list-
Since drivers now create their own device nodes, ing of the directory will reveal which devices are cur-
there is no need to maintain a MAKEDEV pro- rently available. For the first time, there is a way
gramme, nor is there a need to administer a /dev to determine the available devices. Furthermore, all
directory. This removes most cases of duplicated devices are presented in a uniform way, as device
information. Device drivers are now the primary nodes in devfs.
“database” of device information.

Thus, if a user installs a new version of a driver,


there is no need to create a device node. This in 2.2.7 PTY security
turn means that there is no need for a device driver
author to contact the maintainer of the MAKEDEV Devfs allows a device driver to “tag” certain device
programme to update and release a new version of files so that when an unopened device is opened, the
MAKEDEV. The user will not be called upon to ownerships are changed to the current effective uid
down-load a new version of MAKEDEV, or manu- and gid of the opening process, and the protections
ally create device nodes if a new version will not be are changed to the default registered by the driver.
available. When the device is closed ownership is set back to
root and protections are set back to read-write for
everybody.
2.2.4 /dev growth
This solves the problem of pseudo-terminal security,
Since device drivers now register device nodes as without the need to modify any programmes. The
hardware is detected, /dev no longer needs to be specialised “devpts” file-system has a similar fea-
filled with thousands (and potentially millions) of ture for Unix98 pseudo-terminals, but this does not
device nodes that may be needed. Instead, /dev work for the classic Berkeley-style pseudo-terminals.
will only contain a small number of device nodes. The devfs implementation works for both pseudo-
terminal variants.
In addition, devfs eliminates all disc storage require-
ments for the /dev directory. An extfs inode con-
sumes 128 bytes of media storage. With over a thou- 2.3 Further Benefits
sand device nodes, /dev can consume 128 kBytes or
more. Reclaiming this space is of particular bene-
fit to embedded systems, which have limited mem- Besides overcoming the previously discussed limi-
ory and storage resources. Installation floppy discs, tations, devfs provides a number of other benefits,
with their small capacities, also benefit. described below.
2.3.1 Read-only root file-system requires clever boot scripts and a fragile and con-
ceptually complex boot procedure.
Device nodes usually must reside on the root file-
The approach of mounting devfs is more robust and
system, so that when mounting other file-systems,
conceptually simpler.
the device nodes for the corresponding disc media
are available. Having device nodes on a read-only
root file-system would prevent ownership and pro-
tection changes to these device nodes. 2.3.3 Intelligent device management

The most common need for changing permissions By providing a mechanism for device drivers to reg-
of device nodes is for terminal (tty) devices. Thus, ister device nodes, it is possible to send notifica-
it is impractical to mount a CD-ROM as the root tions to user-space when these registrations and un-
file-system for a production system, since tty per- registrations occur. This allows more sophisticated
missions will not be changeable. Similarly, the root device management schemes and policies to be im-
file-system cannot reside on a ROM-FS (often used plemented.
on embedded systems to save space).
Furthermore, a virtual file-system mounted onto
A similar problem exists for systems where the root /dev opens the possibility of capturing file-system
file-system is mounted from an NFS server. Multi- events and notifying user-space. For example, open-
ple systems cannot mount the same NFS root file- ing a device node, or attempting to access a non-
system because there would be a conflict between existent device node, can be used to trigger a specific
the machines as device node permissions need to be action in user-space. This further enhances the level
changed. of sophistication possible in device management.

These problems can be worked around by creating In section 3, the Linux devfs daemon is presented,
a RAMDISC at boot time, making an ext2 file- which supports advanced device management.
system in it, mounting it somewhere and copying
the contents of /dev into it, then un-mounting it
and mounting it over /dev. 2.3.4 Speculative Device Scanning

I would argue that mounting devfs over /dev is a


simpler solution, particularly given the other bene- Consider an application (like cdparanoia) that needs
fits that devfs provides. to find all CD-ROM devices on the system (SCSI,
IDE and other types), whether or not their respec-
tive modules are loaded. The application must
speculatively open certain device nodes (such as
2.3.2 Non-Unix root file-system
/dev/sr0 for the SCSI CD-ROMs) in order to make
sure the module is loaded. If the module is not
Non-Unix file-systems (such as NTFS) can’t be loaded, an attempt to open /dev/sr0 will cause the
used for a root file-system because they don’t sup- driver to be automatically loaded.
port device nodes. Having a separate disc-based or
RAMDISC-based file-system mounted on /dev will This requires that all Linux distributions follow the
not resolve this problem because device nodes are standard device naming scheme. Some distributions
needed before these file-systems can be mounted. chose to violate the standard and use other names
(such as /dev/scd0). Devfs solves the naming prob-
Devfs can be mounted without any device nodes lem, as the kernel presents a known set of names
(because it is a virtual file-system), and thus avoids that applications may rely on.
this problem.
The same application also needs to determine which
An alternative solution is to use initrd to mount a devices are actually available on the system. With
RAMDISC initial root file-system (which is popu- the existing system it needs to read the /dev direc-
lated with a minimal set of device nodes), and then tory and speculatively open each /dev/sr* device
construct a new /dev in another RAMDISC, and fi- to determine if the device exists or not. With a
nally switch to the non-Unix root file-system. This large /dev this is an inefficient operation, especially
if there are many /dev/sr* nodes. In addition, each perform further configuration operations on the
open operation may trigger the device to commence devices. This is required for hot-plug devices
spinning the media, forcing the scanning operation which need to make complex policy decisions
to wait until the media is spinning at the rated speed which cannot be made in kernel-space
(this can take several seconds per device).
• device entry registration/un-registration events
With devfs, the application can open the can be used to create “compatibility” entries, so
/dev/cdroms directory (which triggers module that applications which use the old-style device
auto-loading if required), and proceed to read names will work without modification. This
/dev/cdroms. Since only available devices will have eases the transition from a non-devfs system
entries, there are no inefficiencies in directory scan- to a devfs-only system
ning, and devices do not need to be speculatively • asynchronous device open and close events can
opened to determine their existence. Furthermore, be used to implement clever permissions man-
all types of CD-ROMs are presented in this direc- agement. For example, the default permissions
tory, so the application does not have to be modified on /dev/dsp do not allow everybody to read
as new types of CD-ROMs are developed. from the device. This is sensible, as you don’t
want some remote user recording what you say
at your console. However, the console user is
also prevented from recording. This behaviour
3 Advanced Device Management is not desirable. With asynchronous device
open and close events, devfsd(8) can run a
programme or script when console devices are
Devfs implements a simple yet powerful protocol opened to change the ownerships for other de-
for communication with a device management dae- vice nodes (such as /dev/dsp). On closure, a
mon (devfsd(8)) which runs in user-space. It is different script can be run to restore permis-
possible to send a message (either synchronously or sions
asynchronously) to devfsd(8) on any event, such as
registration/un-registration of device entries, open- • synchronous device open events can be used
ing and closing devices, looking up inodes, scanning to perform intelligent device access protec-
directories and more. This opens many possibilities tions. Before the device driver open() method
for more advanced device management. is called, the daemon must first validate the
open attempt, by running an external pro-
The daemon may be configured to take a variety of gramme or script. This is far more flexible than
actions for any event type. These actions include access control lists, as access can be determined
setting permissions, running external programmes, on the basis of other system conditions instead
loading modules, calling functions in shared objects, of just the UID and GID.
copying permissions to/from a database, and creat-
ing “compatibility” device entries. This yields enor- • inode lookup events can be used to authenticate
mous flexibility in the way devices are managed. module auto-load requests. Instead of using
Some of the more common ways these features are kmod directly, the event is sent to devfsd(8),
used include: which can implement arbitrary authentication
before loading the module itself. For example,
if the initiating process is owned by the console
• device entry registration events can be used user, the module is loaded, otherwise it is not
to change permissions of newly-created device
• inode lookup events can also be used to con-
nodes. This is one mechanism to control device
struct arbitrary name-spaces, without having
permissions
to resort to populating devfs with symlinks to
• device entry registration events can be used devices that don’t exist.
to provide automatic mounting of file-systems
when a new block device media is inserted into
the drive In addition to these applications, devfsd(8) may be
used to manage devices in many other novel ways.
• device entry registration/un-registration events This powerful daemon relies on two important fea-
can be used to run programmes or scripts which tures that devfs provides:
• a unified mechanism for drivers to publish de- • inode lookup events on /dev cannot be caught
vice nodes which in turn means that module auto-loading
requires device nodes to be created. This is a
• a virtual file-system that can capture common
problem, particularly for drivers where only a
VFS events.
few inodes are created from a potentially large
set
See: http://www.atnf.csiro.au/∼rgooch/linux/
• this technique can’t be used when the root FS
for more details.
is mounted read-only.

4.2 Just implement a better scsidev


4 Other Alternatives

This suggestion involves taking the scsidev(8) pro-


Some of the limitations that devfs addresses have
gramme and extending it to scan for all devices,
alternate proposed solutions. These do not solve
not just SCSI devices. The scsidev(8) programme
all of the problems, but are described here for com-
works by scanning /proc/scsi.
pleteness, as are their respective limitations.
This proposal has the following problems:
4.1 Why not just pass device cre-
ate/remove events to a daemon? • this programme would need to be run every
time a new module was loaded, which would
slow down module loading and unloading
Here the suggestion is to develop an API in the
kernel so that devices can register create and re- • the kernel does not currently provide a list of
move events, and a daemon listens for those events. all devices available. Not all drivers register
The daemon would then populate/de-populate /dev entries in /proc or generate kernel messages
(which resides on disc).
• there is no uniform mechanism to register de-
This has several limitations: vices other than the devfs API
• implementing such an API is then the same as
• it only works for modules loaded and unloaded the proposal above in section 4.1
(or devices inserted and removed) after the ker-
nel has finished booting. Without a database of
events, there is no way the daemon could fully 4.3 Put /dev on a ramdisc
populate /dev
• if a database is added to this scheme, the ques- This suggestion involves creating a ramdisc and
tion is then how to present that database to populating it with device nodes and then mounting
user-space. If it is a list of strings with em- it over /dev.
bedded event codes which are passed through
a pipe to the daemon, then this is only of use Problems:
to the daemon. I argue that the natural way
to present this data is via a file-system (since
many of the events will be of a hierarchical na- • this doesn’t help when mounting the root file-
ture), such as devfs. Presenting the data as system, since a device node is still required to
a file-system makes it easy for the user to see do that
what is available and also makes it easy to write
• if this technique is to be used for the root de-
scripts to scan the “database”
vice node as well, initrd must be used. This
• the tight binding between device nodes and complicates the booting sequence and makes it
drivers is no longer possible (requiring the oth- significantly harder to administer and config-
erwise perfectly avoidable table lookups dis- ure. The initrd is essentially opaque, robbing
cussed in section 1.2.5) the system administrator of easy configuration
• insufficient information is available to correctly Red Hat Linux) which is distributed by SGI for their
populate the ramdisc. So we come back to the Linux server products.
proposal in section 4.1 to “solve” this
A number of improvements to devfs and to the
• a ramdisc-based solution would take more ker- generic kernel are needed to complete this work so
nel memory, since the backing store would that Linux will be suitable to the large enterprise
be (at best) normal VFS inodes and dentries, and “data-centre” portions of the industry. These
which take 284 bytes and 112 bytes, respec- are discussed below.
tively, for each entry. Compare that to 72 bytes
for devfs
5.1 Mounting via WWN
4.4 Do nothing: there’s no problem
The current devfs name-space is a significant im-
provement on the old name-space, since the removal
Some claims have been made that the existing or addition of a SCSI disc does not affect the names
scheme is fine. These claims ignore the following: of other SCSI discs. Thus, the system is more ro-
bust.

• device number size (8 bits each for major and In larger systems, however, discs are often moved
minor) is a real limitation, and must be fixed between different controllers (the interface between
somehow. Systems with large numbers of SCSI the computer and groups of discs). This is often
devices, for example, will continue to con- done when a system is being reconfigured for the ad-
sume the remaining unallocated major num- dition of more storage capacity. If discs are mounted
bers. Hot-plug busses such as USB will also using their locations, the administrator must manu-
need to push beyond the 8 bit minor limitation ally update the configuration file which specifies the
locations (usually /etc/fstab). Thus, some means
• simply increasing the device number size is in-
of addressing the disc, irrespective of where it is lo-
sufficient. Besides breaking many applications
cated, is required.
(no libc 5 application can handle larger device
numbers), it doesn’t solve the management is-
Ideally, each device would have a unique identifier
sues of a /dev with thousands or more device
to facilitate tracking it. This unique identifier is de-
nodes
fined by the SCSI 3 standard, and is term a WWN
• ignoring the problem of a huge /dev will not (world-wide number). If a disc is mounted by speci-
make it go away, and dismisses the legitimacy of fying its WWN, then it may be moved to a different
a large number of people who want a dynamic controller without requiring further work by the ad-
/dev ministrator. This is important for a system with a
large number of discs.
• it does not address the problems of managing
hot-plug devices The SCSI sub-system in the Linux kernel needs to
be modified to query devices for their WWNs, which
• the standard response then becomes: “write a can then be used to register a device entry which in-
device management daemon”, which brings us cludes the WWN. All WWN entries would be placed
back to the proposal of section 4.1. in a single directory (such as /dev/volumes/wwn or
/dev/scsi/wwn).

5 Future Work 5.2 Mounting via volume label

Devfs has been available and widely used since 1998. For administrative reasons, some devices may be di-
It has attracted a user-base numbering in the several vided into a number of “logical volumes”. This is
thousands (possibly far greater), and forms a critical often used for very large storage devices where dif-
technology in SGI Pro-Pack (a modified version of ferent departments of an organisation are each given
a set of logical volumes for their private use. In this gies. Thus, designing a detailed structure to support
case, the storage may be presented as a single phys- different topologies is not feasible.
ical device, and thus would have a single WWN.
The solution I propose is to define a /dev/hw heirar-
As with physical discs, logical volumes may need chy, which is to be completely vendor-specific. This
to be re-arranged for administrative reasons. Here, heirarchy will be created and managed by vendor-
some mechanism which can address volumes by specific code, giving vendors complete flexibilty in
their contents is required. By storing a volume label their design. The /dev/hw heirarchy will effec-
on each volume, it is possible to address volumes by tively be a wiring diagram of the system. The only
content. imposed standard is that the vendor remaps the
generic Linux bus directories into the dev/hw tree.
Existing and planned logical volume managers need For example, /dev/bus/pci0 would become a sym-
to be modified to support storing volume labels and bolic link to a directory somewhere in the /dev/hw
must provide a common programming interface so tree.
that this information may be used in a generic way.
Once these steps have been taken, volume labels The combination of these two naming schemes
may be exposed in the device name-space in a simi- should provide sufficient flexibility for a wide variety
lar fashion as WWNs, placing entries in a directory of applications. The /dev/bus heirarchy will suf-
such as /dev/volumes/labels. fice for uncomplicated systems which do not change
their topology (such as embedded and desktop ma-
chines, which dominate the market). In addition,
5.3 Mounting via physical path /dev/bus provides a convenient place in which to
search for all system busses, which is of use for the
system administrator as well as some system man-
Prior to mounting via WWN or volume label, the agement programmes. Also, because /dev/bus is
initial location of a device is required. Once the managed by the generic Linux bus management sub-
device is located, the WWN may be obtained, or a system, it is always available, even on systems with
volume label may be written. In order to initially lo- complex topologies. A vendor need not implement
cate the device, the physical path to the device must a /dev/hw heirarchy if it considers the benefits to be
be used. To support this, device names which rep- marginal, or if time does not permit prior to prod-
resent the physical location of devices are required. uct shipment. Implementing a /dev/hw tree will
To support this, two new naming schemes are pro- add value, but is not required for basic operation of
posed. a system.

A new /dev/bus hierarchy will be created, which


reflects the logical enumeration of system busses, 5.4 Block Device Layer
and sub-components thereof. For example, a spe-
cific PCI device (function2 in slot1 in PCI bus
0) would be represented by a directory named The Linux block device layer is used to access all
/dev/bus/pci0/slot1/function2. If this device types of random-access storage media (hard discs,
was a SCSI controller, this directory would be the CD-ROMs, DVD-ROMs and so on). This layer has
root of the SCSI host tree for this device. This nam- two limitations. The first is that each block device is
ing scheme is a natural reflection of the Linux view limited to 2 TB on 32 bit machines. This limits the
of system busses. maximum size storage device that can be attached
to most Linux machines.
Many larger systems have complicated topologies.
For example, in a cc-NUMA system, I/O devices The second limitation is that the block device layer
may be distributed across many nodes in the net- uses device numbers in all it’s internal operations.
work. The detection order of busses may change These device numbers are used to identify devices
if a new node is added, causing the existing identi- and to lookup partition sizes and other configura-
fiers in the /dev/bus heirarchy to change. A naming tion information. This limits the number of devices
scheme is required which reflects the complete hard- that can be attached to a Linux machine. While
ware topology. This is clearly vendor-specific, as devfs allows device drivers to bypass device number
different systems will have radically different topolo- limitations, the drivers must be changed to make
use of this. The block device layer is a critical layer
that must be changed.

The block device layer (as well as the SCSI layer)


needs to be modified to use device descriptor objects
rather than device numbers. This is a significant,
but essential, project. There has been some work
on this already (a new struct block device class
has been defined), but more work is required.

5.5 Use of dcache for devfs internal


database

The current implementation of devfs uses an inter-


nal database (a simple directory tree structure) to
store device node entries. Much of the code used to
manage this directory tree could be removed if the
dcache (directory entry cache) in the VFS was used
instead. This would significantly reduce the code
size and complexity of devfs. The cost would be an
increase in the memory consumption of devfs (from
72 to 112 bytes per device node entry).

When devfs was first implemented, this option was


not available, but since then the VFS has matured,
and this option should now be practical. A further
modest change to the VFS is required (separation
of dcache entries from VFS inodes).

This change is not required for large systems, as the


existing implementation of the core file-system does
not impose limitations. This change will provide a
benefit for very small (embedded) systems, which
have small numbers of devices, and thus the code
savings outweigh the increased memory consump-
tion in device node storage.

This change would also provide a political bene-


fit, because of the code reduction, and would in-
crease acceptance in some quarters. Despite its ac-
ceptance into the official kernel, devfs remains con-
troversial, due its departure from traditional Unix
device nodes.

You might also like