The Structure of NTFS
The Structure of NTFS
The Structure of NTFS
Table of Contents
The Need for NTFS..1
NTFSs Disk Structure.3
Recoverability..11
Fault Tolerance/Data Redundancy..16
File Security.20
Nguyen 1
system was
suitable
for
mission-critical
applications
that
required
recoverability, fault tolerance and data redundancy, and file security. Instead, the
design group for NT decided on constructing a new file system, NTFS (NTs file
system).
Since Windows NT was targeting businesses and corporations, the
reliability of the data stored on the system became more of a priority that speed
as in the case of home computer users. In a corporate environment, if a system
fails and data is lost, speed becomes irrelevant. To support recoverability, the
new file system, NTFS, provided file system recovery based upon a transactionprocessing model as well as an improved write-caching feature.
Nguyen 2
To support fault tolerance and data redundancy, various kinds of disk
volumes and volume sets were implemented, ranging from RAID (redundant
array of independent disks) Level 0, 1, 5 to sector sparing.
These features
focused on either the reliability of the data, improving disk I/O, or having multiple
copies of the same data so that the data is always preserved (data redundancy).
For file security, NTFS used the security model that protected everything
else that comes in contact with Windows NT, such as devices, named and
anonymous pipes, processes, threads, events, mutexes, semaphores, waitable
timers, access tokens, network shares, services, and so on. This security could
be implemented for a file by establishing a security descriptor as a file record
attribute in the MFTmaster file table.
Nguyen 3
Sector
Cluster
(4 sectors)
Nguyen 4
The cluster size, or cluster factor, varies with the volume size. By varying the
cluster factor, NTFS is able to support very large hard disks efficiently.
For
example, a small cluster factor would reduce the amount of unused space within
a cluster, but increase the amount of file fragmentation (i.e. clusters containing a
file exist on noncontiguous parts of the disk). A large cluster factor would reduce
the amount of fragmentation, but increase the amount of unused space within
each cluster. To access parts of the disk, NTFS uses logical cluster numbers,
LCNs, as disk addresses. LCNs are simply the numbering of all clusters of a
volume from beginning to end. With this schema, a physical disk address can be
calculated by multiplying the LCN by the cluster factor to get the physical byte
offset on the volume (Silberschatz and Galvin, 766). In regard to files, NTFS
uses VCNsvirtual cluster numbersto refer to the data within a file. VCNs
number the clusters belonging to a particular file from 0 through n - 1. VCNs are
not necessarily physically contiguous (especially when the file is fragmented), but
they can be mapped to any number of LCNs on the NT volume (Solomon, 407).
The MFT (Master File Table) is the vital core of the NTFS structure. All data
stored on a volume is contained in the MFT. By storing all of this information
within a file, NTFS can easily locate and maintain the data, and a security
descriptor, used by NTs security model, can protect each separate file. The MFT
is essentially an array of records with each record holding data about a particular
file in the volume. In addition, it also includes a file record for itself so that the
MFT can be rebuilt in case it becomes damaged. The MFT also includes file
Nguyen 5
records for the NTFS metadata files that help implement the file system
structurei.e. data structures used to locate and retrieve files, the bootstrap
data, and the bitmap that records the allocation state of the entire volume. Each
of these NTFS metadata files has a name beginning with a dollar sign ($) to
differentiate them from other system files and user files:
$MFTMirr contains a copy of the first few rows of the MFT, used to locate
metadata files in case the MFT file is corrupt for some reason
$LogFile used to record all disk operations that affect the NTFS volume
structure such as file creations, file copying, file deletion, etc. After a system
failure, NTFS uses the log file to recover the NTFS volume.
$Bitmap records the allocation state of the NTFS volume. Each bit in the
bitmap represents a cluster on the volume, identifying whether the cluster is
free or has been allocated to a file.
$BadClus records any bad clusters on the volume so that Windows NT will
not write to that disk region in the future.
Nguyen 6
For NTFS to reference these file records, each file record has a unique
64-bit ID value called the file reference. This file reference consists of a file
number and a sequence number. The file number corresponds to the position of
the files file record in the MFT minus 1, because the first file record is referenced
as File Number 0 (the MFT itself). The file sequence number enables NTFS to
perform internal consistency checks, since it is incremented every time an MFT
file record position is reused.
63
47
Sequence
number
File number
Nguyen 7
Standard
Information
VCN
Filename
Security
Descriptor
Data
LCN
1025
1026
1027
Extended
attributes
Data
1031
1032
Data
1028
1029
1030
For the most part, files only need to be referenced by a single file record
(with disk runs if the data attribute is large). Sometimes, though, a file has too
many attributes to fit in the MFT file record, so a second MFT file record is used
Nguyen 8
to contain the additional attributes.
attribute list is added to the second file record. This attribute list attribute stores
the name and type code of each of the files attributes and the file reference of
the MFT record where the attribute is located.
necessary for the instances in which a file grows so large or so fragmented that a
single file record cannot contain the collection of VCN-to-LCN mappings required
to locate all of its disk runs.
Along with file records stored in the MFT, directory records also are store
in the MFT. A file directory is just an index of filenames with their file references
organized in a specific fashion for quick and easy access.
NTFS creates
Standard
information Filename
Index root
Index allocation
Bitmap
Index of files
\
file4
VCN
file0
file1
file3
VCN
file10
file15
VCN-to-LCN mappings
VCN
file11
file6
file8
file12
7
file9
10
11
file13
file14
Nguyen 9
Each directory record in the MFT contains an index root attribute that in turn
stores a sorted list of the files in the directory.
filenames are actually stored in 4KB fixed-size index buffers that contain and
organize the filenames. These index buffers use a b+ tree data structure that
minimizes the number of disk accesses required to find a particular file. The
index root attribute stores the first level (root subdirectories) of the b+ tree and
points to index buffers containing the next level. The index allocation attribute
maps the VCNs of the index buffer disk runs to the LCNs that show where the
index buffers are located on the disk. The bitmap attribute keeps track of which
VCNs in the index buffers are in use and which are free.
directorys index root attribute contains several filenames that behave as indexes
into the second level of the b+ tree. Each filename in the index root attribute has
an optional pointer associated with it that points to an index buffer. The index
buffer it points to contains filenames with lexicographic values less than its own.
For example, in the previous figure, file4 is a first-level entry in the b+ tree, and it
points to an index buffer containing filenames that are lexicographically less that
itself, file0, file1, and file3. Storing filenames in b+ trees provides three major
benefits: faster directory lookups since the filenames are stored in a sorted order;
when higher-level software enumerates the files in a directory, NTFS returns
names that have already been sorted; because b+ trees have a tendency to grow
wide rather than deep, NTFSs fast lookup times do not degenerate as directories
Nguyen 10
get large (Solomon, 405-11). Knowing about NTFSs disk structure allows for a
better appreciation of how its mission-critical features operate.
Nguyen 11
Recoverability
When a disastrous system failure or power failure occurs, NTFSs
recovery support makes sure no file system operations will be left incomplete and
the volume structure will keep integrity without the need to run a disk repair utility.
To implement this ability, NTFS uses transactions to keep track of modified
system data. Transactions are a collection of file system operations that are
executed atomicallywhere all the file system operations must successfully
execute, otherwise, the modified data is rolled back to its previous state.
Transactions work by writing redo and undo information to a log file before any
data is modified. After the data is modified, a commit record is written to the log
file to indicate that the transaction succeeded.
then undoing the operations for transactions that did not commit successfully
before the crash. The log file that these transactions use may seem to grow
without bound considering the large number of file system operations that occur
during a typical user session on the computer (e.g. opening applications,
changing resolutions, installing software, and so on), but this is not the case.
Periodically (typically every 5 seconds), a checkpoint record is written to the log
file. The purpose of the checkpoint record is to mark where records existing
before the checkpoint record do not have to be redone or undone for Windows
NT to recover from a crash.
Nguyen 12
discarded to limit the size of the log file. This seems like enormous overhead to
maintain volume integrity, but the transaction-logging system applies only to file
system data (NTFSs metadata files) in which case transaction execution and
recovery are fairly quick (Silberschatz and Galvin, 768). With this said, user file
data such as MS Word documents, assembly code files, and JPEG-format
picture files are not guaranteed to be in a stable condition after a major system
crash. This decision not to implement user data recovery in the file system
represents a trade-off between a fully fault tolerant file system and one that
provides optimum performance for all file operations.
An exploration into the evolution of file system design will give insight into
how NTFS further implements recoverability for the Windows NT operating
system.
systems input/output and caching support: careful write and lazy write. Each
technique has its own trade-offs between safety and performance.
operation(s) were in progress at the time of the crash or loss of power, such an
abrupt halt can produce inconsistencies in a file system. An inconsistency is
some sort of corruption within the file system. For example, a filename may
appear in a directory listing but is nonexistent as far as the file system is
concerned.
Nguyen 13
A careful write file system does not try to prevent file system
inconsistencies, but rather it orders its write operations so that the worst thing
that can happen is a system crash will produce predictable, noncritical
inconsistencies, that the file system can fix later without consequence. When the
careful write file system receives a request to update the disk, it must perform
several suboperations before the update will be complete. These suboperations
are always serially written (written one at a time) to disk. For example, when
allocating disk space for a file, the careful write file system first sets some bits in
its bitmap and then allocates the space to the file. If the power to the computer is
cut off just after the bits are set, the careful write file system loses access to
some disk space (the space represented by the bitmap bits), but existing data is
not corrupted. Through this serialization of write operations, I/O requests are
filled in the order in which they are received.
allocates disk space and then another process creates a file, the careful write file
system completes the disk allocation before it starts to create the file, which is
preferable to creating the file, and then the allocation is processed.
The major advantage of a careful write file system is that in the situation of
a failure, the disk volume stays consistent and usable without the need to
immediately run a slow volume repair utility.
Nguyen 14
The careful write file system compromises speed for the safety it provides
because it processes one I/O operation at a time.
increased because the file system can return control to the caller without waiting
for disk writes to be completed.
inconsistent intermediate states on a disk volume that can result when the
suboperations of two or more I/O requests are interleaved. This makes it easier
for the file system to be multithreadedallowing more than one I/O operation to
be in progress at a time.
Nguyen 15
Nguyen 16
managing the disk volumes and physical hard drives to improve performance,
capacity, or reliability. Data redundancy is a technique of keeping multiple copies
of data so that if one copy becomes corrupt, the data is still preserved in the
other copy. These two schemes allow the protection of all files within a disk
volume, so that NTFS has the capability to protect even user files which was
discarded in the transaction-based recovery system because of the high
overhead. FtDisk is the fault tolerant driver that allows different configurations of
NTs volumes. There are several schemes for NTFS volumes that improve
performance or reliability.
Nguyen 17
A stripe set, also known as RAID Level 0 (redundant array of independent
disks), is a series of partitions, one partition per disk, that NTs Disk Administrator
combines into a single logical volume. NTFS distributes files in a striped set by
using a round-robin manner where data is placed in each of the partitions in
64KB stripes at a time. The advantage of stripe sets is that the data is distributed
fairly evenly among the disk partitions which in turn increases the probability that
multiple read and write operations will be bound for different disks. And because
data on all three disks can be accessed simultaneously, latency time for disk
input/output is reduced.
Volume sets make managing disk volumes more convenient, and stripe
sets spread the disk I/O load over multiple disks, yet neither provide the ability to
recover data if a disk fails.
redundant storage schemes: mirror sets, stripe sets with parity, and sector
sparing.
In the mirror set (RAID Level 1) scheme, data is duplicated in two
partitions. If the first disk or any data stored on it becomes unreadable because
of a hardware or software failure, FtDisk automatically accesses the data from
the mirror partition. Mirror sets ca help in I/O throughput on heavily loaded
systems.
For example, when I/O activity is high, FtDisk balances its read
operations between the primary and mirror partition. Two read operations can
proceed simultaneously and thus theoretically finish in half the time. When a file
is modified, both partitions of the mirror set must be written, but disk writes are
Nguyen 18
completed asynchronously (does not half to be at the same time), so the
performance of user programs is generally not affected by the extra disk update.
The second fault tolerant scheme, stripe sets with parity, is similar to the
stripe set discussed earlier except that fault tolerance is achieved by reserving
about one disk for storing parity for each stripe. The parity stripe contains a bytefor-byte logical sum (XOR) of the other stripes in the other partitions with the
same stripe number. For example, on a three-disk stripe set with parity, stripe 1
on disk 1 would contain the parity information for stripe 1 of disks 2 and 3, stripe
2 of disk 2 would contain parity information for stripe 2 of disks 1 and 3, and so
on. Recovering from a failed disk in this set up relies on an arithmetic principle:
in an equation with n variables, if you know the value of n 1 of the variables,
you can determine the value of the missing variable by subtraction. For example,
in the equation x + y = z, where z represents the parity stripe, FtDisk computes
zy to determine the contents of x; to find y, it computes z x. FtDisk uses
similar logic to recover lost data. If a disk in a stripe set with parity fails or if data
on one disk becomes unreadable, FtDisk reconstructs the missing data by using
the XOR operation.
In sector sparing, FtDisk uses its redundant data storage to dynamically
replace lost data when a disk sector becomes unreadable. The sector sparing
technique exploits a feature of some hard disks, which provide a set of physical
sectors reserved as spare. If FtDisk receives a data error from the hard disk, it
obtains a spare sector from the disk driver to replace the bad sector that caused
the data error. FtDisk recovers the data that was on the bad sector (by either
Nguyen 19
reading the data from a disk mirror or recalculating the data from a stripe set with
parity) and copies it to the spare sector.
If a bad-sector error occurs and the hard disk does not provide spares,
runs out of them, or is a nonSCSI-based disk, FtDisk can still recover the data. I
recalculates the unreadable data by accessing a stripe set with parity, or it reads
a copy of the data from a disk mirror. It then passes the data to the file system
along with a warning status that only one copy of the data is left in a disk mirror
or that one stripe is inaccessible in a stripe set with parity and that data
redundancy is therefore no longer in effect for that sector. It is the file systems
decision to respond or ignore the warning. FtDisk will re-recover the data each
time the file system tries to read from the bad sector (Solomon, 440-45). With
the data fairly secure from system crashes and failures, The only thing NTFS
needs to do is to protect files from unauthorized users.
Nguyen 20
File Security
Files in NTFS are protected by the security model in Windows NT. An
exhaustive discussion of NTs security model will not be done here, since this
model can warrant its own research paper. Instead, only the security features
pertinent to the actual NTFS files will be explored. With that in mind, looking
back at the file records in the MFT, the master file table, a security descriptor can
be found as an attribute of all file records. This security descriptor controls who
has what access to the file. It contains the following:
Group SID the security ID of the primary group for the file
An access control list consists of an ACL header and zero or more access control
entry (ACE) structures.
indicates that no user has access to the file. In a DACL, each ACE contains a
security ID and an access mask. Two types of ACEs can appear in a DACL:
access allowed and access denied. Of course, the access-allowed ACE grants
access to a user, and the access-denied ACE denies the access rights specified
in the access mask. The accumulation of access rights granted by individual
Nguyen 21
ACEs forms the set of access rights granted by an ACL. If no DACL is present in
a security descriptor, everyone has full access to the file. On the other hand, if
the DACL is NULL (has 0 ACEs), no user has access to the file. An SACL
contains only one type of ACE, called a system audit ACE, which specifies which
operations performed on the object by specific users or groups should be
audited. The audit information is stored in the system audit log. Both successful
and unsuccessful attempts can be audited. If the SACL is NULL, no auditing
occurs on the particular file (Solomon, 310-11).
Nguyen 22
Conclusion
Windows NTs file system, NTFS, is a robust file system that expands
upon the earlier file systems such as the FAT (file allocation table) file system
and the HPFS (high-performance file system).
recoverability, fault tolerance and data redundancy, and file security, so it could
support mission-critical applications used by businesses and corporations that
demand data integrity and high performance. The outcome of Microsofts efforts
was a file system that could provide preservation and protection of any data that
is place in its disk volume.
Nguyen 23
Works Cited
Calinger, Peter. Operating System Elements: A User Perspective. Englewood
Cliffs: Prentice-Hall, 1982.
Galvin, Peter Baer and Abraham Silberschatz. Operating System Concepts.
Reading: Addison Wesley Longman, 1998.
Haberman, A. N. Introduction to Operating System Design. Chicago: Science
Research Associates, 1976.
Kaisler, Stephen H. The Design of Operating Systems for Small Computer
Systems. New York: John Wiley & Sons, 1983.
Solomon, David A. Inside Windows NT: Second Edition. Redmond: Microsoft
Press, 1998