Technical - Distributing File System
Technical - Distributing File System
Submitted by:
March 4, 2010
Title: Distributing File System
Abstract
file servers and shares on the network. Instead of having to think of a specific
machine name for each set of files, the user will only have to remember one
name; which will be the 'key' to a list of shares found on multiple servers on the
network. Think of it as the home of all file shares with links that point to one or
more servers that actually host those shares. DFS has the capability of routing a
client to the closest available file server by using Active Directory site metrics. It
can also be installed on a cluster for even better performance and reliability.
Medium to large sized organizations are most likely to benefit from the use of
DFS - for smaller companies it is simply not worth setting up since an ordinary
2003 and can be found in the administrative tools folder. To open, press Start >
Programs > Administrative Tools > Distributed File System or in the Control
Panel, open the Administrative Tools folder and click on the Distributed File
System icon. This will open the management console where all the configuration
takes place.
design. In particular, if all the files are small, the disk block size should be small,
too, to avoid wasting too large a fraction of the disk. On the other hand, if files are
generally large, choosing a large block size is good since it leads to more
efficient transfers. Only by knowing the file size distribution can reasonable
choices be made. In 1984, we published the file size distribution for a university
years later to see how file sizes have changed. In short, the median file size has
more than doubled (from 1080 bytes to 2475 bytes), but large files still dominate
Twenty years ago we published a study of static file sizes at the Computer
Science Dept. of the Vrije Universiteit (VU) [1]. These files represented the
totality of files owned by students and faculty members on the Dept.’s UNIX
machines. For another research project, we recently had a need to revisit these
measurements to see how much file sizes have changed in the past 20 years.
With the recent popularity of digital photos, music, and videos, it is certainly
possible that this data is now dated. In this note we present new measurements
on the Computer Science Dept.’s UNIX file systems and compare them to the old
The Problem
The reason for collecting this data was to determine how big disk blocks
should be. File systems are generally free to choose any size they want since
block size is transparent to user programs, but the efficiency of the file system
depends on the block size. Large disk blocks are more efficient than small ones
since the seek and rotational latency is then amortized over more bytes
gathered. This fact argues for using a very large disk block. On the other hand,
since most files are small, a large disk block wastes a great deal of space on the
disk. From Fig. 3, we see, for example, that even a 1-KB disk block wastes more
than half the block for the 20% smallest files. With a 16-KB disk block and files of
under 512 bytes, over 95% of the block is wasted. This fact argues for a small
block size. To make a reasonable choice, we need to look at the actual file size
distribution in detail. For a given block size, each file occupies a certain number
of disk blocks. For example, with a 1-KB block size, a file of 2300 bytes occupies
3 blocks and 3072 bytes of storage. With a 2-KB block size it uses 2 blocks and
4096 bytes of storage. With an 8-KB block size it uses only 1 block, but 8192
bytes of storage. Thus the block size affects how much storage is needed for the
What it shows is the collective size of the 10% smallest files together take up
0.08% of the disk (excluding unused blocks), the collective size of the smallest
20% of the files take up 0.17% of the disk, and the collective size of the smallest
half of the files take up only 0.64% of the disk. In fact the 90% smallest files (1.5
million files) take up only 5.52% of the disk in use. The largest 10% of the files
use the other 94.48% of the disk. In short, virtually the entire disk is filled by the
10% largest files; the small files hardly matter at all. The fourth column assumes
2-KB disk blocks. For the very smallest 10% of the files (all 164 bytes or less),
doubling the block size does not change the number of blocks they occupied, but
it does double the number of KB of storage used, since these 170,000 files now
each occupy 2 KB instead of 1 KB. In percent terms, the first decile now occupies
0.16% of the disk instead of 0.08% of the disk. But as we move to larger files, the
effect is much smaller. With a file above 1 MB, going from 1 KB to 2 KB might at
most add another 1 KB to the length, but as a percentage of what the file needs,
it is negligible. All in all, with 2-KB blocks, the smallest 90% of the files use only
5.97% of the occupied blocks. The next three columns show the same
calculations for 4-KB, 8-KB, and 16-KB disk blocks. Even with 16-KB blocks,
which is too big for more than 3/4 of the files, the bottom 90% of the files still
account for only 13.65% of the allocated blocks. In short, while using large disk
blocks is very inefficient for small files, despite the fact that most files are small,
the total amount of wasted space on the disk is still very small. The reason is
simple: the disk is mostly occupied by very large files, whose space efficiency is
largely independent of the block size. One other figure that is interesting is how
the total amount of disk storage rises with block size. Remember that the
smallest 30% of the files are 968 bytes or less, so going from 1 KB to 2 KB
increases the total amount of storage each of these files uses by a factor of two.
For large files, the percentage increase is very small, of course. We have
calculated how much additional storage is needed for the entire file system when
going from 1 KB to larger sizes. For 2 KB, 4 KB, 8 KB, and 16 KB, the percent
increase in total disk storage required is 1%, 2%, 4%, and 10% respectively.
Thus going from a 1-KB block to a 16-KB block means the VU file system
While the ideal block size depends partly on the static file size distribution
and the waste incurred by choosing a certain block size, these are not the only
factors. For instance, if files larger than 4 KB are hardly ever accessed, it makes
no sense to make the blocks larger than 4 KB. In other words, it is also important
to consider the dynamic file size distribution: how often are files of a certain size
read and written? In order to answer this question, we logged all NFS traffic to
the faculty’s NFS servers for four days. The logs contain all read and write
operations, but also operations that do not incur disk block transfers to the client
and that are therefore given a transfer size of 0 (e.g., create, remove, and setattr
operations). The total number of operations in the logs was 4.3 million. Note that
the wire. Rather, they represent full operations. For instance, a copy of a 10-MB
file from the NFS server to a local disk results in a single entry in a log. As
expected, the logs are dominated by reads and writes: the number of write
excluding reads combined (0.54 million) and the number of reads was six times
higher even than that (3.2 million). Concentrating on reads and writes, we
measured the transfer sizes of each operation and grouped them by decile. The
results are shown in Fig. 4. For example, 20% of the reads were for files of length
1965 bytes or less. The results show that 50% of all read operations and 80% of
all write operations are for files smaller than 8 KB. Note that we should be
cautious in drawing conclusions about the ideal block size from these figures, as
we do not know the exact behavior of the disk controller, the scheduling of
requests on the disk, and indeed the transfers between physical disk and the
average results in a single-block transfer for 50% of all read operations and 80%
all reads and 90% of all writes in a single block transfer. Another point worth
system’s buffer cache. By increasing the block size, we decrease the number of
blocks in the cache (assuming a constant amount of memory for the cache). This
will reduce the hit rate. To investigate this effect, we constructed a 1-KB file
system on a PC, set the operating system’s cache size to 0.5 MB, compiled a
large collection of programs, and measured the total compilation time. We then
repeated the experiment for cache sizes of 1 MB, 2 MB, 4 MB, and 8 MB with
block sizes of 1 KB, 2 KB, 4 KB, and 8 KB. The results are presented in Fig. 5.
For example, with a 2-MB cache and 4-KB blocks, it took 90 seconds to compile
all the test programs. While the first row seems anomalous, we repeated the
measurements and kept getting the same numbers. We do not have an obvious
effect here is dramatic for small caches. When the cache is too small to hold the
various passes of the compiler, the include files, and the libraries, having large
blocks (meaning few files in the cache) is a huge performance loss. When the
cache gets to 2 MB, everything fits and block size is no longer so important. This
experiment puts a premium on cache space because we kept calling the same
compiler passes and used the same standard header files over and over. When
the cache was too small to store them, disk activity shot up and performance
The first column of each table gives file sizes from 1 byte to 128 MB. The
second column tells what percentage of all files in 1984 were that size or smaller.
For example, 48.05% of all files in 1984 were 1024 bytes or less. This puts the
median file size at just over 1 KB. Put in other terms, a file system with a 1-KB
block size could store almost half of all files in a single disk block. Over 85% of
the files were ’small’ files in the sense of fitting in 10 1-KB disk blocks, and thus
not needing indirect blocks in the i-nodes to keep track of them. In 1984, all files
were 524,288 bytes or smaller. The third column gives the data for 2005. Files
are definitely larger now, with the median file size in 2005 being 2475 bytes and
the percentage of files that are ’small’ (10 1-KB disk blocks or fewer) being 73%.
Furthermore, the largest file recorded was now 2 GB, more than 4000
times larger than in 1984. While our university file system is probably typical of
other UNIX file systems at major research universities with 1000+ login names, it
may not be typical of other applications. To provide some idea of what the
distribution looks like for other applications, we ran the same measurements on a
Linux file system being used as a Web server (for www.electoral-vote.com). This
system, which was at a commercial hosting service in upstate New York, had
only 53,000 files (vs. 1.7 million at the VU). The results are given in the fourth
column. Somewhat surprisingly, on the whole, the Web server’s files were
smaller than the university’s, with a median file size of 1180 bytes. In part this
result is due to the relatively large number of icons, small .gif files, and short Perl
Another way of looking at the data is by graphing it. In Fig. 2 we see the
data of Fig. 1 in graphical form. We see more clearly here that the shape of the
curve for the three measurements is largely the same, with the 2005
measurements offset to the right to indicate the growth in file size. The Web has
relatively more small files than the university measurements, but for larger sizes,
it falls in between the 1984 and 2005 measurements. Another way of presenting
the data is to look at it by decile, as is done in Fig. 3 for the 2005 VU files. Here
the files are grouped into deciles by size. Each decile contains 10% of all the
files. The first decile contains files from 0 to 164 bytes, the second decile
contains files from 165 bytes to 485 bytes, and so on. Put in other words, 10% of
all files are 164 bytes or smaller, 20% of all files are 485 bytes are smaller, half of
all files are 2475 bytes or smaller and 90% of all files are 56,788 bytes or smaller.
Generalization
In the global system, all cells (processor elements) can access the same file. The
degree of freedom of the global file system is high. However, when a plurality of
cells access the same address of the same file, the consistency of the file should
be controlled. Thus, the file system becomes large and the process time thereof
since the probability of which a plurality of cells access the same address of the
same file is low, the consistency control for the file is almost meaningless.
On the other hand, in the local file system, each cell has an independent
file system. Each cell stores only required data in the local file system. In the
local file system, each cell accesses only the local file. A schematic diagram for
explaining a related art reference against the present invention. When a host
computer accesses a local file of FIG. 1, it sees data distributed to each cell as
independent files. FIGS. 1(A) and (B) show an example of a local file system in
which a substance of a file is present on the cell side. Files A0, A1, and A2 can
be accessed by only cells (cell-0, cell-1, and cell-2), respectively. When the host
independently open and access them distributed to each cell. When each cell
accesses a global file, it can access the entire file. FIGS. 1(C) and (D) show an
example in the case that a substance of a file is present in the host computer
each cell opens a file A, it can access the entire file in the same manner.
When the host computer accesses a local file, as shown in FIG. 1(A), it should
local file. When each cell accesses a file of the host computer, as shown in FIG.
1(D), since it can access files that are not required for the process thereof, it
should select and access required data. Thus, the load of the data process for
In the data process of a local file as in an SPMD type parallel computer, when
the number of cells is varied, since data processed thereby is changed, data to
each local file should be re-distributed, thereby increasing the overhead of the
process. In addition, when the distribution conditions that each cell read-
accesses data are different from the distribution conditions that each cell write-
accesses the same data, the application side should consider the change of the
Thus, in the file system for use in the above-described high speed parallel
computing environment, the file access efficiency of each cell and host computer
becomes critical factor for improving the performance thereof. To improve the file
access efficiency, a technique for allowing each cell and host computer to flexibly
access only a required portion of a file without increase of load of the application
program is required.
Findings
are still quite small, although larger than they were in 2010. The most likely
reason is that students keep their photo, music, and video files on their home
computers, and use the university computers mostly for small programs they are
working on for lab courses. The shape of the cumulative distribution of file sizes
(shown in Fig. 2) is remarkably similar in the two years measured, 2007 and
suggest that larger block sizes may help speed up a significant number of
serve 1.5–2 times as many requests in a single disk block. Unfortunately we did
compare the situation in 1984 with the current environment. A study of the
Windows NT file system based on dynamic measurements has been presented
by Vogels [2]. Interestingly the file sizes seen in the file size distribution for the
speculate that this might be because most reads or writes of small files incur a
read of one or more large files first, because most files are accessed using fairly
large applications. However, this is just a guess. This discussion of block sizes
does not lead to a mathematical optimum for trading off disk performance against
wasted space. Still, the data show that even for large block sizes (16 KB), the
total file system expands by only 10%. This suggests that large blocks are a
good idea. However, the data of Fig. 5 suggest that even with an 8-MB cache,
observed with 4-KB blocks. In summary, In terms of wasted disk space, block
size hardly matters at all, but the cache performance suffers somewhat with 8-KB
blocks. Our tentative conclusion is that probably 4 KB is the sweet spot balancing
all the factors. With this size, almost 60% of all disk files fit in a single disk block
and the amount of disk space required to store all the files increases only 2%
over 1-KB blocks. Considering files actually read, though, only 35% of the files
Vogels, W.: ‘‘File System Usage in Windows NT 4.0,’’ Proc. ACM Symp. on