AFast File System for UNIX*
Marshall Kirk McKusick, William N. Joy†,
Samuel J.Leffler‡, Robert S. Fabry
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA94720
ABSTRACT
Areimplementation of the UNIX file system is described. The reimplementation
provides substantially higher throughput rates by using more flexible allocation policies
that allowbetter locality of reference and can be adapted to a wide range of peripheral
and processor characteristics. The newfile system clusters data that is sequentially
accessed and provides twoblock sizes to allowfast access to large files while not wasting
large amounts of space for small files. File access rates of up to ten times faster than the
traditional UNIX file system are experienced. Long needed enhancements to the pro-
grammers’ interface are discussed. These include a mechanism to place advisory locks
on files, extensions of the name space across file systems, the ability to use long file
names, and provisions for administrative control of resource usage.
Revised February 18, 1984
CR Categories and Subject Descriptors: D.4.3 [Operating Systems]:F file
organization, directory structures, access methods;D.4.2 [Operating Systems]:S
allocation/deallocation strategies, secondary storage devices;D.4.8 [Operating Systems]:P
measurements, operational analysis;H.3.2 [Information Systems]:I file organization
Additional Keywords and Phrases: UNIX, file system organization, file system performance, file system
design, application program interface.
General Terms: file system, measurement, performance.
*UNIX is a trademark of Bell Laboratories.
†William N. Joyiscurrently employed by: Sun Microsystems, Inc, 2550 Garcia Avenue, Mountain View, CA
94043
‡Samuel J. Leffler is currently employed by: Lucasfilm Ltd., PO Box 2009, San Rafael, CA 94912
This work was done under grants from the National Science Foundation under grant MCS80-05144, and the
Defense Advance Research Projects Agency(DoD) under ARPAOrder No. 4031 monitored by NavalElec-
tronic System Command under Contract No. N00039-82-C-0235.
SMM:05-2 A Fast File System for UNIX
TABLE OF CONTENTS
1. Introduction
2. Old file system
3. New file system organization
3.1. Optimizing storage utilization
3.2. File system parameterization
3.3. Layout policies
4. Performance
5. File system functional enhancements
5.1. Long file names
5.2. File locking
5.3. Symbolic links
5.4. Rename
5.5. Quotas
Acknowledgements
References
1. Introduction
This paper describes the changes from the original 512 byte UNIX file system to the newone
released with the 4.2 BerkeleySoftware Distribution. It presents the motivations for the changes, the meth-
ods used to effect these changes, the rationale behind the design decisions, and a description of the new
implementation. This discussion is followed by a summary of the results that have been obtained, direc-
tions for future work, and the additions and changes that have been made to the facilities that are available
to programmers.
The original UNIX system that runs on the PDP-11† has simple and elegant file system facilities.
File system input/output is buffered by the kernel; there are no alignment constraints on data transfers and
all operations are made to appear synchronous. All transfers to the disk are in 512 byte blocks, which can
be placed arbitrarily within the data area of the file system. Virtually no constraints other than available
disk space are placed on file growth [Ritchie74], [Thompson78].*
When used on the VAX-11 together with other UNIX enhancements, the original 512 byte UNIX file
system is incapable of providing the data throughput rates that manyapplications require. Forexample,
applications such as VLSI design and image processing do a small amount of processing on a large quanti-
ties of data and need to have a high throughput from the file system. High throughput rates are also needed
by programs that map files from the file system into large virtual address spaces. Paging data in and out of
the file system is likely to occur frequently [Ferrin82b]. This requires a file system providing higher band-
width than the original 512 byte UNIX one that provides only about twopercent of the maximum disk
bandwidth or about 20 kilobytes per second per arm [White80], [Smith81b].
Modifications have been made to the UNIX file system to improve its performance. Since the UNIX
file system interface is well understood and not inherently slow, this development retained the abstraction
and simply changed the underlying implementation to increase its throughput. Consequently,users of the
system have not been faced with massive software conversion.
Problems with file system performance have been dealt with extensively in the literature; see
[Smith81a] for a survey.Previous work to improve the UNIX file system performance has been done by
[Ferrin82a]. The UNIX operating system drewmanyofits ideas from Multics, a large, high performance
†DEC, PDP,VAX, MASSBUS, and UNIBUS are trademarks of Digital Equipment Corporation.
*Inpractice, a file’ssize is constrained to be less than about one gigabyte.
AFast File System for UNIX SMM:05-3
operating system [Feiertag71]. Other work includes Hydra [Almes78], Spice [Thompson80], and a file sys-
tem for a LISP environment [Symbolics81]. Agood introduction to the physical latencies of disks is
described in [Pechura83].
2. Old File System
In the file system developed at Bell Laboratories (the ‘‘traditional’’file system), each disk drive is
divided into one or more partitions. Each of these disk partitions may contain one file system. Afile sys-
tem neverspans multiple partitions.† Afile system is described by its super-block, which contains the
basic parameters of the file system. These include the number of data blocks in the file system, a count of
the maximum number of files, and a pointer to the free list,alinked list of all the free blocks in the file sys-
tem.
Within the file system are files. Certain files are distinguished as directories and contain pointers to
files that may themselves be directories. Every file has a descriptor associated with it called an inode.An
inode contains information describing ownership of the file, time stamps marking last modification and
access times for the file, and an array of indices that point to the data blocks for the file. Forthe purposes
of this section, we assume that the first 8 blocks of the file are directly referenced by values stored in an
inode itself*. An inode may also contain references to indirect blocks containing further data block indices.
In a file system with a 512 byte block size, a singly indirect block contains 128 further block addresses, a
doubly indirect block contains 128 addresses of further singly indirect blocks, and a triply indirect block
contains 128 addresses of further doubly indirect blocks.
A150 megabyte traditional UNIX file system consists of 4 megabytes of inodes followed by 146
megabytes of data. This organization segregates the inode information from the data; thus accessing a file
normally incurs a long seek from the file’sinode to its data. Files in a single directory are not typically
allocated consecutive slots in the 4 megabytes of inodes, causing manynon-consecutive blocks of inodes to
be accessed when executing operations on the inodes of several files in a directory.
The allocation of data blocks to files is also suboptimum. The traditional file system nevertransfers
more than 512 bytes per disk transaction and often finds that the next sequential data block is not on the
same cylinder,forcing seeks between 512 byte transfers. The combination of the small block size, limited
read-ahead in the system, and manyseeks severely limits file system throughput.
The first work at Berkeleyonthe UNIX file system attempted to improve both reliability and
throughput. The reliability was improvedbystaging modifications to critical file system information so
that theycould either be completed or repaired cleanly by a program after a crash [Kow alski78]. The file
system performance was improvedbyafactor of more than twobychanging the basic block size from 512
to 1024 bytes. The increase was because of twofactors: each disk transfer accessed twice as much data,
and most files could be described without need to access indirect blocks since the direct blocks contained
twice as much data. The file system with these changes will henceforth be referred to as the old file system.
This performance improvement gav e astrong indication that increasing the block size was a good
method for improving throughput. Although the throughput had doubled, the old file system was still using
only about four percent of the disk bandwidth. The main problem was that although the free list was ini-
tially ordered for optimal access, it quickly became scrambled as files were created and removed. Eventu-
ally the free list became entirely random, causing files to have their blocks allocated randomly overthe
disk. This forced a seek before every block access. Although old file systems provided transfer rates of up
to 175 kilobytes per second when theywere first created, this rate deteriorated to 30 kilobytes per second
after a fewweeks of moderate use because of this randomization of data block placement. There was no
wayofrestoring the performance of an old file system except to dump, rebuild, and restore the file system.
Another possibility,assuggested by [Maruyama76], would be to have a process that periodically
†By‘‘partition’’here we refer to the subdivision of physical space on a disk drive.Inthe traditional file sys-
tem, as in the newfile system, file systems are really located in logical disk partitions that may overlap. This
overlapping is made available, for example, to allowprograms to copyentire disk drivescontaining multiple file
systems.
*The actual number may vary from system to system, but is usually in the range 5-13.
SMM:05-4 A Fast File System for UNIX
reorganized the data on the disk to restore locality.
3. New file system organization
In the newfile system organization (as in the old file system organization), each disk drive contains
one or more file systems. Afile system is described by its super-block, located at the beginning of the file
system’sdisk partition. Because the super-block contains critical data, it is replicated to protect against
catastrophic loss. This is done when the file system is created; since the super-block data does not change,
the copies need not be referenced unless a head crash or other hard disk error causes the default super-block
to be unusable.
To insure that it is possible to create files as large as 2
32
bytes with only twolev els of indirection, the
minimum size of a file system block is 4096 bytes. The size of file system blocks can be anypower of two
greater than or equal to 4096. The block size of a file system is recorded in the file system’ssuper-block so
it is possible for file systems with different block sizes to be simultaneously accessible on the same system.
The block size must be decided at the time that the file system is created; it cannot be subsequently changed
without rebuilding the file system.
The newfile system organization divides a disk partition into one or more areas called cylinder
groups.Acylinder group is comprised of one or more consecutive cylinders on a disk. Associated with
each cylinder group is some bookkeeping information that includes a redundant copyofthe super-block,
space for inodes, a bit map describing available blocks in the cylinder group, and summary information
describing the usage of data blocks within the cylinder group. The bit map of available blocks in the cylin-
der group replaces the traditional file system’sfree list. Foreach cylinder group a static number of inodes
is allocated at file system creation time. The default policyistoallocate one inode for each 2048 bytes of
space in the cylinder group, expecting this to be far more than will everbeneeded.
All the cylinder group bookkeeping information could be placed at the beginning of each cylinder
group. Howeverifthis approach were used, all the redundant information would be on the top platter.A
single hardware failure that destroyed the top platter could cause the loss of all redundant copies of the
super-block. Thus the cylinder group bookkeeping information begins at a varying offset from the begin-
ning of the cylinder group. The offset for each successive cylinder group is calculated to be about one track
further from the beginning of the cylinder group than the preceding cylinder group. In this way the redun-
dant information spirals down into the pack so that anysingle track, cylinder,orplatter can be lost without
losing all copies of the super-block. Except for the first cylinder group, the space between the beginning of
the cylinder group and the beginning of the cylinder group information is used for data blocks.†
3.1. Optimizing storage utilization
Data is laid out so that larger blocks can be transferred in a single disk transaction, greatly increasing
file system throughput. As an example, consider a file in the newfile system composed of 4096 byte data
blocks. In the old file system this file would be composed of 1024 byte blocks. By increasing the block
size, disk accesses in the newfile system may transfer up to four times as much information per disk trans-
action. In large files, several 4096 byte blocks may be allocated from the same cylinder so that evenlarger
data transfers are possible before requiring a seek.
The main problem with larger blocks is that most UNIX file systems are composed of manysmall
files. A uniformly large block size wastes space. Table 1 shows the effect of file system block size on the
amount of wasted space in the file system. The files measured to obtain these figures reside on one of our
†While it appears that the first cylinder group could be laid out with its super-block at the ‘‘known’’location,
this would not work for file systems with blocks sizes of 16 kilobytes or greater.This is because of a require-
ment that the first 8 kilobytes of the disk be reserved for a bootstrap program and a separate requirement that the
cylinder group information begin on a file system block boundary.Tostart the cylinder group on a file system
block boundary,file systems with block sizes larger than 8 kilobytes would have toleave anempty space
between the end of the boot block and the beginning of the cylinder group. Without knowing the size of the file
system blocks, the system would not knowwhat roundup function to use to find the beginning of the first cylin-
der group.
AFast File System for UNIX SMM:05-5
time sharing systems that has roughly 1.2 gigabytes of on-line storage. The measurements are based on the
active user file systems containing about 920 megabytes of formatted space.
Space used %waste Organization
775.2 Mb 0.0 Data only,noseparation between files
807.8 Mb 4.2 Data only,each file starts on 512 byte boundary
828.7 Mb 6.9 Data +inodes, 512 byte block UNIX file system
866.5 Mb 11.8 Data +inodes, 1024 byte block UNIX file system
948.5 Mb 22.4 Data +inodes, 2048 byte block UNIX file system
1128.3 Mb 45.6 Data +inodes, 4096 byte block UNIX file system
Ta asted space as a function of block size.
The space wasted is calculated to be the percentage of space on the disk not containing user data. As the
block size on the disk increases, the waste rises quickly,toanintolerable 45.6% waste with 4096 byte file
system blocks.
To beable to use large blocks without undue waste, small files must be stored in a more efficient way.
The newfile system accomplishes this goal by allowing the division of a single file system block into one
or more fragments.The file system fragment size is specified at the time that the file system is created;
each file system block can optionally be broken into 2, 4, or 8 fragments, each of which is addressable. The
lower bound on the size of these fragments is constrained by the disk sector size, typically 512 bytes. The
block map associated with each cylinder group records the space available in a cylinder group at the frag-
ment level; to determine if a block is available, aligned fragments are examined. Figure 1shows a piece of
amap from a 4096/1024 file system.
Bits in map XXXX XXOO OOXX OOOO
Fragment numbers 0-3 4-7 8-11 12-15
Block numbers 0123
Each bit in the map records the status of a fragment; an ‘‘X’’shows that the fragment is in use, while a ‘‘O’’
shows that the fragment is available for allocation. In this e
Fragments of adjoining blocks cannot be used as a full block,
ev e niftheyare large enough. In this e
On a file system with a block size of 4096 bytes and a fragment size of 1024 bytes, a file is repre-
sented by zero or more 4096 byte blocks of data, and possibly a single fragmented block. If a file system
block must be fragmented to obtain space for a small amount of data, the remaining fragments of the block
are made available for allocation to other files. As an example consider an 11000 byte file stored on a
4096/1024 byte file system. This file would uses twofull size blocks and one three fragment portion of
another block. If no block with three aligned fragments is available at the time the file is created, a full size
block is split yielding the necessary fragments and a single unused fragment. This remaining fragment can
be allocated to another file as needed.
Space is allocated to a file when a program does a write system call. Each time data is written to a
file, the system checks to see if the size of the file has increased*. If the file needs to be expanded to hold
the newdata, one of three conditions exists:
1) There is enough space left in an already allocated block or fragment to hold the newdata. The new
data is written into the available space.
2) The file contains no fragmented blocks (and the last block in the file contains insufficient space to
hold the newdata). If space exists in a block already allocated, the space is filled with newdata. If
the remainder of the newdata contains more than a full block of data, a full block is allocated and the
*Aprogram may be overwriting data in the middle of an existing file in which case space would already have
been allocated.