Overview of the Linux Virtual File System
Original author: Richard Gooch <rgooch@atnf.csiro.au>
Last updated on June 24, 2007.
Copyright (C) 1999 Richard Gooch
Copyright (C) 2005 Pekka Enberg
This file is released under the GPLv2.
Introduction
============
The Virtual File System (also known as the Virtual Filesystem Switch)
is the software layer in the kernel that provides the filesystem
interface to userspace programs. It also provides an abstraction
within the kernel which allows different filesystem implementations to
coexist.
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
on are called from a process context. Filesystem locking is described
in the document Documentation/filesystems/Locking.
Directory Entry Cache (dcache)
------------------------------
The VFS implements the open(2), stat(2), chmod(2), and similar system
calls. The pathname argument that is passed to them is used by the VFS
to search through the directory entry cache (also known as the dentry
cache or dcache). This provides a very fast look-up mechanism to
translate a pathname (filename) into a specific dentry. Dentries live
in RAM and are never saved to disc: they exist only for performance.
The dentry cache is meant to be a view into your entire filespace. As
most computers cannot fit all dentries in the RAM at the same time,
some bits of the cache are missing. In order to resolve your pathname
into a dentry, the VFS may have to resort to creating dentries along
the way, and then loading the inode. This is done by looking up the
inode.
The Inode Object
----------------
An individual dentry usually has a pointer to an inode. Inodes are
filesystem objects such as regular files, directories, FIFOs and other
beasts. They live either on the disc (for block device filesystems)
or in the memory (for pseudo filesystems). Inodes that live on the
disc are copied into the memory when required and changes to the inode
are written back to disc. A single inode can be pointed to by multiple
dentries (hard links, for example, do this).
To look up an inode requires that the VFS calls the lookup() method of
the parent directory inode. This method is installed by the specific
filesystem implementation that the inode lives in. Once the VFS has
the required dentry (and hence the inode), we can do all those boring
things like open(2) the file, or stat(2) it to peek at the inode
data. The stat(2) operation is fairly simple: once the VFS has the
dentry, it peeks at the inode data and passes some of it back to
userspace.
The File Object
---------------
Opening a file requires another operation: allocation of a file
structure (this is the kernel-side implementation of file
descriptors). The freshly allocated file structure is initialized with
a pointer to the dentry and a set of file operation member functions.
These are taken from the inode data. The open() file method is then
called so the specific filesystem implementation can do its work. You
can see that this is another switch performed by the VFS. The file
structure is placed into the file descriptor table for the process.
Reading, writing and closing files (and other assorted VFS operations)
is done by using the userspace file descriptor to grab the appropriate
file structure, and then calling the required file structure method to
do whatever is required. For as long as the file is open, it keeps the
dentry in use, which in turn means that the VFS inode is still in use.
Registering and Mounting a Filesystem
=====================================
To register and unregister a filesystem, use the following API
functions:
#include <linux/fs.h>
extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
The passed struct file_system_type describes your filesystem. When a
request is made to mount a filesystem onto a directory in your namespace,
the VFS will call the appropriate mount() method for the specific
filesystem. New vfsmount referring to the tree returned by ->mount()
will be attached to the mountpoint, so that when pathname resolution
reaches the mountpoint it will jump into the root of that vfsmount.
You can see all filesystems that are registered to the kernel in the
file /proc/filesystems.
struct file_system_type
-----------------------
This describes the filesystem. As of kernel 2.6.39, the following
members are defined:
struct file_system_type {
const char *name;
int fs_flags;
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct list_head fs_supers;
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
};
name: the name of the filesystem type, such as "ext2", "iso9660",
"msdos" and so on
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
mount: the method to call when a new instance of this
filesystem should be mounted
kill_sb: the method to call when an instance of this filesystem
should be shut down
owner: for internal VFS use: you should initialize this to THIS_MODULE in
most cases.
next: for internal VFS use: you should initialize this to NULL
s_lock_key, s_umount_key: lockdep-specific
The mount() method has the following arguments:
struct file_system_type *fs_type: describes the filesystem, partly initialized
by the specific filesystem code
int flags: mount flags
const char *dev_name: the device name we are mounting.
void *data: arbitrary mount options, usually comes as an ASCII
string (see "Mount Options" section)
The mount() method must return the root dentry of the tree requested by
caller. An active reference to its superblock must be grabbed and the
superblock must be locked. On failure it should return ERR_PTR(error).
The arguments match those of mount(2) and their interpretation
depends on filesystem type. E.g. for block filesystems, dev_name is
interpreted as block device name, that device is opened and if it
contains a suitable filesystem image the method creates and initializes
struct super_block accordingly, returning its root dentry to caller.
->mount() may choose to return a subtree of existing filesystem - it
doesn't have to create a new one. The main result from the caller's
point of view is a reference to dentry at the root of (sub)tree to
be attached; creation of new superblock is a common side effect.
The most interesting member of the superblock structure that the
mount() method fills in is the "s_op" field. This is a pointer to
a "struct super_operations" which describes the next level of the
filesystem implementation.
Usually, a filesystem uses one of the generic mount() implementations
and provides a fill_super() callback instead. The generic variants are:
mount_bdev: mount a filesystem residing on a block device
mount_nodev: mount a filesystem that is not backed by a device
mount_single: mount a filesystem which shares the instance between
all mounts
A fill_super() callback implementation has the following arguments:
struct super_block *sb: the superblock structure. The callback
must initialize this properly.
void *data: arbitrary mount options, usually comes as an ASCII
string (see "Mount Options" section)
int silent: whether or not to be silent on error
The Superblock Object
=====================
A superblock object represents a mounted filesystem.
struct super_operations
-----------------------
This describes how the VFS can manipulate the superblock of your
filesystem. As of kernel 2.6.22, the following members are defined:
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
int (*write_inode) (struct inode *, i