Third, extensions are a leading cause of operating system
failure. In Windows XP, for example, drivers cause 85%
of recently reported failures [39]. In Linux, the frequency of
coding errors is seven times higher for device drivers than for
the rest of the kernel [10]. While the core operating system
kernel reaches high levels of reliability due to longevity and
repeated testing, the extended operating system cannot be
tested completely. With tens of thousands of extensions,
operating system vendors cannot even identify them all, let
alone test all possible combinations used in the marketplace.
Improving OS reliability will therefore require systems to
become highly tolerant of failures in drivers and other ex-
tensions. Furthermore, the hundreds of millions of existing
systems executing tens of thousands of extensions demand a
reliability solution that is at once backward compatible and
efficient for common extensions. Backward compatibility
improves the reliability of already deployed systems. Ef-
ficiency avoids the classic tradeoff between robustness and
performance.
Our focus on extensibility and reliability is not new. The
last twenty years have produced a substantial amount of
research on improving extensibility and reliability through
the use of new kernel architectures [15], new driver archi-
tectures [38], user-level extensions [18, 31, 55], new hard-
ware [16, 54], or type-safe languages [3].
While many of the underlying techniques used in Nooks
have been used in previous systems, Nooks differs from ear-
lier efforts in two key ways. First, we target existing exten-
sions for commodity operating systems rather than propose
a new extension architecture. We want today’s extensions
to execute on today’s platforms without change if possible.
Second, we use C, a conventional programming language.
We do not ask developers to change languages, development
environments, or, most importantly, perspective. Overall,
we focus on a single and very serious problem – reducing
the huge number of crashes due to drivers and other exten-
sions. In the end, we hope to see an isolation service such
as Nooks become standard on all non-performance-critical
systems, from desktops to servers to embedded appliances.
We implemented a prototype of Nooks in the Linux op-
erating system and experimented with a variety of kernel
extension types, including several device drivers, a file sys-
tem, and a kernel Web server. Using automatic fault injec-
tion [26], we show that when injecting synthetic bugs into
extensions, Nooks can gracefully recover and restart the ex-
tension in 99% of the cases that cause Linux to crash. In
addition, Nooks recovered from all of the common causes
of kernel crashes that we manually inserted. Extension re-
covery occurs quickly, as compared to a full system reboot,
leaving most applications running. For drivers – the most
common extension type – the impact on performance is low
to moderate. Finally, of the eight kernel extensions we iso-
lated with Nooks, seven required no code changes, while
only 13 lines changed in the eighth. Although our proto-
type is Linux based, we expect that the architecture and
many implementation features would port readily to other
commodity operating systems.
The rest of this paper describes the design, implementa-
tion and performance of Nooks. The next section summa-
rizes reated work in OS extensibility and reliability. Sec-
tion 3 describes the system’s guiding principles and high-
level architecture. Section 4 discusses the system’s imple-
mentation on Linux. We present experiments that evaluate
the reliability of Nooks in Section 5 and its performance in
Section 6. Section 7 summarizes our work and draws con-
clusions.
2. RELATED WORK
Our work differs from the substantial body of research on
extensibility and reliability in many dimensions. Nooks re-
lies on a conventional processor architecture, a conventional
programming language, a conventional operating system ar-
chitecture, and existing extensions. It is designed to be
transparent to the extensions themselves, to support recov-
erability, and to impose only a modest performance penalty.
The major hardware approaches to improve reliability in-
clude capability-based architectures [25, 30, 36] and ring
and segment architectures [27, 40].
1
These systems support
fine-grained protection, enabling construction and isolation
of privileged subsystems. The OS is extended by adding
new privileged subsystems that exist in new domains or
segments. Recovery is not specifically addressed in either
architecture. In particular, capabilities support the fine-
grained sharing of data. If one sharing component fails, re-
covery may be difficult for others sharing the same resource.
Segmented architectures have been difficult to program and
plagued by poor performance. In contrast, Nooks isolates
existing code on commodity processors using standard vir-
tual memory and runtime techniques, and it supports recov-
ery through garbage collection of extension-allocated data.
Several projects have isolated kernel components through
new operating system structures. Microkernels [31, 55] and
their derivatives [15, 17, 23] promise another path to reliabil-
ity. These systems isolate extensions into separate address
spaces that interact with the OS through a kernel commu-
nication service, such as messages or remote procedure call
[2]. Therefore, the failure of an extension within an ad-
dress space does not necessarily crash the system. However,
as in capability-based systems, recovery has received little
attention in microkernel systems. In Mach, for example, a
user-level system service can fail without crashing the kernel,
but rebooting is often the only way to restart the service.
Despite much research in fast inter-process communiction
(IPC) [2, 31], the reliance on separate address spaces raises
performance concerns that have prevented adoption in com-
modity systems.
A number of transaction-based systems [41, 43] have ap-
plied recoverable database techniques within the OS to im-
prove reliability. In some cases, such as the file system, the
approach worked well, while in others it proved awkward
and slow [41]. Like the language-based approaches, these
strategies have limited applicability and audience. In con-
trast, Nooks integrates transparently into existing commod-
ity systems without requiring architectural change.
An alternative to operating system-based isolation is the
use of type-safe programming languages and run-time sys-
tems [3] that prevent many faults from occurring. Such sys-
tems can provide performance advantages, since compile-
time checking enables lightweight run-time structures (e.g.,
local procedure calls rather than cross-domain calls). To
date, however, OS suppliers have been unwilling to imple-
ment system code in type-safe, high-level languages. More-
over, the type-safe language approach makes it impossible
1
[54] presents a similar approach in a newer context.
208