提高商业操作系统的可用性资源-CSDN文库

需积分: 9 65 浏览量 2013-01-14 11:13:19 上传评论收藏 257KB PDF 举报

### 提高商业操作系统的可用性 #### 摘要与背景随着计算机技术的发展与广泛应用，操作系统作为连接硬件与软件的桥梁，在整个信息技术体系中扮演着至关重要的角色。然而，尽管在可扩展操作系统技术领域已经进行了数十年的研究，但设备驱动程序等扩展组件仍然是导致系统故障的主要原因之一。例如，在Windows XP操作系统中，近85%的系统故障是由驱动程序引发的。为了解决这一问题，迈克尔·M·斯威夫特(Michael M. Swift)、布莱恩·N·伯沙德(Brian N. Bershad)和亨利·M·莱维(Henry M. Levy)等研究人员提出了“Nooks”这一可靠性子系统。该系统旨在通过隔离操作系统与驱动程序之间的交互来显著提高操作系统的可靠性，从而避免大多数由驱动程序引发的崩溃问题。Nooks的设计目标是在不对现有驱动程序和系统代码进行大规模修改的情况下实现这一目标。 #### Nooks的关键技术和设计原理 ##### 隔离机制 Nooks的核心技术之一是将驱动程序隔离在轻量级保护域内，这些保护域位于内核地址空间内。通过这种方式，即使驱动程序出现故障或错误行为，也能够防止其对整个内核造成破坏。这种隔离机制依赖于硬件和软件的协同工作，确保了即使在驱动程序发生异常时，也能保护内核不受损害。 ##### 资源跟踪与快速恢复除了隔离之外，Nooks还具备资源跟踪功能。它能够追踪驱动程序对内核资源的使用情况，以便在系统恢复过程中加速自动清理过程。这意味着一旦检测到驱动程序故障，Nooks不仅能够迅速隔离故障点，还能高效地清除故障产生的负面影响，使系统尽快恢复正常运行状态。 #### 实现与测试为了验证Nooks的有效性和可行性，研究团队在Linux操作系统上实现了该系统，并对其进行了详尽的测试。在一系列包含2000次故障注入测试的实验中，Nooks成功地从99%会导致Linux崩溃的故障中自动恢复过来。这表明Nooks能够在实际应用场景中显著提升操作系统的可靠性。此外，虽然Nooks最初是为了处理驱动程序故障而设计的，但其技术同样适用于其他内核扩展，如内核模式文件系统和内核中的互联网服务。这一点在后续的测试中得到了证实。 #### 结论与展望 Nooks支持现有的C语言扩展，可以在商品化的操作系统和硬件上运行，并且能够实现自动化恢复，因此代表了一个实质性的进步，超越了以前针对安全扩展所做的努力所要求的特殊架构和类型安全语言。Nooks不仅能够显著提高操作系统的可靠性，而且由于其兼容性好、易于集成的特点，具有广泛的应用前景。对于企业和个人用户而言，使用像Nooks这样的解决方案可以极大地减少因系统崩溃而导致的数据丢失和业务中断等问题，从而提高整体的工作效率和用户体验。

资源推荐

资源详情

资源评论

Improving the Reliability of Commodity Operating Systems

Michael M. Swift, Brian N. Bershad, and Henry M. Levy

Department of Computer Science and Engineering

University of Washington

Seattle, WA 98195 USA

{

mikesw,bershad,levy}@cs.washington.edu

ABSTRACT

Despite decades of research in extensible operating system

technology, extensions such as device drivers remain a signif-

icant cause of system failures. In Windows XP, for example,

drivers account for 85% of recently reported failures.

This paper describes Nooks, a reliability subsystem that

seeks to greatly enhance OS reliability by isolating the OS

from driver failures. The Nooks approach is practical: rather

than guaranteeing complete fault tolerance through a new

(and incompatible) OS or driver architecture, our goal is to

prevent the vast majority of driver-caused crashes with little

or no change to existing driver and system code. To achieve

this, Nooks isolates drivers within lightweight protection do-

mains inside the kernel address space, where hardware and

software prevent them from corrupting the kernel. Nooks

also tracks a driver’s use of kernel resources to hasten auto-

matic clean-up during recovery.

To prove the viability of our approach, we implemented

Nooks in the Linux operating system and used it to fault-

isolate several device drivers. Our results show that Nooks

oﬀers a substantial increase in the reliability of operating

systems, catching and quickly recovering from many faults

that would otherwise crash the system. In a series of 2000

fault-injection tests, Nooks recovered automatically from

99% of the faults that caused Linux to crash.

While Nooks was designed for drivers, our techniques gen-

eralize to other kernel extensions, as well. We demonstrate

this by isolating a kernel-mode ﬁle system and an in-kernel

Internet service. Overall, because Nooks supports existing

C-language extensions, runs on a commodity operating sys-

tem and hardware, and enables automated recovery, it repre-

sents a substantial step beyond the specialized architectures

and type-safe languages required by previous eﬀorts directed

at safe extensibility.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.

$5.00.

Categories and Subject Descriptors

D.4.5 [Operating Systems]: Reliability—fault tolerance

General Terms

Reliability, Management

Keywords

Recovery, Device Drivers, Virtual Memory, Protection, I/O

1. INTRODUCTION

This paper describes the architecture, implementation, and

performance of Nooks, a new operating system subsystem

that allows existing OS extensions (such as device drivers

and loadable ﬁle systems) to execute safely in commodity

kernels. In contemporary systems, any fault in a kernel ex-

tension can corrupt vital kernel data, causing the system

to crash. To reduce the threat of extension failures, Nooks

executes each extension in a lightweight kernel protection do-

main – a privileged kernel-mode environment with restricted

write access to kernel memory. Nooks’ interposition services

track and validate all modiﬁcations to kernel data structures

performed by the kernel-mode extension, thereby trapping

bugs as they occur and facilitating subsequent automatic

recovery.

Three factors motivated our research. First, computer

system reliability remains a crucial but unsolved prob-

lem [20, 37]. While the cost of high-performance computing

continues to drop, the cost of failures (e.g., downtime on

a stock exchange or e-commerce server, or the manpower

required to service a help-desk request in an oﬃce envi-

ronment) continues to rise. In addition, the growing sec-

tor of “unmanaged” systems, such as digital appliances and

consumer devices based on commodity hardware and soft-

ware [24, 48], ampliﬁes the need for reliability.

Second, OS extensions have become increasingly preva-

lent in commodity systems such as Linux (where they are

called modules [5]) and Windows (where they are called

drivers [11]). Extensions are optional components that re-

side in the kernel address space and typically communicate

with the kernel through published interfaces. In addition

to device drivers, extensions include ﬁle systems, virus de-

tectors, and network protocols. Extensions now account for

over 70% of Linux kernel code [10], while over 35,000 dif-

ferent drivers with over 120,000 versions exist on Windows

XP desktops [39]. Many, if not most, of these extensions

are written by programmers signiﬁcantly less experienced in

kernel organization and programming than those who built

the operating system itself.

207

Third, extensions are a leading cause of operating system

failure. In Windows XP, for example, drivers cause 85%

of recently reported failures [39]. In Linux, the frequency of

coding errors is seven times higher for device drivers than for

the rest of the kernel [10]. While the core operating system

kernel reaches high levels of reliability due to longevity and

repeated testing, the extended operating system cannot be

tested completely. With tens of thousands of extensions,

operating system vendors cannot even identify them all, let

alone test all possible combinations used in the marketplace.

Improving OS reliability will therefore require systems to

become highly tolerant of failures in drivers and other ex-

tensions. Furthermore, the hundreds of millions of existing

systems executing tens of thousands of extensions demand a

reliability solution that is at once backward compatible and

eﬃcient for common extensions. Backward compatibility

improves the reliability of already deployed systems. Ef-

ﬁciency avoids the classic tradeoﬀ between robustness and

performance.

Our focus on extensibility and reliability is not new. The

last twenty years have produced a substantial amount of

research on improving extensibility and reliability through

the use of new kernel architectures [15], new driver archi-

tectures [38], user-level extensions [18, 31, 55], new hard-

ware [16, 54], or type-safe languages [3].

While many of the underlying techniques used in Nooks

have been used in previous systems, Nooks diﬀers from ear-

lier eﬀorts in two key ways. First, we target existing exten-

sions for commodity operating systems rather than propose

a new extension architecture. We want today’s extensions

to execute on today’s platforms without change if possible.

Second, we use C, a conventional programming language.

We do not ask developers to change languages, development

environments, or, most importantly, perspective. Overall,

we focus on a single and very serious problem – reducing

the huge number of crashes due to drivers and other exten-

sions. In the end, we hope to see an isolation service such

as Nooks become standard on all non-performance-critical

systems, from desktops to servers to embedded appliances.

We implemented a prototype of Nooks in the Linux op-

erating system and experimented with a variety of kernel

extension types, including several device drivers, a ﬁle sys-

tem, and a kernel Web server. Using automatic fault injec-

tion [26], we show that when injecting synthetic bugs into

extensions, Nooks can gracefully recover and restart the ex-

tension in 99% of the cases that cause Linux to crash. In

addition, Nooks recovered from all of the common causes

of kernel crashes that we manually inserted. Extension re-

covery occurs quickly, as compared to a full system reboot,

leaving most applications running. For drivers – the most

common extension type – the impact on performance is low

to moderate. Finally, of the eight kernel extensions we iso-

lated with Nooks, seven required no code changes, while

only 13 lines changed in the eighth. Although our proto-

type is Linux based, we expect that the architecture and

many implementation features would port readily to other

commodity operating systems.

The rest of this paper describes the design, implementa-

tion and performance of Nooks. The next section summa-

rizes reated work in OS extensibility and reliability. Sec-

tion 3 describes the system’s guiding principles and high-

level architecture. Section 4 discusses the system’s imple-

mentation on Linux. We present experiments that evaluate

the reliability of Nooks in Section 5 and its performance in

Section 6. Section 7 summarizes our work and draws con-

clusions.

2. RELATED WORK

Our work diﬀers from the substantial body of research on

extensibility and reliability in many dimensions. Nooks re-

lies on a conventional processor architecture, a conventional

programming language, a conventional operating system ar-

chitecture, and existing extensions. It is designed to be

transparent to the extensions themselves, to support recov-

erability, and to impose only a modest performance penalty.

The major hardware approaches to improve reliability in-

clude capability-based architectures [25, 30, 36] and ring

and segment architectures [27, 40].

These systems support

ﬁne-grained protection, enabling construction and isolation

of privileged subsystems. The OS is extended by adding

new privileged subsystems that exist in new domains or

segments. Recovery is not speciﬁcally addressed in either

architecture. In particular, capabilities support the ﬁne-

grained sharing of data. If one sharing component fails, re-

covery may be diﬃcult for others sharing the same resource.

Segmented architectures have been diﬃcult to program and

plagued by poor performance. In contrast, Nooks isolates

existing code on commodity processors using standard vir-

tual memory and runtime techniques, and it supports recov-

ery through garbage collection of extension-allocated data.

Several projects have isolated kernel components through

new operating system structures. Microkernels [31, 55] and

their derivatives [15, 17, 23] promise another path to reliabil-

ity. These systems isolate extensions into separate address

spaces that interact with the OS through a kernel commu-

nication service, such as messages or remote procedure call

[2]. Therefore, the failure of an extension within an ad-

dress space does not necessarily crash the system. However,

as in capability-based systems, recovery has received little

attention in microkernel systems. In Mach, for example, a

user-level system service can fail without crashing the kernel,

but rebooting is often the only way to restart the service.

Despite much research in fast inter-process communiction

(IPC) [2, 31], the reliance on separate address spaces raises

performance concerns that have prevented adoption in com-

modity systems.

A number of transaction-based systems [41, 43] have ap-

plied recoverable database techniques within the OS to im-

prove reliability. In some cases, such as the ﬁle system, the

approach worked well, while in others it proved awkward

and slow [41]. Like the language-based approaches, these

strategies have limited applicability and audience. In con-

trast, Nooks integrates transparently into existing commod-

ity systems without requiring architectural change.

An alternative to operating system-based isolation is the

use of type-safe programming languages and run-time sys-

tems [3] that prevent many faults from occurring. Such sys-

tems can provide performance advantages, since compile-

time checking enables lightweight run-time structures (e.g.,

local procedure calls rather than cross-domain calls). To

date, however, OS suppliers have been unwilling to imple-

ment system code in type-safe, high-level languages. More-

over, the type-safe language approach makes it impossible

[54] presents a similar approach in a newer context.

208

to leverage the enormous existing code base. In contrast,

Nooks requires no specialized programming language.

Recent years have seen the development of software tech-

niques that enforce code correctness properties, e.g., soft-

ware fault isolation [51] and self-verifying assembly code [34].

These technologies are attractive and might replace or aug-

ment some of Nooks’ isolation techniques. Nevertheless, in

their proposed form, they deal only with the isolation prob-

lem, leaving unsolved the problems of transparent integra-

tion and recovery. Recently, techniques for verifying the

integrity of extensions in existing operating systems have

proven eﬀective at revealing programming errors [14]. This

static approach obviously complements our own dynamic

one.

In the past, virtual memory techniques have been used

to isolate speciﬁc components or data from corruption, e.g.,

in a database [46] or in the ﬁle system cache [35]. Nooks

uses similar techniques to protect the operating system from

erroneous extension behavior.

Virtual machine technologies [7, 9, 45, 53] have been pro-

posed as a solution to the reliability problem. They can

reduce the amount of code that can crash the whole ma-

chine. Virtualization techniques typically run several entire

operating systems on top of a virtual machine, so faulty

extensions in one operating system cause only a few appli-

cations to fail. However, if the extension executes in the

virtual machine monitor, such as device drivers for physical

devices, a fault causes all virtual machines and their appli-

cations to fail. While applications can be partitioned among

virtual machines to limit the scope of failure, doing so re-

moves the beneﬁts of sharing within an operating system,

such as fast IPC and intelligent scheduling. The challenge

for reliable extensibility is not in virtualizing the underly-

ing hardware; rather it lies in virtualizing only the interface

between the kernel and extension. In fact, this is a major

feature of the Nooks architecture.

More recently, researchers have begun to focus on recov-

ery as a general technique for dealing with failure in com-

plex systems [37]. In [6], for example, the authors propose

a model of recursive recovery; in the model a complex soft-

ware system is decomposed into a multi-level implementa-

tion where each layer can fail and recover independently.

Nooks is certainly complementary, although our focus to

date has been limited to operating system kernels.

Table 1 shows the changes to hardware architecture, oper-

ating system architecture, or extension architecture required

by other approaches to reliability. Nooks, virtual machines,

and static analysis techniques need no architectural changes.

3. ARCHITECTURE

The Nooks architecture is based on two core principles:

1. Design for fault resistance, not fault tolerance. The

system must prevent and recover from most, but not

necessarily all, extension failures.

2. Design for mistakes, not abuse. Extensions are gener-

ally well-behaved but may fail due to errors in design

or implementation.

From the ﬁrst principle, we are not seeking a complete

solution for all possible extension errors. However, since

extensions cause the vast majority of system failures, elimi-

nating most of them will substantially improve system relia-

Required Modiﬁcations

Approach Hardware OS Extension

Capabilities yes yes yes

Microkernels no yes yes

Languages no yes yes

New Driver no yes yes

Architectures

Transactions no no yes

Virtual Machines no no no

Static Analysis no no no

Nooks no no no

Table 1: Components that require architectural changes

for various approaches to reliability. A “yes” in a cell

indicates that the reliability mechanism on that row re-

quires architectural change to the component listed at

the top of the column.

bility. From the second principle, we have chosen to occupy

the design space between “unprotected” and “safe.” That

is, the extension architecture for conventional operating sys-

tems (such as Linux or Windows) is unprotected: nearly any

bug within the extension can corrupt or crash the rest of the

system. In contrast, safe systems (such as SPIN [3] or the

Java Virtual Machine [21]) strictly limit extension behavior

and thus make no distinction between buggy and malicious

code. We trust kernel extensions not to be malicious, but

we do not trust them not to be buggy.

The practical impact of these principles is substantial,

both positively and negatively. On the positive side, it al-

lows us to deﬁne an architecture that directly supports ex-

isting driver code with only moderate performance costs.

On the negative side, our solution does not detect or re-

cover from 100% of all possible failures and can be easily

circumvented by malicious code acting within the kernel.

As examples, consider a malfunctioning driver that contin-

ues to run and does not corrupt kernel data, but returns a

packet that is one byte short, or a malicious driver that ex-

plicitly corrupts the system page table. We do not attempt

to detect or correct such failures.

Among failures that can crash the system, a spectrum

of possible defensive approaches exist. These range from

the Windows approach (i.e., to preemptively crash to avoid

data corruption) to the full virtual machine approach (i.e.,

to virtualize the entire architecture and provide total isola-

tion). Our approach lies in the middle. Like all possible ap-

proaches, it reﬂects tradeoﬀs among performance, compat-

ibility, complexity, and completeness. Section 4.6 describes

our current limitations. Some limitations are architectural,

while others are induced by the current hardware or soft-

ware implementation. Despite these limitations, given tens

of thousands of existing drivers, and the millions of failures

they cause, a fault-resistant solution like the one we propose

has practical implications and value.

3.1 Goals

Given the preceding principles, the Nooks architecture seeks

to achieve three major goals:

1. Isolation. The architecture must isolate the kernel

from extension failures. Consequently, it must detect

209

剩余15页未读，继续阅读

评论收藏

内容反馈

skysky198803

粉丝: 0
资源: 15

提高商业操作系统的可用性

一种提高小流量负载均衡业务系统可用性的网络设计.pdf

系统高可用性解决方案

CentOS系统架构高可用性

高可用性解决方案荟萃 使用SQL Server 2005提高系统、数据库和数据的可用性.pdf

人机交互界面的可用性评估及方法

商业银行内控合规与操作风险管理系统白皮书.pdf

Windows操作系统在 目前市场主导的原因

Windows操作系统PPT

AIX操作系统培训文档.ppt

网络操作系统概述PPT学习教案.pptx

商业价值分析系统运维解决方案

各种操作系统及oracle11.2等软件下载地址

自己编的51单片机操作系统，比keil强

windows视窗操作系统发展史.docx

操作系统安全（卿汉斯）

向低成本及高可用性IT架构过渡.pptx

向低成本与高可用性IT架构的过渡.pptx

C/S架构 商业系统

1量属性可用性&易用性-课程内容.rar

操作系统概念英文课件：Chapter 01-Introduction.pptx

数据库高可用性：构筑坚不可摧的数据堡垒

linux操作系统和windows操作系统的区别.doc

计算机操作系统期末介绍

电子商务平台可用性测试项目可行性分析报告.pptx

浅析Linux操作系统教学.pdf

风河率先推出通过NIAP EAL4＋安全认证的Linux操作系统.pdf

最新资源

高可用性解决方案荟萃使用SQL Server 2005提高系统、数据库和数据的可用性.pdf

Windows操作系统在目前市场主导的原因

C/S架构商业系统