没有合适的资源?快使用搜索试试~ 我知道了~
机器学习和人工智能(AI)的快速进步使人们越来越关注人工智能技术对社会的潜在影响。在本文中,我们讨论了一个潜在的影响:机器学习系统中的事故问题,它被定义为现实世界中人工智能系统设计不当可能产生的意外和有害行为。我们列出了五个与事故风险相关的实际研究问题,根据问题是否源于错误的目标函数(“避免副作用”和“避免奖励黑客攻击”)、一个过于昂贵而无法频繁评估的目标函数,或学习过程中的不良行为(“安全探索”和“分布转移”)。我们回顾了之前在这些领域的工作,并提出了研究方向,重点是与尖端人工智能系统的相关性。最后,我们考虑了一个高层次的问题,即如何最有效地思考人工智能前瞻性应用的安全性。
资源推荐
资源详情
资源评论
Concrete Problems in AI Safety
Dario Amodei
∗
Google Brain
Chris Olah
∗
Google Brain
Jacob Steinhardt
Stanford University
Paul Christiano
UC Berkeley
John Schulman
OpenAI
Dan Man´e
Google Brain
Abstract
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing atten-
tion to the potential impacts of AI technologies on society. In this paper we discuss one such
potential impact: the problem of accidents in machine learning systems, defined as unintended
and harmful behavior that may emerge from poor design of real-world AI systems. We present a
list of five practical research problems related to accident risk, categorized according to whether
the problem originates from having the wrong objective function (“avoiding side effects” and
“avoiding reward hacking”), an objective function that is too expensive to evaluate frequently
(“scalable supervision”), or undesirable behavior during the learning process (“safe exploration”
and “distributional shift”). We review previous work in these areas as well as suggesting re-
search directions with a focus on relevance to cutting-edge AI systems. Finally, we consider
the high-level question of how to think most productively about the safety of forward-looking
applications of AI.
1 Introduction
The last few years have seen rapid progress on long-standing, difficult problems in machine learning
and artificial intelligence (AI), in areas as diverse as computer vision [82], video game playing [102],
autonomous vehicles [86], and Go [140]. These advances have brought excitement about the positive
potential for AI to transform medicine [126], science [59], and transportation [86], along with concerns
about the privacy [76], security [115], fairness [3], economic [32], and military [16] implications of
autonomous systems, as well as concerns about the longer-term implications of powerful AI [27, 167].
The authors believe that AI technologies are likely to be overwhelmingly beneficial for humanity, but
we also believe that it is worth giving serious thought to potential challenges and risks. We strongly
support work on privacy, security, fairness, economics, and policy, but in this document we discuss
another class of problem which we believe is also relevant to the societal impacts of AI: the problem
of accidents in machine learning systems. We define accidents as unintended and harmful behavior
that may emerge from machine learning systems when we specify the wrong objective function, are
∗
These authors contributed equally.
1
arXiv:1606.06565v2 [cs.AI] 25 Jul 2016
not careful about the learning process, or commit other machine learning-related implementation
errors.
There is a large and diverse literature in the machine learning community on issues related to
accidents, including robustness, risk-sensitivity, and safe exploration; we review these in detail below.
However, as machine learning systems are deployed in increasingly large-scale, autonomous, open-
domain situations, it is worth reflecting on the scalability of such approaches and understanding
what challenges remain to reducing accident risk in modern machine learning systems. Overall, we
believe there are many concrete open technical problems relating to accident prevention in machine
learning systems.
There has been a great deal of public discussion around accidents. To date much of this discussion has
highlighted extreme scenarios such as the risk of misspecified objective functions in superintelligent
agents [27]. However, in our opinion one need not invoke these extreme scenarios to productively
discuss accidents, and in fact doing so can lead to unnecessarily speculative discussions that lack
precision, as noted by some critics [38, 85]. We believe it is usually most productive to frame accident
risk in terms of practical (though often quite general) issues with modern ML techniques. As AI
capabilities advance and as AI systems take on increasingly important societal functions, we expect
the fundamental challenges discussed in this paper to become increasingly important. The more
successfully the AI and machine learning communities are able to anticipate and understand these
fundamental technical challenges, the more successful we will ultimately be in developing increasingly
useful, relevant, and important AI systems.
Our goal in this document is to highlight a few concrete safety problems that are ready for ex-
perimentation today and relevant to the cutting edge of AI systems, as well as reviewing existing
literature on these problems. In Section 2, we frame mitigating accident risk (often referred to as
“AI safety” in public discussions) in terms of classic methods in machine learning, such as supervised
classification and reinforcement learning. We explain why we feel that recent directions in machine
learning, such as the trend toward deep reinforcement learning and agents acting in broader environ-
ments, suggest an increasing relevance for research around accidents. In Sections 3-7, we explore five
concrete problems in AI safety. Each section is accompanied by proposals for relevant experiments.
Section 8 discusses related efforts, and Section 9 concludes.
2 Overview of Research Problems
Very broadly, an accident can be described as a situation where a human designer had in mind
a certain (perhaps informally specified) objective or task, but the system that was designed and
deployed for that task produced harmful and unexpected results. . This issue arises in almost any
engineering discipline, but may be particularly important to address when building AI systems [146].
We can categorize safety problems according to where in the process things went wrong.
First, the designer may have specified the wrong formal objective function, such that maximizing that
objective function leads to harmful results, even in the limit of perfect learning and infinite data.
Negative side effects (Section 3) and reward hacking (Section 4) describe two broad mechanisms
that make it easy to produce wrong objective functions. In “negative side effects”, the designer
specifies an objective function that focuses on accomplishing some specific task in the environment,
but ignores other aspects of the (potentially very large) environment, and thus implicitly expresses
indifference over environmental variables that might actually be harmful to change. In “reward
hacking”, the objective function that the designer writes down admits of some clever “easy” solution
that formally maximizes it but perverts the spirit of the designer’s intent (i.e. the objective function
can be “gamed”), a generalization of the wireheading problem.
2
Second, the designer may know the correct objective function, or at least have a method of evaluating
it (for example explicitly consulting a human on a given situation), but it is too expensive to do so
frequently, leading to possible harmful behavior caused by bad extrapolations from limited samples.
“Scalable oversight” (Section 5) discusses ideas for how to ensure safe behavior even given limited
access to the true objective function.
Third, the designer may have specified the correct formal objective, such that we would get the
correct behavior were the system to have perfect beliefs, but something bad occurs due to making
decisions from insufficient or poorly curated training data or an insufficiently expressive model.
“Safe exploration” (Section 6) discusses how to ensure that exploratory actions in RL agents don’t
lead to negative or irrecoverable consequences that outweigh the long-term value of exploration.
“Robustness to distributional shift” (Section 7) discusses how to avoid having ML systems make bad
decisions (particularly silent and unpredictable bad decisions) when given inputs that are potentially
very different than what was seen during training.
For concreteness, we will illustrate many of the accident risks with reference to a fictional robot
whose job is to clean up messes in an office using common cleaning tools. We return to the example
of the cleaning robot throughout the document, but here we begin by illustrating how it could behave
undesirably if its designers fall prey to each of the possible failure modes:
• Avoiding Negative Side Effects: How can we ensure that our cleaning robot will not
disturb the environment in negative ways while pursuing its goals, e.g. by knocking over a
vase because it can clean faster by doing so? Can we do this without manually specifying
everything the robot should not disturb?
• Avoiding Reward Hacking: How can we ensure that the cleaning robot won’t game its
reward function? For example, if we reward the robot for achieving an environment free of
messes, it might disable its vision so that it won’t find any messes, or cover over messes with
materials it can’t see through, or simply hide when humans are around so they can’t tell it
about new types of messes.
• Scalable Oversight: How can we efficiently ensure that the cleaning robot respects aspects of
the objective that are too expensive to be frequently evaluated during training? For instance, it
should throw out things that are unlikely to belong to anyone, but put aside things that might
belong to someone (it should handle stray candy wrappers differently from stray cellphones).
Asking the humans involved whether they lost anything can serve as a check on this, but this
check might have to be relatively infrequent—can the robot find a way to do the right thing
despite limited information?
• Safe Exploration: How do we ensure that the cleaning robot doesn’t make exploratory
moves with very bad repercussions? For example, the robot should experiment with mopping
strategies, but putting a wet mop in an electrical outlet is a very bad idea.
• Robustness to Distributional Shift: How do we ensure that the cleaning robot recognizes,
and behaves robustly, when in an environment different from its training environment? For
example, strategies it learned for cleaning an office might be dangerous on a factory workfloor.
There are several trends which we believe point towards an increasing need to address these (and
other) safety problems. First is the increasing promise of reinforcement learning (RL), which al-
lows agents to have a highly intertwined interaction with their environment. Some of our research
problems only make sense in the context of RL, and others (like distributional shift and scalable
oversight) gain added complexity in an RL setting. Second is the trend toward more complex agents
and environments. “Side effects” are much more likely to occur in a complex environment, and an
agent may need to be quite sophisticated to hack its reward function in a dangerous way. This may
explain why these problems have received so little study in the past, while also suggesting their
3
importance in the future. Third is the general trend towards increasing autonomy in AI systems.
Systems that simply output a recommendation to human users, such as speech systems, typically
have relatively limited potential to cause harm. By contrast, systems that exert direct control over
the world, such as machines controlling industrial processes, can cause harms in a way that humans
cannot necessarily correct or oversee.
While safety problems can exist without any of these three trends, we consider each trend to be a
possible amplifier on such challenges. Together, we believe these trends suggest an increasing role
for research on accidents.
When discussing the problems in the remainder of this document, we will focus for concreteness on
either RL agents or supervised learning systems. These are not the only possible paradigms for AI
or ML systems, but we believe they are sufficient to illustrate the issues we have in mind, and that
similar issues are likely to arise for other kinds of AI systems.
Finally, the focus of our discussion will differ somewhat from section to section. When discussing
the problems that arise as part of the learning process (distributional shift and safe exploration),
where there is a sizable body of prior work, we devote substantial attention to reviewing this prior
work, although we also suggest open problems with a particular focus on emerging ML systems.
When discussing the problems that arise from having the wrong objective function (reward hacking
and side effects, and to a lesser extent scalable supervision), where less prior work exists, our aim is
more exploratory—we seek to more clearly define the problem and suggest possible broad avenues
of attack, with the understanding that these avenues are preliminary ideas that have not been fully
fleshed out. Of course, we still review prior work in these areas, and we draw attention to relevant
adjacent areas of research whenever possible.
3 Avoiding Negative Side Effects
Suppose a designer wants an RL agent (for example our cleaning robot) to achieve some goal, like
moving a box from one side of a room to the other. Sometimes the most effective way to achieve
the goal involves doing something unrelated and destructive to the rest of the environment, like
knocking over a vase of water that is in its path. If the agent is given reward only for moving the
box, it will probably knock over the vase.
If we’re worried in advance about the vase, we can always give the agent negative reward for knocking
it over. But what if there are many different kinds of “vase”—many disruptive things the agent could
do to the environment, like shorting out an electrical socket or damaging the walls of the room? It
may not be feasible to identify and penalize every possible disruption.
More broadly, for an agent operating in a large, multifaceted environment, an objective function
that focuses on only one aspect of the environment may implicitly express indifference over other
aspects of the environment
1
. An agent optimizing this objective function might thus engage in
major disruptions of the broader environment if doing so provides even a tiny advantage for the
task at hand. Put differently, objective functions that formalize “perform task X” may frequently
give undesired results, because what the designer really should have formalized is closer to “perform
task X subject to common-sense constraints on the environment,” or perhaps “perform task X but
avoid side effects to the extent possible.” Furthermore, there is reason to expect side effects to be
negative on average, since they tend to disrupt the wider environment away from a status quo state
that may reflect human preferences. A version of this problem has been discussed informally by [13]
under the heading of “low impact agents.”
1
Intuitively, this seems related to the frame problem, an obstacle in efficient specification for knowledge represen-
tation raised by [95].
4
As with the other sources of mis-specified objective functions discussed later in this paper, we could
choose to view side effects as idiosyncratic to each individual task—as the responsibility of each
individual designer to capture as part of designing the correct objective function. However, side
effects can be conceptually quite similar even across highly diverse tasks (knocking over furniture
is probably bad for a wide variety of tasks), so it seems worth trying to attack the problem in
generality. A successful approach might be transferable across tasks, and thus help to counteract
one of the general mechanisms that produces wrong objective functions. We now discuss a few broad
approaches to attacking this problem:
• Define an Impact Regularizer: If we don’t want side effects, it seems natural to penalize
“change to the environment.” This idea wouldn’t be to stop the agent from ever having an
impact, but give it a preference for ways to achieve its goals with minimal side effects, or
to give the agent a limited “budget” of impact. The challenge is that we need to formalize
“change to the environment.”
A very naive approach would be to penalize state distance, d(s
i
, s
0
), between the present state
s
i
and some initial state s
0
. Unfortunately, such an agent wouldn’t just avoid changing the
environment—it will resist any other source of change, including the natural evolution of the
environment and the actions of any other agents!
A slightly more sophisticated approach might involve comparing the future state under the
agent’s current policy, to the future state (or distribution over future states) under a hypothet-
ical policy π
null
where the agent acted very passively (for instance, where a robot just stood in
place and didn’t move any actuators). This attempts to factor out changes that occur in the
natural course of the environment’s evolution, leaving only changes attributable to the agent’s
intervention. However, defining the baseline policy π
null
isn’t necessarily straightforward, since
suddenly ceasing your course of action may be anything but passive, as in the case of carrying
a heavy box. Thus, another approach could be to replace the null action with a known safe
(e.g. low side effect) but suboptimal policy, and then seek to improve the policy from there,
somewhat reminiscent of reachability analysis [93, 100] or robust policy improvement [73, 111].
These approaches may be very sensitive to the representation of the state and the metric being
used to compute the distance. For example, the choice of representation and distance metric
could determine whether a spinning fan is a constant environment or a constantly changing
one.
• Learn an Impact Regularizer: An alternative, more flexible approach is to learn (rather
than define) a generalized impact regularizer via training over many tasks. This would be
an instance of transfer learning. Of course, we could attempt to just apply transfer learning
directly to the tasks themselves instead of worrying about side effects, but the point is that side
effects may be more similar across tasks than the main goal is. For instance, both a painting
robot and a cleaning robot probably want to avoid knocking over furniture, and even something
very different, like a factory control robot, will likely want to avoid knocking over very similar
objects. Separating the side effect component from the task component, by training them
with separate parameters, might substantially speed transfer learning in cases where it makes
sense to retain one component but not the other. This would be similar to model-based RL
approaches that attempt to transfer a learned dynamics model but not the value-function [155],
the novelty being the isolation of side effects rather than state dynamics as the transferrable
component. As an added advantage, regularizers that were known or certified to produce safe
behavior on one task might be easier to establish as safe on other tasks.
• Penalize Influence: In addition to not doing things that have side effects, we might also
prefer the agent not get into positions where it could easily do things that have side effects,
even though that might be convenient. For example, we might prefer our cleaning robot not
5
剩余28页未读,继续阅读
资源评论
阿杰技术
- 粉丝: 23
- 资源: 79
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功