没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Interactive Learning from Policy-Dependent Human Feedback
James MacGlashan
1
Mark K Ho
2
Robert Loftin
3
Bei Peng
4
Guan Wang
2
David L. Roberts
3
Matthew E. Taylor
4
Michael L. Littman
2
Abstract
This paper investigates the problem of interac-
tively learning behaviors communicated by a hu-
man teacher using positive and negative feed-
back. Much previous work on this problem has
made the assumption that people provide feed-
back for decisions that is dependent on the be-
havior they are teaching and is independent from
the learner’s current policy. We present empirical
results that show this assumption to be false—
whether human trainers give a positive or neg-
ative feedback for a decision is influenced by
the learner’s current policy. Based on this in-
sight, we introduce Convergent Actor-Critic by
Humans (COACH), an algorithm for learning
from policy-dependent feedback that converges
to a local optimum. Finally, we demonstrate that
COACH can successfully learn multiple behav-
iors on a physical robot.
1. Introduction
Programming robots is very difficult, in part because
the real world is inherently rich and—to some degree—
unpredictable. In addition, our expectations for physical
agents are quite high and often difficult to articulate. Nev-
ertheless, for robots to have a significant impact on the lives
of individuals, even non-programmers need to be able to
specify and customize behavior. Because of these complex-
ities, relying on end-users to provide instructions to robots
programmatically seems destined to fail.
Reinforcement learning (RL) from human trainer feedback
provides a compelling alternative to programming because
agents can learn complex behavior from very simple posi-
tive and negative signals. Furthermore, real-world animal
training is an existence proof that people can train complex
*
Equal contribution
1
Cogitai
2
Brown University
3
North Car-
olina State University
4
Washington State University. Correspon-
dence to: James MacGlashan <james@cogitai.com>.
Proceedings of the 34
th
International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017
by the author(s).
behavior using these simple signals. Indeed, animals have
been successfully trained to guide the blind, locate mines
in the ocean, detect cancer or explosives, and even solve
complex, multi-stage puzzles.
Despite success when learning from environmental reward,
traditional reinforcement-learning algorithms have yielded
limited success when the reward signal is provided by hu-
mans. This failure underscores the importance that algo-
rithms for learning from humans are based on appropriate
models of human-feedback. Indeed, much human-centered
RL work has investigated and employed different mod-
els of human-feedback (Knox & Stone, 2009b; Thomaz &
Breazeal, 2006; 2007; 2008; Griffith et al., 2013; Loftin
et al., 2015). Many of these algorithms leverage the ob-
servation that people tend to give feedback that is best in-
terpreted as guidance on the policy the agent should be fol-
lowing, rather than as a numeric value to be maximized
by the agent. However, these approaches assume models
of feedback that are independent of the policy the agent
is currently following. We present empirical results that
demonstrate that this assumption is incorrect and further
demonstrate cases in which policy-independent learning al-
gorithms suffer from this assumption. Following this result,
we present Convergent Actor-Critic by Humans (COACH),
an algorithm for learning from policy-dependent human
feedback. COACH is based on the insight that the ad-
vantage function (a value roughly corresponding to how
much better or worse an action is compared to the current
policy) provides a better model of human feedback, cap-
turing human-feedback properties like diminishing returns,
rewarding improvement, and giving 0-valued feedback a
semantic meaning that combats forgetting. We compare
COACH to other approaches in a simple domain with sim-
ulated feedback. Then, to validate that COACH scales to
complex problems, we train five different behaviors on a
TurtleBot robot.
2. Background
For modeling the underlying decision-making problem of
an agent being taught by a human, we adopt the Markov
Decision Process (MDP) formalism. An MDP is a 5-tuple:
hS, A, T, R, γi, where S is the set of possible states of the
资源评论
IT徐师兄
- 粉丝: 1579
- 资源: 2690
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 自动驾驶定位系列教程十:闭环修正.pdf
- HM2333-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- Python实现插入排序算法(源代码)
- 123.cpp
- HM2319-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- modbus4j-3.0.4.jar
- 蒙特·卡罗实验、使用蒙特·卡罗方法计算圆周率近似值.docx
- HM2319A-VB一款P-Channel沟道SOT23的MOSFET晶体管参数介绍与应用说明
- JAVA SpringBoot 集成华为云OBS,多镜像配置settings
- 一个文件共享系统,包括前端文件展示系统和后台管理系统,基于SpringBoot + MyBatis实现
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功