InteractiveLearningfromPolicy-DependentHumanFeedback.pdf资源-CSDN文库

人工智能

需积分: 1 145 浏览量 2023-05-18 11:41:19 上传评论收藏 401KB PDF 举报

资源推荐

资源详情

资源评论

Interactive Learning from Policy-Dependent Human Feedback

James MacGlashan

Mark K Ho

Robert Loftin

Bei Peng

Guan Wang

David L. Roberts

Matthew E. Taylor

Michael L. Littman

Abstract

This paper investigates the problem of interac-

tively learning behaviors communicated by a hu-

man teacher using positive and negative feed-

back. Much previous work on this problem has

made the assumption that people provide feed-

back for decisions that is dependent on the be-

havior they are teaching and is independent from

the learner’s current policy. We present empirical

results that show this assumption to be false—

whether human trainers give a positive or neg-

ative feedback for a decision is inﬂuenced by

the learner’s current policy. Based on this in-

sight, we introduce Convergent Actor-Critic by

Humans (COACH), an algorithm for learning

from policy-dependent feedback that converges

to a local optimum. Finally, we demonstrate that

COACH can successfully learn multiple behav-

iors on a physical robot.

1. Introduction

Programming robots is very difﬁcult, in part because

the real world is inherently rich and—to some degree—

unpredictable. In addition, our expectations for physical

agents are quite high and often difﬁcult to articulate. Nev-

ertheless, for robots to have a signiﬁcant impact on the lives

of individuals, even non-programmers need to be able to

specify and customize behavior. Because of these complex-

ities, relying on end-users to provide instructions to robots

programmatically seems destined to fail.

Reinforcement learning (RL) from human trainer feedback

provides a compelling alternative to programming because

agents can learn complex behavior from very simple posi-

tive and negative signals. Furthermore, real-world animal

training is an existence proof that people can train complex

Equal contribution

Cogitai

Brown University

North Car-

olina State University

Washington State University. Correspon-

dence to: James MacGlashan <james@cogitai.com>.

Proceedings of the 34

International Conference on Machine

by the author(s).

behavior using these simple signals. Indeed, animals have

been successfully trained to guide the blind, locate mines

in the ocean, detect cancer or explosives, and even solve

complex, multi-stage puzzles.

Despite success when learning from environmental reward,

traditional reinforcement-learning algorithms have yielded

limited success when the reward signal is provided by hu-

mans. This failure underscores the importance that algo-

rithms for learning from humans are based on appropriate

models of human-feedback. Indeed, much human-centered

RL work has investigated and employed different mod-

els of human-feedback (Knox & Stone, 2009b; Thomaz &

Breazeal, 2006; 2007; 2008; Grifﬁth et al., 2013; Loftin

et al., 2015). Many of these algorithms leverage the ob-

servation that people tend to give feedback that is best in-

terpreted as guidance on the policy the agent should be fol-

lowing, rather than as a numeric value to be maximized

by the agent. However, these approaches assume models

of feedback that are independent of the policy the agent

is currently following. We present empirical results that

demonstrate that this assumption is incorrect and further

demonstrate cases in which policy-independent learning al-

gorithms suffer from this assumption. Following this result,

we present Convergent Actor-Critic by Humans (COACH),

an algorithm for learning from policy-dependent human

feedback. COACH is based on the insight that the ad-

vantage function (a value roughly corresponding to how

much better or worse an action is compared to the current

policy) provides a better model of human feedback, cap-

turing human-feedback properties like diminishing returns,

rewarding improvement, and giving 0-valued feedback a

semantic meaning that combats forgetting. We compare

COACH to other approaches in a simple domain with sim-

ulated feedback. Then, to validate that COACH scales to

complex problems, we train ﬁve different behaviors on a

TurtleBot robot.

2. Background

For modeling the underlying decision-making problem of

an agent being taught by a human, we adopt the Markov

Decision Process (MDP) formalism. An MDP is a 5-tuple:

hS, A, T, R, γi, where S is the set of possible states of the

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

评论收藏

内容反馈

IT徐师兄

粉丝: 1579
资源: 2690

Interactive Learning from Policy-Dependent Human Feedback.pdf

最新资源

Interactive Learning from Policy-Dependent Human Feedback.pdf

前端项目-dependent-dropdown.zip

论文研究-A New Filtering Algorithm for Stochastic Dynamical Systems with State-Dependent Observation Noise.pdf

ddpendent-dependent-dependent.zip_Dependent下载_long range_range

T-REC-G.987.2-201010-I!!PDF-E.pdf

Docker-in-Action.pdf

chnsirt-dependent-proficient.rar_Windows编程

T-REC-G.989.2-201412-I!!PDF-E.pdf

2020年江苏省研究生数学建模试题及附件

ExtremeLearningMachine资源共享-New-delay-variation-dependent-stability-for-neural-networks-_2013_Neurocompu.pdf

EN301489-1-2008.pdf

论文研究-Delay-dependent Robust H∞ Control for Uncertain Nonlinear System Based on Fuzzy Hyperbolic Model with Time-Varying Delays.pdf

论文研究-基于DCR假设的KDM-CCA安全性.pdf

JESD204B-link-debug.pdf

论文研究-Delay-dependent stabilization of Singular Linear Time Delay Systems Based on Memory State Feedback Control.pdf

iso-iec 14496-10(3rd_2006-03-01)_MPEG4_AVC_H264.pdf

A geometry-dependent model for void closure in.pdf

免费版 .NET PDF查看组件-Spire.Pdfviewer_4.1

LMI Approach to Analysis and Control of Takagi-Sugeno Fuzzy Systems with Time Delay

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

2023泛娱乐社交出海手册-ZEGO即构科技

4个亲测好用的ChatGPT4渠道

HAI-2024斯坦福AI指数报告（中文译版）.pdf

学术海报模板+论文科研+研究生

最新资源