模仿学习ImitationLearning最新论文2018资源-CSDN文库

共10个文件

pdf：10个

深度学习

生成对抗网络

模仿学习

需积分: 24 96 浏览量 2018-12-31 11:15:42 上传评论 1 收藏 21.53MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

模仿学习.rar （10个子文件）

模仿学习

Learning human behaviors from motion capture by adversarial imitation.pdf 3.49MB

Model-Free_Imitation_Learning_with_Policy_Optimiza.pdf 740KB

Unsupervised Perceptual Rewards.pdf 4.01MB

Time-Contrastive Networks-Self-Supervised Learning from Multi-View Observation.pdf 1.98MB

Generative_Adversarial_Imitation_Learning.pdf 501KB

One-Shot Visual Imitation Learning via Meta-Learning.pdf 2.49MB

Imitation from Observation--Learning to Imitate Behaviors from Raw Video.pdf 3.32MB

Robust Imitation of Diverse Behaviors.pdf 1.18MB

One-shot imitation learning.pdf 4.52MB

Third-Person Imitation Learning.pdf 1.81MB

One-Shot Imitation Learning

Yan Duan

†§

, Marcin Andrychowicz

‡

, Bradly Stadie

†‡

, Jonathan Ho

†§

Jonas Schneider

‡

, Ilya Sutskever

‡

, Pieter Abbeel

†§

, Wojciech Zaremba

‡

†

Berkeley AI Research Lab,

‡

OpenAI

Work done while at OpenAI

{rockyduan, jonathanho, pabbeel}@eecs.berkeley.edu

{marcin, bstadie, jonas, ilyasu, woj}@openai.com

Abstract

Imitation learning has been commonly applied to solve different tasks in isolation.

This usually requires either careful feature engineering, or a signiﬁcant number of

samples. This is far from what we desire: ideally, robots should be able to learn

from very few demonstrations of any given task, and instantly generalize to new

situations of the same task, without requiring task-speciﬁc engineering. In this

paper, we propose a meta-learning framework for achieving such capability, which

we call one-shot imitation learning.

Speciﬁcally, we consider the setting where there is a very large (maybe inﬁnite)

set of tasks, and each task has many instantiations. For example, a task could be

to stack all blocks on a table into a single tower, another task could be to place

all blocks on a table into two-block towers, etc. In each case, different instances

of the task would consist of different sets of blocks with different initial states.

At training time, our algorithm is presented with pairs of demonstrations for a

subset of all tasks. A neural net is trained such that when it takes as input the ﬁrst

demonstration demonstration and a state sampled from the second demonstration,

it should predict the action corresponding to the sampled state. At test time, a full

demonstration of a single instance of a new task is presented, and the neural net

is expected to perform well on new instances of this new task. Our experiments

show that the use of soft attention allows the model to generalize to conditions and

tasks unseen in the training data. We anticipate that by training this model on a

much greater variety of tasks and settings, we will obtain a general system that can

turn any demonstrations into robust policies that can accomplish an overwhelming

variety of tasks.

1 Introduction

We are interested in robotic systems that are able to perform a variety of complex useful tasks, e.g.

tidying up a home or preparing a meal. The robot should be able to learn new tasks without long

system interaction time. To accomplish this, we must solve two broad problems. The ﬁrst problem is

that of dexterity: robots should learn how to approach, grasp and pick up complex objects, and how

to place or arrange them into a desired conﬁguration. The second problem is that of communication:

how to communicate the intent of the task at hand, so that the robot can replicate it in a broader set of

initial conditions.

Demonstrations are an extremely convenient form of information we can use to teach robots to over-

come these two challenges. Using demonstrations, we can unambiguously communicate essentially

any manipulation task, and simultaneously provide clues about the speciﬁc motor skills required to

perform the task. We can compare this with an alternative form of communication, namely natural

language. Although language is highly versatile, effective, and efﬁcient, natural language processing

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

arXiv:1703.07326v3 [cs.AI] 4 Dec 2017

(a) Traditional Imitation Learning

Task A

e.g. stack blocks into towers of height 3

Many

demonstrations

Imitation

Learning

Algorithm

Policy for

task A

action

Environment

obs

Task B

e.g. stack blocks into towers of height 2

Many

demonstrations

Imitation

Learning

Algorithm

Policy for

task B

action

Environment

obs

Many demonstrations

for task A

Meta Learning

Algorithm

…

more demonstrations for

more tasks

One-Shot Imitator

(Neural Network)

Environment

action

obs

Single demonstration

for task F

Policy for task F

Many demonstrations

for task B

Many demonstrations

for task A

sample

Many demonstrations

for task B

(b) One-Shot Imitation Learning

One-Shot Imitator

(Neural Network)

Supervised loss

Demo

observation from

Demo

corresponding

action in Demo

predicted

action

Figure 1: (a) Traditionally, policies are task-speciﬁc. For example, a policy might have been

trained through an imitation learning algorithm to stack blocks into towers of height

, and then

another policy would be trained to stack blocks into towers of height

, etc. (b) In this paper, we

are interested in training networks that are not speciﬁc to one task, but rather can be told (through a

single demonstration) what the current new task is, and be successful at this new task. For example,

when it is conditioned on a single demonstration for task F, it should behave like a good policy for

task F. (c) We can phrase this as a supervised learning problem, where we train this network on a set

of training tasks, and with enough examples it should generalize to unseen, but related tasks. To train

this network, in each iteration we sample a demonstration from one of the training tasks, and feed it

to the network. Then, we sample another pair of observation and action from a second demonstration

of the same task. When conditioned on both the ﬁrst demonstration and this observation, the network

is trained to output the corresponding action.

systems are not yet at a level where we could easily use language to precisely describe a complex

task to a robot. Compared to language, using demonstrations has two fundamental advantages: ﬁrst,

it does not require the knowledge of language, as it is possible to communicate complex tasks to

humans that don’t speak one’s language. And second, there are many tasks that are extremely difﬁcult

to explain in words, even if we assume perfect linguistic abilities: for example, explaining how to

swim without demonstration and experience seems to be, at the very least, an extremely challenging

task.

Indeed, learning from demonstrations have had many successful applications . However, so far

these applications have either required careful feature engineering, or a signiﬁcant amount of system

interaction time. This is far from what what we desire: ideally, we hope to demonstrate a certain task

only once or a few times to the robot, and have it instantly generalize to new situations of the same

task, without long system interaction time or domain knowledge about individual tasks.

In this paper we explore the one-shot imitation learning setting illustrated in Fig. 1, where the

objective is to maximize the expected performance of the learned policy when faced with a new,

previously unseen, task, and having received as input only one demonstration of that task. For the

tasks we consider, the policy is expected to achieve good performance without any additional system

interaction, once it has received the demonstration.

We train a policy on a broad distribution over tasks, where the number of tasks is potentially inﬁnite.

For each training task we assume the availability of a set of successful demonstrations. Our learned

policy takes as input: (i) the current observation, and (ii) one demonstration that successfully solves

a different instance of the same task (this demonstration is ﬁxed for the duration of the episode).

The policy outputs the current controls. We note that any pair of demonstrations for the same task

provides a supervised training example for the neural net policy, where one demonstration is treated

as the input, while the other as the output.

To make this model work, we made essential use of soft attention [

] for processing both the (poten-

tially long) sequence of states and action that correspond to the demonstration, and for processing the

components of the vector specifying the locations of the various blocks in our environment. The use

of soft attention over both types of inputs made strong generalization possible. In particular, on a

family of block stacking tasks, our neural network policy was able to perform well on novel block

conﬁgurations which were not present in any training data. Videos of our experiments are available

at http://bit.ly/nips2017-oneshot.

2 Related Work

Imitation learning considers the problem of acquiring skills from observing demonstrations. Survey

articles include [48, 11, 3].

Two main lines of work within imitation learning are behavioral cloning, which performs supervised

learning from observations to actions (e.g., [

]); and inverse reinforcement learning [

], where

a reward function [

] is estimated that explains the demonstrations as (near) optimal

behavior. While this past work has led to a wide range of impressive robotics results, it considers

each skill separately, and having learned to imitate one skill does not accelerate learning to imitate

the next skill.

One-shot and few-shot learning has been studied for image recognition [

], generative

modeling [

], and learning “fast” reinforcement learning agents with recurrent policies [

Fast adaptation has also been achieved through fast-weights [

]. Like our algorithm, many of the

aforementioned approaches are a form of meta-learning [

], where the algorithm itself is

being learned. Meta-learning has also been studied to discover neural network weight optimization

algorithms [

]. This prior work on one-shot learning and meta-learning, however,

is tailored to respective domains (image recognition, generative models, reinforcement learning,

optimization) and not directly applicable in the imitation learning setting. Recently, [

] propose a

generic framework for meta learning across several aforementioned domains. However they do not

consider the imitation learning setting.

Reinforcement learning [

] provides an alternative route to skill acquisition, by learning through

trial and error. Reinforcement learning has had many successes, including Backgammon [

helicopter control [

], Atari [

], Go [

], continuous control in simulation [

] and on

real robots [

]. However, reinforcement learning tends to require a large number of trials and

requires specifying a reward function to deﬁne the task at hand. The former can be time-consuming

and the latter can often be signiﬁcantly more difﬁcult than providing a demonstration [37].

Multi-task and transfer learning considers the problem of learning policies with applicability and

re-use beyond a single task. Success stories include domain adaptation in computer vision [

] and control [

]. However, while acquiring a multitude of

skills faster than what it would take to acquire each of the skills independently, these approaches do

not provide the ability to readily pick up a new skill from a single demonstration.

Our approach heavily relies on an attention model over the demonstration and an attention model

over the current observation. We use the soft attention model proposed in [

] for machine translations,

and which has also been successful in image captioning [

]. The interaction networks proposed

in [

] also leverage locality of physical interaction in learning. Our model is also related to

the sequence to sequence model [

], as in both cases we consume a very long demonstration

sequence and, effectively, emit a long sequence of actions.

3 One Shot Imitation Learning

3.1 Problem Formalization

We denote a distribution of tasks by

, an individual task by

t ⇠ T

, and a distribution of demon-

strations for the task

D(t)

. A policy is symbolized by

⇡

✓

(a|o, d)

, where

is an action,

an observation,

is a demonstration, and

✓

are the parameters of the policy. A demonstration

d ⇠ D(t)

is a sequence of observations and actions :

d =[(o

), (o

),...,(o

)]

.We

assume that the distribution of tasks

is given, and that we can obtain successful demonstrations for

each task. We assume that there is some scalar-valued evaluation function

(d)

(e.g. a binary value

indicating success) for each task, although this is not required during training. The objective is to

maximize the expected performance of the policy, where the expectation is taken over tasks

t 2 T

and demonstrations d 2 D(t).

3.2 Block Stacking Tasks

To clarify the problem setting, we describe a concrete example of a distribution of block stacking

tasks, which we will also later study in the experiments. The compositional structure shared among

these tasks allows us to investigate nontrivial generalization to unseen tasks. For each task, the goal is

to control a 7-DOF Fetch robotic arm to stack various numbers of cube-shaped blocks into a speciﬁc

conﬁguration speciﬁed by the user. Each conﬁguration consists of a list of blocks arranged into

towers of different heights, and can be identiﬁed by a string. For example,

ab cd ef gh

means

that we want to stack 4 towers, each with two blocks, and we want block A to be on top of block B,

block C on top of block D, block E on top of block F, and block G on top of block H. Each of these

conﬁgurations correspond to a different task. Furthermore, in each episode the starting positions

of the blocks may vary, which requires the learned policy to generalize even within the training

tasks. In a typical task, an observation is a list of

(x, y, z )

object positions relative to the gripper,

and information if gripper is opened or closed. The number of objects may vary across different

task instances. We deﬁne a stage as a single operation of stacking one block on top of another. For

example, the task ab cd ef gh has 4 stages.

3.3 Algorithm

In order to train the neural network policy, we make use of imitation learning algorithms such

as behavioral cloning and DAGGER [

], which only require demonstrations rather than reward

functions to be speciﬁed. This has the potential to be more scalable, since it is often easier to

demonstrate a task than specifying a well-shaped reward function [38].

We start by collecting a set of demonstrations for each task, where we add noise to the actions in order

to have wider coverage in the trajectory space. In each training iteration, we sample a list of tasks

(with replacement). For each sampled task, we sample a demonstration as well as a small batch of

observation-action pairs. The policy is trained to regress against the desired actions when conditioned

on the current observation and the demonstration, by minimizing an

or cross-entropy loss based on

whether actions are continuous or discrete. A high-level illustration of the training procedure is given

in Fig. 1(c). Across all experiments, we use Adamax [

] to perform the optimization with a learning

rate of 0.001.

4 Architecture

While, in principle, a generic neural network could learn the mapping from demonstration and current

observation to appropriate action, we found it important to use an appropriate architecture. Our

architecture for learning block stacking is one of the main contributions of this paper, and we believe

it is representative of what architectures for one-shot imitation learning could look like in the future

when considering more complex tasks.

Our proposed architecture consists of three modules: the demonstration network, the context network,

and the manipulation network. An illustration of the architecture is shown in Fig. 2. We will describe

the main operations performed in each module below, and a full speciﬁcation is available in the

Appendix.

4.1 Demonstration Network

The demonstration network receives a demonstration trajectory as input, and produces an embedding

of the demonstration to be used by the policy. The size of this embedding grows linearly as a function

of the length of the demonstration as well as the number of blocks in the environment.

Temporal Dropout:

For block stacking, the demonstrations can span hundreds to thousands of time

steps, and training with such long sequences can be demanding in both time and memory usage.

Hence, we randomly discard a subset of time steps during training, an operation we call temporal

dropout, analogous to [

]. We denote

as the proportion of time steps that are thrown away.

Hidden layers

Temporal Dropout

Neighborhood Attention

Temporal Convolution

Attention over

Demonstration

Demonstration Current State

Action

ABlock# B C D E F G H I J

Attention

over

Current

State

Context Network

Demonstration Network

Manipulation Network

Context Embedding

Figure 2: Illustration of the network architecture.

In our experiments, we use

p =0.95

, which reduces the length of demonstrations by a factor of

During test time, we can sample multiple downsampled trajectories, use each of them to compute

downstream results, and average these results to produce an ensemble estimate. In our experience,

this consistently improves the performance of the policy.

Neighborhood Attention:

After downsampling the demonstration, we apply a sequence of opera-

tions, composed of dilated temporal convolution [

] and neighborhood attention. We now describe

this second operation in more detail.

Since our neural network needs to handle demonstrations with variable numbers of blocks, it must

have modules that can process variable-dimensional inputs. Soft attention is a natural operation which

maps variable-dimensional inputs to ﬁxed-dimensional outputs. However, by doing so, it may lose

information compared to its input. This is undesirable, since the amount of information contained

in a demonstration grows as the number of blocks increases. Therefore, we need an operation that

can map variable-dimensional inputs to outputs with comparable dimensions. Intuitively, rather than

having a single output as a result of attending to all inputs, we have as many outputs as inputs, and

have each output attending to all other inputs in relation to its own corresponding input.

We start by describing the soft attention module as speciﬁed in [

]. The input to the attention includes

a query

, a list of context vectors

}

, and a list of memory vectors

}

. The

th attention weight

is given by

tanh(q + c

)

, where

is a learned weight vector. The output of attention is a

weighted combination of the memory content, where the weights are given by a softmax operation

over the attention weights. Formally, we have

output

exp(w

)

exp(w

)

. Note that the output has

the same dimension as a memory vector. The attention operation can be generalized to multiple query

heads, in which case there will be as many output vectors as there are queries.

Now we turn to neighborhood attention. We assume there are

blocks in the environment. We

denote the robot’s state as

robot

, and the coordinates of each block as

),...,(x

)

The input to neighborhood attention is a list of embeddings

,...,h

of the same dimension,

which can be the result of a projection operation over a list of block positions, or the output of a

previous neighborhood attention operation. Given this list of embeddings, we use two separate linear

layers to compute a query vector and a context embedding for each block:

Linear(h

)

, and

Linear(h

)

. The memory content to be extracted consists of the coordinates of each block,

concatenated with the input embedding. The

th query result is given by the following soft attention

operation: result

SoftAttn(query: q

, context: {c

}

j=1

, memory: {((x

),h

))}

j=1

Intuitively, this operation allows each block to query other blocks in relation to itself (e.g. ﬁnd the

closest block), and extract the queried information. The gathered results are then combined with

each block’s own information, to produce the output embedding per block. Concretely, we have

评论收藏

内容反馈

hxd3008

粉丝: 5
资源: 2

模仿学习Imitation Learning最新论文2018

最新资源

模仿学习Imitation Learning最新论文2018

最新 「模仿学习Imitation Learning」综述论文

最新《模仿学习(Imitation Learning》进展报告

模仿学习（Imitation Learning）

imitation-learning:模仿学习算法

imitation:模仿学习算法的干净PyTorch实现

模仿学习论文 One-Shot Imitation Learning .Yan Duan

Generative Adversarial Imitation Learning 生成对抗的模仿学习

Python-Tensorflow实现生成对抗模仿学习GAIL

ICML 2018强化学习tutorial: Imitation Learning

Imitation-Learning-Paper-Lists:RL中的模仿学习论文集

基于自由能原理的强化模仿学习_Reinforced Imitation Learning by Free Energy Prin

Third-Person Imitation Learning, OpenAI, 2017.pdf模仿学习

imitation-learning:自动驾驶

BC-regularized-GAIL:在 PyTorch 中正式实现论文“Augmenting GAIL with BC for sample高效模仿学习”

很棒的模仿学习：精选的很棒的模仿学习资源和出版物清单

论文理解【IL - IRL】.pdf

模仿学习（imitation_learning）【62页ppt】.zip

Imitation Learning A Survey of Learning Methods.pdf

DeepLearning:深度学习、强化学习、模仿学习与机器人

Python-深度增强学习算法的PyTorch实现策略梯度生成对抗模仿学习

机器人运动学习： 从模仿学习到强化学习

webGpt模型外国论文

Papers.zip

CoRL2019-DREX:D-REX算法的代码和项目页面，来自于CoRL 2019上发表的论文“通过自动排序的演示胜过演示者的模仿学习”

ACGAIL：使用辅助分类器GAN模仿学习多种意图

最新资源

最新「模仿学习Imitation Learning」综述论文

模仿学习论文　One-Shot Imitation Learning .Yan Duan

机器人运动学习：从模仿学习到强化学习