Sim-to-Real:LearningAgileLocomotionForQuadrupedRobots笔记

共2个文件

md：1个

pdf：1个

人工智能

需积分: 41 174 浏览量 2018-08-04 19:06:00 上传评论收藏 962KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

sim-to-real 2018 笔记+paper.zip （2个子文件）

Sim-to-Real Learning Agile Locomotion.pdf 1.01MB

论文sim-to-real 2018笔记.md 11KB

Sim-to-Real: Learning Agile Locomotion For

Quadruped Robots

Jie Tan

, Tingnan Zhang

, Erwin Coumans

, Atil Iscen

Yunfei Bai

, Danijar Hafner

, Steven Bohez

, and Vincent Vanhoucke

Google Brain

Google DeepMind

Abstract—Designing agile locomotion for quadruped robots

often requires extensive expertise and tedious manual tuning.

In this paper, we present a system to automate this process by

leveraging deep reinforcement learning techniques. Our system

can learn quadruped locomotion from scratch using simple

reward signals. In addition, users can provide an open loop

reference to guide the learning process when more control over

the learned gait is needed. The control policies are learned in a

physics simulator and then deployed on real robots. In robotics,

policies trained in simulation often do not transfer to the real

world. We narrow this reality gap by improving the physics

simulator and learning robust policies. We improve the simulation

using system identiﬁcation, developing an accurate actuator

model and simulating latency. We learn robust controllers by

randomizing the physical environments, adding perturbations

and designing a compact observation space. We evaluate our

system on two agile locomotion gaits: trotting and galloping.

After learning in simulation, a quadruped robot can successfully

perform both gaits in the real world.

I. INTRODUCTION

Designing agile locomotion for quadruped robots is a long-

standing research problem [1]. This is because it is difﬁcult to

control an under-actuated robot performing highly dynamic

motion that involve intricate balance. Classical approaches

often require extensive experience and tedious manual tuning

[2, 3]. Can we automate this process?

Recently, we have seen tremendous progress in deep rein-

forcement learning (deep RL) [4, 5, 6]. These algorithms can

solve locomotion problems from scratch without much human

intervention. However, most of these studies are conducted

in simulation, and a controller learned in simulation often

performs poorly in the real world. This reality gap [7, 8]

is caused by model discrepancies between the simulated and

the real physical system. Many factors, including unmodeled

dynamics, wrong simulation parameters, and numerical errors,

contribute to this gap. Even worse, this gap is greatly ampliﬁed

in locomotion tasks. When a robot performs agile motion with

frequent contact changes, the switches of contact situations

break the control space into fragmented pieces. Any small

model discrepancy can be magniﬁed and generate bifurcated

consequences. Overcoming the reality gap is challenging.

An alternative is to learn the task directly on the physical

system. While this has been successfully demonstrated in

robotic grasping [9], it is challenging to apply this method

Fig. 1: The simulated and the real Minitaurs learned to gallop

using deep reinforcement learning.

to locomotion tasks due to the difﬁculties of automatically

resetting the experiments and continuously collecting data. In

addition, every falling during learning can potentially damage

the robot. Thus for locomotion tasks, learning in simulation is

more appealing because it is faster, cheaper and safer.

In this paper, we present a complete learning system for

agile locomotion, in which control policies are learned in

simulation and deployed on real robots. There are two main

challenges: 1) learning controllable locomotion policies; and

2) transferring the policies to the physical system.

While learning from scratch can lead to better policies than

incorporating human guidance [10], in robotics, having control

of the learned policy sometimes is preferred. Our learning

system provides users a full spectrum of controllability over

the learned policies. The user can choose from letting the

system learn completely by itself to specifying an open-loop

reference gait as human guidance. Our system will keep the

learned gait close to the reference while, at the same time,

maintain balance, increase speed and energy efﬁciency.

To narrow the reality gap, we perform system identiﬁcation

to ﬁnd the correct simulation parameters. Besides, we improve

the ﬁdelity of the physics simulator by adding a faithful

actuator model and latency handling. To further narrow the

gap, we experiment with three approaches to increase the

robustness of the learned controllers: dynamics randomization,

arXiv:1804.10332v2 [cs.RO] 16 May 2018

perturbation forces, and compact design of observation space.

We evaluate our system on a quadruped robot with two

locomotion tasks, trotting and galloping, in Section VI. We

show that with deep RL, highly agile locomotion gaits can

emerge automatically. We also demonstrate how users can

easily specify the style of locomotion using our system. When

comparing with the gaits handcrafted by experts, we ﬁnd

that our learned gaits are more energy efﬁcient at the same

running speed. We demonstrate that with an accurate physics

simulator and robust control policies, we can successfully

deploy policies learned in simulation to the physical system.

The main contributions of this paper are:

1) We propose a complete learning system for agile lo-

comotion. It provides users a full spectrum (from fully

restricted to a user-speciﬁed gait to fully learned from

scratch) of controllability over the learned policies.

2) We show that the reality gap can be narrowed by

a variety of approaches and conduct comprehensive

evaluations on their effectiveness.

3) We demonstrate that agile locomotion gaits, such as

trotting and galloping, can be learned automatically and

these gaits can work on robots directly without further

training on the physical system.

II. RELATED WORK

A. Legged Locomotion Control

Optimizing controllers [11] automatically is an appealing

alternative to tedious manual tuning. Popular optimization

methods for locomotion control include black-box [12] and

Bayesian optimization [13]. Bayesian optimization is often

data efﬁcient enough to be applied directly on real robots

[14, 15]. However, it is challenging to scale these methods

to high-dimensional control space. For this reason, feature

engineering and controller architecture design are needed.

On the other hand, recent advances in deep RL have

signiﬁcantly reduced the requirement of human expertise [6].

We have witnessed intense competitions among deep RL

algorithms in the simulated benchmark environments [16, 17].

In this paper, we choose to use Proximal Policy Optimization

(PPO) [5] because it is a stable on-policy method and can be

easily parallelized [18].

Much research has applied reinforcement learning to loco-

motion tasks [19, 20, 21, 22, 23]. More recently, Gay et al. [24]

learned a neural network that modiﬁed a Central Pattern Gen-

erator controller for stable quadruped locomotion. Levine et al.

[25] applied guided policy search to learn bipedal locomotion

in simulation. Peng et al. [26] developed a CACLA-inspired

algorithm to control a simulated dog to navigate complex 1D

terrains. It was further improved using a mixture of actor-critic

experts [27]. Peng et al. [28] learned a hierarchical controller

to direct a 3D biped to walk in a simulated environment. Heess

et al. [29] showed that complex locomotion behaviors, such as

running, jumping and crouching, can be learned directly from

simple reward signals in rich simulated environments. Sharma

and Kitani [30] exploited the cyclic nature of locomotion with

phase-parametric policies. Note that in most of these latest

work, learning and evaluation were exclusively performed in

simulation. It is not clear whether these learned policies can be

safely and successfully deployed on the robots. In this paper,

while we also train in simulation, more importantly, we test

the policies on the real robot and explore different approaches

to narrow the reality gap.

B. Overcoming the Reality Gap

Reality gap is the major obstacle to applying deep RL in

robotics. Neunert et al. [31] analyzed potential causes of the

reality gap, some of which can be solved by system identi-

ﬁcation [32]. Li et al. [33] showed that the transferability of

open loop locomotion could be increased if carefully measured

physical parameters were used in simulation. These physical

parameters can also be optimized by matching the robot

behaviors in the simulated and the real world [34, 35]. Bongard

et al. [36] used the actuation-sensation relationship to build

and reﬁne a simulation through continuous self-modeling, and

later used this model to plan forward locomotion. Ha et al.

[37] used Gaussian Processes, a non-parametric model, to

minimize the error between simulation and real physics. Yu

et al. [38] performed online system identiﬁcation explicitly on

physical parameters, while Peng et al. [39] embedded system

identiﬁcation implicitly into a recurrent policy. In addition

to system identiﬁcation, we identify that inaccurate actuator

models and lack of latency modeling are two major causes of

the reality gap. We improve our simulator with new models.

A robust controller is more likely to be transferred to the

real world. Robustness can be improved by injecting noise

[40], perturbing the simulated robot [41], leveraging multiple

simulators [8], using domain randomization [42] and dynamics

randomization [43, 44, 39]. Although not explicitly using

real-world data for training, these methods have been shown

effective to increase the success rate of sim-to-real transfer.

Another way to cross the reality gap is to learn from both

simulation and real-world data. A policy can be pre-trained

in simulation and then ﬁne-tuned on robots [45]. Hanna and

Stone [46] adapted the policy using a learned action space

transformation from simulation to the real world. In visuo-

motor learning, domain adaptation were applied at feature

level [47, 48] or pixel level [49] to transfer the controller.

Bousmalis et al. [49] reduced the real world data requirement

by training a generator network that converts simulated images

to real images. While some of these methods were successfully

demonstrated in robotic grasping, it is challenging to apply

them to locomotion due to the difﬁculties to continuously and

safely collect enough real world data. Furthermore, we need

to narrow the reality gap of dynamics rather than perception.

III. ROBOT PLATFORM AND PHYSICS SIMULATION

Our robot platform is the Minitaur from Ghost Robotics

(Figure 1 bottom), a quadruped robot with eight direct-drive

actuators [50]. Each leg is controlled by two actuators that

allow it to move in the sagittal plane. The motors can be

actuated through position control or through a Pulse Width

Fig. 2: The customized hardware architecture enables the

Minitaur to perform deep neural network inference.

Modulation (PWM) signal. The Minitaur is equipped with

motor encoders that measure the motor angles and an IMU

that measures the orientation and the angular velocity of its

base. An STM32 ARM microcontroller sends commands to

actuators, receives sensor readings and can perform simple

computations. However, this microcontroller is not powerful

enough to execute neural network policies learned from deep

RL. For this reason, we installed an Nvidia Jetson TX2 to

perform neural network inference. The TX2 is interfaced with

the microcontroller through UART communication. At every

control step, the sensor measurements are collected at the

microcontroller and sent back to the TX2, where they are fed

into a neural network policy to decide the actions to take.

These actions are then transmitted to the microcontroller and

executed by the actuators (Figure 2). Since the TX2 does not

run a real time operating system, the control loop runs at

variable control frequencies of approximately 150-200Hz.

We build a physics simulation of the Minitaur (Figure 1 top)

using PyBullet [51], a Python module that extends the Bullet

Physics Engine with robotics and machine learning capabili-

ties. Bullet solves the equations of motion for articulated rigid

bodies in generalized coordinates while simultaneously satis-

fying physical constraints including contact, joint limits and

actuator models. Then the state of the system is numerically

integrated over time using a semi-implicit scheme.

IV. LEARNING LOCOMOTION CONTROLLERS

A. Background

We formulate locomotion control as a Partially Observable

Markov Decision Process (POMDP) and solve it using a policy

gradient method. An MDP is a tuple (S, A, r, D, P

sas

, γ),

where S is the state space; A is the action space; r is the

reward function; D is the distribution of initial states s

, P

sas

is the transition probability; and γ ∈ [0, 1] is the discount

factor. Our problem is partially observable because certain

states such as the position of the Minitaur’s base and the foot

contact forces are not accessible due to lack of corresponding

sensors. At every control step, a partial observation o ∈ O,

rather than a complete state s ∈ S, is observed. Reinforcement

learning optimizes a policy π : O 7→ A that maximizes the

expected return (accumulated rewards) R.

∗

= arg max

∼D

)] (1)

B. Observation and Action Space

In our problem, the observations include the roll, pitch, and

the angular velocities of the base along these two axes, and the

eight motor angles. Note that we do not include all available

sensor measurements in the observation space. For example,

the IMU also provides the yaw of the base. We exclude it

because the measurement drifts quickly. The motor velocities

can also be calculated but can be noisy. We ﬁnd that our

observation space is sufﬁcient to learn the tasks demonstrated

in this paper. More importantly, a compact observation space

helps to transfer the policy to the real robot. More analysis on

this is presented in Section VI.

When designing the action space, we choose to use the

position control mode of the actuators for safety and ease

of learning [52]. The actions consist of the desired pose of

each leg in the leg space [50, 12]. The pose of each leg

is decomposed into the swing and the extension components

(s, e) (Figure 3). They are mapped into the motor space as

= e + s

= e − s

where θ

and θ

are the angles of the two motors controlling

the same leg; s and e are the swing and the extension

components in the leg space.

An alternative action space is the eight desired motor

angles. However, in this motor space, many conﬁgurations

are invalid due to self collisions between body parts. This

results in an action space where valid actions are scattered

nonconvex regions, which signiﬁcantly increases the difﬁculty

of learning. In contrast, in the leg space, we can easily set a

rectangle bound that prunes out all the invalid actions while

still covering most of the valid conﬁgurations.

C. Reward Function

We design a reward function to encourage faster forward

running speed and penalize high energy consumption.

r = (p

− p

n−1

) · d − w∆t|τ

| (2)

where p

and p

n−1

are the positions of the Minitaur’s base

at the current and the previous time step respectively; d is the

desired running direction; ∆t is the time step; τ are the motor

torques and

q are the motor velocities. The ﬁrst term measures

the running distance towards the desired direction and the

second term measures the energy expenditure. w is the weight

that balances these two terms. Since the learning algorithm

is robust to a wide range of w, we do not tune it and use

Fig. 3: Representation of the leg pose in motor space and leg

space. Extension (e) sets the length of the leg by rotating both

motors in opposite directions while swing (s) sets the overall

rotation of the leg by rotating both motors in same direction.

w = 0.008 in all our experiments. During training, the rewards

are accumulated at each episode. An episode terminates after

1000 steps or when the simulated Minitaur loses balance: its

base tilts more than 0.5 radians.

D. Policy Representation

Although learning from scratch can eliminate the need of

human expertise, and sometimes achieve better performance,

having control of the learned policies is important for robotic

applications. For example, we may want to specify details

of a gait (e.g. style or ground clearance). For this reason,

we decouple the locomotion controller into two parts, an

open loop component that allows a user to provide reference

trajectories and a feedback component that adjusts the leg

poses on top of the reference based on the observations.

a(t, o) =

a(t) + π(o) (3)

where

a(t) is the open loop component, which is typically a

periodic signal, and π(o) is the feedback component. In this

way, users can easily express the desired gait using an open

loop signal and learning will ﬁgure out the rest, such as the

balance control, which is tedious to design manually.

This hybrid policy (eq. (3)) is a general formulation that

gives users a full spectrum of controllability. It can be varied

continuously from fully user-speciﬁed to entirely learned from

scratch. If we want to use a user-speciﬁed policy, we can set

both the lower and the upper bounds of π(o) to be zero. If we

want a policy that is learned from scratch, we can set

a(t) = 0

and give the feedback component π(o) a wide output range.

By varying the open loop signal and the output bound of the

feedback component, we can decide how much user control

is applied to the system. In Section VI, we will illustrate two

examples, learning to gallop from scratch and learning to trot

with a user provided reference.

We represent the feedback component π with a neural

network and solve the above POMDP using Proximal Policy

Optimization [5]. The neural network has two fully-connected

hidden layers. Its size is determined via hyperparameter

search. Refer to Section VI for more details.

V. NARROWING THE REALITY GAP

Due to the reality gap, robotic controllers learned in simu-

lation usually do not perform well in the real environments.

We propose two approaches to narrow the gap: improving

simulation ﬁdelity and learning robust controllers.

A. Improving Simulation Fidelity

Since the reality gap is caused by model discrepancies

between the simulation and the real dynamics, a direct way

to narrow it is to improve the simulation. We ﬁrst create

an accurate Uniﬁed Robot Description Format (URDF) [53]

ﬁle for the simulated Minitaur. We disassemble a Minitaur

measure the dimension, weigh the mass, ﬁnd the center of

mass of each link and incorporate this information into the

For robots that are difﬁcult to dissemble, traditional system identiﬁcation

methods could be applied.

URDF ﬁle. Measuring inertia is difﬁcult. Instead, we estimate

it for each link given its shape and mass, assuming uniform

density. We also design experiments to measure motor frictions

[50]. In addition to system identiﬁcation, we augment the

simulator with a more faithful actuator model and latency

handling.

a) Actuator Model: We use position control to actuate

the motors of the Minitaur. Bullet also provides position con-

trol for simulated motors. In its implementation, one constraint

n+1

= 0 is formulated for each motor where e

n+1

is an error

at the end of current time step. The error is deﬁned as

n+1

= k

(¯q − q

n+1

) + k

(

˙q − ˙q

n+1

) (4)

where ¯q and

˙q are the desired motor angle and velocity, q

n+1

and ˙q

n+1

are the motor angle and velocity at the end of current

time step, k

is the proportional gain and k

is the derivative

gain. Despite its similarity to the Proportional-Derivative (PD)

servo, a key difference is that eq. (4) guarantees that the motor

angle and velocity at the end of the time step satisfy the

constraint while PD servo uses the current motor angle and

velocity to decide how to actuate the motors. As a result, if

large gains are used, the motors can remain stable in simulation

but oscillate in reality.

To eliminate the model discrepancy for actuators, we de-

velop an actuator model according to the dynamics of an ideal

DC motor. Given a PWM signal, the torque of the motor is

τ = K

I =

pwm

− V

emf

(5)

emf

= K

˙q (6)

where I is the armature current, K

is the torque constant

or back electromotive force (EMF) constant, V

pwm

is the

supplied voltage which is modulated by the PWM signal, V

emf

is the back EMF voltage, and R is the armature resistance. The

parameters K

and R are provided in the actuator speciﬁcation.

Using the above model, we observe that the real Minitaur

often sinks to its feet or cannot lift them while the same

controller works ﬁne in simulation. This is because the linear

torque-current relation only holds for ideal motors. In reality,

the torque saturates as the current increases. For this reason,

we construct a piece-wise linear function to characterize this

nonlinear torque-current relation [54]. In simulation, once the

current is computed from PWM (eq. (5) and (6)), we use this

piece-wise function to look up the corresponding torque.

In position control, PWM is controlled by a PD servo.

pwm

= V (k

(¯q − q

) + k

(

˙q − ˙q

)) (7)

where V is the battery voltage. Note that we use the angle

and velocity at the current time step. In addition, we set the

target velocity

˙q = 0 after consulting Ghost Robotics’s PD

implementation on the microcontroller.

We performed an experiment to validate the new actuator

model. We actuated a motor with a desired trajectory of sine

curve. With this actuator model, the simulated trajectory agrees

with the ground truth (Figure 4).

评论收藏

内容反馈

小弟有礼了

粉丝: 5
资源: 9

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots 笔记

最新资源

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots 笔记

"Sim-to-Real: Learning Agile Locomotion For Quadruped Robots"环境

"Sim-to-Real: Learning Agile Locomotion For Quadruped Robots"代码

The Object Primer - Introduction to Techniques for Agile Modeling.pdf

User Stories Applied: For Agile Software Development

Agile Software Development----敏捷软件开发----Draft version: 3b

Agile SCRUM for Trello boards-1.4.7.zip

learning-agile.zip

Engineering Software as a Service: An Agile Approach Using Cloud Computing

The.Agile.Consultant.Guiding.Clients.to.Enterprise.Agility.epub

The Art of Agile Development (2007)

2021-02-btc-agile-sw-online-tutoria-01:卡梅（Primera Kata）

Frazzoli 等。 - 2001 - Real-time motion planning for agile autonom

Lean Architecture: for Agile Software Development

User Stories Applied - For Agile Software Development （pdf REPOST）

软件工程与软件开发生命周期.pptx

Agile Record - The Magazine for Agile Developers and Agile Testers - 1

agile-web-development-with-rails_2

考研常考英语词根！好好好好好好

Agile-Processes-for-Hardware-Development-cPrime.pdf

Quad-SDK Full Stack Software Framework for Agile Quadrupedal Loc

全国博士英语词汇大纲

UOOC拓展英语词汇词根总和大全.doc

(完整版词根词缀)有了这个根本不用背单词.doc

最新资源