Sim-to-Real: Learning Agile Locomotion For
Quadruped Robots
Jie Tan
1
, Tingnan Zhang
1
, Erwin Coumans
1
, Atil Iscen
1
,
Yunfei Bai
2
, Danijar Hafner
1
, Steven Bohez
3
, and Vincent Vanhoucke
1
1
Google Brain
2
X
3
Google DeepMind
Abstract—Designing agile locomotion for quadruped robots
often requires extensive expertise and tedious manual tuning.
In this paper, we present a system to automate this process by
leveraging deep reinforcement learning techniques. Our system
can learn quadruped locomotion from scratch using simple
reward signals. In addition, users can provide an open loop
reference to guide the learning process when more control over
the learned gait is needed. The control policies are learned in a
physics simulator and then deployed on real robots. In robotics,
policies trained in simulation often do not transfer to the real
world. We narrow this reality gap by improving the physics
simulator and learning robust policies. We improve the simulation
using system identification, developing an accurate actuator
model and simulating latency. We learn robust controllers by
randomizing the physical environments, adding perturbations
and designing a compact observation space. We evaluate our
system on two agile locomotion gaits: trotting and galloping.
After learning in simulation, a quadruped robot can successfully
perform both gaits in the real world.
I. INTRODUCTION
Designing agile locomotion for quadruped robots is a long-
standing research problem [1]. This is because it is difficult to
control an under-actuated robot performing highly dynamic
motion that involve intricate balance. Classical approaches
often require extensive experience and tedious manual tuning
[2, 3]. Can we automate this process?
Recently, we have seen tremendous progress in deep rein-
forcement learning (deep RL) [4, 5, 6]. These algorithms can
solve locomotion problems from scratch without much human
intervention. However, most of these studies are conducted
in simulation, and a controller learned in simulation often
performs poorly in the real world. This reality gap [7, 8]
is caused by model discrepancies between the simulated and
the real physical system. Many factors, including unmodeled
dynamics, wrong simulation parameters, and numerical errors,
contribute to this gap. Even worse, this gap is greatly amplified
in locomotion tasks. When a robot performs agile motion with
frequent contact changes, the switches of contact situations
break the control space into fragmented pieces. Any small
model discrepancy can be magnified and generate bifurcated
consequences. Overcoming the reality gap is challenging.
An alternative is to learn the task directly on the physical
system. While this has been successfully demonstrated in
robotic grasping [9], it is challenging to apply this method
Fig. 1: The simulated and the real Minitaurs learned to gallop
using deep reinforcement learning.
to locomotion tasks due to the difficulties of automatically
resetting the experiments and continuously collecting data. In
addition, every falling during learning can potentially damage
the robot. Thus for locomotion tasks, learning in simulation is
more appealing because it is faster, cheaper and safer.
In this paper, we present a complete learning system for
agile locomotion, in which control policies are learned in
simulation and deployed on real robots. There are two main
challenges: 1) learning controllable locomotion policies; and
2) transferring the policies to the physical system.
While learning from scratch can lead to better policies than
incorporating human guidance [10], in robotics, having control
of the learned policy sometimes is preferred. Our learning
system provides users a full spectrum of controllability over
the learned policies. The user can choose from letting the
system learn completely by itself to specifying an open-loop
reference gait as human guidance. Our system will keep the
learned gait close to the reference while, at the same time,
maintain balance, increase speed and energy efficiency.
To narrow the reality gap, we perform system identification
to find the correct simulation parameters. Besides, we improve
the fidelity of the physics simulator by adding a faithful
actuator model and latency handling. To further narrow the
gap, we experiment with three approaches to increase the
robustness of the learned controllers: dynamics randomization,
arXiv:1804.10332v2 [cs.RO] 16 May 2018