Agent Inspired Trading Using Recurrent
Reinforcement Learning and LSTM Neural Networks
David W. Lu
Email: davie.w.lu@gmail.com
Abstract—With the breakthrough of computational power and
deep neural networks, many areas that we haven’t explore with
various techniques that was researched rigorously in past is
feasible. In this paper, we will walk through possible concepts
to achieve robo-like trading or advising. In order to accomplish
similar level of performance and generality, like a human trader,
our agents learn for themselves to create successful strategies that
lead to the human-level long-term rewards. The learning model
is implemented in Long Short Term Memory (LSTM) recurrent
structures with Reinforcement Learning or Evolution Strategies
acting as agents The robustness and feasibility of the system is
verified on GBPUSD trading.
Keywords—Deep learning, Long Short Term Memory (LSTM),
neural network for finance, recurrent reinforcement learning,
evolution strategies, robo-advisers, robo-traders
I. INTRODUCTION
Many of the machine learning or artificial intelligence
techniques can be trace back as early as the 1950s. Evolv-
ing from the study of pattern recognition and computational
learning theory, researchers explore and study the construction
of algorithms that can learn and predict on data. With the
predictions, researchers came across the idea of of a learning
system that can decide something, that adapts its behavior in
order to maximize a signal from its environment. This was the
creation of a ”hedonistic” learning system.[1] The idea of this
learning system may be viewed as Adaptive Optimal Control,
nowadays we call it reinforcement learning [2]. In order to
accomplish similar level of performance and generality, like a
human, we need to construct and learn the knowledge directly
from raw inputs, such as vision, without any hand-engineered
features, this can be achieved by deep learning of neural
networks. Combining the two, some simply refer it to deep
reinforcement learning, which could create an artificial agent
that is as close as we can to sanely call it true ”artificial
intelligence”.
In this paper, we’ll focus with direct reinforcement or
recurrent reinforcement learning to refer to algorithms that
do not have to learn a value function in order to derive a
policy. Some researchers called policy gradient algorithms in
a Markov Decision Process framework as direct reinforcement
to generally refer to any reinforcement learning algorithm that
does not require learning a value function. Herein, we will
focus on recurrent reinforcement learning. Methods such as
dynamic programming[3], TD Learning[4] or Q-Learning[5]
had been the focus of most modern researches. These method
At the time of completion of this article, the author works for Bank of
America Merrill Lynch. The views and opinions expressed in this article are
those of the author and do not necessarily reflect the views or position of
Bank of America Merrill Lynch
attempt to learn a value function. Actor-Critic methods[6],
which is an intermediate between direct reinforcement and
value function methods, in that the ”critic” learns a value
function which is then used to update the parameters of the
”actor”.
Why we chose to focus on recurrent reinforcement learn-
ing? Though much theoretical progress has been made in the
recent years, there had been few public known applications
in the financial world. We as start-ups, quantitative hedge
funds, client driven investment services, wealth management
companies, and most recently robo-advisers, had been focusing
on financial decision making problems to trade on its own.
Within the reinforcement learning community, much atten-
tion is actually given to the question of learning policies
versus learning value functions. The value function approach
as described earlier had dominated the field throughout the
last thirty years. The approach had worked well in many
applications, alpha Go, training a Helicopter to name a few.
However, the value function approach suffers from several
limitations. Q-Learning is in the context of action spaces and
discrete state. In many situations, this will suffer the ”curse
of dimensionality” When Q-Learning is extended to function
approximators, researchers had shown it fail to converge using
a simple Markov Decision Process. Brittleness which means
small change in the value function can produce large changes
in the policy. In the trading signal world, the data can be
in presence of large amounts of noise and nonstationarity in
the datasets, which could cause severe problems for a value
function approach.
Recurrent reinforcement learning can provide immediate
feedback to optimize the strategy, ability to produce real valued
actions or weights naturally without resorting to the discretiza-
tion necessary for value function approach. There are other
portfolio optimization techniques such as evolution strategies
and linear matrix inequilities which relies on predicting the
covariance matrix and optimize. For all optimization problems
or in reinforcement learning set up, we need an objective,
and such objective can be formulated in terms of risk or
rewards. Moody et al.[7] shown that how differential forms
of the Sharpe Ratio and Downside Deviation Ratio can be
formulated to enable efficient on-line learning with recurrent
reinforcement learning, Lu[8] had shown using Linear Matrix
Inequilities can beat the risk free rate, and Deng et al.[9]
had shown max return can be used as objective in recurrent
reinforcement learning as well as using deep learning trans-
formations to initialize the features.
To extend the recurrent structure, we’ll further discuss in
this paper how the back propagation through time method
is exploited to unfold the recurrent neural network as a
arXiv:1707.07338v1 [q-fin.CP] 23 Jul 2017