AgentInspiredTradingUsingRecurrent.pdf资源-CSDN文库

需积分: 9 144 浏览量 2020-05-11 16:48:12 上传评论收藏 3.95MB PDF 举报

根据提供的文件内容，本文将探讨强化学习和LSTM（长短期记忆网络）在量化交易中的应用。以下为详细知识点： 1. 强化学习的引入与发展强化学习（Reinforcement Learning, RL）的概念最早可以追溯至20世纪50年代，当时的机器学习或人工智能技术以模式识别和计算学习理论的研究为基础。研究者们探索能够从数据中学习和预测的算法。强化学习可以被看作是一种“享乐主义”的学习系统，该系统能够根据环境中的信号来决定行动并调整自己的行为以最大化信号获取。随着计算能力的突破和深度神经网络的发展，众多以前未探索的领域变得可行。 2. 深度学习与神经网络深度学习（Deep Learning）通过深度神经网络（Deep Neural Networks, DNNs）直接从原始输入中学习知识，例如视觉信息，无需人工设计特征。深度学习使得通过使用神经网络实现的强化学习成为可能，进而可能创造出接近真正意义上的人工智能代理。 3. 递归神经网络与LSTM 递归神经网络（Recurrent Neural Networks, RNNs）具有处理序列数据的能力，而长短期记忆网络（Long Short-Term Memory, LSTM）是RNN的一种特殊类型，能够更好地解决长序列中梯度消失或爆炸的问题。LSTM网络适用于时间序列分析，比如股票市场的量化交易。 4. 递归强化学习本文强调使用直接的递归强化学习或递归强化学习（Recurrent Reinforcement Learning）来实现类似人类水平的交易性能和普遍性。在这个领域，学习模型是在LSTM递归结构中实现的，使用强化学习或进化策略作为智能体（agents）来创建成功的策略。 5. 量化交易与智能代理量化交易（Quantitative Trading）或称为智能交易（Robo-trading）涉及使用算法来识别交易机会。本研究的目标是创建一个智能交易系统，该系统通过学习来自GBPUSD货币对交易的数据，学习构建有效的交易策略。 6. 进化策略与深度强化学习进化策略（Evolution Strategies）是一种启发式优化算法，用于机器学习领域，特别是在强化学习中作为优化算法使用。深度强化学习（Deep Reinforcement Learning）结合了深度学习和强化学习的优势，通过深度神经网络来近似强化学习中的价值函数或策略函数。 7. 强化学习在金融领域的应用深度强化学习在金融领域，特别是在量化交易方面，具有巨大的应用前景。通过使用LSTM网络，强化学习模型能够处理金融市场的复杂数据，并通过与环境的交互来优化其策略。 8. 学习模型的实现和验证文章描述了如何在递归神经网络中实现强化学习模型，并在GBPUSD交易中验证了其稳健性和可行性。这意味着模型经过了实证测试，显示了其在实际交易中的有效性和适应性。 9. 关键词解析文档内容中提到了几个关键词，包括深度学习、LSTM、递归强化学习、进化策略、智能顾问（Robo-advisers）和智能交易。这些关键词代表了本文的主题方向和研究焦点。 10. 强化学习的未来展望随着技术的不断进步，强化学习在创建具有高度适应性和复杂决策能力的智能系统方面具有巨大潜力，这可能对量化交易产生深远的影响，推动市场参与者不断寻找更高效、更准确的交易策略。总结而言，本文探讨了深度学习、递归神经网络（尤其是LSTM）以及强化学习在实现类似人类交易水平的智能交易系统中的应用。通过研究深度强化学习如何在GBPUSD交易中验证模型的稳健性和可行性，文章指出了未来人工智能在金融交易中应用的巨大潜力。

资源推荐

资源详情

资源评论

Agent Inspired Trading Using Recurrent

Reinforcement Learning and LSTM Neural Networks

David W. Lu

Email: davie.w.lu@gmail.com

Abstract—With the breakthrough of computational power and

deep neural networks, many areas that we haven’t explore with

various techniques that was researched rigorously in past is

feasible. In this paper, we will walk through possible concepts

to achieve robo-like trading or advising. In order to accomplish

similar level of performance and generality, like a human trader,

our agents learn for themselves to create successful strategies that

lead to the human-level long-term rewards. The learning model

is implemented in Long Short Term Memory (LSTM) recurrent

structures with Reinforcement Learning or Evolution Strategies

acting as agents The robustness and feasibility of the system is

veriﬁed on GBPUSD trading.

Keywords—Deep learning, Long Short Term Memory (LSTM),

neural network for ﬁnance, recurrent reinforcement learning,

evolution strategies, robo-advisers, robo-traders

I. INTRODUCTION

Many of the machine learning or artiﬁcial intelligence

techniques can be trace back as early as the 1950s. Evolv-

ing from the study of pattern recognition and computational

learning theory, researchers explore and study the construction

of algorithms that can learn and predict on data. With the

predictions, researchers came across the idea of of a learning

system that can decide something, that adapts its behavior in

order to maximize a signal from its environment. This was the

creation of a ”hedonistic” learning system.[1] The idea of this

learning system may be viewed as Adaptive Optimal Control,

nowadays we call it reinforcement learning [2]. In order to

accomplish similar level of performance and generality, like a

human, we need to construct and learn the knowledge directly

from raw inputs, such as vision, without any hand-engineered

features, this can be achieved by deep learning of neural

networks. Combining the two, some simply refer it to deep

reinforcement learning, which could create an artiﬁcial agent

that is as close as we can to sanely call it true ”artiﬁcial

intelligence”.

In this paper, we’ll focus with direct reinforcement or

recurrent reinforcement learning to refer to algorithms that

do not have to learn a value function in order to derive a

policy. Some researchers called policy gradient algorithms in

a Markov Decision Process framework as direct reinforcement

to generally refer to any reinforcement learning algorithm that

does not require learning a value function. Herein, we will

focus on recurrent reinforcement learning. Methods such as

dynamic programming[3], TD Learning[4] or Q-Learning[5]

had been the focus of most modern researches. These method

At the time of completion of this article, the author works for Bank of

America Merrill Lynch. The views and opinions expressed in this article are

those of the author and do not necessarily reﬂect the views or position of

Bank of America Merrill Lynch

attempt to learn a value function. Actor-Critic methods[6],

which is an intermediate between direct reinforcement and

value function methods, in that the ”critic” learns a value

function which is then used to update the parameters of the

”actor”.

Why we chose to focus on recurrent reinforcement learn-

ing? Though much theoretical progress has been made in the

recent years, there had been few public known applications

in the ﬁnancial world. We as start-ups, quantitative hedge

funds, client driven investment services, wealth management

companies, and most recently robo-advisers, had been focusing

on ﬁnancial decision making problems to trade on its own.

Within the reinforcement learning community, much atten-

tion is actually given to the question of learning policies

versus learning value functions. The value function approach

as described earlier had dominated the ﬁeld throughout the

last thirty years. The approach had worked well in many

applications, alpha Go, training a Helicopter to name a few.

However, the value function approach suffers from several

limitations. Q-Learning is in the context of action spaces and

discrete state. In many situations, this will suffer the ”curse

of dimensionality” When Q-Learning is extended to function

approximators, researchers had shown it fail to converge using

a simple Markov Decision Process. Brittleness which means

small change in the value function can produce large changes

in the policy. In the trading signal world, the data can be

in presence of large amounts of noise and nonstationarity in

the datasets, which could cause severe problems for a value

function approach.

Recurrent reinforcement learning can provide immediate

feedback to optimize the strategy, ability to produce real valued

actions or weights naturally without resorting to the discretiza-

tion necessary for value function approach. There are other

portfolio optimization techniques such as evolution strategies

and linear matrix inequilities which relies on predicting the

covariance matrix and optimize. For all optimization problems

or in reinforcement learning set up, we need an objective,

and such objective can be formulated in terms of risk or

rewards. Moody et al.[7] shown that how differential forms

of the Sharpe Ratio and Downside Deviation Ratio can be

formulated to enable efﬁcient on-line learning with recurrent

reinforcement learning, Lu[8] had shown using Linear Matrix

Inequilities can beat the risk free rate, and Deng et al.[9]

had shown max return can be used as objective in recurrent

reinforcement learning as well as using deep learning trans-

formations to initialize the features.

To extend the recurrent structure, we’ll further discuss in

this paper how the back propagation through time method

is exploited to unfold the recurrent neural network as a

arXiv:1707.07338v1 [q-fin.CP] 23 Jul 2017

series of time-dependent stacks without feedback. As discuss

in [9], gradient vanishing issue is inevitably true in these

structures. This was because the unfolded neural networks

exhibits extremely deep structures on the feature learning

and also the time expansion parts. We introduce the Long

Short Term Memory (LSTM) to handle this deﬁciency. We

will discuss the characteristics of LSTM as well as thoughts

and techniques such as Dropouts [10] to test. This strategy

provides a chance to forecast the ﬁnal objective and improve

the learning efﬁciency.

The recurrent reinforcement learner requires to optimize

the objective through gradient ascent. In this paper, we will

also explore literature in Evolution Strategies [11] and Nelder-

Mead method [12] as to search for the gradient or so called

direct search or derivative-free methods.

Finally, the Trading Systems will be tested among S&P

500, EURUSD, and commodity futures market. The remaining

parts of this part are organized as follows. Section II, we will

walk through how we construct the trading agents, Section

III reveals how we construct the recurrent layers in plain

recurrent and LSTM. Additionally, how dropouts can affect the

training and reduce gradient vanishing issues. Section IV, we

will talk about gradient ascent, evolution strategies and Helder-

Mead Methods. Section V, we will detail the test results and

comparison of methods listed in Section II to IV. Section VI

concludes his paper and provides thoughts on future directions.

II. RECURRENT REINFORCEMENT LEARNING

To demonstrate the feasibility of agents that trades, we

consider agents that trade ﬁxed position size on a single secu-

rity. The methods described here can be generalized to more

sophisticated agents that trades or optimize a portfolio, trades

varying quantities of a security, allocate assets continuously or

manage multiple asset portfolios. We’ll discuss this further in

separate sessions. See [13] for some initial discussions.

Intuitively, we ﬁnd a objective function so that the agent

knows what we are trying to maximize or minimize. Most

modern fund managers attempt to maximize risk-adjusted re-

turn using the Sharpe Ratio, as suggested by modern portfolio

theory. The Sharpe Ratio is deﬁne as follows[14]:

Average(R

)

StandardDeviation(R

)

E[R

]

E[R

] − (E[R

])

(1)

where R

is the return on investment for trading period t

and E[.] denotes the expectation. In modern portfolio theory,

higher Sharpe Ratio rewards investment strategies that rely

on less volatile trends to make a proﬁt. As discussed earlier,

there are other functions or ratios we can use, however for

ease of demonstration purposes we will use Sharpe Ratio and

Downside Deviation Ratio in this article.

Next step we need to deﬁne how a agent would trade.

The trader would take a long, neutral or a short position.

A long position is entering a purchase of some quantity of

a security, while a short position is triggered by selling the

security. Herein, we will follow mostly the notation in [7][15]

for the ease of explaining and reconciliation. Let’s deﬁne

∈ [−1, 0.1] represents the trading positions at time t. A

long position when F

> 0. In this case, the trader buys the

security at price P

, and hopes that the prices goes up on period

t+1. A short position is when F

< 0. In this case, the trader

short sale (borrow to sell) the security at price P

, and hopes

that the prices goes down on period t+1 so that trader can buy

it back to return the security that it borrowed. Intuitively, one

can use a Tanh function to represent this set up since it’s goes

from -1 to 1.

We deﬁne the trader function in a simple form of:

= tanh(w

) (2)

where x

= [r

t−m+1

, ...r

] and the return r

= p

− p

t−1

Note

that the trader function can also add in a bias term b and the last

trading decision with parameter u to add into the regression.

The latest trading decision with parameter can discourage the

agent to frequently change the trading positions and to avoid

huge transaction costs. We can then rewrote the equation to

= tanh(w

+ b + uF

t−1

) (3)

Adding the number of shares as s with transaction cost c,

we can then write the return at time t as

= s(F

t−1

− c|F

− F

t−1

|) (4)

With the above elements set up, we can now try to

maximize the Sharpe Ratio using Gradient Ascent or other

methods which we’ll discuss further in Section IV to the ﬁnd

the optimal weights for the agent to use. Again let’s think

through given trading system model F

, the goal is to adjust

the parameter or weights w in order to maximize S

. We can

write the weights as follows:

= w

t−1

+ ρ

= w

t−1

+ ∆w (5)

where w

is any weight of the network at time t, S

is the

measure we’d like to maximize or minimize, and ρ is an

adjustable learning rate.

Examining the derivatives of S

or gradient with respect

to the weight w over a sequence of T periods is:

t=1



t−1



(6)

The trader can then be optimized in batch mode by repeatingly

compute the value of S

on forward passes through the data

with:

= −scsign(F

= F

t−1

) (7)

t−1

= r

+ scsign(F

= F

t−1

) (8)

∂F

∂w

∂F

t−1

(9)

Due to the inherent recurrence, the quantitites dF

/dw are total

derivatives that depend upon the entire sequence of previous

time periods. In other words, dF

/dw is recurrent and depends

on all previous values. Though it does slow down the gradient

ascent but due to modern computational power and range of

samples, it does not present insuperable burden. To correctly

compute and optimize these total derivatives, we can deploy

剩余7页未读，继续阅读

评论收藏

内容反馈

淡然之枫

粉丝: 24
资源: 1

Agent Inspired Trading Using Recurrent.pdf

最新资源

Agent Inspired Trading Using Recurrent.pdf

Building Trading Bots Using Java 无水印pdf

Important-Trading-Point-Prediction-Using-a-HybridConvolutional-Recurrent-Neural-Network

2011Adaptive Trading Agent Strategies Using Market Experience.pdf

Delphi 12 控件之Code4Delphi-Delphi-AI-Developer- Inspired by GitHub Copilot.pdf

clever algorithms：nature-Inspired Programming Recipes.pdf

Biologically Inspired Methods —— 优化算法.zip

Multi-Verse Optimizer: a nature-inspired algorithm for global optimization.pdf

Nginx-Essentials.pdf

A new bio-inspired optimisation algorithm Bird Swarm Algorithm.pdf

Bio-Inspired-algorithmsMatlab代码.rar

CHAPTER ONE_Bio-inspired Algorithms.pdf

android java and javascript bridge, inspired by wechat web.zip

Artificial.Intelligence.for.Humans.Jeff Heaton.附代码

A_Comprehensive_Guide_to_Machine_Learning.pdf

Mastering Swift 4, 4th Edition-Packt Publishing(2017).pdf

Introduction_to_Optimum_Design.pdf

多层卷积脉冲神经网络.pdf

英语专业八级大纲要求词汇表.pdf

inspired-social:从 code.google.compinspired-social 自动导出

大学英语1补考试题及答案借鉴.pdf

BDA.zip_BDA_knewgcn_pdf_zip

Synchronization Techniques for OFDMA- A Tutorial Review.pdf

Expert Perception of ChatGPT in Clinical Medicine.pdf

Harris hawks optimization_ Algorithm and applications.pdf

EvolutionaryOptimizationAlgorithms.pdf 英文原版

Receipt-Recognition.pdf

The Ant Lion Optimizer.pdf

最新资源