MDP-DP-RL:马尔可夫决策过程，动态规划和强化学习

共93个文件

py：89个

ipynb：2个

gitignore：1个

Python

5星 · 超过95%的资源需积分: 26 150 浏览量 2021-05-09 13:33:12 上传评论 3 收藏 148KB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

MDP-DP-RL-master.zip （93个子文件）

MDP-DP-RL-master

src

processes

mrp_refined.py 1KB

mdp.py 6KB

mdp_rep_for_adp.py 848B

mdp_rep_for_rl_tabular.py 1KB

mdp_refined.py 7KB

__init__.py 0B

mp_funcs.py 8KB

policy.py 1KB

det_policy.py 699B

mdp_rep_for_adp_pg.py 1KB

mp.py 2KB

mdp_rep_for_rl_fa.py 1KB

mdp_rep_for_rl_pg.py 665B

mab_env.py 1KB

mrp.py 2KB

func_approx

dnn.py 10KB

eligibility_traces.py 3KB

func_approx_base.py 4KB

__init__.py 0B

linear_approx.py 4KB

dnn_spec.py 2KB

examples

run_all_algorithms.py 12KB

windy_grid.py 9KB

deriv_pricing_hedging

__init__.py 0B

max_exp_utility.py 8KB

port_opt

port_opt.py 15KB

__init__.py 0B

merton_portfolio.py 12KB

single_asset_cara.py 8KB

__init__.py 0B

clearance_pricing.py 5KB

exam_problems

mrp_tdmc_outline.py 3KB

frog_lilypad.py 2KB

grid_maze.py 1KB

wage_max.py 3KB

price_control.py 1KB

W2021

career_optimization.py 3KB

Midterm P1 Solution1.ipynb 10KB

midterm-p1-sol2.ipynb 38KB

mrp_tdmc.py 4KB

inv_control.py 8KB

american_pricing

vanilla_american_test.py 6KB

num_utils.py 5KB

grid_pricing.py 8KB

__init__.py 0B

bs_pricing.py 3KB

american_pricing.py 23KB

algorithms

func_approx_spec.py 2KB

td_algo_enum.py 124B

backward_dp.py 4KB

mab

ts_gaussian.py 3KB

mab_graphs_gen.py 2KB

epsilon_greedy.py 3KB

gradient_bandits.py 3KB

plot_mab_graphs.py 7KB

ts_bernoulli.py 2KB

mab_base.py 2KB

__init__.py 0B

ucb1.py 2KB

rl_func_approx

tdlambda.py 10KB

tdlambda_exact.py 10KB

td0.py 7KB

lspi.py 8KB

__init__.py 0B

rl_func_approx_base.py 3KB

monte_carlo.py 7KB

adp

adp_pg.py 15KB

adp_alt.py 10KB

__init__.py 0B

adp.py 9KB

__init__.py 0B

ams.py 6KB

rl_tabular

tdlambda.py 8KB

rl_tabular_base.py 2KB

td0.py 7KB

__init__.py 0B

monte_carlo.py 7KB

helper_funcs.py 5KB

tabular_base.py 2KB

dp_numeric.py 2KB

__init__.py 0B

dp_analytic.py 2KB

dp_base.py 3KB

rl_pg

pg.py 14KB

opt_base.py 967B

backward_adp.py 3KB

utils

beta_distribution.py 2KB

generic_typevars.py 62B

__init__.py 0B

standard_typevars.py 3KB

gen_utils.py 3KB

.gitignore 18B

README.md 7KB

# MDP-DP-RL The goal of this project was to develop all Dynamic Programming and Reinforcement Learning algorithms from scratch (i.e., with no use of standard libraries, except for basic numpy and scipy tools). The "develop from scratch" goal was motivated by educational purposes - students learning this topic can understand the concepts throroughly only when they develop and work with code developed from scratch. I teach courses on this topic to a variety of student backgrounds, and each such course is big on precise programming implementations of the techniques/algorithms. In particular, I use this codebase when I teach Stanford CME 241: Reinforcement Learning for Stochastic Control Problems in Finance (http://cme241.stanford.edu). Any feedback on code readability, performance and bugs will be greatly appreciated as the code is still fairly raw and untested in various parts (started working on this code in August 2018, and have mainly been in code-growth mode so far). The project started by implementing the foundational data structures for finite Markov Processes (a.k.a. Markov Chains), Markov Reward Processes (MRP), and Markov Decision Processes (MDP). This was followed by Dynamic Programming (DP) algorithms, where the focus was to represent Bellman equations in clear mathematical terms within the code. Next was the core educational material of Reinforcement Learning, implementing the Generalized Policy Iteration algorithms based on simulations (Monte Carlo and Temporal Difference, including eligibility traces). However, the emphasis was to first implement the tabular methods so that one can work with actual data structures (finite, hence tabular), rather than functions to represent MDP rewards and transition specifications as well as value functions and policies. Once the tabular RL methods were implemented, it was straightforward to write the same algorithms as functional approximation-based algorithms. However, this required a detour to build some foundation for function approximation. I chose to implement linear and deep neural network approximations, both of which require a specification of feature functions. Backpropagation was developed from scratch, again for educational purposes. On a whim, I also implemented Approximate Dynamic Programming (ADP) Algorithms, which was basically the same old Policy Iteration and Value Iteration algorithms but now using the output of function approximation for the right-hand-side of the Bellman update, and using the updated values as training data for gradient descent on the parameters of the function approximation. So far, I am finding ADP as the most valuable algorithm for the MDP problems I typically work with. I am a bit surprised that the "literature" focuses so much on model-free whereas I often know the model for many of the MDPs I work on, and so, ADP is ideal. I have chosen Python 3 as the language, mainly because I can't expect my students to have expertise in the potentially more-appropriate languages for this project, such as Scala, Ocaml and Haskell. These are functional programming languages and this topic/project is best done through a tasteful application of Functional Progamming. But Python 3 is not such a bad choice as functions are fast-class entities. My core technique in this project is indeed Functional Programming, but I had to very careful in getting around Python's "naughty" handling of function closures. I have also made heavy use of classes and TypeVars. Object-oriented polymorphism as well as type-parametrized polymorphism enabled me to cover a wide range of algorithms with plenty of common code. Python 3 also provided me the benefit of type annotations, which I have taken heavy advantage of in this project. Type annotations support turned out to be extremely valuable in the project as my IDE (PyCharm) caught a lot of errors/warnings statically, in fact as I was typing code, it would spot errors. More importantly, type annotations makes the interfaces very clear and I believe any sort of mathematical programming needs a strong style of type annotations (if not static typing). This is how the modules of the project are organized. processes: All about Markov Processes, MRP, MDP and classes that serve as minimal but complete representations of an MDP for specific classes of algorithms, eg: a representation for tabular RL, a representation for function approximation RL, and a representation for ADP. A lot of the heavy lifting is done in the "helper" sub-module mp_funcs.py func_approx: Linear and Deep-Neural Network (DNN) function approximation. Implements function evaluation (forward propagation for DNN) and gradient calculation/gradient descent (backward propagation for DNN) using ADAM. Took advantage of numpy vectors, matrices, tensors and efficiently computing with them. algorithms: within this, we have the modules dp (for Dynamic Programming), adp (for Approximate Dynamic Programming), rl_tabular (for Tabular RL - Monte Carlo, SARSA, Q-Learning, Expected SARSA), rl_func_aprprox (for Function Approximation RL - same algorithms as Tabular RL). Note that I have implemented TD(0) and TD(Lambda) separately for both Tabular RL and Function Approximation RL, although TD(0) is a special case of TD(Lambda). TD(0) was implemented separately for the usual reason in the project - that I find it easy to introduce a special case (in this case TD(0)) for pedagogical reasons and so showing students TD(0) as a special case with simpler/lighter code that focuses on the concept (versus the complication of eligibility traces) is quite beneficial. This is the same reason I implemented Tabular (Tabular is a special case of Linear Function Approximation where the features are indicator functions, one for each of the states/state-action pairs). Note the deep object-oriented inheritance hiereracy - rooted at the abstract base class OptBase. Note also that a lot of heavy lifting happens in the module helper_funcs.py. A couple of semi-advanced algorithms such as LSTD/LSPI and Policy Gradient are also implemented here (LSPI provides batch-efficiency and Policy Gradient is valuable when the action space is large/continuous). Some special but highly useful model-based algorithms such as Backward Induction (backward_dp.py) and Adapative Multistage Sampling (ams.py) have also been implemented. examples: Implemented a few common examples of problems that are ideal for RL: Windy Grid, Inventory Control. For http://cme241.stanford.edu, I have also implemented initial versions of two important and interesting finance problems that can be solved by modeling them as MDPs and solving with DP/RL: 1) Optimal Asset-Allocation and Consumption when managing a portfolio of risky assets and 1 riskless asset, 2) Optimal Exercise of American Options when the option-payoff is either path-dependent or if the state space of the option is high-dimensional. utils: Some generic utility functions to transform data structures.