具有研究友好功能的深度强化学习算法的高质量单文件实施（PPO、DQN、C51、DDPG、TD3、SAC、PPG）

共380个文件

png：178个

py：72个

md：58个

版权申诉

深度学习

强化学习

Python

175 浏览量 2024-02-26 13:02:50 上传评论收藏 75.31MB ZIP 举报

深度强化学习（Deep Reinforcement Learning, DRL）是人工智能领域的一个重要分支，它结合了深度学习的表征能力与强化学习的决策制定机制。在给定的标题和描述中，我们关注的是CleanRL库，这是一个专门针对DRL的Python实现，强调了其研究友好的特性以及可扩展性。下面我们将详细探讨CleanRL库中的主要算法及其应用。 1. **Proximal Policy Optimization (PPO)**: PPO 是一种基于策略梯度的强化学习算法，由OpenAI提出。它的核心思想是在更新策略时引入一个近似于最优值的边界，以限制新旧策略之间的差距，从而保证学习过程的稳定性。PPO在实践中表现优秀，且易于实现和调整。 2. **Deep Q-Network (DQN)**: DQN 是将深度学习引入Q-Learning的一种方法，解决了传统Q-Learning的Q-table大小限制问题。通过使用神经网络作为Q函数，DQN可以处理高维状态空间，使得它在Atari游戏等环境中表现出色。 3. **Categorical Q-Learning (C51)**: C51是DQN的一个变体，它使用了离散分布的Q值来估计动作值，而不是像DQN那样使用单个值。这种方法更符合Q值的概率性质，从而提高了学习的稳定性和准确性。 4. **Deep Deterministic Policy Gradient (DDPG)**: DDPG 是一个连续动作空间的强化学习算法，它结合了Actor-Critic框架和DQN的思想。Actor负责生成策略，而Critic则用于评估策略，两者相互作用以优化性能。 5. **Twin Delayed Deep Deterministic Policy Gradient (TD3)**: TD3 是对DDPG的改进，它引入了两个独立的Critic网络来减少过估计，并且引入了动作噪声来鼓励探索。这些改进使得TD3在连续控制任务中表现更加稳定。 6. **Soft Actor-Critic (SAC)**: SAC 是一种基于熵的强化学习算法，它鼓励探索并平衡了探索与利用之间的关系。SAC通过最大化策略的熵来增加不确定性，从而探索更多可能的解决方案。 7. **Policy Gradient with Population-Based Training (PPG)**: PPG是基于Population-Based Training的策略梯度算法，它使用多个并行策略进行训练，通过比较不同策略的表现来更新参数，以适应环境变化和提高性能。 CleanRL库的独特之处在于它提供了一个简洁、单一文件的实现，使得研究人员能够快速理解和复现这些算法。同时，库的可扩展性允许通过AWS Batch等工具进行大规模实验，这对于研究者来说非常有价值，能够方便地进行参数搜索和对比实验，加速了强化学习的研究进程。总而言之，CleanRL库是一个强大的工具，它涵盖了强化学习中一些最流行和成功的算法，如PPO、DQN、C51、DDPG、TD3、SAC和PPG。对于想要进入或深入理解DRL领域的开发者和研究者来说，这是一个极具价值的资源。

资源推荐

资源详情

资源评论

收起资源包目录

具有研究友好功能的深度强化学习算法的高质量单文件实施（PPO、DQN、C51、DDPG、TD3、SAC、PPG）（380个子文件）

CNAME 16B

termynal.css 2KB

custom.css 2KB

extra.css 171B

Dockerfile 890B

.gitpod.Dockerfile 686B

.dockerignore 75B

.gitignore 2KB

.gitignore 33B

CleanRL_Huggingface_Integration_Demo.ipynb 697KB

termynal.js 9KB

custom.js 7KB

chat.js 77B

LICENSE 16KB

poetry.lock 313KB

poetry.lock 41KB

ppo.md 69KB

rpo.md 30KB

sac.md 29KB

cleanrl-v1.md 23KB

ddpg.md 21KB

dqn.md 19KB

ppo-isaacgymenvs.md 15KB

c51.md 15KB

README.md 15KB

td3.md 14KB

contribution.md 13KB

ppg.md 11KB

hyperparameter-tuning.md 11KB

qdagger.md 10KB

benchmark-utility.md 8KB

ppo_atari_envpool_xla_jax.md 8KB

ppo_atari_envpool_xla_jax_runtimes.md 8KB

basic-usage.md 7KB

overview.md 6KB

ppo-rnd.md 6KB

zoo.md 5KB

atari_returns.md 5KB

atari_hns.md 5KB

index.md 4KB

submit-experiments.md 4KB

cleanrl-supported-papers-projects.md 4KB

installation.md 3KB

resume-training.md 3KB

pull_request_template.md 2KB

ddpg.md 1KB

td3.md 1KB

td3_runtimes.md 1KB

examples.md 1KB

ppo_continuous_action.md 1KB

ppo_continuous_action_runtimes.md 1KB

ppo_envpool.md 1KB

ppo_envpool_runtimes.md 1KB

installation.md 1KB

issue_template.md 948B

ppo_atari_envpool_xla_jax_scan.md 875B

ppo_atari_envpool_xla_jax_scan_runtimes.md 869B

experiment-tracking.md 838B

ppo_atari_multigpu.md 790B

sac.md 773B

sac_runtimes.md 767B

ppo_atari_envpool.md 730B

ppo_atari_envpool_runtimes.md 724B

ppo_atari_multigpu_runtimes.md 484B

ppo_atari_lstm.md 467B

ppo_atari_lstm_runtimes.md 464B

ppo_atari.md 442B

ppo_atari_runtimes.md 439B

ppo_procgen.md 382B

ppo_procgen_runtimes.md 379B

ppo.md 367B

ppo_runtimes.md 364B

CONTRIBUTING.md 75B

index.md 7B

dm_control_all_ppo_rpo_8M.png 2.15MB

tensorboard.png 1.85MB

aws_batch2.png 1.78MB

aws_batch1.png 1.7MB

hms_each_game.png 1.27MB

Hopper-v2-time.png 1.23MB

Hopper-v2.png 1.15MB

ppo_continuous_action_gymnasium_dm_control.png 1.1MB

BossFight.png 1.06MB

Walker2d-v2.png 1.05MB

BigFish.png 1.04MB

Walker2d-v2-time.png 1.03MB

StarPilot.png 1MB

HalfCheetah-v2.png 967KB

Ant.png 896KB

Anymal.png 895KB

Humanoid.png 887KB

BallBalance.png 887KB

Hopper-v2.png 875KB

Hopper-v2-time.png 860KB

HalfCheetah-v2-time.png 843KB

Walker2d-v2-time.png 843KB

Cartpole.png 841KB

HalfCheetah-v2.png 827KB

Walker2d-v2.png 824KB

AllegroHand.png 824KB

共 380 条

# CleanRL (Clean Implementation of RL Algorithms) [<img src="https://img.shields.io/badge/license-MIT-blue">](https://github.com/vwxyzjn/cleanrl) [![tests](https://github.com/vwxyzjn/cleanrl/actions/workflows/tests.yaml/badge.svg)](https://github.com/vwxyzjn/cleanrl/actions/workflows/tests.yaml) [![docs](https://img.shields.io/github/deployments/vwxyzjn/cleanrl/Production?label=docs&logo=vercel)](https://docs.cleanrl.dev/) [<img src="https://img.shields.io/discord/767863440248143916?label=discord">](https://discord.gg/D6RCjA6sVT) [<img src="https://img.shields.io/youtube/channel/views/UCDdC6BIFRI0jvcwuhi3aI6w?style=social">](https://www.youtube.com/channel/UCDdC6BIFRI0jvcwuhi3aI6w/videos) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Models-Huggingface-F8D521">](https://huggingface.co/cleanrl) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vwxyzjn/cleanrl/blob/master/docs/get-started/CleanRL_Huggingface_Integration_Demo.ipynb) CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features. The implementation is clean and simple, yet we can scale it to run thousands of experiments using AWS Batch. The highlight features of CleanRL are: * ð Single-file implementation * *Every detail about an algorithm variant is put into a single standalone file.* * For example, our `ppo_atari.py` only has 340 lines of code but contains all implementation details on how PPO works with Atari games, **so it is a great reference implementation to read for folks who do not wish to read an entire modular library**. * ð Benchmarked Implementation (7+ algorithms and 34+ games at https://benchmark.cleanrl.dev) * ð Tensorboard Logging * ðª Local Reproducibility via Seeding * ð® Videos of Gameplay Capturing * ð§« Experiment Management with [Weights and Biases](https://wandb.ai/site) * ð¸ Cloud Integration with docker and AWS You can read more about CleanRL in our [JMLR paper](https://www.jmlr.org/papers/volume23/21-1342/21-1342.pdf) and [documentation](https://docs.cleanrl.dev/). CleanRL only contains implementations of **online** deep reinforcement learning algorithms. If you are looking for **offline** algorithms, please check out [corl-team/CORL](https://github.com/corl-team/CORL), which shares a similar design philosophy as CleanRL. > â¹ï¸ **Support for Gymnasium**: [Farama-Foundation/Gymnasium](https://github.com/Farama-Foundation/Gymnasium) is the next generation of [`openai/gym`](https://github.com/openai/gym) that will continue to be maintained and introduce new features. Please see their [announcement](https://farama.org/Announcing-The-Farama-Foundation) for further detail. We are migrating to `gymnasium` and the progress can be tracked in [vwxyzjn/cleanrl#277](https://github.com/vwxyzjn/cleanrl/pull/277). > â ï¸ **NOTE**: CleanRL is *not* a modular library and therefore it is not meant to be imported. At the cost of duplicate code, we make all implementation details of a DRL algorithm variant easy to understand, so CleanRL comes with its own pros and cons. You should consider using CleanRL if you want to 1) understand all implementation details of an algorithm's varaint or 2) prototype advanced features that other modular DRL libraries do not support (CleanRL has minimal lines of code so it gives you great debugging experience and you don't have do a lot of subclassing like sometimes in modular DRL libraries). ## Get started Prerequisites: * Python >=3.7.1,<3.11 * [Poetry 1.2.1+](https://python-poetry.org) To run experiments locally, give the following a try: ```bash git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl poetry install # alternatively, you could use `poetry shell` and do # `python run cleanrl/ppo.py` poetry run python cleanrl/ppo.py \ --seed 1 \ --env-id CartPole-v0 \ --total-timesteps 50000 # open another terminal and enter `cd cleanrl/cleanrl` tensorboard --logdir runs ``` To use experiment tracking with wandb, run ```bash wandb login # only required for the first time poetry run python cleanrl/ppo.py \ --seed 1 \ --env-id CartPole-v0 \ --total-timesteps 50000 \ --track \ --wandb-project-name cleanrltest ``` If you are not using `poetry`, you can install CleanRL with `requirements.txt`: ```bash # core dependencies pip install -r requirements/requirements.txt # optional dependencies pip install -r requirements/requirements-atari.txt pip install -r requirements/requirements-mujoco.txt pip install -r requirements/requirements-mujoco_py.txt pip install -r requirements/requirements-procgen.txt pip install -r requirements/requirements-envpool.txt pip install -r requirements/requirements-pettingzoo.txt pip install -r requirements/requirements-jax.txt pip install -r requirements/requirements-docs.txt pip install -r requirements/requirements-cloud.txt ``` To run training scripts in other games: ``` poetry shell # classic control python cleanrl/dqn.py --env-id CartPole-v1 python cleanrl/ppo.py --env-id CartPole-v1 python cleanrl/c51.py --env-id CartPole-v1 # atari poetry install -E atari python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4 python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4 python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4 python cleanrl/sac_atari.py --env-id BreakoutNoFrameskip-v4 # NEW: 3-4x side-effects free speed up with envpool's atari (only available to linux) poetry install -E envpool python cleanrl/ppo_atari_envpool.py --env-id BreakoutNoFrameskip-v4 # Learn Pong-v5 in ~5-10 mins # Side effects such as lower sample efficiency might occur poetry run python ppo_atari_envpool.py --clip-coef=0.2 --num-envs=16 --num-minibatches=8 --num-steps=128 --update-epochs=3 # procgen poetry install -E procgen python cleanrl/ppo_procgen.py --env-id starpilot python cleanrl/ppg_procgen.py --env-id starpilot # ppo + lstm poetry install -E atari python cleanrl/ppo_atari_lstm.py --env-id BreakoutNoFrameskip-v4 ``` You may also use a prebuilt development environment hosted in Gitpod: [![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/vwxyzjn/cleanrl) ## Algorithms Implemented | Algorithm | Variants Implemented | | ----------- | ----------- | | â [Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf) | [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy) | | | [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_ataripy) | | [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_continuous_actionpy) | | [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_lstmpy) | | [`ppo_atari_envpool.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_envpoolpy) | | [`ppo_atari_envpool_xla_jax.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_envpool_xla_jaxpy) | | [`ppo_atari_envpool_xla_jax_scan.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool_xla_jax_scan.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_envpool_xla_jax_scanpy)) | | [`ppo_procgen

评论收藏

内容反馈

版权申诉