RLHF（基于人类反馈的强化学习）算法的简单实现。.zip_基于类反馈的强化学习技术(RLHF)资源-CSDN文库

共423个文件

py：139个

sh：117个

json：74个

版权申诉

48 浏览量 2024-05-11 17:18:40 上传评论收藏 12.13MB ZIP 举报

RLHF（基于人类反馈的强化学习）算法的简单实现。强化学习（Reinforcement Learning, RL），又称再励学习、评价学习或增强学习，是机器学习的范式和方法论之一。它主要用于描述和解决智能体（agent）在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的特点在于没有监督数据，只有奖励信号。强化学习的常见模型是标准的马尔可夫决策过程（Markov Decision Process, MDP）。按给定条件，强化学习可分为基于模式的强化学习（model-based RL）和无模式强化学习（model-free RL），以及主动强化学习（active RL）和被动强化学习（passive RL）。强化学习的变体包括逆向强化学习、阶层强化学习和部分可观测系统的强化学习。求解强化学习问题所使用的算法可分为策略搜索算法和值函数（value function）算法两类。强化学习理论受到行为主义心理学启发，侧重在线学习并试图在探索-利用（exploration-exploitation）间保持平衡。不同于监督学习和非监督学习，强化学习不要求预先给定任何数据，而是通过接收环境对动作的奖励（反馈）获得学习信息并更新模型参数。强化学习问题在信息论、博弈论、自动控制等领域有得到讨论，被用于解释有限理性条件下的平衡态、设计推荐系统和机器人交互系统。一些复杂的强化学习算法在一定程度上具备解决复杂问题的通用智能，可以在围棋和电子游戏中达到人类水平。强化学习在工程领域的应用也相当广泛。例如，Facebook提出了开源强化学习平台Horizon，该平台利用强化学习来优化大规模生产系统。在医疗保健领域，RL系统能够为患者提供治疗策略，该系统能够利用以往的经验找到最优的策略，而无需生物系统的数学模型等先验信息，这使得基于RL的系统具有更广泛的适用性。总的来说，强化学习是一种通过智能体与环境交互，以最大化累积奖励为目标的学习过程。它在许多领域都展现出了强大的应用潜力。

资源推荐

资源详情

资源评论

收起资源包目录

RLHF（基于人类反馈的强化学习）算法的简单实现。.zip （423个子文件）

CODEOWNERS 179B

ds-chat-single.gif 1.46MB

ds-chat.gif 289KB

.gitignore 2KB

.gitignore 50B

ds_config_structural_pruning_TEMPLATE.json 4KB

ds_config_TEMPLATE.json 4KB

ds_config.json 4KB

ds_config_W48A8_Qgroup48_lkd_fp32.json 4KB

ds_config_layer_reduction_fp16.json 3KB

ds_config_layer_reduction_W1Q8_fp32.json 3KB

ds_config_W1A8_Qgroup64_fp32.json 3KB

ds_config_W8A8_Qgroup48_fp32.json 3KB

ds_config_W1A8_Qgroup1_fp32.json 3KB

ds_config_W1or2A8_Qgroup64_fp16.json 3KB

ds_config_W1A8_Qgroup64_fp16.json 3KB

ds_config_channel_prune.json 3KB

ds_config_gpt2-medium_2clmetrics_TEMPLATE.json 2KB

ds_config_W4or8A8_Qgroup64_fp32.json 2KB

ds_config_W4or8A8_Qgroup64_fp16.json 2KB

ds_config_gpt2-medium_1clmetric_TEMPLATE.json 2KB

ds_config_W8A8_Qgroup64_fp16.json 2KB

bert_large_lamb.json 2KB

bert_large.json 2KB

bert_base.json 2KB

bert_large_lamb_nvidia_data.json 2KB

ds_config_W8A8_Qgroup64_fp32.json 2KB

bert_base_large_lr.json 1KB

ds_config_gpt_medium_random_ltd.json 1KB

ds_config_gpt_base_random_ltd.json 1KB

ds_config_imagenet_random_ltd.json 1KB

ds_config_cifar_random_ltd.json 1KB

ds_config_gpt2_TEMPLATE.json 954B

deepspeed_bsz64k_lamb_config_seq128.json 718B

deepspeed_bsz64k_onebitlamb_config_seq128_nccl.json 674B

deepspeed_bsz64k_onebitlamb_config_seq128_mpi_ethernet.json 673B

deepspeed_bsz64k_onebitlamb_config_seq128_mpi_infiniband.json 672B

deepspeed_bsz32k_onebitlamb_config_seq512_nccl.json 642B

deepspeed_bsz32k_onebitlamb_config_seq512_mpi_ethernet.json 641B

deepspeed_bsz32k_onebitlamb_config_seq512_mpi_infiniband.json 640B

deepspeed_bsz4k_01adam_config_seq512_mpi_infiniband.json 611B

deepspeed_bsz4k_01adam_config_seq512_nccl.json 611B

deepspeed_bsz4k_01adam_config_seq512_mpi_ethernet.json 611B

deepspeed_bsz4k_01adam_config_seq128_mpi_infiniband.json 556B

deepspeed_bsz4k_01adam_config_seq128_nccl.json 556B

deepspeed_bsz4k_01adam_config_seq128_mpi_ethernet.json 556B

test.json 532B

deepspeed_bsz4k_onebitadam_config_seq128_nccl.json 517B

deepspeed_bsz4k_onebitadam_config_seq128_mpi_ethernet.json 516B

deepspeed_bsz4k_onebitadam_config_seq128_mpi_infiniband.json 515B

deepspeed_bsz4k_progressive_layer_drop_config_seq128.json 515B

ds_config.json 510B

ds_config.json 508B

deepspeed_bsz32k_lamb_config_seq512.json 441B

bert-large-uncased-whole-word-masking-config.json 434B

deepspeed_onebitadam_bsz96_config.json 393B

deepspeed_onebitadam_bsz96_config.json 392B

deepspeed_onebitadam_bsz96_config.json 391B

ds_config_fp16_tune.json 371B

glue_bert_base.json 354B

glue_bert_large.json 353B

glue_bert_base.json 353B

ds_config_fp16_tune.json 341B

ds_config_tune.json 320B

ds_config_tune.json 304B

deepspeed_bsz24_config.json 299B

ds_config.json 292B

ds_config_tune.json 286B

gan_deepspeed_config.json 213B

ds_config_fp16_z2.json 129B

ds_config_fp16_z3.json 129B

ds_config_fp16_z0.json 129B

ds_config_fp16_z1.json 129B

ds_config_z0.json 92B

ds_config_z2.json 92B

ds_config_z3.json 92B

ds_config_z1.json 92B

LICENSE 11KB

LICENSE 1KB

opt-1.3b-globalBatchSize128.log 309KB

actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log 215KB

opt-350m_globalBatchSize-64.log 199KB

README.md 28KB

README.md 21KB

README.md 14KB

README.md 12KB

README.md 6KB

README.md 5KB

README.md 4KB

BenckmarkSetting.md 3KB

README.md 3KB

共 423 条

<p align="center"> <img src="assets/image/ds-shiba.png" alt="DeepSpeed Shiba Inu!"/> </p> <div align="center"> ## ðDeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scalesð </div> <div align="center"> [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](LICENSE) </div> A fast, affordable, scalable and open system framework for enabling end-to-end Reinforcement Learning Human Feedback (RLHF) training experience to generate high-quality ChatGPT-style models at all scales. <div align="center"> <img src="assets/image/four_blocks.png" alt="DeepSpeed ChatGPT-Like Models Banner"/> </div>   ## Table of Contents - [ð° Latest News ð°](#-latest-news-) - [ð What is DeepSpeed Chat ðï¸](#-what-is-deepspeed-chat-) - [ð§¨ Capabilities ð§¨](#-capabilities-) - [â Quick Start â](#-quick-start-) - [ð¼ Installation](#-installation) - [ð¼ Single Script for Training 3-Step RLHF Pipeline](#-one-single-script-completes-all-three-stages-of-rlhf-training-and-generate-your-first-chatgpt-model) - [ð¼ Demonstration: Individual Step Fine-Tuning](#-demonstration-individual-step-fine-tuning) - [ð Step 1 - Supervised Fine-Tuning](#-step-1---supervised-fine-tuning) - [ð Step 2 - Reward Model](#-step-2---reward-model) - [ð Step 3 - Reinforcement Learning with Human Feedback](#-step-3---reinforcement-learning-with-human-feedback) - [ð¼ Adding and using your own datasets in DeepSpeed-Chat](#-adding-and-using-your-own-datasets-in-deepspeed-chat) - [ð¼ Customizing RLHF training pipeline via DeepSpeed-Chatâs APIs](#-customizing-your-own-rlhf-training-pipeline-using-deepspeed-chats-rlhf-apis) - [ð¼ Serving Your Model: Plug-in and Test!](#-serving-plug-in-your-final-model-trained-by-deepspeed-chat-and-test-it-out) - [ð¥ Training Performance Evaluation ð¥](#-training-performance-evaluation-) - [ð½ Supported Models ð½](#-supported-models-) - [ð¬ Build Pipeline Status ð¬](#-build-pipeline-status-) - [â Documentation and Tutorial â](#-documentation-and-tutorial-) - [ð± DeepSpeed Chat's Roadmap ð±](#-deepspeed-chats-roadmap-) - [ð¬ DeepSpeed Chat and DeepSpeed Community ð¬](#-deepspeed-chat-and-deepspeed-community-) - [ð Acknowledgement and Citation ð](#-acknowledgement-and-citation-)  ## ð° Latest News ð° * ***[2023/04] ð [DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat)*** [[English](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat/README.md)] [[ä¸æ](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat/chinese/README.md)] [[æ¥æ¬èª](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat/japanese/README.md)]ð To cite DeepSpeed Chat, please cite our [arxiv report](https://arxiv.org/abs/2308.01320): ``` @article{yao2023dschat, title={{DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales}}, author={Zhewei Yao and Reza Yazdani Aminabadi and Olatunji Ruwase and Samyam Rajbhandari and Xiaoxia Wu and Ammar Ahmad Awan and Jeff Rasley and Minjia Zhang and Conglong Li and Connor Holmes and Zhongzhu Zhou and Michael Wyatt and Molly Smith and Lev Kurilenko and Heyang Qin and Masahiro Tanaka and Shuai Che and Shuaiwen Leon Song and Yuxiong He}, journal={arXiv preprint arXiv:2308.01320}, year={2023} } ``` ## ð What is DeepSpeed Chat ð <div align="center"> https://user-images.githubusercontent.com/124002815/230290966-a78ea171-ab65-4fcc-b91e-67c7c4403497.mp4 </div> In the spirit of democratizing ChatGPT-style models and their capabilities, DeepSpeed is proud to introduce a general system framework for enabling an end-to-end training experience for ChatGPT-like models, named ***DeepSpeed Chat***. It can automatically take your favorite pre-trained large language models though an OpenAI InstructGPT style three stages to produce your very own high-quality ChatGPT-style model. DeepSpeed Chat makes training for high-quality ChatGPT-style models easy, fast, affordable and scalable. With just one click, you can train, generate and serve a 1.3 billion parameter ChatGPT model within 1.36 hours on a single consumer-grade NVIDIA A6000 GPU with 48GB memory. On a single DGX node with 8 NVIDIA A100-40G GPUs, DeepSpeed-Chat enables training for a 13 billion parameter ChatGPT model in 13.6 hours. On multi-GPU multi-node systems (cloud scenarios),i.e., 8 DGX nodes with 8 NVIDIA A100 GPUs/node, DeepSpeed-Chat can train a 66 billion parameter ChatGPT model under 9 hours. Finally, it enables 15X faster training over the existing RLHF systems, and can handle training of ChatGPT-like models with over 200 billion parameters: another impossible feat with the existing systems. For the full range of discussion on various model sizes and low training cost enabled by DeepSpeed-Chat, please refer to the [Release Blog](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat) and [Training Performance Evaluation](#-training-performance-evaluation-). Beyond this release, DeepSpeed system has been proudly serving as the system backend for accelerating a range of on-going efforts for fast training/fine-tuning Chat-Style models (e.g., LLaMA). The following are some of the open-source examples that are powered by DeepSpeed: - [Databricks Dolly](https://github.com/databrickslabs/dolly) - [LMFlow](https://github.com/OptimalScale/LMFlow) - [CarperAI-TRLX](https://github.com/CarperAI/trlx) - [Huggingface-PEFT](https://github.com/huggingface/peft) ## ð§¨ Capabilities ð§¨ DeepSpeed Chat is evolving fast to accommodate the increasing demand for system-level acceleration support for training/finetuning as well as serving emerging models. Please stay tuned with our upcoming milestones at [Roadmap](#-deepspeed-chats-roadmap-). A summary of DeepSpeed Chat includes: + **DeepSpeed Chat**: a complete end-to-end three-stage OpenAI InstructGPT training strategy with Reinforcement Learning Human Feedback (RLHF), to generate high-quality ChatGPT-style models from usersâ favorite pre-trained large language model checkpoints; + **DeepSpeed Hybrid Engine**: A new system support for fast, affordable and scalable RLHF training at All Scales. It is built upon your favorite DeepSpeed's system capability such as ZeRO technologies and DeepSpeed-Inference; + **Easy-breezy Training Experience**: A single script capable of taking a pre-trained Huggingface model and running it though all three steps of the RLHF training. + **A Universal System Support for Todayâs ChatGPT-like Model Training**: DeepSpeed Chat can serve as the system backend for not only the 3-step instruct-base RLHF pipeline, but also the current single model finetuning exploration (e.g., LLaMA-centric finetuning) and generic RLHF training for various models and scenarios. Please check out our [Blog Release](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat) and [Documentation and Tutorial](#-documentation-and-tutorial-) for more details on our training methodology and new system technologies. ## â Quick Start â ### ð¼ Installation ```bash pip install deepspeed>=0.9.0 git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt ``` ### ð¼ One Single Script Completes All Three Steps of RLHF Training and Generate Your First ChatGPT Model   **:yellow_heart: DeepSpeed-Chatâs RLHF Example 1: Coffee Time Training for a 1.3B ChatGPT Model** <details><summary> Expand </summary><p> If you only have around **1-2 hour** for coffee or lunch break, you c

评论收藏

内容反馈

版权申诉