OpenAssistant是一个基于聊天的助手，它可以理解任务，可以与第三方系统交互，并动态检索信息资源-CSDN文库

共1576个文件

json：464个

py：358个

tsx：192个

176 浏览量 2024-01-29 11:03:48 上传评论收藏 7.55MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

OpenAssistant是一个基于聊天的助手，它可以理解任务，可以与第三方系统交互，并动态检索信息（1576个子文件）

Dockerfile.backend 654B

Dockerfile.backend-worker 646B

CODEOWNERS 865B

nginx.conf 3KB

nginx.conf 2KB

pgbackrest.conf 477B

postgres.conf 153B

prometheus.conf 135B

redis.conf 57B

redis.conf 46B

redis.conf 45B

custom.css 2KB

index.module.css 365B

styles.module.css 138B

globals.css 100B

data.csv 56B

Dockerfile.discord-bot 207B

Dockerfile 296B

.dockerignore 30B

.env 1KB

.env.example 537B

.env.example 259B

.gitattributes 50B

.gitignore 2KB

.gitignore 524B

.gitignore 358B

.gitignore 238B

.gitignore 200B

.gitignore 70B

.gitignore 54B

.gitignore 44B

.gitignore 39B

.gitignore 33B

.gitignore 22B

.gitignore 14B

.gitinclude 0B

.gitkeep 0B

component-index.html 387B

alembic.ini 3KB

test.inventory.ini 36B

detoxify-evaluation.ipynb 2.63MB

public.ipynb 392KB

Closed Book QA Generator.ipynb 191KB

TSSB-3M-bugs_dataset.ipynb 165KB

3_10k_bart_trial.ipynb 163KB

writing_prompt-checkpoint.ipynb 142KB

writing_prompt.ipynb 134KB

2_wikitext_doc2query.ipynb 119KB

stackexchange-builder.ipynb 87KB

tlcv2_0_oa.ipynb 84KB

wikidata.ipynb 64KB

getting-started.ipynb 37KB

unified-qa.ipynb 36KB

ubuntu_parser.ipynb 35KB

safety data-augmentation.ipynb 34KB

diverse.ipynb 30KB

data_processor.ipynb 28KB

project_gutenberg_crawler.ipynb 25KB

convert-to-instruction-format.ipynb 20KB

prosocial-confessions.ipynb 19KB

cmu_parser.ipynb 15KB

prosocial.ipynb 15KB

essay-revision.ipynb 14KB

imsdb.ipynb 14KB

iapp_wiki_qa_squad_oa.ipynb 13KB

dataset_creation.ipynb 13KB

dataset-cookbook.ipynb 12KB

essay-instructions.ipynb 11KB

openbugger_example.ipynb 10KB

tasty_recipes.ipynb 9KB

HumanEval_and_MBPP_code_gen.ipynb 9KB

HumanEval_and_MBPP_test_gen.ipynb 9KB

movie_descriptions.ipynb 9KB

Summarize_codesearchnet_for_python.ipynb 8KB

hippocorpus.ipynb 7KB

GenerateOpenAssistantInstructionResponseFormat.ipynb 6KB

oa_leet10k.ipynb 6KB

tell_a_joke.ipynb 6KB

example.ipynb 4KB

redmond_logo.jpg 9KB

av4.jpg 8KB

av1.jpg 7KB

av2.jpg 6KB

av3.jpg 6KB

av5.jpg 6KB

mockServiceWorker.js 8KB

docusaurus.config.js 4KB

tailwind.config.js 3KB

sidebars.js 2KB

preview.js 2KB

jest.setup.js 1KB

next.config.js 1KB

wikipedia_emergency_info.js 982B

jest.config.js 803B

cypress.config.js 798B

main.js 756B

共 1576 条

<a href="https://github-com.translate.goog/LAION-AI/Open-Assistant/blob/main/oasst-data/README.md?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp">![Translate](https://img.shields.io/badge/Translate-blue)</a> # Open Assistant Data Module (oasst_data) ## Installation of oasst_data If you got the exception `ModuleNotFoundError: No module named 'oasst_data'` you first need to install the `oasst_data` package: Run `pip install -e .` in the `oasst-data/` directory of the Open-Assistant repository to install the `oasst_data` python package in editable mode. ## Reading Open-Assistant Export Files Reading jsonl files is in general very simple in Python. To further simplify the process for OA data the `oasst_data` module comes with Pydantic class definitions for validation and helper functions to load and traverse message trees. Code example: ```python # parsing OA data files with oasst_data helpers from oasst_data import read_message_trees, visit_messages_depth_first, ExportMessageNode messages: list[ExportMessageNode] = [] input_file_path = "data_file.jsonl.gz" for tree in read_message_trees(input_file_path): if tree.prompt.lang not in ["en","es"]: # filtering by language tag (optional) continue # example use of depth first tree visitor help function visit_messages_depth_first(tree.prompt, visitor=messages.append, predicate=None) ``` A more comprehensive example of loading all conversation threads ending in assistant replies can be found in the file [oasst_dataset.py](https://github.com/LAION-AI/Open-Assistant/blob/main/model/model_training/custom_datasets/oasst_dataset.py) which is used to load Open-Assistant export data for supervised fine-tuning (training) of our language models. You can also load jsonl data completely without dependencies to `oasst_data` solely with standard python libraries. In this case the json objects are loaded as nested dicts which need to be 'parsed' manually by you: ```python # loading jsonl files without using oasst_data import gzip import json from pathlib import Path input_file_path = Path(input_file_path) if input_file_path.suffix == ".gz": file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8") else: file_in = input_file_path.open("r", encoding="UTF-8") with file_in: # read one object per line for line in file_in: dict_tree = json.loads(line) # manual parsing of data now goes here ... ``` ## Open-Assistant JSON Lines Export Data Format Open-Assistant export data is written as standard [JSON Lines data](https://jsonlines.org/). The generated files are UTF-8 encoded text files with single JSON objects in each line. The files come either uncompressed with the ending `.jsonl` or compressed with the ending `.jsonl.gz`. Three different types of objects can appear in these files: 1. Individual Messages 2. Conversation Threads 3. Message Trees For readability the following JSON examples are shown formatted with indentation on multiple lines although they are be stored without indentation in the actual data file. ### 1. Individual Messages Message objects can be identified by the presence of a `"message_id"` property. In files written by Open-Assistant this property will appear as the first property on the line directly after the opening curly brace. Each message needs at least an id (UUID), message text, a role (either "prompter" or "assistant") and a language tag ([BCP 47](https://en.wikipedia.org/wiki/IETF_language_tag)) like "en" for English. Minimal example of a message: ```json { "message_id": "13714ad5-3161-4ead-9593-7248b0a3f218", "text": "List the pieces of a reinforcement learning system (..)", "role": "prompter", "lang": "en" } ``` Example of a message with more properties: ```json { "message_id": "218440fd-5317-4355-91dc-d001416df62b", "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4", "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4", "text": "It was the winter of 2035, and artificial intelligence (..)", "role": "assistant", "lang": "en", "review_count": 3, "review_result": true, "deleted": false, "rank": 0, "synthetic": true, "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)", "labels": { "spam": { "value": 0.0, "count": 3 }, "lang_mismatch": { "value": 0.0, "count": 3 }, "pii": { "value": 0.0, "count": 3 }, "not_appropriate": { "value": 0.0, "count": 3 }, "hate_speech": { "value": 0.0, "count": 3 }, "sexual_content": { "value": 0.0, "count": 3 }, "quality": { "value": 0.416, "count": 3 }, "toxicity": { "value": 0.16, "count": 3 }, "humor": { "value": 0.0, "count": 3 }, "creativity": { "value": 0.33, "count": 3 }, "violence": { "value": 0.16, "count": 3 } } }, ``` The backend export tool ([export.py](https://github.com/LAION-AI/Open-Assistant/blob/main/backend/export.py)) will generate jsonl files with individual messages when a set of messages is exported that is not a full tree. This is for example the case when filtering messages based on properties like user, deleted, spam or synthetic. Spam messages are those which have a `review_result` that is `false`. ### 2. Conversation Threads Conversation threads are a linear lists of messages. THese objects can be identified by the presence of the `"thread_id"` property which contains the UUID of the last message of the thread (which can be used to reconstruct the thread by returning the list of ancestor messages up to the prompt root message). The message_id of the first message is normally also the id of the message-tree that contains the thread. ```json { "thread_id": "534c7711-afb5-4410-9006-489dc885280e", "thread": [ { "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793", "text": "Why can't we divide by 0? (..)", "role": "prompter", "lang": "en" }, { "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8", "text": "The reason we cannot divide by zero is because (..)", "role": "assistant", "lang": "en" }, { "message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce", "text": "Can you explain why we created a definition (..)", "role": "prompter", "lang": "en" }, { "message_id": "534c7711-afb5-4410-9006-489dc885280e", "text": "The historical origin of the imaginary (..)", "role": "assistant", "lang": "en" } ] } ``` ### 3. Message Trees Message trees have of a prompt message at the root and can then branch out into multiple different reply branches which each can again have further replies. Message trees can be identified by the `"message_tree_id"` property. The `message_tree_id` always matches the id of the prompt-message. Example of a tree with minimal messages: For clarity only the mandatory elements of the message are shown here. The full export format contains all the message attributes as shown above in the full message example. ```json { "message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793", "tree_state": "ready_for_export", "prompt": { "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793", "text": "Why can't we divide by 0? (..)", "role": "prompter", "lang": "en", "replies": [ { "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8", "text": "The reason we cannot divide by zero is because (..)", "role": "assistant", "lang": "en", "replies": [ { "message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce", "text": "Can you explain why we created a definition (..)", "role": "prompter", "lang": "en", "replies": [ { "message_id": "534c7711-afb5-4410-9006-489dc885280e", "text": "The historical origin of the imaginary (..)", "role": "assistant", "lang": "en",

评论收藏

内容反馈