<a href="https://github-com.translate.goog/LAION-AI/Open-Assistant/blob/main/oasst-data/README.md?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp">![Translate](https://img.shields.io/badge/Translate-blue)</a>
# Open Assistant Data Module (oasst_data)
## Installation of oasst_data
If you got the exception `ModuleNotFoundError: No module named 'oasst_data'` you
first need to install the `oasst_data` package:
Run `pip install -e .` in the `oasst-data/` directory of the Open-Assistant
repository to install the `oasst_data` python package in editable mode.
## Reading Open-Assistant Export Files
Reading jsonl files is in general very simple in Python. To further simplify the
process for OA data the `oasst_data` module comes with Pydantic class
definitions for validation and helper functions to load and traverse message
trees.
Code example:
```python
# parsing OA data files with oasst_data helpers
from oasst_data import read_message_trees, visit_messages_depth_first, ExportMessageNode
messages: list[ExportMessageNode] = []
input_file_path = "data_file.jsonl.gz"
for tree in read_message_trees(input_file_path):
if tree.prompt.lang not in ["en","es"]: # filtering by language tag (optional)
continue
# example use of depth first tree visitor help function
visit_messages_depth_first(tree.prompt, visitor=messages.append, predicate=None)
```
A more comprehensive example of loading all conversation threads ending in
assistant replies can be found in the file
[oasst_dataset.py](https://github.com/LAION-AI/Open-Assistant/blob/main/model/model_training/custom_datasets/oasst_dataset.py)
which is used to load Open-Assistant export data for supervised fine-tuning
(training) of our language models.
You can also load jsonl data completely without dependencies to `oasst_data`
solely with standard python libraries. In this case the json objects are loaded
as nested dicts which need to be 'parsed' manually by you:
```python
# loading jsonl files without using oasst_data
import gzip
import json
from pathlib import Path
input_file_path = Path(input_file_path)
if input_file_path.suffix == ".gz":
file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8")
else:
file_in = input_file_path.open("r", encoding="UTF-8")
with file_in:
# read one object per line
for line in file_in:
dict_tree = json.loads(line)
# manual parsing of data now goes here ...
```
## Open-Assistant JSON Lines Export Data Format
Open-Assistant export data is written as standard
[JSON Lines data](https://jsonlines.org/). The generated files are UTF-8 encoded
text files with single JSON objects in each line. The files come either
uncompressed with the ending `.jsonl` or compressed with the ending `.jsonl.gz`.
Three different types of objects can appear in these files:
1. Individual Messages
2. Conversation Threads
3. Message Trees
For readability the following JSON examples are shown formatted with indentation
on multiple lines although they are be stored without indentation in the actual
data file.
### 1. Individual Messages
Message objects can be identified by the presence of a `"message_id"` property.
In files written by Open-Assistant this property will appear as the first
property on the line directly after the opening curly brace.
Each message needs at least an id (UUID), message text, a role (either
"prompter" or "assistant") and a language tag
([BCP 47](https://en.wikipedia.org/wiki/IETF_language_tag)) like "en" for
English.
Minimal example of a message:
```json
{
"message_id": "13714ad5-3161-4ead-9593-7248b0a3f218",
"text": "List the pieces of a reinforcement learning system (..)",
"role": "prompter",
"lang": "en"
}
```
Example of a message with more properties:
```json
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
},
```
The backend export tool
([export.py](https://github.com/LAION-AI/Open-Assistant/blob/main/backend/export.py))
will generate jsonl files with individual messages when a set of messages is
exported that is not a full tree. This is for example the case when filtering
messages based on properties like user, deleted, spam or synthetic. Spam
messages are those which have a `review_result` that is `false`.
### 2. Conversation Threads
Conversation threads are a linear lists of messages. THese objects can be
identified by the presence of the `"thread_id"` property which contains the UUID
of the last message of the thread (which can be used to reconstruct the thread
by returning the list of ancestor messages up to the prompt root message). The
message_id of the first message is normally also the id of the message-tree that
contains the thread.
```json
{
"thread_id": "534c7711-afb5-4410-9006-489dc885280e",
"thread": [
{
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en"
},
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en"
},
{
"message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce",
"text": "Can you explain why we created a definition (..)",
"role": "prompter",
"lang": "en"
},
{
"message_id": "534c7711-afb5-4410-9006-489dc885280e",
"text": "The historical origin of the imaginary (..)",
"role": "assistant",
"lang": "en"
}
]
}
```
### 3. Message Trees
Message trees have of a prompt message at the root and can then branch out into
multiple different reply branches which each can again have further replies.
Message trees can be identified by the `"message_tree_id"` property. The
`message_tree_id` always matches the id of the prompt-message.
Example of a tree with minimal messages:
For clarity only the mandatory elements of the message are shown here. The full
export format contains all the message attributes as shown above in the full
message example.
```json
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "1c9210e9-af9e-4507-abc5-3b3c7bca4dce",
"text": "Can you explain why we created a definition (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "534c7711-afb5-4410-9006-489dc885280e",
"text": "The historical origin of the imaginary (..)",
"role": "assistant",
"lang": "en",
没有合适的资源?快使用搜索试试~ 我知道了~
OpenAssistant是一个基于聊天的助手,它可以理解任务,可以与第三方系统交互,并动态检索信息
共1576个文件
json:464个
py:358个
tsx:192个
0 下载量 176 浏览量
2024-01-29
11:03:48
上传
评论
收藏 7.55MB ZIP 举报
温馨提示
OpenAssistant是一个基于聊天的助手,它可以理解任务,可以与第三方系统交互,并动态检索信息。Open Assistant是一个旨在让每个人都能访问基于聊天的大型语言模型的项目。 您不需要在本地运行项目,除非您对开发过程有贡献。上面的网站链接将带您访问公共网站,在那里您可以使用数据收集应用程序和聊天。 如果您想在本地运行数据采集应用程序进行开发,您可以使用Docker设置运行Open Assistant所需的整个堆栈,包括网站、后端和相关的依赖服务。我们希望打造未来的助理,不仅能写电子邮件和求职信,还能做有意义的工作,使用API,动态研究信息等等,并能够被任何人个性化和扩展。我们希望以一种开放和可访问的方式来实现这一点,这意味着我们不仅必须构建一个伟大的助手,而且必须使其足够小和高效,以便在消费类硬件上运行。
资源推荐
资源详情
资源评论
收起资源包目录
OpenAssistant是一个基于聊天的助手,它可以理解任务,可以与第三方系统交互,并动态检索信息 (1576个子文件)
Dockerfile.backend 654B
Dockerfile.backend-worker 646B
CODEOWNERS 865B
nginx.conf 3KB
nginx.conf 3KB
nginx.conf 2KB
pgbackrest.conf 477B
postgres.conf 153B
prometheus.conf 135B
redis.conf 57B
redis.conf 46B
redis.conf 45B
redis.conf 45B
custom.css 2KB
index.module.css 365B
styles.module.css 138B
globals.css 100B
data.csv 56B
Dockerfile.discord-bot 207B
Dockerfile 296B
.dockerignore 30B
.env 1KB
.env.example 537B
.env.example 259B
.gitattributes 50B
.gitignore 2KB
.gitignore 524B
.gitignore 358B
.gitignore 238B
.gitignore 200B
.gitignore 70B
.gitignore 54B
.gitignore 44B
.gitignore 39B
.gitignore 33B
.gitignore 22B
.gitignore 14B
.gitinclude 0B
.gitkeep 0B
component-index.html 387B
alembic.ini 3KB
alembic.ini 3KB
test.inventory.ini 36B
test.inventory.ini 36B
detoxify-evaluation.ipynb 2.63MB
public.ipynb 392KB
Closed Book QA Generator.ipynb 191KB
TSSB-3M-bugs_dataset.ipynb 165KB
3_10k_bart_trial.ipynb 163KB
writing_prompt-checkpoint.ipynb 142KB
writing_prompt.ipynb 134KB
2_wikitext_doc2query.ipynb 119KB
stackexchange-builder.ipynb 87KB
tlcv2_0_oa.ipynb 84KB
wikidata.ipynb 64KB
getting-started.ipynb 37KB
unified-qa.ipynb 36KB
ubuntu_parser.ipynb 35KB
safety data-augmentation.ipynb 34KB
diverse.ipynb 30KB
data_processor.ipynb 28KB
project_gutenberg_crawler.ipynb 25KB
convert-to-instruction-format.ipynb 20KB
prosocial-confessions.ipynb 19KB
cmu_parser.ipynb 15KB
prosocial.ipynb 15KB
essay-revision.ipynb 14KB
imsdb.ipynb 14KB
iapp_wiki_qa_squad_oa.ipynb 13KB
dataset_creation.ipynb 13KB
dataset-cookbook.ipynb 12KB
essay-instructions.ipynb 11KB
openbugger_example.ipynb 10KB
tasty_recipes.ipynb 9KB
HumanEval_and_MBPP_code_gen.ipynb 9KB
HumanEval_and_MBPP_test_gen.ipynb 9KB
movie_descriptions.ipynb 9KB
Summarize_codesearchnet_for_python.ipynb 8KB
hippocorpus.ipynb 7KB
GenerateOpenAssistantInstructionResponseFormat.ipynb 6KB
oa_leet10k.ipynb 6KB
tell_a_joke.ipynb 6KB
example.ipynb 4KB
redmond_logo.jpg 9KB
av4.jpg 8KB
av1.jpg 7KB
av2.jpg 6KB
av3.jpg 6KB
av5.jpg 6KB
mockServiceWorker.js 8KB
docusaurus.config.js 4KB
tailwind.config.js 3KB
sidebars.js 2KB
preview.js 2KB
jest.setup.js 1KB
next.config.js 1KB
wikipedia_emergency_info.js 982B
jest.config.js 803B
cypress.config.js 798B
main.js 756B
共 1576 条
- 1
- 2
- 3
- 4
- 5
- 6
- 16
资源评论
技术探秘者
- 粉丝: 1092
- 资源: 48
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功