# CrossWOZ
CrossWOZ is the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains **6K** dialogue sessions and **102K** utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. Moreover, the corpus contains rich annotation of dialogue states and dialogue acts at both user and system sides. We also provide a user simulator and several benchmark models for pipelined taskoriented dialogue systems, which will facilitate researchers to compare and evaluate their models on this corpus.
Refer to our paper for more details: [CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset](https://arxiv.org/abs/2002.11893) (accepted by TACL)
If you have any question, feel free to open an issue.
## Data
A piece of dialogue: (Names of hotels are replaced by A,B,C for simplicity.)
![example](example.png)
In `data/crosswoz` directory. Data statistics:
| Split | Train | Valid | Test |
| --------------------- | ------ | ----- | ----- |
| \# dialogues | 5,012 | 500 | 500 |
| \# Turns (utterances) | 84,692 | 8,458 | 8,476 |
| Vocab | 12,502 | 5,202 | 5,143 |
| Avg. sub-goals | 3.24 | 3.26 | 3.26 |
| Avg. semantic tuples | 14.8 | 14.9 | 15.0 |
| Avg. turns | 16.9 | 16.9 | 17.0 |
| Avg. tokens per turn | 16.3 | 16.3 | 16.2 |
According to the type of user goal, we group the dialogues in the **training set** into five categories:
- **S**: 417 dialogues have only one sub-goal in HAR domains.
- **M**: 1573 dialogues have multiple sub-goals (2-3) in HAR domains. However, these sub-goals do not have cross-domain informable slots.
- **M+T**: 691 dialogues have multiple sub-goals in HAR domains and at least one sub-goal in the metro or taxi domain (3-5 sub-goals). The sub-goals in HAR domains do not have cross-domain informable slots.
- **CM**: 1,759 dialogues have multiple sub-goals (2-5) in HAR domains with cross-domain informable slots.
- **CM+T**: 572 dialogues have multiple sub-goals in HAR domains with cross-domain informable slots and at least one sub-goal in the metro or taxi domain (3-5 sub-goals).
Statistics for dialogues of different goal types in the training set:
| Goal type | S | M | M+T | CM | CM+T |
| -------------------- | ---- | ---- | ---- | ---- | ---- |
| \# dialogues | 417 | 1573 | 691 | 1759 | 572 |
| NoOffer rate | 0.10 | 0.22 | 0.22 | 0.61 | 0.55 |
| Multi-query rate | 0.06 | 0.07 | 0.07 | 0.14 | 0.12 |
| Goal change rate | 0.10 | 0.28 | 0.31 | 0.69 | 0.63 |
| Avg. dialogue acts | 1.85 | 1.90 | 2.09 | 2.06 | 2.11 |
| Avg. sub-goals | 1.00 | 2.49 | 3.62 | 3.87 | 4.57 |
| Avg. semantic tuples | 4.5 | 11.3 | 15.8 | 18.2 | 20.7 |
| Avg. turns | 6.8 | 13.7 | 16.0 | 21.0 | 21.6 |
| Avg. tokens per turn | 13.2 | 15.2 | 16.3 | 16.9 | 17.0 |
We also provide database in `data/crosswoz/database`.
## Code
please install via:
```
pip install -e .
```
Code:
- BERTNLU: `convlab2/nlu/jointBERT/crosswoz`
- Trained model: https://convlab.blob.core.windows.net/convlab-2/bert_crosswoz_all_context.zip
- RuleDST: `convlab2/dst/rule/crosswoz`
- TRADE: `convlab2/dst/trade/crosswoz`
- Trained model: https://convlab.blob.core.windows.net/convlab-2/trade_crosswoz_model.zip
- Preprocessed data: https://convlab.blob.core.windows.net/convlab-2/trade_crosswoz_data.zip
- SL policy: `convlab2/policy/mle/crosswoz`
- Trained model: https://convlab.blob.core.windows.net/convlab-2/mle_policy_crosswoz.zip
- SCLSTM: `convlab2/nlg/sclstm/crosswoz`
- Trained model: https://convlab.blob.core.windows.net/convlab-2/nlg_sclstm_crosswoz.zip
- TemplateNLG: `convlab2/nlg/template/crosswoz`
- User simulator: `convlab2/policy/rule/crosswoz`
- Evaluate with user simulator: `convlab2/policy/mle/crosswoz/evaluate.py`
Result:
![result](result.png)
## Citing
Please kindly cite our paper if this paper and the dataset are helpful.
```
@article{zhu2020crosswoz,
author = {Qi Zhu and Kaili Huang and Zheng Zhang and Xiaoyan Zhu and Minlie Huang},
title = {Cross{WOZ}: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset},
journal = {Transactions of the Association for Computational Linguistics},
year = {2020}
}
```
没有合适的资源?快使用搜索试试~ 我知道了~
CrossWOZ:大规模的中文跨域任务导向对话数据集
共139个文件
py:100个
json:16个
md:8个
5星 · 超过95%的资源 需积分: 16 6 下载量 83 浏览量
2021-04-14
10:18:08
上传
评论
收藏 17.72MB ZIP 举报
温馨提示
CrossWOZ CrossWOZ是第一个大规模的中文跨域“绿野仙踪”任务导向数据集。 它包含5个领域的6K对话会话和102K语音,包括酒店,餐厅,景点,地铁和出租车。 而且,语料库在用户和系统端都包含丰富的对话状态注释和对话行为。 我们还为面向任务的流水线对话系统提供了一个用户模拟器和一些基准模型,这将有助于研究人员在该语料库上比较和评估他们的模型。 有关更多详细信息,请参阅我们的论文: (TACL接受) 如果您有任何疑问,请随时提出问题。 数据 一段对话:(为简单起见,酒店名称用A,B,C代替。) 在data/crosswoz目录中。 数据统计: 分裂 火车 有效的 测试 #对话 5,012 500 500 #转(说话) 84,692 8,458 8,476 词汇 12,502 5,202 5,143 平均子目标 3.24 3.26 3.26 平均语
资源详情
资源评论
资源推荐
收起资源包目录
CrossWOZ:大规模的中文跨域任务导向对话数据集 (139个子文件)
config_usr.cfg 414B
config.cfg 395B
setup.cfg 78B
.gitignore 949B
MANIFEST.in 67B
auto_user_template_nlg.json 3.94MB
auto_system_template_nlg.json 3.5MB
restaurant_db.json 1.27MB
hotel_db.json 1.18MB
metro_db.json 501KB
attraction_db.json 493KB
manual_system_template_nlg.json 74KB
manual_user_template_nlg.json 29KB
usr_da_voc.json 6KB
sys_da_voc.json 6KB
crosswoz_all_context_fr.json 675B
crosswoz_all_context.json 660B
crosswoz_all_fr.json 643B
crosswoz_all.json 629B
config.json 170B
taxi_db.json 119B
LICENSE 11KB
README.md 4KB
README.md 3KB
README.md 2KB
database.md 2KB
README.md 2KB
README.md 612B
README.md 342B
PULL_REQUEST_TEMPLATE.md 83B
mapping.pair 1KB
multi-bleu.perl 5KB
example.png 674KB
result.png 406KB
utils_multiWOZ_DST.py 36KB
trade.py 33KB
TRADE.py 29KB
nlg.py 19KB
generate_resources.py 19KB
goal_generator.py 18KB
utils_temp.py 18KB
rule_simulator.py 17KB
train.py 14KB
allennlp_file_utils.py 11KB
decoder_deep.py 11KB
sc_lstm.py 11KB
generate_auto_template.py 11KB
evaluate.py 10KB
evaluate.py 10KB
train.py 10KB
dbquery.py 10KB
dataset_woz.py 9KB
evaluate.py 9KB
evaluate.py 8KB
hotel_generator.py 7KB
dataloader.py 7KB
GEM_train.py 6KB
rlmodule.py 6KB
masked_cross_entropy.py 6KB
session.py 6KB
agent.py 6KB
train.py 5KB
restaurant_generator.py 5KB
test.py 5KB
bleu.py 5KB
sentence_generator.py 5KB
jointBERT.py 5KB
EWC_train.py 5KB
dst.py 4KB
preprocess.py 4KB
lexicalize.py 4KB
config.py 4KB
measures.py 4KB
nlu.py 4KB
fix_label.py 4KB
train.py 4KB
loader.py 4KB
vector_crosswoz.py 4KB
train.py 4KB
attraction_generator.py 4KB
lm_deep.py 3KB
reorder.py 3KB
analyse.py 3KB
postprocess.py 3KB
fine_tune.py 3KB
evaluate.py 3KB
logger.py 2KB
loader.py 2KB
mle.py 2KB
setup.py 2KB
mle.py 2KB
masked_cross_entropy.py 2KB
state.py 1KB
taxi_generator.py 1020B
gen_da_voc.py 1009B
cnembedding.py 976B
metro_generator.py 949B
env.py 875B
train_util.py 819B
dataset.py 761B
共 139 条
- 1
- 2
weixin_42156940
- 粉丝: 18
- 资源: 4629
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 如何使用redis构建简单的社交网站.md
- Screenshot_20240609_101817.jpg
- 01b8ab2bf936365ce73984328df2f6b2.dav
- YOLOv10-DeepSORT-main
- 一个基于matbla的仿真个人学习资源包
- (佳作)两轮平衡小车(原理图、PCB、程序源码、BOM等)
- Android移动平台开发-常用控件应用pdf
- openai 前员工关于AGI的详细论述及未来系统发展判断
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena阅读笔记
- (优作)PID-小车类-两轮自平衡小车资料(L298N 模块原理图及使用说明+c源码)
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1