CrossWOZ:大规模的中文跨域任务导向对话数据集

共139个文件

py：100个

json：16个

md：8个

Python

5星 · 超过95%的资源需积分: 16 83 浏览量 2021-04-14 10:18:08 上传评论收藏 17.72MB ZIP 举报

资源详情

资源评论

资源推荐

收起资源包目录

CrossWOZ:大规模的中文跨域任务导向对话数据集（139个子文件）

config_usr.cfg 414B

config.cfg 395B

setup.cfg 78B

.gitignore 949B

MANIFEST.in 67B

auto_user_template_nlg.json 3.94MB

auto_system_template_nlg.json 3.5MB

restaurant_db.json 1.27MB

hotel_db.json 1.18MB

metro_db.json 501KB

attraction_db.json 493KB

manual_system_template_nlg.json 74KB

manual_user_template_nlg.json 29KB

usr_da_voc.json 6KB

sys_da_voc.json 6KB

crosswoz_all_context_fr.json 675B

crosswoz_all_context.json 660B

crosswoz_all_fr.json 643B

crosswoz_all.json 629B

config.json 170B

taxi_db.json 119B

LICENSE 11KB

README.md 4KB

README.md 3KB

README.md 2KB

database.md 2KB

README.md 2KB

README.md 612B

README.md 342B

PULL_REQUEST_TEMPLATE.md 83B

mapping.pair 1KB

multi-bleu.perl 5KB

example.png 674KB

result.png 406KB

utils_multiWOZ_DST.py 36KB

trade.py 33KB

TRADE.py 29KB

nlg.py 19KB

generate_resources.py 19KB

goal_generator.py 18KB

utils_temp.py 18KB

rule_simulator.py 17KB

train.py 14KB

allennlp_file_utils.py 11KB

decoder_deep.py 11KB

sc_lstm.py 11KB

generate_auto_template.py 11KB

evaluate.py 10KB

train.py 10KB

dbquery.py 10KB

dataset_woz.py 9KB

evaluate.py 9KB

evaluate.py 8KB

hotel_generator.py 7KB

dataloader.py 7KB

GEM_train.py 6KB

rlmodule.py 6KB

masked_cross_entropy.py 6KB

session.py 6KB

agent.py 6KB

train.py 5KB

restaurant_generator.py 5KB

test.py 5KB

bleu.py 5KB

sentence_generator.py 5KB

jointBERT.py 5KB

EWC_train.py 5KB

dst.py 4KB

preprocess.py 4KB

lexicalize.py 4KB

config.py 4KB

measures.py 4KB

nlu.py 4KB

fix_label.py 4KB

train.py 4KB

loader.py 4KB

vector_crosswoz.py 4KB

train.py 4KB

attraction_generator.py 4KB

lm_deep.py 3KB

reorder.py 3KB

analyse.py 3KB

postprocess.py 3KB

fine_tune.py 3KB

evaluate.py 3KB

logger.py 2KB

loader.py 2KB

mle.py 2KB

setup.py 2KB

mle.py 2KB

masked_cross_entropy.py 2KB

state.py 1KB

taxi_generator.py 1020B

gen_da_voc.py 1009B

cnembedding.py 976B

metro_generator.py 949B

env.py 875B

train_util.py 819B

dataset.py 761B

共 139 条

# CrossWOZ CrossWOZ is the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains **6K** dialogue sessions and **102K** utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. Moreover, the corpus contains rich annotation of dialogue states and dialogue acts at both user and system sides. We also provide a user simulator and several benchmark models for pipelined taskoriented dialogue systems, which will facilitate researchers to compare and evaluate their models on this corpus. Refer to our paper for more details: [CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset](https://arxiv.org/abs/2002.11893) (accepted by TACL) If you have any question, feel free to open an issue. ## Data A piece of dialogue: (Names of hotels are replaced by A,B,C for simplicity.) ![example](example.png) In `data/crosswoz` directory. Data statistics: | Split | Train | Valid | Test | | --------------------- | ------ | ----- | ----- | | \# dialogues | 5,012 | 500 | 500 | | \# Turns (utterances) | 84,692 | 8,458 | 8,476 | | Vocab | 12,502 | 5,202 | 5,143 | | Avg. sub-goals | 3.24 | 3.26 | 3.26 | | Avg. semantic tuples | 14.8 | 14.9 | 15.0 | | Avg. turns | 16.9 | 16.9 | 17.0 | | Avg. tokens per turn | 16.3 | 16.3 | 16.2 | According to the type of user goal, we group the dialogues in the **training set** into five categories: - **S**: 417 dialogues have only one sub-goal in HAR domains. - **M**: 1573 dialogues have multiple sub-goals (2-3) in HAR domains. However, these sub-goals do not have cross-domain informable slots. - **M+T**: 691 dialogues have multiple sub-goals in HAR domains and at least one sub-goal in the metro or taxi domain (3-5 sub-goals). The sub-goals in HAR domains do not have cross-domain informable slots. - **CM**: 1,759 dialogues have multiple sub-goals (2-5) in HAR domains with cross-domain informable slots. - **CM+T**: 572 dialogues have multiple sub-goals in HAR domains with cross-domain informable slots and at least one sub-goal in the metro or taxi domain (3-5 sub-goals). Statistics for dialogues of different goal types in the training set: | Goal type | S | M | M+T | CM | CM+T | | -------------------- | ---- | ---- | ---- | ---- | ---- | | \# dialogues | 417 | 1573 | 691 | 1759 | 572 | | NoOffer rate | 0.10 | 0.22 | 0.22 | 0.61 | 0.55 | | Multi-query rate | 0.06 | 0.07 | 0.07 | 0.14 | 0.12 | | Goal change rate | 0.10 | 0.28 | 0.31 | 0.69 | 0.63 | | Avg. dialogue acts | 1.85 | 1.90 | 2.09 | 2.06 | 2.11 | | Avg. sub-goals | 1.00 | 2.49 | 3.62 | 3.87 | 4.57 | | Avg. semantic tuples | 4.5 | 11.3 | 15.8 | 18.2 | 20.7 | | Avg. turns | 6.8 | 13.7 | 16.0 | 21.0 | 21.6 | | Avg. tokens per turn | 13.2 | 15.2 | 16.3 | 16.9 | 17.0 | We also provide database in `data/crosswoz/database`. ## Code please install via: ``` pip install -e . ``` Code: - BERTNLU: `convlab2/nlu/jointBERT/crosswoz` - Trained model: https://convlab.blob.core.windows.net/convlab-2/bert_crosswoz_all_context.zip - RuleDST: `convlab2/dst/rule/crosswoz` - TRADE: `convlab2/dst/trade/crosswoz` - Trained model: https://convlab.blob.core.windows.net/convlab-2/trade_crosswoz_model.zip - Preprocessed data: https://convlab.blob.core.windows.net/convlab-2/trade_crosswoz_data.zip - SL policy: `convlab2/policy/mle/crosswoz` - Trained model: https://convlab.blob.core.windows.net/convlab-2/mle_policy_crosswoz.zip - SCLSTM: `convlab2/nlg/sclstm/crosswoz` - Trained model: https://convlab.blob.core.windows.net/convlab-2/nlg_sclstm_crosswoz.zip - TemplateNLG: `convlab2/nlg/template/crosswoz` - User simulator: `convlab2/policy/rule/crosswoz` - Evaluate with user simulator: `convlab2/policy/mle/crosswoz/evaluate.py` Result: ![result](result.png) ## Citing Please kindly cite our paper if this paper and the dataset are helpful. ``` @article{zhu2020crosswoz, author = {Qi Zhu and Kaili Huang and Zheng Zhang and Xiaoyan Zhu and Minlie Huang}, title = {Cross{WOZ}: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset}, journal = {Transactions of the Association for Computational Linguistics}, year = {2020} } ```