This folder contains the "Cross-lingual Natural Language Inference" (XNLI) data.
XNLI consists of dev and test sets for the MNLI task, in 15 languages:
ar: Arabic
bg: Bulgarian
de: German
el: Greek
en: English
es: Spanish
fr: French
hi: Hindi
ru: Russian
sw: Swahili
th: Thai
tr: Turkish
ur: Urdu
vi: Vietnamese
zh: Chinese (Simplified)
Please consider citing [1] if using this dataset:
@InProceedings{conneau2018xnli,
author = "Conneau, Alexis
and Rinott, Ruty
and Lample, Guillaume
and Williams, Adina
and Bowman, Samuel R.
and Schwenk, Holger
and Stoyanov, Veselin",
title = "XNLI: Evaluating Cross-lingual Sentence Representations",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
year = "2018",
publisher = "Association for Computational Linguistics",
location = "Brussels, Belgium",
}
** The XNLI corpus (XNLI-1.0.zip) **
The files xnli.dev.{tsv,jsonl} and xnli.test.{tsv,jsonl} contain the dev and test sets of the 15 languages of XNLI. Each language has 2490 dev samples and 5010 test samples.
If you plan to report results publicly, you should not use the test set for tuning, early stopping, or model selection.
The format follows that of MultiNLI [2], with additional columns for tokenized text. We use the MOSES tokenizer [3] for all languages, falling back to the default English tokenizer when necessary, except for Chinese where we used the Stanford `pku` segmenter. Some columns are deliberately left empty.
These dev and test sets are new, and are not derived from the dev and test sets of MultiNLI.
XNLI was created using professional translators who translated the new English dev and test sets.
** Machine Translation Baselines (XNLI-MT-1.0.zip) **
We provide machine-translated data for people to reproduce the MT baselines of the XNLI paper.
The XNLI-MT-1.0/XNLI subfolder corresponds to the XNLI dev/test data where the premise and hypothesis in $lg have been translated automatically to English.
These datasets are used in the "test-translation" baseline of the XNLI paper.
The XNLI-MT-1.0/multiNLI subfolder contains neural machine-translation of multiNLI training data to $lg.
These datasets are used in the "train-translation" baseline of the XNLI paper.
** References **
[2] Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.
[3] Koehn P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst, Moses: Open
Source Toolkit for Statistical Machine Translation. In ACL07. (2007)
[4] https://nlp.stanford.edu/software/segmenter.shtml
没有合适的资源?快使用搜索试试~ 我知道了~
中文文本相似度/文本推理/文本匹配数据集——XNLI
共7个文件
tsv:2个
jsonl:2个
ds_store:1个
需积分: 50 22 下载量 129 浏览量
2020-12-25
18:45:40
上传
评论
收藏 16.27MB ZIP 举报
温馨提示
中文文本相似度/文本推理/文本匹配数据集——XNLI
资源详情
资源评论
资源推荐
收起资源包目录
XNLI.zip (7个子文件)
XNLI
XNLI-1.0
xnli.dev.jsonl 45.4MB
xnli.test.tsv 42.53MB
.DS_Store 8KB
xnli.dev.tsv 21MB
xnli.test.jsonl 91.99MB
README.md 3KB
ReadMe.txt 139B
共 7 条
- 1
CQU-XJTU-Mr.Wu
- 粉丝: 27
- 资源: 13
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0