中英文实体识别数据集，中英文机器翻译数据集,中文分词数据集.zip资源-CSDN文库

共24个文件

txt：8个

train：3个

dev：3个

版权申诉

78 浏览量 2023-10-19 22:26:35 上传评论收藏 13.33MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

中英文实体识别数据集，中英文机器翻译数据集,中文分词数据集.zip （24个子文件）

nlp-public-dataset-master

ner-data

boson

data_util.py 4KB

origindata.txt 1.78MB

license.txt 2KB

readme.txt 981B

weibo

crfsuite.weiboNER.charpos.conll.dev 1.96MB

pku_test_gold.utf8 701KB

pku_training.utf8 7.37MB

weiboNER_2nd_conll.dev 103KB

crfsuite.weiboNER.charpos.conll.test 2MB

weiboNER.conll.train 442KB

weiboNER_2nd_conll.test 106KB

weiboNER_2nd_conll.train 523KB

weiboNER.conll.test 90KB

weiboNER.conll.dev 88KB

crfsuite.weiboNER.charpos.conll.train 9.93MB

renMinRiBao

data_renmin_word.py 5KB

renmin.txt 10.18MB

MSRA

train2pkl.py 4KB

test1.txt 514KB

link.txt 49B

testright1.txt 564KB

train1.txt 9.99MB

ss.md 0B

README.md 3KB

## NLP-dataset (General) * [Huggingface, datasets](https://huggingface.co/datasets) * [Awesome-Chinese-NLP, Chinese](https://github.com/crownpku/Awesome-Chinese-NLP) * [CLUEDatasetSearch, Chinese](https://github.com/CLUEbenchmark/CLUEDatasetSearch) * [funNLP, Chinese](https://github.com/fighting41love/funNLP) * [ChineseNLPCorpus1, Chinese](https://github.com/InsaneLife/ChineseNLPCorpus) * [ChineseNLPCorpus2, Chinese](https://github.com/SophonPlus/ChineseNlpCorpus) * [CLUE, Chinese](https://www.cluebenchmarks.com/introduce.html) * [Chinese NLP data by ShannonAI, Chinese](https://github.com/ShannonAI/glyce/blob/master/docs/dataset_download.md) * [nlp-datasets, Multilingual](https://github.com/niderhoff/nlp-datasets) * [awesome-nlp, Multilingual](https://github.com/keon/awesome-nlp#datasets) ## Word Segmentation (Chinese) * [SIGHAN2005](http://sighan.cs.uchicago.edu/bakeoff2005/) * [multi-criteria-cws](https://github.com/hankcs/multi-criteria-cws) * [Chinese NLP data by ShannonAI, Chinese](https://github.com/ShannonAI/glyce/blob/master/docs/dataset_download.md) ## NER dataset (English) * [various NER dataset](https://github.com/juand-r/entity-recognition-datasets) * [CoNLL-2003, Offical](https://www.clips.uantwerpen.be/conll2003/ner/), [CoNLL-2003, other link](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003) * [WNUT-2016, Twitter](https://github.com/aritter/twitter_nlp/tree/master/data/annotated/wnut16) * [OntoNotes-5.0, broadcase news, braodcase conversation, weblogs, magzine genre](https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO) * [Wikigold](https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold) * [Twitter](https://github.com/aritter/twitter_nlp/blob/master/data/annotated/ner.txt) * [kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) * [MUC6](https://catalog.ldc.upenn.edu/LDC2003T13) * [MUC7](https://catalog.ldc.upenn.edu/LDC2001T02) ## NER dataset (Chinese) - [MSRA, OntoNotes 4.0, Resume, Weibo](https://drive.google.com/file/d/1mDKkc2-8e4wXAuAnGiZMHI59UgVbl1q4/view) - [CLUENER](https://storage.googleapis.com/cluebenchmark/tasks/cluener_public.zip) - [RenMinRiBao](https://github.com/quincyliang/nlp-dataset/tree/master/ner-data/renMinRiBao) - [MSRA](https://github.com/quincyliang/nlp-dataset/tree/master/ner-data/MSRA) - [Boson](https://github.com/quincyliang/nlp-dataset/tree/master/ner-data/boson) - [Weibo](https://github.com/quincyliang/nlp-dataset/tree/master/ner-data/weibo) - [Others](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER) ## Machine Translation (Chinese-English) - [WMT 2020](http://statmt.org/wmt20/translation-task.html) - [AI challenger](https://challenger.ai/) (英中翻译规模最大的口语领域英中双语对照数据集) - [UM-Corpus: A Large English-Chinese Parallel Corpus](http://nlp2ct.cis.umac.mo/um-corpus/) - [OpenSubtitles2016](http://opus.nlpl.eu/OpenSubtitles2016.php) - [MultiUN](http://opus.nlpl.eu/MultiUN.php)

评论收藏

内容反馈

版权申诉