Multi-CPR_大规模段落检索多领域中文数据集.zip资源-CSDN文库

共33个文件

py：12个

tsv：9个

txt：6个

版权申诉

数据集

5星 · 超过95%的资源 3 浏览量 2022-06-28 17:10:46 上传评论收藏 239.18MB ZIP 举报

《大规模段落检索多领域中文数据集：Multi-CPR深度解析》在当前的信息时代，数据已经成为推动科技进步的重要驱动力，特别是在自然语言处理（NLP）领域，高质量的数据集是模型训练、算法优化的关键。Multi-CPR，全称为“大规模段落检索多领域中文数据集”，是一个专门针对中文文本检索问题而设计的大型数据集，旨在提升中文信息检索系统的性能，尤其在多领域应用中的表现。我们来深入了解Multi-CPR的核心特性。该数据集的特点在于其“大规模”和“多领域”的特性。所谓“大规模”，意味着它包含了海量的文本数据，这些数据不仅数量庞大，而且涵盖了广泛的主题和内容，为模型提供了丰富的学习素材。而“多领域”则表明，数据来源于多个不同的专业领域，如科技、娱乐、新闻等，这使得模型在处理跨领域检索任务时更具通用性和适应性。 Multi-CPR数据集的构建过程严谨且科学，包括了对原始文本的收集、预处理、标注等多个环节。在数据收集阶段，开发者可能从各种网络资源，如新闻网站、论坛、社交媒体等，精心挑选出具有代表性的段落。预处理步骤涉及文本清洗、去重、分词等，以确保数据质量。标注环节则可能涉及到人工审核或利用已有的知识图谱进行自动标注，确保查询与段落之间的相关性准确性。在数据集的使用上，Multi-CPR可以用于训练和评估多种检索模型，例如基于深度学习的语义匹配模型，如BERT、RoBERTa等。这些模型通过学习数据集中大量查询与段落的匹配关系，能够理解文本的语义，从而提高检索结果的相关度。此外，Multi-CPR也可以用于研究跨领域的检索策略，探索如何在不同主题之间进行有效的信息检索，这对于提升智能助手、搜索引擎等应用的用户体验至关重要。为了充分利用这个数据集，研究人员通常会将其拆分为训练集、验证集和测试集，分别用于模型的训练、参数调整和性能评估。在训练过程中，模型会通过反向传播优化损失函数，逐步提升对段落检索任务的理解能力。而验证集的引入则能帮助我们避免过拟合，找到最佳模型参数。测试集的性能指标，如准确率、召回率和F1值，可以客观地反映出模型在实际应用中的表现。 Multi-CPR是一个极具价值的资源，它为中文段落检索的研究提供了一个广阔的实验平台。无论是对于学术研究者还是工业界开发者，掌握和运用好这个数据集，都将有助于推动中文信息检索技术的进步，进而改善人们的日常搜索体验。随着NLP技术的不断演进，我们期待更多类似Multi-CPR这样的高质量数据集出现，共同促进人工智能的发展。

资源推荐

资源详情

资源评论

收起资源包目录

Multi-CPR_ 大规模段落检索多领域中文数据集.zip （33个子文件）

Multi-CPR

data

ecom

dev.query.txt 26KB

qrels.dev.tsv 17KB

train.query.txt 2.41MB

corpus.tsv 89.3MB

qrels.train.tsv 1.6MB

medical

dev.query.txt 56KB

qrels.dev.tsv 16KB

train.query.txt 5.42MB

corpus.tsv 334.06MB

qrels.train.tsv 1.6MB

video

dev.query.txt 27KB

qrels.dev.tsv 18KB

train.query.txt 2.72MB

corpus.tsv 71.62MB

qrels.train.tsv 1.73MB

rerank

inference.sh 523B

run_marco.py 7KB

readme.md 3KB

run_train.sh 844B

evaluate.py 3KB

build_train.py 5KB

build_dev.py 5KB

retrieval

readme.md 3KB

run_train.sh 835B

arguments.py 4KB

evaluate.py 3KB

retrieval.py 3KB

create_train.py 2KB

run_training.py 6KB

encoder_corpus.py 4KB

modeling.py 5KB

data.py 3KB

README.md 4KB

# Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval This repo contains the annotated datasets and expriments implementation introduced in our resource paper in SIGIR2022 Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval. [[Paper]](https://arxiv.org/pdf/2203.03367.pdf). ## Introduction Multi-CPR is a multi-domain Chinese dataset for passage retrieval. The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. Examples of annotated query-passage related pairs in three different domains: | Domain | Query | Passage | | ---- | ---- | ---- | | E-commerce | 尼康z62 (<font color=Blue>Nikon z62</font>) | <div style="width: 150pt"> Nikon/尼康二代全画幅微单机身Z62 Z72 24-70mm套机 (<font color=Blue>Nikon/Nikon II, full-frame micro-single camera, body Z62 Z72 24-70mm set</font>) | | Entertainment video | 海神妈祖 (<font color=Blue>Ma-tsu, Goddess of the Sea</font>) | 海上女神妈祖 (<font color=Blue>Ma-tsu, Goddess of the Sea</font>) | | Medical | <div style="width: 150pt"> 大人能把手放在睡觉婴儿胸口吗 (<font color=Blue>Can adults put their hands on the chest of a sleeping baby?</font>) | <div style="width: 150pt"> 大人不能把手放在睡觉婴儿胸口，对孩子呼吸不好，要注意 (<font color=Blue>Adults should not put their hands on the chest of a sleeping baby as this is not good for the baby's breathing.</font>) | ## Data Format Datasets of each domain share a uniform format, more details can be found in our paper: - qid: A unique id for each query that is used in evaluation - pid: A unique id for each passaage that is used in evaluation | File name | number of record | format | | ---- | ---- | ---- | | corpus.tsv | 1002822 | pid, passage content | | train.query.txt | 100000 | qid, query content | | dev.query.txt | 1000 | qid, query content | | qrels.train.tsv | 100000 | qid, '0', pid, '1' | | qrels.dev.tsv | 1000 | qid, '0', pid, '1' | ## Experiments The ```retrieval``` and ```rerank``` folders contain how to train a BERT-base dense passage retrieval and reranking model based on Multi-CPR dataset. This code is based on the previous work [tevatron](https://github.com/texttron/tevatron) and [reranker](https://github.com/luyug/Reranker) produced by [luyug](https://github.com/luyug). Many thanks to [luyug](https://github.com/luyug). Dense Retrieval Resutls | Models | Datasets | Encoder | E-commerce | | Entertainment video | | Medical | | |:------:|-----------|---------|------------|-------------|---------------------|-------------|---------------------|-------------| | | | | MRR@10 | Recall@1000 | MRR@10 | Recall@1000 | MRR@10 | Recall@1000 | | DPR | General | BERT | 0.2106 | 0.7750 | 0.1950 | 0.7710 | 0.2133 | 0.5220 | | DPR-1 | In-domain | BERT | 0.2704 | 0.9210 | 0.2537 | 0.9340 | 0.3270 | 0.7470 | | DPR-2 | In-domain | BERT-CT | 0.2894 | 0.9260 | 0.2627 | 0.9350 | 0.3388 | 0.7690 | BERT-reranking results | Retrieval | Reranker | E-commerce | Entertainment video | Medical | |:---------:|:--------:|:----------:|:--------------------:|:-------:| | | | MRR@10 | MRR@10 | MRR@10 | | DPR-1 | - | 0.2704 | 0.2537 | 0.3270 | | DPR-1 | BERT | 0.3624 | 0.3772 | 0.3885 | ## Requirements ``` python=3.8 transformers==4.18.0 tqdm==4.49.0 datasets==1.11.0 torch==1.11.0 faiss==1.7.0 ``` ## Citing us If you feel the datasets helpful, please cite: ``` @article{Long2022MultiCPRAM, title={Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval}, author={Dingkun Long and Qiong Gao and Kuan Zou and Guangwei Xu and Pengjun Xie and Rui Guo and Jianfeng Xu and Guanjun Jiang and Luxi Xing and P. Yang}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, series = {SIGIR 22}, year={2022} } ```

评论收藏

内容反馈

版权申诉