# Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
This repo contains the annotated datasets and expriments implementation introduced in our resource paper in SIGIR2022 Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval. [[Paper]](https://arxiv.org/pdf/2203.03367.pdf).
## Introduction
Multi-CPR is a multi-domain Chinese dataset for passage retrieval. The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs.
Examples of annotated query-passage related pairs in three different domains:
| Domain | Query | Passage |
| ---- | ---- | ---- |
| E-commerce | 尼康z62 (<font color=Blue>Nikon z62</font>) | <div style="width: 150pt"> Nikon/尼康二代全画幅微单机身Z62 Z72 24-70mm套机 (<font color=Blue>Nikon/Nikon II, full-frame micro-single camera, body Z62 Z72 24-70mm set</font>) |
| Entertainment video | 海神妈祖 (<font color=Blue>Ma-tsu, Goddess of the Sea</font>) | 海上女神妈祖 (<font color=Blue>Ma-tsu, Goddess of the Sea</font>) |
| Medical | <div style="width: 150pt"> 大人能把手放在睡觉婴儿胸口吗 (<font color=Blue>Can adults put their hands on the chest of a sleeping baby?</font>) | <div style="width: 150pt"> 大人不能把手放在睡觉婴儿胸口,对孩子呼吸不好,要注意 (<font color=Blue>Adults should not put their hands on the chest of a sleeping baby as this is not good for the baby's breathing.</font>) |
## Data Format
Datasets of each domain share a uniform format, more details can be found in our paper:
- qid: A unique id for each query that is used in evaluation
- pid: A unique id for each passaage that is used in evaluation
| File name | number of record | format |
| ---- | ---- | ---- |
| corpus.tsv | 1002822 | pid, passage content |
| train.query.txt | 100000 | qid, query content |
| dev.query.txt | 1000 | qid, query content |
| qrels.train.tsv | 100000 | qid, '0', pid, '1' |
| qrels.dev.tsv | 1000 | qid, '0', pid, '1' |
## Experiments
The ```retrieval``` and ```rerank``` folders contain how to train a BERT-base dense passage retrieval and reranking model based on Multi-CPR dataset. This code is based on the previous work [tevatron](https://github.com/texttron/tevatron) and [reranker](https://github.com/luyug/Reranker) produced by [luyug](https://github.com/luyug). Many thanks to [luyug](https://github.com/luyug).
Dense Retrieval Resutls
| Models | Datasets | Encoder | E-commerce | | Entertainment video | | Medical | |
|:------:|-----------|---------|------------|-------------|---------------------|-------------|---------------------|-------------|
| | | | MRR@10 | Recall@1000 | MRR@10 | Recall@1000 | MRR@10 | Recall@1000 |
| DPR | General | BERT | 0.2106 | 0.7750 | 0.1950 | 0.7710 | 0.2133 | 0.5220 |
| DPR-1 | In-domain | BERT | 0.2704 | 0.9210 | 0.2537 | 0.9340 | 0.3270 | 0.7470 |
| DPR-2 | In-domain | BERT-CT | 0.2894 | 0.9260 | 0.2627 | 0.9350 | 0.3388 | 0.7690 |
BERT-reranking results
| Retrieval | Reranker | E-commerce | Entertainment video | Medical |
|:---------:|:--------:|:----------:|:--------------------:|:-------:|
| | | MRR@10 | MRR@10 | MRR@10 |
| DPR-1 | - | 0.2704 | 0.2537 | 0.3270 |
| DPR-1 | BERT | 0.3624 | 0.3772 | 0.3885 |
## Requirements
```
python=3.8
transformers==4.18.0
tqdm==4.49.0
datasets==1.11.0
torch==1.11.0
faiss==1.7.0
```
## Citing us
If you feel the datasets helpful, please cite:
```
@article{Long2022MultiCPRAM,
title={Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval},
author={Dingkun Long and Qiong Gao and Kuan Zou and Guangwei Xu and Pengjun Xie and Rui Guo and Jianfeng Xu and Guanjun Jiang and Luxi Xing and P. Yang},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
series = {SIGIR 22},
year={2022}
}
```
没有合适的资源?快使用搜索试试~ 我知道了~
Multi-CPR_ 大规模段落检索多领域中文数据集.zip
共33个文件
py:12个
tsv:9个
txt:6个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
5星 · 超过95%的资源 1 下载量 85 浏览量
2022-06-28
17:10:46
上传
评论
收藏 239.18MB ZIP 举报
温馨提示
Multi-CPR_ 大规模段落检索多领域中文数据集.zip
资源推荐
资源详情
资源评论
收起资源包目录
Multi-CPR_ 大规模段落检索多领域中文数据集.zip (33个子文件)
Multi-CPR
data
ecom
dev.query.txt 26KB
qrels.dev.tsv 17KB
train.query.txt 2.41MB
corpus.tsv 89.3MB
qrels.train.tsv 1.6MB
medical
dev.query.txt 56KB
qrels.dev.tsv 16KB
train.query.txt 5.42MB
corpus.tsv 334.06MB
qrels.train.tsv 1.6MB
video
dev.query.txt 27KB
qrels.dev.tsv 18KB
train.query.txt 2.72MB
corpus.tsv 71.62MB
qrels.train.tsv 1.73MB
rerank
inference.sh 523B
run_marco.py 7KB
readme.md 3KB
run_train.sh 844B
evaluate.py 3KB
build_train.py 5KB
build_dev.py 5KB
retrieval
readme.md 3KB
run_train.sh 835B
arguments.py 4KB
evaluate.py 3KB
retrieval.py 3KB
create_train.py 2KB
run_training.py 6KB
encoder_corpus.py 4KB
modeling.py 5KB
data.py 3KB
README.md 4KB
共 33 条
- 1
资源评论
- dreamlikecloud2023-02-15资源很实用,对我启发很大,有很好的参考价值,内容详细。
BryanDing
- 粉丝: 297
- 资源: 5587
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功