# 〇、千言数据集:文本相似度比赛简介
aistudio地址:[打卡零基础PaddleNLP【千言数据集:文本相似度】比赛](https://aistudio.baidu.com/aistudio/projectdetail/2045895)
github地址:[https://github.com/livingbody/paddlenlp_text_similarity](https://github.com/livingbody/paddlenlp_text_similarity)
文本相似度旨在识别两段文本在语义上是否相似。文本相似度在自然语言处理领域是一个重要研究方向,同时在信息检索、新闻推荐、智能客服等领域都发挥重要作用,具有很高的商业价值。
[ 文本相似度:https://aistudio.baidu.com/aistudio/competition/detail/45](https://aistudio.baidu.com/aistudio/competition/detail/45)
目前学术界的一些公开中文文本相似度数据集,在相关论文的支撑下对现有的公开文本相似度模型进行了较全面的评估,具有较高权威性。因此,本开源项目收集了这些权威的数据集,期望对模型效果进行综合的评价,旨在为研究人员和开发者提供学术和技术交流的平台,进一步提升文本相似度的研究水平,推动文本相似度在自然语言处理领域的应用和发展。
## 1.数据集
本次评测的文本相似度数据集包括公开的三个文本相似度数据集,分别为哈尔滨工业大学(深圳)的 LCQMC [1] 和 BQ Coupus [2],以及谷歌的 PAWS-X(中文) [3]。各数据集的简介如下:
### a) LCQMC
LCQMC(A Large-scale Chinese Question Matching Corpus), 百度知道领域的中文问题匹配数据集,目的是为了解决在中文领域大规模问题匹配数据集的缺失。该数据集从百度知道不同领域的用户问题中抽取构建数据。
### b) BQ Corpus
BQ Corpus(Bank Question Corpus), 银行金融领域的问题匹配数据,包括了从一年的线上银行系统日志里抽取的问题pair对,是目前最大的银行领域问题匹配数据。
### c) PAWS-X (中文)
PAWS (Paraphrase Adversaries from Word Scrambling),谷歌发布的包含 7 种语言释义对的数据集,包括PAWS(英语) 与 PAWS-X(多语)。数据集里包含了释义对和非释义对,即识别一对句子是否具有相同的释义(含义),特点是具有高度重叠词汇,对于进一步提升模型对于强负例的判断很有帮助。
各个数据集的任务均一致,即判断两段文本在语义上是否相似的二分类任务。以LCQMC为例:
## 2.提交方式
```
# 文本相似度任务
index prediction
0 1
1 0
2 1
```
# 二、思路
* 1.搭建网络
* 2.更换数据集
* 3.针对不同数据集Finetune 生成不同模型
* 4.用不同模型进行预测
# 三、环境准备
## 1.包引入
```python
!python -m pip install --upgrade paddlenlp==2.0.2 -i https://mirror.baidu.com/pypi/simple
```
Looking in indexes: https://mirror.baidu.com/pypi/simple
Collecting paddlenlp==2.0.2
[?25l Downloading https://mirror.baidu.com/pypi/packages/b1/e9/128dfc1371db3fc2fa883d8ef27ab6b21e3876e76750a43f58cf3c24e707/paddlenlp-2.0.2-py3-none-any.whl (426kB)
[K |████████████████████████████████| 430kB 18.9MB/s eta 0:00:01
[?25hRequirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (1.2.2)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (2.9.0)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (4.1.0)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (0.4.4)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (0.70.11.1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (0.42.1)
Requirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp==2.0.2) (2.1.1)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp==2.0.2) (0.24.2)
Requirement already satisfied, skipping upgrade: numpy>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp==2.0.2) (1.20.3)
Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp==2.0.2) (1.15.0)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp==2.0.2) (0.3.3)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (3.14.0)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (1.0.0)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (0.7.1.1)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (1.1.1)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (7.1.2)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (0.8.53)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (2.22.0)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (3.8.2)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp==2.0.2) (1.21.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.0.2) (1.6.3)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.0.2) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp==2.0.2) (0.14.1)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp==2.0.2) (2.8.0)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp==2.0.2)
没有合适的资源?快使用搜索试试~ 我知道了~
打卡零基础PaddleNLP【千言数据集:文本相似度】比赛.zip
共3个文件
txt:1个
md:1个
ipynb:1个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 78 浏览量
2023-08-24
16:46:12
上传
评论
收藏 43KB ZIP 举报
温馨提示
全国大学生电子设计竞赛(National Undergraduate Electronics Design Contest),试题,解决方案及源码。计划或参加电赛的同学可以用来学习提升和参考。程序均是实战案例,经过测试可直接运行。 全国大学生电子设计竞赛(National Undergraduate Electronics Design Contest),试题,解决方案及源码。计划或参加电赛的同学可以用来学习提升和参考。程序均是实战案例,经过测试可直接运行。
资源推荐
资源详情
资源评论
收起资源包目录
打卡零基础PaddleNLP【千言数据集:文本相似度】比赛.zip (3个子文件)
ori_code
requirements.txt 79B
javaroom.ipynb 111KB
README.md 87KB
共 3 条
- 1
资源评论
白话机器学习
- 粉丝: 8726
- 资源: 7682
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功