# Chinese Word Vectors 中文词向量
[中文](https://github.com/Embedding/Chinese-Word-Vectors/blob/master/README_zh.md)
This project provides 100+ Chinese Word Vectors (embeddings) trained with different **representations** (dense and sparse), **context features** (word, ngram, character, and more), and **corpora**. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.
Moreover, we provide a Chinese analogical reasoning dataset **CA8** and an evaluation toolkit for users to evaluate the quality of their word vectors.
## Reference
Please cite the paper, if using these embeddings and CA8 dataset.
Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, <a href="http://aclweb.org/anthology/P18-2023"><em>Analogical Reasoning on Chinese Morphological and Semantic Relations</em></a>, ACL 2018.
```
@InProceedings{P18-2023,
author = "Li, Shen
and Zhao, Zhe
and Hu, Renfen
and Li, Wensi
and Liu, Tao
and Du, Xiaoyong",
title = "Analogical Reasoning on Chinese Morphological and Semantic Relations",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "138--143",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-2023"
}
```
A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:
Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. <a href="http://www.cips-cl.org/static/anthology/CCL-2018/CCL-18-086.pdf"><em>Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings</em></a>. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)
```
@incollection{qiu2018revisiting,
title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
pages={209--221},
year={2018},
publisher={Springer}
}
```
## Format
The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.
Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.
## Pre-trained Chinese Word Vectors
### Basic Settings
<table align="center">
<tr align="center">
<td><b>Window Size</b></td>
<td><b>Dynamic Window</b></td>
<td><b>Sub-sampling</b></td>
<td><b>Low-Frequency Word</b></td>
<td><b>Iteration</b></td>
<td><b>Negative Sampling<sup>*</sup></b></td>
</tr>
<tr align="center">
<td>5</td>
<td>Yes</td>
<td>1e-5</td>
<td>10</td>
<td>5</td>
<td>5</td>
</tr>
</table>
<sup>\*</sup>Only for SGNS.
### Various Domains
Chinese Word Vectors trained with different representations, context features, and corpora.
<table align="center">
<tr align="center">
<td colspan="5"><b>Word2vec / Skip-Gram with Negative Sampling (SGNS)</b></td>
</tr>
<tr align="center">
<td rowspan="2">Corpus</td>
<td colspan="4">Context Features</td>
</tr>
<tr align="center">
<td>Word</td>
<td>Word + Ngram</td>
<td>Word + Character</td>
<td>Word + Character + Ngram</td>
</tr>
<tr align="center">
<td>Baidu Encyclopedia 百度百科</td>
<td><a href="https://pan.baidu.com/s/1Rn7LtTH0n7SHyHPfjRHbkg">300d</a></td>
<td><a href="https://pan.baidu.com/s/1XEmP_0FkQwOjipCjI2OPEw">300d</a></td>
<td><a href="https://pan.baidu.com/s/1eeCS7uD3e_qVN8rPwmXhAw">300d</a></td>
<td><a href="https://pan.baidu.com/s/1IiIbQGJ_AooTj5s8aZYcvA">300d</a> / PWD: 5555</td>
</tr>
<tr align="center">
<td>Wikipedia_zh 中文维基百科</td>
<td><a href="https://pan.baidu.com/s/11hSZJN-NWBEvryIED6Donw?pwd=qfgv">300d</a></td>
<td><a href="https://pan.baidu.com/s/1RWcPWQEiCrwna7xmhI8ARg?pwd=jp7e">300d</a></td>
<td><a href="https://pan.baidu.com/s/1DKvgg0RgtqwyDPs1IbS0TQ?pwd=s22w">300d</a></td>
<td><a href="https://pan.baidu.com/s/1OTfYo_sQamCYwJLdp3KHnw?pwd=k6p9">300d</td>
</tr>
<tr align="center">
<td>People's Daily News 人民日报</td>
<td><a href="https://pan.baidu.com/s/19sqMz-JAhhxh3o6ecvQxQw">300d</a></td>
<td><a href="https://pan.baidu.com/s/1upPkA8KJnxTZBfjuNDtaeQ">300d</a></td>
<td><a href="https://pan.baidu.com/s/1BvKk2QjbtQMch7EISppW2A">300d</a></td>
<td><a href="https://pan.baidu.com/s/19Vso_k79FZb5OZCWQPAnFQ">300d</a></td>
</tr>
<tr align="center">
<td>Sogou News 搜狗新闻</td>
<td><a href="https://pan.baidu.com/s/1tUghuTno5yOvOx4LXA9-wg">300d</a></td>
<td><a href="https://pan.baidu.com/s/13yVrXeGYkxdGW3P6juiQmA">300d</a></td>
<td><a href="https://pan.baidu.com/s/1pUqyn7mnPcUmzxT64gGpSw">300d</a></td>
<td><a href="https://pan.baidu.com/s/1svFOwFBKnnlsqrF1t99Lnw">300d</a></td>
</tr>
<tr align="center">
<td>Financial News 金融新闻</td>
<td><a href="https://pan.baidu.com/s/1c8wmsqdrfUbQQ6j2Dx5NwQ?pwd=nakr">300d</a></td>
<td><a href="https://pan.baidu.com/s/1EXVpN8-vMr1-f2l4kZICLg?pwd=ki7t">300d</a></td>
<td><a href="https://pan.baidu.com/s/1EXVpN8-vMr1-f2l4kZICLg?pwd=ki7t">300d</a></td>
<td><a href="https://pan.baidu.com/s/19JWtZL6U8P-XfE5LsTlftg?pwd=gbnb">300d</a></td>
</tr>
<tr align="center">
<td>Zhihu_QA 知乎问答 </td>
<td><a href="https://pan.baidu.com/s/1VGOs0RH7DXE5vRrtw6boQA">300d</a></td>
<td><a href="https://pan.baidu.com/s/1OQ6fQLCgqT43WTwh5fh_lg">300d</a></td>
<td><a href="https://pan.baidu.com/s/1_xogqF9kJT6tmQHSAYrYeg">300d</a></td>
<td><a href="https://pan.baidu.com/s/1Fo27Lv_0nz8FXg-xbOz14Q">300d</a></td>
</tr>
<tr align="center">
<td>Weibo 微博</td>
<td><a href="https://pan.baidu.com/s/1zbuUJEEEpZRNHxZ7Gezzmw">300d</a></td>
<td><a href="https://pan.baidu.com/s/11PWBcvruXEDvKf2TiIXntg">300d</a></td>
<td><a href="https://pan.baidu.com/s/10bhJpaXMCUK02nHvRAttqA">300d</a></td>
<td><a href="https://pan.baidu.com/s/1FHl_bQkYucvVk-j2KG4dxA">300d</a></td>
</tr>
<tr align="center">
<td>Literature 文学作品</td>
<td><a href="https://pan.baidu.com/s/1ciq8iXtcrHpu3ir_VhK0zg">300d</a></td>
<td><a href="https://pan.baidu.com/s/1Oa4CkPd8o2xd6LEAaa4gmg">300d</a> / PWD: z5b4</td>
<td><a href="https://pan.baidu.com/s/1IG8IxNp2s7vVklz-vyZR9A">300d</a></td>
<td><a href="https://pan.baidu.com/s/1SEOKrJYS14HpqIaQT462kA">300d</a> / PWD: yenb</td>
</tr>
<tr align="center">
<td>Complete Library in Four Sections<br />四库全书<sup>*</sup></td>
<td><a href="https://pan.baidu.com/s/1vPSeUsSiWYXEWAuokLR0qQ">300d</a></td>
<td><a href="https://pan.baidu.com/s/1sS9E7sclvS_UZcBgHN7xLQ">300d</a></td>
<td>NAN</td>
<td>NAN</td>
</tr>
<tr align="center">
<td>Mixed-large 综合<br>Baidu Netdisk / Google Drive</td>
<td>
<a href="https://pan.baidu.com/s/1luy-GlTdqqvJ3j-A4FcIOw">300d</a><br>
<a href="https://drive.google.com/open?id=1Zh9ZCEu8_eSQ-qkYVQufQDNKPC4mtEKR">300d</a>
</td>
<td>
<a href="https://pan.baidu.com/s/1oJol-GaRMk4-8Ejpzx
没有合适的资源?快使用搜索试试~ 我知道了~
上百种预训练中文词向量.zip
共11个文件
txt:3个
md:3个
py:2个
需积分: 3 0 下载量 113 浏览量
2024-04-28
19:52:28
上传
评论
收藏 390KB ZIP 举报
温馨提示
100+ Chinese Word Vectors 上百种预训练中文词向量 本项目提供超过100种中文词向量,其中包括不同的表示方式(稠密和稀疏)、不同的上下文特征(词、N元组、字等等)、以及不同的训练语料。获取预训练词向量非常方便,下载后即可用于下游任务。 此外,我们还提供了中文词类比任务数据集CA8和配套的评测工具,以便对中文词向量进行评估。
资源推荐
资源详情
资源评论
收起资源包目录
上百种预训练中文词向量.zip (11个子文件)
上百种预训练中文词向量
Chinese-Word-Vectors-master
LICENSE 11KB
testsets
CA_translated
ca_translated.txt 31KB
CA8
dataset_statistics.xlsx 386KB
morphological.txt 334KB
semantic.txt 259KB
README.md 15KB
README_zh.md 21KB
README.md 22KB
evaluation
ana_eval_dense.py 6KB
ana_eval_sparse.py 6KB
项目说明.zip 46KB
共 11 条
- 1
资源评论
DC头发很茂密
- 粉丝: 2062
- 资源: 612
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功