# Chinese Word Analogy Benchmarks
The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated ([Chen et al., 2015](#reference)), where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 ([Li et al., 2018](#reference)) is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations.
## CA8
CA8 incorporates comprehensive morphological and semantic relations in Chinese. Specifically, CA8-morphological (CA8-Mor) contains 10177 morphological questions, which are constructed based on two types of relations: reduplication and semi-affixation. CA8-semantic (CA8-Sem) contains 7636 semantic questions, which can be divided into 4 categories and 28 sub-categories. Detailed description is as follows:
<table>
<tr align="center">
<td colspan="5"><b>Morphological Questions: Reduplication</b></td>
</tr>
<tr align="center">
<td>Category</td>
<td>Sub-category</td>
<td>POS</td>
<td>Morphological Function</td>
<td>Example</td>
</tr>
<tr align="center">
<td rowspan="9">A</td>
<td rowspan="7">AA</td>
<td rowspan="2">Noun</td>
<td>Form kinship terms</td>
<td>爸 (dad) → 爸爸 (dad)</td>
</tr>
<tr align="center">
<td>Yield every / each meaning</td>
<td>天 (day) → 天天 (everyday)</td>
</tr>
<tr align="center">
<td rowspan="1">Measure</td>
<td>Yield every / each meaning</td>
<td>个 (-) → 个个 (every/each)</td>
</tr>
<tr align="center">
<td rowspan="2">Verb</td>
<td>Signal doing something a little bit</td>
<td>说 (say) → 说说 (say a little)</td>
</tr>
<tr align="center">
<td>Signal things happen briefly</td>
<td>看 (look) → 看看 (have a brief look)</td>
</tr>
<tr align="center">
<td rowspan="2">Adjective</td>
<td>Intensify the adjective</td>
<td>大 (big) → 大大 (very big)</td>
</tr>
<tr align="center">
<td>Transform it to adverbs</td>
<td>慢 (slow) → 慢慢 (slowly)</td>
</tr>
<tr align="center">
<td rowspan="1">A yi A</td>
<td rowspan="1">Verb</td>
<td>Signal trying to do something</td>
<td>吃 (eat) → 吃一吃 (try to eat)</td>
</tr>
<tr align="center">
<td rowspan="1">A lai A qu</td>
<td rowspan="1">Verb</td>
<td>Signal doing something repeatedly</td>
<td>飞 (fly) → 飞来飞去 (fly around)</td>
</tr>
<tr align="center">
<td rowspan="9">AB</td>
<td rowspan="5">AABB</td>
<td rowspan="1">Noun</td>
<td>Yield many / much meaning</td>
<td>山水 (mountain and river) → 山山水水 (many mountains and rivers)</td>
</tr>
<tr align="center">
<td rowspan="1">Verb</td>
<td>Indicate a continuous action</td>
<td>说笑 (laugh and chat) → 说说笑笑 (laugh and chat for a while)</td>
</tr>
<tr align="center">
<td rowspan="2">Adjective</td>
<td>Intensify the adjective</td>
<td>清楚 (clear) → 清清楚楚 (very clear)</td>
</tr>
<tr align="center">
<td>Yield the meaning of not uniform</td>
<td>大小 (size) → 大大小小 (all sizes)</td>
</tr>
<tr align="center">
<td rowspan="1">Adverb</td>
<td>Intensify the adverb</td>
<td>彻底 (completely) → 彻彻底底 (totally and completely)</td>
</tr>
<tr align="center">
<td rowspan="1">A li A B</td>
<td rowspan="1">Adjective</td>
<td>Oralize the adjective and yield derogatory meaning</td>
<td>慌张 (flurried) → 慌里慌张 (anxious)</td>
</tr>
<tr align="center">
<td rowspan="3">ABAB</td>
<td rowspan="1">Verb</td>
<td>Signal doing something a little bit</td>
<td>注意 (pay attention) → 注意注意 (pay a little attention)</td>
</tr>
<tr align="center">
<td rowspan="2">Adjective</td>
<td>Intensify the adjective</td>
<td>雪白 (white) → 雪白雪白 (very white)</td>
</tr>
<tr align="center">
<td>Transform it to a verb</td>
<td>高兴 (happy) → 高兴高兴 (make someone happy)</td>
</tr>
</table>
Affixation is a morphological process whereby a bound morpheme (an affix) is attached to roots or stems to form new language units. Chinese is a typical isolating language that has few affixes. [Liu et al. (2001)](#reference) points out that although affixes are rare in Chinese, there are some components behaving like affixes and can also be used as independent lexemes. They are called semi-affixes. We follow their work and adopt this concept.
<table>
<tr align="center">
<td colspan="3"><b>Morphological Questions: Semi-affixation</b></td>
</tr>
<tr align="center">
<td>Category</td>
<td>Semi-affix</td>
<td>Example</td>
</tr>
<tr align="center">
<td rowspan="21">Semi-prefix</td>
<td>第</td>
<td>一 (one) → 第一 (first)</td>
</tr>
<tr align="center">
<td>初</td>
<td>一 (one) → 初一 (the first day of a lunar month)</td>
</tr>
<tr align="center">
<td>十</td>
<td>一 (one) → 十一 (eleven)</td>
</tr>
<tr align="center">
<td>周</td>
<td>一 (one) → 周一 (Monday)</td>
</tr>
<tr align="center">
<td>星期</td>
<td>一 (one) → 星期一 (Monday)</td>
</tr>
<tr align="center">
<td>老</td>
<td>虎 (tiger) → 老虎 (tiger)</td>
</tr>
<tr align="center">
<td>小</td>
<td>草 (grass) → 小草 (grass)</td>
</tr>
<tr align="center">
<td>大</td>
<td>海 (sea) → 大海 (large sea)</td>
</tr>
<tr align="center">
<td>半</td>
<td>导体 (conductor) → 半导体 (semiconductor)</td>
</tr>
<tr align="center">
<td>单</td>
<td>细胞 (cell) → 单细胞 (unicell)</td>
</tr>
<tr align="center">
<td>超</td>
<td>链接 (link) → 超链接 (hyperlink)</td>
</tr>
<tr align="center">
<td>次</td>
<td>大陆 (continent) → 次大陆 (subcontinent)</td>
</tr>
<tr align="center">
<td>非</td>
<td>常规 (conventional) → 非常规 (unconventional)</td>
</tr>
<tr align="center">
<td>每</td>
<td>次 (time) → 每次 (every time)</td>
</tr>
<tr align="center">
<td>全</td>
<td>明星 (star) → 全明星 (all star)</td>
</tr>
<tr align="center">
<td>伪</td>
<td>君子 (gentlemen) → 伪君子 (hypocrites)</td>
</tr>
<tr align="center">
<td>亚</td>
<td>热带 (tropical zone) → 亚热带 (sub-tropical zone)</td>
</tr>
<tr align="center">
<td>洋</td>
<td>酒 (wine) → 洋酒 (foreign wine)</td>
</tr>
<tr align="center">
<td>总</td>
<td>比分 (score) → 总比分 (total score)</td>
</tr>
<tr align="center">
<td>反</td>
<td>物质 (matter) → 反常规 (antimatter)</td>
</tr>
<tr align="center">
<td>副</td>
<td>总统 (president) → 副总统 (vice president)</td>
</tr>
<tr align="center">
<td rowspan="41">Semi-suffix</td>
<td>们</td>
<td>我 (I) → 我们 (we)</td>
</tr>
<tr align="center">
<td>里</td>
<td>这 (here) → 这里 (here)</td>
</tr>
<tr align="center">
<td>些</td>
<td>这 (this) → 这些 (these)</td>
</tr>
<tr align="center">
<td>样</td>
<td>这 (this) → 这样 (such)</td>
</tr>
<tr align="center">
<td>个</td>
<td>这 (this) → 这个 (this one)</td>
</tr>
<tr align="center">
<td>边</td>
<td>这 (this) → 这边 (here)</td>
</tr>
<tr align="center">
<td>种</td>
<td>这 (this) → 这种 (this kind)</td>
</tr>
<tr align="center">
<td>次</td>
<td>这 (this) → 这次 (this time)</td>
</tr>
<tr align="center">
<td>儿</td>
<td>这 (this) → 这儿 (here)</t
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
本项目提供 100+ 个中文词向量(嵌入),使用不同的表示(密集和稀疏)、上下文特征(单词、ngram、字符等)和语料库进行训练。人们可以很容易地获得具有不同属性的预训练向量,并将它们用于下游任务。 此外,我们还提供了一个中文类比推理数据集CA8和一个评估工具包,供用户评估其词向量的质量。
资源推荐
资源详情
资源评论
收起资源包目录
人工智能-项目实践-预训练-100+ Chinese Word Vectors 上百种预训练中文词向量.zip (7个子文件)
Chinese-Word-Vectors-master
testsets
CA_translated
ca_translated.txt 31KB
CA8
dataset_statistics.xlsx 386KB
morphological.txt 334KB
semantic.txt 259KB
README.md 15KB
evaluation
ana_eval_dense.py 6KB
ana_eval_sparse.py 6KB
共 7 条
- 1
资源评论
博士僧小星
- 粉丝: 1922
- 资源: 5884
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 教学内容及补充-cha7.rar
- 设计1.ms14
- vscode-1.64.1.tar源码文件
- vscode-1.64.0.tar源码文件
- vscode-1.52.0.tar源码文件
- Music-Player +PlayerActivity+ rockplayer+ SeeJoPlayer 播放器JAVA源码
- vscode-1.46.0.tar源码文件
- 最近很火植物大战僵尸杂交版2.08苹果+安卓+PC+防闪退工具V2+修改工具+高清工具+通关存档整合包更新
- 超级好用的截图工具PixPin,可录制Gif图
- Screenshot_2024-05-21-17-06-42-64_2332cb9b27b851b548ba47a91682926c.jpg
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功