没有合适的资源?快使用搜索试试~ 我知道了~
MarkBERT论文 Marking Word Boundaries Improves Chinese BERT
需积分: 0 1 下载量 111 浏览量
2022-11-01
11:24:07
上传
评论
收藏 663KB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/86869347/0001-ab583c23dd711f2becebf08df5859417_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
10页
MarkBERT论文 Marking Word Boundaries Improves Chinese BERT
资源推荐
资源详情
资源评论
![application/pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083646.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![pptx](https://img-home.csdnimg.cn/images/20210720083543.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![exe](https://img-home.csdnimg.cn/images/20210720083343.png)
![](https://csdnimg.cn/release/download_crawler_static/86869347/bg1.jpg)
MarkBERT: Marking Word Boundaries Improves Chinese BERT
Linyang Li
2∗
,Yong Dai
1
, Duyu Tang
1†
, Zhangyin Feng
1
, Cong Zhou
1
,
Xipeng Qiu
2
, Zenglin Xu
3
, Shuming Shi
1
,
1
Tencent AI Lab, China
2
Fudan University
3
PengCheng Laboratory
{yongdai,brannzhou, aifeng,duyutang}@tencent.com,
zenglin@gmail.com, {linyangli19, xpqiu}@fudan.edu.cn
Abstract
We present a Chinese BERT model dubbed
MarkBERT that uses word information. Ex-
isting word-based BERT models regard words
as basic units, however, due to the vocabulary
limit of BERT, they only cover high-frequency
words and fall back to character level when
encountering out-of-vocabulary (OOV) words.
Different from existing works, MarkBERT
keeps the vocabulary being Chinese charac-
ters and inserts boundary markers between
contiguous words. Such design enables the
model to handle any words in the same way,
no matter they are OOV words or not. Be-
sides, our model has two additional benefits:
first, it is convenient to add word-level learn-
ing objectives over markers, which is comple-
mentary to traditional character and sentence-
level pre-training tasks; second, it can eas-
ily incorporate richer semantics such as POS
tags of words by replacing generic markers
with POS tag-specific markers. MarkBERT
pushes the state-of-the-art of Chinese named
entity recognition from 95.4% to 96.5% on
the MSRA dataset and from 82.8% to 84.2%
on the OntoNotes dataset, respectively. Com-
pared to previous word-based BERT models,
MarkBERT achieves better accuracy on text
classification, keyword recognition, and se-
mantic similarity tasks.
1 Introduction
Chinese words can be composed of multiple Chi-
nese characters. For instance, the word
地球
(earth)
is made up of two characters
地
(ground) and
球
(ball). However, there are no delimiters (i.e., space)
between words in written Chinese sentences. Tra-
ditionally, word segmentation is an important first
step for Chinese natural language processing tasks
(Chang et al., 2008). Instead, with the rise of pre-
trained models (Devlin et al., 2018), Chinese BERT
∗
Work done during internship at Tencent AI Lab.
†
Corresponding author.
models are dominated by character-based ones (Cui
et al., 2019a; Sun et al., 2019; Cui et al., 2020; Sun
et al., 2021b,a), where a sentence is represented
as a sequence of characters. There are several at-
tempts at building Chinese BERT models where
word information is considered. Existing studies
tokenize a word as a basic unit (Su, 2020), as multi-
ple characters (Cui et al., 2019a) or a combination
of both (Zhang and Li, 2020; Lai et al., 2021; Guo
et al., 2021). However, due to the limit of the vo-
cabulary size of BERT, these models only learn for
a limited number (e.g., 40K) of words with high
frequency. Rare words below the frequency thresh-
old will be tokenized as separate characters so that
the word information is neglected.
In this work, we present a simple framework,
MarkBERT, that considers Chinese word informa-
tion. Instead of regarding words as basic units, we
use character-level tokenizations and inject word
information via inserting special markers between
contiguous words. The occurrence of a marker
gives the model a hint that its previous character is
the end of a word and the following character is the
beginning of another word. Such a simple model
design has the following advantages. First, it avoids
the problem of OOV words since it deals with com-
mon words and rare words (even the words never
seen in the pre-training data) in the same way. Sec-
ond, the introduction of marker allows us to de-
sign word-level pre-training tasks (such as replaced
word detection illustrated in section 2), which are
complementary to traditional character-level pre-
training tasks like masked language modeling and
sentence-level pre-training tasks like next sentence
prediction. Third, the model is easy to be extended
to inject richer semantics of words.
In the pre-training stage, we train our model with
two pre-training tasks. The first task is masked lan-
guage modeling. We also mask markers such that
word boundary knowledge can be learned. The
second task is replaced word detection. We replace
arXiv:2203.06378v1 [cs.CL] 12 Mar 2022
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/ac98deb6d1c54693948a8c2c8b7cdf02_yaohaishen.jpg!1)
justdoitnow
- 粉丝: 2591
- 资源: 7
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)