【免费】MarkBERT论文MarkingWordBoundariesImprovesChineseBERT

需积分: 0 111 浏览量 2022-11-01 11:24:07 上传评论收藏 663KB PDF 举报

资源推荐

资源详情

资源评论

MarkBERT: Marking Word Boundaries Improves Chinese BERT

Linyang Li

2∗

,Yong Dai

, Duyu Tang

1†

, Zhangyin Feng

, Cong Zhou

Xipeng Qiu

, Zenglin Xu

, Shuming Shi

Tencent AI Lab, China

Fudan University

PengCheng Laboratory

{yongdai,brannzhou, aifeng,duyutang}@tencent.com,

zenglin@gmail.com, {linyangli19, xpqiu}@fudan.edu.cn

Abstract

We present a Chinese BERT model dubbed

MarkBERT that uses word information. Ex-

isting word-based BERT models regard words

as basic units, however, due to the vocabulary

limit of BERT, they only cover high-frequency

words and fall back to character level when

encountering out-of-vocabulary (OOV) words.

Different from existing works, MarkBERT

keeps the vocabulary being Chinese charac-

ters and inserts boundary markers between

contiguous words. Such design enables the

model to handle any words in the same way,

no matter they are OOV words or not. Be-

sides, our model has two additional beneﬁts:

ﬁrst, it is convenient to add word-level learn-

ing objectives over markers, which is comple-

mentary to traditional character and sentence-

level pre-training tasks; second, it can eas-

ily incorporate richer semantics such as POS

tags of words by replacing generic markers

with POS tag-speciﬁc markers. MarkBERT

pushes the state-of-the-art of Chinese named

entity recognition from 95.4% to 96.5% on

the MSRA dataset and from 82.8% to 84.2%

on the OntoNotes dataset, respectively. Com-

pared to previous word-based BERT models,

MarkBERT achieves better accuracy on text

classiﬁcation, keyword recognition, and se-

mantic similarity tasks.

1 Introduction

Chinese words can be composed of multiple Chi-

nese characters. For instance, the word

地球

(earth)

is made up of two characters

地

(ground) and

球

(ball). However, there are no delimiters (i.e., space)

between words in written Chinese sentences. Tra-

ditionally, word segmentation is an important ﬁrst

step for Chinese natural language processing tasks

(Chang et al., 2008). Instead, with the rise of pre-

trained models (Devlin et al., 2018), Chinese BERT

∗

Work done during internship at Tencent AI Lab.

†

Corresponding author.

models are dominated by character-based ones (Cui

et al., 2019a; Sun et al., 2019; Cui et al., 2020; Sun

et al., 2021b,a), where a sentence is represented

as a sequence of characters. There are several at-

tempts at building Chinese BERT models where

word information is considered. Existing studies

tokenize a word as a basic unit (Su, 2020), as multi-

ple characters (Cui et al., 2019a) or a combination

of both (Zhang and Li, 2020; Lai et al., 2021; Guo

et al., 2021). However, due to the limit of the vo-

cabulary size of BERT, these models only learn for

a limited number (e.g., 40K) of words with high

frequency. Rare words below the frequency thresh-

old will be tokenized as separate characters so that

the word information is neglected.

In this work, we present a simple framework,

MarkBERT, that considers Chinese word informa-

tion. Instead of regarding words as basic units, we

use character-level tokenizations and inject word

information via inserting special markers between

contiguous words. The occurrence of a marker

gives the model a hint that its previous character is

the end of a word and the following character is the

beginning of another word. Such a simple model

design has the following advantages. First, it avoids

the problem of OOV words since it deals with com-

mon words and rare words (even the words never

seen in the pre-training data) in the same way. Sec-

ond, the introduction of marker allows us to de-

sign word-level pre-training tasks (such as replaced

word detection illustrated in section 2), which are

complementary to traditional character-level pre-

training tasks like masked language modeling and

sentence-level pre-training tasks like next sentence

prediction. Third, the model is easy to be extended

to inject richer semantics of words.

In the pre-training stage, we train our model with

two pre-training tasks. The ﬁrst task is masked lan-

guage modeling. We also mask markers such that

word boundary knowledge can be learned. The

second task is replaced word detection. We replace

arXiv:2203.06378v1 [cs.CL] 12 Mar 2022

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

评论收藏

内容反馈

justdoitnow

粉丝: 2591
资源: 7

MarkBERT论文 Marking Word Boundaries Improves Chinese BERT

最新资源

MarkBERT论文 Marking Word Boundaries Improves Chinese BERT

Richtek_Marking_Code_Richtek_

marking code （器件表面的丝印）-抄板必备

常见三极管Marking

marking部分丝印查询

常见二极管Marking

AX2012 Marking function code

UL Marking guide

markingcode及datasheet电子元器件查询手册.zip

fpga.zip_MARKING_fpga image _fpga marking_image_water marking

marking system.rar

Laravel开发-eloquent-single-state-marking-store

marking mfc

论文研究-Marking Analysis for Optimal Supervisor for a Class of Petri Nets.pdf

各国safey marking.zip

Gerber8.5 Marking.exe异常

dct2_embed_mod.rar_MOD_water marking

markingcode－－查datasheet的软件.rar

Adesto Standard Products_ Top Side Marking

论文研究-An Implementation Method of Tree Structure Based on Multilevel Marking Coding Theory.pdf

stable-diffusion部署需要的包

大规模语言模型：从理论到实践

人工智能大模型介绍.pptx

21个免费无限制免登录chatgpt资源， OpenAI GPT-4\3.5 模型的智能对话链接

ChatGPT智能AI机器人微信小程序源码-带部署教程

diabetes糖尿病数据集

LM Studio windows版本安装

transformer代码

线性代数-同济大学第七版

Notepad++ 8.5.6最新版 64位安装包

最新资源