ResearchonChineseAncientandModernWritingHabitsBasedonErgonomics

需积分: 9 99 浏览量 2022-10-23 15:39:25 上传评论收藏 94KB PDF 举报

《基于人体工程学的中国古代与现代书写习惯研究》是一篇探讨如何运用现代技术解决古代文本处理挑战的研究论文。本文特别关注的是命名实体识别（NER），这是自然语言处理（NLP）和信息检索（IR）领域中的核心任务。NER旨在从文本中自动抽取出具有特定意义的实体，如人名、地名、组织名等。然而，处理古代汉语文本的NER面临着两大难题。不同时期的文献风格差异极大，导致实体识别变得尤为困难。古代文本的实体内容和汉字分布的复杂性因类型而异，增加了识别的复杂性。为解决这一问题，论文提出了一种创新的方法，即结合数据库导向的映射规则进行预处理，创建半自动化的注释语料库，同时利用深度学习（DL）架构的新型有监督序列标记器。本文的一个独特之处在于引入了“MoGCN”（混合门控卷积神经网络）。MoGCN是基于门控机制的更高效的深度学习模块，它改进了传统的局部表示方法，能够更好地捕捉文本中的结构和语义信息。这种网络结构能够适应古代文本的复杂性和多样性，提高NER的性能。门控机制在神经网络中起着关键作用，它允许模型动态地控制信息流，增强了模型对输入序列中重要特征的捕获能力。MoGCN通过混合不同的门控卷积层，可以有效地处理不同长度和复杂性的文本序列，从而更准确地识别出古代文本中的实体。未来的研究工作应该进一步探索如何利用语义特征和学习策略优化大规模中文传统文本的NER。这可能涉及更深入地研究如何结合语言学特性，如词汇、语法和语境，以及更智能的学习策略，如迁移学习或元学习，来提升模型的泛化能力和识别精度。这篇论文为处理中国古代与现代书写习惯提供了一个基于人体工程学的视角，利用深度学习技术特别是MoGCN来解决古代文本的NER问题。这种方法有望为古籍数字化和历史信息提取等领域带来显著的进步，同时对NLP和IR的理论研究也具有重要价值。

资源详情

资源评论

资源推荐

‘‘MoGCN’’ (Mixture of

Gated Convolutional

Neural Network)

“MoGCN”（⻔控卷积神经⽹络的

混合）

Research on Chinese

Ancient and Modern

Writing Habits Based

on Ergonomics

ABSTRACT

This paper addresses the

respective issues and proposes

an efﬁcient automatic processing

solution for tackling NER of

ancient Chinese data, including

the implementation of data-

driven tagging and an innovative

end-to-end network namely

‘‘MoGCN’’ (Mixture of Gated

Convolutional Neural Network).

本⽂解决了各⾃的问题，并提出了

⼀种处理中国古代数据NER的⾼

效⾃动处理解决⽅案，包括实现数

据驱动标记和创新的端到端⽹络，

即“MoGCN”（混合⻔卷积神经⽹

络）。

Future work should be focused

on more exploration about NER

optimization on massive Chinese

traditional texts with linguistic

features and learning strategies.

今后的⼯作应侧重于对具有语⾔特

征和学习策略的⼤量中⽂传统⽂本

的净⼊学率优化进⾏更多的探索。

I. INTRODUCTION

Named Entity Recognition (NER)

is a fundamental task to extract

useful named entities from texts

in automatic way, which plays a

crucial role in the ﬁelds of

Natural Language Processing

(NLP) and Information Retrieval

(IR).

命名实体识别（NER）是从⽂本

中⾃动提取有⽤命名实体的⼀项基

本任务，在⾃然语⾔处理（NLP）

和信息检索（IR）领域发挥着⾄关

重要的作⽤。

However, this encounters two

major problems: ﬁrst, the

distinction of text styles in

diachronic corpora [4] brings

great difﬁculty for the entity

identiﬁcation, since different

types of ancient texts have quite

different entity content and

complexity of Chinese character

distribution.

然⽽，这遇到了两个主要问题：第

⼀，历时语料库中⽂本⻛格的区分

[4]给实体识别带来了很⼤困难，

因为不同类型的古代⽂本具有截然

不同的实体内容和汉字分布的复杂

性。

texts. To tackle these problems,

we propose a novel entity

extraction approach for Chinese

historical texts in particular, that

is, a prior semi-automatic

procedure of annotated corpus

through database-oriented

mapping rules and a novel

supervised sequential tagger

with Deep Learning (DL)

architectures.

⽂本。为了解决这些问题，我们提

出了⼀种新的中⽂历史⽂本实体提

取⽅法，即通过⾯向数据库的映射

规则预先半⾃动地提取带注释的语

料库，以及⼀种具有深度学习

（DL）结构的新型有监督序列标

记器。

One innovation of this paper is

to import a more effective deep

blocks called ‘‘MoGCN’’ (Mixture

of Gated Convolutional Neural

Network) on the basis of gated

mechanism to traditional local

representation of multiple

character-based embeddings.

本⽂的⼀个创新之处是，在选通机

制的基础上，将⼀个更有效的深层

块“MoGCN”（混合选通卷积神经

⽹络）引⼊到基于多字符嵌⼊的传

统局部表示中。

II. RELATED WORK

The major approaches of NER

for Chinese historical texts have

been focused on handcrafted

heuristic rules [6]–[11] to

formulate entity features from

context-derived patterns in a

quick manner, which depend a

lot on relevant domain

knowledge.

中国历史⽂本的NER主要⽅法集

中于⼿⼯编制的启发式规则[6]-

[11]，以从上下⽂衍⽣模式中快速

形成实体特征，这在很⼤程度上依

赖于相关领域知识。

B. DL-BASED NEURAL

NETWORKS FOR NER

TASKS

B、⽤于NER任务的基于DL

的神经⽹络

Recent advances of NER have

utilized deep neural networks in

modelling Chinese texts, which

demonstrate a more powerful

capacity for feature abstraction.

These methods can be divided

into two main branches,

including RNN (Recurrent Neural

Network)- and CNN

(Convolutional Neural Network)-

based NER models.

NER的最新进展已将深度神经⽹

络⽤于中⽂⽂本建模，这表明其具

有更强⼤的特征提取能⼒。这些⽅

法可以分为两个主要分⽀，包括基

于RNN（递归神经⽹络）和CNN

（卷积神经⽹络）的净⼊学率模

型。

In this paper, we offer a holistic

solution to this research gap with

an automatically constructed

labeled corpus by extracting

historical entities from

unstructured raw texts.

在本⽂中，我们通过从⾮结构化原

始⽂本中提取历史实体来⾃动构建

标记语料库，为这⼀研究空⽩提供

了⼀个整体解决⽅案。

III. PROPOSED

FRAMEWORK

iii。建议的框架

The overall architecture of the

framework in this paper consists

of two major phases: the semi-

automatic construction of an

annotated historical corpus and

the supervised DL-based

sequential model for NER, as

shown in Figure 1.

本⽂框架的总体架构由两个主要阶

段组成：半⾃动构建带注释的历史

语料库和基于监督DL的NER序列

模型，如图1所示。

However, the difference is that

we replace the original CNN with

a self-deﬁned stacked module -

‘‘MoGCN’’ (Mixture of Gated

Convolutional Network), which

manifests a stronger ﬁtting

capability. To capture a rich

semantic representation,

MoGCNs are employed in

multiple kernels with varying

widths to generate robust

functions of mapping the

embedding vectors to diverse

hidden neural spaces, as

demonstrated in Figure 2.

然⽽，不同的是，我们⽤⾃定义的

堆叠模块“MoGCN”（⻔控卷积⽹

络的混合）取代了原始的CNN，

这表明了更强的拟合能⼒。为了捕

获丰富的语义表示，MoGCN被⽤

于具有不同宽度的多个内核中，以

⽣成将嵌⼊向量映射到不同隐藏神

经空间的健壮函数，如图2所示。

As such, we develop a more

powerful block-Dilated Residual

Network (DRN) to combine the

strengths of Dilated CNNs (DCN)

and residual blocks together.

因此，我们开发了⼀个更强⼤的区

块扩张残差⽹络（DRN），将扩

张CNN（DCN）和残差区块的优

势结合在⼀起。

Moreover, Gate Linear Unit (GLU)

[46] has been used to substitute

the activation function (i.e. tanh

or sigmoid) based on the gated

mechanism for controlling the

non-linear ability of activation

function.

此外，⻔线性单元（GLU）[46]已

⽤于替代基于⻔机制的激活函数

（即tanh或sigmoid），以控制激

活函数的⾮线性能⼒。

IV. EXPERIMENTS

D. EVALUATION

AND ANALYSIS

As shown in Table 4, our method

(Avg.F1 = 86.99%) achieves the

top performance among the all

the models, with signiﬁcant

improvement by 1.5%-11%.

Notably, our model outperforms

the previously best model in

record (LSTM-CNNs-CRF) by

1.5%. In addition, to testify the

model robustness, we

implement sub-experiments on

the three sub-datasets and

results are provided in Table 5.

如表4所示，我们的⽅法（平均

F1=86.99%）在所有模型中都达

到了最佳性能，显著提⾼了

1.5%-11%。值得注意的是，我们

的模型⽐以往记录中最好的模型

（LSTM CNN CRF）⾼1.5%。此

外，为了验证模型的鲁棒性，我们

在三个⼦数据集上进⾏了⼦实验，

结果如表5所示。

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余0页未读，立即下载

评论收藏

内容反馈

ECCCE

粉丝: 0
资源: 1

Research on Chinese Ancient and Modern Writing Habits Based on E...

评论0

最新资源

Research on Chinese Ancient and Modern Writing Habits Based on E...

评论0

内蒙古北方重工业集团有限公司第三中学2014-2015学年高一英语下学期第一次月考试题

辽宁省辽阳市第九中学八年级英语上册《Unit 1》单元综合测试题 牛津版

吉林大学网络教育大学英语(一)考试答案.pdf

7 habits – 高效能人士的7个习惯_CC.pptx

新人版英语八年级（上册）英语短语_答案.doc

Python库 | habits-0.0.1.1.tar.gz

2020_2021学年八年级英语下册Unit6BeaChampionLesson35TheDreamTeam课时作业新版冀教版2

slide_7 habits.pdf

Small-habits-have-superior-memory.zip_Superior

肥胖估计

高三毕业班英语经典单选题100题（泉州五中）

选修六Unit 5 The power of nature课件及练习题7份3精选.doc

英文原版-Sedentary Behavior and Health 1st Edition

人教新目标九年级英语Unit 1 How can we become good learners单元测试题（含答案）.docx

Build.Your.Own.Web.Site.The.Right.Way.Using.HTML .CSS .pdf

2014高考英语 单项选择 2013暑假自测题（20）

HTML5.and.CSS3.Building.Responsive.Websites.pdf

【全程复习方略】广东省2013版高中英语 素能提升演练(二十八) Unit3 新人教版选修6

2017_2018学年八年级英语下册Unit3Couldyoupleasecleanyourroom测试题1新版人教新目标版20

高效能职业心态培训(Based by 7-habits).pptx

全套人教版八年级英语上册Unit 2同步练习题及答案19精选.doc

福建省厦门市集美区灌口中学2015届高三上学期英语综合练习（17）

21st Century C.pdf

COMP09092+212222052+chenyinghao.pdf

Synovate - Media habits of Chinese consumers.pptx

Responsive Web Design with HTML5 and CSS3(PACKT,2ed,2015)

最新资源

辽宁省辽阳市第九中学八年级英语上册《Unit 1》单元综合测试题牛津版

2014高考英语单项选择 2013暑假自测题（20）

【全程复习方略】广东省2013版高中英语素能提升演练(二十八) Unit3 新人教版选修6