没有合适的资源?快使用搜索试试~ 我知道了~
【自然语言处理】领域的【事件提取】是信息抽取的一个关键任务,主要目标是识别文本中的事件并为其分配相应的角色。2016年提出了一项针对开放领域事件提取的英文数据集,旨在解决现有数据集的局限性,尤其是针对模板自学习系统的发展和测试。 在传统的信息抽取框架下,如MUC、ACE和TAC评估,事件提取任务被定义为填充模板,即识别文本中的事件类型,并为每个事件分配特定的角色或槽位。例如,地震事件可能涉及地点、时间和震级等角色。然而,现有的MUC-4数据集存在规模小、代表性不足以及模板角色之间相似性高的问题,这限制了对无监督学习方法的研究。 为此,研究者们创建了一个新的部分注释的英文数据集,利用Wikinews中“Laws & Justice”类别下的新闻作为基础,通过Google搜索引擎检索同一事件的多篇文档。Wikinews文档经过人工注释,可用于评估,而其他文档则用于无监督学习。构建这个数据集的方法包括选择主题,收集相关文档,以及进行部分的人工标注。 数据集的创建过程详述了如何挑选和整理信息,以确保事件的多样性和角色的区分度。此外,该文还对一些已存在的系统在新数据集上的性能进行了评估,从而验证了新数据集的有效性和挑战性。 关键词:事件提取、语料库创建、无监督方法。 这个数据集的出现,为自然语言处理社区提供了一个更贴近实际的资源,有助于推动事件提取和模板自学习技术的进步,尤其是在无监督学习的环境下。通过使用这个新数据集,研究人员可以更好地评估和改进他们的算法,以应对现实世界文本中的复杂事件结构和多样性的角色分配。同时,它也促进了对事件抽取模型的泛化能力和无监督学习策略的深入研究。
资源详情
资源评论
资源推荐
![](https://csdnimg.cn/release/download_crawler_static/86288222/bg1.jpg)
A Dataset for Open Event Extraction in English
Kiem-Hieu Nguyen
1,∗
, Xavier Tannier
2
, Olivier Ferret
3
, Romaric Besanc¸on
3
1. Hanoi Univ. of Science and Technology, 1 Dai Co Viet, Hai Ba Trung, Hanoi, Vietnam
2. LIMSI, CNRS, Univ. Paris-Sud, Universit
´
e Paris-Saclay, rue John von Neumann, 91403 Orsay, France
3. CEA, LIST, Vision and Content Engineering Laboratory, F-91191, Gif-sur-Yvette, France
Abstract
This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the
task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in
a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size,
non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies
some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search
engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for
evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate
some existing systems on this new data.
Keywords: Event extraction, corpus creation, unsupervised methods.
1. Introduction
Information Extraction has been defined by the Message
Understanding Conference (MUC) evaluations (Grishman
and Sundheim, 1996) and its successors, i.e. the Automatic
Content Extraction (ACE) (Doddington et al., 2004) and
Text Analysis Conference (TAC) (Ellis et al., 2014)
evaluations, specifically by the task of template filling. The
objective of this task is to assign event roles to individual
textual mentions. A template defines a specific type
of events (e.g. earthquakes), associated with semantic
roles (or slots) hold by entities (for earthquakes, typically
their location, date, magnitude and the damages they
caused (Jean-Louis et al., 2011)). This kind of structures is
comparable to the schemas of (Schank and Abelson, 1977).
Schema induction is the task of learning these structures
with no supervision from unlabeled texts. We focus here
more specifically on event schema induction (Chambers
and Jurafsky, 2011; Chambers, 2013; Cheung et al., 2013;
Nguyen et al., 2015). The idea is to group entities
corresponding to the same role into an event template.
Figure 1 illustrates this process.
Previous work on event schema induction was evaluated
on the MUC-4 corpus (Grishman and Sundheim, 1996).
However, this corpus raises two main issues:
• It was annotated with templates describing all events
with the same set of slots.
• It doesn’t contain redundancy.
The first issue is clearly a limitation due to the fact that
all the considered types of events in the MUC-4 corpus
are close to each other while the second issue is more a
difficulty for applying current machine learning methods.
In this paper, we propose the ASTRE corpus in order to
tackle these two issues. We report experimental results on
∗ This author was affiliated at LIMSI-CNRS when working
on this project.
Slot i
Slot i+1
ATTACK
Perpetrator
Instrument
Target
Victim
BOMBING
Perpetrator
Instrument
Target
Victim
citizen
woman
police
victim
civilian
Documents
Schemas
(templates/slots)
bomb
explosion
re
charge
explosive
Data aggregation /
Schema building
ARSON
Perpetrator
Instrument
Target
Victim
...
...
...
Figure 1: Event induction process (MUC schema example).
this corpus using state-of-the-art event schema induction
methods. The rest of the paper is organized as follows.
Section 2 presents the MUC-4 corpus and its limitations
for evaluating schema induction. It also discusses its
successors, i.e. the ACE and TAC corpora. Section 3
describes the creation of the ASTRE corpus while Section 4
shows the evaluation results of two state-of-the-art systems
for open event extraction task on it. Finally, Section 5
concludes the paper.
2. MUC-4 Corpus
A significant part of the work in the field of event schema
induction from texts such as (Chambers and Jurafsky, 2011;
Chambers, 2013; Cheung et al., 2013; Nguyen et al.,
2015) relies on the MUC-4 corpus for its evaluation. This
corpus contains 1,700 news articles about terrorist incidents
happening in Latin America. The corpus is divided into
1939
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083646.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![py](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083646.png)
![docx](https://img-home.csdnimg.cn/images/20210720083331.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![avatar](https://profile-avatar.csdnimg.cn/78c8c4bde5944fcb820e9b68579bed70_weixin_35748716.jpg!1)
十二.12
- 粉丝: 36
- 资源: 276
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)
评论0