OntoNotes Release 5.0
with OntoNotes DB Tool v0.999 beta
http://www.bbn.com/NLP/OntoNotes
2012-09-28
Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Jeff
Kaufman, Michelle Franchini, Mohammed El-Bachouti
Nianwen Xue
Martha Palmer, Jena D. Hwang, Claire Bonial, Jinho Choi,
Aous Mansouri, Maha Foster and Abdel-aati Hawwary
Mitchell Marcus, Ann Taylor, Craig Greenberg
Eduard Hovy, Robert Belvin, Ann Houston (from
Grammarsmith)
OntoNotes Release 5.0
2
Contents
1
Introduction ................................................................................................................ 4
1.1 Summary Description of the OntoNotes Project ....................................................... 4
1.2 Corpus and GALE Project Plans................................................................................ 5
2
Annotation Layers .................................................................................................... 10
2.1 Treebank ..................................................................................................................... 10
2.2 PropBank .................................................................................................................... 11
2.3 Word Sense Annotation ............................................................................................. 12
2.3.1
Verbs......................................................................................................................................13
2.3.2
Nouns .....................................................................................................................................14
2.3.3
Nominalizations and Eventive Noun Senses .........................................................................15
2.4 Ontology ...................................................................................................................... 19
2.5 Coreference ................................................................................................................. 20
2.6 Entity Names Annotation .......................................................................................... 21
3
English Release Notes .............................................................................................. 23
3.1 English Corpora ......................................................................................................... 23
3.2 English Treebank Notes ............................................................................................. 23
3.3 English PropBank Notes ............................................................................................ 24
3.4 English Treebank/Propbank Merge Notes .............................................................. 24
3.4.1
Treebank Changes .................................................................................................................24
3.4.2
Propbank changes ..................................................................................................................25
3.5 English Word Sense Notes ......................................................................................... 25
3.6 English Coreference Notes ........................................................................................ 25
3.7 English Name Annotation Notes ............................................................................... 26
4
Chinese Release Notes ............................................................................................. 27
4.1 Chinese Corpora ....................................................................................................... 27
4.2 Chinese Treebank Notes ............................................................................................ 27
4.3 Chinese PropBank Notes ........................................................................................... 28
4.4 Chinese Word Sense Notes ........................................................................................ 28
4.5 Chinese Coreference Notes ........................................................................................ 28
4.6 Chinese Name Annotation Notes .............................................................................. 29
5
Arabic Release Notes ............................................................................................... 30
5.1 Arabic Corpora .......................................................................................................... 30
5.2 Arabic Treebank Notes .............................................................................................. 30
5.3 Arabic Word Sense Notes .......................................................................................... 31
5.4 Arabic Coreference Notes ......................................................................................... 31
OntoNotes Release 5.0
3
5.5 Arabic Name Annotation Notes ................................................................................ 31
6
Database, Views, Supplementary Data, and Data Access Guide ........................... 32
6.1 How the OntoNotes Data is Organized .................................................................... 32
6.2 OntoNotes Annotation Database .............................................................................. 33
6.3 OntoNotes Normal Form (ONF) View ..................................................................... 35
6.4 The Treebank View .................................................................................................... 39
6.5 Proposition Bank View .............................................................................................. 39
6.6 Word Sense View ....................................................................................................... 44
6.7 Coreference View ....................................................................................................... 45
6.8 Entity Names View ..................................................................................................... 47
6.9 Parallel View ............................................................................................................... 47
6.10 Speaker View .............................................................................................................. 48
6.11 Ontology View ............................................................................................................ 48
6.12 Supplementary Data .................................................................................................. 50
6.12.1
PropBank Frame Files.......................................................................................................50
6.12.2
Sense Inventory Files ........................................................................................................50
6.13 Access Script Documentation .................................................................................... 51
7
References ................................................................................................................ 52
OntoNotes Release 5.0
4
1 Introduction
This document describes the final release (v5.0) of OntoNotes, an annotated corpus
whose development was supported under the GALE program of the Defense Advanced
Research Projects Agency, Contract No. HR0011-06-C-0022. The annotation is provided
both in separate text files for each annotation layer (Treebank, PropBank, word sense,
etc.) and in the form of an integrated relational database with a Python API to provide
convenient cross-layer access. More detailed documents (referred to at various points
below) that describe the annotation guidelines and document the routines for deriving
various views of the data from the database are included in the documentation directory
of the distribution.
1.1 Summary Description of the OntoNotes Project
Natural language applications like machine translation, question answering, and
summarization currently are forced to depend on impoverished text models like bags of
words or n-grams, while the decisions that they are making ought to be based on the
meanings of those words in context. That lack of semantics causes problems throughout
the applications. Misinterpreting the meaning of an ambiguous word results in failing to
extract data, incorrect alignments for translation, and ambiguous language models.
Incorrect coreference resolution results in missed information (because a connection is
not made) or incorrectly conflated information (due to false connections). Some richer
semantic representation is badly needed.
The OntoNotes project was a collaborative effort between BBN Technologies, Brandeis
University, the University of Colorado, the University of Pennsylvania, and the
University of Southern California's Information Sciences. The goal was to annotate a
large corpus comprising various genres (news, broadcast, talk shows, weblogs, usenet
newsgroups, and conversational telephone speech) in three languages (English, Chinese,
and Arabic) with structural information (syntax and predicate argument structure) and
shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds
on two time-tested resources, following the Penn Treebank for syntax and the Penn
PropBank for predicate-argument structure. Its semantic representation adds coreference
to PropBank, and includes partial word sense disambiguation for some nouns and verbs,
with the word senses connected to an ontology. OntoNotes includes roughly 1.5 million
words of English, 800 K of Chinese, and 300 K of Arabic. More details are provided in
Weischedel et al. (2011)
This resource is being made available to the natural language research community so that
decoders for these phenomena can be trained to generate the same structure in new
documents. Lessons learned over the years have shown that the quality of annotation is
crucial if it is going to be used for training machine learning algorithms. Taking this cue,
we strove to ensure that each layer of annotation in OntoNotes have at least 90% inter-
annotator agreement..
This level of semantic representation goes far beyond the entity and relation types
targeted in the ACE program, since every concept in the text is indexed, not just 100 pre-
specified types. For example, consider this sentence: “The founder of Pakistan's nuclear
program, Abdul Qadeer Khan, has admitted that he transferred nuclear technology to
OntoNotes Release 5.0
5
Iran, Libya, and North Korea”. In addition to the names, each of the nouns “founder”,
“program”, and “technology” would be assigned a word sense and linked to an
appropriate ontology node. The propositional connection signaled by “founder” between
Khan and the program would also be marked. The verbs “admit” and “transfer” would
have their word sense and argument structures identified and be linked to their equivalent
ontology nodes. One argument of “admit” is “he”, which would be connected by
coreference to Khan, and the other is the entire transfer clause. The verb “transfer”, in
turn, has “he/Khan” as the agent, the technology as the item transferred, and the three
nations Iran, Libya, and North Korea as the destination of the transfer. A graphical view
of the representation is shown below:
Significant breakthroughs that change large sections of the field occur from time to time
in Human Language Technology. The Penn Treebank in the late 1980s transformed
parsing, and the statistical paradigm similarly transformed MT and other applications in
the early 1990s. We believe that OntoNotes has the potential for being a breakthrough of
this magnitude, since it is the first semantic resource of this substantial size ever
produced. As demonstrated with the Treebank and WordNet, a publicly available
resource can unleash an enormous amount of work internationally on algorithms and on
the automated creation of semantic resources in numerous other domains and genres. We
hope that this new level of semantic modeling will empower semantics-enabled
applications to break the current accuracy barriers in transcription, translation, and
question answering, fundamentally changing the nature of human language processing
technology.
1.2 Corpus and GALE Project Plans
The goal for OntoNotes was to achieve substantial coverage in various genres and in all
three GALE languages. The current 5.0 release covers newswire, broadcast news,
broadcast conversation, and web data in English and Chinese, a pivot corpus in English,
and newswire data in Arabic
1
.
1
For simplicity, the numbers in this table are rounded to the nearest 50k