HowUniversalisGenreinUniversalDependencies_通用依赖中的体裁有多通用.pd资源-CSDN文库

版权申诉

39 浏览量 2022-01-03 16:01:45 上传评论收藏 568KB PDF 举报

《通用依赖中的体裁有多通用》这篇论文主要探讨了在多语言环境下的通用依赖(UD)框架中，不同语料库的体裁分类问题。通用依赖是一个旨在建立跨语言一致的句法分析标准的项目，它包含18种不同的体裁，这些体裁的特定程度在114种语言中有所不同。由于大多数树库（即语料库）被标记为多种体裁，但缺乏关于哪些实例属于哪种体裁的具体注解，论文提出了四种利用弱监督从树库元数据预测实例级别体裁的方法。以往的工作通常在单语或双语环境下，使用少量定义明确的体裁标签进行体裁识别。而UD的多样性使得研究更具挑战性。提出的四种方法在对UD中部分有标签实例的评估中，比竞争基准更好地恢复了实例级别的体裁信息，并更符合全局预期的分布。论文还分析了先前使用UD体裁元数据进行树库选择的研究，发现元数据本身是一个嘈杂的信号，需要在树库内部进行解析，才能在通用应用中发挥效用。这一发现强调了对元数据进行精细处理的重要性，以便更好地理解和利用其在跨语言任务中的潜力。研究还关注了预训练的掩码语言模型（MLMs）是否能够在高度多语言的环境中捕捉到单语言的体裁特征。通过使用多语言mBERT MLM（一个在多个语言上预训练的模型），研究者探索了这种特性在UD 2.8版本涵盖的114种语言的语料库中是否成立。预训练的MLMs已经在单语言环境下显示出了捕获体裁差异的能力，但这项工作提出的问题是，这些区别是否在多语言场景下同样明显。通过这种方法，研究不仅评估了mBERT对体裁分布的敏感性，还可能揭示了在多语言语境下，模型如何处理和理解不同体裁的文本。这篇论文的贡献在于： 1. 提供了对UD中体裁的首次深入分析。 2. 设计了四种利用元数据进行实例级别体裁预测的方法。 3. 对UD元数据的局限性和噪声性质进行了批判性分析。 4. 探索了预训练的多语言模型在多语言体裁识别中的表现。这些研究结果对于改进跨语言的自然语言处理任务，如文档分组和任务特定数据选择，以及优化预训练模型的性能，具有重要的理论和实践意义。同时，它们也提出了未来研究的挑战，包括如何更准确地利用元数据、如何提高模型对体裁差异的敏感性，以及如何在多语言环境中建立稳定的体裁表示。

资源推荐

资源详情

资源评论

How Universal is Genre in Universal Dependencies?

Max Müller-Eberstein and Rob van der Goot and Barbara Plank

Department of Computer Science

IT University of Copenhagen, Denmark

mamy@itu.dk, robv@itu.dk, bapl@itu.dk

Abstract

This work provides the ﬁrst in-depth analysis of genre in Universal Dependencies (UD). In con-

trast to prior work on genre identiﬁcation which uses small sets of well-deﬁned labels in mono-

/bilingual setups, UD contains 18 genres with varying degrees of speciﬁcity spread across 114

languages. As most treebanks are labeled with multiple genres while lacking annotations about

which instances belong to which genre, we propose four methods for predicting instance-level

genre using weak supervision from treebank metadata. The proposed methods recover instance-

level genre better than competitive baselines as measured on a subset of UD with labeled in-

stances and adhere better to the global expected distribution. Our analysis sheds light on prior

work using UD genre metadata for treebank selection, ﬁnding that metadata alone are a noisy

signal and must be disentangled within treebanks before it can be universally applied.

1 Introduction

Identifying document genre automatically has long been of interest to the NLP community due to its

immediate applications both in document grouping (Petrenz, 2012) as well as task-speciﬁc data selec-

tion (Ruder and Plank, 2017; Sato et al., 2017).

Cross-lingual genre identiﬁcation has however remained a challenge, mainly due to the lack of stable

cross-lingual representations (Petrenz, 2012). Recent work has shown that pre-trained masked language

models (MLMs) capture monolingual genre (Aharoni and Goldberg, 2020). Do such distinctions man-

ifest in highly multilingual spaces as well? In this work, we investigate whether this property holds for

the genre distribution in the 114 language Universal Dependencies corpus (UD version 2.8; Zeman et al.,

2021) using the multilingual mBERT MLM (Devlin et al., 2019).

In absence of an exact deﬁnition of textual genre (Kessler et al., 1997; Webber, 2009; Plank, 2016), this

work will focus on the information speciﬁcally denoted by the genres metadata tag in UD. We hope

that an in-depth, cross-lingual analysis of what this label represents will enable practitioners to better

control for the effects of domain shift in their experiments. Previous work using these UD metadata

for proxy training data selection have produced mixed results (Stymne, 2020). We investigate possible

reasons and identify inconsistencies in genre annotation. The fact that genre labels are only available at

the level of treebanks makes it difﬁcult to gather a clear picture of the sentence-level genre distribution

— especially with some treebanks having up to 10 genre labels. We therefore investigate the degree to

which instance-level genre is recoverable using only the treebank-level metadata as weak supervision.

Our contributions entail the, to our knowledge, ﬁrst detailed deﬁnition of all UD metadata genre labels

(Section 3), four weakly supervised methods for extracting instance-level genre across 114 languages

(Section 4) as well as genre identiﬁcation experiments which show that our proposed two-step procedure

allows for effective genre recovery in multilingual setups where language relatedness typically outweighs

genre similarities (Section 5).

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://

creativecommons.org/licenses/by/4.0/.

Code available at https://personads.me/x/syntaxfest-2021-code.

arXiv:2112.04971v1 [cs.CL] 9 Dec 2021

2 Related Work

The largest hurdle for cross-lingual genre classiﬁcation is the lack of shared representational spaces.

Sharoff (2007) use shared POS n-grams in order to jointly classify the genre of English and Russian

documents. Petrenz (2012) similarly seek out features which are stable across languages in order to

classify English and Chinese documents into four shared genres. A recent data-driven approach ﬁnds that

monolingual MLM embeddings can be clustered into ﬁve groups closely representing the data sources

of the original corpus (Aharoni and Goldberg, 2020). In this work, we investigate whether this holds for

multilingual settings as well.

Being able to identify textual genre has been crucial for domain-speciﬁc ﬁne-tuning (Dai et al., 2020;

Gururangan et al., 2020) including dependency parsing. For parser training, in-genre data is typically

selected by proxy of the data source (Plank and van Noord, 2011; Rehbein and Bildhauer, 2017; Sato

et al., 2017). Data-driven approaches which include automatically inferred topics based on word and

embedding distributions (Ruder and Plank, 2017) as well as POS-based approaches (Søgaard, 2011;

Rosa, 2015; Vania et al., 2019) have also been found effective.

Universal Dependencies (Nivre et al., 2020) aims to consolidate syntactic annotations for a wide vari-

ety of languages and genres under a single scheme. The latest release contains 114 languages — many

with fewer than 100 sentences. In order for languages at all resource levels to beneﬁt from domain

adaptation, it will continue to be important to identify cross-lingually stable signals for genre. While lan-

guage labels are generally agreed upon, differences in genre are more subtle. Metadata at the treebank

level provides some insights into genres of original data sources, however these are “neither mutually

exclusive nor based on homogeneous criteria, but [are] currently the best documentation that can be

obtained” (Nivre et al., 2020).

Stymne (2020) performs an initial study on using these treebank metadata labels for the selection

of spoken and Twitter data. Results show that training on out-of-language/in-genre data is superior to

out-of-language/out-of-genre data. However the best results are obtained using in-language data regard-

less of genre-adherence. This holds across multiple methods of proxy dataset selection (e.g. treebank

embeddings; Smith et al., 2018).

Recently, Müller-Eberstein et al. (2021) have shown that combining UD genre metadata and MLM

embeddings can improve proxy training data selection for zero-shot parsing of low-resource languages.

The use of genre in their work is more implicit as it is mainly driven by the genre of the target data. In

contrast, this work takes a holistic view and explicitly examines the classiﬁcation of instance-level genre

for all sentences in UD.

As genre appears to be a valuable signal, we set out to investigate how it is deﬁned and distributed

within UD. Due to the coarse, treebank-level nature of current genre annotations, we hypothesize that a

clearer picture can only be obtained by moving to the sentence level. We therefore transition from prior

supervised document genre prediction to weakly supervised instance genre prediction. Additionally, we

expand the linguistic scope from mono- or bilingual corpora to all 114 languages currently in UD.

More generally, this task can be viewed as predicting genre labels for all sentences in all corpora of a

collection while only being given the set of labels said to be contained in each corpus.

3 UD-level Genre

We analyze genre as currently used in the genres metadata of 200 treebanks from Universal Dependen-

cies version 2.8 (Zeman et al., 2021). Section 3.1 provides an overview of all UD genre types and Section

3.2 analyzes how these global labels relate to the subset of treebanks which do provide treebank-speciﬁc,

instance genre annotations.

3.1 Available Metadata

UD 2.8 (Zeman et al., 2021) contains 18 genres which are denoted in each treebank’s accompanying

metadata. Around 36% of treebanks contain a single genre while the remaining majority can contain

between 2–10 which are not further labeled at the instance level. There is no ofﬁcial description of each

genre label, however they can be roughly categorized as follows:

 academic Collections of scientiﬁc articles covering multiple disciplines. Note that this label may

subsume others such as medical.

Ã bible Passages from the bible, frequently from older languages (e.g. Old Church Slavonic-

PROIEL by Haug and Jøhndal, 2008). Largely non-overlapping passages are used across treebanks.

x blog Internet documents on various topics which may overlap with other genres such as news.

They are typically more informal in register. Some treebanks group social media content and reviews

under this category (e.g. Russian-Taiga by Shavrina and Shapovalova, 2017).

! email Formal, written communication. This includes English-EWT’s (Silveira et al., 2014) sub-

section based on the Enronsent Corpus (Styler, 2011) as well as letters attributed to Dante Alighieri as

part of Latin-UDante (Cecchini et al., 2020).

Y ﬁction Mostly paragraphs from diverse sets of ﬁction books and magazines.

ý government The least represented genre, mainly denoting texts from governmental sources.

These include political speeches (English-GUM by Zeldes, 2017) as well as inscriptions from Neo-

Assyrian kings from around 900 BCE (Akkadian-RIAO by Luukko et al., 2020).

! grammar-examples Sentences from teaching or grammatical reference books which are typically

short, but cover a wide range of dependency relations (e.g. Tagalog-TRG by Samson and Cöltekin, 2020).

 learner-essays Small genre occurring in three single-genre treebanks. Sentences were written by

second-language learners and either contain original errors (English-ESL by Berzak et al., 2016), manual

corrections (IT-Valico by Di Nuovo et al., 2019) or both (Chinese-CFL by Lee et al., 2017).

s legal Relatively frequent genre based mostly on laws and legal corpora within the public domain.

3 medical Scientiﬁc articles/books in the ﬁeld of medicine (e.g. cardiology, diabetes, endocrinology

for Romanian-SiMoNERo by Mitrofan et al., 2019). It is subsumed by academic for some treebanks (e.g.

Czech-CAC by Hladká et al., 2008).

Z news The highest-resource genre by a large margin corresponding to news-wire texts as well as

online newspapers on speciﬁc topics (e.g. IT-news in German-HDT by Borges Völker et al., 2019).

ð nonﬁction Second most frequent genre with a high degree of variance, subsuming e.g. academic

and legal. German-LIT (Salomoni, 2019) contains three philosophical books from the 18th century.

Other non-ﬁction treebanks can originate from multiple sources (e.g. books and internet) and time spans.

W poetry Smaller, yet distinct genre covering mostly older texts and language variations (e.g. Old

French-SRCMF by Stein and Prévost, 2013).

j reviews Medium-resource genre covering informal online reviews with unnormalized orthogra-

phy (e.g. English-EWT) as well as formal reviews (e.g. newspaper ﬁlm reviews in Czech-CAC).

Ç social Encompasses social media data such as tweets (e.g. Italian-TWITTIRÒ by Cignarella et al.,

2019) as well as newsgroups (e.g. English-EWT). Some spoken data is co-labeled with this genre when

it refers to colloquial speech (e.g. South Levantine Arabic-MADAR by Zahra, 2020).

Õ spoken Distinct genre which typically consists of spoken language transcriptions. Sentences

contain ﬁller words and may have abrupt boundaries. Sources range from elicited speech of native

speakers (Komi Zyrian-IKDP by Partanen et al., 2018) to radio program transcriptions (Frisian Dutch-

Fame by Braggaar and van der Goot, 2021).

~ web Similarly ambiguous genre as non-ﬁction. It occurs in conjunction with speciﬁc genres such

as blog and social and never appears alone (e.g. Persian-PerDT by Sadegh Rasooli et al., 2020).

¡ wiki Denotes data from Wikipedia for which cross-lingual authoring guidelines exist.

news

nonfiction

fiction

web

wiki

legal

reviews

blog

bible

grammar

spoken

social

academic

poetry

medical

learner

government

Percentage

114

Figure 1: Genre Distribution in UD Version 2.8. Ranges indicate upper/lower bounds for sentences

per genre inferred from UD metadata. Center marker reﬂects the distribution under the assumption that

genres within treebanks are uniformly distributed. Labels above the bars indicate the number of treebanks

which contain each genre.

Figure 1 shows the approximated distribution of these genres in UD. Maximum/minimum sentence

counts are inferred from the size of single-genre treebanks plus the size of all treebanks in which a genre

is said to occur. The center line denotes the distribution under the assumption that genres are uniformly

distributed within each treebank.

It is clear that news and non-ﬁction constitute more than half of the entire dataset. Specialized genres

such as medical are less represented. For broader genres such as web, which frequently co-occurs with

others, the exact number of sentences is hard to estimate, but must lie between 0–20%. Considering these

large variances, access to instance-level genre will likely be crucial for effective proxy data selection and

downstream domain adaptation.

3.2 Instance-level Annotations

In addition to the aforementioned 18 treebank-level genre labels, some treebanks provide instance-level

genre annotations in the comment-metadata before each sentence. We ﬁnd such annotations in 26 out of

200 treebanks in UD 2.8 amounting to 124k or 8.25% of all sentences.

Out of this set, 20 treebanks belong to the Parallel Universal Dependencies (PUD; Nivre et al., 2017).

They are split 500/500 between news and wiki, as denoted by sentence IDs beginning with n and w

respectively. The parallel nature of PUD makes it interesting for analyzing cross-lingual genre identiﬁ-

cation performance. However these two genres only represent a small fraction of non-ﬁction texts and

furthermore, each PUD-treebank is test-split-only. Note also that Polish-PUD as an exception has the

metadata labels news and non-ﬁction.

The remaining six treebanks for which we were able to identify instance-level genre annotations are

Belarusian-HSE (Lyashevskaya et al., 2017), Czech-CAC (Hladká et al., 2008), English-EWT (Silveira et

al., 2014), German-LIT (Salomoni, 2019), Polish-LFG (Patejuk and Przepiórkowski, 2018) and Russian-

Taiga (Shavrina and Shapovalova, 2017). They cover a wider set of 12 genres. Annotation schema

vary across treebanks and are neither fully compatible amongst each other nor with the 18 UD labels.

Approximate mappings can however be drawn thanks to source data documentation by the respective

authors (Section 4.2).

Further comment-metadata which may guide genre separation within treebanks includes document,

paragraph and source identiﬁers. Again, these are unfortunately not available for all sentences (although

coverage of these metadata reaches up to 45%) and their values do not provide further indications about

genre adherence.

剩余16页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6624
资源: 9万+

How Universal is Genre in Universal Dependencies_通用依赖中的体裁有多通用.pd

最新资源

How Universal is Genre in Universal Dependencies_通用依赖中的体裁有多通用.pd

Music-Genre-Classification-master_Genre_语音识别_音乐特征.zip

Music-Genre-Classification-master_Genre_语音识别_音乐特征_源码.rar

song_data.csv

Music_genre_classification.ipynb

snli_1.0.zip

person-movie-genre.rar

getchu_com_scraping_tools:www.getchu.com 抓取工具

Chinese Entity Linking Comprehensive

Register-and-Genre语域与体裁-PPT.ppt

example_sql.zip_Creating

sql.rar_oracle

matlab精度检验代码-music_genre_classification:音乐体裁分类器Web应用程序

Beyond Convention Genre Innovation In Academic Writing

LunchProject:项目管理理论2019年第3组

arm linux mp3 player

主成分回归代码matlab及例子-Music_Genre_Classification:Music_Genre_Classification

3D-GenRe-ShapeHD.zip

Python-Chartify一个能让数据科学家可以轻松创建图表的Python库

机器学习（NLP）：大规模、多体裁的自然语言推理数据集

multinli_1.0.zip

wma.zip_wma

DVD.zip_java编写一个dvd

C# Game Programming Cookbook for Unity 3D - 2014

genre 变速齿轮0.441

Genre-ist:音乐体裁检测

How to Create Fantasy Art for Computer Games——1

最新资源