2 Related Work
The largest hurdle for cross-lingual genre classification is the lack of shared representational spaces.
Sharoff (2007) use shared POS n-grams in order to jointly classify the genre of English and Russian
documents. Petrenz (2012) similarly seek out features which are stable across languages in order to
classify English and Chinese documents into four shared genres. A recent data-driven approach finds that
monolingual MLM embeddings can be clustered into five groups closely representing the data sources
of the original corpus (Aharoni and Goldberg, 2020). In this work, we investigate whether this holds for
multilingual settings as well.
Being able to identify textual genre has been crucial for domain-specific fine-tuning (Dai et al., 2020;
Gururangan et al., 2020) including dependency parsing. For parser training, in-genre data is typically
selected by proxy of the data source (Plank and van Noord, 2011; Rehbein and Bildhauer, 2017; Sato
et al., 2017). Data-driven approaches which include automatically inferred topics based on word and
embedding distributions (Ruder and Plank, 2017) as well as POS-based approaches (Søgaard, 2011;
Rosa, 2015; Vania et al., 2019) have also been found effective.
Universal Dependencies (Nivre et al., 2020) aims to consolidate syntactic annotations for a wide vari-
ety of languages and genres under a single scheme. The latest release contains 114 languages — many
with fewer than 100 sentences. In order for languages at all resource levels to benefit from domain
adaptation, it will continue to be important to identify cross-lingually stable signals for genre. While lan-
guage labels are generally agreed upon, differences in genre are more subtle. Metadata at the treebank
level provides some insights into genres of original data sources, however these are “neither mutually
exclusive nor based on homogeneous criteria, but [are] currently the best documentation that can be
obtained” (Nivre et al., 2020).
Stymne (2020) performs an initial study on using these treebank metadata labels for the selection
of spoken and Twitter data. Results show that training on out-of-language/in-genre data is superior to
out-of-language/out-of-genre data. However the best results are obtained using in-language data regard-
less of genre-adherence. This holds across multiple methods of proxy dataset selection (e.g. treebank
embeddings; Smith et al., 2018).
Recently, Müller-Eberstein et al. (2021) have shown that combining UD genre metadata and MLM
embeddings can improve proxy training data selection for zero-shot parsing of low-resource languages.
The use of genre in their work is more implicit as it is mainly driven by the genre of the target data. In
contrast, this work takes a holistic view and explicitly examines the classification of instance-level genre
for all sentences in UD.
As genre appears to be a valuable signal, we set out to investigate how it is defined and distributed
within UD. Due to the coarse, treebank-level nature of current genre annotations, we hypothesize that a
clearer picture can only be obtained by moving to the sentence level. We therefore transition from prior
supervised document genre prediction to weakly supervised instance genre prediction. Additionally, we
expand the linguistic scope from mono- or bilingual corpora to all 114 languages currently in UD.
More generally, this task can be viewed as predicting genre labels for all sentences in all corpora of a
collection while only being given the set of labels said to be contained in each corpus.
3 UD-level Genre
We analyze genre as currently used in the genres metadata of 200 treebanks from Universal Dependen-
cies version 2.8 (Zeman et al., 2021). Section 3.1 provides an overview of all UD genre types and Section
3.2 analyzes how these global labels relate to the subset of treebanks which do provide treebank-specific,
instance genre annotations.
3.1 Available Metadata
UD 2.8 (Zeman et al., 2021) contains 18 genres which are denoted in each treebank’s accompanying
metadata. Around 36% of treebanks contain a single genre while the remaining majority can contain
between 2–10 which are not further labeled at the instance level. There is no official description of each
genre label, however they can be roughly categorized as follows: