Corpus for Chinese News Headline Categorization
1 Task Denition
This task aims to evaluate the automatic classication
techniques for very short texts, i.e., Chinese news head-
lines. Each news headline (i.e., news title) is required
to be classied into one or more predened categories.
With the rise of Internet and social media, the text data
on the web is growing exponentially. Make a human
being to analysis all those data is impractical, while ma-
chine learning techniques suits perfectly for this kind of
tasks. after all, human brain capacity is too limited and
precious for tedious and non-obvious phenomenons.
Formally, the task is dened as follows: given a news
headline x = (x
1
, x
2
, ..., x
n
), where x
j
represents jth
word in x, the object is to nd its possible category or
label c ∈ C. More specically, we need to nd a function
to predict in which category does x belong to.
c
∗
= arg max
c∈C
f(x; θ
c
), (1)
where θ is the parameter for the function.
2 Data
We collected news headlines (titles) from several Chinese
news websites, such as toutiao, sina, and so on.
There are 18 categories in total. The detailed infor-
mation of each category is shown in Table 1. All the
sentences are segmented by using the python Chinese
segmentation tool jieba.
Some samples from training dataset are shown in Ta-
ble 2.
Length The statistical information is also given in Fig.
3.
Figure 1 shows that most of title sentence character
number is less than 40, with a mean of 21.05. Title
sentence word length is even shorter, most of which is
less than 20 with a mean of 12.07.
The dataset is released on github along with a Ten-
sorow
[
Abadi et al., 2015
]
implemented demonstration
code.
3 Evaluation
We use the macro-averaged precision, recall and F1 to
evaulate the performance.
Category Train Dev Test
entertainment 10000 2000 2000
sports 10000 2000 2000
car 10000 2000 2000
society 10000 2000 2000
tech 10000 2000 2000
world 10000 2000 2000
nance 10000 2000 2000
game 10000 2000 2000
travel 10000 2000 2000
military 10000 2000 2000
history 10000 2000 2000
baby 10000 2000 2000
fashion 10000 2000 2000
food 10000 2000 2000
discovery 4000 2000 2000
story 4000 2000 2000
regimen 4000 2000 2000
essay 4000 2000 2000
Table 1: The information of categories.
The Macro Avg. is dened as follow:
Macro_avg =
1
m
m
∑
i=1
ρ
i
And Micro Avg. is dened as:
Micro_avg =
1
N
m
∑
i=1
w
i
ρ
i
Where m denotes the number of class, in the case of this
dataset is 18. ρ
i
is the accuracy of ith category, w
i
rep-
resents how many test examples reside in ith category,
N is total number of examples in the test set.
4 Baseline Implementations
As a branch of machine learning, Deep Learning (DL)
has gained much attention in recent years due to its
prominent achievement in several domains such as Com-
puter vision and Natural Language processing.