没有合适的资源?快使用搜索试试~ 我知道了~
基于文本聚类的中文量化风格特征识别
0 下载量 81 浏览量
2021-03-10
07:06:33
上传
评论
收藏 535KB PDF 举报
温馨提示
“新闻广播”和“三个人之间的强强对话”的风格是不同的。 前者在广播,而后者在对话。 本文收集了这两个程序的语料,并选择了句子长度,单词长度和句子首字母词POS作为字符来生成文本向量。 并通过欧氏距离和病房算法对文本进行聚类。 分析表明,句子长度,单词长度和句子首字母词POS可以用作汉语定量文体特征。
资源推荐
资源详情
资源评论
Discrimination of Chinese Quantitative Style Features
Based on Text Clustering
Hou Renkui, Jiang Minghu
Lab. of Computational Linguistics, School of Humanities and Social Sciences,
Tsinghua University, Beijing 100084, China
hourk0917@163.com, jiang.mh@tsinghua.edu.cn
Abstract—The styles of “News Broadcast” and “Qiang Qiang
Conversation between Three Individuals” are different. The
former is broadcasting, while the latter is conversational. This
paper collects the corpus of both programs and selects sentence
length, word length and sentence-initial word POS as the
characters to generate the text vectors. And the texts are
clustered by the Euclidean distance and ward algorithm. The
analysis showed that the sentence length, word length and
sentence-initial word POS can be used as Chinese quantitative
stylistic characters.
Keywords- Text Clustering, type of writing, sentence length,
word length, sentence-initial word POS
I. INTRODUCTION
Style is the beginning and result of the linguistic
performance and was formed in the specific context. It is a
kind of speech function variation reflecting object in a
particular way using language means according by context [1].
In communication, according to a kind of context, choose
some stylistic means, using the specific expressions and a
large number of neutral language materials, you can construct
such a discourse genre. There are both inevitability and
occasionality in the use of language means, while this
coincidence can be described by probability. Quantitative
analysis can make us explain the linguistic features of the style
more objectively and scientifically. The style is formed by the
language unit frequency, while the law of language unit is the
basis of analysis of the style [2]. The stylistic means is
reflected to the statistics of the language units. The
distribution of the linguistic features can be thought the basis
of the language style [2].
Text Clustering is an unsupervised text mining, in which
similar elements are divided into the same groups and
different elements are divided into different groups [3]. Text
clustering is the cluster analysis and has the character of this
statistical analysis: do not know in advance the number and
structure of the categories and clustering based on similarity or
dissimilarity between objects. This similarity is regarded as a
"distance" measurement between objects. The objects which
have near distance are classified into a class, the objects which
have far distance are classified different classes.
“News Broadcasting” belongs to broadcast style [1, 4, 5],
in which there is no interaction between the host. “Qiang
Qiang” belongs to conversational style in which the host and
guests discuss some hot social issues.
This paper selects sentence length, word length and
sentence-initial word POS as feature representations of the
texts, determining whether these language features can
distinguish two kinds of style texts and determining whether
they can be used as a quantitative stylistic character by text
clustering.
II. C
ORPUS COLLECTION, PREPROCESSING, TEXT
REPRESENTATION AND CLUSTERING ALGORITHM
"News Broadcasting" corpus is collected from the
language resource monitoring and research center, the scale of
which is 30 days; "Qiang Qiang" corpus is collected from the
website of ifeng, the scale of which is 31 days
1
.., Both them
are original corpus.
Some tags in the corpus do not belong to the linguistic
performance, which need to be cleared, such as the time
stamps, the titles and the blank lines in "News Broadcasting",
the speaker marks in "Qiang Qiang". After that, the process is
word segmentation and POS tagging by Chinese lexical
analysis system by the Institute of Computing Technology.
Choosing some certain language features to represent the
text, compute the features distribution and normalized them
and generate text vectors. We can calculate the Euclidean
distance between the text vectors, such as formula 1, where X
= [x
1
, x
2
, ..., x
p
] and Y = [y
1
, y
2
, ..., y
p
] represent two texts, x
i
and y
i
represent eigenvalues.
˄1˅
Two vectors are more likely to be clustered together when
the Euclidean distance of them is smaller and their similarity
is higher.
1
ˈhttp://phtv.ifeng.com/program/qqsrx
¦
2
)(),(
ii
yxYXEd
___________________________________
978-1-4673-2197-6/12/$31.00 ©2012 IEEE
2204
资源评论
weixin_38683721
- 粉丝: 8
- 资源: 929
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功