基于文本聚类的中文量化风格特征识别资源-CSDN文库

81 浏览量 2021-03-10 07:06:33 上传评论收藏 535KB PDF 举报

资源推荐

资源详情

资源评论

Discrimination of Chinese Quantitative Style Features

Based on Text Clustering

Hou Renkui, Jiang Minghu

Lab. of Computational Linguistics, School of Humanities and Social Sciences,

Tsinghua University, Beijing 100084, China

hourk0917@163.com, jiang.mh@tsinghua.edu.cn

Abstract—The styles of “News Broadcast” and “Qiang Qiang

Conversation between Three Individuals” are different. The

former is broadcasting, while the latter is conversational. This

paper collects the corpus of both programs and selects sentence

length, word length and sentence-initial word POS as the

characters to generate the text vectors. And the texts are

clustered by the Euclidean distance and ward algorithm. The

analysis showed that the sentence length, word length and

sentence-initial word POS can be used as Chinese quantitative

stylistic characters.

Keywords- Text Clustering, type of writing, sentence length,

word length, sentence-initial word POS

I. INTRODUCTION

Style is the beginning and result of the linguistic

performance and was formed in the specific context. It is a

kind of speech function variation reflecting object in a

particular way using language means according by context [1].

In communication, according to a kind of context, choose

some stylistic means, using the specific expressions and a

large number of neutral language materials, you can construct

such a discourse genre. There are both inevitability and

occasionality in the use of language means, while this

coincidence can be described by probability. Quantitative

analysis can make us explain the linguistic features of the style

more objectively and scientifically. The style is formed by the

language unit frequency, while the law of language unit is the

basis of analysis of the style [2]. The stylistic means is

reflected to the statistics of the language units. The

distribution of the linguistic features can be thought the basis

of the language style [2].

Text Clustering is an unsupervised text mining, in which

similar elements are divided into the same groups and

different elements are divided into different groups [3]. Text

clustering is the cluster analysis and has the character of this

statistical analysis: do not know in advance the number and

structure of the categories and clustering based on similarity or

dissimilarity between objects. This similarity is regarded as a

"distance" measurement between objects. The objects which

have near distance are classified into a class, the objects which

have far distance are classified different classes.

“News Broadcasting” belongs to broadcast style [1, 4, 5],

in which there is no interaction between the host. “Qiang

Qiang” belongs to conversational style in which the host and

guests discuss some hot social issues.

This paper selects sentence length, word length and

sentence-initial word POS as feature representations of the

texts, determining whether these language features can

distinguish two kinds of style texts and determining whether

they can be used as a quantitative stylistic character by text

clustering.

II. C

ORPUS COLLECTION, PREPROCESSING, TEXT

REPRESENTATION AND CLUSTERING ALGORITHM

"News Broadcasting" corpus is collected from the

language resource monitoring and research center, the scale of

which is 30 days; "Qiang Qiang" corpus is collected from the

website of ifeng, the scale of which is 31 days

.., Both them

are original corpus.

Some tags in the corpus do not belong to the linguistic

performance, which need to be cleared, such as the time

stamps, the titles and the blank lines in "News Broadcasting",

the speaker marks in "Qiang Qiang". After that, the process is

word segmentation and POS tagging by Chinese lexical

analysis system by the Institute of Computing Technology.

Choosing some certain language features to represent the

text, compute the features distribution and normalized them

and generate text vectors. We can calculate the Euclidean

distance between the text vectors, such as formula 1, where X

= [x

, x

, ..., x

] and Y = [y

, y

, ..., y

] represent two texts, x

and y

represent eigenvalues.

˄1˅

Two vectors are more likely to be clustered together when

the Euclidean distance of them is smaller and their similarity

is higher.

ˈhttp://phtv.ifeng.com/program/qqsrx



)(),(

yxYXEd

___________________________________

2204

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余3页未读，立即下载

评论收藏

内容反馈

weixin_38683721

粉丝: 8
资源: 929

基于文本聚类的中文量化风格特征识别

论文研究-说话人识别中基于聚类特征的矢量量化技术.pdf

NLP 课程作业-中文分词词性标注句法分析文本向量化情感分析基于机器学习的 NLP 算法+源代码+文档说明

论文研究-基于词典词语量化关系的中文文本分割方法.pdf

基于文本聚类的招聘信息技能要求提取与量化1

kmeans中文文本聚类java源码（包括对文本tf，idf的计算，文本相似度计算）

K-Means文本聚类python实现

大数据-算法-模糊文本聚类算法的研究与应用.pdf

基于Python实现文本聚类的提取与量化【100013216】

k均值聚类、数据等，学习模式识别的可以参考下

文本特征提取常见方法

lbg.rar_LBG_LBG c_矢量聚类_矢量量化_矢量量化lbg

基于机器学习的宋词风格识别.pdf

基于聚类的网络舆情热点发现及分析1

SOFM_sofm聚类_SOFM_

数据挖掘中的文本挖掘(共21张PPT)精选.pptx

语音处理相关论文（共81篇）

重叠聚类数据集

模糊聚类matlab源程序

基于局部语义聚类的语义重叠社区发现算法_辛宇1

MATLAB技术文本挖掘实例.docx

k均值聚类分割算法的资料以及程序

最新资源