没有合适的资源?快使用搜索试试~ 我知道了~
(EI收录)Incremental Rapidly Grouping Aggregation Method for Simila
需积分: 0 0 下载量 126 浏览量
2022-08-03
23:34:37
上传
评论
收藏 958KB PDF 举报
温馨提示
试读
12页
Journal of Physics: Conference SeriesIncremental Rapidly Grouping Aggregation Me
资源详情
资源评论
资源推荐
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Incremental Rapidly Grouping Aggregation Method for Similar Web
News Headline
To cite this article: Shengze Hu et al 2020 J. Phys.: Conf. Ser. 1453 012153
View the article online for updates and enhancements.
This content was downloaded from IP address 220.202.210.195 on 31/03/2020 at 05:21
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
CISAI 2019
Journal of Physics: Conference Series 1453 (2020) 012153
IOP Publishing
doi:10.1088/1742-6596/1453/1/012153
1
Incremental Rapidly Grouping Aggregation Method for
Similar Web News Headline
Shengze Hu
1
, Chunhui He
1*
, Chong Zhang
1
, Bin Ge
1
, Huiming Zhu
1
and Wei Liu
1
1
Science and Technology on Information Systems Engineering Laboratory, National
University of Defense Technology Changsha, Hunan, P.R., China, 410073.
*Corresponding author’s e-mail: xtuhch@163.com
Abstract. With the popularity of the Internet, real-time web news has been extensively used in
human daily life. Because of competition and repetitive reporting between the media, there may
be multiple similar news reporting the same searing event, which may waste a lot of time for
users to obtain and read the similar news. Hence, how to rapid aggregating similar news to help
users obtain relevant news information accurately is a very meaningful and essential study.
However, the accuracy of the existing aggregation method needs to be strengthened, and it not
meet real-time incremental processing requirements.
In this paper, rapid incremental aggregating a lot of similar news headlines are our main study
targets. We proposed a novel Rapidly Grouping Aggregation (RGA) method. It carries out
similar news headline aggregation via two core aspects --- Improved Jaro-Winkler (IJW)
similarity calculation, normalized mapping and rapidly grouping aggregation. The IJW using
hierarchical matching windows to enhance the performance of similarity calculation, and using
the MD5 normalization strategy and the Group By function of the MongoDB database to carry
out the similarity news headline rapidly grouping. Compared with the state-of-the-art method,
the actual system application results show that the RGA method achieves higher performance on
similar news headline aggregation tasks.
1. Introduction
With the rapid development of the Internet, simple and efficient short texts have gradually become an
important information exchange carrier in human's life. Short text main including Weibo content, web
news headlines, entity and relation [1], etc. Faced with these short texts, how to aggregate and mine
them is a very challenging task.
In addition, the aggregation of actual large-scale web news headline is not as simple as the above
example due to a number of challenges: (a) the similar web news headlines often exists little word-using
inconsistent problem, it is a common problem in multiple sources heterogeneous short text data sets; (b)
Comparison with the long text, the value of the web news headlines often contains colloquial words, it
causes the short text aggregation more difficult than the long text; (c) news headlines belong to short
text, so, the headlines’ available features are poor; (d) public opinion systems contain lots of news,
usually the number of web news headlines under one topic may reach million-level. If we directly to
calculate each similarity news headline, it’s very inefficient. Thus, it’s a tricky problem that needs to
solve.
For challenges (a) and (b), it reflects the dilemma issue facing the current short-text aggregation
study. That is the object name may differ. It should be aggregated, but it not be aggregated correctly, or
different objects with a very similar name may be incorrectly aggregated together. However, in recent
CISAI 2019
Journal of Physics: Conference Series 1453 (2020) 012153
IOP Publishing
doi:10.1088/1742-6596/1453/1/012153
2
years, with the vigorous development of big data technology, it’s attracted lots of attention from
researchers [2]. Text aggregation methods are summarized into two categories. First is based on the
statistical method. And second is using the machine learning method [3]. Existing statistical and machine
learning methods they are not supporting the challenges of similar web news headlines incremental
aggregation tasks. To deal with the challenges (a)-(d), we propose a rapid group aggregation (RGA)
model based on retrieval technology and short text similarity calculation algorithm IJW. By the RGA
method, it can rapidly aggregate similar web news headlines. The method contains three core steps: At
first, based on the BM25 algorithm to get all the most relevant news headlines. Second, using a
character-based IJW similarity algorithm and specific normalization strategy, it can rapidly obtain the
most similar news headline and efficient to finish the construction of the index. Finally, the aggregation
results will be obtained by the Group By aggregation function of the MongoDB database. The flow chart
of the RGA method is indicated in Figure 1.
Figure 1. The flow charts of RGA method
In Figure 1, we use the inverted index technology to build the index on the web news headline. In
our system, the similarity threshold K was set to 0.75, and it can get the best performance. Through the
RGA method, we can map all similar news headlines to the same MD5 value and aggregate them
together. In this paper, we mainly contributions as following:
1) As far as we know, it’s the first time to utilize the integrated method built on a search engine,
similarity calculation, and normalized mapping technology to solve the similar news headlines
aggregation task.
2) We improved the JW algorithm by designing a hierarchical matching window and weight
update method to get the IJW algorithm. It can enhance the similarity calculation accuracy for two
strings.
3) We took place a normalization method founded on the MD5 Hash algorithm, BM25, and IJW
algorithm to map all similar news headlines into the same MD5 value.
2. Related work
For the short text aggregation model, it is usually necessary to use a text similarity algorithm. The
common string-based similarity algorithms generally divide into two categories [4]: Character-based
剩余11页未读,继续阅读
KerstinTongxi
- 粉丝: 22
- 资源: 277
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0