【免费】(EI收录)IncrementalRapidlyGroupingAggregationMethodforSimila资源-CSDN文库

需积分: 0 126 浏览量 2022-08-03 23:34:37 上传评论收藏 958KB PDF 举报

资源详情

资源评论

资源推荐

Journal of Physics: Conference Series

PAPER • OPEN ACCESS

Incremental Rapidly Grouping Aggregation Method for Similar Web

News Headline

To cite this article: Shengze Hu et al 2020 J. Phys.: Conf. Ser. 1453 012153

View the article online for updates and enhancements.

This content was downloaded from IP address 220.202.210.195 on 31/03/2020 at 05:21

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution

of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Published under licence by IOP Publishing Ltd

CISAI 2019

Journal of Physics: Conference Series 1453 (2020) 012153

IOP Publishing

doi:10.1088/1742-6596/1453/1/012153

Incremental Rapidly Grouping Aggregation Method for

Similar Web News Headline

Shengze Hu

, Chunhui He

, Chong Zhang

, Bin Ge

, Huiming Zhu

and Wei Liu

Science and Technology on Information Systems Engineering Laboratory, National

University of Defense Technology Changsha, Hunan, P.R., China, 410073.

*Corresponding author’s e-mail: xtuhch@163.com

Abstract. With the popularity of the Internet, real-time web news has been extensively used in

human daily life. Because of competition and repetitive reporting between the media, there may

be multiple similar news reporting the same searing event, which may waste a lot of time for

users to obtain and read the similar news. Hence, how to rapid aggregating similar news to help

users obtain relevant news information accurately is a very meaningful and essential study.

However, the accuracy of the existing aggregation method needs to be strengthened, and it not

meet real-time incremental processing requirements.

In this paper, rapid incremental aggregating a lot of similar news headlines are our main study

targets. We proposed a novel Rapidly Grouping Aggregation (RGA) method. It carries out

similar news headline aggregation via two core aspects --- Improved Jaro-Winkler (IJW)

similarity calculation, normalized mapping and rapidly grouping aggregation. The IJW using

hierarchical matching windows to enhance the performance of similarity calculation, and using

the MD5 normalization strategy and the Group By function of the MongoDB database to carry

out the similarity news headline rapidly grouping. Compared with the state-of-the-art method,

the actual system application results show that the RGA method achieves higher performance on

similar news headline aggregation tasks.

1. Introduction

With the rapid development of the Internet, simple and efficient short texts have gradually become an

important information exchange carrier in human's life. Short text main including Weibo content, web

news headlines, entity and relation [1], etc. Faced with these short texts, how to aggregate and mine

them is a very challenging task.

In addition, the aggregation of actual large-scale web news headline is not as simple as the above

example due to a number of challenges: (a) the similar web news headlines often exists little word-using

inconsistent problem, it is a common problem in multiple sources heterogeneous short text data sets; (b)

Comparison with the long text, the value of the web news headlines often contains colloquial words, it

causes the short text aggregation more difficult than the long text; (c) news headlines belong to short

text, so, the headlines’ available features are poor; (d) public opinion systems contain lots of news,

usually the number of web news headlines under one topic may reach million-level. If we directly to

calculate each similarity news headline, it’s very inefficient. Thus, it’s a tricky problem that needs to

solve.

For challenges (a) and (b), it reflects the dilemma issue facing the current short-text aggregation

study. That is the object name may differ. It should be aggregated, but it not be aggregated correctly, or

different objects with a very similar name may be incorrectly aggregated together. However, in recent

CISAI 2019

Journal of Physics: Conference Series 1453 (2020) 012153

IOP Publishing

doi:10.1088/1742-6596/1453/1/012153

years, with the vigorous development of big data technology, it’s attracted lots of attention from

researchers [2]. Text aggregation methods are summarized into two categories. First is based on the

statistical method. And second is using the machine learning method [3]. Existing statistical and machine

learning methods they are not supporting the challenges of similar web news headlines incremental

aggregation tasks. To deal with the challenges (a)-(d), we propose a rapid group aggregation (RGA)

model based on retrieval technology and short text similarity calculation algorithm IJW. By the RGA

method, it can rapidly aggregate similar web news headlines. The method contains three core steps: At

first, based on the BM25 algorithm to get all the most relevant news headlines. Second, using a

character-based IJW similarity algorithm and specific normalization strategy, it can rapidly obtain the

most similar news headline and efficient to finish the construction of the index. Finally, the aggregation

results will be obtained by the Group By aggregation function of the MongoDB database. The flow chart

of the RGA method is indicated in Figure 1.

Figure 1. The flow charts of RGA method

In Figure 1, we use the inverted index technology to build the index on the web news headline. In

our system, the similarity threshold K was set to 0.75, and it can get the best performance. Through the

RGA method, we can map all similar news headlines to the same MD5 value and aggregate them

together. In this paper, we mainly contributions as following:

1) As far as we know, it’s the first time to utilize the integrated method built on a search engine,

similarity calculation, and normalized mapping technology to solve the similar news headlines

aggregation task.

2) We improved the JW algorithm by designing a hierarchical matching window and weight

update method to get the IJW algorithm. It can enhance the similarity calculation accuracy for two

strings.

3) We took place a normalization method founded on the MD5 Hash algorithm, BM25, and IJW

algorithm to map all similar news headlines into the same MD5 value.

2. Related work

For the short text aggregation model, it is usually necessary to use a text similarity algorithm. The

common string-based similarity algorithms generally divide into two categories [4]: Character-based

剩余11页未读，继续阅读

评论收藏

内容反馈

KerstinTongxi

粉丝: 22
资源: 277

(EI收录)Incremental Rapidly Grouping Aggregation Method for Simila

评论0

最新资源

(EI收录)Incremental Rapidly Grouping Aggregation Method for Simila

评论0

Incremental Few-Shot Learning for Pedestrian Attribute Recognition.pdf

Incremental.Software.Architecture

Incremental Sampling-based Algorithms for Optimal Motion Planning, Robotics

Incremental Learning for Robust Visual Tracking.

incremental-learning-for-SVM.rar_Svm 增量学习_online svm_svm代码matlab

Incremental Learning for Robust Visual Tracking

class-incremental-learning:AANets（CVPR 2021）的PyTorch实施和助记符培训（CVPR 2020）

An improved topic detection method for Chinese microblog based on incremental clustering

An Incremental Tuning Method Based on Ultraconservative Update for Statistical Machine Translation

SVM Incremental Learning Algorithm for Adaptive Sketch Recognition

incremental SVM

Incremental-Learning-for-Tracking.rar_INCREMENTAL LEARNING_目标跟踪

This MATLAB package implements the methods for exact incremental

dbnmatlab代码-Incremental_Learning_of_Abnormalities_in_Autonomous_Systems

incremental_svm.tar.gz_Incremental_Incremental-SVM_incremental s

SVD-Based Incremental Approaches for Recommender Systems

maven-shared-incremental-1.1.jar

General Incremental Sliding-Window Aggregation (p702-tangwongsan)-计算机科学

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

STM32F103C8T6核心板-电路原理图1.PDF

软件工程导论(第六版)课后习题答案1

最新资源