KV-match：一种支持归一化和时间扭曲的子序列匹配方法

183 浏览量 2021-03-29 03:15:42 上传评论收藏 741KB PDF 举报

KV-match是一种子序列匹配方法，主要针对时间序列数据进行挖掘。时间序列数据因其在数据中心管理、物联网（IoT）等新型应用中的普及而爆炸性增长。在这些场景下，子序列匹配成为了挖掘时间序列数据的一项基础任务。所谓子序列匹配，是指从一个长的时间序列中找出与查询序列Q的距离在一定阈值ε内的所有子序列。现有的基于索引的方法只能处理原始子序列匹配（RSM），并不支持序列的归一化。归一化是一种常见的数据预处理手段，它可以消除不同规模和量级带来的影响，使得数据在进行算法处理前具有统一的标准。而UCRSuite能够处理归一化后的子序列匹配问题（NSM），但它需要扫描整个时间序列，这在处理大规模数据时效率较低。为了解决这一问题，本文提出了一个新的问题，称为受限的归一化子序列匹配问题（cNSM）。cNSM问题对NSM问题添加了一些约束条件，为用户提供了灵活控制偏移量和平移量的机制，从而使得用户能够在构建索引时进行查询。为了解决cNSM问题，本文提出了新的索引结构——KV索引，以及匹配算法——KV-match。 KV索引是一种键值对结构，可以方便地在本地文件或HBase表上实现。通过单一索引，KV-match能够支持在编辑距离（ED）或动态时间弯曲距离（DTW）下，同时处理RSM和cNSM问题。为了支持任意长度查询的需求，作者还扩展了KV-match算法至KV-matchDP，该算法通过利用多个不同长度的索引处理查询，这样可以处理变长的查询序列。作者在论文中进行了广泛的实验，使用了合成数据集和真实世界的数据集来验证其方法的有效性和效率。实验结果显示了KV-match在处理各种复杂度查询时的优越性能，特别是在支持归一化和时间扭曲的情况下。时间扭曲（Time Warping）是处理时间序列数据时一种重要的技术，它允许时间序列之间在时间轴上存在一定的弹性。在实际应用中，由于采样率不一致、噪声干扰等因素，原始的时间序列数据往往存在扭曲。时间扭曲技术可以将这些扭曲考虑进去，使得原本由于时间轴上的偏移而看起来不匹配的序列变得匹配。这是时间序列分析中非常关键的技术之一，它在诸如语音识别、生物信息学、工业监测等领域都有广泛的应用。 KV-match方法在时间序列数据的子序列匹配上提供了一种新的思路，通过支持归一化和时间扭曲，使得处理更加灵活，查询效率更高。这在处理大规模的时间序列数据时尤其重要，对于数据密集型应用，如数据中心管理和物联网设备的数据分析，都有着直接的影响和积极的应用价值。通过本研究，可以预见在不久的将来，KV-match方法将在时间序列数据挖掘领域发挥越来越重要的作用。

资源推荐

资源详情

资源评论

KV-match: A Subsequence Matching Approach

Supporting Normalization and Time Warping

Jiaye Wu

, Peng Wang

, Ningting Pan

, Chen Wang

∗

, Wei Wang

, Jianmin Wang

∗

School of Computer Science, Fudan University, Shanghai, China

{wujy16, pengwang5, ntpan17, weiwang1}@fudan.edu.cn

∗

School of Software, Tsinghua University, Beijing, China

{wang_chen, jimwang}@tsinghua.edu.cn

Abstract—The volume of time series data has exploded due

to the popularity of new applications, such as data center

management and IoT. Subsequence matching is a fundamental

task in mining time series data. All index-based approaches only

consider raw subsequence matching (RSM) and do not support

subsequence normalization. UCR Suite can deal with normalized

subsequence matching problem (NSM), but it needs to scan

full time series. In this paper, we propose a novel problem,

named constrained normalized subsequence matching problem

(cNSM), which adds some constraints to NSM problem. The

cNSM problem provides a knob to ﬂexibly control the degree

of offset shifting and amplitude scaling, which enables users to

build the index to process the query. We propose a new index

structure, KV-index, and the matching algorithm, KV-match.

With a single index, our approach can support both RSM and

cNSM problems under either ED or DTW distance. KV-index is

a key-value structure, which can be easily implemented on local

ﬁles or HBase tables. To support the query of arbitrary lengths,

we extend KV-match to KV-match

, which utilizes multiple

varied-length indexes to process the query. We conduct extensive

experiments on synthetic and real-world datasets. The results

verify the effectiveness and efﬁciency of our approach.

I. INTRODUCTION

Time series data are pervasive across almost all human

endeavors, including medicine, ﬁnance and science. In conse-

quence, there is an enormous interest in querying and mining

time series data [1], [2].

Subsequence matching problem is a core subroutine for

many time series mining algorithms. Speciﬁcally, given a long

time series X, for any query series Q and a distance threshold

ε, the subsequence matching problem ﬁnds all subsequences

from X, whose distance with Q falls within the threshold ε.

FRM [3] is the pioneer work of subsequence matching.

Many approaches have been proposed, either to improve the

efﬁciency [4], [5] or to deal with various distance functions [6],

[7], such as Euclidean distance and Dynamic Time Warping.

However, all these approaches only consider the raw subse-

quence matching problem (RSM for short). In recent years,

researchers realize the importance of the subsequence normal-

ization [8]. It is more meaningful to compare the z-normalized

subsequences, instead of the raw ones. UCR Suite [8] is the

The work is supported by the Ministry of Science and Tech-

nology of China, National Key Research and Development Program

(No.2016YFB1000700), NSFC(61672163, U1509213), Shanghai Innovation

Action Project (No.16DZ1100200)

Offset [Label] Length

877 [Q] 17,124

(d) Results of NSM

Offset [Label] Distance

252,492 [S

] 117.78

(a) PAMAP time series 97,458

] 130.80

0 5000 10000 15000

-5

34,562 [S

] 138.12

161,416

] 149.37

··· ···

134,456

] 164.88

296,063

] 166.74

··· ···

(b) Aligned normalized subsequences ε= 200.0

Fig. 1. Illustrative example of cNSM

state-of-the-art approach to solve the normalized subsequence

matching problem (NSM for short).

The NSM approach suffers from two drawbacks. First, it

needs to scan the full time series X, which is prohibitively

expensive for long time series. For example, for a time series

of length 10

, UCR Suite needs more than 100 seconds to

process a query of length 1,000. [8] analyzed the reason why it

is impossible to build the index for the NSM problem. Second,

the NSM query may output some results not satisfying users’

intent. The reason is that NSM fully ignores the offset shifting

and amplitude scaling. However, in real world applications, the

extent of offset shifting and amplitude scaling may represent

certain speciﬁc physical mechanism or state. Users often only

hope to ﬁnd subsequences within similar state as the query.

We illustrate it with an example.

Example 1. The time series in Fig. 1(a) comes from the

Physical Activity Monitoring for Aging People (PAMAP)

dataset [1] collected from z-accelerometer at hand position.

The monitored person conducts various activities alternatively,

like sitting, standing, running and so on. Each activity lasts for

about 3 minutes, and the data collection frequency is 100Hz.

We use one subsequence corresponding to lying activity as

the query (Q in Fig. 1(c)) to ﬁnd other “lying” subsequences.

We issue a NSM query with Q, and Fig. 1(d) lists the top

results. Unfortunately, all top-4 results corresponds to other

activities. S

and S

correspond to sitting activity, while S

and S

correspond to breaking activity. Although S

and S

866

2019 IEEE 35th International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE.2019.00082

are the desired results (correspond to lying activity), they are

ranked out of top-20. We show the normalized Q, S

and S

Fig. 1(b). It is difﬁcult to distinguish them after normalization.

By observing Fig. 1(a), one can ﬁlter the undesired results

easily by adding an additional constraint: the output subse-

quences should have similar mean value as Q. In fact, this

new type of NSM query, NSM plus some constraints, is useful

in many applications. We list two of them as follows,

• Industry application. In the wind power generation ﬁeld,

LIDAR system can provide preview information of wind

disturbances [9]. Extreme Operating Gust (EOG) is a

typical gust pattern which is a phenomenon of dramatic

changes of wind speed in a short period. Fig. 2 shows

a typical EOG pattern. This pattern is important because

it may generate damage on the turbine. All EOG pattern

occurrences have the similar shape, and their ﬂuctuation

degree falls within certain range, because the wind speed

cannot be arbitrarily high. If we hope to ﬁnd all EOG

pattern occurrences in the historical data, we can use a

typical EOG pattern as the query, plus the constraint on

the range of the values.

• IoT application. When a container truck goes through

a bridge, the strain meter planted in the bridge will

demonstrate a speciﬁc ﬂuctuation pattern. The value

range in the pattern depends on the weight of the truck.

If we have one occurrence of the pattern as a query, we

can additionally set a mean value range as the constraint

to search container trucks whose weight falls within a

certain range.

Note that the above applications cannot be handled by

RSM query, because the existing offset shifting and amplitude

scaling forces us to set a very large distance threshold, which

will cause many false positive results.

Furthermore, to verify the universality of this new query

type, we investigate the motif pairs in some popular real-world

time series benchmarks. Motif mining [2] is an important time

series mining task, which ﬁnds a pair (or set) of subsequences

with minimal normalized distance. For a motif subsequence

pair, say X and Y , we show the relative mean value difference

(ΔMean=

|μ

−μ

max −min

) and the ratio of standard deviation

(ΔStd= |

|) in Fig. 3 We can see that although these pairs

are found without any constraint (like NSM query), both mean

value and standard deviation of motif subsequences are very

similar. So we can ﬁnd these pairs by the cNSM query, a NSM

query plus a small constraint.

In this paper, we formally deﬁne a new subsequence

matching problem, called constrained normalized subsequence

matching problem (cNSM for short). Two constraints, one for

mean value and the other for standard deviation, are added to

the traditional NSM problem. One exemplar of cNSM query

looks like “given a query Q with mean value μ

and stan-

dard deviation σ

, return subsequences S which satisfy: (1)

Dist(

Q) ≤ 1.5; (2) |μ

−μ

|≤5; (3) 0.5 ≤ σ

/σ

≤ 2”.

With the constraint, the cNSM problem provides a knob to

ﬂexibly control the degree of offset shifting (represented by

















î





















Fig. 2. EOG pattern

7D[L

'DWDVHW

0HDQ

3RZHU

7HPSHUDWXUH

6WG

3HQJXLQ

&RPPXWH

(&*

(&*

1356

9LGHR

7(.

 

 

 

 

 

 

 

 

 

 

Fig. 3. Motif example

mean value) and amplitude scaling (represented by standard

deviation). Moreover, the cNSM problem offers us the oppor-

tunity to build index for the normalized subsequence matching.

Challenges. Solving the cNSM problem faces the following

challenges. First, how can we process the cNSM query efﬁ-

ciently? A straightforward approach is to ﬁrst apply UCR Suite

to ﬁnd unconstrained results, and then use mean value and

standard deviation constraints to prune the unqualiﬁed ones.

However, it still needs to scan the full series. Can we build an

index and process the query more efﬁciently?

Second, users often conduct the similar subsequence search

in an exploratory and interactive fashion. Users may try dif-

ferent distance functions, like Euclidean distance or Dynamic

Time Warping. Meanwhile, users may try RSM and cNSM

query simultaneously. Can we build a single index to support

all these query types?

Contributions. Besides proposing the cNSM problem, we

also have the following contributions.

• We present the ﬁltering conditions for four query types,

RSM-ED, RSM-DTW, cNSM-ED and cNSM-DTW, and

prove the correctness. The conditions enable us to build

index and meanwhile guarantee no false dismissals.

• We propose a new index structure, KV-index, and the

query processing approach, KV-match, to support all

these query types. The biggest advantage is that we can

process various types of queries efﬁciently with a single

index. Moreover, KV-match only needs a few numbers of

sequential scans of the index, instead of many random

accesses of tree nodes in the traditional R-tree index,

which makes it much more efﬁcient.

• Third, to support the query of arbitrary lengths efﬁciently,

we extend KV-match to KV-match

, which utilizes mul-

tiple indexes with different window lengths. We conduct

extensive experiments. The results verify the efﬁciency

and effectiveness of our approach.

The rest of the paper is organized as follows. We present

the preliminary knowledge and problem statements in Sec-

tion II. In Section III we introduce the theoretical foundation

and motivate the approach. Section IV and V describe our

index structure, index building algorithm and query processing

algorithm. Section VI extends our method to use multi-

level indexes with different window lengths. The experimental

results are presented in Section VII and we discuss related

867

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38739101

粉丝: 7
资源: 945

KV-match：一种支持归一化和时间扭曲的子序列匹配方法

idb-kv-store:由IndexedDB支持的Web浏览器的持久键值存储

kv-logger:一个微型记录器，支持带kv参数的日志

基恩士KV-N60，KV-N14程序 基恩士KV-N60主站，KV-N14从站，松下触摸屏 KV-N11L（RS422A

postgres-kv-poc：基于Postgres的键值存储的PoC，因为为什么不：D

kv-storage：基于LocalForage的键值存储，用于浏览器，Node.js和CordovaIonic

基恩士PLC KV系列 kv-10/16/24/40 手册 分3部分 1安装 2编程 3支持软件

kv-asset-handler:将请求路由到KV资产

loopback-connector-kv-redis:用于LoopBack的官方Redis KeyValue连接器

consul-kv-monitor:实时监控领事中的键值记录的模块。 模块使用阻止查询

松下KV-SL1035扫描仪驱动 v5.6 官方版

kv-log-macro:日志kv不稳定后端的日志宏

J-Runner with Extras V3.3.0

docker-plugin-kv-consul:在容器启动时设置领事密钥

淘宝技术嘉年华峰会上的7月10日全部PPT下载

cloudflare-kv-storage-rest:适用于cloudflare kv-storage rest-api的微型JavaScript和节点包装器

2018第十届中国系统架构师大会(SACC2018)幻灯片-10月17日

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

基于BP神经网络的人口预测

最新资源

基恩士KV-N60，KV-N14程序基恩士KV-N60主站，KV-N14从站，松下触摸屏 KV-N11L（RS422A

基恩士PLC KV系列 kv-10/16/24/40 手册分3部分 1安装 2编程 3支持软件

consul-kv-monitor:实时监控领事中的键值记录的模块。模块使用阻止查询