基于通用主成分分析的多元时间序列的准确高效分类

2 浏览量 2021-03-28 21:16:42 上传评论 1 收藏 713KB PDF 举报

多元时间序列在数据分析和挖掘领域扮演着重要角色，它们存在于多种场景中，如金融市场中的股票时间序列、医学中的脑电图（EEG）、环境监测网络记录的地磁数据等。这些序列数据的一个主要特征是高维性，这常常限制了分类技术的质量。高维数据处理是数据挖掘领域的一大挑战，尤其是在分类任务中，高维数据可能会导致模型的过拟合、计算复杂度增加以及对数据分布的理解困难。在处理多元时间序列分类问题时，主成分分析（PCA）是一种常用的数据降维技术，它可以将数据降维到较低维的空间内，并保留大部分的数据特征。然而，传统的PCA方法在处理具有相同主成分的多个数据集时可能不是最优的，因为PCA假设数据集具有不同的主成分，而在多元时间序列中，不同的序列可能具有一定的共同结构。针对这一挑战，研究者提出了基于通用主成分分析（Common Principal Components Analysis, CPCA）的方法，用于提高多元时间序列分类的准确性和效率。CPCA是一种扩展的PCA方法，能够识别并利用数据集中不同子集之间共享的主成分结构，从而提供一个更为通用且有效的降维框架。通过该研究，作者提出了一种准确高效的多元时间序列分类方法，具体步骤如下： 1. 将多元时间序列数据根据类别标签的数目划分为若干个聚类。 2. 利用CPCA降低每个聚类内时间序列数据的维度，并确保降维后的主要成分具有足够的方差。 3. 通过共同协方差矩阵的特征向量构建每个聚类相应的降维坐标空间。 4. 对于没有类别标签的多元时间序列数据，将其投影到这些坐标空间中，并通过投影后降维主成分序列的最小方差来预测其类别标签。研究论文中强调了该方法对于具有不同长度的多元时间序列分类的灵活性，这为处理实际应用中可能出现的多种数据格式提供了解决方案。通过实验验证，提出的基于CPCA的多元时间序列分类方法表现出了比现有方法更高的准确性和效率。这一研究不仅在理论上提供了新的视角和方法论，也展示了在实际应用中的潜力，例如在金融市场分析、医疗诊断和环境监测等领域。此外，由于其在降低高维数据时的高效性，该方法为大数据环境下的数据挖掘任务提供了宝贵的工具。在数据挖掘领域，多元时间序列的分类是一个重要的研究方向，它能够从大量数据中提取有价值的信息，对于商业决策、科学研究和预测模型构建等方面都具有极大的价值。随着大数据技术的发展，多元时间序列的数据量和复杂性也在不断增长，因此，开发新的有效技术以处理这些数据是非常迫切的。基于通用主成分分析的方法正是为了应对这种需求而提出的技术方案，它不仅提升了分类的准确度，还提高了分类的效率，为多元时间序列的数据处理提供了新的可能性。

资源推荐

资源详情

资源评论

Accurate and efﬁcient classiﬁcation based on common principal

components analysis for multivariate time series

Hailin Li

a,b,

College of Business Administration, Huaqiao University, Quanzhou 362021, China

Research Center for Applied Statistics and Big Data, Huaqiao University, Xiamen 361021, China

article info

Article history:

Received 3 November 2014

Received in revised form

25 June 2015

Accepted 6 July 2015

Communicated by P. Zhang

Available online 15 July 2015

Keywords:

Classiﬁcation

Common principal components analysis

Data mining

Multivariate time series

abstract

Multivariate time series are found everywhere and they are important data in the ﬁeld of data mining,

but their high dimensionality often hinders the quality of techniques employed for classifying multi-

variate time series. In this study, we propose an accurate and efﬁcient classiﬁcation method based on

common principal components analysis for multivariate time series. First, multivariate time series are

divided into several clusters according to the number of class labels, and the high dimensionality of

multivariate time series can then be reduced by common principal components analysis, which gives the

reduced principal component series sufﬁciently high variance. Second, each cluster is used to construct

the corresponding reduced coordinate space formed by the eigenvectors of the common covariance

matrix. Third, any multivariate time series without a class label can be projected onto these coordinate

spaces and its label can be predicted based on the minimal variance of the reduced principal

components series according to the different projections. Our experimental results demonstrated that

the proposed method for the classiﬁcation of multivariate time series is more accurate and efﬁcient than

existing methods. It is also ﬂexible for multivariate time series with different lengths.

1. Introduction

Multivariate time series (MTS) are very common in many ﬁelds,

such as stock time series in ﬁnance [1], electroencephalograms in

medicine [2], and geophysical data recorded from monitoring

networks [3]. To extract knowledge from these data, many

techniques are used in the ﬁelds of data mining, particularly the

classiﬁcation of MTS.

In the ﬁeld of time series data mining, classiﬁcation is often applied

to univariate time series (UTS) to address many scientiﬁcproblems.

Howev er , due to the high dimensionality and numerical values of time

series, classic methods such as ID3 [4],C4.5[5],CART[6], and neural

networks [7] have difﬁcultyclassifyingthesedata.MTShavetime-

based and variable-based dimensions, both of which make the classic

methods inefﬁcient. Ho wever, the K nearest neighbors classiﬁer

method appears to be used widely to classify time series. For example,

Keogh et al. [8] used this method to search for similar objects and to

forecast the class labels of UTS. Y u et al. [9] proposed a nearest

neighbor classiﬁer, which is considered to be an effectiv e method for

UTS classiﬁcation. Yang and Shahabi [10] also proposed an efﬁcient K

nearest neighbor search method for MTS, while W eng and Shen

[11,1 2] proposed MTS classiﬁcation based on 1-nearest neighbor

classiﬁer. In our previous studies [1 3–15] of time series data mining,

we combined the 1-nearest neighbor classiﬁcat ion method with other

algorithms.

Nearest neighbor classiﬁers are very popular and valid in

various ﬁelds, but the raw method is not used directly because

the performance of these classiﬁers is not good with MTS. There-

fore, before executing the nearest neighbor classiﬁer, some pre-

liminary processes must be performed: dimensionality reduction

and the use of a similarity measure. Before MTS data mining,

dimensionality reduction is used to reduce the computational time

for related algorithms such as clustering, classi ﬁ

cation, motif/

pattern recognition, and abnormal detection, while it can also

improve the quality of the data mining results. Various methods

can be used to reduce the dimensionality of MTS, including

singular value decomposition (SVD) [16], principal components

analysis (PCA) [17], independent components analysis [18], as well

as their extensions and variations [19,11,12]. However, PCA may be

the most commonly used method. PCA can be used to transform

MTS into some principal component sequences, which have less

dimensions than the original. Moreover, the principal component

series (PCS) retains most of the information about the original

MTS. In addition, a good function is very important for measuring

the distance (or similarity) between two groups of features in a

MTS. In particular, the nearest neighbor classiﬁer is often based

on a distance measure. Thus, a suitable distance function for

Contents lists available at ScienceDirect

journal homepage: www.elsevier. com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2015.07.010

Correspondence address: College of Business Administration, Huaqiao Univer-

sity, Quanzhou 362021, China. Tel.: þ 86 595 22693815.

E-mail address: hailin@mail.dlut.edu.cn

Neurocomputing 171 (2016) 744–753

measuring the similarity between two MTS items is also important

for classiﬁ cation. In practice, Euclidean distance [20] and dynamic

time warping (DT W) [21] are two of the most popular methods.

The former is a fast method, but its quality is easily affected by

abnormal data points, while it often requires that the time series

are equal in length. DTW is a robust method, but the time and the

space complexity of its computation are very large, which make it

unsuitable for long time series with excessive numbers of

variables.

In most cases, MTS hav e differ ent lengths. Most of the traditional

techniq ues can reduce the variable-based dimensions but the number

of time-based dimensions is retained; thus, MTS with different

lengths will yield sequences that represent the MTS with different

lengths. For instance, PCA transforms MTS into the corresponding PCS

with different lengths, which means that distance functions such as

DTW must be used to measure the similarity . However, although

DTWisaneffectiveapproach,itrequiresexcessiveamountsoftime

and space to measure distances.

To overcome the problems mentioned above, we propose an

accurate and efﬁcient MTS classiﬁcation method. The main moti-

vations of our study are summarized as follows. First, we analyze

the weaknesses of the traditional common PCA (CPCA) applied to

the ﬁeld of time series data mining. The effectiveness and the

efﬁciency of CPCA are regarded as shortcomings of the algorithms

used to extract knowledge from MTS datasets; thus, it is necessary

to design a novel method to address these difﬁculties. Second, the

traditional methods based on PCA often fail to handle MTS data

with different lengths; therefore, the proposed method should

consider various lengths and improve the quality of PCA for

mining time series.

In this study, various MTS clusters are transformed to construct

the corresponding reduced subspaces by CPCA. Each space is then

organized by the eigenvectors of the common covariance matrix in

a cluster. The MTS without class labels in the test dataset are

projected onto the different subspaces and the minimal variance in

the reduced PCS according to different projections can specify the

label values for the MTS in the test dataset. The two main

contributions of our proposed method are as follows. First, MTS

items with the same label in a cluster are used to construct the

subspace, which means that the PCS of any MTS item projected

onto the corresponding subspace has a large variance. Thus, when

the variance in the PCS derived from different subspaces is larger,

the projected item and those in the corresponding cluster are

more similar. Second, we treat the largest variance in the different

subspaces as a classiﬁer with high efﬁciency, which improves the

performance of the proposed method. Three advantages may be

obtained, as follows. (1) Our proposed method is faster at

classifying MTS compared with the existing methods based on

PCA. (2) The quality of the classiﬁcation results obtained by the

proposed method is often better than that with traditional

method. (3) The proposed method is suitable for the classiﬁcation

of MTS datasets where the lengths of the MTS items are different.

The results of our experimental evaluation also indicated that the

proposed method is more accurate and efﬁcient.

The remainder of this paper is organized as follows. Back-

ground and related work are introduced in Section 2.InSection 2,

we describe the proposed new classiﬁcation method. The results of

experimental evaluations of the proposed method are presented in

Section 4. In the ﬁnal section, we give our conclusions and discuss

future work.

2. Background and related work

Due to the high dimensionality of MTS, techniques for dimen-

sionality reduction are very important for time series data mining,

and PCA [15,17,10] is one of the most commonly used methods. In

addition, compared with the traditional methods, CPCA can often

improve the performance of the algorithms used in MTS datasets.

In this section, we introduce both these methods and we review

related work.

2.1. PCA

PCA is used widely for MTS dimensionality reduction. PCA can

transform a MTS X ¼fx

; x

; …; x

g into a PCS Y ¼fy

; y

; …; y

where m is the number of variable-based dimensions and y

denotes the ith principal component sequence. Moreover, the ﬁrst

principal component sequence y

contains most of the informa-

tion about the original MTS and y

contains the second highest

amount of information, and so on. In fact, each principal compo-

nent sequence y

is a linear transformation of the variables in the

original MTS and the coefﬁcients deﬁned in this transformation

are considered as weight vectors, i.e.,

¼ a

þa

þ⋯ þa

; i ¼ 1; 2; …; m; ð1Þ

where a is the corresponding weight and x

denotes the jth

variable of the MTS. Moreover, the ﬁrst principal component

sequence y

has the largest variance, λ

¼ Varðy

Þ, the second

principal component sequence accounts for the largest portion of

the remaining variance, λ

¼ Varðy

Þ, and so on. In this manner, the

ﬁrst p component sequences may retain most of the variance

present in all of the original m variables, where po m. Thus, the

dimensionality reduction for a MTS with m variables can be

achieved by projecting it onto the p- dimensional subspace (also

called the coordinate space). The subspace can be constructed by

an eigenmatrix of the covariance matrix Σ of X. According to the

SVD, the covariance matrix can be decomposed by

Σ ¼ UΛU

; ð2Þ

where U contains the weights for the principal component

sequences and the matrix Λ has the corresponding variances,

which means that the ﬁrst column vector of U is the weight vector

of the ﬁrst principal component sequence and its variance is the

ﬁrst element of the matrix Λ along the diagonal.

To reduce the dimensionality of MTS, the ﬁrst p principal

component sequences are retained, which means that the ﬁrst p

eigenvectors are used to construct the subspace, i.e.,

mp

¼ Uð1 : m; 1 : pÞ. Thus, the reduced PCS can be formed by

np

¼ X

nm

mp

; ð3Þ

where n is the length of the MTS, where it is often the case that

po m and po n. In this manner, we can transform a MTS with a

size of n  m into a reduced representation with a size of n  p.

In addition to dimensionality reduction and feature represen-

tation using PCA, some distance functions are often used to

measure the similarity between two representations following

the transformation of the MTS by PCA. The angles between all the

combinations of the selected principal components can be used to

measure the similarity [22]. Another approach was proposed by

[23] for modifying previous methods by weighting the angles with

the corresponding variances. Ref. [17] addressed the issue of

similar principal components present in a time series by using

the different values of the variables. Refs. [24,10] proposed Eros

based on the acute angles between the corresponding compo-

nents, which can measure the similarity better and faster than

previous methods. A fast similarity search for MTS using a

projection comparison based on PCA was proposed by Karamito-

poulos et al. [25]. In addition, since PCA is based on SVD, some

methods based on SVD [11,12] have been applied to dimension-

ality reduction and as similarity measures for MTS.

H. Li / Neurocomputing 171 (2016) 744–753 745

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38699757

粉丝: 4
资源: 1026

基于通用主成分分析的多元时间序列的准确高效分类

基于共同主成分的多元时间序列降维方法

论文研究-一种基于CNN模型多元时间序列分类结构 .pdf

多变量时间序列的异常识别与分类研究

降水时间序列的聚类分析和预测

基于Excel的地理数据分析

基于奇异谱分析的钢箱梁桥面GPS数据处理.pdf

EOF_信号分解_eof_主成分分析_

STAT-X-498-Practicum-3

processExcel_fd_

基于SPSS的数据分析

数学建模的29个通用模型及matlab解法.zip

psy612：用于PSY 612的材料：数据分析II

数学建模教程数学建模的29个通用模型及matlab解法

数学建模的29个通用模型及解法和matlab学习资料大全（模型）

Neurocomputing.ens elsevier 爱思唯尔旗下期刊 Neurocomputing 神经计算杂志 的endnote 参考文献模板

elsevier 爱思唯尔 系列期刊的word模板，template，单栏，双栏.zip

第十三章_优化算法1

数学建模概况

高等应用数学问题的matlab求解（318个源程序）.zip

基于Excel地理数据分析方法高清扫描版

SPSS19经典教程课件

数据挖掘算法工具包接口算法详解

数学建模常用模型方法总结

《医药信息学》考点.docx

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

最新资源

Neurocomputing.ens elsevier 爱思唯尔旗下期刊 Neurocomputing 神经计算杂志的endnote 参考文献模板

elsevier 爱思唯尔系列期刊的word模板，template，单栏，双栏.zip