Faststochasticordinalembeddingwithvariancereductionandadaptivestepsize资源-CSDN文库

17 浏览量 2021-02-11 21:42:57 上传评论收藏 3.18MB PDF 举报

### Fast Stochastic Ordinal Embedding with Variance Reduction and Adaptive Step Size #### 摘要与研究背景本文提出了一种快速随机序贯嵌入方法（SVRG-SBB），旨在通过从相对相似性比较中学习表示（通常称为序贯嵌入）来解决大规模数据集上的优化问题。序贯嵌入作为一种新兴的研究领域，在过去几年里受到了越来越多的关注。大多数现有的方法基于半正定规划（SDP），这种方法虽然有效但计算成本较高，并且随着数据规模的增长而降低了可扩展性。为了解决这个问题，研究人员提出了一种名为SVRG-SBB的随机算法，该算法具有两个主要特点： 1. **通过放弃正半定（PSD）约束以实现快速算法**：即利用随机方差减少梯度（SVRG）方法，从而实现了良好的可扩展性。 2. **通过引入一种新的自适应步长策略**：即稳定化的Barzilai-Borwein (SBB)步长，使算法能够自适应地学习。 #### 理论贡献在理论上，假设满足某些自然条件，作者证明了所提出的算法在收敛到一个稳定点时的收敛速率为O(1/T)，其中T是总迭代次数。此外，在进一步假设Polyak-Łojasiewicz条件下，可以证明所提算法具有全局线性收敛性，即以指数速度快速收敛至全局最优解。 #### 方法论 - **随机方差减少梯度（SVRG）方法**：这是一种有效的优化技术，用于减少随机梯度下降过程中的方差，从而加速收敛并提高稳定性。SVRG通过在每次迭代中周期性地计算全批量梯度，并用其来校正随机梯度，从而降低随机梯度的方差。 - **稳定化的Barzilai-Borwein (SBB)步长**：这是一种自适应步长策略，它根据当前迭代点和前一迭代点之间的变化自动调整步长大小。与传统的固定步长或基于梯度的方法相比，SBB步长能够更高效地探索解空间，同时避免了过大的步长导致的振荡问题。 #### 实验结果为了验证所提方法的有效性，研究人员进行了大量的模拟实验以及真实世界数据集上的实验。实验结果表明，与最先进的方法相比，SVRG-SBB不仅显著降低了计算成本，而且保持了良好的预测性能。具体而言： - **计算效率**：SVRG-SBB算法在处理大规模数据集时表现出较低的计算成本。 - **预测性能**：尽管简化了模型，但SVRG-SBB仍然能够提供准确的预测结果，这表明其在保持良好性能的同时提高了效率。 #### 结论与展望 SVRG-SBB是一种高效且具有良好可扩展性的序贯嵌入算法。通过结合随机方差减少梯度方法和稳定化的Barzilai-Borwein步长策略，该算法能够在处理大规模数据集时保持快速收敛性和准确性。未来的工作可以进一步探索如何将这些方法应用于更广泛的机器学习任务中，特别是在处理高维和非结构化数据方面。本文为序贯嵌入领域提供了一种新的、高效的解决方案，不仅在理论上具有重要意义，而且在实践中也具有广泛的应用前景。

资源推荐

资源详情

资源评论

1041-4347 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2019.2956700, IEEE

Transactions on Knowledge and Data Engineering

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, AUGUST 2019 1

Fast Stochastic Ordinal Embedding with

Variance Reduction and Adaptive Step Size

Ke Ma, Jinshan Zeng, Jiechao Xiong, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao

Abstract—Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent

years. Most of the existing methods are based on the semi-deﬁnite programming (SDP), which is generally time-consuming and

degrades the scalability, especially meeting the large-scale data. To overcome this challenge, we propose a stochastic algorithm called

SVRG-SBB, which has the following features: i) achieving good scalability via dropping positive semi-deﬁnite (PSD) constraints as

serving a fast algorithm, i.e., stochastic variance reduced gradient (SVRG) method, and ii) adaptive learning via introducing a new,

adaptive step size called the stabilized Barzilai-Borwein (SBB) step size. Theoretically, under some natural assumptions, we show the

) rate of convergence to a stationary point of the proposed algorithm, where T is the number of total iterations. Under the further

Polyak-Łojasiewicz assumption, we can show the global linear convergence (i.e., exponentially fast converging to a global optimum) of

the proposed algorithm. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the

proposed algorithm by comparing with the state-of-the-art methods, notably, much lower computational cost with good prediction

performance.

Index Terms—Ordinal Embedding, Stochastic Variance Reduction Gradient (SVRG), Non-Convex Optimization, Barzilai-Borwein (BB)

Step Size, .

1INTRODUCTION

RDINAL embedding aims to learn the representation

of data as points in a low-dimensional embedded

space. Here the “low-dimensional” means the embedding

dimension is much smaller than the number of data points.

The distances between these points agree with a set of

relative similarity comparisons. Relative comparisons are

often collected via the participators who are asked to answer

questions like:

“Is the similarity between object i and j larger than the

similarity between

l and k?”

• K. Ma is with the State Key Laboratory of Information Security (SKLOIS),

Institute of Information Engineering, Chinese Academy of Sciences, Bei-

jing, 100093, China, and with the School of Cyber Security, University of

Chinese Academy of Sciences, Beijing, 100049, China.

E-mail: make@iie.ac.cn.

• J. Zeng is with the School of Computer Information Engineering, Jiangxi

Normal University, Nanchang, Jiangxi, 330022, China, and with the

Department of Mathematics, Hong Kong University of Science and

Technology, Clear Water Bay, Kowloon, Hong Kong. Jinshan Zeng’s work

is supported in part by the National Natural Science Foundation (NNSF)

of China (No. 61977038, 61603162, 61876074), and part of this work was

performed when he was at Department of Mathematics, The Hong Kong

University of Science and Technology.(E-mail: jsh.zeng@gmail.com)

• J. Xiong is with the Tencent AI Lab, Shenzhen, Guangdong, China.

E-mail: jcxiong@tencent.com

• Q. Xu is with the Key Laboratory of Intelligent Information Processing,

Institute of Computing Technology, Chinese Academy of Sciences, Beijing

100190, China. E-mail: qianqian.xu@vipl.ict.ac.cn, xuqianqian@ict.ac.cn.

• X. Cao is with the State Key Laboratory of Information Security

(SKLOIS), Institute of Information Engineering, Chinese Academy of

Sciences, Beijing, 100093, China. E-mail: caoxiaochun@iie.ac.cn.

• W. Liu is with the Tencent AI Lab, Shenzhen, Guangdong, China.

E-mail: wliu@ee.columbia.edu

• Y. Yao is with the Department of Mathematics, and by courtesy, the De-

partment of Computer Science and Engineering, Hong Kong University

of Science and Technology, Clear Water Bay, Kowloon, Hong Kong.

E-mail: yuany@ust.hk.

Manuscript received April 19, 2005; revised September 17, 2014.

The feedback of these questions provide us with a set

of quadruplets, i.e.,

(i, j, l, k) which indicates that the sim-

ilarity between object

i and j is larger than the similarity

between

l and k. These relative similarity comparisons are

the supervision information for ordinal embedding. Without

prior knowledge, the relative similarity comparisons always

involve all objects, and the number of potential quadruplets

could be

O(n

). Even under the so-called “local” setting

where we restrict

l = i, the triple-wise comparisons, (i, j, k),

also have the complexity

O(n

The ordinal embedding problem was ﬁrstly studied by

[1], [2], [3], [4] in the psychometric society. In recent years, it

has drawn a lot of attention in machine learning [5], [6], [7],

[8], [9], [10], statistic ranking [11], [12], [13], artiﬁcial intel-

ligence [14], [15], information retrieval [16], and computer

vision [17], [18], etc.

Most of the ordinal embedding methods are based on the

semi-deﬁnite programming (SDP). Some typical methods

include the Generalized Non-Metric Multidimensional Scal-

ing (GNMDS) [19], Crowd Kernel Learning (CKL) [20], and

Stochastic Triplet Embedding (STE/TSTE) [21]. The main

idea of such methods is to formulate the ordinal embedding

problem into a convex, low-rank SDP problem with respect

to the Gram matrix of the embedding points. In order to

solve such a SDP problem, the traditional methods gener-

ally employ the projection gradient descent to satisfy the

positive semi-deﬁnite constraint, where the singular value

decomposition (SVD) is required at each iteration. This

inhibits the popularity of this type of methods for large-

scale and online ordinal embedding applications.

To handle the large-scale ordinal embedding problem,

we reformulate the considered problem using the embed-

ding matrix instead of its Gram matrix. By taking ad-

vantage of this new non-convex formulation, the positive

Transactions on Knowledge and Data Engineering

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, AUGUST 2019 2

semi-deﬁnite constraint is eliminated. Furthermore, we ex-

ploit the well-known stochastic variance reduced gradient

(SVRG) method to efﬁciently solve the developed formula-

tion, which is a fast stochastic algorithm proposed in [22].

Generally, step size, one essential hyper-parameter, should

be tuned in SVRG. It is a difﬁcult task in practice as the

Lipschitz constant is hard to estimate. To facilitate the use of

SVRG, Tan et al. [23] introduced the well-known, adaptive

step size called the Barzilai-Borwein (BB) step size [24],

and proved its linear convergence in the strongly convex

case. However, as shown in our simulations (see, Figure

1), the absolute value of the original BB step size varies

dramatically regarding the epoch number, when applied to

our developed ordinal embedding formulation. One major

reason is that our developed ordinal embedding model is

not strongly convex, and even non-convex. Thus, in such

setting, the denominator of BB step size might be very close

to zero, leading to the instability of the BB step size. We add

another positive term to the non-negative denominator of

BB step size which overcomes such instability of the original

BB step size. Similar to the original version, the new step

size is adaptive with almost the same computational cost.

More importantly, the new step size is more stable than the

original BB step size, and can be applied to more general

case beyond the strongly convexity assumption. Henceforth,

we call the new method as stabilized Barzilai-Borwein (SBB)

step size. By incorporating the SBB step size with SVRG,

we propose a new stochastic algorithm called SVRG-SBB for

efﬁciently solving the considered ordinal embedding model.

In summary, our main contributions can be shown as

follows:

• We propose a non-convex framework for the ordinal

embedding problem via considering the original em-

bedding variable rather than its Gram matrix. We get

rid of the positive semi-deﬁnite (PSD) constraint on

the Gram matrix, and thus, our proposed algorithm

is SVD-free and has better scalability than the exist-

ing convex ordinal embedding methods.

• The introduced SBB step size can overcome the in-

stability of the original BB which comes from the

absence of strongly convexity. More importantly, the

proposed SVRG-SBB algorithm outperforms most of

the state-of-the-art methods as shown by numerous

simulations and real-world data experiments, in the

sense that SVRG-SBB often signiﬁcantly reduces the

computational cost.

• We establish O(

) convergence rate of SVRG-SBB in

the sense of converging to a stationary point, where

T is the total number of iterations. Such result is

comparable with the existing convergence results in

the literature.

This paper is an extension of our conference work

[25], where we propose the basic SVRG-SBB method

which derives the adaptive step size. But there still exist

some limitations in our conference method. First, the

original SVRG-SBB does not incorporate with mini-batch

paradigm which provides a computationally efﬁcient

process than single point update. Second, we observe that

the well-known “local optimal” of non-convex problem

does not have serious impact on the embedding result. The

0 100 200 300 400 500 600

#gradients / #constraints

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Step Size

ncvx-SVRG-SBB

0.025

-1

ncvx-SVRG-SBB

0.025

-5

ncvx-SVRG-SBB

0.025

-10

ncvx-SVRG-SBB

0.025

-20

Fig. 1: Step sizes along iterations of SVRG-SBB

✏

on the

synthetic data, where the dark yellow curve of ncvx-SVRG-

SBB

is exactly the varying of the BB step size in this setting.

empirical success in non-convex ordinal embedding poses

a new problem that under what conditions the non-convex

stochastic algorithms may ﬁnd the global optima effectively.

We provide a possible answer of this question with the

help of the Polyak-Łojasiewicz (PL) condition. Finally, we

summarize the existing ordinal embedding method and

propose the generalized ordinal embedding framework

which generalizes the existing classiﬁcation-based methods

including GNMDS, CKL and STE/TSTE. We hope the new

framework will guide the future research directions.

Organization

The remainder of the paper is organized as follows. In

Section 2, we describe the mathematical formulation of

the generalized ordinal embedding problem. Section 3

shows the development of the SVRG-SBB algorithm for

non-convex ordinal embedding. Section 4 establishes

the convergence analysis of the proposed algorithm.

Comprehensive experimental validation based on simulated

and real-world datasets are demonstrated in Section 5. We

conclude this paper in Section 6.

2GENERALIZED ORDINAL EMBEDDING

Throughout the paper, we denote scalars, vectors, matrices

and sets as lower case letters (

x), bold lower case letters

(

x), bold capital letters (X) and calligraphy upper case

letters (

X). x

and x

denote the i

element of vector x

and (i, j) entry of matrix X, respectively. For any x 2 R

kxk

denotes its `

norm. I

is the identity matrix with

size

n ⇥ n and the subscript n would be omitted if there

is no confusion. For any

X 2 R

p⇥n

, kXk

and rank(X)

denote the Frobenius norm and rank of X. vec(X) is the

vectorization operator on

X by column. For any square

matrix

G 2 R

n⇥n

, tr(G) is the trace of G. [n] is the set

of {1,...,n}. For any X 2 R

p⇥n

, G = X

X is the

Gram matrix. For any

, x

) 2X⇥Xwhere X⇢R

= d(x

, x

) is the distance between x

and x

, and

D = {d

, x

)} is the squared distance matrix of X.

Here the distance

d : R

⇥ R

! R

depends on the

Transactions on Knowledge and Data Engineering

JOURNAL OF L

X CLASS FILES, VOL. 13, NO. 9, AUGUST 2019 3

embedded space. In case of the Euclidean space, we adopt

the Euclidean distance if not speciﬁed. E

[·] represents the

expectation.

Let O = {o

,...,o

} be a collection of objects, X⇢R

be a low-dimensional embedded space where p ⌧ n, and

⇤

: O⇥O ! R

be a dissimilarity function of O

where

⇤

is the dissimilarity measure between o

and o

The traditional multi-dimensional scaling (MDS) methods

embed

O into X based on

⇤

= {

⇤

,i,j2 [n],i 6= j}.

However, there is always a lack of a dissimilarity function

⇤

that can evaluate the objects O properly for real-world

applications, e.g., [15], [18], [26], [27]. As an alternative,

ordinal embedding methods incorporate human knowledge

into the loop and relax the requirement of

⇤

By collecting a partially ordered set which assesses dis-

similarity on a relative scale by human, ordinal embedding

methods establish relative dissimilarity of

X and obtain

embedding

X = {x

: x

2X, o

2O,i2 [n]} (1)

based on the dissimilarity comparisons. Speciﬁcally, given a

dissimilarity function  : X⇥X!R

and 

= (x

, x

)

is the dissimilarity between x

and x

, we collect a set of

quadruplets, that is,

Q =



q | q =(i, j, l, k), (

,

) 2 

(2)

and deﬁne



= {(

,

) | 

<

,i,j,l,k2 [n],

i 6= j, l 6= k, (i, j) 6=(l, k)}.

(3)

Although the embedding

X and ={

| i, j 2 [n],i6= j}

is unknown, human knowledge can help to determine 



or not and generate Q. The goal of ordinal embedding

is to estimate

X or  based on Q.

One common class of ordinal embedding methods tries

to formulate it as a classiﬁcation problem (generally, a

binary classiﬁcation problem, say, [9], [19], [20], [21], [28]).

Given an ordered quadruplet

q =(i, j, l, k) and the associ-

ated ordered pair

(

,

), the corresponding label y

can

be deﬁned as follows

⇢

> 0,

<

< 0,

>

(4)

Here we ignore the multi-class case, e.g.

could be

{1, 0, +1} and y

=0indicates that 

and 

have

the same value. As it is exceptionally rare in the practical

applications and has no obvious improvement of the results

whether we include multi-class label or not, we only con-

sider the binary case in our generalized ordinal embedding

(GOE) problem.

Let

:= {y

,q 2Q}be the corresponding label

set. Given an embedding candidate

X and a classiﬁer

h : R

⇥ R

, the empirical misclassiﬁcation error

can be deﬁned as follows

Q,,h

(X, Y

|Q|

q 2Q

`(h((x

, x

),(x

, x

),y

)),

(5)

where

|Q| represents the cardinality of the set Q, and ` : R ⇥

R ! R

[{0} is a speciﬁc loss function such as hinge loss or

logistic loss. Therefore, the GOE problem can be formulated

as the following minimization problem,

min

X2R

p⇥n

Q,,h

(X, Y

). (6)

In practice, the dissimilarity function

 is generally taken as

the squared Euclidean distance

(x

, x

)=d

= kx

 x

, (7)

and the empirical loss (5) can be written as

Q,,h

(X, Y

)=L

Q,h

(D, Y

)

|Q|

q 2Q

`(h(d

)),

(8)

where

D is the squared Euclidean distance matrix of X.

Besides (6), the following SDP based formulation of the

ordinal embedding is commonly used in the literature ( [19],

[20], [21]). Let

G = X

X be the Gram matrix of X. There

exists a bijection between the Gram matrix

G 2 S

, S

the

n-dimensional positive semi-deﬁnite cone, the set of all

symmetric positive semideﬁnite matrices in R

n⇥n

and the

squared Euclidean distance matrix

D as

= kx

 x

= g

 2g

+ g

(9)

where

is the (i, j) element of G. We change the variable

D in empirical loss (8) as G

Q,h

(D, Y

)=L

Q,h

(G, Y

)

|Q|

q 2Q

`(h(g

 2g

+ g

 2g

+ g

)).

(10)

According to (6), (8) and (10), the ordinal embedding prob-

lem can be formulated as the following SDP problem with

respect to G, i.e.,

min

G2S

, rank(G)p

Q,h

(G, Y

), (11)

the positive semi-deﬁnite constraint

G 2 S

or G ⌫ 0

comes from the fact that the Gram matrix G is positive

semi-deﬁnite matrix; the rank constraint comes from the

fact that rank

(G)  rank(X)  min(n, p)=p. Note that

the formulation (11) is generally convex. However, the com-

putational complexity of such SDP problem is very high,

which degrades the scalability of this kind of methods. This

motivates us to directly obtain embedding

X from (6).

3DEVELOPMENT OF SVRG-SBB

Since (6) is an unconstrained optimization problem, SVD

and regularization parameter tuning are both avoided.

However, without any prior knowledge on

O, the sample

complexity of

Q is O ( n

). Because of the expense of full

gradients and inverse of Hessian matrix computation in

each iteration, the traditional full batch optimization meth-

ods, i.e. gradient descent and (quasi-)Newton method, are

not suitable for solving such large-scale problem where

would be larger than thousands. Instead of the full-batch

methods, we introduce the stochastic algorithm to solve

the non-convex problem (6). One open issue in stochastic

optimization is how to choose an appropriate step size in

practice. Traditional methods include that using a constant

剩余13页未读，继续阅读

评论收藏

内容反馈

weixin_38698174

粉丝: 3
资源: 980

Fast stochastic ordinal embedding with variance reduction and ad...

最新资源

Fast stochastic ordinal embedding with variance reduction and ad...

stochastic processes

Fast Stochastic - MetaTrader 4脚本.zip

Stochastic Network Optimization with Application to Communication and Queueing S

Forward-backward doubly stochastic differential equations with random jumps and stochastic partial differential-integral equations

stochastic network optimization with application to communication

stochastic models information theory and lie groups

MatlabToolboxforDimensionalityReduction-drtoolbox.rar

Introduction to stochastic calculus with applications.pdf

Matlab数据最新降维工具箱

Stochastic differential equations and applications(Mao Xuerong).pdf

Stochastic Models, Information Theory, and Lie Groups Volume 1

Stochastic Models, Information Theory, and Lie Groups Volume 2

Stochastic Models, Estimation, and Control 123卷

122_Stochastic_Models，_Statistics_and_Their_Applications.pdf

Stochastic Differential Equations And Applications(Mao)

Backward Stochastic Differential Equations with Discontinuous Coefficients

Stochastic Processes for Insurance and Finance

Stochastic Geometry and Its Applications, 3rd Edition.pdf

Stochastic Learning and Optimization

Matlab降维工具箱

fOn Stochastic Ordering for Diffusion with Jumps and Applications

Probability, Random Variables and Stochastic Processes Solutions 答案

Stochastic Network Optimization with Application to Communication

最新资源