【免费】2016-J神-Mini-BatchSemi-StochasticGradientDescentintheProxi资源-CSDN文库

需积分: 0 61 浏览量 2022-08-03 11:46:19 上传评论收藏 757KB PDF 举报

【标题与描述解析】本文的标题为"2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi"，这是一项研究工作，其中"Mini-Batch Semi-Stochastic Gradient Descent"（简称mS2GD）是主要讨论的主题，而"Proximal Setting"则指的是该方法在具有邻近性的优化环境中的应用。标题暗示了研究者可能提出了一种改进的优化算法，用于处理大规模数据集时的效率和性能。描述中提到的"Edinburgh Research Explorer Mini-Batch Semi-Stochastic Gradient Descent in the Pr"进一步确认了这是一项发表在爱丁堡研究探索器上的研究，涉及到了Mini-Batch和Semi-Stochastic Gradient Descent（半随机梯度下降）的结合，并且是在邻近设置下进行的。【主要内容分析】文章的核心在于mS2GD算法，这是一种结合了Mini-Batch和Semi-Stochastic Gradient Descent思想的优化方法。在机器学习和深度学习领域，梯度下降是最常见的优化技术，用于最小化损失函数。然而，全梯度下降（Batch Gradient Descent）在大数据集上效率较低，因为它需要计算所有样本的梯度，而随机梯度下降（Stochastic Gradient Descent, SGD）虽然快，但可能会因样本的随机性导致收敛不稳。 Semi-Stochastic Gradient Descent是一种折中方案，它不是每次迭代只使用一个样本，而是使用一小批样本（小批量）来计算梯度，平衡了速度和稳定性。引入Mini-Batch可以减少噪声并提高收敛速度，尤其是在现代GPU并行计算的环境下。 "Proximal Setting"是指在优化问题中引入了“邻近算子”（Proximal Operator），这有助于处理带有非光滑或非凸项的复杂优化问题，如正则化项。邻近算子能够保持迭代过程的稳定性，即使在存在复杂约束的情况下。论文详细分析了mS2GD的理论复杂度，探讨了如何在实际应用中改善S2GD的性能。作者们是Jakub Koneˇcný, Jie Liu, Peter Richtarik和Martin Takáˇc，他们的研究成果发表在《IEEE Journal of Selected Topics in Signal Processing》2016年的第10卷第2期，页码242-255，DOI为10.1109/JSTSP.2015.2505682。这篇论文的贡献在于提供了一个新的优化策略，通过Mini-Batch技术改进了Semi-Stochastic Gradient Descent，使得在大规模数据和复杂优化问题的求解中，能更好地平衡计算效率和算法的收敛性。

资源详情

资源评论

资源推荐

Edinburgh Research Explorer

Mini-Batch Semi-Stochastic Gradient Descent in the Proximal

Setting

Citation for published version:

Konený, J, Liu, J, Richtarik, P & Taká, M 2016, 'Mini-Batch Semi-Stochastic Gradient Descent in the

Proximal Setting' IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 2, pp. 242-255. DOI:

10.1109/JSTSP.2015.2505682

Digital Object Identifier (DOI):

10.1109/JSTSP.2015.2505682

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

IEEE Journal of Selected Topics in Signal Processing

Publisher Rights Statement:

users, including reprinting/ republishing this material for advertising or promotional purposes, creating new

collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this

work in other works.

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 27. Aug. 2018

Mini-Batch Semi-Stochastic Gradient Descent

in the Proximal Setting

Jakub Kone

y, Jie Liu, Peter Richt

arik, Martin Tak

Abstract—We propose mS2GD: a method incorporating a

mini-batching scheme for improving the theoretical complexity

and practical performance of semi-stochastic gradient descent

(S2GD). We consider the problem of minimizing a strongly

convex function represented as the sum of an average of a large

number of smooth convex functions, and a simple nonsmooth

convex regularizer. Our method ﬁrst performs a deterministic

step (computation of the gradient of the objective function at

the starting point), followed by a large number of stochastic

steps. The process is repeated a few times with the last iterate

becoming the new starting point. The novelty of our method is in

introduction of mini-batching into the computation of stochastic

steps. In each step, instead of choosing a single function, we

sample b functions, compute their gradients, and compute the

direction based on this. We analyze the complexity of the

method and show that it beneﬁts from two speedup effects.

First, we prove that as long as b is below a certain threshold,

we can reach any predeﬁned accuracy with less overall work

than without mini-batching. Second, our mini-batching scheme

admits a simple parallel implementation, and hence is suitable

for further acceleration by parallelization.

Index Terms—mini-batches, proximal methods, empirical risk

minimization, semi-stochastic gradient descent, sparse data,

stochastic gradient descent, variance reduction.

I. INTRODUCTION

N this work we are concerned with the problem of mini-

mizing the sum of two convex functions,

min

x∈R

{P (x) := F (x) + R(x)}, (1)

where the ﬁrst component, F , is smooth, and the second

component, R, is possibly nonsmooth (and extended real-

valued, which allows for the modeling of constraints).

In the last decade, an intensive amount of research was

conducted into algorithms for solving problems of the form

(1), largely motivated by the realization that the underlying

problem has a considerable modeling power. One of the

most popular and practical methods for (1) is the accelerated

proximal gradient method of Nesterov [1], with its most

successful variant being FISTA [2].

In many applications in optimization, signal processing and

machine learning, F has an additional structure. In particular,

it is often the case that F is the average of a number of convex

functions:

F (x) =

i=1

(x). (2)

Jakub Kone

y and Peter Richt

arik are with the School of Mathematics,

University of Edinburgh, United Kingdom, EH9 3FD.

Jie Liu and Martin Tak

c are with the Department of Industrial and Systems

Engineering, Lehigh University, Bethlehem, PA 18015, USA.

Manuscript received April 15, 2015; revised —.

Indeed, even one of the most basic optimization problems—

least squares regression—lends itself to a natural representa-

tion of the form (2).

A. Stochastic methods.

For problems of the form (1)+(2), and especially when n

is large and when a solution of low to medium accuracy is

sufﬁcient, deterministic methods do not perform as well as

classical stochastic

methods. The prototype method in this

category is stochastic gradient descent (SGD), dating back

to the 1951 seminal work of Robbins and Monro [3]. SGD

selects an index i ∈ {1, 2, . . . , n} uniformly at random, and

then updates the variable x using ∇f

(x) — a stochastic

estimate of ∇F (x). Note that the computation of ∇f

is n

times cheaper than the computation of the full gradient ∇F .

For problems where n is very large, the per-iteration savings

can be extremely large, spanning several orders of magnitude.

These savings do not come for free, however (modern

methods, such as the one we propose, overcome this – more

on that below). Indeed, the stochastic estimate of the gradient

embodied by ∇f

has a non-vanishing variance. To see this,

notice that even when started from an optimal solution x

∗

there is no reason for ∇f

∗

) to be zero, which means

that SGD drives away from the optimal point. Traditionally,

there have been two ways of dealing with this issue. The ﬁrst

one consists in choosing a decreasing sequence of stepsizes.

However, this means that a much larger number of iterations is

needed. A second approach is to use a subset (“minibatch”) of

indices i, as opposed to a single index, in order to form a better

stochastic estimate of the gradient. However, this results in a

method which performs more work per iteration. In summary,

while traditional approaches manage to decrease the variance

in the stochastic estimate, this comes at a cost.

B. Modern stochastic methods

Very recently, starting with the SAG [4], SDCA [5], SVRG

[6] and S2GD [7] algorithms from year 2013, it has tran-

spired that neither decreasing stepsizes nor mini-batching are

Depending on conventions used in different communities, the terms

randomized or sketching are used instead of the word stochastic. In signal

processing, numerical linear algebra and theoretical computer science, for

instance, the terms sketching and randomized are used more often. In machine

learning and optimization, the terms stochastic and randomized are used

more often. In this paper, stochasticity does not refer to a data generation

process, but to randomization embedded in an algorithm which is applied to

a deterministic problem. Having said that, the deterministic problem can and

often does arise as a sample average approximation of stochastic problem

(average replaces an expectation), which further blurs the lines between the

terms.

arXiv:1504.04407v2 [cs.LG] 16 Nov 2015

necessary to resolve the non-vanishing variance issue inherent

in the vanilla SGD methods. Instead, these modern stochastic

method are able to dramatically improve upon SGD in various

different ways, but without having to resort to the usual

variance-reduction techniques (such as decreasing stepsizes

or mini-batching) which carry with them considerable costs

drastically reducing their power. Instead, these modern meth-

ods were able to improve upon SGD without any unwelcome

side effects. This development led to a revolution in the area

of ﬁrst order methods for solving problem (1)+(2). Both the

theoretical complexity and practical efﬁciency of these modern

methods vastly outperform prior gradient-type methods.

In order to achieve -accuracy, that is,

E[P (x

) − P (x

∗

)] ≤ [P (x

) − P (x

∗

)], (3)

modern stochastic methods such as SAG, SDCA, SVRG and

S2GD require only

O((n + κ) log(1/)) (4)

units of work, where κ is a condition number associated with

F , and one unit of work corresponds to the computation of

the gradient of f

for a random index i, followed by a call

to a prox-mapping involving R. More speciﬁcally, κ = L/µ,

where L is a uniform bound on the Lipschitz constants of the

gradients of functions f

and µ is the strong convexity constant

of P. These quantities will be deﬁned precisely in Section IV.

The complexity bound (4) should be contrasted with that

of proximal gradient descent (e.g., ISTA), which requires

O(nκ log(1/)) units of work, or FISTA, which requires

O(n

√

κ log(1/)) units of work

. Note that while all these

methods enjoy linear convergence rate, the modern stochastic

methods can be many orders of magnitude faster than classical

deterministic methods. Indeed, one can have

n + κ  n

√

κ ≤ nκ.

Based on this, we see that these modern methods always

beat (proximal) gradient descent (n + κ  nκ), and also

outperform FISTA as long as κ ≤ O(n

). In machine learning,

for instance, one usually has κ ≈ n, in which case the

improvement is by a factor of

√

n when compared to FISTA,

and by a factor of n over ISTA. For applications where n is

massive, these improvements are indeed dramatic.

For more information about modern dual and primal meth-

ods we refer the reader to the literature on randomized

coordinate descent methods [8], [9], [10], [11], [12], [5], [13],

[14], [15], [16], [17], [18] and stochastic gradient methods [4],

[19], [20], [21], [22], [23], [17], [24], respectively.

C. Linear systems and sketching.

In the case when R ≡ 0, all stationary points (i.e., points

satisfying ∇F (x) = 0) are optimal for (1)+(2). In the special

These methods are randomized algorithms. However, the term “stochastic”

(somewhat incorrectly) appears in their names for historical reasons, and quite

possibly due to their aspiration to improve upon stochastic gradient descent

(SGD).

However, it should be remarked that the condition number κ in these latter

methods is slightly different from that appearing in the bound (4).

case when the functions f

are convex quadratics of the form

(x) =

x − b

), the equation ∇F (x) = 0 reduces to

the linear system A

Ax = A

b, where A = [a

, . . . , n].

Recently, there has been considerable interest in designing and

analyzing randomized methods for solving linear systems; also

known under the name of sketching methods. Much of this

work was done independently from the developments in (non-

quadratic) optimization, despite the above connection between

optimization and linear systems. A randomized version of the

classical Kaczmarz method was studied in a seminal paper by

Strohmer and Vershynin [25]. Subsequently, the method was

extended and improved upon in several ways [26], [27], [28],

[29]. The randomized Kaczmarz method is equivalent to SGD

with a speciﬁc stepsize choice [30], [31]. The ﬁrst randomized

coordinate descent method, for linear systems, was analyzed

by Lewis and Leventhal [32], and subsequently generalized

in various ways by numerous authors (we refer the reader to

[17] and the references therein). Gower and Richt

arik [31]

have recently studied randomized iterative methods for linear

systems in a general sketch and project framework, which

in special cases includes randomized Kaczmarz, randomized

coordinate descent, Gaussian descent, randomized Newton,

their block variants, variants with importance sampling, and

also an inﬁnite array of new speciﬁc methods. For approaches

of a combinatorial ﬂavour, speciﬁc to diagonally dominant

systems, we refer to the inﬂuential work of Spielman and Teng

[33].

II. CONTRIBUTIONS

In this paper we equip moderns stochastic methods—

methods which already enjoy the fast rate (4)—with the ability

to process data in mini-batches. None of the primal

modern

methods have been analyzed in the mini-batch setting. This

paper ﬁlls this gap in the literature.

While we have argued above that the modern methods,

S2GD included, do not have the “non-vanishing variance”

issue that SGD does, and hence do not need mini-batching

for that purpose, mini-batching is still useful. In particular, we

develop and analyze the complexity of mS2GD (Algorithm 1)

— a mini-batch proximal variant of semi-stochastic gradient

descent (S2GD) [7]. While the S2GD method was analyzed

in the R = 0 case only, we develop and analyze our method

in the proximal

setting (1). We show that mS2GD enjoys

several beneﬁts when compared to previous modern methods.

First, it trivially admits a parallel implementation, and hence

enjoys a speedup in clocktime in an HPC environment. This is

critical for applications with massive datasets and is the main

motivation and advantage of our method. Second, our results

show that in order to attain a speciﬁed accuracy , mS2GD

can get by with fewer gradient evaluations than S2GD. This

is formalized in Theorem 2, which predicts more than linear

By a primal method we refer to an algorithm which operates directly to

solve (1)+(2) without explicitly operating on the dual problem. Dual methods

have very recently been analyzed in the mini-batch setting. For a review of

such methods we refer the reader to the paper describing the QUARTZ method

[34] and the references therein.

Note that the Prox-SVRG method [35] can also handle the composite

problem (1).

剩余13页未读，继续阅读

评论收藏

内容反馈

不能汉字字母b

粉丝: 22
资源: 291

2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi

评论0

最新资源

2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi

评论0

Semi-Stochastic Gradient Descent Methods

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

机器学习-随机梯度下降（Stochastic gradient descent）和 批量梯度下降（Batch gradient

gradient-descent-wikipedia_gradientdescent_descent_

matlab开发-StochasticGradientDescent

An overview of gradient descent optimizationalgorithms.pdf

Machine Learning: Step-by-Step Guide To Implement Algorithms with Python

神经网络里的mini-batch算法.zip

Matlab library for gradient descent algorithms Version 1.0.1.zip

《Machine Learning》课程PPT-吴恩达17

线性回归与梯度下降法

机器学习-梯度下降、正则、交叉验证、特征选择、变量分类

An overview of gradient descent optimization algorithms

kmeans-mini batch.rar_Mini Batch K-Means_batch_kmeans python_min

An overview of gradient descent optimization.pdf

AdaBatch: Efficient Gradient Aggregation.pdf

吴恩达深度学习Mini-batch Gradient Descent 以及Momentum、Adam算法Python亲测调通版本

gradient_descent_ebook_descent_machinelearning_

神经网络梯度更新优化器详解笔记.docx

Mini-batch-SGD-master.zip_MBGD小批量梯度下降_batch_多项式回归

linear-regression-stochatic_minibatch_gradient_descent-on-bostion-dataset

gradient_descent_机器学习_toodj1_python_descent_梯度下降法_

06-改善深层神经网络week21

The Definitive Guide to Spring Batch, 2nd Edition.epub

Gradient-Based Learning

Training Neural Networks without Gradients

4-机器学习系列（4）：提高深度网络性能之 - 优化算法及python实现1

最新资源

机器学习-随机梯度下降（Stochastic gradient descent）和批量梯度下降（Batch gradient