2
necessary to resolve the non-vanishing variance issue inherent
in the vanilla SGD methods. Instead, these modern stochastic
2
method are able to dramatically improve upon SGD in various
different ways, but without having to resort to the usual
variance-reduction techniques (such as decreasing stepsizes
or mini-batching) which carry with them considerable costs
drastically reducing their power. Instead, these modern meth-
ods were able to improve upon SGD without any unwelcome
side effects. This development led to a revolution in the area
of first order methods for solving problem (1)+(2). Both the
theoretical complexity and practical efficiency of these modern
methods vastly outperform prior gradient-type methods.
In order to achieve -accuracy, that is,
E[P (x
k
) − P (x
∗
)] ≤ [P (x
0
) − P (x
∗
)], (3)
modern stochastic methods such as SAG, SDCA, SVRG and
S2GD require only
O((n + κ) log(1/)) (4)
units of work, where κ is a condition number associated with
F , and one unit of work corresponds to the computation of
the gradient of f
i
for a random index i, followed by a call
to a prox-mapping involving R. More specifically, κ = L/µ,
where L is a uniform bound on the Lipschitz constants of the
gradients of functions f
i
and µ is the strong convexity constant
of P. These quantities will be defined precisely in Section IV.
The complexity bound (4) should be contrasted with that
of proximal gradient descent (e.g., ISTA), which requires
O(nκ log(1/)) units of work, or FISTA, which requires
O(n
√
κ log(1/)) units of work
3
. Note that while all these
methods enjoy linear convergence rate, the modern stochastic
methods can be many orders of magnitude faster than classical
deterministic methods. Indeed, one can have
n + κ n
√
κ ≤ nκ.
Based on this, we see that these modern methods always
beat (proximal) gradient descent (n + κ nκ), and also
outperform FISTA as long as κ ≤ O(n
2
). In machine learning,
for instance, one usually has κ ≈ n, in which case the
improvement is by a factor of
√
n when compared to FISTA,
and by a factor of n over ISTA. For applications where n is
massive, these improvements are indeed dramatic.
For more information about modern dual and primal meth-
ods we refer the reader to the literature on randomized
coordinate descent methods [8], [9], [10], [11], [12], [5], [13],
[14], [15], [16], [17], [18] and stochastic gradient methods [4],
[19], [20], [21], [22], [23], [17], [24], respectively.
C. Linear systems and sketching.
In the case when R ≡ 0, all stationary points (i.e., points
satisfying ∇F (x) = 0) are optimal for (1)+(2). In the special
2
These methods are randomized algorithms. However, the term “stochastic”
(somewhat incorrectly) appears in their names for historical reasons, and quite
possibly due to their aspiration to improve upon stochastic gradient descent
(SGD).
3
However, it should be remarked that the condition number κ in these latter
methods is slightly different from that appearing in the bound (4).
case when the functions f
i
are convex quadratics of the form
f
i
(x) =
1
2
(a
T
i
x − b
i
), the equation ∇F (x) = 0 reduces to
the linear system A
T
Ax = A
T
b, where A = [a
1
, . . . , n].
Recently, there has been considerable interest in designing and
analyzing randomized methods for solving linear systems; also
known under the name of sketching methods. Much of this
work was done independently from the developments in (non-
quadratic) optimization, despite the above connection between
optimization and linear systems. A randomized version of the
classical Kaczmarz method was studied in a seminal paper by
Strohmer and Vershynin [25]. Subsequently, the method was
extended and improved upon in several ways [26], [27], [28],
[29]. The randomized Kaczmarz method is equivalent to SGD
with a specific stepsize choice [30], [31]. The first randomized
coordinate descent method, for linear systems, was analyzed
by Lewis and Leventhal [32], and subsequently generalized
in various ways by numerous authors (we refer the reader to
[17] and the references therein). Gower and Richt
´
arik [31]
have recently studied randomized iterative methods for linear
systems in a general sketch and project framework, which
in special cases includes randomized Kaczmarz, randomized
coordinate descent, Gaussian descent, randomized Newton,
their block variants, variants with importance sampling, and
also an infinite array of new specific methods. For approaches
of a combinatorial flavour, specific to diagonally dominant
systems, we refer to the influential work of Spielman and Teng
[33].
II. CONTRIBUTIONS
In this paper we equip moderns stochastic methods—
methods which already enjoy the fast rate (4)—with the ability
to process data in mini-batches. None of the primal
4
modern
methods have been analyzed in the mini-batch setting. This
paper fills this gap in the literature.
While we have argued above that the modern methods,
S2GD included, do not have the “non-vanishing variance”
issue that SGD does, and hence do not need mini-batching
for that purpose, mini-batching is still useful. In particular, we
develop and analyze the complexity of mS2GD (Algorithm 1)
— a mini-batch proximal variant of semi-stochastic gradient
descent (S2GD) [7]. While the S2GD method was analyzed
in the R = 0 case only, we develop and analyze our method
in the proximal
5
setting (1). We show that mS2GD enjoys
several benefits when compared to previous modern methods.
First, it trivially admits a parallel implementation, and hence
enjoys a speedup in clocktime in an HPC environment. This is
critical for applications with massive datasets and is the main
motivation and advantage of our method. Second, our results
show that in order to attain a specified accuracy , mS2GD
can get by with fewer gradient evaluations than S2GD. This
is formalized in Theorem 2, which predicts more than linear
4
By a primal method we refer to an algorithm which operates directly to
solve (1)+(2) without explicitly operating on the dual problem. Dual methods
have very recently been analyzed in the mini-batch setting. For a review of
such methods we refer the reader to the paper describing the QUARTZ method
[34] and the references therein.
5
Note that the Prox-SVRG method [35] can also handle the composite
problem (1).
评论0