Optimization for deep learning: theory and algorithms
Ruoyu Sun
∗
December 21, 2019
Abstract
When and why can a neural network be successfully trained? This article provides an
overview of optimization algorithms and theory for training neural networks. First, we discuss
the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum,
and then discuss practical solutions including careful initialization and normalization methods.
Second, we review generic optimization methods used in training neural networks, such as SGD,
adaptive gradient methods and distributed methods, and existing theoretical results for these
algorithms. Third, we review existing research on the global issues of neural network training,
including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-
width analysis.
1 Introduction
A major theme of this article is to understand the practical components for successfully training
neural networks, and the possible factors that cause the failure of training. Imagine you were in
year 1980 trying to solve an image classification problem using neural networks. If you wanted to
train a neural network from scratch, it is very likely that your first few attempts would have failed
to return reasonable results. What are the essential changes to make the algorithm work? In a
high-level, you need three things (besides powerful hardware): a proper neural network, a proper
training algorithm, and proper training tricks.
• Proper neural-net. This includes neural architecture and activation functions. For neural
architecture, you may want to replace a fully connected network by a convolutional network
with at least 5 layers and enough neurons. For better performance, you may want to increase
the depth to 20 or even 100, and add skip connections. For activation functions, a good
starting point is ReLU activation, but using tanh or swish activation is also reasonable.
• Training algorithm. A big choice is to use stochastic versions of gradient descent (SGD) and
stick to it. A well-tuned constant step-size is good enough, while momentum and adaptive
stepsize can provide extra benefits.
∗
Department of Industrial and Enterprise Systems Engineering (ISE), and affiliated to Coordinated Science Labo-
ratory and Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL. Email: ruoyus@illinois.edu.
1
arXiv:1912.08957v1 [cs.LG] 19 Dec 2019