深度学习的最优化：理论和算法综述论文【包含257篇文献】.zip资源-CSDN文库

共1个文件

pdf：1个

DL_optimization

需积分: 47 68 浏览量 2019-12-30 22:00:33 上传评论 3 收藏 789KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

1912.08957.zip （1个子文件）

1912.08957.pdf 994KB

Optimization for deep learning: theory and algorithms

Ruoyu Sun

∗

December 21, 2019

Abstract

When and why can a neural network be successfully trained? This article provides an

overview of optimization algorithms and theory for training neural networks. First, we discuss

the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum,

and then discuss practical solutions including careful initialization and normalization methods.

Second, we review generic optimization methods used in training neural networks, such as SGD,

adaptive gradient methods and distributed methods, and existing theoretical results for these

algorithms. Third, we review existing research on the global issues of neural network training,

including results on bad local minima, mode connectivity, lottery ticket hypothesis and inﬁnite-

width analysis.

1 Introduction

A major theme of this article is to understand the practical components for successfully training

neural networks, and the possible factors that cause the failure of training. Imagine you were in

year 1980 trying to solve an image classiﬁcation problem using neural networks. If you wanted to

train a neural network from scratch, it is very likely that your ﬁrst few attempts would have failed

to return reasonable results. What are the essential changes to make the algorithm work? In a

high-level, you need three things (besides powerful hardware): a proper neural network, a proper

training algorithm, and proper training tricks.

• Proper neural-net. This includes neural architecture and activation functions. For neural

architecture, you may want to replace a fully connected network by a convolutional network

with at least 5 layers and enough neurons. For better performance, you may want to increase

the depth to 20 or even 100, and add skip connections. For activation functions, a good

starting point is ReLU activation, but using tanh or swish activation is also reasonable.

• Training algorithm. A big choice is to use stochastic versions of gradient descent (SGD) and

stick to it. A well-tuned constant step-size is good enough, while momentum and adaptive

stepsize can provide extra beneﬁts.

∗

Department of Industrial and Enterprise Systems Engineering (ISE), and aﬃliated to Coordinated Science Labo-

ratory and Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL. Email: ruoyus@illinois.edu.

arXiv:1912.08957v1 [cs.LG] 19 Dec 2019

Figure 1: A few major design choices for a successful training of a neural network with theoretical understanding. They have

impact on three aspects of algorithm convergence: make convergence possible, faster convergence and better global solutions.

The three aspects are somewhat related, and it is jut a rough classiﬁcation. Note that there are other important design choices,

especially the neural architecture, that is not understood theoretically, and thus omitted in this ﬁgure. There are also other

beneﬁts such as generalization, which is also omitted.

• Training tricks. Proper initialization is very important for the algorithm to start training.

To train a network with more than 10 layers, two extra tricks are often needed: adding

normalization layers and adding skip connections.

Which of these design choices are essential? Currently we have some understanding of a few

design choices, including initialization strategies, normalization methods, the skip connections,

over-parameterization (large width) and SGD, as shown in Figure 1. We roughly divide the opti-

mization advantage into three parts: controlling Lipschitz constants, faster convergence and better

landscape. There are many other design choices that are hard to understand, most notably the

neural architecture. Anyhow, it seems impossible to understand every part of this complicated

system, and the current understanding can already provide some useful insight.

To keep the survey simple, we will focus on the supervised learning problem with feedforward

neural networks. We will not discuss more complicated formulations such as GANs (generative

adversarial networks) and deep reinforcement learning, and do not discuss more complicated ar-

chitecture such as RNN (recurrent neural network), attention and Capsule. In a broader context,

theory for supervised learning contains at least representation, optimization and generalization

(see Section 1.1), and we do not discuss representation and generalization in detail. One major

goal is to understanding how the neural-net structure (the parameterization by concatenation of

many variables) aﬀects the design and analysis of optimization algorithms, which can potentially

go beyond supervised learning.

This article is written for researchers who are interested in theoretical understanding of opti-

mization for neural networks. Prior knowledge on optimization methods and basic theory will be

very helpful (see, e.g., [24, 200, 30] for preparation). Existing surveys on optimization for deep

learning are intended for general machine learning audience, such as Chapter 8 of the book Good-

fellow et al. [77]. These reviews often do not discuss optimization theoretical aspects in depth.

In contrast, in this article, we emphasize more on the theoretical results while trying to make it

accessible for non-theory readers. Simple examples that illustrate the intuition will be provided if

possible, and we will not explain the details of the theorems.

1.1 Big picture: decomposition of theory

A useful and popular meta-method to develop theory is decomposition. We ﬁrst brieﬂy review

the role of optimization in machine learning, and then discuss how to decompose the theory of

optimization for deep learning.

Representation, optimization and generalization. The goal of supervised learning is to

ﬁnd a function that approximates the underlying function based on observed samples. The ﬁrst

step is to ﬁnd a rich family of functions (such as neural networks) that can represent the desirable

function. The second step is to identify the parameter of the function by minimizing a certain loss

function. The third step is to use the function found in the second step to make predictions on

unseen test data, and the resulting error is called test error. The test error can be decomposed into

representation error, optimization error and generalization error, corresponding to the error caused

by each of the three steps.

In machine learning, the three subjects representation, optimization and generalization are

often studied separately. For instance, when studying representation power of a certain family

of functions, we often do not care whether the optimization problem can be solved well. When

studying the generalization error, we often assume that the global optima have been found (see

[96] for a survey of generalization). Similarly, when studying optimization properties, we often do

not explicitly consider the generalization error (but sometimes we assume the representation error

is zero).

Decomposition of optimization issues. Optimization issues of deep learning are rather

complicated, and further decomposition is needed. The development of optimization can be divided

into three steps. The ﬁrst step is to make the algorithm start running and converge to a reasonable

solution such as a stationary point. The second step is to make the algorithm converge as fast as

possible. The third step is to ensure the algorithm converge to a solution with a low objective value

(e.g. global minima). There is an extra step of achieving good test accuracy, but this is beyond the

scope of optimization. In short, we divide the optimization issues into three parts: convergence,

convergence speed and global quality.

Optimization issues











Local issues

(

Convergence issue: gradient explosion/vanishing

Convergence speed issue

Global issues: bad local minima, plaeaus, etc.

Most works are reviewed in three sections: Section 4, Section 5 and Section 6. Roughly speaking,

each section is mainly motivated by one of the three parts of optimization theory. However, this

partition is not precise as the boundaries between the three parts are blurred. For instance, some

techniques discussed in Section 4 can also improve the convergence rate, and some results in Section

6 address the convergence issue as well as global issues. Another reason of the partition is that

they represent three rather separate subareas of neural network optimization, and are developed

somewhat independently.

1.2 Outline

The structure of the article is as follows. In Section 2, we present the formulation of a typical neural

network optimization problem for supervised learning. In Section 3, we present back propagation

(BP) and analyze the diﬃculty of applying classical convergence analysis to gradient descent for

neural networks. In Section 4, we discuss neural-net speciﬁc tricks for training a neural network,

and some underlying theory. These are neural-network dependent methods, that open the black box

of neural networks. In particular, we discuss a major challenge called gradient explosion/vanishing

and a more general challenge of controlling spectrum, and review main solutions such as careful

initialization and normalization methods. In Section 5, we discuss generic algorithm design which

treats neural networks as generic non-convex optimization problems. In particular, we review SGD

with various learning rate schedules, adaptive gradient methods, large-scale distributed training,

second order methods and the existing convergence and iteration complexity results. In Section

6, we review research on global optimization of neural networks, including global landscape, mode

connectivity, lottery ticket hypothesis and inﬁnite-width analysis (e.g. neural tangent kernel).

2 Problem Formulation

In this section, we present the optimization formulation for a supervised learning problem. Suppose

we are given data points x

∈ R

, y

∈ R

, i = 1, . . . , n, where n is the number of samples. The

input instance x

can represent a feature vector of an object, an image, a vector that presents a

word, etc. The output instance y

can represent a real-valued vector or scalar such as in a regression

problem, or an integer-valued vector or scalar such as in a classiﬁcation problem.

We want the computer to predict y

based on the information of x

, so we want to learn the

underlying mapping that maps each x

to y

. To approximate the mapping, we use a neural network

: R

→ R

, which maps an input x to a predicted output ˆy. A standard fully-connected neural

network is given by

(x) = W

φ(W

L−1

. . . φ(W

φ(W

x))), (1)

where φ : R → R is the neuron activation function (sometimes simply called “activation” or

“neuron”), W

is a matrix of dimension d

× d

j−1

, j = 1, . . . , L and θ = (W

, . . . , W

) represents

the collection of all parameters. Here we deﬁne d

= d

and d

= d

. When applying the scalar

function φ to a matrix Z, we apply φ to each entry of Z. Another way to write down the neural

network is to use a recursion formula:

= x; z

= φ(W

l−1

), l = 1, . . . , L. (2)

Note that in practice, the recursive expression should be z

= φ(W

l−1

+ b

). For simplicity of

presentation, throughout the paper, we often skip the “bias” term b

in the expression of neural

networks and just use the simpliﬁed version (2).

We want to pick the parameter of the neural network so that the predicted output ˆy

= f

)

is close to the true output y

, thus we want to minimize the distance between y

and ˆy

. For a

certain distance metric `(·, ·), the problem of ﬁnding the optimal parameters can be written as

min

F (θ) ,

i=1

`(y

, f

)). (3)

For regression problems, `(y, z) is often chosen to be the quadratic loss function `(y, z) = ky −zk

For binary classiﬁcation problem, a popular choice of ` is `(y, z) = log(1 + exp(−yz)).

Technically, the neural network given by (2) should be called fully connected feed-forward

networks (FCN). Neural networks used in practice often have more complicated structure. For

computer vision tasks, convolutional neural networks (CNN) are standard. In natural language

processing, extra layers such as “attention” are commonly added. Nevertheless, for our purpose

of understanding the optimization problem, we mainly discuss the FCN model (2) throughout this

article, though in few cases the results for CNN will be mentioned.

For a better understanding of the problem (20), we relate it to several classical optimization

problems.

2.1 Relation with Least Squares

One special form of (20) is the linear regression problem (least squares):

min

w∈R

d×1

ky − w

, (4)

where X = (x

, . . . , x

) ∈ R

d×n

, y ∈ R

1×n

. If there is only one linear neuron that maps the input

x to w

x and the loss function is quadratic, then the general neural network problem (20) reduces

to the least square problem (4). We explicitly mention the least square problem for two reasons.

First, it is one of the simplest forms of a neural network problem. Second, when understanding

neural network optimization, researchers have constantly resorted to insight gained from analyzing

linear regression.

2.2 Relation with Matrix Factorization

Neural network optimization (20) is closely related to a fundamental problem in numerical compu-

tation: matrix factorization. If there is only one hidden layer of linear neurons and the loss function

is quadratic, and the input data matrix X is the identity matrix, the problem (20) reduces to

min

kY − W

, (5)

where W

∈ R

×d

, W

∈ R

×n

, Y ∈ R

×n

and k·k

indicates the Frobenious norm of a matrix.

If d

< min{n, d

}, then the above problem gives the best rank-d

approximation of the matrix

Y . Matrix factorization is widely used in engineering, and it has many popular extensions such as

non-negative matrix factorization and low-rank matrix completion. Neural network can be viewed

as an extension of two-factor matrix factorization to multi-factor nonlinear matrix factorization.

评论收藏

内容反馈

syp_net

粉丝: 158
资源: 1196

深度学习的最优化：理论和算法综述论文【包含257篇文献】.zip

最优化理论与方法论文(DOC).doc

《深度学习最优化》综述论文

综述深度学习优化综述【UIUC】.zip

深度学习中的最优化问题讲义pdf

2019-2020必看的十篇【深度学习领域综述论文】.zip

深度学习国外综述论文 Efficient Processing of Deep Neural Networks: A Tutorial and Survey

FIST快速迭代收缩最经典的文献，目前最流行的优化算法之一

自动超参数优化：算法和应用综述论文.pdf

Combinatorial-Optimization-Theory-and-Algorithm

Neural Networks and Deep Learning A Textbook 完整版

最优化：建模、算法与理论1

决策树算法经典优秀论文(1).zip

最优化—建模、算法与理论课件.rar

《深度学习：算法到实战》全套论文.zip

最优化建模,算法与理论1

最优化理论与算法习题解答.pdf

凸优化各种算法的理论基础与matlab实现源码.zip

深度多目标跟踪算法综述.pdf

美赛各题型常见参考代码：多种遗传算法优化论文与代码.zip

基于python的群体智能优化算法智能算法库源码(高分课设).zip

基于深度学习的表格检测识别算法综述.pdf

基于深度学习的轮廓检测算法：综述.pdf

matlab神经网络和优化算法：14 遗传算法参考程序.zip

基于深度学习的LSTM算法双色球预测实战完整代码.zip

matlab神经网络和优化算法：62 遗传算法求解不等式.zip

matlab神经网络和优化算法：62遗传算法求解不等式.zip

ChatGPT教程（终极版）最全整理

yolov8调用zed相机实现三维测距（版本一）

基于Python+pytorch的图像处理+附完整代码图像处理，能够轻松实现图像的读取、显示、裁剪等还有机器学习等操作

最新资源