【免费】Cmixup数据增强1_vicinalriskminimization资源-CSDN文库

需积分: 0 35 浏览量 2022-08-03 15:00:31 上传评论收藏 723KB PDF 举报

C mixup数据增强1 C mixup是一种简单的学习原则，旨在缓解深度神经网络中的 memorization 和对对抗性示例的敏感性问题。该方法通过训练神经网络在一对示例及其标签的凸组合上，regularize 神经网络以便在训练示例之间表现出简单的线性行为。在自然语言处理领域中，C mixup方法可以应用于文本分类、语言模型、机器翻译等任务，以提高模型的泛化能力和鲁棒性。例如，在文本分类任务中，可以使用C mixup方法来合成新的训练示例，从而提高模型对未见过的类别的识别能力。此外，C mixup方法也可以应用于计算机视觉领域，例如图像分类、目标检测等任务，以提高模型对图像的鲁棒性和泛化能力。在本paper中，作者使用了C mixup方法对多个数据集进行了实验，包括ImageNet-2012、CIFAR-10、CIFAR-100、Google commands和UCI数据集。结果表明，C mixup方法可以提高模型的泛化能力、减少对对抗性示例的敏感性和提高模型对 corrupt labels 的鲁棒性。 C mixup方法的优点在于，它可以作为一种简单的 regularizer 来 regularize 神经网络，从而提高模型的泛化能力和鲁棒性。此外，C mixup方法也可以与其他 regularizer 结合使用，以进一步提高模型的性能。然而，C mixup方法也存在一些缺陷，例如它可能会增加模型的计算复杂度和参数数量。此外，C mixup方法也需要选择合适的超参数，以便获得好的结果。 C mixup是一种简单有效的方法，可以提高深度神经网络的泛化能力和鲁棒性。它可以应用于自然语言处理、计算机视觉等领域，以提高模型的性能。知识点： 1. C mixup是一种简单的学习原则，旨在缓解深度神经网络中的 memorization 和对对抗性示例的敏感性问题。 2. C mixup方法可以应用于自然语言处理领域，例如文本分类、语言模型、机器翻译等任务，以提高模型的泛化能力和鲁棒性。 3. C mixup方法可以应用于计算机视觉领域，例如图像分类、目标检测等任务，以提高模型对图像的鲁棒性和泛化能力。 4. C mixup方法可以作为一种简单的 regularizer 来 regularize 神经网络，从而提高模型的泛化能力和鲁棒性。 5. C mixup方法需要选择合适的超参数，以便获得好的结果。 6. C mixup方法可能会增加模型的计算复杂度和参数数量。 7. C mixup方法可以与其他 regularizer 结合使用，以进一步提高模型的性能。

资源详情

资源评论

资源推荐

Published as a conference paper at ICLR 2018

mixup: BEYOND EMPIRICAL RISK MINIMIZATION

Hongyi Zhang

MIT

Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz

∗

FAIR

ABSTRACT

Large deep neural networks are powerful, but exhibit undesirable behaviors such

as memorization and sensitivity to adversarial examples. In this work, we propose

mixup, a simple learning principle to alleviate these issues. In essence, mixup trains

a neural network on convex combinations of pairs of examples and their labels.

By doing so, mixup regularizes the neural network to favor simple linear behavior

in-between training examples. Our experiments on the ImageNet-2012, CIFAR-10,

CIFAR-100, Google commands and UCI datasets show that mixup improves the

generalization of state-of-the-art neural network architectures. We also ﬁnd that

mixup reduces the memorization of corrupt labels, increases the robustness to

adversarial examples, and stabilizes the training of generative adversarial networks.

1 INTRODUCTION

Large deep neural networks have enabled breakthroughs in ﬁelds such as computer vision (Krizhevsky

et al., 2012), speech recognition (Hinton et al., 2012), and reinforcement learning (Silver et al., 2016).

In most successful applications, these neural networks share two commonalities. First, they are

trained as to minimize their average error over the training data, a learning rule also known as the

Empirical Risk Minimization (ERM) principle (Vapnik, 1998). Second, the size of these state-of-the-

art neural networks scales linearly with the number of training examples. For instance, the network of

Springenberg et al. (2015) used

parameters to model the

5 · 10

images in the CIFAR-10 dataset,

the network of (Simonyan & Zisserman, 2015) used

parameters to model the

images in the

ImageNet-2012 dataset, and the network of Chelba et al. (2013) used

2 · 10

parameters to model

the 10

words in the One Billion Word dataset.

Strikingly, a classical result in learning theory (Vapnik & Chervonenkis, 1971) tells us that the

convergence of ERM is guaranteed as long as the size of the learning machine (e.g., the neural

network) does not increase with the number of training data. Here, the size of a learning machine is

measured in terms of its number of parameters or, relatedly, its VC-complexity (Harvey et al., 2017).

This contradiction challenges the suitability of ERM to train our current neural network models, as

highlighted in recent research. On the one hand, ERM allows large neural networks to memorize

(instead of generalize from) the training data even in the presence of strong regularization, or in

classiﬁcation problems where the labels are assigned at random (Zhang et al., 2017). On the other

hand, neural networks trained with ERM change their predictions drastically when evaluated on

examples just outside the training distribution (Szegedy et al., 2014), also known as adversarial

examples. This evidence suggests that ERM is unable to explain or provide generalization on testing

distributions that differ only slightly from the training data. However, what is the alternative to ERM?

The method of choice to train on similar but different examples to the training data is known as data

augmentation (Simard et al., 1998), formalized by the Vicinal Risk Minimization (VRM) principle

(Chapelle et al., 2000). In VRM, human knowledge is required to describe a vicinity or neighborhood

around each example in the training data. Then, additional virtual examples can be drawn from the

vicinity distribution of the training examples to enlarge the support of the training distribution. For

instance, when performing image classiﬁcation, it is common to deﬁne the vicinity of one image

as the set of its horizontal reﬂections, slight rotations, and mild scalings. While data augmentation

consistently leads to improved generalization (Simard et al., 1998), the procedure is dataset-dependent,

and thus requires the use of expert knowledge. Furthermore, data augmentation assumes that the

∗

Alphabetical order.

arXiv:1710.09412v2 [cs.LG] 27 Apr 2018

Published as a conference paper at ICLR 2018

examples in the vicinity share the same class, and does not model the vicinity relation across examples

of different classes.

Contribution

Motivated by these issues, we introduce a simple and data-agnostic data augmenta-

tion routine, termed mixup (Section 2). In a nutshell, mixup constructs virtual training examples

˜x = λx

+ (1 − λ)x

, where x

, x

are raw input vectors

˜y = λy

+ (1 − λ)y

, where y

, y

are one-hot label encodings

, y

)

and

, y

)

are two examples drawn at random from our training data, and

λ ∈ [0, 1]

Therefore, mixup extends the training distribution by incorporating the prior knowledge that linear

interpolations of feature vectors should lead to linear interpolations of the associated targets. mixup

can be implemented in a few lines of code, and introduces minimal computation overhead.

Despite its simplicity, mixup allows a new state-of-the-art performance in the CIFAR-10, CIFAR-

100, and ImageNet-2012 image classiﬁcation datasets (Sections 3.1 and 3.2). Furthermore, mixup

increases the robustness of neural networks when learning from corrupt labels (Section 3.4), or facing

adversarial examples (Section 3.5). Finally, mixup improves generalization on speech (Sections 3.3)

and tabular (Section 3.6) data, and can be used to stabilize the training of GANs (Section 3.7). The

source-code necessary to replicate our CIFAR-10 experiments is available at:

https://github.com/facebookresearch/mixup-cifar10.

To understand the effects of various design choices in mixup, we conduct a thorough set of ablation

study experiments (Section 3.8). The results suggest that mixup performs signiﬁcantly better than

related methods in previous work, and each of the design choices contributes to the ﬁnal performance.

We conclude by exploring the connections to prior work (Section 4), as well as offering some points

for discussion (Section 5).

2 FROM EMPIRICAL RISK MINIMIZATION TO mixup

In supervised learning, we are interested in ﬁnding a function

f ∈ F

that describes the relationship

between a random feature vector

and a random target vector

, which follow the joint distribution

P (X, Y )

. To this end, we ﬁrst deﬁne a loss function

that penalizes the differences between

predictions

f(x)

and actual targets

, for examples

(x, y) ∼ P

. Then, we minimize the average of

the loss function ` over the data distribution P , also known as the expected risk:

R(f) =

`(f(x), y)dP (x, y).

Unfortunately, the distribution

is unknown in most practical situations. Instead, we usually have

access to a set of training data

D = {(x

, y

)}

i=1

, where

, y

) ∼ P

for all

i = 1, . . . , n

. Using

the training data D, we may approximate P by the empirical distribution

(x, y) =

i=1

δ(x = x

, y = y

where

δ(x = x

, y = y

)

is a Dirac mass centered at

, y

)

. Using the empirical distribution

, we

can now approximate the expected risk by the empirical risk:

(f) =

`(f(x), y)dP

(x, y) =

i=1

`(f(x

), y

). (1)

Learning the function

by minimizing

(1)

is known as the Empirical Risk Minimization (ERM)

principle (Vapnik, 1998). While efﬁcient to compute, the empirical risk

(1)

monitors the behaviour

only at a ﬁnite set of

examples. When considering functions with a number parameters

comparable to

(such as large neural networks), one trivial way to minimize

(1)

is to memorize the

training data (Zhang et al., 2017). Memorization, in turn, leads to the undesirable behaviour of

outside the training data (Szegedy et al., 2014).

Published as a conference paper at ICLR 2018

# y1, y2 should be one-hot vectors

for (x1, y1), (x2, y2) in zip(loader1, loader2):

lam = numpy.random.beta(alpha, alpha)

x = Variable(lam

x1 + (1. - lam)

x2)

y = Variable(lam

y1 + (1. - lam)

y2)

optimizer.zero_grad()

loss(net(x), y).backward()

optimizer.step()

(a) One epoch of mixup training in PyTorch.

ERM mixup

(b) Effect of mixup (

α = 1

) on a

toy problem. Green: Class 0. Or-

ange: Class 1. Blue shading indicates

p(y = 1|x).

Figure 1: Illustration of mixup, which converges to ERM as α → 0.

However, the na

ıve estimate

is one out of many possible choices to approximate the true distribu-

tion

. For instance, in the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), the

distribution P is approximated by

(˜x, ˜y) =

i=1

ν(˜x, ˜y|x

, y

where

is a vicinity distribution that measures the probability of ﬁnding the virtual feature-target

pair

(˜x, ˜y)

in the vicinity of the training feature-target pair

, y

)

. In particular, Chapelle et al.

(2000) considered Gaussian vicinities

ν(˜x, ˜y|x

, y

) = N(˜x − x

, σ

)δ(˜y = y

)

, which is equivalent

to augmenting the training data with additive Gaussian noise. To learn using VRM, we sample the

vicinal distribution to construct a dataset

:= {(˜x

, ˜y

)}

i=1

, and minimize the empirical vicinal

risk:

(f) =

i=1

`(f(˜x

), ˜y

The contribution of this paper is to propose a generic vicinal distribution, called mixup:

µ(˜x, ˜y|x

, y

) =

[δ(˜x = λ · x

+ (1 − λ) · x

, ˜y = λ · y

+ (1 − λ) · y

)] ,

where

λ ∼ Beta(α, α)

, for

α ∈ (0, ∞)

. In a nutshell, sampling from the mixup vicinal distribution

produces virtual feature-target vectors

˜x = λx

+ (1 − λ)x

˜y = λy

+ (1 − λ)y

where

, y

)

and

, y

)

are two feature-target vectors drawn at random from the training data, and

λ ∈ [0, 1]

. The mixup hyper-parameter

controls the strength of interpolation between feature-target

pairs, recovering the ERM principle as α → 0.

The implementation of mixup training is straightforward, and introduces a minimal computation

overhead. Figure 1a shows the few lines of code necessary to implement mixup training in PyTorch.

Finally, we mention alternative design choices. First, in preliminary experiments we ﬁnd that convex

combinations of three or more examples with weights sampled from a Dirichlet distribution does not

provide further gain, but increases the computation cost of mixup. Second, our current implementation

uses a single data loader to obtain one minibatch, and then mixup is applied to the same minibatch

after random shufﬂing. We found this strategy works equally well, while reducing I/O requirements.

Third, interpolating only between inputs with equal label did not lead to the performance gains of

mixup discussed in the sequel. More empirical comparison can be found in Section 3.8.

What is mixup doing?

The mixup vicinal distribution can be understood as a form of data aug-

mentation that encourages the model

to behave linearly in-between training examples. We argue

that this linear behaviour reduces the amount of undesirable oscillations when predicting outside the

training examples. Also, linearity is a good inductive bias from the perspective of Occam’s razor,

剩余12页未读，继续阅读

评论收藏

内容反馈

Orca是只鲸

粉丝: 36
资源: 317

C mixup数据增强1

评论0

最新资源

C mixup数据增强1

评论0

人工智能训练数据增强Mosaic Mixup

基于Mixup数据增强的LSTM-FCN时间序列分类.docx

融合类别先验Mixup数据增强的罪名预测方法.docx

这里面存放了一些目标检测算法的数据增强方法。如mosaic、mixup。.zip

基于Mixup算法和卷积神经网络的柑橘黄龙病果实识别研究.pdf

YOLOv5数据增强测试

Swin Transformer实战：timm中的 Swin Transformer实现图像分类（多GPU）。

基于python利用pytorch实现图像分类项目源码.zip

yolo离线数据增强代码

Python-论文MixupBeyondEmpiricalRiskMinimization的一个PyTorch实现

MobileNetV3 实战：植物幼苗分类（pytorch）.zip

Python-mixupBeyondEmpiricalRiskMinimization

目标检测中的数据增强操作

训练数据 数据分类和目标检测的非常关键的数据增强算法

ResNet实战：tensorflow2.X版本，ResNet50图像分类任务（小数据集）

MPViT实战：植物幼苗分类.zip

基于图像的数据增强方法发展现状综述

pytorch_mixup:混合的PyTorch实现

vision-transformer实战总结：非常简单的VIT入门教程，一定不要错过

Co-Mixup:PyTorch正式实施“ Co-Mixup

Resnet实战：单机多卡DDP方式、混合精度训练

经常用于目标检测的方法的数据增强的方法

睿智的目标检测28——YoloV4当中的Mosaic数据增强方法

GraphMixup论文原文

MobileNet实战：tensorflow2.X版本，MobileNetV2图像分类任务（小数据集）.zip

EfficientNet实战：tensorflow2.X版本，EfficientNetB0图像分类任务（小数据集）.zip

1114-极智开发-解读Mixup及示例代码

图像分类实战：mobilenetv2从训练到TensorRT部署（pytorch）

MobileNet实战：tensorflow2.X版本，MobileNetV3图像分类任务（小数据集）.zip

最新资源

训练数据数据分类和目标检测的非常关键的数据增强算法