深度神经网络自监督视觉特征学习综述_音频自监督学习资源-CSDN文库

需积分: 5 72 浏览量 2021-10-25 22:24:25 上传评论收藏 8.14MB PDF 举报

为了在计算机视觉应用中学习得到更好的图像和视频特征，通常需要大规模的标记数据来训练深度神经网络。为了避免收集和标注大量的数据所需的巨大开销，作为无监督学习方法的一个子方法——自监督学习方法，可以在不使用任何人类标注的标签的情况下，从大规模无标记数据中学习图像和视频的一般性特征。深度神经网络自监督视觉特征学习是当前计算机视觉领域的一个热门研究方向。传统的深度学习方法依赖于大量带有标签的数据，这不仅需要巨大的人力成本，还限制了模型在未标注数据上的泛化能力。自监督学习通过设计巧妙的任务或约束，使得模型能够从无标签数据中自动学习到有用的特征表示，从而在视觉任务上取得与有监督学习相当甚至更好的性能。自监督学习的核心思想是利用数据本身的结构信息作为监督信号，例如图像的不同部分、时间序列的前后帧等。这种学习方式可以分为两大类：基于 pretext task的方法和对比学习方法。预设任务方法是指设计一个与最终目标相关的中间任务，如旋转预测、颜色恢复、拼图游戏等，模型在解决这些任务的过程中学到的特征可以迁移到其他视觉任务。对比学习则通过对比样本之间的相似性和差异性来学习特征，比如SimCLR、MoCo等方法。深度神经网络架构在自监督学习中扮演着关键角色。卷积神经网络（CNN）由于其对空间局部结构的敏感性，在图像特征提取方面表现出色；而Transformer架构由于其强大的序列建模能力，在视频理解等领域也得到了广泛应用。许多研究者尝试结合两种架构的优点，以适应不同的自监督学习场景。自监督学习的评估标准通常包括下游任务的性能，如分类、检测、分割等，以及特征的迁移学习能力。常用的图像数据集有ImageNet、COCO等，视频数据集如Kinetics、Something-Something等。这些数据集的广泛使用推动了自监督学习方法的发展，并提供了公正的比较基准。近年来，自监督学习方法已经取得了显著的进步，尤其是在图像特征学习上，许多方法在预训练后进行微调，在ImageNet分类任务上达到了与监督学习相当的结果。对于视频特征学习，自监督方法也开始展现出竞争力，尤其是在动作识别和时序分析任务上。未来的研究方向可能包括以下几个方面： 1. 更强的自监督信号：设计更复杂、更有挑战性的预设任务，或者探索新的对比学习策略，以挖掘更深的视觉表示。 2. 模型的通用性：提高自监督学习模型在不同视觉任务间的迁移能力，实现更广泛的适应性。 3. 多模态融合：结合音频、文本等多模态信息，提升自监督学习的鲁棒性和理解能力。 4. 实时性和效率：优化模型架构，降低计算资源需求，使其适用于资源有限的设备和实时应用。 5. 对抗性和隐私保护：研究在无标签数据中学习特征的同时，增强模型对对抗攻击的抵抗力和保护用户隐私的能力。自监督学习为解决深度学习的标注数据依赖问题提供了一种有效途径，随着研究的深入，它有望在视觉特征学习和计算机视觉应用中发挥更大的作用。

资源推荐

资源详情

资源评论

Self-supervised Visual Feature Learning with

Deep Neural Networks: A Survey

Longlong Jing and Yingli Tian

∗

, Fellow, IEEE

Abstract—Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual

feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating

large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general

image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive

review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation,

general pipeline, and terminologies of this ﬁeld are described. Then the common deep neural network architectures that used for

self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed

followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally,

quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image

and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual

feature learning.

Index Terms—Self-supervised Learning, Unsupervised Learning, Convolutional Neural Network, Transfer Learning, Deep Learning.

1 INTRODUCTION

1.1 Motivation

UE to the powerful ability to learn different levels of

general visual features, deep neural networks have

been used as the basic structure to many computer vision

applications such as object detection [1], [2], [3], semantic

segmentation [4], [5], [6], image captioning [7], etc. The mod-

els trained from large-scale image datasets like ImageNet

are widely used as the pre-trained models and ﬁne-tuned

for other tasks for two main reasons: (1) the parameters

learned from large-scale diverse datasets provide a good

starting point, therefore, networks training on other tasks

can converge faster, (2) the network trained on large-scale

datasets already learned the hierarchy features which can

help to reduce over-ﬁtting problem during the training of

other tasks, especially when datasets of other tasks are small

or training labels are scarce.

The performance of deep convolutional neural networks

(ConvNets) greatly depends on their capability and the

amount of training data. Different kinds of network ar-

chitectures were developed to increase the capacity of net-

work models, and larger and larger datasets were collected

these days. Various networks including AlexNet [8], VGG

[9], GoogLeNet [10], ResNet [11], and DenseNet [12] and

• L. Jing is with the Department of Computer Science, The Graduate

Center, The City University of New York, NY, 10016. E-mail:

ljing@gradcenter.cuny.edu

• Y. Tian is with the Department of Electrical Engineering, The City

College, and the Department of Computer Science, the Graduate

Center, the City University of New York, NY, 10031. E-mail:

ytian@ccny.cuny.edu

∗

Corresponding author

This material is based upon work supported by the National Science Founda-

tion under award number IIS-1400802.

large scale datasets such as ImageNet [13], OpenImage [14]

have been proposed to train very deep ConvNets. With

the sophisticated architectures and large-scale datasets, the

performance of ConvNets keeps breaking the state-of-the-

arts for many computer vision tasks [1], [4], [7], [15], [16].

However, collection and annotation of large-scale

datasets are time-consuming and expensive. As one of the

most widely used datasets for pre-training very deep 2D

convolutional neural networks (2DConvNets), ImageNet

[13] contains about 1.3 million labeled images covering

1, 000 classes while each image is labeled by human workers

with one class label. Compared to image datasets, collection

and annotation of video datasets are more expensive due

to the temporal dimension. The Kinetics dataset [17], which

is mainly used to train ConvNets for video human action

recognition, consists of 500, 000 videos belonging to 600

categories and each video lasts around 10 seconds. It took

many Amazon Turk workers a lot of time to collect and

annotate a dataset at such a large scale.

To avoid time-consuming and expensive data anno-

tations, many self-supervised methods were proposed to

learn visual features from large-scale unlabeled images or

videos without using any human annotations. To learn

visual features from unlabeled data, a popular solution is to

propose various pretext tasks for networks to solve, while

the networks can be trained by learning objective functions

of the pretext tasks and the features are learned through this

process. Various pretext tasks have been proposed for self-

supervised learning including colorizing grayscale images

[18], image inpainting [19], image jigsaw puzzle [20], etc.

The pretext tasks share two common properties: (1) visual

features of images or videos need to be captured by Con-

vNets to solve the pretext tasks, (2) pseudo labels for the

pretext task can be automatically generated based on the

attributes of images or videos.

arXiv:1902.06162v1 [cs.CV] 16 Feb 2019

Self-supervised Pretext Task Training

Supervised Downstream Task Training

Unlabeled Dataset

Labeled Dataset

Pretext

Task

ConvNet

…

ConvNet

Downstream

Task

…

Knowledge Transfer

Fig. 1. The general pipeline of self-supervised learning. The visual

feature is learned through the process of training ConvNets to solve

a pre-deﬁned pretext task. After self-supervised pretext task training

ﬁnished, the learned parameters serve as a pre-trained model and

are transferred to other downstream computer vision tasks by ﬁne-

tuning. The performance on these downstream tasks is used to evaluate

the quality of the learned features. During the knowledge transfer for

downstream tasks, the general features from only the ﬁrst several layers

are unusually transferred to downstream tasks.

The general pipeline of self-supervised learning is shown

in Fig. 1. During the self-supervised training phase, a pre-

deﬁned pretext task is designed for ConvNets to solve, and

the pseudo labels for the pretext task are automatically gen-

erated based on some attributes of data. Then the ConvNet

is trained to learn object functions of the pretext task. Af-

ter the self-supervised training ﬁnished, the learned visual

features can be further transferred to downstream tasks

(especially when only relatively small data available) as pre-

trained models to improve performance and overcome over-

ﬁtting. Generally, shallow layers capture general low-level

features like edges, corners, and textures while deeper layers

capture task related high-level features. Therefore, visual

features from only the ﬁrst several layers are transferred

during the supervised downstream task training phase.

1.2 Term Deﬁnition

To make this survey easy to read, we ﬁrst deﬁne the terms

used in the remaining sections.

• Human-annotated label: Human-annotated labels

refer to labels of data that are manually annotated by

human workers.

• Pseudo label: Pseudo labels are automatically gener-

ated labels based on data attributes for pretext tasks.

• Pretext Task: Pretext tasks are pre-designed tasks for

networks to solve, and visual features are learned by

learning objective functions of pretext tasks.

• Downstream Task: Downstream tasks are computer

vision applications that are used to evaluate the qual-

ity of features learned by self-supervised learning.

These applications can greatly beneﬁt from the pre-

trained models when training data are scarce. In gen-

eral, human-annotated labels are needed to solve the

downstream tasks. However, in some applications,

the downstream task can be the same as the pretext

task without using any human-annotated labels.

• Supervised Learning: Supervised learning indi-

cates learning methods using data with ﬁne-grained

human-annotated labels to train networks.

• Semi-supervised Learning: Semi-supervised learn-

ing refers to learning methods using a small amount

of labeled data in conjunction with a large amount of

unlabeled data.

• Weakly-supervised Learning: Weakly supervised

learning refers to learning methods to learn with

coarse-grained labels or inaccurate labels. The cost

of obtaining weak supervision labels is generally

much cheaper than ﬁne-grained labels for supervised

methods.

• Unsupervised Learning: Unsupervised learning

refers to learning methods without using any

human-annotated labels.

• Self-supervised Learning: Self-supervised learning

is a subset of unsupervised learning methods. Self-

supervised learning refers to learning methods in

which ConvNets are explicitly trained with automat-

ically generated labels. This review only focuses on

self-supervised learning methods for visual feature

learning with ConvNets in which the features can

be transferred to multiple different computer vision

tasks.

Since no human annotations are needed to generate

pseudo labels during self-supervised training, very large-

scale datasets can be used for self-supervised training.

Trained with these pseudo labels, self-supervised methods

achieved promising results and the gap with supervised

methods in performance on downstream tasks becomes

smaller. This paper provides a comprehensive survey of

deep ConvNets-based self-supervised visual feature learn-

ing methods. The key contributions of this paper are as

follows:

• To the best of our knowledge, this is the ﬁrst compre-

hensive survey about self-supervised visual feature

learning with deep ConvNets which will be helpful

for researchers in this ﬁeld.

• An in-depth review of recently developed self-

supervised learning methods and datasets.

• Quantitative performance analysis and comparison

of the existing methods are provided.

• A set of possible future directions for self-supervised

learning is pointed out.

2 FORMULATION OF DIFFERENT LEARNING

SCHEMAS

Based on the training labels, visual feature learning methods

can be grouped into the following four categories: super-

vised, semi-supervised, weakly supervised, and unsuper-

vised. In this section, the four types of learning methods are

compared and key terminologies are deﬁned.

2.1 Supervised Learning Formulation

For supervised learning, given a dataset X, for each data

in X, there is a corresponding human-annotated label

. For a set of N labeled training data D = {X

}

i=0

, the

training loss function is deﬁned as:

loss(D) = min

i=1

loss(X

, Y

). (1)

Trained with accurate human-annotated labels, the su-

pervised learning methods obtained break-through results

on different computer vision applications [1], [4], [8], [16].

However, data collection and annotation usually are ex-

pensive and may require special skills. Therefore, semi-

supervised, weakly supervised, and unsupervised learning

methods were proposed to reduce the cost.

2.2 Semi-Supervised Learning Formulation

For semi-supervised visual feature learning, given a small

labeled dataset X and a large unlabeled dataset Z, for each

data X

in X, there is a corresponding human-annotated

label Y

. For a set of N labeled training data D

= {X

}

i=0

and M unlabeled training data D

= {Z

}

i=0

, the training

loss function is deﬁned as:

loss(D

, D

) = min

i=1

loss(X

, Y

i=1

loss(Z

, R(Z

, X)),

(2)

where the R(Z

, X) is a task-speciﬁc function to represent

the relation between each unlabeled training data Z

with

the labeled dataset X.

2.3 Weakly Supervised Learning Formulation

For weakly supervised visual feature learning, given a

dataset X, for each data X

in X, there is a correspond-

ing coarse-grained label C

. For a set of N training data

D = {X

}

i=0

, the training loss function is deﬁned as:

loss(D) = min

i=1

loss(X

, C

). (3)

Since the cost of weak supervision is much lower than

the ﬁne-grained label for supervised methods, large-scale

datasets are relatively easier to obtain. Recently, several

papers proposed to learn image features from web collected

images using hashtags as category labels [21], [22], and

obtained very good performance [21].

2.4 Unsupervised Learning Formulation

Unsupervised learning refers to learning methods that do

not need any human-annotated labels. This type of methods

including fully unsupervised learning methods in which

the methods do not need any labels at all, as well as self-

supervised learning methods in which networks are ex-

plicitly trained with automatically generated pseudo labels

without involving any human annotation.

2.4.1 Self-supervised Learning

Recently, many self-supervised learning methods for visual

feature learning have been developed without using any

human-annotated labels [23], [24], [25], [26], [27], [28], [29],

[30], [31], [32], [33], [33], [34], [35]. Some papers refer to

this type of learning methods as unsupervised learning [36],

[37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48].

Compared to supervised learning methods which require a

data pair X

and Y

while Y

is annotated by human labors,

self-supervised learning also trained with data X

along

with its pseudo label P

while P

is automatically gener-

ated for a pre-deﬁned pretext task without involving any

human annotation. The pseudo label P

can be generated by

using attributes of images or videos such as the context of

images [18], [19], [20], [36], or by traditional hand-designed

methods [49], [50], [51].

Given a set of N training data D = {P

}

i=0

, the training

loss function is deﬁned as:

loss(D) = min

i=1

loss(X

, P

). (4)

As long as the pseudo labels P are automatically gen-

erated without involving human annotations, then the

methods belong to self-supervised learning. Recently, self-

supervised learning methods have achieved great progress.

This paper focuses on the self-supervised learning methods

that mainly designed for visual feature learning, while the

features have the ability to be transferred to multiple visual

tasks and to perform new tasks by learning from limited

labeled data. This paper summarizes these self-supervised

feature learning methods from different perspectives includ-

ing network architectures, commonly used pretext tasks,

datasets, and applications, etc.

3 COMMON DEEP NETWORK ARCHITECTURES

No matter the categories of learning methods, they share

similar network architectures. This section reviews common

architectures for learning both image and video features.

3.1 Architectures for Learning Image Features

Various 2DConvNets have been designed for image feature

learning. Here, ﬁve milestone architectures for image feature

learning including AlexNet [8], VGG [9], GoogLeNet [10],

ResNet [11], and DenseNet [12] are reviewed.

3.1.1 AlexNet

AlexNet obtained a big improvement in the performance of

image classiﬁcation on ImageNet dataset compared to the

previous state-of-the-art methods [8]. With the support of

powerful GPUs, AlexNet which has 62.4 million parameters

were trained on ImageNet with 1.3 million images. As

shown in Fig. 2, the architecture of AlexNet has 8 layers in

which 5 are convolutional layers and 3 are fully connected

layers. The ReLU is applied after each convolutional layers.

94% of the network parameters come from the fully con-

nected layers. With this scale of parameters, the network

can easily be over-ﬁtting. Therefore, different kinds of tech-

niques are applied to avoid over-ﬁtting problem including

data augmentation, dropout, and normalization.

256

384 384 384

4096

4096 1000

Fig. 2. The architecture of AlexNet [8]. The numbers indicate the number

of channels of each feature map. Figure is reproduced based on AlexNet

[8].

3.1.2 VGG

VGG is proposed by Simonyan and Zisserman and won

the ﬁrst place for ILSVRC 2013 competition [9]. Simonyan

and Zisserman proposed various depth of networks, while

the 16-layer VGG is the most widely used one due to its

moderate model size and its superior performance. The

architecture of VGG-16 is shown in Fig. 3. It has 16 convo-

lutional layers belong to ﬁve convolution blocks. The main

difference between VGG and AlexNet is that AlexNet has

large convolution stride and large kernel size while all the

convolution kernels in VGG have same small size (3×3) and

small convolution stride (1×1). The large kernel size leads to

too many parameters and large model size, while the large

convolution stride may cause the network to miss some

ﬁne features in the lower layers. The smaller kernel size

makes the training of very deep convolution neural network

feasible while still reserving the ﬁne-grained information in

the network.

3.1.3 ResNet

VGG demonstrated that deeper networks are possible to

obtain better performance. However, deeper networks are

more difﬁcult to train due to two problems: gradient van-

ishing and gradient explosion. ResNet is proposed by He

et al. to use the skip connection in convolution blocks by

sending the previous feature map to the next convolution

256

4096

1000

128

256

512

Max pooling

Convolution + Relu

Fully connect + Relu

Softmax

Fig. 3. The architecture of VGG [9]. Figure is reproduced based on VGG

[9].

block to overcome the gradient vanishing and gradient

explosion [11]. The details of the skip connection are shown

in Fig. 4. With the skip connection, training of very deep

neural networks on GPUs becomes feasible.

weight layer

F(x)

F(x) + x

x !

Identity

relu

Fig. 4. The architecture of Residual block [11]. The identity mapping

can effectively reduce gradient vanishing and explosion which make the

training of very deep network feasible. Figure is reproduced based on

ResNet [11].

In ResNet [11], He et al. also evaluated networks with

different depths for image classiﬁcation. Due to its smaller

model size and superior performance, ResNet is often used

as the base network for other computer vision tasks. The

convolution blocks with skip connection also widely used

as the basic building blocks.

3.1.4 GoogLeNet

GoogLeNet, a 22-layer deep network, is proposed by

Szegedy et al. which won ILSVRC-2014 challenge with a top-

5 test accuracy of 93.3% [10]. Compared to previous work

that to build a deeper network, Szegedy et al. explored to

build a wider network in which each layer has multiple

parallel convolution layers. The basic block of GoogLeNet

is inception block which consists of 4 parallel convolution

layers with different kernel sizes and followed by 1 ×1 con-

volution for dimension reduction purpose. The architecture

for the inception block of GoogLeNet is shown in Fig. 5.

With a carefully crafted design, they increased the depth

and width of the network while keeping the computational

cost constant.

剩余23页未读，继续阅读

评论收藏

内容反馈

syp_net

粉丝: 158
资源: 1187

深度神经网络自监督视觉特征学习综述

基于深度网络的自监督视觉特征学习综述.zip

Self-Supervised Visual Feature Learning With Deep Neural Network

《深度半监督学习》综述论文

deepcluster:深度聚类，用于视觉特征的无监督学习

TAMU首篇《图神经网络自监督学习》综述论文

深度学习综述类文章

深度学习中的无监督学习方法综述.pdf

大数据无监督特征学习的深度计算模型

使用无监督深度特征学习的基于EEG的情绪识别

脉冲神经网络的监督学习算法研究综述

AlexNet深度学习综述

卷积神经网络综述.pdf

深度神经网络图像描述综述.pdf

深度学习中弱监督细粒度识别方法与应用综述

脉冲神经网络的监督学习算法研究综述.pdf

电子科大最新《深度半监督学习》综述论文（2021版）

ufldl_tutorial：斯坦福无监督特征学习和深度学习教程

非监督特征学习与深度学习 中文教程（UFLDL）

深度半监督学习中伪标签方法综述.docx

深度学习图像检索(CBIR): 十年之大综述

最新资源

非监督特征学习与深度学习中文教程（UFLDL）