没有合适的资源?快使用搜索试试~ 我知道了~
《对比监督学习》2020综述论文
需积分: 50 14 下载量 146 浏览量
2020-11-05
20:39:19
上传
评论 4
收藏 5.18MB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/13090437/0001-19357b36adf20ad42df14e5db0182bdb_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
20页
自监督学习由于能够避免标注大规模数据集的成本而受到欢迎。它能够采用自定义的伪标签作为监督,并将学习到的表示用于几个下游任务。具体来说,对比学习最近已成为计算机视觉、自然语言处理(NLP)等领域的自主监督学习方法的主要组成部分。它的目的是将同一个样本的增广版本嵌入到一起,同时试图将不同样本中的嵌入推开。
资源推荐
资源详情
资源评论
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![](https://csdnimg.cn/release/download_crawler_static/13090437/bg1.jpg)
A SURVEY ON CONTRASTIVE SELF-SUPERVISED LEARNING
Ashish Jaiswal
The University of Texas at Arlington
Arlington, TX 76019
ashish.jaiswal@mavs.uta.edu
Ashwin Ramesh Babu
The University of Texas at Arlington
Arlington, TX 76019
ashwin.rameshbabu@mavs.uta.edu
Mohammad Zaki Zadeh
The University of Texas at Arlington
Arlington, TX 76019
mohammad.zakizadehgharie@mavs.uta.edu
Debapriya Banerjee
The University of Texas at Arlington
Arlington, TX 76019
debapriya.banerjee2@mavs.uta.edu
Fillia Makedon
The University of Texas at Arlington
Arlington, TX 76019
makedon@uta.edu
November 3, 2020
ABSTRACT
Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating
large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and use the
learned representations for several downstream tasks. Specifically, contrastive learning has recently
become a dominant component in self-supervised learning methods for computer vision, natural
language processing (NLP), and other domains. It aims at embedding augmented versions of the
same sample close to each other while trying to push away embeddings from different samples. This
paper provides an extensive review of self-supervised methods that follow the contrastive approach.
The work explains commonly used pretext tasks in a contrastive learning setup, followed by different
architectures that have been proposed so far. Next, we have a performance comparison of different
methods for multiple downstream tasks such as image classification, object detection, and action
recognition. Finally, we conclude with the limitations of the current methods and the need for further
techniques and future directions to make substantial progress.
Keywords
contrastive learning
·
self-supervised learning
·
discriminative learning
·
image/video classification
·
object
detection · unsupervised learning · transfer learning
1 Introduction
The advancements in deep learning have elevated it to become one of the core components in most intelligent systems in
existence. The ability to learn rich patterns from the abundance of data available today has made deep neural networks
(DNNs) a compelling approach in the majority of computer vision (CV) tasks such as image classification, object
detection, image segmentation, activity recognition as well as natural language processing (NLP) tasks such as sentence
classification, language models, machine translation, etc. However, the supervised approach to learning features from
labeled data has almost reached its saturation due to intense labor required in manually annotating millions of data
samples. This is because most of the modern computer vision systems (that are supervised) try to learn some form of
image representations by finding a pattern between the data points and their respective annotations in large datasets.
Works such as GRAD-CAM [
1
] have proposed techniques that provide visual explanations for decisions made by a
model to make them more transparent and explainable.
arXiv:2011.00362v1 [cs.CV] 31 Oct 2020
![](https://csdnimg.cn/release/download_crawler_static/13090437/bg2.jpg)
A PREPRINT - NOVEMBER 3, 2020
Traditional supervised learning approaches heavily rely on the amount of annotated training data available. Even
though there’s a plethora of data available out there, the lack of annotations has pushed researchers to find alternative
approaches that can leverage them. This is where self-supervised methods plays a vital role in fueling the progress of
deep learning without the need for expensive annotations and learn feature representations where data itself provides
supervision.
Figure 1: Basic intuition behind contrastive learning paradigm: push original and augmented image closer and push
original and negative images away
Supervised learning not only depends on expensive annotations but also suffers from issues such as generalization
error, spurious correlations, and adversarial attacks [
2
]. Recently, self-supervised learning methods have integrated
both generative and contrastive approaches that have been able to utilize unlabeled data to learn the underlying
representations. A popular approach has been to propose various pretext tasks that help in learning features using
pseudo-labels. Tasks such as image-inpainting, colorizing greyscale images, jigsaw puzzles, super-resolution, video
frame prediction, audio-visual correspondence, etc have proven to be effective for learning good representations.
Figure 2: Contrastive learning pipeline for self-supervised training
2
![](https://csdnimg.cn/release/download_crawler_static/13090437/bg3.jpg)
A PREPRINT - NOVEMBER 3, 2020
Generative models gained its popularity after the introduction of Generative Adversarial Networks (GANs) [
3
] in
2014. The work later became the foundation for many successful architectures such as CycleGAN [4], StyleGAN [5],
PixelRNN [
6
], Text2Image [
7
], DiscoGAN [
8
], etc. These methods inspired more researchers to switch to training deep
learning models with unlabeled data in an self-supervised setup. Despite their success, researchers started realizing
some of the complications in GAN-based approaches. They are harder to train because of two main reasons: (a)
non-convergence–the model parameters oscillate a lot and rarely converge, and (b) the discriminator gets too successful
that the generator network fails to create real-like fakes due to which the learning cannot be continued. Also, proper
synchronization is required between the generator and the discriminator that prevents the discriminator to converge and
the generator to diverge.
Figure 3: Top-1 classification accuracy of different contrastive learning methods against baseline supervised method on
ImageNet
Unlike generative models, contrastive learning (CL) is a discriminative approach that aims at grouping similar samples
closer and diverse samples far from each other as shown in figure 1. To achieve this, a similarity metric is used to
measure how close two embeddings are. Especially, for computer vision tasks, a contrastive loss is evaluated based
on the feature representations of the images extracted from an encoder network. For instance, one sample from the
training dataset is taken and a transformed version of the sample is retrieved by applying appropriate data augmentation
techniques. During training referring to figure 2, the augmented version of the original sample is considered as a
positive sample, and the rest of the samples in the batch/dataset (depends on the method being used) are considered
negative samples. Next, the model is trained in a way that it learns to differentiate positive samples from the negative
ones. The differentiation is achieved with the help of some pretext task (explained in section 2). In doing so, the model
learns quality representations of the samples and is used later for transferring knowledge to downstream tasks. This
idea is advocated by an interesting experiment conducted by Epstein [
9
] in 2016, where he asked his students to draw a
dollar bill with and without looking at the bill. The results from the experiment show that the brain does not require
complete information of a visual piece to differentiate one object from the other. Instead, only a rough representation of
an image is enough to do so.
Most of the earlier works in this area combined some form of instance-level classification approach[
10
][
11
][
12
] with
contrastive learning and were successful to some extent. However, recent methods such as SwAV [
13
], MoCo [
14
], and
3
![](https://csdnimg.cn/release/download_crawler_static/13090437/bg4.jpg)
A PREPRINT - NOVEMBER 3, 2020
SimCLR [
15
] with modified approaches have produced results comparable to the state-of-the-art supervised method on
ImageNet [
16
] dataset as shown in figure 3. Similarly, PIRL [
17
], Selfie [
18
], and [
19
] are some papers that reflect the
effectiveness of the pretext tasks being used and how they boost the performance of their models.
2 Pretext Tasks
Pretext tasks are self-supervised tasks that act as an important strategy to learn representations of the data using pseudo
labels. These pseudo labels are generated automatically based on the attributes found in the data. The learned model
from the pretext task can be used for any downstream tasks such as classification, segmentation, detection, etc. in
computer vision. Furthermore, these tasks can be applied to any kind of data such as image, video, speech, signals,
and so on. For a pretext task in contrastive learning, the original image acts as an anchor, its augmented(transformed)
version acts as a positive sample, and the rest of the images in the batch or in the training data act as negative samples.
Most of the commonly used pretext tasks are divided into four main categories: color transformation, geometric
transformation, context-based tasks, and cross-modal based tasks. These pretext tasks have been used in various
scenarios based on the problem intended to be solved.
2.1 Color Transformation
Figure 4: Color Transformation as pretext task [
15
]. (a) Original (b) Gaussian noise (c) Gaussian blur (d) Color
distortion (Jitter)
Color transformation involves basic adjustments of color levels in an image such as blurring, color distortions, converting
to grayscale, etc. Figure 4 represents an example of color transformation applied on a sample image from the ImageNet
dataset [15]. During this pretext task, the network learns to recognize similar images invariant to their colors.
2.2 Geometric Transformation
A geometric transformation is a spatial transformation where the geometry of the image is modified without altering
its actual pixel information. The transformations include scaling, random cropping, flipping (horizontally, vertically),
etc. as represented in figure 5 through which global-to-local view prediction is achieved. Here the original image is
considered as the global view and the transformed version is considered as the local view. Chen et. al. [
15
] performed
such transformations to learn features during pretext task.
4
剩余19页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/0f5d51af96824612be4ad2fd47b10fbf_syp_net.jpg!1)
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)