LearningTransferableVisualModelsFromNaturalLanguageSuperv资源-CSDN文库

共1个文件

pdf：1个

需积分: 1 124 浏览量 2023-12-06 15:37:27 上传评论收藏 5.56MB ZIP 举报

标题“Learning Transferable Visual Models From Natural Language Supervision”是一篇关于深度学习领域的研究论文，主要探讨如何通过自然语言监督来学习可迁移的视觉模型。在当前的计算机视觉领域，模型通常需要大量的标记图像数据进行训练，这既费时又昂贵。这篇论文可能提出了一种新的方法，减少了对人工标注数据的依赖，利用丰富的自然语言资源来提升模型的泛化能力和迁移能力。论文的核心可能围绕以下几个关键知识点： 1. 自然语言监督：这是指利用大规模的未标注文本数据（如网络上的文本）作为监督信号，训练模型理解语义信息。这种方法相比传统的图像标注，可以利用更广泛、更便宜的数据源。 2. 可迁移的视觉模型：这类模型能够将在一个任务或数据集上学习到的知识应用到其他不同的任务或数据集。这对于解决新问题或处理小规模数据集非常有用，因为它减少了对新数据的依赖。 3. 零样本/少样本学习：论文可能提出了一个方法，允许模型在没有或只有少量特定任务样本的情况下进行有效学习。这是通过从自然语言中学习通用特征实现的，从而克服了传统机器学习模型对大量标记数据的需要。 4. 跨模态学习：论文可能涉及了结合图像和文本的跨模态学习框架，让模型能够理解和关联视觉与语言信息。这种技术有助于模型理解图像中的物体、场景和上下文，同时还能处理与之相关的文本描述。 5. Transformer架构：作为现代深度学习的基石，Transformer模型可能在论文中被用作基础架构，因为它们在处理序列数据（如自然语言）时表现出色，并且可以有效地捕获长距离依赖关系。 6. 实验与评估：论文可能详细描述了一系列实验，包括基准测试和对比实验，以验证所提方法的有效性。这些实验可能涵盖了多种计算机视觉任务，如图像分类、对象检测和语义分割等。 7. 应用场景：除了理论上的贡献，该研究可能还讨论了这些技术在实际应用中的潜力，如智能助手、自动驾驶汽车、图像搜索引擎等领域。 8. 挑战与未来工作：尽管这种方法带来了许多优势，但可能还存在一些挑战，比如模型的复杂性、计算资源需求以及语言与视觉信息的准确匹配问题。论文可能会提出这些问题，并探讨可能的解决方案或未来的研究方向。通过阅读这篇名为“2103.00020.pdf”的论文，读者可以深入理解如何利用自然语言监督来构建强大的视觉模型，并了解这种方法如何推动计算机视觉领域的进步。对于正在进行毕业设计的学生来说，这是一个有价值的参考，可以帮助他们探索新的研究方向和方法，提高模型的性能和实用性。

资源推荐

资源详情

资源评论

收起资源包目录

2103.00020.zip （1个子文件）

2103.00020.pdf 6.5MB

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford

* 1

Jong Wook Kim

* 1

Chris Hallacy

Aditya Ramesh

Gabriel Goh

Sandhini Agarwal

Girish Sastry

Amanda Askell

Pamela Mishkin

Jack Clark

Gretchen Krueger

Ilya Sutskever

Abstract

State-of-the-art computer vision systems are

trained to predict a ﬁxed set of predetermined

object categories. This restricted form of super-

vision limits their generality and usability since

additional labeled data is needed to specify any

other visual concept. Learning directly from raw

text about images is a promising alternative which

leverages a much broader source of supervision.

We demonstrate that the simple pre-training task

of predicting which caption goes with which im-

age is an efﬁcient and scalable way to learn SOTA

image representations from scratch on a dataset

of 400 million (image, text) pairs collected from

the internet. After pre-training, natural language

is used to reference learned visual concepts (or

describe new ones) enabling zero-shot transfer

of the model to downstream tasks. We study

the performance of this approach by benchmark-

ing on over 30 different existing computer vi-

sion datasets, spanning tasks such as OCR, ac-

tion recognition in videos, geo-localization, and

many types of ﬁne-grained object classiﬁcation.

The model transfers non-trivially to most tasks

and is often competitive with a fully supervised

baseline without the need for any dataset spe-

ciﬁc training. For instance, we match the ac-

curacy of the original ResNet-50 on ImageNet

zero-shot without needing to use any of the 1.28

million training examples it was trained on. We

release our code and pre-trained model weights at

https://github.com/OpenAI/CLIP.

1. Introduction and Motivating Work

Pre-training methods which learn directly from raw text

have revolutionized NLP over the last few years (Dai &

Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Rad-

ford et al., 2018; Devlin et al., 2018; Raffel et al., 2019).

Equal contribution

OpenAI, San Francisco, CA 94110, USA.

Correspondence to: <{alec, jongwook}@openai.com>.

Task-agnostic objectives such as autoregressive and masked

language modeling have scaled across many orders of mag-

nitude in compute, model capacity, and data, steadily im-

proving capabilities. The development of “text-to-text” as

a standardized input-output interface (McCann et al., 2018;

Radford et al., 2019; Raffel et al., 2019) has enabled task-

agnostic architectures to zero-shot transfer to downstream

datasets removing the need for specialized output heads or

dataset speciﬁc customization. Flagship systems like GPT-3

(Brown et al., 2020) are now competitive across many tasks

with bespoke models while requiring little to no dataset

speciﬁc training data.

These results suggest that the aggregate supervision acces-

sible to modern pre-training methods within web-scale col-

lections of text surpasses that of high-quality crowd-labeled

NLP datasets. However, in other ﬁelds such as computer

vision it is still standard practice to pre-train models on

crowd-labeled datasets such as ImageNet (Deng et al., 2009).

Could scalable pre-training methods which learn directly

from web text result in a similar breakthrough in computer

vision? Prior work is encouraging.

Over 20 years ago Mori et al. (1999) explored improving

content based image retrieval by training a model to pre-

dict the nouns and adjectives in text documents paired with

images. Quattoni et al. (2007) demonstrated it was possi-

ble to learn more data efﬁcient image representations via

manifold learning in the weight space of classiﬁers trained

to predict words in captions associated with images. Sri-

vastava & Salakhutdinov (2012) explored deep represen-

tation learning by training multimodal Deep Boltzmann

Machines on top of low-level image and text tag features.

Joulin et al. (2016) modernized this line of work and demon-

strated that CNNs trained to predict words in image cap-

tions learn useful image representations. They converted

the title, description, and hashtag metadata of images in the

YFCC100M dataset (Thomee et al., 2016) into a bag-of-

words multi-label classiﬁcation task and showed that pre-

training AlexNet (Krizhevsky et al., 2012) to predict these

labels learned representations which preformed similarly

to ImageNet-based pre-training on transfer tasks. Li et al.

(2017) then extended this approach to predicting phrase n-

grams in addition to individual words and demonstrated the

ability of their system to zero-shot transfer to other image

arXiv:2103.00020v1 [cs.CV] 26 Feb 2021

Learning Transferable Visual Models From Natural Language Supervision 2

·T

…

·T

…

·T

…

⋮ ⋮ ⋮

·T

(1) Contrastive pre-training

Image

Encoder

Text

Encoder

Pepperthe

aussiepup

Pepperthe

aussiepup

Pepperthe

aussiepup

Pepperthe

aussiepup

…

⋮

(2) Create dataset classiﬁer from label text

plane

car

dog

⋮

bird

Aphotoof

a{object}.

⋮

Text

Encoder

…

(3) Use for zero-shot prediction

Image

Encoder

·T

…

Aphotoof

adog.

·T

⋮

…

⋮ ⋱

·T

Figure 1.

Summary of our approach. While standard image models jointly train an image feature extractor and a linear classiﬁer to predict

some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training

examples. At test time the learned text encoder synthesizes a zero-shot linear classiﬁer by embedding the names or descriptions of the

target dataset’s classes.

classiﬁcation datasets by scoring target classes based on

their dictionary of learned visual n-grams and predicting the

one with the highest score. Adopting more recent architec-

tures and pre-training approaches, VirTex (Desai & Johnson,

2020), ICMLM (Bulent Sariyildiz et al., 2020), and Con-

VIRT (Zhang et al., 2020) have recently demonstrated the

potential of transformer-based language modeling, masked

language modeling, and contrastive objectives to learn im-

age representations from text.

While exciting as proofs of concept, using natural language

supervision for image representation learning is still rare.

This is likely because demonstrated performance on com-

mon benchmarks is much lower than alternative approaches.

For example, Li et al. (2017) reach only 11.5% accuracy

on ImageNet in a zero-shot setting. This is well below the

88.4% accuracy of the current state of the art (Xie et al.,

2020). It is even below the 50% accuracy of classic com-

puter vision approaches (Deng et al., 2012). Instead, more

narrowly scoped but well-targeted uses of weak supervision

have improved performance. Mahajan et al. (2018) showed

that predicting ImageNet-related hashtags on Instagram im-

ages is an effective pre-training task. When ﬁne-tuned to

ImageNet these pre-trained models increased accuracy by

over 5% and improved the overall state of the art at the time.

Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have

also demonstrated large gains on a broader set of transfer

benchmarks by pre-training models to predict the classes of

the noisily labeled JFT-300M dataset.

This line of work represents the current pragmatic middle

ground between learning from a limited amount of super-

vised “gold-labels” and learning from practically unlimited

amounts of raw text. However, it is not without compro-

mises. Both works carefully design, and in the process limit,

their supervision to 1000 and 18291 classes respectively.

Natural language is able to express, and therefore supervise,

a much wider set of visual concepts through its general-

ity. Both approaches also use static softmax classiﬁers to

perform prediction and lack a mechanism for dynamic out-

puts. This severely curtails their ﬂexibility and limits their

“zero-shot” capabilities.

A crucial difference between these weakly supervised mod-

els and recent explorations of learning image representations

directly from natural language is scale. While Mahajan et al.

(2018) and Kolesnikov et al. (2019) trained their models for

accelerator years on millions to billions of images, VirTex,

ICMLM, and ConVIRT trained for accelerator days on one

to two hundred thousand images. In this work, we close

this gap and study the behaviors of image classiﬁers trained

with natural language supervision at large scale. Enabled

by the large amounts of publicly available data of this form

on the internet, we create a new dataset of 400 million (im-

age, text) pairs and demonstrate that a simpliﬁed version of

ConVIRT trained from scratch, which we call CLIP, for Con-

trastive Language-Image Pre-training, is an efﬁcient method

of learning from natural language supervision. We study

the scalability of CLIP by training a series of eight models

spanning almost 2 orders of magnitude of compute and ob-

serve that transfer performance is a smoothly predictable

function of compute (Hestness et al., 2017; Kaplan et al.,

2020). We ﬁnd that CLIP, similar to the GPT family, learns

to perform a wide set of tasks during pre-training including

OCR, geo-localization, action recognition, and many others.

We measure this by benchmarking the zero-shot transfer

performance of CLIP on over 30 existing datasets and ﬁnd

Learning Transferable Visual Models From Natural Language Supervision 3

2M 33M 67M 134M 268M 400M

# of images processed

Zero-Shot ImageNet Accuracy

3X efficiency4X efficiency

Bag of Words Contrastive (CLIP)

Bag of Words Prediction

Transformer Language Model

Figure 2. CLIP is much more efﬁcient at zero-shot transfer

than our image caption baseline.

Although highly expressive,

we found that transformer-based language models are relatively

weak at zero-shot ImageNet classiﬁcation. Here, we see that it

learns 3x slower than a baseline which predicts a bag-of-words

(BoW) encoding of the text (Joulin et al., 2016). Swapping the

prediction objective for the contrastive objective of CLIP further

improves efﬁciency another 4x.

it can be competitive with prior task-speciﬁc supervised

models. We also conﬁrm these ﬁndings with linear-probe

representation learning analysis and show that CLIP out-

performs the best publicly available ImageNet model while

also being more computationally efﬁcient. We additionally

ﬁnd that zero-shot CLIP models are much more robust than

equivalent accuracy supervised ImageNet models which

suggests that zero-shot evaluation of task-agnostic models is

much more representative of a model’s capability. These re-

sults have signiﬁcant policy and ethical implications, which

we consider in Section 7.

2. Approach

2.1. Natural Language Supervision

At the core of our approach is the idea of learning percep-

tion from supervision contained in natural language. As

discussed in the introduction, this is not at all a new idea,

however terminology used to describe work in this space

is varied, even seemingly contradictory, and stated motiva-

tions are diverse. Zhang et al. (2020), Gomez et al. (2017),

Joulin et al. (2016), and Desai & Johnson (2020) all intro-

duce methods which learn visual representations from text

paired with images but describe their approaches as unsuper-

vised, self-supervised, weakly supervised, and supervised

respectively.

We emphasize that what is common across this line of work

is not any of the details of the particular methods used but

the appreciation of natural language as a training signal. All

these approaches are learning from natural language super-

vision. Although early work wrestled with the complexity

of natural language when using topic model and n-gram

representations, improvements in deep contextual represen-

tation learning suggest we now have the tools to effectively

leverage this abundant source of supervision (McCann et al.,

2017).

Learning from natural language has several potential

strengths over other training methods. It’s much easier

to scale natural language supervision compared to standard

crowd-sourced labeling for image classiﬁcation since it does

not require annotations to be in a classic “machine learning

compatible format” such as the canonical 1-of-N majority

vote “gold label”. Instead, methods which work on natural

language can learn passively from the supervision contained

in the vast amount of text on the internet. Learning from

natural language also has an important advantage over most

unsupervised or self-supervised learning approaches in that

it doesn’t “just” learn a representation but also connects that

representation to language which enables ﬂexible zero-shot

transfer. In the following subsections, we detail the speciﬁc

approach we settled on.

2.2. Creating a Sufﬁciently Large Dataset

Existing work has mainly used three datasets, MS-COCO

(Lin et al., 2014), Visual Genome (Krishna et al., 2017), and

YFCC100M (Thomee et al., 2016). While MS-COCO and

Visual Genome are high quality crowd-labeled datasets, they

are small by modern standards with approximately 100,000

training photos each. By comparison, other computer vision

systems are trained on up to 3.5 billion Instagram photos

(Mahajan et al., 2018). YFCC100M, at 100 million photos,

is a possible alternative, but the metadata for each image is

sparse and of varying quality. Many images use automati-

cally generated ﬁlenames like

20160716 113957.JPG

as “titles” or contain “descriptions” of camera exposure

settings. After ﬁltering to keep only images with natural

language titles and/or descriptions in English, the dataset

shrunk by a factor of 6 to only 15 million photos. This is

approximately the same size as ImageNet.

A major motivation for natural language supervision is the

large quantities of data of this form available publicly on the

internet. Since existing datasets do not adequately reﬂect

this possibility, considering results only on them would un-

derestimate the potential of this line of research. To address

this, we constructed a new dataset of 400 million (image,

text) pairs collected form a variety of publicly available

sources on the Internet. To attempt to cover as broad a set

of visual concepts as possible, we search for (image, text)

pairs as part of the construction process whose text includes

one of a set of 500,000 queries.

We approximately class

The base query list is all words occurring at least 100 times in

the English version of Wikipedia. This is augmented with bi-grams

Learning Transferable Visual Models From Natural Language Supervision 4

balance the results by including up to 20,000 (image, text)

pairs per query. The resulting dataset has a similar total

word count as the WebText dataset used to train GPT-2. We

refer to this dataset as WIT for WebImageText.

2.3. Selecting an Efﬁcient Pre-Training Method

State-of-the-art computer vision systems use very large

amounts of compute. Mahajan et al. (2018) required 19

GPU years to train their ResNeXt101-32x48d and Xie et al.

(2020) required 33 TPUv3 core-years to train their Noisy

Student EfﬁcientNet-L2. When considering that both these

systems were trained to predict only 1000 ImageNet classes,

the task of learning an open set of visual concepts from

natural language seems daunting. In the course of our ef-

forts, we found training efﬁciency was key to successfully

scaling natural language supervision and we selected our

ﬁnal pre-training method based on this metric.

Our initial approach, similar to VirTex, jointly trained an

image CNN and text transformer from scratch to predict the

caption of an image. However, we encountered difﬁculties

efﬁciently scaling this method. In Figure 2 we show that a

63 million parameter transformer language model, which

already uses twice the compute of its ResNet-50 image

encoder, learns to recognize ImageNet classes three times

slower than a much simpler baseline that predicts a bag-of-

words encoding of the same text.

Both these approaches share a key similarity. They try to pre-

dict the exact words of the text accompanying each image.

This is a difﬁcult task due to the wide variety of descriptions,

comments, and related text that co-occur with images. Re-

cent work in contrastive representation learning for images

has found that contrastive objectives can learn better repre-

sentations than their equivalent predictive objective (Tian

et al., 2019). Other work has found that although generative

models of images can learn high quality image representa-

tions, they require over an order of magnitude more compute

than contrastive models with the same performance (Chen

et al., 2020a). Noting these ﬁndings, we explored training

a system to solve the potentially easier proxy task of pre-

dicting only which text as a whole is paired with which

image and not the exact words of that text. Starting with

the same bag-of-words encoding baseline, we swapped the

predictive objective for a contrastive objective in Figure 2

and observed a further 4x efﬁciency improvement in the rate

of zero-shot transfer to ImageNet.

Given a batch of

(image, text) pairs, CLIP is trained to

predict which of the

N × N

possible (image, text) pairings

across a batch actually occurred. To do this, CLIP learns a

with high pointwise mutual information as well as the names of

all Wikipedia articles above a certain search volume. Finally all

WordNet synsets not already in the query list are added.

multi-modal embedding space by jointly training an image

encoder and text encoder to maximize the cosine similar-

ity of the image and text embeddings of the

real pairs

in the batch while minimizing the cosine similarity of the

embeddings of the

− N

incorrect pairings. We opti-

mize a symmetric cross entropy loss over these similarity

scores. In Figure 3 we include pseudocode of the core of an

implementation of CLIP. To our knowledge this batch con-

struction technique and objective was ﬁrst introduced in the

area of deep metric learning as the multi-class N-pair loss

Sohn (2016), was popularized for contrastive representation

learning by Oord et al. (2018) as the InfoNCE loss, and was

recently adapted for contrastive (text, image) representation

learning in the domain of medical imaging by Zhang et al.

(2020).

Due to the large size of our pre-training dataset, over-ﬁtting

is not a major concern and the details of training CLIP are

simpliﬁed compared to the implementation of Zhang et al.

(2020). We train CLIP from scratch without initializing the

image encoder with ImageNet weights or the text encoder

with pre-trained weights. We do not use the non-linear

projection between the representation and the contrastive

embedding space, a change which was introduced by Bach-

man et al. (2019) and popularized by Chen et al. (2020b).

We instead use only a linear projection to map from each en-

coder’s representation to the multi-modal embedding space.

We did not notice a difference in training efﬁciency between

the two versions and speculate that non-linear projections

may be co-adapted with details of current image only in

self-supervised representation learning methods. We also

remove the text transformation function

from Zhang et al.

(2020) which samples a single sentence at uniform from

the text since many of the (image, text) pairs in CLIP’s pre-

training dataset are only a single sentence. We also simplify

the image transformation function

. A random square

crop from resized images is the only data augmentation

used during training. Finally, the temperature parameter

which controls the range of the logits in the softmax,

, is

directly optimized during training as a log-parameterized

multiplicative scalar to avoid turning as a hyper-parameter.

2.4. Choosing and Scaling a Model

We consider two different architectures for the image en-

coder. For the ﬁrst, we use ResNet-50 (He et al., 2016a)

as the base architecture for the image encoder due to its

widespread adoption and proven performance. We make sev-

eral modiﬁcations to the original version using the ResNet-

D improvements from He et al. (2019) and the antialiased

rect-2 blur pooling from Zhang (2019). We also replace

the global average pooling layer with an attention pooling

mechanism. The attention pooling is implemented as a sin-

gle layer of “transformer-style” multi-head QKV attention

where the query is conditioned on the global average-pooled

Learning Transferable Visual Models From Natural Language Supervision 5

# image_encoder - ResNet or Vision Transformer

# text_encoder - CBOW or Text Transformer

# I[n, h, w, c] - minibatch of aligned images

# T[n, l] - minibatch of aligned texts

# W_i[d_i, d_e] - learned proj of image to embed

# W_t[d_t, d_e] - learned proj of text to embed

# t - learned temperature parameter

# extract feature representations of each modality

I_f = image_encoder(I) #[n, d_i]

T_f = text_encoder(T) #[n, d_t]

# joint multimodal embedding [n, d_e]

I_e = l2_normalize(np.dot(I_f, W_i), axis=1)

T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]

logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function

labels = np.arange(n)

loss_i = cross_entropy_loss(logits, labels, axis=0)

loss_t = cross_entropy_loss(logits, labels, axis=1)

loss = (loss_i + loss_t)/2

Figure 3.

Numpy-like pseudocode for the core of an implementa-

tion of CLIP.

representation of the image. For the second architecture, we

experiment with the recently introduced Vision Transformer

(ViT) (Dosovitskiy et al., 2020). We closely follow their

implementation with only the minor modiﬁcation of adding

an additional layer normalization to the combined patch

and position embeddings before the transformer and use a

slightly different initialization scheme.

The text encoder is a Transformer (Vaswani et al., 2017)

with the architecture modiﬁcations described in Radford

et al. (2019). As a base size we use a 63M-parameter 12-

layer 512-wide model with 8 attention heads. The trans-

former operates on a lower-cased byte pair encoding (BPE)

representation of the text with a 49,152 vocab size (Sen-

nrich et al., 2015). For computational efﬁciency, the max

sequence length was capped at 76. The text sequence is

bracketed with

[SOS]

and

[EOS]

tokens and the activa-

tions of the highest layer of the transformer at the

[EOS]

token are treated as the feature representation of the text

which is layer normalized and then linearly projected into

the multi-modal embedding space. Masked self-attention

was used in the text encoder to preserve the ability to ini-

tialize with a pre-trained language model or add language

modeling as an auxiliary objective, though exploration of

this is left as future work.

While previous computer vision research has often scaled

models by increasing the width (Mahajan et al., 2018) or

depth (He et al., 2016a) in isolation, for the ResNet image

encoders we adapt the approach of Tan & Le (2019) which

found that allocating additional compute across all of width,

depth, and resolution outperforms only allocating it to only

one dimension of the model. While Tan & Le (2019) tune

the ratio of compute allocated to each dimension for their

EfﬁcientNet architecture, we use a simple baseline of allo-

cating additional compute equally to increasing the width,

depth, and resolution of the model. For the text encoder, we

only scale the width of the model to be proportional to the

calculated increase in width of the ResNet and do not scale

the depth at all, as we found CLIP’s performance to be less

sensitive to the capacity of the text encoder.

2.5. Training

We train a series of 5 ResNets and 3 Vision Transformers.

For the ResNets we train a ResNet-50, a ResNet-101, and

then 3 more which follow EfﬁcientNet-style model scaling

and use approximately 4x, 16x, and 64x the compute of a

ResNet-50. They are denoted as RN50x4, RN50x16, and

RN50x64 respectively. For the Vision Transformers we

train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all

models for 32 epochs. We use the Adam optimizer (Kingma

& Ba, 2014) with decoupled weight decay regularization

(Loshchilov & Hutter, 2017) applied to all weights that are

not gains or biases, and decay the learning rate using a

cosine schedule (Loshchilov & Hutter, 2016). Initial hyper-

parameters were set using a combination of grid searches,

random search, and manual tuning on the baseline ResNet-

50 model when trained for 1 epoch. Hyper-parameters were

then adapted heuristically for larger models due to compu-

tational constraints. The learnable temperature parameter

was initialized to the equivalent of 0.07 from (Wu et al.,

2018) and clipped to prevent scaling the logits by more

than 100 which we found necessary to prevent training in-

stability. We use a very large minibatch size of 32,768.

Mixed-precision (Micikevicius et al., 2017) was used to ac-

celerate training and save memory. To save additional mem-

ory, gradient checkpointing (Griewank & Walther, 2000;

Chen et al., 2016), half-precision Adam statistics (Dhariwal

et al., 2020), and half-precision stochastically rounded text

encoder weights were used. The calculation of embedding

similarities was also sharded with individual GPUs comput-

ing only the subset of the pairwise similarities necessary for

their local batch of embeddings. The largest ResNet model,

RN50x64, took 18 days to train on 592 V100 GPUs while

the largest Vision Transformer took 12 days on 256 V100

GPUs. For the ViT-L/14 we also pre-train at a higher 336

pixel resolution for one additional epoch to boost perfor-

mance similar to FixRes (Touvron et al., 2019). We denote

this model as ViT-L/14@336px. Unless otherwise speciﬁed,

all results reported in this paper as “CLIP” use this model

which we found to perform best.

评论收藏

内容反馈

zhaohad

粉丝: 21
资源: 4

Learning Transferable Visual Models From Natural Language Superv

2015 Learning Transferable Features with Deep Adaptation Networks.pdf

Learning transferable features with deep adaptation networks.pdf

Learning Transferable Architectures for Scalable Image Recognition

hogmatlab源码-Learning-Transferable-Subspace-for-Human-Motion-Segmentatio

第四期_How transferable are features in deep neural networks.pptx

Transferable discriminative dimensionality reduction.pdf

convnet_transfer, 纸张"How transferable are features in deep neural networks?" 代码.zip

lThe Transferable Belief Mode for quantified belief representation

近20篇经典的图像分类国外高级期刊

可转移对话系统和用户模拟器_Transferable Dialogue Systems and User Simulators

Approximating probability distribution of circuit performance function for parametric yield estimation using transferable belief model

cnn卷积神经网络论文.zip

Xlearn:转移学习图书馆

2018GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Repres.pdf

Transferable-Interactiveness-Network:用于人与物体交互检测的可传递交互性知识代码。 （CVPR'19，TPAMI'21）

Springer - Genetic Programming Theory and Practice 5.pdf

deep domain adaptation tutorial-small.pdf

34个经典javaweb项目实例.zip

毕业设计 springBoot人力资源管理系统+毕业论文+前后端源代码

项目源码：基于Hadoop+Spark招聘推荐可视化系统 大数据项目 计算机毕业设计

基于spring boot的小区物业管理系统源码+论文+答辩ppt

计算机毕业设计：Flask股票数据采集分析可视化系统 python+爬虫+金融数据

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计 项目源码 毕业设计

毕业设计-基于JAVA的springboot超市进销存系统(源代码+论文）

基于深度学习的课堂行为识别和考试作弊检测系统的设计与实现（python源码）

基于51单片机的智能电子秤系统设计(含代码仿真及论文)

Python爬取智联招聘网站数据，2023.10.31测试，可跑

最新资源

Transferable-Interactiveness-Network:用于人与物体交互检测的可传递交互性知识代码。（CVPR'19，TPAMI'21）

项目源码：基于Hadoop+Spark招聘推荐可视化系统大数据项目计算机毕业设计

人脸识别系统OpenCV+dlib+python（含数据库）Pyqt5界面设计项目源码毕业设计