【免费】CNN-弱监督-迁移学习-exploring_the_limits_of_weakly_supervised

需积分: 0 193 浏览量 2022-08-04 12:04:21 上传评论收藏 1.19MB PDF 举报

资源详情

资源评论

资源推荐

Exploring the Limits of

Weakly Supervised Pretraining

Dhruv Mahajan Ross Girshick Vignesh Ramanathan Kaiming He

Manohar Paluri Yixuan Li Ashwin Bharambe Laurens van der Maaten

Facebook

Abstract. State-of-the-art visual perception models for a wide range

of tasks rely on supervised pretraining. ImageNet classiﬁcation is the de

facto pretraining task for these models. Yet, ImageNet is now nearly ten

years old and is by modern standards “small”. Even so, relatively little is

known about the behavior of pretraining with datasets that are multiple

orders of magnitude larger. The reasons are obvious: such datasets are

diﬃcult to collect and annotate. In this paper, we present a unique study

of transfer learning with large convolutional networks trained to predict

hashtags on billions of social media images. Our experiments demon-

strate that training for large-scale hashtag prediction leads to excellent

results. We show improvements on several image classiﬁcation and object

detection tasks, and report the highest ImageNet-1k single-crop, top-1

accuracy to date: 85.4% (97.6% top-5). We also perform extensive ex-

periments that provide novel empirical data on the relationship between

large-scale pretraining and transfer learning performance.

1 Introduction

Nearly all state-of-the-art visual perception algorithms rely on the same formula:

(1) pretrain a convolutional network on a large, manually annotated image clas-

siﬁcation dataset and (2) ﬁnetune the network on a smaller, task-speciﬁc dataset.

This formula [1,2,3] has been in wide use for several years and led to impres-

sive improvements on numerous tasks. Examples include: object detection [1,4],

semantic segmentation [5,6], human pose estimation [7,8], video recognition [9],

monocular depth estimation [10], and so on. In fact, it is so eﬀective that it

would now be considered foolhardy not to use supervised pretraining.

The ImageNet dataset [11] is the de facto pretraining dataset. While there are

studies analyzing the eﬀects of various ImageNet pretraining factors on transfer

learning (e.g., [12,13]) or the use of diﬀerent datasets that are of the same size

magnitude as ImageNet (e.g., [14,15]), relatively little is known about pretraining

on datasets that are multiple orders of magnitude larger ([16,17] are the largest

studies to date). The reasons for this are numerous: few such datasets exist,

building new datasets is labor intensive, and large computational resources are

needed to conduct experiments. Yet, given the central role of pretraining it is

important to expand our scientiﬁc knowledge in this domain.

This paper tries to address this complex issue by studying an unexplored data

regime: billions of images “labeled” in the wild with social media hashtags. This

2 Mahajan et al.

data source has the advantage of being large and continuously growing, as well

as “free” from an annotation perspective since no manual labeling is required.

However, the data source also has potential disadvantages: hashtags may be

too noisy to serve as an eﬀective supervisory signal and the image distribution

might be biased in ways that harm transfer learning. It is not a priori obvious

that training on this data will yield good transfer learning results.

The main result of this paper is that without manual dataset curation or so-

phisticated data cleaning, models trained on billions of Instagram images using

thousands of distinct hashtags as labels exhibit excellent transfer learning per-

formance. For example, we observe improvements over the state-of-the-art for

image classiﬁcation and object detection, where we obtain a single-crop, top-1

accuracy of 85.4% on the ImageNet-1k image-classiﬁcation dataset and 45.2%

AP on the COCO object-detection dataset [18], compared to 79.8% and 43.7%,

respectively, when training (or pretraining) the same models on ImageNet-1k.

Our primary goal, however, is to contribute novel experimental data about this

previously unexplored regime. To that end, we conduct numerous experiments

that reveal interesting trends. For example, we ﬁnd that “hashtag engineering”

(i.e., collecting images tagged with a speciﬁc subset of hashtags) is a promising

new direction for improving transfer learning results, that training on large-scale

hashtag data is unexpectedly robust to label noise, and that the features learned

allow a simple linear classiﬁer to achieve state-of-the-art ImageNet-1k top-1 ac-

curacy of 83.6% without any ﬁnetuning (compared to 84.2% with ﬁnetuning).

2 Scaling up Supervised Pretraining

In our experiments, we train standard convolutional network architectures to

predict hashtags on up to 3.5 billion public Instagram images. To make training

at this scale practical, we adopt a distributed synchronous implementation of

stochastic gradient descent with large (8k image) minibatches, following Goyal

et al. [19]. We experiment on a variety of datasets, which we describe next.

2.1 Instagram Datasets

We use a simple data collection pipeline: (1) We select a set of hashtags. (2) We

download images that are tagged with at least one of these hashtags. (3) Then,

because multiple hashtags may refer to the same underlying concept, we apply

a simple process that utilizes WordNet [20] synsets to merge some hashtags into

a single canonical form (e.g., #brownbear and #ursusarctos are merged). (4)

Finally, for each downloaded image, we replace each hashtag with its canonical

form and discard any hashtags that were not in the selected set. The canonical

hashtags are used as labels for training and evaluation.

By varying the selected hashtags and the number of images to sample, we

can construct a variety of datasets of diﬀerent sizes and visual distributions. Ta-

ble 1 summarizes the datasets used in our experiments. Each dataset is named

Exploring the Limits of Weakly Supervised Pretraining 3

Name template Description

train-IG-I-1.5k Instagram training set of I images and

∼

1.5k hashtags from ImageNet-1k.

train-IG-I-8.5k Instagram training set of I images and

∼

8.5k hashtags from WordNet.

train-IG-I-17k Instagram training set of I images and

∼

17k hashtags from WordNet.

train-IN-1M-1k The standard ImageNet-1k ILSVRC training set with 1.28M images.

val-IN-50k-1k The standard ImageNet-1k ILSVRC validation set with 50k images.

train-IN-I-L Extended ImageNet training set of I images and L ∈ {5k, 9k} labels.

val-IN-I-L Extended ImageNet validation set of I images and L ∈ {5k, 9k} labels.

train-CUB-6k-200 The Caltech-UCSD Birds-200-2011 training set.

val-CUB-6k-200 The Caltech-UCSD Birds-200-2011 validation set.

train-Places-1.8M-365 The Places365-Standard training set (high-resolution version).

val-Places-37k-365 The Places365-Standard validation set (high-resolution version).

train-COCO-135k-80 The standard COCO detection training set (2017 version).

val-COCO-5k-80 The standard COCO detection validation set (2017 version).

test-COCO-20k-80 The standard COCO detection test-dev set (2017 version).

Table 1: Summary of image classiﬁcation datasets. Each dataset is named with a

template, role -source-I-L, that indicates its role (training, validation, testing), source,

number of images I, and number of labels L.

by completing a template, role-source-I-L, that indicates its role (training, val-

idation, testing), source (IG for Instagram, IN for ImageNet, etc.), number of

images I, and number of labels L. We use approximate image and label counts

for convenience, for example “train-IG-940M-1.5k” is an Instagram dataset for

training with

∼

940e6 images and

∼

1,500 labels. We omit the role and image

count when it is clear from context or not useful to present.

We design three hashtag sets for the Instagram data: (1) A

∼

1.5k set with

hashtags from the standard 1,000 IN-1k synsets (each synset contains at least

one synonym, hence there are more hashtags than synsets). (2) A

∼

17k set with

hashtags that are synonyms in any of the noun synsets in WordNet. And (3)

∼

8.5k set with the most frequent hashtags from the 17k set. The hashtag set

sizes are measured after merging the hashtags into their canonical forms. We

hypothesize that the ﬁrst set has a visual distribution similar to IN-1k, while

the other two represent more general visual distributions covering ﬁne-grained

visual categories. Details of how these hashtags are selected and how the merging

process works are given in supplemental material.

Image deduplication. When performing transfer learning, it is essential to

understand and properly address overlap between training and test sets. Overlap

can exists because images may come from the same underlying sources (e.g.,

Wikipedia, Flickr, Google). For instance,

∼

5% of the images in the val-CUB-6k-

200 set [21] also appear in train-IN-1M-1k, and 1.78% of images in val-IN-50k-1k

set are in the JFT-300M training set [17]. To address this issue, we performed the

following deduplication procedure: we compute R-MAC features [22,23] for all

candidate images using a ResNet-50 model, and use these features to ﬁnd the k =

21 nearest neighbors for each of the images in our test sets (additional details are

in the supplemental material). Subsequently, we manually inspected all images

and their nearest neighbors to identify duplicates. This procedure uncovered

150 val-IN-50k-1k (0.30%), 10 val-CUB-6k-200 (0.17%), 151 val-Places-37k-365

4 Mahajan et al.

(0.41%), and 6 val-COCO-5k-80 (0.12%) duplicates. In our results, we report

the observed accuracy of our models; in the supplemental material, we report

a conservative lower bound on accuracy by marking all duplicates as incorrect.

Given the small percentage of duplicates, they do not impact our ﬁndings.

Discussion. Our datasets have two nice properties: public visibility and sim-

plicity. By using publicly accessible images, the data used in our experiments is

visible to everyone. To see what it looks like, the images are browsable by hashtag

at https://www.instagram.com/explore/tags/ followed by a speciﬁc hashtag;

for example https://www.instagram.com/explore/tags/brownbear shows im-

ages tagged with #brownbear. Our data is also taken from the “wild”, essentially

as-is, with minimal eﬀort to sanitize it. This makes the dataset construction pro-

cess particularly simple and transparent.

We contrast these properties with the JFT-300M dataset [17], which is not

publicly visible and is the result of a proprietary collection process (“The [JFT-

300M] images are labeled using an algorithm that uses a complex mixture of raw

web signals, connections between web-pages and user feedback.”). Additional

details describing the collection of JFT-300M have not been publicly disclosed.

Despite our eﬀorts to make the dataset content and collection process trans-

parent, we acknowledge that, similar to JFT-300M, it is not possible for other

research groups to know exactly which images we used nor to download them

en masse. Hence it is not possible for others to replicate our results at this time.

However, we believe that it is better if we undertake this study and share the

results with the community than to not publish the results.

2.2 ImageNet Datasets

In addition to the standard IN-1k dataset, we experiment with larger subsets

of the full ImageNet 2011 release that contains 14.2M images and 22k labels.

We construct training and validation sets that include 5k and 9k labels. For the

5k set, we use the now standard IN-5k proposed in [15] (6.6M training images).

For the 9k label set, we follow the same protocol used to construct IN-5k, which

involves taking the next most frequent 4k labels and all of the associated images

(10.5M training images). In all cases, we use 50 images per class for validation.

2.3 Models

We use residual networks with grouped convolutional layers, called ResNeXt

[15]. Our experiments use ResNeXt-101 32×Cd, which has 101 layers, 32 groups,

and group widths C of: 4 (8B multiply-add FLOPs, 43M parameters), 8 (16B,

88M), 16 (36B, 193M), 32 (87B, 466M), and 48 (153B, 829M). Our implemen-

tation matches [19]. We believe our results will generalize to other architectures

[24,25,26].

Exploring the Limits of Weakly Supervised Pretraining 5

Loss function. In contrast to ImageNet, our Instagram datasets may contain

multiple labels per image (because a user speciﬁed multiple hashtags). The aver-

age number of hashtags per image varies depending on the dataset; for instance,

train-IG-1B-17k contains

∼

2 hashtags per image. Our model computes probabil-

ities over all hashtags in the vocabulary using a softmax activation and is trained

to minimize the cross-entropy between the predicted softmax distribution and

the target distribution of each image. The target is a vector with k non-zero

entries each set to 1/k corresponding to the k ≥ 1 hashtags for the image.

We have also experimented with per-hashtag sigmoid outputs and binary

logistic loss, but obtained signiﬁcantly worse results. While counter-intuitive

given the multi-label data, these ﬁndings match similar observations in [16].

The successful application of sigmoid activations and logistic loss may require

sophisticated label completion techniques [17] and more hyper-parameter search.

2.4 Pretraining Details

Our models are trained by synchronous stochastic gradient descent (SGD) on 336

GPUs across 42 machines with minibatches of 8,064 images. Each GPU processes

24 images at a time and batch normalization (BN) [27] statistics are computed

on these 24 image sets. The length of the training schedule, measured in units

of number-of-images-processed (i.e., minibatch size × total SGD updates), is

determined by a heuristic: we choose two training extremes (for instance, 120

epochs on 1.2e6 images and 2 epochs on 3.5e9 images) and linearly interpolate

the schedule between them to set the number-of-images-processed for each ex-

periment. Schedules for each experiment are in the supplemental material. Our

ResNeXt-101 32×16d networks took

∼

22 days to train on 3.5B images.

To set the learning rate, we follow the linear scaling rule with gradual warm-

up described in [19]. We use a warm-up from 0.1 up to 0.1/256 ×8064, where 0.1

and 256 are canonical learning rate and minibatch sizes [28]. After the warm-up,

the learning rate is multiplied by 0.5 at equally spaced steps, such that the total

number of learning rate reductions is 20 over the course of training. The same

settings are used when training on ImageNet and Instagram data, except that

when training on ImageNet we use 128 GPUs in 16 machines (for a minibatch

size of 3,072) due to the smaller dataset size and we use the standard learning

rate schedule that involves three equally spaced reductions by a factor of 0.1.

All other initialization and training details match [19] and are summarized in

the supplemental material.

3 Experiments

In our experiments, we pretrain convolutional networks for hashtag prediction

and transfer those networks to a variety of tasks. There are two established pro-

tocols for judging the quality of a pretrained model (see [29] §3 for a discussion).

Both analyze how pretraining on a source task, e.g. IN-1k classiﬁcation, leads to

gains (or losses) on a target task, e.g. bird recognition or object detection.

剩余22页未读，继续阅读

评论收藏

内容反馈

臭人鹏

粉丝: 19
资源: 330

CNN-弱监督-迁移学习-exploring_the_limits_of_weakly_supervised_pretraini

评论0

最新资源

CNN-弱监督-迁移学习-exploring_the_limits_of_weakly_supervised_pretraini

评论0

Dark_Web_Exploring_and_Data_Mining the Dark Side of the Web

Exploring_the_performance_of_ROS2.pdf

Exploring+Scrum_+the+Fundamenta+Dan+Rawsthorne

「WEB应用防火墙」Blockchain_Based_Digital_Identity：Exploring_the_mes

Exploring_The_Maze_Of_Memory_Layouts_Towards_Exploits.pdf

Bidirectional Rapidly-exploring Random Trees_random_RRT_rrtmatla

exploring_ASP_as_a_sourcing_strategy

00058 8.3_Exploring_the_End_Point_Tracker.mp4

Al大模型，Exploring the Limits of Transfer Learning

信息安全_数据安全_Exploring-security-of-IoT-and-Em.pdf

ICCV2019论文合集

exploring_word_vectors.ipynb

Exploring_Expect.pdf

Statistics for Machine Learning: Techniques for exploring supervised

Rapidly-exploring Random Trees_路径规划_random_rrt路径规划_rrtmatlab_

Art-Exploring-the-New-Android-KitKat-Runtime

Exploring Steganography-Seeing the Unseen

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

CISP、NISP二级、CISE题库最新版（2024年1月更新）

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

全面的安全基线核查清单

OpenVAS离线资源

2024最新：Hvv中常见的面试问题

最新资源