没有合适的资源?快使用搜索试试~ 我知道了~
CNN-弱监督-迁移学习-exploring_the_limits_of_weakly_supervised_pretraini
需积分: 0 0 下载量 193 浏览量
2022-08-04
12:04:21
上传
评论
收藏 1.19MB PDF 举报
温馨提示
试读
23页
Exploring the Limits ofLaurens van der MaatenAbstract. State-of-the-art visual p
资源详情
资源评论
资源推荐
Exploring the Limits of
Weakly Supervised Pretraining
Dhruv Mahajan Ross Girshick Vignesh Ramanathan Kaiming He
Manohar Paluri Yixuan Li Ashwin Bharambe Laurens van der Maaten
Facebook
Abstract. State-of-the-art visual perception models for a wide range
of tasks rely on supervised pretraining. ImageNet classification is the de
facto pretraining task for these models. Yet, ImageNet is now nearly ten
years old and is by modern standards “small”. Even so, relatively little is
known about the behavior of pretraining with datasets that are multiple
orders of magnitude larger. The reasons are obvious: such datasets are
difficult to collect and annotate. In this paper, we present a unique study
of transfer learning with large convolutional networks trained to predict
hashtags on billions of social media images. Our experiments demon-
strate that training for large-scale hashtag prediction leads to excellent
results. We show improvements on several image classification and object
detection tasks, and report the highest ImageNet-1k single-crop, top-1
accuracy to date: 85.4% (97.6% top-5). We also perform extensive ex-
periments that provide novel empirical data on the relationship between
large-scale pretraining and transfer learning performance.
1 Introduction
Nearly all state-of-the-art visual perception algorithms rely on the same formula:
(1) pretrain a convolutional network on a large, manually annotated image clas-
sification dataset and (2) finetune the network on a smaller, task-specific dataset.
This formula [1,2,3] has been in wide use for several years and led to impres-
sive improvements on numerous tasks. Examples include: object detection [1,4],
semantic segmentation [5,6], human pose estimation [7,8], video recognition [9],
monocular depth estimation [10], and so on. In fact, it is so effective that it
would now be considered foolhardy not to use supervised pretraining.
The ImageNet dataset [11] is the de facto pretraining dataset. While there are
studies analyzing the effects of various ImageNet pretraining factors on transfer
learning (e.g., [12,13]) or the use of different datasets that are of the same size
magnitude as ImageNet (e.g., [14,15]), relatively little is known about pretraining
on datasets that are multiple orders of magnitude larger ([16,17] are the largest
studies to date). The reasons for this are numerous: few such datasets exist,
building new datasets is labor intensive, and large computational resources are
needed to conduct experiments. Yet, given the central role of pretraining it is
important to expand our scientific knowledge in this domain.
This paper tries to address this complex issue by studying an unexplored data
regime: billions of images “labeled” in the wild with social media hashtags. This
2 Mahajan et al.
data source has the advantage of being large and continuously growing, as well
as “free” from an annotation perspective since no manual labeling is required.
However, the data source also has potential disadvantages: hashtags may be
too noisy to serve as an effective supervisory signal and the image distribution
might be biased in ways that harm transfer learning. It is not a priori obvious
that training on this data will yield good transfer learning results.
The main result of this paper is that without manual dataset curation or so-
phisticated data cleaning, models trained on billions of Instagram images using
thousands of distinct hashtags as labels exhibit excellent transfer learning per-
formance. For example, we observe improvements over the state-of-the-art for
image classification and object detection, where we obtain a single-crop, top-1
accuracy of 85.4% on the ImageNet-1k image-classification dataset and 45.2%
AP on the COCO object-detection dataset [18], compared to 79.8% and 43.7%,
respectively, when training (or pretraining) the same models on ImageNet-1k.
Our primary goal, however, is to contribute novel experimental data about this
previously unexplored regime. To that end, we conduct numerous experiments
that reveal interesting trends. For example, we find that “hashtag engineering”
(i.e., collecting images tagged with a specific subset of hashtags) is a promising
new direction for improving transfer learning results, that training on large-scale
hashtag data is unexpectedly robust to label noise, and that the features learned
allow a simple linear classifier to achieve state-of-the-art ImageNet-1k top-1 ac-
curacy of 83.6% without any finetuning (compared to 84.2% with finetuning).
2 Scaling up Supervised Pretraining
In our experiments, we train standard convolutional network architectures to
predict hashtags on up to 3.5 billion public Instagram images. To make training
at this scale practical, we adopt a distributed synchronous implementation of
stochastic gradient descent with large (8k image) minibatches, following Goyal
et al. [19]. We experiment on a variety of datasets, which we describe next.
2.1 Instagram Datasets
We use a simple data collection pipeline: (1) We select a set of hashtags. (2) We
download images that are tagged with at least one of these hashtags. (3) Then,
because multiple hashtags may refer to the same underlying concept, we apply
a simple process that utilizes WordNet [20] synsets to merge some hashtags into
a single canonical form (e.g., #brownbear and #ursusarctos are merged). (4)
Finally, for each downloaded image, we replace each hashtag with its canonical
form and discard any hashtags that were not in the selected set. The canonical
hashtags are used as labels for training and evaluation.
By varying the selected hashtags and the number of images to sample, we
can construct a variety of datasets of different sizes and visual distributions. Ta-
ble 1 summarizes the datasets used in our experiments. Each dataset is named
Exploring the Limits of Weakly Supervised Pretraining 3
Name template Description
train-IG-I-1.5k Instagram training set of I images and
∼
1.5k hashtags from ImageNet-1k.
train-IG-I-8.5k Instagram training set of I images and
∼
8.5k hashtags from WordNet.
train-IG-I-17k Instagram training set of I images and
∼
17k hashtags from WordNet.
train-IN-1M-1k The standard ImageNet-1k ILSVRC training set with 1.28M images.
val-IN-50k-1k The standard ImageNet-1k ILSVRC validation set with 50k images.
train-IN-I-L Extended ImageNet training set of I images and L ∈ {5k, 9k} labels.
val-IN-I-L Extended ImageNet validation set of I images and L ∈ {5k, 9k} labels.
train-CUB-6k-200 The Caltech-UCSD Birds-200-2011 training set.
val-CUB-6k-200 The Caltech-UCSD Birds-200-2011 validation set.
train-Places-1.8M-365 The Places365-Standard training set (high-resolution version).
val-Places-37k-365 The Places365-Standard validation set (high-resolution version).
train-COCO-135k-80 The standard COCO detection training set (2017 version).
val-COCO-5k-80 The standard COCO detection validation set (2017 version).
test-COCO-20k-80 The standard COCO detection test-dev set (2017 version).
Table 1: Summary of image classification datasets. Each dataset is named with a
template, role -source-I-L, that indicates its role (training, validation, testing), source,
number of images I, and number of labels L.
by completing a template, role-source-I-L, that indicates its role (training, val-
idation, testing), source (IG for Instagram, IN for ImageNet, etc.), number of
images I, and number of labels L. We use approximate image and label counts
for convenience, for example “train-IG-940M-1.5k” is an Instagram dataset for
training with
∼
940e6 images and
∼
1,500 labels. We omit the role and image
count when it is clear from context or not useful to present.
We design three hashtag sets for the Instagram data: (1) A
∼
1.5k set with
hashtags from the standard 1,000 IN-1k synsets (each synset contains at least
one synonym, hence there are more hashtags than synsets). (2) A
∼
17k set with
hashtags that are synonyms in any of the noun synsets in WordNet. And (3)
an
∼
8.5k set with the most frequent hashtags from the 17k set. The hashtag set
sizes are measured after merging the hashtags into their canonical forms. We
hypothesize that the first set has a visual distribution similar to IN-1k, while
the other two represent more general visual distributions covering fine-grained
visual categories. Details of how these hashtags are selected and how the merging
process works are given in supplemental material.
Image deduplication. When performing transfer learning, it is essential to
understand and properly address overlap between training and test sets. Overlap
can exists because images may come from the same underlying sources (e.g.,
Wikipedia, Flickr, Google). For instance,
∼
5% of the images in the val-CUB-6k-
200 set [21] also appear in train-IN-1M-1k, and 1.78% of images in val-IN-50k-1k
set are in the JFT-300M training set [17]. To address this issue, we performed the
following deduplication procedure: we compute R-MAC features [22,23] for all
candidate images using a ResNet-50 model, and use these features to find the k =
21 nearest neighbors for each of the images in our test sets (additional details are
in the supplemental material). Subsequently, we manually inspected all images
and their nearest neighbors to identify duplicates. This procedure uncovered
150 val-IN-50k-1k (0.30%), 10 val-CUB-6k-200 (0.17%), 151 val-Places-37k-365
4 Mahajan et al.
(0.41%), and 6 val-COCO-5k-80 (0.12%) duplicates. In our results, we report
the observed accuracy of our models; in the supplemental material, we report
a conservative lower bound on accuracy by marking all duplicates as incorrect.
Given the small percentage of duplicates, they do not impact our findings.
Discussion. Our datasets have two nice properties: public visibility and sim-
plicity. By using publicly accessible images, the data used in our experiments is
visible to everyone. To see what it looks like, the images are browsable by hashtag
at https://www.instagram.com/explore/tags/ followed by a specific hashtag;
for example https://www.instagram.com/explore/tags/brownbear shows im-
ages tagged with #brownbear. Our data is also taken from the “wild”, essentially
as-is, with minimal effort to sanitize it. This makes the dataset construction pro-
cess particularly simple and transparent.
We contrast these properties with the JFT-300M dataset [17], which is not
publicly visible and is the result of a proprietary collection process (“The [JFT-
300M] images are labeled using an algorithm that uses a complex mixture of raw
web signals, connections between web-pages and user feedback.”). Additional
details describing the collection of JFT-300M have not been publicly disclosed.
Despite our efforts to make the dataset content and collection process trans-
parent, we acknowledge that, similar to JFT-300M, it is not possible for other
research groups to know exactly which images we used nor to download them
en masse. Hence it is not possible for others to replicate our results at this time.
However, we believe that it is better if we undertake this study and share the
results with the community than to not publish the results.
2.2 ImageNet Datasets
In addition to the standard IN-1k dataset, we experiment with larger subsets
of the full ImageNet 2011 release that contains 14.2M images and 22k labels.
We construct training and validation sets that include 5k and 9k labels. For the
5k set, we use the now standard IN-5k proposed in [15] (6.6M training images).
For the 9k label set, we follow the same protocol used to construct IN-5k, which
involves taking the next most frequent 4k labels and all of the associated images
(10.5M training images). In all cases, we use 50 images per class for validation.
2.3 Models
We use residual networks with grouped convolutional layers, called ResNeXt
[15]. Our experiments use ResNeXt-101 32×Cd, which has 101 layers, 32 groups,
and group widths C of: 4 (8B multiply-add FLOPs, 43M parameters), 8 (16B,
88M), 16 (36B, 193M), 32 (87B, 466M), and 48 (153B, 829M). Our implemen-
tation matches [19]. We believe our results will generalize to other architectures
[24,25,26].
Exploring the Limits of Weakly Supervised Pretraining 5
Loss function. In contrast to ImageNet, our Instagram datasets may contain
multiple labels per image (because a user specified multiple hashtags). The aver-
age number of hashtags per image varies depending on the dataset; for instance,
train-IG-1B-17k contains
∼
2 hashtags per image. Our model computes probabil-
ities over all hashtags in the vocabulary using a softmax activation and is trained
to minimize the cross-entropy between the predicted softmax distribution and
the target distribution of each image. The target is a vector with k non-zero
entries each set to 1/k corresponding to the k ≥ 1 hashtags for the image.
We have also experimented with per-hashtag sigmoid outputs and binary
logistic loss, but obtained significantly worse results. While counter-intuitive
given the multi-label data, these findings match similar observations in [16].
The successful application of sigmoid activations and logistic loss may require
sophisticated label completion techniques [17] and more hyper-parameter search.
2.4 Pretraining Details
Our models are trained by synchronous stochastic gradient descent (SGD) on 336
GPUs across 42 machines with minibatches of 8,064 images. Each GPU processes
24 images at a time and batch normalization (BN) [27] statistics are computed
on these 24 image sets. The length of the training schedule, measured in units
of number-of-images-processed (i.e., minibatch size × total SGD updates), is
determined by a heuristic: we choose two training extremes (for instance, 120
epochs on 1.2e6 images and 2 epochs on 3.5e9 images) and linearly interpolate
the schedule between them to set the number-of-images-processed for each ex-
periment. Schedules for each experiment are in the supplemental material. Our
ResNeXt-101 32×16d networks took
∼
22 days to train on 3.5B images.
To set the learning rate, we follow the linear scaling rule with gradual warm-
up described in [19]. We use a warm-up from 0.1 up to 0.1/256 ×8064, where 0.1
and 256 are canonical learning rate and minibatch sizes [28]. After the warm-up,
the learning rate is multiplied by 0.5 at equally spaced steps, such that the total
number of learning rate reductions is 20 over the course of training. The same
settings are used when training on ImageNet and Instagram data, except that
when training on ImageNet we use 128 GPUs in 16 machines (for a minibatch
size of 3,072) due to the smaller dataset size and we use the standard learning
rate schedule that involves three equally spaced reductions by a factor of 0.1.
All other initialization and training details match [19] and are summarized in
the supplemental material.
3 Experiments
In our experiments, we pretrain convolutional networks for hashtag prediction
and transfer those networks to a variety of tasks. There are two established pro-
tocols for judging the quality of a pretrained model (see [29] §3 for a discussion).
Both analyze how pretraining on a source task, e.g. IN-1k classification, leads to
gains (or losses) on a target task, e.g. bird recognition or object detection.
剩余22页未读,继续阅读
臭人鹏
- 粉丝: 19
- 资源: 330
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0