LearningDeepFeaturesforDiscriminativeLocalization论文原文加翻译

共2个文件

pdf：1个

docx：1个

需积分: 50 19 浏览量 2018-06-11 10:40:52 上传评论收藏 6.94MB RAR 举报

《Learning Deep Features for Discriminative Localization》是一篇在2016年CVPR（Computer Vision and Pattern Recognition）会议上发表的重要论文，由Kaiming He、Georgia Gkioxari、Piotr Dollar和Ross Girshick共同撰写。这篇论文主要探讨了如何利用深度学习方法来实现目标定位（discriminative localization），特别是关注卷积神经网络（CNN）在这一领域的应用。论文的核心在于提出了一种称为Class Activation Mapping (CAM) 的技术，该技术能够从CNN模型中生成可视化注意力图，以揭示模型在进行类别预测时关注的图像区域。 CNNs在图像识别任务上表现出色，但它们通常被设计为全局平均池化，这导致了位置信息的丢失。而这篇论文提出的方法可以保留部分位置信息，使得模型不仅能够识别图像中的物体，还能指出识别出物体的关键区域。CAM通过将最后一层卷积层的激活图与全局平均池化的特征相结合，生成一个热力图，这个热力图直接对应于图像的像素空间，指示出网络在识别特定类别时关注的区域。具体来说，CAM首先对全连接层前一层的卷积层进行平均池化，然后将得到的特征图与全连接层的权重相乘，再进行 upsampling 操作，以恢复到输入图像的尺寸。这样生成的激活图就可以直接映射到原始图像上，显示了网络对每个像素的关注程度。通过这种方式，研究人员和开发者可以直观地理解模型的决策过程，并且用于定位和理解模型的预测。这篇论文的意义在于，它不仅提供了一种有效且直观的可视化工具，还为改进目标检测和语义分割等任务提供了新的思路。例如，通过分析CAM生成的激活图，可以发现模型可能存在的弱点，比如过度关注某些特定特征而忽视其他关键信息。此外，CAM还可以用于弱监督定位任务，只需要类别标签，而无需精确的边界框标注，大大降低了数据标注的难度。《Learning Deep Features for Discriminative Localization》是深度学习领域的一次重要进展，它推动了CNN在目标定位方面的应用，对后续的视觉任务研究产生了深远影响。通过阅读这篇论文的原文和翻译，我们可以深入理解CAM的工作原理，以及它如何改进模型的解释性和定位能力。同时，这也为后续的科研工作提供了理论基础和实践指导。

资源推荐

资源详情

资源评论

收起资源包目录

Learning Deep Features for Discriminative Localization论文原文加翻译.rar （2个子文件）

Learning Deep Features for Discriminative Localization翻译.docx 4.58MB

Learning Deep Features for Discriminative Localization(CVPR 2016).pdf 2.49MB

Learning Deep Features for Discriminative Localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba

Computer Science and Artiﬁcial Intelligence Laboratory, MIT

{bzhou,khosla,agata,oliva,torralba}@csail.mit.edu

Abstract

In this work, we revisit the global average pooling layer

proposed in [13], and shed light on how it explicitly enables

the convolutional neural network to have remarkable local-

ization ability despite being trained on image-level labels.

While this technique was previously proposed as a means

for regularizing training, we ﬁnd that it actually builds a

generic localizable deep representation that can be applied

to a variety of tasks. Despite the apparent simplicity of

global average pooling, we are able to achieve 37.1% top-5

error for object localization on ILSVRC 2014, which is re-

markably close to the 34.2% top-5 error achieved by a fully

supervised CNN approach. We demonstrate that our net-

work is able to localize the discriminative image regions on

a variety of tasks despite not being trained for them.

1. Introduction

Recent work by Zhou et al [33] has shown that the con-

volutional units of various layers of convolutional neural

networks (CNNs) actually behave as object detectors de-

spite no supervision on the location of the object was pro-

vided. Despite having this remarkable ability to localize

objects in the convolutional layers, this ability is lost when

fully-connected layers are used for classiﬁcation. Recently

some popular fully-convolutional neural networks such as

the Network in Network (NIN) [13] and GoogLeNet [24]

have been proposed to avoid the use of fully-connected lay-

ers to minimize the number of parameters while maintain-

ing high performance.

In order to achieve this, [13] uses global average pool-

ing which acts as a structural regularizer, preventing over-

ﬁtting during training. In our experiments, we found that

the advantages of this global average pooling layer extend

beyond simply acting as a regularizer - In fact, with a little

tweaking, the network can retain its remarkable localization

ability until the ﬁnal layer. This tweaking allows identifying

easily the discriminative image regions in a single forward-

pass for a wide variety of tasks, even those that the network

was not originally trained for. As shown in Figure 1(a), a

Brushing teeth Cutting trees

Figure 1. A simple modiﬁcation of the global average pool-

ing layer combined with our class activation mapping (CAM)

technique allows the classiﬁcation-trained CNN to both classify

the image and localize class-speciﬁc image regions in a single

forward-pass e.g., the toothbrush for brushing teeth and the chain-

saw for cutting trees.

CNN trained on object categorization is successfully able to

localize the discriminative regions for action classiﬁcation

as the objects that the humans are interacting with rather

than the humans themselves.

Despite the apparent simplicity of our approach, for the

weakly supervised object localization on ILSVRC bench-

mark [20], our best network achieves 37.1% top-5 test er-

ror, which is rather close to the 34.2% top-5 test error

achieved by fully supervised AlexNet [10]. Furthermore,

we demonstrate that the localizability of the deep features in

our approach can be easily transferred to other recognition

datasets for generic classiﬁcation, localization, and concept

discovery.

1.1. Related Work

Convolutional Neural Networks (CNNs) have led to im-

pressive performance on a variety of visual recognition

tasks [10, 34, 8]. Recent work has shown that despite being

trained on image-level labels, CNNs have the remarkable

ability to localize objects [1, 16, 2, 15]. In this work, we

show that, using the right architecture, we can generalize

this ability beyond just localizing objects, to start identi-

fying exactly which regions of an image are being used for

Our models are available at: http://cnnlocalization.csail.mit.edu

arXiv:1512.04150v1 [cs.CV] 14 Dec 2015

discrimination. Here, we discuss the two lines of work most

related to this paper: weakly-supervised object localization

and visualizing the internal representation of CNNs.

Weakly-supervised object localization: There have

been a number of recent works exploring weakly-

supervised object localization using CNNs [1, 16, 2, 15].

Bergamo et al [1] propose a technique for self-taught object

localization involving masking out image regions to iden-

tify the regions causing the maximal activations in order to

localize objects. Cinbis et al [2] combine multiple-instance

learning with CNN features to localize objects. Oquab et

al [15] propose a method for transferring mid-level image

representations and show that some object localization can

be achieved by evaluating the output of CNNs on multi-

ple overlapping patches. However, the authors do not ac-

tually evaluate the localization ability. On the other hand,

while these approaches yield promising results, they are not

trained end-to-end and require multiple forward passes of a

network to localize objects, making them difﬁcult to scale

to real-world datasets. Our approach is trained end-to-end

and can localize objects in a single forward pass.

The most similar approach to ours is the work based on

global max pooling by Oquab et al [16]. Instead of global

average pooling, they apply global max pooling to localize

a point on objects. However, their localization is limited to

a point lying in the boundary of the object rather than deter-

mining the full extent of the object. We believe that while

the max and average functions are rather similar, the use

of average pooling encourages the network to identify the

complete extent of the object. The basic intuition behind

this is that the loss for average pooling beneﬁts when the

network identiﬁes all discriminative regions of an object as

compared to max pooling. This is explained in greater de-

tail and veriﬁed experimentally in Sec. 3.2. Furthermore,

unlike [16], we demonstrate that this localization ability is

generic and can be observed even for problems that the net-

work was not trained on.

We use class activation map to refer to the weighted acti-

vation maps generated for each image, as described in Sec-

tion 2. We would like to emphasize that while global aver-

age pooling is not a novel technique that we propose here,

the observation that it can be applied for accurate discrimi-

native localization is, to the best of our knowledge, unique

to our work. We believe that the simplicity of this tech-

nique makes it portable and can be applied to a variety of

computer vision tasks for fast and accurate localization.

Visualizing CNNs: There has been a number of recent

works [29, 14, 4, 33] that visualize the internal represen-

tation learned by CNNs in an attempt to better understand

their properties. Zeiler et al [29] use deconvolutional net-

works to visualize what patterns activate each unit. Zhou et

al. [33] show that CNNs learn object detectors while being

trained to recognize scenes, and demonstrate that the same

network can perform both scene recognition and object lo-

calization in a single forward-pass. Both of these works

only analyze the convolutional layers, ignoring the fully-

connected thereby painting an incomplete picture of the full

story. By removing the fully-connected layers and retain-

ing most of the performance, we are able to understand our

network from the beginning to the end.

Mahendran et al [14] and Dosovitskiy et al [4] analyze

the visual encoding of CNNs by inverting deep features

at different layers. While these approaches can invert the

fully-connected layers, they only show what information

is being preserved in the deep features without highlight-

ing the relative importance of this information. Unlike [14]

and [4], our approach can highlight exactly which regions

of an image are important for discrimination. Overall, our

approach provides another glimpse into the soul of CNNs.

2. Class Activation Mapping

In this section, we describe the procedure for generating

class activation maps (CAM) using global average pooling

(GAP) in CNNs. A class activation map for a particular cat-

egory indicates the discriminative image regions used by the

CNN to identify that category (e.g., Fig. 3). The procedure

for generating these maps is illustrated in Fig. 2.

We use a network architecture similar to Network in Net-

work [13] and GoogLeNet [24] - the network largely con-

sists of convolutional layers, and just before the ﬁnal out-

put layer (softmax in the case of categorization), we per-

form global average pooling on the convolutional feature

maps and use those as features for a fully-connected layer

that produces the desired output (categorical or otherwise).

Given this simple connectivity structure, we can identify

the importance of the image regions by projecting back the

weights of the output layer on to the convolutional feature

maps, a technique we call class activation mapping.

As illustrated in Fig. 2, global average pooling outputs

the spatial average of the feature map of each unit at the

last convolutional layer. A weighted sum of these values is

used to generate the ﬁnal output. Similarly, we compute a

weighted sum of the feature maps of the last convolutional

layer to obtain our class activation maps. We describe this

more formally below for the case of softmax. The same

technique can be applied to regression and other losses.

For a given image, let f

(x, y) represent the activation

of unit k in the last convolutional layer at spatial location

(x, y). Then, for unit k, the result of performing global

average pooling, F

x,y

(x, y). Thus, for a given

class c, the input to the softmax, S

, is

where w

is the weight corresponding to class c for unit k. Essentially,

indicates the importance of F

for class c. Finally the

output of the softmax for class c, P

is given by

exp(S

)

exp(S

)

Here we ignore the bias term: we explicitly set the input

Australian

terrier

...

GAP

...

1 *

+ w

2 *

+ … + w

n *

Class

Activation

Map

(Australian terrier)

Class Activation Mapping

Figure 2. Class Activation Mapping: the predicted class score is mapped back to the previous convolutional layer to generate the class

activation maps (CAMs). The CAM highlights the class-speciﬁc discriminative regions.

bias of the softmax to 0 as it has little to no impact on the

classiﬁcation performance.

By plugging F

x,y

(x, y) into the class score,

, we obtain

x,y

(x, y)

x,y

(x, y). (1)

We deﬁne M

as the class activation map for class c, where

each spatial element is given by

(x, y) =

(x, y). (2)

Thus, S

x,y

(x, y), and hence M

(x, y) directly

indicates the importance of the activation at spatial grid

(x, y) leading to the classiﬁcation of an image to class c.

Intuitively, based on prior works [33, 29], we expect each

unit to be activated by some visual pattern within its recep-

tive ﬁeld. Thus f

is the map of the presence of this visual

pattern. The class activation map is simply a weighted lin-

ear sum of the presence of these visual patterns at different

spatial locations. By simply upsampling the class activa-

tion map to the size of the input image, we can identify the

image regions most relevant to the particular category.

In Fig. 3, we show some examples of the CAMs output

using the above approach. We can see that the discrimi-

native regions of the images for various classes are high-

lighted. In Fig. 4 we highlight the differences in the CAMs

for a single image when using different classes c to gener-

ate the maps. We observe that the discriminative regions

Figure 3. The CAMs of four classes from ILSVRC [20]. The maps

highlight the discriminative image regions used for image classiﬁ-

cation e.g., the head of the animal for briard and hen, the plates in

barbell, and the bell in bell cote.

for different categories are different even for a given im-

age. This suggests that our approach works as expected.

We demonstrate this quantitatively in the sections ahead.

Global average pooling (GAP) vs global max pool-

ing (GMP): Given the prior work [16] on using GMP for

weakly supervised object localization, we believe it is im-

portant to highlight the intuitive difference between GAP

and GMP. We believe that GAP loss encourages the net-

work to identify the extent of the object as compared to

GMP which encourages it to identify just one discrimina-

tive part. This is because, when doing the average of a map,

the value can be maximized by ﬁnding all discriminative

parts of an object as all low activations reduce the output of

评论收藏

内容反馈

stu_sun

粉丝: 8
资源: 2

Learning Deep Features for Discriminative Localization论文原文加翻译

Deep-Person Learning discriminative deep features for person Re-Identification

pytorch_CAM实现

deep learning经典论文30篇

Learning features for discriminative behavior analysis of evolutionary algorithms via slow feature analysis

Matlab code for Learning a discriminative high-fidelity dictionary for SCSS

MACHINE LEARNING: Discriminative and Generative_Tony Jebara_2004

【论文笔记】Joint Discriminative and Generative Learning for Person Re-identification

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

On-Line Selection of discriminative tracking features论文

【2】A fast learning algorithm for deep belief nets.pdf

Discriminative Learning of Local Image Descriptors

Enhancing the Discriminative Feature Learning for Visible-Thermal Cross-Modality

A Discriminative Metric Learning Based Anomaly Detection Method

Direct Discriminative Pattern Mining for Effective Classification(中文翻译）

deepcnn人脸识别center loss（discriminative feature learning）

Deep-Person-master.zip

SOUND SOURCE LOCALIZATION BASED ON DEEP NEURAL NETWORKS WITH DIRECTIONAL ACTIVATE FUNCTION EXPLOITING PHASE INFORMATION

Discriminative Reordering with Chinese Grammatical Relations Features

Discriminative Analysis Dictionary Learning

CVPR2018_Oral_论文合集_人工智能_机器学习

[CVPR 2018]Discriminative Learning of Latent Features for Zero-Shot Recognition

行人重识别论文解读报告

Machine Learning for Multimedia Content Analysis

最新资源