【免费】[2015-ICCV].BilinearCNNModelsforFine-grainedVisualRecognit

需积分: 0 118 浏览量 2022-08-03 14:50:59 上传评论收藏 3.57MB PDF 举报

在细粒度视觉识别领域，Bilinear CNN Models for Fine-grained Visual Recognition提出了一种简单而有效的架构，称为Bilinear Convolutional Neural Networks（B-CNNs）。这种网络通过两个CNN提取的特征之间的外积池化来表示图像，以翻译不变的方式捕获局部特征交互。B-CNNs属于无序纹理表示的类别，但与先前的工作不同，它们可以端到端的方式进行训练。 B-CNNs的核心在于利用双线性运算来捕捉图像中的细微差异，这对于识别细粒度类别的关键区别至关重要。传统的CNN模型可能无法检测到这些微小的变化，因为它们主要关注全局的、高层次的特征。然而，B-CNN通过两个CNN分支的特征交互，能够捕获更丰富的局部信息，这对识别鸟的种类、汽车型号或狗的品种等任务特别有用。在实验中，该模型在Caltech-UCSD Birds、NABirds、FGVC aircraft和Stanford Cars等数据集上表现出了出色的性能。例如，B-CNN在这些数据集上的每图准确率分别达到84.1%、79.4%、86.9%和91.3%，并在NVIDIA Titan X GPU上实现了30帧/秒的运行速度，展示了其在实时应用中的潜力。此外，研究还进行了系统性的分析，揭示了以下几点： 1. 双线性特征具有高度冗余性，可以通过降维处理减小一阶量级，而不会显著降低准确性。这为模型的轻量化和资源效率提供了可能性。 2. 这种方法不仅适用于细粒度识别，还可以应用于其他图像分类任务，如纹理和场景识别，显示出广泛的适用性。 3. B-CNN可以从ImageNet数据集上直接进行训练，并且在基线架构上提供一致的性能提升，证明了其强大的学习能力。为了进一步理解模型的工作机制，研究者使用了神经元单元的顶部激活和基于梯度的反向传播技术对模型进行了可视化。这些可视化结果有助于揭示模型如何对输入图像的不同部分做出响应。 Bilinear CNN模型通过引入双线性池化，为细粒度视觉识别带来了显著的性能提升，并且其设计思想和优势也扩展到了其他图像分类任务。提供的源代码使得研究社区能够更方便地访问和应用这一创新技术，从而推动了相关领域的进步。

资源详情

资源评论

资源推荐

Bilinear CNNs for Fine-grained Visual

Recognition

Tsung-Yu Lin Aruni RoyChowdhury Subhransu Maji

Abstract—We present a simple and effective architecture for ﬁne-grained visual recognition called Bilinear Convolutional Neural

Networks (B-CNNs). These networks represent an image as a pooled outer product of features derived from two CNNs and capture

localized feature interactions in a translationally invariant manner. B-CNNs belong to the class of orderless texture representations but

unlike prior work they can be trained in an end-to-end manner. Our most accurate model obtains 84.1%, 79.4%, 86.9% and 91.3%

per-image accuracy on the Caltech-UCSD birds [67], NABirds [64], FGVC aircraft [42], and Stanford cars [33] dataset respectively and

runs at 30 frames-per-second on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1)

the bilinear features are highly redundant and can be reduced by an order of magnitude in size without signiﬁcant loss in accuracy, (2)

are also effective for other image classiﬁcation tasks such as texture and scene recognition, and (3) can be trained from scratch on the

ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models

on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete

system is available at http://vis-www.cs.umass.edu/bcnn.

Index Terms—Fine-grained recognition, Texture representations, Second-order pooling, Bilinear models, Convolutional networks

1 INTRODUCTION

INE-GRAINED recognition involves classiﬁcation of in-

stances within a subordinate category. Examples in-

clude recognition of species of birds, models of cars, or

breeds of dogs. These tasks often require recognition of

highly localized attributes of objects while being invariant

to their pose and location in the image. For example, dis-

tinguishing a “California gull” from a “Ringed-bill gull”

requires the recognition of patterns on their bill, or subtle

color differences of their feathers [1]. There are two broad

classes of techniques that are effective for these tasks. Part-

based models construct representations by localizing parts

and extracting features conditioned on their detected lo-

cations. This makes subsequent reasoning about appear-

ance easier since the variations due to location, pose, and

viewpoint changes are factored out. Holistic models on the

other hand construct a representation of the entire image

directly. These include classical image representations, such

as Bag-of-Visual-Words [12] and their variants popularized

for texture analysis. Most modern approaches are based

on representations extracted using Convolutional Neural

Networks (CNNs) pre-trained on the ImageNet dataset [54].

While part-based models based on CNNs are more accurate,

they require part annotations during training. This makes

them less applicable in domains where such annotations are

difﬁcult or expensive to obtain, including categories without

a clearly deﬁned set of parts such as textures and scenes.

In this paper we argue that the effectiveness of part-

based reasoning is due to their invariance to position and

pose of the object. Texture representations are translationally

invariant by design as they are based on aggregation of

local image features in an orderless manner. While classical

• T.-Y. Lin, A. RoyChowdhury, and S. Maji are with the College of Informa-

tion and Computer Sciences, University of Massachusetts Amherst, USA.

E-mails: {tsungyulin, arunirc, smaji}@cs.umass.edu

texture representations based on SIFT [40] and their recent

extensions based on CNNs [11], [24], have been shown to be

effective at ﬁne-grained recognition, they have not matched

the performance of part-based approaches. A potential rea-

son for this gap is that the underlying features in texture

representations are not learned in an end-to-end manner

and are likely to be suboptimal for the recognition task.

We present Bilinear CNNs (B-CNNs) that address several

drawbacks of existing deep texture representations. Our key

insight is that several widely-used texture representations

can be written as a pooled outer product of two suitably

designed features. When these features are based on CNNs

the resulting architecture consists of standard CNN units

for feature extraction, followed by a specially designed

bilinear layer and a pooling layer. The output is a ﬁxed

high-dimensional representation which can be combined

with a fully-connected layer to predict class labels. The

simplest bilinear layer is one where two identical features

are combined with an outer product. This is closely re-

lated to the Second-Order Pooling approach of Carreira et

al. [8] popularized for semantic image segmentation. We

also show that other texture representations can be written

as B-CNNs once suitable non-linearities are applied to the

underlying features. This results in a family of layers which

can be plugged into existing CNNs for end-to-end training

on large datasets, or domain-speciﬁc ﬁne-tuning for transfer

learning. B-CNNs outperform existing models, including

those trained with part-level supervision, on a variety of

ﬁne-grained recognition datasets. Moreover, these models

are fairly efﬁcient. Our most accurate model implemented

in MatConvNet [66] runs at 30 frames-per-second on a

NVIDIA Titan X GPU and obtains 84.1%, 79.4%, 86.9%

and 91.3% per-image accuracy on Caltech-UCSD birds [67],

NABirds [64], FGVC aircraft [42], and Stanford cars [33]

dataset respectively.

arXiv:1504.07889v6 [cs.CV] 1 Jun 2017

This manuscript combines the analysis of our earlier

works [36], [37] and extends them in a number of ways.

We present an account of related work, including extensions

published subsequently (Section 2). We describe the B-CNN

architecture (Section 3), and present a uniﬁed analysis of

exact and approximate end-to-end trainable formulations

of Second-Order Pooling (O2P), Fisher Vector (FV), Vector-

of-Locally-Aggregated Descriptors (VLAD), Bag-of-Visual-

Words (BoVW) in terms of their accuracy on a variety of

ﬁne-grained recognition datasets (Section 3.2-4). We show

that the approach is general-purpose and is effective at

other image classiﬁcation tasks such as material, texture,

and scene recognition (Section 4). We present a detailed

analysis of dimensionality reduction techniques and pro-

vide trade-off curves between accuracy and dimensionality

for different models, including a direct comparison with

the recently proposed compact bilinear pooling technique [19]

(Section 5.1). Moreover, unlike prior texture representations

based on networks pre-trained on the ImageNet dataset,

B-CNNs can be trained from scratch and offer consistent

improvements over the baseline architecture with a mod-

est increase in the computation cost (Section 5.2). Finally

we visualize the top activations of several units in the

learned models and apply the gradient-based technique of

Mahendran and Vedaldi [41] to visualize inverse images

on various texture and scene datasets (Section 5.3). We

have released the complete source code for the system at

http://vis-www.cs.umass.edu/bcnn.

2 RELATED WORK

Fine-grained recognition techniques. After AlexNet’s [34]

impressive performance on the ImageNet classiﬁcation chal-

lenge, several authors (e.g., [13], [52]) have demonstrated

that features extracted from layers of a CNN are effective

at ﬁne-grained recognition tasks. Building on prior work

on part-based techniques (e.g., [5], [15], [71]), Zhang et

al. [70], and Branson et al. [6] demonstrated the beneﬁts

of combining CNN-based part detectors [23] and CNN-

based features for ﬁne-grained recognition tasks. Other

approaches use segmentation to guide part discovery in

a weakly-supervised manner and train part-based mod-

els [31]. Among the non part-based techniques, texture

descriptors such as FV and VLAD have traditionally been

effective for ﬁne-grained recognition. For example, the top

performing method on FGCOMP’12 challenge used SIFT-

based FV representation [25].

Recent improvements in deep architectures have also

resulted in improvements in ﬁne-grained recognition. These

include architectures that have increased depth such as the

“deep” [9] and “very deep” [59] networks from the Oxford’s

VGG group, inception networks [60], and “ultra deep”

residual networks [26]. Spatial Transformer Networks [29]

augment CNNs with parameterized image transformations

and are highly effective at ﬁne-grained recognition tasks.

Other techniques augment CNNs with “attention” mecha-

nisms that allow focused reasoning on regions of an im-

age [4], [43]. B-CNNs can be viewed as an implicit spatial

attention model since the outer product modulates one

feature based on the other, similar to the multiplicative

feature interactions in attention mechanisms. Although not

directly comparable, Krause et al. [32] showed that the

accuracy of deep networks can be improved signiﬁcantly by

using two orders of magnitude more training data obtained

by querying category labels on search engines. Recently,

Moghimi et al. [44] showed boosting B-CNNs offers con-

sistent improvements on ﬁne-grained tasks.

Texture representations and second-order features. Tex-

ture representations have been widely studied for decades.

Early work [35] represents the texture by computing the

statistics of linear ﬁlter-bank responses (e.g., wavelets and

steerable pyramids). The use of second-order features of

ﬁlter-bank responses was pioneered by Portilla and Simon-

celli [50]. Recent variants such as FV [46] and O2P [8] with

SIFT were shown to be a highly effective for image classiﬁ-

cation and semantic segmentation [14] tasks respectively.

The advantages of combining orderless texture represen-

tations and deep features have been studied in a number of

recent works. Gong et al. performed a multi-scale orderless

pooling of CNN features [24] for scene classiﬁcation. Cimpoi

et al. [11] performed a systematic analysis of texture repre-

sentations by replacing linear ﬁlter-banks with non-linear

ﬁlter-banks derived from a CNN and showed it results

in signiﬁcant improvements on various texture, scene, and

ﬁne-grained recognition tasks. They found that orderless

aggregation of CNN features was more effective than the

commonly-used fully-connected layers on these tasks. How-

ever, a drawback of these approaches is that the ﬁlter banks

are not trained in an end-to-end manner. Our work is also

related to the cross-layer pooling approach of Liu et al. [38]

who showed that second-order aggregation of features from

two different layers of a CNN is effective at ﬁne-grained

recognition. Our work showed that feature normalization

and domain-speciﬁc ﬁne-tuning offers additional beneﬁts,

improving the accuracy from 77.0% to 84.1% using identical

networks on the Caltech-UCSD Birds dataset [67]. Another

subsequently published work of interest is the NetVLAD

architecture [3] which provides a end-to-end trainable ap-

proximation of VLAD. The approach was applied to image-

based geolocation problem. We include a comparison of

NetVLAD to other texture representations in Section 4.

Texture synthesis and style transfer. Concurrent to our

work, Gatys et al. showed that the Gram matrix of CNN

features is an effective texture representation and by match-

ing the Gram matrix of a target image one can create novel

images with the same texture [20] and transfer styles [21].

While the Gram matrix is identical to a pooled bilinear

representation when the two features are the same, the

emphasis of our work is recognition and not generation. This

distinction is important since Ustyuzhaninov et al. [63] show

that the Gram matrix of a shallow CNN with random ﬁlters

is sufﬁcient for texture synthesis, while discriminative pre-

training and subsequent ﬁne-tuning are essential to achieve

high performance for recognition.

Polynomial kernels and sum-product networks. An

alternate strategy for combining features from two networks

is to concatenate them and learn their pairwise interactions

through a series of layers on top. However, doing this

naively requires a large number of parameters since there

are O(n

) interactions over O(n) features requiring a layer

with O(n

) parameters. Our explicit representation using

an outer product has no parameters and is similar to a

quadratic kernel expansion used in kernel support vector

machines [55]. However, one might be able to achieve

similar approximations using alternate architectures such as

sum-product networks that efﬁciently model multiplicative

interactions [22].

Bilinear model variants and extensions. Bilinear models

were used by Tanenbaum and Freeman [62] to model two-

factor variations such as “style” and “content” for images.

While we also model two factor variations in location and

appearance of parts, our goal is classiﬁcation and not the

explicit modeling of these factors. Our work is related to

bilinear classiﬁers [49] that express the classiﬁer as a prod-

uct of two low-rank matrices. Our models based on low

dimensional representations described in Section 5.1 can be

interpreted as bilinear classiﬁers. Our model is related to

“two-stream” architectures used to analyze videos where

one network models the temporal aspect, while the other

models the spatial aspect [17], [58]. The idea of combining

two features using the outer product has also been shown

to be effective for other tasks such as visual question-

answering [18] where text and visual features are combined,

action recognition [16] where optical ﬂow and image fea-

tures are combined.

Low-dimensional bilinear features. A drawback of the

bilinear features is the memory overhead of storing the high-

dimensional features. For example, the outer product of 512

dimensional features results in a 512×512 dimensional rep-

resentation. Our earlier work [37] showed that the overall

representation can be reduced to 512×64 dimensions by

projecting one of the features to a lower-dimensional space.

Alternatively, the compact bilinear pooling [19] applies tensor

sketching [48] to aggregate low-dimensional embeddings

that approximate the bilinear features. In Section 5.1 we

compare the two approaches and ﬁnd that the projection

method is simpler, faster, and equally effective. In most cases

features size can be reduced 8-32× without signiﬁcant loss

in accuracy.

Scalability and speed. B-CNNs compare favorably to

traditional CNN architectures in terms of speed since they

replace several fully-connected layers with a bilinear pool-

ing layer and a linear layer. Our MatConvNet-based [66]

implementation runs between 30 to 100 frames per second

on a NVIDIA Titan X GPU with cudnn-v5 depending on

the model architecture. Even with faster object detection

modules such as Faster R-CNNs [53] or Single-Shot Detector

(SSD) [39], part-based models for ﬁne-grained recognition

are 2-10× slower. The main advantage of B-CNNs is that

they require image labels only and can be easily applied to

different ﬁne-grained datasets.

3 B-CNNS FOR IMAGE CLASSIFICATION

In this section we introduce the B-CNN architecture for

image classiﬁcation and then show that various widely used

texture representations can be written as B-CNNs.

3.1 The B-CNN architecture

A B-CNN for image classiﬁcation consists of a quadru-

ple B = (f

, f

, P, C). Here f

and f

are feature func-

tions based on CNNs, P is a pooling function, and C is

…

bilinear vector

softmax

convolutional + pooling layers

CNN stream A

CNN stream B

…

Chestnut_Sided_Warbler_0110_164023.jpg

chestnut!

sided!

warbler

Fig. 1. Image classiﬁcation using a B-CNN. An image is passed

through CNNs A and B, and their outputs at each location are combined

using the matrix outer product and average pooled to obtain the bilinear

feature representation. This is passed through a linear and softmax layer

to obtain class predictions.

a classiﬁcation function. A feature function is a mapping

f : L × I → R

K×D

, that takes an image I ∈ I and a

location l ∈ L and outputs a feature of size K ×D. We refer

to locations generally, which can include position and scale.

The feature outputs are combined at each location using the

matrix outer product, i.e., the bilinear combination of f

and

at a location l is given by

bilinear(l, I, f

, f

) = f

(l, I)

(l, I). (1)

Both f

and f

must have the same feature dimension

K to be compatible. The value of K depends on the particu-

lar model. For example, K = 1 for BoVW model and equals

the number of clusters in a FV model (details in Section 2).

The pooling function P aggregates the bilinear combination

of features across all locations in the image to obtain a global

image representation Φ(I). We use sum pooling in all our

experiments, i.e.,

Φ(I) =

l∈L

bilinear(l, I, f

, f

) =

l∈L

(l, I)

(l, I).

(2)

Since the location of features is ignored during pooling,

the bilinear feature Φ(I) is an orderless representation. If

and f

extract features of size K × M and K × N

respectively, then Φ(I) is of size M ×N . The bilinear feature

is a general-purpose image representation that can be used

with a classiﬁer C (Figure 1). Intuitively, the outer product

conditions the outputs of features f

and f

on each other

by considering their pairwise interactions, similar to the

feature expansion in a quadratic kernel.

3.1.1 Feature functions

A natural candidate for the feature function f is a CNN

consisting of a hierarchy of convolutional and pooling

layers. In our experiments we use CNNs pre-trained on

the ImageNet dataset truncated at an intermediate layer as

feature functions. By pre-training we beneﬁt when domain-

speciﬁc data is limited. This has been shown to be effective

for a number of tasks ranging from object detection, texture

recognition, to ﬁne-grained classiﬁcation [10], [13], [23], [52].

Another advantage of using CNNs is that the resulting

剩余13页未读，继续阅读

评论收藏

内容反馈

曹将

粉丝: 27
资源: 308

[2015-ICCV].Bilinear CNN Models for Fine-grained Visual Recognit

评论0

最新资源

[2015-ICCV].Bilinear CNN Models for Fine-grained Visual Recognit

评论0

Bilinear Modeling

带有细分细分的细粒度视觉分类

两流上下文化CNN用于细粒度图像分类

ICCV2015.zip

matlab代码sqrt-bcnn:B-CNN：双线性CNN，用于细粒度的视觉识别

人工智能精选论文（图像识别与图像处理）

双线性汇合(bilinear pooling)在细粒度图像分析及其他领域的进展综述1

Deshpande-Learning-Large-Scale-Automatic-ICCV-2015-paper

数据融合matlab代码-fine-Grained-classify:fine-Grainedclassify细颗粒度图像分类

matlab代码sqrt-bcnn:原始文件代码“用于细粒度视觉识别的双线性CNN模型”

Top-conference-paper-list-main.zip

MultiNet_ICCV.zip

YOLOv3-训练-修剪.zip

MCWNNM_ICCV2017-master.zip_MCWNNM_加权核范数_图像 去噪_图像去噪_联合开发

深度学习三维重建 SurfaceNet-ICCV-2017（源码+原文）

深度学习三维重建 PointMVSNet-ICCV-2019（源码、原文、译文、批注）

ICCV2013.zip

AC-NET.pptx

ICCV-MoG.rar_动态背景 提取_动态背景建模_稀疏低秩分解_视频提取_高斯混合 背景

What-Makes-Paris-Look-like-Paris.rar_paris

Awesome-ICCV:ICCV2019最新录用情况

qucik-PWM-realize.rar_Atmega128 pwm_ICC PWM proteus

Python-ICCV2019论文代码即时汇总项目

SIFT-master.zip

chap-绪论.pptx

ICCV2015 (3).zip

“交通·未来”第2期-Deep Traffic Prediction-PPT-20200704.pdf

Multi-View_Subspace_Clustering_ICCV_2015_paper.pdf

最新资源

MCWNNM_ICCV2017-master.zip_MCWNNM_加权核范数_图像去噪_图像去噪_联合开发

ICCV-MoG.rar_动态背景提取_动态背景建模_稀疏低秩分解_视频提取_高斯混合背景