googLeNet深度学习四篇论文_googlenet论文资源-CSDN文库

共4个文件

pdf：4个

需积分: 50 193 浏览量 2018-09-09 12:03:43 上传评论收藏 2.33MB RAR 举报

深度学习领域的GoogLeNet系列论文是计算机视觉和神经网络发展的重要里程碑，它们引入了创新的网络架构，显著提升了模型的性能与效率。这四篇论文分别介绍了Inception-v1, Inception-v2, Inception-v3以及Inception-v4模型，让我们逐一深入探讨。 1. Inception-v1（1409.4842 - Going Deeper with Convolutions）：这是GoogLeNet系列的开创之作，由Szegedy等人在2014年提出。该模型的主要创新是引入了“ inception module”（又称Inception结构），它通过并行组合不同大小的卷积核，同时处理不同尺度的特征，提高了网络的深度和宽度而不增加计算负担。此外，Inception-v1首次在ImageNet大规模图像识别挑战赛中实现了较高的准确率，同时也减少了参数数量，避免了过拟合。 2. Inception-v2（1502.03167 - GoogLeNet-inception-V2 - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift）：Inception-v2在前作的基础上引入了批量归一化（Batch Normalization）技术，有效解决了内部协变量漂移问题，加速了训练过程，并提高了模型的泛化能力。批量归一化使得每一层的输入保持恒定的分布，降低了训练的难度。此外，通过调整Inception模块的设计，模型的性能进一步提升。 3. Inception-v3（1512.00567 - GoogLeNet-V3 - Rethinking the Inception Architecture for Computer Vision）：Inception-v3继续优化Inception模块，引入了更精细的网络设计。比如，将3x3的卷积层替换为1x3和3x1的卷积层串联，减少了计算量，同时保持了感受野。另外，它引入了“residual connections”（残差连接的预兆），尽管不如后来的ResNet那么直接，但也有助于信息流的传递。这些改进使得Inception-v3在保持高效的同时，提高了模型的准确性。 4. Inception-v4（1602.07261 - GoogLeNet-Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning）：Inception-v4结合了Inception模块和ResNet的残差连接，形成了Inception-ResNet结构，显著提升了模型的深度和复杂性，同时保持了训练的稳定性。残差连接使得信息可以直接跨层传递，解决了深度网络中的梯度消失问题，使网络能够学习更复杂的特征表示。这四篇论文不仅推动了深度学习模型的发展，还对后续的网络设计如ResNet、Xception等产生了深远影响。通过不断优化Inception模块和引入新的训练策略，GoogLeNet系列模型为计算机视觉任务提供了更强大、更高效的解决方案。在实际应用中，这些模型被广泛用于图像分类、目标检测、语义分割等多个领域。

资源推荐

资源详情

资源评论

收起资源包目录

GoogLeNet.rar （4个子文件）

1512.00567-GoogLeNet-V3-RethinkingtheInceptionArchitectureforComputerVision.pdf 505KB

1602.07261-GoogLeNet-Inception-v4,Inception-ResNetand theImpactofResidualConnectionsonLearning.pdf 935KB

1502.03167-GoogLeNet-inception-V2-Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.pdf 169KB

1409.4842-GoogLeNet-inception-V1-Going deeper with convolutions.pdf 1.14MB

Going deeper with convolutions

Christian Szegedy

Google Inc.

Wei Liu

University of North Carolina, Chapel Hill

Yangqing Jia

Google Inc.

Pierre Sermanet

Google Inc.

Scott Reed

University of Michigan

Dragomir Anguelov

Google Inc.

Dumitru Erhan

Google Inc.

Vincent Vanhoucke

Google Inc.

Andrew Rabinovich

Google Inc.

Abstract

We propose a deep convolutional neural network architecture codenamed Incep-

tion, which was responsible for setting the new state of the art for classiﬁcation

and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014

(ILSVRC14). The main hallmark of this architecture is the improved utilization

of the computing resources inside the network. This was achieved by a carefully

crafted design that allows for increasing the depth and width of the network while

keeping the computational budget constant. To optimize quality, the architectural

decisions were based on the Hebbian principle and the intuition of multi-scale

processing. One particular incarnation used in our submission for ILSVRC14 is

called GoogLeNet, a 22 layers deep network, the quality of which is assessed in

the context of classiﬁcation and detection.

1 Introduction

In the last three years, mainly due to the advances of deep learning, more concretely convolutional

networks [10], the quality of image recognition and object detection has been progressing at a dra-

matic pace. One encouraging news is that most of this progress is not just the result of more powerful

hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and

improved network architectures. No new data sources were used, for example, by the top entries in

the ILSVRC 2014 competition besides the classiﬁcation dataset of the same competition for detec-

tion purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters

than the winning architecture of Krizhevsky et al [9] from two years ago, while being signiﬁcantly

more accurate. The biggest gains in object-detection have not come from the utilization of deep

networks alone or bigger models, but from the synergy of deep architectures and classical computer

vision, like the R-CNN algorithm by Girshick et al [6].

Another notable factor is that with the ongoing traction of mobile and embedded computing, the

efﬁciency of our algorithms – especially their power and memory use – gains importance. It is

noteworthy that the considerations leading to the design of the deep architecture presented in this

paper included this factor rather than having a sheer ﬁxation on accuracy numbers. For most of the

experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds

at inference time, so that the they do not end up to be a purely academic curiosity, but could be put

to real world use, even on large datasets, at a reasonable cost.

arXiv:1409.4842v1 [cs.CV] 17 Sep 2014

In this paper, we will focus on an efﬁcient deep neural network architecture for computer vision,

codenamed Inception, which derives its name from the Network in network paper by Lin et al [12]

in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word

“deep” is used in two different meanings: ﬁrst of all, in the sense that we introduce a new level of

organization in the form of the “Inception module” and also in the more direct sense of increased

network depth. In general, one can view the Inception model as a logical culmination of [12]

while taking inspiration and guidance from the theoretical work by Arora et al [2]. The beneﬁts

of the architecture are experimentally veriﬁed on the ILSVRC 2014 classiﬁcation and detection

challenges, on which it signiﬁcantly outperforms the current state of the art.

2 Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard

structure – stacked convolutional layers (optionally followed by contrast normalization and max-

pooling) are followed by one or more fully-connected layers. Variants of this basic design are

prevalent in the image classiﬁcation literature and have yielded the best results to-date on MNIST,

CIFAR and most notably on the ImageNet classiﬁcation challenge [9, 21]. For larger datasets such

as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14],

while using dropout [7] to address the problem of overﬁtting.

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same

convolutional network architecture as [9] has also been successfully employed for localization [9,

14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience

model of the primate visual cortex, Serre et al. [15] use a series of ﬁxed Gabor ﬁlters of different sizes

in order to handle multiple scales, similarly to the Inception model. However, contrary to the ﬁxed

2-layer deep model of [15], all ﬁlters in the Inception model are learned. Furthermore, Inception

layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet

model.

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representa-

tional power of neural networks. When applied to convolutional layers, the method could be viewed

as additional 1× 1 convolutional layers followed typically by the rectiﬁed linear activation [9]. This

enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our

architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they

are used mainly as dimension reduction modules to remove computational bottlenecks, that would

otherwise limit the size of our networks. This allows for not just increasing the depth, but also the

width of our networks without signiﬁcant performance penalty.

The current leading approach for object detection is the Regions with Convolutional Neural Net-

works (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem

into two subproblems: to ﬁrst utilize low-level cues such as color and superpixel consistency for

potential object proposals in a category-agnostic fashion, and to then use CNN classiﬁers to identify

object categories at those locations. Such a two stage approach leverages the accuracy of bound-

ing box segmentation with low-level cues, as well as the highly powerful classiﬁcation power of

state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have ex-

plored enhancements in both stages, such as multi-box [5] prediction for higher object bounding

box recall, and ensemble approaches for better categorization of bounding box proposals.

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increas-

ing their size. This includes both increasing the depth – the number of levels – of the network and its

width: the number of units at each level. This is as an easy and safe way of training higher quality

models, especially given the availability of a large amount of labeled training data. However this

simple solution comes with two major drawbacks.

Bigger size typically means a larger number of parameters, which makes the enlarged network more

prone to overﬁtting, especially if the number of labeled examples in the training set is limited.

This can become a major bottleneck, since the creation of high quality training sets can be tricky

(a) Siberian husky (b) Eskimo dog

Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classiﬁcation challenge.

and expensive, especially if expert human raters are necessary to distinguish between ﬁne-grained

visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated

by Figure 1.

Another drawback of uniformly increased network size is the dramatically increased use of compu-

tational resources. For example, in a deep vision network, if two convolutional layers are chained,

any uniform increase in the number of their ﬁlters results in a quadratic increase of computation. If

the added capacity is used inefﬁciently (for example, if most weights end up to be close to zero),

then a lot of computation is wasted. Since in practice the computational budget is always ﬁnite, an

efﬁcient distribution of computing resources is preferred to an indiscriminate increase of size, even

when the main objective is to increase the quality of results.

The fundamental way of solving both issues would be by ultimately moving from fully connected

to sparsely connected architectures, even inside the convolutions. Besides mimicking biological

systems, this would also have the advantage of ﬁrmer theoretical underpinnings due to the ground-

breaking work of Arora et al. [2]. Their main result states that if the probability distribution of

the data-set is representable by a large, very sparse deep neural network, then the optimal network

topology can be constructed layer by layer by analyzing the correlation statistics of the activations

of the last layer and clustering neurons with highly correlated outputs. Although the strict math-

ematical proof requires very strong conditions, the fact that this statement resonates with the well

known Hebbian principle – neurons that ﬁre together, wire together – suggests that the underlying

idea is applicable even under less strict conditions, in practice.

On the downside, todays computing infrastructures are very inefﬁcient when it comes to numerical

calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is

reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse

matrices would not pay off. The gap is widened even further by the use of steadily improving,

highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploit-

ing the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse

models require more sophisticated engineering and computing infrastructure. Most current vision

oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em-

ploying convolutions. However, convolutions are implemented as collections of dense connections

to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection

tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the

trend changed back to full connections with [9] in order to better optimize parallel computing. The

uniformity of the structure and a large number of ﬁlters and greater batch size allow for utilizing

efﬁcient dense computation.

This raises the question whether there is any hope for a next, intermediate step: an architecture

that makes use of the extra sparsity, even at ﬁlter level, as suggested by the theory, but exploits our

current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix

computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices

tends to give state of the art practical performance for sparse matrix multiplication. It does not

seem far-fetched to think that similar methods would be utilized for the automated construction of

non-uniform deep-learning architectures in the near future.

The Inception architecture started out as a case study of the ﬁrst author for assessing the hypothetical

output of a sophisticated network topology construction algorithm that tries to approximate a sparse

structure implied by [2] for vision networks and covering the hypothesized outcome by dense, read-

ily available components. Despite being a highly speculative undertaking, only after two iterations

on the exact choice of topology, we could already see modest gains against the reference architec-

ture based on [12]. After further tuning of learning rate, hyperparameters and improved training

methodology, we established that the resulting Inception architecture was especially useful in the

context of localization and object detection as the base network for [6] and [5]. Interestingly, while

most of the original architectural choices have been questioned and tested thoroughly, they turned

out to be at least locally optimal.

One must be cautious though: although the proposed architecture has become a success for computer

vision, it is still questionable whether its quality can be attributed to the guiding principles that have

lead to its construction. Making sure would require much more thorough analysis and veriﬁcation:

for example, if automated tools based on the principles described below would ﬁnd similar, but

better topology for the vision networks. The most convincing proof would be if an automated

system would create network topologies resulting in similar gains in other domains using the same

algorithm but with very differently looking global architecture. At very least, the initial success of

the Inception architecture yields ﬁrm motivation for exciting future work in this direction.

4 Architectural Details

The main idea of the Inception architecture is based on ﬁnding out how an optimal local sparse

structure in a convolutional vision network can be approximated and covered by readily available

dense components. Note that assuming translation invariance means that our network will be built

from convolutional building blocks. All we need is to ﬁnd the optimal local construction and to

repeat it spatially. Arora et al. [2] suggests a layer-by layer construction in which one should analyze

the correlation statistics of the last layer and cluster them into groups of units with high correlation.

These clusters form the units of the next layer and are connected to the units in the previous layer. We

assume that each unit from the earlier layer corresponds to some region of the input image and these

units are grouped into ﬁlter banks. In the lower layers (the ones close to the input) correlated units

would concentrate in local regions. This means, we would end up with a lot of clusters concentrated

in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as

suggested in [12]. However, one can also expect that there will be a smaller number of more

spatially spread out clusters that can be covered by convolutions over larger patches, and there

will be a decreasing number of patches over larger and larger regions. In order to avoid patch-

alignment issues, current incarnations of the Inception architecture are restricted to ﬁlter sizes 1×1,

3×3 and 5×5, however this decision was based more on convenience rather than necessity. It also

means that the suggested architecture is a combination of all those layers with their output ﬁlter

banks concatenated into a single output vector forming the input of the next stage. Additionally,

since pooling operations have been essential for the success in current state of the art convolutional

networks, it suggests that adding an alternative parallel pooling path in each such stage should have

additional beneﬁcial effect, too (see Figure 2(a)).

As these “Inception modules” are stacked on top of each other, their output correlation statistics

are bound to vary: as features of higher abstraction are captured by higher layers, their spatial

concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should

increase as we move to higher layers.

One big problem with the above modules, at least in this na

ıve form, is that even a modest number of

5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number

of ﬁlters. This problem becomes even more pronounced once pooling units are added to the mix:

their number of output ﬁlters equals to the number of ﬁlters in the previous stage. The merging of

the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable

评论收藏

内容反馈

LIYUNKEBEYOND

粉丝: 2
资源: 32

googLeNet 深度学习四篇论文

最新资源

googLeNet 深度学习四篇论文

goolenet论文（4篇）

GoogleNet经典论文4篇

深度学习论文合集

GoogLeNet论文原文.pdf

GoogLeNet论文资料共3个文档 1-原版论文pdf-2-中文翻译pdf-3-中英文翻译对照pdf.rar

深度学习论文集合

GoogLeNet（inception v1）的论文翻译

深度学习之图像分割领域开山之作FCN论文

深度学习革命及其对计算机架构和芯片设计的影响【Google Jeff Dean独自署名论文】.zip

100篇+深度学习论文合集

机器学习、深度学习各类领域内的论文

基于matlab的表情识别代码-Deep-learning-papers:深度学习论文

Google元老Eric Schmidt发布《深度学习2020大综述》论文.pdf

基于GoogLeNet卷积神经网络的农业书籍文字识别.pdf

深度学习入门精选论文

Google-斯坦福发布《深度学习统计力学》综述论文.pdf

深度学习如何又好又快? Google最新《高效深度学习: 更小、更快、更好》综述论文

深度学习_多层神经网络的复兴与变革

基于深度学习的多目标跟踪算法研究.pdf

人工智能论文：基于深度学习的目标检测技术综述.pdf

CNN模型简单介绍(LeNet,AlexNet,VGG,GoogLeNet,ResNet,GAN,R-CNN)

基于深度学习的车辆零件缺陷检测方法.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

最新资源