人工智能学习参考，内含ppt讲解资源-CSDN文库

共14个文件

pptx：13个

pdf：1个

人工智能

需积分: 5 94 浏览量 2023-03-12 01:26:58 上传评论收藏 59.12MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

ppt.zip （14个子文件）

ppt

第3讲：逻辑回归和最简单的神经网络.pptx 3.73MB

第12讲：循环神经网络（RNN).pptx 3.98MB

inception.pdf 1.15MB

第7讲：改善深层神经网络：数据集划分、偏差、方差、正则化、梯度.pptx 9.17MB

第 2 讲：神经网络的发展脉络.pptx 4.76MB

~$第3讲：逻辑回归和最简单的神经网络.pptx 165B

第6讲：深层神经网络.pptx 3.75MB

第5讲：浅层神经网络.pptx 3.38MB

第11讲：卷积神经网络经典模型.pptx 2.87MB

第10讲：卷积神经网络（CNN）.pptx 11.74MB

第4讲：梯度下降.pptx 2.42MB

第13讲：词嵌入与自然语言处理.pptx 2.95MB

第 1 讲：人工智能概述.pptx 5.92MB

第9讲：超参调试、标准化、Softmax和TensorFlow.pptx 4.11MB

Going deeper with convolutions

Christian Szegedy

Google Inc.

Wei Liu

University of North Carolina, Chapel Hill

Yangqing Jia

Google Inc.

Pierre Sermanet

Google Inc.

Scott Reed

University of Michigan

Dragomir Anguelov

Google Inc.

Dumitru Erhan

Google Inc.

Vincent Vanhoucke

Google Inc.

Andrew Rabinovich

Google Inc.

Abstract

We propose a deep convolutional neural network architecture codenamed Incep-

tion, which was responsible for setting the new state of the art for classiﬁcation

and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014

(ILSVRC14). The main hallmark of this architecture is the improved utilization

of the computing resources inside the network. This was achieved by a carefully

crafted design that allows for increasing the depth and width of the network while

keeping the computational budget constant. To optimize quality, the architectural

decisions were based on the Hebbian principle and the intuition of multi-scale

processing. One particular incarnation used in our submission for ILSVRC14 is

called GoogLeNet, a 22 layers deep network, the quality of which is assessed in

the context of classiﬁcation and detection.

1 Introduction

In the last three years, mainly due to the advances of deep learning, more concretely convolutional

networks [10], the quality of image recognition and object detection has been progressing at a dra-

matic pace. One encouraging news is that most of this progress is not just the result of more powerful

hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and

improved network architectures. No new data sources were used, for example, by the top entries in

the ILSVRC 2014 competition besides the classiﬁcation dataset of the same competition for detec-

tion purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters

than the winning architecture of Krizhevsky et al [9] from two years ago, while being signiﬁcantly

more accurate. The biggest gains in object-detection have not come from the utilization of deep

networks alone or bigger models, but from the synergy of deep architectures and classical computer

vision, like the R-CNN algorithm by Girshick et al [6].

Another notable factor is that with the ongoing traction of mobile and embedded computing, the

efﬁciency of our algorithms – especially their power and memory use – gains importance. It is

noteworthy that the considerations leading to the design of the deep architecture presented in this

paper included this factor rather than having a sheer ﬁxation on accuracy numbers. For most of the

experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds

at inference time, so that the they do not end up to be a purely academic curiosity, but could be put

to real world use, even on large datasets, at a reasonable cost.

arXiv:1409.4842v1 [cs.CV] 17 Sep 2014

In this paper, we will focus on an efﬁcient deep neural network architecture for computer vision,

codenamed Inception, which derives its name from the Network in network paper by Lin et al [12]

in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word

“deep” is used in two different meanings: ﬁrst of all, in the sense that we introduce a new level of

organization in the form of the “Inception module” and also in the more direct sense of increased

network depth. In general, one can view the Inception model as a logical culmination of [12]

while taking inspiration and guidance from the theoretical work by Arora et al [2]. The beneﬁts

of the architecture are experimentally veriﬁed on the ILSVRC 2014 classiﬁcation and detection

challenges, on which it signiﬁcantly outperforms the current state of the art.

2 Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard

structure – stacked convolutional layers (optionally followed by contrast normalization and max-

pooling) are followed by one or more fully-connected layers. Variants of this basic design are

prevalent in the image classiﬁcation literature and have yielded the best results to-date on MNIST,

CIFAR and most notably on the ImageNet classiﬁcation challenge [9, 21]. For larger datasets such

as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14],

while using dropout [7] to address the problem of overﬁtting.

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same

convolutional network architecture as [9] has also been successfully employed for localization [9,

14], object detection [6, 14, 18, 5] and human pose estimation [19]. Inspired by a neuroscience

model of the primate visual cortex, Serre et al. [15] use a series of ﬁxed Gabor ﬁlters of different sizes

in order to handle multiple scales, similarly to the Inception model. However, contrary to the ﬁxed

2-layer deep model of [15], all ﬁlters in the Inception model are learned. Furthermore, Inception

layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet

model.

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representa-

tional power of neural networks. When applied to convolutional layers, the method could be viewed

as additional 1× 1 convolutional layers followed typically by the rectiﬁed linear activation [9]. This

enables it to be easily integrated in the current CNN pipelines. We use this approach heavily in our

architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they

are used mainly as dimension reduction modules to remove computational bottlenecks, that would

otherwise limit the size of our networks. This allows for not just increasing the depth, but also the

width of our networks without signiﬁcant performance penalty.

The current leading approach for object detection is the Regions with Convolutional Neural Net-

works (R-CNN) proposed by Girshick et al. [6]. R-CNN decomposes the overall detection problem

into two subproblems: to ﬁrst utilize low-level cues such as color and superpixel consistency for

potential object proposals in a category-agnostic fashion, and to then use CNN classiﬁers to identify

object categories at those locations. Such a two stage approach leverages the accuracy of bound-

ing box segmentation with low-level cues, as well as the highly powerful classiﬁcation power of

state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have ex-

plored enhancements in both stages, such as multi-box [5] prediction for higher object bounding

box recall, and ensemble approaches for better categorization of bounding box proposals.

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increas-

ing their size. This includes both increasing the depth – the number of levels – of the network and its

width: the number of units at each level. This is as an easy and safe way of training higher quality

models, especially given the availability of a large amount of labeled training data. However this

simple solution comes with two major drawbacks.

Bigger size typically means a larger number of parameters, which makes the enlarged network more

prone to overﬁtting, especially if the number of labeled examples in the training set is limited.

This can become a major bottleneck, since the creation of high quality training sets can be tricky

(a) Siberian husky (b) Eskimo dog

Figure 1: Two distinct classes from the 1000 classes of the ILSVRC 2014 classiﬁcation challenge.

and expensive, especially if expert human raters are necessary to distinguish between ﬁne-grained

visual categories like those in ImageNet (even in the 1000-class ILSVRC subset) as demonstrated

by Figure 1.

Another drawback of uniformly increased network size is the dramatically increased use of compu-

tational resources. For example, in a deep vision network, if two convolutional layers are chained,

any uniform increase in the number of their ﬁlters results in a quadratic increase of computation. If

the added capacity is used inefﬁciently (for example, if most weights end up to be close to zero),

then a lot of computation is wasted. Since in practice the computational budget is always ﬁnite, an

efﬁcient distribution of computing resources is preferred to an indiscriminate increase of size, even

when the main objective is to increase the quality of results.

The fundamental way of solving both issues would be by ultimately moving from fully connected

to sparsely connected architectures, even inside the convolutions. Besides mimicking biological

systems, this would also have the advantage of ﬁrmer theoretical underpinnings due to the ground-

breaking work of Arora et al. [2]. Their main result states that if the probability distribution of

the data-set is representable by a large, very sparse deep neural network, then the optimal network

topology can be constructed layer by layer by analyzing the correlation statistics of the activations

of the last layer and clustering neurons with highly correlated outputs. Although the strict math-

ematical proof requires very strong conditions, the fact that this statement resonates with the well

known Hebbian principle – neurons that ﬁre together, wire together – suggests that the underlying

idea is applicable even under less strict conditions, in practice.

On the downside, todays computing infrastructures are very inefﬁcient when it comes to numerical

calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is

reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse

matrices would not pay off. The gap is widened even further by the use of steadily improving,

highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploit-

ing the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse

models require more sophisticated engineering and computing infrastructure. Most current vision

oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em-

ploying convolutions. However, convolutions are implemented as collections of dense connections

to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection

tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the

trend changed back to full connections with [9] in order to better optimize parallel computing. The

uniformity of the structure and a large number of ﬁlters and greater batch size allow for utilizing

efﬁcient dense computation.

This raises the question whether there is any hope for a next, intermediate step: an architecture

that makes use of the extra sparsity, even at ﬁlter level, as suggested by the theory, but exploits our

current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix

computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices

tends to give state of the art practical performance for sparse matrix multiplication. It does not

seem far-fetched to think that similar methods would be utilized for the automated construction of

non-uniform deep-learning architectures in the near future.

The Inception architecture started out as a case study of the ﬁrst author for assessing the hypothetical

output of a sophisticated network topology construction algorithm that tries to approximate a sparse

structure implied by [2] for vision networks and covering the hypothesized outcome by dense, read-

ily available components. Despite being a highly speculative undertaking, only after two iterations

on the exact choice of topology, we could already see modest gains against the reference architec-

ture based on [12]. After further tuning of learning rate, hyperparameters and improved training

methodology, we established that the resulting Inception architecture was especially useful in the

context of localization and object detection as the base network for [6] and [5]. Interestingly, while

most of the original architectural choices have been questioned and tested thoroughly, they turned

out to be at least locally optimal.

One must be cautious though: although the proposed architecture has become a success for computer

vision, it is still questionable whether its quality can be attributed to the guiding principles that have

lead to its construction. Making sure would require much more thorough analysis and veriﬁcation:

for example, if automated tools based on the principles described below would ﬁnd similar, but

better topology for the vision networks. The most convincing proof would be if an automated

system would create network topologies resulting in similar gains in other domains using the same

algorithm but with very differently looking global architecture. At very least, the initial success of

the Inception architecture yields ﬁrm motivation for exciting future work in this direction.

4 Architectural Details

The main idea of the Inception architecture is based on ﬁnding out how an optimal local sparse

structure in a convolutional vision network can be approximated and covered by readily available

dense components. Note that assuming translation invariance means that our network will be built

from convolutional building blocks. All we need is to ﬁnd the optimal local construction and to

repeat it spatially. Arora et al. [2] suggests a layer-by layer construction in which one should analyze

the correlation statistics of the last layer and cluster them into groups of units with high correlation.

These clusters form the units of the next layer and are connected to the units in the previous layer. We

assume that each unit from the earlier layer corresponds to some region of the input image and these

units are grouped into ﬁlter banks. In the lower layers (the ones close to the input) correlated units

would concentrate in local regions. This means, we would end up with a lot of clusters concentrated

in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as

suggested in [12]. However, one can also expect that there will be a smaller number of more

spatially spread out clusters that can be covered by convolutions over larger patches, and there

will be a decreasing number of patches over larger and larger regions. In order to avoid patch-

alignment issues, current incarnations of the Inception architecture are restricted to ﬁlter sizes 1×1,

3×3 and 5×5, however this decision was based more on convenience rather than necessity. It also

means that the suggested architecture is a combination of all those layers with their output ﬁlter

banks concatenated into a single output vector forming the input of the next stage. Additionally,

since pooling operations have been essential for the success in current state of the art convolutional

networks, it suggests that adding an alternative parallel pooling path in each such stage should have

additional beneﬁcial effect, too (see Figure 2(a)).

As these “Inception modules” are stacked on top of each other, their output correlation statistics

are bound to vary: as features of higher abstraction are captured by higher layers, their spatial

concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should

increase as we move to higher layers.

One big problem with the above modules, at least in this na

ıve form, is that even a modest number of

5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number

of ﬁlters. This problem becomes even more pronounced once pooling units are added to the mix:

their number of output ﬁlters equals to the number of ﬁlters in the previous stage. The merging of

the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable

评论收藏

内容反馈

IWSXQY

粉丝: 763
资源: 23

人工智能学习参考，内含ppt讲解

python基础学习（内含学习代码及注释和安装包）从入门到人工智能，从未知到兴趣

C语言课设（内含课设报告）（仅用于学习参考）

网上购物系统，用asp+vbscript编写，内含课程实验报告，供大家学习参考。

基于51单片机烟雾报警器（内含源程序，PCB原理图及源文件，仿真，开题报告，讲解视频，元件清单，参考论文等，课设必备）

人工智能领域常用传感器，GY-52三轴陀螺仪MPU6050资料包（内含芯片手册、参考文档STM32及51测试程序）

人工智能学习资料（各种手册ppt集合）

人工智能PPT

人工智能ppt

人工智能作业 - 论文+PPT+参考文献 - 领域：多轮对话.zip

嵌入式课件（1-6次课程）ppt.zip

C语言实现花样黑白棋AI（内含源码和实验报告）.zip

南京大学-程序设计基础实验-项目三代码-花样黑白棋AI内含报告+源程序.zip

人力资源管理PPT.zip

基于51单片机篮球计分器（内含源程序，PCB原理图及源文件，仿真，开题报告，讲解视频，元件清单，参考论文等，课设必备）

源码 仿 源码交易网直接打包内含数据（参考学习）.zip

XML_ppt 内含教程ppt和详细学习资料

哈工大 2023刘贤明教授 人工智能导论实验1-2 内含报告 包括一些对编码的修改 如果同学因为编码报错的话可以参考一下这个

学习EDA的好资料 内含ppt课件 Quartus II 中文培训教程

人工智能入门讲解ppt

人工智能介绍 PPT

人工智能课程PPT

ai 学习ppt

南京信息工程大学专业课数据结构大作业内含源码和报告.zip

南京信息工程大学专业课网络大作业内含源码和报告.zip

RV1126 文档教程资料

历史上的三次工业科技革命ppt.ZIP

华中科技大学计算机学院-课程实验-JAVA实验-内含源码和说明书.zip

最新资源

源码仿源码交易网直接打包内含数据（参考学习）.zip

哈工大 2023刘贤明教授人工智能导论实验1-2 内含报告包括一些对编码的修改如果同学因为编码报错的话可以参考一下这个

学习EDA的好资料内含ppt课件 Quartus II 中文培训教程