AReviewonDeepLearningTechniquesAppliedtoSemanticSegmentation_TheoreticalandAppliedGenetics资源-CSDN文库

需积分: 10 33 浏览量 2018-04-14 14:46:28 上传评论收藏 8.2MB PDF 举报

语义分割是计算机视觉领域中的一个关键问题，它涉及到将图像划分为多个区域，每个区域对应一个语义类别。深度学习技术的引入，使得语义分割的精度和效率得到了显著提升。随着自动驾驶、室内导航以及虚拟现实等技术的发展，对于准确高效图像分割的需求日益增长。本文综述了应用于语义分割领域的深度学习方法，并对现有的主要数据集和挑战进行了分析。文章首先介绍了语义分割的术语和必要的背景概念。语义分割区别于像素级的分割，它关注的是图像中每个像素所属的类别。这对于高层次的场景理解至关重要，场景理解不仅包括了物体的检测和识别，还包括了对物体间关系的理解。当前，深度学习方法已在多个领域和应用场景中取得了成功，语义分割或场景理解正是其中之一。深度学习在这些领域取得的进展主要得益于其在图像识别和分类任务中的出色表现。文章提供了对应用于各种应用领域语义分割的深度学习方法的综述，指出了一些重要的数据集和面临的挑战，并帮助研究人员选择最适合他们需求和目标的数据集。深度学习方法包括但不限于卷积神经网络（CNN），它们能够学习图像的层次化特征表示。这些表示能够使网络更好地理解图像中的不同区域是如何与语义类别相关联的。在语义分割中，深度学习模型通常需要处理高维数据并输出高维的分割图。文章还探讨了目前深度学习在语义分割中的主要挑战，包括但不限于数据集的多样性和规模、不同场景下的复杂性和变化性、以及实时处理的需求。这些挑战要求深度学习模型具有更好的泛化能力和更高的计算效率。在现有的语义分割方法中，研究者通常关注以下几个方面： 1. 网络结构的创新：如全卷积网络（FCN）、U-Net、SegNet等，这些结构针对图像分割进行了优化，能够更有效地提取特征并进行像素级的预测。 2. 上下文信息的利用：为了提升分割的精度，许多方法尝试整合上下文信息，比如使用长距离连接、注意力机制等来增强网络对上下文的感知能力。 3. 多尺度特征融合：由于不同尺度的特征包含着不同层次的信息，因此，将多个尺度的特征进行融合，可以让模型更好地理解图像。 4. 损失函数的设计：如何设计合适的损失函数对于训练深度学习模型至关重要，尤其是针对语义分割的特殊性，需要考虑不均衡的类别分布等因素。 5. 半监督和弱监督学习：对于大规模数据集的标注成本高昂，半监督和弱监督学习提供了一种可能的解决方案，可以利用少量标注数据来训练模型。文章最后给出了被描述方法和数据集的定量结果，并结合这些结果进行了讨论。作者也指出了未来研究的几个有前景的方向，包括但不限于： - 模型轻量化，以适应边缘计算和移动设备的限制。 - 提升模型的泛化能力，使其能够在不同领域和应用场景中都有良好的表现。 - 模型的可解释性，使人们能够理解模型的决策过程，这在安全敏感的应用中尤为重要。通过上述内容，我们可以看到，深度学习技术在语义分割领域的应用已经取得了重大进展，并且还有很大的提升空间。对于研究人员和工程师而言，这是一片充满机遇和挑战的领域。随着技术的不断进步和应用的深入，我们可以期待未来的语义分割技术会在准确性和实时性上取得更加显著的突破。

资源推荐

资源详情

资源评论

A Review on Deep Learning Techniques

Applied to Semantic Segmentation

A. Garcia-Garcia, S. Orts-Escolano, S.O. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez

Abstract—Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers.

Many applications on the rise need accurate and efﬁcient segmentation mechanisms: autonomous driving, indoor navigation, and even

virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every

ﬁeld or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a

review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology

of this ﬁeld as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers

decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their

contributions and their signiﬁcance in the ﬁeld. Finally, quantitative results are given for the described methods and the datasets in

which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw

our own conclusions about the state of the art of semantic segmentation using deep learning techniques.

Index Terms—Semantic Segmentation, Deep Learning, Scene Labeling, Object Segmentation

1 INTRODUCTION

OWADAYS, semantic segmentation – applied to still

2D images, video, and even 3D or volumetric data

– is one of the key problems in the ﬁeld of computer

vision. Looking at the big picture, semantic segmentation

is one of the high-level task that paves the way towards

complete scene understanding. The importance of scene

understanding as a core computer vision problem is high-

lighted by the fact that an increasing number of applications

nourish from inferring knowledge from imagery. Some of

those applications include autonomous driving [1] [2] [3],

human-machine interaction [4], computational photography

[5], image search engines [6], and augmented reality to name

a few. Such problem has been addressed in the past using

various traditional computer vision and machine learning

techniques. Despite the popularity of those kind of methods,

the deep learning revolution has turned the tables so that

many computer vision problems – semantic segmentation

among them – are being tackled using deep architectures,

usually Convolutional Neural Networks (CNNs) [7] [8] [9]

[10] [11], which are surpassing other approaches by a large

margin in terms of accuracy and sometimes even efﬁciency.

However, deep learning is far from the maturity achieved

by other old-established branches of computer vision and

machine learning. Because of that, there is a lack of unifying

works and state of the art reviews. The ever-changing state

of the ﬁeld makes initiation difﬁcult and keeping up with

its evolution pace is an incredibly time-consuming task due

to the sheer amount of new literature being produced. This

makes it hard to keep track of the works dealing with se-

• A. Garcia-Garcia, S.O. Oprea, V. Villena-Martinez, and J. Garcia-

Rodriguez are with the Department of Computer Technology, University

of Alicante, Spain.

E-mail: {agarcia, soprea, vvillena, jgarcia}@dtic.ua.es

• S. Orts-Escolano is with the Department of Computer Science and

Artiﬁcial Intelligence, Universit of Alicante, Spain.

E-mail: sorts@ua.es.

mantic segmentation and properly interpret their proposals,

prune subpar approaches, and validate results.

To the best of our knowledge, this is the ﬁrst review to

focus explicitly on deep learning for semantic segmentation.

Various semantic segmentation surveys already exist such

as the works by Zhu et al. [12] and Thoma [13], which do

a great work summarizing and classifying existing meth-

ods, discussing datasets and metrics, and providing design

choices for future research directions. However, they lack

some of the most recent datasets, they do not analyze

frameworks, and none of them provide details about deep

learning techniques. Because of that, we consider our work

to be novel and helpful thus making it a signiﬁcant contri-

bution for the research community.

Fig. 1: Evolution of object recognition or scene understand-

ing from coarse-grained to ﬁne-grained inference: classiﬁca-

tion, detection or localization, semantic segmentation, and

instance segmentation.

arXiv:1704.06857v1 [cs.CV] 22 Apr 2017

The key contributions of our work are as follows:

• We provide a broad survey of existing datasets that

might be useful for segmentation projects with deep

learning techniques.

• An in-depth and organized review of the most sig-

niﬁcant methods that use deep learning for semantic

segmentation, their origins, and their contributions.

• A thorough performance evaluation which gathers

quantitative metrics such as accuracy, execution time,

and memory footprint.

• A discussion about the aforementioned results, as

well as a list of possible future works that might set

the course of upcoming advances, and a conclusion

summarizing the state of the art of the ﬁeld.

The remainder of this paper is organized as follows.

Firstly, Section 2 introduces the semantic segmentation prob-

lem as well as notation and conventions commonly used in

the literature. Other background concepts such as common

deep neural networks are also reviewed. Next, Section 3

describes existing datasets, challenges, and benchmarks.

Section 4 reviews existing methods following a bottom-

up complexity order based on their contributions. This

section focuses on describing the theory and highlights

of those methods rather than performing a quantitative

evaluation. Finally, Section 5 presents a brief discussion on

the presented methods based on their quantitative results

on the aforementioned datasets. In addition, future research

directions are also laid out. At last, Section 6 summarizes

the paper and draws conclusions about this work and the

state of the art of the ﬁeld.

2 TERMINOLOGY AND BA CKGROUND CONCEPTS

In order to properly understand how semantic segmenta-

tion is tackled by modern deep learning architectures, it is

important to know that it is not an isolated ﬁeld but rather a

natural step in the progression from coarse to ﬁne inference.

The origin could be located at classiﬁcation, which consists

of making a prediction for a whole input, i.e., predicting

which is the object in an image or even providing a ranked

list if there are many of them. Localization or detection is the

next step towards ﬁne-grained inference, providing not only

the classes but also additional information regarding the

spatial location of those classes, e.g., centroids or bounding

boxes. Providing that, it is obvious that semantic segmen-

tation is the natural step to achieve ﬁne-grained inference,

its goal: make dense predictions inferring labels for every

pixel; this way, each pixel is labeled with the class of its en-

closing object or region. Further improvements can be made,

such as instance segmentation (separate labels for different

instances of the same class) and even part-based segmenta-

tion (low-level decomposition of already segmented classes

into their components). Figure 1 shows the aforementioned

evolution. In this review, we will mainly focus on generic

scene labeling, i.e., per-pixel class segmentation, but we will

also review the most important methods on instance and

part-based segmentation.

In the end, the per-pixel labeling problem can be reduced

to the following formulation: ﬁnd a way to assign a state

from the label space L = {l

, l

, ..., l

} to each one of the

elements of a set of random variables X = {x

, x

, ..., x

Each label l represents a different class or object, e.g., aero-

plane, car, trafﬁc sign, or background. This label space has

k possible states which are usually extended to k + 1 and

treating l

as background or a void class. Usually, X is a 2D

image of W × H = N pixels x. However, that set of random

variables can be extended to any dimensionality such as

volumetric data or hyperspectral images.

Apart from the problem formulation, it is important

to remark some background concepts that might help the

reader to understand this review. Firstly, common networks,

approaches, and design decisions that are often used as the

basis for deep semantic segmentation systems. In addition,

common techniques for training such as transfer learning.

At last, data pre-processing and augmentation approaches.

2.1 Common Deep Network Architectures

As we previously stated, certain deep networks have made

such signiﬁcant contributions to the ﬁeld that they have

become widely known standards. It is the case of AlexNet,

VGG-16, GoogLeNet, and ResNet. Such was their impor-

tance that they are currently being used as building blocks

for many segmentation architectures. For that reason, we

will devote this section to review them.

2.1.1 AlexNet

AlexNet was the pioneering deep CNN that won the

ILSVRC-2012 with a TOP-5 test accuracy of 84.6% while

the closest competitor, which made use of traditional tech-

niques instead of deep architectures, achieved a 73.8% ac-

curacy in the same challenge. The architecture presented by

Krizhevsky et al. [14] was relatively simple. It consists of

ﬁve convolutional layers, max-pooling ones, Rectiﬁed Lin-

ear Units (ReLUs) as non-linearities, three fully-connected

layers, and dropout. Figure 2 shows that CNN architecture.

Fig. 2: AlexNet Convolutional Neural Network architecture.

Figure reproduced from [14].

2.1.2 VGG

Visual Geometry Group (VGG) is a CNN model introduced

by the Visual Geometry Group (VGG) from the University

of Oxford. They proposed various models and conﬁgura-

tions of deep CNNs [15], one of them was submitted to

the ImageNet Large Scale Visual Recognition Challenge

(ILSVRC)-2013. That model, also known as VGG-16 due to

the fact that it is composed by 16 weight layers, became

popular thanks to its achievement of 92.7% TOP-5 test

accuracy. Figure 3 shows the conﬁguration of VGG-16. The

main difference between VGG-16 and its predecessors is the

use of a stack of convolution layers with small receptive

ﬁelds in the ﬁrst layers instead of few layers with big

receptive ﬁelds. This leads to less parameters and more non-

linearities in between, thus making the decision function

more discriminative and the model easier to train.

Fig. 3: VGG-16 CNN architecture. Figure extracted from

Matthieu Cord’s talk with his permission.

2.1.3 GoogLeNet

GoogLeNet is a network introduced by Szegedy et al. [16]

which won the ILSVRC-2014 challenge with a TOP-5 test

accuracy of 93.3%. This CNN architecture is characterized

by its complexity, emphasized by the fact that it is composed

by 22 layers and a newly introduced building block called

inception module (see Figure 4). This new approach proved

that CNN layers could be stacked in more ways than a

typical sequential manner. In fact, those modules consist of

a Network in Network (NiN) layer, a pooling operation, a

large-sized convolution layer, and small-sized convolution

layer. All of them are computed in parallel and followed

by 1 × 1 convolution operations to reduce dimensionality.

Thanks to those modules, this network puts special consid-

eration on memory and computational cost by signiﬁcantly

reducing the number of parameters and operations.

Filter

concatenation

3x3 convolutions 5x5 convolutions1x1 convolutions 1x1 convolutions

1x1 convolutions 1x1 convolutions

3x3 max pooling

Previous layer

Fig. 4: Inception module with dimensionality reduction

from the GoogLeNet architecture. Figure reproduced from

[16].

2.1.4 ResNet

Microsoft’s ResNet [17] is specially remarkable thanks to

winning ILSVRC-2016 with 96.4% accuracy. Apart from that

fact, the network is well-known due to its depth (152 layers)

and the introduction of residual blocks (see Figure 5). The

residual blocks address the problem of training a really deep

architecture by introducing identity skip connections so that

layers can copy their inputs to the next layer.

weight layer

F (χ) + χ

F (χ)

identity

relu

Fig. 5: Residual block from the ResNet architecture. Figure

reproduced from [17].

The intuitive idea behind this approach is that it ensures

that the next layer learns something new and different from

what the input has already encoded (since it is provided

with both the output of the previous layer and its un-

changed input). In addition, this kind of connections help

overcoming the vanishing gradients problem.

2.1.5 ReNet

In order to extend Recurrent Neural Networks (RNNs)

architectures to multi-dimensional tasks, Graves et al. [18]

proposed a Multi-dimensional Recurrent Neural Network

(MDRNN) architecture which replaces each single recurrent

connection from standard RNNs with d connections, where

d is the number of spatio-temporal data dimensions. Based

on this initial approach, Visin el al. [19] proposed ReNet

architecture in which instead of multidimensional RNNs,

they have been using usual sequence RNNs. In this way, the

number of RNNs is scaling linearly at each layer regarding

to the number of dimensions d of the input image (2d).

In this approach, each convolutional layer (convolution +

pooling) is replaced with four RNNs sweeping the image

vertically and horizontally in both directions as we can see

in Figure 6.

Fig. 6: One layer of ReNet architecture modeling vertical and

horizontal spatial dependencies. Extracted from [19].

2.2 Transfer Learning

Training a deep neural network from scratch is often not

feasible because of various reasons: a dataset of sufﬁcient

size is required (and not usually available) and reaching

convergence can take too long for the experiments to be

worth. Even if a dataset large enough is available and con-

vergence does not take that long, it is often helpful to start

with pre-trained weights instead of random initialized ones

[20] [21]. Fine-tuning the weights of a pre-trained network

by continuing with the training process is one of the major

transfer learning scenarios.

Yosinski et al. [22] proved that transferring features

even from distant tasks can be better than using random

initialization, taking into account that the transferability of

features decreases as the difference between the pre-trained

task and the target one increases.

However, applying this transfer learning technique is

not completely straightforward. On the one hand, there

are architectural constraints that must be met to use a pre-

trained network. Nevertheless, since it is not usual to come

up with a whole new architecture, it is common to reuse

already existing network architectures (or components) thus

enabling transfer learning. On the other hand, the training

process differs slightly when ﬁne-tuning instead of training

from scratch. It is important to choose properly which layers

to ﬁne-tune – usually the higher-level part of the network,

since the lower one tends to contain more generic features

– and also pick an appropriate policy for the learning rate,

which is usually smaller due to the fact that the pre-trained

weights are expected to be relatively good so there is no

need to drastically change them.

Due to the inherent difﬁculty of gathering and creating

per-pixel labelled segmentation datasets, their scale is not as

large as the size of classiﬁcation datasets such as ImageNet

[23] [24]. This problem gets even worse when dealing with

RGB-D or 3D datasets, which are even smaller. For that

reason, transfer learning, and in particular ﬁne-tuning from

pre-trained classiﬁcation networks is a common trend for

segmentation networks and has been successfully applied

in the methods that we will review in the following sections.

2.3 Data Preprocessing and Augmentation

Data augmentation is a common technique that has been

proven to beneﬁt the training of machine learning models in

general and deep architectures in particular; either speeding

up convergence or acting as a regularizer, thus avoiding

overﬁtting and increasing generalization capabilities [25].

It typically consist of applying a set of transformations

in either data or feature spaces, or even both. The most

common augmentations are performed in the data space.

That kind of augmentation generates new samples by ap-

plying transformations to the already existing data. There

are many transformations that can be applied: translation,

rotation, warping, scaling, color space shifts, crops, etc. The

goal of those transformations is to generate more samples to

create a larger dataset, preventing overﬁtting and presum-

ably regularizing the model, balance the classes within that

database, and even synthetically produce new samples that

are more representative for the use case or task at hand.

Augmentations are specially helpful for small datasets,

and have proven their efﬁcacy with a long track of suc-

cess stories. For instance, in [26], a dataset of 1500 por-

trait images is augmented synthesizing four new scales

(0.6, 0.8, 1.2, 1.5), four new rotations (−45, −22, 22, 45), and

four gamma variations (0.5, 0.8, 1.2, 1.5) to generate a new

dataset of 19000 training images. That process allowed them

to raise the accuracy of their system for portrait segmenta-

tion from 73.09 to 94.20 Intersection over Union (IoU) when

including that augmented dataset for ﬁne-tuning.

3 DATASETS AND CHALLENGES

Two kinds of readers are expected for this type of review:

either they are initiating themselves in the problem, or either

they are experienced enough and they are just looking for

the most recent advances made by other researchers in the

last few years. Although the second kind is usually aware of

two of the most important aspects to know before starting

to research in this problem, it is critical for newcomers to get

a grasp of what are the top-quality datasets and challenges.

Therefore, the purpose of this section is to kickstart novel

scientists, providing them with a brief summary of datasets

that might suit their needs as well as data augmentation and

preprocessing tips. Nevertheless, it can also be useful for

hardened researchers who want to review the fundamentals

or maybe discover new information.

Arguably, data is one of the most – if not the most

– important part of any machine learning system. When

dealing with deep networks, this importance is increased

even more. For that reason, gathering adequate data into

a dataset is critical for any segmentation system based on

deep learning techniques. Gathering and constructing an

appropriate dataset, which must have a scale large enough

and represent the use case of the system accurately, needs

time, domain expertise to select relevant information, and

infrastructure to capture that data and transform it to a

representation that the system can properly understand and

learn. This task, despite the simplicity of its formulation in

comparison with sophisticated neural network architecture

deﬁnitions, is one of the hardest problems to solve in this

context. Because of that, the most sensible approach usually

means using an existing standard dataset which is repre-

sentative enough for the domain of the problem. Following

this approach has another advantage for the community:

standardized datasets enable fair comparisons between sys-

tems; in fact, many datasets are part of a challenge which

reserves some data – not provided to developers to test their

algorithms – for a competition in which many methods are

tested, generating a fair ranking of methods according to

their actual performance without any kind of data cherry-

picking.

In the following lines we describe the most popular

large-scale datasets currently in use for semantic segmen-

tation. All datasets listed here provide appropriate pixel-

wise or point-wise labels. The list is structured into three

parts according to the nature of the data: 2D or plain

RGB datasets, 2.5D or RGB-Depth (RGB-D) ones, and pure

volumetric or 3D databases. Table 1 shows a summarized

view, gathering all the described datasets and providing

剩余22页未读，继续阅读

评论收藏

内容反馈

whd520hbababa

粉丝: 0
资源: 2

A Review on Deep Learning Techniques Applied to Semantic Segment...

最新资源

A Review on Deep Learning Techniques Applied to Semantic Segment...

翻译 Review on Deep Learning Segmentation （应用于语义分割问题的深度学习技术综述）

语义分割综述论文合集

Semantic Segmentation of Point Clouds using Deep Learning

Understanding Deep Learning Techniques for Image Segmentation.pdf

Getting Started with Semantic Segmentation using DL:Getting Started with Deep Learning Semantic Segmentation using your own image dataset-matlab开发

Deep Learning methods and applications

Deep Learning深度学习总结

Applied Deep Learning

Deep Learning: Methods and Applications

Deep learning 综述

A SURVEY ON DEEP LEARNING-BASED ARCHITECTUR.pdf

DeepLab: Semantic Image Segmentation

Alberto 2018 Survey.pdf

笔记 Xie_Deep Learning Enabled Semantic Communication System.pdf

[DeepLearning 综述]DeepLearning Segmentation

SemanticSegmentation

深度学习语义分割论文

Deep learning

用一篇文章理解DeepLearning

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

TensorFlow Deep Learning Projects

Deep Learning based 3D Segmentation A Survey.pdf

Deep Learning and Convolutional Neural Networks for Medical Image Computing

DFAnet：Deep Feature Aggregation for Real-time Semantic Segmentation.pdf

Deep Learning 英文版

Deep Learning

最新资源