4
2.2 Transfer Learning
Training a deep neural network from scratch is often not
feasible because of various reasons: a dataset of sufficient
size is required (and not usually available) and reaching
convergence can take too long for the experiments to be
worth. Even if a dataset large enough is available and con-
vergence does not take that long, it is often helpful to start
with pre-trained weights instead of random initialized ones
[20] [21]. Fine-tuning the weights of a pre-trained network
by continuing with the training process is one of the major
transfer learning scenarios.
Yosinski et al. [22] proved that transferring features
even from distant tasks can be better than using random
initialization, taking into account that the transferability of
features decreases as the difference between the pre-trained
task and the target one increases.
However, applying this transfer learning technique is
not completely straightforward. On the one hand, there
are architectural constraints that must be met to use a pre-
trained network. Nevertheless, since it is not usual to come
up with a whole new architecture, it is common to reuse
already existing network architectures (or components) thus
enabling transfer learning. On the other hand, the training
process differs slightly when fine-tuning instead of training
from scratch. It is important to choose properly which layers
to fine-tune – usually the higher-level part of the network,
since the lower one tends to contain more generic features
– and also pick an appropriate policy for the learning rate,
which is usually smaller due to the fact that the pre-trained
weights are expected to be relatively good so there is no
need to drastically change them.
Due to the inherent difficulty of gathering and creating
per-pixel labelled segmentation datasets, their scale is not as
large as the size of classification datasets such as ImageNet
[23] [24]. This problem gets even worse when dealing with
RGB-D or 3D datasets, which are even smaller. For that
reason, transfer learning, and in particular fine-tuning from
pre-trained classification networks is a common trend for
segmentation networks and has been successfully applied
in the methods that we will review in the following sections.
2.3 Data Preprocessing and Augmentation
Data augmentation is a common technique that has been
proven to benefit the training of machine learning models in
general and deep architectures in particular; either speeding
up convergence or acting as a regularizer, thus avoiding
overfitting and increasing generalization capabilities [25].
It typically consist of applying a set of transformations
in either data or feature spaces, or even both. The most
common augmentations are performed in the data space.
That kind of augmentation generates new samples by ap-
plying transformations to the already existing data. There
are many transformations that can be applied: translation,
rotation, warping, scaling, color space shifts, crops, etc. The
goal of those transformations is to generate more samples to
create a larger dataset, preventing overfitting and presum-
ably regularizing the model, balance the classes within that
database, and even synthetically produce new samples that
are more representative for the use case or task at hand.
Augmentations are specially helpful for small datasets,
and have proven their efficacy with a long track of suc-
cess stories. For instance, in [26], a dataset of 1500 por-
trait images is augmented synthesizing four new scales
(0.6, 0.8, 1.2, 1.5), four new rotations (−45, −22, 22, 45), and
four gamma variations (0.5, 0.8, 1.2, 1.5) to generate a new
dataset of 19000 training images. That process allowed them
to raise the accuracy of their system for portrait segmenta-
tion from 73.09 to 94.20 Intersection over Union (IoU) when
including that augmented dataset for fine-tuning.
3 DATASETS AND CHALLENGES
Two kinds of readers are expected for this type of review:
either they are initiating themselves in the problem, or either
they are experienced enough and they are just looking for
the most recent advances made by other researchers in the
last few years. Although the second kind is usually aware of
two of the most important aspects to know before starting
to research in this problem, it is critical for newcomers to get
a grasp of what are the top-quality datasets and challenges.
Therefore, the purpose of this section is to kickstart novel
scientists, providing them with a brief summary of datasets
that might suit their needs as well as data augmentation and
preprocessing tips. Nevertheless, it can also be useful for
hardened researchers who want to review the fundamentals
or maybe discover new information.
Arguably, data is one of the most – if not the most
– important part of any machine learning system. When
dealing with deep networks, this importance is increased
even more. For that reason, gathering adequate data into
a dataset is critical for any segmentation system based on
deep learning techniques. Gathering and constructing an
appropriate dataset, which must have a scale large enough
and represent the use case of the system accurately, needs
time, domain expertise to select relevant information, and
infrastructure to capture that data and transform it to a
representation that the system can properly understand and
learn. This task, despite the simplicity of its formulation in
comparison with sophisticated neural network architecture
definitions, is one of the hardest problems to solve in this
context. Because of that, the most sensible approach usually
means using an existing standard dataset which is repre-
sentative enough for the domain of the problem. Following
this approach has another advantage for the community:
standardized datasets enable fair comparisons between sys-
tems; in fact, many datasets are part of a challenge which
reserves some data – not provided to developers to test their
algorithms – for a competition in which many methods are
tested, generating a fair ranking of methods according to
their actual performance without any kind of data cherry-
picking.
In the following lines we describe the most popular
large-scale datasets currently in use for semantic segmen-
tation. All datasets listed here provide appropriate pixel-
wise or point-wise labels. The list is structured into three
parts according to the nature of the data: 2D or plain
RGB datasets, 2.5D or RGB-Depth (RGB-D) ones, and pure
volumetric or 3D databases. Table 1 shows a summarized
view, gathering all the described datasets and providing