Bag-of-Visual-Words Models for Adult Image Classification and Filtering
Thomas Deselaers
1∗
, Lexi Pimenidis
2
, Hermann Ney
1∗
1
Human Language Technology and Pattern Recognition –
2
Security and Privacy Research
Computer Science Department, RWTH Aachen University, Aachen, Germany
E-mail: deselaers@cs.rwth-aachen.de
Abstract
We present a method to classify images into different
categories of pornographic content to create a system
for filtering pornographic images from network traf-
fic. Although different systems for this application were
presented in the past, most of these systems are based
on simple skin colour features and have rather poor
performance. Recent advances in the image recogni-
tion field in particular for the classification of objects
have shown that bag-of-visual-words-approaches are a
good method for many image classification problems.
The system we present here, is based on this approach,
uses a task-specific visual vocabulary and is trained and
evaluated on an image database of 8500 images from
different categories. It is shown that it clearly outper-
forms earlier systems on this dataset and further eval-
uation on two novel web-traffic collections shows the
good performance of the proposed system.
1 Introduction
Rating images according to their content is an important
application area, with one main application in filtering
network traffic to prohibit e.g. viewing pornographic
material. One desired property of such a system is the
possibility to dynamically change the content-type that
is filtered to avoid the necessity of several such systems.
Different clients might require differently strict content-
filters (e.g. elementary schools or religious institutions
might require different filters than universities or private
employers). At home, people might want to enable such
a filter over the day, when children are using the com-
puter but disable it in the late evening [14]. Ideally, an
pornographic image filter is created once and then the
filter administrator can easily select which types of im-
ages he wants the filter to remove and which types of
images are allowed.
In the literature, different porn image filtering tech-
niques were presented: The detection of skin coloured
areas is investigated in [10, 9], skin colour features are
used in combination with other features such as tex-
ture features and colour histograms [7, 11, 15, 2, 9, 16].
∗
This work was partially funded by the Deutsche Forschungsge-
meinschaft (DFG) under grant NE-572/6.
Most of these systems build on neural networks or sup-
port vector machines as classifiers. In [14], some spe-
cialised features for porn image classification are pre-
sented and used in a retrieval/nearest neighbour clas-
sification scheme. The POESIA filter
1
contains an
open source implementation of a skin-colour-based fil-
ter. Other approaches try to fuse textual and visual in-
formation from webpages in order to achieve better per-
formance [8].
Recently the bag-of-visual-words (BOVW) mod-
els, which were initially proposed for texture classifica-
tion [3, 13], have gained enormous popularity in object
classification [4, 5] and natural scene analysis [6]. The
BOVW models are inspired by the bag-of-words models
in text classification where a document is represented by
an unsorted set of the contained words. Analogously,
here an image is represented by an unsorted set of dis-
crete visual words, which are obtained by discretisation
of local descriptors. The here presented method learns
a task-specific visual vocabulary and employs a log-
linear model to discriminate between different classes
of content-type.
2 Porn Image Identification
For porn image identification, we follow the BOVW-
approach, where images are represented as a histogram
of visual words. The visual words denote local features
extracted from the images and the vocabulary is learnt
task-specifically from a training database.
2.1 Bag-of-Visual-Words Method
As local features, we extract image patches around
difference-of-Gaussian interest points [12] which are
scaled to a common size and then PCA transformed
leaving 30 coefficients to reduce their dimensionality.
The advantage of patches over e.g. SIFT features [12]
is the straight-forward inclusion of colour information
which clearly is important for the addressed task.
To create a visual vocabulary, we use the training al-
gorithm for unsupervised training of Gaussian mixture
models. This algorithm creates a set of 2
#splits
densi-
ties by iteratively splitting each existing density in the
1
http://www.poesia-filter.org/