UnsupervisedObjectDiscovery:AComparison（Maxplank）_无监督目标检测资源-CSDN文库

Unsupervised

需积分: 14 2 浏览量 2010-01-27 16:07:56 上传评论收藏 5.41MB PDF 举报

资源详情

资源评论

资源推荐

Int J Comput Vis

DOI 10.1007/s11263-009-0271-8

Unsupervised Object Discovery: A Comparison

Tinne Tuytelaars ·Christoph H. Lampert ·

Matthew B. Blaschko ·Wray Buntine

Received: 28 July 2008 / Accepted: 6 July 2009

Abstract The goal of this paper is to evaluate and compare

models and methods for learning to recognize basic enti-

ties in images in an unsupervised setting. In other words,

we want to discover the objects present in the images by

analyzing unlabeled data and searching for re-occurring

patterns. We experiment with various baseline methods,

methods based on latent variable models, as well as spectral

clustering methods. The results are presented and compared

both on subsets of Caltech256 and MSRC2, data sets that

are larger and more challenging and that include more ob-

ject classes than what has previously been reported in the lit-

erature. A rigorous framework for evaluating unsupervised

object discovery methods is proposed.

Keywords Object discovery · Unsupervised object

recognition · Evaluation

T. Tuytelaars (



)

ESAT-PSI, K.U. Leuven, Kasteelpark Arenberg 10, bus 2241,

Leuven 3001, Belgium

e-mail: Tinne.Tuytelaars@esat.kuleuven.be

C.H. Lampert · M.B. Blaschko

Max Planck Institute for Biological Cybernetics, Tübingen,

Germany

C.H. Lampert

e-mail: Christoph.Lampert@tuebingen.mpg.de

M.B. Blaschko

University of Oxford, Oxford, UK

e-mail: blaschko@robots.ox.ac.uk

W. Buntine

NICTA, Canberra, Australia

e-mail: Wray.Buntine@nicta.com.au

W. Buntine

Australian National University, Canberra, Australia

1 Introduction

Over the last decade, the category-level object recognition

problem has drawn considerable attention within the com-

puter vision research community and as a result, much

progress has been realized. International challenges and

benchmarking data sets such as the yearly Pascal Visual

Object Classes challenge, Caltech101 and later Caltech256

fostered this research by providing annotated data sets for

training and testing. Nevertheless, researchers soon identi-

ﬁed the dependence on the availability of annotated training

data as a serious restriction: the expensive manual annota-

tion process hampers the extension to large numbers of ob-

ject classes (on the order of thousands or even more, taking

into account that humans easily distinguish around 30,000

object categories (Biederman 1987)). More importantly, it

forces methods to extrapolate from a relatively small num-

ber of training examples. Better performance, and especially

better recall, is to be expected if more data were available for

training. In response to this, weakly supervised as well as

fully unsupervised methods have been investigated. In this

paper, we want to experimentally validate the feasibility of

fully unsupervised methods for object recognition, also re-

ferred to as object discovery, where only unlabeled data are

provided.

Unsupervised learning can be formulated as a search for

patterns in the data above and beyond what would be con-

sidered to be pure unstructured noise. Such patterns could

be anything, and need not have a semantic interpretation.

This makes it hard to evaluate unsupervised methods be-

cause multiple solutions (interpretations) can exist and all

be equally valid as far as the data itself is concerned. Here,

we build on the approach proposed in Sivic et al. (2005):

a data set is constructed composed of images from a ﬁxed

Int J Comput Vis

number of predeﬁned categories. The goal is then to sepa-

rate the different categories. This approach provides ground

truth information against which the results obtained by dif-

ferent methods can be evaluated quantitatively.

In contrast to the supervised case, where systematic com-

parison on benchmarking data sets and challenges is stan-

dard practice, to date no in-depth evaluation and compari-

son of different unsupervised methods has been performed.

Instead, each paper reports results on its own data set, gener-

ated ad hoc by throwing together a selection of object cate-

gory images from one of the many annotated data sets avail-

able. Moreover, more often than not, evaluation is limited to

one or two such data sets. This may result in over-optimistic

results due to parameter ﬁne-tuning.

Various evaluation metrics have been proposed, with

classiﬁcation accuracy probably the most popular one (Sivic

et al. 2005). In this case, each discovered category is linked

(often manually) to a ground truth category. Images are then

assigned to the most probable category and evaluation is per-

formed as for a standard classiﬁcation problem (with accu-

racy deﬁned as the sum of the diagonal elements of the con-

fusion matrix). However, as the number of object categories

and/or the imbalance in the data set increases, categories can

be merged or split. We argue conditional entropy provides

a more intuitive and scalable measure for a proper evalua-

tion. We also propose a new evaluation scheme for the more

realistic setting with multiple object categories per image,

without imposing pixel-wise classiﬁcation.

In this paper, a range of different methods are compared,

including baseline methods such as random assignment,

k-means clustering and principal component analysis, and

more advanced methods such as various latent variable mod-

els and spectral clustering schemes. All of these start from

the same underlying image representation: a simple bag-of-

visual-words model that describes the image in terms of a

set of quantized local image patch descriptors. We experi-

ment with several local feature detectors, various vocabu-

lary sizes, as well as different normalization schemes. More

complex image representations that include spatial conﬁg-

uration information could yield further improvements, but

fall outside the scope of this paper. Also global image rep-

resentations are not considered, as these would probably be

more sensitive to changing backgrounds, cannot deal with

multiple objects per image, and would make the study too

extensive.

We tried to avoid any kind of manual parameter tuning

or model selection for a speciﬁc data set, as this could be

considered a violation of the unsupervised character of our

methods. Instead, we selected reasonable parameters in ad-

vance and held them ﬁxed for all of the experiments. We

report results on more than ten different test sets, each con-

taining 20 or more different object categories, always using

the same parameters.

The only parameter that we assume to be known in ad-

vance is the number of object categories in the data set. Ul-

timately, the machine should be able to decide on this too

without any human intervention, but here we have preferred

to avoid the complex issue of model selection and focus on

comparing the different models given a known number of

classes.

1.1 Related Work

Probably the ﬁrst work to tackle the problem of unsu-

pervised object category discovery has been Weber et al.

(2000), building on a simpliﬁed constellation model-like

scheme.

Especially probabilistic models seem well suited for

tackling the unsupervised object discovery task and have

been studied by several authors. Sivic et al. (2005)have

proposed a method that builds on Probabilistic Latent Se-

mantic Analysis (PLSA), as proposed by Hofmann (1999),

to separate images of four distinct object categories (faces,

airplanes, rear cars, and motorbikes). They later extended

their work, experimenting with both PLSA as well as La-

tent Dirichlet Allocation (LDA) (Blei et al. 2003) and using

multiple image segments as the equivalent of documents, so

as to better localize the objects in the images (Russell et al.

2006). Very similar is the work of Tang and Lewis (2008),

except that they use non-negative matrix factorization (Lee

and Seung 1999). Liu and Chen (2007) extend the PLSA

model with the integration of a temporal model so as to dis-

cover objects in video.

Grauman and Darrell (2006) propose an alternative

method based on spectral clustering. They pay special at-

tention to separate the objects from the background or other

objects present in a single image, and also propose a semi-

supervised extension.

Finally, Kim et al. (2008) build on link analysis tech-

niques developed in the context of graph-mining and typi-

cally used in web search engines.

Recently, unsupervised object discovery methods in-

cluding spatial information (Wang and Grimson 2008;

Todorovic and Ahuja 2006) and hierarchical organization

of categories (Sivic et al. 2008;Bartetal.2008) have been

proposed as well, but these fall outside the scope of this

paper.

The rest of this paper is organized as follows. First, we

introduce our evaluation metrics, both for the case of a sin-

gle object per image as for the case of multiple objects

per image (Sect. 2). Then we give a concise description of

the image representation we use throughout this paper (see

Sect. 3). Next, in Sect. 4,wegiveanoverviewofdiffer-

ent methods for unsupervised object discovery, starting with

the baseline methods (Sect. 4.1), followed by latent variable

methods (Sect. 4.2) and spectral clustering (Sect.

4.3). In

Int J Comput Vis

Sects. 5 and 6, all these methods are applied both to sub-

sets of Caltech256 and MSRC2 data sets and compared both

quantitatively as well as qualitatively. Section 7 concludes

the paper.

2 Evaluation Metrics

Recent reviews of the literature on measures for clustering

can be found in Meila (2007), Rosenberg and Hirschberg

(2007). Standard measures for scoring clustering quality

against a known standard include purity, deﬁned as the mean

of the maximum class probabilities for the ground truth cate-

gory labels X and obtained cluster labels Y . Given variables

(x, y) sampled from the ﬁnite discrete joint space X × Y ,

this is deﬁned as

Purity(X|Y)=



y∈Y

p(y)max

x∈X

p(x|y).

In practice the distribution p(x,y) is unknown so it is es-

timated from the observed frequencies in a test sample, re-

sulting in an empirical purity estimate.

A second measure is the mutual information or entropy

gain

I(X|Y)=H(X)−H(X|Y), (1)

where

H(X|Y)=



y∈Y

p(y)



x∈X

p(x|y)log

p(x|y)

. (2)

Likewise, observed frequencies are used instead of proba-

bilities, so one has an empirical entropy gain estimate. Our

experiments have shown that purity and entropy gain are

highly correlated in their ranking of different clusterings Y .

Note we usually estimate these quantities from a test set,

say T , so when we do, we denote them as Purity

(X|Y)and

(X|Y).

2.1 Conditional Entropy

The use of entropy gain was originally introduced to allow

the measure to be compared across different domains. In a

single domain, the base entropy H(X) is constant, so one

can also use the conditional entropy, H(X|Y),asasimpler

measure to evaluate different algorithms for generating clus-

ters or components. This has a nice and intuitive interpreta-

tion, as it gives the average amount of uncertainty that re-

mains about X once the value of Y is known. Here we use it

to measure how much uncertainty remains in the true class

given the instances estimated topic or cluster label. Condi-

tional entropy has the following properties:

0 ≤H(X|Y) with equality iff Y determines X, (3)

H(X|Y)≤H(X) with equality

iff Y is independent of X. (4)

Roughly, assuming base 2 logarithms, one can interpret

H(X|Y) as saying that on average there remain about

H(X|Y)

choices for X once Y is known. Thus with 20

classes in ground truth, H(X|Y) = 1 leaves about 2/20

choices and H(X|Y)=2 leaves about 4/20 choices.

2.2 Using More Clusters

In experiments, one can soon build more clusters or compo-

nents than is known in the ground truth, that is so |X|< |Y |.

How can these subsequently be evaluated against the ground

truth? The problem with the evaluation metrics described so

far is that as |Y |increases, purity and conditional entropy get

better and better. If |Y | were allowed to go arbitrarily large,

purity would go to 1 and conditional entropy would go to 0.

But this is due to over-ﬁtting rather than having a good clus-

tering.

A simple approach is to use an oracle to assign each dis-

covered component to its best known class, and then eval-

uate the resultant assignments as before, using purity, con-

ditional entropy or entropy gain. In practice, the oracle is

created from a data set that we will call the tuning set.Care

must be taken to ensure an unbiased evaluation and there-

fore, the tuning set of the oracle must be separate from the

test set used to estimate the probabilities p(x,y).Weuse

tuning set T

for the oracle and test set T

to estimate prob-

abilities. Probabilities estimated from either set are denoted

ˆp

() and ˆp

() respectively.

We use a mapping σ :Y → X that represents the assign-

ment of clusters to labels used in the ground truth. Each data

point (x, y) ∈ X × Y is now mapped to a value (x, σ (y)),

a co-domain we denote as X ×σ(Y). Thus any joint distrib-

ution p(x,y) on X ×Y infers a joint distribution p(x, x



) on

X ×σ(Y). We then take measures such as purity or entropy

on this inferred distribution w.r.t. the test set. Note that σ()

could be referred to as a prediction function, and x



=σ(y)

as the prediction.

Given the tuning set T

, estimate the best mapping, i.e.,

the oracle, by maximizing the purity/entropy as follows:

purity

= argmax

Purity

(X|σ (Y )), (5)

entropy

= argmax

(X|σ (Y )). (6)

Then use the test set T

to get an unbiased estimate of the

measures:

Purity(X|Y)≈ Purity

(X|σ

purity

(Y )), (7)

H(X|Y)≈ H

(X|σ

entropy

(Y )). (8)

Int J Comput Vis

In the case where T

and |Y|=|X|, these measures are

equivalent to observed estimates of the previous versions,

and the effect of bias minimal. By keeping the tuning and

test sets different, one can see that as |Y | gets arbitrarily

large, the purity and conditional entropy measures no longer

tend towards their ideal values (one and zero respectively).

This is the effect of using unbiased estimates.

Note, with the above formulation, one can easily do

multiple train-test splits, and even perform N-way cross-

validation to get, arguably, better estimates.

Also, note that for purity, the estimates can be re-phrased

as sample-based sums:

purity

= argmax



(x,y)∈T

x=σ(y)

, (9)

Purity(X|Y)≈



(x,y)∈T

x=σ

purity

(y)

. (10)

This form may be more intuitive to some people, but the

formulation as in (5) and (7) brings out the correspondence

with entropy.

2.3 Multi-Class Ground Truth

It is more realistic that each test image contains different

classes of content. For instance, one image of a city scene

might have labels including buildings, cars and trees. The

MSRC2 image set, used in Sect. 6, is such a collection.

Now we could ascribe proportions of each image to the

ground truth. Thus one image of a city scene might have

proportions including 27% buildings, 10% cars, 22% trees

and 41% other (or unknown) class. However, this is not gen-

erally considered a good measure of content, since back-

ground classes like sky or grass often cover larger parts of

the image than the actual foreground objects. Because we

do not know of any established means of assessing propor-

tions in a meaningful manner to classes in an image, we only

consider the existence or non-existence of each class. So the

ground truth for an image is the subset of classes existing in

the image.

In this case the purity measure is clearly useless since it

assumes each image belongs to one class. One can instead

replace purity by accuracy, or go to the more general mea-

sures used in such cases where multiple classes are being

assessed, precision and recall. In our case, precision shall

be taken to mean the proportion of predictions we make that

are accurate, and recall shall be the proportion of true classes

that we predict. In evaluation, we use their harmonic mean

called F 1 given by

precision∗recall

precision+recall

For a given image, let S denote the subset of X giving

classes that occur in the image (and thus are positive). Like-

wise, let S



denote the subset of σ(Y) giving classes that

have been predicted to occur in the image. Then for one im-

age and a particular prediction function σ(), the precision is

|S∩S



and the recall is

|S∩S



|S|

. We wish to accumulate those

over the full test set, so quantities we estimate empirically

are instead gathered using a mean. Thus

precision

E(|S ∩S



E(|S



recall

E(|S ∩S



E(|S|)

(11)

where E() is the expected value function approximated as

() by taking the mean of the quantity over a data set T .

Samples now take the form of a subset of classes (de-

noting the true classes for an image) together with either

the best component or the probability scores for the compo-

nents. Thus a sample is (S, y) where S ⊆X and y ∈ Y ,or,

in case of methods outputting a vector of component scores,

(S, q) where S ⊆X and q ∈

. The prediction function σ

on Y used previously maps to the space σ(Y)=X ∪{void}

which now includes a null class. We need the null class since

some components may not correspond to true classes but

some “other”. Thus each component either maps to a known

class, or it can map to nothing at all. We arbitrarily extend

σ() to the score case q ∈

by deﬁning

σ(q) =



x :



y :σ(y)=x

> 0.01



. (12)

In this case, the predicted classes S



is given by σ(q).

The F 1 score for the prediction function σ() estimated

on the data set T is the harmonic mean of precision and

recall given earlier, which simpliﬁes to

F 1

(X|Y)=

(|S ∩S



(|S|+|S



. (13)

An unbiased estimate for F 1 using the same formulation as

for entropy is then

= argmax

F 1

(X|Y), (14)

F 1(X|Y) ≈ F 1

(X|Y). (15)

Notice an unbiased estimate for accuracy also follows the

model of purity given earlier:

accuracy

=argmax



(S,y)∈T

hamming(S, σ (y))

|X|

, (16)

Accuracy(X|Y)

≈



(S,y)∈T

hamming(S, σ

accuracy

(y))

|X|

. (17)

Here, hamming() denotes the Hamming distance.

剩余18页未读，继续阅读

评论收藏

内容反馈

gongmingnju

粉丝: 2
资源: 3

Unsupervised Object Discovery: A Comparison （Maxplank）

评论0

最新资源

Unsupervised Object Discovery: A Comparison （Maxplank）

评论0

Unsupervised Object Category Discovery via Information Bottleneck Method

Unsupervised Domain Adaption of Object Detectors A Survey.zip

Unsupervised Domain Adaption of Object Detectors A Survey.pdf

机器学习教程 - Unsupervised Learning: Word Embedding

机器学习 -- Unsupervised Learning: Deep Auto-encoder

机器学习 -- Unsupervised Learning: Principle Component Analysis

A HELMHOLTZ EQUATION SOLVER USING UNSUPERVISED LEARNING: APPLICA

Unsupervised-Classification:扫描

Python基于BERT的中文文本相似度识别模型源码+项目说明.zip

Unsupervised segmentation evaluation: an edge-based method

机器学习 -- Unsupervised Learning: Neighbor Embedding

Unsupervised Random Forest:在线无监督随机森林-开源

Unsupervised-Intent:无监督意图检测的代码和资源

unsupervised-clustering:未知来源图像的无监督聚类分类

unsupervised-qa:基于模板的检索句生成的问题，用于改进无监督问答

Unsupervised-Test:人工智能监控系统

quasi-unsupervised-cc:论文“准无监督颜色恒定性”中描述的方法的实现-CVPR 2019

Grip-task-Prediction-using-unsupervised-learning:任务2

matlab导入excel代码-Microstructure-Classification-Unsupervised-Context:微观结构

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

信号与系统——保研复习资料.pdf

Landsat_WRS2.zip

最全的Visio形状/图形库

AxureRP9项目原型50套、案例20个、元件库1套.zip

北理工+成电+东南——通信/信号保研面试真题.pdf

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar