Heterogeneousmultimediacooperativeannotationbasedonmultimodalcorrelationlearning资源-CSDN文库

研究论文

75 浏览量 2021-02-07 12:45:30 上传评论收藏 1.98MB PDF 举报

资源推荐

资源详情

资源评论

Heterogeneous multimedia cooperative annotation based on multimodal

correlation learning

q,qq

Feng Tian

⇑

, Quge Wang, Xin Li, Ning Sun

School of Computer and Information Technology, Northeast Petroleum University, DaQing 163318, China

article info

Article history:

Received 13 July 2018

Revised 26 November 2018

Accepted 11 December 2018

Available online 13 December 2018

Keywords:

Multimedia annotation

Cooperative annotation

Multimodal correlation learning

abstract

Rich multimedia contents are dominating the current Web. In popular social media platforms such as

FaceBook, Twitter, and Instagram, there are over millions of multimedia contents being created by users.

In the meantime, multimedia data consists of data in multiple modalities, such as text, images, videos,

audio, time series sequences, and so on. Many research efforts have been devoted to multimedia anno-

tation to further improve the performance. However, the prevailing methods are designed for single-

media annotation task. In fact, heterogeneous media content describes given labels from respective

modality and is complementary to each other, and it becomes critical to explore advanced techniques

for heterogeneous data analysis and multimedia annotation. Inspired by this idea, this paper presents

a new multimodal correlation learning method for heterogeneous multimedia cooperative annotation,

named uniﬁed space learning, which projects heterogeneous media data into one uniﬁed space. We for-

mulate the multimedia annotation task into a semi-supervised learning framework, in which we learn

different projection matrices for different media type. By doing so, different media content is aligned

cooperatively, and jointly provides a more complementary proﬁle of given semantic labels.

Experimental results on data set with images, audio clips, videos and 3D models show that the proposed

approach is more effective.

1. Introduction

Web-based sharing services, coupled with the growing amounts

of multimedia content available online, are dominating the current

Web. Managing the multimedia data requires an effective retrieval

mechanism [1–4]. Compared with the instant-based multimedia

retrieval, the keyword-based methods is more convenient since it

is easier for users to offer keywords than multimedia samples.

However, the keyword-based retrieval required text labels for mul-

timedia data. Thus, multimedia annotation which assigns labels to

multimedia objects plays a signiﬁcant role. Despite being studied

extensively, most media annotation methods are generally focused

on speciﬁc single type media annotation and the performance are

unsatisfactory [3,4]. In fact, the different types of media data usu-

ally contain complementary information, and the semantic con-

cept can be described by different views. Inspired by this, we can

provide more diverse and accurate labels for multimedia data.

However, it is extremely difﬁcult to measure the similarities

between heterogeneous multimedia data in their original space

although they have same semantics (e.g., the 3D model of a wolf

and the audio clip of a wolf) due to the semantic gap. What’s more,

how to learn the correlation matching between heterogeneous

media data still remains an open problem.

To further improve the performance of multimedia annotation,

a possible solution is to model the correlation and learn a uniﬁed

representation for heterogeneous media data. By doing this, we

can make full use of their complementarity. In this paper, we pre-

sent a novel multimodal correlation learning approach for hetero-

geneous media data, named uniﬁed space learning, USL for short,

to predict the labels by exploring the association between different

types of media data cooperatively.

In summary, the advantages and characteristics of the method

are given as follows:

(1) We approached the heterogeneous media cooperative anno-

tation by multimodal correlation learning. We learned a uni-

ﬁed space by taking both the correlation between

heterogeneous multimedia data and the semantic informa-

https://doi.org/10.1016/j.jvcir.2018.12.028

This work is supported by the Natural Science Foundation of China (No.

61502094), Natural Science Foundation of Heilongjiang Province of China (No.

F2016002).

This article is part of the Special Issue on Multimodal_Cooperation.

⇑

Corresponding author.

E-mail address: tianfeng@nepu.edu.cn (F. Tian).

J. Vis. Commun. Image R. 58 (2019) 544–553

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier.com/locate/jvci

tion into account simultaneously. By doing so, the samples

in the space are comparable and the problem of annotation

is transformed to the problem of neighbor search.

(2) In order to make full use of the complementarity among

heterogeneous multimedia data and provide entrance to

the out-of-sample, we learn different projection matrices

for different types of media data simultaneously. By doing

so, heterogeneous multimedia data are aligned with each

other, which is more robust to noise.

(3) To make the proposed method more suitable for the real

environment, we utilize both the labeled and unlabeled data

into the learning framework. Meanwhile, optimizing them

together can make the solution smoother in the uniﬁed

space. We conduct experiments on heterogeneous media

data set by exploring the correlation between four types of

media data cooperatively and the experimental results

demonstrate its effectiveness.

The rest of this paper is organized as follows. In Section 2,we

review the existing related methods. The proposed method is

introduced in details in Section 3. In Section 4, the extensive exper-

iments are conducted to demonstrate the superiority of the pro-

posed method. Finally, we conclude this work in Section 5.

2. Related work

In this section, we will review recent works closely related to

multimedia annotation [5–14] and media correlation analysis

[15–32].

2.1. Multimedia annotation

Recently, many methods are proposed to solve the problem of

multimedia annotation, which basically focused on single type

media annotation, such as image annotation, audio annotation

and audio annotation, etc. Topic models ﬁnd the correlation

between the media objects and labels by latent variables [5–8].

From the probabilistic modeling perspective, the problem of media

annotation can be transformed to recover these variables that

describe the distribution of the media data and labels. Among

them, Dirichlet allocation [7] proceeds beyond probabilistic latent

semantic analysis [8] by Bayesian network that models media data

as a mixture over a set of topics. However, it is difﬁcult to select the

number of mixture components. Another important perspective is

based on matrix factorization [9,10] and graph [11,12]. However,

there are still limitations with these models. Given outer media

object, the cost function needs to be recomputed. In other words,

these methods cannot solve ‘‘out-of-sample” problem. The

instant-based methods estimate the relevance between the label

and the content of media by accumulating votes from neighbors.

The most commonly used methods are the neighbor voting algo-

rithm [13,14].

2.2. Media correlation analysis

Recently, learning from different view has become a very

promising topic with wide applicability [15–21].In[15], the per-

formance of Visual Bag model is improved by exploiting the coher-

ence between multiple feature spaces. In [16], the performance of

multimedia retrieval is improved greatly by combining visual and

text features together. In [17], the performance of cross-media

retrieval is improved by integrating the correlations between dif-

ferent modal data. It is notable that the essence of multimedia

annotation is to learn the association between the semantic con-

cept and the media content [22]. Canonical Correlation Analysis

(CCA) [23] has been popular for its performance and capability of

modeling two sets of modalities. CCA computes a shared subspace

of both sets of modalities through maximizing the correlation

between them. Although CCA has been widely used in different

ﬁelds, there are still limitations in different ﬁelds. Consequently,

many methods based on CCA has been proposed [24]. Among

them, Cross-modal Factor Analysis

[25] (CFA) and Kernelization

of CCA [26] (KCCA) are representative methods. Instead of maxi-

mizing pairwise correspondence between two sets of features,

CFA attempts to minimize the Frobenius norm between two

modalities. Compared to CCA, the proposed CFA shows advantages

in cross-modal information retrieval. Because CCA ignores the non-

linearities of multimodal data, KCCA has been proposed to capture

more properties of the multimodal by embedding the data into a

higher-dimensional feature space. As the nonlinear extension of

CCA, KCCA has been applied in many ﬁelds. Speciﬁcally, KCCA is

exploited to learn the mapping between visual words and text cor-

pus [27]. KCCA has also been successfully applied in cross-media

retrieval [28], computational biology [29] and audio recognition

[30]. Recently, some methods modeling the multimodal correlation

with deep learning model and deep feature learning are proposed

[31–36]. These methods have a more strong ability to model non-

linear correlation. Among them, Multimodal Deep Belief Network

(MultimodalDBN) [31] and Bimodal Deep Autoencoder (Bimoda-

lAE) [32] are two representative methods. MultimodalDBN consists

of two separate DBN and a joint Restricted Boltzmann Machine

(RBM) layer on the top of them. The features for each modality

are modeled by the separate DBN. Then, the joint RBM combines

the output of DBN to model the joint distribution of different

modalities. BimodalAE is a deep autoencoder network and an

extension of RBM for modeling multiple modalities. BimodalAE

learns separate representation for sets of modalities by two sub-

networks, and then the common representation is learned by the

joint layer. BimodalAE can learn high-order correlation by mini-

mizing the reconstruction error between original features and

reconstructed representations.

However, these methods only learn the correlations between

two data modalities, and cannot handle more than three types of

media data. The semantic information and the correlation informa-

tion hidden behind more kinds of media types are ignored.

3. The proposed method

In this section, we present the proposed approach, which is

named by uniﬁed space learning. The framework of our solution

is illustrated in Fig. 1. First of all, we assume that different view

of semantic concepts (or different types of media objects) can be

embedded into a uniﬁed space regardless of the type of media. In

the space, heterogeneous media objects share the same represen-

tation. This is a legitimate assumption since person and labels

can be mapped into space where the location of one person and

labels reﬂect their characteristics. If an audio sample is close to a

3D sample, they share the same semantic concepts. As a conse-

quence, the distance and the relevance are negatively correlated.

Our approach aims to learn the projection matrix for each view

(or the i-th media type). Considering that a large fraction of the

multimedia data have no labels at all [4], we learn the projection

matrices for heterogeneous media data using both labeled data

and unlabeled data. To obtain the uniﬁed space, we present an iter-

ative algorithm to obtain the optimization solution. Once the pro-

jection matrices obtained, different types of media data can be

mapped into the uniﬁed space where the structure of original

space and the pairwise correlations between multimedia objects

are kept. Given a new media sample, we can easily embed it into

the uniﬁed space, and then the associated labels of its neighbors

F. Tian et al. / J. Vis. Commun. Image R. 58 (2019) 544–553

545

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38707240

粉丝: 5
资源: 921

Heterogeneous multimedia cooperative annotation based on multimo...

最新资源

Heterogeneous multimedia cooperative annotation based on multimo...

3D Multimedia Data Search System Based on Stochastic

Parallel Computing on Heterogeneous Networks

Multipath TCP Path Scheduling Optimization Based on Q-Learning in Vehicular Heterogeneous Networks

A lightweight heterogeneous network clustering algorithm based on edge computing for 5G

3D Model Classification and Retrieval Method Using LDA Based on Heterogeneous Features

Recommendation in Heterogeneous Information Networks

藏经阁-Separating hot-cold data into heterogeneous storage based on

Bias-drift-free Mach–Zehnder modulators based on a heterogeneous silicon and lithium niobate platform

Ranking-based Clustering on General Heterogeneous Information Networks by Network Projection

Homogeneous and Heterogeneous System Copy for SAP Systems Based 2004s sr2

Ranking-Based Classification of Heterogeneous Information Networks

07-[论文分享]-RAID2020 Cyber Threat Intelligence Modeling Based on H

LHRM: A LBS based Heterogeneous Relations Model for User Cold-Start Recommendati

Identifying Drug–target Interactions Based on Heterogeneous Biological Data - PART 1

Identifying Drug-target Interactions Based on Heterogeneous Biological Data - PART 2

A utility-based capacity optimization framework for achieving cooperative diversity in the hierarchical converged heterogeneous wireless networks&amp;nbsp;

Virtual Machine Based Heterogeneous Checkpointing.pdf

Machine Learning for Text

TA03 AutoCAD 2004的操作基础.txt

最新资源

A utility-based capacity optimization framework for achieving cooperative diversity in the hierarchical converged heterogeneous wireless networks