没有合适的资源?快使用搜索试试~ 我知道了~
Heterogeneous multimedia cooperative annotation based on multimo...
0 下载量 75 浏览量
2021-02-07
12:45:30
上传
评论
收藏 1.98MB PDF 举报
温馨提示
Heterogeneous multimedia cooperative annotation based on multimodal correlation learning
资源推荐
资源详情
资源评论
Heterogeneous multimedia cooperative annotation based on multimodal
correlation learning
q,qq
Feng Tian
⇑
, Quge Wang, Xin Li, Ning Sun
School of Computer and Information Technology, Northeast Petroleum University, DaQing 163318, China
article info
Article history:
Received 13 July 2018
Revised 26 November 2018
Accepted 11 December 2018
Available online 13 December 2018
Keywords:
Multimedia annotation
Cooperative annotation
Multimodal correlation learning
abstract
Rich multimedia contents are dominating the current Web. In popular social media platforms such as
FaceBook, Twitter, and Instagram, there are over millions of multimedia contents being created by users.
In the meantime, multimedia data consists of data in multiple modalities, such as text, images, videos,
audio, time series sequences, and so on. Many research efforts have been devoted to multimedia anno-
tation to further improve the performance. However, the prevailing methods are designed for single-
media annotation task. In fact, heterogeneous media content describes given labels from respective
modality and is complementary to each other, and it becomes critical to explore advanced techniques
for heterogeneous data analysis and multimedia annotation. Inspired by this idea, this paper presents
a new multimodal correlation learning method for heterogeneous multimedia cooperative annotation,
named unified space learning, which projects heterogeneous media data into one unified space. We for-
mulate the multimedia annotation task into a semi-supervised learning framework, in which we learn
different projection matrices for different media type. By doing so, different media content is aligned
cooperatively, and jointly provides a more complementary profile of given semantic labels.
Experimental results on data set with images, audio clips, videos and 3D models show that the proposed
approach is more effective.
Ó 2018 Elsevier Inc. All rights reserved.
1. Introduction
Web-based sharing services, coupled with the growing amounts
of multimedia content available online, are dominating the current
Web. Managing the multimedia data requires an effective retrieval
mechanism [1–4]. Compared with the instant-based multimedia
retrieval, the keyword-based methods is more convenient since it
is easier for users to offer keywords than multimedia samples.
However, the keyword-based retrieval required text labels for mul-
timedia data. Thus, multimedia annotation which assigns labels to
multimedia objects plays a significant role. Despite being studied
extensively, most media annotation methods are generally focused
on specific single type media annotation and the performance are
unsatisfactory [3,4]. In fact, the different types of media data usu-
ally contain complementary information, and the semantic con-
cept can be described by different views. Inspired by this, we can
provide more diverse and accurate labels for multimedia data.
However, it is extremely difficult to measure the similarities
between heterogeneous multimedia data in their original space
although they have same semantics (e.g., the 3D model of a wolf
and the audio clip of a wolf) due to the semantic gap. What’s more,
how to learn the correlation matching between heterogeneous
media data still remains an open problem.
To further improve the performance of multimedia annotation,
a possible solution is to model the correlation and learn a unified
representation for heterogeneous media data. By doing this, we
can make full use of their complementarity. In this paper, we pre-
sent a novel multimodal correlation learning approach for hetero-
geneous media data, named unified space learning, USL for short,
to predict the labels by exploring the association between different
types of media data cooperatively.
In summary, the advantages and characteristics of the method
are given as follows:
(1) We approached the heterogeneous media cooperative anno-
tation by multimodal correlation learning. We learned a uni-
fied space by taking both the correlation between
heterogeneous multimedia data and the semantic informa-
https://doi.org/10.1016/j.jvcir.2018.12.028
1047-3203/Ó 2018 Elsevier Inc. All rights reserved.
q
This work is supported by the Natural Science Foundation of China (No.
61502094), Natural Science Foundation of Heilongjiang Province of China (No.
F2016002).
qq
This article is part of the Special Issue on Multimodal_Cooperation.
⇑
Corresponding author.
E-mail address: tianfeng@nepu.edu.cn (F. Tian).
J. Vis. Commun. Image R. 58 (2019) 544–553
Contents lists available at ScienceDirect
J. Vis. Commun. Image R.
journal homepage: www.elsevier.com/locate/jvci
tion into account simultaneously. By doing so, the samples
in the space are comparable and the problem of annotation
is transformed to the problem of neighbor search.
(2) In order to make full use of the complementarity among
heterogeneous multimedia data and provide entrance to
the out-of-sample, we learn different projection matrices
for different types of media data simultaneously. By doing
so, heterogeneous multimedia data are aligned with each
other, which is more robust to noise.
(3) To make the proposed method more suitable for the real
environment, we utilize both the labeled and unlabeled data
into the learning framework. Meanwhile, optimizing them
together can make the solution smoother in the unified
space. We conduct experiments on heterogeneous media
data set by exploring the correlation between four types of
media data cooperatively and the experimental results
demonstrate its effectiveness.
The rest of this paper is organized as follows. In Section 2,we
review the existing related methods. The proposed method is
introduced in details in Section 3. In Section 4, the extensive exper-
iments are conducted to demonstrate the superiority of the pro-
posed method. Finally, we conclude this work in Section 5.
2. Related work
In this section, we will review recent works closely related to
multimedia annotation [5–14] and media correlation analysis
[15–32].
2.1. Multimedia annotation
Recently, many methods are proposed to solve the problem of
multimedia annotation, which basically focused on single type
media annotation, such as image annotation, audio annotation
and audio annotation, etc. Topic models find the correlation
between the media objects and labels by latent variables [5–8].
From the probabilistic modeling perspective, the problem of media
annotation can be transformed to recover these variables that
describe the distribution of the media data and labels. Among
them, Dirichlet allocation [7] proceeds beyond probabilistic latent
semantic analysis [8] by Bayesian network that models media data
as a mixture over a set of topics. However, it is difficult to select the
number of mixture components. Another important perspective is
based on matrix factorization [9,10] and graph [11,12]. However,
there are still limitations with these models. Given outer media
object, the cost function needs to be recomputed. In other words,
these methods cannot solve ‘‘out-of-sample” problem. The
instant-based methods estimate the relevance between the label
and the content of media by accumulating votes from neighbors.
The most commonly used methods are the neighbor voting algo-
rithm [13,14].
2.2. Media correlation analysis
Recently, learning from different view has become a very
promising topic with wide applicability [15–21].In[15], the per-
formance of Visual Bag model is improved by exploiting the coher-
ence between multiple feature spaces. In [16], the performance of
multimedia retrieval is improved greatly by combining visual and
text features together. In [17], the performance of cross-media
retrieval is improved by integrating the correlations between dif-
ferent modal data. It is notable that the essence of multimedia
annotation is to learn the association between the semantic con-
cept and the media content [22]. Canonical Correlation Analysis
(CCA) [23] has been popular for its performance and capability of
modeling two sets of modalities. CCA computes a shared subspace
of both sets of modalities through maximizing the correlation
between them. Although CCA has been widely used in different
fields, there are still limitations in different fields. Consequently,
many methods based on CCA has been proposed [24]. Among
them, Cross-modal Factor Analysis
[25] (CFA) and Kernelization
of CCA [26] (KCCA) are representative methods. Instead of maxi-
mizing pairwise correspondence between two sets of features,
CFA attempts to minimize the Frobenius norm between two
modalities. Compared to CCA, the proposed CFA shows advantages
in cross-modal information retrieval. Because CCA ignores the non-
linearities of multimodal data, KCCA has been proposed to capture
more properties of the multimodal by embedding the data into a
higher-dimensional feature space. As the nonlinear extension of
CCA, KCCA has been applied in many fields. Specifically, KCCA is
exploited to learn the mapping between visual words and text cor-
pus [27]. KCCA has also been successfully applied in cross-media
retrieval [28], computational biology [29] and audio recognition
[30]. Recently, some methods modeling the multimodal correlation
with deep learning model and deep feature learning are proposed
[31–36]. These methods have a more strong ability to model non-
linear correlation. Among them, Multimodal Deep Belief Network
(MultimodalDBN) [31] and Bimodal Deep Autoencoder (Bimoda-
lAE) [32] are two representative methods. MultimodalDBN consists
of two separate DBN and a joint Restricted Boltzmann Machine
(RBM) layer on the top of them. The features for each modality
are modeled by the separate DBN. Then, the joint RBM combines
the output of DBN to model the joint distribution of different
modalities. BimodalAE is a deep autoencoder network and an
extension of RBM for modeling multiple modalities. BimodalAE
learns separate representation for sets of modalities by two sub-
networks, and then the common representation is learned by the
joint layer. BimodalAE can learn high-order correlation by mini-
mizing the reconstruction error between original features and
reconstructed representations.
However, these methods only learn the correlations between
two data modalities, and cannot handle more than three types of
media data. The semantic information and the correlation informa-
tion hidden behind more kinds of media types are ignored.
3. The proposed method
In this section, we present the proposed approach, which is
named by unified space learning. The framework of our solution
is illustrated in Fig. 1. First of all, we assume that different view
of semantic concepts (or different types of media objects) can be
embedded into a unified space regardless of the type of media. In
the space, heterogeneous media objects share the same represen-
tation. This is a legitimate assumption since person and labels
can be mapped into space where the location of one person and
labels reflect their characteristics. If an audio sample is close to a
3D sample, they share the same semantic concepts. As a conse-
quence, the distance and the relevance are negatively correlated.
Our approach aims to learn the projection matrix for each view
(or the i-th media type). Considering that a large fraction of the
multimedia data have no labels at all [4], we learn the projection
matrices for heterogeneous media data using both labeled data
and unlabeled data. To obtain the unified space, we present an iter-
ative algorithm to obtain the optimization solution. Once the pro-
jection matrices obtained, different types of media data can be
mapped into the unified space where the structure of original
space and the pairwise correlations between multimedia objects
are kept. Given a new media sample, we can easily embed it into
the unified space, and then the associated labels of its neighbors
F. Tian et al. / J. Vis. Commun. Image R. 58 (2019) 544–553
545
剩余9页未读,继续阅读
资源评论
weixin_38707240
- 粉丝: 5
- 资源: 921
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 技术资源分享-我的运维人生-Vue 应用数据交互与状态管理脚本
- formatted-task018-mctaco-temporal-reasoning-presence.json
- formatted-task017-mctaco-wrong-answer-generation-frequency.json
- 一个基于用手写的非常正常的图片
- formatted-task016-mctaco-answer-generation-frequency.json
- formatted-task015-mctaco-question-generation-frequency.json
- GL-v3-M416.apk
- formatted-task014-mctaco-wrong-answer-generation-absolute-timepoint.json
- sdddddddddaaaaaaaaaa
- Linux部署文件资料
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功