没有合适的资源?快使用搜索试试~ 我知道了~
206-0919-審稿paper.pdf
需积分: 0 0 下载量 9 浏览量
2022-12-29
15:10:36
上传
评论
收藏 984KB PDF 举报
温馨提示
试读
17页
206-0919-審稿paper.pdf
资源推荐
资源详情
资源评论
International Journal of Robotics and Automation
An Adversarial and Deep Hashing-Based Hierarchical Supervised Cross-modal Image
and Text Retrieval Algorithm
--Manuscript Draft--
Manuscript Number: 206-0919
Full Title: An Adversarial and Deep Hashing-Based Hierarchical Supervised Cross-modal Image
and Text Retrieval Algorithm
Article Type: Full Article
Keywords: Cross-modal image and text retrieval; deep hash algorithm; hierarchical supervision;
adversarial network
Abstract: With the rapid development of robotics and sensor technology, vast amounts of
valuable multimodal data are collected. It is extremely critical for a variety of robots
performing automated tasks to find relevant multimodal information quickly and
efficiently in large amounts of data. In this paper, we propose an adversarial and deep
hashing-based hierarchical supervised cross-modal image and text retrieval algorithm
to perform semantic analysis and association modeling on image and text by making
full use of the rich semantic information of the label hierarchy. First, the modal
adversarial block and the modal differentiation network both perform adversarial
learning to keep different modalities with the same semantics closest to each other in a
common subspace. Second, the intra-label layer similarity loss and inter-label layer
correlation loss are used to fully exploit the intrinsic similarity existing in each label
layer and the correlation existing between label layers. Finally, an objective function for
different semantic data is redesigned to keep data with different semantics away from
each other in a common subspace, thus avoiding interference of retrieval by data of
different semantics. The experimental results on two cross-modal retrieval datasets
with hierarchically supervised information show that the proposed method substantially
enhances retrieval performance and consistently outperforms other state-of-the-art
methods.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript received DD Month YYYY (write the date on which you submitted your paper for review.)
AN ADVERSARIAL AND DEEP HASHING-BASED
HIERARCHICAL SUPERVISED CROSS-MODAL
IMAGE AND TEXT RETRIEVAL ALGORITHM
Abstract
With the rapid development of robotics and sensor technology, vast amounts of valuable multimodal data
are collected. It is extremely critical for a variety of robots performing automated tasks to find relevant
multimodal information quickly and efficiently in large amounts of data. In this paper, we propose an
adversarial and deep hashing-based hierarchical supervised cross-modal image and text retrieval
algorithm to perform semantic analysis and association modeling on image and text by making full use of
the rich semantic information of the label hierarchy. First, the modal adversarial block and the modal
differentiation network both perform adversarial learning to keep different modalities with the same
semantics closest to each other in a common subspace. Second, the intra-label layer similarity loss and
inter-label layer correlation loss are used to fully exploit the intrinsic similarity existing in each label layer
and the correlation existing between label layers. Finally, an objective function for different semantic data
is redesigned to keep data with different semantics away from each other in a common subspace, thus
avoiding interference of retrieval by data of different semantics. The experimental results on two
cross-modal retrieval datasets with hierarchically supervised information show that the proposed method
substantially enhances retrieval performance and consistently outperforms other state-of-the-art methods.
Key Words
Cross-modal image and text retrieval; deep hash algorithm; hierarchical supervision; adversarial network
1. Introduction
In recent years, various types of intelligent robots [1] have developed rapidly. Cross-modal retrieval [2, 3],
a key technology for robots to achieve automated tasks through the understanding of multimodal content,
Manuscript Click here to access/download;Manuscript;Manuscript.doc
2
is the process of retrieving data from one modality and returning data from other modalities that are most
semantically relevant to the retrieved data.
In recent years, many approaches are proposed to address cross-modal retrieval. The traditional
cross-modal retrieval method [4-9] constructs a matrix for different media, projecting it uniformly into a
shared subspace, and then utilizes distance metrics such as Euclidean distance or cosine distance to
measure the similarity between heterogeneous modalities. Canonical Correlation Analysis (CCA) [4] is
widely used in cross-modal retrieval, and many cross-modal retrieval methods have been built on it.
However, most traditional cross-modal retrieval methods rely on hand-designed features and it is still
difficult to solve the “heterogeneity gap” problem effectively.
Deep neural networks have made progress in many fields such as computer vision [10, 11] and
natural language processing [12, 13], which are also effectively adopted in cross-modal retrieval.
However, there are problems of high storage costs and slow retrieval speed when employing deep learning
methods [14-16] for cross-modal retrieval of large-scale data.
In the storage and retrieval of large-scale cross-modal data, hashing algorithms [17-21] are widely
regarded for their low storage cost and high retrieval efficiency. Jiang et al. [22] proposed the deep
cross-modal hashing (DCMH) to integrate feature learning and hash code learning into a unified
framework. Li et al. [23] proposed the self-supervised adversarial hashing (SSAH) method to build
self-supervised semantic networks by using labels as self-supervised information.
Most of the existing cross-modal retrieval methods are used for non-hierarchically structured
supervised data, and cannot fully exploit the supervised information of the labels. However, in many
real-world application scenarios, label-supervised information on cross-modal data often has some kind of
hierarchical structure with rich semantic information. For example, in the field of public security, the
image or video automatically collected by robots through sensors may contain multiple layers of label
supervision information.
3
There are only a few methods currently that have been designed to label supervision information in
hierarchical structures. Wang et al. [24] proposed the supervised hierarchical deep hashing (SHDH)
method, which defines a similarity formula to weight different levels for labeled supervised information of
the hierarchy and verifies that the hierarchical information can improve the hash retrieval accuracy.
However, this method is designed for single-modal retrieval. To verify the effectiveness of labels with
hierarchical structure in cross-modal retrieval, Sun et al. [25] proposed the supervised hierarchical
cross-modal hashing (HiCHNet) method to learn hierarchical information and regularized cross-modal
hashing simultaneously. However, those methods have the following problems:
• The distance between multimodal data with the same semantic information in the common subspace is
not sufficiently minimized.
• The inter-layer correlation of supervisory information is not sufficiently considered so that complex
inter-layer correlation information is not fully learned.
• Cross-modal retrieval has interfered with dissimilar data.
To address the above problem, we propose a novel method for hierarchical supervised cross-modal
image and text retrieval. The contributions of this study are as follows:
• The feature extraction network and the modality differentiation network, which are used as generators
and adversaries respectively, both perform adversarial learning to result in the closest distance in the
common space for different modalities containing the same semantics.
• The intra-label layer similarity loss and inter-label layer correlation loss are introduced to fully explore
the intrinsic similarity existing in each layer of labels and the correlation existing between label layers,
thus improving the accuracy of cross-modal retrieval.
• An objective function for the distance between different semantic categories of data is redesigned to
keep the modal data of different semantic categories distant from each other in the common space.
剩余16页未读,继续阅读
资源评论
aftermath,,,
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- v8390966-xhs-Guanfang_X64.apk
- 《STM32单片机+MAX30102心率血氧传感器+OLED屏幕+心率血氧数据发送到串口调试助手》源代码
- 刷机和解锁system分区全套软件
- vb6.0使用数据环境设计器dataenvoriment连接带密码的access数据库不成功解决办法
- wifiphisher-master
- 构建简单的社交网站时,Redis 可以作为一个非常有用的工具来存储和管理各种数据 以下是一些在构建社交网站时可以使用 Redis
- 分布式锁和信号量都是在分布式系统中用于控制并发访问的重要工具,它们有不同的特点和应用场景: 1. **分布式锁**:
- Screenshot_2024-06-07-16-32-13-866_com.android.browser.jpg
- Redis 是一个流行的开源内存数据库,它支持多种数据结构,如字符串、哈希、列表、集合和有序集合等 以下是一些常见的 Redis
- 在计算机中,Maven 是一个广泛用于构建和管理 Java 项目的工具 它基于项目对象模型(Project Object Mod
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功