IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 525
An Embedded System-on-Chip Architecture for
Real-time Visual Detection and Matching
Jianhui Wang, Sheng Zhong, Luxin Yan, Member, IEEE, and Zhiguo Cao
Abstract—Detecting and matching image features is a funda-
mental task in video analytics and computer vision systems. It
establishes the correspondences between two images taken at
different time instants or from different viewpoints. However,
its large computational complexity has been a challenge to most
embedded systems. This paper proposes a new FPGA-based em-
bedded system architecture for feature detection and matching.
It consists of scale-invariant feature transform (SIFT) feature
detection, as well as binary robust independent elementary
features (BRIEF) feature description and matching. It is able to
establish accurate correspondences between consecutive frames
for 720-p (1280 x 720) video. It optimizes the FPGA architecture
for the SIFT feature detection to reduce the utilization of
FPGA resources. Moreover, it implements the BRIEF feature
description and matching on FPGA. Due to these contributions,
the proposed system achieves feature detection and matching
at 60 frame/s for 720-p video. Its processing speed can meet
and even exceed the demand of most real-life real-time video
analytics applications. Extensive experiments have demonstrated
its efficiency and effectiveness.
Index Terms—Binary robust independent elementary features
(BRIEF), feature detection and matching, field programmable
gate array (FPGA), scale-invariant feature transform (SIFT),
system-on-chip (SoC).
I. Introduction
E
FFICIENT detection and reliable matching of visual
features is a fundamental problem in computer vision
applications, such as object recognition, structure from motion,
image indexing, and visual localization. Real-time perfor-
mance is a critical demand to most of these applications,
which require the detection and matching of the visual fea-
tures in real time. Although feature detection and matching
methods have been studied in the literature, due to their
computational complexity, their pure software implementation
without using special hardware is far from satisfactory in their
performance for real time applications. This paper is focused
on a new hardware design to enable real-time performance of
Manuscript received March 21, 2013; revised June 19, 2013 and July 31,
2013; accepted August 5, 2013. Date of publication August 29, 2013; date
of current version March 4, 2014. This work was supported in part by the
National Pre-Research Foundation under Grant 625010221. This paper was
recommended by Associate Editor T.-S. Chang.
The authors are with the Science and Technology on Multi-Spectral
Information Processing Laboratory, School of Automation, Huazhong
University of Science and Technology, Wuhan 430074, China (e-mail:
wang.ddu@gmail.com; zhongsheng@hust.edu.cn; yanluxin@gmail.com;
zgcao@hust.edu.cn).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2013.2280040
establishing correspondences between two consecutive frames
of high-resolution video.
In the literature, there are many different methods to detect
local features in an image, such as Harris [1], scale-invariant
feature transform (SIFT) [2], and SURF [3]. SIFT is one of
the most efficient methods to detect and describe distinctive
invariant features from images. Its significant advantage over
other methods is that the SIFT feature is invariant to image
translation, scaling, and rotation, while at the same time quite
robust to illumination changes. However, it is known that it is
very difficult, if not impossible, to achieve software-based real-
time computing of SIFT due to its computational complexity.
Recently, there have been some studies using special hardware
[4]–[7] to accelerate the detection part of the SIFT algorithm,
and some of these works may achieve satisfactory real-time
performance, such as the design in [7]. However, to the–
knowledge, a full-fledged feature detection, description, and
matching system is yet to be designed. Despite the detection
part of SIFT, obtaining the SIFT feature descriptors is also
critical and it has been the performance bottleneck of the
whole system because it is very difficult, if not impossible,
to integrate the description part of SIFT into FPGA. The main
challenge is its operational complexity, which prevents it from
being parallelized effectively.
There have been many modifications and variants of the
original SIFT descriptor to speed it up at the algorithmic
level. Broadly speaking, these methods can be divided into
two classes. One is to shorten the size of the SIFT feature by
applying dimensionality reduction, such as principal compo-
nent analysis (PCA) [8], to the original SIFT feature descriptor.
Another way is to quantize its floating-point coordinates into
integer codes on fewer bits, such as the results presented in
[9]–[11]. From these important contributions, Calonder et al.
[12] presented a method to extract feature descriptor very
efficiently, called binary robust independent elementary fea-
tures (BRIEF), which greatly reduced the memory demanded
to store the feature descriptors and the time consumed to
match the features, while yielding comparable recognition
accuracy.
In order to achieve real-time feature detection and matching
for 720-p video, we propose to replace the original SIFT
descriptor by the BRIEF descriptor in this paper. Considering
the space, power, and real-time constraints of an embedded
system, we implement the whole system on a single FPGA
chip. This system consists of SIFT detection, BRIEF descrip-
tion, and BRIEF matching. The proposed FPGA-based feature
1051-8215
c
2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications
standards/publications/rights/index.html for more information.