FPGA-BasedParallelHardwareArchitectureforReal-TimeImageClassification资源-CSDN文库

需积分: 9 26 浏览量 2017-11-28 13:07:34 上传评论 1 收藏 2.75MB PDF 举报

本文提出了一种基于FPGA的并行硬件架构，用于实现基于尺度不变特征变换（Scale-Invariant Feature Transform, SIFT）、特征包（Bag of Features, BoF）和支持向量机（Support Vector Machine, SVM）算法的实时图像分类。该架构通过利用这些算法中不同形式的并行性来加速执行，以实现实时性能。为减少硬件资源使用并加速计算密集型步骤的执行，采用了不同的技术来并行化这些步骤的执行。在引言部分，文章指出目标检测和分类是计算机视觉中的一项基本任务，它涉及在图像或视频帧中找到特定类别的对象，例如人脸、汽车和建筑物，并将输入图像根据其视觉内容分类为相似对象的一般类别。基于特征的目标分类是计算机视觉中一种常见的图像分类方法，它通常使用某种特征提取算法来提取图像的重要内容，然后使用某种分类算法将图像标记为训练过的类别的一个。文章中提到的SIFT是一种被广泛用于计算机视觉领域的特征提取算法，用于检测和描述图像中的局部特征。SIFT算法能够对图像缩放、旋转甚至亮度变化保持不变性，因此非常适合于处理在不同条件下捕获的图像。 BoF是一种基于视觉词袋模型的图像分类方法，它通过统计图像中局部特征的分布信息来表征图像内容。使用特征提取算法（如SIFT）来获取图像的局部特征描述符。然后，将这些描述符量化到视觉词汇中，并构建直方图来表示图像的全局特征。BoF方法的一个主要优点是它可以有效地将局部特征转换为全局特征，这样可以使用传统的机器学习算法来进行图像分类。 SVM是一种监督学习算法，广泛用于分类和回归分析。在图像分类中，SVM用于通过找到特征空间中的最优决策边界来将图像分类到不同的类别中。文章还提到，提出架构的原型是在FPGA平台上实现的，并使用两个基准数据集进行了评估：Caltech-256和比利时交通标志数据集。文章指出，提出的架构能够每帧检测到1270个SIFT特征，并且比现有最佳实现增加了380个特征。与等效的软件实现相比，特征提取算法速度提高了54倍，分类算法提高了6倍，同时在分类精度上的差异保持在3%以内。此外，提出的架构所使用的硬件资源也比其他现有的解决方案少。由于使用了FPGA并行处理的优势，这种实时图像分类硬件架构特别适合于对延迟敏感的应用场景，如自动驾驶车辆的环境感知、监控摄像头中的实时对象识别等。此外，由于FPGA可以通过编程进行定制，因此还可以针对特定的应用需求进行优化，以进一步提升性能或降低功耗。 FPGA的并行化能力也意味着可以同时处理多个图像或图像的部分，从而在不牺牲精度的情况下显著提升图像处理的吞吐量。这种硬件加速方法的缺点是设计和调试FPGA程序相对复杂，需要专业的硬件描述语言（如VHDL或Verilog）和深入的硬件知识。这项工作展示了FPGA并行硬件架构在实时图像分类领域的应用潜力，并指出通过硬件加速可以在保持高精度的同时获得显著的性能提升。

资源推荐

资源详情

资源评论

56 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

FPGA-Based Parallel Hardware Architecture

for Real-Time Image Classiﬁcation

Murad Qasaimeh, Assim Sagahyroon, and Tamer Shanableh

Abstract—This paper proposes a parallel hardware architecture

for real-time image classiﬁcation based on scale-invariant fea-

ture transform (SIFT), bag of features (BoFs), and support vector

machine (SVM) algorithms. The proposed architecture exploits

different forms of parallelism in these algorithms in order to accel-

erate their execution to achieve real-time performance. Different

techniques have been used to parallelize the execution and reduce

the hardware resource utilization of the computationally intensive

steps in these algorithms. The architecture takes a 640 × 480 pixel

image as an input and classiﬁes it based on its content within

33 ms. A prototype of the proposed architecture is implemented on

an FPGA platform and evaluated using two benchmark datasets:

1) Caltech-256 and 2) the Belgium Trafﬁc Sign datasets. The archi-

tecture is able to detect up to 1270 SIFT features per frame with

an increment of 380 extra features from t he best recent implemen-

tation. We were able to speedup the feature extraction algorithm

when compared to an equivalent software implementation by 54×

and for classiﬁcation algorithm by 6×, while maintaining the

difference in classiﬁcation accuracy within 3%. The hardware

resources utilized by our architecture were also less than those

used by other existing solutions.

Index Terms—Field-programmable gate array (FPGA), hard-

ware implementation, image classiﬁcation, scale-invariant feature

transform (SIFT).

I. INTRODUCTION

BJECT detection and classiﬁcation is the process of ﬁnd-

ing objects of a certain class, such as faces, cars, and

buildings, in an image or a video frame. This task involves clas-

sifying the input image according to its visual content into a

general class of similar objects. Feature-based object classiﬁ-

cation is a common image classiﬁcation method in computer

vision. It typically uses one of the feature extraction algorithms

to extract the image’s important content. Then, it uses one of the

classiﬁcation algorithms to label images into one of the trained

categories. Image classiﬁcation has many potential applications

including autonomous robots, intelligent trafﬁc systems, com-

puter human interactions, quality control in production lines,

and biomedical image analysis.

Many algorithms have been proposed for feature extraction

and classiﬁcation in the last two decades. These algorithms

Manuscript received November 14, 2014; February 26, 2015; accepted

March 30, 2015. Date of publication April 23, 2015; date of current version

June 26, 2015. The associate editor coordinating the review of this manuscript

and approving it for publication was Prof. Andrew Lumsdaine.

The authors are with the Department of Computer Engineering, American

University of Sharjah (AUS), Sharjah 26666, United Arab Emirates (e-mail:

murad.m.q@gmail.com; asagahyroon@aus.edu; tshanableh@aus.edu).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TCI.2015.2424077

involve a tradeoff between the quality of the extracted features,

the classiﬁcation accuracy and the computational complexity

of these algorithms. As an example, the unsophisticated FAST

corner detector [1] executes in 13 ms for a 1024 × 768 image

on a desktop machine, but the extracted features are not robust

to changes in illumination, scale, or rotation. In contrast, a

software implementation of computationally expensive algo-

rithms like SIFT [2] and HOG [3], process the same image

within 1920 ms. However, the extracted features are invariant

to change in illumination, scale, viewpoint and rotation. The

argument also applies to classiﬁcation algorithms. A simple

classiﬁcation algorithm like Naive Bayes takes noticeably lower

time to execute in comparison to a sophisticated algorithms

like RBF-SVM. However, the classiﬁcation accuracy achieved

by SVM is much higher than that of Naive Bayes. So, imple-

menting robust image classiﬁcation s ystem using complex algo-

rithms like SIFT [2] and SVM [4] are computationally intensive

and it is very hard to reach real-time performance (33 ms).

To address the complexity problem, a number of simpliﬁed

versions of the SIFT algorithm have been proposed, such as

SURF [5] and the fast SIFT [6] algorithms. The SURF reduces the

complexityoftheSIFTalgorithmbyapproximatingtheLaplacian

of Gaussian (LoG) using ﬁlters in the orientation assignment and

feature description steps. Thisreducesthecomputationtimefrom

1036 ms for SIFT to 354 ms for SURF on a standard Linux PC

(Pentium IV, 3 GHz) [5]. ManySURF hardwareimplementations

have been proposed to accelerate the algorithm [7]–[9]. Even

thought, it reduces the computational time for extracting the local

features, it produces fewer reliable features in comparison with

the SIFT algorithm. Moreover, the hardware implementations

require a considerable internal memory, almost four times the

memory requirement of SIFT [10], [11].

Other researchers tried to accelerate the SIFT algorithm

using a GPU platform [12]–[14]. The implementation in [12]

achieved a speedup of 4 −7× over an optimized CPU ver-

sion when tested on a dataset of 320 ×280 pixel images. The

power consumption of this implementation on the NVIDIA

Tegra 250 development board was 3383 mW. In [14], another

GPU-based implementation was proposed. It was able to detect

SIFT features in images (640 × 480 pixels) within 58 ms,

which is around 20 frames/s. The GPU-based implementa-

tions can accelerate the SIFT algorithm to reach near real-time

performance, but they require an excessive amount of hard-

ware resources and they consume too much power compared

to other hardware platforms. This makes the GPU implementa-

tions not suitable for portable embedded systems with limited

power. Other solutions tried to accelerate the SIFT algorithm

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

QASAIMEH et al.: FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME IMAGE CLASSIFICATION 57

using multicore processors [15]. The results showed that the

performance achieved by a multicore CPU is almost the same

as the implementation on GPUs.

Other attempts have been made to accelerate the algorithms

using FPGA hardware architectures [16]–[21]. However, the

current implementations either work on small image resolu-

tion (320 × 240 pixels) like the work in [16], [17], [19], [21] or

compute small number of SIFT features within real-time con-

straints. In [22], the implementation can detect up to 890 SIFT

features per frame within the 33 ms. Other implementations use

large hardware resources [18]–[20] like 35 000–45 000 LUTs.

Therefore, there is a strong need to propose a customized paral-

lel hardware architecture that accelerates the execution of the

image classiﬁcation algorithms (SIFT, BoF, SVM) on VGA

images to reach the real-time performance with low hardware

resources utilization and high classiﬁcation accuracy.

In this work, new techniques are used to accelerate the com-

putationally intensive parts in the SIFT feature extraction and

SVM classiﬁcation algorithms and to reduce the hardware uti-

lization. The main contributions of this paper are as follows.

1) To the best of authors’ knowledge, this is the ﬁrst com-

plete hardware architecture that implements SIFT, BoF and

SVM algorithms on FPGA. 2) Reducing the hardware utiliza-

tion in the SIFT’s Gaussian scale space (GSS) step by using

the multiplierless multiple constant multiplication (MCM) with

common subexpression elimination algorithm. 3) Our SIFT

module architecture is able to detect and describe up to 1270

SIFT features in 33 ms with an increment of 380 features

from the best recent implementation. 4) Introduce a new design

for multiported memories to store the SIFT descriptor his-

togram values by reordering the SIFT’s 128 values to utilize

the fabricated block RAMs in the FPGA. 5) To the best of our

knowledge, no other research has successfully used datasets

with as much image variation as the ones in Caltech-256 and

KUL Belgium trafﬁc sign classiﬁcation datasets to verify their

hardware image classiﬁcation systems.

This paper is organized as follows. Section II reviews the

algorithms used in our architecture. Section III presents the

software implementation of these algorithms. Section IV pro-

vides a detailed description of the proposed architecture and its

various building blocks. Section V discusses the experimental

results and architecture performance and Section VI concludes

the paper.

II. R

EVIEW OF THE ALGORITHMS

In this work, we used scale-invariant feature transform

(SIFT) for feature extraction, bag of features (BoFs) algorithm

for image representation and support vector machine (SVM)

algorithm for classiﬁcation. This section reviews these three

algorithms.

A. Scale-Invariant Feature Transform (SIFT)

SIFT is an algorithm used to extract and describe local fea-

tures in images. It was proposed by Lowe [2] in 2004. The SIFT

algorithm process can be divided into two main stages: keypoint

detection (KPD) and keypoint description. In the ﬁrst stage,

the image is scanned to search for distinctive and repeatable

points called keypoints. This stage consists of three subtasks:

GSS generation, local extrema detection, and KPD. Descriptors

generation is the second stage. It can be divided into two sub-

tasks: dominant orientation assignment and 128-d descriptor

generation.

1) GSS Generation: In the ﬁrst step, the input image I(x, y)

is convolved with a series of Gaussian ﬁlters G(x, y,σ

)to

build a GSS as deﬁned in (1) and (2). Where σ

is the Gaussian

ﬁlter scale, L(x, y,σ

) i s the ﬁltered image, and i is a scale

index. The ∗ in (1) is 2-D convolution operation in x and y,

and G(x, y,σ

) is the Gaussian Kernel. In this work, we used

six scales (σ0, k × σ0, k

× σ0, k

× σ0),

where σ

=1.6, and k=

√

2. After computing the Gaussian

ﬁltered images, the next step is to compute the difference of

GSS (DoG) by subtracting each two consecutive images as

deﬁned in (3)

L(x, y, σ

)=G(x, y, σ

) * I(x, y) (1)

G(x, y, σ)=

2πσ

−



2σ



(2)

D(x, y, σ

)=L(x, y, Kσ

) − L(x, y, σ

). (3)

2) Local Extrema Detection: In this step, the DoG images

are scanned to ﬁnd the candidate keypoints. Each pixel in the

D(x, y,σ

) image at location (x, y) is compared with its 3 ×3

neighbors in the same scale and the adjacent scales. If the pixel

is local maxima or local minima out of the total 26 neighboring

pixels, it will be considered as a candidate keypoint. This oper-

ation is performed for every pixel in the DoG images and what

results is a list of keypoint candidates.

3) Keypoint Detection: The goal of this step is to eliminate

the candidate keypoints that have low contrast or are poorly

localized along edges. To detect a low contrast keypoint, the

value of the pixel is compared with a predeﬁned threshold. If

the value is less than the threshold, the keypoint will be rejected.

To ﬁnd a poorly localized peak, a keypoint is tested using

the inequality deﬁned in (4). Where H is the Hessian matrix

computed as deﬁned in (5), and Tr (H) is the trace of H, Det(H)

is the determinant of H, and r is a constant value

Tr(H)

Det(H)

(r +1)

(4)

H =



Δxx Δxy

Δxy Δyy



. (5)

4) Orientation Assignment: The gradient magnitude and

orientation are computed for all pixels around the stable key-

points. The gradient is computed in both the horizontal and

vertical direction as deﬁned in (6) and (7).The gradient mag-

nitude and gradient orientation are computed from (x) and (y)

as given in (8) and (9)

Δx = L(x +1, y) − L(x − 1, y)/2 (6)

Δy = L(x, y +1)− L(x, y −1)/2 (7)

m(x, y)=



Δx

+Δy

(8)

θ(x, y)=tan

−1



Δy

Δx



. (9)

58 IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 1, NO. 1, MARCH 2015

5) Descriptor Generation: To compute the SIFT descrip-

tors, there are two main tasks: dominant orientation compu-

tation and descriptor generation. The dominant orientation is

computed by building the gradient orientation histogram around

a keypoint. The gradient orientations in a region are mapped

into one of 36 bins, where each bin represents 10

◦

. At the end,

the bin with the largest value or count will represents the dom-

inant orientation. After computing a dominant orientation, the

coordinates of the pixels around a keypoint are rotated relative

to the dominant orientation. A SIFT descriptor is computed by

dividing the region around the keypoint into 4 × 4 squares and

building the gradient histogram over 8 bins, where each bin cov-

ers 45

◦

. The gradient magnitudes are weighted by a Gaussian

ﬁlter prior to building the histogram. Finally, the 16 histograms

with 8 bins are each represented in the SIFT descriptor. The

SIFT descriptor has 4 × 4 × 8 values.

B. Bag of Features (BoFs)

BoF is one of the popular image representation methods [23].

It is used to represent images as orderless collections of local

features. The idea is to quantize each of the extracted key-

points into one of the visual words (points in the feature space),

and then to represent each image by a histogram of the visual

words. The process starts by clustering the SIFT descriptors

in their feature space (128-dimensional space) in large num-

bers of clusters by using the K-means clustering algorithm.

The centroids of these clusters are the BoF visual words. The

visual words represent a particular local pattern shared by the

keypoints in t hat cluster. The next step is to assign each descrip-

tor into its nearest cluster center. The normalized histogram of

the quantized features is the BoF model representation of that

image.

C. Support Vector Machine (SVM)

SVM is a machine-learning algorithm proposed by Vapnik

in 1990 [4]. The algorithm’s idea is to ﬁnd the deci-

sion hyperplane that deﬁnes decision boundaries between

two different classes. Given a set of training exam-

ples, each one is labeled to one of two categories, D=

{(x, y)|x → data sample,y → class label}. The SVM algo-

rithm ﬁnds the optimal hyperplane that splits the training

data i nto two categories. Any new examples can be classiﬁed

based on the location relative to that hyperplane. SVMs uti-

lize a technique called kernel trick that represents the data in

a higher dimensional space than the original feature space. This

mapping makes it easier to ﬁnd a separation hyperplane in non-

linearly separable data. The most common kernels are linear,

polynomial, radial basis function (RBF), and sigmoid kernels,

as given by in

Linear: K(x, z)=x · z (10)

Polynomial: K(x, z)=((x · z)+1)

,d>0 (11)

RBF: K(x, z)=exp(−x − z

/(2σ

) (12)

Sigmoid: K(x, z)=tanh(K(x · z)+Φ). (13)

Fig. 1. Five classes from the Caltech-256 dataset.

SVM was originally designed as a binary classiﬁcation algo-

rithm, but it is possible to extend it to multiclass classiﬁcation.

The most common methods of multiclass SVM are the “one

against all” and the “pairwise” classiﬁers methods. In this work,

we used one against all multiclass SVM with RBF kernel.

III. S

OFTWARE IMPLEMENTATION

We used a software implementation of the SIFT–BoF–SVM

algorithms for three reasons. First, for validation, we com-

pare the results generated by our hardware implementation with

the software implementation’s results. Second, for performance

measurement, we measure the processing time in software and

compare it with the processing time in the hardware imple-

mentation. Third, to choose the best classiﬁcation algorithm

for our system by conducting comparison between six different

classiﬁcation algorithms.

In the software implementation, we used an open source

SIFT library implemented in C by Hess [24]. For the SVM

classiﬁcation algorithm, we used LibSVM [25], an open source

C++ library implemented by Chang et al. The validation and

performance measurement parts are reported in the experimen-

tal result section. In this section, we elaborate why we choose

RBF-SVM over the other classiﬁcation algorithms.

Selecting the best classiﬁcation algorithm for our system was

an important step. So that, we perform an experiment to ﬁnd out

which classiﬁtion algorithm achieves higher accuracy within

an accepted processing time. The algorithms tested are: K-

nearest-neighbor (KNN), naïve Bayes, decision tree, Adaboost,

linear SVM, and nonlinear SVM. For this experiment, we used

Caltech-256 [26] benchmark dataset.

We used ﬁve classes: airplane, human face, motorbike, hours,

and watch. Fig. 1 shows example images from each category.

To ﬁnd out which classiﬁcation algorithm has the best accu-

racy, we build the learning curve for each classiﬁer. The

learning curve shows how the classiﬁer performance is affected

by increasing the size of the training set. By training the classi-

ﬁers on datasets of sizes (50, 100, 200, 300, 400, and 500) and

measuring its accuracy we obtained the results of Fig. 2.

The RBF-SVM classiﬁer outperforms other classiﬁers on all

training sizes. In terms of accuracy, the RBF SVM is followed

by the Adaboost algorithms, which is then followed by the lin-

ear SVM. The naïve Bayes and decision tree classiﬁers have

poor classiﬁcation rates.

To assess how the processing time is effected by increas-

ing the testing set size, we measured the processing time on

datasets of sizes of 50, 100, 200, 300, 400, and 500. The results

剩余14页未读，继续阅读

评论收藏

内容反馈

是否龙磊磊真的一无所有

粉丝: 442
资源: 32

FPGA-Based Parallel Hardware Architecture for Real-Time Image Cl...

最新资源

FPGA-Based Parallel Hardware Architecture for Real-Time Image Cl...

A Survey of FPGA-Based LDPC Decoders

Parallel Computer Architecture - A Hardware Software Approach

Efficient Searching with a TCAM-based Parallel Architecture

Exploiting Event-Based Communication for Real-Time Distributed and Parallel Video Content Analysis

A Communication-Efficient Parallel Algorithm for Decision Tree

CUDA for Engineers An Introduction to High-Performance Parallel Computing azw3

A Layer-Level Parallel Encoding Framework for SVC

《Cable-Driven Parallel Robots 》(作者：Tobias Bruckmann • Andreas Pott) 无水印原版pdf

Parallel Computer Architecture - A hardware software approach 并行计算机架构 软硬件方法

Real-Time Parallel Hashing on the GPU

论文研究-A GPU-Based Parallel Algorithm for Design Structure Matrix (DSM) Partition.pdf

Cable-Driven Parallel Robots Theory and Application epub

Parallel Computer Architecture.rar

CUDA for Engineers An Introduction to High-Performance Parallel Computing

BC-BSP: A BSP-based parallel iterative processing system for big data on cloud architecture

Classification using both Inter- & Intra- Channel Parallel Convolutions.pdf

CUDA for Engineers An Introduction to High-Performance Parallel Computing epub

Formal analysis of MPI-based parallel programs

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

神经网络回归预测--气温数据集

Ollama软件windows安装包(版本0.3.10)

Mathwork+Matlab+编程手册

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

亚博K210模型训练部署

Plecs电力电子仿真PLECS41.64 电力系统仿真软件免安装版本

shape_predictor_68_face_landmarks.zip

最新资源

Parallel Computer Architecture - A hardware software approach 并行计算机架构软硬件方法