![](https://csdnimg.cn/release/download_crawler_static/1337342/bg1.jpg)
Gradient Pursuits
Thomas Blumensath and Mike E. Davies
”This work has been submitted to the IEEE for possible publica-
tion. Copyright may be transferred without notice, after which t his
version may no longer be accessible.”
![](https://csdnimg.cn/release/download_crawler_static/1337342/bg2.jpg)
VERSION: NOVEMBER 14, 2007 1
Gradient Pursuits
Thomas Blumensath, Member, IEEE, Mike E. Davies, Member, IEEE
Abstract— Sparse signal approximations have become a funda-
mental t ool in signal processing with wide ranging applications
from source separation to signal acquisition. The ever grow ing
number of possible applications and in particular the ever
increasing problem sizes now addressed lead to new challenges
in terms of computational strategies and the development of fast
and efficient algorithms has become paramount.
Recently, very fast algorithms have been developed to solve
convex optimisation problems that are often used to approximate
the sparse approximation problem, however, it has also been
shown, that in certain circumstances, greedy strategies, such as
Orthogonal Matching Pursuit, can have better performance than
the convex meth od s.
In this paper improvements to greedy strategies are p roposed
and algorithms are developed that approximate Orthogonal
Matching Pursuit with computational requirements more akin
to Matching Pursuit. Three different directional optimisation
schemes based on the gradient, the conjugate gradient and an
approximation to the conjugate gradient are discussed respec-
tively. It i s shown that the conjugate gradient update leads
to a novel implementation of Orthogonal Matching Pursuit,
while the gradient based app roach as well as the approximate
conjugate gradient methods both lead to fast approximations to
Orthogonal Matching Pursuit, with the approximate conjugate
gradient method being superior to the gradient method.
Index Terms— Sparse Representations/Approximations,
Matching Pursuit, Orthogonal Matching Pursuit, Gradient
Optimisation, Conjugate Gradient Optimisation.
I. INTRODUCTION
A sparse signal expansion is a signal model that uses a linear
combination of a small number of elementary waveforms
selected from a large collection to represent or approximate
a signal. Such expansions are of increasing interest in signal
processing with applications ranging from source coding [1]
to de-noising [2], source separation [3] and signal acquisition
[4].
Let x ∈ R
M
be a known vector and Φ ∈ R
M×N
a matrix
with M < N. We will refer to Φ as the dictionary and call
the column vectors φ
i
of Φ atoms. The problem addressed in
this paper is to find a vector y satisfying the relationship:
x = Φy + ". (1)
If we allow for a non-zero error " we talk about a signal
approximation, while for zero " we have an exact signal
represen tation.
Because M < N, there are an infinite number of y
satisfying the above equation. It is therefore common to search
The authors are with IDCOM & Joint Research Institute for Signal and
Image Processing, Edinburgh University, King’s Buildings, Mayfield Road,
Edinburgh EH9 3JL, UK (Tel.: +44(0)131 6505659, Fax.: +44(0)131 6506554,
e-mail: thomas.blumensath@ed.ac.uk, mike.davies@ed.ac.uk).
This research was supported by EPSRC grant D000246/1. MED acknowl-
edges support of his position from the Scottish Funding Council and their
support of the Joint Research Institute with the Heriot-Watt University as a
component part of the Edinburgh Research Partnership.
for a vector y optimising a certain sparsity measure. For
example, it is common to look for a vector y with the smallest
number of non-zero elements. The problem of finding such a y
is, however, NP-hard in general [5], [6]. Therefore, different
sub-optimal strategies are used in practise. Commonly used
strategies are often based on convex relaxation, non-convex
(often gradient based) local optimisation or greedy search
strategies. Convex relaxation is used in algorithms such as
Basis Pursuit and Basis Pursuit De-Noising [7], the LASSO
and Least Angle Regression (LARS) [8]. Recently fast algo-
rithms solving the LASSO problem have been suggested in [9]
and [10]. Non-convex local optimisation procedures include
the Focal Underdetermined System Solver FOCUSS [11] and
Bayesian approaches such as the Relevance Vector Machine,
also known as Sparse Bayesian Learning [12] [13] or Monte
Carlo based approaches such as those in [14], [15], [16], [17]
and [18]. In this paper we are interested in greedy methods,
the most important of which are Matching Pursuit (MP) [19],
Orthogonal Matching Pursuit (OMP) [20] and Orthogonal
Least Squares (OLS) [21], also often known as ORMP, OOMP
or, in the regression literature, as forward selection.
MP is an algorithm often used for practical applications and
there are now very efficient O(N log M ) (for each iteration)
implementations [22], [23] whenever Φ is the union of dic-
tionaries for which fast transforms are available. For general
dictionaries MP is O(NM) (per iteration). On the other hand,
OMP has superior performance. Current implementations,
however, are more demanding both in terms of computation
time and memory requirement.
In this paper we combine an MP type algorithm with direc-
tional optimisation to derive ‘Directional Pursuit’ algorithms.
These new algorithms use a similar greedy element selection
as MP and OMP, however, the costly orthogonal projection
is (approximately) done using directional optimisation. We
propose two update directions that can be calculated efficiently
such that the algorithms have the same memory requirements
and computational complexity as MP. A third update direction
is based on the calculation of a conjugate gradient. This
leads to a novel implementation of OMP with computational
requirements similar to currently used methods based on QR
factorisation.
A. Paper Overview
The main part of this paper starts with a review of MP
and OMP in section II. Based on these two algorithms, we
develop the general Directional Pursuit framework in section
III. Particular directions are then suggested in the following
three subsections, starting with the gradient direction in sub-
section III-A, followed by the conjugate gradient in subsection
III-B and an approximate conjugate gradient in subsection
![](https://csdnimg.cn/release/download_crawler_static/1337342/bg3.jpg)
VERSION: NOVEMBER 14, 2007 2
III-C. Section IV takes a closer look at the computational
requirements of the proposed algorithms and compares these to
MP and two different OMP implementations. Section V gives
theoretic bounds on the convergence of the gradient based
algorithm. The paper concludes with a range of experiments
presented in section VI. The first experiment explores how
the proposed approaches compare with MP and OMP, both,
in terms of approximation performance (subsection VI-A) as
well as in their ability to exactly recover the underlying sparse
structure (subsection VI-B). This is followed by an experiment
that highlights the applicability of the methods to an audio
de-noising example where OMP is not feasible (subsection
VI-C). The final example in subsection VI-D analyses the
performance of the methods for compressed sensing and
contrasts the performance to other approaches.
B. Notation
Γ
n
will denote a set containing the indices of the elements
selected up to and including iteration n. Using this index set
as a subscript, the matrix Φ
Γ
n
will be a sub-matrix of Φ
containing only those columns of Φ with indices in Γ
n
. The
same convention is used for vectors. For example, y
Γ
n
is a
sub-vector of y containing only those elements of y with
indices in Γ
n
. In general, the superscript in the subscript
of y
Γ
n
reminds us that we are in iteration n, on occasion,
however, we resort to using superscripts (e.g. y
n
) to label
the iteration. The gram matrix G
Γ
n
= Φ
T
Γ
n
Φ
Γ
n
will also
be used frequently. In general, lower case bold face characters
represent vectors while upper case bold characters are used for
matrices. Individual elements from a vector will be in standard
type face with a subscript. For example g will be used to refer
to a gradient vector with g
i
denoting the i
th
element of this
vector. Inner products between vectors will often be written
using angled brackets, e.g. "x, y# = x
T
y.
II. MATCHING PURSUIT AND ORTHOGONAL MATCHING
PURSUIT
The algorithms in this paper approximate a vector x itera-
tively. In iteration n we calculate an approximation using
ˆ
x
n
= Φ
Γ
n
y
Γ
n
, (2)
and calculate the approximation error as
r
n
= x −
ˆ
x
n
. (3)
In each iteration, the approximation error is then used to
determine a new element to be selected from Φ in order to
find a better approximation.
One of the simplest algorithms of this form is possibly
Matching Pursuit (MP) [19]. A new element is selected based
on the inner product between the current residual r
n
and the
columns in Φ and the corresponding element of y is updated.
MP is summarised as follows
1) Initialise r
0
= x, y
0
= 0
2) for n = 1; n := n + 1 till stopping criterion is met
a) g
n
= Φ
T
r
n−1
b) i
n
= arg
i
max |g
n
i
|
c) y
n
i
n
= y
n−1
i
n
+ g
n
i
n
d) r
n
= r
n−1
− φ
i
n
g
n
i
n
3) Output r
n
, y
n
Note the slight abuse of notation in the expression
arg
i
max |g
n
i
|, which in general could return a set of indices.
In this case one of the elements would have to be chosen.
MP requires the evaluation of matrix multiplications in-
volving Φ
T
. If Φ is a union of dictionaries for which fast
transforms exist, then these matrix operations can be computed
efficiently. Another trick used in MP [19] is to compute
the inner products between the residual and the dictionary
elements recursively. This can be done using
g
n+1
i
=
!
r
n+1
, φ
i
"
= g
n
i
− g
n
i
n
"φ
i
n
, φ
i
# , (4)
where φ
i
n
is the last selected element. This result is a direct
consequence of the MP error recursion
r
n+1
= r
n
− g
n
i
n
φ
i
n
. (5)
This approach is of advantage whenever the inner products
"φ
i
n
, φ
i
# between the dictionary elements can either be stored
or efficiently computed and, crucially, whenever these inner
products are predominantly zero.
In Orthogonal Matching Pursuit [20], [24] the approxima-
tion y is updated in each iteration by projecting x orthogonally
onto all selected atoms. OMP therefore finds the optimum (in
terms of squared error) signal approximation achievable with
the selected atoms. This algorithm is
1) Initialise r
0
= x, y
0
= 0, Γ
0
= ∅
2) for n = 1; n := n + 1 till stopping criterion is met
a) g
n
= Φ
T
r
n−1
b) i
n
= arg
i
max |g
n
i
|
c) Γ
n
= Γ
n−1
∪ i
n
d) y
n
= Φ
†
Γ
n
x
e) r
n
= x − Φy
n
3) Output r
n
, y
n
Here the dagger † indicates the Moore-Penrose pseudo-inverse.
Note that this inverse should never be calculated explicitly
and more efficient implementations of OMP based on QR
factorisation [24] or Cholesky factorisation [25] are available.
The main drawback of these approaches is that they require
additional storage, as discussed in detail in section IV. This
storage requirement can become an issue for large problems in
which Φ cannot be stored
1
. The aim of this paper is therefore
to develop fast approximate OMP algorithms that require less
storage.
III. THE DIRECTIONAL PURSUIT FRAMEWORK
In iteration n the problem solved by Orthogonal Matching
Pursuit (OMP) is the minimisation over y
Γ
n
of the quadratic
cost-function (in n unknowns)
'x − Φ
Γ
n
y
Γ
n
'
2
2
. (6)
1
If Φ could be stored explicitly as a matrix, then one could most likely
also store a QR or Cholesky factorisation of a subset of its columns.
![](https://csdnimg.cn/release/download_crawler_static/1337342/bg4.jpg)
VERSION: NOVEMBER 14, 2007 3
Instead of updating y
i
n
by adding g
n
i
n
as in Matching Pursuit
(MP), we propose a directional update
y
n
Γ
n
= y
n−1
Γ
n
+ a
n
d
Γ
n
, (7)
where d
Γ
n
is an update direction. Different directions d
Γ
n
can
be chosen and three different possibilities will be discussed
below.
Once an update direction has been calculated, the step-size
a
n
can be determined explicitly. As shown in [26, pp. 521],
the optimum step size
2
is
a
n
=
"r
n
, c
n
#
'c
n
'
2
2
, (8)
where c
n
is the vector c
n
= Φ
Γ
n
d
n
.
In general, a single directional update does not guarantee
convergence to the minimum of expression (6) (one could
therefore also term the method “nearly” Orthogonal Matching
Pursuit) and an additional generalisation of the proposed
approach would be to use several directional update steps
before selecting a new element.
In MP and OMP, the selection of new elements is based
on the inner product between the residual and the dictionary
elements. In OMP, elements are not selected repeatedly, be-
cause due to the orthogonal projection used, the residual r
n
is always orthogonal to all previously selected elements. If
directional optimisation is used to approximate the orthogonal
projection, this orthogonality is no longer guaranteed. Because
all previously chosen elements are updated in each iteration, it
would nevertheless be possible to restrict the selection of new
elements to those elements not selected previously. However, a
large inner product between r
n
and any atom selected in a pre-
vious step indicates that the residual is ‘far’ from orthogonal to
the selected elements (as would be required by the residual in
exact OMP). Furthermore, if the algorithm always selects the
element with the largest inner product, irrespective of whether
this element has previously been selected or not, then the
algorithm belongs to the family of General MP as defined in
[27]. The theoretic results presented in [27] would then also
hold for the directional pursuit algorithms. For example, the
following theorem from [27] would hold for all algorithms in
this paper.
Theorem 1: (Theorem 1 from [27]) Let Γ be an index set
and let x = Φ
Γ
y
Γ
for some y
Γ
. If sup
i/∈Γ
'(Φ
Γ
)
†
φ
i
'
1
≤ 1,
then a general MP algorithm picks up an atom from the set Γ
in each step.
These theoretical considerations, as well as experimental
results, some of which can be found in section VI, show that
it is beneficial to always select the element with the largest
inner product. We therefore allow the selection step to select
elements more than once, i.e. we do not force the selection step
to select a new element in each iteration, thereby letting the
algorithm automatically decide how many directional update
steps are required for each new element.
2
The step size is optimum in terms of OMP, i.e. in terms of finding the
minimum squared error solution in the update direction. Other algorithms,
such as LARS can also be understood in terms of directional optimisation.
As these algorithms work with a different cost function, the optimal step size
in these algorithms is different to the one used here.
The Directional Pursuit family of algorithms can be sum-
marised as follows
1) Initialise r
0
= x, y
0
= 0, Γ
0
= ∅
2) for n = 1; n := n + 1 till stopping criterion is met
a) g
n
= Φ
T
r
n−1
b) i
n
= arg
i
max |g
n
i
|
c) Γ
n
= Γ
n−1
∪ i
n
d) calculate update direction d
Γ
n
e) c
n
= Φ
Γ
n
d
Γ
n
f) a
n
=
$r
n
,c
n
%
&c
n
&
2
2
g) y
n
Γ
n
:= y
n−1
Γ
n
+ a
n
d
Γ
n
h) r
n
= r
n−1
− a
n
c
n
3) Output r
n
, y
n
Different choices for the update direction are feasible and
the gradient g
n
discussed in subsection III-A is the obvious
choice. However, the optimal choice of d
Γ
n
would find the
minimum of (6) in a single step. It turns out that this optimal
direction is a conjugate direction as discussed in subsection III-
B and can be evaluated explicitly. Unfortunately, the evaluation
of this direction requires computational resources similar to an
OMP implementation based on QR factorisation. In subsection
III-C we therefore propose the use of an approximation to the
conjugate direction which can be evaluated more efficiently.
A. Gradient Pursuit
The gradient of expression 6 with respect to y is
g
Γ
n
= Φ
T
Γ
n
(x − Φ
Γ
n
y
n−1
Γ
n
). (9)
Using this gradient as the update direction gives a direc-
tional pursuit algorithm we call Gradient Pursuit (GP). It is
important to realise that this gradient is exactly the vector
g
n
restricted to the elements in Γ
n
, which has already been
calculated in step 2.a) of the Matching Pursuit (MP) algorithm,
i.e. MP calculates this gradient in each iteration, so that the
use of this gradient comes at no additional computational cost.
The only additional cost compared to MP is then the evaluation
of the step-size. Note, however, that when updating more than
one element in y
n
, the recursion in equation (4) becomes less
efficient.
B. Conjugate Gradient Pursuit
Another popular directional optimisation algorithm is the
conjugate gradient method [26, Section 10.2], [28, Section
16.4], which is a well known optimisation procedure that
is guaranteed to solve quadratic optimisation problems in as
many steps as the dimension of the problem
3
. The conju-
gate gradient algorithm can be summarised as follows. We
denote the cost function to be minimised by
1
2
y
T
Gy − b
T
y
(which is equivalent to solving Gy = b for y). The con-
jugate gradient method uses directional updates that are G-
conjugate to the previously chosen directions. A set of vectors
{d
1
, d
2
, . . . , d
n
} is G-conjugate if
d
T
n
Gd
k
= 0 (10)
3
At least for infinite precision arithmetic. See for example [26] for a detailed
discussion regarding stability issues.
![](https://csdnimg.cn/release/download_crawler_static/1337342/bg5.jpg)
VERSION: NOVEMBER 14, 2007 4
for all k )= n. More details can be found in for example [26,
Section 10.2] and [28, Section 16.4].
The same idea can be used in the directional pursuit frame-
work, where we now want to calculate an update direction that
is G
Γ
n
conjugate to all previously used update directions. Here
G
Γ
n
= Φ
T
Γ
n
Φ
Γ
n
. The cost function is now
'x − Φ
Γ
n
y
Γ
n
'
2
2
, (11)
where the dimension n of the cost function changes whenever
a new element is selected. Let d
k
Γ
n
be the k
th
conjugate
direction. The subscript reminds us that this vector is |Γ
n
|
dimensional. We update y
Γ
n
(which is an |Γ
n
|-dimensional
sub-vector of y) in direction d
n
Γ
n
, which is equivalent to
updating y using a directional vector d
n
of dimension N,
with all elements zero apart from the element indexed by Γ
n
.
Therefore, we can think of all previous update directions d
k
Γ
k
(note the superscript in the subscript) as higher dimensional
vectors d
k
Γ
n
, where the elements associated with the ‘new’
dimensions are set to zero.
To derive the algorithm, we recall the conjugate gradient
Theorem [28, Theorem 16.2], which we give here using the
notation introduced above.
Theorem 2: [28, Theorem 16.2] Let [d
1
Γ
n
, d
2
Γ
n
, . . . , d
n
Γ
n
]
be any set of non-zero G
Γ
n
-conjugate vectors, then the
solution to the problem G
Γ
n
y
Γ
n
= Φ
T
Γ
n
x is
y
n
Γ
n
=
n
#
k=1
a
k
d
k
Γ
n
, (12)
with step-sizes
a
k
=
!
r
k
, Φ
Γ
k d
k
Γ
n
"
'Φ
Γ
k
d
k
Γ
n
'
2
2
, (13)
where
r
k
= x − Φ
Γ
k−1
$
k−1
#
i=1
a
i
d
i
Γ
n
%
. (14)
The importance of this theorem lies in the fact that the
step-size a
k
only depends on the current residual error and
the current conjugate direction. This means that y
n
Γ
n
can be
approximated iteratively by calculating a new direction and
step-size in each iteration.
The other important aspect of this theorem is that it guaran-
tees an optimal solution in N iterations. In a directional pursuit
framework, the dimensions can change from one iteration to
the next. In the first iteration we have a single trivial direction
and step-size. In general, if the n − 1 previously used update
directions are G
Γ
n
conjugate, then, by the above theorem, we
only require one additional conjugate direction to exactly solve
the n dimensional problem.
Assume that the n−1 previously used update directions are
G
Γ
n−1 -conjugate. We want to show that they are also G
Γ
n
-
conjugate (note the different subscripts!). Using the matrix
D
n−1
Γ
n−1
to denote the matrix containing all conjugate update
directions from iteration n − 1 and the matrix D
n−1
Γ
n
to be
the same matrix but with an additional row of zeros at the
bottom. From the definition of G
Γ
n
-conjugacy we require
(D
n−1
Γ
n
)
T
G
Γ
n
D
n−1
Γ
n
= B, where B is a diagonal matrix.
Because the last row of D
n−1
Γ
n
contains only zeros, the last
row and column of G
Γ
n
are multiplied by zeros, which implies
that
B = (D
n−1
Γ
n−1
)
T
G
Γ
n−1
D
n−1
Γ
n−1
= (D
n−1
Γ
n
)
T
G
Γ
n
D
n−1
Γ
n
. (15)
The main question is now how to calculate a new conjugate
gradient. We require the new direction to be G
Γ
n
-conjugate.
Therefore, the new direction has to satisfy
(D
n−1
Γ
n
)
T
G
Γ
n
d
n
Γ
n
= 0 (16)
We write each new direction as a combination of all previously
chosen directions and the current gradient g
Γ
n
[26, Section
10.2]
d
n
Γ
n
= b
0
g
Γ
n
+ D
n−1
Γ
n
b. (17)
Without loss of generality we can set b
0
= 1. Pre-multiplying
by D
n−1
Γ
n
G
Γ
n
and using the G
Γ
n
-conjugacy then leads to the
n − 1 constraints
(D
n−1
Γ
n
)
T
G
Γ
n
(g
Γ
n
+ D
n−1
Γ
n
b) = 0 (18)
from which we can write
b = −((D
n−1
Γ
n
)
T
G
Γ
n
D
n−1
Γ
n
)
−1
((D
n−1
Γ
n
)
T
G
Γ
n
g
Γ
k ). (19)
Again using G
Γ
n
-conjugacy we find that (D
n−1
Γ
n
)
T
G
Γ
n
D
n−1
Γ
n
is diagonal so that the conjugate gradient can be calculated
without matrix inversion.
Note that in the standard conjugate gradient algorithm [26,
Section 10.2], [28, Section 16.4] each new update direction
can be calculated as a combination of the current gradient and
the single previous update direction alone, i.e. in the standard
conjugate gradient algorithm, all but the last conjugate update
direction turn out to be G conjugate to the current gradient
such that b in equation (19) has only a single non-zero
element. Unfortunately, in the context of OMP, the changing
dimensionality destroys this property so that we have to take
account of all previous update directions in each step.
For an efficient implementation it is worth noting that
in the calculation of b the product D
n−1
Γ
n
G
Γ
n
can be up-
dated recursively by adding a single new row and column
in each iteration. Note also that ((D
n−2
Γ
n−1
)
T
G
Γ
n−1
D
n−2
Γ
n−1
)
−1
and ((D
n−1
Γ
n
)
T
G
Γ
n
D
n−1
Γ
n
)
−1
are equal apart from a single
additional value added in each iteration. This value is 'c
n−1
'
2
2
used in step 2.f) of iteration n − 1 and does not have to be
recalculated.
It is important to realise that the algorithm derived here
is different form an implementation of OMP in which a full
conjugate gradient solver is used for each newly selected
element. Instead, the proposed method only uses a single
directional update step for each new element. The most
similar method to the proposed algorithm is probably the
implementation of OMP proposed in [20], which also uses
a directional update. However, the method in [20] uses matrix
inversions, which have to be updated iteratively and which can
make the approach less stable.
Another approach to OMP is based on QR factorisation. The
selected dictionary Φ
Γ
n
is decomposed into Φ
Γ
n
= Q
Γ
n
R
Γ
n
where in each iteration new elements are added to Q
Γ
n−1
and
R
Γ
n−1
. The algorithm then does not need to evaluate y
Γ
n
in
each iteration, instead, r
n
= r
n−1
− "q, x# q, where q is the