1722 J. CHE AND Y. YANG
AIC, BIC in regression, etc. [
26]. Then a search method is needed to nd the best variable
subset from all these 2
m
possible subsets [5]. Because the number of variable subset candi-
dates is 2
m
, it is dicult to search the variable subset exhaustively for large m. This brings
about such challenges as reducing computational complexity while extracting compact yet
eective variables.
It is well known that the stepwise selection has lower computational complexity than
the best subset selection. However, the stepwise selection actually selects a sub-optimal
subset due to the obtained nested sequence [
39]. To improve the traditional stepwise selec-
tion, Xin and Zhu propose a stochastic stepwise ensemble (ST2E). This ensemble method
randomly includes or excludes a group of variables at each step, where the group size is
randomly determined [
42]. For a xed group size, the group candidates can be quite large.
Instead of searching all the group candidates exhaustively, they only need to evaluate a
few randomly selected subsets. Then the best one is chosen from the selected subsets.
As demonstrated by a special example, Xin and Zhu nd that the globally optimal subset
with exhaustive search may include some junk variables. It is partly eciency and partly
lucky that the ensemble method could nd the best subset without exhaustive search. This
ensemble learning algorithm has received our considerable attention due to its potential to
signicantly improve the performance of traditional procedure.
Inspired by the great success of ensemble learning algorithms in solving forecasting
problems [
1,3,16,25], the idea of ensemble learning has been recently introduced to exe-
cute variable selection [
24,32,40,45]. In general, variable selection ensembles (VSEs) allow
each optimization path (i.e. each ensemble) to generate sub-optimal rather than opti-
mal solution, while make these solutions of VSEs as dierent as possible. In other words,
each ensemble does not need search the variable subset exhaustively, alternatively, it only
need do a simple search to get a good strength-diversity tradeo among all the ensem-
bles. According to this strength-diversity tradeo, to improve a VSE, all the ensemble
members must be as good variable selectors as possible, meanwhile, they must also dis-
agree with each other as much as possible [
44]. As stated in [21], improving the strength
of VSEs’ members while keeping their diversity will design a better VSE. Based on the
above idea, we propose a novel ensemble learning framework by injecting an informa-
tion measurement criterion into an ST2E with the aim to improve the strength of each
ensemble.
In probability theory and information theory, the dependence measurement between
two random variables is a fundamental and interesting problem. It has many applications in
statistics, signal processing, economics and so on. The most popular and classical measure-
ment for nonlinear and linear dependence is the mutual information and the correlation
coecient, respectively. Mutual information (MI) between two variables X and Y mea-
sures how similar the joint distribution p(X, Y) is to the products of factored marginal
distribution p(X)p(Y), so the MI of two random variables is a generalised measurement of
the variables’ mutual dependence [
33]. Not limited to linear dependence relationship, such
dependence includes linear or nonlinear relationship. Kojadinovic measures the similar-
ity for agglomerative hierarchical clustering of continuous variables by using the notion of
mutual information (MI) [20]. Typically, the MI-based variable selection algorithms can
build a lter method by the estimation of the MI between a variable candidate and the tar-
get variable (see the literature [
2,11,38]). However, the performance of the above selection
algorithms will be degraded as a result of large errors in estimating the mutual information