Entropy 2015, 17 4135
1. Introduction
Bayesian networks (BNs), which were introduced by Pearl [1], can encode dependencies among all
variables. Their success has led to a recent flurry of algorithms for learning BNs from data [2–5].
BN =< N, A, Θ > is a directed acyclic graph with a conditional probability distribution for each node,
collectively represented by Θ that quantifies how much a node depends on its parents. Each node n ∈ N
represents a domain variable, and each arc a ∈ A between nodes represents a probabilistic dependency.
A BN can be used as a classifier that characterizes the joint distribution P (x, y) (In the following
discussion, lower-case letters denote specific values taken by corresponding attributes. For instance,
x
i
represents the event that X
i
= x
i
) of class variable Y and a set of attributes X = {X
1
, X
2
, · · · , X
n
},
and predicts the class label with the highest conditional probability. Denoting the parent nodes of x
i
by P a(x
i
), the joint distribution P
B
(x, y) can be represented by factors over the network structure B,
as follows:
P
B(x,y)
=
n
Y
i=1
P (x
i
|P a(x
i
)). (1)
The inference of a general BN has been shown to be an NP-hard problem [6] even for approximate
solutions [7]. However, learning unrestricted BNs does not necessarily lead to a classifier with good
performance. For example, naive Bayes (NB) [8] is the simplest BN, which considers only the
dependence between each attribute X
i
and the class variable Y . However, Friedman et al. [9] observed
that unrestricted BN classifiers do not outperform NB in a large sample of benchmark data sets. Many
BN classifiers have been proposed to overcome the limitation of NB. One practical approach for
structure learning is to impose some restrictions on the structures of BNs, for example, learning tree-like
structures. Sahami [10] proposed to describe the limited dependence among variables within a general
framework, which is called k-dependence Bayesian (KDB) classifier. Friedman et al. [9] proposed tree
augmented naive Bayes (TAN), a structure-learning algorithm that learns a maximum spanning tree from
the attributes. Conditional mutual information is applied in these two algorithms to measure the weight
of arcs between predictive attributes. When data size becomes larger, the superiority in high-dependence
representation helps KDB obtain better classification performance than TAN.
The key differences between Bayesian classifiers are their structure-learning algorithms. Many
criteria, such as Bayesian scoring function [11], minimal description length (MDL) [12] and Akaike
Information Criterion (AIC) [13], have been proposed to find out one global graph structure B
G
that best
characterizes the true distribution of given data. Considering the time and space complexity overhead,
only a limited number of conditional probabilities can be encoded in BN. All credible dependencies must
be represented to obtain a more accurate estimation of the true joint distribution. However, these criteria
can only approximately measure the overall interdependencies between attributes, but cannot identify the
change of interdependencies when attributes take different values. Thus the candidate graph structures
may have very close score values and are non-negligible in the posterior sense [14]. To extend the limited
representation of B
G
, some researchers proposed to aggregate several candidate BNs together. Averaged
one-dependence estimators (AODE), which were proposed by Webb et al. [15], aggregate the predictions
of all qualified restricted class of one-dependence estimators. Zheng et al. [16] proposed subsumption
resolution (SR), to efficiently identify occurrences of the specialization-generalization relationship and
eliminate generalizations at classification time. By introducing Functional Dependency (FD) analysis