1.2 Relational patterns
Relational patterns involve multiple relations from a re-
lational database. They are typically stated in a more ex-
pressive language than patterns defined on a single data
table. The major types of relational patterns extend the
types of propositional patterns considered in single table
data mining. We can thus have relational classification rules,
relational regression trees, and relational association rules,
among others.
An example relational classification rule is given in Ta-
ble 1, which involves the relations Customer and MarriedTo.
It predicts a person to be a big spender if the person is mar-
ried to somebody with high income (compare this to the rule
that states a person is a big spender if he has high income,
listed above the relational rule). Note that the two persons
C1 and C2 are connected through the relation MarriedTo.
Relational patterns are typically expressed in subsets of
first-order logic (also called predicate or relational logic).
Essentials of predicate logic include predicates (MarriedTo)
and variables (C1, C2), which are not present in proposi-
tional logic. Relational patterns are thus more expressive
than propositional ones.
Most commonly, the logic programming subset of first-
order logic, which is strongly related to deductive databases,
is used as the formalism for expressing relational patterns.
E.g., the relational rule in Table 1 is a logic program clause.
Note that a relation in a relational database corresponds to
a predicate in first-order logic (and logic programming).
1.3 Relational to propositional
RDM tools can be applied directly to multi-relational
data to find relational patterns that involve multiple rela-
tions. Most other data mining approaches assume that the
data resides in a single table and require preprocessing to
integrate data from multiple tables (e.g., through joins or
aggregation) into a single table before they can be applied.
Integrating data from multiple tables through joins or aggre-
gation, however, can cause loss of meaning or information.
Suppose we are given the relations customer(CustID,
Name, Age, SpendsALot) and purchase(CustID,
P roductID, Date, V alue, P aymentMode), where each cus-
tomer can make multiple purchases, and we are interested in
characterizing customers that spend a lot. Integrating the
two relations via a natural join will give rise to a relation
purchase1 where each row corresponds to a purchase and
not to a customer. One possible aggregation would give rise
to the relation customer1(CustID, Age, NofP urchases,
T otalV alue, SpendsALot). In this case, however, some in-
formation has been clearly lost during aggregation.
The following pattern can be discovered if the relations
customer and purchase are considered together.
customer(CID, N ame, Age, yes) ←
Age > 30 ∧
purchase(CID, P ID, D, V alue, P M ) ∧
P M = credit card ∧ V alue > 100.
This pattern says: “a customer spends a lot if she is older
than 30, has purchased a product of value more than 100
and paid for it by credit card.” It would not be possible to
induce such a pattern from either of the relations purchase1
and customer1 considered on their own.
Besides the ability to deal with data stored in multi-
ple tables directly, RDM systems are usually able to take
into account generally valid background (domain) knowl-
edge given as a logic program. The ability to take into ac-
count background knowledge and the expressive power of the
language of discovered patterns are distinctive for RDM.
Note that data mining approaches that find patterns in a
given single table are referred to as attribute-value or propo-
sitional learning approaches, as the patterns they find can be
expressed in propositional logic. RDM approaches are also
referred to as first-order learning approaches, or relational
learning approaches, as the patterns they find are expressed
in the relational formalism of first-order logic. A more de-
tailed discussion of the single table assumption, the prob-
lems resulting from it and how a relational representation
alleviates these problems is given by Wrobel [50] (Chapter
4 of [16]).
1.4 Algorithms for relational data mining
A RDM algorithm searches a language of relational pat-
terns to find patterns valid in a given database. The search
algorithms used here are very similar to those used in single
table data mining: one can search exhaustively or heuristi-
cally (greedy search, best-first search, etc.). Just as for the
single table case, the space of patterns considered is typically
lattice-structured and exploiting this structure is essential
for achieving efficiency. The lattice structure is traversed by
using refinement operators [46], which are more complicated
in the relational case. In the propositional case, a refinement
operator may add a condition to a rule antecedent or an item
to an item set. In the relational case, a new relation can be
introduced as well.
Just as many data mining algorithms come from the field
of machine learning, many RDM algorithms come form the
field of inductive logic programming (ILP, [35; 30]). Situ-
ated at the intersection of machine learning and logic pro-
gramming, ILP has been concerned with finding patterns
expressed as logic programs. Initially, ILP focussed on au-
tomated program synthesis from examples, formulated as
a binary classification task. In recent years, however, the
scope of ILP has broadened to cover the whole spectrum of
data mining tasks (classification, regression, clustering, asso-
ciation analysis). The most common types of patterns have
been extended to their relational versions (relational classi-
fication rules, relational regression trees, relational associa-
tion rules) and so have the major data mining algorithms
(decision tree induction, distance-based clustering and pre-
diction, etc.).
Van Laer and De Raedt [49] (Chapter 10 of [16]) present
a generic approach of upgrading single table data mining
algorithms (propositional learners) to relational ones (first-
order learners). Note that it is not trivial to extend a single
table data mining algorithm to a relational one. Extending
the key notions to, e.g., defining distance measures for multi-
relational data requires considerable insight and creativity.
Efficiency concerns are also very important, as it is often the
case that even testing a given relational pattern for validity
is computationally expensive, let alone searching a space of
such patterns for valid ones. An alternative approach to
RDM (called propositionalization) is to create a single table
from a multi-relational database in a systematic fashion [28]
(Chapter 11 of [16]): this approach shares some efficiency
concerns and in addition can have limited expressiveness.