chromosomal regions (Daly et al. 2001), then search the
minimum subsets of tagging SNPs within each haplotype
block. These tagSNPs can distinguish each pair of common
haplotypes (Gabriel et al. 2002), or at least most of them
(Zhang et al. 2004). But block-based methods have the
disadvantage that the choice of tagSNPs depend on the
block definition. There is no general solution on how
the blocks are formed and there is no idea on how to
prevent the accuracy loss of the tagSNP selection which is
caused by the lack of the inter-block association.
Compared with the block-based methods, the LD based
methods use the pairwise association of SNPs. In these LD-
based methods, the SNPs in given chromosomal regions are
divided into several clusters according to their LD associ-
ations (measured by the pairwise r
2
). In such clusters, every
SNPs have strong LD association with each other (Carlson
et al. 2004; Ao et al. 2005). A minimum set of SNPs is
selected from these clusters as the tagSNPs. These SNPs,
which have highly LD associations with the rest SNPs in
their clusters, can represent the other associated SNPs even
with long distance. The information overlappings of these
tagSNPs are very small, which means the chosen SNPs only
have very low LD value with each others. But the tagSNPs
found by these methods may lose some important infor-
mation which are contained by the rest SNPs and fail to
distinguish all haplotypes in a LD cluster.
Bafna et al. (2003) proposed a different method based
on the prediction accuracy. It assumes that tagSNPs can
reconstruct the remaining SNPs of an unknown sample
with high accuracy. Thus, these methods introduce a new
measure called informativeness to quantify how well the
selected SNPs predict the remaining set of the unselected
SNPs and reconstruct the complete haplotypes (Halldors-
son et al. 2004; Halperin et al. 2005). These methods obtain
a combination of several tagSNPs through training, and
predict the other SNPs in the genotype by the haplotypes of
tagSNPs. However, their performances are limited by the
restrictions such as the fixed number of tagSNPs for each
prediction or the definitions of the informativeness.
In this paper, we propose a new hybrid method to solve
the problem of finding the minimum set of tagSNPs which
is called CMDStagger. Our method adopts the ideas of
block based and LD based methods, using the graph
algorithm to combine the diversity of common haplotypes
and the LD association between SNPs. Unlike the previous
methods, the proposed method does not look for subsets of
SNPs but discard redundant sites by judging whether they
can be represented by the other correlative sites in their
correlative maximum density subgraphs (MDS). It can give
better performance on large data sets by reducing the
running time. The method does not need to define block or
fix on the number of tagSNPs. In addition, the correlative
SNPs in our method are not only the SNPs that have high
LD value with each other, but also the ones that have the
similar haplotype identifying information in their cluster.
Thus, there are no limits such as the information loss and
haplotype identification failure. To reduce the computa-
tional complexity of graph algorithm, the haplotype data
are preprocessed by a clustering algorithm using the LD
association of SNPs, then the graphs are constructed on the
SNP clusters deriving from the preprocess clustering step.
We describe our method, including the introduction of
basic notations, clustering algorithm, graph construction,
and the MDS-based selection algorithm in Sect. 2. Our
experimental results on biological data and the comparisons
with other methods are presented in Sect. 3. In Sect. 4, the
conclusions and future research directions are summarized.
2 TagSNP selection method
2.1 Problem formulation
We first introduce some notations and definitions for our
method. Assume we are given n haploid sequences con-
sisting of m SNPs. Since we are only interested in bi-allelic
SNPs, each haplotype can be represented as a binary string.
The n sequences can form a matrix M of size n*m where
rows are haplotype sequences and columns are SNPs.
M½i; j2f0; 1; g is the allele of the ith sequence at the jth
SNP locus, where 0, 1 represent the major and minor alleles,
and—indicates missing data. To simplify the problem, we
assume no missing data in the sequences in our method.
Through the analysis of SNPs and haplotypes, it can be
discovered that the SNPs distribute in the space according to
their LD values and their related haplotype diversities. The
relative sites gather round and there are usually more sites
around the real tagSNPs than the others except some special
tagSNPs which have no association with any sites. TagSNPs
can usually distinguish all the common haplotypes or at least
most of them as the all SNPs can do. That means the hap-
lotype diversity information of these tagSNPs can represent
the information their related sites have. In order to get
enough genetic information for the future study, we assume
that tagSNPs must have at least two features—high LD value
with other associated SNPs and the common haplotype
identification ability. Thus, we can give the following formal
definition of the tagSNP selection problem in our method.
Problem: Minimum TagSNP Selection (MTS)
Input: An n*m haplotype matrix M
Output: The minimum tagSNP set ST. The tagSNPs in
the set ST satisfy:
(1) Each tagSNP has high LD values with their related
SNP sets;
(2) For each pair of haplotype patterns P
i
and P
j
in the M,
these is a SNP s
i
[ ST such that M [k, i] = M [k, j];
1144 M.-Z. Guo et al.
123