AhybridclusteringandgraphbasedalgorithmfortagSNPselection资源-CSDN文库

24 浏览量 2021-02-20 16:07:21 上传评论收藏 531KB PDF 举报

根据提供的文件信息，我们可以归纳出以下知识点： 1. 标签SNP选择的重要性：标签SNP选择旨在从大量SNP（单核苷酸多态性）中选取一小部分具有代表性的SNP集合来代表整个SNP集合。这对当前基因组学研究至关重要。它可以减少基因分型的成本，并且通过过滤大量冗余的SNP来加速全基因组疾病关联研究。这一过程不仅仅是节省成本和加速研究，更重要的是能够帮助研究者识别疾病的遗传成分以及人群中存在的变异。 2. 提出的新型混合方法：文中提出了一种名为CMDStagger的新型混合方法，该方法结合了聚类和图算法的思路，以找到最小的标签SNP集合。这种方法在选择标签SNP时，使用了连锁不平衡和单倍型多样性等信息，以减少信息损失。此外，CMDStagger没有对区块划分的限制。 3. 算法的实现与测试：算法在Hapmap和5q31染色体上的八个基准数据集上进行了测试。测试结果表明，该算法能够减少选择时间，并且在具有高预测准确性的同时，能够获得更少的标签SNP。这表明，该方法相比于之前的方法有更优的性能。 4. 链锁不平衡（LD）和单倍型多样性：在标签SNP选择过程中，考虑了链锁不平衡和单倍型多样性。链锁不平衡是指染色体上两个或多个SNP之间的非随机联合遗传变异。单倍型多样性是指一组SNP上不同组合的多样性，其对于理解遗传疾病的发病机制、药物反应及其它表型具有重要意义。通过利用这些遗传信息，算法能够在减少标签SNP数目时保持足够的信息，以保证研究的精确性和效率。 5. 算法的应用背景：这种方法主要应用于全基因组疾病关联研究，能够提供识别常见疾病的遗传成分和人群之间变异最完整的信息。SNPs和单倍型能够为研究者提供有关疾病关联的详尽信息，是当前基因组研究的主要关注点。 6. 算法的原理和优势：提出的方法利用了图算法中的最大密度子图（MDS）概念，将遗传信息中的连锁不平衡和单倍型多样性进行有效结合。同时，由于该方法不限制区块划分，其灵活性和适应性更强。与以往的方法相比， CMDStagger能够更快速地进行标签SNP的选择，并且在保持高预测准确性的同时，还能大幅减少所需的标签SNP数目，从而显示出更好的性能。 7. 文献出版信息：本文由Mao-Zu Guo、Jun Wang、Chun-yu Wang和Yang Liu撰写，发表于《SoftComput》杂志2009年第13期，页码为1143-1151。在线发表日期为2009年3月21日，DOI编号为10.1007/s00500-009-0419-z。出版社为Springer-Verlag。该论文在提出一种新的标签SNP选择方法的同时，也为基因组学研究提供了一种有效的数据简化手段。 8. 研究团队和机构：本文的研究团队来自哈尔滨工业大学计算机科学与技术学院，成员包括Mao-Zu Guo、Jun Wang、Chun-yu Wang和Yang Liu。他们的研究成果展示了中国在信息科学领域的研究实力，尤其是在生物信息学和计算生物学方面。通过上述知识点的梳理，我们可以看到，这项研究是将计算生物学、遗传学以及数据挖掘技术结合在一起的前沿工作，对于深入理解复杂遗传数据和疾病关联具有重要的意义。

资源推荐

资源详情

资源评论

FOCUS

A hybrid clustering and graph based algorithm for tagSNP

selection

Mao-Zu Guo Æ Jun Wang Æ Chun-yu Wang Æ

Yang Liu

Published online: 21 March 2009

 Springer-Verlag 2009

Abstract TagSNP selection, which aims to select a small

subset of informative single nucleotide polymorphisms

(SNPs) to represent the whole large SNP set, has played an

important role in current genomic research. Not only can this

cut down the cost of genotyping by ﬁltering a large number of

redundant SNPs, but also it can accelerate the study of gen-

ome-wide disease association. In this paper, we propose a new

hybrid method called CMDStagger that combines the ideas of

the clustering and the graph algorithm, to ﬁnd the minimum

set of tagSNPs. The proposed algorithm uses the information

of the linkage disequilibrium association and the haplotype

diversity to reduce the information loss in tagSNP selection,

and has no limit of block partition. The approach is tested on

eight benchmark datasets from Hapmap and chromosome

5q31. Experimental results show that the algorithm in this

paper can reduce the selection time and obtain less tagSNPs

with high prediction accuracy. It indicates that this method

has better performance than previous ones.

Keywords TagSNP selection  Clustering algorithm 

Maximum density subgraph (MDS)  Linkage

disequilibrium (LD)  Haplotypes diversity

1 Introduction

The genome-wide disease association is a major interest of

current genomics research. SNPs and haplotypes, which

characterize most of the genetic variation among different

people, can provide the most complete information for

these association studies, such as identifying the genetic

components of common diseases, ﬁnding variations among

different people.

Single nucleotide polymorphisms (SNPs) are mutations

at single nucleotide positions, and the haplotypes are the

sets of associating SNPs localized on one chromosome. It

has been estimated that there are about ten million common

SNPs in the human genome (Sachidanandam et al. 2001).

Genotyping and studying all SNPs in a candidate region is

costly and time-consuming for a large number of individ-

uals. It is essential to ﬁnd some useful categories to improve

the genotyping performance. SNPs are not independently

distributing in the genome. Many SNPs have nonrandom

associations with each other, which is called the linkage

disequilibrium (LD) (Lewontin 1964). The LD association

between SNPs causes the redundancy of genetic informa-

tion. This redundancy makes it possible to overcome the

genotyping and genome study problem through SNP

selection. In the process of SNP selection, only a subset of

informative SNPs (tagSNPs or tagging SNPs) is chosen and

the rest, which have redundant information are removed

(Johnson et al. 2001). The selected SNPs can accurately

represent the rest of the SNPs, and have enough information

of original set for further study, while having little infor-

mation overlapping among them. The selection of tagSNPs

has become a very active research topic. Several compu-

tational methods have been proposed in the past few years.

These methods can be mainly separated to three species.

The ﬁrst common method is based on the haplotype blocks.

It assumes human genome can be delimited as a set of

discrete blocks with limited diversity. That means there is a

very small set of common haplotypes shared by most of

the population within each block. Under this assumption,

these methods ﬁrst identify haplotype blocks on the

M.-Z. Guo  J. Wang (&)  C.-y. Wang  Y. Liu

School of Computer Science and Technology,

Harbin Institute of Technology, 150001 Harbin,

Heilongjiang, China

e-mail: wjking@hit.edu.cn

123

Soft Comput (2009) 13:1143–1151

DOI 10.1007/s00500-009-0419-z

chromosomal regions (Daly et al. 2001), then search the

minimum subsets of tagging SNPs within each haplotype

block. These tagSNPs can distinguish each pair of common

haplotypes (Gabriel et al. 2002), or at least most of them

(Zhang et al. 2004). But block-based methods have the

disadvantage that the choice of tagSNPs depend on the

block deﬁnition. There is no general solution on how

the blocks are formed and there is no idea on how to

prevent the accuracy loss of the tagSNP selection which is

caused by the lack of the inter-block association.

Compared with the block-based methods, the LD based

methods use the pairwise association of SNPs. In these LD-

based methods, the SNPs in given chromosomal regions are

divided into several clusters according to their LD associ-

ations (measured by the pairwise r

). In such clusters, every

SNPs have strong LD association with each other (Carlson

et al. 2004; Ao et al. 2005). A minimum set of SNPs is

selected from these clusters as the tagSNPs. These SNPs,

which have highly LD associations with the rest SNPs in

their clusters, can represent the other associated SNPs even

with long distance. The information overlappings of these

tagSNPs are very small, which means the chosen SNPs only

have very low LD value with each others. But the tagSNPs

found by these methods may lose some important infor-

mation which are contained by the rest SNPs and fail to

distinguish all haplotypes in a LD cluster.

Bafna et al. (2003) proposed a different method based

on the prediction accuracy. It assumes that tagSNPs can

reconstruct the remaining SNPs of an unknown sample

with high accuracy. Thus, these methods introduce a new

measure called informativeness to quantify how well the

selected SNPs predict the remaining set of the unselected

SNPs and reconstruct the complete haplotypes (Halldors-

son et al. 2004; Halperin et al. 2005). These methods obtain

a combination of several tagSNPs through training, and

predict the other SNPs in the genotype by the haplotypes of

tagSNPs. However, their performances are limited by the

restrictions such as the ﬁxed number of tagSNPs for each

prediction or the deﬁnitions of the informativeness.

In this paper, we propose a new hybrid method to solve

the problem of ﬁnding the minimum set of tagSNPs which

is called CMDStagger. Our method adopts the ideas of

block based and LD based methods, using the graph

algorithm to combine the diversity of common haplotypes

and the LD association between SNPs. Unlike the previous

methods, the proposed method does not look for subsets of

SNPs but discard redundant sites by judging whether they

can be represented by the other correlative sites in their

correlative maximum density subgraphs (MDS). It can give

better performance on large data sets by reducing the

running time. The method does not need to deﬁne block or

ﬁx on the number of tagSNPs. In addition, the correlative

SNPs in our method are not only the SNPs that have high

LD value with each other, but also the ones that have the

similar haplotype identifying information in their cluster.

Thus, there are no limits such as the information loss and

haplotype identiﬁcation failure. To reduce the computa-

tional complexity of graph algorithm, the haplotype data

are preprocessed by a clustering algorithm using the LD

association of SNPs, then the graphs are constructed on the

SNP clusters deriving from the preprocess clustering step.

We describe our method, including the introduction of

basic notations, clustering algorithm, graph construction,

and the MDS-based selection algorithm in Sect. 2. Our

experimental results on biological data and the comparisons

with other methods are presented in Sect. 3. In Sect. 4, the

conclusions and future research directions are summarized.

2 TagSNP selection method

2.1 Problem formulation

We ﬁrst introduce some notations and deﬁnitions for our

method. Assume we are given n haploid sequences con-

sisting of m SNPs. Since we are only interested in bi-allelic

SNPs, each haplotype can be represented as a binary string.

The n sequences can form a matrix M of size n*m where

rows are haplotype sequences and columns are SNPs.

M½i; j2f0; 1; g is the allele of the ith sequence at the jth

SNP locus, where 0, 1 represent the major and minor alleles,

and—indicates missing data. To simplify the problem, we

assume no missing data in the sequences in our method.

Through the analysis of SNPs and haplotypes, it can be

discovered that the SNPs distribute in the space according to

their LD values and their related haplotype diversities. The

relative sites gather round and there are usually more sites

around the real tagSNPs than the others except some special

tagSNPs which have no association with any sites. TagSNPs

can usually distinguish all the common haplotypes or at least

most of them as the all SNPs can do. That means the hap-

lotype diversity information of these tagSNPs can represent

the information their related sites have. In order to get

enough genetic information for the future study, we assume

that tagSNPs must have at least two features—high LD value

with other associated SNPs and the common haplotype

identiﬁcation ability. Thus, we can give the following formal

deﬁnition of the tagSNP selection problem in our method.

Problem: Minimum TagSNP Selection (MTS)

Input: An n*m haplotype matrix M

Output: The minimum tagSNP set ST. The tagSNPs in

the set ST satisfy:

(1) Each tagSNP has high LD values with their related

SNP sets;

(2) For each pair of haplotype patterns P

and P

in the M,

these is a SNP s

[ ST such that M [k, i] = M [k, j];

1144 M.-Z. Guo et al.

123

剩余8页未读，继续阅读

评论收藏

内容反馈

weixin_38746701

粉丝: 7
资源: 921

A hybrid clustering and graph based algorithm for tagSNP selecti...

最新资源

A hybrid clustering and graph based algorithm for tagSNP selecti...

A negative selection algorithm based on hierarchical clustering of self set

Clustering Algorithm

Data clustering algorithm and application

A clustering algorithm for multiple data streams based on spectral component similarity

Graph-based Clustering Approach

MaPle A Fast Algorithm for Maximal Pattern-based Clustering∗

A Distribution-Based Clustering Algorithm For Mining In Large Spatial Databases

A Hybrid Clustering System based, (DE) Algorithm for Clustering:A Hybrid Clustering System based on, (DE) Algorithm for Clustering-matlab开发

Graph Algorithms

（ 源代码）GMC: Graph-based Multi-view Clustering

论文研究-K-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data.pdf

a hybrid model for stock market forecasting and portfolio selection.pdf

A Huffman tree-based algorithm for Clustering Documents

A dynamic clustering based differential evolution algorithm .pdf

A Novel Three-Way Clustering Algorithm for Mixed-Type Data

EM Algorithm for Clustering

A Concept-Driven Algorithm for clustering Search Results

图像分割—基于图的图像分割（Graph-Based Image Segmentation）C++源代码

C#，基于密度的噪声应用空间聚类算法（DBSCAN Algorithm）源代码

A Link Clustering Based Approach for Clustering Categorical Data

Enhanced k-Means Clustering Algorithm for Malaria Image.pdf

最新资源

（源代码）GMC: Graph-based Multi-view Clustering