没有合适的资源?快使用搜索试试~ 我知道了~
cd-hit-user-guide.pdf 学习cd-hit新手必备!
需积分: 31 18 下载量 151 浏览量
2013-12-10
20:34:29
上传
评论 1
收藏 315KB PDF 举报
温馨提示
试读
20页
cd-hit是非常快速的、是中国人(Weizhong Li)写的,很好用,最大的特点就是快。基本思路是首先对所有序列按照其长度进行排序,然后从最长的序列开始,形成第一个序列类,然后依次对序列进行处理,如果新的序列与已有的序列类的代表序列的相似性在cutoff以上则把该序列加到该序列类中,否则形成新的序列类。 指导学习cd-hit
资源推荐
资源详情
资源评论
CD-HIT User’s Guide
Last updated: April 5, 2010
http://cd-hit.org
http://bioinformatics.org/cd-hit/
Program developed by Weizhong Li’s lab at UCSD
http://weizhong-lab.ucsd.edu liwz@sdsc.edu
1. Introduction
2. Algorithm
2.1. cd-hit clustering algorithm
2.2. algorithm limitations
2.3. cd-hit-2d comparing algorithm
2.4. DNA/RNA clustering & comparing
2.5. psi-cd-hit algorithm
2.6. cd-hit-454
3. User’s Guide
3.1. installation
3.2. cd-hit
3.3. cd-hit-2d
3.4. cd-hit-est
3.5. cd-hit-est-2d
3.6. multi-threaded programs
3.7. cd-hit-para.pl, cd-hit-2d-para.pl
3.8. psi-cd-hit.pl
3.9. psi-cd-hit-2d.pl
3.10. incremental clustering
3.11. hierarchically clustering
3.12. cd-hit-454
4. CD-HIT tools
4.1. plot_len.pl
4.2. clstr_sort.pl
4.3. clstr_merge.pl
4.4. clstr_renumber.pl
4.5. clstr_rev.pl
5. CD-HIT web server
6. FAQ
7. References
Introduction
!"#$%&' ()*' +,-.-/)001')' 2,+34-/' 506*34,-/.' 2,+.,)78' &94' 7)-/' ):;)/3).4' +<'39-*' 2,+.,)7' -*'
-3*' 603,)#<)*3' *244:8' %3' 5)/'=4' 96/:,4:*' +<' 3-74*' <)*34,' 39)/' +394,' 506*34,-/.' 2,+.,)7*>' <+,'
4?)7204>'@ABC&!ADC&8'&94,4<+,4'-3'5)/'9)/:04';4,1'0),.4':)3)=)*4*>'0-E4'FG8'
'
&94'H
*3
';4,*-+/'+<'39-*'2,+.,)7>'!"#$%>'()*'26=0-*94:')/:',404)*4:'-/'IJJH8' &94'I
/:
';4,*-+/>'
5)004:' !"#$%&>' ()*' 26=0-*94:'-/'IJJI'(-39'*-./-<-5)/3'-72,+;474/3*8'C-/54'IJJK>' !" #$%&'
9)*'=44/'9+*34:')3'=-+-/<+,7)3-5*8+,.')*')/'+24/'*+6,54'2,+L4538'
'
C-/54'-3*' ,404)*4>'!"#$%&' 9)*' =44/'.433-/.' 7+,4' )/:'7+,4' 2+260),8'%3 !9)*' )' *-./-<-5)/3'6*4,'
=)*4>' %' 4*3-7)34:' )3' +;4,' *4;4,)0' 39+6*)/:*'6*4,*8'%3'-*'6*4:')3'7)/1',4*4),59')/:'
4:65)3-+/)0'-/*3-363-+/*8'M+,'4?)7204>')3' UniProt>' !"#$%&! -*'6*4:'3+'.4/4,)34'394'UniRef'
,4<4,4/54':)3)'*43*'N9332OPP(((82-,86/-2,+38+,.P:)3)=)*4P"@"4*5,-23-+/8*9370Q8! %3! -*')0*+'
6*4:'-/'PDB!3+'3,4)3',4:6/:)/3'*4R64/54*'N9332OPP,63.4,*8,5*=8+,.P2:=P,4:6/:)/5189370Q8'''
'
%/' IJJS>' 394' T
,:
'7)L+,'62:)34*'(4,4'26=0-*94:')/:',404)*4:'(-39' )=-0-3-4*' 3+' 24,<+,7'
;),-+6*'L+=*' 0-E4' 506*34,-/.')' 2,+34-/':)3)=)*4>' 506*34,-/.' )'"FBPGFB' :)3)=)*4>'5+72),-/ .'
3(+':)3)=)*4*'N2,+34-/'+,'"FBPGFBQ>'.4/4,)3-/.'2,+34-/'<)7-0-4*>')/:'7)/1'+394,*8'
'
&94' !"#$%&' (4=' *4,;4,' ()*' -720474/34:' -/' IJJU>' (9-59' )00+(*' 6*4,*' 3+' 506*34,' +,'
5+72),4' *4R64/54*' (-39+63' 6*-/.'5+77)/:'!"#$%&8' &94' *4,;4,' 2,+;-:4*' -/34,)53-;4'
-/34,<)54' )/:' )::-3-+/)0' ;-*6 )0-V)3-+/' 3++0*8' %3' )0*+' 2,+;-:4*' 2,4#5)0560)34:' )/:' ,4.60),01'
62:)34:'*4R64/54'506*34,*'<+,'*4;4,)0'(-:401'6*4:':)3)=)*4*8'
'
!"#$%&#KWK>' )' *245-)0' ;4,*-+/' +<' !"#$%&' ()*' -720474/34:' -/' IJHJ' 3+' 506*34,' ),3-<-5-)0'
:620-5)34:',4):*'-/'21,+*4R64/5-/.'NKWKQ':)3)8'
'
!6,,4/301>'!"#$%&'2)5E).4'9)*'7)/1'2,+.,)7*O'5:#9-3>'5:#9-3#I: >'5:#9-3#4*3>'5:#9-3#4*3#I:>'
5:#9-3#2),)>' 5:#9-3#I:#2),)>' 2*-#5:#9-3>' 2*-#5:#9-3#I:>' 5:#9-3#KWK8' %' )0*+' :4;40+24:' *+74'
63-0-31'3++0*>'(,-334/'-/'X4,0>'3+'9402',6/')/:')/)01V4'!"#$%&'L+=*8''
&9-*' 2,+.,)7' -*' *3-00' 6/:4,' )53-;4' :4;40+274/3Y' /4(' <4)36,4*' )/:' /4(' 2,+.,)7*' (-00' =4'
+63'-/'394'<636,48'
'
'
'
Algorithm
B0.+,-397*'<+,'!"#$%&'(4,4':4*5,-=4:'-/'39,44'2)24,*'26=0-*94:'-/'@-+-/<+,7)3-5*8''
1. Clustering of highly homologous sequences to reduce the size of large protein databases.
Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2001) 17:282-283,
PDF, Pubmed
2. Tolerating some redundancy significantly speeds up clustering of large protein databases.
Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2002) 18: 77-82, PDF,
Pubmed
3. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide
sequences. Weizhong Li & Adam Godzik. Bioinformatics (2006) 22:1658-1659 PDF,
Pubmed
%' *6..4*3' 39)3' 1+6' ,4):' 394*4'2)24,*' -<'NHQ'1+6' ()/3' 3+' 6/:4,*3)/:' 7+,4' :43)-0*' )=+63' 394'
)0.+,-397'+,'NIQ'1+6' ()/3' E/+(' (91' -3' -*' *+' <)*38' %<' 1+6' :+/Z3' 9);4' 3-74' 3+' ,4):' 394*4'
2)24,*>' 394' )0.+,-397*' ),4' *677),-V4:' =40+(8'!"#$%&' (4=' *4,;4,' )/:' !"#$%&#KWK' ),4'
:4*5,-=4:'-/'394*4'3(+'2)24,*8'
'
4. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web
server for clustering and comparing biological sequences. Bioinformatics, (2010). 26:680
PDF
5. Beifang Niu, Limin Fu, Shulei Sun and Weizhong Li, Artificial and natural duplicates in
pyrosequencing reads of metagenomic data. BMC Bioinformatics, (2010), accepted
'
CD-HIT clustering algorithm
!06*34,-/.')'*4R64/54':)3)=)*4',4R6-,4*')00#=1#)00'5+72),-*+/*Y'394,4<+,4'-3'-*';4,1'3-74#
5+/*67-/.8' [)/1' 7439+:*' 6*4' @ABC&' 3+' 5+72634' 394' )00' ;*8' )00' *-7-0),-3-4*8' %3' -*' ;4,1'
:-<<-5603' <+,' 394*4' 7439+:*' 3+' 506*34,' 0),.4' :)3)=)*4*8' \9-04' !" #$%&' 5)/' );+-:' 7)/1'
2)-,(-*4'*4R64/54')0-./74/3*'(-39')'*9+,3'(+,:'<-034,'%':4;40+24:8''
'
%/' !"#$%&>' %'6*4'.,44:1'-/5,474/3)0'506*34,-/.')0.+,-397'7439+:8'@,-4<01>'*4R64/54*'),4'
<-,*3'*+,34:'-/'+,:4,'+<':45,4)*-/.'04/ .398'&94'0+/.4*3'+/4'=45+74*'394',42,4*4/3)3-;4' +<'
394' <-,*3' 506*34,8' &94/>' 4)59' ,47)-/-/ .' *4R64/54' -*' 5+72),4:' 3+' 394' ,42,4*4/3)3-;4*' +<'
4?-*3-/.'506*34,*8'%<'394'*-7-0),-31'(-39')/1',42,4*4/3)3-;4'-*')=+;4')'.-;4/ '39,4*9+0:>'-3'-*'
.,+624:'-/3+'39)3'506*34,8']394,(-*4>')'/4( '506*34,'-*':4<-/4: '(-39'39)3'*4R 64/54')*'394'
,42,4*4/3)3-;48'
'
$4,4' -*' 9+(' 394' *9+,3' (+,:' <-034,' (+,E*8' &(+' 2,+34-/*' (-39')'54,3)-/'*4R64/54'-:4/3-31'
76*3' 9);4' )3' 04)*3' )' *245-<-5'/67=4,'+<'-:4/3-5)0':-2423-:4*>'3,-2423-:4*')/:'4358' M+,'
4?)7204>' <+,'3(+ '*4R64/54*' 3+' 9);4'^W_'-:4/3-31'+;4,')'HJJ#,4*-:64' (-/:+(' 3941'9);4' 3+'
9);4' )3' 04)*3' `J' -:4/3-5)0' :-2423-:4*>' WW' -:4/3-5)0' 3,-2423-:4*>' )/:' IW' -:4/3-5)0'
剩余19页未读,继续阅读
资源评论
summerhai
- 粉丝: 13
- 资源: 13
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功