fasta-35.3.6.tar.gz_Waterman_fasta_fastaprogram_sequencealignm资源-CSDN文库

版权申诉

116 浏览量 2022-09-14 15:23:39 上传评论收藏 600KB GZ 举报

共267个文件

c：67个

h：31个

aa：29个

资源推荐

资源详情

资源评论

收起资源包目录

fasta-35.3.6.tar.gz_Waterman_fasta_fasta program_sequence alignm （267个子文件）

fasta35.1 11KB

fasta3.1 10KB

pvcomp.1 7KB

prss3.1 5KB

fastf3.1 5KB

fasts3.1 4KB

map_db.1 948B

ps_lav.1 645B

readme.pvm_3.2 1KB

readme.pvm_3.3 7KB

readme.mpi_3.3 2KB

readme.pvm_3.4 3KB

titin_hum.aa 34KB

mwkw.aa 2KB

myosin_bp.aa 1KB

egmsmg.aa 1KB

qrhuld.aa 914B

n2.aa 692B

mwrtc1.aa 500B

oohu.aa 385B

prio_atepa.aa 340B

xurt8c.aa 302B

gtm1_human.aa 300B

gtt1_drome.aa 291B

gstt1_drome.aa 291B

mgstm1.aa 284B

xurtg.aa 281B

musplfm.aa 275B

lcbo.aa 271B

h10_human.aa 247B

n2t.aa 243B

hahu.aa 225B

ngt.aa 217B

mchu.aa 212B

n2s.aa 178B

ngts.aa 111B

m1r.aa 56B

m2.aa 50B

n1.aa 47B

ms1.aa 43B

n0.aa 26B

mgstm1.aaa 310B

gstt1_pssm.asn1 56KB

test.bat 3KB

test2.bat 3KB

smith_waterman_altivec.c 111KB

scaleswn.c 72KB

dropfz2.c 72KB

dropfx.c 68KB

comp_lib2.c 66KB

initfa.c 59KB

p2_complib2.c 56KB

nmgetlib.c 54KB

dropnfa.c 47KB

ncbl2_mlib.c 45KB

dropfs2.c 44KB

p2_workcomp2.c 37KB

scaleswt.c 37KB

dropff2.c 36KB

dropgsw2.c 33KB

pssm_asn_subs.c 31KB

lsim4.c 24KB

mshowalign.c 23KB

compacc.c 22KB

mmgetaa.c 21KB

mshowbest.c 18KB

pgsql_lib.c 17KB

dropnnw.c 16KB

mysql_lib.c 16KB

res_stats.c 16KB

lav2ps.c 14KB

lav2svg.c 14KB

faatran.c 14KB

karlin.c 13KB

tatstats.c 13KB

ncbl_lib.c 12KB

showsum.c 12KB

smith_waterman_sse2.c 12KB

c_dispn.c 11KB

cal_consf.c 11KB

cal_cons.c 11KB

print_pssm.c 11KB

map_db.c 11KB

apam.c 11KB

llgetaa.c 10KB

doinit.c 10KB

dropnsw.c 10KB

pthr_subs2.c 9KB

getseq.c 9KB

wm_align.c 8KB

work_thr2.c 8KB

lib_sel.c 7KB

dec_pthr_subs.c 7KB

workacc.c 5KB

list_db.c 5KB

FileDlog.c 5KB

last_tat.c 4KB

checkevent.c 3KB

url_subs.c 3KB

ag_stats.c 3KB

共 267 条

FASTA program and documentation may not be sold or incorporated

into a commercial product, in whole or in part, without written

consent of William R. Pearson and the University of Virginia.

For further information regarding permission for use or

reproduction, please contact: David Hudson, Assistant Provost for

Research, University of Virginia, P.O. Box 9025, Charlottesville,

VA 22906-9025, (434) 924-6853

The FASTA program package

Introduction

This documentation describes the version 2.0x of the FASTA

program package (see W. R. Pearson and D. J. Lipman (1988),

"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-

2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence

Comparison with FASTP and FASTA" Methods in Enzymology 183:63-

98). Version 2.0 modifies version 1.8 to include explicit

statistical estimates for similarity scores based on the extreme

value distribution. In addition, FASTA protein alignments now

use the Smith-Waterman algorithm with no limitation on gap size.

FASTA and SSEARCH now use the BLOSUM50 matrix by default, with

options to change gap penalties on the command line. Version 1.7

replaces rdf2 and rss with prdf and prss, which use the extreme-

value distribution to calculate accurate probability estimates.

Although there are a large number of programs in this package,

they belong to four groups:

Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH

Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN

Statistical significance: PRDF, RELATE, PRSS, RANDSEQ

Global alignment: ALIGN

In addition, I have included several programs for protein

sequence analysis, including a Kyte-Doolittle hydropathicity

plotting program (GREASE, TGREASE), and a secondary structure

prediction package (GARNIER).

The FASTA sequence comparison programs on this disk are

improved versions of the FASTP program, originally described in

Science (Lipman and Pearson, (1985) Science 227:1435-1441). We

have made several improvements. First, the library search

programs use a more sensitive method for the initial comparison

of two sequences which allows the scores of several similar

regions to be combined. As a result, the results of a library

search are now given with three scores, initn (the new initial

score which may include several similar regions), init1 (the old

fastp initial score from the best initial region), and opt (the

old fastp optimized score allowing gaps in a 32 residue wide

band).

These programs have also been modified to become "universal"

(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or

FAST-N (nucleotides)); by changing the environment variable

SMATRIX, the programs can be used to search protein sequences,

DNA sequences, or whatever you like. By default, FASTA, LFASTA,

and the PRDF programs automatically recognize protein and DNA

sequences. Sequences are first read as amino acids, and then

converted to nucleotides if the sequence is greater than 85%

A,C,G,T (the '-n' option can be used to indicate DNA sequences).

TFASTA compares protein sequences to a translated DNA sequence.

Alternative scoring matrices can also be used. In addition to

the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices

based on simple identities or the genetic code can also be used

for sequence comparisons or evaluation of significance. Several

different protein sequence matrices have been included;

instructions for constructing your own scoring matrix are

included in the file FORMAT.DOC.

The remainder of this document is divided into three sections:

(1) a brief history of the changes to the FASTA package; (2) A

guide to installing the programs and databases; (3) A guide to

using the FASTA programs. The programs are very easy to use, so

if you are using them on a machine that is administered by

someone else, you may want to skip to section (3) to learn how to

use the programs, and then read section (1) to look at some of

the more recent changes. If you are installing the programs on

your own machine, you will need to read section (2) carefully.

1. Revision History

1.1. Changes with version 2.0u

Version 2.0u provides several major improvements over

previous versions of FASTA (and SSEARCH). The most important is

the incorporation of explicit statistical estimates and

appropriate normalization of similarity scores. This improvement

is discussed in more detail below in the section entitled

Statistical Significance. In addition, all of the protein

comparison programs now use the BLOSUM50 matrix, with gap

penalties of -12, -2, by default. BLOSUM50 performs

significantly better than the older PAM250 matrix. PAM250 can

still be used with the command line option: -s 250. (DNA

sequence comparisons use a more stringent gap penalty of -16, -4,

which produces excellent statistical estimates when optimized

scores are used. TFASTA uses -16, -4 as well.)

The quality of the fit of the extreme value distribution to

the actual distribution of similarity scores is summarized with

the Kolmogorov-Smirnov statistic. The acceptance limits for this

statistic can be found in many statistics books. In general,

values <0.10 (N=30) indicate excellent agreement between the

actual and theoretical distributions. If this statistic is >

0.2, consider using a higher (more stringent) gap penalty, e.g.

- 16, -4 rather than -12, -2. The default scoring matrix for DNA

has been changed to score +5 for an identity and -4 for a

mismatch. These are the same scores used by BLASTN.

With explicit expectation calculations, the program now

shows all scores and alignments with expectations less than 10.0

(with optimized scores, 2.0 without optimization) when the "-Q"

(quiet) mode is used. The expectation threshold can be changed

with the "-E" option.

Finally, the algorithm used to produce the final alignments

of protein sequences is now a full Smith-Waterman, with unlimited

gaps. (The older band-limited alignments are used for DNA

sequences and TFASTA by default, because Smith-Waterman

alignments are very slow for long sequences.) Both the optimized

and Smith-Waterman scores are reported; if the Smith-Waterman

score is higher, then additional gaps allowed a better alignment

and similarity score to be calculated.

FASTA searches now optimize similarity scores by default

(this slows searches about 2-fold (worst case) for ktup=2). Thus,

the meaning of the "-o" option has been reversed; "-o" now turns

off optimization and reports results sorted by "initn" scores.

Optimization significantly improves the sensitivity of FASTA, so

that it almost matches Smith-Waterman. With version 2.0, the

default band width used for optimized calculations can be varied

with the "-y" option. For proteins with ktup=2, a width of 16

(-y 16) is used; 16 is also used for DNA sequences. For proteins

and ktup=1, a width of 32 is used. Searches that disable

optimization with the "-o" option will work fine for sequences

that share 25% or more identity in general, but to detect

evolutionary relationships with 20% - 25% identity, the more

sensitive default optimization is often required. Optimization

is required for accurate statistical estimates with either

protein or DNA sequences.

The FASTA package now includes FASTX, a program that

compares a DNA sequence to a protein sequence database by

translating the DNA sequence in three frames (the reverse frames

are selected with the -i option) and aligning the three-frame

translation with the sequences in the protein database.

Alignment scores allow frameshifts so that a cDNA or EST sequence

with insertion/deletion errors can be aligned with its homologues

from beginning to end.

With release 20u6, there is also a TFASTX program, which is

a replacement for TFASTA. TFASTA treats each of the six reading

frames of a DNA library sequence as a different sequence; TFASTX

compares a protein sequence against only two sequences from each

DNA sequence - the forward and reverse orientation. For a given

orientation, TFASTX calculates a similarity score for alignments

that allow frameshifts, thus considering all possible reading

frames.

Another new program is included - randseq - which will

produce a randomly shuffled (uniform or local shuffle) from an

input sequence. This randomly shuffled sequence can be used to

evaluate the statistical estimates produced by FASTA, SSEARCH, or

BLAST.

1.2. Changes with version 1.7

Version 1.7 has been released to provide the PRDF and PRSS

programs for shuffling sequences and estimating accurately the

probabilities of the unshuffled-sequence scores.

PRDF a version of RDF2 that uses calculates the probability

of a similarity score more accurately by using a fit to

an extreme value distribution. Code to fit the extreme

value distribution parameters and the impetus to update

RDF2 was provided by Phil Green, U. of Washington.

PRSS a version of PRDF that uses a rigorous Smith-Waterman

calculation to score similarities

1.3. Changes with version 1.6

FASTA version 1.6 uses a new method for calculating optimal

scores in a band (the optimization or last step in the FASTA

algorithm). In addition, it uses a linear-space method for

calculating the actual alignments. FASTA v1.6 package includes

several new programs:

SSEARCH a program to search a sequence database using the

rigorous Smith-Waterman algorithm (this program is

about 100-fold slower than FASTA with ktup=2 (for

proteins).

LALIGN A rigorous local sequence alignment program that will

display the N-best local alignments (N=10 by default).

PLALIGN a version of lalign that plots the local alignments to

a tektronix display.

FLALIGN a version of lalign that plots the local alignments to

a GCG Figure file.

The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"

algorithm described by Huang and Miller (1991) Adv. Appl. Math.

12:337-357. The SSEARCH and PRSS programs incorporate algorithms

described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.

LFASTA and PLFASTA now calculate a different number of local

similarities; they now behave more like LALIGN/PLALIGN. Since

local alignments of identical sequences produce "mirror-image"

alignments, lalign and lfasta consider only one-half of the

potential alignments between sequences from identical file names.

Thus

lfasta mchu.aa mchu.aa

Displays only two alignments, with earlier versions of the

program, it would have displayed five, including the identity

alignment. PLFASTA does display five alignments; when two

identical filenames are given, it draws the identity alignment,

calculates the two unique local alignments, draws them, and draws

their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the

filenames, rather than the actual sequences, to determine whether

sequences are identical; you can "trick" the programs into

behaving the old way by putting the same sequence in two

different files.

1.4. Changes with version 1.5

FASTA version 1.5 includes a number of substantial revisions

to improve the performance and sensitivity of the program. It is

now possible to tell the program to optimize all of the initn

scores greater than a threshold. The threshold is set at the

same value as the old FASTA cutoff score. Alternatively, you can

tell FASTA to sort the results by the init1, rather than the

initn, score by using the -1 option. FASTA -1 ... will report

the results the way the older FASTP program did.

A new method has been provided for selecting libraries. In

the past, one could enter the name of a sequence file to be

searched or a single letter that would specify a library from the

list included in the $FASTLIBS file. Now, you can specify a set

of library files with a string of letters preceded by a '%'.

Thus, if the FASTLIBS file has the lines:

Genbank 70 primates$1P/seqlib/gbpri.seq 1

Genbank 70 rodents$1R/seqlib/gbrod.seq 1

Genbank 70 other mammals$1M/seqlib/gbmam.seq 1

Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1

Then the string: "%PRMB" would tell FASTA to search the four

libraries listed above. The %PRMB string can be entered either

on the command line or when the program asks for a filename or

library letter.

FASTA1.5 also provides additional flexibility for specifying

评论收藏

内容反馈

版权申诉

局外狗

粉丝: 84
资源: 1万+

fasta-35.3.6.tar.gz_Waterman_fasta_fasta program_sequence alignm

sequence-aligner

seq-align:Needleman-Wunsch和Smith-Waterman序列比对的快速，便携式C实现

fasta_window_stats:Fasta文件上的序列模式

Sequence.fasta

sequence.fasta

PyPI 官网下载 | alpha2fasta-1.41.tar.gz

Python库 | jupyterlab-fasta-3.1.0.tar.gz

PyPI 官网下载 | TransVar-2.5.5.20190430.tar.gz

PyPI 官网下载 | rcsb.utils.seq-0.49.tar.gz

PyPI 官网下载 | little-bio-parser-0.8.1.tar.gz

快速从fasta源文件中查询目标基因序列文件

perl的fasta程序

Python库 | bioarch-0.0.7.tar.gz

PyPI 官网下载 | rcsb.utils.seq-0.23.tar.gz

PyPI 官网下载 | better_fasta_grep-1.0.0.tar.gz

Python库 | fasta2png-1.tar.gz

PyPI 官网下载 | pybiolib-0.2.442.tar.gz

pear-0.9.11-linux-x86-64.tar.gz

Python库 | jupyterlab_fasta-3.1.0-py3-none-any.whl

Python库 | alpha2fasta-1.4-py3-none-any.whl

Python库 | biocommons.seqrepo-0.5.0.tar.gz

PyPI 官网下载 | openproteindesign-0.1.dev0.tar.gz

Python库 | biocommons.seqrepo-0.5.6.tar.gz

Amino.py-1.2.10.1-py3-none-any.whl.zip

Python库 | telomerecat-2.1.dev5.tar.gz

PyPI 官网下载 | metaseq-0.5.5.1.tar.gz

PyPI 官网下载 | kipoi-0.6.14.tar.gz

tRNAscan-SE.tar.gz_One Three One_predictions_tRNA scan_tRNAsc_tR

PyPI 官网下载 | bx-python-0.8.9.tar.gz

PyPI 官网下载 | pandas-genomics-0.5.0.tar.gz

最新资源