COGsoft.201204_doctorgza_cog_源码资源-CSDN文库

共35个文件

h：14个

cpp：14个

makefile：5个

版权申诉

4 浏览量 2021-10-04 00:31:28 上传评论收藏 220KB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

COGsoft.201204.rar （35个子文件）

COGmakehash

ereader.cpp 2KB

COGmakehash.cpp 2KB

Makefile 324B

ereader.h 1KB

Readme.2012.04.txt 21KB

COGlse

reader.h 4KB

logger.h 1KB

graph.h 5KB

reader.cpp 4KB

Makefile 281B

mklse.cpp 14KB

COGcognitor

enum.h 635B

cognitor.cpp 27KB

main.cpp 11KB

cognitorglob.h 8KB

os.h 524B

cognitor.h 5KB

Makefile 434B

enum.cpp 2KB

COGreadblast

bc.cpp 5KB

blastconvglob.cpp 4KB

blastconv.h 2KB

reader.h 2KB

cogreadblast.cpp 12KB

reader.cpp 4KB

blastconv.cpp 4KB

bc.h 2KB

Makefile 640B

blastconvglob.h 545B

COGtriangles

cogmaker.cpp 19KB

cogtriangles.cpp 6KB

cogmaker.h 3KB

os.h 602B

COGtriangles.reformat.pl 11KB

Makefile 396B

############################################################ 0. General remarks. ############################################################ This is the April 2012 release of the Eugene Koonin's group COG software, featuring the EdgeSearch algorithm. #----------------------------------------------------------- 0.1. Disclaimers. This version of COG-related software is a part of a research project of Eugene Koonin's group at the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM) at the National Institutes of Health (NIH). This is not an official NCBI software product. The software is distributed "as is" with no stated or implied warranty whatsoever. Although we'll make a reasonable effort to help if problems arise, we are in no way committed to support and maintenance of this software. #----------------------------------------------------------- 0.2. Credits, contacts etc. COG software: Credits: David M. Kristensen, NCBI Alexander V. Sorokin, NCBI Pavel S. Novichkov, NCBI Yuri I. Wolf, NCBI Eugene V. Koonin, NCBI Contact: Yuri I. Wolf <[email protected]> References: Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics. 2010 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq229?ijkey=zD7TIWnncGvDGYE&keytype=ref Other papers that heavily rely on COGs, constructed with this software: Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol Direct. 2, 33, 2007 A blanket citation for the COG approach: Tatusov RL, Koonin EV, Lipman DJ A genomic perspective on protein families. Science 278, 631-637, 1997 EdgeSearch algorithm/implementation: Credits: David Kristensen Lavanya Kannan Michael Coleman Arcady Mushegian all at Stowers Institute for Medical Research, Bioinformatics Department Contact: David M. Kristensen <[email protected]> ############################################################ 1. Software. ############################################################ This distribution package contains the UNIX source files for the following programs (each in its own directory): COGmakehash COGreadblast COGlse COGtriangles COGcognitor Run UNIX make command in each directory to create executable binaries. You might need to modify the makefiles if you are using a C++ compiler other than GNU g++. All programs will print out a short description of the options when executed with "-h" command line flag. A note on Mac OSX: the hidden file .DS_Store.tab in each directory interferes with the execution of the COGreadblast program, since it reads all files with the extension '.tab'. One possible work-around is to change the file mask to something other than '.tab' (FMASK variable in bc.h). ############################################################ 2. Data. ############################################################ #----------------------------------------------------------- 2.0. Remarks on data. The description of the data set requirements that will follow assumes that standard NCBI BLAST suite software will be used as a part of the data processing workflow. Thus, the description reflects the limitations imposed by the current (as of May 2010) versions of that software. In many cases the COG-related software has additional flexibility (e.g. less strict conventions on naming proteins and domains) which can be used in alternative workflows. Here we will mostly ignore these options and present the more strict (i.e. most compatible) set of requirements. The internal data formats are described below to allow the possibility to supply these data to the COG software outside of the normal processing workflow. #----------------------------------------------------------- 2.1. Protein sets. Each protein is expected to have a unique short identifier - either a string of alphanumeric characters or an integer decimal number not exceeding 2147483647 (NB - longer strings of decimal digits can be used as sequence IDs if they are presented in the context of text IDs rather than numbers). For compatibility with BLAST it is strongly advised that the sequence IDs in FASTA files are constructed as "lcl|<text-id>" or "gi|<num-id>". Use of some non-alphanumeric characters (dot and dash) in names is allowed but discouraged; others (comma, semicolon, colon etc.) are prohibited. Some software does not distinguish upper and lower case letters, so it is strongly recommended to make the names unique in the case-insensitive mode. #----------------------------------------------------------- 2.2. Proteins and genomes. Each protein is expected to be assigned to one and only one genome; lists of proteins assigned to a genome are expected to be complete (i.e. include the full set of proteins, encoded by a genome). The protein-genome data are given line-by-line in a comma-separated file as "<prot-id>,<genome-id>". Genome IDs are expected to be alphanumeric strings. #----------------------------------------------------------- 2.3. BLAST results. Typically two BLAST search passes are required - one without any low-complexity filtering and composition-based statistics (as it produces scores that, apparently, better reflect the phylogenetic distances); another with low-complexity filtering and/or composition-based statistics (as it produces less false positives). By convention, the COG software expects BLAST results in tabular ("-m 8" or "-m 9") format; each set of BLAST results is expected to reside in it's own directory in *.tab files. Depending on the precise composition of the query and subject deflines, BLAST output may contain either bare protein IDs (e.g. "18978112" or "MTH068") or full sequence IDs from the original fasta files (e.g. "gi|18978112|ref|NP_579469.1|" or "lcl|MTH068"). The BLAST postprocessing software will split the string by delimiter characters and take the specified (by default the 2nd) token for an ID. You need to make sure that BLAST statistics is consistent and comparable between different runs; one useful hint includes forcing the same effective database size in all searches (using "-z" BLAST parameter) when running BLAST searches in batches against different databases. #----------------------------------------------------------- 2.4. Internal IDs. Internally all COG software uses numerical IDs for the sequences. Correspondence between the user-supplied IDs and these internal IDs are established by the COGmakehash program (see p. 3.1.1). The correspondence file is called hash.csv and resides in a separate directory together with other processed BLAST data. The file consists of "<num-prot-id>,<user-prot-id>" records. #----------------------------------------------------------- 2.5. Self-similarity data. All BLAST similarity scores are measured against the self-similarity of the proteins involved. The self-similarity data is stored in the self.csv file residing in the processed BLAST data directory. The file consists of "<num-prot-id>,<prot-length>,<self-score>" records. This file is prepared by the COGreadblast program (see p. 3.1.2). #----------------------------------------------------------- 2.6. BLAST hits data. Processed BLAST hits from the unfiltered BLAST search are stored in the hits.csv file residing in the processed BLAST data directory. The file consists of "<query-num-prot-id>,<subject-num-prot-id>,<query-start>,<query-end>,<subject-start>,<subject-end>,<e-value>,<score>" records. All hits for the same query are stored in contiguous blocks sorted by decreasing score (increasing e-value). This file is prepared by the COGreadblast program (see p. 3.1.2). #------------------------

评论收藏

内容反馈

版权申诉