############################################################
0. General remarks.
############################################################
This is the April 2012 release of the Eugene Koonin's group COG software, featuring the EdgeSearch algorithm.
#-----------------------------------------------------------
0.1. Disclaimers.
This version of COG-related software is a part of a research project of Eugene Koonin's group at the National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM) at the National Institutes of Health (NIH). This is not an official NCBI software product. The software is distributed "as is" with no stated or implied warranty whatsoever. Although we'll make a reasonable effort to help if problems arise, we are in no way committed to support and maintenance of this software.
#-----------------------------------------------------------
0.2. Credits, contacts etc.
COG software:
Credits:
David M. Kristensen, NCBI
Alexander V. Sorokin, NCBI
Pavel S. Novichkov, NCBI
Yuri I. Wolf, NCBI
Eugene V. Koonin, NCBI
Contact:
Yuri I. Wolf <[email protected]>
References:
Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A
A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches.
Bioinformatics. 2010
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq229?ijkey=zD7TIWnncGvDGYE&keytype=ref
Other papers that heavily rely on COGs, constructed with this software:
Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV
Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea.
Biol Direct. 2, 33, 2007
A blanket citation for the COG approach:
Tatusov RL, Koonin EV, Lipman DJ
A genomic perspective on protein families.
Science 278, 631-637, 1997
EdgeSearch algorithm/implementation:
Credits:
David Kristensen
Lavanya Kannan
Michael Coleman
Arcady Mushegian
all at Stowers Institute for Medical Research, Bioinformatics Department
Contact:
David M. Kristensen <[email protected]>
############################################################
1. Software.
############################################################
This distribution package contains the UNIX source files for the following programs (each in its own directory):
COGmakehash
COGreadblast
COGlse
COGtriangles
COGcognitor
Run UNIX make command in each directory to create executable binaries. You might need to modify the makefiles if you are using a C++ compiler other than GNU g++.
All programs will print out a short description of the options when executed with "-h" command line flag.
A note on Mac OSX: the hidden file .DS_Store.tab in each directory interferes with the execution of the COGreadblast program, since it reads all files with the extension '.tab'. One possible work-around is to change the file mask to something other than '.tab' (FMASK variable in bc.h).
############################################################
2. Data.
############################################################
#-----------------------------------------------------------
2.0. Remarks on data.
The description of the data set requirements that will follow assumes that standard NCBI BLAST suite software will be used as a part of the data processing workflow. Thus, the description reflects the limitations imposed by the current (as of May 2010) versions of that software. In many cases the COG-related software has additional flexibility (e.g. less strict conventions on naming proteins and domains) which can be used in alternative workflows. Here we will mostly ignore these options and present the more strict (i.e. most compatible) set of requirements. The internal data formats are described below to allow the possibility to supply these data to the COG software outside of the normal processing workflow.
#-----------------------------------------------------------
2.1. Protein sets.
Each protein is expected to have a unique short identifier - either a string of alphanumeric characters or an integer decimal number not exceeding 2147483647 (NB - longer strings of decimal digits can be used as sequence IDs if they are presented in the context of text IDs rather than numbers). For compatibility with BLAST it is strongly advised that the sequence IDs in FASTA files are constructed as "lcl|<text-id>" or "gi|<num-id>". Use of some non-alphanumeric characters (dot and dash) in names is allowed but discouraged; others (comma, semicolon, colon etc.) are prohibited. Some software does not distinguish upper and lower case letters, so it is strongly recommended to make the names unique in the case-insensitive mode.
#-----------------------------------------------------------
2.2. Proteins and genomes.
Each protein is expected to be assigned to one and only one genome; lists of proteins assigned to a genome are expected to be complete (i.e. include the full set of proteins, encoded by a genome). The protein-genome data are given line-by-line in a comma-separated file as "<prot-id>,<genome-id>". Genome IDs are expected to be alphanumeric strings.
#-----------------------------------------------------------
2.3. BLAST results.
Typically two BLAST search passes are required - one without any low-complexity filtering and composition-based statistics (as it produces scores that, apparently, better reflect the phylogenetic distances); another with low-complexity filtering and/or composition-based statistics (as it produces less false positives). By convention, the COG software expects BLAST results in tabular ("-m 8" or "-m 9") format; each set of BLAST results is expected to reside in it's own directory in *.tab files. Depending on the precise composition of the query and subject deflines, BLAST output may contain either bare protein IDs (e.g. "18978112" or "MTH068") or full sequence IDs from the original fasta files (e.g. "gi|18978112|ref|NP_579469.1|" or "lcl|MTH068"). The BLAST postprocessing software will split the string by delimiter characters and take the specified (by default the 2nd) token for an ID. You need to make sure that BLAST statistics is consistent and comparable between different runs; one useful hint includes forcing the same effective database size in all searches (using "-z" BLAST parameter) when running BLAST searches in batches against different databases.
#-----------------------------------------------------------
2.4. Internal IDs.
Internally all COG software uses numerical IDs for the sequences. Correspondence between the user-supplied IDs and these internal IDs are established by the COGmakehash program (see p. 3.1.1). The correspondence file is called hash.csv and resides in a separate directory together with other processed BLAST data. The file consists of "<num-prot-id>,<user-prot-id>" records.
#-----------------------------------------------------------
2.5. Self-similarity data.
All BLAST similarity scores are measured against the self-similarity of the proteins involved. The self-similarity data is stored in the self.csv file residing in the processed BLAST data directory. The file consists of "<num-prot-id>,<prot-length>,<self-score>" records. This file is prepared by the COGreadblast program (see p. 3.1.2).
#-----------------------------------------------------------
2.6. BLAST hits data.
Processed BLAST hits from the unfiltered BLAST search are stored in the hits.csv file residing in the processed BLAST data directory. The file consists of "<query-num-prot-id>,<subject-num-prot-id>,<query-start>,<query-end>,<subject-start>,<subject-end>,<e-value>,<score>" records. All hits for the same query are stored in contiguous blocks sorted by decreasing score (increasing e-value). This file is prepared by the COGreadblast program (see p. 3.1.2).
#------------------------