PyPI官网下载|gtdbtk-0.0.8b1.tar.gz资源-CSDN文库

版权申诉

83 浏览量 2022-01-27 22:44:53 上传评论收藏 69KB GZ 举报

共39个文件

py：17个

pm：10个

txt：4个

资源推荐

资源详情

资源评论

收起资源包目录

gtdbtk-0.0.8b1.tar.gz （39个子文件）

gtdbtk-0.0.8b1

MANIFEST.in 78B

PKG-INFO 649B

bin

gtdbtk 19KB

gtdbtk.egg-info

PKG-INFO 649B

requires.txt 56B

SOURCES.txt 1KB

top_level.txt 7B

dependency_links.txt 1B

gtdbtk

config

config_metadata.py 370B

__init__.py 0B

default_values.py 1KB

config_template.py 7KB

main.py 12KB

tools.py 2KB

reroot_tree.py 6KB

classify.py 53KB

__init__.py 1KB

VERSION 554B

relative_distance.py 18KB

markers.py 24KB

external

pfam_search.py 8KB

Bio

Pfam

HMM

HMMResultsIO.pm 31KB

HMMMatch.pm 1KB

HMM.pm 5KB

HMMSequence.pm 2KB

HMMIO.pm 10KB

HMMUnit.pm 2KB

HMMResults.pm 13KB

Scan

Seq.pm 1KB

PfamScan.pm 28KB

Active_site

as_search.pm 11KB

__init__.py 0B

pfam_search.pl 11KB

hmm_aligner.py 12KB

tigrfam_search.py 7KB

prodigal.py 7KB

setup.cfg 38B

setup.py 2KB

README.md 12KB

# GTDB-Tk [![version status](https://img.shields.io/pypi/v/gtdbtk.svg)](https://pypi.python.org/pypi/gtdbtk) **Note (19/04/2018)** : - A new version of the data (release 83) is available under [this link](https://data.ace.uq.edu.au/public/gtdbtk/release_83/). - This new version is recommended to run GTDB-Tk v0.0.6+ GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes. It is computationally efficient and designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes. The GTDB-Tk is open source and released under the GNU General Public License (Version 3). GTDB-Tk is **under active development and validation**. Please independently confirm the GTDB-Tk predictions by manually inspecting the tree and bringing any discrepencies to our attention. Notifications about GTDB-Tk releases will be available through the ACE Twitter account (https://twitter.com/ace_uq). ## Hardware requirements - ~90Gb of memory to run. - ~70Gb of Storage. ## Installation GTDB-Tk requires the following Python libraries: * [jinja2](http://jinja.pocoo.org/) >=2.7.3: a full featured template engine for Python. * [mpld3](http://mpld3.github.io/) >= 0.2: D3 viewer for Matplotlib. * [biolib](https://github.com/dparks1134/biolib) >= 0.0.44: Python package for common tasks in bioinformatic. * [dendropy](http://dendropy.org/) >= 4.1.0: A Python library for phylogenetics and phylogenetic computing: reading, writing, simulation, processing and manipulation of phylogenetic trees (phylogenies) and characters. * [SciPy Stack](https://www.scipy.org/install.html): at least the Matplotlib, NumPy, and SciPy libraries Jinja2, mpld3, dendropy and biolib will be install as part of GTDB-Tk when installing via pip as described below. The SciPy Stack must be install seperately. GTDB-Tk makes use of the following 3rd party dependencies and assumes these are on your system path: * [Prodigal](http://prodigal.ornl.gov/) >= 2.6.2: Hyatt D, et al. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics, 28, 2223-2230. * [HMMER](http://http://hmmer.org/) >= 3.1: Eddy SR. 2011. Accelerated profile HMM searches. PLoS Comp. Biol., 7, e1002195. * [pplacer](http://matsen.fhcrc.org/pplacer/) >= 1.1: Matsen F, et al. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11, 538. * [FastANI](https://github.com/ParBLiSS/FastANI) >= 1.0: Jain C, et al. 2017. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries.bioRxiv. 256800. * [FastTree](http://www.microbesonline.org/fasttree/) >= 2.1.9: Price MN, et al. 2010 FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5, e9490. GTDB-Tk also assumes the Python 2.7.x and Perl interpreters are on your system path. _NOTE_ :Perl interpreter requires Moose,Bundle::BioPerl and IPC::Run modules. you can install those modules using CPAN: ``` perl -MCPAN -e"install Moose" perl -MCPAN -e"install IPC::Run" perl -MCPAN -e"install Bundle::BioPerl" ``` if ```perl -MCPAN -e"install Bundle::BioPerl"``` does not run on your server, please install BioPerl following the steps under [this link](https://bioperl.org/INSTALL.html) You need to make sure that the folder where perl modules (*.pm) are located is part the @inc variable. If it is not , you can set the PERL5LIB ( or PERLIB) environment variable the same way you set PATH environment variable. Every directory listed in this variable will be added to @inc. i.e: ``` export PERL5LIB="$PERL5LIB:/path/to/moose/module:/path/to/ipc/module:/path/to/bioperl/module" ``` GTDB-Tk requires ~70G+ of external data that need to be downloaded and unarchived (preferably in the same directory): ``` wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/fastani.tar.gz wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/markers.tar.gz wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/masks.tar.gz wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/msa.tar.gz wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/pplacer.tar.gz wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/taxonomy.tar.gz ``` Or alternatively, all the data at once using: ``` wget https://data.ace.uq.edu.au/public/gtdbtk/release_xx/gtdbtk_rxx_data.tar.gz ``` Once these are installed, GTDB-Tk can be installed using [pip](https://pypi.python.org/pypi/gtdbtk): ``` > pip install gtdbtk ``` GTDB-Tk requires a config file. In the Python lib/site-packages directory, go to the gtdbtk directory and setup this config file: ``` cd config cp config_template.py config.py ``` Edit the config.py file and modify different variables: -GENERIC_PATH should point to the directory containing the data downloaded from the https://data.ace.uq.edu.au/public/gtdbtk/. Make sure the variable finishes with a slash '/'. ## Quick Start The functionality provided by GTDB-Tk can be accessed through the help menu: ``` > gtdbtk -h ``` Usage information about each methods can also be accessed through their species help menu, e.g.: ``` > gtdbtk classify_wf -h ``` ## Classify Workflow The classify workflow consists of three steps: *identify*, *align*, and *classify*. The *identify* step calls genes using [Prodigal](http://prodigal.ornl.gov/) and then uses HMM models and the [HMMER](http://http://hmmer.org/) package to identify the marker genes used for phylogenetic inference. Consistent alignments are obtained by aligning marker genes to their respective HMM model. The *align* step concatenates the aligned marker genes and applies all necessary filtering to the concatenated multiple sequence alignment (MSA). Finally, the *classify* step uses [pplacer](http://matsen.fhcrc.org/pplacer/) to find the maximum-likelihood placement of each genome's concatenated protein alignment in the GTDB-Tk reference tree. GTDB-Tk classifies each genome based on its placement in the reference tree, its relative evolutionary distance, and FastANI distance (see Chaumeil PA et al., 2018 for details). The classify workflow can be run as follows: ``` > gtdbtk classify_wf --genome_dir <my_genomes> --out_dir <output_dir> ``` This will process all genomes in <my_genomes> using both bacterial and archaeal marker sets and place the results in <output_dir>. Genomes must be in FASTA format. The location of genomes can also be specified using a batch file with the --batchfile flag. The batch file is simply a two column file indicating the location of each genome and the desired genome identifier (i.e., a Newick compatible alphanumeric string). These fields must be seperated by a tab. The workflow supports several optional flags, including: * cpus: maximum number of CPUs to use For other flags please consult the command line interface. Here is an example run of this workflow: ``` > gtdbtk classify_wf --cpus 24 --genome_dir ./my_genomes --out_dir gtdbtk_output ``` The taxonomic classification of each bacterial and archaeal genome is contained in the \<prefix\>.bac120.classification.tsv and \<prefix\>.ar122.classification.tsv output files. ##### Additional output files Each step of the classify workflow generates a number of files that can be consulted for additional information about the processed genomes. Identify step: * \<prefix\>_bac120_markers_summary.tsv: summary of unique, duplicated, and missing markers within the 120 bacterial marker set for each submitted genome * \<prefix\>_ar122_markers_summary.tsv: analogous to the above file, but for the 122 archaeal marker set * marker_genes directory: contains individual genome result

评论收藏

内容反馈

版权申诉