磷循环Pcycle功能基因分析过滤代码_宏基因组磷元素循环分析技术资源-CSDN文库

共13个文件

py：4个

txt：3个

test：1个

生信分析

117 浏览量 2023-12-14 10:09:27 上传评论收藏 3.76MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Phosphorus-cycling-database-main.zip （13个子文件）

Phosphorus-cycling-database-main

pafAORF.list 18KB

filter_Generate_ORF2gene.py 1KB

Extract_Seq.py 915B

Coverage_get.py 2KB

example file

S0028ORF2Pgene.txt 241KB

quant.sf 11.6MB

S0028.PCycDB.blastp 3.16MB

S0028.TPM.txt 2KB

merge_metaphlan_tables.py 4KB

readme.txt 2KB

pafAORF.nuc.fa 496KB

README.md 4KB

code

test 5B

# Phosphorus-cycling-database (PCyCDB) This is a curated phosphorus cycling database (PCyCDB) with 138 gene families and 10 metabolism processes. Homologous genes were added into the database to reduce the false positive rate. The criteria (i.e., identity, hit length) for filtering the alignment result generated by sequence similarity searching tools (e.g., BLAST, USEARCH, DIAMOND) were refined by identifying a known simulated gene dataset and mock bacteria community to obtain the best accuracy and further reduction of false positives and false negatives. The accuracy, PPV, sensitivity, specificity and NPV were 99.76%, 95.70%, 99.94%, 99.74% and 99.99%, respectively, at the 70% identity and 25 amino acid cutoffs. Importantly, the genes encoding the intracellular phosphorus metabolic processes are added into PCyCDB, which should help researchers broaden the insights into not only the geochemical P cycling but also the microbial P metabolisms. If you feel this database and utilities are useful, please cite: Zeng J, Tu Q, Yu X, et al. PCycDB: a comprehensive and accurate database for fast analysis of phosphorus cycling genes[J]. Microbiome, 2022, 10(1): 1-16. User guide: 1. Assuming you had a sample named $Sample.fa, and obtained a blast table named Sample.P.blast using BLAST+ or DIAMOND or other alignment tools, you can filter the result using filter_Generate_ORF2gene.py. Command: python filter_Generate_ORF2gene.py -s $identity -cov $alignment-coverage -hit $hitlength -b $Sample.P.blast Recommended filtering threthold: it is acceptable to use 30% identity and 25 amino acids cutoffs to investigate more potential PCGs from metagenome sequencing data, because all known PCGs can be detected by PCycDB, while a little bit of predicted false-positive PCGs (1.04%) might also have the potential ability in mediating the P cycling. Also, one may use a stricter cutoff (i.e., 70% identity, 25 aa) to control the false positives (< 0.25%). By doing this, you will obtain a ORF2gene file named $Sample.ORF2gene.txt, which descripts the P cycling gene for each ORF. If you have a lot of sample for analysis, a bash for loop is recommend for quickly processing the data. For example: for i in $Sample*.blast; do python filter_Generate_ORF2gene.py -s $identity -cov $alignment-coverage -hit $hitlength -b $i; done 2. Assuming you already have calculated the coverage (or TPM) using Bowtie2, CheckM, Salmon or other tools, and obtained a abuncance file named $Sample.quant for all ORFs, you can easily extracted the abuncance of PCGs using Coverage_get.py. Command: python Coverage_get.py -i $Sample.ORF2gene.txt -t $Sample.quant -o $Sample.P.profile.txt By doing this, you will obtain a abundance file for each PCG of the $Sample. Extract_Seq.py was a useful python script for fast sequence extraction. Command: python Extract_Seq.py -m $MAP -f $fasta -o $outputfile Where $MAP contains the ORF id you wish to extract, $fasta is the sequence file from which for extraction, and $outputfile is the outputfile name. The file "pafAORF.list" is an example of $MAP. The file "pafAORF.nuc.fa" is the sequence of pafA gene analyzed in this paper. Please note that this script uses {dictionary} function to fast extract the targeted sequence, it will preload all the sequence from $fasta into your system memory. 3. Finally, you can merged all the $Sample.P.profile.txt into one matrix table using merge_metaphlan_tables.py provided by MetaPhlan. Reference Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 Francesco Beghini, Lauren J McIver, Aitor Blanco-Míguez, Leonard Dubois, Francesco Asnicar, Sagun Maharjan, Ana Mailyan, Paolo Manghi, Matthias Scholz, Andrew Maltez Thomas, Mireia Valles-Colomer, George Weingart, Yancong Zhang, Moreno Zolfo, Curtis Huttenhower, Eric A Franzosa, Nicola Segata. eLife (2021)

评论收藏

内容反馈