# Phosphorus-cycling-database (PCyCDB)
This is a curated phosphorus cycling database (PCyCDB) with 138 gene families and 10 metabolism processes.
Homologous genes were added into the database to reduce the false positive rate. The criteria (i.e., identity, hit length) for filtering the alignment result generated by sequence similarity searching tools (e.g., BLAST, USEARCH, DIAMOND) were refined by identifying a known simulated gene dataset and mock bacteria community to obtain the best accuracy and further reduction of false positives and false negatives. The accuracy, PPV, sensitivity, specificity and NPV were 99.76%, 95.70%, 99.94%, 99.74% and 99.99%, respectively, at the 70% identity and 25 amino acid cutoffs.
Importantly, the genes encoding the intracellular phosphorus metabolic processes are added into PCyCDB, which should help researchers broaden the insights into not only the geochemical P cycling but also the microbial P metabolisms.
If you feel this database and utilities are useful, please cite:
Zeng J, Tu Q, Yu X, et al. PCycDB: a comprehensive and accurate database for fast analysis of phosphorus cycling genes[J]. Microbiome, 2022, 10(1): 1-16.
User guide:
1. Assuming you had a sample named $Sample.fa, and obtained a blast table named Sample.P.blast using BLAST+ or DIAMOND or other alignment tools, you can filter the result using filter_Generate_ORF2gene.py.
Command: python filter_Generate_ORF2gene.py -s $identity -cov $alignment-coverage -hit $hitlength -b $Sample.P.blast
Recommended filtering threthold: it is acceptable to use 30% identity and 25 amino acids cutoffs to investigate more potential PCGs from metagenome sequencing data, because all known PCGs can be detected by PCycDB, while a little bit of predicted false-positive PCGs (1.04%) might also have the potential ability in mediating the P cycling. Also, one may use a stricter cutoff (i.e., 70% identity, 25 aa) to control the false positives (< 0.25%).
By doing this, you will obtain a ORF2gene file named $Sample.ORF2gene.txt, which descripts the P cycling gene for each ORF.
If you have a lot of sample for analysis, a bash for loop is recommend for quickly processing the data. For example:
for i in $Sample*.blast; do python filter_Generate_ORF2gene.py -s $identity -cov $alignment-coverage -hit $hitlength -b $i; done
2. Assuming you already have calculated the coverage (or TPM) using Bowtie2, CheckM, Salmon or other tools, and obtained a abuncance file named $Sample.quant for all ORFs, you can easily extracted the abuncance of PCGs using Coverage_get.py.
Command: python Coverage_get.py -i $Sample.ORF2gene.txt -t $Sample.quant -o $Sample.P.profile.txt
By doing this, you will obtain a abundance file for each PCG of the $Sample.
Extract_Seq.py was a useful python script for fast sequence extraction.
Command: python Extract_Seq.py -m $MAP -f $fasta -o $outputfile
Where $MAP contains the ORF id you wish to extract, $fasta is the sequence file from which for extraction, and $outputfile is the outputfile name. The file "pafAORF.list" is an example of $MAP. The file "pafAORF.nuc.fa" is the sequence of pafA gene analyzed in this paper.
Please note that this script uses {dictionary} function to fast extract the targeted sequence, it will preload all the sequence from $fasta into your system memory.
3. Finally, you can merged all the $Sample.P.profile.txt into one matrix table using merge_metaphlan_tables.py provided by MetaPhlan.
Reference
Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 Francesco Beghini, Lauren J McIver, Aitor Blanco-Míguez, Leonard Dubois, Francesco Asnicar, Sagun Maharjan, Ana Mailyan, Paolo Manghi, Matthias Scholz, Andrew Maltez Thomas, Mireia Valles-Colomer, George Weingart, Yancong Zhang, Moreno Zolfo, Curtis Huttenhower, Eric A Franzosa, Nicola Segata. eLife (2021)
没有合适的资源?快使用搜索试试~ 我知道了~
磷循环Pcycle功能基因分析过滤代码
共13个文件
py:4个
txt:3个
test:1个
0 下载量 117 浏览量
2023-12-14
10:09:27
上传
评论
收藏 3.76MB ZIP 举报
温馨提示
This is a curated phosphorus cycling database (PCyCDB) with 138 gene families and 10 metabolism processes. Homologous genes were added into the database to reduce the false positive rate. The criteria (i.e., identity, hit length) for filtering the alignment result generated by sequence similarity searching tools (e.g., BLAST, USEARCH, DIAMOND) were refined by identifying a known simulated gene dataset and mock bacteria community to obtain the best accuracy and further reduction of false positive
资源推荐
资源详情
资源评论
收起资源包目录
Phosphorus-cycling-database-main.zip (13个子文件)
Phosphorus-cycling-database-main
pafAORF.list 18KB
filter_Generate_ORF2gene.py 1KB
Extract_Seq.py 915B
Coverage_get.py 2KB
example file
S0028ORF2Pgene.txt 241KB
quant.sf 11.6MB
S0028.PCycDB.blastp 3.16MB
S0028.TPM.txt 2KB
merge_metaphlan_tables.py 4KB
readme.txt 2KB
pafAORF.nuc.fa 496KB
README.md 4KB
code
test 5B
共 13 条
- 1
资源评论
小果运维
- 粉丝: 7549
- 资源: 13
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于卷积神经网络ResNet的图像分类python实现源码+GUI界面.zip
- 基于SP3232芯片 TTL转RS232串口模块 Cadence16.3设计硬件(原理图+PCB)文件.zip
- Screenshot_20240509_034911_com.tencent.mtt.jpg
- 基于python实现的医学影像体脂分割+源代码+文档说明(课程设计)
- 基于python实现的医学影像(MIR, CT )图像分割源码+文档说明(高分课程设计)
- 基于python+JavaScript实现的医学影像分割+源代码+文档说明+截图演示+数据(高分毕业设计)
- 基于U-net+pytorch实现的医学影像分割python源码+文档说明+数据+界面截图+博客介绍
- 课程设计-基于Pytorch实现MNIST数据集的手写数字识别源码+数据(Gui界面)+文档说明+模型
- 软件开发国家标准.xls
- pytorch-CNN-SBATM-ubuntudemo
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功