## Objective
The objective of this tutorial is to illustrate the complete workflow of a chewBBACA pipeline for creating a wgMLST and a cgMLST schema for a colection of 714 _Streptococcus agalactiae_ genomes (32 complete genomes and 682 draft genome assemblies deposited on the NCBI databases) by providing step-by-step instructions and displaying the obtained outputs.
All information about the NCBI genomes used in this example is on the [.tsv file](https://github.com/B-UMMI/chewBBACA_tutorial/tree/master/genomes/NCBI_genomes_proks.Sagalactiae_allGenomes.2016_08_03.tsv) inside the `genomes` folder.
Please start by going through the following steps:
1. Install chewBBACA. Check [Installing chewBBACA](https://github.com/B-UMMI/chewBBACA/wiki/0.-Setting-up-chewBBACA) for instructions on how to install chewBBACA. chewBBACA includes Prodigal training files for several species, including for _Streptococcus agalactiae_. You can check the list of available training files [here](https://github.com/B-UMMI/chewBBACA/raw/master/CHEWBBACA/prodigal_training_files/). We have included the training file for _Streptococcus agalactiae_ in this repository.
2. Clone this repository to the local folder of your choice. To clone, run the following command:
`git clone https://github.com/B-UMMI/chewBBACA_tutorial`
3. Go to the top-level directory of the cloned repository, `.../chewBBACA_tutorial/`, and run `unzip genomes/complete_genomes.zip` to extract all the complete genomes.
The execution times reported in this tutorial were obtained for a DELL XPS13 (10th Generation Intel® Core™ i7-10710U Processor - 12MB Cache, up to 4.7 GHz, using 6 cores). Using a computer with less powerful specifications can greatly increase the duration of the analyses.
The commands used in this tutorial assume that the working directory is the top-level directory of the cloned repository, `.../chewBBACA_tutorial/`. The commands should be modified if they are executed from a different working directory. We have included the expected results for each section in the `expected_results` folder for reference (each subfolder has the name of one of the sections).
## Schema creation
We will start by creating a wgMLST schema based on **32** _Streptococcus agalactiae_ complete genomes (32 genomes with a level of assembly classified as complete genome or chromossome) available at NCBI. The sequences are present in the `complete_genomes/` directory. To create the wgMLST schema, run the following command:
```
chewBBACA.py CreateSchema -i complete_genomes/ -o tutorial_schema --ptf Streptococcus_agalactiae.trn --cpu 6
```
The schema seed will be available at `tutorial_schema/schema_seed`. We passed the value `6` to the `--cpu` parameter to use 6 CPU cores, but you should pass a value based on the specifications of your machine. In our system, the process took 56 seconds to complete resulting on a wgMLST schema with 3128 loci. At this point the schema is defined as a set of loci each with a single representative allele.
## Allele calling
The next step is to perform allele calling with the wgMLST schema created in the previous step for the **32** complete genomes. The allele call step determines the allelic profiles of the analyzed strains, identifying known and novel alleles in the analyzed genomes. Novel alleles are assigned an allele identifier and added to the schema. To perform allele call, run the following command:
```
chewBBACA.py AlleleCall -i complete_genomes/ -g tutorial_schema/schema_seed -o results32_wgMLST --cpu 6
```
The allele call used the default BSR threshold of 0.6 (more information on the threshold [here](https://github.com/B-UMMI/chewBBACA/wiki/2.-Allele-Calling)) and took approximately 17 minutes to complete (an average of 32 seconds per genome). The allele call identified 14,720 novel alleles and added those alleles to the schema, increasing the number of alleles in the schema from 3,128 to 17,848.
## Paralog detection
The next step in the analysis is to determine if some of the loci can be considered paralogs, based on the result of the wgMLST allele calling. The _Allele call_ returns a list of Paralogous genes in the `RepeatedLoci.txt` file that can be found on the `results32_wgMLST/results_<datestamp>` folder.
The `RepeatedLoci.txt` file contains a set of 20 loci that were identified as possible paralogs. These loci should be removed from the schema due to the potential uncertainty in allele assignment (for a more detailed description see the [Alelle Calling](https://github.com/B-UMMI/chewBBACA/wiki/2.-Allele-Calling) entry on the wiki). To remove the set of 20 paralogous loci from the allele calling results, run the following command:
```
chewBBACA.py RemoveGenes -i results32_wgMLST/results_<datestamp>/results_alleles.tsv -g results32_wgMLST/results_<datestamp>/RepeatedLoci.txt -o results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv
```
This will remove the columns matching the 20 paralogous loci from the allele calling results and save the allelic profiles into the `results_alleles_NoParalogs.tsv` file (the new file contains allelic profiles with 3108 loci).
## cgMLST schema determination
We can now determine the set of loci in the core genome based on the allele calling results. The set of loci in the core genome is determined based on a threshold of loci presence in the analysed genomes. We can run the TestGenomeQuality module to determine the impact of several threshold values on the number of loci in the core genome.
```
chewBBACA.py TestGenomeQuality -i results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv -n 13 -t 200 -s 5 -o results32_wgMLST/results_<datestamp>/genome_quality_32
```
The process will automatically open a HTML file with the following plot:
![Genome quality testing of complete genomes](https://i.imgur.com/uf3Hygd.png)
[larger image fig 1](https://i.imgur.com/uf3Hygd.png) or [see interactive plot online](http://im.fm.ul.pt/chewBBACA/GenomeQual/GenomeQualityPlot_complete_genomes.html)
A set of **1136 loci** were found to be present in all the analyzed complete genomes, while **1267 loci** were present in at least 95%.
For further analysis only the **1267** loci present in at least 95% of the complete genomes will be used. We selected that threshold value to account for loci that may not be identified due to sequencing coverage and assembly problems.
We can run the ExtraCgMLST module to quickly determine the set of loci in the core genome at 95%.
```
chewBBACA.py ExtractCgMLST -i results32_wgMLST/results_<datestamp>/results_alleles_NoParalogs.tsv -o results32_wgMLST/results_<datestamp>/cgMLST_95 --t 0.95
```
The list with the 1267 loci in the core genome at 95% is in the `results32_wgMLST/results_<datestamp>/cgMLST_95/cgMLSTschema.txt` file. This file can be passed to the `--gl` parameter of the AlleleCall process to perform allele calling only for the set of genes that constitute the core genome.
## Allele call for 682 _Streptococcus agalactiae_ assemblies
**682 assemblies** of _Streptococcus agalactiae_ available on NCBI were downloaded (03-08-2016, downloadable zip file [here](https://drive.google.com/file/d/0Bw6VuoagsdhmaWEtR25fODlJTEk/view?usp=sharing), run `unzip GBS_Aug2016.zip` to extract genome files into a folder named `GBS_Aug2016`) and analyzed with [MLST](https://github.com/tseemann/mlst) in order to exclude possibly mislabeled samples as _Streptococcus agalactiae_. Out of the **682 genomes**, 2 (GCA_000323065.2_ASM32306v2 and GCA_001017915.1_ASM101791v1) were detected as being of a different species/contamination and were removed from the analysis.
Allele call was performed on the bona fide _Streptococcus agalactiae_ **680 genomes** using the **1267 loci** that constitute the core genome at 95%. Paralog detection found no paralog loci.
```
chewBBACA.py AlleleCall -i path/to/GBS_Aug2016/ -g tutorial_schema/schema_seed --gl results32_wgMLST/results_<datestamp>/cgMLST_95/cgMLSTschema.txt -o
没有合适的资源?快使用搜索试试~ 我知道了~
chewBBACA_tutorial:使用chewBBACA的分步教程以及所有必要的文件和结果
共39个文件
tsv:19个
txt:12个
zip:4个
需积分: 50 5 下载量 50 浏览量
2021-05-12
20:37:14
上传
评论 1
收藏 37.89MB ZIP 举报
温馨提示
客观的 本教程的目的是通过提供以下步骤来说明用于创建714个无乳链球菌基因组(存储在NCBI数据库中的32个完整基因组和682个草图基因组程序集)的wgMLST和cgMLST模式的chewBBACA管道的完整工作流程。分步说明并显示获得的输出。 在本示例中使用的有关NCBI基因组的所有信息都位于genomes文件夹内的。 请首先执行以下步骤: 安装chewBBACA。 检查“以获取有关如何的说明。 chewBBACA包括一些物种的败家培训文件,包括无乳链球菌。 您可以在查看可用的培训文件列表。 我们已在此存储库中包含无乳链球菌的培训文件。 将此存储库克隆到您选择的本地文件夹。 要克隆,请运行以下命令: git clone https://github.com/B-UMMI/chewBBACA_tutorial 转到克隆的存储库的顶级目录.../chewBBACA_tutorial
资源详情
资源评论
资源推荐
收起资源包目录
chewBBACA_tutorial-master.zip (39个子文件)
chewBBACA_tutorial-master
Streptococcus_agalactiae.trn 545KB
info.tsv 33KB
analysis_all
cgMLST_25
Presence_Abscence.tsv 1.63MB
cgMLSTschema.txt 14KB
mdata_stats.tsv 29KB
cgMLST.tsv 623KB
cgMLST_all.tsv 2MB
GenomeQualityPlot.html 2.31MB
removedGenomes.txt 80KB
Genes_95%.txt 2.16MB
cgMLST_completegenomes
Presence_Abscence.tsv 291KB
cgMLSTschema.txt 39KB
mdata_stats.tsv 1KB
cgMLST.tsv 122KB
removedGenomes_25.txt 2KB
genomes
NCBI_genomes_proks.Sagalactiae_allGenomes.2016_08_03.tsv 155KB
complete_genomes.zip 17.5MB
README.md 13KB
expected_results
Paralog_detection
results_alleles_NoParalogs.tsv.tsv 436KB
Allele_call_for_682_Streptococcus_agalactiae_assemblies
tutorial_schema.zip 5.51MB
results680_cgMLST
results_20210430T184958
results_alleles.tsv 1.91MB
results_statistics.tsv 38KB
results_contigsInfo.tsv 23.7MB
RepeatedLoci.txt 12B
logging_info.txt 174B
cgMLST_schema_determination
genome_quality_32
GenomeQualityPlot.html 3.33MB
removedGenomes.txt 2KB
Genes_95%.txt 1.53MB
cgMLST_95
cgMLSTschema.txt 39KB
Presence_Absence.tsv 291KB
mdata_stats.tsv 2KB
cgMLST.tsv 122KB
Allele_calling
tutorial_schema.zip 5.7MB
results32_wgMLST
results_20210429T125638
results_alleles.tsv 439KB
results_statistics.tsv 2KB
results_contigsInfo.tsv 1.77MB
RepeatedLoci.txt 745B
logging_info.txt 173B
Schema_creation
tutorial_schema.zip 4.04MB
共 39 条
- 1
深夜里呕吐的鱼公子
- 粉丝: 23
- 资源: 4721
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- (源码)基于C++的ARMA53贪吃蛇游戏系统.zip
- (源码)基于Python和MQTT协议的IoT数据获取与处理系统.zip
- (源码)基于Arduino编程语言的智能硬件控制系统.zip
- (源码)基于Android的记账管理系统.zip
- (源码)基于Spring Boot框架的二手车管理系统.zip
- (源码)基于Spring Boot和Vue的分布式权限管理系统.zip
- (源码)基于Spring Boot框架的后台管理系统.zip
- (源码)基于Spring Boot和Vue的高性能售票系统.zip
- (源码)基于Windows API的USB设备通信系统.zip
- (源码)基于Spring Boot框架的进销存管理系统.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0