# CVRCseq
This is a collection of commonly used pipelines integrated into a single workflow via snakemake. Previously, I had all of these as individual snakemake workflows. This workflow is designed to run on NYU's UltraViolet HPC, which utilizes Slurm and has a variety of different node types.
## RNA-seq
There are currently 4 RNA-seq analysis pipelines available
1. RNAseq_PE
paired-end data fastqc > fastp > STAR > featurecounts
2. RNAseq_SE
single-end data fastqc > fastp > STAR > featurecounts
3. RNAseq_HISAT2_stringtie
paired-end data fastqc > fastp > HISAT2 > stringtie
4. RNAseq_HISAT2_stringtie_nvltrx
paired-end data fastqc > fastp > HISAT2 > stringtie novel transcript identification
## DNA Binding/enrichment
There are currently 2 analysis pipelines available
1. ChIPseq_PE
paired-end data fastqc > fastp > bowtie2 > macs2
2. CUT-RUN_PE
paired-end data fastqc > fastp > bowtie2 > seacr
3. ATACseq_PE
paired-end data fastqc > fastp > bowtie2 > macs2
# Description of files:
## Snakefiles
workflow/Snakefile launches the individual pipelines in workflow/rules
## config/samples_info.tab
This file contains a tab deliminated table with:
1. The names of R1 and R2 of each fastq file as received from the sequencing center. If sample was split over multiple lanes, remove the lane number ('L00X') from the fastq file name. cat_rename.py removes this when it concatenates .fastq files split over multiple lanes.
2. Simple sample names
3. Condition (e.g. diabetic vs non_diabetic)
4. Replicate #
5. If using ChIPseq or CUT-RUN a column titles 'antibody' is required. antibody specifies if the sample is the ChIP antibody or a control (input or IgG etc...)
6. Sample name is the concatenated final sample_id. This is a concatenation of the sample name, condition, replicate, and antibody (if present) columns
7. Additional metadata can be added to this table for downstream analysis
8. For ChIPseq and CUT-RUN, sample name, condition, and replicate should be identical for each pair of antibody and control fastq files. The antibody column specifies which of the pair is antibody and which is control.
## config/config.yaml
This file contains required general and workflow specific configuaration info.
Generic requirements
sample_file: Where to locate the samples_info.tab file (default config/samples_info.tab)
workflow: name of workflow being used
genome: location of indexed genome.
1. For RNAseq_PE or RNAseq_SE - star 2.7.7a index
2. For HISAT2 workflows - HISAT2 index
3. For ChIPseq or CUT-RUN - bowtie2 index
GTF: location of .gtf file
CUT-RUN_PE
spike_genome: Location of spike-in genome index. This is only implemented in CUT-RUN. bowtie2 index
chromosome_lengths: location of chromosome lengths file. required for spike-in normalization in CUT-RUN
ChIPseq_PE
effective_genome_size: Effective genome size for MACS2
ATACseq_PE
effective_genome_size: Effective genome size for MACS2
RNAseq_HISAT2_stringtie or RNAseq_HISAT2_stringtie_nvltrx
prepDE_length: Average fragment length for stringtie prepDE script
## config/profile/config.yaml
This file contains the default slurm resources for each rule
## workflow/scripts/cat_rename.py
This script:
1. Concatenates fastq files for samples that were split over multiple sequencing lanes
2. Renames the fastq files from the generally verbose ids given by the sequencing center to those supplied in Samples_info.tab.
3. The sample name, condition, and replicate columns are concatenated and form the new sample_id_Rx.fastq.gz files
4. This script is executed via snakemake_init.sh prior to launching the appropriate snakemake pipeline
## workflow/scripts/snakemake_init.sh
This bash script:
1. Executes cat_rename.py
2. loads the miniconda3/cpu/4.9.2 module on Ultraviolet
3. Executes snakemake
4. Runs multiqc
## workflow/scripts/FRP.py
This file computes the fraction of reads in peaks (FRP) and outputs a table with FRP, total fragments, and fragments within peaks.
## workflow/envs/CVRCseq.yml
This file contains the conda environment info used by this pipeline.
## Usage
When starting a new project:
1. Clone the git repo using 'git clone https://github.com/mgildea87/CVRCseq.git'
2. Update the samples_info.tab file with fastq.gz file names and desired sample, condition, replicate names, and Antibody/IgG control status (if using)
3. Update config.yaml
4. Modify parameters in the appropriate worklow/rules .smk file if desired. e.g. alignment parameters.
5. Run 'bash workflow/scripts/snakemake_init.sh'
Description of parameters
-h help"
-d .fastq directory"
-c Absolute path to conda environment"
-w workflow
# To-do
* Add testing data and tests
* Enrichment pipelines
* Add irreproducible discovery rate (IDR) for identifying robust peak sets between replicates. See ENCODE pipeline
* Add deduplication by default. Likely via Picard prior to peak calling. (rn macs does this)
* enable more efficient handling of experimental designs where the same input is used for multiple pull-down/antibody samples. e.g. ChIRPseq.
* Simplify cat_rename.py to take sample prefixes (text upstream of the lane number '_L00X') supplied via samples_info.tab.
* Add parameter to specify output directory name. Right now its given the pipeline name.
* Add salmon pipeline for RNAseq
* Add rules to snakefiles containing R scripts for some downstream QC and plotting. e.g. for RNAseq: PCA, replicate scatter plots, count statistics or for ATACseq: fragment length distributions, FRP plots, replicate comparisons.
没有合适的资源?快使用搜索试试~ 我知道了~
用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip
共16个文件
smk:7个
yaml:2个
py:2个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 103 浏览量
2023-11-15
10:54:05
上传
评论
收藏 21KB ZIP 举报
温馨提示
用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip
资源推荐
资源详情
资源评论
收起资源包目录
用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip (16个子文件)
CVRCseq-main
workflow
rules
RNAseq_SE.smk 3KB
RNAseq_PE.smk 4KB
RNAseq_PE_HISAT2_stringtie_nvltrx.smk 5KB
CUT-RUN_PE.smk 6KB
RNAseq_PE_HISAT2_stringtie.smk 3KB
ChIPseq_PE.smk 5KB
ATACseq_PE.smk 5KB
Snakefile 276B
envs
CVRCseq.yml 9KB
scripts
FRP.py 2KB
cat_rename.py 5KB
snakemake_init.sh 2KB
README.md 6KB
config
profile
config.yaml 252B
config.yaml 163B
samples_info.tab 64B
A
共 16 条
- 1
资源评论
GZM888888
- 粉丝: 206
- 资源: 2970
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功