用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip资源-CSDN文库

共16个文件

smk：7个

yaml：2个

py：2个

版权申诉

24 浏览量 2023-11-15 10:54:05 上传评论收藏 21KB ZIP 举报

RNAseq、CUT-RUN和ChIPseq是生物学研究中常用的高通量测序技术，它们在理解基因表达调控、转录因子结合以及染色质结构方面起着至关重要的作用。这个名为“用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip”的文件集合提供了一个基于JAVA的分析流程，旨在简化和标准化这些复杂实验的数据处理过程。 RNAseq（RNA测序）是一种广泛应用于基因表达谱分析的技术，通过测序全转录本，可以揭示细胞在特定条件下的基因表达情况。这个工作流可能包括质量控制、对齐、定量基因表达、差异表达分析以及功能富集分析等步骤。JAVA平台提供了强大的计算能力和灵活性，使得这些分析能够在多种操作系统上执行。 CUT-RUN（Cut and Run sequencing）是一种新型的染色质免疫沉淀技术，它能高效地定位转录因子的结合位点。与传统的ChIP-seq相比，CUT-RUN具有更少的背景噪声和更高的分辨率。工作流可能包括样本准备、文库构建、测序数据的预处理、峰值检测和功能注释。 ChIPseq（染色质免疫沉淀测序）是识别DNA结合蛋白或表观遗传标记在基因组上的分布的金标准方法。该工作流涵盖了ChIP-seq实验的所有关键步骤，如数据预处理、对齐、峰检测、富集区域的功能注释以及与基因功能的关联分析。 CVRCseq-main可能是一个核心的JAVA程序，负责协调整个分析流程，包括数据输入、运行各种分析工具、管理和存储结果。工作流中的其他文件可能包含特定分析模块，如对齐工具（例如STAR或Bowtie2）、定量工具（如HTSeq或DESeq2）以及峰值检测软件（如MACS2）。使用这样的工作流集合有几个优点：它可以确保一致性，因为所有分析都遵循相同的步骤；它减少了用户在不同工具之间切换的复杂性；通过JAVA实现，它提供了跨平台兼容性，可以在Windows、Linux或Mac OS等系统上运行。在实际应用中，用户需要具备一定的生物信息学基础和JAVA编程知识，以便理解和定制这些工作流。同时，对RNAseq、CUT-RUN和ChIPseq数据分析的理解也至关重要，包括了解如何解释和验证结果，以及如何将这些结果与生物学问题相结合。对于初学者，建议参考相关文献和教程来逐步学习和掌握这些技术。

资源推荐

资源详情

资源评论

收起资源包目录

用于RNAseq、CUT-RUN和ChIPseq分析的蛇行工作流集合.zip （16个子文件）

CVRCseq-main

workflow

rules

RNAseq_SE.smk 3KB

RNAseq_PE.smk 4KB

RNAseq_PE_HISAT2_stringtie_nvltrx.smk 5KB

CUT-RUN_PE.smk 6KB

RNAseq_PE_HISAT2_stringtie.smk 3KB

ChIPseq_PE.smk 5KB

ATACseq_PE.smk 5KB

Snakefile 276B

envs

CVRCseq.yml 9KB

scripts

FRP.py 2KB

cat_rename.py 5KB

snakemake_init.sh 2KB

README.md 6KB

config

profile

config.yaml 252B

config.yaml 163B

samples_info.tab 64B

# CVRCseq This is a collection of commonly used pipelines integrated into a single workflow via snakemake. Previously, I had all of these as individual snakemake workflows. This workflow is designed to run on NYU's UltraViolet HPC, which utilizes Slurm and has a variety of different node types. ## RNA-seq There are currently 4 RNA-seq analysis pipelines available 1. RNAseq_PE paired-end data fastqc > fastp > STAR > featurecounts 2. RNAseq_SE single-end data fastqc > fastp > STAR > featurecounts 3. RNAseq_HISAT2_stringtie paired-end data fastqc > fastp > HISAT2 > stringtie 4. RNAseq_HISAT2_stringtie_nvltrx paired-end data fastqc > fastp > HISAT2 > stringtie novel transcript identification ## DNA Binding/enrichment There are currently 2 analysis pipelines available 1. ChIPseq_PE paired-end data fastqc > fastp > bowtie2 > macs2 2. CUT-RUN_PE paired-end data fastqc > fastp > bowtie2 > seacr 3. ATACseq_PE paired-end data fastqc > fastp > bowtie2 > macs2 # Description of files: ## Snakefiles workflow/Snakefile launches the individual pipelines in workflow/rules ## config/samples_info.tab This file contains a tab deliminated table with: 1. The names of R1 and R2 of each fastq file as received from the sequencing center. If sample was split over multiple lanes, remove the lane number ('L00X') from the fastq file name. cat_rename.py removes this when it concatenates .fastq files split over multiple lanes. 2. Simple sample names 3. Condition (e.g. diabetic vs non_diabetic) 4. Replicate # 5. If using ChIPseq or CUT-RUN a column titles 'antibody' is required. antibody specifies if the sample is the ChIP antibody or a control (input or IgG etc...) 6. Sample name is the concatenated final sample_id. This is a concatenation of the sample name, condition, replicate, and antibody (if present) columns 7. Additional metadata can be added to this table for downstream analysis 8. For ChIPseq and CUT-RUN, sample name, condition, and replicate should be identical for each pair of antibody and control fastq files. The antibody column specifies which of the pair is antibody and which is control. ## config/config.yaml This file contains required general and workflow specific configuaration info. Generic requirements sample_file: Where to locate the samples_info.tab file (default config/samples_info.tab) workflow: name of workflow being used genome: location of indexed genome. 1. For RNAseq_PE or RNAseq_SE - star 2.7.7a index 2. For HISAT2 workflows - HISAT2 index 3. For ChIPseq or CUT-RUN - bowtie2 index GTF: location of .gtf file CUT-RUN_PE spike_genome: Location of spike-in genome index. This is only implemented in CUT-RUN. bowtie2 index chromosome_lengths: location of chromosome lengths file. required for spike-in normalization in CUT-RUN ChIPseq_PE effective_genome_size: Effective genome size for MACS2 ATACseq_PE effective_genome_size: Effective genome size for MACS2 RNAseq_HISAT2_stringtie or RNAseq_HISAT2_stringtie_nvltrx prepDE_length: Average fragment length for stringtie prepDE script ## config/profile/config.yaml This file contains the default slurm resources for each rule ## workflow/scripts/cat_rename.py This script: 1. Concatenates fastq files for samples that were split over multiple sequencing lanes 2. Renames the fastq files from the generally verbose ids given by the sequencing center to those supplied in Samples_info.tab. 3. The sample name, condition, and replicate columns are concatenated and form the new sample_id_Rx.fastq.gz files 4. This script is executed via snakemake_init.sh prior to launching the appropriate snakemake pipeline ## workflow/scripts/snakemake_init.sh This bash script: 1. Executes cat_rename.py 2. loads the miniconda3/cpu/4.9.2 module on Ultraviolet 3. Executes snakemake 4. Runs multiqc ## workflow/scripts/FRP.py This file computes the fraction of reads in peaks (FRP) and outputs a table with FRP, total fragments, and fragments within peaks. ## workflow/envs/CVRCseq.yml This file contains the conda environment info used by this pipeline. ## Usage When starting a new project: 1. Clone the git repo using 'git clone https://github.com/mgildea87/CVRCseq.git' 2. Update the samples_info.tab file with fastq.gz file names and desired sample, condition, replicate names, and Antibody/IgG control status (if using) 3. Update config.yaml 4. Modify parameters in the appropriate worklow/rules .smk file if desired. e.g. alignment parameters. 5. Run 'bash workflow/scripts/snakemake_init.sh' Description of parameters -h help" -d .fastq directory" -c Absolute path to conda environment" -w workflow # To-do * Add testing data and tests * Enrichment pipelines * Add irreproducible discovery rate (IDR) for identifying robust peak sets between replicates. See ENCODE pipeline * Add deduplication by default. Likely via Picard prior to peak calling. (rn macs does this) * enable more efficient handling of experimental designs where the same input is used for multiple pull-down/antibody samples. e.g. ChIRPseq. * Simplify cat_rename.py to take sample prefixes (text upstream of the lane number '_L00X') supplied via samples_info.tab. * Add parameter to specify output directory name. Right now its given the pipeline name. * Add salmon pipeline for RNAseq * Add rules to snakefiles containing R scripts for some downstream QC and plotting. e.g. for RNAseq: PCA, replicate scatter plots, count statistics or for ATACseq: fragment length distributions, FRP plots, replicate comparisons.

评论收藏

内容反馈

版权申诉