【免费】综述RNAseq的最佳分析流程A_survey_of_best_practices_for_RNA_seq_data

需积分: 0 72 浏览量 2022-08-03 12:39:10 上传评论收藏 1.02MB PDF 举报

RNA-seq（RNA测序）技术自其诞生以来，已经成为分子生物学领域不可或缺的核心工具，尤其在转录组学研究中发挥了关键作用。该技术结合了转录本发现和基因表达定量，为研究者提供了高通量测序的全面解决方案。随着RNA-seq的广泛应用，从基因组学到生命科学各个领域的研究者都在采用这一方法，由此催生了多种实验设计和分析策略。本文对RNA-seq数据分析的最佳实践进行了全面回顾，涵盖了从实验设计、质量控制到数据解读的整个流程。首先，实验设计是RNA-seq成功的关键，需要考虑样本数量、重复次数、处理条件以及合适的对照，以确保实验的统计功效和结果的可靠性。质量控制是数据分析的第一步，包括对原始测序数据的评估，如读取质量、长度分布和潜在污染的检查。常用的工具有FastQC和Trimmomatic，它们可以帮助识别并去除低质量读取和接头序列，以提高后续分析的准确性。接着，读取对齐是将测序数据映射到参考基因组或转录组的关键步骤。不同的对齐器，如STAR、TopHat和Hisat2，各有优缺点，选择时应根据研究目标和数据特性进行权衡。对齐后，通常会进行定量分析，以确定基因和转录本的表达水平，如使用DESeq2、edgeR或Cufflinks等工具。在定量基础上，可以进行差异表达分析，探究在不同条件或样本间基因表达的变化。此外，RNA-seq还可以揭示剪接变异，如内含子保留、剪接位点改变等，这些信息对于理解基因表达调控和疾病机制至关重要。除了传统的基因表达分析，RNA-seq还可以用于检测基因融合，这对于某些癌症研究尤为重要。另外，eQTL（表达型 quantitative trait loci）映射可以帮助揭示基因表达与遗传变异之间的关系，从而深入解析基因调控网络。小型RNA（如miRNA和siRNA）的分析也是RNA-seq的重要应用，它们在转录后调控中发挥着重要作用。为了整合多组学数据，例如将RNA-seq与ChIP-seq或ATAC-seq相结合，可以帮助揭示基因表达调控的多层次复杂性。展望未来，随着新技术的发展，如单分子测序和空间转录组学，RNA-seq分析将面临更多挑战和机遇。这些新技术有望提供更精细的转录本结构信息和细胞内时空表达模式，推动转录组学进入一个全新的时代。总之，RNA-seq数据分析涉及多个复杂步骤，每个阶段都有其独特的挑战。选择最佳实践需要综合考虑实验设计、数据分析工具和研究目的。随着技术的不断进步，理解并优化这些流程对于提升RNA-seq研究的质量和影响力至关重要。

资源详情

资源评论

资源推荐

REVIE W Open Access

A survey of best practices for RNA-seq data

analysis

Ana Conesa

1,2*

, Pedro Madrigal

3,4*

, Sonia Tarazona

2,5

, David Gomez-Cabrero

6,7,8,9

, Alejandra Cervera

Andrew McPherson

, Michał Wojciech Szcześniak

, Daniel J. Gaffney

, Laura L. Elo

, Xuegong Zhang

14,15

and Ali Mortazavi

16,17*

Abstract

RNA-sequencing (RNA-seq) has a wide variety of

applications, but no single analysis pipeline can be

used in all cases. We review all of the major steps in

RNA-seq data analysis, including experimental design,

quality control, read alignment, quantification of gene

and transcript levels, visualization , differential gene

expression, alternative splicing, functional analysis,

gene fusion detection and eQTL mapping. We

highlight the challenges associated with each step.

We discuss the analysis of small RNAs and the

integration of RNA-seq with other functional

genomics techniques. Finally, we discuss the outlook

for novel technologies that are changing the state of

the art in transcriptomics.

Background

Transcript identification and the quantification of gene

expression have been distinct core activities in molecular

biology ever since the discovery of RNA’s role as the key

intermediate between the genome and the proteome.

The power of sequencing RNA lies in the fact that the

twin aspects of discovery and quantification can be com-

bined in a single high-throughput sequencing assay

called RNA-sequencing (RNA-seq). The pervasive adop-

tion of RNA-seq has spread well beyond the genomics

community and has become a standard part of the toolkit

used by the life sciences research community. Many varia-

tions of RNA-seq protocols and analyses have been

* Correspondence: aconesa@ufl.edu; pm12@sanger.ac.uk; ali.mortazavi@uci.edu

Institute for Food and Agricultural Sciences, Department of Microbiology

and Cell Science, University of Florida, Gainesville, FL 32603, USA

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton,

Cambridge CB10 1SA, UK

Department of Developmental and Cell Biology, University of California,

Irvine, Irvine, CA 92697-2300, USA

Full list of author information is available at the end of the article

published, making it challenging for new users to appreci-

ate all of the steps necessary to conduct an RNA-seq study

properly.

There is no optimal pipeline for the variety of different

applications and analysis scenarios in which RNA-seq

can be used. Scientists plan experiments and adopt dif-

ferent analysis strategies depending on the organism be-

ing studied and their research goals. For example, if a

genome sequence is available for the studied organism,

it should be possible to identify transcripts by mapping

RNA-seq reads onto the genome. By contrast, for organ-

isms without sequenced genomes, quantification would

be achieved by first assembling reads de novo into con-

tigs and then mapping these contigs onto the transcrip-

tome. For well-annotated genomes such as the human

genome, researchers may choose to base their RNA-seq

analysis on the existing annotated reference transcrip-

tome alone, or might try to identify new transcripts and

their differential regulation. Furthermore, investigators

might be interested only in messenger RNA isoform ex-

pression or microRNA (miRNA) levels or allele variant

identification. Both the experimental design and the ana-

lysis procedures will vary greatly in each of these cases.

RNA-seq can be used solo for transcriptome profiling or

in combination with other functional genomics methods

to enhance the analysis of gene expression. Finally, RNA-

seq can be coupled with different types of biochemical

assay to analyze many other aspects of RNA biology, such

as RNA–protein binding, RNA structure, or RNA–RNA

interactions. These applications are, however, beyond the

scope of this review as we focus on ‘typical’ RNA-seq.

Every RNA-seq experimental scenario could poten-

tially have different optimal methods for transcript

quantification, normalization, and ultimately differential

expression analysis. Moreover, quality control checks

should be applied pertinently at different stages of the

analysis to ensure both reproducibility and reliability of

the results. Our focus is to outline current standards

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to

the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Conesa et al. Genome Biology (2016) 17:13

DOI 10.1186/s13059-016-0881-8

and resources for the bioinformatics analysis of RNA-

seq data. We do not aim to provide an exhaustive com-

pilation of resources or software tools nor to indicate

one best analysis pipeline. Rather, we aim to provide a

commented guideline for RNA-seq data analysis. Figure 1

depicts a generic roadmap for experimental design and

analysis using standard Illumina sequencing. We also

briefly list several data integration paradigms that have

been proposed and comment on their potential and limi-

tations. We finally discuss the opportunities as well as

challenges provided by single-cell RNA-seq an d long-

read technologies when compared to traditional short-

read RNA-seq.

Experimental design

A crucial prerequisite for a successful RNA-s eq study is

that the data generated have the potential to answer the

biological questions of interest. This is achieved by first

defining a good experimental design, that is, by choosing

the library type, sequencing depth and number of repli-

cates appropriate for the biological system under study,

and second by planning an adequate execution of the se-

quencing experiment itself, ensuring that data acquisi-

tion does not become contaminated with unnecessary

biases. In this section, we discuss both considerations.

One important aspect of the experiment al design is

the RNA-extraction protocol used to remove the highly

abundant ribosomal RNA (rRNA), which typically con-

stitutes over 90 % of total RNA in the cell, leaving the

1–2 % comprising messenger RNA (mRNA) that we are

normally interested in. For eukaryotes, this involves

choosing whether to enrich for mRNA using poly(A) se-

lection or to deplete rRNA. Poly(A) selection typically

requires a relatively high proportion of mRNA with min-

imal degradation as measured by RNA integrity number

(RIN), which normally yields a higher overall fraction of

reads falling onto known exons. Many biologically rele-

vant samples (such as tissue biopsies) cannot, however,

be obtained in great enough quantity or good enough

mRNA integrity to produce good poly(A) RNA-seq li-

braries and therefore require ribosomal depletion. For

bacterial samples, in which mRNA is not polyadenylated,

Fig. 1 A generic roadmap for RNA-seq computational analyses. The major analysis steps are listed above the lines for pre-analysis, core analysis

and advanced analysis. The key analysis issues for each step that are listed below the lines are discussed in the text. a Preprocessing includes

experimental design, sequencing design, and quality control steps. b Core analyses include transcriptome profiling, differential gene expression,

and functional profiling. c Advanced analysis includes visualization, other RNA-seq technologies, and data integration. Abbreviations: ChIP-seq

Chromatin immunoprecipitation sequencing, eQTL Expression quantitative loci, FPKM Fragments per kilobase of exon model per million mapped

reads, GSEA Gene set enrichment analysis, PCA Principal component analysis, RPKM Reads per kilobase of exon model per million reads, sQTL

Splicing quantitative trait loci, TF Transcription factor, TPM Transcripts per million

Conesa et al. Genome Biology (2016) 17:13 Page 2 of 19

the only viable alternative is ribosomal depletion. Another

consideration is whether to generate strand-preserving li-

braries. The first generation of Illumina-based RNA-seq

used random hexamer priming to reverse-transcribe

poly(A)-selected mRNA. This methodology did not retain

information contained on the DNA strand that is actually

expressed [1] and therefore complicates the analysis and

quantification of antisense or overlapping transcripts. Sev-

eral strand-specific protocols [2], such as the widely used

dUTP method, extend the original protocol by incorporat-

ing UTP nucleotides during the second cDNA synthesis

step, prior to adapter ligation followed by digestion of the

strand containing dUTP [3]. In all cases, the size of the

final fragments (usually less than 500 bp for Illumina) will

be crucial for proper sequencing and subsequent analysis.

Furthermore, sequencing can involve single-end (SE) or

paired-end (PE) reads, although the latter is preferable for

de novo transcript discovery or isoform expression ana-

lysis [4, 5]. Similarly, longer reads improve mappability

and transcript identification [5, 6]. The best sequencing

option depends on the analysis goals. The cheaper, short

SE reads are normally sufficient for studies of gene expres-

sion levels in well-annotated organisms, whereas longer

and PE reads are preferable to characterize poorly anno-

tated transcriptomes.

Another important factor is sequencing depth or li-

brary size, which is the number of sequenced reads for a

given sample. More transcripts will be detected and their

quantification will be more precise as the sample is se-

quenced to a deeper level [1]. Nevertheless, optimal se-

quencing depth again depends on the aims of the

experiment. While some authors will argue that as few

as five million mapped reads are sufficient to quantify

accurately medium to highly expressed genes in most

eukaryotic transcriptomes, others will sequence up to

100 million reads to quantify precisely genes and tran-

scripts that have low expression levels [7]. When study-

ing single cells, which have limited sample complexity,

quantification is often carried out with just one million

reads but may be done reliably for highly expressed

genes with as few as 50,000 reads [8]; even 20,000 reads

have been used to differentiate cell types in splenic tissue

[9]. Moreover, optimal library size depends on the com-

plexity of the targeted transcriptome. Experimental results

suggest that deep sequencing improves quantification and

identification but might also result in the detection of

transcriptional noise and off-target transcripts [10]. Satur-

ation curves can be used to assess the improvement in

transcriptome coverage to be expected at a given sequen-

cing depth [10].

Finally, a crucial design factor is the number of repli-

cates. The number of replicates that should be included in

a RNA-seq experiment depends on both the amount of

technical variability in the RNA-seq procedures and the

biological variability of the system under study, as well as

on the desired statistical power (that is, the capacity for

detecting statistically significant differences in gene ex-

pression between experimental groups). These two aspects

are part of power analysis calculations (Fig. 1a; Box 1).

The adequa te planning of sequencing experiments so

as to avoid technical biases is as important as good

Box 1. Number of replicates

Three factors determine the number of replicates required in a

RNA-seq experiment. The first factor is the variability in the

measurements, which is influenced by the technical noise and

the biological variation. While reproducibility in RNA-seq is usually

high at the level of sequencing [1, 45], other steps such as RNA

extraction and library preparation are noisier and may introduce

biases in the data that can be minimized by adopting good

experimental procedures (Box 2). Biological variability is particular

to each experimental system and is harder to control [189].

Nevertheless, biological replication is required if inference on the

population is to be ma de, with three replicates being the minimum

for any inferential analysis. For a proper statistical power analysis,

estimates of the within-group variance and gene expression levels

are required. This information is typically not available beforeha nd

but can be obtained from similar experiments. The exact power will

depend on the method used for differential expression analysis,

and software packages exist that provide a theoretical estimate of

power over a range of variables, given the within-group variance of

the samples, which is intrinsic to the experiment [190, 191]. Table 1

shows an example of statistical power calculations over a range of

fold-changes (or effect sizes) and number of replicates in a human

blood RNA-seq sample sequenced at 30 million mapped reads. It

should be noted that these estimates apply to the average gene

expression level, but as dynamic ranges in RNA-seq data are large,

the probability that highly expressed genes will be detected as

differentially expressed is greater than that for low-count genes

[192]. For methods that return a false discovery rate (FDR), the

proportion of genes that are highly expressed out of the total set

of genes being tested will also influence the power of detection

after multiple testing correction [193]. Filtering out genes that are

expressed at low levels prior to differential expression analysis

reduces the severity of the correction and may improve the power

of detection [20]. Increasing sequencing depth also can improve

statistical power for lowly expressed genes [10, 194], and for any

given sample there exists a level of sequencing at which power

improvement is best achieved by increasing the number of

replicates [195]. Tools such as Scotty are available to calculate the

best trade-off between sequencing depth and replicate number

given some budgetary constraints [191].

Conesa et al. Genome Biology (2016) 17:13 Page 3 of 19

experimental design, especially when the experiment in-

volves a large number of samples that need to be proc-

essed in several batches. In this case, including controls,

randomizing sample processing and smart management

of sequencing runs are crucial to obtain error-free data

(Fig. 1a; Box 2).

Analysis of the RNA-seq data

The actual analysis of RNA-seq data has as many varia-

tions as there are applications of the technology. In this

section, we address all of the major analysis steps for a

typical RNA-seq experiment, which involve quality con-

trol, read alignment with and without a reference genome,

obtaining metrics for gene and transcript expression, and

approaches for detecting differential gene expression. We

also discuss analysis options for applications of RNA-seq

involving alternative splicing, fusion transcripts and small

RNA expression. Finally, we review useful packages for

data visualization.

Quality-control checkpoints

The acquisition of RNA-seq data consists of several

steps — obtaining raw reads, read alignment and quanti-

fication. At each of these steps, specific checks should

be applied to monitor the quality of the data (Fig. 1a).

Raw reads

Quality control for the raw reads involves the analysis of

sequence quality, GC content, the presence of adaptors,

overrepresented k-mers and duplicated reads in order to

detect sequencing errors, PCR artifacts or con tamina-

tions. Acceptable duplication, k-mer or GC content

levels are experiment- and organism-specific, but these

values should be homogeneous for samples in the same

experiments. We recommend that outliers with over

30 % disagreement to be discarded. FastQC [11] is a

popular tool to perform these analyses on Illumina

reads, whereas NGSQC [12] can be applied to any plat-

form. As a general rule, read quality decreases towards

the 3’ end of reads, and if it becomes too low, ba ses

should be removed to improve mappability. Software

tools such as the FASTX-Toolkit [13] and Trimmomatic

[14] can be used to discard low-qua lity reads, trim

adaptor sequences, and eliminate poor-quality bases.

Read alignment

Reads are typically mapped to either a genome or a tran-

scriptome, as will be discussed later. An important map-

ping quality parameter is the percentage of mapped

reads, which is a global indicator of the overall sequen-

cing accuracy and of the presence of contaminating

DNA. For example, we expect between 70 and 90 % of

regular RNA-seq reads to map onto the human genome

(depending on the read mapper used) [15], with a sig-

nificant fraction of reads mapping to a limited number

of identical regions equally well (‘multi-mapping reads’).

When reads are mapped against the transcriptome, we

expect slightly lower total mapping percentages because

reads coming from unannotated transcripts will be lost,

and significantly more multi-mapping reads because of

reads falling onto exons that are shared by different

transcript isoforms of the same gene.

Other important parameters are the uniformity of read

coverage on exons and the mapped strand. If reads

Box 2. Experiment execution choices

RNA-seq library preparation and sequencing procedures include

a number of steps (RNA fragmentation, cDNA synthesis, adapter

ligation, PCR amplification, bar-coding, and lane loading) that

might introduce biases into the resulting data [196]. Including

exogenous reference transcripts (‘spike-ins’) is useful both for

quality control [1, 197] and for library-size normalization [198].

For bias minimization, we recommend following the suggestions

made by Van Dijk et al. [199], such as the use of adapters with

random nucleotides at the extremities or the use of chemical-based

fragmentation instead of RNase III-based fragmentation. If the

RNA-seq experiment is large and samples have to be processed in

different batches and/or Illumina runs, caution should be taken to

randomize samples across library preparation batches and lanes so

as to avoid technical factors becoming confounded with

experimental factors. Another option, when samples are individually

barcoded and multiple Illumina lanes are needed to achieve the

desired sequencing depth, is to include all samples in each lane,

which would minimize any possible lane effect.

Table 1 Statistical power to detect differential expression varies

with effect size, sequencing depth and number of replicates

Replicates per group

3510

Effect size (fold change)

1.25 17 % 25 % 44 %

1.5 43 % 64 % 91 %

2 87 % 98 % 100 %

Sequencing depth (millions of reads)

3 19% 29% 52%

10 33 % 51 % 80 %

15 38 % 57 % 85 %

Example of calculations for the proba bility of detecting differential expression

in a single test at a significance level of 5 %, for a two-group comparison using

a Negative Binomial model, as computed by the RNASeqPower package of

Hart et al. [190]. For a fixed within-group variance (package default value), the

statistical power increases with the difference between the two groups (effect

size), the sequencing depth, and the number of replicates per group. This

table shows the statistical power for a gene with 70 aligned reads, which was

the median coverage for a protein-codin g gene for one whole-blood RNA-seq

sample with 30 million aligned reads from the GTEx Project [214]

Conesa et al. Genome Biology (2016) 17:13 Page 4 of 19

剩余18页未读，继续阅读

评论收藏

内容反馈

丽龙

粉丝: 27
资源: 332

综述RNAseq的最佳分析流程A_survey_of_best_practices_for_RNA_seq_data_analy

评论0

最新资源

综述RNAseq的最佳分析流程A_survey_of_best_practices_for_RNA_seq_data_analy

评论0

RNA-SEQ测序结果分析

rnaseq:RNA-seq分析

Basic-RNAseq-analysis-pipeline:本指南旨在演示RNA-seq分析流程以及差异基因表达检测背后的基础统计分析

RNASeq_pipeline:用于 RNA-Seq 数据处理的脚本集

RNAseq_ChIPseq_course:使用PAF1 Cell 2015进行RNA-seq和ChIP-seq数据分析简介

rnaseq_variant_calling_workflow:这是遵循GATK管道的人类RNAseq变体调用工作流程。 还包括ADAR站点消除

matlab如何敲代码-TCGA_RNASeq_Clinical:TCGA_RNASeq_Clinical

RNA-seq数据分析实用方法(2015)

tophat_cufflinks_rnaseq:RNA-Seq 分析流水线

rnaseq_demystified_workshop_2021

Continuous_analysis_rnaseq：如何将连续分析用于RNA-Seq差异表达的示例

auto_sra_rnaseq_pipeline

data_analysis_rnaseq

nextflow_rnaseq_training_dataset:nextflow训练rnaseq数据集

scRNASeq-bulkRNASeq:单细胞和大量RNASeq分析脚本

chenlab_rnaseq_pipeline:来自Toil的chenlab RNA Seq管线

rna_seq:转录组学工作流程

rnaseq:使用STAR，RSEM，HISAT2或Salmon的RNA测序分析流程，具有同工型计数和广泛的质量控制

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

软件工程导论(第六版)课后习题答案1

STM32F103C8T6核心板-电路原理图1.PDF

最新资源

rnaseq_variant_calling_workflow:这是遵循GATK管道的人类RNAseq变体调用工作流程。还包括ADAR站点消除