Investigating the transcriptomic differences between physiological and pathological conditions helps us gain insights into the mechanisms underlying diseases and the development of therapeutic strategies. The traditional approach relied on low-throughput techniques such as reverse transcription polymerase chain reaction and quantitative polymerase chain reaction, which are limited to analyzing single or a few transcripts of interest [1]. However, alteration of particular gene expression may not always directly lead to the phenotype of interest, but expression change of multiple gene sets can be involved in the consequential biological phenotypes [2]. With rapid technological advancements, researchers are able to analyze global transcriptome profiling. The first transcriptome study was conducted using complementary DNA microarray to monitor the expression of 45 Arabidopsis genes with a single reaction [3]. This study has opened new avenues for investigating transcriptomes on a genome-wide scale, going beyond single-gene analysis. Although current DNA microarrays can offer comprehensive coverage of the genome, this technique is limited to pre-defined transcripts due to the requirement of hybridization with pre-fixed probes on the DNA chip. In addition, the other disadvantages of microarrays are relatively (1) high cost, (2) low specificity, and (3) low reproducibility. After next-generation sequencing (NGS) became available, RNA-sequencing (RNA-seq) has been gradually overtaking DNA microarray as the tool of choice for studying global transcriptome to overcome the disadvantages of microarray [4]. In contrast to hybridization-based microarrays, RNA-seq does not require predesigned probes and provides more sensitive and accurate data at a lower cost [5]. However, RNA-seq presents numerous challenges that need to be overcome for accurate data interpretation. Our goals in this review are to describe distinct RNA-Seq to select the most appropriate assay and how to interpret the RNA-Seq data to gain insights into relevant biological meaning.
RNA-seq is one of the most popular high-throughput technologies that uses NGS to reveal patterns and quantify cellular transcriptomes. Since the development of RNA-seq, nearly 100 distinct methods have evolved from the standard RNA-seq protocol [6]. Nevertheless, the majority of RNA-seq data in public repositories have been generated using Illumina sequencing technology, and most of the steps have not changed substantially [6]. Therefore, this review focuses on the pros and cons of each RNA-seq methods between standard RNA-seq and other two popular variants of RNA-seq, rather than describing the detailed workflows of each RNA-seq.
The standard RNA-seq procedure consists of several steps including RNA isolation, converting to complementary DNA (cDNA), adaptor ligation, constructing a sequencing library, sequencing by synthesis, and analysis. Standard RNA-seq provides transcriptome information on gene expression profiling, splice variant analysis and single nucleotide polymorphism (SNP) discovery (Table 1) [7]. However, standard RNA-seq requires a relatively higher amount of RNA and a higher RNA integrity number (RIN), which can be a significant obstacle for analyzing less abundant cell populations in tissue and low-quality RNA from forensic and certain clinical samples [8, 9].
Comparison of distinct RNA-seq described in this review
Standard RNA-seq | 3’ Tag RNA-seq | De novo transcript assembly | |
---|---|---|---|
Purpose | Discovery of DEGs Splice variant analysis SNP discovery |
Discovery of DEGs | Working with unavailable model organism and/or high amount of genomic alteration sample |
Applicable species | Prokaryote, eukaryote | Eukaryote | Prokaryote, eukaryote |
Recommended read depth | 10~30 million reads | <5 million reads | 100~200 million reads |
Cost | Average | Low | High |
Recommended input RNA (ng) | >500 | >25 | >500 |
Recommended RIN | >8.0 | >5.0 | >5.3 (10.1111/1755-0998.12485) |
Reference genome | Required | Required | Not required |
Abbreviations: RNA-seq, RNA-sequencing; DEGs, differentially expressed genes; SNP, single nucleotide polymorphism; RIN, RNA integrity number.
One popular variation of RNA-seq is 3’ Tag RNA-seq, which uses oligo-dT priming for cDNA conversion, resulting in constructed libraries that are enriched near the 3’ end of polyadenylated messenger RNAs (mRNAs) [10]. Because the 3’ end of mRNA in mammals is more stable than other mRNA regions, 3’ Tag RNA-seq is less sensitive to RNA degradation [11]. In addition, 3’ Tag RNA-seq requires much fewer sequencing reads to identify differentially expressed genes (DEGs) between the samples, leading to substantial cost savings as well as providing low-noise gene expression profiles. However, 3’ Tag RNA-seq does have certain limitations. For instance, it is only suitable for eukaryotic total RNA samples due to the requirement of a poly-A tail. In addition, it is not suitable for identifying splice variant analysis and SNP discovery. Therefore, 3’ Tag RNA-seq can be the best sequencing option in case the primary goal of the analysis is to identify DEGs between eukaryotic samples that have a lower quantity and lower RIN.
The resulting sequenced reads generated from both standard RNA-seq and 3’ Tag RNA-seq are required reference genomes for mapping and assembling them to reveal the transcript (Table 2). In contrast, de novo transcript assembly involves the process of directly joining overlapping reads into longer contiguous sequences and don’t need reference genome for analysis. Therefore, de novo transcript assembly becomes a useful approach for analyzing cellular transcriptomes with an unavailable reference genome sequence. Moreover, de novo transcript assembly successfully generates transcripts even in cases where a reference-guided assembly may fail to reconstruct them correctly due to gaps, high fragmentation, or significant alterations in the genomic sequence, as is often the case in cancer cells [12]. Nonetheless, it is essential to acknowledge certain limitations associated with de novo transcript assembly. de novo transcript assembly requires a high amount of sequencing read counts, which causes a much higher cost. Moreover, most genomes contain lots of repetitive regions, which make it difficult to achieve high-quality transcript assembly and often cause errors leading to misarrangements in the assembly results [13].
Most widely studied species of latest reference genome assembly available in the UCSC Browser (http://genome.ucsc.edu.)
Taxonomic name | UCSC version | Genome assembly name | Release date |
---|---|---|---|
Homo sapiens | hs1 | T2T CHM13v2.0 | 24 Jan. 2022 |
Mus musculus | mm39 | GRCm39 | Jun. 2020 |
Rattus norvegicus | rn7 | mRatBN7.2 | Nov. 2020 |
Danio rerio | danRer11 | GRCz11 | May 2017 |
Drosophila melanogaster | dm6 | BDGP Release 6+ISO1 MT | Aug. 2014 |
Caenorhabditis elegans | ce11 | WBcel235 | Feb. 2013 |
Saccharomyces cerevisiae S288C | sacCer3 | R64 | Apr. 2011 |
Abbreviation: UCSC, University of California Santa Cruz.
In summary, in order to select a suitable RNA-seq method, several factors should be considered such as experimental objectives, RNA sample quality and availability of a reference transcriptome.
Exploratory data analysis refers to the approach of investigating and summarizing the main characteristics of the data sets to facilitate a better understanding of the data [14]. Several statistical data visualization methods, such as principal component analysis (PCA), heatmap, and volcano plot, are categorized under exploratory data analysis. With these analyses, we can quickly reveal the overall trends in the dataset, which can guide us in determining the most suitable analytical approach.
Since RNA-seq is involved in complex multiple steps, extreme deviation of intrasample called an outlier is often generated [15, 16]. Finding and removing outliers in RNA-seq datasets is a prerequisite for improving quality and preventing misinterpretation of the biological meaning derived from RNA-seq data. Given that data generated from RNA-seq is high-dimensional, considering a global overview of the data is necessary to determine intrasample outliers. In this regard, PCA is the most widely used way to distinguish outliers. PCA is a widely used exploratory data analysis to reduce the dimensionality of the dataset, which can allow for visualization of the dominant patterns of the data [17]. Principal components are linear combinations of the genes that collectively explain the variation across the samples [18]. As depicted in Figure 1, the PCA plot provides the overall similarity of the dataset. Through the PCA plot in Figure 1 (right), we can successfully identify an outlier that deviates significantly from the overall distribution pattern of the control group.
Heatmap is a graphical representation of the data using a color gradient for easier interpretation and visualization of RNA-seq data [19]. Genes with higher expression levels are typically colored red, while those with lower expression levels are colored blue, thus providing a simultaneous illustration of gene expression patterns within large datasets across all the samples. In most cases, rows representing each gene and columns representing each sample are reordered by certain clustering algorithms. This reordering ensures the data matrices with similar patterns are placed closely on the heatmap. In biology, a hierarchical clustering algorithm is the most widely applied algorithm for generating a heatmap [20, 21]. The hierarchical clustering algorithm is a type of unsupervised machine learning algorithm, which pairs objects based on the degree of similarity [22]. Therefore, a heatmap combined with a hierarchical clustering algorithm provides better visualization of patterns, relationships, and similarities across all the samples (Figure 2) [21, 23]. In ideal RNA-seq data, hierarchical clustering algorithms would pair the same groups together, as the similarity within the same groups is greater than the similarity between different groups.
A volcano plot is a type of scatter plot to explore the most interesting genes within large datasets. Typically, the x-axis of a volcano plot represents log2 of the fold change (FC), and the y-axis represents the –log10 of the adjusted P-value, which is called as “double filtering” criterion [24, 25]. As depicted in Figure 3 [26], interesting genes (DEGs) meet the two criteria: (1) |log2FC|>1 and (2) adjusted P-value<0.05. Double filtering criteria can be beneficial to exclude (1) genes with large expression differences that are caused by large variations in the dataset (outlier) and (2) genes that show statistical significance but have low expression differences, which could be false positives caused by batch effect or low expression level [24, 27]. In a volcano plot, the points representing each gene that is located in the upper-right and upper-left corners are considered the most promising findings, which can be candidate biomarkers and therapeutic targets. However, double filtering criteria itself relies on arbitrary cutoffs and there is no standard rule for setting up a cutoff threshold, which results in filtering out potentially valuable genes that are close to the cutoff threshold.
Therefore, researchers should be aware that while exploratory data analysis is useful for an initial overview, it does not provide sufficient information to derive meaningful conclusions.
With the advent of high-throughput technologies such as RNA-seq, the main hurdle has shifted from acquiring gene expression profiles to the correct interpretation of the transcriptomic data to gain insights into relevant biological meaning. Typically, RNA-seq data generates the gene list of hundreds to thousands of DEGs, making it nearly impossible to manually search the literature and interpret the biological nuances. A standard strategy to overcome this issue is a pathway enrichment analysis which identifies a smaller list of interpretable biological pathways from overwhelmingly large gene lists [28]. Biological pathways are achieved with statistical testing whether provided gene lists are enriched in particular pathways from a variety of databases (Table 3). Pathway enrichment analysis can be categorized into over representation analysis (ORA), functional class scoring (FCS), and topology-based pathway analysis (TPA) (Figure 4) [29, 30].
Popular public repositories commonly used in pathway enrichment analysis
Database | Website | TPA availability | Reference (DOI) |
---|---|---|---|
KEGG | https://www.kegg.jp/ | Yes | 10.1093/nar/28.1.27 |
GO | https://geneontology.org/ | No | 10.1002/pro.4218 |
REACTOME | https://reactome.org/ | Yes | 10.1093/nar/gkab1028 |
WikiPathways | https://www.wikipathways.org/ | Yes | 10.1093/nar/gkaa1024 |
MsigDB | https://www.gsea-msigdb.org/gsea/msigdb | No | 10.1016/j.cels.2015.12.004 |
ORA is the first generation of pathway enrichment analysis that explores the list of DEGs that are enriched in certain biological pathways from public repositories (Table 3). The first step of ORA is identifying DEGs from RNA-seq data using certain criteria such as false discovery rate (FDR) and/or FC of gene expression. The next step involves counting the number of selected DEGs within each pathway. This counting process is performed for each pathway individually. Subsequently, the statistical evaluation of each pathway is carried out using a Fisher’s exact test based on the hypergeometric distribution [28, 31]. Since hundreds of pathways (hypothesis) are statistically evaluated simultaneously, each statistical evaluation has a false positive error probability (Type I error) [32, 33]. Therefore, multiple-testing correction is required to correct this error. Benjamini-Hochberg FDR procedure is the most commonly applied method to correct P-value as the adjusted P-value (Q-value) [34]. However, there are still limitations to ORA. Since the selection of DEGs relies on arbitrary cutoffs, such as FDR and/or FC of gene expression, standardization can be challenging. Additionally, ORA is that it tends to identify DEGs associated with substantial expression changes by arbitrary cutoffs. This tendency can exclude sets of functionally related genes with milder expression changes, which could coordinately exert as much influence as a single gene with a large expression change. Moreover, once DEGs are chosen, ORA considers the entire list of genes for analysis. This approach results in each gene within the DEGs list having an equal impact on pathway enrichment, regardless of differences in their FDR and gene expression levels.
FCS is the second generation of pathway enrichment analysis to overcome the limitations of ORA. In contrast to ORA, FCS does not filter out with a particular threshold to isolate the list of DEGs. Instead, FCS calculates and assigns the gene score to each gene and analyzes pathway enrichment based on these assigned gene scores, ensuring that all genes are considered in the analysis. The most widely used FCS tool is gene set enrichment analysis (GSEA) [2, 35]. GSEA computes gene scores with several methods such as signal-to-noise ratio, t-test and gene expression between two phenotypes [36]. Then, it evaluates the distribution of a set of genes from each pathway of the database repository to assign an enrichment score through a weighted Kolmogorov-Smirnov-like statistic [2]. The statistical significance of the enrichment score is evaluated by an empirical phenotype-based permutation test for larger replicates or a gene set for smaller replicates (below 7 replicates) [36]. Lastly, the enrichment score is adjusted by multiple hypotheses testing to reduce a false positive finding. Since the FCS method uses all available information from RNA-seq data, it has a better resolution to detect the pathways associated with weak but coordinated gene sets. However, FCS also comes with certain limitations. FCS does not account for the relationships within the gene sets of the pathway, often referred to as the ‘gene independence assumption’ [37], which is far removed from the interconnected nature of biology. Similarly, another limitation of FCS is that it does not consider the interplay between pathways. Given that biological pathways are not isolated entities, and one pathway can affect the activity of others, the approach of FCS neglects actual biological processes.
To overcome the limitations of ORA and FCS, TPA was developed. Unlike the first two-generation analysis, TPA takes into consideration not only the lists of genes and gene ranks but also the integration of topological information from particular biological pathway repositories such as KEGG, REACTOME, or WikiPathway [38]. There are several publicly available TPA based packages such as SPIA and SEMgsa [39, 40]. The algorithm of TPA is fundamentally similar to that of FCS. However, the main difference lies in the fact that TPA considers topological features of the genes such as the position of genes within a pathway and its interaction with other genes [41]. Essentially, TPA computes a pathway-level perturbation using both expression and the topology of the pathway, which enables a better assessment of relevant pathway derived from the RNA-seq data [42-44]. Although TPA is the latest generation, this method also has its own limitations, which may be addressed in future methods. First, it is not feasible to consider activation and inactivation time and spatial distribution for the pathway components in the model. Second, the genuine pathway underlying the RNA-seq data can be different from the pathways of the database. Lastly, the limited database is available due to the cell and tissue specificity of the pathway.
As technology advances, RNA-seq has become the first choice for interpreting cellular transcriptomes both in research and clinical applications. To provide general considerations for choosing appropriate types of RNA-seq, we introduce 3’ Tag RNA-seq and de novo transcript assembly as well as standard RNA-seq in this review (Table 1). For example, if the primary objectives of the experiment are to isolate DEGs from a small amount or low-quality eukaryotic RNA samples, then 3’ Tag RNA-seq would be a better option compared to standard RNA-seq. Moreover, we also summarize three generations of pathway enrichment analysis, which has become one of the foremost tools of RNA-seq data. Basically, all the pathway enrichment analyses aim to simplify the complex transcriptomic profiling, identifying a smaller number of significantly enriched pathways. As described above, all three have their own advantages and limitations. It is important for researchers to understand that none of the existing approaches is flawless. Therefore, these tools should be used to generate more refined hypotheses for uncovering biological meanings, rather than making definitive conclusions.
차세대 염기서열 분석이 개발되고 널리 사용됨에 따라 RNA-시퀀싱(RNA-sequencing, RNA-seq)이 글로벌 전사체 프로파일링을 검증하기 위한 도구의 첫번째 선택으로 급부상하게 되었다. RNA-seq의 상당한 발전으로 다양한 유형의 RNA-seq가 생물정보학(bioinformatics) 발전과 함께 진화했으나, 다양한 RNA-seq 기법 및 생물정보학에 대한 전반적인 이해 없이는 RNA-seq의 복잡한 데이터를 해석하여 생물학적 의미를 도출하기는 어렵다. 이와 관련하여 본 리뷰에서는 RNA-seq의 두 가지 주요 섹션을 논의하고 있다. 첫째, Standard RNA-seq과 주요하게 자주 사용되는 두 가지 RNA-seq variant method를 비교하였다. 이 비교는 어떤 RNA-seq 방법이 연구 목적에 가장 적절한지에 대한 시사점을 제공한다. 둘째, 가장 널리 사용되는 RNA-seq에서 생성된 데이터 분석; (1) 탐색적 자료 분석 및 (2) enriched pathway 분석에 대해 논의하였다. 데이터 세트의 전반적인 추세를 제공할 수 있는 주 성분 분석, Heatmap 및 Volcano plot과 같이 RNA-seq에 대해 가장 널리 사용되는 탐색적 자료 분석을 소개하였다. Enriched pathway 분석 섹션에서는 3가지 세대의 enriched pathway 분석에 대해 소개하고 각 세대가 어떤 식으로 RNA-seq 데이터 세트로부터 enriched pathway를 도출하는지를 소개하였다.
None
None
None
Woo SH1, Graduate student; Jung BC2, Researcher.
-Writing - original draft: Woo SH, Jung BC.
-Writing - review & editing: Woo SH, Jung BC.
This article does not require IRB/IACUC approval because there are no human and animal participants.