Genomic transcription is tightly regulated, and assessing its expression can be a key to understanding pathological conditions. Currently, bulk RNA-sequencing (RNA-seq) is the most commonly used technique for analyzing transcriptomic profiling in a variety of fields, including neuroscience, diabetes, and oncology [1-4]. Although bulk RNA-seq is a powerful tool for estimating global transcriptomic profiling in the target samples, it can only provide an average of gene expression across entire populations within the samples. This becomes particularly problematic when dealing with heterogeneous samples such as blood and biopsies, as the uniqueness of each cell within the whole population is masked [5]. To overcome this issue, a technique for RNA-seq at the single-cell level, known as single-cell RNA-Seq (scRNA-seq), has been invented [6]. Since its first introduction, scRNA-seq has rapidly gained attention due to its ability to facilitate the deconvolution of cell types in heterogeneous samples. In particular, scRNA-seq allows us to identify rare populations that were unable to be addressed using traditional bulk RNA-seq [7]. Moreover, scRNA-seq can be used to trace the trajectories of distinct cell lineage in tissue development, cancer, and immune cells [8]. Currently, the most widely used and most common commercially available platform is a droplet-based microfluidics system by 10x Genomics [9]. This platform captures single cells into each droplet containing beads conjugated to the primers with both common and unique barcodes and enzymes for library preparation. Each droplet acts as an individual reaction chamber where cell lysis and library generation take place. Since each generated library has unique barcodes, each cell can be distinguished and analyzed at a single-cell level. However, the data generated from scRNA-seq is much more complex than bulk RNA-seq and there is not a universal standardization to analyze the dataset. For analyzing scRNA-seq, we need to use a bunch of bioinformatic tools (Table 1). In this article, we aim to provide general considerations when dealing with analyzing the scRNA-seq datasets. In addition, we focused on introducing relatively new and most widely used tools for scRNA-seq rather than the explaining complex algorithm and mathematics underlying each tool. For new researchers interested in this field, this paper can serve as a reference to build a workflow for scRNA-seq analysis that can be modified to their own research context.
The descriptions of the popular methods for scRNA-seq pipelines
Task | Tool (language) | Year | Description |
---|---|---|---|
General scRNA-seq | Seurat (R) | 2015∼2023 | A popular R package for the preprocessing and explorative downstream analysis of single-cell RNA sequencing data. Commonly, it is used with a variety of R-based package |
General scRNA-seq | Scanpy (Python) | 2018 | A popular Python package for the preprocessing and explorative downstream analysis of single-cell RNA sequencing data. Commonly, it is used with a variety of Python-based package |
Empty-drop identification | EmptyDrops (R) | 2019 | It estimates the background levels of RNA present in empty droplets, then identifies droplets containing cell that significantly deviate from the background |
Ambient RNA identification | DecontX (Python) | 2020 | It utilizes Bayesian method to estimate the percentage of contaminating transcripts from ambient RNA, then removing contaminated transcripts in each cell data |
Doublet identification | DoubletFinder (R) Scrublet (Python) | 2019 2019 | Generate artificial doublets using a nearest-neighbor algorithm, then identifies the doublets that are similar to artificial doublets |
Normalization | SCtransform (R) | 2019 | It utilizes regularized negative binomial regression, which represent normalized data value without affected by technical issues |
Visualization | t-SNE (R, Python) UMAP (R, Python) | 2008 2018 | Both are unsupervised non-linear dimensionality reduction method for visualization. UMAP has been rapidly overtaking t-SNE due to its superior ability to preserve large-scale structures |
Differential expression testing | ROTSvoom (R) D3E (Python) Limma-trend (R) Wilcoxon rank-sum (R, Python) | 2020 | They are famous packages for differential expression testing, which show good performance after prefiltering lowly expressed genes. Wilcoxon rank-sum test is most widely used option |
Pseudotime | Monocle3 (R) scTEP (R) | 2019 2023 | Monocle3 is the most popular package for pseudotime while scTEP is the most recently developed package, which may show better accuracy |
Abbreviations: scRNA-seq, single-cell RNA-sequencing; t-SNE, t-distributed stochastic neighbor embedding; UMAP, uniform manifold approximation and projection; scTEP, single-cell data trajectory inference method using ensemble pseudotime.
Unlike traditional bulk RNA-seq, droplet-based scRNA-seq can generate a variety of artifacts such as empty droplets and ambient RNA. In addition, we need to remove the data generated from damaged cells and non-single cells (i.e. duplet or multiplets) by computational tools. In the first section, we will discuss how to remove the data from low-quality cells. Then, we will discuss data normalization and visualization in low-dimensional space.
The first step of scRNA-seq data analysis is to ensure that each transcriptomic data corresponds to intact live cells. With a droplet-based microfluidics system, it is unavoidable to have empty drops as cells in the target sample are highly diluted to achieve a single cell in each droplet. In addition, ambient RNA released from damaged and dead cells in the microfluidics system can be encapsulated in empty drops. As a result, ambient RNA also can be amplified and have its own barcode, which can be incorrectly considered as data from real cells. Therefore, it is necessary to filter out the data obtained from ambient RNA and should not be included for further analysis. Initially, this was performed by removing all barcodes with lower transcripts [10]. Although this method is simple and straightforward, it can wrongly filter out the droplets containing small cells with low RNA amounts. Recently, several computational tools were developed including EmptyDrops to address this issue [11]. This method first estimates expression profiles of ambient RNA in empty droplets and then identifies cell-containing droplets that significantly differ from ambient RNA. Moreover, Yang et al [12] developed computational tools called DecontX, which are designed to remove transcripts from ambient RNA. DecontX used a Bayesian method to estimate the percentage of contaminating transcripts from ambient RNA. After estimation, DecontX can remove contaminated transcripts in each cell. With these approaches, researchers will have better cell type recovery as it does not filter out the data obtained from cells with low RNA content. Next, the data generated by the damaged cell should be filtered out. Since mitochondrial RNA (mtRNA) is more likely to be retained in damaged cells due to the mitochondrial membranes, a higher proportion of mtRNA is a widely used criterion to determine damaged cells [13]. The cutoff value for the acceptable proportion of mtRNA is varied depending on the tissue types, technical factors, etc [14]. In general, 5%∼10% of the threshold can be a starting point to adjust the cut-off value to distinguish normal and damaged cells [14, 15]. Another important aspect to take into account is the removal of multiplet artifacts, which represent two or more cells in a single droplet, from the dataset for further analysis. Typically, the frequency of multiplets is positively correlated with the concentration of the input cell. Although diluted cell suspension can reduce the frequency of multiplets, it is not feasible to dilute too much as it reduces cell recovery. In this regard, several computational tools such as DoubletFinder and Scrublet were proposed [16, 17]. The algorithms of the two tools are similar. The tools generate artificial doublets by combining two randomly selected droplets of gene expression data. Subsequently, doublet scores in real scRNA-seq data are calculated based on the similarity of artificial droplets using k-nearest neighbors algorithm [18]. One of the main differences between these tools is the hyperparameter setting such as the number of artificial droplets and the number of principal components to determine nearest neighbors [18]. Although these two tools have their own strength, the recent benchmarking result indicates DoubletFinder method shows the best doublet detection accuracy among the 9 cutting-edge computational tools [18]. An overview of the computational methods to isolate intact live cells is illustrated in Figure 1.
After cell-level quality control, data normalization is a prerequisite to correct for cell-to-cell differences due to technical variability such as capture efficiency, amplification biases, and sequencing depth (number of transcripts detected per cell) [19]. If data is not normalized properly, downstream analysis such as comparison of gene expression and clustering of subpopulations would be biased. One of the common normalized methods is library size (total counts) normalization [20]. This process involves dividing the read count of each cell by a size factor to normalize the library size across all cells. For example, let us assume that we have scRNA-seq data derived from mouse tissue. Then, we can suppose that cell A has 10 reads from the
In the data generated from scRNA-seq, the expression value of each gene represents a dimension. While high-dimensional data is more informative compared to traditional approaches that rely on low-throughput techniques, interpreting high-dimensional data can be much more difficult. Feature selection is the computational technique, which selects genes that efficiently describe the original dataset. In other words, it excludes genes that act as noise and redundancy. For example, if two genes (features) are perfectly correlated, selecting one feature can be enough to describe the original dataset and the other feature is no longer informative but can serve as a noise. Therefore, feature selection (excluding the irrelevant features) can help in not only reducing computational burden but also providing a better understanding of the data to improve downstream analysis [25]. To exclude irrelevant features, it is necessary to estimate the relevance of each feature to the dataset. A commonly used approach is selecting highly variable genes (HVGs) because it does not require any prior knowledge such as predefined cell marker genes [26]. Several tools, like Seurat, can calculate the mean expression and dispersion of genes to identify HVGs [27]. Typically, genes with the highest variance-to-mean ratio are selected first as HVGs, with the number ranging from 1,000 to 5,000 depending on the complexity of the dataset [28].
After selecting HVGs, the dimensions of the dataset can be further reduced by several dimensionality reduction methods for visualization in two- or three-dimensional space. For the data generated from scRNA-seq, t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are the most widely used dimensionality reduction techniques for two dimensions visualization (Figure 2) [29]. t-SNE is an unsupervised non-linear dimensionality reduction method that maps high-dimensional data points to two dimensions while preserving the local structures [30]. The algorithm of t-SNE is creating a probability distribution for the high-dimensional data that represents data (each cell) similarities a high-dimensional space. For example, similar cells are assigned a higher probability that the cell would choose another similar cell as its neighbor in high dimensions. Then, t-SNE defines a probability distribution that represents data similarities in the lower dimensions (usually two dimension) and try to minimize the Kullback–Leibler divergence (KL divergence) between the two distributions [31]. By minimizing KL divergence, the data in low dimensional space becomes similar to the original structure of the data (the data in high dimensional space) [32]. However, the disadvantages of t-SNE algorithms are (1) inaccuracies of the global structure, (2) slow computation time due to the complex calculations, and (3) computationally unfeasibility with large datasets, such as those with over 10,000 cells [33]. In this regard, UMAP has been introduced and has become preferable over t-SNE [34]. The underlying algorithm of UMAP is constructing a fuzzy topological structure that represents the likelihood of a connection between the cells in high dimensional space. Then it optimizes the low dimensional representation to make its fuzzy topological structure as similar as possible to the original dataset [35]. Recently, UMAP has been rapidly overtaking t-SNE due to its superior ability to preserve large-scale structures and its better scaling performance, which results in faster processing time [36]. However, a recent paper argues that an advanced t-SNE algorithm can perform at a similar speed and can preserve the global structure in a manner similar to UMAP [37]. Therefore, it is still controversial which algorithm is more suitable for visualizing the data from scRNA-seq.
After applying appropriate preprocessing of the scRNA-seq dataset, the processed dataset can provide a vast amount of information including inference of the cell types and the ordering of cells along a lineage based on the gene expression at the single cell level. In the following section, we provide information regarding the most popular and widely used downstream analyses.
A common analysis step after dimensionality reduction is unsupervised clustering to identify the groups of cells with similar characteristics based on the expression profiles. Although there are several tools were developed, Leiden clustering is recently developed and shows better performance, compared to classical methods such as Louvain clustering [38]. The Leiden algorithm is based on modularity and hierarchical clustering. Modularity is a measure of the strength of communities and shows dense connections between the nodes when communities have high modularity [39]. Hierarchical clustering is an unsupervised machine learning algorithm that pairs objects based on the similarity between the dataset [40]. After running the Leiden algorithm, the clustering can be visualized with t-SNE or UMAP. When running Leiden clustering for your dataset, it is important to note that resolution parameters used for computing the modularity can specify the number of communities (clusters). Figure 3, which was implemented by Scanpy (python based), shows the Leiden clustering visualized in t-SNE and UMAP representation at two different resolution settings [41]. The resolution parameters are generally optimized with cell type annotations by checking gene expression profiles in each cluster and the researcher needs to decide the degree of annotation detail. For example, a satisfactory cell label could be “natural killer (NK) cells,” but it can be further divided into “NK cells” and “activated NK cells” [42]. To identify cell types, researchers need to explore the scRNA dataset, which shows transcriptomic profiles of clusters, and manually inspect whether established marker genes that have been reported for a given cell type are expressed in certain clusters. For example, CD31‒, CD45‒, Ter119‒ and PDGFRα+ can be the cell maker for adipose precursor cells in stromal vascular fraction [2].
Differential expression testing can provide biological insights into differences between two experimental conditions, and it is very well-documented in bulk RNA-seq [1]. Briefly, differential expression testing can be conducted using three generational statistical tests: (1) over-representation analysis, (2) functional class scoring, and (3) topology-based pathway analysis. Through differential expression testing, researchers can identify important biological pathways from the gene list of differentially expressed genes or the total gene list assigned with gene scores (ranking). With the dataset from scRNA-seq, it is particularly useful to dissect cell-type-specific responses to perturbations such as chemical treatment, disease, or genetical modification. For example, imagine a scenario where we have identified immune cells in the whole blood from wild-type and genetically altered mice through scRNA-seq analysis. Then, we can select certain cell types, such as monocyte only, from the dataset and identify relevant biological pathways from the two genotypes of mice. This allows us to determine which cell types are most perturbed or mostly involved in phenotypic change by genetic alteration. However, differential expression testing for scRNA-seq is much more challenging, compared to bulk RNA-seq because scRNA-seq data typically has lower library sizes, high noise levels, and “dropout” events, where certain genes are detected in small populations but not detected in other populations (i.e. lots of the zero value of certain genes in certain populations) [43]. Moreover, given the clustering is already defined based on the gene profiling before differential expression testing in scRNA-seq analysis, clustering and differential expression testing are not independent which results in artificial false discoveries [44]. In this regard, Soneson and Robinson [45] compared several methods to find differential expression testing in the scRNA-seq dataset. They suggest that prefiltering of lowly expressed genes can be beneficial to type I error control and ROTSvoom, D3E, limma-trend, the t-test, and the Wilcoxon test performed well in terms of lower false discovery proportions. Indeed, the Wilcoxon rank-sum test is the most widely used statistical method for differential expression testing employed in recent scRNA-seq analysis [44]. Wilcoxon rank-sum test is the default option for differential expression testing in the single-cell toolkit, Seurat.
One of the most revolutionary uses of scRNA-seq is to assess cellular state transitions, which are characterized by cascades of gene expression changes [7]. For example, cascades of gene expression changes are well documented during the cell cycle and cell differentiation [46, 47]. Pseudotime analysis can assign each cell to a specific position based on its gene expression patterns, which provides an ordering of cells along a trajectory or lineage [48]. More than 70 pseudotime analysis tools have already been developed, but the most famous and widely used method is an R package, Monocle [49, 50]. Monocle3, the most recent version of Monocle, uses a principal graph algorithm and calculates the geodesic distance of cells from the user-selected root node in the trajectory as a hypothesized time course, pseudotime [51]. Recently, the single-cell data trajectory inference method using ensemble pseudotime inference (scTEP) is proposed by Zhang et al [52]. It takes advantage of the multiple clustering results and fine-tuning algorithm for improving pseudotime accuracy. They compared the scTEP with other pseudotime analysis methods using 41 real scRNA-seq data sets and showed better robustness and accuracy.
Since the introduction of scRNA-seq, numerous studies have been conducted mostly by specialized research groups with expertise in single-cell isolation and computational analysis. Recently, scRNA-seq has become more accessible to the broader research community including biomedical researchers and clinical researchers [8]. Especially, the field of oncology has been actively studied using scRNA-seq [53]. Since one of the major causative factors for cancer is genomic disruption, cancer cells likely have distinct transcriptome profiling from their normal counterparts [54]. Therefore, cancer cells would be shown in a distinct cluster whereas normal cells would be located in broader clusters in the UMAP of scRNA-seq [55]. Then, we could further analyze the cancer-specific cluster if there are certain mutations converting normal cells to cancer cells. In Figure 4, we can consider that cluster 8 is a cancer-specific cluster. If we use bulk RNA-seq, we may not find the mutation because cancer-specific cluster 8 is a relatively small population, which could be masked from other major populations. With scRNA-seq, we can specifically characterize the cancer cell and apply appropriate therapeutic options (Figure 4A). For example, mutations in epidermal growth factor receptor (EGFR) are almost exclusively found in lung adenocarcinomas and the drug for each mutation varies (e.g. EGFR exon 19 deletions and exon 21 L858R alterations-erlotinib, gefitinib, and afatinib; EGFR exon 20 insertions-osimertinib and poziotinib) [56, 57]. In addition, we may find specific biomarkers after differential expression testing. In dot plots, a gene I exclusively expressed in the cancer-specific cluster, which could be considered as a potential biomarker gene (Figure 4B). Furthermore, the construction of tumor microenvironment is available through scRNA-seq. Since tumor tissue has different microenvironments consisting of heterogeneous cell types such as tissue-resident cells, endothelial cells, and tumor-infiltrating immune cells, the population ratio of each cell type can be valuable information to understand characterization of the cancers [58]. In Figure 4C, we can compare cell populations between healthy and cancer patients to construct the tumor microenvironment. Among the 9 clusters, clusters 3 and 4 are increased whereas cluster 1 is decreased in cancer patients (Figure 4C). Therefore, comparison between these clusters can be interesting in understanding the mechanisms of disease and finding new therapeutic targets.
scRNA-seq technology has opened the avenue to investigate cellular heterogeneity of RNA transcripts at the single-cell level. The field of scRNA-seq analysis is rapidly expanding, with advanced platforms and computational tools emerging regularly. In this review, we provide an overview of analyzing the scRNA-seq datasets with the introduction of the most commonly used R packages and recently developed tools for preprocessing of the data and downstream analysis. However, researchers need to be aware that none of the analytical tools is flawless and the tools of scRNA-seq analysis are rapidly emerging to overcome current limitations. Therefore, researchers should expect there will be other improvements to the packages shortly although the computational tools described here remain valid over time.
RNA-시퀀싱은 표본에 대한 전사체 전체의 패턴을 제공하는 기법이다. 그러나 RNA-시퀀싱은 표본 내 전체 세포에 대한 평균 유전자 발현만 제공할 수 있으며, 표본 내의 이질성(heterogeneity)에 대한 정보는 제공하지 못한다. 단일 세포 RNA-시퀀싱 기술의 발전을 통해 우리는 표본의 단일 세포 수준에서 이질성과 유전자 발현의 동역학(dynamics)에 대한 이해를 할 수 있게 되었다. 예를 들어, 우리는 단일 세포 RNA-시퀀싱을 통해 복잡한 조직을 구성하는 다양한 세포 유형을 식별할 수 있으며, 특정 세포 유형의 유전자 발현 변화와 같은 정보를 알 수 있다. 단일 세포 RNA-시퀀싱은 처음 도입된 이후 많은 이들의 관심을 끌게 되었으며, 이를 활용하기 위한 대규모 생물정보학(bioinformatics) 도구가 개발되었다. 그러나 단일 세포 RNA-시퀀싱에서 생성된 빅데이터 분석에는 데이터 전처리에 대한 이해와 전처리 이후 다양한 분석 기술에 대한 이해가 필요하다. 본 종설에서는 단일 세포 RNA-시퀀싱 데이터분석과 관련된 작업과정의 개요를 제시한다. 먼저 데이터의 품질 관리, 정규화 및 차원 감소와 같은 데이터의 전 처리 과정에 대해 설명한다. 그 이후, 가장 일반적으로 사용되는 생물정보학 도구를 활용한 데이터의 후속 분석에 대해 설명한다. 본 종설은 이 분야에 관심이 있는 새로운 연구자를 위한 가이드라인을 제공하는 것을 목표로 한다.
None
None
None
Woo SH1, Graduate student; Jung BC2, Researcher.
-Writing - original draft: Woo SH, Jung BC.
-Writing - review & editing: Woo SH, Jung BC.
This article does not require IRB/IACUC approval because there are no human and animal participants.