RNA-seq normalization is essential for accurate RNA-seq data analysis. Various factors affect transcript quantification in RNA-seq data, such as sequencing depth, transcript length, and sample-to-sample and batch-to-batch variability (Conesa et al., 2016). Normalization methods exist to minimize these variables and ensure reliable transcriptomic data.
In this article, you’ll find an overview of why RNA-seq normalization is essential, and a break down of different RNA-seq normalization methods to help you master your next RNA-seq analysis.
RNA-seq normalization adjusts raw transcriptomic data to account for various technical factors that may mask actual biological effects and lead to incorrect conclusions.
Bioinformatic approaches can correct these factors, and multiple RNA-seq normalization methods exist for different datasets and comparisons (Abrams et al., 2019; Zhao et al., 2021).
Simplified user-friendly RNA-seq data analysis platforms now allow laboratory scientists with no coding experience to normalize and explore their own RNA-seq data with standardized input files (Akhmedov et al., 2020).
Sequencing technologies introduce technical variability (Conesa et al., 2016;).
Therefore, raw transcriptomic data must be adjusted to account for these technical factors before researchers can compare gene expression within or between samples (Abrams et al., 2019).
Normalized gene expression units ensure comparable and consistent data for exploratory or differential expression analysis while limiting false positive or negative results.
It is essential to choose the correct RNA-seq normalization method for your dataset. This choice will impact the sensitivity of your differential expression analysis (Bullard et al., 2010).
There are three main RNA-seq normalization stages you should consider:
Within sample normalization is required to compare the expression of genes within an individual sample (Zhao et al., 2021). It can adjust data for two primary technical variables: transcript length and sequencing depth.
Longer genes often have more mapped reads than shorter genes at the same expression level (Mortazavi et al., 2008). Therefore, their expression level can only be accurately compared within a sample after normalization.
Furthermore, the number of sequencing reads per sample may vary. This can also be corrected by within sample normalization.
Within sample normalization is not sufficient to compare gene expression between samples. For this, between sample RNA-seq normalization methods are required.
Samples within a dataset can be simultaneously normalized as a complete set to adjust for different technical variations such as sequencing depth (Bolstad et al., 2003).
RNA-seq is a relative, not an absolute, measure of transcript abundance. This means that the transcript population as a whole affects relative levels of transcripts (Robinson and Oshlack, 2010).
This creates biases for gene expression analyses, and these are minimized by between sample RNA-seq normalization methods.
Researchers often integrate RNA-seq data from multiple independent studies. These datasets are usually sequenced at different times, with varying methods across multiple facilities, and contain other experimental variables (Leek et al., 2010).
This results in a batch effect.
The batch effect is often responsible for the greatest source of differential expression when data is combined. It can mask any true biological differences and lead to incorrect conclusions (Leek et al., 2010; Ritchie et al., 2015).
RNA-seq normalization across datasets can correct for known variables across batches, such as the sequencing center and date of sequencing, as well as unknown variables (Johnson et al., 2007; Leek et al., 2012; Ritchie et al., 2015).
Here we provide an overview of normalization units and methods to help you choose the right one for your dataset.
Counts per million (CPM) mapped reads are the number of raw reads mapped to a transcript, scaled by the number of sequencing reads in your sample, multiplied by a million.
It normalizes RNA-seq data for sequencing depth but not gene length.
Therefore, although it is a within sample normalization approach, CPM normalization is unsuitable for within sample comparisons of gene expression.
Between sample comparisons can be made when CPM is used alongside ‘within a dataset’ normalization methods.
FPKM (fragments per kilobase of transcript per million fragments mapped) for paired-end data and RPKM (reads per kilobase of transcript per million reads mapped) for single-end data correct for variations in library size and gene length (Mortazavi et al., 2008).
One issue with FPKM/RPKM units is that the expression of a gene in one sample will appear different from its expression in another sample, even when its true expression level is the same (Wagner, Kin and Lynch, 2012).
This is because it depends on the relative abundance of a transcript among a population of sequenced transcripts.
FPKM/RPKM units best compare gene expression within a single sample (Zhao, Ye and Stanton, 2020).
Transcripts per million (TPM) represents the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts (Wagner, Kin and Lynch, 2012).
It is calculated by dividing the number of reads mapped to a transcript by the transcript length. This value is then divided by the sum of mapped reads to all transcripts after normalization for transcript length. It is then multiplied by one million to allow easier further analyses (Wagner, Kin and Lynch, 2012; Zhao et al., 2021).
It normalizes RNA-seq data for sequencing depth and transcript length.
TPM and FPKM/RPKM are closely related, however, in contrast to FPKM/RPKM, there is limited variation in values between samples as the sum of all TPMs in each sample is the same.
TPM can be used for within sample comparisons but requires ‘within a dataset’ normalization for between sample comparisons (Zhao, Ye and Stanton, 2020).
Are you enjoying this post? We regularly publish bioinformatics content. Subscribe to our newsletter to not miss it!
The quantile method aims to make the distribution of gene expression levels the same for each sample in a dataset (Bolstad et al., 2003).
It assumes that the global differences in distributions between samples are all due to technical variation. Any remaining differences are likely actual biological effects.
For each sample, genes are ranked based on their expression level. An average value is calculated across all samples for genes of the same rank. This average value then replaces the original value of all genes in that rank. These genes are then placed in their original order.
This method is implemented in our Omics Playground.
TMM (trimmed mean of M-values) also assumes that most genes are not differentially expressed between samples (Robinson and Oshlack, 2010).
If many genes are uniquely or highly expressed in one experimental condition, it will affect the accurate quantification of the remaining genes.
To adjust for this possibility, TMM calculates scaling factors to adjust library sizes for the normalization of samples within a dataset.
To do this, one sample is chosen as a reference sample. The fold changes and absolute expression levels of other samples within the dataset are then calculated relative to the reference sample (Robinson and Oshlack, 2010).
Next, the genes in the data set are ‘trimmed’ to remove differentially expressed genes using these two values. The trimmed mean of the fold changes is then found for each sample.
Finally, read counts are scaled by this trimmed mean and the total count of their sample.
Methods such as Limma and ComBat remove batch effects when the sources of variation, such as the date of sequencing, are known (Johnson, Li and Rabinovic, 2007; Ritchie et al., 2015).
Within dataset normalization should first be applied before batch correction so that gene expression values are on the same scale between samples.
Limma and ComBat use empirical Bayes statistics to estimate the prior probability distributions from the data.
They work well even for small sample sizes because information is ‘borrowed’ across genes in each batch. This leads to more robust adjustments for the batch effect on each gene.
Empirical Bayes methods are based on two stages.
Firstly, the methods assume a model for the variance and/or mean of the data within batches.
Secondly, batches are then adjusted to meet assumed model specifications. This leaves gene expression variation that is more likely due to biological effects.
After correction for known batch effects, surrogate variable analysis (sva) can be used to identify and estimate unknown sources of variation (Leek et al., 2012).
These different approaches to batch correction can be performed visually in Omics Playground:
Here you can immediately see the impact of each normalization method on your RNA-seq data.
What’s next? You can easily analyze your RNA-Seq data with Omics Playground. Click the button below to get started.