Why and How to Normalize RNA-Seq Data

Introduction

RNA-seq normalization is essential for accurate RNA-seq data analysis. Various factors affect transcript quantification in RNA-seq data, such as sequencing depth, transcript length, and sample-to-sample and batch-to-batch variability (Conesa et al., 2016). Normalization methods exist to minimize these variables and ensure reliable transcriptomic data.

In this article, you’ll find an overview of why RNA-seq normalization is essential, and a break down of different RNA-seq normalization methods to help you master your next RNA-seq analysis.

Working with proteomics too? Check out our proteomics normalization guide.

What is normalization in RNA-seq data analysis?

RNA-seq normalization adjusts raw transcriptomic data to account for various technical factors that may mask actual biological effects and lead to incorrect conclusions.

Bioinformatic approaches can correct these factors, and multiple RNA-seq normalization methods exist for different datasets and comparisons (Abrams et al., 2019; Zhao et al., 2021).

Simplified user-friendly RNA-seq data analysis platforms now allow laboratory scientists with no coding experience to normalize and explore their own RNA-seq data with standardized input files (Akhmedov et al., 2020).

Why is RNA-seq normalization needed?

Sequencing technologies introduce technical variability (Conesa et al., 2016;).

Therefore, raw transcriptomic data must be adjusted to account for these technical factors before researchers can compare gene expression within or between samples (Abrams et al., 2019).

Normalized gene expression units ensure comparable and consistent data for exploratory or differential expression analysis while limiting false positive or negative results.

The differences between the three normalization stages in RNA-seq data analysis

It is essential to choose the correct RNA-seq normalization method for your dataset. This choice will impact the sensitivity of your differential expression analysis (Bullard et al., 2010).

There are three main RNA-seq normalization stages you should consider:

1. Within sample

Within sample normalization is required to compare the expression of genes within an individual sample (Zhao et al., 2021). It can adjust data for two primary technical variables: transcript length and sequencing depth.

Longer genes often have more mapped reads than shorter genes at the same expression level (Mortazavi et al., 2008). Therefore, their expression level can only be accurately compared within a sample after normalization.

Furthermore, the number of sequencing reads per sample may vary. This can also be corrected by within sample normalization.

Within sample normalization is not sufficient to compare gene expression between samples. For this, between sample RNA-seq normalization methods are required.

2. Within a dataset (between samples)

Samples within a dataset can be simultaneously normalized as a complete set to adjust for different technical variations such as sequencing depth (Bolstad et al., 2003).

RNA-seq is a relative, not an absolute, measure of transcript abundance. This means that the transcript population as a whole affects relative levels of transcripts (Robinson and Oshlack, 2010).

This creates biases for gene expression analyses, and these are minimized by between sample RNA-seq normalization methods.

3. Across datasets

Researchers often integrate RNA-seq data from multiple independent studies. These datasets are usually sequenced at different times, with varying methods across multiple facilities, and contain other experimental variables (Leek et al., 2010).

This results in a batch effect.

The batch effect is often responsible for the greatest source of differential expression when data is combined. It can mask any true biological differences and lead to incorrect conclusions (Leek et al., 2010; Ritchie et al., 2015).

RNA-seq normalization across datasets can correct for known variables across batches, such as the sequencing center and date of sequencing, as well as unknown variables (Johnson et al., 2007; Leek et al., 2012; Ritchie et al., 2015).

How to normalize RNA-seq data within a sample

Here we provide an overview of normalization units and methods to help you choose the right one for your dataset.

CPM

Counts per million (CPM) mapped reads are the number of raw reads mapped to a transcript, scaled by the number of sequencing reads in your sample, multiplied by a million.

It normalizes RNA-seq data for sequencing depth but not gene length.

Therefore, although it is a within sample normalization approach, CPM normalization is unsuitable for within sample comparisons of gene expression.

Between sample comparisons can be made when CPM is used alongside ‘within a dataset’ normalization methods.

Quick Tip: With Omics Playground, normalizing your data is just a click away. Choose from Counts per Million (CPM), CPM + Quantile normalization, maxMedian, maxSum, or Reference methods during data upload — so you can focus on uncovering insights in your data!

FPKM/RPKM

FPKM (fragments per kilobase of transcript per million fragments mapped) for paired-end data and RPKM (reads per kilobase of transcript per million reads mapped) for single-end data correct for variations in library size and gene length (Mortazavi et al., 2008).

One issue with FPKM/RPKM units is that the expression of a gene in one sample will appear different from its expression in another sample, even when its true expression level is the same (Wagner, Kin and Lynch, 2012).

This is because it depends on the relative abundance of a transcript among a population of sequenced transcripts.

FPKM/RPKM units best compare gene expression within a single sample (Zhao, Ye and Stanton, 2020).

TPM

Transcripts per million (TPM) represents the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts (Wagner, Kin and Lynch, 2012).

It is calculated by dividing the number of reads mapped to a transcript by the transcript length. This value is then divided by the sum of mapped reads to all transcripts after normalization for transcript length. It is then multiplied by one million to allow easier further analyses (Wagner, Kin and Lynch, 2012; Zhao et al., 2021).

It normalizes RNA-seq data for sequencing depth and transcript length.

TPM and FPKM/RPKM are closely related, however, in contrast to FPKM/RPKM, there is limited variation in values between samples as the sum of all TPMs in each sample is the same.

TPM can be used for within sample comparisons but requires ‘within a dataset’ normalization for between sample comparisons (Zhao, Ye and Stanton, 2020).

How to normalize RNA-seq data within a dataset (between samples)

Quantile

The quantile method aims to make the distribution of gene expression levels the same for each sample in a dataset (Bolstad et al., 2003).

It assumes that the global differences in distributions between samples are all due to technical variation. Any remaining differences are likely actual biological effects.

For each sample, genes are ranked based on their expression level. An average value is calculated across all samples for genes of the same rank. This average value then replaces the original value of all genes in that rank. These genes are then placed in their original order.

This method is implemented in our Omics Playground.

TMM normalization

TMM (trimmed mean of M-values) also assumes that most genes are not differentially expressed between samples (Robinson and Oshlack, 2010).

If many genes are uniquely or highly expressed in one experimental condition, it will affect the accurate quantification of the remaining genes.

To adjust for this possibility, TMM calculates scaling factors to adjust library sizes for the normalization of samples within a dataset.

To do this, one sample is chosen as a reference sample. The fold changes and absolute expression levels of other samples within the dataset are then calculated relative to the reference sample (Robinson and Oshlack, 2010).

Next, the genes in the data set are ‘trimmed’ to remove differentially expressed genes using these two values. The trimmed mean of the fold changes is then found for each sample.

Finally, read counts are scaled by this trimmed mean and the total count of their sample.

How to normalize RNA-seq data across datasets

Batch correction

Methods such as Limma and ComBat remove batch effects when the sources of variation, such as the date of sequencing, are known (Johnson, Li and Rabinovic, 2007; Ritchie et al., 2015).

Within dataset normalization should first be applied before batch correction so that gene expression values are on the same scale between samples.

Limma and ComBat use empirical Bayes statistics to estimate the prior probability distributions from the data.

They work well even for small sample sizes because information is ‘borrowed’ across genes in each batch. This leads to more robust adjustments for the batch effect on each gene.

Empirical Bayes methods are based on two stages.

Firstly, the methods assume a model for the variance and/or mean of the data within batches.

Secondly, batches are then adjusted to meet assumed model specifications. This leaves gene expression variation that is more likely due to biological effects.

After correction for known batch effects, surrogate variable analysis (sva) can be used to identify and estimate unknown sources of variation (Leek et al., 2012).

About the Author

Ivo Kwee

Ivo Kwee holds a BSc degree in Engineering Physics, an MEng in Applied Physics and a PhD in Medical Physics. He has over 16 years of experience in bioinformatics and is currently CTO and co-founder of BigOmics Analytics, where he contributes to the mission of creating the best self-service analytics platform that enables life scientists to analyze their omics data.

References

Abrams, Z.B. et al. (2019) ‘A protocol to evaluate RNA sequencing normalization methods’, BMC Bioinformatics, 20(24), pp.1-7. Available at: https://doi.org/10.1186/s12859-019-3247-x.
Akhmedov, M. et al. (2020) ‘Omics Playground: A comprehensive self-service platform for visualization, analytics and exploration of Big Omics Data’, NAR Genomics and Bioinformatics, 2(1), p.lqz019. Available at: https://doi.org/10.1093/nargab/lqz019.
Bolstad, B.M., et al. (2003) ‘A comparison of normalization methods for high density oligonucleotide array data based on variance and bias’. Bioinformatics, 19(2), pp.185-193. Available at: https://doi.org/10.1093/bioinformatics/19.2.185.
Bullard, J.H., et al. (2010). ‘Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments’. BMC bioinformatics, 11(1), pp.1-13. Available at: https://doi.org/10.1186/1471-2105-11-94.
Conesa, A. et al. (2016) ‘A survey of best practices for RNA-seq data analysis’, Genome Biology, 17(1), pp.1-19. Available at: https://doi.org/10.1186/s13059-016-0881-8.
Johnson, W.E., Li, C. and Rabinovic, A. (2007) ‘Adjusting batch effects in microarray expression data using empirical Bayes methods’, Biostatistics, 8(1), pp. 118–127. Available at: https://doi.org/10.1093/biostatistics/kxj037.
Leek, J.T. et al. (2010) ‘Tackling the widespread and critical impact of batch effects in high-throughput data’, Nature Reviews Genetics, 11(10), pp. 733–739. Available at: https://doi.org/10.1038/nrg2825.
Leek, J.T. et al. (2012) ‘The SVA package for removing batch effects and other unwanted variation in high-throughput experiments’, Bioinformatics, 28(6), pp. 882–883. Available at: https://doi.org/10.1093/bioinformatics/bts034.
Mortazavi, A. et al. (2008) ‘Mapping and quantifying mammalian transcriptomes by RNA-Seq’, Nature Methods, 5(7), pp. 621–628. Available at: https://doi.org/10.1038/nmeth.1226.
Ritchie, M.E. et al. (2015) ‘Limma powers differential expression analyses for RNA-sequencing and microarray studies’, Nucleic Acids Research, 43(7), p. e47. Available at: https://doi.org/10.1093/nar/gkv007.
Robinson, M.D. and Oshlack, A. (2010) ‘A scaling normalization method for differential expression analysis of RNA-seq data.’ Genome biology, 11(3), pp.1-9. Available at: https://doi.org/10.1186/gb-2010-11-3-r25.
Wagner, G.P., Kin, K. and Lynch, V.J. (2012) ‘Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples’, Theory in Biosciences, 131(4), pp. 281–285. Available at: https://doi.org/10.1007/s12064-012-0162-3.
Zhao, S., Ye, Z. and Stanton, R. (2020) ‘Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols’. Rna, 26(8), pp.903-909. Available at: 10.1261/rna.074922.120.
Zhao, Y. et al. (2021) ‘TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository’, Journal of Translational Medicine, 19(1), pp.1-15. Available at: https://doi.org/10.1186/s12967-021-02936-w.