Beyond Basic Transcriptomics Data Analysis: How to Extract the Most Information from Your Data

7 approaches for your transcriptomics data analysis to get the most out of your experiments.

Published on February 27th, 2024
Written by Axel Martinelli, PhD

 5 min read

dna data

Introduction

Most efforts when interpreting new transcriptomics datasets focus on clustering, identifying differentially expressed genes, coupled with some geneset enrichment analysis or pathway enrichment analysis. 

While this was indeed the standard expected analysis a decade ago and still is for less studied species, there is now a great plethora of resources and databases for human and mouse/rat datasets. 

In this post, we will highlight some of these approaches and databases and explain their usage in research. 

1. Gene Network Analysis (GNA): PCSF and WGCNA

Gene Network Analysis has become one of the most popular downstream analysis types for transcriptomics data in the past decade. It involves studying the interactions and relationships between genes within a biological system. Basically, it’s like creating a map of connections between different genes to understand how they work together.

A simple way to explain Gene Network Analysis is to imagine genes as individual components in a complex machine (the cell or organism). It aims to identify which genes “talk” to each other, influence each other’s activity, or work together to perform specific functions. It helps scientists unravel the intricate web of gene interactions and how they contribute to various biological processes.

The interaction strength between genes is often quantified using gene-to-gene correlation. Two common frameworks in gene network analysis are Prize-collecting Steiner Forest (PCSF) and Weighted Gene Co-expression Network Analysis (WGCNA).

PCSF (Akhmedov et al, 2017)  is a method used to obtain subnetworks in large biological networks, where each subnetwork represents a specific biological process under study. It focuses on identifying key genes that act as “hubs” in the network, connecting various biological pathways. 

By applying PCSF to protein-protein interaction data, researchers can, for example, identify critical genes that serve as central players in cancer-related pathways.

Figure 1. A typical PCSF plot, with the nodes and edges sizes proportional to the importance of a gene (node) or the strength of a connection between genes (edges) produced with the Omics Playground platform. You can try it yourself by creating a free account here.

WGCNA (Lagerfeld and Horvath, 2008) is a method that groups genes into co-expression modules based on the similarity of their expression patterns across samples. It assigns a “weight” to the strength of co-expression relationships, helping identify modules of functionally related genes. 

A typical use of WGCNA could be the identification of modules of co-expressed genes related to processes such as neurodevelopment or synaptic function. WGCNA results are typically represented as a series of plots, as shown in Figure 2. 

You can read more about WGCNA analysis and how to perform it using Omics Playground in our dedicated blog post.

Example of some of the plots produced during WGCNA analysis (adapted from Langfelder and Horvath, 2008).
Figure 2. Example of some of the plots produced during WGCNA analysis (adapted from Langfelder and Horvath, 2008).

2. Word Clouds

Word clouds, although initially designed for text data, can be creatively adapted for visualizing transcriptomics data in a way that conveys the relative importance of genes or terms. A few examples include:

Gene Expression Word Clouds

Instead of words, each “word” in the cloud represents a gene, and the size of the word corresponds to the gene’s expression level. Genes with higher expression levels will have larger font sizes, making them visually prominent.

Functional Annotation Word Clouds

Each word represents a gene function or biological term, and the size of the word indicates the prevalence or significance of that term in the dataset. This is based on the number of significant genes associated with the term or their statistical significance (Figure 3).

Word-cloud visualization of the functional annotation of age co-expressed genes in two tissues (A) adipose and (B) heart (adapted from Yang et al, 2017).
Figure 3. Word-cloud visualization of the functional annotation of age co-expressed genes in two tissues (A) adipose and (B) heart (adapted from Yang et al, 2017).

Pathway Enrichment Word Clouds

Words in the cloud represent biological pathways, and the size indicates the significance of pathway enrichment. Pathways with more genes from the dataset or higher statistical significance have larger fonts.

3. Biomarker Discovery

Biomarker discovery using transcriptomics data involves identifying specific genes or gene expression patterns that are associated with a particular condition (such as a disease or treatment response). There are both simple and sophisticated approaches for this purpose, as nicely summarized in a recent review (Ng et al, 2023).

At the simplest level, the analysis starts with the identification of genes that show significant differences in expression between the condition and control groups using statistical tests (e.g., t-tests, ANOVA). Feature selection algorithms within machine learning models can then be employed to automatically identify the most relevant genes. Methods like recursive feature elimination (RFE) or LASSO (Least Absolute Shrinkage and Selection Operator) can be effective.

On a more sophisticated level, gene expression profiles from individuals with the condition of interest and from controls are used to train a machine learning model. The model learns to recognize patterns or features that distinguish between the two groups, allowing it to predict the presence or absence of the condition based on gene expression data. Features (genes) that contribute most to the model’s predictive performance are considered potential biomarkers. This approach not only automates the selection of relevant biomarkers but also provides a computational framework for understanding complex relationships within the transcriptomic data and leveraging them for diagnostic or prognostic purposes. However, one also needs to be aware of the limitations of such approaches and try not to over interpret the results without further verification.

Transcriptional analysis highlights clinical relevance of NDR readout - Le Compte, M., De La Hoz, E.C., Peeters, S. et al- 2023
Figure 4. Example of variable importance plot produced using Omics Playground (adapted from Le Compte, M., De La Hoz, E.C., Peeters, S. et al., 2023)

Various computational tools are available to facilitate biomarker discovery. Platforms like Omics Playground offer an integrated environment where users can apply these techniques seamlessly, combining statistical testing, machine learning, and interactive visualizations. 

Master Biomarkers Analysis with Omics Playground: A Step-by-Step Tutorial provides insights into how to leverage the platform for your biomarker analysis.

4. Drug Connectivity Map

The Drug Connectivity Map (Drug cMap) is a resource and tool in the field of pharmacogenomics and drug discovery. It was developed by the Broad Institute (Lamb et al, 2006), and it provides a large-scale collection of gene expression profiles in response to more than 5000 compounds, for a total of more than a million gene expression profiles. 

The primary goal of the Drug cMap is to help researchers identify connections between drugs, genes, and diseases, facilitating the discovery of new therapeutic targets and repurposing existing drugs for different indications.

Similar to the Drug cMap, there are other drug-related databases, including the Cancer Therapeutics Response Portal (CTRP) database and the Genomics of Drug Sensitivity in Cancer (GDSC) project. 

The CTRP is a resource developed by the Broad Institute to link genetic (including gene expression), lineage, and other cellular features of cancer cell lines to small-molecule sensitivity (Basu et al, 2013). It provides open access to the results obtained through quantitatively measuring the sensitivity of genetically characterized cancer cell lines to a set of small-molecule probes and drugs.

The GDSC project is a collaboration between the Cancer Genome Project at the Wellcome Sanger Institute and the Center for Molecular Therapeutics at Massachusetts General Hospital (Yang et al, 2013). It involves the characterization of over 1000 human cancer cell lines and screening them with hundreds of compounds to provide drug response data and genomic (including transcriptomic) information. Its database is the largest public resource for information on drug sensitivity in cancer cells and molecular markers of drug response, containing data for almost 75,000 experiments across almost 700 cancer cell lines.

These databases are usually queried through a collection of software tools, such as Enrichr (Chen et al, 2013), though many websites are now available that provide easy access to databases, such as the Enrichr website (see Figure 5 for an example output) from Mayaan lab, iLINCS and SigCom LINCSOne drawback of these tools is that they focus on individual experiment analysis, which can often provide contradictory results of difficult interpretation. 

For this reason, an R package, metaLINCS (Kwee et al, 2022),  which allows a meta analysis beyond single experiments has recently been developed. Those without coding skills can also find it conveniently included in the Omics Playground platform.

The figure contains a bar chart displaying the top small molecules identified by the L1000 Drug cMAP query using Enrichr that mimic the “STAT3’ example expression profile The chart displays the small molecules which mimic the observed gene expression signature.
Figure 5. The figure contains a bar chart displaying the top small molecules identified by the L1000 Drug cMAP query using Enrichr that mimic the “STAT3’ example expression profile The chart displays the small molecules which mimic the observed gene expression signature.

5. Cell Profiling (or Cell Type Identification)

The development of single-cell RNA sequencing (scRNA-seq) has revolutionized transcriptomics data analysis in the recent decade. This approach provides insights into gene expression levels and patterns within individual cells, allowing for the identification of cell types, subtypes, and their functional characteristics. 

Databases for various cell types and organs, mostly for humans and rodent models, can now be queried to identify the nature of individual cells in experiments. These include the NCBI BioSample Database, the Genotype-Tissue Expression (GTEx) portal (GTEx consortium, 2013), which contains expression profiles of various tissue types, the aforementioned CTRP, which contains unique expression profiles for various cancer cell lines and various databases for immune cell profiling (e.g. the Database of Immune cell Eqtls, Expression and Epigenomics, DICE (Schmiedel et al, 2018) and the ImmunoStates database (Vallania et al, 2018).

The cell profiling information can be displayed in various ways, including bar charts, dot maps and, more commonly, UMAP plots (Figure 6).

Example of single cell UMAP representation for cell profiling. Based on data from Tirosh et al, 2016
Figure 6. Example of single cell UMAP representation for cell profiling. Based on data from Tirosh et al, 2016.

6. Feature Level Clustering

Most researchers are familiar with clustering at the sample level on a PCA, UMAP or tSNE plot. An increasingly popular alternative is clustering of samples at the feature level through gene or geneset UMAP plots (Figure 7). 

Clustering at the gene level involves grouping genes based on their expression patterns across samples. This approach aims to identify sets of genes that exhibit similar expression profiles, which can provide insights into their potential co-regulation and shared biological functions

Gene UMAP plot produced with Omics Playground by BigOmics
Figure 7. Gene UMAP plots as generated on the transcriptomics dataset by Wang et al, 2021, using Omics Playground. Genes are clustered together based on similarity of expression across the samples and then coloured in terms of the variation in expression across samples. Areas of extreme variation (coloured in red/dark orange) indicate differences between samples that may be explained by existing phenotypes.

A further variation is the clustering of gene sets rather than individual genes in a UMAP plot, as performed on the Omics Playground platform. While unpublished, this approach can provide more meaningful information than individual genes and can already highlight impacted pathways or other features of interest in the dataset.

7. Comparative Analysis Between Datasets

With the availability of both large gene expression databases as well as hundreds of thousands of gene expression studies through portals such as GEO (Edgar et al, 2002), researchers are no longer limited to analyzing their datasets in isolation. Increasingly, comparing datasets across large collections is becoming the norm in both academic and private research. 

Such comparative analysis can take various forms, with the simplest being a straightforward gene expression correlation analysis between pairwise comparisons from two datasets (Figure 8). 

The real challenge for inexperienced researchers is performing such an analysis across hundreds of thousands of pairwise comparisons, ordering the results based on correlation strength and statistical significance and then generating corresponding databases.

Pairwise comparisons plot produced with Omics Playground by BigOmics
Figure 8. An example of how pairwise comparisons can be compared with each other in a simple correlation scatterplot, with the studied pairwise comparison value on the Y-axis and reference pairwise comparison on the X-axis. Plot generated based on the transcriptomics data by Wang et al, 2021 using Omics Playground.

Conclusion

Thanks to the availability of both large dataset collections and various types of gene profile databases, it is now possible to extend the bioinformatics analysis of transcriptomics datasets beyond the conventional analysis types (such as differential expression analysis or gene set analysis) typical of the field. 

In this review, we provided a brief overview of some of the potential applications, though the full list extends well beyond what is discussed here, for example the advent of spatial transcriptomics.

While these types of analyses may seem daunting for researchers with limited programming knowledge, there are now many tools available to both simplify them and speed up advanced analysis. 

In particular online bioinformatics platforms, such as Omics Playground have made great progress in making ever more sophisticated transcriptomics data analysis accessible and more efficient.

Unlock the full potential of your RNA-seq and proteomics data.

About the Author

Axel Martinelli

Axel Martinelli’s academic background is in molecular biology and parasitology. He earned a Ph.D. on the genetics of strain-specific immunity against malaria infections and a master’s degree in bioinformatics with specialization in the analysis of omics data. During his postdoctoral career, he worked on genomics and transcriptomics studies and is currently the head of biology at Bigomics Analytics.

References

Akhmedov M, Kedaigle A, Chong RE, Montemanni R, Bertoni F, Fraenkel E, Kwee I. PCSF: An R-package for network-based interpretation of high-throughput data. PLoS Comput Biol. 2017 Jul 31;13(7):e1005694.

Basu A, Bodycombe NE, Cheah JH, Price EV, Liu K, Schaefer GI, Ebright RY, Stewart ML, Ito D, Wang S, Bracha AL, Liefeld T, Wawer M, Gilbert JC, Wilson AJ, Stransky N, Kryukov GV, Dancik V, Barretina J, Garraway LA, Hon CS, Munoz B, Bittker JA, Stockwell BR, Khabele D, Stern AM, Clemons PA, Shamji AF, Schreiber SL. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell. 2013 Aug 29;154(5):1151-1161

Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, Clark NR, Ma’ayan A. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013 Apr 15;14:128. doi: 10.1186/1471-2105-14-128.

Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002 Jan 1;30(1):207-10.

GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013 Jun;45(6):580-5.

Kwee I, Martinelli A, Khayal LA, Akhmedov M. metaLINCS: an R package for meta-level analysis of LINCS L1000 drug signatures using stratified connectivity mapping. Bioinform Adv. 2022 Sep 9;2(1):vbac064.

Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006 Sep 29;313(5795):1929-35.

Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008 Dec 29;9:559. doi: 10.1186/1471-2105-9-559.

Ng S, Masarone S, Watson D, Barnes MR. The benefits and pitfalls of machine learning for biomarker discovery. Cell Tissue Res. 2023 Oct;394(1):17-31.

Schmiedel BJ, Singh D, Madrigal A, Valdovino-Gonzalez AG, White BM, Zapardiel-Gonzalo J, Ha B, Altay G, Greenbaum JA, McVicker G, Seumois G, Rao A, Kronenberg M, Peters B, Vijayanand P. Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression. Cell. 2018 Nov 29;175(6):1701-1715.e16.

Tirosh I, Izar B, Prakadan SM, Wadsworth MH 2nd, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, Fallahi-Sichani M, Dutton-Regester K, Lin JR, Cohen O, Shah P, Lu D, Genshaft AS, Hughes TK, Ziegler CG, Kazer SW, Gaillard A, Kolb KE, Villani AC, Johannessen CM, Andreev AY, Van Allen EM, Bertagnolli M, Sorger PK, Sullivan RJ, Flaherty KT, Frederick DT, Jané-Valbuena J, Yoon CH, Rozenblatt-Rosen O, Shalek AK, Regev A, Garraway LA. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016 Apr 8;352(6282):189-96.

Vallania F, Tam A, Lofgren S, Schaffert S, Azad TD, Bongen E, Haynes W, Alsup M, Alonso M, Davis M, Engleman E, Khatri P. Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases. Nat Commun. 2018 Nov 9;9(1):4735.

Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, Bindal N, Beare D, Smith JA, Thompson IR, Ramaswamy S, Futreal PA, Haber DA, Stratton MR, Benes C, McDermott U, Garnett MJ. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2013 Jan;41(Database issue):D955-61

Yang J, Qin Y, Zhang T, Wang F, Peng L, Zhu L, Yuan D, Gao P, Zhuang J, Zhang Z, Wang J, Fang Y. Identification of human age-associated gene co-expressions in functional modules using liquid association. Oncotarget. 2017 Dec 8;9(1):1063-1074.

Wang LB, Karpova A, Gritsenko MA, Kyle JE, Cao S, Li Y, Rykunov D, Colaprico A, Rothstein JH, Hong R, Stathias V, Cornwell M, Petralia F, Wu Y, Reva B, Krug K, Pugliese P, Kawaler E, Olsen LK, Liang WW, Song X, Dou Y, Wendl MC, Caravan W, Liu W, Cui Zhou D, Ji J, Tsai CF, Petyuk VA, Moon J, Ma W, Chu RK, Weitz KK, Moore RJ, Monroe ME, Zhao R, Yang X, Yoo S, Krek A, Demopoulos A, Zhu H, Wyczalkowski MA, McMichael JF, Henderson BL, Lindgren CM, Boekweg H, Lu S, Baral J, Yao L, Stratton KG, Bramer LM, Zink E, Couvillion SP, Bloodsworth KJ, Satpathy S, Sieh W, Boca SM, Schürer S, Chen F, Wiznerowicz M, Ketchum KA, Boja ES, Kinsinger CR, Robles AI, Hiltke T, Thiagarajan M, Nesvizhskii AI, Zhang B, Mani DR, Ceccarelli M, Chen XS, Cottingham SL, Li QK, Kim AH, Fenyö D, Ruggles KV, Rodriguez H, Mesri M, Payne SH, Resnick AC, Wang P, Smith RD, Iavarone A, Chheda MG, Barnholtz-Sloan JS, Rodland KD, Liu T, Ding L; Clinical Proteomic Tumor Analysis Consortium. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer Cell. 2021 Apr 12;39(4):509-528.e20.