A Short Guide to Enrichment Analysis

by Ivo Kwee, CTO

Enrichment Analysis (EA), or also called Gene Set Analysis (GSA), is a computational method used to analyze gene expression data and identify whether specific sets of genes or pathways show statistically significant differences between different experimental conditions or phenotypes.

Enrichment Analysis helps uncover biologically relevant patterns in large-scale omics data by assessing the enrichment of predefined gene sets or pathways based on their expression levels of their individual genes. This approach provides insights into the collective behavior of functionally related genes, revealing potential biological processes associated with the observed changes in gene expression.

Enrichment Analysis involves several steps that we will look at individually in the following sections.

1. Input Genes

As input, Enrichment Analysis methods generally require either a list of significant differentially expressed genes, or some scoring vector that can be used to rank the genes by their differential expression values (e.g., fold change or t-statistic) between two or more experimental conditions (see Figure 1). These are usually obtained after differential expression (DE) analysis that can be generated with R or Python. Alternatively, researchers can also access more user-friendly platforms such as Galaxy that require only a minimal bioinformatic knowledge.

2. Predefined Gene Sets

Enrichment Analysis relies on predefined gene sets, which can be obtained from various sources such as pathway databases (KEGG, Reactome), GO terms, or curated collections of genes associated with specific functions. As there are several such resources available online, we can list below some of the most commonly used Gene Set collections:

Gene Ontology (GO): GO is a widely used collection of terms that categorize genes and gene products based on their biological functions, processes, and cellular components.
KEGG Pathways: The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a comprehensive collection of pathways that represent various biological processes, such as metabolic pathways, signal transduction, and cellular processes.
Reactome Pathways: Reactome is a curated database of biological pathways and reactions, offering detailed insights into molecular events and processes.
MSigDB (Molecular Signatures Database): MSigDB contains a diverse set of gene sets, including pathways, gene ontologies, and curated sets from various sources. It includes the hallmark gene sets that represent key biological processes.
BioCarta Pathways: BioCarta provides curated pathways that focus on interactions, signaling, and biological processes relevant to human biology.
Transcription Factor Targets: These gene sets represent genes regulated by specific transcription factors, offering insights into regulatory networks.
Wikipathways. WikiPathways is an open, community-driven collaborative platform and database for creating, curating, and sharing biological pathways.
Drug Signatures Database (DSigDB): a bioinformatics resource and database that provides information about the transcriptional effects of small molecule drugs and compounds on gene expression profiles

The Enrichr and Harmonizome websites gather genesets from different resources and allow to download them easily from a central place.

3. Statistical Testing

Enrichment Analysis methods need a statistical test to determine whether the predefined gene sets are statistically enriched. It assesses whether the genes within a set are more frequently found among the highly upregulated or downregulated genes.

Depending on the method, this involves the calculation of some kind of enrichment score, which represents the degree to which a particular gene set is overrepresented and quantifies the collective behavior of genes within the set. Furthermore, statistical p-values are computed to determine the significance of the enrichment scores. Since enrichment analysis performs multiple statistical tests for many gene sets, adjusting p-values to account for the number of gene sets tested, is required. Normally this is achieved by applying the False Discovery Rate (FDR) correction.

Probably more than hundred Enrichment Analysis methods have been proposed in literature. Before you sink into despair at the thought of having to take a Master degree in advanced statistics and coding to perform an enrichment analysis, there are actually several algorithms, usually conveniently assembled in Bioconductor or independent R packages, that can be used to perform all the required statistical analyses. The following short list of Enrichment Analysis methods are among the best methods or, according to our experience, strike the best balance between performance and computational speed:

Fisher’s Exact Test: Fisher’s exact test is used to determine whether the observed gene set score is significantly different from what would be expected by chance. The test assesses whether the distribution of gene set genes at the top or bottom of the ranked list is significantly skewed. Instead of the Fisher’s Exact test, the Chi-squared test can be used.
GSEA (Gene Set Enrichment Analysis): This is the original algorithm developed by the Broad Institute. It assesses whether predefined gene sets are statistically enriched at the top or bottom of a ranked list of genes based on their differential expression between two or more experimental conditions. fgsea (Fast Gene Set Enrichment Analysis) is an efficient and faster implementation of GSEA. It is available as an R package and can be used for GSEA analysis on large-scale genomic datasets.
ssGSEA (Single Sample Gene Set Enrichment Analysis): ssGSEA is another variation of Gene Set Enrichment Analysis that is specifically designed for single-sample or individual-level analysis. Unlike traditional GSEA, which compares gene sets between two or more groups of samples, ssGSEA compute the gene set enrichment for individual samples in a dataset.
GSVA (Gene Set Variation Analysis): GSVA is an alternative approach to ssGSEA that estimates the variation of gene sets’ activity across samples, allowing for the analysis of pathway deregulation without prior gene ranking.
CAMERA (Competitive Gene Set Test Accounting for Inter-gene Correlation): This method takes into account the correlation structure among genes within a gene set and can be more powerful when dealing with gene sets that contain highly correlated genes. It is available in the R/Bioconductor package limma.
FRY/ROAST. ROAST uses residual space rotation as a sort of continuous version of sample permutation. FRY is a very fast approximation to the complete ROAST method.

Between these methods, GSEA is generally considered to be the best and has been always among the top performing methods in comparisons, while the Fisher’s Exact test is also used a lot mainly because of its speed. The right method depends foremost on the input type that is available. If you only have a list of significant genes available, then you must use ORA methods (either Fisher’s exact or Chi-squared test). If you have some ranking of all the genes (like logFC or t-statistics) then you can use the rank-based methods. Rank-based methods are generally preferred because they can give a result even is none of the genes reach statistical significance. A decision diagram of which method to use is given in Figure 2. Arguably, the “best” method is to evaluate multiple methods and combine their results.

4. Visualisation of the results

Once the statistical tests have been performed, plots are generally used to visualize the enrichment scores and distribution of gene set genes along the ranked gene list.

Enrichment plots are the most widely used representation (Figure 3A), although they require some background reading for a correct interpretation. They were originally introduced for the GSEA method. A useful guide to that extent can be found here.

Alternatively, for gene sets that represent biological pathways such as KEGG pathways, Wikipathways and Reactome Pathways, a more visually appealing and biologically interpretable representation at the individual gene level (for example, Figure 3B) can be generated through available images rendered by software packages such as the pathview Bioconductor module. In these cases, the actual expression changes in each individual gene of a given gene set are visualized, rather than just being ranked.

A nice visualization is to annotate the gene set genes in the differential expression volcano plot as in Figure 3C.

To compare enrichment scores of all tested gene sets, one can create a volcano plot at gene set level, where the horizontal axis represents the enrichment score and the vertical axis the statistical significance (see Figure 3D). Other common visualizations are the barplot and dotplot where color could depict the p-value (see Figure 3E).

A problem is that there are many hundreds of thousands of genesets. Many of them are highly overlapping or correlated so it may become difficult to see the overarching theme. By clustering gene sets using UMAP (from the ssGSEA or GSVA matrix), one can nicely visualize a ‘landscape’ of gene sets (see Figure 2F). Then by coloring the gene sets by their enrichment score or average logFC, one can easily show upregulated and downregulated regions of correlated gene sets. The Enrichment Map does a similar clustering of gene sets but using a network based on gene overlap (Figure 2G).

Enrichment plot. Colored pathway. Annotated DE volcano. Gene set volcano. Barplot and dotplot. Gene-set umap. Enrichment umap. — Figure 3. (A) Example of a typical GSEA plot. (B) Example of the rendering of a pathway gene set with genes colored based on the level of up-or downregulation. (C) Gene set genes annotated in the DE volcano plot. (D) Gene set level volcano where each point is a gene set. (E) Barplot and dot plot visualization. (F) Gene sets UMAP indicating regions of highly enriched gene sets. (G) Enrichment map colored by average logFC.

5. Performing GSEA analysis online using Omics Playground

Enrichment Analysis can be a daunting prospect for novices and require a fair investment of time to learn R or other programming languages, as well as familiarizing yourself with the various statistical approaches used.

An alternative to such a toil is provided by user-friendly bioinformatics platforms such as the Omics Playground platform. These platforms take care of the normalization, analysis and visualization steps, allowing users to concentrate on the biological interpretation of the results.

Omics Playground will cross-reference the experimental differential gene expression signatures against more than 50,000 gene sets from various public libraries, as well as provide access to rendered images from the Wikipathways and Reactome collections, as well as producing GO term graphs.

The platform in particular offers a very intuitive interface. Users simply need to produce a read counts table in csv format and a description of the various phenotypes. They can then create pairwise comparisons between conditions through the platform itself and then select from a collection of seven different peer-reviewed Enrichment methods (Figure 4), which will be intersected.

The results will be presented in various tables and visual formats under the “Gene Sets” module, with a dedicated tab for general Enrichment Analysis (Figure 5) and a tab dedicated to the visual rendering of Wikipathways, Reactome pathways and GO terms (Figure 6).

As a user-friendly and up-to-date alternative to scripted pipeline, Omics Playground makes advanced analysis of transcriptomics and proteomics data easily accessible to casual users. It also provides bioinformaticians with an efficient way to share data with biologists across the organization, ensuring reproducible results.

About the Author

Ivo Kwee holds a BSc degree in Engineering Physics, an MEng in Applied Physics and a PhD in Medical Physics. He has over 16 years of experience in bioinformatics and is currently CTO and co-founder of BigOmics Analytics, where he contributes to the mission of creating the best self-service analytics platform that enables life scientists to analyze their omics data.