A Short Guide to Enrichment Analysis

Published on Aug 21, 2023
by Ivo Kwee

Enrichment Analysis (EA), or also called Gene Set Analysis (GSA), is a computational method used to analyze gene expression data and identify whether specific sets of genes or pathways show statistically significant differences between different experimental conditions or phenotypes.

Enrichment Analysis helps uncover biologically relevant patterns in large-scale omics data by assessing the enrichment of predefined gene sets or pathways based on their expression levels of their individual genes. This approach provides insights into the collective behavior of functionally related genes, revealing potential biological processes associated with the observed changes in gene expression.

Enrichment Analysis involves several steps that we will look at individually in the following sections.

1. Input Genes 

As input, Enrichment Analysis methods generally require either a list of significant differentially expressed genes, or some scoring vector that can be used to rank the genes by their differential expression values (e.g., fold change or t-statistic) between two or more experimental conditions (see Figure 1). These are usually obtained after differential expression (DE) analysis that can be generated with R or Python. Alternatively, researchers can also access more user-friendly platforms such as Galaxy that require only a minimal bioinformatic knowledge.

Figure 1. The input for Enrichment Analysis is either a list of significant differentially expressed genes (gene list), or a scoring vector (e.g. logFC or t-statistics) that is used to rank the genes.

2. Predefined Gene Sets

Enrichment Analysis relies on predefined gene sets, which can be obtained from various sources such as pathway databases (KEGG, Reactome), GO terms, or curated collections of genes associated with specific functions. As there are several such resources available online, we can list below some of the most commonly used Gene Set collections:

  1. Gene Ontology (GO): GO is a widely used collection of terms that categorize genes and gene products based on their biological functions, processes, and cellular components.
  2. KEGG Pathways: The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a comprehensive collection of pathways that represent various biological processes, such as metabolic pathways, signal transduction, and cellular processes.
  3. Reactome Pathways: Reactome is a curated database of biological pathways and reactions, offering detailed insights into molecular events and processes.
  4. MSigDB (Molecular Signatures Database): MSigDB contains a diverse set of gene sets, including pathways, gene ontologies, and curated sets from various sources. It includes the hallmark gene sets that represent key biological processes.
  5. BioCarta Pathways: BioCarta provides curated pathways that focus on interactions, signaling, and biological processes relevant to human biology.
  6. Transcription Factor Targets: These gene sets represent genes regulated by specific transcription factors, offering insights into regulatory networks.
  7. Wikipathways. WikiPathways is an open, community-driven collaborative platform and database for creating, curating, and sharing biological pathways. 
  8. Drug Signatures Database (DSigDB):  a bioinformatics resource and database that provides information about the transcriptional effects of small molecule drugs and compounds on gene expression profiles

The Enrichr and Harmonizome websites gather genesets from different resources and allow to download them easily from a central place.

3. Statistical Testing

Enrichment Analysis methods need a statistical test to determine whether the predefined gene sets are statistically enriched. It assesses whether the genes within a set are more frequently found among the highly upregulated or downregulated genes. 

Depending on the method, this involves the calculation of some kind of enrichment score, which represents the degree to which a particular gene set is overrepresented and quantifies the collective behavior of genes within the set. Furthermore, statistical p-values are computed to determine the significance of the enrichment scores. Since enrichment analysis performs multiple statistical tests for many gene sets, adjusting p-values to account for the number of gene sets tested, is required. Normally this is achieved by applying the False Discovery Rate (FDR) correction.

Probably more than hundred  Enrichment Analysis methods have been proposed in literature. Before you sink into despair at the thought of having to take a Master degree in advanced statistics and coding  to perform an enrichment analysis,  there are actually several algorithms, usually conveniently assembled  in Bioconductor or independent R packages, that can be used to perform all the required statistical analyses. The following short list of Enrichment Analysis methods are among the best methods or, according to our experience, strike the best balance between performance and computational speed:

  1. Fisher’s Exact Test: Fisher’s exact test is used to determine whether the observed gene set score is significantly different from what would be expected by chance. The test assesses whether the distribution of gene set genes at the top or bottom of the ranked list is significantly skewed. Instead of the Fisher’s Exact test, the Chi-squared test can be used.
  2. GSEA (Gene Set Enrichment Analysis): This is the original algorithm developed by the Broad Institute. It assesses whether predefined gene sets are statistically enriched at the top or bottom of a ranked list of genes based on their differential expression between two or more experimental conditions.  fgsea (Fast Gene Set Enrichment Analysis) is an efficient and faster implementation of GSEA. It is available as an R package and can be used for GSEA analysis on large-scale genomic datasets.
  3. ssGSEA (Single Sample Gene Set Enrichment Analysis): ssGSEA  is another variation of Gene Set Enrichment Analysis that is specifically designed for single-sample or individual-level analysis. Unlike traditional GSEA, which compares gene sets between two or more groups of samples, ssGSEA compute the gene set enrichment for individual samples in a dataset.
  4. GSVA (Gene Set Variation Analysis): GSVA is an alternative approach to ssGSEA that estimates the variation of gene sets’ activity across samples, allowing for the analysis of pathway deregulation without prior gene ranking. 
  5. CAMERA (Competitive Gene Set Test Accounting for Inter-gene Correlation): This method takes into account the correlation structure among genes within a gene set and can be more powerful when dealing with gene sets that contain highly correlated genes. It is available in the R/Bioconductor package limma.
  6. FRY/ROAST.  ROAST uses residual space rotation as a sort of continuous version of sample permutation. FRY is a very fast approximation to the complete ROAST method.

Between these methods, GSEA is generally considered to be the best and has been always among the top performing methods in comparisons, while the Fisher’s Exact test is also used a lot mainly because of its speed. The right method depends foremost on the input type that is available. If you only have a list of significant genes available, then you must use ORA methods (either Fisher’s exact or Chi-squared test). If you have some ranking of all the genes (like logFC or t-statistics) then you can use the rank-based methods. Rank-based methods are generally preferred because they can give a result even is none of the genes reach statistical significance. A decision diagram of which method to use is given in Figure 2. Arguably, the “best” method is to evaluate multiple methods and combine their results.

Figure 2. Decision diagram of how to choose different enrichment methods depending on the input type and speed of the methods. Over-representation methods (ORA) require a list of significant genes. Rank-based methods require for all genes some kind of score (e.g. logFC or t-statistics).

4. Visualisation of the results

Once the statistical tests have been performed, plots are generally used to visualize the enrichment scores and distribution of gene set genes along the ranked gene list.

Enrichment plots are the most widely used representation (Figure 3A), although they require some background reading for a correct interpretation. They were originally introduced for the GSEA method. A useful guide to that extent can be found here.

Alternatively, for gene sets that represent biological pathways such as KEGG pathways, Wikipathways and Reactome Pathways, a more visually appealing and biologically interpretable representation at the individual gene level (for example, Figure 3B) can be generated through available images rendered by software packages such as the pathview Bioconductor module. In these cases, the actual expression changes in each individual gene of a given gene set are visualized, rather than just being ranked.

A nice visualization is to annotate the gene set genes in the differential expression volcano plot as in Figure 3C.

To compare enrichment scores of all tested gene sets, one can create a volcano plot at gene set level, where the horizontal axis represents the enrichment score and the vertical axis the statistical significance (see Figure 3D). Other common visualizations are the barplot and dotplot where color could depict the p-value (see Figure 3E).

A problem is that there are many hundreds of thousands of genesets. Many of them are highly overlapping or correlated so it may become difficult to see the overarching theme. By clustering gene sets using UMAP (from the ssGSEA or GSVA matrix), one can nicely visualize a ‘landscape’ of gene sets (see Figure 2F). Then by coloring the gene sets by their enrichment score or average logFC, one can easily show upregulated and downregulated regions of correlated gene sets. The Enrichment Map does a similar clustering of gene sets but using a network based on gene overlap (Figure 2G).

Figure 3. (A) Example of a typical GSEA plot. (B) Example of the rendering of a pathway gene set with genes colored based on the level of up-or downregulation. (C) Gene set genes annotated in the DE volcano plot. (D) Gene set level volcano where each point is a gene set. (E) Barplot and dot plot visualization. (F) Gene sets UMAP indicating regions of highly enriched gene sets. (G) Enrichment map colored by average logFC.

5. Performing Enrichment Analysis using Omics Playground

Enrichment Analysis can be a daunting prospect for novices and require a fair investment of time to learn R or other programming languages, as well as familiarizing yourself with the various statistical approaches used. An alternative to such a toil is provided by user-friendly bioinformatics platforms such as our own Omics Playground. These platforms take care of the normalization, analysis and visualization steps, allowing users to concentrate on the biological interpretation of the results.

The platform will cross-reference the experimental differential gene expression signatures against more than 50,000 gene sets from various public libraries, as well as provide access to rendered images from the Wikipathways and Reactome collections, as well as producing GO term graphs.

Omics Playground in particular offers a very intuitive GUI. Users simply need to produce a read counts table in csv format and a description of the various phenotypes. They can then create pairwise comparisons between conditions through the platform itself and then select from a collection of seven different peer-reviewed Enrichment methods (Figure 4), which will be intersected.

Figure 4. Selecting GSEA algorithms, highlighted by the red box, from the Omics Playground platform.

The results will be presented in various tables and visual formats under the “Gene Sets” module, with a dedicated tab for general Enrichment Analysis (Figure 5)  and a tab dedicated to the visual rendering of Wikipathways, Reactome pathways and GO terms (Figure 6).

Figure 5. Gene Set Enrichment tab of the Omics Playground platform displaying a list of positively and negatively correlated gene sets from over 40 databases in various table and graphic formats
Figure 6. Tabular and visual representation of the results of the Pathway enrichment analysis based on the Wikipathways, Reactome and GO collections.

As a user-friendly and up-to-date alternative to scripted pipeline, Omics Playground makes advanced analysis of transcriptomics and proteomics data easily accessible to casual users, while also providing bioinformaticians with a tool to delegate routine dataset analysis and focus on more challenging tasks.