Published on Aug 21, 2023
by Ivo Kwee
Enrichment Analysis (EA), or also called Gene Set Analysis (GSA), is a computational method used to analyze gene expression data and identify whether specific sets of genes or pathways show statistically significant differences between different experimental conditions or phenotypes.
Enrichment Analysis helps uncover biologically relevant patterns in large-scale omics data by assessing the enrichment of predefined gene sets or pathways based on their expression levels of their individual genes. This approach provides insights into the collective behavior of functionally related genes, revealing potential biological processes associated with the observed changes in gene expression.
Enrichment Analysis involves several steps that we will look at individually in the following sections.
As input, Enrichment Analysis methods generally require either a list of significant differentially expressed genes, or some scoring vector that can be used to rank the genes by their differential expression values (e.g., fold change or t-statistic) between two or more experimental conditions (see Figure 1). These are usually obtained after differential expression (DE) analysis that can be generated with R or Python. Alternatively, researchers can also access more user-friendly platforms such as Galaxy that require only a minimal bioinformatic knowledge.
Enrichment Analysis relies on predefined gene sets, which can be obtained from various sources such as pathway databases (KEGG, Reactome), GO terms, or curated collections of genes associated with specific functions. As there are several such resources available online, we can list below some of the most commonly used Gene Set collections:
The Enrichr and Harmonizome websites gather genesets from different resources and allow to download them easily from a central place.
Enrichment Analysis methods need a statistical test to determine whether the predefined gene sets are statistically enriched. It assesses whether the genes within a set are more frequently found among the highly upregulated or downregulated genes.
Depending on the method, this involves the calculation of some kind of enrichment score, which represents the degree to which a particular gene set is overrepresented and quantifies the collective behavior of genes within the set. Furthermore, statistical p-values are computed to determine the significance of the enrichment scores. Since enrichment analysis performs multiple statistical tests for many gene sets, adjusting p-values to account for the number of gene sets tested, is required. Normally this is achieved by applying the False Discovery Rate (FDR) correction.
Probably more than hundred Enrichment Analysis methods have been proposed in literature. Before you sink into despair at the thought of having to take a Master degree in advanced statistics and coding to perform an enrichment analysis, there are actually several algorithms, usually conveniently assembled in Bioconductor or independent R packages, that can be used to perform all the required statistical analyses. The following short list of Enrichment Analysis methods are among the best methods or, according to our experience, strike the best balance between performance and computational speed:
Between these methods, GSEA is generally considered to be the best and has been always among the top performing methods in comparisons, while the Fisher’s Exact test is also used a lot mainly because of its speed. The right method depends foremost on the input type that is available. If you only have a list of significant genes available, then you must use ORA methods (either Fisher’s exact or Chi-squared test). If you have some ranking of all the genes (like logFC or t-statistics) then you can use the rank-based methods. Rank-based methods are generally preferred because they can give a result even is none of the genes reach statistical significance. A decision diagram of which method to use is given in Figure 2. Arguably, the “best” method is to evaluate multiple methods and combine their results.
Once the statistical tests have been performed, plots are generally used to visualize the enrichment scores and distribution of gene set genes along the ranked gene list.
Enrichment plots are the most widely used representation (Figure 3A), although they require some background reading for a correct interpretation. They were originally introduced for the GSEA method. A useful guide to that extent can be found here.
Alternatively, for gene sets that represent biological pathways such as KEGG pathways, Wikipathways and Reactome Pathways, a more visually appealing and biologically interpretable representation at the individual gene level (for example, Figure 3B) can be generated through available images rendered by software packages such as the pathview Bioconductor module. In these cases, the actual expression changes in each individual gene of a given gene set are visualized, rather than just being ranked.
A nice visualization is to annotate the gene set genes in the differential expression volcano plot as in Figure 3C.
To compare enrichment scores of all tested gene sets, one can create a volcano plot at gene set level, where the horizontal axis represents the enrichment score and the vertical axis the statistical significance (see Figure 3D). Other common visualizations are the barplot and dotplot where color could depict the p-value (see Figure 3E).
A problem is that there are many hundreds of thousands of genesets. Many of them are highly overlapping or correlated so it may become difficult to see the overarching theme. By clustering gene sets using UMAP (from the ssGSEA or GSVA matrix), one can nicely visualize a ‘landscape’ of gene sets (see Figure 2F). Then by coloring the gene sets by their enrichment score or average logFC, one can easily show upregulated and downregulated regions of correlated gene sets. The Enrichment Map does a similar clustering of gene sets but using a network based on gene overlap (Figure 2G).
Enrichment Analysis can be a daunting prospect for novices and require a fair investment of time to learn R or other programming languages, as well as familiarizing yourself with the various statistical approaches used. An alternative to such a toil is provided by user-friendly bioinformatics platforms such as our own Omics Playground. These platforms take care of the normalization, analysis and visualization steps, allowing users to concentrate on the biological interpretation of the results.
The platform will cross-reference the experimental differential gene expression signatures against more than 50,000 gene sets from various public libraries, as well as provide access to rendered images from the Wikipathways and Reactome collections, as well as producing GO term graphs.
Omics Playground in particular offers a very intuitive GUI. Users simply need to produce a read counts table in csv format and a description of the various phenotypes. They can then create pairwise comparisons between conditions through the platform itself and then select from a collection of seven different peer-reviewed Enrichment methods (Figure 4), which will be intersected.
The results will be presented in various tables and visual formats under the “Gene Sets” module, with a dedicated tab for general Enrichment Analysis (Figure 5) and a tab dedicated to the visual rendering of Wikipathways, Reactome pathways and GO terms (Figure 6).
As a user-friendly and up-to-date alternative to scripted pipeline, Omics Playground makes advanced analysis of transcriptomics and proteomics data easily accessible to casual users, while also providing bioinformaticians with a tool to delegate routine dataset analysis and focus on more challenging tasks.