Combining Multi-Omics Factorization Methods for Robust Biomarker Identification

Written by Ivo Kwee
5 min read

Clustered correlation heatmap of multi-omics factorization methods. The correlation is measured between the ranked weights of the maximum correlated factor with the phenotype Her2. Methods that are clustered together have similar factor weights.

Introduction

Multi-omics biomarker identification is crucial for advancing biomedical research and personalized medicine. Computational genomic approaches are increasingly used to screen large biological datasets and identify features capable of classifying or predicting phenotypes. 

While traditionally focused on single-omics data, integration of multi-omics data enhances biomarker selection by capturing additional layers of biological variation. Matrix factorization methods have emerged as a powerful tool for multi-omics analysis, as they can learn latent factors that capture significant heterogeneity across different data types. 

However, each of the current multi-omics factorization algorithms presents its own strengths and weaknesses. Distinct approaches may result into different biomarker sets, therefore causing loss of potentially valuable information. Ultimately, reconciling different biomarker sets identified by distinct methods is difficult and error-prone.

In this blog post we’ll address these questions by using a combinatorial approach to compare multi-omics factorization methods for robust biomarker identification.

Methods

We used a TCGA, multi-omics breast cancer dataset of 150 samples comprising of transcriptomics, proteomics, and microRNA profiles. 

We used this dataset to: 

  1. Perform multi-omics factorization using 10 distinct algorithms or algorithm variants thereof, including:
    • DIABLO, PCA (principal component analysis),
    • MOFA (multi-omics factor analysis), NMF (non-negative matrix factorization),
    • WGCNA (weighted gene correlation network analysis),
    • SGCCA (sparse generalized CCA),
    • SGCCDA (sparse generalized canonical correlation discriminant analysis),
    • RGCCA (regularized generalized CCA),
    • RGCCDA (regularized generalized canonical correlation discriminant analysis),
    • MCIA (multiple co-inertia analysis);
  2. Combine results of all methods into a variable importance measure to identify a robust set of biomarkers;
  3. Compare the distinct methods to assess concordance between methods.

Preliminary data

All methods were able to predict a set of biomarkers. However, as anticipated, non overlapping biomarker features between methods were often observed. 

It remains challenging to determine the optimal criteria for selecting the most appropriate factorization approach. For example, distinct data types or data modalities may require tailored approaches. 

We compared the methods by correlating their factor loading (weights) and clustering them in a heatmap (see Figure 1).

Figure 1. Clustered correlation heatmap of multi-omics factorization methods. The correlation is measured between the ranked weights of the maximum correlated factor with the phenotype Her2. Methods that are clustered together have similar factor weights. 

We found that PCA, MOFA, NMF2 are more similar to each other compared to other methods. This can be explained by algorithm similarity. For instance, both PCA and MOFA attempt to explain the maximum variance into a small set of components or factors created as an approximated linear combination of the original variables from each data modality. 

Also, as expected, canonical correlation analysis (CCA) methods, including SGCCA, RGCCA, SGCCDA were highly correlated with each other. We also found that DIABLO, a widely used supervised learning factorization method, was highly correlated with SGCCDA. 

Correlation analysis also revealed significant divergence between methods. For instance, both PCA and MOFA were lowly correlated with MCIA (multiple co-inertia analysis). We computed a variable importance score for each method. An aggregated score is then calculated as the cumulative rank of the variable importances of the different algorithms (see Figure 2).

Multi-omics variable importance for Her2 breast cancer classification. The total variable importance is determined as the cumulative ranking of the multi-omics feature by different multi-omics factorization methods.
Figure 2. Multi-omics variable importance for Her2 breast cancer classification. The total variable importance is determined as the cumulative ranking of the multi-omics feature by different multi-omics factorization methods.

To define a robust set of biomarkers, we selected the best predictive features as those with the highest cumulative ranks. 

Finally, the factor-trait correlation matrices of the different methods showed how the different methods differ substantially in their support or effective dimensionality (see Figure 3).

Factor-trait correlation heatmaps for different multi-omics factorization methods.
Figure 3. Factor-trait correlation heatmaps for different multi-omics factorization methods. Orthogonal factorization methods have a more compact support or effective dimensionality.

Conclusion

Combining biomarker scores from multiple multi-omics integration methods delivers more robust biomarkers bypassing the risk of information loss from single methods.

Key takeaways

  • Integrating transcriptomics, proteomics, and microRNA data improves biomarker identification by capturing more biological variability than single-omics approaches.
  • Ten different factorization algorithms (e.g., PCA, MOFA, NMF, DIABLO, SGCCA) were applied to multi-omics data to extract latent factors and identify biomarkers.
  •  PCA, MOFA, and NMF showed high similarity; CCA-based methods clustered together; MCIA diverged significantly from others, suggesting differing model assumptions.
  • An aggregated variable importance score was created by combining the outputs of all methods, enabling the identification of a more reliable set of predictive biomarkers.
  • Selecting the most appropriate factorization method for specific data types remains difficult due to differing algorithmic assumptions and outputs.

Multi-Omics Analysis with Omics Playground

Omics Playground’s new multi-omics beta features combine RNA, protein expression, metabolomics, and integrated pathways, empowering you to transform complex multi-omics data into actionable insights.

Using multiple methods, including MOFA, MixOmics and Deep Learning, Omics Playground ensures a comprehensive and robust analysis of your multi-omics datasets.

The multi-omics features are currently available for testing. All you have to do is log in to your account or sign up for a trial. Once in, select “Omics Playground v4 (beta)” and start uploading your data for multi-omics data analysis!

Work in biotech or pharma?  Contact us here to learn more.

Scientist with lens exploring the drug connectivity analysis tab in Omics Playground

Unlock the full potential of your RNA-seq and proteomics data!

About the Author

Ivo Kwee

Ivo Kwee holds a BSc degree in Engineering Physics, an MEng in Applied Physics and a PhD in Medical Physics. He has over 16 years of experience in bioinformatics and is currently CTO and co-founder of BigOmics Analytics, where he contributes to the mission of creating the best self-service analytics platform that enables life scientists to analyze their omics data.