How to Identify Biomarkers:
Definition, Examples, and Algorithms for Detection

 9 min read
by Axel Martinelli

DNA Drug development

Introduction

Novel biomarker discovery is crucial to many areas, including clinical diagnostics and drug development.

The Food and Drug Administration (FDA) defines a biomarker as ‘a defined characteristic that is measured as an indicator of normal biological processes, pathogenic processes, or responses to an exposure or intervention, including therapeutic interventions.’

A biomarker could be almost any objective and quantifiable functional, physiological, biochemical, or molecular measurement. Examples of molecular biomarkers include the presence of proteins in the blood, such as prostate-specific antigen (PSA) used in the diagnosis of prostate cancer, or the presence of mutations in tumor suppressor genes, like those in BRCA1 or BRCA2, predictive of breast cancer risk.

Today, the growing affordability of large-scale, data-rich omics technologies like proteomics and transcriptomics has led to vast treasure troves of data that are primed for biomarker analyses. By analyzing these data with advanced machine learning approaches, researchers have identified features, such as protein or gene expression signatures, associated with disease and showing predictive power for diagnosis, prognosis, or response to therapeutics (1,2).

To improve the success rate and efficiency of drug development, we need new informative biomarkers that identify novel druggable targets or can highlight toxicity or safety concerns early in the development pipeline.

However, efficient and reliable identification of molecular biomarkers from complex omics data requires substantial knowledge of the most appropriate methods and their implementation. Today, tools such as Omics Playground integrate such data analysis and visualization methods, making biomarker discovery more interactive and accessible to scientists.

The discovery of new molecular biomarkers with robust, computational, and statistical biomarker analysis approaches has unparalleled potential for application in personalized medicine approaches for patient care.

In this blog, we provide guidance on molecular biomarker analysis for biologists and bioinformaticians. We highlight some available methods and discuss where machine learning approaches offer advantages over traditional methods.

What is Biomarker Identification?

Molecular biomarker analysis refers to the process of discovering and verifying a specific gene or protein signature that can be used as a quantifiable and defined characteristic relevant to a desired outcome. These outcomes could be to diagnose a disease or to indicate the toxic effects of a therapeutic intervention.

At present, only part of the biomarker discovery process implemented in pharmaceutical or academic pipelines is solely based on computational analyses. 

Typically, initial computational predictions are performed on large-scale multi-omics data followed by proof-of-concept and functional genomics screens. The early stages of discovery largely rely on selecting and implementing the most appropriate advanced algorithms for the data and desired use.

What Makes a Good Biomarker?

In drug discovery, there is a need to increase the efficiency of drug development pipelines and improve the success rate of compounds reaching the clinic. Novel preclinical biomarkers identifying drug-induced toxicity early in the pipeline are crucial to achieving these goals.

The best biomarkers should have three key attributes in line with these aims.

1. Easy to access

The biomarker is present in peripheral tissue or biological fluids such as blood, urine, or saliva and requires minimally invasive collection.

2. Easy to detect

The biomarker is easy to detect such as highly expressed gene panels or abundant proteins for clinical diagnosis or detected in response to treatment.

3. Specific and quantifiable

The biomarker must be as specific as possible for the perturbation caused in response to treatment or disease and should be easy to measure.

4. Robust to validation

The biomarker is highly robust such that it is successfully validated in independent assays. Reliable biomarkers are highly replicable.

How are Biomarkers Identified?

Scientist looking at samples

The exact process of biomarker analysis depends largely on the study’s desired outcome, such as identifying a predictive, diagnostic, or safety biomarker and the type of data to be analyzed (genomic, transcriptomic, proteomic, etc.). Additional layers of complexity arise when determining biomarkers for clinical purposes, which must undergo rigorous analytical or clinical validation to ensure they are fit-for-purpose.

Broadly, there are some crucial aspects to consider for successful biomarker analysis and discovery.

1. Appropriate study design

It is crucial to ensure the study is adequately powered for statistically and biologically meaningful detection of biomarkers, with sufficient sample numbers from the outset. Failure to have a sufficiently powered study could result in missing important biomarkers and/or lead to false positive findings.

The most appropriate technique to detect potential biomarkers should be selected. This could be transcriptomics, proteomics, or other omics approaches or combinations. The sample source should also be carefully selected. This could be essentially any biological sample, including non-invasive liquid biopsies, blood samples, or cultured cells.

2. Data quality, standardization, and preprocessing

Next, ensuring suitable data quality is vital. Poor-quality data will make it much more challenging to discover accurate and reliable biomarkers for your desired purpose. Data curation with quality control and filtering of samples based on standardized or custom criteria are important initial steps in data processing pipelines. Appropriate preprocessing of data will increase the likelihood of subsequent analysis steps being successful.

3. Select relevant modeling methods

Following data preprocessing, appropriate statistical and machine-learning methods must be chosen for biomarker analysis. The Oxford English Dictionary defines machine learning as “the use and development of computer systems that are able to learn and adapt without following explicit instructions by using algorithms and statistical models to analyze and draw inferences from patterns in data.” Analysis of proteomic and transcriptomic data with machine learning algorithms has led to invaluable discoveries of novel biomarkers for countless disorders, including Alzheimer’s Disease (3). The modeling method chosen will depend on the objective and design of your study and the goals of analysis, such as a probabilistic model of the data.

For instance, to compute a combined variable importance score for each feature, Omics Playground uses a combination of multiple machine learning algorithms to allow the selection of the best possible predictive features.

4. Validation analysis

Regardless of the study design and algorithm(s) employed for prioritization and identification, a biomarker can only be trusted to move on to subsequent phases of the discovery process upon validation in independent assays and datasets. Validation is an essential component of the rigorous analytical pipeline as it ensures that a biomarker is robust and fit for purpose for effective clinical application. Typically, validation is a multi-factorial process where distinct, quantifiable criteria are examined to produce a measure of the overall quality and replicability score of a biomarker.

Biomarker Analysis Algorithms

sPLS (sparse Partial Least Squares)

sPLS simultaneously combines integration and variable selection on two data sets (2). It aims to find a linear regression model by projecting observed and predicted variables into a new space. Linear regression aims to model the dependence relationship between one target variable and multiple independent or explanatory variables.

XGBoost (eXtreme Gradient Boosting)

XGBoost is a gradient-boosting algorithm common in ensemble learning (4). Ensemble learning enlists many models to make predictions together by building a sequence of initial weak models into increasingly more powerful ones. It operates on decision trees that examine the input under various “if” statements. The algorithm progressively adds more and more “if” conditions to the decision tree to build a stronger model.

Random forest

Random forest is another ensemble learning algorithm (5). It grows and combines the output of multiple decision trees to reach a single result and can be used for classification or regression tasks. This supervised machine learning algorithm allows for accurate and stable results by relying on many decision trees rather than a single tree.

Glmnet

In advanced data analytics, a key aim is to build predictive models without overfitting the data. Overfitting can occur when models become too complex and fit training data too closely or too excessively, which can lead to the model being able to make accurate predictions only on training data but not on the data under study (6). Glmnet models are an extension of generalized linear models that reduce overfitting by using regularized regression to add a penalty term to the objective function and are suitable for high-dimensional datasets (7).

Similarities and Differences Between Differential Gene Expression (DGE) Testing and Biomarker Selection

One of the most common uses of RNA-seq data is to perform differential gene expression and pathway enrichment analyses to find genes and pathways expressed at significantly different levels between two or more sample sets. While this approach has provided valuable information in the discovery of novel gene expression changes of potential biomarkers, the analyses are generally univariate, focusing on simplified comparisons and individual gene expression.

If you want more information on how to perform differential gene expression analysis or how to perform enrichment analysis you can check our tutorial guides.

In more complex cases, a single gene may not be enough as a biomarker and multiple genes may perform better. A representative example is the FDA-cleared PAM50 multi-gene signature, which profiles the expression of a multi-gene panel in breast cancer tissues. A combined assessment of the PAM50 multi-gene signature enables molecular-based classification of distinct tumor subtypes and provides a metric of the likelihood of risk for cancer recurrence, potentially empowering medical decisions on personalized treatment decisions.

Combining clinical with molecular omics biomarkers is an interesting problem. Clinical data such as age, sex, weight, BMI, blood pressure and imaging are routinely used in the clinic for diagnosis and monitoring disease progression. Emerging research is aimed to investigate whether a combination of omics and clinical biomarkers can create more precise models for disease risk that may find application in personalized medicine.

Many biomarkers have been discovered with traditional differential expression analysis tools. Still, the power of complex machine learning algorithms to identify combinations of key genetic biomarkers in complex multivariate datasets is steadily being realized. For instance, these analyses may help unravel how multiple genes may interplay in distinct biological pathways to better indicate patient prognosis, survival, or drug toxicity. 

Combining different machine learning techniques to provide a composite result takes advantage of the strengths and overcomes the weaknesses of the individual methods, providing robust biomarkers that are challenging to discover with single, traditional methods. This powerful combined approach is now available in Omics Playground at the click of a button. Results are fully reproducible with no coding skills required.

If you’d like to know more about how to perform biomarkers analysis using Omics Playground you can check our blog post: “Master Biomarkers Analysis with Omics Playground: A Step-by-Step Tutorial“.

Biomarker Selection tab in Omics Playground

Fast-track the prediction of biomarkers in your expression and proteomics data with Omics Playground. Perform your Biomarkers Analysis in a few clicks.

About the Author

Axel Martinelli

Axel Martinelli’s academic background is in molecular biology and parasitology. He earned a Ph.D. on the genetics of strain-specific immunity against malaria infections and a master’s degree in bioinformatics with specialization in the analysis of omics data. During his postdoctoral career, he worked on genomics and transcriptomics studies and is currently the head of biology at Bigomics Analytics.

References

  1. Merry E, Thway K, Jones RL, Huang PH. Predictive and prognostic transcriptomic biomarkers in soft tissue sarcomas. NPJ Precision Oncology. 2021 Mar 5;5(1):17.
  2. Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2010 Jan;72(1):3-25.
  3. Tan MS, Cheah PL, Chin AV, Looi LM, Chang SW. A review on omics-based biomarkers discovery for Alzheimer’s disease from the bioinformatics perspectives: statistical approach vs machine learning approach. Computers in biology and medicine. 2021 Dec 1;139:104947.
  4. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794).
  5. Breiman L. Random forests. Machine learning. 2001 Oct;45:5-32.
  6. Bejani MM, Ghatee M. A systematic review on overfitting control in shallow and deep neural networks. Artificial Intelligence Review. 2021 Dec;54(8):6391-438.
  7. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33(1):1.