Published on April 23nd, 2025
Written by Shujia Dai, Ph.D
⏱ 12 min read
Mass spectrometry (MS)-based proteomics has emerged as the gold standard for large-scale, unbiased profiling of protein expression and post-translational modifications, enabling transformative insights in biological research [1].
However, inherent technical variability in experimental protocols, sample preparation, instrumentation, and data processing necessitate rigorous preprocessing and quality control (QC) prior to downstream bioinformatics analysis.
This article focuses on bottom-up proteomics workflows for discovery (global proteome profiling) and targeted (fixed protein panels) studies, outlining key preprocessing steps, software tools, and considerations.
Raw MS data, characterized by high complexity and volume, require sophisticated preprocessing to convert millions of spectra into reliable protein identifications and quantitative profiles.
A typical proteomics experiment, started from sample collection and preparation, and then raw MS data are generated.
Despite of varied experimental protocols, a generalized preprocessing workflow (Figure 1) begins with peak detection and feature extraction, where retention times are aligned and overlapping peaks resolved using signal processing algorithms.
Next, peptide-spectrum matches (PSMs) are generated by comparing experimental spectra against theoretical databases. To ensure accuracy, false discovery rate (FDR) estimation methods, such as target-decoy approaches [2] or machine learning models (e.g. Percolator [3]), filter out incorrect peptide-spectrum matches (PSMs).
Subsequently, peptide-to-protein inference [4] assigns peptides to proteins, addressing challenges posed by shared sequences across isoforms.
Finally, quantitation integrates peptide-level abundances into protein-level estimates, supported by label-free or labeling-based strategies [5].
The workflow concludes with quality control and statistical analysis, including transformation, normalization, missing value imputation, and differential expression testing [6].
According to the 2023 guidelines[7] from the Human Proteome Organization (HUPO), reliable proteomics results must meet two critical criteria:
Discovery proteomics is designed for hypothesis generation (e.g. mechanism of action, target identification, and biomarker discovery), providing unbiased identification and relative quantitation of thousands of proteins, which makes it ideal for hypothesis-generating studies. This approach primarily employs two data acquisition modes: data-dependent acquisition (DDA) and data-independent acquisition (DIA). DDA selects the most abundant precursor ions for fragmentation, prioritizing sensitivity but risking undersampling of low-abundance species. In contrast, DIA fragments all ions within predefined m/z windows, enabling retrospective analysis and enhanced reproducibility. Advances in hybrid MS systems (e.g. Orbitrap Astral, timsTOF MS) and AI-driven tools (e.g. DIA-NN, Prosit) have positioned DIA as the preferred method for large-scale studies. A limitation of discovery workflows is their reliance on spectral libraries or de novo sequencing (via DDA) for species lacking genome reference databases.
Targeted proteomics, conversely, focuses on absolute quantitation of predefined proteins or peptides, such as clinical biomarker validation in a large cohort of patients. This approach employs selected reaction monitoring (SRM), parallel reaction monitoring (PRM), or multiple reaction monitoring (MRM) to monitor specific precursors with high sensitivity. Heavy isotope-labeled internal standards ensure precise quantitation, while compatibility with both high- and low-resolution MS systems enhances flexibility. Targeted workflows are particularly suited for validation studies requiring high reproducibility and compliance with regulatory standards (e.g. GLP/GCP).
Raw MS data is generated in vendor-specific formats (e.g. .RAW for Thermo, .d for Bruker/Agilent, .WIFF for Sciex), which are often converted to open formats like mzML or mzXML using tools such as MSConvert (ProteoWizard). These standardized formats ensure compatibility with open-source preprocessing software. Public repositories like the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and ProteomeExchange provide benchmark datasets for method validation and cross-study comparisons.
Common preprocessing tools vary by study type (Figure 2). For discovery proteomics, DDA workflows leverage software such as MaxQuant, FragPipe, SpectroMine, and PEAKS. DIA workflows rely on tools like DIA-NN and Spectronaut, which utilize spectral libraries or AI-predicted spectra for library-free analysis. Targeted proteomics predominantly employs Skyline and SpectroDive, which integrate heavy isotope standards for detection of their conterpart light peaks.
Quantitative strategies in proteomics vary by study type. In discovery proteomics, researchers implement either label-based or label-free methods to estimate relative protein abundance across samples. Targeted proteomics, in contrast, focuses on absolute quantitation of predefined proteins or peptides, which fits applications like clinical validation or regulatory compliance.
Label-based quantitation uses chemical (e.g. TMT or iTRAQ) or metabolic labeling (e.g. SILAC) to incorporate stable isotopes or reporter ions into peptides. These methods enable multiplexed comparisons within a single experiment, enhancing precision and throughput. For example, TMT allows simultaneous analysis of 16 or 35 samples, which is ideal for small-scale studies requiring high reproducibility. However, label-based workflows are limited by the number of available labeling channels, carrier effects[8] from boosting channels (in single-cell proteomics), and ratio compression effect [9].
Label-free quantitation (LFQ) directly compares untouched peptide intensities from multiple samples, making it suitable for large-scale studies. Common LFQ methods include: intensity-based absolute quantitation (iBAQ, using peptide intensities normalized by protein length), spectral counting (counting PSMs corresponding to each protein), and LFQ intensity (MaxLFQ algorithm). Most preprocessing software (e.g. DIA-NN, FragPipe) support LFQ for both DDA and DIA data, offering flexibility in study design.
Absolute quantitation for targeted proteomics. Absolute quantitation requires heavy stable isotope-labeled peptides as internal standards to calibrate analyte concentrations. Advanced MS platforms, such as the Stellar MS (Thermo), enable simultaneous monitoring of thousands of targets within a single LC-MS/MS run (~1 hour) [10], ensuring high sensitivity and reproducibility. Proprietary software from MS vendors aligns with Good Laboratory/Clinical Practice (GLP/GCP) standards, streamlining workflows for regulated environments. FDA’s Bioanalytical Method Validation Guidance [11](2018) provides advices in detail for assay development.
Quality Control. Post-quantitation raw protein abundances undergo log2 transformation and normalization (e.g. total ion current or median normalization) to improve data distribution. Missing values are imputed using methods such as k-nearest neighbors (kNN) or random forest (RF), though careful evaluation is required to avoid overrepresentation of artifactual changes. Statistical tools like Limma [12] and MSstats facilitate differential expression analysis, while interactive processing platforms (e.g. Protigy and SpectroPipeR) generate integrated reports with PCA, heatmaps, and clustering to identify outliers or batch effects. A representative workflow is summarized in Figure 3.
Following preprocessing, proteomics data analysis progresses through systematic workflows tailored to biological questions. Preprocessed data—structured as metadata (e.g., experimental conditions, patient demographics) and normalized protein expression matrices (fold changes, p-values)—undergo quality control to address batch effects and outliers (Figure 4).
There are mainly 3 steps in downstream analysis:
The globally identified proteins, including the significantly altered proteins, feed into functional annotation and classification (e.g., Gene Ontology terms via PANTHER, domain analysis via InterPro/Pfam) and subsequent enrichment analyses. Enrichment analysis includes gene set enrichment analysis (GSEA) to detect coordinated pathway changes in ranked protein lists, over-representation analysis (ORA) to identify enriched pathways among top differential proteins, and upstream regulator analysis to infer transcription factors or kinases driving observed changes. Network analysis maps protein-protein interactions (PPI) and functional modules using tools like STRING or Cytoscape. Furthermore, co-expression analysis (e.g., WGCNA) identifies co-regulated protein clusters linked to phenotypes.
For biomarker discovery, machine learning models (e.g., LASSO, random forests) validate candidate proteins by correlating expression patterns with clinical outcomes. Cross-validation and independent cohort testing ensure robustness, while tools like ROC curves assess diagnostic accuracy. Phenotypic classification leverages supervised learning to stratify samples (e.g., disease subtypes) based on proteomic signatures. See a schematic workflow of validation and modeling in Figure 5.
Over the past five years, emerging hybrid MS systems—such as the Orbitrap Astral MS (Thermo Fisher Scientific) and timsTOF MS (Bruker)—have achieved remarkable improvements in sensitivity, dynamic range, scan rate, and resolution. These advancements have led to an exponential increase in data volume and complexity, significantly enhancing proteome coverage, depth, and throughput. As a result, it is now possible to identify and quantify thousands of proteins across diverse biological samples with unprecedented accuracy and speed. However, these innovations have also introduced new challenges in data analysis.
Data management now demands cloud-based solutions (e.g. AWS/S3) to handle terabyte-scale datasets, ensuring secure storage, standardized data formats and seamless integration with other tools, like electronic lab notebooks (Benchling). High-performance preprocessing relies on optimized algorithms (e.g. cloud-based parallelization) and high-performance computing (HPC) environments to process billions of spectra efficiently.
A critical frontier is the development of integration methods that bridge proteomics data with multi-omics data types [13] (genomics, metabolomics) which are essential for holistic insights into biological systems, particularly in drug discovery and translational research.
Some of these challenges are already being tackled by BigOmics through its Omics Playground [14] platform.
Omics Playground is a cloud-based platform designed for intuitive analysis and visualization of proteomics data. It empowers researchers to transform complex omics datasets into meaningful biological insights quickly, reliably, and at scale.
Once your data has been pre-processed and abundance tables are ready, simply upload them to Omics Playground to interactively explore patterns, trends, and correlations.
Built with flexibility in mind, Omics Playground’s modular design lets you tailor your analysis workflow to your specific research questions. Whether you’re conducting differential expression analysis, pathway enrichment, biomarkers analysis, or dimensionality reduction, the platform supports a wide range of methods through 18+ analytical modules and over 100 interactive visualizations.
Why Omics Playground?
To learn more about the types of analyses you can perform, read How to Analyze Proteomics Data Using Omics Playground.
Unlock the Full Potential of Your Proteomics Data
Shujia Dai, Ph.D., is a multidisciplinary expert in mass spectrometry and proteomics with over 19 years of experience in drug discovery and development. He has held key roles at ProFound Therapeutics, Sanofi, and Northeastern University, leading innovative proteomics initiatives and developing advanced multi-omics platforms for phenotypic profiling, target identification, and biomarker discovery. His work spans oncology, immunology, and multiple therapeutic modalities, with a strong focus on translating proteomic insights into therapeutic strategies.
Dr. Dai holds a Ph.D. in Analytical Chemistry and has authored over 40 peer-reviewed publications, successfully bridging academic rigor and industry impact to address complex biological challenges.
[1] Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003 Mar 13;422(6928):198-207.
[2] Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007 Mar;4(3):207-14.
[3] Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007 Nov;4(11):923-5.
[4] Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem. Mol Cell Proteomics. 2005 Oct;4(10):1419-40.
[5] Matthiesen R, Carvalho AS. Methods and Algorithms for Quantitative Proteomics by Mass Spectrometry. Methods Mol Biol. 2020;2051:161-197.
[6] Kohler D, Staniak M, Tsai TH, Huang T, Shulman N, Bernhardt OM, MacLean BX, Nesvizhskii AI, Reiter L, Sabido E, Choi M, Vitek O. MSstats Version 4.0: Statistical Analyses of Quantitative Mass Spectrometry-Based Proteomic Experiments with Chromatography-Based Quantification at Scale. J Proteome Res. 2023 May 5;22(5):1466-1482.
[7] Omenn GS, Lane L, Overall CM, Lindskog C, Pineau C, Packer NH, Cristea IM, Weintraub ST, Orchard S, Roehrl MHA, Nice E, Guo T, Van Eyk JE, Liu S, Bandeira N, Aebersold R, Moritz RL, Deutsch EW. The 2023 Report on the Proteome from the HUPO Human Proteome Project. J Proteome Res. 2024;23(2):532-549.
[8] Dwivedi P, Rose CM. Understanding the effect of carrier proteomes in single cell proteomic studies – key lessons. Expert Rev Proteomics. 2022 Jan;19(1):5-15.
[9] Ting L, Rad R, Gygi SP, Haas W. MS3 eliminates ratio distortion in isobaric multiplexed quantitative proteomics. Nat Methods. 2011 Oct 2;8(11):937-40.
[10] Plubell DL, Remes PM, Wu CC, Jacob CC, Merrihew GE, Hsu C, Shulman N, MacLean BX, Heil L, Poston K, Montine T, MacCoss MJ. Development of highly multiplex targeted proteomics assays in biofluids using the Stellar mass spectrometer. bioRxiv [Preprint]. 2024 Jun 11:2024.06.04.597431.
[11] Bioanalytical Method Validation Guidance for Industry, May 2018, https://www.fda.gov/regulatory-information/search-fda-guidance-documents/bioanalytical-method-validation-guidance-industry
[12] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47.
[13] Nitzan Simchi, Ofer Givton, Joseph Rinberg, Alon Shtrikman, Tamar Geiger, Alexey I. Nesvizhskii, Eran Seger, Kirill Pevzner. A Novel Algorithm for the Harmonization of Pan-cancer Proteomics bioRxiv 2025.03.17.642820
[14] Akhmedov M, Martinelli A, Geiger R, Kwee I. Omics Playground: a comprehensive self-service platform for visualization, analytics and exploration of Big Omics Data. NAR Genom Bioinform. 2019 Dec 6;2(1):lqz019. doi: 10.1093/nargab/lqz019. PMID: 33575569; PMCID: PMC7671354.
