Published on August 20th, 2024
Written by Thorben Sauer
⏱ 12 min read
Proteomics, the large-scale study of proteins, is central to understanding cellular processes, disease mechanisms, and potential therapeutic targets. Proteins are not only the workhorses of the cell; they play a crucial role in virtually every biological process. As such, proteomics provides invaluable insights into the functional dynamics of the proteome, the total set of proteins expressed by a genome.
In this guide, we’ll look at some of the top proteomics databases, each offering unique resources for protein expression and interaction data. Whether you’re a biologist looking for expression patterns or a bioinformatician searching for interaction networks, this guide will help you find the right database for your needs. Finally, we’ll show how to use public data in Omics Playground and how to compare your experimental data to public datasets.
Here’s a summary of what you’ll find in this blog post:
Before we dive into the databases, it’s important to understand the types of data they provide:
UniProt (Universal Protein Resource) is one of the most comprehensive protein sequence and annotation databases available. While not a true protein expression database, it provides detailed, high-quality information on the function of proteins, their structures, and their roles in various biological processes.
UniProt integrates data from multiple sources and provides comprehensive protein annotations, including functional information, domain structures, post-translational modifications and variants. It consists of several components: UniProtKB (knowledgebase), UniRef (reference clusters), and UniParc (archive).
UniProt is the source of choice for reference proteomes and databases for mass spectrometry data processing. It provides the expertly curated UniProtKB/Swiss-prot sequence database, as well as the unreviewed TrEMBL database, which is derived from the translation of the EMBL (European Molecular Biology Laboratory) database and has been automatically annotated.
The UniProt platform also offers practical analysis tools, such as BLAST (basic local alignment tool) for sequence alignment, a peptide search tool, and an ID mapping tool. UniProt can be accessed through its web interface, where users can perform searches, retrieve detailed protein information, and download data in various formats. Omics Playground is able to handle Uniprot accession numbers/IDs as input format for your proteomic features.
The PRIDE (Proteomics Identifications Database) is a prominent public repository for protein and peptide identifications and is part of the ProteomeXchange consortium. It contains data from a wide range of mass spectrometry-based experiments, making it a go-to resource for researchers needing raw data for re-analysis or integration.
The database not only provides the proteomics community with protein and peptide identification data described in scientific publications, but also with the evidence supporting these identifications. PRIDE also includes a collection of post-translational modification data. It combines raw MS files, typically from bottom-up, but also from top-down and imaging experiments, with the processed results, such as identification and quantification result files. Additional information such as spectral libraries used, sequence database and programming scripts can be deposited. PRIDE supports data from different species and experimental setups, facilitating broad comparative studies. It also integrates with other resources to improve data discoverability.
To access proteomic datasets, visit the PRIDE website and either search for a dataset identifier from a scientific publication to access a specific dataset, or search for a specific disease to obtain a list of matching datasets. From the dataset page, you can download all deposited files and use them for your own analysis purposes.
The Peptide Atlas aggregates mass spectrometry-based proteomics data to create a high-quality, publicly accessible resource for peptides. The Peptide Atlas collects raw mass spectrometry output files from human, mouse, yeast and several other organisms and reprocesses them through a unified analysis and validation pipeline. The results are loaded into a database and the information derived from the raw data is made available to the community through several exploration tools. The restriction of only accepting raw data as input and processing it in a consistent manner ensures high-quality results with well-understood false discovery rates (FDR).
You can use the Peptide Atlas website to explore the datasets. Advanced search functions allow for detailed queries, and data is available for download for custom analysis. The Peptide Atlas is particularly useful for exploring information on specific peptides or proteins via the respective ´Views´. To extract information for many peptides/proteins at once, the database can be queried for specific parameters and multiple entries and the results can either be explored in the web page or extracted and viewed in the spreadsheet viewer of choice. When planning a targeted proteomics experiment, it is essential to identify optimal target peptides, known as proteotypic peptides. These peptides can be extracted using the query described above.
The Human Protein Atlas (HPA) is a comprehensive resource focusing on the human proteins in cells, tissues, and organs using the integration of various omics technologies.
The atlas consists of 12 separate sections, each focusing on a specific aspect of the genome-wide analysis of the human proteins. Thus, the atlas includes not only information on tissue, immune cell and blood expression and distribution, but also information on cell type and tissue specificity (scRNA-seq data), subcellular localisation (RNA-seq), cell line expression, protein structure and on protein-protein interactions. Tissue specific expression RNA-seq data are available for download from the HPA. You can then use Omics Playground to visualize the data and compare it to your own. For this, you can learn how to prepare the data in this blog post.
Finally, the HPA includes pathology information, showing the impact of protein levels on the survival of cancer patients, the Disease Blood Atlas, showing the protein levels in the blood of patients with different diseases, and protein panels used for disease prediction. All of this information makes the HPA an exceptional knowledge base for assessing the clinical value of your target proteins.
You can access the HPA via its website, where you can search for specific proteins or browse expression data across different tissues and organs.
The STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a widely used database of known and predicted protein-protein interactions. These interactions include direct (physical) and indirect (functional) associations which are derived from computational prediction, high-throughput laboratory experiments, conserved co-expression, text mining and other knowledge bases.
To date, STRING covers >59 million proteins from >12,000 organisms. You can access STRING through the web interface and search the database by single or multiple protein names, amino acid sequences, protein families and more. STRING visualizes the interaction networks, allowing users to see the broader context of protein interactions and their role in cellular processes. It also provides a confidence score for each interaction, helping the user to assess the reliability of the results.
Another way you can access STRING information is through Omics Playground’s beta module called PCSF (Prize-Collecting Steiner Forest) which focuses on identifying key genes that act as “hubs” in the network, connecting various biological pathways. Here the STRING protein-protein interaction network is used as template. You can easily access this module under Clustering> PCSF (beta) by activating beta features in your Omics Playground account (Video 1).
Video 1. PCSF analysis in Omics Playground.
IntAct is a high-quality, open-source database maintained by the European Molecular Biology Laboratory (EMBL) that focuses on protein-protein interactions.
IntAct provides curated data from experimental sources and allows researchers to contribute their own interaction data. IntAct provides a robust and detailed curation process to ensure high-quality and reliable interaction data. It supports data from a variety of interaction detection methods and species.
IntAct also provides advanced tools for data visualization and analysis, including complex interaction networks. Access via the IntAct website and search for interaction data, view detailed annotations, and use visualization tools.
Suppose you want to explore protein expression data provided by public databases on Omics Playground. You could find a suitable dataset by searching the PRIDE database. The dataset entry usually contains a list of associated files, including raw files but also a search engine output, all of which can be downloaded.
You can either search the raw files with the search engine of your choice and use the output as input in Omics Playground, or you can choose the quick start and use the search engine results provided by the original researchers immediately.
How to prepare proteomics data for Omics Playground is described here. Once the data is uploaded, you are ready to explore!
If you want to compare the results of your experimental data with a public dataset, you can use Omics Playground to do this.
Upload the experimental dataset to the Omics Playground, select the Compare module and select the Compare dataset sub-module. Here, you can select the newly uploaded public dataset (or any other dataset loaded into your Omics Playground) and compare pairwise comparisons between datasets.
The sub-module is divided into three tabs: Compare expression, Foldchange, and Gene Correlation:
Navigating the wealth of proteomics data can be daunting, but choosing the right database is critical to effective research.
For comprehensive protein sequence and annotation data, UniProt is the first choice, providing detailed information on protein functions, structures and PTMs. For raw mass spectrometry data, PRIDE and Peptide Atlas are excellent starting points, offering comprehensive datasets that support various comparative studies and data re-analyses. For high quality expression data, the Human Protein Atlas is invaluable. Researchers seeking protein-protein interaction data will find STRING and IntAct to be essential resources.
Once you have interesting protein expression datasets at your fingertips, you are only a short step away from analyzing and exploring these data yourself in the Omics Playground. You can also leverage these resources and Omics Playground to compare and validate your own research findings. Have fun exploring!
