Published on March 2nd, 2023
⏱ 13 min read
The wealth of publicly available RNA-sequencing (RNA-seq) and single-cell RNA-seq (scRNA-seq) data has empowered biologists to contextualize their own data and findings, generate informed data-driven hypotheses, and discover trends across diverse studies. If you are new to the field, however, the sheer amount of data at your fingertips can be daunting!
Today, most journals and funding agencies require scientists to deposit all generated genomic/sequencing data into public repositories alongside publication. Some groups also create portals or Shiny apps to display their data in interactive formats, making it more accessible to a wider audience. Code repositories like GitHub have also made it easier to share the code developed for genomic analyses, promoting reproducibility and open access.
The ways in which scientists can engage with published RNA-seq data are growing at an amazing pace. However, understanding how to access and utilize public RNA-seq datasets can be difficult if you don’t know where to look or how to find data relevant to your interests.
Here, I outline some of the most comprehensive databases for RNA-seq data, and later, scRNA-seq data.
The gene expression omnibus (GEO) is a broad repository of gene expression data generated across multiple platforms (e.g., microarray, bulk RNA-seq, scRNA-seq) and from multiple organisms that is hosted by the NIH.
When publishing new RNA-seq data, GEO is often chosen by authors for data deposition due to its inclusiveness and NIH oversight. Importantly, GEO interfaces with another NIH database called the Sequence Read Archive (SRA). SRA hosts the raw sequencing data (i.e., FASTQs) associated with GEO entries and other datasets, making it straightforward to download both count matrices and/or FASTQs depending on your questions and interests.
While not the prettiest database, GEO has a thorough advanced search function to help specify which datasets you may want to find. For example, you can search based on metadata like organism, experimental variables, study author, and number of samples to match your interests. Once you have found a study whose data you want to access through the search function, the “GEO Accession” page will contain information on the experimental design, associated publication, and links to the data for download.
At the bottom of each accession page, you can find links to various parts of the dataset. Typically, RNA-seq count matrices will be provided under Supplementary file and can be downloaded using the (http) link from your browser. Also take note of the Series Matrix File(s) which will have the associated metadata for each sample included in the study. Finally, clicking on the SRA Run Selector will bring you to the associated SRA collection with links to FASTQ files.
In an effort to collate and normalize GEO and SRA data across diverse studies, the Ma’ayan Lab at Mount Sinai Center for Bioinformatics has released ARCHS4.
ARCHS4 makes available the results of uniformly processed RNA-seq data from mouse and human in interactive exploration interface, allowing the user to select samples across studies based on tissue type/cell line, gene set enrichment (e.g., GO or KEGG), or a user provided list of signature genes. Once you have selected a subset of samples, ARCHS4 provides a small R script that can be run locally to download a gene expression matrix with all of the requested samples.
Access to the popular ARCHS4 database is also provided within Omics Playground for comparative analysis against users’ experimental datasets.
EMBL hosts a suite of NGS tools and databases including the Expression Atlas which has explorable and downloadable RNA-seq datasets from many organisms, tissues, and diseases.
While not as large as GEO, the Expression Atlas has some high-level annotation of datasets to enhance browsing and searching. Datasets are categorized as “baseline” or “differential” depending on how the experiment was set up for the deposited dataset.
Baseline studies assess gene expression in various tissues at steady state, while differential studies have a comparison between two or more conditions (e.g., disease, gene knock-out, etc.,). When browsing experiments, you can easily filter studies based on experimental factors (e.g., time or disease).
Once you have selected a study, you are often presented first with a “Results” page that allows you to explore the data very basically with a heatmap and gene selection. Additional tabs for “Experimental Design”, “Supplementary Information”, and “Downloads”, contain a metadata table, information on data processing, and links to raw and normalized RNA-seq data matrices, respectively.
The Genotype-Tissue Expression (GTEx) project houses both bulk and scRNA-seq data from humans organized by tissue to enable both within and cross-tissue analyses.
GTEx is hosted by the Broad Institute of MIT and Harvard but is contributed to by scientists across the world. To get started, the portal has several exploration tools built in. Users can easily browse for expression of a specific gene across tissues, look in detail at the sample makeup of a particular tissue in the database, and even browse histology data to corroborate gene expression.
Open access RNA-seq data is easily downloaded from the portal and is split up by tissue type with both counts and TPM data available. Unfortunately, GTEx doesn’t have a way to further subset their database prior to download, so you will have to do any further sample selection independently.
In addition to bulk RNA-seq data, GTEx also houses the data from a large, multi-tissue single-nucleus RNA-seq dataset (Eraslan et al., Science, 2022) that can be downloaded or explored using the “Multi-Gene Single Cell Query”. Additionally, quantitative trait loci (QTL) data are available to browse across tissues.
The Cancer Genome Atlas (TCGA) and associated Genomic Data Commons (GDC) Data Portal is a regularly updated and NIH maintained repository for sequencing data and related files from cancer studies in humans.
The portal allows for highly interactive exploration of the datasets in the repository including breaking down studies by data availability, primary cancer site, among other metadata. It also has a suite of analysis tools to compare metadata and clinical data across studies.
To get RNA-seq data, users can either start from the “Exploration” or “Repository” browsers, though I suggest you start from Exploration. Once in Exploration, you can select filters like Primary Site, Project, and Disease Type to reduce the number of studies to those of interest. You can then click the “View Files in Repository” button on the top right to then see the associated data available.
Once in the Repository, you should select “transcriptome profiling” under Data Category and “Gene Expression Quantification” under Data Type to subset to RNA-seq data files. To actually download the data, you need to click on “Add All Files to Cart”, from where you can download collected metadata on these samples and the RNA-seq counts matrices.
Note, unlike most other portals, TCGA has a single counts file for each sample that will need to be combined for downstream analysis.
For those with more R experience, recount3 from the Langmead lab provides access to normalized and uniformly processed RNA-seq data from GEO/SRA, GTEx, and TCGA.
Like ARCHS4, this database contains datasets that were re-processed from sequencing reads, but also provides an R package through Bioconductor to access its data. Users can use the Study Explorer to narrow down samples of interest by metadata terms, and then input those accession numbers into their code to download the data into a RangedSummarizedExperiment object. This object can then be readily used in various RNA-seq analysis packages like DESeq2 (differential expression) and EnrichmentBroswer (pathway enrichment).
With the explosive growth of scRNA-seq studies, existing databases have been adapting to accept single-cell data and new databases have been created specifically for its curation and exploration.
scRNA-seq data provides high-resolution information on complex tissues that can be used to understand relative contributions of cell populations to bulk RNA-seq data in a process called deconvolution. Moreover, scRNA-seq also captures heterogeneity present in populations of cells and empowers discovery of gene expression programs that vary over time, developmental trajectories, mutational burden, and disease condition.
GEO/SRA already accepts and hosts scRNA-seq datasets that are easily searchable and accessible like bulk RNA-seq datasets. Many of these scRNA-seq datasets can be explored and visualized using PanglaoDB, which allows users to explore gene expression across cell types and studies and provides links to read counts data as either R objects or text files. EMBL also has a companion Single Cell Expression Atlas to explore and download their hosted datasets. However, there are also new databases gaining traction made specifically for scRNA-seq with more advanced exploration tools.
One of the largest scRNA-seq specific databases is the Single Cell Portal, hosted by the Broad Institute of MIT and Harvard. Here, users can search for studies by organ, species, disease, and cell type. Each study has built-in exploration functions, such as t-SNE or UMAP embeddings and violin plots for visualizing gene expression across clusters. Raw and normalized data can be easily downloaded after creating an account.
As part of the Chan Zuckerberg Initiative, CZI Science has also created their own scRNA-seq database named CZ Cell x Gene Discover, built around their open-source data exploration tool for scRNA-seq data. Like the Broad’s Single Cell Portal, CZ Cell x Gene Discover hosts over 500 datasets, provides exploration capacity through their software, and allows for easy downloading.
Finally, for those that are more comfortable in R, the scRNAseq package on Bioconductor provides access to dozens of scRNA-seq datasets for easy downloading. To promote easy interoperability with other Bioconductor packages, all of the datasets are provided as SingleCellExperiment objects, which can be readily used with many other Bioconductor packages for downstream scRNA-seq analysis. Moreover, the SingleCellExperiment object can be converted for use in larger data exploration packages like Seurat and Scanpy.
While this blog post is non-exhaustive, many of the above RNA-Seq databases provide good starting points for finding and downloading bulk and single-cell RNA-seq datasets.
Once you have some datasets of interest in hand, it’s straightforward to prepare your files and start doing your own RNA-Seq data analysis in Omics Playground!
Written in collaboration with Samuel Kazer for BigOmics Analytics.