Amidst the talks of personalised medicine, it is easy to forget that cheap large scale genome sequencing is a recent phenomenon that started in the early 2000’s with the development of next generation sequencing (NGS) techniques. While this development initially addressed a bottleneck in genomic studies (namely the need for more rapid data generation), it also brought new challenges that needed to be addressed.
In this brief blog post, we present the data analysis process in connection with genomic data and focus on the bottlenecks that have arisen as technology moves on from a pre-NGS era to the current state of affairs. Generally speaking, the discovery process must go through the phases of data acquisition, data pre-processing and data discovery. We will point out that the bottleneck is shifting towards the final discovery phase and conclude discussing the role that bioinformatic platforms can play to address the latest challenges.
Bottleneck 1.0 (Data Acquisition)
Before the advent of NGS, experiments aimed at discovering the genetic basis of distinct phenotypes were limited by the unavailability of a technology that allowed the rapid identification of alterations at a genome-wide level. Researchers were limited to the use of Sanger sequencing, which can reliably sequence at most 1kb of nucleotides (considerably less than the size of the average human gene) in one reaction or to performing real time PCR reactions on fragments of individual genes. Neither technique was scalable to a genome-wide size and meant that scientists would often spend months or even years sequencing individual genes or designing probes for quantification until they found a significant association.
The advent of NGS (e.g. Illumina Solexa, 454 and SOLiD sequencing) helped overcome this initial bottleneck. It became possible not only to accurately sequence large chunks of chromosomes, but entire genomes in one single experiment. All the necessary genetic information was available for multiple samples that could help in rapidly addressing questions that would have taken months or even years before.
Bottleneck 2.0 (Data Processing)
A new bottleneck started developing on how to accurately and rapidly assign the millions of reads generated by such experiments. Mapping could often take days (or even weeks) on a single desktop computer for large genomes. And that process had then to be multiplied by the number of genomes that had been sequenced for a given project.
This bottleneck was addressed thanks to the constant increase in computing power and the development of faster and more efficient mapping algorithms (1). Nowadays, a human whole genome sequencing dataset can be mapped in as little as 30 minutes (2), for example.
The increase in computing power and mapping algorithms efficiency, coupled with the decreasing costs of sequencing has led to an explosion in data generation in recent years, with genomics projects expected to generate between 2 and 40 exabytes of data within the next decade (3).
Bottleneck 3.0 (Data Discovery)
With the boom in genomic data generation a new bottleneck has surfaced as exponentially more datasets need to be analysed each year and analysis itself has become increasingly sophisticated.
In the first stages of the omics revolution, it was sufficient to assemble a genome and compile an incomplete list of various mutations that could be associated to specific phenotypes to ensure a high quality publication. Nowadays, such analysis is barely sufficient for a short communication in a low tier journal. Comparative analysis of multiple genomes or transcriptomes, that can easily range in the hundreds of thousands in the case of single cell RNA sequencing (scRNA-seq) datasets, is becoming the norm.
This development has moved research away from the initial bottlenecks discussed above to a situation where the bottleneck is represented by the inability to analyse all the data as it is being released (Fig, 1). Analysing omics data does require training and although the number of scientists now able to deal with omics data is increasing, so is the amount of data being generated and, more crucially, the level of sophistication that such an analysis requires. Furthermore, the increasing availability of different omics data types for the same experiment (e.g. genomic, transcriptomic, metabolomic and proteomic data) is also requiring sophisticated mathematical approaches (4) that can find correlations across them.
Overcoming the Third Bottleneck through Self-Service Bioinformatic Platforms
As technology rapidly moves on, the learning curve for biologists without any formal training in programming and advanced statistics and wishing to perform their own analysis becomes steeper. Relying on trained bioinformaticians is a solution, but the latter are confronted with and sometimes even overwhelmed by increasing amounts of data from researchers, which also greatly affects the time they can devote to the development of new tools and algorithms (5).
Luckily, as bioinformatic analysis tools and methods mature and become established, it is getting easier to delegate tasks to so-called self-service bioinformatic platforms. They can automate several aspects of the data analysis, speeding up the rate at which data can be elaborated and presented and thus helping in addressing the analysis bottleneck. These self-service platform do not require any prior programming knowledge and are thus accessible to any biologist. Furthermore, they also free up bioinformaticians to focus on more creative tasks rather than spending large amounts of time running and re-running standard analysis pipelines in response to the feedback of biologists. Platforms for the analysis of genomic data are also growing in sophistication to keep pace with new methodologies, support of new and larger data types (such as scRNA-seq), as well as accessing the vast amounts of public data already available for comparative analysis. Examples of such platforms include the CLC workbench, Rosalind from OnRamp Bio, Qlucore Omics Explorer, the NASQAR online platform (6) and our very own Omics Playground.
The study of the genetic basis of phenotypes have undergone massive changes in the past 20 years. From a pre-NGS state where generation of data represented the primary bottleneck, it has rapidly moved to a stage where data processing underwent massive changes, mainly driven by hardware and software improvements and finally to the current state of affairs, where analysis cannot keep pace with the vast amounts of data being generated.
There is no doubt that self-service bioinformatic platforms will play an increasingly important role to help address the analysis bottleneck. Furthermore, the advent of personalised medicine will boost omics data production to unprecedented levels and expand its interpretation beyond biologists to clinicians and even their patients (7-8). This is a development that will require a new generation of platforms, less focused on statistical details and more on immediate interpretation in a medical context. It is a challenge that we intend to meet through our platforms in the coming years.
(1) Canzar S, Salzberg SL. Short Read Mapping: An Algorithmic Tour. Proc IEEE Inst Electr Electron Eng. 2017;105(3):436-458.
(2) Zhang G, Zhang Y, Jin J.The Ultrafast and Accurate Mapping Algorithm FANSe3: Mapping a Human Whole-Genome Sequencing Dataset Within 30 Minutes.. Phenomics 2021;1:22–30.
(4) Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights. 2020;14:1177932219899051.
(5) Chang J. Core services: Reward bioinformaticians. Nature. 2015 Apr 9;520(7546):151-2.
(6) Yousif A, Drou N, Rowe J, Khalfan M, Gunsalus KC. NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization. BMC Bioinformatics. 2020;21(1):267