The opportunities for solving big health problems through genomics research are tremendous; yet, the challenges associated with storing, managing and processing the massive amounts of data
The opportunities for solving big health problems through genomics research are tremendous; yet, the challenges associated with storing, managing and processing the massive amounts of data involved in these jobs are daunting. Accessing more compute power can help.
The field of genomics research is exploding as researchers make groundbreaking discoveries through next-generation DNA sequencing (NGS). The demand for sequencing is increasing exponentially while the cost of sequencing has dropped tremendously.
The latest statistics reveal that genomics data is doubling every seven months. As you might imagine, keeping pace with the velocity of such incredible growth presents new challenges.
Genomics Research Today
Genomics research may include studying the entire human genome or large subsets of DNA sequence data. Genomic testing locates variations in DNA that affect health, disease or drug responses.
Enabling personalized medicine is just one example of what’s possible from the study of genomes. Personalized medicine equips clinicians to more accurately determine an individual’s risk of cancer, treat diagnosed patients and monitor disease recurrence.
Researchers have already begun testing a personalized cancer vaccine on patients. The creation of this particular test resulted from collaborative efforts to identify tumor-specific mutations that could trigger immune responses in cancer patients. In effect, helping cancer patients fight their diseases.
Technical advances in DNA sequencing and computational biology are enabling and accelerating the field of genomics. When the Human Genome Project first launched in 2003, it took 13 years to complete the sequencing of a single human genome. According to a 2018 IDC Infobrief, sequencing can now be completed in as little as 27 hours – and continues to drop.
The cost to sequence a single human genome (one person) has dropped dramatically. What started at a cost of $3 billion in 2003 has now dropped to as low as $1,000. In their 2020 Big Ideas Report, ARK Invest reported a new target for reducing the average cost of sequencing a whole human genome – $200 by 2024.
In addition to the astounding cost and time reductions achieved in sequencing, there’s the exploding volume of data being extracted from each genome. This accumulation of massive volumes of data creates big data storage, management, and processing challenges.
The Three V’s of Genomics Research Data
As the volume of data continues to soar, the storing, managing and processing of electronic information is requiring new approaches for supporting the growing demand for access. Data is often siloed and access to significant compute power is a must-have for efficient processing of this otherwise unused information.
Volume (Data storage) – There are 3 billion units, or base pairs, of DNA across nearly 30,000 genes in a single genome. Each base pair equates to 750 megabytes of data. Imagine the amount of storage space required as the data continues to accumulate with the completion of each genome study.
Variety (Data management) – In addition to the volume of data, the types of studies are expanding. There’s the Million Veteran Program, The Cancer Genome Atlas, 1000 Genomes Project, and others. As a result of these distinct initiatives, data is siloed and almost certainly includes different points of data that must be carefully aggregated before it can be studied as a whole.
Velocity (Data processing) – Researchers are constantly facing challenges accessing the right infrastructure and compute environments. Processing enormous amounts of data requires serious compute power – and this is in addition to the demands for faster results and seamless scaling of infrastructure.
Fortunately, advancements in technology solutions and tools are making great strides in solving the challenges associated with data volume, management, and processing.
Required Skill Set: A PhD in Genomics OR Computer Science?
Yes, that’s right: all it takes is a few minutes of perusing current job postings in the field of genomics to see that a PhD and at least hands-on software development experience are both listed as required. While a researcher’s domain is centered around everything related to the study of genomes, they still have to know how to write software. Specifically, they have to code the algorithms they use every day to process substantial volumes of data.
Interestingly, many researchers learn how to code by necessity – essentially taking the hacker approach. Python has been their choice, as it is easier to learn in comparison to other programming languages. Knowing how to code in Python is what equips researchers with the skills necessary to build the algorithms for processing, analyzing, and extracting insights from their data.
While Python is relatively easy to learn, it’s also easy to use it to write bad code. In addition, many scientists are not experts in multi-processing so the algorithms aren’t set up to run as efficiently as they could. When code is not optimized – specifically algorithms – it delays processing and the production of insights researchers need to make important discoveries. Fortunately, solutions for working with less-than-perfect code exist today.
In reality, coding may not be the only technical skill researchers need to know. Responsibilities can also include setting up the technical environment and infrastructure to run that code. However, researchers are not likely to be cloud natives or knowledgeable about modern software development practices that include using the cloud, web technologies and tools for scheduling and orchestration.
Time-to-market is crucial when it comes to improving patient outcomes through genomics research. Now is the time to start exploring what’s available to streamline data management and make those important discoveries faster.
Access Powerful Compute Without DevOps
Imagine being able to enter a single command line interface to launch a script or algorithm – even if it’s not coded perfectly. Think single-source access to parallelized computing in the cloud that’s scalable. There’s no need to worry about where and how the job runs.
Dis.co’s compute platform leverages existing infrastructure (on-prem and cloud) as a single resource. The Dis.co Smart Scheduler seamlessly distributes processing jobs across available resources for optimal resource configuration and compute models. Dis.co also eliminates the need for DevOps teams to build in-house scheduling and orchestration tools.
Available resources on the Dis.co platform may be an on-prem or private cloud, or public, hybrid, or multicloud. It even includes accessing GPUs in addition to CPUs. Originally designed for gaming, GPUs boast nearly 200 times more processor per chip than CPUs.
If you’re ready to simplify data management and accelerate running processing jobs to accomplish research goals, contact Dis.co today.