Why Big Data Isn’t the Big Problem for Genomic Medicine
Buzzwords, like a virus, spread inexorably from discipline to discipline. Take “big data,” which originated in supercomputing and now has infected finance, logistics, advertising and commerce, intelligence and defense and, most recently, life science and health care. Is there some rule requiring every presentation on genomics to include a slide comparing sequencing costs to Moore’s Law, followed by slides lamenting how much data we are producing and the resources required to act on it?
We aren’t suggesting big data isn’t worth talking about in life science and health care—but we would argue it’s not the biggest barrier the industry faces, particularly in making genomics a part of routine clinical practice, which is the primary aim of precision medicine. Why? Because big data is a tractable problem. The problems associated with storing, managing and analyzing big data have been solved for other industries by many companies, including Appistry, St Louis. Data is data, and whether it’s sequencing data, social media activity or credit card transactions, the patterns for storage and execution needed to make it actionable are the same.
Life science IT staff may still wring their hands about all of the data coming off sequencers, but look at how the industry’s conception of the problem has changed just in the past five years. Back then, storing, managing and analyzing terabytes of data annually was inconceivable. Today, with input file sizes for exome and whole genome runs at 500 GB or more per sample, research and clinical laboratories routinely handle at least 100 TB annually.
Of course, big data isn’t just about file size. Doug Laney posited in 2001 that there are three dimensions to the data explosion that organizations must control: volume (file sizes), velocity (the amount of data produced and handled at any given time) and variety (the types of data that must be handled and coordinated). Life science data has all of these characteristics, particularly when taking into account the data produced by next-generation sequencing (NGS) together with that associated with patient treatment, tissue handling and routing, data analysis and clinical interpretation—activities that must be coordinated to get to actionable clinical decisions.
What’s heartening is that big data is no longer an edge case that requires frontiersmanship when applied to scaling science. An example is the NIH Undiagnosed Diseases Program. Meeting the scientific objectives associated with that program’s unique family genetics pipeline has required the NIH UDP to launch 600 cores of processing power over 56 patient samples covering 18 families. Processing will take about one to two weeks. That’s a lot of data. But it’s manageable—and by leveraging production-grade big data capabilities, NIH UDP researchers have been able to iterate and expand their analytics in weeks rather than months.
Yet, this same example also demonstrates how far the industry has to go. The data management challenges become more difficult—and more interesting and impactful—when processing hundreds or thousands of patient samples a year, rather than a mere 50 samples at a time. Using information in a patient’s genome to ascertain the cause of disease or suggest potential therapies is a big data problem health care wants to have, because it has the potential to lead not just to more targeted therapies for existing disease but, quite possibly, to preventative measures that can stop disease before it starts.
What’s keeping the bigger big data problem out of reach? It’s not technology. The tools, infrastructure and expertise to generate the data and manage it upon arrival exist and are well understood. But today, genomics testing remains research driven. Those clinical applications that occur happen in top-tier research hospitals on a case-by-case basis. Such hospitals can use research funding to acquire and maintain the sequencing instrumentation, bioinformatics skills and clinical interpretation expertise to support their tests. Yet, because only a limited number of patients have the luxury of affording and obtaining these treatments, even these leading hospitals have a hard time proving the efficacy of the tests and justifying the costs associated with administering them.
Meanwhile, the patients are out there. Thousands of them are seen at regional hospitals and medical centers that lack the expertise to deliver genomics testing themselves. It’s untenable to expect every physician or health care provider interested in improving patient care through the use of genomics testing to make the costly capital and other investments required to make this science a practical reality that impacts day-to-day patient care. Instead, the aim should be to connect the siloed capabilities associated with genomics testing into a simple, physician-friendly workflow that makes the best services accessible to every provider, regardless of geography or institutional size or affiliation.
Such an approach has several impacts. First, it empowers physicians to employ genomics tests as a part of routine patient care, particularly if access to services can be reduced to ordering a test and receiving an actionable clinical report a week to 10 days later. The best, highest-quality services will become something any physician can access—and any patient can receive, from a trusted local physician. Second, with proper consents in place, it provides the necessary influx of patients to enable test developers to assess the validity and efficacy of their tests.
Of course, as this technology is adopted more broadly it will deliver new challenges in data management and analytics. But it’s nothing this industry can’t handle. The true barrier to clinical adoption of genomic medicine isn’t data volume or scale, but how to empower physicians from a logistical and clinical genomics knowledge standpoint, while proving the fundamental efficacy of genomics medicine in terms of improved patient diagnosis, treatment regimens, outcomes and improved patient management.