Big Data at the 2015 AAPS National Biotechnology Conference

Complex biological processes require large databases to find correlations with patients and outcomes. The 2015 AAPS (American Association of Pharmaceutical Scientists) National Biotechnology Conference (NBC), held June 8‒10, 2015, in San Francisco, Calif., opened with an extended session on the utility of big data technology in the creation of a modern pharmaceutical industry. The session was chaired by Brian Moyer (BRMoyer & Associate LLC, Washington, D.C.).

According to Moyer, key concerns are: 1) sustainability of big data; 2) worldwide harmonization; 3) indemnification in the case of adverse events; 4) privacy and confidentiality of data; 5) integrity of results, which depends upon data quality; and 6) experimental design. Concerns about biobanking, sampling errors and signal quality (such as signal/noise) were also expressed.

Big data in pharma

In the keynote lecture, Tamara Dull of the SAS Institute in Cary, N.C. pointed out that only 20% of data is structured, as in tables, and 80% is unstructured. “Big data” was defined as data that exceeds one’s comfort zone. Only about 25% of projects are considered “successful,” and 13% are successful at full scale; 48% of the installations are rated as “partially successful.” A project that fails to address and solve a key business issue is considered unsuccessful. People issues (motivations and process) are harder to resolve than technical issues.

SAS builds most of its systems around Hadoop cloud storage since it is low cost and provides easy global access. To work with data, it must be organized. Dr. Dull cited GlaxoSmithKline (Philadelphia, Penn.) as a firm that has successfully deployed a cloud-based system that supports research in exploring unknowns. In closing, she explained that the two largest threats to big data projects are limited skills of the project team and inadequate data security, the latter a particular concern in healthcare.

Cloud computing for biomedicine

Intel Corp. (Santa Clara, Calif.) is working to provide the electronic chips required by big data projects. Dr. Ketan Paranjape, general manager of Intel’s Life Sciences program, which focuses on biomedicine, talked about scalability; this led to a discussion of cloud computing. With all the emphasis on volume, velocity, versatility and veracity in computing, he sees the cloud as an economic necessity, as server farms are expensive to build and operate.

Because many users do not continuously perform high-performance computing (HPC), server farms might have long dead times. Apache CloudStack addresses this need. The open-source computing software consists of three modules providing software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS). The latter is a virtual environment that can be configured to support laboratory sciences.

While data security is a major concern, experts say this is a general issue and not specific to the cloud. Karen Riley of the FDA warns that data security is the responsibility of the trial sponsor.

Paranjape offered performance benchmarks derived from a case study of the Novartis Institutes for BioMedical Research (La Jolla and Emeryville, Calif.; Cambridge, Mass.; and Basel, Switzerland). Novartis wants to combine its large database of patients and diseases with insights from previous clinical trials to support proof of concept in new human trials. Since Novartis’s internal High Performance Computational Chemistry (HPCC) was already at capacity, the cloud was a cost-effective way to accelerate in silico modeling and lead optimization. This involved sustained access to more than 50,000 compute cores. With these resources, typical jobs took a minute of run time with costs ranging from $0.08 to $2.25, according to Paranjape. To run these in house, Novartis would have needed to build and operate a HPCC facility at a projected cost of $44 million. In the future, Paranjape expects that Novartis will use the cloud to support emerging imaging technologies that require a tenfold increase over next-generation sequencing.

Microbiomics

Professor Michael Snyder of Stanford University (Stanford, Calif.) focused on the microbiome in “Personal Genomics,” a talk that began with a general description of the various classes (genomics, epigenomics, transcriptomics, proteomics, metabolomics and ultimately microbiomics). The microbiome can vary significantly, depending on its location in or on a body, as well as with personal history. Snyder pegs the number of analytes in the microbiome in the billions, compared to about 100,000 for all the remaining ’omics classes. The microbiome plays an important role in digestion and vitamin synthesis and is implicated in inflammatory bowel disease, diabetes and obesity.

Snyder used himself as the subject in developing an ’omics profile.” He started with a simple timeline covering hundreds of points over 62 consecutive months. A significant event occurred at day 123, with a human rhinovirus infection that lasted 21 days. A respiratory syncytial virus followed about a year later. In all, Snyder had a total of seven viral infections over five years. The workup included a genome sequence, genotyping, pharmacogenomics and specific identification of the disease vector. During this time, Snyder also developed type 2 diabetes, which manifested as a sudden 50% increase in glucose and HbA1c level at about day 350.

Predictive disease models

Dr. Guna Rajagopal of Janssen Pharmaceutical (New Brunswick, N.J.) described how big data is transforming the pharmaceutical business model. Firms are integrating a variety of information sources to improve profitability and patient outcome by developing and using predictive models of disease. Costs of computing power and low-cost sequencing have attracted academic and government institutions as well as private firms to explore the potential market.

Genotype/phenotype databases

As director of the Institute for Computational Health Science at the University of California in San Francisco, Professor Atul Butte has access to 12 million de-identified patient records of the University of California Research Exchange (UC ReX). Twelve million patient records, however, is not enough to satisfy Professor Butte. He expects that big data analytics will be required to tease out causal relationships. In 2015, there are 1.7 million microarrays publically available through ArrayExpress. The doubling time is just over two years. A valuable resource is the Cancer Genome Atlas, which provides annotated data on malignant genomes. The database of Genotypes and Phenotypes (dbGaP) provides a collection of databases organized by genotype and phenotype. After describing several other databases, Butte pointed out that one experiment does not yield a clear causal factor. Statisticians call this the degree of freedom problem. Clearly large databases exist and are underutilized.

Sustainability

The cloud offers a lower-cost alternative to big data by allowing organizations to share a very large store and compute facility. The cost of utilities required to run a large server farm is significant. One estimate claims that in 2013 IT activities consumed as much energy as all of the airlines globally (about 2% of the global energy diet). The doubling time for stored information is about two years. At this rate, in 2020, energy diets would be 10% of the global total.

Conclusion

The conference team at AAPS organized a broad and timely meeting showing that the technology for mining big data exists and is being used successfully by early adopters.

The 2016 AAPS National Biotechnology Conference will be held May 16‒18 in Boston, Mass. For more information, visit www.aaps.org/nationalbiotech/.

Robert L. Stevenson, Ph.D., is Editor Emeritus, American Laboratory/Labcompare; e-mail: [email protected].

Related Products

Comments