“Genomical” Replaces “Astronomical” at Converged IT Summit

The Converged IT Summit, held September 9 and 10 in San Francisco, Calif., provided examples of integrating IT and science to support discovery—especially improved healthcare.

It’s a matter of scale

Life science research and its application to human health is all about information…lots of it. The recently coined word “genomical” refers to the nearly incomprehensible size and speed of operations and data that now characterize genomic research.

In the keynote lecture, Ari Berman, Ph.D., of BioTeam Inc. (Middleton, Mass.) explained how the convergence of new IT is the major enabling technology bridging the gap from life science research results to knowledge and its application to healthcare. His examples showed that converging science with advanced IT is working.

Berman’s formal training was in molecular biology, but he has spent the last 18 years in high-performance computing/infrastructure, and strives to turn IT into an essential laboratory tool. He sees the output of life science research laboratories converging with the bleeding edge of high-performance computing. The primary goal is science-based optimal therapy for humans.

Berman claimed that 25% of all life scientists will require high-performance computing (HPC) in 2015. Genomics and proteomics labs are generating gigantic amounts of data that must be moved, processed and stored for further processing. Much commercial “big data” IT supports transaction systems with emphasis on input and output (response). In contrast, life science IT is batch-based, often integrating or comparing very large databases, using genomical active memory, with much longer response times. This difference is reflected in the computer system architecture—to the IT engineer, this makes life science IT unique.

Successful vendors must understand and adapt to the scientific workflow. Berman advised that traditional IT does not work. The challenge for vendors is in understanding the application and designing IT to facilitate delivering the benefit.

He focused on the human genome, which has 3.2 Gbp (giga-basepairs) spread between 23 chromosomes. There are about 21,000 genes with over 55 million known variations. Next-generation sequencing data is overloading the system, with an output of 10 terabytes (TB)/day/machine. There are probably over 1000 active sequencers. So the relevant database expands by some 500‒1000 petabytes (PB)/year of new data.

High-throughput screening (HTS) for disease research and drug development generates about 10 TB/week/facility. There are a few hundred HTS core facilities around the world. MRIs and confocal imaging of tissues also generate tens of TB/week/imager. This adds up to a mid-petabyte range/week. This data needs to be stored, accessed, moved and analyzed.

Computer configuration depends upon the application. Molecular modeling, as in reaction simulation and rational drug design, requires different computer architecture than sequencing, which requires many terabytes of RAM to hold long contigs.

In Berman’s view, life scientists contribute to the challenge—it’s typical that everything is held onto, and data is not always willingly shared. Data is stored in many places and, because file structures are not standardized, integrating data is time-consuming. Still, genomical-scale analytics require access to genomical data.

To address these issues, Berman wants to improve networking. Commercial cloud-based networking, storage and computing is probably the best intermediate-term solution to questions of storage; it is available now and is more economical. In the future, he envisions hybrid clouds that share storage and processing capacity between private and public centers. These can be empowered with science gateways that have domain-specific knowledge and help scientists access data. But this is also a temporary fix, since files and directories become increasingly unreliable above a million files. Beyond 50 TB, Berman said, it’s hard to keep track of anything—life science databases are already in the petabytes.

His long-range solution is to network converged IT infrastructure with campus nodes connected with global 100-GB networks. The Texas Advanced Computing Center (Austin, Texas) (see below) was cited as leaders in integrating computer hardware, software, networking, storage and staff, at scale.

Applications of converged IT

“Every revolution in the history of sciences has been driven by one and only one thing: access to data,” said Prof. John Quackenbush of Harvard (Cambridge, Mass.). Science-based precision medicine relies on data. The cost of data is declining so rapidly that sequencing an individual’s genome is now affordable in clinical settings. He pointed out that, while big data is a problem, messy data is worse.

Prof. Quackenbush is co-founder of GenoSpace, which offers a platform for cloud-based genomic data analysis. It uses semantic technology to integrate disparate databases while retaining the source data. GenoSpace for Research is intended for users researching a cohort of patients, GenoSpace for Clinical Care is for those working with individual patients, and GenoSpace for Patient Communities corrals data from many patients suffering from a common disease.

Data security is provided by encryption combined with advanced permission. Prof. Quackenbush noted that precision medicine demands simplicity, especially for patients and treating physicians, since it is a new endeavor with few experienced practitioners.

Autism Speaks targets 10,000 genomes

Dr. Mathew Pletcher, president of Autism Speaks, reported on the progress of the Ten Thousand Genomes Program initiated in June 2014. The project, recently renamed MSSNG (the missing vowels represent missing information), seeks to gather 10,000 whole genome sequences with phenotypical information from families with autism spectrum disorder (ASD). On the surface, ASD is a genomic disorder with risk of affliction at 90% for identical twins, 44% for fraternal twins and 7% for siblings, compared to a 1.5% general population risk. MSSNG has already posted 4017 sequenced whole genomes to its website, with another 1615 in process. About 2300 of these will be publically available in the fall of 2015. The goal is to have 10,000 by March 2016.

Networking at scale and speed

Moving data is key to using it. The cloud is one solution, but it involves slow and expensive uploading and downloading. Berman and others see the cloud as an immediate to mid-term or mid-size tool.

The alternative is networks optimized for data transmission. These were explored by Robert Vietzke of Internet2 (Washington, D.C.), a computer networking community. The principal advantage of interconnecting institutions is the increased bandwidth and hence speed for high-volume communication. The consortium supports research aimed at improving network capacity. Services include opening network architecture and control software to allow developers to optimize the network for specific applications. Internet2 also offers end-to end performance monitoring. Currently, the network provides 100-GB communication between selected campuses around the world.

In early 2014, IBM purchased Aspera (Emeryville, Calif.), known for its ability to transmit large files globally. Charles Shiflett of Aspera described the new FASP, which provides 80-GB transmission irrespective of distance. This is in contrast to legacy systems, which are usually limited to 10‒20 GB.

Genomical computing

Dr. Matthew Vaughn of Texas Advanced Computing Center (TACC) described their charter as advancing science and society through the application of advanced computing technologies. TACC has several large computers, known by names such as Stampede, Lonestar and Wrangler. The latter has 0.6-PB flash storage and a 1-TB/sec read rate. Currently, more than 3 PB/month are moved in and out of the center, and this is growing by about 20% per month.

Converged solutions for genomics

James Reaney of SGI (Milpitas, Calif.) lectured on converged solutions for genomics. Next-generation sequencing (NGS) requires significant computing power, and SGI needed examples in which big IT provided useful results. The company first focused on the workflow from Illumina DNA sequencers at The Genome Analysis Center (TGAC, Norwich, U.K.), resulting in the design of unique computer systems that avoid bottlenecks. One such system, the SGI UV300 “super node,” has been used to elucidate how cancerous mutations change the protein signaling networks in human cells.

Hybrid clouds for storage and analytics

Dr. Aaron Black of the Inova Translational Medicine Institute (Fairfax, Vir.) described a hybrid cloud IT system for whole genome patient data. Inova hospitals have a patient base of more than 10,000, with electronic health records encompassing millions of files and petabytes of storage. Supported research includes biobanking, whole genome sequencing, RNA sequencing and miRNA and DNA methylation. Amazon web services provide data storage and sharing. Security is a concern, so computing is managed locally by Avere Systems (Pittsburgh, Penn.) using hybrid cloud architecture managing computing by SGI. Research data requests, data movement and analytics are responded to within hours to days. Previously, this was weeks to months.

Data security

Data security solutions include firewalls, the cloud, encryption and demilitarized zones (DMZs), or perimeter networks. Prof. Quackenbush foresees that phenotypical information and individual patient histories will be required for individualized medicine, and expects that encryption of complete data is superior to de-identified databases.

AES-NI (advanced encryption standard-new instructions) from Intel (Santa Clara, Calif.) provide rapid encryption, which is used by GenoSpace for data security for its precision medicine operating systems. These units provide genomic data with context, making it more valuable to researchers.

Firewalls are used to monitor communications for malware and possibly to encrypt outgoing packets, but usually slow down communication and can introduce other problems. These are insignificant when data is small, but when the data grows in size, firewalls can limit communication speed.

In an effort to provide data security to high-speed data transfers (greater than 10 GB), Energy Sciences Network (ESnet, Berkeley, Calif.) employs DMZs. These are situated outside the firewall but near the computer mainframe. DMZs are marketed and supported by Juniper Networks (Sunnyvale, Calif.).

Credits

The staff of Cambridge Healthech Institute (Needham, Mass.) deserves credit for organizing the first Converged IT Summit. Please see http://convergeditsummit.com/ for more information.

Robert L. Stevenson, Ph.D., is Editor Emeritus, American Laboratory/Labcompare; e-mail: [email protected].