Data management is the current choke point in many scientific endeavors. New workflows enabled by advances in hardware and software are making big data a reality for many labs and staff. To me, big data starts when the volume of data exceeds my capability in managing it in my cerebral memory. My Ph.D. dissertation involved synthesis and characterization of about 30 new chelating agents. This was manageable, but if it were 300 new compounds, I’d never make sense of it. Any hope of insightful analysis would exceed my limited cranial bandwidth.
In September 2010, the number of chemicals in Chemical Abstracts was 55 million. This is up from 4 million when I was in graduate school in the mid-1960s. In our data-driven world, data are proliferating with a doubling time of less than two years. Performance of computer processors is improved by 60% per year. Sure, a great part of these macro measures of growth may be due to espionage and brute force attempts to improve public safety, but one only has to see how Amazon and Google are mining data for commercial advantage. Data mining combined with analytics is being used to record and then engineer consumer behavior. Consumer behavior is the most important driver of our economy.
So why big data now? The simple answer: Now people have the tools that are starting to enable the process. Plus there is a need, since some data may be relevant for several decades. FDA inspectors ask, “Is the product you made today the same as that licensed in 19XX? Show me the data….” Drug regulation is a huge data-intense endeavor that continues for the product lifetime.
Spotlight on bioinformatics/genomics
Three tracks at the Molecular Med Tri-Con 2014 provided a forum to record successes and spotlight problems. The particular focus was bioinformatics, particularly genomics, but several noted that “data are just data.” Domain-specific knowledge provides the context and is thus the key differentiator. Ontologies are essential to understanding context.
Labs are data generators. A report from Cycle Computing (New York, NY) described a high-throughput screen of a 205,000-member library of candidate semiconductors for potential use as photovoltaics. Data analysis was complex, requiring integration of many data sources. Using the cloud, they assembled a network utilizing 156,000 microprocessor cores. Throughput was 1.21 petaflops, which compressed 264 years of computing to 18 hr—for a total computing cost of only $33,000.
Life science labs generate huge data files. But electronic medical records, including whole genome sequences for individual patients, will be even larger. Where to store the data? The first response is in the cloud. The cloud promises to reduce costs by sharing servers. Dr. Angel Pizarro of Amazon Web Services (Seattle, WA) described Amazon’s commercial cloud service. On an intuitive level, this seems attractive since server farms are expensive to build and operate. There are problems, and some were solved in the last few months.
Uploads to the cloud
Genomics data repositories around the world provide redundancy and local access for research. And, as data are distilled to knowledge, these farms will be essential in serving the needs of regional populations.
IBM (Armonk, NY) is one firm that seems well positioned to lead to personalized diagnostics and therapy. For example, the principal sequencing operations such as Broad are committed to daily updating the files of other similar organizations. Open communication is essential for intercenter cooperation. However, the files are too large for uploading using file transfer protocol (FTP). The accepted workaround has been the “FedEx Transfer,” where today’s data are loaded onto several TB hard drives and shipped by FedEx to the receiving sites for direct connection to update their server farm. Since generally the receiving site is also actively generating sequences, they transfer their results to the hard drives and send them back to the original source.
Aspera (acquired by IBM in January 2014) developed a capability to transfer very large files by bypassing FTP and look for unused bandwidth between send and receive nodes. Patented software tries to find unused bandwidth and fill it with portions of the large files. Over time (usually seconds) the large file is on its way.
On the exhibition floor, IBM and Accelrys (San Diego, CA) exhibited results of their cooperation for applying modern analytics to life science/medical applications. Accelrys has domain-specific knowledge in the life science space that is complementary to IBM’s expertise. One example is a project at SUNY Buffalo (Buffalo, NY) to elucidate the genetics of multiple sclerosis (MS). About a million people globally suffer from a loss of cognitive ability due to inflammation and degeneration of the brain and spinal cord. The project started in 2008 with collection and scanning of genomes from MS patients to see if genomics could predict factors in asymptotic patients contributing to the risk of developing MS. The simplistic look for a single marker was not fruitful. Many mutations combine with external factors (epigenetics) to initiate MS. To tease out the relevant factors, the SUNY team developed AMBIENCE to find linear and nonlinear correlations between various factors. Initially, the computational problem seemed too complex, requiring evaluation of 1018 possible correlations.
Using IBM’s PureData™ System for Analytics powered by Netezza® technology (parallel processors) and R from Revolution Analytics (Mountain View, CA), ultimately the compute time required was reduced from 27 hr to 11.7 min. Even better: non-IT scientists can write the query to test a new hypothesis.
Another example: The medical school at Vanderbilt University (Nashville, TN) found that IBM’s help was crucial to the success of a project that examined de-identified medical records of 2.2 million patients collected over the last 20 years. The database, called the Synthetic Derivative, consists of both numerical and text. It is an information-rich source that elucidates disease patterns and guides therapy. When combined with a separate genetic database (BioVU), it is possible to explore phenotypical factors. Initially, responses to queries required more than six months. However, with IBM’s PureData System using Netezza technology, response time is less than a minute. For informatics this is noteworthy since it demonstrates successful integration of two large databases with different structure and content, including natural language. For the researcher, it is now practical to quickly test hypotheses while the idea is fresh.
Still another example is IBM’s co-sponsoring and organizing the sbv IMPROVER Challenge series, where sbv stands for systems biology verification. This series has stimulated labs around the globe to hone their skills developing informatics tools to elucidate biochemical functions.1,2
When you already know what you’re looking for in your data, you often put on the blinders and don’t focus on what you could learn. But if you want to find what you don’t know and test new hypotheses against existing data sources, this is what is referred to as “data discovery,” a concept that is extremely relevant to the life sciences. However, comparing results from data sources is often problematic due to varying data structures. A lecture by Dave Anstey of YarcData (a Cray company, Pleasanton, CA) described advanced computing hardware that facilitates data discovery using semantic technology supported by massively parallel processing.
In more detail, semantic technology with data in the RDF (Resource Description Framework) model has many advantages over relational data models. Because RDF represents data as graphs—nodes of information connected by named links—the model enables you to merge data more easily and provides additional context to the data. Graphs of considerable size are difficult to analyze without semantic technology because as nonpartitionable, dynamic structures, they require a large memory and rapid processing power.
YarcData addressed these unique needs of semantic technology by introducing the Urika™ appliance with up to 512 TB of shared memory and Cray’s Threadstorm™ massively multithreaded processors. The Urika appliance is designed to handle billions of triples in memory, starting at the low end with 8‒12 billion and increasing exponentially from there, which greatly accelerates processing.
One benchmark study at Sandia National Laboratories showed that a 32-processor Urika appliance required only 30 sec in execution time for a multithreaded implementation, compared to 10.8 hr using a 48-processor traditional system after months of optimization.
For those concerned about transitioning from SQL to NoSQL technology, Mr. Anstey pointed out that YarcData partner IO Informatics (Berkeley, CA) markets Data Manager that can quickly convert RDB to RDF and back. Thus, you can perform searches across numerous databases, and then convert back to relational formats, such as Excel, for more traditional data storage and presentation.
Innovative technologies that facilitate data discovery are the way of the future, as YarcData declared that the data discovery market segment had revenue of approximately one billion dollars in 2013, with a growth rate of over 30% per year.
Displays and human interface
Dr. Richard Scheuermann of the J. Craig Venter Institute (La Jolla, CA) described a complex process for upgrading masses of biodata to knowledge. Ontologies help organize data from different files. Graph and statistical analysis of the data files supports generation of semantic assertions. The process uses 288 parallel pipelines. Data are organized by clustering using a variety of visuals including Venn diagrams supported by complex annotated tables. Dr. Scheuermann quickly discussed improved visual presentation of outputs from a fluorescence activated cell sorter. While these were improvements over conventional displays, it is clear that more creativity in data display would certainly help. Dr. Scott Kahn of Illumina (San Diego, CA) also cited this need and opportunity.
Semantic searches often produce complex graph visual displays, referred to by several lecturers as (feline) “fur balls.” The problem is how to visualize a multidimensional relationship in two dimensions. Color, line width, and internodal distance are all used, but displays can still be incomprehensible and often unreadable. New designs for the human interface for complex, data-rich relationships are needed.
Server farms use huge amounts of energy. One estimate is that in 2012, server farms consumed about as much energy as the commercial airlines. This is about 2% of the global energy consumption. However, the doubling time of information is less than two years. If this continues for a decade, the extrapolated energy consumption could be 25 (32×) to 26 (64×). This forecast is probably not sustainable. But what will give? Will new energy-efficient computer technology sweep in?
Genomics: Do we even know what we do not know?
Genomics will certainly involve curating and interrogating very large databases. Prof. Eric Topol (Scripps Clinic, San Diego, CA) urges digitizing every one. He adds that it is not the tumor location but the mutation that is relevant to cancer diagnosis and therapy. In a few cases, this works.
However, our understanding of genomics does not yet give an adequate picture of cancer’s etiology. “Junk DNA” is no longer junk. Circulating tumor cells (CTCs) indicate cancer, but when individual cells are sequenced, the DNA is very heterogeneous. Which mutation should be targeted? Solid tumor masses show similar heterogeneity. CTCs and biopsies are probably useful in diagnosis, but how should we use the result prognostically? Then, what about exogenous RNA and epigenetics? It is clear that we have a lot to learn.
The Molecular Med Tri-Con 2014 attracted 3200 scientists to the San Francisco’s Moscone Convention Center from February 9 to 14 for an intense, multifaceted program split into 22 tracks covering a range of topics. The exhibition also provided an opportunity for multidisciplinary scientists to see product and service offerings. Three of the tracks dealt with modern data management. I want to congratulate Ms. Cindy Crowninshield, Director of Pharmaceutical Strategy at Cambridge Healthtech Institute (Needham, MA), for organizing the outstanding technical and support program for the informatics tracks.
Molecular Med Tri-Con 2015 will take place February 15‒20 in San Francisco. For information, visit http://www.triconference.com/.
Robert L. Stevenson, Ph.D., is Editor, American Laboratory/Labcompare; e-mail: firstname.lastname@example.org.