World Genome Data Analysis Summit 2012: Big Data, Big Problems, and Thoughtful Solutions

The World Genome Data Analysis Summit attracted about 70 scientists and engineers to the Monaco Hotel in San Francisco, CA, November 27–29 to talk shop on DNA data. It was quickly apparent that the issues are much bigger than data analysis‒it is the entire work flow. Every step presents options that must interoperate with the preceding and subsequent steps. The massive scale compounds the problem. Next-generation sequencing (NGS) is delivering genomic data faster than they can be analyzed into information. Only a small fraction is distilled and suitable for supporting decision-making. And the trend is for even more data, delivered faster and combined with more sophisticated tools.

Genome data analysis is all about computer-enabled research. The body of literature is large and of variable quality. Annotations are cited as a particularly important and variable example. This is being curated with programs searching for prior reports using natural language technology. All too often the original report is not accessible. An annotation is attributed to a lecture or poster that was never posted online or made it into print. Some annotations seem to be based on intuition combined with the need to fill in a box…with anything.

Central information technology (IT) received particularly bad press. Part of their problem is the huge range of needs from the researchers. The IT jocks are pulled in many directions due to the disparate needs of different principal investigators (PIs), which are their clients. Another factor: IT staffs have professional knowledge about software. This often is a rub point between the PI and IT. The PI wants to use the latest, which is often used by colleagues on an open-source basis. The IT staff fears that open source means declining user support with time, hence their preference for commercial software.

During a discussion after a lecture, Dr. Richard LeDuc of Indiana University’s National Center for Genome Analysis Support advised that one of the requirements for a master’s degree was to write and test a new program. Generally, these programs are open-sourced. Upon graduation, the student moves on, leaving behind an orphaned program. IT professionals are all too familiar with this scenario, which leads to large issues down the road.

Genome sequencing is a team sport

The PIs often start sequencing and gathering data without a good data plan. “Let’s put our line in the water and see what we can find.” Then they find something, but is it a fluke or is it likely to mean something? Often, the determining criterion is, “Is it publishable?” On the receiving end, the question is, “How much faith should I place on this report?” “I know that some reports are trustworthy and others less so, but how to evaluate?”

Another issue is that the tools are so sophisticated that a team of specialists is necessary. A statistician is needed for design of experiments on the front end and evaluating the reliability of the results of the finished product at the back end. A biobanker needs to control and manage the samples and preserve their integrity. A sequencing specialist will run the instruments and deliver a mass of base calls on billions of fragments. This is then assembled and aligned. Then another experienced person needs to compare the assembled sequence with others. Errors need to be identified, investigated, and resolved. Now the data are starting to be upgraded to information. Annotation is the next step. Eventually the team head coach (usually the PI) has to relate the output to the purpose of the sample and draw a conclusion, and then write the paper. This is a team approach. As in sports, the outcome depends on the performance of the team members.

Exome sequencing versus whole genome sequencing

The first step is to define the problem, and what constitutes success. Most had the goal of ultimately contributing to improving human health care, since human health is a huge topic with many contributing factors that need to be investigated. A holistic approach is not conceivable with today’s technology, so scientists must focus on elucidating particular subsystems. We have some data on genomes, but the data set is too large. So some choose to focus on maladies associated with particular regions of DNA (exomes). They look for transcription products of exomes to develop causal relationships that can aid diagnosis and treatment. However, human exomes account for 30 megabases out of 3 billion for the human genome. This is low coverage and useful only if you really know what you are looking for.

In contrast, with whole genome sequencing (WGS), the process is long, but it should provide higher coverage with lower sampling bias than exome sequencing. But some regions of the genome do not sequence as easily as most. Thus, the percent coverage in these regions may not be sufficient to warrant high confidence in the results. Plus the results obtained from different base calling, assembly, and alignment programs on the same sample data often do not agree. This is damaging, especially if single point mutations are the target.

For example, Prof. Liz Worthey of the Human and Molecular Genetics Center of the Medical College of Wisconsin elected to study pediatric patients who had not been helped by exome diagnostics using next-generation whole genome sequencing. Although still in the pilot stage, her team has had success in finding causal relationships in about half the patients studied to date. Many of these have been useful in guiding therapy.

After the discussion following Worthey’s lecture, there seemed to be a general consensus that whole genome sequencing (both DNA and transcriptome) will largely replace exome sequencing. The price of WGS is declining rapidly; today exomes are probably price-competitive for five or less, but with the advances in WGS, any price advantage for exomes will probably disappear in a year or two. With WGS one avoids the gnawing question, “Am I missing something?”

Next-generation data analysis

In NGS, the process starts with the sequencer sending a fire hose of base calls of small (50–200 bp) sequences. Fragmentation is not truly random. Some segments are harder to fragment than others, making them rare. This is countered by fragmenting many DNA copies, typically 30–60 fold, in order to get linkage across the rare ones. Typically, this provides only about 95–99% coverage. Completing the remainder needs special (slower) care or new technology.

Typically the redundancy involves 30–60×. This is where the data start to become challenging. A human genome is about 15 Gb; a 60× redundancy yields 60 × 15 = 900 Gb, or round it to a Tb. Simply transmitting these data is a difficult issue. At the meeting, there was a general consensus that the base calling and alignment should be done in close proximity to the sequencer. IT departments often fight this, until they succumb to the headaches.

A new firm, Bina Technologies (Redwood City, CA) recognized an opportunity to build a standardized interface for secondary analysis of the raw data from the base caller. Secondary data analysis includes alignment of the sequence fragments to produce the genome sequence, followed by detection of variants, adding annotations and evaluating data quality. Bina calls this the “Bina Box.” The box houses hardwired specific-purpose kernels to speed up the data-intensive repetitive tasks. Processing time is reduced by a factor of 100 to an hour or two. The data file for the finished genome is reduced by 1000, making it compatible with routine transmission to enterprise data centers or to the cloud for tertiary data analysis that includes data mining and comparative genomics.

In very large data compilations, random events are expected to be observed. Even an error rate of one in a million will produce 3000 errors in one human genome. Different alignment protocols deliver different results with the same sample. If the Bina Box, or similar processor, becomes the industry standard, this could help by having a well-documented set of idiosyncrasies. Right now, this information seems to be tribal knowledge.

A cloud on the horizon

Cloud computing was a contentious topic. Debate topics: Where does the cloud fit in genome data analysis? The cloud is not suitable for huge files such as raw reads or thousands of genome data sets since transmission is too slow. Where do the data reside? Is storing patient data in the cloud compliant with HIPAA? Cost estimates varied greatly.

Dr. Shanrong Zhao of Janssen Pharmaceuticals (Titusville, NJ) added the voice of experience with a lecture on “Large-Scale Whole Genome Sequencing Data Analysis in the Amazon Cloud.” Amazon (Seattle, WA) is a leading vendor of cloud computing.

Running in-house, a whole genome data package consisted of 70–180 Gb for the raw data base calls and 2 hr to load. Extraction of the raw reads required 460 Gb and two days’ processing time. Alignment of the 60× coverage took 867 Gb and 11 days. SNP calling required 250 Gb and 10 days. Adding this up amounts to 1.5 Tb and 23 days of computer processing. They faced a major expense for IT to improve turnaround time (TAT).

Amazon’s cloud offered potential time-savings since data storage and CPUs were unlimited. There are concerns, however, i.e., how to reserve, configure, and launch a study? Can one assure that all computers are “healthy”? When done, how do you release the resources? Amazon caps data transfer in and out at 5 Gb, which is much too small. What is the workaround?

When the data file is too large for the wires, the accepted way to transmit files is still to record them on hard drives that are sent to the new host by air express. Web 2 may be a way around, but the nodes are not plentiful in the U.S.A. The TAT for the cloud was 53 hr for the cloud versus 23 days for in-house. A project that involved 50 WGS would run in parallel in less than three days in the cloud compared to 3 years for serial processing in-house. This should require 60 Tb of storage for each. Using Amazon, the cost was ~$120 sample from .Bam file to SNP calls.

Some lessons the Janssen team learned about the cloud are: 1) Be prepared to request and wait for the required resources, especially storage; 2) moving a large data set in the cloud is not a trivial exercise; 3) cloud computing is not yet as robust as expected; and 4) premium customer support is recommended.

Credits and comments

The staff of Hanson Wade (London, U.K.) deserves special recognition for choosing excellent authoritative speakers. Two of the three best lecturers I listened to in 2012 were at this meeting. Although the lectures seemed to jump between DNA sequencing and RNA, this did not seem to be an issue compared to problems with data. The program included two networking sessions that were effective in getting the attendees to network.

The down side was the poor projector resolution. The lectures often squeezed the font size to fit too much on a PowerPoint® slide. This is a common problem, but the projectors supplied by the hotel were not adequate for even well-prepared slides. Just because a slide looks great on a computer display, projection on the screen is quite different. Also, some of the small laser pointers lack the power to be seen clearly, even in the front row, where I customarily sit. I think that meeting organizers should invest in quality AV projectors and pointers that are air shipped to each meeting.

All in all, World Genome Data Analysis Summit 2012 provided thoughtful solutions to big problems with big data.

Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: [email protected].