Molecular-Med Tri-Con 2012: Big Information Technology Faces Gigantic Problems

Next-generation sequencing technology is capable of sequencing human genomes for $1000/patient in about a week. But what does this mean? Not much. The first problem is assembling the DNA reads into a sequence. This is computationally intensive. Then the genome sequence needs to be compared with existing databases to search for diagnostically significant findings. You have entered the realm of big data, where the data files are so large that they challenge current information technology (IT). For example, the Xeon® E7 processor family from Intel® (Santa Clara, CA) is limited to 2560 processor cores and 16 TB of memory. Some genomic sequencing labs produce this in less than a week. Big data also challenges “The Cloud” since data transfer is too slow. The need for massive computing will only increase. For example, we may need to sequence individual circulating tumor cells for cancer therapy.

In addition to big data, big pharma is challenged by size-related IT problems that are much more complex, i.e., big IT. The challenges are not just genomics. Big pharma faces different problems with providing big IT support for research centers around the globe. Novartis (Basel, Switzerland) reports that high-content screening is a huge challenge since its bioassay library is 25,000 protocols. Many are quite complex to describe.

These and other problems with big IT were discussed in two parallel tracks of the 19th International Molecular Medicine Tri-Conference held at the Moscone Center in San Francisco, CA, February 19–23, 2012.

In ’omics research, IT issues force researchers to simplify their hunt for markers and correlations. Biology is complex. Cofactors undoubtedly exist. Searching for correlations with blinders that limit the field of view to three, four, or five agents will probably provide only low-definition pictures.

High-content screening

Today, the work flow has changed to employ automated high-content screening technology even for formulation work. Screens themselves are not so simple. When I was at the bench and witnessed a result, I asked myself: Was the observation consistent with expectation? Then I followed up as needed. Occasionally, one enjoys an “aha” moment. With great excitement, one could drop everything to explore an unexpected result. With big data, all one sees is the large, dense forest. Which trees merit closer study and why? With big data, the research team is also big, and typically includes a statistician, an IT person, and scientists with domain-specific knowledge. Collaboration, aka teamwork, is required. But how does one build and manage a team, especially in high-turnover settings? Key people complete the degree requirements and go, leaving a hole. Plus, the experiments themselves take time to plan and execute. Too often, by the time the experiment is completed, the principal investigator (PI) has forgotten the original thought and has moved onto a new idea.

The structure of databases is still another restraint. Dr. Arturo Morales of Novartis reports that the company does not have a common ontology for global research operations. Results depend on the investigator. Further, federated database programs have only had a three-year life cycle from design to obsolescence because of exploding demand for additional features or sites. Adding more capability quickly makes the human interface unnavigable except for the IT expert. Documentation falls behind; technology changes; and, in only a couple of years, it is time to start anew.

Then there is the fear factor. Nearly every leading researcher is working to “cure cancer” or other nefarious disease. Each hopes to be recognized as the one who “cured XXX.” Thus, many PIs hold their data close, lest some sharp data-miner finds the golden needle overlooked in their haystack.

Managers of big research have responded with requirements that primary and meta data be published. Human genome data from the large sequencing operations, such as Broad Institute (Cambridge, MA), are a noteworthy example. However, the collected library is too large for general access. Within the proprietary databases of big pharma, the key annotations, including motivation for the experiment, are usually inadequate if they exist at all. These are big problems, and there may be more. Yet the need is also huge and growing.

Dedicated processors for genome assembly and high-content analysis

It is time to step back and examine the infrastructure. What about the computers, including servers? Can performance (speed) for specialized tasks be improved by making application-specific, optimized processors? This worked well two decades ago when tas-kspecific video cards augmented the general-purpose CPU for games.

For gene sequencing, the work flow is standardized and must be repeated exactly for every sample. Engineers at Bina Technologies (Los Altos, CA) see an opportunity to accelerate the computer-intensive part of the work by creating application-specific processors where particular functions are hardwired into the chip, which increases speed at the expense of flexibility. Bina is making systems using dedicated chips for the repetitive operations such as assembling and matching. For example, a general-purpose computer requires about a day of computing time to assemble one genome. With a focused-design assembler chip, the time is less than 30 minutes.

What does the future hold? Let’s look at circulating tumor cells (CTCs). When the concentration of CTCs exceeds 5 cells/mL, the prognosis for the patient is bleak. The ideal number is zero. The cells are believed to be responsible for metastasis of tumors. Further, the CTCs can point to the organ of origin. This is relevant for classifying a recurrence or new disease.

The first problem with CTCs is the low concentration, which is seldom higher than 10 cells/mL. In contrast, 1 mL of whole blood contains several billion red blood cells and several million white blood cells. The cells are usually identified by their unusual morphology. Within the last year, instruments have been introduced for purification of CTCs from whole blood.

The isolated CTCs are heterogeneous. Part of the heterogeneity is no doubt due to their individual status in the cell cycle. The genomes are also mutated from that of the host. The mutations involve localized insertions or delegations that can be probed with hybridization chips and scrambling where large segments of DNA are repositioned into nonnative chromosomes. Thus, connectivity is important, requiring whole-genome sequencing of individual CTCs at least at the research stage. If we need several genome sequences per patient, sample load increases proportionately.

High-content analysis involving fluorescence imaging of live cells is another information-rich application where dedicated computers should prove useful. Current technology does not permit real-time data analysis and response classification except on very restricted data fields.

Automated imaging could benefit from improved speed in sorting and classifying images. Current practice with trained humans is fraught with problems such as operator fatigue and variability. One can add more staff, but this is expensive and does little to improve reproducibility and accuracy.

Surely there are other examples, for, as science learns to use automated protocols, there will be a need for higher-speed IT. This will probably require IT with different architecture.

Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: rlsteven@comcast.net.

Comments