R&D Informatics at Molecular Med Tri-Con 2013

Integrated R&D Informatics and Knowledge Management (I&KM) was the title of one of the 15 programs spread over three days at the 20th Molecular Med Tri-Conference held February 13–15 at the Moscone Center in San Francisco, CA. The programs were grouped into five channels: Diagnostics, Therapeutics, Clinical, Cancer, and Informatics. I chose to focus on I&KM since it is an issue common to all the other programs. Indeed, I&KM is the sorest rub point.

Several factors are responsible for the interest in I&KM. R&D has become much more complex, with work flows involving massive files (TB to PB) of data. Data complexity is larger than can be handled by any single human. Programs require collaboration of a mixture of specialists in complementary domains. Information technology is split between natural language processing (NLP) and processing very large data sets (big data). Traditional IT functions are not adept at handling the new technology for I&KM.

It does not help that I&KM means many different things, depending on the particular environment and issue set. The simplest form is the chemist working on a paper or report that includes data from several techniques such as nuclear magnetic resonance (NMR), infrared (IR), mass spec (MS), and X-ray crystal structure. The chemist often has to deal with data files with quite different structures. This is the simplest form of data for I&KM‒do the work, gather the data, write it up, graduate, move on, and forget it.

Multidisciplinary research and collaboration among domain experts

Research is becoming multidisciplinary, where data are distilled to information and then transferred between experts with domain-specific knowledge within an organization such as a department or academic research lab. This is intraorganization I&KM. An example is a research group developing protein biomarkers. It might include collaboration among domain experts in biology, biobanking, assay development, liquid chromatography, LC-MS, statistics, and informatics. The degree of collaboration of the team members might vary considerably, but the output requires positive contributions of each domain expert.

The next step of complexity involves interorganization I&KM. Examples include data integration between a client and CRO or between a drug sponsor and the FDA. This is the most formal, with written protocols and specifications. In contrast to intraorganization I&KM, transfer of money is usually involved in interenterprise I&KM. Lawyers often join late in the program, which complicates and slows things down.

Crowd sourcing

Crowd sourcing (CS) is the newest form of I&KM. CS openly solicits participation of a large number of individuals or groups to an open call. This is the most extreme form of integration and knowledge management, since it invites participation by potentially anyone, even if they lack domain-specific training. Crowd sourcing is based on the hypothesis that if everyone contributes, the best performers will distill to the top. CS really makes the most sense when the status quo with domain experts is not working.

The meeting organizers, Cambridge Healthtech Institute (CHI, Needham, MA), prepared a lecture program that provided examples of each type of I&KM. The lecturers reported the idiosyncrasies of each. At the end of three days, it was clear that I&KM is a complex problem that is context specific. Let’s start with two examples of crowd sourcing since it is the most unusual.

Keynote on personalized oncology

The technical program opened on Wednesday with a plenary keynote discussion session on personalized oncology organized and chaired by Mr. Marty Tenenbaum, who founded Cancer Commons. Tenenbaum is a 15-year survivor of melanoma. He was initially bothered by why the therapy (drugs) in his clinical trial worked for him, but not for most others. He may have benefited from a random conjunction of his particular form of melanoma and the therapy he was prescribed. Plus, he was fortunate that he was not in the control cohort.

Over the years, Tenenbaum has become an activist for personalized oncology. He pointed out that 900 oncology drugs are in the development pipeline. Many are for exceedingly rare cancers. How will the oncologist sift through the many options? Tenenbaum believes that HIPAA regulations inhibit connecting the dots from specific clinical trials to treatment and on to precision therapy. With rare cancers, one needs an index of others that have related genetics, therapy, and outcome. This could be particularly useful in guiding patients to the most suitable therapy or clinical trials.

Since it inhibits such open communication at the institutional level, HIPAA does not apply to individuals. Hence patients can volunteer information. Cancer Commons was founded as a forum for cancer patients to deposit their records for use by others. Records include institution, provider, cancer diagnostics and classification, and therapy plan‒potentially anything and everything that might be relevant.

Even if nothing works, this negative information would be useful to warn subsequent providers not to go down a well-traveled path that leads to poor outcomes. Trying another may not work in every case, but at least intuitively there is hope for finding a new and better path. He has found general acceptance of this approach among cancer patients, especially those with poor prognosis. For those who are interested, Tenenbaum is recruiting collaborators at www.cancercommons.org.

So what will this crowd sourcing effort look like? A collection of data in multiple formats and unspecified history provided by sources that see the exercise as an obligation from some fringe patients. The lack of uniformity and potential variable data quality will present a significant data I&KM problem.

IMPROVER Challenge

Another example of crowd sourcing was in the booth and lecture of sbv IMPROVER (where sbv stands for systems biology verification). sbv IMPROVER is a global collaboration of IBM (Zurich, Switzerland) and funded by PMI (Neuchatel, Switzerland). PMI, once part of Phillip Morris Inc., is very interested in developing tools for toxicity. PMI’s vision is that improved tools might facilitate development of reduced-risk tobacco products.

The IMPROVER model is based on the Jeopardy! Challenge, where the massive computing power of Watson was able to outperform two gifted humans to claim the $1 million prize. The first IMPROVER round started in March 2012 with a challenge to find genetic signatures for four data sets that were selected for psoriasis, multiple sclerosis (MS), chronic pulmonary obstructive disease (CPOD), and lung cancer. Results were presented at a meeting in October 2012 and will be published in 2013. In addition to the recognition, the winners will receive $50,000 for research support.

Species Translation, the second challenge, is scheduled to start in Q2 2013 (www.sbvimprover.com/). The goals are: 1) Identify rules that map measurements derived from high-throughput screening (HTS) from one species (mouse, rat, human, etc.) to another; 2) establish limits of species translation; and 3) quantify species translation.

IMPROVER is certainly an innovative and ambitious program. Holding a bake-off with significant monetary and reputation building rewards will attract participation. The financial clout of both sponsors adds to the probability of success. So, best of luck!

Personal informatics and knowledge management (small data)

At the personal level, the most challenging problems with I&KM arise from finding legacy data that may be stored in an obsolete file format or extension. Usually, there are ways to translate between file formats, but this becomes more difficult with the passage of time. This leads to lost data, which is a growing problem. Academic laboratories face especially difficult issues since the unwritten “tribal knowledge” is lost as the graduates move on.

Intragroup I&KM

Data integration within an administrative group such as an academic laboratory or industrial department adds to the complexity of data sharing. Many people are reluctant to share their data for a variety of reasons, stated or private. Most often the reluctance to share is a lack of trust in colleagues and/or management. Personal insecurity is another potential factor. The result is a series of information silos that hinder meeting the goals of the work group.

Silos were a common problem mentioned in at least half of the lectures. Dr. Chris Waller of the ChemoInformatics Group at Merck (West Point, PA) addressed them head on. Legacy work flows did not support information sharing because the technologies did not exist. Plus, staff was not incentivized to collaborate. However, today’s projects are large and multidisciplinary, which requires collaboration. Collaboration necessitates sharing information. Interoperability between silos requires adopting new technologies and rewarding collaborative behaviors. Mentioned new technologies were standard ontologies from W3C and semantic platforms from IO Informatics (Berkeley, CA). Cited collaborative platforms included CDD Vault Web-based software from Collaborative Drug Discovery (Burlingame, CA), HEOS® from Accelrys (San Diego, CA), and Semantic Fingerprint™ technology from Praxeon™ (Jamaica Plain, MA). Changing the culture is another issue that was less well developed. Incentives that reward collaborative behavior need to be created. Also, people need time to adjust to their new society. Waller listed many collaborative opportunities from the NIH, FDA, U.K., etc.

Dr. Arturo Morales, Global Leader of Data Federation for Novartis (Basel, Switzerland), discussed problems integrating data and knowledge for a global pharmaceutical firm with research centers in Europe, Asia, and America. First, Novartis insists on a common language, English, but the ability to communicate in English varies, i.e., terms can differ significantly. Hence, Novartis has developed “authoritative vocabularies.” The vocabularies are domain specific. Electronic laboratory notebooks (ELNs) are useful in facilitating global use of the correct words. The example cited by Morales was: Novartis has over 25,000 assays. The reporting units can differ, i.e., μg/L or ng/mL. All too often, researchers waste time developing and validating a new method because they are unaware of legacy methods. His efforts are a work in progress; progress is being made, but the job will probably never be completed.

During Thursday’s plenary, Dr. Gary Kennedy of Remedy Informatics (Sandy, UT) discussed an enlightened approach to IT. Ontologies are key to harmonization of data. He advises trying to separate the question from the way the question was asked. The benefits of improved collaboration were expanded upon further by Prof. Mark Musen of Stanford University (CA). By his count, there are over 300 domain-specific ontologies in the life sciences, and the list is growing. Stanford has developed a system named Protégé to assist in the creation of ontologies. It has a global user base of 200,000.

Three lectures reported case histories of developing I&KM systems for R&D groups. Their advice was: 1) Keep the software development team small (no more than four) and listen carefully to what the end user says and why; 2) avoid consulting management since they often are unaware of what is really needed at the bench; and 3) start simple to show some success, which helps to get buy-in. The lecturers seem to have a vision and are on a mission to deliver. They deliver and then go on to the next project and repeat. One has to wonder how scaleable this model is.

Interenterprise I&KM

Intraenterprise I&KM is difficult, but interenterprise adds a complexity layer of different organizations, each with their own motives and culture. Thus, I found it surprising that some large pharma firms are outsourcing key functions in discovery and development. For example, lectures by Dr. Arun Nayar of AstraZeneca (Wilmington, DE) and Dr. Phyllis Post of Merck described the startup of collaboration programs.

In early 2012, AstraZeneca announced the creation of the virtual Neuroscience Innovative Medicine Group patterned after virtual biotechs. At the core was an informatics platform (called FIPnet) designed to support external preclinical and clinical work. The model involved much simultaneous collaboration to generate data and distill them to knowledge. The cloud will provide the IT services. IT will use the cloud to avoid scaling for maximum load. It took only nine months for FIPnet to be fully operational. Diagrams of work flow support the process. When help is needed, knowledge leaders are identified with KNODE (www.knodeinc.com).

The AstraZeneca program delegated responsibility to find “good enough” or “suitable for purpose” solutions. Both Merck and AstraZeneca recognized the need to stay away from legacy in-house resources. Some of these appear to be candidates for outsourcing. Both companies also recognized the need for major revision of traditional security and IP protection. This seems to also reduce corresponding expectation of rewards.

In terms of specific systems, the firms reported using products from major vendors, particularly Accelrys. The number of people involved favored vendors with sufficient staff to provide product support for large departments.

New tool for big data

The most dramatic example of new vendors supporting the new collaborative world was a lecture by Dr. Lukas Karlsson of the Broad Institute (Cambridge, MA). His group was charged to “create the best collaboration experience to accelerate ground-breaking science….” Broad decided to partner with Google since Google’s Enterprise Services appeared technically advanced and well supported. For example, Google’s Cloud Platform included modules called “Storage,” “Big Query,” “SQL Compatibility,” “Prediction,” and “Translation,” in addition to its dominant position in search. For a new project, Google “Group” is used for “Collaboration,” and “Calendar” is used for scheduling events. “Drive Folders” automatically shares content with the team along with keeping track of versions. “Hangout” supports virtual meetings and presentations. Google really seems to offer interoperable packages that merit consideration for large-scale I&KM.

Summary and credits

The Molecular Medicine Tri-Conference series celebrated its 20th anniversary at the February 11–15 meeting. The series started with three biorelated channels with a common exhibition. Over time, it has grown to five channels encompassing 15 programs. The 2013 meeting attracted over 3100 bioscientists and vendors to a sunny, warm week in San Francisco. The exhibit hall provided a forum for more than 50 posters and 170 vendors, including 30 first-timers. A quick “walk-by” survey of attendance in the lecture halls found that sessions on circulating tumor cells (CTCs) were the best attended. Interest in CTCs has been growing rapidly during the last four Tri-Conferences.

The staff of Cambridge Healthtech Institute deserves special recognition for organizing a complex program that required the collaboration of 40 people. A meeting like Tri-Con involves cooperation and hard work, so all at CHI should be proud of their accomplishment.

Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: rlsteven@comcast.net.

Comments