Regardless of their area of specialization, today’s biotechnology companies want to do two things: improve the speed and success of their research efforts and lower the costs involved in doing so. The “and” is the tricky part. In an increasingly global and distributed research environment, collaboration and information sharing are more essential than ever—between departments; between the corporate world and academia; as well as between various experts in biology, chemistry, materials science, toxicology, and more. It is this open collaboration that leads to successful innovation. Yet the increasing complexity, volume, and diversity of scientific data present a big challenge for researchers. How can they share information across scientific disciplines when it is stored in a host of different systems and saved in incompatible formats? How can they quickly find the most relevant information when it is hidden in a vast sea of data? And how can they integrate this information in order to make critical connections, conduct collaborative analysis, and make faster discoveries?
Rethinking the informatics infrastructure
Without a more streamlined approach to scientific informatics, organizations will continue to burn time and money on redundant experiments and inefficient processes. Equally important, opportunities for truly groundbreaking innovation will be missed. The reality of modern scientific research demands that R&D organizations move beyond format and discipline-centric informatics solutions that trap information in silos, create barriers to collaboration, and add unnecessary effort and expense to the innovation process. This is particularly important as larger companies move toward a disease-centric, “franchise” model for R&D, which requires the integration of information and resources from a diverse array of internal and external sources where the main focus for the data aggregation is the disease. Data that used to be aggregated around processes, specialties, and departments now need to support multidisciplinary projects—such as translational medicine initiatives focused on a single disease—from early research through marketing and sales.
Figure 1 - Department-based data silos versus disease-centric data aggregation.
For too long, information management in R&D has been based on operational and departmental constructs. Organizations would build a data warehouse for research, another one for preclinical work, and another to house information on projects in development. Aggregating all this information to support the franchise model discussed above is a rather daunting task using traditional data warehousing architecture, especially in austere times. This is further complicated by the fundamental design principles on which the various warehouses are constructed. For instance, research data warehouses and experiments are typically compound-centric, while development data warehouses are study-centric (Figure 1). Integration of the data in these disparate warehouses is next to impossible using traditional key-based approaches; a “rip and replace” strategy is not the answer either, since it is costly.
An enterprise-level approach to scientific informatics is needed—one that is able to leverage existing investments in legacy systems, while at the same time overcome the integration challenges that these systems present. To make this a reality, however, organizations will have to rethink their informatics infrastructure.
Step 1: Move toward a service-oriented, global architecture
While the scope of R&D operations is broader and more global than ever, organizations are constrained by informatics tools that have not kept up. Data generated by a single chemist or biologist, much less an interdisciplinary project team or the broader industry, are often spread across a diverse array of formats; applications; and proprietary systems, such as unstructured text documents saved in an electronic laboratory notebook, data generated by a mass spectrometry instrument, or images from a microscope. In addition, the volume is enormous— spanning thousands or even millions of possible compounds, proteins, and genetic tests. Stakeholders can easily spend countless hours finding needed information; preparing data for analysis; and collating, formatting, and distributing results.
Figure 2 - A specific scientific process can be automated using a simple module approach to protocols building. The scientist performs a simple drag and drop of each module to construct the desired data flow.
Fortunately, new infrastructure paradigms such as service-oriented architecture (SOA) are changing this. A Web services-based IT foundation for scientific informatics can support the integration of multiple sources of information in a “plug-and-play” environment, so that organizations can build bridges across the research enterprise and create automated work flows that streamline highly complex projects (Figure 2). The idea is to enable data pipelining, which allows researchers to unlock and utilize all the rich information sources available to them (both within and outside the organization) without the time and expense involved in enlisting IT resources to build customized point solutions. In fact, a customized approach would be impossible for IT to support, since there would be no way to keep up with constantly changing user requirements for thousands of different integration points and work flow tasks.
With an SOA-enabled platform, all IT needs to do is support a base number of services (i.e., Web parts) instead of having to use SQL, PERL, Java, or another scripting code to integrate information on a point-by-point basis. A researcher can then simply drag and drop prepackaged work flow items—such as statistical processes, simulations, or reports—from a services menu onto the desktop and assemble them as needed based on what the specific project and discipline require. This type of flexible architecture makes it easy to customize and automate scientific processes, which in turn will streamline and speed research efforts. Furthermore, previous investments in various process components can be maximized through reuse. Consider a highly specialized analysis protocol such as next-generation sequencing (NGS). If only applied once by a specific researcher or department and then hidden away in the “lost code” dust bin, its value is limited. Transformed into a service that can be used again and again by other researchers throughout the organization, it becomes a company asset.
Finally, when data are integrated and available to the entire research enterprise, they can more easily be searched. This enables researchers to quickly avoid wasting time trying to find important information or repeating work that has already been carried out by another team or department.
Step 2: Foster greater collaboration with open standards and open source
Open data standards are essential to making an SOA architecture work, and they are absolutely imperative to enabling global collaboration between diverse groups both inside and outside of an organization. No one vendor is going to be able to provide all the tools that a researcher might use in any given project. What is important is that collaborators are able to link all the information they are generating through various tools and applications into integrated work flows. They also need to be able to pull in content from publicly available databases, from the academic world, from contract research partners, and so on. This is why the plug-and-play element of a services-based informatics platform presents such a compelling opportunity to drive new efficiencies. Open standards facilitate the interoperability that makes plug and play happen.
What, exactly, are open standards and how can organizations effectively support them? Today there are a few good examples of open data standards groups in life sciences, including Health Level Seven (HL7) and The Clinical Data Interchange Standards Consortium (CDISC). These groups have worked diligently for several years to create medical record and clinical research standards. An emerging group, Pistoia Alliance, is working on basic research data standards. Ultimately, these standards succeed when vendors and technology users implement them—a process that can take some time given that applications often need to be reengineered to the standard. In the end, standards serve a valuable purpose once a community of users supports them.
Open standards help organizations build bridges between the many sources of knowledge available to them and an enterprise informatics infrastructure that gives researchers a holistic, cross-disciplinary view of information and an integrated forum to collaborate on the projects they are working on. Open interoperability also gives organizations the agility to pipe new streams of data into their informatics platform as needs or research partners change, as new algorithms are developed, and as new scientific tools emerge.
Finally, open standards help ensure that taking an enterprise approach to scientific informatics does not require that organizations “rip and replace” their legacy systems, since the data required for cross-enterprise scientific processes can be easily aggregated into project data marts without changing the source. A foundation for a services-based informatics infrastructure can therefore be established incrementally while still leveraging previous investments in systems and applications. It is an evolutionary strategy that allows IT to support a dynamic organization undergoing massive change without requiring massive systems overhaul.
Enterprise scientific informatics in the real world
An enterprise approach to scientific informatics is concerned with building an ecosystem that can integrate various sources of scientific data and best-of-breed applications, as well as commonly used data standards, into streamlined, work flow-driven processes that speed innovation and foster broad collaboration. This ecosystem should also enable research stakeholders to more easily search and reuse corporate and publicly available research, reduce the number of experiments required, and automate repeatable tasks. Finally, it should support both internal and external collaboration, as well as changing business models in the pharmaceutical and biotech industries.
In order to better understand how this can be applied in a real-world setting, an example that would be applicable to translational research is given below.
Figure 3 - A simple protocol/work flow that brings together gene expression analysis and text analysis to create the data mashup required to identify new targets.
By studying biomarkers that indicate disease or nondisease states, researchers can potentially discover new and more effective drug therapies, understand response to therapies, and make more accurate diagnoses of disease states. However, this can be a time-consuming and painstaking process when project stakeholders need to compare volumes of mass spectrometry data, microarray data, information from public databases, and patent files. Consider a research project that seeks to create a diagnostic test for acute respiratory distress syndrome (ARDS), a serious lung inflammation (Figure 3). In this case, researchers might want to analyze proteomics data generated by a mass spectrometer to identify proteins common and more abundant in ARDS patients compared to a set of controls. Of further scientific interest would be to compare these overexpressed proteins to gene expression data sets for analogous sets of patients. If promising and consistent biomarker candidates are discovered, researchers would also want to search the available literature to find out what is known about the genes, proteins, and related pathways.
Without an underlying informatics platform capable of integrating this information and automating the steps involved in comparing findings across multiple data sources and formats, ARDS researchers would be faced with a difficult and time-consuming task, forgoing comparisons that are ripe to be made. Bioinformatics professionals would need to spend a great deal of time, effort, and resources manually collecting and analyzing the information and searching databases like PubMed. Key connections or relevant existing findings could easily be missed.
In contrast, a services-based, enterprise platform for scientific informatics would enable researchers to build an integrated work flow protocol, or pipeline, that brings together multiple sources of data and automates complex analysis steps. With such a solution in place, informatics experts can free themselves from many of the mundane tasks associated with data analysis and concentrate their efforts on problems that truly require human intervention and judgment. Nonexperts can broaden the scope of their research because they can more easily leverage experimental data generated outside of their domain and collaborate with colleagues across scientific disciplines. The research organization as a whole can take advantage of greater quantities of relevant data in order to not only make new discoveries, but also make them faster and more cost effectively than before.
Modern biotechnology demands an enterprise approach to scientific informatics. By looking to open, service-oriented, and standards-based solutions that encompass the entire R&D ecosystem, today’s research organizations can lay an IT foundation that is essential to streamlining processes, leveraging all relevant sources of knowledge, fostering collaboration, and making faster discoveries.
Dr. Brown is Chief Science Officer, Accelrys, Inc., 10188 Telesis Ct., Ste. 100, San Diego, CA 92121, U.S.A.; tel.: 858-799-5000; e-mail: FBrown@accelrys.com.