Facilitating Scientific Information Accessibility with a Next-Generation Search Solution

Information access is central to the success of a science-driven organization. Implementing laboratory informatics solutions that provide timely, regular, and accurate search results can have significant benefits. A new generation of search technology is required for science-driven organizations—one that performs three vital functions: 1) searches for information simultaneously across two or more internal information repositories, 2) is able to recognize science objects and pull them into the search results, and 3) can index unstructured data.

The value of scientific data

What is the value of scientific reports, manuscripts, and experimental records in a science-based organization? A recent survey of 330 C-level executives from the pharmaceutical and 10 other industries conducted by Oracle Corp. found that 93% of the executives say their organizations are collecting more information than only two years ago; the same study found that these organizations are leaving, on average, about 20% more revenue (~$98.5 million) on the table each year because they cannot fully leverage that information.1

How much is spent by industry on commercial laboratory informatics systems to capture, catalog, and archive data? Strategic Directions International (SDI) estimates the annual market for laboratory informatics at around $1.1 billion (comprised of chromatography data software [CDS], laboratory information management systems [LIMS], electronic laboratory notebooks [ELN], and laboratory execution systems [LES]).2 Clearly, these market data illustrate the emphasis placed on information creation and collection.

Still, despite the large investment in laboratory informatics to capture laboratory data, currently available software does not address the needs of users trying to find information, i.e., information consumers. For example, a recent announcement by GSK that $1.5 million was spent developing an in-house scientific search capability illustrates that global science-based organizations need to complement information creation with information accessibility. 3

Finding information from the laboratory perspective

As a scientist, what steps would you take to develop a new experimental technique or methodology? Typically, the first step would include talking to colleagues and performing a literature search. At some point, would you then attempt to search all of the experimental records stored within your organization’s information repositories? This could pose a problem: The way information is managed today at most organizations, it would simply be out of the question to search across all of an organization’s scientific information repositories because the information is:

  • Maintained in independent and unconnected data silos
  • Lacking support for science objects (spectra, chromatograms, images)
  • Overwhelmingly unstructured (in the form of manuscripts and reports).

See Figure 1.

Yet, an understanding of past work is an essential first step in any method development process. For example, suppose you were developing a new chromatographic separations method. If you wanted to leverage the prior work of your colleagues, you would be required to search across the following information repositories:

  • ELNs used in place of the traditional laboratory paper notebook
  • LIMS for managing routine sample-processing procedures
  • LES for managing more complex work flows at the bench level
  • CDS for automated control of analytical instrumentation
  • SDMS (scientific data management software) to serve as repositories for raw data as well as printed reports.

Figure 1 – Three challenges to leveraging information in the science-driven organization. To overcome the challenges, Scientific Search technology spans across isolated information silos, adds structure to unstructured information, and facilitates searches based on science objects.

Home-grown custom data repositories

And let’s not forget about home-grown custom data repositories. As is the case in most organizations today, the majority of these information silos do not talk to one another either, so there is no way to search all data repositories simultaneously. Even if they did, how do you know that valuable information has not been left behind—information that is irretrievable by industry-standard search engine technology?

The point is that the language of science is not a language limited to numbers and words. Scientists communicate with images and specialized visuals as well. For example:

  • Molecular structure and chemical reaction diagrams are used by chemists to identify the molecules and the methods they are using to synthesize them
  • DNA, RNA, and amino acid sequences are used by biologists to identify the genes and proteins involved in disease states
  • Chromatographic and mass spectra are produced by analytical scientists during the process of characterizing molecular structures and biological sequences undergoing evaluation.

Visuals are an important element of communication, yet they are unsuitable for computational use and are especially unsuitable for searching. The reason for this is that computers utilize special data structures to represent chromatograms, spectra, chemical structures, etc., but these data structures (science objects) are generally optimized for data storage in specialized data silos and require specialized engines for search and retrieval. Conventional web-based search engines are unsuitable for retrieving science objects because they are designed to search text and not science objects.

Capturing unstructured information

Historically, science-driven organizations have relied primarily on collecting and harvesting structured information (well-structured information found in tables within a relational database) such as data found in CDS or LIMS. More recently, the type of knowledge captured has expanded to include unstructured information, i.e., text strings that are two or more words strung together and commonly found in manuscripts, reports, and research documents. Unstructured information has been growing 3–4 times faster than structured information due largely to the greater usage of ELN and ECM (enterprise content management). But, unlike structured information, unstructured information requires specialized algorithms to search and analyze.

Science-driven organizations rely on ECM systems such as Microsoft SharePoint and EMC2 Documentum to help manage the lifecycle of controlled content found in manuscripts and reports to ensure compliance with internal policies and FDA regulations. These ECM systems contain vast amounts of historical data that can yield valuable information about a business.

Figure 2 – Science objects may include chemical structures, chemical reactions, chromatograms, spectra, biological sequences, and images.

Platform for scientifically relevant search

How do you find a good starting point for chromatographic method development? If you were exploring how to develop a chromatographic method and wanted to leverage your organization’s scientific repositories, your ideal tool would allow you to: search across internal information silos, use a combination of keywords and science objects, and search both unstructured and structured data (Figure 2). This ideal tool may sound similar to Federated Search, a search technology capable of searching across multiple web-based information servers simultaneously based on keywords only (often used in literature and library searching applications; see www.Science.gov for an example). A Federated Search consists of three steps:

  1. Keyword queries are submitted into a single user interface
  2. The user interface communicates with individual web-based servers and relies on the individual search engines at each server to perform the actual query
  3. Search results from all sites are sent back to the query interface and compiled into a single list.

Federated Search is suitable for keyword searches across different web sites, but is not suitable for searching across niche data repositories such as CDS, LIMS, ELN, Sharepoint, Documentum, etc., because it lacks the appropriate secure communication connections and does not provide searches based on science objects, such as chemical structures.

Figure 3 – Types of isolated information silos include business applications (ERP, LIMS), laboratory applications (CDS, SDMS), science applications (ELN), office applications (word processing and spreadsheets), file servers, and electronic content management (Documentum, Sharepoint).

Instead, to complete your search for a suitable starting point for chromatographic method development, you may want to use a new technology known as Scientific Search, a next- generation search technology (a commercially available solution called Paradigm Scientific Search software is available through Waters Corp., Milford, MA, www.waters.com/Paradigm),4 which overcomes the limitations of other web-based search technologies.

Scientific Search:

  • Spans scientific data silos (Figure 3)
  • Can construct queries based on science objects and keywords (Figure 2)
  • Provides content enrichment by supplying structure to unstructured information (Figure 2).

What would a hypothetical “scientific search” look like? You could construct your search with a mixture of keywords and chemical structures (and any other useful identifying information such as spectra). Submitting the search would result in an index of results including published manuscripts (Sharepoint), internal research reports (Documentum), and experimental conditions (ELN, CDS, SDMS, LIMS). In addition to the experimental conditions and results, you would also see the authors of the records, allowing you to contact them for further advice. Those who use Scientific Search for the first time are amazed at how quickly they can uncover information relevant to the task at hand, and why that old graduate school saying, “an hour in the library will save you a week in the laboratory,” should read “a few seconds with Scientific Search will save you a week in the laboratory.”

Conclusion

It is difficult for most organizations to assign a cost to the documents and information they have already created.3 Somewhere in the information storehouse could be the next innovative product, key compliance document required during a regulatory audit, or critical information required to resolve a product defect. Managers are frustrated by their inability to locate data that are stored in isolated data silos.1 Scientific Search offers a solution for accessing scientifically relevant information across an organization by spanning information silos, performing searches based on science objects, and adding structure to unstructured data.

References

  1. Grogan, K. Pharma Not Making the Most of Data, says Oracle. PharmaTimes Online, July 18, 2012; http://www.pharmatimes.com/article/12-07-18/Pharma_not_making_the_most_of_data_says_Oracle.aspx.
  2. Market Intelligence Management Informatics: CDS, LIMS, ELN/LES, and SDMS 2012- 2017; Apr 8, 2013; Strategic Directions International, Inc., Los Angeles, CA.
  3. Luchette, M. Searching for Gold: GSK’s New Search Program That Saved Them Millions. Bio-itworld.com, June 5, 2013; http://www.bio- itworld.com/2013/6/4/searching-gold-gsk-new-search-program-saved-millions.html.
  4. Documents—An Opportunity for Cost Control and Business Transformation; 2003, Xerox Corp.; http://www.xerox.com/downloads/gbr/en/i/idc_Survey.pdf.

Chris Stumpf, Ph.D., is Senior Product Marketing Manager, Informatics; Steve F. Eaton is Senior Manager, Marketing; and Paul van Eikeren, Ph.D., is Senior Director, Data Science, Waters Corp., 34 Maple St., Milford, MA 01757, U.S.A.; tel.: 508-482-3108; e-mail: [email protected].