2012 in Retrospect: A Great Year for Scientific Software

Information technology (IT) has become an inegral part of chemical and biochemical research. IT is not uniformly applied, however. In 2012, significant needs were recognized and fulfilled. Below are my top four picks of software focused on previously unmet needs in the chemistry and biochemistry laboratory space.

Cloud LIMS fits the academic lab

The cloud for computing is not exactly new, but its utilization in smaller labs, particularly academic, is evolving more slowly compared to other segments. For example, there are more than 100 vendors of laboratory information management systems (LIMS), but only a handful show up in a Google search of “Cloud LIMS.”

The touted advantages of cloud LIMS are: 1) lower cost, 2) no local server is required, 3) no IT infrastructure is required, 4) it is scaleable from very small to large data, 5) it reduces data loss, 6) installation is simple, 7) it is flexible, and 8) it is easy to use. The last three attributes are claimed by nearly all software, but one should be skeptical.

Some questions to ask are: Data security—Where are my data? Are they exported in accordance with export control regulations? How are the data backed up? Are they encrypted? Can the data be hacked? Can they be erased? What is the response time? Can I send and retrieve terabyte files? What happens to data as they age? Can today’s data be accessed indefinitely? What happens to data when the service is terminated? Are upgrades to new LIMS versions likely to affect me? Although it is claimed that the LIMS is flexible, LIMS are known to require specific work flows. What is the required work flow for (name your application or assay)?

Perhaps the biggest issue with cloud LIMS is: Who supports it and how? Bugs will arise, as will new file structures and extensions. People will need training. Startup training may be different than training newly hired staff. Product pitches to small labs claim that an IT department or specialist is not necessary. Customer support can be provided as part of the subscription fee.

The potential subscriber should consider the attitude of his or her IT department toward cloud LIMS. IT departments seldom view the cloud favorably. Job security can be a hidden issue, as can the implied message that the IT group is not meeting one’s needs. Often, the IT staff will be risk adverse. They may fear that the cloud could open a trap door that bypasses the enterprise firewall. Plus, the IT staff may need to update their skills from SQL-based search technology. These reasons may be significant factors limiting the adoption of cloud LIMS in large enterprises.

Then one needs to consider the host. Until recently, anecdotal reports of poor service were common. One should consider selection of the host as part of the selection process.

However, academic labs may really benefit, since individuals can access the information from their location, often with personal electronics. This is a definite improvement over files located in an instrument or on a local network where one needs to be an active student or faculty for access. Frequently, reports and papers are written and edited after the student or postdoc leaves the institution.

GoLIMS (www.GoLIMS.com) is a new offering for cloud LIMS that is optimized for smaller research labs, both commercial and academic. The product provides a flexible work flow at an affordable price. By keeping the human interface simple, and avoiding a long, complicated feature set, GoLIMS covers the basics of reliably capturing, storing, and retrieving laboratory data. GoLIMS hosts the program. All data are backed up in remotely located mirror sites. GoLIMS recognizes the need for data security while encouraging open access (within your account). Data security meets the requirements of HIPAA, SAS 70 Type II, ISO 27001, 21 CFR Part II, and applicable FDA regulations. Training is web based. Small labs should look at GoLIMS and similar products. If the lab is flexible and creative, the software should meet its needs.

Extending the drug life cycle

In an interesting strategic move, Accelrys (San Diego, CA) recently acquired VelQuest. Accelrys was building a strong base providing IT software and services for drug discovery and development. VelQuest was strong in the process scaleup and early manufacturing segment. Prior to the acquisition, pharma and biopharma firms struggled to interface the transition from lab to production. There was need and room for improvement.

In September, Accelrys announced the Process Management and Compliance Suite of software, which extends seamless information flow to the entire life cycle of a drug. This speeds development and reduces costs while improving companies’ abilities to meet quality and regulatory compliance objectives. Plus, information generated in production can easily be used to develop other products. In particular, the ability to use data and methods developed early in the product life cycle to support global manufacturing significantly improves product knowledge. The FDA admonishes firms to “know your product.” With the Compliance Suite, this is much easier for all stakeholders.

Retrieving trapped data

As a scientist, I’m disturbed by the number of retractions of papers citing inability to repeat experiments. The ability to repeat and verify observations is central to the scientific method. However, today’s work flows are often very complex and may involve esoteric critical reagents such as hybrid cell lines. These lines can be carefully selected from billions of similar products. Even common reagents such as water entail uncontrolled and undocumented variables including unique sources combined with tribal knowledge, and these can have critical impact. Failure to reproduce a reported method can be due to difficulty in method transfer. In this environment, is it fair to scientists to impute drylabbing when failing in method transfer?

Missing data also leads to retractions. Data can be inaccessible or trapped and hence effectively lost for a variety of reasons. Backwards compatibility of software usually only extends backwards for about a decade, seldom for two. Indeed, the entire research system is vulnerable to data loss due to data trapping by advances in IT. Electronic data storage is 35 years old. Over this period, the structure of data and related applications technology has evolved through several generations. Yet, the data and derived information should be timeless. After all, the product life cycle of a biotherapeutic may extend beyond the 20-year patent cycle. And biosimilars need to reference the quality attributes of the originator’s licensed product. This can be old data from obsolete technology, but these data are the primary reference. With predictable advances in IT, today’s fresh data will be old in 2040, but will still be relevant.

The trapped data problem was first solved globally for text publishing and interoperability through introduction of HTML, which was embraced by the W3C (World Wide Web Consortium) as the standard for text. Today, HTML text has become the basis of global, persistent text publishing, access, and sharing. In contrast, data are still stuck in a forest of silos with limited interoperability. Because global standards for publishing data resources within a common description framework have not been available, even relational databases (RDBs) have extreme difficulty with interconnectivity. IT departments spend most of their time programming SQL searches and “joins” and transforming data into new warehouses in efforts to overcome inherent limitations of RDBs.

Almost a decade ago, systems biology started to stumble over the limitations of RDBs as well the plethora of specialized file formats in this market. Despite many efforts focusing on standardization (for example, Minimum Information About a Microarray Experiment, MIAME), there was much conflict and little commonality in data standards, file structures, and database schemas. In the early 2000s, Tim Berners-Lee and WC3 started to look for an analogous structure for data that provided similar intuitive acceptance to HTML. They identified RDF (Resource Description Framework) as a common framework for describing and integrating data, even data from different standards. RDF is a much more flexible file structure that avoids the “data trap” problem by moving information into an openstandards linked data format that makes search, retrieval, and integration much easier.

The latest Sentient product release from IO Informatics (Berkeley, CA) facilitates finding, retrieving, and integrating data from trapped or otherwise inaccessible, incompatible files. Sentient can access almost any file format (including useful but proprietary format “traps” like .CEL, .CNT, .GSP, .GSC, and DICOM), and from there the system can translate the data structure to an openstandard RDF format. Once in the global W3C standard RDF, the data can be much more easily integrated and are far less likely to be lost in a silo or format trap. The original file is unchanged. Since only relevant data are gathered, one does not need to manipulate big data files such as whole genomes. Interestingly, Sentient works with images as well as binaries, numeric, string, and other kinds of data.

Semantic clipping

Midsummer, I received an e-mail calling attention to one of my publications in Sepu (Chinese Journal of Chromatography). Included in the message was a list of 20 recent publications that were relevant to my publication. This was interesting since ISI does not abstract Sepu because few manuscripts are in English.

The e-mail was from “Who is Publishing in My Domain?” (www.WIPIMD.com). WIPIMD starts with textual data of the article. This is mapped with WIPIMD’s knowledge base to produce a searchable database (DB). The DB can be searched with a few keywords to produce a list of relevant files, i.e., documents. WIPIMD’s knowledge base consists of 20 million synonyms expressing 2.5 million biomedical concepts. A semantic search layer queries and evaluates each to distill them down to the most relevant. In practice, the WIPIMD delivers superior specificity and sensitivity. In the case of chromatography, WIPIMD sorts through 140,000 papers to prepare a report customized to my specific profile.

Subscribers can select from several services, including: 1) top 20 articles, in the Domain of Article since (2007), 2) top 20 articles, in the Domain of Article appearing in the past two months, 3) top 20 free-full text articles, in the Domain of Article appearing in the past two months, and 4) articles citing the publication of the subscriber.

Summary

The above caught my attention as novel and useful software advances in 2012. Novelty, as is beauty, is in the eye of the beholder, so differences of opinion should be anticipated and accepted. Readers’ comments are always welcome.

Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: [email protected].