As chemists and especially life scientists work with more complex systems and experiments, the huge data sets arising from high-content screening (HCS) and next-generation sequencing (NGS) challenge and occasionally exceed the limits of the Internet. Even reporting and archiving the results is an issue for informatics. Let’s look at the challenges facing informatics in 2013, including the Internet, science, technology, engineering, and mathematics (STEM).
In 2010, the global society invested about $1 trillion in R&D (http://en.wikipedia.org/wiki/List_of_countries_by_research_and_development_spending). The majority of the spend was for applied projects by private firms, but about 15% was spent on academic R&D, which is generally supported by government grants. Grantors award financial support for work in their mission area. In exchange for funding, the awardee accepts the obligation to document and report the results of the work. Over time, various researchers build up a body of knowledge. This knowledge is often accumulated over several decades, and potentially centuries. Early reports serve as the foundation for later work. Thus, all of the reports are preserved for potential use by future researchers.
Results reporting faces several problems: Most public attention is on the battle of open-source versus for-profit data collection and publishing. This often comes down to who should pay. Preparing a manuscript for print publication typically costs about $4000–$5000. The cost for an electronic-only publication is about $1000/manuscript. Some journals charge the author to publish; others include the cost in the subscription fee. The matter of who pays is now a political issue. Public attitude strongly favors making the information available to all, especially individuals, without charge. After all, our income taxes pay for most of it.
But what about the ongoing cost? The costs above only cover preparing the manuscript for publishing or posting, and not the ongoing costs. Simply housing print journals is expensive. Libraries with reading facilities are large buildings, which are costly to build, staff, and maintain. Staying ahead of the volume of print material year-after-year is a challenge. Plus, large libraries are few and far between, which is a further impediment to access. Traditionally, print publications have used the subscription model. This leads to very restricted availability since new libraries will not have back issues. This is a particularly serious issue in the developing world.
Publishers of research journals tend to respond by offering electronic access to back issues, but the access price is still high and, for many, prohibitive. One can buy access to individual articles, but the price is typically about $40 each, which is significant. In the developing world, simply getting hard currency and a credit card is a major-to-impossible task. Thus, copies and subscriptions fail the “freely available” test.
Electronic publishing (posting) also has an ongoing cost that is significant. In 2013, the cost is buried in service fees, which cover the costs of providing access to the net via cable or satellite. Internet service providers spend billions per year laying new cable and maintaining the old. Enterprise information processing is usually the domain of the IT department, which appears as a cost center in enterprise overhead. No matter who pays, running the servers that store and retrieve information operate 24/7, which requires electrical power and cooling.
Many enterprises scale their IT systems for maximum required capacity, which is often utilized less than half the time. Recently, cloud computing has emerged as a more cost-effective way optimize the IT load by peak shaving. Cloud storage is advertised at $0.10/GB/month. A modest IT load of 1 TB costs $100/month, but a PB is $100,000/month. Indeed, IBM estimates that the world generates 2.5 × 1018 bytes of data per day. This translates to adding $100 million/day to the total IT spend, assuming that all the data are stored. Ninety-five percent of the world’s data have been generated during the last two years (http://www-01.ibm.com/software/data/bigdata/). Certainly, some data are stored offline, with not much deleted.
Another looming constraint is that in 2012, the amount of power consumed by global IT operations was comparable to the power used by civilian air transport (~2% each). However, with Big Data in general and NGS in particular, the doubling time is forecast to be less than two years. Thus, in less than a decade, IT could be consuming 20% of the global energy diet. This is probably not sustainable.
Data security is another issue. Printing is nearly forever. Assuming proper care, which can be expensive, printed paper can last from decades to perhaps centuries. Plus, printed journals are widely distributed, which ensures that the contents will probably survive somewhere even if a major calamity should strike. Yes, there is a finite, but small risk of a calamity such as the Cretaceous–Paleogene extinction event of 65 million years ago, which annihilated the dinosaurs.
Electronic publications are easier to distribute globally, but also appear to be much more vulnerable. The Web could be attacked and rendered inoperable. Even if the knowledge survived, it would be lost to humanity if it could not be accessed. Just imagine the poor archeologist a hundred million years from now trying to figure out the architecture and function of an Intel® Atom™ Processor Z510 1.1 GHz 400 MT/s or supporting memory devices.
Technical obsolescence of electronic publications due to changes in software already makes it difficult to access files created 30 years ago. Software companies limit and ultimately refuse to support legacy products. Wikipedia lists more than 50 word processing programs; few are common names. Several software vendors have already disappeared. Think of what would happen if Microsoft went bankrupt—farfetched, but Microsoft is a single-technology company like Xerox and Kodak.
Clearly, there is a need to curate information for decades. Take pharmaceuticals, for example. The patent life of a pharmaceutical is 20 years. But with the practice of “evergreening” successful patented drugs, patents can endure for several multiples of 20 years. After 40 years, how is a generic drug developer going to reference the old methods used to characterize the innovator’s product and compare it with the new batch?
The great divide between white and gray data
Data accessibility is not uniform across all fields of interest. For example, in science, technology, engineering, and mathematics (STEM), the sciences benefit from a competitive cadre of research publishers. These publishers make original articles, written in English, available to scientists in the developed world. They have a complex set of metrics (impact factor, citation index, etc.) that is called “white information.”
In contrast, the applied sections of STEM receive much less attention. Enter the world of “gray information.” The gray world suffers from a lack of indexing. For example, about half of the engineering information is not curated or internationally accessible. These reports often reside in flat files located in a university or department library.
Almost half of the engineering knowledge is in “gray publications” that are outside of the formal publication process of journals and books. Currently, gray literature is starting to be curated in electronic form. Natural language processing using semantic technology offers a convenient route to search and retrieve this valuable information on a global basis.
Even in chemistry, works such as analytical methods that are not included in white publications (including AOAC, ASTM, and USP) are not systematically available. For example, identity assays of pharmaceuticals are often performed by HPLC in the developed world, but in the developing world, HPLC is not practical due to cost, a lack of reliable electrical power, mobile phases, and training. Thin layer chromatography (TLC) is much more suitable to assure the user that the drug product contains the drug active. But just try to get a non-novel TLC method published for others to use.
Local language journals such as SEPU (The Chinese Journal of Chromatography) illustrate another hole where information is lost to society by firms serving the white information segment. In my travels to China starting in the early 1990s, I was impressed with the ingenuity of the analysts in solving analytical problems by using the best technology available to them. This was often a packed-column GC with a thermal conductivity or flame ionization detector, which was 20 years out of date.
Instrumentation for capillary GC and MS was not generally available. But these chemists used chemistry to provide actionable data. The methods were time-consuming and not very precise, but they were used, and hence useful. Today, SEPU publishes articles in Chinese that would certainly pass peer review if they were written in English. Hopefully, computer-aided translation will soon automatically process these articles for the rest of the world to see.
Another issue: PowerPoints of posters and lectures presented at scientific meetings consume hours of intense preparation but are often quickly forgotten after the event. Six months later, only a few people remember them, but they cannot be retrieved or cited. The impact factor is probably positive, but small.
The gray information segment might benefit from a comprehensive effort by a major firm such as Google to scan and curate material from the gray STEM world.
In summary, informatics today is wrestling with several huge problems. Open-source is what the public can understand, but the other issues are fundamental. These must be recognized, addressed, and solved.
Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: firstname.lastname@example.org.