Big data and in silico analytics are driving the current stage of the information revolution. Data integration is one important part. Our multidisciplinary research entails inclusion of those data from other specialties. This can be difficult if the file structures are different. Even the headings for the columns or rows can use the same word, but with a particular meaning, depending upon context. For example, “resolution” has quite a different meaning to chromatographers and spectroscopists.
Across all disciplines, in silico analytics, which seeks to discover relationships buried in the data, is emerging as even more valuable, particularly in business, the life sciences, and healthcare.
DATAVERSITY™ (Studio City, CA) is the leader in education for networking and data management. Chemistry, life sciences, and applied clinical diagnostics are all data intensive, and rapidly growing more so. Over the last decade, applications lectures at the SemTech series chronicled the evolution of semantic technology. Recently the focus has been on data integration and predictive analytics, particularly in business.
Four years ago, DATAVERSITY organized the “NoSQL Now!” conference, located in San Jose, CA. This year they co-located SemTech and NoSQL Now! and added a third forum on Cognitive Computing (CC). The latter focused on the rapidly emerging field using neural computing and new processor architecture called True North. Combined, the program attracted over 1000 IT professionals to the San Jose, CA, Convention Center from August 19 to 21, 2014.
Each of the meetings is relevant to the subscriber base of American Laboratory, since we all have significant problems in data integration, including searching, and most scientists see potential utility in predictive analytics. Hence I attended sessions in the three meetings with the interests of American Laboratory subscribers in mind.
Interesting business applications employing semantic technology (ST) continued to grow at the 10th Annual Semantic Technology & Business Conference (SemTech). This expansion of scope shows that ST has entered the growth phase powered by commercial adoption. Growth was confirmed by the increase in vendors in the exhibition. Several seemed to be first-timers.
ST incorporates ontologies and data in the Resource Description Framework (RDF) format and extension. RDF files are often referred to as a triple store, since the information is stored in the form of triples. Triples are more flexible in organizing data, which facilitates data integration and simplifies the search query. It also facilitates natural language search.
The opening keynote by Phil Archer of the W3C (World Wide Web Consortium, Cambridge, MA) noted that 2014 is a great year for ST. The web is 25 years old, the W3C is 20 years old, and this is the 10th year of the annual SemTech series. Early on the sciences and healthcare were obvious targets. This has continued: Open PHACTS has an RDF database of 470 million triples of open pharmacological and chemical data. The Monarch Initiative has compiled databases on model organisms in in vitro models, genes, gene expression, phenotypes, pathways, and more. DNA Digest is starting to provide deidentified genomic data.
Managing sepsis with semantics
Michael Grove (Clark and Parsia LLC, Washington, DC) described a case history of the development of a semantic-based treatment model for treating sepsis. Sepsis is 10th among the leading causes of death in America. But the presenting symptoms are not unique. High fever, lack of appetite, tenderness, headache, etc., are common symptoms caused by many diseases.
Sepsis typically starts as a bacterial infection. In most cases, the body’s immune system responds and counters the infection. However, in some cases, the bacteria resist the immune system and the infection grows, leading to sepsis. At this point, therapy includes looking for alternative drugs and characterization of the bacteria. Ideally, this leads to matching of the bacteria with a particular drug. Usually this works, but in some cases, the bacterial load is so large that the residue from killing the bacteria is sensed by the host as increased infection. It responds with an all-out attack. The immune system overloads the regulatory system, leading to lower blood pressure, which in turn starves the organs of oxygen and inhibits waste removal, leading to organ failure and death. Along the way, there are many key questions that should be considered, such as “Is this a bacterial or viral infection?” and “Does the patient have drug allergies?”
Since sepsis accounts for 2% of hospital admissions, which cost America about $90 billion per year, MHN (Managed Health Network, San Rafael, CA) selected it as the pilot case. This is part of MHN’s initiative to use modern technology, including integrated data, to aid thinly staffed healthcare institutions in recognizing and managing sepsis and septic shock. Grove’s lecture showed how MHN’s pilot program used semantic graphs combined with 24 clinical measures to identify patients at risk. This used an RDF triple store managed under W3C guidance to identify and interconnect nodes.
The program has been particularly effective in guiding nonphysicians in diagnosis and initiating appropriate therapy with cell phones and tablets. This has improved outcomes for patients in remote locations or presenting at times when physicians are busy or not available. The sepsis package is already helping improve outcomes in facilities in the Pacific Northwest. Since the sepsis pilot is considered successful, Grove expects other diseases will be added.
The Yosemite manifesto
The Yosemite manifesto issued two years ago by a Working Group of the W3C recommends that electronic health records (EHRs) be constructed in RDF format. RDF files are easily expanded and machine accessible. This facilitates searching using semantics powerful query language (SPARQL) and other search engines. EHRs are common in clinical trials as well as in healthcare facilities. Drs. Terry Roch (Capiscum Business Architects) and Dean Allemang (Working Ontologist LLC) presented a description of EHRs in Australia. Performance to date is good and has been recommended to the health standards community in Australia.
A developer’s view of semantic technology
According to David Wood, Ph.D., CTO of 3 Round Stones, Inc. (Arlington, VA), one of the many advantages of ST is that it is easier and faster to change data than code. Code requires coding specialists. Reducing code permits users to focus on the information rather than fight the constraints of the database architecture. This shows up as improved speed to market, lower cost, and more applications. Wood asserts that over the lifetime of a data product, maintenance of the data files accounts for about 60% of the total cost. However, software maintenance is seldom forecast in software budgeting. He presented several case histories supporting his contention that your next products should be semantic. Dr. Wood recently completed a book designed to plan and execute software using ST. His parting quip was “Big Data” usually leads to a big “Data Problem.” Think about it.
Four years ago, a new meeting titled NoSQL Now! provided a forum for semantic technology advocates and offered ST as an alternative to relational databases (RDB) and SQL (structured query language). Although RDB and SQL dominated the IT space, SQL was a notoriously cumbersome query language for searching for data in relational databases. Non-IT and even skilled IT professionals found it difficult to extract data to respond to questions from the operating groups.
Common scenarios: “We know that the answer is in there, but we cannot get it out.” Or, in medical research: “We may have cured (name of disease) but we would not know it.” “We have two decades of files, all with evolving architecture stored with several operating systems. We really need easy access to it.”
SPARQL was clearly superior in power and ease of use. Despite the irrational exuberance of some semantic web proponents, cooler heads recognized that replacing RDB with RDFs quickly would be very expensive and would require revision of many of the programs that make our society run. It was not going to happen. This recognizes the current reality that relational databases, with all their frustrating idiosyncrasies, are just too ubiquitous for quick replacement. Longer term, this may happen, but in the meantime, hybrid systems will exist.
Such a system was described by Marc Lieber of Trivadis, AG (Basel Switzerland). Trivadis has several pharmaceutical clients that have years of data in relational databases, usually supported by several Oracle products. Trivadis was charged with interfacing the legacy structure to take advantage of RDF and SPARQL technology. Lieber started with Oracle’s Spatial and Graphs software, which created triple stores. This facilitated operating on data with semantic technology and storing the results in relational databases, where the data are protected by Oracle’s DB administration.
This approach was amplified by a series of new product announcements by Andrew Mendelsohn, a vice president of Oracle. Generally, these were designed as layers to facilitate the best parts of ST, while protecting the Oracle base products. It will be interesting to benchmark this approach against one that uses ST alone.
The Cognitive Computing (CC) forum segment of the meeting was the first I’ve seen on this topic. CC significantly extends the predictive analytics part of semantic technology by adding statistics to the edges (associations) in graph computing. The statistics evaluate the probability or strength of the association between the nodes. This extension is impressive, since it allows machines to compare the probability of relevancy of an association; this is “cognitive analytics” since it mimics how humans process information. When the probability is high, one (human or machine) can now rapidly evaluate and compare many possible conclusions. Generally, only the most probable are reported. Historically, the Defense Advanced Research Projects Agency(DARPA) accelerated the development of CC by awarding IBM a contract in 2008. This led to the development of IBM’s Watson and probably True Blue.
Thus it is appropriate that the keynote introduction of CC came from IBM. Dr. Christopher Welty (IBM, retired) opened the conference with a comparison of traditional computing with CC. The process starts with semantics, which can give multiple possible connections. The new software paradigm evaluates many inexact solutions using multiple (~100) hypothesis-scoring subprograms. Each subprogram is added only if it improves the overall probability of validating the hypothesis.
Welty pointed out that CC builds upon logical associations generated by computers by evaluating the probability. Some will note that this has been done for several years in scoring matches of mass spectra of peptides and proteins in proteomics. What is different are the scale and natural language processing. Welty explained that the CC approach really requires large databases. He specifically claimed that the restriction imposed by HIPAA “is killing people (privately).”
CC is already showing utility in several fields, according to Paul Hofmann Ph.D., CTO of Saffron Technology (Cary, NC). The company’s scoring algorithms use cognitive distance and associative memories to improve predictive value for threat analysis. One example was a predictive maintenance program for Chinook helicopters. The goal was to reduce maintenance costs based upon fleet-wide generalities by focusing on airframe specificity. More than 40 data sources were consulted for each ’copter, including crew reports. This has materially improved uptime for each aircraft compared to a fleet-wide maintenance approach.
Mt Sinai Hospital in New York City is well respected in cardiology, yet a significant fraction of the patients were misdiagnosed. There was a need for much more detailed pictures of fluid mechanics in the heart, especially the left ventricle. This involves detailed (10,000 data points/beat) electrophysiological measures. Computer-aided intelligence of the data set is emerging as a significant diagnostic aid to improving diagnostic accuracy and patient outcome.
While listening to the lectures, I wondered about the impact of this computing power on the grid. How many Watsons can we afford and power up? Then I remembered that IBM also has developed the True Blue brain chip, which reduces power consumption 1000×. Improved processor technology may just be in time to make these software advances available to the masses, not just the rich.
Credits and next year
Tony Shaw, founder and CEO of DATAVERSITY, deserves special credit for organizing a complex, multifaceted program. I appreciate the assistance of Ms. Gretchen Hydo of Chatterbox PR, who managed the media center. The 2014 meeting was a technical success. I expect it will be repeated in 2015. Please monitor http://www.dataversity.net/ for details as they develop.
Robert L. Stevenson, Ph.D., is Editor, American Laboratory/Labcompare; e-mail: [email protected] .