Seventh Annual Semantic Technology Conference (SemTech 2011): Natural Language Processing Catches up to Large-Scale Integration

In 2009, when I attended the Semantic Technology Conference in San Jose, CA, the meeting was a mixture of software developers and applications reports.1 Many were from scientists seeking to improve data searches. Nearly all were quite technical. In two years, following a change in ownership, the Sem Tech meeting has morphed into a business informatics trade show, where reports on scientific applications of semantic search technology were rare indeed. In its place, business applications and practical benefits based on broad integration and searching of unstructured data such as natural language text have stolen the spotlight. When unstructured text integration is combined with broad semantic integration of related data sources, new applications are enabled.

This was illustrated in two keynote addresses. Bill Guinn of Amdocs Product Business Unit (Chesterfield, MO) described development of a modern call center that integrates existing customer specific data and derives probable hypothesis about the purpose of the call before the first ring. Caller ID is used to pull up the account history and display it on the screen of the next available customer representative (CR). For example, if the last invoice was recently mailed, and the amount was higher than usual, the screen shows this and also suggests response scripts. Prior call history is also displayed. This all appears before the CR picks up the handset. If the anticipated topic is wrong, another screen contains routes to other customer support functions such as order handling, field service, shipment tracking, account updates, and complaints.

Bill pointed out that a typical call handled the conventional way with a detailed routing tree still costs about $7.00/call. Fully one-third of the callers simply "zero out" by pressing 0 for operator. Overall, traditional trees have a very low success rate (only ~20%), which is measured as percentage of the satisfactory handling of the issue on the first call. The new call center operations from Amdocs reduces the cost per call by more than half, while doubling call efficiency and improving customer satisfaction. Semantic search technology is used in the background to integrate the text and data files in real time and display the information for the CR. The implementation phase of a project for a new installation is about two-thirds shorter than conventional approaches using relational databases.

The second keynote was given by John O'Donovan of the Press Association in the U.K. He led a team that designed a completely new publication system for the BBC's coverage of World Cup Soccer. As the global primary source, the task involved taking inputs from text, video feeds, rebroadcasts, etc., and making them available for all stakeholders (i.e., individuals, broadcasters, newspapers, and historians) in real time. The volume of information is large, with nearly 1000 pages for players, teams, locations, and so on. John described using a small, approximately 10-person team including OntoText AB (Sofia, Bulgaria) to create the semantic system that ran nearly automatically in 90 days. It made use of many existing support elements for computing and storage. Thus, only two new processors were required. Today, the systems are used in all BBC sports coverage, and have been cloned into other BBC departments. The next challenge is the 2012 Olympics in London, which will require much more than tenfold expansion in performance.

The take-home of the keynotes is that while semantic search technology, as a completely new way of modeling and integrating data, seems a bit more complicated initially, it is actually much simpler and faster, both for integrating data up front and for maintaining integration and related applications at the back end. The front end requires using sets of "triples" for data and coding.The triple consists of a line of code with subject and predicate (or operation), followed by object. These are placed in a Resource Description Format (RDF) store that can be read quickly by the computer. Other programs can link subjects and objects to form associations. Today's high processing speed makes this happen very rapidly. Due to the human readable nature of these data relationships, non-IT professionals can use semantic search technology in their daily work. This avoids the IT keyhole that is strangling many organizations.

For biopharma, Robert Stanley of IO Informatics (Berkeley, CA) cited three recent cases to demonstrate how semantic technology can reduce the pain and frustration in data integration. The first was for a manufacturing unit of Pfizer (Groton, CT) that needed an easier way to integrate data from many sources to comply with lot release filings to regulators. Specifically, the FDA had asked how the data compared to the original data used at licensure. With IO's Sentient Query, the data from the experimental sources were semantically linked to the report data, and thus were quickly retrieved. Pfizer had allowed four months for the pilot to become operational: It took less than four weeks. Once operational, the number of full-time employees retrieving the experimental data associated with the original report was reduced by at least a third.

The second case involved the FDA, which wanted to reduce the need for animal testing by comparing toxicology patterns for several species of laboratory animals. The group hoped to find species-independent markers of toxicity or safety. The problem was to access and integrate many disparate data sources (genetic, proteomics, pathology images) from numerous toxicity studies. Due to the complexity of the original data, the barrier to entry on this project was too high for the FDA team to move forward without more effective integration methods. Using the Sentient suite, subject matter experts were able to create an integrated semantic knowledge network, which can add new data sources with far less reformatting overhead than required for traditional relational integrations. This group is currently comparing data sets to identify candidate cross-species biomarkers and relationships for validation. In terms of number of full-time employees required to carry out the integration, the system was set up and running far more quickly than originally estimated by the FDA staff.

The third example, which referenced several collaborating centers working with data generated by St. Paul’s Hospital at the University of British Columbia in Canada, focused on the discovery of markers useful in the stratification of patients with organ failure. Sentient technology was used with statistical algorithms generated by the Prevention of Organ Failure Centre of Excellence (PROOF Centre) to retrospectively analyze historical data sources for patterns related to organ rejection and nonrejection. Once patterns were discovered, they were evaluated for relevancy. This has led to a set of combined gene and protein expression classifiers currently undergoing validation for screening of candidate transplant patients for likelihood of transplant failure.

The small exhibition was dominated by business-to-business (B2B) offerings that were outside the laboratory sphere. An exception was Cambridge Semantics (Boston, MA), which is applying semantic technology (ST) to problems in the laboratory. Two years ago, I was impressed with the company's selection of Microsoft® Excel products for the human interface. Nearly everyone in the laboratory space knows how to use Excel. The problem is that Excel is also the most common example of a relational database, which many see as the anathema of semantic technology. Cambridge accepts the challenge and puts the ST behind the familiar screens. For example, a common complaint about Excel is that it does not provide traceable data. Neither the data entry nor the organization of the table is inherently traceable to entry date, person, or form. Yet traceability, or provenance, is required in many regulated situations. As do other semantic technology vendors, Cambridge handles the provenance issue by replacing the triple-store backbone of ST with a quadruple store, where the fourth domain is used for traceability. With this, one can quickly backtrack and see the complete history of any data point in any database.

Changes in technology often involve corresponding changes in corporate fortunes as new replaces old. As the leader in relational databases, Oracle Corp. (Redwood Shores, CA) probably has the most to lose from ST. To its credit, Oracle exhibited at SemTech 2011and presented a short lecture describing an interim approach to bridge the gap between ST and legacy relational databases. This consisted of a programming layer on top of the traditional Oracle database management programs that translated ST to Oracle. This has been under development for several years.

Natural language searching

Over the last two years, the focus of ST has shifted from a narrow focus on data integration to natural language searching as a key piece of the semantic integration puzzle. By far, the most conspicuous example was a lecture by Aditya Kalyanpur of IBM (San Jose, CA), who described building and improving Watson (the artificial intelligence computer system) to win the Jeopardy Challenge. The structure of higher-value questions in Jeopardy are deliberately obtuse and call for Byzantine reasoning to arrive at the correct answer. Cataloging fixed query structures seemed hopeless. IBM adopted several very flexible semantic association structures using ST that were combined with weighting to discriminate between possibilities. For example, one looked for associations in time, another for location, and still another for people. Repeated optimizations over six-month intervals of software improvement produced measurable progress. After two years, IBM thought that they were competitive with humans. A year later, they were consistently competitive with the best.

Typical association studies needed two hours for a single processor. But, to be competitive on stage, IBM needed to respond in less than three seconds, to beat the humans. So, they simply added 2800 Intel multicore processors and called the product "Watson." The rest is history: Brawn + brains beat brains alone. IBM’s next challenge is to apply Watson to medical diagnostics. It will be interesting to see if the editors of The New York Times Magazine add Waston's diagnosis to the weekly diagnosis series.

Back in the real world, semantic search of natural language text is a reality. A tutorial by Dan McCreary of Kelly-McCreary & Associates (Minneapolis, MN) on entity extraction (EE) described the process. Entity extraction targets include people's names, organizations, dates, times, events, products, prices, medical conditions, symptoms, and drugs. It uses the appearance of these items to derive meaning and point to documents containing similar items in similar populations. Entity extraction does this by putting items into XML2 and then RDF. From there, the words move up to ontologies, where synonyms are discovered. Ambiguities are resolved, usually by referring to context. In EE, context relies on patterns of association with other words. For example, if the word is "lead" and the text also mentions "violin," then the context is probably music. In contrast, if the text contains “battery” and "toxicity," the text is probably about lead in lead batteries. These articles belong to different taxonomies. Classification based on only a few words may be off, but if one has a pattern of, say, 10 words, confidence in categorization is much higher. This facilitates grouping documents into taxonomies and related ontology. Information is now ready for querying with programs such as SPARQL. Plus, one can now use inference engines to make assertions which, if proven, are facts. Recently, the RDF structure has been updated to RDFa, which includes annotations or "tags." Although tagging involves human intervention, annotations can significantly improve the quality and speed of searches.

Looking ahead, the sciences will probably benefit more than any other discipline, since semantic technology will simplify data integration and also searching the literature. Elsevier B.V., in particular, will benefit since the publisher owns the copyrights about 25% of the world’s scientific content and have a customer base of 15 million users. Elsevier seeks to capitalize on this incredibly strong strategic position to expand into natural language search through the SciVerse Platform. The core of the platform contains ScienceDirect, Scopus, and Hub content-specific application programming interfaces (APIs) through which one makes search and retrieval requests enabled with ST. One can customize the APIs to form gadgets that serve as an interface to the Web, including third-party applications and content.

Credits

Although the focus of the Semantic Technology Conference has clearly shifted away from the sciences, it is still relevant to us. However, I expect that scientists will split off and form a more focused meeting.

I congratulate the new owners of the Semantic Meeting series, WebMediaBrands (New York, NY), for organizing an outstanding program with excellent content. On the down side, the lack of laser pointers really made it difficult for the lecturer to guide the audience to particular regions on the screen. This was compounded by the tendency of many lecturers to use screen dumps that simply did not project with useful clarity.

Next, WebMediaBrands plans a world tour of the meeting in Washington, D.C. and London, before returning to San Francisco, June 3–7, 2012.

References

  1. Stevenson, R.L. Am. Lab. Nov 2009, 4……. 
  2. www.en.wikipedia.org/wiki/XML.

Dr. Stevenson is a Consultant and Editor for American Laboratory/Labcompare; e-mail: [email protected].