Semantic Web Technologies as a Key to Successful Federation of Data in Life Science Organizations

The RDF Semantic Web has become an important player over the years in regard to federating different data sources using a common query federation tool, SPARQL. W3C SPARQL is now established as a powerful query language, based on complex ontologies, that allows data federation. This article presents a practical implementation of data federation in a pharmaceutical company based in Switzerland. The goal was not only to build proof of concept, but to make Semantic Web technology accessible in a productive environment.

The first research description framework (RDF) project in which the author was involved was a scientific knowledge portal used by many applications to search ontology-oriented data. It was organized around scientific concept types—genes, proteins, indications, anatomy, diseases, taxonomy, etc.—and builds a semantic network of scientific concepts.

The former application, all written in PL/SQL, was migrated to Java, and the data were stored in Oracle relational tables to a triple store-based technology. The challenges encountered included data validation, versioning at the concept level, query performances, replication, and security. However, the project was deemed successful and has been running productively for the past two years.

This gave us the opportunity to become more visible inside the company and to participate in projects where the use of Semantic Web ontologies was used on a much larger scale. The amount of data to be handled and analyzed, as well as the mix of structured data and semistructured data such as text mining outputs, required a completely different approach known as polyglot programming and polyglot persistence. This approach takes advantage of the fact that different languages and storage are suitable for handling different problems.

Storing ontology oriented data

Why was W3 Semantic Web chosen initially? We were already running an application with excellent response times, but there was a lack of flexibility when an attempt was made to map ontologies to traditional relational database management systems (RDBMS). Every new property or new relation requires changes in the data model and rebuilding of the application. It is complicated to map a rich domain model to a relational schema.

We were looking for a flexible, changeable schema and rich multilevel taxonomies. We decided to use the Oracle implementation of RDF, Oracle Spatial, and Graph. This helped us manage a smooth transition from the old environment to the new one. The database administration stays the same, PL/SQL programming is still supported, and we remain in an enterprise environment with full support from Oracle.

Also, Oracle allows federation of data stored in the triple store with data stored in relational tables using a dedicated SQL table function named SEM_MATCH. The overall query performance met design goals for comparable performance to legacy application with enhanced usability.

Finally, the migration to the RDF technology enabled discovery of unknown relationships based on the meaning (semantics) of the data and enabled easy changes to incorporate new kinds of data and relationships.

In a second step, we decided to test the Semantic Web technology with much larger volumes of data. The scientific knowledge portal stored about 140 million triples, and now we want to work with billions of triples. This proof of concept was done on a Franz Allegrograph triple store instead of Oracle Spatial & Graphs. We chose to do this not because the former triple store was better than Oracle, but the process of setting up a new machine by the Oracle database administrators, with the required specific setups for Oracle Spatial & Graphs, takes time in an enterprise environment. Allegrograph is easy to install on a Linux machine and requires no DBA. We did some benchmarks on other triple stores, but in the end Allegrograph seemed to be a reasonable choice.

Data federation on a large scale

Our objective was to extend the search and analysis capabilities of the first project and be able to federate data coming from different sources having different structures:

  • RDF data stored locally as triples like the one from the scientific knowledge portal
  • RDF data accessible through the Open Linked Data initiative using SPARQL end points, such as Uniprot
  • Text mining outputs (MedLine data, genomic data)
  • Scientific data stored in large Oracle data warehouses
  • Use of existing in memory indexing tools such as Solr Index (apache.org).

In short, this POC is a computationally searchable, integrated data environment based on tacit concepts like gene, phenotype, and compound from the external literature and internal data sources. This will allow for rapid review of the pertinent information. Thus, even if the results might generate some noise, more important is the discovery of unknown relationships. We want to find the needle in the haystack.

This discovery process requires heavy use of RDF reasoning. This is where the way of building the ontology becomes extremely important. Our experience shows that doing reasoning on billions of triples using OWL modeling is becoming a challenge. Full OWL can be used to do a fine-grained definition of an ontology, but the problem is that the reasoning becomes too slow, and there is a definite risk of unexpected results if the ontology becomes too complex.

For that reason, we restricted the ontology to RDF++ and, even there, only Oracle was able to materialize the inferred triples after many hours of work. We tried to do some partial reasoning using Jena rules, using the CONSTRUCT syntax or the Property Path syntax of SPARQL 1.1. The main problem is that the reasoning tools such as Pellet, TrOWL, and Hermit do not really do streaming but require significant amounts of memory to load the data into. This shows that the triple store should only contain the ontology relevant data, and that the rest of the data should be stored elsewhere and be accessible through different type of linking tools.

For example, the text mining data will be stored outside the triple store. We decided to store them in a NoSQL database of type Document Store or Key Value store. Both solutions scale to petabytes of data. MongoDB was our first option since it stores documents in JSON format and has easy-to-use query functionalities as well as full-text search.

The second option was Oracle NoSQL using the table model feature. Here again JSON documents can be stored, but have a model to be validated against. Oracle NoSQL is a proven robust Key Value store and requires less storage due to the AVRO compression, but is less user-friendly than MongoDB.

We actually stored JSON-LD document structures. Although JSON-LD is not seen by the JSON community as another way to represent Semantic Web data, it is fully compliant with RDF type data. This means that by storing JSON-LD formats in MongoDB or Oracle NoSQL, we are already RDF compliant.

The Allegrograph triple store has a native implementation of a MongoDB call inside of SPARQL. By doing this, we can then retrieve the IDs of the impacted MongoDB stored documents and map it to information stored in the triple store.

RDBMS mapping to RDF

Scientific data such as assays are stored in Oracle relational tables. Already existing Solr Indexes on assays drastically simplified access to those data through SPARQL. We decided to transform only the most relevant information into RDF and persist in the triple store. Due to the so-called Magic Properties of Allegrograph, we could perform fast in-memory search using the Solr Index calls inside of SPARQL. This allowed retrieving the impacted assays in a very efficient manner by giving access to information missing in the triple store. An example of Sparql Query using a Solr Index is given in Figure 1.

Figure 1 ‒ Example of Sparql Query using a Solr Index.

The translation of relational structures into RDF triples can be done with mapping tools of type D2RQ, RDB2RDF, R2RML, or OBDA. Oracle supports native RDF2RDF and R2RML inside the database. R2RML views or RDB2RDF views can be created to make relational data look like a triple store. The data stay in the relational tables; they are not duplicated.

In our case, there was a need for a generic mapping tool to access Oracle but also Postgres stored data. This is why we decided to use an OBDA based tool named Ontop (http://ontop.inf.unibz.it/). Ontop is efficient, easy to use, and allows limited reasoning of type OWL-QL. The mapping can be done using the Protege RDF editor. The tool shows the result of the SQL query transformation and therefore allows fine-tuning for better performances.

The SQL query rewrite response time is especially critical when the mapping is done on-the-fly on a large data warehouse. In our project we had two distinct use cases:

  1. Use case 1: Persist assay relevant data from an Oracle database in the Allegrograph triple store. We used the Ontop java API to generate the triples. The number of triples imported is limited (a few 100 million triples) and the Solr Index extends the search functionalities.
  2. Use case 2: Link a large Oracle data warehouse with the data stored in Allegrograph. The amount of data is way too large to be persisted in a triple store. We therefore used transient views to query RDBMs through OWL/RDFS ontologies on-the-fly, using SPARQL. The complexity of the query is hidden behind SQL views created with virtual Primary Key and Foreign Key definitions. These Keys play an important role in the Ontop query rewrite process.

Overview of ou data federation setting

Our project has thus far federated data coming from locally RDF stored data (Oracle and Allegrograph triple stores), RDBMS data, and MongoDB documents. Due to SPARQL 1.1, we also have access to external SPARQL end points using the SERVICE keyword. The challenge here is to avoid joining remote data with locally stored data using nested loops.

Text mining outputs as well as additional semistructured data like genomic data will become an important part of this POC in the near future. This would require a big data environment such as Hadoop. We will use Map Reduce to transform the result of the analysis into JSON-LD and store it in MongoDB.

An overview of the federated data implementation is given in Figure 2. Figure 3 shows an example of query federation using SPARQL: find all assays and its associated compounds where beta-arrestin is measured with a qualified IC50 <10.

Figure 2 ‒ Overview of federated data implementation.
Figure 3 ‒ Example of query federation using SPARQL: find all assays and its associated compounds where beta-arrestin is measured with a qualified IC50<10.

Web-based RDF open platforms

What if you don’t want to know about SQL, RDF, and SPARQL but still want to use the Semantic Web technology? We have been looking at some open-source implementation to make linked data accessible via APIs that are readily usable by developers. ELDA, for example, is Epimorphics’ Java open-source implementation of the open Linked Data API specification. It is easy to use and the query results are presented in familiar formats. Basically it is a web interface for searching and navigating through the RDF data.

Another approach is to use on top of our data federation environment a web-based open platform for linked data to present the outputs on Wiki pages and to be able to access simultaneously data stored in NoSQL databases, RDBMS, Hadoop, as well as Excel files. FluidOps (http://www.fluidops.com) seems to fulfill most of our needs.

Conclusion

Linking data using Semantic Web technology has been a success for our pharma research department. Through the initial project, which was a migration of an existing knowledge portal to a Semantic Web-based environment, we were able to gain more visibility inside the company, to be involved in projects of much larger scale, and share our experience with other departments. The storage of billions and even trillions of triples inside a triple store is feasible but will rapidly reach its limit when reasoning is required. We preferred a more pragmatic solution named Polyglot persistence and decided to store only the ontology oriented data inside the triple stores.

We believe that this way of distributing the data will help us build a robust and expandable working environment for semantic web discovery.

Marc Lieber is Principal Consultant at Trivadis AG, Switzerland; tel.: +41-61-279 97 55; e- mail: [email protected]; www.trivadis.com