Improving Tools for Structured Query Language (SQL)

Since structured query language (SQL) was introduced 40 years ago, it has evolved and mutated to become the most popular tool for managing tabular databases. As data files get larger and more complex, new query options are emerging that are designed to be faster, easier to use, and work with different data file structures. The NoSQL 2012 NOW! conference attracted about 800 database developers to the San Jose, CA, Convention Center from August 21 to 23, 2012. The vendor community supported the meeting by exhibiting products that offered improved performance.

The title of the meeting “NoSQL” illustrates the issue. The obvious interpretation of NoSQL is to kill and bury SQL with improved language(s). However, SQL works and supports most of the data queries today. It is ideal for small relational databases with static architecture. However, it is less useful for evolving database architecture and especially natural language files. Replacing SQL quickly would entail a huge investment in IT infrastructure, which is unlikely to happen. So the preferred interpretation of NoSQL NOW is “not only SQL Now.”

SQL will probably be the main medium for most queries, especially in the domain of formal information technology departments, with replacement half-life measured in decades. Newer query capabilities are rapidly evolving (Table 1), including XQuery (trees), Semantic (triples with URIs), and Star Schema/On-line Analytical Processing (OLAP) that are easier to use, faster, and can search for and display relationships that are difficult to execute in SQL. Semantic technology is particularly attractive since it based on a widely adopted w3c standard, and works well with data from natural language processing (text mining) resources as well as with scientific data, for example, from biotechnology, health care, and other environments.

Table 1 – Data file architecture and related query programs

Data analysis and whole genome sequencing

Big data is an issue for many domains, but the problems and attractive solution are domain specific. For the sciences, large data files exist in various institutes. Their size makes data-sharing difficult and restricted to an elite group. Genomics, with whole genome sequencing, is probably the most conspicuous example. The $1000 genome sequence appears to be within reach, but development of the required data analysis is trailing.

For a practical example, whole genome sequencing is proving useful in tracing “pop-up” infections of superbacteria. Sequencing provides much more sensitive and specific bacteria definition than traditional “swab and culture” methods. A report hit the news on the last day of the meeting. A patient presented with a lung/throat infection of K. pneumoniae. Despite strict protocols designed to prevent infection of other patients, three others were infected and these infected still others, resulting in six deaths. Using whole genome sequencing, the route of hospital infection was traced to inadequate disinfection of a respirator and sink drain in the primary patient’s room. This is but one example of how genome sequencing of microbes may become the method of choice for elucidating the origin and spread of outbreaks of diseases. Yes, bacterial genomes are smaller (~1 to 10 Mbp compared to 3000 Mbp for humans), but the number of microbes is estimated at a nonillion (1030). Big data, indeed.

Circulating tumor cells (CTCs) may be a related problem. CTCs are a heterogeneous population of cells originating from a tumor circulating in blood. The DNA in cancer cells is often scrambled between the chromosomes. Scrambling is implicated in cancer since the control (off) signal may be separated from the primer and coding sequence. Thus, some cells can run wild, i.e., out of control, as in cancer. In contrast, other cells may not be viable since their reproduction machinery is nonviable. There may be many structural variants in the population of CTCs, so it may be necessary to do whole genome sequencing on many individual CTCs to find the ones perpetuating the cancer. Then, the problem would be to match the particular cancer structure and mechanism to a therapeutic. After the analytics, which often includes IT-intensive assembly of subgenome reads, the problem will be analysis of the genome of each CTC. Again, the ability of existing IT technology to analyze these data has fallen behind the productivity of chemical sequencing. The sample load could be very large: In 2012 the forecast number of cancer patients is expected to be 1.7 million. This may triple by 2040.

Computing in the cloud

The cloud is one aspect of NoSQL that may not be widely applicable to genomics, at least in the near future. At the meeting, cloud computing was considered an integral part of NoSQL. Drivers for the cloud include economics and ease of use. Many enterprises scale their computing farm to handle peak loads. But this entails adding a lot of excess capacity compared to what is required for routine operation. Case studies report using the cloud for peak loading can reduce computing costs by more than 50%. However, as reported previously, data files in genomics are so large that the upload and download times are too long. Until this choke point is resolved, sequencing operations need server farms close by. Knome (Cambridge, MA) claims to use cloud computing to provide genome interpretation.

Robert L. Stevenson, Ph.D., is a Consultant and Editor of Separation Science for American Laboratory/Labcompare; e-mail: [email protected].