How Visualization and Analytics Help Scientists Make Better Decisions Faster

Artificial intelligence. Machine learning. Automation. As these technologies become more widely adopted among life sciences companies in hopes of better innovation and faster drug development, it is natural to wonder whether the researcher will become sidelined. While these methods are all making great strides, for the foreseeable future, at least, scientists will continue to review data, decisions, and results and will ultimately be responsible for advancing or halting important projects in the lab. So, the questions become:

  • How can scientists best work with their data to make effective and efficient decisions?
  • What are the characteristics of the best software solutions that help scientists review data, perform analyses, and support decision-making?

Software that is most effective must have optimum scientific visualization and scientific analysis capabilities. It also must be able to perform scientific data operations and enable sharing of best practices.

One of the greatest challenges in research is the massive amount of data generated from multiple data types. Scientific visualization is one of the best ways for scientists to make sense of all this data. But to be effective, the visualization application must provide a smooth and fluid user experience when visualizing datasets of all sizes. A jerking, lagging user experience compromises data evaluation. The application must also be proficient at rendering all types of data, from textual and numeric information to data types such as molecules, sequences, protein structures, surfaces, and more.

Put simply, how can software allow scientists to interact with data that could have millions of rows, by thousands of columns, and include a large scientific object (think genome or protein) in each cell? To achieve this, software developers can employ a number of strategies including data abstraction, progressive disclosure, and efficient low-level coding.

To understand how to assess visualization performance, it is first necessary to understand the concept of computational complexity. The computational complexity of an algorithm, whether used for visualization or analysis, is a measure of how difficult the algorithm is to execute and, importantly, how it will scale with the size of the data. As an example, the execution time of a first-order algorithm will grow linearly with the size of the dataset. First-order scaling is often the best an algorithm can do. Algorithms for either computational analysis or visualization can get overwhelmed with large datasets, and performance becomes much worse than first-order. When this is the case, the software becomes unusable. However, in the case of graphics algorithms, even first-order degradation cannot be allowed as the dataset grows. In other words, the user experience must be consistent regardless of the size of the data—it must be of zero order. Fortunately, programmers can take advantage of several important characteristics of end users. A visualization program, for example, only needs to draw one screen of data at a time as that is all the user can see; not everything in a dataset needs to be rendered. It also only needs to draw a level of detail a user can understand—which may well be less than the level of detail in the data—and it only needs to draw motion at a rate the human eye sees as smooth, generally at or above 60 frames per second (fps).

To illustrate the three imperative strategies of visualization, we can examine what would occur when loading an entire human genome, which contains a massive amount of data with some sequences more than 200,000,000 bases in length. It would be impossible to absorb that level of detail if viewed at the highest level. Figure 1 is a simplified, abstract view showing the most important features. As the user zooms in, increasing levels of annotation detail are drawn. This is called progressive disclosure. By abstracting the view and progressively disclosing data, the software can constrain the complexity of the visualization.

Figure 1 – Abstract view of a long sequence on the left-hand side, shown in both linear and circular form. As the user zooms in to enlarge the circular view, an increasing level of annotation is disclosed, for example, as shown on the right-hand side. With much more screen real estate, a high level of annotation detail can be shown on the diagram.

Taking the example further, only when the user zooms in to a small region is the actual sequence shown, as opposed to Figure 1. An even closer, hyperbolic view shows more detail in the center of the portion of the sequence in view than at the edges (Figure 2). Importantly, all of these views show the same amount of information, so they have the same complexity of visualization and each can be animated smoothly irrespective of the size of the underlying data. In large datasets, the other key to a smooth rendering is to have the data structured so that the software can instantly retrieve the portion it needs to start drawing the screen, without having to walk through the entire dataset. In this way, the user can step through the data and then manipulate individual scientific objects, all with the software rendering the views at >60 fps.

Figure 2 – Transition from an abstract view of a large part of a long sequence (top) to the details of the individual residues in the sequence (middle) once the user has focused into a narrow region sufficiently for that level of detail to be understandable. The hyperpolic view (bottom) allows even more focus to be provided on individual residues within the center of field of focus.

To achieve these performance levels, it becomes necessary to resort to low-level coding. High-level programming languages have been designed for the mass market to perform most things well enough on most datasets and allow rapid development. Additionally, a variety of programming languages are used for application development, and each typically provides a rich environment with prebuilt capabilities to render graphic objects, such as table and graphing controls. These save programmers time because they do not have to “reinvent the wheel” every time. However, these controls are not optimized for huge data and extreme calculation, so if they were used in scientific visualization applications, a slow, lagging user experience would inevitably result. Ultimately, to achieve the necessary performance, data models, table and graphic controls, as well as sequence and structure rendering, controls must be coded from first principles with careful consideration of everything from hardware design to compiler design to the way memory is managed.

A second important characteristic of software for decision-making support is scientific analysis. As with scientific visualization, it is necessary to look at computational complexity and focus on the optimization of computation methods to return answers in a reasonable time, allowing responsive decision-making. For computation, it is important to consider the difference between the complexity of a problem and the complexity of the algorithm used to solve it. Using biological sequence alignment as an example, as the number and size of the sequences grows, the size of the problem grows at a greater than factorial rate (actually N.N!), and a brute-force solution would rapidly become untenable. A useful sequence alignment algorithm’s complexity must be better than the problem it addresses. Common literature methods such as Needleman-Wunsch and BLAST achieve better complexity (N2 and N<2, respectively). However, there are compromises—for example, requiring large amounts of memory, which is not very scalable and is probabilistic rather than deterministic (that is, they may or may not always give the best possible solution).

This problem has been addressed with algorithms created by Dotmatics (Herts., U.K.) by again coding from first principles with variants that require only a small memory footprint and either:

  1. Go at the same speed as existing methods but are deterministic (always provide the best answer)
  2. Go up to 100 times faster than BLAST, but still provide good answers as opposed to the best answers.

These algorithms have been added to the Dotmatics advanced analysis and visualization application, Vortex. Within Vortex, users can select which of these methods best fits their purpose.

A small memory footprint is important because these methods can now solve large alignment problems on a standard business laptop, without requiring that the job be sent to a large server, which typically delays and complicates the process.

The importance of increased speed is that these algorithms not only solve standard sequence analysis problems much faster, but they also can be used to create new analyses that would be impossible with existing alignment speeds. An example of this is sequence clustering on very large datasets. Clustering 140K sequences requires 19 billion comparisons and can now be performed in just a few minutes; this would be impractically slow, or impossible, with standard methods. This improved performance permits the development and use of advanced analyses, such as matched pair analysis, which allows individual differences among pairs of sequences to be related to differences in response variables, such as activity.

Scientific data operations such as data compression and data comparison capabilities are also critical for decision-making support. Dotmatics user Xenon Pharmaceuticals Inc. (Burnaby, Canada) offers a good example of how its informaticians were able to compress large amounts of in-vitro and in-vivo data into easily digestible visuals for their science teams. They then applied algorithms to provide context that enabled comparison of the data for effective decision-making. In one example, by reducing a large SAR table to a single radar plot and overlaying benchmark compounds with assay results for test compounds, a scientist is able to learn “good” SAR shapes and rapidly examine a large set of compounds by SAR shape (Figure 3). At the same time, additional views provide context, including a compound’s exposure profile, its place in ligand efficiency, and physical property space. All of these factors influence compound selection. In another example, context is added to sparse in-vivo results to allow direct comparisons between compounds before all data points are filled in.

Figure 3 – Radar plot allows a rapid comparison of the profile of a molecule of interest (green) across 15 different assays versus a benchmark compound that exhibits a desirable SAR shape (blue). A rapid visual inspection of the green vs blue shape allows a rapid determination of the quality of the green compound.

Finally, in today’s labs, specialized cheminformaticians and bioinformaticians can be rare and expensive resources. Applications must be able to ensure the success of scientists who are not experts; therefore, best practices must be developed and communicated to all scientists. There are two ways to accomplish this. The software itself can make the smart choices. In the case of sequence alignment, for example, Vortex includes the industry-standard methods as well as the faster proprietary methods described above. An expert bioinformatician will know which methods suit which volumes and types of data, but what is the nonexpert supposed to make of the choices? Typical software may provide a default for those who do not possess the knowledge to make the decision. However, there is no single good default; instead, Vortex software analyzes the size and type of data and suggests the most appropriate algorithm for their particular dataset to guide them to the most appropriate result.

A second way to ensure the success of nonexperts is to allow a small community of expert informaticians to encode their best practices for wider distribution. In Vortex this can be done by constructing a “helper workflow” to walk a nonexpert through the analysis (Figure 4). An expert informatician would know how to load data into a table and then select and draw a histogram to understand a property of that data. The experts can then construct a helper workflow to show the nonexpert how to create the table, select the variable to graph, bin the data, select the chart type, and even select a color. While this is a simple example, more complex analyses, such as the sequence match pair method described above, will be much more understandable to the wider audience with the aid of a helper workflow.

Figure 4 – A simple example helper workflow. The guide text and embedded controls on the left-hand side of the screen guide the novice user through the creation and configuration of the histogram on the right-hand side of the screen.

While automation, artificial intelligence, and machine learning hold great potential to speed drug development, researchers will always be an organization’s greatest asset. Life sciences companies must now work to enhance researchers’ decision-making capabilities with software that excels at scientific visualization, scientific analysis, and scientific data operations while effectively communicating best practices.

Robert D. Brown, Ph.D., is vice president of Product Marketing, and Tom Oldfield, DPhil, is principal software architect, Dotmatics Ltd., The Old Monastery, Windhill Bishops, Stortford, Herts., CM23 2ND, U.K.; www.dotmatics.com. The authors would like to thank Steven Wesolowski, Ph.D., at Xenon Pharmaceuticals Inc. (Burnaby, Canada), for his help in preparing this article.

Related Products

Comments