Data navigation, etc.

3. Data navigation, data mining, and visualization of multidimensional databases including visualization methods for categorical data as well as text, hypertext and multimedia data bases

Marc Abrams and Cliff Shaffer

The fundamental mathematical structure of many data spaces is generally multidimensional, and the goal is to fathom the structures through 3D views of that space. The central problem is data navigation: either the data set is too massive to exhaustively visualize, or there are multiple data sets to be compared.

Many scientific disciplines produce and use large, multidimensional databases. One typical examples include the General Social Survey (GSS), widely used in the Social Sciences. Collected nearly every year for about 20 years, the GSS contains hundreds of individual questions each administered to hundreds of individuals regarding preferences and demographic statistics. Another example is the US Census, issued as hundreds of statistics on each census tract. Large scale computations such as the multidisciplinary design optimization of an aircraft or weather forecasting produce far too much data to exhaustively visualize.

In collections of text, multimedia, and hypermedia, there also are high dimensional representations required (e.g., vector spaces, feature spaces). Indeed, thousands of dimensions are the norm, and powerful methods like Latent Semantic Indexing only reduce the number to hundreds of dimensions. Here the added challenge is that many dimensions are interconnected or related in complex ways, through links, synchronization, and content-based associations.

Most scientific visualization software today is designed to visualize numerical data, or data that has as its domain a set whose members are totally ordered, such as integers or real numbers. Yet many databases such as the GSS or the US Census include non-numeric, or categorical data: a partial order may exist among the domain values, but a total ordering does not. Genome sequences and individuals in a demographic study are categorical data. So are the values in many computer and network traces, such as traces of which program module is currently in execution in a program used in the design of software for parallel computers. The trouble with categorical data is that there is no obvious way to visualize them, in contrast to numerical data, whose total ordering provides an obvious mapping to two or more Cartesian dimensions. A particular problem is visualizing categorical time series. The body of time series methods for numerical data, such as Fourier transforms, correlation graphs, and statistical methods such as ARIMA are defined in terms of arithmetic operations, which are meaningless for categorical data.

Just as a statistician would never make an inference based on a sample size of one, visualization systems will ultimately have to handle multiple data sets, representing multiple simulation runs, or multiple trace files, or multiple observations of a system. Doing so requires handling massive amounts of data, thereby exacerbating the scale problem in visualization tools: how to design a tool that can simultaneously analyze an arbitrary number of traces of arbitrary size, but not exceed the limited input bandwidth of a human's senses. There are four fundamental solution methods to allow analysis of multiple data sets, while addressing the scale problem:

Utilize parallel processing to visualization to permit real-time navigation through large and multidimensional data sets.
Extend visualization tools to handle multiple data sets, rather than a single data set.
Move the function of interpreting data from human to machine (e.g., devise methods to filter, transform, and reduce the size of data sets; use statistical tests to compare data sets; use data mining methods to discover automatically hidden relationships).
Exploit more of a human's input bandwidth, for example through sonification and virtual reality.
How can one visualize the interplay of interaction types in a large information collection, e.g., see how news or annoncements in email streams lead to "herd" behavior in access home pages in the WWW or segments of course materials in a digital library supporting education.

The following are some approaches to dealing with these problems.

Real time navigation: Visualization today is mostly concerned with representing one data set, and the hardware reflects that fact--one fast workstation or a synchronous multiprocessor with a few processors. True data navigation of large and multidimensional data sets will require massive computational power, which translates to parallel supercomputers with tens to hundreds of processors (like an Intel Paragon or a Cray T3E, rather than a Power Challenge). We propose development and implementation of parallel visualization algorithms, because they are a crucial enabling technology. This is especially important in handling real-time network traffic analysis, where thousands of streams are simultaneously active on our campus.

Extending visualizations: Techniques for visual analysis and browsing of such multidimensional databases are beginning to emerge. Integrated environments allowing multiple views of the data can allow users to gain new insights into their data. Such views include maps (for spatial distribution), graphs for range distribution, and spreadsheets for viewing multiple criteria at a time. Scatter plots and special techniques such as parallel coordinates attempt to let users visually spot correlations and other relevant relationships. Few of these techniques have been extended to 3D visualization environments and the new visualization technology such as desktop and CAVE VR.

Machine interpretation: A methodology that helps identify "interesting" things to look at, such as outliers or steep gradients, or that intelligently condenses insignificant data, or that clusters similar data, can help in navigation. Our past work, on a system to analyze time series data called Chitra, analyzes not one but a set, or ensemble, or data sets. Chitra commands, which can manipulate the ensemble as a unit, statistically analyze, model, and visualize select members individually, or as an ensemble. Chitra provides transforms to simplify the data and models to reduce trace data to a parsimonious, dynamic characterization of system behavior. We propose building on our work with Chitra to incorporate data navigation and techniques from the emerging field of "data mining." The solution used in Chitra is to visualize some traces, test all traces, transform all, and model all. Transforms are used to control state space explosion in the resultant model; examples of transforms include state aggregation by pattern aggregation and state aggregation by filtering in frequency domain. Among its statistical tests, Chitra allows partitioning of an ensemble. Normally the user will partition into mutually exclusive and exhaustive sub-ensembles, each of which is homogeneous; and then visualize one trace in each sub-ensemble, which by definition is ``representative.'' Chitra also can generate a separate model of each sub-ensemble.

Exploiting more human input bandwidth: Naturally a significant bottleneck in visualization of multidimensional databases has been the limits of the 2D computer monitor. 3D visualization hardware allows for the development of new visualization techniques that take advantage of 3D stereoscopic displays, or true 3D support through Virtual Reality. We propose studying VR interfaces and visualization techniques for multidimensional databases. Of particular interest are collections of information about network traffic, where sonification approaches might be helpful to add dimensions, allowing understanding of: media type, media size, communication origin and destination, are, topics, etc.