|
Metacombine Multidimensional Visualization Project
|
|
The following is a demonstration of a work in progress. Some links are broken and applets have bugs. Java demos run only under Java 1.4.2 or greater. Upgrade to J2SE v 1.4.2_05 JRE here.
Table of Contents
1 Introduction
A major difficulty in creating and exploring digital libraries is understanding the nature of their content. The questions a user or deployer of a harvest-created DL might ask are "what kind of resources exist in the library?" and "how do the resources relate to each other?" Because of the serial and static nature of written language, text is a difficult medium to express answers to these questions. A text-based representation may require many tedious readings to convey these kinds of relationships. Worse yet, the reader may come away with an incomplete or even wrong idea of a library's content, as text is a "low bandwidth" medium, unable to convey many attributes at once. Using pictures is a much more natural method of expressing this complex information, because the human visual system can more rapidly deconstruct images than read lists. A quick glance at an image tells a viewer what visual elements are present and what the spatial relationships between them are. More succinctly, the maxim "a picture is worth a thousand words" holds true.
The goal of the A3 experiment, or Multi-dimensional Visualization (hereafter, "MDV") is to explore ways of graphically conveying useful information about digital library content to a user without confusing or overwhelming them. It seeks to present a user with an interactive graph of numeric and textual information gleaned from a semantic clustering system. We experiment with several kinds of graphs, each type conveying information about the digital library collection in a different way.
Ultimately, the graphs will provide users with a system for efficient navigation through a document corpus. By graphically moving through a hierarchy of clusters, casual users and collection developers will browse through large collections of documents, understanding the arrangement of records at multiple levels of specificity. It is important that the browse interface be simple and intuitive for researchers with potentially no technical background. Through user feedback and focus groups, the system will continue to be tuned to researchers' needs and expectations.
2 Dimensionality Reduction
In a mathematical sense, semantic clusters are points in extremely high-dimensional space. The dimensions in this space correspond to text features like word stems and noun phrases. Because of the need to represent all the text features in a document corpus, the space often contains over 30,000-50,000 dimensions or more. To display the clusters on the screen, they must be transformed into a low-dimensional representation, as screens are only two-dimensional! In this process some information must be lost. Dimensionality reduction seeks to minimize lost information and preserve the spatial relationships of the data as much as possible. It is an active field of scientific research and there are a wide variety of techniques to try. We have explored the following techniques:
- PCA (Principal Components Analysis) - A classic dimensionality reduction technique based on the singular value decomposition (SVD) of numerical analysis.
- NMF (Nonnegative Matrix Factorization) - A modified form of the cluster-finding algorithm. We compile the centroids of semantic clusters into a matrix X similar to the document matrix used to do the initial clustering (see NMF link). We then perform NMF on X, specifying that the number of clusters equal the number of dimensions we want to visualize. This gives us UV as a factorization of X. We then project X onto U (more precisely Ut*X) to get a matrix P. The columns of P contain the clusters projected into a lower dimensionality.
- Euclidean tree - An interactive dimensionality reduction technique where a graph tree is constructed from a user-specified root node. Rather than presenting a picture of the global structure of the collection, this technique arranges the clusters while only considering their relationship to the root node.
- Isomap - A dimensionality reduction technique that searches for nonlinear manifolds in the data. A graph is constructed by placing edges between all nodes' k-nearest neighbors. Then, a distance matrix D is constructed between nodes by using Dijkstra's shortest-path in the constructed graph to compute distance instead of true linear distance. Finally, Multidimensional Scaling is performed on D to compute the final position of the nodes.
- LLE (Local Linear Embedding) - Similar to isomap. Uses a different edge-selection technique to construct the graph.
- Physical Simulation - a graph layout based on n-body physics.
It is important to remember that the specific placement of a single cluster in the plane by any the above techniques has no exact meaning. The important information is conveyed through the relative placement of the clusters as a group.
3 Visualizations
Once the dimensionality of the clusters has been reduced enough to be placed in the screen, there are various ways to visualize them interactively. Follow each of the links below to interact with the visualization on its own particular page. Each visualization page contains a java applet demo and explanatory information on how to read and understand its content.
- Scatter (Non-Java) - illustrates relationships
among the entire cluster collection
- Radial (Non-Java) - illustrates relationships
among clusters relative to a single user-specified cluster
- Force - similar to scatter, but provides a more flexible layout
4 Research Questions
- Automatically labeling nodes is equivalent to the problem of collection understanding or conveying the underlying theme of the collection. Some work has been done with regards to this problem in the domain of image collections. The paper is called Collection Understanding Through Streaming Collage. Perhaps using the technique of streaming collage with regards to textual information is a possible method for automatic collage generation.
- There are two particularly effective graphs: Scatter and Radial. Scatter is effective in illustrating the global cluster relationships, or how all the different clusters relate to each other at once. Radial is particularly effective in illustrating cluster relationships local to a single cluster, or how all the other clusters relate to a single cluster. Would it be particularly effective to merge the two schemes interactively, letting users switch in and out of the two graph-types? Such a scheme would give the researcher more flexibility in how they desire to percieve library content. Switching contexts from global to local is similar to the research practice of searching for a particular topic and then focusing on that topic.
- What is the best possible result of an optimal visualization of metadata?
- Are there general styles of research activity? Do certain kinds of visualizations lend themselves better to certain styles of research? Can/should certain styles of research be taught with metadata visualization?
- How can search be incorporated into information visualization?
- Would it be useful if the user could "bookmark" parts of space or connect it to other regions thus editing the space?
- What do "paths" through the space mean? Could such paths be saved and many of them combined to provide research "trails" like trails in a forest?
- Would it be useful to see what typical "non-visualized" research looked like mapped on to the visualization space?
- More soon...
5 Software
MDV software consists of two pieces: the dimensionality reduction part and the visualization part. The dimensionality reduction part is a set of perl scripts that glue together several different pieces of dimensionality reduction and XML processing software written in Perl, C, C++, and Java. The visualization software was written entirely in java. Much of this part of the project uses code from prefuse, a java-based information visualization toolkit designed to make development of info-vis projects as rapid as possible.