|
|
The group of circles in the area above is a graph representing semantic clusters. You can think of semantic clusters as groups of documents that share common subject matter. The cluster labels detail the subject matter of each cluster.
The position and color of the clusters in the graph allow you to view the "distance" between a selected cluster and other clusters. How close and what color any clusters are in relation to each other is proportional to the "semantic distance" between them. Semantic distance can be described as number of similar words and references in common: closer documents are much more similar in actual content than documents that are far away from each other. One can think of the above graph as a picture of the space of subject matter in a collection of documents. It is important to remember that the precise layout of the clusters in the above graph has no exact meaning. The specific placement or color of a single cluster in the plane is meaningless. The important information is conveyed through the relative placement of the clusters as a group.
We currently supply you with two scatter plots of the clusters. One is determined by NMF (Nonnegative Matrix Factorization) and the other is determined by PCA (Principal Components Analyis). These are dimension reduction schemes that attempt to find the best layout of high-dimensional data (the semantic clusters) in lower-dimensional space (the screen). Please read the technical explanation below if you're interested in understanding these schemes.
An important feature of the graph is that it is hierarchical. This means that the clustering algorithm is run again on the documents that make up a single cluster resulting in sub-clusters. In the above graph, if a cluster has a plus-sign on it, then it contains sub-clusters. Using the middle mouse button on any of these clusters allows you to navigate to this cluster sub-space that resides on a lower level of the hierarchy. Clicking the middle mouse button anywhere on the graph that there isn't a cluster (i.e. blank space) allows you to return to the previous level in the cluster hierarchy.
Preliminary: To understand the exact criteria for placing the clusters, you will need to understand what a vector and matrix are.
Internally, each semantic cluster represents a vector in high-dimensional space (over 30,000 dimensions). If we combine all these vectors together into a cluster matrix, we can apply certain techniques to find a low-dimensional (2 or 3) representation of the original matrix. The two techniques we have applied to the cluster matrix are Principal Components Analysis and Nonnegative matrix factorization. These links give detailed implementation-level information regarding these techniques.