|
|||||||||||||||||||||||
The group of circles in the area above is a graph representing semantic clusters. You can think of semantic clusters as groups of documents that share common subject matter. The sizes of the clusters show the number of documents in that cluster relative to other clusters. The cluster labels detail the subject matter of each cluster.
The position of the clusters in the graph allows you to view the "distance" between a selected cluster and other clusters. The selected cluster is the one with the red border. You can make any cluster the selected cluster by clicking it. How close the other clusters are to the selected cluster is proportional to the "semantic distance" between them. What I mean by semantic distance is based on the discrepancy of the word content of two documents. For example, consider if the following three documents were the sole contents of three semantic clusters respectively:
If the speech by Daffy Duck was the selected cluster, then the speech by Bugs Bunny would be closer than FDR's speech. This is because the two cartoon speeches likely contain similar words and references to things like Elmer Fudd, etc. In reality clusters contain many documents and so all of their features are averaged together into a single set of features characterizing the group of documents as a whole.
The color of a cluster is determined by a mathematical technique called PCA (or Principal Components Analysis) which is used to find general trends in the data. While the placement of the clusters tells you one thing (the "distance" to the center cluster), PCA tries to tell you something a little more subtle. Distinct colors don't have any exact meaning, but provide a signal of a strong relationship between two documents. Its result is that clusters with similarities have similar colors.
An important feature of the graph is that it is hierarchical. This means that the clustering algorithm is run again on the documents that make up a single cluster resulting in sub-clusters. In the above graph, if a cluster has a plus-sign on it, then it contains sub-clusters. Using the middle mouse button on any of these clusters allows you to navigate to this cluster sub-space that resides on a lower level of the hierarchy. Clicking the middle mouse button anywhere on the graph that there isn't a cluster (ie blank space) allows you to return to the previous level in the cluster hierarchy.
Preliminary: To understand the exact criteria for placing the clusters, you will need to understand what Euclidean distance means and what a node-and-edge graph and vector are.
Internally, each semantic cluster represents a vector in high-dimensional space (over 30,000 dimensions). First, every semantic cluster becomes a graph node which we add to our graph. Then for every pair of graph nodes we calculate the Euclidean distance between their vectors. If this value is over a certain threshold, then we add to the graph an edge between these two nodes. The threshold is controlled by the horizontal slider. This is how the internal node-and-edge graph is constructed.
To place this graph on the screen, we pick a node, what we called the "selected cluster" above. Using this node as the root, we construct a tree from the internal graph. Then, on the actual screen, we define a center where we draw the selected cluster. Next, for each level of the tree (the root being level 0), we define a circle of radius rn where r is some base radius and n is the tree level and draw all the level's nodes on this circle. In other words, nodes are placed on the Nth distant circle if there are at least N edges between them.
Back to overview