Two Truths
Phenomenon in Spectral Graph ClusteringCarey E. Priebe, Youngser Park, Joshua T. Vogelstein, John M. Conroy, Vince Lyzinskic, Minh Tang, Avanti Athreya, Joshua Cape, and Eric Bridgeford, “On a ‘Two Truths’ Phenomenon in Spectral Graph Clustering,” Proceedings of National Academy of Science, submitted, 2018.
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering – clustering the vertices of a graph based on their spectral embedding – is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical results provide new understanding of the problem and solutions, and lead us to a ‘Two Truths’ LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome data set: the different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core-periphery structure.
Keywords: Spectral Embedding, Spectral Clustering, Graph, Network, Connectome
Figure 1.1: Figure 1. A ‘Two Truths’ graph (connectome) depicting connectivity structure such that one grouping of the vertices yields affinity structure (e.g. left hemisphere/right hemisphere) and the other grouping yields core-periphery structure (e.g. gray matter/white matter). Top center: the graph with four vertex colors. Top left / top right: LSE groups one way; ASE groups another way. Bottom left: the LSE truth is two densely connected groups, with sparse interconnectivity between them (affinity structure). Bottom right: the ASE truth is one densely connected group, with sparse interconnectivity between it and the other group and sparse interconnectivity within the other group (core-periphery structure). This paper demonstrates the ‘Two Truths’ phenomenon illustrated in this figure - that LSE and ASE find fundamentally different but equally meaningful network structure - via theory, simulation, and real data analysis.
Here we make available the connectome data used in our paper: 114 graphs, and for each graph every vertex has a {Left,Right} label and a {Gray,White} label.
NB: This is not meant to be a finding of neurscientific significance; rather, this is an illustration of our ‘Two Truths’ phenomenon. As such, we consider binarized versions of the largest connected component of the graphs. (The original diffusion MRI connectomes are symmetric, hollow, and weighted.)
The \(m=114\) adjacency matrices on \(n \approx 40,000\) vertices and associated vertex label attributes can be downloaded as an R object. NB: 8GB!
print(load(url("http://www.cis.jhu.edu/~parky/TT/Data/TT-glist114-binary.rda")))
# ~8GB, may take several minutes to load.
# This may fail with "Error: vector memory exhaused (limit reached?)" if there is not enough memory on your computer!
#[1] "glist"
length(glist)
#[1] 114
library(igraph)
summary(glist[[1]])
#IGRAPH b40d5fe UNW- 40813 2224492 -- sub-0025864_ses-1_dwi_DS72784
#+ attr: name (g/c), name (v/c), hemisphere (v/c), tissue (v/c), Y
#| (v/c), weight (e/n)
table(V(glist[[1]])$hemisphere)
# left right
#20412 20401
table(V(glist[[1]])$tissue)
# gray white
#19353 21460
table(V(glist[[1]])$Y)
# LG LW RG RW
# 9664 10748 9689 10712
Figure 2.1: Figure S1. Summary of data: number of vertices per graph by hemisphere/tissue type.
Note that there is one bad graph in the data set: image processing failed for left hemisphere of subject 50 scan 2. (This anomaly is shown in the bar plot, not shown in the box plot.) We have left this graph in, as is.
R
code for the reproducing the experiemntal results presented in the manuscript is available in the demo
folder at github.
require(devtools)
devtools::install_github("youngser/TwoTruth")
# WARNING: This may take a while to install all the required packages!
# Also, depending on the enviroment, many packages need to be reinstalled/updated manually!
library(TwoTruth)