1 Introduction

On Fri, Dec 11, 2015 at 11:53 AM, joshua vogelstein [email protected] wrote:
we will get n=10^6 points, each in d=25 dimensions.
i want to hierarchically cluster them, in a ways:

  1. recursive k-means on the data, maybe 5 levels
  2. compute approximate k-neighbors, svd in d=dimensions, and then #1
  3. maybe some other ways.

2 Data

The corresponds to 24 channels x 6 features per synapse, ordered like c0f0,c0f1,c0f2,c0f3,c0f4,c0f5,c1f0,c1f1… etc

f0 = integrated brightness
f1 = local brightness
f2 = distance to Center of Mass
f3 = moment of inertia around synapsin maxima
f4,f5 are features that I forget what they are.. would need to ask brad.
i would throw them out, I did so in my kohonen code (which you have, its in matlab).

and

On Feb 8, 2016, at 2:00 PM, Kristina Micheva [email protected] wrote:

feat <- fread("~/Dropbox/KDM-SYN-100824B analysis/synapsinR_7thA.tif.Pivots.txt.2011Features.txt",showProgress=FALSE)
dim(feat)
# [1] 1119299     144
channel <- c('Synap','Synap','VGlut1','VGlut1','VGlut2','Vglut3',
              'psd','glur2','nmdar1','nr2b','gad','VGAT',
              'PV','Gephyr','GABAR1','GABABR','CR1','5HT1A',
              'NOS','TH','VACht','Synapo','tubuli','DAPI')
channel.type <- c('ex.pre','ex.pre','ex.pre','ex.pre','ex.pre','in.pre.small',
                  'ex.post','ex.post','ex.post','ex.post','in.pre','in.pre',
                  'in.pre','in.post','in.post','in.post','in.pre.small','other',
                  'ex.post','other','other','ex.post','none','none')
nchannel <- length(channel)
nfeat <- ncol(feat) / nchannel
fchannel <- as.numeric(as.factor(channel.type))
ccol <- rainbow(max(fchannel))[fchannel]

# The corresponds to 24 channels x 6 features per synapse, ordered like
# c0f0,c0f1,c0f2,c0f3,c0f4,c0f5,c1f0,c1f1... etc
#
#f0 = integrated brightness
#f1 = local brightness
#f2 = distance to Center of Mass
#f3 = moment of inertia around synapsin maxima
#f4,f5 are features that I forget what they are.. would need to ask brad. 
#i would throw them out, I did so in my kohonen code (which you have, its in matlab).
fname <- as.vector(sapply(channel,function(x) paste0(x,1:6)))
names(feat) <- fname
fcol <- rep(ccol, each=6)
mycol <- colorpanel(100,"blue","grey","red")

# ignore f4,f5
f4 <- seq(5,ncol(feat),by=6)
f5 <- seq(6,ncol(feat),by=6)
#feat <- as.matrix(feat)[,-c(f4,f5)] 
feat <- subset(feat, select=-c(f4,f5))
nfeat <- ncol(feat) / nchannel

3 Using “integrated brightness” features (f0) only

On Jan 19, 2016, at 11:54 AM, jovo Vogelstein [email protected] wrote:
the channels have an arbitrary independent linear transform,
so doing pca on the raw data is fairly meaningless.
also, the 6 features are fundamentally different, they are even different units,
so it is difficult to understand linear combinations across features.
converting to z-scores, or quantiles, is required to make sense of the resulting data.
so, i’d start with: just look at the first 24 features, which are the “integrated brightness” for each of the channels, and linearly transform to z-scores.
and then just do iterated k-means.

f0 <- seq(1,ncol(feat),by=nfeat)
feat3 <- as.matrix(feat)[,f0]
zfeat3 <- scale(feat3,center=TRUE,scale=TRUE)

#mycol2 <- matlab.like(nchannel)
df <- melt(feat3)
names(df) <- c("ind","channel","value")
ggplot(df, aes(x=value)) + 
    scale_color_manual(values=ccol) +
    scale_x_log10(limits=c(1e+03,1e+06), breaks=c(1e+03,5e+03,1e+04,5e+04,1e+05,5e+05,1e+06))+
    geom_density(aes(group=channel, colour=channel))
Figure 1: Kernel densities for each channels

pr2 <- prcomp(zfeat3)
(elb <- getElbows(pr2$x,3,plot=TRUE))
Figure 2: Scree plot with three potential ‘elbows’ with red dots. We use the third elbow as dhat.

# [1]  3 18 21
X <- pr2$x[,1:elb[3]]
out <- doIDT(as.matrix(X),
             FUN="pamk",
             Dmax=ncol(X), ## max dim for clustering
             Kmax=9,  ## max K for clustering 
             maxsamp=nrow(X), ## max n for clustering
             samp=1, # 1: no sampling, else n/sample sampling
             maxdepth=5, # termination, maximum depth for idt
             minnum=100, # termination, minimum observations per branch
             verbose=FALSE)  
# number of leaves (clusters) =  194
idtlab <- out$class
plot(X, pch=".",col=idtlab)
Figure 3: Scatter plot of IDT clustering.

feat4 <- aggregate(zfeat3, by=list(lab=idtlab), FUN=mean)
feat4 <- feat4[,-1]
#image(Matrix(as.matrix(feat4)),lwd=0, aspect="fill")

# cluster rows and columns
heatmap.2(as.matrix(feat4), trace="none", col=mycol, cexRow=0.5, keysize=1, colCol=ccol,srtCol=90)
Figure 4: Heatmap of the cluster means vs channels. Rows and columns are rearranged according to hclust.

3.1 Level 1

# cut tree at level 1
idtall <- out$idtall
ppp <- sapply(idtall, function(x) x$depth==2)
qqq <- which(ppp==TRUE)
tree1 <- idtall[qqq]
(n1 <- sapply(tree1, function(x) length(x$ids)))
# [1] 709107 277157 126907   6128
lab1 <- rep(1:length(n1), times=n1)
Figure 5: Average silhouette width plot of the level 1 clustering.

Figure 6: Scatter plot of first level from the IDT clustering.

feat5 <- aggregate(zfeat3, by=list(lab=lab1), FUN=mean)
feat5 <- feat5[,-1]
rownames(feat5) <- paste0("C",1:length(n1),", ",n1)
#image(Matrix(as.matrix(feat4)),lwd=0, aspect="fill")

# cluster rows and columns
heatmap.2(as.matrix(feat5), trace="none", col=mycol, keysize=1, colCol=ccol,srtCol=90,cexRow=0.7)
Figure 7: Heatmap of the first level cluster means vs channels. Rows and columns are rearranged according to hclust.

Figure 8: Pairwise correlation of the features.

Figure 9: Reordered based on the synapse type.

4 Using “integrated brightness” features (f0) only and without “other”={5HT1A,TH,VACht} and “none”={tubuli,DAPI} channels

# [1] "5HT1A"  "TH"     "VACht"  "tubuli" "DAPI"
Figure 10: Scree plot with three potential ‘elbows’ with red dots. We use the third elbow as dhat.

# [1]  3 14 16
out <- doIDT(as.matrix(X),
             FUN="pamk",
             Dmax=ncol(X),    # max dim for clustering
             Kmax=9,          # max K for clustering 
             maxsamp=nrow(X), # max n for clustering
             samp=1,          # 1: no sampling, else n/sample sampling
             maxdepth=5,      # termination, maximum depth for idt
             minnum=100,      # termination, minimum observations per branch
             verbose=FALSE)  
# number of leaves (clusters) =  74
idtlab <- out$class
plot(X, pch=".",col=idtlab)
Figure 11: Scatter plot of IDT clustering.

Figure 12: Heatmap of the cluster means vs channels. Rows and columns are rearranged according to hclust.

4.1 Level 1

# [1] 790708 150883 172509   5199
Figure 13: Average silhouette width plot of the level 1 clustering.

Figure 14: Scatter plot of first level from the IDT clustering.

Figure 15: Scaled heatmap of the first level cluster means vs channels. Rows and columns are rearranged according to hclust.

Figure 16: Pairwise correlation of the features.

Figure 17: Reordered based on the synapse type.

4.2 2-means on “scaled” == “z-transformed”

On Feb 10, 2016, at 6:47 AM, jovo Vogelstein [email protected] wrote:

i wonder if there are simply too many non-synapses, and their variance is big, so they are swamping the signal.

  1. i’m scared of using ZG, pamk, etc. can we just do 2-means on the “raw” data, where “raw” means z-transformed and only using the 5 groups you did, but no embedding. and then plot the scaled means plot and correlations?
Figure 18: Scaled heatmap of the first level cluster means vs channels. Rows and columns are rearranged according to hclust.

Figure 19: Pairwise correlation of the features.

Figure 20: Reordered based on the synapse type.

4.3 Clustering based on feature group

ARI twixt clustering based on feature group \(i\) & clustering based on feature group \(j\)

since we are thinking feature group i is for excitatory (or whatever)
and feature group j is for inhibitory (or whatever),

ari of 2-means based on feature group
i j ari(2means_i,2means_j)
ex.post ex.pre 0.15
ex.post in.post 0.09
ex.post in.pre -0.04
ex.post in.pre.small 0.01
ex.pre in.post 0.10
ex.pre in.pre -0.05
ex.pre in.pre.small 0.00
in.post in.pre 0.05
in.post in.pre.small 0.02
in.pre in.pre.small 0.04