Fri Sep 27 11:21:11 2013
The long term goal is to discover structure that is meaningful in terms of treatment outcomes.
Adelstein JS, Shehzad Z, Mennes M, DeYoung CG, Zuo X-N, et al. (2011) Personality Is Reflected in the Brain's Intrinsic Functional Architecture. PLoS ONE 6(11): e27633. doi:10.1371/journal.pone.0027633)
C-data (time_series_cc200_20130402): C is for connectome, which just refers, in this case, to the fMRI data. The 197 regions are obtained from the 'all voxel' data via parsing voxels into regions. We already know the function relating the 197 to the all voxels case, because we wrote it. It is in the form of \((region=197)\times (time=194)\times (subj=42)\). Unfortunately, the time-series across subjects are not aligned relative to one another.
P-data (NEO2012.csv): P is short for phenotype, in this case, it is the 5 factor model (\((subj=42)\times (factor=5)\)), which are the 5 factors: ENOAC (sometimes ordered differently and called OCEAN).
NEO2012.csv has 42 subjects and the following columns: ParticipantID,Sex,Age,E,N,O,A,C the last 5 are the “5 factors”
NEO2012_linkage.csv has 42 subjects and 2 columns: the first is the ParticipantID, which is listed in the other file, the second is the file name. so it maps participant id to file, which we need.
ScaledScales.xls: this is the file that had the original data that i extracted NEO2012.csv from. you shouldn't need it, but it is the raw data, so i didn't want to discard it.
dname <- "http://www.cis.jhu.edu/~parky/CGP/Data/time_series_cc200_20130402/"
urlfiles <- readHTMLTable(dname,skip.rows=1:2)[[1]]$V2
urlfiles <- urlfiles[!is.na(urlfiles)]
files <- paste0(dname,urlfiles)
(n <- length(files)) # 42
require(flexmix)
coh <- ts <- spect <- spect_norm <- NULL
for (i in 1:length(files)) {
ts[[i]] <- readMat(files[i])[[1]] # 194 x 197
spect[[i]] <- spectrum(t(ts[[i]]))$spec[6:40,] # 100 (freq) x 194 => (bandpass) 35 x 194
spect_norm[[i]] <- scale(spect[[i]],center=FALSE,scale=colSums(spect[[i]])) # normalization
coh[[i]] <- KLdiv(spect_norm[[i]]) # Delta: 194 x 194
}
dissmat <- matrix(0,n,n)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
dissmat[i,j] <- helldist(coh[[i]],coh[[j]]) # hellinger distance
}
}
dissmat <- dissmat + t(dissmat)
diag(dissmat) <- 0
distC <- normalize(dissmat)
# [1] 42 42
n <- nrow(distC)
mdsC <- cmdscale(distC,n-1)
elbowC <- getElbows(mdsC,3)
mcC <- Mclust(mdsC[,1:elbowC[2]])
plot(mcC,"BIC")
require(gdata)
fname <- "http://www.cis.jhu.edu/~parky/CGP/Data/NEO2012.csv"
neo <- read.csv(fname)
pdat <- neo[,4:8]
distP <- as.matrix(dist(pdat))
plotmat(as.matrix(pdat),"subject","factor")
distP <- normalize(distP)
dim(distP)
plotmat(distP,"subject","subject")
hist(distP)
mdsP <- cmdscale(distP,n-1)
elbowP <- getElbows(mdsP,3)
mcP <- Mclust(mdsP[,1:elbowP[3]])
plot(mcP,"BIC")
#distC <- distP
#set.seed(123)
#distC <- jitter(distP,17000)
#plotmat(distC,main="distC = jitter(distP,100)")
#plotmat(distP,main="distP")
#plotmat(abs(distC-distP),main="abs(distC-distP)")
## form omnibus matrix
meanD <- (distC+distP)/2
omni <- rbind(cbind(distC,meanD),cbind(t(meanD),distP))
dim(omni)
# [1] 84 84
## embed
n2 <- n*2
mds <- cmdscale(omni,n2-1)
elbow <- getElbows(mds,3)
knee <- elbow[1]
dat <- mds[,1:knee]
the null (ie, truly matched subjects) and alternative holdout pdf estimates of the oos distances!
set.seed(12345)
nmc <- 1000
out <- foreach(mc=1:nmc, .combine="rbind") %dopar% {
ij1 <- sample(1:n,2,replace=FALSE) # i1, j1
ij2 <- ij1+n # i2, j2
ij <- c(ij1,ij2)
## "joint" oos
Y <- fast.oos2(omni[-ij,-ij], knee, omni[ij,-ij])
DY <- as.matrix(dist(Y$X.oos))
D0 <- c(DY[1,3],DY[2,4]) # d(i1,i2), d(j1,j2)
DA <- c(DY[1,4],DY[2,3]) # d(i1,j2), d(j1,i2)
c(D0,DA)
}
## omni
meanD <- (distC + distP)/2
omni <- rbind(cbind(distC, meanD), cbind(t(meanD), distP))
## smacof W
w <- 0.05
Wdiag <- matrix(w, n, n)
Woff <- matrix(0, n, n)
diag(Woff) <- 1 - w
Weight <- rbind(cbind(Wdiag, Woff), cbind(t(Woff), Wdiag))
suppressMessages(require(jhusmacof,quietly=TRUE))
## smacof embedding of all 2*n points
n2 <- n*2
knee <- 2
set.seed(12345)
smacof <- smacofSym(omni, knee, weightmat = Weight)$conf
lab <- factor(rep(c("C","P"),each=n))
plot(smacof[,1],smacof[,2],col=as.numeric(lab),pch=as.numeric(lab),main="C vs P via smacof embedding")
segments(smacof[1:n,1],smacof[1:n,2],smacof[(n+1):n2,1],smacof[(n+1):n2,2],col="orange")
legend("topright", c("C","P"),col=1:2,pch=1:2)
for each left-out (C,P) in turn
K <- 1 # 1-NN
Pjhat <- matrix(0,n,knee)
dp <- dpmean <- rep(0,n)
for(i in 1:n) {
j <- i+n # matching P
ij <- c(i,j) ## (Ci,Pj) pair
## in sample without (Ci,Pj) pair
Dij <- omni[-ij,-ij]
Wij <- Weight[-ij,-ij]
Ci <- smacof[i,]
Xij <- smacof[-ij,]
X <- smacofSym(Dij, knee, weightmat = Wij, init=Xij)$conf # 82 x 5
Chat <- X[1:(n-1),] # in-sampled of C \ Ci
Phat <- X[-c(1:(n-1)),] # in-sampled of P \ Pj
## oos Ci only
Di <- omni[-j,-j]
Wi <- matrix(0, nrow(omni), ncol = ncol(omni))
Wi[i,1:n] <- 1
Wi[1:n,i] <- 1
Wi <- Wi[-j,-j]
within <- rep(1,n2)
within[ij] <- 0
Cihat <- oosIM(Di, X, W=Wi, init=Ci, isWithin=within[-j]) # oos'ed of Ci
## estimate Pj
dChat <- outer(rowSums(Cihat^2),rowSums(Chat^2),"+") - (2*Cihat %*% t(Chat))
dChat <- sqrt((dChat+abs(dChat))/2) # dist from Cihat to the rest of Chat
ind <- order(dChat) # index of nearest neighbors, 1st one is 1NN, etc.
Pjhat[i,] <- colMeans(Phat[ind[1:K],,drop=FALSE]) # estimated Pj from Phat
## compare d(P,Pjhat) and d(P,mean(Phat))
dp[i] <- sqrt(sum((Pjhat[i,]-smacof[j,])^2))
dpmean[i] <- sqrt(sum((smacof[j,]-colMeans(Phat))^2))
}
Let \(dP\) be the distance in the original \(P\)-space.
Let \(dC\) be the distance in the original \(C\)-space.
Let \(dJOFC\) be the distance in the derived \(JOFC\)-space.
Consider the \(j^{th}\) subject,
then we care about \(dP(\hat{P}_j,P_j)\) as our figure of merit.
We observe \(C_j\), as well as all the other \((C,P)\) pairs.
\(\hat{P}_j = P_{\arg\min_{j' \neq j} d(C_{j},C_{j'})}\) where the distance \(d\) used to determine the nearest neighbor is \(dC\) for Method 1 and \(dJOFC\) for Method 2.
Let's look at which \(P\)'s are close, in all three cases,
that is,
for each \(j\),
If \(dist_2\) looks more like \(dist_0\) than \(dist_1\) looks like \(dist_0\), then JOFC wins!
Now, let's do Wilcoxon tests using \(dist_1\) and \(dist_2\).
Prediction of Phenotype based on Connectome is better via jofc than via raw!