Prediction of Phenotype based on Connectome

YP, MT, JV, CEP, …

Fri Sep 27 11:21:11 2013

Goal

The long term goal is to discover structure that is meaningful in terms of treatment outcomes.

Data

Adelstein JS, Shehzad Z, Mennes M, DeYoung CG, Zuo X-N, et al. (2011) Personality Is Reflected in the Brain's Intrinsic Functional Architecture. PLoS ONE 6(11): e27633. doi:10.1371/journal.pone.0027633)

C-data (time_series_cc200_20130402): C is for connectome, which just refers, in this case, to the fMRI data. The 197 regions are obtained from the 'all voxel' data via parsing voxels into regions. We already know the function relating the 197 to the all voxels case, because we wrote it. It is in the form of \((region=197)\times (time=194)\times (subj=42)\). Unfortunately, the time-series across subjects are not aligned relative to one another.
P-data (NEO2012.csv): P is short for phenotype, in this case, it is the 5 factor model (\((subj=42)\times (factor=5)\)), which are the 5 factors: ENOAC (sometimes ordered differently and called OCEAN).

Supplements

NEO2012.csv has 42 subjects and the following columns: ParticipantID,Sex,Age,E,N,O,A,C the last 5 are the “5 factors”
NEO2012_linkage.csv has 42 subjects and 2 columns: the first is the ParticipantID, which is listed in the other file, the second is the file name. so it maps participant id to file, which we need.
ScaledScales.xls: this is the file that had the original data that i extracted NEO2012.csv from. you shouldn't need it, but it is the raw data, so i didn't want to discard it.

C Data

    dname <- "http://www.cis.jhu.edu/~parky/CGP/Data/time_series_cc200_20130402/"
    urlfiles <- readHTMLTable(dname,skip.rows=1:2)[[1]]$V2
    urlfiles <- urlfiles[!is.na(urlfiles)]
    files <- paste0(dname,urlfiles)
    (n <- length(files)) # 42

    require(flexmix)
    coh <- ts <- spect <- spect_norm <- NULL
    for (i in 1:length(files)) {
        ts[[i]] <- readMat(files[i])[[1]] # 194 x 197
        spect[[i]] <- spectrum(t(ts[[i]]))$spec[6:40,] # 100 (freq) x 194 => (bandpass) 35 x 194
        spect_norm[[i]] <- scale(spect[[i]],center=FALSE,scale=colSums(spect[[i]])) # normalization
        coh[[i]] <- KLdiv(spect_norm[[i]]) # Delta: 194 x 194
    }

plot of chunk dissmatata

run spectrum (FFT) on ts for each region

plot of chunk unnamed-chunk-2

bandpass and normalize it

plot of chunk unnamed-chunk-3

calc KLdiv among regions

plot of chunk unnamed-chunk-4

calc normalized Hellinger distance among subjects

    dissmat <- matrix(0,n,n)
    for (i in 1:(n-1)) {
        for (j in (i+1):n) {
            dissmat[i,j] <- helldist(coh[[i]],coh[[j]]) # hellinger distance
        }
    }
    dissmat <- dissmat + t(dissmat)
    diag(dissmat) <- 0

distC <- normalize(dissmat)

# [1] 42 42

plot of chunk plotC

embed and cluster it

n <- nrow(distC)
mdsC <- cmdscale(distC,n-1)
elbowC <- getElbows(mdsC,3)

plot of chunk unnamed-chunk-6

mcC <- Mclust(mdsC[,1:elbowC[2]])
plot(mcC,"BIC")

plot of chunk unnamed-chunk-6

P Data

    require(gdata)
    fname <- "http://www.cis.jhu.edu/~parky/CGP/Data/NEO2012.csv"
    neo <- read.csv(fname)

    pdat <- neo[,4:8]
    distP <- as.matrix(dist(pdat))
plotmat(as.matrix(pdat),"subject","factor")
distP <- normalize(distP)
dim(distP)

plotmat(distP,"subject","subject")
hist(distP)

mdsP <- cmdscale(distP,n-1)
elbowP <- getElbows(mdsP,3)
mcP <- Mclust(mdsP[,1:elbowP[3]])
plot(mcP,"BIC")

plot of chunk distPdata

plot of chunk plotP

JOFC

#distC <- distP
#set.seed(123)
#distC <- jitter(distP,17000)
#plotmat(distC,main="distC = jitter(distP,100)")
#plotmat(distP,main="distP")
#plotmat(abs(distC-distP),main="abs(distC-distP)")

## form omnibus matrix
meanD <- (distC+distP)/2
omni <- rbind(cbind(distC,meanD),cbind(t(meanD),distP))
dim(omni)

# [1] 84 84

## embed
n2 <- n*2
mds <- cmdscale(omni,n2-1)

plot of chunk JOFC-omni-w

elbow <- getElbows(mds,3)

plot of chunk JOFC-mds

knee <- elbow[1]
dat <- mds[,1:knee]

plot of chunk JOFC-scatter

Validation

the null (ie, truly matched subjects) and alternative holdout pdf estimates of the oos distances!

loop through the data, embedding but holding out two matched pairs. call these two matched pairs (i1,i2) and (j1,j2).
- oos embed (i1,i2) under the null – that is, assuming they are a matched pair – and record d(i1,i2) in the embedded space.
- oos embed (j1,j2) under the null – that is, assuming they are a matched pair – and record d(j1,j2) in the embedded space.
- oos embed (i1,j2) under the null – that is, assuming they are a matched pair – and record d(i1,j2) in the embedded space.
- oos embed (j1,i2) under the null – that is, assuming they are a matched pair – and record d(j1,i2) in the embedded space.
calc null pdf estimates of the first two oos distances. calc alternative pdf estimates of the last two oos distances.
plot these two kernel estimates together on the same plot – null in blue, alt in red.

    set.seed(12345)
    nmc <- 1000
    out <- foreach(mc=1:nmc, .combine="rbind") %dopar% {
        ij1 <- sample(1:n,2,replace=FALSE) # i1, j1
        ij2 <- ij1+n         # i2, j2
        ij <- c(ij1,ij2) 

        ## "joint" oos
        Y <- fast.oos2(omni[-ij,-ij], knee, omni[ij,-ij])
        DY <- as.matrix(dist(Y$X.oos))
        D0 <- c(DY[1,3],DY[2,4]) # d(i1,i2), d(j1,j2)
        DA <- c(DY[1,4],DY[2,3]) # d(i1,j2), d(j1,i2)
        c(D0,DA)
    }

plot of chunk plotpdf

plot of chunk plotpwr

Predicting P from C

    ## omni
    meanD <- (distC + distP)/2
    omni <- rbind(cbind(distC, meanD), cbind(t(meanD), distP))

    ## smacof W
    w <- 0.05
    Wdiag <- matrix(w, n, n)
    Woff <- matrix(0, n, n)
    diag(Woff) <- 1 - w
    Weight <- rbind(cbind(Wdiag, Woff), cbind(t(Woff), Wdiag))

plot of chunk plotW2

    suppressMessages(require(jhusmacof,quietly=TRUE))

    ## smacof embedding of all 2*n points
    n2 <- n*2
    knee <- 2
    set.seed(12345)
    smacof <- smacofSym(omni, knee, weightmat = Weight)$conf 
    lab <- factor(rep(c("C","P"),each=n))
    plot(smacof[,1],smacof[,2],col=as.numeric(lab),pch=as.numeric(lab),main="C vs P via smacof embedding")
    segments(smacof[1:n,1],smacof[1:n,2],smacof[(n+1):n2,1],smacof[(n+1):n2,2],col="orange")
    legend("topright", c("C","P"),col=1:2,pch=1:2)

plot of chunk perf

Estimating P for a C for which we have no P

for each left-out (C,P) in turn

we choose Phat[C] = weighted avg of P's associated with C's nearby to embedded C.
we look at d( P[C], Phat[C] ).
we decide what type of person you are, based on your connectome.

    K <- 1 # 1-NN
    Pjhat <- matrix(0,n,knee)
    dp <- dpmean <- rep(0,n)
    for(i in 1:n) {
        j <- i+n # matching P
        ij <- c(i,j) ## (Ci,Pj) pair

        ## in sample without (Ci,Pj) pair
        Dij <- omni[-ij,-ij]
        Wij <- Weight[-ij,-ij]
        Ci <- smacof[i,]
        Xij <- smacof[-ij,]

        X <- smacofSym(Dij, knee, weightmat = Wij, init=Xij)$conf # 82 x 5
        Chat <- X[1:(n-1),] # in-sampled of C \ Ci
        Phat <- X[-c(1:(n-1)),] # in-sampled of P \ Pj

        ## oos Ci only
        Di <- omni[-j,-j]
        Wi <- matrix(0, nrow(omni), ncol = ncol(omni))
        Wi[i,1:n] <- 1
        Wi[1:n,i] <- 1
        Wi <- Wi[-j,-j]
        within <- rep(1,n2)
        within[ij] <- 0
        Cihat <- oosIM(Di, X, W=Wi, init=Ci, isWithin=within[-j]) # oos'ed of Ci

        ## estimate Pj
        dChat <- outer(rowSums(Cihat^2),rowSums(Chat^2),"+") - (2*Cihat %*% t(Chat))
        dChat <- sqrt((dChat+abs(dChat))/2) # dist from Cihat to the rest of Chat
        ind <- order(dChat) # index of nearest neighbors, 1st one is 1NN, etc.
        Pjhat[i,] <- colMeans(Phat[ind[1:K],,drop=FALSE]) # estimated Pj from Phat

        ## compare d(P,Pjhat) and d(P,mean(Phat))
        dp[i] <- sqrt(sum((Pjhat[i,]-smacof[j,])^2))
        dpmean[i] <- sqrt(sum((smacof[j,]-colMeans(Phat))^2))
    }

plot of chunk load1

Performance

Let \(dP\) be the distance in the original \(P\)-space.
Let \(dC\) be the distance in the original \(C\)-space.
Let \(dJOFC\) be the distance in the derived \(JOFC\)-space.

Consider the \(j^{th}\) subject, then we care about \(dP(\hat{P}_j,P_j)\) as our figure of merit.

We observe \(C_j\), as well as all the other \((C,P)\) pairs.

we estimate \(P_j\) via the \(P\)'s associated with the \(C\)'s which are \(dC\)-close to \(C_j\).
we estimate \(P_j\) via the \(P\)'s associated with the \(C\)'s which are \(JOFC\)-close to \(C_j\).
Both approaches are trying to approximate
we estimate \(P_j\) via the \(P\)'s which are \(dP\)-close to \(P_j\).
(Noe that this is a gold standard, since \(P_j\) is not available in our use-case!)

\(\hat{P}_j = P_{\arg\min_{j' \neq j} d(C_{j},C_{j'})}\) where the distance \(d\) used to determine the nearest neighbor is \(dC\) for Method 1 and \(dJOFC\) for Method 2.

Let's look at which \(P\)'s are close, in all three cases, that is,
for each \(j\),

\(dist_0 = dP(P_j, \hat{P})\), where \(\hat{P} = P_{\arg\min_{j' \neq j} dP(C_j,C_{j'})}\),
\(dist_1 = dP(P_j, \hat{P}_{dC})\), where \(\hat{P}_{dC} = P_{\arg\min_{j' \neq j} dC(C_j,C_{j'})}\),
\(dist_2 = dP(P_j, \hat{P}_{dJOFC})\), where \(\hat{P}_{dJOFC} = P_{\arg\min_{j' \neq j} dJOFC(C_j,C_{j'})}\)

If \(dist_2\) looks more like \(dist_0\) than \(dist_1\) looks like \(dist_0\), then JOFC wins!

plot of chunk pval0

Now, let's do Wilcoxon tests using \(dist_1\) and \(dist_2\).

plot of chunk pval

Prediction of Phenotype based on Connectome is better via jofc than via raw!