Painted Tree Classifier

Document Information

Package base version 2.8.0

11/24/2008

Project 5: Statistical Learning with Multi-Scale Cardiovascular Data

Contact email: [email protected]

CardioVascular Research Grid

Johns Hopkins University

Index

1. Painted Tree Classifier

1.1. 50/50 Adjustment

1.2. Pruned Tree Classifier

1.3. Leave One Out Cross Validation (LOOCV)

2. R Code

2.1. Commented Code

2.2. Uncommented Code

1. Painted Tree Classifier

The painted tree classifier is a model we have developed for dealing with limited data. It is based on several adjustments to the traditional classification tree model, which was introduced in the previous chapter. It was developed to address the fact that in certain biomedical applications, such as cardiovascular studies, data are available for different modalities and some modalities may not be available for certain subjects. For example, for we might have genotype data for certain subjects, electrophysiological data for other subjects, and so forth. This is fundamentally different than the traditional case of "missing data" in which a relatively small number of features may be missing for some subjects. Here, entire modalities, each comprising many features, may not be available for a significant number of subjects and the subjects for which all the data are available might in fact be in the minority. As a result, whereas we could build trees using data of a single modality by limiting the number of subjects, we cannot build a tree using all the modalities. This is motivation for painted trees. There is a tree for each modality, constructed in the usual manner. We think of the modalities as colors and each tree constructed from a single modality as a colored tree which utilizes only data of that color to predict the dependent variable. Then, having learned a classification tree for each modality, the various trees are combined to make decisions on new samples by voting among the different trees; the overall result is painted tree classifier.

For the painted tree classifier, we first make an adjustment to the original dataset by replicating observations that are valuable and limited. Then we do the traditional classification tree analysis and prune the tree to a certain depth that we pre-specify. In the end, we also use the Leave-One-Out-Cross- Validation (LOOCV) to test the classifier. Each part is of this process explained in the following sections.

Index

Top

1.1. 50/50 Adjustment

In our case, we only have some hundreds of observations. The phenotype of interest is "fired" or "not fired" within a certain time from the implantation of an ICD. We want to predict this based on multi-modal data consisting of clinical, imaging, genomic and electrophysiological variables. What makes this especially difficult is that the number of patients whose device has fired is very small. Hence, one of the two classes is very sparsely represented in the dataset. This type of situation is not uncommon in medical experiments due to cost and other factors. However, from an inductive learning point of view, this creates a major problem since it becomes difficult to estimate the statistics of features in the rare class. As we have seen, estimating the probabilities of events based on the features in each class is necessary for constructing classification trees. Due to the limited dataset, we use the 50/50 adjustment to increase the data observations to make our estimates more stable and robust.

The adjustment is done as follows. We first calculate the ratio between the number of fired observations and not-fired observations. Usually, the number of fired observations is far smaller. We then "replicate" all the fired observations and add them back into the original dataset, forcing the ratio to be approximately one. The reason for this adjustment is that, in the classification tree model, each time after a split, we label the node based on the majority of that node ("fired" or "not fired"). If there is no 50/50 adjustment, then almost every node will have more "not fire" than "fire", thus we will label all of them to "not fired", which makes the model useless. So by doing the 50/50 adjustment, we give more weights to "fired" data and make the model work.

Index

Top

1.2. Pruned Tree Classifier

We prune the traditional tree model to a pre-specified depth so that we can avoid over-fitting, which would results from estimating too many parameters relative to the amount of data we have. In our analysis, since we do not have enough data, we cannot reliably estimate the posterior probability for each node after more than two or three questions (splits). Each time we split the tree, we add one or two more parameters in the model, at the meantime, makes the datasets in each node smaller, so the estimation will be harder as well.

Index

Top

1.3. Leave-One-Out-Cross-Validation (LOOCV)

After the analysis, we do LOOCV to estimate the generalization error of the painted tree classifier. Each time, we remove a single observation from the original dataset and use the remaining data to construct the painted trees in exactly the same way. The classifier constructed in each such "loop" might be slightly different. We use that model to predict the observation that we removed. If there are n samples, then we build n classifiers. Adding the total number of errors made gives the LOOCV estimated error rate.

Index

Top

2. R Code

2.1. Commented Code (download here)

# Clear memory

rm(list=ls())

# Load "tree" model package.

library(tree)

# Painted forest code begins here.

# Read data into R.

d<-read.csv("data.csv",as.is=TRUE)

# Remove data points (rows) wiht LVEF being NA.

#d<-d[!is.na(d$LVEF),]

# Remove data points (rows) wiht LVEF larger than 35.

#d<-d[d$LVEF<=35,]

# Only keep the implanted subjects with positive days of observation

d <- d[!is.na(d$Implant.Date),]

d <- d[d$Days.Of.Observation>0,]

# Only keep subjects with at least ndays (=365) of observation

ndays<-365

d <- d[d$Days.Of.Observation>=ndays,]

# Create a "fired" variable

d$fired<-d$Days.To.First.App.Firing<=ndays

# Display a contingency table of the counts at each factor in d$fired.

table(d$fired)

# Create a discrete variable list, name it "discrete", and assign corresponding column names.

discrete <- c(43:48,50,55,58,59)

names(discrete)<-names(d)[discrete]

# Convert the strings in discrete variables to factors.

for (i in discrete)

{

d[,i]<-as.factor(d[,i])

}

# Create a continuous variable list, name it "continuous", and assign corresponding column names.

continuous <- c(3:41,83:87)

names(continuous)<-names(d)[continuous]

# Convert all variables in the continuous list to numeric variables.

for (i in continuous)

{

d[,i]<-as.numeric(d[,i])

}

# Create a variable list named "snp", and assign corresponding column names.

snp<-c(43:48)

names(snp)<-names(d)[snp]

# Create a variable list named "ecg", and assign corresponding column names.

ecg<-c(3:41)

names(ecg)<-names(d)[ecg]

# Create a variable list named "image", and assign corresponding column names.

image<-c(83:87)

names(image)<-names(d)[image]

# Create a variable list named "inducible", and assign corresponding column names.

inducible<-c(59)

names(inducible)<-names(d)[inducible]

# Create a variable list named "race", and assign corresponding column names.

race<-c(55)

names(race)<-names(d)[race]

# Create a variable list named "gender", and assign corresponding column names.

gender<-c(50)

names(gender)<-names(d)[gender]

# Create a variable list named "discrete2", and assign corresponding column names.

discrete2<-c(50,55,59)

names(discrete2)<-names(d)[discrete2]

# Trim to given max depth

reduce.tree.depth<-function(tree,depth)

{

# Get the nod number and store them as numbers in a vector r.

r<-as.numeric(rownames(tree$frame))

# Prune the tree to a pre-specified depth.

tree$frame<-tree$frame[r<=((2^depth)-1),]

# Store the number of nodes in n.

n<-dim(tree$frame)[1]

r<-as.numeric(rownames(tree$frame))

tree$frame$var[r>=(2^(depth-1))]<-"<leaf>"

tree

}

#reduce.tree.depth(TREE.image,1)

#reduce.tree.depth(TREE.image,2)

#reduce.tree.depth(TREE.image,3)

#reduce.tree.depth(TREE.image,4)

#reduce.tree.depth(TREE.image,5)

# Use dataframe d to build a tree for "fired" with variables in the

# list with variable names. Truncate these trees to a given depth.

build.tree<-function(d,xnames,depth)

{

#print("in build tree")

#print(names(d))

#print(dim(d))

str<-paste("dnew<-data.frame(",sep="")

for (i in seq(1,length(xnames)))

{

str<-paste(str,"d$",names(xnames)[i],",",sep="")

}

str<-paste(str,"d$fired",")",sep="")

#print(str)

eval(parse(text=str))

#print(names(dnew))

#print(dim(dnew))

dnew<-dnew[complete.cases(dnew),]

names(dnew)<-c(names(xnames),"fired")

dnew<-make.fired.5050(dnew)

#print(names(dnew))

#print(dim(dnew))

# Formula creator: input a list of n variable names for predictors, and name of response variable

# and output is an expression that looks like this:

# fired~xnames[1]+...xnames[n]

str<-paste("fired~",sep="")

n<-length(xnames)

for (i in seq(1,n-1))

{

str<-paste(str,names(xnames)[i],"+",sep="")

}

str<-paste(str,names(xnames)[n],sep="")

#print(str)

TREE<-tree(as.formula(str),data=dnew)

# print(TREE)

RTREE<-reduce.tree.depth(TREE,depth)

# print(RTREE)

RTREE

}

# Given a data frame with fired variable, replicate the fired cases

# so that we get as close as we can (but below if neccessary) 50%firing

make.fired.5050<-function(d)

{

d.fired<-d[d$fired==1,]

d.not.fired<-d[!d$fired,]

k.fired<-dim(d.fired)[1]

k.not.fired<-dim(d.not.fired)[1]

rat<-floor(k.not.fired/k.fired)

# print(rat)

if (rat>0)

{

d.temp<-d.fired

if (rat>1)

{

for (i in seq(1,rat-1))

{

d.temp<-rbind(d.temp,d.fired)

}

else

{

print("error with replication: no firings for this dataset\n")

}

rbind(d.temp,d.not.fired)

}

# Build the trees for the full data set.

TREE.snp<-build.tree(d,snp,8)

TREE.ecg<-build.tree(d,ecg,8)

TREE.image<-build.tree(d,image,8)

TREE.discrete<-build.tree(d,discrete2,8)

TREE.LIST<-list(TREE.snp,TREE.ecg,TREE.image,TREE.discrete)

RTREE.snp<-build.tree(d,snp,3)

RTREE.ecg<-build.tree(d,ecg,3)

RTREE.image<-build.tree(d,image,3)

RTREE.discrete<-build.tree(d,discrete2,3)

RTREE.LIST<-list(RTREE.snp,RTREE.ecg,RTREE.image,RTREE.discrete)

N<-dim(d)[1]

d.pred<-data.frame(

rep(0,N),

rep(0,N))

names(d.pred)<-

c("SNP","ECG","IMAGE","DISCRETE")

# LOOCV loop

for (i in seq(1,N))

{

print(i)

# Replicate the "fired" cases so that we get close to 50-50

# as probability of firing.

d.train<-d[-i,]

TREE.snp<-build.tree(d.train,snp,3)

TREE.ecg<-build.tree(d.train,ecg,3)

TREE.image<-build.tree(d.train,image,3)

TREE.discrete<-build.tree(d.train,discrete2,3)

# Use each tree to predict outcomes for the test set

d.test<-d[i,]

p.snp<-predict(TREE.snp,d.test)

# print(p.snp)

p.ecg<-predict(TREE.ecg,d.test)

# print(p.ecg)

p.image<-predict(TREE.image,d.test)

# print(p.image)

p.discrete<-predict(TREE.discrete,d.test)

# print(p.discrete)

#print("done with this index")

d.pred[i,]<-c(

p.snp,

p.ecg,

p.image,

p.discrete)

}

for (i in seq(1,dim(d.pred)[2]))

{

d.pred[,i]<-round(d.pred[,i],3)

}

# create a "fired" variable for the test set

d.pred$fired<-d$fired

# Compute the average of probs over all modalities.

# Only include those modalities that we have data for.

d.pred$pmean<-rep(0,N)

for (i in seq(1,N))

{

clist<-c()

if (d$SNP.ALL.IND[i])

{

clist<-c(clist,1)

}

if (d$ECG.IND[i])

{

clist<-c(clist,2)

}

if (d$IMAGE.IND[i])

{

clist<-c(clist,3)

}

if (d$IND.IND[i]&&d$GENDER.IND[i]&&d$RACE[i])

{

clist<-c(clist,4)

}

print("clist=")

print(clist)

print(d.pred[i,clist])

if (length(clist)>1)

{

d.pred$pmean[i]<-apply(d.pred[i,clist],1,mean)

}

if (length(clist)==1)

{

d.pred$pmean[i]<-d.pred[i,clist]

}

if (length(clist)==0)

{

d.pred$pmean[i]<-.5

}

write.table(d.pred,file="tree.pred.csv", col.names=NA,sep=",")

# Work out the ROC curve

threshold<-seq(0,1,by=.001)

p.false.det<-rep(0,length(threshold))

p.true.det<-rep(0,length(threshold))

for (i in seq(1,length(threshold)))

{

th<-threshold[i]

fired.pred<-(d.pred$pmean>th)

p.false.det[i]<-sum(fired.pred*!d.pred$fired)/sum(!d.pred$fired)

p.true.det[i]<-sum(fired.pred*d.pred$fired)/sum(d.pred$fired)

}

pdf("ROC.pdf")

plot(p.false.det,p.true.det,type="l",

xlab="False positive rate",

ylab="True positive rate",

main="Prediction of Appropriate Firing ROC Curve")

lines(threshold,threshold,col="red")

dev.off()

Index

Top

2.2. Uncommented Code (download here)

rm(list=ls())

library(tree)

d<-read.csv("data.csv",as.is=TRUE)

#d<-d[!is.na(d$LVEF),]

#d<-d[d$LVEF<=35,]

d <- d[!is.na(d$Implant.Date),]

d <- d[d$Days.Of.Observation>0,]

ndays<-365

d <- d[d$Days.Of.Observation>=ndays,]

d$fired<-d$Days.To.First.App.Firing<=ndays

table(d$fired)

discrete <- c(43:48,50,55,58,59)

names(discrete)<-names(d)[discrete]

or (i in discrete)

{

d[,i]<-as.factor(d[,i])

}

continuous <- c(3:41,83:87)

names(continuous)<-names(d)[continuous]

for (i in continuous)

{

d[,i]<-as.numeric(d[,i])

}

snp<-c(43:48)

names(snp)<-names(d)[snp]

ecg<-c(3:41)

names(ecg)<-names(d)[ecg]

image<-c(83:87)

names(image)<-names(d)[image]

inducible<-c(59)

names(inducible)<-names(d)[inducible]

race<-c(55)

names(race)<-names(d)[race]

gender<-c(50)

names(gender)<-names(d)[gender]

discrete2<-c(50,55,59)

names(discrete2)<-names(d)[discrete2]

reduce.tree.depth<-function(tree,depth)

{

r<-as.numeric(rownames(tree$frame))

tree$frame<-tree$frame[r<=((2^depth)-1),]

n<-dim(tree$frame)[1]

r<-as.numeric(rownames(tree$frame))

tree$frame$var[r>=(2^(depth-1))]<-"<leaf>"

tree

}

build.tree<-function(d,xnames,depth)

{

str<-paste("dnew<-data.frame(",sep="")

for (i in seq(1,length(xnames)))

{

str<-paste(str,"d$",names(xnames)[i],",",sep="")

}

str<-paste(str,"d$fired",")",sep="")

eval(parse(text=str))

dnew<-dnew[complete.cases(dnew),]

names(dnew)<-c(names(xnames),"fired")

dnew<-make.fired.5050(dnew)

str<-paste("fired~",sep="")

n<-length(xnames)

for (i in seq(1,n-1))

{

str<-paste(str,names(xnames)[i],"+",sep="")

}

str<-paste(str,names(xnames)[n],sep="")

TREE<-tree(as.formula(str),data=dnew)

RTREE<-reduce.tree.depth(TREE,depth)

RTREE

}

make.fired.5050<-function(d)

{

d.fired<-d[d$fired==1,]

d.not.fired<-d[!d$fired,]

k.fired<-dim(d.fired)[1]

k.not.fired<-dim(d.not.fired)[1]

rat<-floor(k.not.fired/k.fired)

# print(rat)

if (rat>0)

{

d.temp<-d.fired

if (rat>1)

{

for (i in seq(1,rat-1))

{

d.temp<-rbind(d.temp,d.fired)

}

else

{

print("error with replication: no firings for this dataset\n")

}

rbind(d.temp,d.not.fired)

}

TREE.snp<-build.tree(d,snp,8)

TREE.ecg<-build.tree(d,ecg,8)

TREE.image<-build.tree(d,image,8)

TREE.discrete<-build.tree(d,discrete2,8)

TREE.LIST<-list(TREE.snp,TREE.ecg,TREE.image,TREE.discrete)

RTREE.snp<-build.tree(d,snp,3)

RTREE.ecg<-build.tree(d,ecg,3)

RTREE.image<-build.tree(d,image,3)

RTREE.discrete<-build.tree(d,discrete2,3)

RTREE.LIST<-list(RTREE.snp,RTREE.ecg,RTREE.image,RTREE.discrete)

N<-dim(d)[1]

d.pred<-data.frame(

rep(0,N),

rep(0,N))

names(d.pred)<-

c("SNP","ECG","IMAGE","DISCRETE")

# LOOCV loop

for (i in seq(1,N))

{

print(i)

d.train<-d[-i,]

TREE.snp<-build.tree(d.train,snp,3)

TREE.ecg<-build.tree(d.train,ecg,3)

TREE.image<-build.tree(d.train,image,3)

TREE.discrete<-build.tree(d.train,discrete2,3)

d.test<-d[i,]

p.snp<-predict(TREE.snp,d.test)

p.ecg<-predict(TREE.ecg,d.test)

p.image<-predict(TREE.image,d.test)

p.discrete<-predict(TREE.discrete,d.test)

d.pred[i,]<-c(

p.snp,

p.ecg,

p.image,

p.discrete)

}

for (i in seq(1,dim(d.pred)[2]))

{

d.pred[,i]<-round(d.pred[,i],3)

}

d.pred$fired<-d$fired

d.pred$pmean<-rep(0,N)

for (i in seq(1,N))

{

clist<-c()

if (d$SNP.ALL.IND[i])

{

clist<-c(clist,1)

}

if (d$ECG.IND[i])

{

clist<-c(clist,2)

}

if (d$IMAGE.IND[i])

{

clist<-c(clist,3)

}

if (d$IND.IND[i]&&d$GENDER.IND[i]&&d$RACE[i])

{

clist<-c(clist,4)

}

print("clist=")

print(clist)

print(d.pred[i,clist])

if (length(clist)>1)

{

d.pred$pmean[i]<-apply(d.pred[i,clist],1,mean)

}

if (length(clist)==1)

{

d.pred$pmean[i]<-d.pred[i,clist]

}

if (length(clist)==0)

{

d.pred$pmean[i]<-.5

}

write.table(d.pred,file="tree.pred.csv", col.names=NA,sep=",")

# Work out the ROC curve

threshold<-seq(0,1,by=.001)

p.false.det<-rep(0,length(threshold))

p.true.det<-rep(0,length(threshold))

for (i in seq(1,length(threshold)))

{

th<-threshold[i]

fired.pred<-(d.pred$pmean>th)

p.false.det[i]<-sum(fired.pred*!d.pred$fired)/sum(!d.pred$fired)

p.true.det[i]<-sum(fired.pred*d.pred$fired)/sum(d.pred$fired)

}

pdf("ROC.pdf")

plot(p.false.det,p.true.det,type="l",

xlab="False positive rate",

ylab="True positive rate",

main="Prediction of Appropriate Firing ROC Curve")

lines(threshold,threshold,col="red")

dev.off()

Index

Top