Document
Information
Package base
version 2.8.0
11/24/2008
Project 5: Statistical
Learning with Multi-Scale Cardiovascular Data
Contact email: [email protected]
1.3. Leave One Out Cross Validation (LOOCV)
The painted tree
classifier is a model we have developed for dealing with limited data. It is
based on several adjustments to the traditional classification tree model,
which was introduced in the previous chapter. It was developed to address the
fact that in certain biomedical applications, such as cardiovascular studies,
data are available for different modalities and some modalities may not be
available for certain subjects.
For example, for we might have genotype data for certain subjects,
electrophysiological data for other subjects, and so forth. This is fundamentally different than
the traditional case of "missing data" in which a relatively small
number of features may be missing for some subjects. Here, entire modalities, each comprising many features, may
not be available for a significant number of subjects and the subjects for
which all the data are available might in fact be in the minority. As a result, whereas we could build
trees using data of a single modality by limiting the number of subjects, we
cannot build a tree using all the modalities. This is motivation for painted trees. There is a tree for each modality,
constructed in the usual manner. We think of the modalities as colors and each
tree constructed from a single modality as a colored tree which utilizes only
data of that color to predict the dependent variable. Then, having learned a classification tree for each
modality, the various trees are combined to make decisions on new samples by
voting among the different trees; the overall result is painted tree
classifier.
For the painted
tree classifier, we first make an adjustment to the original dataset by
replicating observations that are valuable and limited. Then we do the
traditional classification tree analysis and prune the tree to a certain depth
that we pre-specify. In the end, we also use the Leave-One-Out-Cross-
Validation (LOOCV) to test the classifier. Each part is of this process
explained in the following sections.
In our case, we
only have some hundreds of observations.
The phenotype of interest is "fired" or "not fired"
within a certain time from the implantation of an ICD. We want to predict this based on
multi-modal data consisting of clinical, imaging, genomic and
electrophysiological variables.
What makes this especially difficult is that the number of patients
whose device has fired is very small.
Hence, one of the two classes is very sparsely represented in the
dataset. This type of situation is
not uncommon in medical experiments due to cost and other factors. However, from an inductive learning
point of view, this creates a major problem since it becomes difficult to
estimate the statistics of features in the rare class. As we have seen, estimating the
probabilities of events based on the features in each class is necessary for
constructing classification trees. Due to the limited dataset, we use the 50/50
adjustment to increase the data observations to make our estimates more stable
and robust.
The adjustment is
done as follows. We first calculate the ratio between the number of fired
observations and not-fired observations. Usually, the number of fired
observations is far smaller. We then "replicate" all the fired
observations and add them back into the original dataset, forcing the ratio to
be approximately one. The reason
for this adjustment is that, in the classification tree model, each time after
a split, we label the node based on the majority of that node
("fired" or "not fired"). If there is no 50/50 adjustment,
then almost every node will have more "not fire" than "fire",
thus we will label all of them to "not fired", which makes the model
useless. So by doing the 50/50
adjustment, we give more weights to "fired" data and make the model
work.
We prune the
traditional tree model to a pre-specified depth so that we can avoid
over-fitting, which would results from estimating too many parameters relative
to the amount of data we have. In our analysis, since we do not have enough
data, we cannot reliably estimate the posterior probability for each node after
more than two or three questions (splits). Each time we split the tree, we add
one or two more parameters in the model, at the meantime, makes the datasets in
each node smaller, so the estimation will be harder as well.
1.3. Leave-One-Out-Cross-Validation (LOOCV)
After the
analysis, we do LOOCV to estimate the generalization error of the painted tree
classifier. Each time, we remove a single observation from the original dataset
and use the remaining data to construct the painted trees in exactly the same
way. The classifier constructed in
each such "loop" might be slightly different. We use that model to predict the
observation that we removed. If
there are n samples, then we build n classifiers. Adding the total number of errors made gives the LOOCV
estimated error rate.
2.1. Commented Code (download
here)
# Clear memory
#
rm(list=ls())
#
# Load
"tree" model package.
#
library(tree)
#
# Painted forest
code begins here.
#
# Read data into
R.
#
d<-read.csv("data.csv",as.is=TRUE)
#
# Remove data
points (rows) wiht LVEF being NA.
#
#d<-d[!is.na(d$LVEF),]
#
# Remove data
points (rows) wiht LVEF larger than 35.
#
#d<-d[d$LVEF<=35,]
#
# Only keep the
implanted subjects with positive days of observation
#
d <-
d[!is.na(d$Implant.Date),]
d <-
d[d$Days.Of.Observation>0,]
#
# Only keep
subjects with at least ndays (=365) of observation
#
ndays<-365
d <-
d[d$Days.Of.Observation>=ndays,]
#
# Create a
"fired" variable
#
d$fired<-d$Days.To.First.App.Firing<=ndays
#
# Display a
contingency table of the counts at each factor in d$fired.
#
table(d$fired)
#
# Create a
discrete variable list, name it "discrete", and assign corresponding
column names.
#
discrete <-
c(43:48,50,55,58,59)
names(discrete)<-names(d)[discrete]
#
# Convert the
strings in discrete variables to factors.
#
for (i in
discrete)
{
d[,i]<-as.factor(d[,i])
}
#
# Create a
continuous variable list, name it "continuous", and assign
corresponding column names.
#
continuous <-
c(3:41,83:87)
names(continuous)<-names(d)[continuous]
#
# Convert all
variables in the continuous list to numeric variables.
#
for (i in
continuous)
{
d[,i]<-as.numeric(d[,i])
}
#
# Create a
variable list named "snp", and assign corresponding column names.
#
snp<-c(43:48)
names(snp)<-names(d)[snp]
#
# Create a
variable list named "ecg", and assign corresponding column names.
#
ecg<-c(3:41)
names(ecg)<-names(d)[ecg]
#
# Create a
variable list named "image", and assign corresponding column names.
#
image<-c(83:87)
names(image)<-names(d)[image]
#
# Create a variable
list named "inducible", and assign corresponding column names.
#
inducible<-c(59)
names(inducible)<-names(d)[inducible]
#
# Create a
variable list named "race", and assign corresponding column names.
#
race<-c(55)
names(race)<-names(d)[race]
#
# Create a
variable list named "gender", and assign corresponding column names.
#
gender<-c(50)
names(gender)<-names(d)[gender]
#
# Create a
variable list named "discrete2", and assign corresponding column
names.
#
discrete2<-c(50,55,59)
names(discrete2)<-names(d)[discrete2]
#
# Trim to given
max depth
#
reduce.tree.depth<-function(tree,depth)
{
# Get the nod number and store
them as numbers in a vector r.
r<-as.numeric(rownames(tree$frame))
# Prune the tree to a
pre-specified depth.
tree$frame<-tree$frame[r<=((2^depth)-1),]
# Store the number of nodes in n.
n<-dim(tree$frame)[1]
r<-as.numeric(rownames(tree$frame))
tree$frame$var[r>=(2^(depth-1))]<-"<leaf>"
tree
}
#reduce.tree.depth(TREE.image,1)
#reduce.tree.depth(TREE.image,2)
#reduce.tree.depth(TREE.image,3)
#reduce.tree.depth(TREE.image,4)
#reduce.tree.depth(TREE.image,5)
#
# Use dataframe d
to build a tree for "fired" with variables in the
# list with
variable names. Truncate these trees to a given depth.
#
build.tree<-function(d,xnames,depth)
{
#
#print("in build tree")
#print(names(d))
#print(dim(d))
str<-paste("dnew<-data.frame(",sep="")
for (i in seq(1,length(xnames)))
{
str<-paste(str,"d$",names(xnames)[i],",",sep="")
}
str<-paste(str,"d$fired",")",sep="")
#print(str)
eval(parse(text=str))
#print(names(dnew))
#print(dim(dnew))
dnew<-dnew[complete.cases(dnew),]
names(dnew)<-c(names(xnames),"fired")
dnew<-make.fired.5050(dnew)
#print(names(dnew))
#print(dim(dnew))
#
# Formula creator: input a list
of n variable names for predictors, and name of response variable
# and output is an expression
that looks like this:
#
# fired~xnames[1]+...xnames[n]
#
str<-paste("fired~",sep="")
n<-length(xnames)
for (i in seq(1,n-1))
{
str<-paste(str,names(xnames)[i],"+",sep="")
}
str<-paste(str,names(xnames)[n],sep="")
#print(str)
TREE<-tree(as.formula(str),data=dnew)
# print(TREE)
RTREE<-reduce.tree.depth(TREE,depth)
# print(RTREE)
RTREE
}
#
# Given a data
frame with fired variable, replicate the fired cases
# so that we get
as close as we can (but below if neccessary) 50%firing
#
make.fired.5050<-function(d)
{
d.fired<-d[d$fired==1,]
d.not.fired<-d[!d$fired,]
k.fired<-dim(d.fired)[1]
k.not.fired<-dim(d.not.fired)[1]
rat<-floor(k.not.fired/k.fired)
# print(rat)
if (rat>0)
{
d.temp<-d.fired
if (rat>1)
{
for (i in seq(1,rat-1))
{
d.temp<-rbind(d.temp,d.fired)
}
}
}
else
{
print("error with replication: no firings for this dataset\n")
}
rbind(d.temp,d.not.fired)
}
#
# Build the trees
for the full data set.
#
TREE.snp<-build.tree(d,snp,8)
TREE.ecg<-build.tree(d,ecg,8)
TREE.image<-build.tree(d,image,8)
TREE.discrete<-build.tree(d,discrete2,8)
TREE.LIST<-list(TREE.snp,TREE.ecg,TREE.image,TREE.discrete)
RTREE.snp<-build.tree(d,snp,3)
RTREE.ecg<-build.tree(d,ecg,3)
RTREE.image<-build.tree(d,image,3)
RTREE.discrete<-build.tree(d,discrete2,3)
RTREE.LIST<-list(RTREE.snp,RTREE.ecg,RTREE.image,RTREE.discrete)
#
#
#
N<-dim(d)[1]
d.pred<-data.frame(
rep(0,N),
rep(0,N),
rep(0,N),
rep(0,N))
names(d.pred)<-
c("SNP","ECG","IMAGE","DISCRETE")
#
# LOOCV loop
#
for (i in
seq(1,N))
{
print(i)
#
# Replicate the "fired"
cases so that we get close to 50-50
# as probability of firing.
#
d.train<-d[-i,]
TREE.snp<-build.tree(d.train,snp,3)
TREE.ecg<-build.tree(d.train,ecg,3)
TREE.image<-build.tree(d.train,image,3)
TREE.discrete<-build.tree(d.train,discrete2,3)
#
# Use each tree to predict
outcomes for the test set
#
d.test<-d[i,]
p.snp<-predict(TREE.snp,d.test)
# print(p.snp)
p.ecg<-predict(TREE.ecg,d.test)
# print(p.ecg)
p.image<-predict(TREE.image,d.test)
# print(p.image)
p.discrete<-predict(TREE.discrete,d.test)
# print(p.discrete)
#print("done with this
index")
d.pred[i,]<-c(
p.snp,
p.ecg,
p.image,
p.discrete)
}
for (i in
seq(1,dim(d.pred)[2]))
{
d.pred[,i]<-round(d.pred[,i],3)
}
#
# create a
"fired" variable for the test set
#
d.pred$fired<-d$fired
#
# Compute the
average of probs over all modalities.
# Only include
those modalities that we have data for.
#
d.pred$pmean<-rep(0,N)
for (i in
seq(1,N))
{
clist<-c()
if (d$SNP.ALL.IND[i])
{
clist<-c(clist,1)
}
if (d$ECG.IND[i])
{
clist<-c(clist,2)
}
if (d$IMAGE.IND[i])
{
clist<-c(clist,3)
}
if (d$IND.IND[i]&&d$GENDER.IND[i]&&d$RACE[i])
{
clist<-c(clist,4)
}
print("clist=")
print(clist)
print(d.pred[i,clist])
if (length(clist)>1)
{
d.pred$pmean[i]<-apply(d.pred[i,clist],1,mean)
}
if (length(clist)==1)
{
d.pred$pmean[i]<-d.pred[i,clist]
}
if (length(clist)==0)
{
d.pred$pmean[i]<-.5
}
}
write.table(d.pred,file="tree.pred.csv",
col.names=NA,sep=",")
#
# Work out the
ROC curve
#
threshold<-seq(0,1,by=.001)
p.false.det<-rep(0,length(threshold))
p.true.det<-rep(0,length(threshold))
for (i in
seq(1,length(threshold)))
{
th<-threshold[i]
fired.pred<-(d.pred$pmean>th)
p.false.det[i]<-sum(fired.pred*!d.pred$fired)/sum(!d.pred$fired)
p.true.det[i]<-sum(fired.pred*d.pred$fired)/sum(d.pred$fired)
}
pdf("ROC.pdf")
plot(p.false.det,p.true.det,type="l",
xlab="False positive
rate",
ylab="True positive
rate",
main="Prediction of
Appropriate Firing ROC Curve")
lines(threshold,threshold,col="red")
dev.off()
2.2. Uncommented Code (download
here)
rm(list=ls())
library(tree)
d<-read.csv("data.csv",as.is=TRUE)
#d<-d[!is.na(d$LVEF),]
#d<-d[d$LVEF<=35,]
d <-
d[!is.na(d$Implant.Date),]
d <-
d[d$Days.Of.Observation>0,]
ndays<-365
d <-
d[d$Days.Of.Observation>=ndays,]
d$fired<-d$Days.To.First.App.Firing<=ndays
table(d$fired)
discrete <-
c(43:48,50,55,58,59)
names(discrete)<-names(d)[discrete]
or (i in discrete)
{
d[,i]<-as.factor(d[,i])
}
continuous <-
c(3:41,83:87)
names(continuous)<-names(d)[continuous]
for (i in
continuous)
{
d[,i]<-as.numeric(d[,i])
}
snp<-c(43:48)
names(snp)<-names(d)[snp]
ecg<-c(3:41)
names(ecg)<-names(d)[ecg]
image<-c(83:87)
names(image)<-names(d)[image]
inducible<-c(59)
names(inducible)<-names(d)[inducible]
race<-c(55)
names(race)<-names(d)[race]
gender<-c(50)
names(gender)<-names(d)[gender]
discrete2<-c(50,55,59)
names(discrete2)<-names(d)[discrete2]
reduce.tree.depth<-function(tree,depth)
{
r<-as.numeric(rownames(tree$frame))
tree$frame<-tree$frame[r<=((2^depth)-1),]
n<-dim(tree$frame)[1]
r<-as.numeric(rownames(tree$frame))
tree$frame$var[r>=(2^(depth-1))]<-"<leaf>"
tree
}
build.tree<-function(d,xnames,depth)
{
str<-paste("dnew<-data.frame(",sep="")
for (i in seq(1,length(xnames)))
{
str<-paste(str,"d$",names(xnames)[i],",",sep="")
}
str<-paste(str,"d$fired",")",sep="")
eval(parse(text=str))
dnew<-dnew[complete.cases(dnew),]
names(dnew)<-c(names(xnames),"fired")
dnew<-make.fired.5050(dnew)
str<-paste("fired~",sep="")
n<-length(xnames)
for (i in seq(1,n-1))
{
str<-paste(str,names(xnames)[i],"+",sep="")
}
str<-paste(str,names(xnames)[n],sep="")
TREE<-tree(as.formula(str),data=dnew)
RTREE<-reduce.tree.depth(TREE,depth)
RTREE
}
make.fired.5050<-function(d)
{
d.fired<-d[d$fired==1,]
d.not.fired<-d[!d$fired,]
k.fired<-dim(d.fired)[1]
k.not.fired<-dim(d.not.fired)[1]
rat<-floor(k.not.fired/k.fired)
# print(rat)
if (rat>0)
{
d.temp<-d.fired
if (rat>1)
{
for (i in seq(1,rat-1))
{
d.temp<-rbind(d.temp,d.fired)
}
}
}
else
{
print("error with replication: no firings for this dataset\n")
}
rbind(d.temp,d.not.fired)
}
TREE.snp<-build.tree(d,snp,8)
TREE.ecg<-build.tree(d,ecg,8)
TREE.image<-build.tree(d,image,8)
TREE.discrete<-build.tree(d,discrete2,8)
TREE.LIST<-list(TREE.snp,TREE.ecg,TREE.image,TREE.discrete)
RTREE.snp<-build.tree(d,snp,3)
RTREE.ecg<-build.tree(d,ecg,3)
RTREE.image<-build.tree(d,image,3)
RTREE.discrete<-build.tree(d,discrete2,3)
RTREE.LIST<-list(RTREE.snp,RTREE.ecg,RTREE.image,RTREE.discrete)
N<-dim(d)[1]
d.pred<-data.frame(
rep(0,N),
rep(0,N),
rep(0,N),
rep(0,N))
names(d.pred)<-
c("SNP","ECG","IMAGE","DISCRETE")
#
# LOOCV loop
#
for (i in
seq(1,N))
{
print(i)
d.train<-d[-i,]
TREE.snp<-build.tree(d.train,snp,3)
TREE.ecg<-build.tree(d.train,ecg,3)
TREE.image<-build.tree(d.train,image,3)
TREE.discrete<-build.tree(d.train,discrete2,3)
d.test<-d[i,]
p.snp<-predict(TREE.snp,d.test)
p.ecg<-predict(TREE.ecg,d.test)
p.image<-predict(TREE.image,d.test)
p.discrete<-predict(TREE.discrete,d.test)
d.pred[i,]<-c(
p.snp,
p.ecg,
p.image,
p.discrete)
}
for (i in
seq(1,dim(d.pred)[2]))
{
d.pred[,i]<-round(d.pred[,i],3)
}
d.pred$fired<-d$fired
d.pred$pmean<-rep(0,N)
for (i in
seq(1,N))
{
clist<-c()
if (d$SNP.ALL.IND[i])
{
clist<-c(clist,1)
}
if (d$ECG.IND[i])
{
clist<-c(clist,2)
}
if (d$IMAGE.IND[i])
{
clist<-c(clist,3)
}
if
(d$IND.IND[i]&&d$GENDER.IND[i]&&d$RACE[i])
{
clist<-c(clist,4)
}
print("clist=")
print(clist)
print(d.pred[i,clist])
if (length(clist)>1)
{
d.pred$pmean[i]<-apply(d.pred[i,clist],1,mean)
}
if (length(clist)==1)
{
d.pred$pmean[i]<-d.pred[i,clist]
}
if (length(clist)==0)
{
d.pred$pmean[i]<-.5
}
}
write.table(d.pred,file="tree.pred.csv",
col.names=NA,sep=",")
#
# Work out the
ROC curve
#
threshold<-seq(0,1,by=.001)
p.false.det<-rep(0,length(threshold))
p.true.det<-rep(0,length(threshold))
for (i in
seq(1,length(threshold)))
{
th<-threshold[i]
fired.pred<-(d.pred$pmean>th)
p.false.det[i]<-sum(fired.pred*!d.pred$fired)/sum(!d.pred$fired)
p.true.det[i]<-sum(fired.pred*d.pred$fired)/sum(d.pred$fired)
}
pdf("ROC.pdf")
plot(p.false.det,p.true.det,type="l",
xlab="False positive
rate",
ylab="True positive
rate",
main="Prediction of
Appropriate Firing ROC Curve")
lines(threshold,threshold,col="red")
dev.off()