Quadrant Tree Classifier

The quadrant tree classifier is based on the idea of selecting a pair of numerical or ordered features x, y, choosing thresholds c, d for each feature breaking up the feature space into four quadrants based on which inequalities x<c, x>c and y<d, y>d hold. Such a classifier amounts to a tree (stub) of depth two in which the same classification criterion for splitting is used for the two nodes at depth two. Constraining the available trees in this manner reduces the level of complexity of the space of allowable classifiers and can thus help to avoid over-fitting while providing a search space that is computationally simpler to exhaustively search.

In addition, our aim is to produce a classifier having a high degree of sensitivity with optimum specificity, so build these criteria into the search through the classifier space, identifying what is an optimum pair of features in the sense of maximizing apparent specificity subject to the constraint that the apparent sensitivity is above a prescribed level.

The search for feature pairs and corresponding thresholds requires calculations involving likelihoods as well as survival probabilities. We will break the discussion of these two aspects of the problem and introduce them one by one in the following sections.

Index

Top

1.1. Sensitivity and Specificity

The most important concepts in quadrant tree classifier are sensitivity and specificity. They are statistical measures of the performance of a classification test. The sensitivity measures the proportion of actual positives which are correctly identified as such (or, in our case, the percentage of "fired" patients who are identified as "fired"); and the specificity measures the proportion of negatives which are correctly identified (i.e. the percentage of "not-fired" patients who are identified as "not-fired"). They are closely related to the concepts of type I and type II errors.

Consider, as an example, a scenario we have been investigating, in which heart disease patients implanted with cardioverter-defibrillators. Firing of the device is used by us as a surrogate phenotype for sudden cardiac death. When this occurs, we say that the device has "fired". We are trying to build a model to identify these subjects who will "fire". The model outcomes for patients are "fired" or "not-fired". Since the model is not 100% accurate, the actual "fired" (or "not-fired") status of each person may be different. In that setting:

* True "fired": Patients whose devices fail are correctly identified as "fired"

* False "fired": Patients whose devices do not fail are wrongly identified as "fired"

* True "not-fired": Patients whose devices do not fail are correctly identified as "not-fired"

* False "not-fired": Patients whose devices fail are wrongly identified as "not-fired"

Then,

Sensitivity = number of true "fired" / (number of true "fired" + number of false "not-fired")

i.e. patients who are correctly diagnosed as "fired" / "fired" patients in total

Specificity = number of true "not-fired" / (number of true "not-fired" + number of false "fired")

i.e. patients who are correctly diagnosed as "not-fired" / "not-fired" patients in total

We can put them in a table.

	"fired" patients	"not-fired" patients
Identified as "fired"	A = True "fired"	B = False "fired" (Type I error,)
Identified as "not-fired"	C = False "not-fired" (Type II error)	D = True "not-fired"
	Sensitivity = A / (A + C)	Specificity = B / (B + D)

Index

Top

1.2. Conditional Probability

Conditional probability plays an important role in our quadrant tree classifier. Conditional probability is the probability of some event A, given the occurrence of some other event B, denoted as .

Conditional probability can be calculated as:

The numerator is called joint probability. That is, it is the probability of both events occurring. It is also written as .

Based on conditional probability, we can solve more complicated problems in our setting. For example, A is an event that can be caused by events . If we are given that event A occurred, what is probability of occurring, i.e. ? We can use conditional probability to solve this problem.

Where

Each term in the formula has a conventional name:

* is the conditional probability of , given A. It is also called the posterior probability because it is derived from or depends upon the specified value of A.

* is the prior probability or marginal probability of . It is "prior" in the sense that it does not take into account any information about A.

* is the prior or marginal probability of A.

Index

Top

1.3. Survival Analysis

The last piece of quadrant tree classifier is survival analysis. It is a part of statistics which specifically models death time and failure time, or more generally, the time to event data. In our case, fire or failure is considered an "event". In our case, survival analysis attempts to answer questions such as: what is the probability of patients not firing until a certain time? How do particular circumstances or individual characteristics increase or decrease the probability of firing?

The primary interest in survival analysis is survival function, conventionally denoted S, which is defined as

where t is a time, T is a random variable denoting the time of death. So the survival function is the probability that the time of fire (or failure) is later than some specified time.

There are many ways to estimate S(t). We will briefly discuss about them in the following sections:

Index

Top

1.3.1. Empirical Distribution Function

The first and simplest method is just using only those in study for at least some time t and taking the ratio between it and the total number of observations. (i.e. empirical distribution function)

This method is simplest, but not be the best. Subjects who have only been observed for s time units are removed from consideration if s<t . Thus we may be leaving out useful information. On the other hand, this method does lead to an unbiased estimate of the probability of firing before t time units since implant.

Index

Top

1.3.2. Kaplan-Meier Estimator, Censoring and Truncation

To solve the problem that empirical distribution function has, that is, being unbiased when patients are not observed for sufficient amount of time, we introduce another method --- Kaplan-Meier estimator (product limit estimator). It is developed to address the censored and truncated data, which is what we mostly deal with in the analysis.

Censoring (or right censoring) happens when patients are not observed for a sufficient amount of time, then we do not know whether and when they get "fired" or not. All we know is that they do not get "fired" when they are in the study. This is the most common type of censoring in our analysis since a lot of patients participate in the study only for a very short time, then decide to leave or lose contact with the hospital. Even so, their cases are still informative to us, thus we want to use their cases to give us more information about the firing. In order to put their cases in use, we only take them into consideration during the period when they are in the study. This is the basic idea of Kaplan-Meier estimator. And this is also how Kaplan-Meier is better than empirical distribution function.

Kaplan-Meier estimator apparently uses more information than empirical distribution function, which makes it more efficient. Kaplan-Meier estimator takes these patients who leave the study into consideration. For example, to estimate a survival probability at a particular time t, empirical distribution function only counts these who survive pass time t. And all the calculation is based on this group of patients. It ignores patients who drop off in the middle of the study (before time t). These patients may also survive pass time t, but they are assumed to have "fired" at the minute when they leave the study, which is an unreasonable assumption. On the other hand, to estimate the same probability, Kaplan-Meier estimator starts from the beginning time of the study and calculate the conditional survival probability at every possible times before t, and multiple them together to get a more accurate estimate, or in some sense, less biased estimate than empirical distribution function.

Kaplan-Meier estimator can be unbiased when under the assumption that censoring times are independent with "fired" times. When these two types of times are independent, then censoring do not give us any information about when these patients who get censored are going to get "fired". So we can treat censoring times as random times. But if they are dependent with each other, then censoring may mean something to us. For example, say patients tend to drop off the study when they feel bad and uncomfortable with their cardioverter-defibrillators, then there is a tendency that patients get "fired" soon after they drop off the study, or vice versa. If that is the case, then every conditional probability in the Kaplan-Meier estimator is biased, thus the Kaplan-Meier estimator is biased. In our analysis, there is no guarantee that this assumption is true, so we cannot say Kaplan-Meier estimator is unbiased either. But it at least uses all the information we have about firing, so Kaplan-Meier estimator is usually preferred over empirical distribution function.

Besides censoring, truncation (or left-truncation) is also a major problem in our analysis. It happens when some patients join the study a while after they are implanted with cardioverter-defibrillators, so we do not observe them before they join in, which means everything we observe from them are based on the fact that they join the study after time t. This gives us the problem that everything we get is conditional. This problem can also be solved by Kaplan-Meier estimator since it only takes patients into consideration when they are in the study. And everything in Kaplan-Meier estimator is conditional (conditional survival probability). So truncation can be easily solved by Kaplan-Meier estimator. But again, Kaplan-Meier estimator is only unbiased when truncation times are independent with "fired" times. The reason is similar to that of censoring which is mentioned in the previous paragraph. For example, if patients tend to go back to the hospital when they sense that their cardioverter-defibrillators are malfunctioning, then there is also a tendency that patients get fired soon after they join in the study. So Kaplan-Meier estimator is biased in that case.

In general, Kaplan-Meier estimator is a milestone for survival analysis. It is a very simple idea and yet improves the estimation quality by significant magnitude. The paper of Kaplan-Meier estimator is also one of the top 5 most referenced papers in this field.

To conduct Kaplan-Meier estimator, we first list all unique death times , and get number at risk and number of death at each time. Number of risk is the number of patients who are alive and in the study just before . Number of death is the number of patients who died at time . Then the Kaplan-Meier estimator is:

If there is no censored or truncated data, then . So Kaplan-Meier estimator becomes . We can see that it is just the empirical estimate function.

Index

Top

1.3.3. Other Methods

Other than the methods mentioned above, there are still a great number of methods. For example, Proportional hazard rate model is also an important model. It is developed under the assumption that each individual's hazard rate function is proportional to the others' over the time.

Under the proportional hazard rate assumption, we can assume there is a baseline hazard rate and everyone else's hazard rate functions are just proportional to the baseline. So we can just estimate the effect parameter(s) without any consideration of the hazard function. This approach is also called Cox proportional hazard rate model. Cox model is a semiparametric method since we only put the effect parameters in the model but the baseline has no parameters at all.

Index

Top

1.4. Quadrant Tree Classifier

With the introduction to all these fundamental knowledge, let's move to the quadrant tree classifier.

Quadrant tree classifier is developed based on traditional tree model, with only two features and two thresholds. These thresholds break up the feature space into four quadrants based on which inequalities x<tx, x>ty, y<ty and y>ty holds, at mean time, the thresholds also break the dataset into four pieces which will be used for estimating survival probabilities. See the graph below.

	x > tx
y > ty	x<tx, y>ty	x>tx, y>ty
y > ty	x<tx, y<ty	x>tx, y<ty

Within each quadrant i (i=1,2,3,4), we conduct a survival analysis based on the dataset of that quadrant. So we can estimate the probability of fire before/after a certain point (a threshold we pre-specified), i.e. P(fired before threshold | quadrant i) and P(fired after threshold | quadrant i), or simply P(fired | quadrant i) and P(not-fired | quadrant i).

Since we divide the original dataset into four parts. We can estimate the probability of any individual falls into each of the quadrant by:

Now we have P(fired | quadrant i) (from survival analysis) and P(quadrant i), we can apply the conditional probability to get P(quadrant i | fired) and P(quadrant i | not-fired), denoted as Pi and Qi respectively.

Now we have eight probabilities Pi and Qi i=1,2,3,4. We calculate the likelihood ratio (=Pi/Qi) for every quadrant i, and put them in descending order. For example, we have a set of Pi and Qi like this:

Quadrant	Pi	Qi	Likelihood ratio
1	0.5	0.25	0.2
2	0.2	0.1	2
3	0.35	0.35	1
4	0.4	0.3	1.33

After put them in descending order based on likelihood ratio, they become the following table:

Quadrant	Pi	Qi	Likelihood ratio	Indicator for specificity	(1-indicator for specificity)*Qi
2	0.2	0.1	2	1	0
4	0.4	0.3	1.33	1	0
3	0.35	0.35	1	0.86 = (0.9-*0.6* )/0.35	0.05= (1-0.86)*0.35
1	0.05	0.25	0.2	0	0.25 = (1-0)*0.25
Sum (rows 1-2)	*0.6*
Sum (rows 1-4)	1	1			0.3 (specificity)

Starting with the quadrant which has the largest likelihood ratio, we take summation of their Pi's by including them in one by one based on the likelihood ratio order. We stop doing that when the summation reaches the largest number below the sensitivity (including sensitivity). In our case, we set our sensitivity to be 90%, so after we include quadrant 2 and 4, we stop, otherwise, the summation will go over 0.9.

Then we assign 1s to each quadrant whose Pi's are included in the summation, call it "indicator for specificity". For the quadrant who just missed to be included, we calculate the difference between the sensitivity and summation, divide it by Pi of that quadrant, and assign the ratio to the indicator. In our case, the summation is 0.6, the sensitivity is 0.9, the quadrant's Pi is 0.35. So, for quadrant 3, the indicator is 0.86 = (0.9 - 0.6 )/ 0.35. For the rest of quadrants, assign 0s for the indicator. In our case, it is quadrant 1.

The indicators mean that: if it is 1, then persons falling in this quadrant will be identified as "fired". If it is 0, then persons falling in this quadrant will be identified as "not-fired". If it is a fraction p, then persons falling in this quadrant will be randomly identified as "fired" with a probability p, and identified as "not-fired" with a probability 1-p.

After this setting, we can lock the sensitivity at 90%. It is because sensitivity is the probability of being identified as "fired" given "fired" patients, i.e. P(identified as "fired" I "fired" patients). So, in our case, it is can be written as:

With the fixed sensitivity, we can get the maximized specificity. Since specificity is the probability of being identified as "not-fired" given "not-fired" patients, i.e. P(identified as "not-fired" I "not-fired" patients). So it is can be written as:

Now, with a threshold and a sensitivity given, we can repeat the procedure above with different criterions (tx and ty) and find an optimal pair of tx and ty with the largest specificity. This is basically how the quadrant tree classifier work.

Index

Top

2. R Code

2.1. Commented Code (download here)

library(survival)

# Given vector of event times and fired indicator (0=censored,1=fired),

# and a time threshold, compute an estimate of the probability of firing

# before that time threshold. We use linear interpolation of output from the

# survival package's kaplan-meier calculation. Generally, we err on the side

# of making the probability greater than needed to be careful.

survival.prob<-function(time,fired,time.threshold)

{

# If no data, then returen 0 so err on the side of making everyone fire.

if (length(time)==0)

{

return(0)

}

# Fit a survival model with Kaplan-Meier method.

S<-survfit(Surv(time,fired),type="kaplan-meier")

# Assign number of observatoins to ntimes.

ntimes<-length(S$time)

# If there is just one observation or the smallest time is larger than the

# threshold, return 0.

if ((ntimes==1)||(min(S$time)>=time.threshold))

{

return(0)

}

# If the largest time is smaller than the threshold, return the largest survival

# probability generated by the model.

if (max(S$time)<=time.threshold)

{

return(S$surv[length(S$surv)])

}

# If not either of the two situation mentioned above, then do linear interpolation

# Find the index of observation with the largest time that is smaller than threshold

index1<-max(which(S$time<time.threshold))

# Find the index of observation with the smallest time that is larger than threshold

# i.e. index1+1

index2<-index1+1

# Calculate the probability of surviving after time index1.

p1<-S$surv[index1]

# Calculate the probability of surviving after time index2.

p2<-S$surv[index2]

# Linear interpolation

lambda<-(time.threshold-S$time[index1])/(S$time[index2]-S$time[index1])

p<-p1+lambda*(p2-p1)

return(p)

}

build.quadrant.classifier<-function(x,y,implanted,days.to.firing,tx.list,ty.list,sens,time.threshold)

{

# Inputs:

# x = continuous feature variable vector

# y = continuous feature variable vector

# implanted = days of implanted

# days to firing = days to firing

# tx = list of x thresholds

# ty = list of y thresholds

# sens = desired sensitivity

# time.threshold = time defining the "fired within" phenotype

#Output:

# an assignment of probabilities to the cells:

# cell 1: x<tx, y<ty

# cell 2: x<tx, y>ty

# cell 3: x>tx, y<ty

# cell 4: x>tx, y>ty

# This function uses the R "survival" package to compute

# Kaplan-Meier probabilities

# Create a time variable and an indicator of fired.

# Assign lx the value of the length of x.

lx<-length(x)

# Assign fired a array of 0s with length the same as lx.

fired<-rep(0,lx)

# Assign time a array of 0s with length the same as lx.

time<-rep(0,lx)

# Remove all the NAs and convert data for survival analysis.

for (i in seq(1,lx))

{

# For each i, i.e. each observation:

# If days.to.firing is NA, then assign both fired[i] and time[i] to 0.

# Otherwise, if days.to firing is smaller than time.threshold, then assign

# fired[i]

if (is.na(days.to.firing[i]))

{

fired[i]<-0

time[i]<-implanted[i]

} else

# Otherwise, if days.to firing is smaller than time.threshold, then assign

# fired[i] 1, and copy days.to.firing[i] to time[i]. And otherwise, assign

# fired[i] 0, and copy implanted[i] to time[i].

{

if (days.to.firing[i]<time.threshold)

{

fired[i]<-1

time[i]<-days.to.firing[i]

}else

{

fired[i]<-0

time[i]<-implanted[i]

}

# Get rid of missing data

# Combine x, y, time, and fired to a data frame named d.temp

d.temp<-data.frame(x=x,y=y,time=time,fired=fired)

# Remove all the observations with missing data or NAs.

d.temp<-d.temp[complete.cases(d.temp),]

# Creat a list with spec=0

best.L<-list(spec=0)

# Creat a loop to try every possible combination of tx and ty to get a optimal

# pair of tx ty with maxium specificity.

for (tx in tx.list)

{

for (ty in ty.list)

{

# Break up the data into the four cells based on tx and ty.

data<-list()

data[[1]]<-d.temp[(d.temp$x<tx)&(d.temp$y<ty),]

data[[2]]<-d.temp[(d.temp$x<tx)&(d.temp$y>ty),]

data[[3]]<-d.temp[(d.temp$x>tx)&(d.temp$y<ty),]

data[[4]]<-d.temp[(d.temp$x>tx)&(d.temp$y>ty),]

# Compute prior probabilities of the cells.

p.cells<-c(

dim(data[[1]])[1],

dim(data[[2]])[1],

dim(data[[3]])[1],

dim(data[[4]])[1])

p.cells<-p.cells/sum(p.cells)

# Compute the fire survival probablilities.

# Note the linear interpolation.

# Creat array for p and P[fire | cell = i]

p.fire<-c()

# Creat array for p and P[not fire | cell = i]

p.notfire<-c()

# Do a loop for each cell.

for (i in seq(1,4))

{

#print("i = ")

#print(i)

# Store time for cell i

time<-data[[i]]$time

# Store fired for cell i

fired<-data[[i]]$fired

# Do survival analysis within cell i

p<-survival.prob(time,fired,time.threshold)

#print("p = ")

#print(p)

# Store P[not fire | cell = i] to p.notfire

p.notfire<-c(p.notfire,p)

# Store P[fire | cell = i] to p.fire

p.fire<-c(p.fire,1-p)

}

#print("end of loop p.fire p.notfire= ")

#print(p.fire)

#print(p.notfire)

# We estimated P[fire | cell = i] for each i,

# and P[not fire | cell = i]

# for each i and we know P[ cell = i].

# Use Bayes theorem to get the cell

# distributions under each class.

# p = P[ cell = i | fired]

# and

# q = P[ cell i | not fired]

# Calculate P[ cell = i, fired]

p<-p.fire*p.cells

# Calculate P[ cell = i | fired]

p<-p/sum(p)

# Calculate P[ cell = i, not fired]

q<-p.notfire*p.cells

# Calculate P[ cell = i | not fired]

q<-q/sum(q)

# Compute likelihood ratios

lr<-p/q

# Make a list of the cell indices 1,2,3,4 in decreasing order

# of likelihood ratio.

perm<-order(lr, decreasing=TRUE)

# Define the cell probabilities for the classifier.

# The rule is to assign a

# probability of 1 to each cell until we get to

# the desired sensitivity.

# The last probability can be a fraction in order

# to get exactly what we want.

#print("lr = ")

#print(lr)

#print("p = ")

#print(p)

#print("q = ")

#print(q)

# Assign all 0s to fprob

fprob<-rep(0,4)

# Creat a variable used in "while" loop.

psum<-0

i<-1

# Repeat the loop until psum+p[perm[i]]<sens, which means keep add up more

# cells into psum until it reaches sensitiviy.

while(psum+p[perm[i]]<sens)

{

# If a cell is included, assign fprob for that cell 1.

fprob[perm[i]]<-1

# Add one more into psum

psum<-psum+p[perm[i]]

i<-i+1

}

# For the last cell, compute a partial

fprob[perm[i]]<-(sens-psum)/p[perm[i]]

# Compute the specificity

spec<-sum((1-fprob)*q)

# If we find a larger specificity, we replace the best.L with the new one.

if (spec>best.L$spec)

{

best.L<-list(fprob=fprob,spec=spec,tx=tx,ty=ty,p=p,q=q)

}

# Return best.L.

best.L

}

Index

Top

2.2. Uncommented Code (download here)

library(survival)

survival.prob<-function(time,fired,time.threshold)

{

# If no data, then returen 0 so err on the side of making everyone fire.

if (length(time)==0)

{

return(0)

}

S<-survfit(Surv(time,fired),type="kaplan-meier")

ntimes<-length(S$time)

if ((ntimes==1)||(min(S$time)>=time.threshold))

{

return(0)

}

if (max(S$time)<=time.threshold)

{

return(S$surv[length(S$surv)])

}

index1<-max(which(S$time<time.threshold))

index2<-index1+1

p1<-S$surv[index1]

p2<-S$surv[index2]

lambda<-(time.threshold-S$time[index1])/(S$time[index2]-S$time[index1])

p<-p1+lambda*(p2-p1)

return(p)

}

build.quadrant.classifier<-function(x,y,implanted,days.to.firing,tx.list,ty.list,sens,time.threshold)

{

lx<-length(x)

fired<-rep(0,lx)

time<-rep(0,lx)

for (i in seq(1,lx))

{

if (is.na(days.to.firing[i]))

{

fired[i]<-0

time[i]<-implanted[i]

} else

{

if (days.to.firing[i]<time.threshold)

{

fired[i]<-1

time[i]<-days.to.firing[i]

}else

{

fired[i]<-0

time[i]<-implanted[i]

}

d.temp<-data.frame(x=x,y=y,time=time,fired=fired)

d.temp<-d.temp[complete.cases(d.temp),]

best.L<-list(spec=0)

for (tx in tx.list)

{

for (ty in ty.list)

{

data<-list()

data[[1]]<-d.temp[(d.temp$x<tx)&(d.temp$y<ty),]

data[[2]]<-d.temp[(d.temp$x<tx)&(d.temp$y>ty),]

data[[3]]<-d.temp[(d.temp$x>tx)&(d.temp$y<ty),]

data[[4]]<-d.temp[(d.temp$x>tx)&(d.temp$y>ty),]

p.cells<-c(

dim(data[[1]])[1],

dim(data[[2]])[1],

dim(data[[3]])[1],

dim(data[[4]])[1])

p.cells<-p.cells/sum(p.cells)

p.fire<-c()

p.notfire<-c()

for (i in seq(1,4))

{

time<-data[[i]]$time

fired<-data[[i]]$fired

p<-survival.prob(time,fired,time.threshold)

p.notfire<-c(p.notfire,p)

p.fire<-c(p.fire,1-p)

}

p<-p.fire*p.cells

p<-p/sum(p)

q<-p.notfire*p.cells

q<-q/sum(q)

lr<-p/q

perm<-order(lr, decreasing=TRUE)

fprob<-rep(0,4)

psum<-0

i<-1

while(psum+p[perm[i]]<sens)

{

# If a cell is included, assign fprob for that cell 1.

fprob[perm[i]]<-1

# Add one more into psum

psum<-psum+p[perm[i]]

i<-i+1

}

fprob[perm[i]]<-(sens-psum)/p[perm[i]]

spec<-sum((1-fprob)*q)

if (spec>best.L$spec)

{

best.L<-list(fprob=fprob,spec=spec,tx=tx,ty=ty,p=p,q=q)

}

best.L

}

Index

Top