Merge Datesets Instruction

 

Document Information:

Package base version 2.7.1

9/19/2008

 

Project 5: Statistical Learning with Multi-Scale Cardiovascular Data

Contact email: [email protected]

CardioVascular Research Grid

Johns Hopkins University

 

 

 

 

 

 

Index

1. List and Data Frame

(1) List

(2) Data Frame

2. Merging Data Frames

(1) merge()

(2) read.table() and read.csv()

3. Example Description

4. Links to Datasets

5. R Code

(1) Commented Code

(2) Uncommented Code

6. References

 

 

1. List and Data Frame

 

(1) List

 

An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on.

 

Here is a simple example of how to make a list:

> Lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))

 

Components of lists may be named, and referred to either by giving the component name as a character string or the number in double square brackets. For example:

 

> Lst[[1]]

[1] "Fred"

> Lst$wife

[1] "Mary"

> Lst[[2]]

[1] "Mary"

> Lst$child.ages

[1] 4 7 9

 

This is a very useful convention as it makes it easier to get access to each individual component of a list.

 

Index

 

(2) Data Frame

 

A data frame is a format for a dataset that we frequently use in R. It is a basically a list of data. It can be a vector, a matrix, or any multidimensional array. The definition of data frame is a list data with class "data.frame". There are restrictions on lists that may be made into data frames, namely:

 

(1) The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames; (2) Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively; (3) Numeric vectors, logicals and factors are included as is, and character vectors are coerced to be factors, whose levels are the unique values appearing in the vector; (4) Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

 

A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions, for example:

 

> x<-data.frame(a=1:4,b=c(TRUE,TRUE,FALSE,TRUE),c=c("A","B","C","D"),d=13:16)

> x

   a     b     c   d

1  1  TRUE   A  13

2  2  TRUE   B  14

3  3  FALSE  C  15

4  4  TRUE   D  16

> x[,1] # the 1st column of x

[1] 1 2 3 4

> x[,2] # the 2nd column of x

[1]  TRUE  TRUE  FALSE  TRUE

> x[1,] # the 1st row of x

   a    b    c   d

1  1  TRUE  A  13

> x[3,] # the 3rd row of x

   a      b    c  d

3  3  FALSE  C  15

> x[3,4] # the element in 3rd row and 4th column

[1] 15

 

Index

 

Top

 

2. Merging Data Frames (function: merge() )

 

(1) Merge() ( online help file )

 

The reason why we need merging data frames is that we often get several data files for one project. Each file contains one piece of information. In order to have a thorough analysis, we need to have all files put together in an organized way so that the analysis can be done on the whole dataset easily.

 

Merging data frames ( merge() ) can put two data frames together based on the common column names or row names. After the merging, the rows are by default lexicographically sorted on the common columns, unless ��sort = TRUE�� is specified in the merge() function. Other detailed argument explanations in merge() can be found in R help files by typing ��?merge()�� in R console, or simply click online help file. Merging datasets example:

 

> authors

   surname     nationality  deceased

1    Tukey          US      yes

2 Venables     Australia       no

3  Tierney          US       no

4   Ripley          UK       no

5   McNeil    Australia       no

> books

      name                         title     other.author

1    Tukey     Exploratory Data Analysis             <NA>

2 Venables      Modern Applied Statistics ...           Ripley

3  Tierney                   LISP-STAT             <NA>

4   Ripley             Spatial Statistics              <NA>

5   Ripley         Stochastic Simulation             <NA>

6   McNeil     Interactive Data Analysis              <NA>

7   R Core          An Introduction to R   Venables & Smith

> merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)

   surname nationality deceased                         title

1   McNeil   Australia       no     Interactive Data Analysis

2   R Core        <NA>     <NA>          An Introduction to R

3   Ripley          UK       no            Spatial Statistics

4   Ripley          UK       no         Stochastic Simulation

5  Tierney          US       no                     LISP-STAT

6    Tukey          US      yes     Exploratory Data Analysis

7 Venables   Australia       no Modern Applied Statistics ...

 

      other.author

1             <NA>

2 Venables & Smith

3             <NA>

4             <NA>

5             <NA>

6             <NA>

7           Ripley

 

Another example:

> x

  k1 k2 data

1 NA  1    1

2 NA NA    2

3  3 NA    3

4  4  4    4

5  5  5    5

> y

  k1 k2 data

1 NA NA    1

2  2 NA    2

3 NA  3    3

4  4  4    4

5  5  5    5

> merge(x, y, by=c("k1","k2")) # NA's match

  k1 k2 data.x data.y

1  4  4      4      4

2  5  5      5      5

3 NA NA      2      1

> merge(x, y, by=c("k1","k2"),all=TRUE) # NA's match

  k1 k2 data.x data.y

1  2 NA     NA      2

2  3 NA      3     NA

3  4  4      4      4

4  5  5      5      5

5 NA  1      1     NA

6 NA  3     NA      3

7 NA NA      2      1

 

In order to merge data frames, we need to have all individual data files read as data frames in R. So read.table() or read.csv() are needed here.

 

Index

 

(2) read.table() and read.csv() ( online help file )

 

Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the read.table() function.

 

To read an entire data frame directly, the external file will normally have a special form.

 

The first line of the file should have a name for each variable in the data frame; (2) Each additional line of the file has as its first item a row label and the values for each variable.

 

For example,

Input file form with names and row labels:

Price  Floor   Area   Rooms  Age Cent.heat

01    52.00   111.0   830   5       6.2   no

02    54.75   128.0   710   5       7.5   no

03    57.50   101.0  1000   5       4.2   no

04    57.50   131.0   690   6       8.8   no

05    59.75    93.0   900   5       1.9   yes

...

 

Then the function read.table() can then be used to read the data frame directly

 

> HousePrice <- read.table("houses.data" , header=TRUE, as.it=TRUE)

 

where the header=TRUE option specifies that the first line is a line of headings, and hence, by implication from the form of the file, that no explicit row labels are given. As.it=TRUE means not converting character variables to factors. read.table() or read.csv() by default convert the character variables (which are not converted to logical, numeric or complex) to factors.

 

If the input data file is in csv format, then use read.csv() with arguments defined above.

 

Index

 

Top

 

3. Example Description

 

Now we are going to try a simple example on real data. In this example, we first read several data files in R, then perform several data manipulations according to the analysis requirement and finally merge data files together to generate a whole dataset. In this example, we basically used all frequently used functions and commands for data organization. So understanding this example will benefit you in understanding data manipulation in R. The next two sections are the data links and R scripts for this example.

 

Index

 

Top

 

4. Links to Datasets

Here are the datasets used in the example:

(1) JHUcomb.csv (ECG)

 

(2) icd.data.oct11.2007.csv (SNP)

cleaned version of sept so that the JHUIDs are in the common format got rid of "new" or "A" or "B"

 

(3) age.gender.csv (AGE.GENDER)

 

(4) ReynRaceData-100207.csv (RACE)

 

(5) firing.datainducibility.csv (IND)

 

(6) img.data11.19.07.csv (IMAGE)

 

Index

 

Top

 

5. R Code

 

Here the code used for the data above. There two editions of the code. One is heavily commented for users with limited experience of R. The other one has no comments at all. They are equivalent and it is up to you which one to use.

 

(1) Commented Code ( download here )

#

#

# Merge

#

#   1) JHUcomb.csv (ECG)

#   2) icd.data.oct11.2007.csv (SNP) -- cleaned version of sept so that the

#                                        JHUIDs are in the common format

#                                        got rid of "new" or "A" or "B"

#   3) age.gender.csv (AGE.GENDER)

#   4) ReynRaceData-100207.csv (RACE)

#   5) firing.datainducibility.csv (IND)

#   6) img.data11.19.07.csv (IMAGE)

#

# Read data from file "JHUcomb.csv" in the current folder

# which is specified under R menu->File->Chang directory.

#

# The file is in csv format. We use the function

# read.csv("file path").

#

# The "as.is=T" means not converting character variables to

# factors. read.csv() by default converts the character

# variables (which are not converted to logical, numeric or

# complex) to factors.

#

ECG<-read.csv("JHUcomb.csv", as.is=T)

#

# Read data from "icd.data.oct.11.2007.csv" in the current

# folder. And do not convert character variables to factors.

#

SNP<-read.csv("icd.data.oct.11.2007.csv",as.is=T)

#

# Read data from "age.gender.csv" in the current

# folder. And do not convert character variables to factors.

#

AGE.GENDER<-read.csv("age.gender.csv",as.is=T)

#

# Read data from "ReynRaceData-100207.csv" in the current

# folder. And do not convert character variables to factors.

#

RACE<-read.csv("ReynRaceData-100207.csv",as.is=T)

#

# Read data from "firing.data.4.09.2008.csv" in the current

# folder. And do not convert character variables to factors.

#

IND<-read.csv( "firing.data.4.09.2008.csv",as.is=T)

#

# Read data from "img.data11.19.07.csv" in the current

# folder. And do not convert character variables to factors.

#

IMAGE<-read.csv("img.data11.19.07.csv",as.is=T)

#

# Need to create an ID variable for the ECG data

# based on the first column which is a filename.

#

# Assign N.ECG the value of number of rows in ECG dataframe.

#

N.ECG<-dim(ECG)[1]

#

# Add a column called "ID" in ECG dataframe. Assign all ""s

# to the column.

#

# rep("",N.ECG) means generating a vector by repeating ""

# N.ECG times.

#

ECG$ID<-rep("",N.ECG)

#

# Do a loop to every row in N.ECG dataframe. In each row, if

# the first column is a "NA", then assign the ID column "NA".

# Otherwise, if ECG is string and its first character is a ".",

# then assign ID column the value of "JHU"+the substring of

# the first column (from 7th character to the 9th character)

# and without any separation mark.

# If it is not a ".", then do the same assignment except using

# the string from 4th character to the 6th character.

#

for (i in seq(1,N.ECG))

{

  #

  # is.na() detect whether variable is NA or not. It returns

  # either TRUE or FALSE.

  #

  # ECG[i,1] means the element in the ith row and 1st column.

  #

  if (is.na(ECG[i,1]))

  {

     # assign the element in ith row and ID column the value of NA

     #

     ECG$ID[i]<-NA

  }

  else

  {

     #

     # Determine whether the 1st character of the string in ith row

     # and 1st column is a "."

     #

     # == is an operator to detect whether both sides are equal

     # It returns a logical value of TRUE or FALSE.

     #

     if (substring(ECG[i,1],1,1)==".")

     {

        # assign the element in ith row and ID column a string, which

        # comprise "JHU" and from 7th to 9th characters in the string

        # in the element in ith row and 1st column.

        #

        ECG$ID[i]<-paste("JHU",substring(ECG[i,1],7,9),sep="")

     }

     else

     { 

        # assign the element in ith row and ID column a string, which

        # comprise "JHU" and from 4th to 6th characters in the string

        # in the element in ith row and 1st column.

        #

        ECG$ID[i]<-paste("JHU",substring(ECG[i,1],4,6),sep="")

     }

  }

}

#

# Recode the -999's in ECG as NA's

#

# Look through each row and each column to see if there is a -999 and

# replace it with NA.

#

# Look through each row in ECG.

#

for (i in seq(1,dim(ECG)[1]))

{

   # Look through each column.

   for (j in seq(1,dim(ECG)[2]))

   {

      # Determine whether the current cell is not a NA.

      #

      # ! is an operator for "not". For example. "!(1==2)" is TRUE

      if (!is.na(ECG[i,j]))

      {

         # Determine whether the current cell is -999, if TRUE, then

         # assign the current cell a NA.

         if (ECG[i,j]==-999)

         {

            ECG[i,j]<-NA

         }

      }

   } 

}

#

# Make all SNP calls that equal ERROR, UNDETERMINED or - into NA's

#

# Look through each row and in column 2 to 7, find all the cells that are

# ERROR, UNDETERMINED or -, replace them with NA's.

#

# create indicator for all SNP's having been called

#

# Look through each row

#

for (i in seq(1,dim(SNP)[1]))

{

   # Look through each element from 2nd column to 7th column.

   for (j in seq(2,7))

   {

     # Deter whether the current cell is ERROR, UNDETERMINED or -, if TRUE,

     # then rewrite it as a NA.

     if ((SNP[i,j]=="UNDETERMINED")||(SNP[i,j]=="-")||(SNP[i,j]=="ERROR"))

     {

        SNP[i,j]<-NA

     }

   } 

}

# Rename the SNP columns as "ID", "snp1", "snp2", "snp3", "snp4", "snp5"

# and "snp6".

#

names(SNP)<-c("ID","snp1","snp2","snp3","snp4","snp5","snp6")

#

# Recode blank gender as NA

#

# Look through each row in column Gender in AGE.GENDER dataframe. Replace

# all the blank cells with NA's.

#

for (i in seq(1,dim(AGE.GENDER)[1]))

{

   # Determine whether the current cell is blank. If so, then assign it a NA.

   if (AGE.GENDER$Gender[i]=="")

   {

      AGE.GENDER$Gender[i]<-NA

   }

}

#

# Recode RACE as NA if it is not A, B, W or O

#

# Look through the column Race in data frame RACE.

#

for (i in seq(1,dim(RACE)[1]))

{

   # Determine if the current cell is A, B, W or O, if not, assign it a NA.

   if ((RACE$Race[i]!="A")&&

       (RACE$Race[i]!="B")&&

       (RACE$Race[i]!="W")&&

       (RACE$Race[i]!="O"))

         {

           RACE$Race[i]<-NA

         }

}

#

# Rename the 1st column (PID variable) of data frame SNP, AGE.GENDER,

# RACE as ID.

#

names(AGE.GENDER)[1]<-"ID"

names(RACE)[1]<-"ID"

names(SNP)[1]<-"ID"

#

# Rename the IND Study.ID variable as ID

#

names(IND)[1]<-"ID"

#

#

# Rename the IMAGE ReynoldsNum variable as ID

#

names(IMAGE)[1]<-"ID"

#

# Clean the inducibility data so that

# (a) the ID's don't have trailing -I

# (b) the phenotype is either yes, no or NA

#

# Note that IND$Inducible is the variable telling us if we have

# inducible data (1) or not (0)

#

# Assign L the number of rows in data frame IND

#

L<-dim(IND)[1]

#

#

#

for (i in seq(1,L))

{

   #

   # fix the ID by extracting the first 6 characters

   #

   IND$ID[i]<-substr(IND$ID[i],1,6)

   # Determine if the Inducible variable is a NA

   if (!is.na(IND$Inducible[i]))

   {

      # Determine if the Inducible variable is "no" or "yes"

      if ((IND$Inducible[i]!="no")&&(IND$Inducible[i]!="yes"))

      { 

         # If the Inducible variable is "no, ", then we change it to

         # "no".

         if (IND$Inducible[i]=="no, ")

         {

            IND$Inducible[i]<-"no"

         }

         # For all other cases, we assign it a NA.

         else

         {

            IND$Inducible[i]<-NA

         }

      }

   }

}

#

# Create indicators that just tell us if an ID is in a dataset

#

# Add a column in SNP named IDIN.SNP.IND with all 1s.

#

SNP$IDIN.SNP.IND<-rep(1,dim(SNP)[1])

#

# Add a column in ECG named IDIN.ECG.IND with all 1s.

#

ECG$IDIN.ECG.IND<-rep(1,dim(ECG)[1])

#

# Add a column in AGE.GENDER named IDIN.AGE.GENDER.IND with all 1s.

#

AGE.GENDER$IDIN.AGE.GENDER.IND<-rep(1,dim(AGE.GENDER)[1])

#

# Add a column in RACE named IDIN.RACE.IND with all 1s.

#

RACE$IDIN.RACE.IND<-rep(1,dim(RACE)[1])

#

# Add a column in IND named IDIN.IND.IND with all 1s.

#

IND$IDIN.IND.IND<-rep(1,dim(IND)[1])

#

# Add a column in IMAGE named IDIN.IMAGE.IND with all 1s.

#

IMAGE$IDIN.IMAGE.IND<-rep(1,dim(IMAGE)[1])

#

# Add a column in IMAGE named IMAGE.IND with all 1s.

#

IMAGE$IMAGE.IND<-rep(1,dim(IMAGE)[1])

#

# Merge ECG and SNP data frames together by the common column ID, and name

# it d1.

# Extra rows will be added to the output for each row in x that has no

# matching row in y. These rows will have NAs in those columns that are

# usually filled with values from y.

#

d1<-merge(ECG,SNP,by.x="ID",by.y="ID",all=TRUE)

#

# Merge d1 and AGE.GENDER data frames together by the common column ID,

# and name it d2.

#

d2<-merge(d1,AGE.GENDER,by.x="ID",by.y="ID",all=TRUE)

#

# Merge d2 and RACE data frames together by the common column ID, and

# name it d3.

#

d3<-merge(d2,RACE,by.x="ID",by.y="ID",all=TRUE)

#

# Merge d3 and IND data frames together by the common column ID, and

# name it d4.

#

d4<-merge(d3,IND,by.x="ID",by.y="ID",all=TRUE)

#

# Merge d4 and IMAGE data frames together by the common column ID, and

# name it d5.

#

d5<-merge(d4,IMAGE,by.x="ID",by.y="ID",all=TRUE)

#

# Rename d5 as d.

#

d<-d5

# remove the variable d1, d2, d3, d4, d5.

rm(d1)

rm(d2)

rm(d3)

rm(d4)

rm(d5)

#

# Create indicators of data available

#

# If all the cells in the same row in column snp1, snp2, snp3, snp4, snp5

# snp6 are not missing value (NA), then assign the indicator TRUE. Otherwise,

# assign indicator FALSE. Name the indicator SNP.ALL.IND.

#

d$SNP.ALL.IND<-complete.cases(d$snp1,d$snp2,d$snp3,d$snp4,d$snp5,d$snp6)

#

# If the cell in column QTVI_log is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the

# indicator ECG.IND.

#

d$ECG.IND<-complete.cases(d$QTVI_log)

#

# If the cell in column Birth.Year.x is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the indicator

# AGE.IND.

#

d$AGE.IND<-complete.cases(d$Birth.Year.x)

#

# If the cell in column Gender is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the indicator

# GENDER.IND.

#

d$GENDER.IND<-complete.cases(d$Gender.x)

#

# If the cell in column Gender is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the indicator

# RACE.IND.

#

d$RACE.IND<-complete.cases(d$Race)

#

# If the cell in column Inducible is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the indicator

# IND.IND.

#

d$IND.IND<-complete.cases(d$Inducible)

#

# If the cell in column DEmass is not missing value (NA), then assign

# the indicator TRUE. Otherwise, assign indicator FALSE. Name the indicator

# IMAGE.IND.

#

d$IMAGE.IND<-complete.cases(d$DEmass)

#

# Filter out the non-adults

#

#

# Set missing birth years to zero

#

d$Birth.Year.x[is.na(d$Birth.Year.x)]<-0

d$Birth.Year.y[is.na(d$Birth.Year.y)]<-0

#

# Create a new Birth.Year variable:

#    if Birth.Year.x is missing, take Birth.Year.y, otherwise take

#    Birth.Year.x

#

d$Birth.Year<-d$Birth.Year.x+d$Birth.Year.y*(d$Birth.Year.x==0)

#

# Keep only those born before 1995.

#

# d[d$Birth.Year<=1995,] means all rows in d that have Birth.Year less or

# equal to 1995.

#

d<-d[d$Birth.Year<=1995,]

#

# Convert firing & implant dates to date format and create an indicator for

# implantation.

#

# If Implant.Date is not a NA, assign TRUE to IMPLANT.IND. Otherwise, assign

# FALSE.

#

d$IMPLANT.IND<-complete.cases(d$Implant.Date)

#

# Convert Firings to another date formate (eg. "09/21/2008") and name it

# Firing.Date.

#

d$Firing.Date<-as.Date(d$Firings,format="%m/%d/%Y")

#

# Convert Implant.Date to another date formate (eg. "09/21/2008") and still

# name it Implant.Date.

#

d$Implant.Date<-as.Date(d$Implant.Date,format="%m/%d/%Y")

#

# Calculate the number of TRUEs in column IMPLANT.IND

#

sum(d$IMPLANT.IND)

#

# Calculate the number of NAs in column IMPLANT.IND

#

sum(!is.na(d$Implant.Date))

#

# Calculate the days between Firing.Date and Implant.Date, assign it to

# Days.To.Firing

#

d$Days.To.Firing<-d$Firing.Date-d$Implant.Date

#

# Create a indicator FIRED.IND to show the NAs in column Days.To.Firing.

#

d$FIRED.IND<-!is.na(d$Days.To.Firing)

#

# Compute days to today, assuming today is March 4, 2008

#

today<-as.Date("3/04/2008",format="%m/%d/%Y")

d$Days.Of.Implant<-today-d$Implant.Date

#

# Create a indicator for AP.vs.IAP. If AP.vs.IAP is "AP", then assign the

# indicator TRUE, otherwise, assign FALSE.

#

d$APP.FIRED.IND<-(d$AP.vs.IAP=="AP")

#

#

# Write data frame d to a csv file

#

write.csv(d,file="data.csv",row.names=F)

#

#

# Make data frame of those for which we have inducibility data

#

dind<-d[d$IND.IND,]

#

#

# Write data frame "dind" to a csv file

#

write.csv(dind,file="data.ind.csv",row.names=F)

 

Index

 

Top

 

(2) Uncommented Code ( download here )

 

ECG<-read.csv("JHUcomb.csv", as.is=T)

SNP<-read.csv("icd.data.oct.11.2007.csv",as.is=T)

AGE.GENDER<-read.csv("age.gender.csv",as.is=T)

RACE<-read.csv("ReynRaceData-100207.csv",as.is=T)

IND<-read.csv( "firing.data.4.09.2008.csv",as.is=T)

IMAGE<-read.csv("img.data11.19.07.csv",as.is=T)

N.ECG<-dim(ECG)[1]

ECG$ID<-rep("",N.ECG)

for (i in seq(1,N.ECG))

{

  if (is.na(ECG[i,1]))

  {

     ECG$ID[i]<-NA

  }

  else

  {

     if (substring(ECG[i,1],1,1)==".")

     {

        ECG$ID[i]<-paste("JHU",substring(ECG[i,1],7,9),sep="")

     }

     else

     { 

        ECG$ID[i]<-paste("JHU",substring(ECG[i,1],4,6),sep="")

     }

  }

}

for (i in seq(1,dim(ECG)[1]))

{

   for (j in seq(1,dim(ECG)[2]))

   {

      if (!is.na(ECG[i,j]))

      {

         if (ECG[i,j]==-999)

         {

            ECG[i,j]<-NA

         }

      }

   } 

}

for (i in seq(1,dim(SNP)[1]))

{

   for (j in seq(2,7))

   {

     if ((SNP[i,j]=="UNDETERMINED")||(SNP[i,j]=="-")||(SNP[i,j]=="ERROR"))

     {

        SNP[i,j]<-NA

     }

   } 

}

names(SNP)<-c("ID","snp1","snp2","snp3","snp4","snp5","snp6")

for (i in seq(1,dim(AGE.GENDER)[1]))

{

   if (AGE.GENDER$Gender[i]=="")

   {

      AGE.GENDER$Gender[i]<-NA

   }

}

for (i in seq(1,dim(RACE)[1]))

{

   if ((RACE$Race[i]!="A")&&

       (RACE$Race[i]!="B")&&

       (RACE$Race[i]!="W")&&

       (RACE$Race[i]!="O"))

         {

           RACE$Race[i]<-NA

         }

}

names(AGE.GENDER)[1]<-"ID"

names(RACE)[1]<-"ID"

names(SNP)[1]<-"ID"

names(IND)[1]<-"ID"

names(IMAGE)[1]<-"ID"

L<-dim(IND)[1]

for (i in seq(1,L))

{

   IND$ID[i]<-substr(IND$ID[i],1,6)

   if (!is.na(IND$Inducible[i]))

   {

      if ((IND$Inducible[i]!="no")&&(IND$Inducible[i]!="yes"))

      { 

         if (IND$Inducible[i]=="no, ")

         {

            IND$Inducible[i]<-"no"

         }

         else

         {

            IND$Inducible[i]<-NA

         }

      }

   }

}

SNP$IDIN.SNP.IND<-rep(1,dim(SNP)[1])

ECG$IDIN.ECG.IND<-rep(1,dim(ECG)[1])

AGE.GENDER$IDIN.AGE.GENDER.IND<-rep(1,dim(AGE.GENDER)[1])

RACE$IDIN.RACE.IND<-rep(1,dim(RACE)[1])

IND$IDIN.IND.IND<-rep(1,dim(IND)[1])

IMAGE$IDIN.IMAGE.IND<-rep(1,dim(IMAGE)[1])

IMAGE$IMAGE.IND<-rep(1,dim(IMAGE)[1])

d1<-merge(ECG,SNP,by.x="ID",by.y="ID",all=TRUE)

d2<-merge(d1,AGE.GENDER,by.x="ID",by.y="ID",all=TRUE)

d3<-merge(d2,RACE,by.x="ID",by.y="ID",all=TRUE)

d4<-merge(d3,IND,by.x="ID",by.y="ID",all=TRUE)

d5<-merge(d4,IMAGE,by.x="ID",by.y="ID",all=TRUE)

d<-d5

rm(d1)

rm(d2)

rm(d3)

rm(d4)

rm(d5)

d$SNP.ALL.IND<-complete.cases(d$snp1,d$snp2,d$snp3,d$snp4,d$snp5,d$snp6)

d$ECG.IND<-complete.cases(d$QTVI_log)

d$AGE.IND<-complete.cases(d$Birth.Year.x)

d$GENDER.IND<-complete.cases(d$Gender.x)

d$RACE.IND<-complete.cases(d$Race)

d$IND.IND<-complete.cases(d$Inducible)

d$IMAGE.IND<-complete.cases(d$DEmass)

d$Birth.Year.x[is.na(d$Birth.Year.x)]<-0

d$Birth.Year.y[is.na(d$Birth.Year.y)]<-0

d$Birth.Year<-d$Birth.Year.x+d$Birth.Year.y*(d$Birth.Year.x==0)

d<-d[d$Birth.Year<=1995,]

d$IMPLANT.IND<-complete.cases(d$Implant.Date)

d$Firing.Date<-as.Date(d$Firings,format="%m/%d/%Y")

d$Implant.Date<-as.Date(d$Implant.Date,format="%m/%d/%Y")

sum(d$IMPLANT.IND)

sum(!is.na(d$Implant.Date))

d$Days.To.Firing<-d$Firing.Date-d$Implant.Date

d$FIRED.IND<-!is.na(d$Days.To.Firing)

today<-as.Date("3/04/2008",format="%m/%d/%Y")

d$Days.Of.Implant<-today-d$Implant.Date

d$APP.FIRED.IND<-(d$AP.vs.IAP=="AP")

write.csv(d,file="data.csv",row.names=F)

dind<-d[d$IND.IND,]

write.csv(dind,file="data.ind.csv",row.names=F)

d$CLINICAL.IND<-(d$AGE.IND)*(d$RACE.IND)*(d$GENDER.IND)

dind$CLINICAL.IND<-(dind$AGE.IND)*(dind$RACE.IND)*(dind$GENDER.IND)

 

Index

 

Top

 

 

 

6. References

 

Bill Venables and David M. Smith, An introduction to R. www.r-project.org, 06/23/2008.

Phil Spector, Introduction to S & S-PLUS. Springer, 12/23/1993.

 

Index

 

Top