Merge Datesets Instruction

An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on.

Here is a simple example of how to make a list:

> Lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))

Components of lists may be named, and referred to either by giving the component name as a character string or the number in double square brackets. For example:

> Lst[[1]]

[1] "Fred"

> Lst$wife

[1] "Mary"

> Lst[[2]]

[1] "Mary"

> Lst$child.ages

[1] 4 7 9

This is a very useful convention as it makes it easier to get access to each individual component of a list.

Index

(2) Data Frame

A data frame is a format for a dataset that we frequently use in R. It is a basically a list of data. It can be a vector, a matrix, or any multidimensional array. The definition of data frame is a list data with class "data.frame". There are restrictions on lists that may be made into data frames, namely:

(1) The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames; (2) Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively; (3) Numeric vectors, logicals and factors are included as is, and character vectors are coerced to be factors, whose levels are the unique values appearing in the vector; (4) Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions, for example:

> x<-data.frame(a=1:4,b=c(TRUE,TRUE,FALSE,TRUE),c=c("A","B","C","D"),d=13:16)

> x

a b c d

1 1 TRUE A 13

2 2 TRUE B 14

3 3 FALSE C 15

4 4 TRUE D 16

> x[,1] # the 1st column of x

[1] 1 2 3 4

> x[,2] # the 2nd column of x

[1] TRUE TRUE FALSE TRUE

> x[1,] # the 1st row of x

a b c d

1 1 TRUE A 13

> x[3,] # the 3rd row of x

a b c d

3 3 FALSE C 15

> x[3,4] # the element in 3rd row and 4th column

[1] 15

Index

Top

2. Merging Data Frames (function: merge() )

(1) Merge() ( online help file )

The reason why we need merging data frames is that we often get several data files for one project. Each file contains one piece of information. In order to have a thorough analysis, we need to have all files put together in an organized way so that the analysis can be done on the whole dataset easily.

Merging data frames ( merge() ) can put two data frames together based on the common column names or row names. After the merging, the rows are by default lexicographically sorted on the common columns, unless 锟斤拷sort = TRUE锟斤拷 is specified in the merge() function. Other detailed argument explanations in merge() can be found in R help files by typing 锟斤拷?merge()锟斤拷 in R console, or simply click online help file. Merging datasets example:

> authors

surname nationality deceased

1 Tukey US yes

2 Venables Australia no

3 Tierney US no

4 Ripley UK no

5 McNeil Australia no

> books

name title other.author

1 Tukey Exploratory Data Analysis <NA>

2 Venables Modern Applied Statistics ... Ripley

3 Tierney LISP-STAT <NA>

4 Ripley Spatial Statistics <NA>

5 Ripley Stochastic Simulation <NA>

6 McNeil Interactive Data Analysis <NA>

7 R Core An Introduction to R Venables & Smith

> merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)

surname nationality deceased title

1 McNeil Australia no Interactive Data Analysis

2 R Core <NA> <NA> An Introduction to R

3 Ripley UK no Spatial Statistics

4 Ripley UK no Stochastic Simulation

5 Tierney US no LISP-STAT

6 Tukey US yes Exploratory Data Analysis

7 Venables Australia no Modern Applied Statistics ...

other.author

1 <NA>

2 Venables & Smith

3 <NA>

4 <NA>

5 <NA>

6 <NA>

7 Ripley

Another example:

> x

k1 k2 data

1 NA 1 1

2 NA NA 2

3 3 NA 3

4 4 4 4

5 5 5 5

> y

k1 k2 data

1 NA NA 1

2 2 NA 2

3 NA 3 3

4 4 4 4

5 5 5 5

> merge(x, y, by=c("k1","k2")) # NA's match

k1 k2 data.x data.y

1 4 4 4 4

2 5 5 5 5

3 NA NA 2 1

> merge(x, y, by=c("k1","k2"),all=TRUE) # NA's match

k1 k2 data.x data.y

1 2 NA NA 2

2 3 NA 3 NA

3 4 4 4 4

4 5 5 5 5

5 NA 1 1 NA

6 NA 3 NA 3

7 NA NA 2 1

In order to merge data frames, we need to have all individual data files read as data frames in R. So read.table() or read.csv() are needed here.

Index

(2) read.table() and read.csv() ( online help file )

Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the read.table() function.

To read an entire data frame directly, the external file will normally have a special form.

The first line of the file should have a name for each variable in the data frame; (2) Each additional line of the file has as its first item a row label and the values for each variable.

For example,

Input file form with names and row labels:

Price Floor Area Rooms Age Cent.heat

01 52.00 111.0 830 5 6.2 no

02 54.75 128.0 710 5 7.5 no

03 57.50 101.0 1000 5 4.2 no

04 57.50 131.0 690 6 8.8 no

05 59.75 93.0 900 5 1.9 yes

...

Then the function read.table() can then be used to read the data frame directly

> HousePrice <- read.table("houses.data" , header=TRUE, as.it=TRUE)

where the header=TRUE option specifies that the first line is a line of headings, and hence, by implication from the form of the file, that no explicit row labels are given. As.it=TRUE means not converting character variables to factors. read.table() or read.csv() by default convert the character variables (which are not converted to logical, numeric or complex) to factors.

If the input data file is in csv format, then use read.csv() with arguments defined above.

Index

Top

3. Example Description

Now we are going to try a simple example on real data. In this example, we first read several data files in R, then perform several data manipulations according to the analysis requirement and finally merge data files together to generate a whole dataset. In this example, we basically used all frequently used functions and commands for data organization. So understanding this example will benefit you in understanding data manipulation in R. The next two sections are the data links and R scripts for this example.

Index

Top

4. Links to Datasets

Here are the datasets used in the example:

(1) JHUcomb.csv (ECG)

(2) icd.data.oct11.2007.csv (SNP)

cleaned version of sept so that the JHUIDs are in the common format got rid of "new" or "A" or "B"

(3) age.gender.csv (AGE.GENDER)

(4) ReynRaceData-100207.csv (RACE)

(5) firing.datainducibility.csv (IND)

(6) img.data11.19.07.csv (IMAGE)

Index

Top

5. R Code

Here the code used for the data above. There two editions of the code. One is heavily commented for users with limited experience of R. The other one has no comments at all. They are equivalent and it is up to you which one to use.

(1) Commented Code ( download here )

# Merge

# 1) JHUcomb.csv (ECG)

# 2) icd.data.oct11.2007.csv (SNP) -- cleaned version of sept so that the

# JHUIDs are in the common format

# got rid of "new" or "A" or "B"

# 3) age.gender.csv (AGE.GENDER)

# 4) ReynRaceData-100207.csv (RACE)

# 5) firing.datainducibility.csv (IND)

# 6) img.data11.19.07.csv (IMAGE)

# Read data from file "JHUcomb.csv" in the current folder

# which is specified under R menu->File->Chang directory.

# The file is in csv format. We use the function

# read.csv("file path").

# The "as.is=T" means not converting character variables to

# factors. read.csv() by default converts the character

# variables (which are not converted to logical, numeric or

# complex) to factors.

ECG<-read.csv("JHUcomb.csv", as.is=T)

# Read data from "icd.data.oct.11.2007.csv" in the current