Biostatistics with R

Data manipulation with vectors

Creating subsets of vectors

We learnt that an operarion performed on a vector name is applied to all its elements separately, resulting in another vector. Thus, if v is a vector, any operation performed on v is applied to all its elements in turn, and this results in a new vector.

Similarly, if a logical condition is applied to a vector x , it is applied to each element of x , resulting in a vector of TRUE oe FALSE values against every element of x .

As an example, if x is a vector of numbers, then the statement x > 12 will check whether every element of x is greater than 12. It will accordingly generate a vector of TRUE or FALSE boolean values.See here:


> x = c(8,10,12,7,14,16,2,4,9,19,20,3,6) > > x > 12
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE [13] FALSE

If the above vector of TRUE and FALSE values are placed inside the square bracket of vector x , the elements of x corresponding to the TRUE values will be filtered out into a vector. Carefully note the following code and its outpout:


> x = c(8,10,12,7,14,16,2,4,9,19,20,3,6) > > x > 12
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE [13] FALSE
> y = x[x>12] > > y
[1] 14 16 19 20

The logical statement inside square bracket can be complex. Thus, if we want to filter out elements of vector x whose values are more than 10 and less than 7, we use

> x[ (x>10) & (x<20) ]
[1] 12 14 16 19

In another example, we create a vector of numbers with some missing values (ie. NA). We will apply a filter to select elements which are not NA's and at the same time have values below 100 and write them into another vector. In a second operation, we will remove all the NA values from the original vector itself.

The script below achieves this:


tarray <- c(2, 7, 29, 32, 41, 11, 15, NA, NA, 55, 32, NA, 42, 109) karray <- tarray[ !is.na(tarray) & (tarray < 100) ] tarray[is.na(tarray)] <- 0 print("Filter with NA's and numbers greater than 100 removed:") print(karray) print("Filter with NA's replaced by 0") print(tarray)


When the above code lines are executed in an R script, the following output is created.

[1] "Filter with NA's and numbers greater than 100 removed:" [1] 2 7 29 32 41 11 15 55 32 42 [1] "Filter with NA's replaced by 0" [1] 2 7 29 32 41 11 15 0 0 55 32 0 42 109

In the above script, the statement tarray[ !is.na(tarray) & (tarray < 100) ] selects elements of vector "tarray" that are not NA's and at the same time less than 100. The statement tarray[is.na(tarray)] <- 0 assigns the value 0 to the elemts of vector "tarray" that are missing values (NA's). After this, all NA's in vector "tarray" are replaced by 0.

Creating subsets of data frames

From a data frame, a subset can be created using subset() funtion by applying conditions on one or more column members.

For example, suppose a data frame is called "datframe" with many columns and one of them have name "npcol". Then the statement,

subdata <- subset(datframe, datframe$npcol > 30.0)

will select all the rows of datframe in which npcol is greater than 30 to create a new data frame called "subdata"


In the example code below, we will create a data frame with an (imaginary) experimental data. In this data set, there are 7 genes for which some experimental measurements are available from 7 experiments.
We will use "subset()" function to create a subset of this data after filtering on individual column values.
The code below demonstrates this. The comments are self explanatory.


# creating a vector of gene names genes = c("gene-1","gene-2","gene-3","gene-4","gene-5","gene-5","gene-6") # creating a vector of gender gender = c("M", "M", "F", "M", "F", "F", "M") # creating 7 data vectors with experimental results result1 = c(12.3, 11.5, 13.6, 15.4, 9.4, 8.1, 10.0) result2 = c(22.1, 25.7, 32.5, 42.5, 12.6, 15.5, 17.6) result3 = c(15.5, 13.4, 11.5, 21.7, 14.5, 16.5, 12.1) result4 = c(14.4, 16.6, 45.0, 11.0, 9.7, 10.0, 12.5) result51 = c(12.2, 15.5, 17.4, 19.4, 10.2, 9.8, 9.0) result52 = c(13.3, 14.5, 21.6, 17.9, 15.6, 14.4, 12.0) result6 = c(11.0, 10.0, 12.2, 14.3, 23.3, 19.8, 13.4) # creating a data frame with this data. # genes along rows, results along columns datframe = data.frame(genes,gender,result1,result2,result3,result4, result51,result52,result6) # adding column names to data frame names(datframe) = c("GeneName", "Gender", "expt1", "expt2", "expt3", "expt4", "expt51", "expt52", "expt6") # creating subset of data with expt2 values above 20 subframe1 = subset(datframe, datframe$expt2 > 20) # creating a subset of data with only Female gender subframe2 = subset(datframe, datframe$Gender == "F") # creating a subset with male gender for which expt2 is less than 30 subframe3 = subset(datframe, (datframe$Gender == "M")&(datframe$expt2 < 30.0) ) # printing the data frames print("subframe1 : Rows with expt2 > 20") print(subframe1) print("subframe2 : Rows with gender Female") print(subframe2) print("subframe3 : Rows with Male gender and expt2 < 30.0") print(subframe3)

When the above code lines are executed in an R script, we get the following output:

[1] "subframe1 : Rows with expt2 > 20" GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6 1 gene-1 M 12.3 22.1 15.5 14.4 12.2 13.3 11.0 2 gene-2 M 11.5 25.7 13.4 16.6 15.5 14.5 10.0 3 gene-3 F 13.6 32.5 11.5 45.0 17.4 21.6 12.2 4 gene-4 M 15.4 42.5 21.7 11.0 19.4 17.9 14.3 [1] "subframe2 : Rows with gender Female" GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6 3 gene-3 F 13.6 32.5 11.5 45.0 17.4 21.6 12.2 5 gene-5 F 9.4 12.6 14.5 9.7 10.2 15.6 23.3 6 gene-5 F 8.1 15.5 16.5 10.0 9.8 14.4 19.8 [1] "subframe3 : Rows with Male gender and expt2 < 30.0" GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6 1 gene-1 M 12.3 22.1 15.5 14.4 12.2 13.3 11.0 2 gene-2 M 11.5 25.7 13.4 16.6 15.5 14.5 10.0 7 gene-6 M 10.0 17.6 12.1 12.5 9.0 12.0 13.4

Union and intersection of vectors

If two vectors x and y represent two sets of elements, their union and intersection operations can be performed by calling the functions union(x,y) and intersect(x,y) as demonstrated here:

> x = c('A','B','C','D','E') > y = c('D','E','K','L','S','P') > zu = union(x,y) > zu
[1] "A" "B" "C" "D" "E" "K" "L" "S" "P"

> zi = intersect(x,y) > > zi
[1] "A" "B" "C" "D" "E" "K" "L" "S" "P"

Computing the differences between elements of a vector

Given a vector of numbers, the diferences between either successive or periodically positioned elements can be obtained using a function called diff(). This returns the result as a vector.


The simplest format is, diff(x,n), where 'x' is the vector of numbers, and 'n' is the space for the difference. For example, n=1 means difference between successive numbers, n=2 means between alternate numbers.


> x = c(4,8,11,14,35,56,120,30) > > diff(x)
[1] 4 3 3 21 21 64 -90

> diff(x,2)
[1] 7 6 24 42 85 -26

Cumulative sum and product of vector elements

Given a vector of numbers, we can find the cumulative sum and cumulative product upto each one of the element using cumsum() and cumprod() functions, as demonstrated below:


> x = c(1, 3, 5, 4, 6, 8, 2) > > cumsum(x)
[1] 1 4 9 13 19 27 29
> > cumprod(x)
[1] 1 3 15 60 360 2880 5760

Finding Unique elements of a vector

A vector may contain multiple copies of same element(s). We can find the unique list of elements in the vector by calling unique() function:


> x = c('a','b','a','c','e','f','c','g','h') > > unique(x)
[1] "a" "b" "c" "e" "f" "g" "h"

Finding duplicated elements of a vector

The set of duplicated elements in a vector can be obtained by the duplicated() function. For each element, this fnction returns a vector of TRUE or FALSE value corresponding to the location of each element. If the element in a location is already present before, it is FALSE, else it is TRUE. See below:

> x = c('a','b','a','c','e','f','c','g','h') > > duplicated(x)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE

From this we can extract the actual duplicated elements by using this bulian vector as vector elements of x:

> x[duplicated(x)]
[1] "a" "c"

To extract the non-duplicated elements, use the logical NOT operator:

> x[!duplicated(x)]
[1] "a" "b" "c" "e" "f" "g" "h"

note that the "not duplicated" subset is same as the "unique" subset obtained before!.

Creating a frequency table

Given a vector, we can get the frequency table of its elements with the table() function. This table function returns a data structure that can be converted to a data frame, and the frequency information can be extracted from there:


> xx = c('A','T','A','G','T','A','T','C','C','A','T','T','G') > > tab = table(xx) > > tab
xx A C G T 4 2 2 5
> > dat = as.data.frame(tab) > > dat
xx Freq 1 A 4 2 C 2 3 G 2 4 T 5
The function call as.data.frame(tab) converts the object "tab" into a proper data frame. Now dat$xx is a vector that contains the unique data members and dat$Freq is a vector of their frequencies.

Getting the index of a vector element

Given a vector, suppose we want to know the array index of a particular element. Using the name or value of an element, its position in the array ( third, fourth etc) should be obtained. We can get this done with the help of a function called which(). See here:


> x = c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh") > > which(x=="ddd")
[1] 4
In the above vector x, the element "ddd" is in the fourth position. Therefore, the above call to which() returns an integer 4.

If a particular element is present in the vector more than once, the which() function returns a vector containing the indices of all the locations of that element in the input vector:

> dat = c("ATG","TAG","ATG","TTA","TGC","ATT","ATG", "GGG") > > d = which(dat=="ATG") > > d
[1] 1 3 7

Joining data frames

In R, data tables are generally loaded as data frames. We can join ( bind) two data frames one below each other (vertical binding) or adjacent to each other (horizontal binding) provided the column or row numbers are matched accordingly.


We will create three data frames called frame1, <>frame2 and frame3 to demonstrate the data set binding.

> index = seq(1:8) > > product = c("wheat","rice","millet","ragi","corn","pulses","meat","sugarCane") > > quantity1 = c(118,179,24,39,32,59,72,84) > > quantity2 = c(128,169,29,35,30,57,67,78) > > sales = c(1200,1400,800,600,400,2900,3000,490 ) > > frame1 = data.frame(index = index, product=product, quantity=quantity1) > > frame2 = data.frame(index=index, product=product, quantity=quantity2) > > frame3 = data.frame(index=index, product=product, sales=sales) > > > frame1
index product quantity 1 1 wheat 118 2 2 rice 179 3 3 millet 24 4 4 ragi 39 5 5 corn 32 6 6 pulses 59 7 7 meat 72 8 8 sugarCane 84
> frame2
index product quantity 1 1 wheat 128 2 2 rice 169 3 3 millet 29 4 4 ragi 35 5 5 corn 30 6 6 pulses 57 7 7 meat 67 8 8 sugarCane 78
> frame3
index product sales 1 1 wheat 1200 2 2 rice 1400 3 3 millet 800 4 4 ragi 600 5 5 corn 400 6 6 pulses 2900 7 7 meat 3000 8 8 sugarCane 490

To join two frames vertically one below the other ( row binding ), use rbind() function. For this,the two data frames must have same variables (ie., column names), though they need not be present in the same order :

> vbframe = rbind(frame1, frame2) > > vbframe
index product quantity 1 1 wheat 118 2 2 rice 179 3 3 millet 24 4 4 ragi 39 5 5 corn 32 6 6 pulses 59 7 7 meat 72 8 8 sugarCane 84 9 1 wheat 128 10 2 rice 169 11 3 millet 29 12 4 ragi 35 13 5 corn 30 14 6 pulses 57 15 7 meat 67 16 8 sugarCane 78

To join two data frames horizontally (column binding ), we use the cbind() function.For this, they should have same number of rows, and the variables (column names) can be same or different:
> hbframe = cbind(frame1, frame2) > > hbframe
index product quantity index product quantity 1 1 wheat 118 1 wheat 128 2 2 rice 179 2 rice 169 3 3 millet 24 3 millet 29 4 4 ragi 39 4 ragi 35 5 5 corn 32 5 corn 30 6 6 pulses 59 6 pulses 57 7 7 meat 72 7 meat 67 8 8 sugarCane 84 8 sugarCane 78

Merging data frames

To merge two data frames horizontally, use the merge() function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join ).


We will now merge the frame1 and the frame3 created before. These two frames have two common variables namely "index" and "product".



First we merge them by the variable "product":

> mrgA = merge(frame1, frame3, by="product") > > mrgA
product index.x quantity index.y sales 1 corn 5 32 5 400 2 meat 7 72 7 3000 3 millet 3 24 3 800 4 pulses 6 59 6 2900 5 ragi 4 39 4 600 6 rice 2 179 2 1400 7 sugarCane 8 84 8 490 8 wheat 1 118 1 1200
Carefully note the following fact in the above merged frame "mrgA" : since two frames merged by "product" have a common variable called "index", the merged frame distinguishes them by changing the two names to "index.x" and "index.y"

We can also merge by more than one variable. For example, we can merge the frame1 and frame3 by the two common variables "index" and "product" as shown below:

> mrgB = merge(frame1, frame3, by=c("index","product")) > > mrgB
index product quantity sales 1 1 wheat 118 1200 2 2 rice 179 1400 3 3 millet 24 800 4 4 ragi 39 600 5 5 corn 32 400 6 6 pulses 59 2900 7 7 meat 72 3000 8 8 sugarCane 84 490


Different types of merging like Outer join, Left outer, Right outer and Cross join are demonstrated below with new data frames called df1 and df2 :


> df1 = data.frame(experimentID = c(1:6), genes=c("g1","g1","g1","g2","g2","g2")) > > df2 = data.frame(experimentID = c(1,3,5), tissues = c("heart","heart","liver")) > > df1
experimentID genes 1 1 g1 2 2 g1 3 3 g1 4 4 g2 5 5 g2 6 6 g2
> > df2
experimentID tissues 1 1 heart 2 3 heart 3 5 liver


Outer join :
> OJ = merge(x = df1, y = df2, by = "experimentID", all = TRUE) > > OJ
experimentID genes tissues 1 1 g1 heart 2 2 g1 <NA> 3 3 g1 heart 4 4 g2 <NA> 5 5 g2 liver 6 6 g2 <NA>


Left Outer :
> LO = merge(x = df1, y = df2, by = "experimentID", all.x = TRUE) > > LO
experimentID genes tissues 1 1 g1 heart 2 2 g1 <NA> 3 3 g1 heart 4 4 g2 <NA> 5 5 g2 liver 6 6 g2 <NA>


Right Outer :
> RO = merge(x = df1, y = df2, by = "experimentID", all.y = TRUE) > > RO
experimentID genes tissues 1 1 g1 heart 2 3 g1 heart 3 5 g2 liver


Cross Join :
> CJ = merge(x = df1, y = df2, by = NULL) > >
experimentID.x genes experimentID.y tissues 1 1 g1 1 heart 2 2 g1 1 heart 3 3 g1 1 heart 4 4 g2 1 heart 5 5 g2 1 heart 6 6 g2 1 heart 7 1 g1 3 heart 8 2 g1 3 heart 9 3 g1 3 heart 10 4 g2 3 heart 11 5 g2 3 heart 12 6 g2 3 heart 13 1 g1 5 liver 14 2 g1 5 liver 15 3 g1 5 liver 16 4 g2 5 liver 17 5 g2 5 liver 18 6 g2 5 liver