R tutorials,arrays,frames,sort,which,append,subset,merge,filter

Biostatistics with R

Data manipulation with vectors

Creating subsets of vectors

We learnt that a vector's name is treated as a variable in R and any operation performed on its name is applied to all its elements separately, resulting in another vector. Thus, if v is a vector, any operation performed on v is applied to all its elements in turn, and this results in a new vector.

Similarly, if a logical condition is applied to a vector x , it is applied to each element of x , resulting in a vector of TRUE or FALSE values against every element of x .

As an example, if x is a vector of numbers, then the statement x > 12 will check whether every element of x is greater than 12. It will accordingly generate a vector of TRUE or FALSE boolean values.See here:


 >  x = c(8,10,12,7,14,16,2,4,9,19,20,3,6)
 > 
 >  x > 12


 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[13] FALSE

If the above vector of TRUE and FALSE values are placed inside the square bracket of vector x , the elements of x corresponding to the TRUE values will be filtered out into a vector. Carefully note the following code and its outpout:


 >  x = c(8,10,12,7,14,16,2,4,9,19,20,3,6)
 >  
 >  x > 12


 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[13] FALSE


 >  y = x[x>12]
 >  
 >  y

[1] 14 16 19 20

The logical statement inside square bracket can be complex. Thus, if we want to filter out elements of vector x whose values are more than 10 and less than 7, we use


 >  x[ (x>10) & (x<20) ]

[1] 12 14 16 19

In another example, we create a vector of numbers with some missing values (ie. NA). We will apply a filter to select elements which are not NA's and at the same time have values below 100 and write them into another vector. In a second operation, we will remove all the NA values from the original vector itself.

The script below achieves this:



tarray <- c(2, 7, 29, 32, 41, 11, 15, NA, NA, 55, 32, NA, 42, 109)

karray <- tarray[ !is.na(tarray) & (tarray < 100) ]

tarray[is.na(tarray)] <- 0

print("Filter with NA's and numbers greater than 100 removed:")
print(karray)

print("Filter with NA's replaced by 0")
print(tarray)

When the above code lines are executed in an R script, the following output is created.


[1] "Filter with NA's and numbers greater than 100 removed:"
 [1]  2  7 29 32 41 11 15 55 32 42
[1] "Filter with NA's replaced by 0"
 [1]   2   7  29  32  41  11  15   0   0  55  32   0  42 109

In the above script, the statement tarray[ !is.na(tarray) & (tarray < 100) ] selects elements of vector "tarray" that are not NA's and at the same time less than 100. The statement tarray[is.na(tarray)] <- 0 assigns the value 0 to the elemts of vector "tarray" that are missing values (NA's). After this, all NA's in vector "tarray" are replaced by 0.

Union and intersection of vectors

If two vectors x and y represent two sets of elements, their union and intersection operations can be performed by calling the functions union(x,y) and intersect(x,y) as demonstrated here:


 >  x = c('A','B','C','D','E')
 >  y = c('D','E','K','L','S','P')

 >  zu = union(x,y)

 >  zu

[1] "A" "B" "C" "D" "E" "K" "L" "S" "P"


 >  zi = intersect(x,y)
 > 
 >  zi

[1] "A" "B" "C" "D" "E" "K" "L" "S" "P"

Computing the differences between elements of a vector

Given a vector of numbers, the diferences between either successive or periodically positioned elements can be obtained using a function called diff(). This returns the result as a vector.

The simplest format is, diff(x,n), where 'x' is the vector of numbers, and 'n' is the space for the difference. For example, n=1 means difference between successive numbers, n=2 means between alternate numbers.


 >  x = c(4,8,11,14,35,56,120,30)
 >  
 >  diff(x)

[1] 4 3 3 21 21 64 -90


 >  diff(x,2)

[1] 7 6 24 42 85 -26

Cumulative sum and product of vector elements

Given a vector of numbers, we can find the cumulative sum and cumulative product upto each one of the element using cumsum() and cumprod() functions, as demonstrated below:


 >  x = c(1, 3, 5, 4, 6, 8, 2)
 >  
 >  cumsum(x)

[1] 1 4 9 13 19 27 29


 >  
 >  cumprod(x)

[1] 1 3 15 60 360 2880 5760

Finding Unique elements of a vector

A vector may contain multiple copies of same element(s). We can find the unique list of elements in the vector by calling unique() function:


 >  x = c('a','b','a','c','e','f','c','g','h')
 >  
 >  unique(x)

[1] "a" "b" "c" "e" "f" "g" "h"

Finding duplicated elements of a vector

The set of duplicated elements in a vector can be obtained by the duplicated() function. For each element, this fnction returns a vector of TRUE or FALSE value corresponding to the location of each element. If the element in a location is already present before, it is FALSE, else it is TRUE. See below:


 >  x = c('a','b','a','c','e','f','c','g','h')
 >  
 >  duplicated(x)

[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE

From this we can extract the actual duplicated elements by using this bulian vector as vector elements of x:


 >   x[duplicated(x)]

[1] "a" "c"

To extract the non-duplicated elements, use the logical NOT operator:


 >   x[!duplicated(x)]

[1] "a" "b" "c" "e" "f" "g" "h"

note that the "not duplicated" subset is same as the "unique" subset obtained before!.

Creating a frequency table

Given a vector, we can get the frequency table of its elements with the table() function. This table function returns a data structure that can be converted to a data frame, and the frequency information can be extracted from there:


 >  xx = c('A','T','A','G','T','A','T','C','C','A','T','T','G')
 > 
 >  tab = table(xx)
 >  
 >  tab


xx
A C G T 
4 2 2 5


 >  
 >  dat = as.data.frame(tab)
 >  
 >  dat


  xx Freq
1  A    4
2  C    2
3  G    2
4  T    5

The function call as.data.frame(tab) converts the object "tab" into a proper data frame. Now dat$xx is a vector that contains the unique data members and dat$Freq is a vector of their frequencies.

Getting the index of a vector element

Given a vector, suppose we want to know the array index of a particular element. Using the name or value of an element, its position in the array ( third, fourth etc) should be obtained. We can get this done with the help of a function called which(). See here:


 >  x = c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh")
 >  
 >  which(x=="ddd")

[1] 4

In the above vector x, the element "ddd" is in the fourth position. Therefore, the above call to which() returns an integer 4.

If a particular element is present in the vector more than once, the which() function returns a vector containing the indices of all the locations of that element in the input vector:


 >  dat = c("ATG","TAG","ATG","TTA","TGC","ATT","ATG", "GGG")
 >  
 >  d = which(dat=="ATG")
 >  
 >  d

[1] 1 3 7

Creating subsets of data frames

From a data frame, a subset can be created using subset() funtion by applying conditions on one or more column members.

For example, suppose a data frame is called "datframe" with many columns and one of them have name "npcol". Then the statement,


subdata <- subset(datframe, datframe$npcol > 30.0)

will select all the rows of datframe in which npcol is greater than 30 to create a new data frame called "subdata"

In the example code below, we will create a data frame with an (imaginary) experimental data. In this data set, there are 7 genes for which some experimental measurements are available from 7 experiments.
We will use "subset()" function to create a subset of this data after filtering on individual column values.
The code below demonstrates this. The comments are self explanatory.



# creating a vector of gene names
 genes = c("gene-1","gene-2","gene-3","gene-4","gene-5","gene-5","gene-6")

 # creating a vector of gender 
 gender = c("M", "M", "F", "M", "F", "F", "M")

 # creating 7 data vectors with experimental results
 result1 = c(12.3, 11.5, 13.6, 15.4, 9.4, 8.1, 10.0)
 result2 = c(22.1, 25.7, 32.5, 42.5, 12.6, 15.5, 17.6)
 result3 = c(15.5, 13.4, 11.5, 21.7, 14.5, 16.5, 12.1)
 result4 = c(14.4, 16.6, 45.0, 11.0, 9.7, 10.0, 12.5)
 result51 = c(12.2, 15.5, 17.4, 19.4, 10.2, 9.8, 9.0)
 result52 = c(13.3, 14.5, 21.6, 17.9, 15.6, 14.4, 12.0)
 result6 = c(11.0, 10.0, 12.2, 14.3, 23.3, 19.8, 13.4)

 # creating a data frame with this data.
 # genes along rows, results along columns
 datframe = data.frame(genes,gender,result1,result2,result3,result4,
                        result51,result52,result6)

 # adding column names to data frame
 names(datframe) = c("GeneName", "Gender", "expt1", "expt2", "expt3", "expt4",
                                "expt51", "expt52", "expt6")

 # creating subset of data with expt2 values above 20
 subframe1 = subset(datframe, datframe$expt2 > 20)

 # creating a subset of data with only Female gender
 subframe2 = subset(datframe, datframe$Gender == "F")

 # creating a subset with male gender for which expt2 is less than 30 
 subframe3 = subset(datframe, (datframe$Gender == "M")&(datframe$expt2 < 30.0) )

 # printing the data frames
 print("subframe1 : Rows with expt2 > 20")
 print(subframe1)
 
 print("subframe2 : Rows with gender Female")
 print(subframe2)
 
 print("subframe3 : Rows with Male gender and expt2 < 30.0")
 print(subframe3)

When the above code lines are executed in an R script, we get the following output:


[1] "subframe1 : Rows with expt2 > 20"
  GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
1   gene-1      M  12.3  22.1  15.5  14.4   12.2   13.3  11.0
2   gene-2      M  11.5  25.7  13.4  16.6   15.5   14.5  10.0
3   gene-3      F  13.6  32.5  11.5  45.0   17.4   21.6  12.2
4   gene-4      M  15.4  42.5  21.7  11.0   19.4   17.9  14.3
[1] "subframe2 : Rows with gender Female"
  GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
3   gene-3      F  13.6  32.5  11.5  45.0   17.4   21.6  12.2
5   gene-5      F   9.4  12.6  14.5   9.7   10.2   15.6  23.3
6   gene-5      F   8.1  15.5  16.5  10.0    9.8   14.4  19.8
[1] "subframe3 : Rows with Male gender and expt2 < 30.0"
  GeneName Gender expt1 expt2 expt3 expt4 expt51 expt52 expt6
1   gene-1      M  12.3  22.1  15.5  14.4   12.2   13.3  11.0
2   gene-2      M  11.5  25.7  13.4  16.6   15.5   14.5  10.0
7   gene-6      M  10.0  17.6  12.1  12.5    9.0   12.0  13.4

Joining data frames

In R, data tables are generally loaded as data frames. We can join ( bind) two data frames one below each other (vertical binding) or adjacent to each other (horizontal binding) provided the column or row numbers are matched accordingly.

We will create three data frames called frame1, <>frame2 and frame3 to demonstrate the data set binding.



 >  index = seq(1:8)
 >  
 >  product = c("wheat","rice","millet","ragi","corn","pulses","meat","sugarCane")
 >  
 >  quantity1 = c(118,179,24,39,32,59,72,84) 
 >  
 >  quantity2 = c(128,169,29,35,30,57,67,78)
 >  
 >  sales = c(1200,1400,800,600,400,2900,3000,490 )
 >  
 >  frame1 = data.frame(index = index, product=product, quantity=quantity1)
 >  
 >  frame2 = data.frame(index=index, product=product, quantity=quantity2)
 >  
 >  frame3 = data.frame(index=index, product=product, sales=sales)
 >  
 >  
 >  frame1


  index   product quantity
1     1     wheat       118
2     2      rice       179
3     3    millet        24
4     4      ragi        39
5     5      corn        32
6     6    pulses        59
7     7      meat        72
8     8 sugarCane        84


 >  frame2


  index   product quantity
1     1     wheat       128
2     2      rice       169
3     3    millet        29
4     4      ragi        35
5     5      corn        30
6     6    pulses        57
7     7      meat        67
8     8 sugarCane        78


 >  frame3


  index   product sales
1     1     wheat  1200
2     2      rice  1400
3     3    millet   800
4     4      ragi   600
5     5      corn   400
6     6    pulses  2900
7     7      meat  3000
8     8 sugarCane   490

To join two frames vertically one below the other ( row binding ), use rbind() function. For this,the two data frames must have same variables (ie., column names), though they need not be present in the same order :


 >  vbframe = rbind(frame1, frame2)
 >  
 >  vbframe


   index   product quantity
1      1     wheat      118
2      2      rice      179
3      3    millet       24
4      4      ragi       39
5      5      corn       32
6      6    pulses       59
7      7      meat       72
8      8 sugarCane       84
9      1     wheat      128
10     2      rice      169
11     3    millet       29
12     4      ragi       35
13     5      corn       30
14     6    pulses       57
15     7      meat       67
16     8 sugarCane       78

To join two data frames horizontally (column binding ), we use the cbind() function.For this, they should have same number of rows, and the variables (column names) can be same or different:


 >  hbframe = cbind(frame1, frame2)
 >  
 >  hbframe


  index   product quantity index   product quantity
1     1     wheat      118     1     wheat      128
2     2      rice      179     2      rice      169
3     3    millet       24     3    millet       29
4     4      ragi       39     4      ragi       35
5     5      corn       32     5      corn       30
6     6    pulses       59     6    pulses       57
7     7      meat       72     7      meat       67
8     8 sugarCane       84     8 sugarCane       78

Merging data frames

To merge two data frames horizontally, use the merge() function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join ).

We will now merge the frame1 and the frame3 created before. These two frames have two common variables namely "index" and "product".

First we merge them by the variable "product":


 >  mrgA = merge(frame1, frame3, by="product")
 >  
 >  mrgA


    product index.x quantity index.y sales
1      corn       5       32       5   400
2      meat       7       72       7  3000
3    millet       3       24       3   800
4    pulses       6       59       6  2900
5      ragi       4       39       4   600
6      rice       2      179       2  1400
7 sugarCane       8       84       8   490
8     wheat       1      118       1  1200

Carefully note the following fact in the above merged frame "mrgA" : since two frames merged by "product" have a common variable called "index", the merged frame distinguishes them by changing the two names to "index.x" and "index.y"

We can also merge by more than one variable. For example, we can merge the frame1 and frame3 by the two common variables "index" and "product" as shown below:


 >  mrgB = merge(frame1, frame3, by=c("index","product")) 
 >  
 >  mrgB


  index   product quantity sales
1     1     wheat      118  1200
2     2      rice      179  1400
3     3    millet       24   800
4     4      ragi       39   600
5     5      corn       32   400
6     6    pulses       59  2900
7     7      meat       72  3000
8     8 sugarCane       84   490

Different types of merging like Outer join, Left outer, Right outer and Cross join are demonstrated below with new data frames called df1 and df2 :


 >  df1 = data.frame(experimentID = c(1:6), genes=c("g1","g1","g1","g2","g2","g2"))
 >  
 >  df2 = data.frame(experimentID = c(1,3,5), tissues = c("heart","heart","liver"))
 >  
 >  df1


  experimentID genes
1            1    g1
2            2    g1
3            3    g1
4            4    g2
5            5    g2
6            6    g2


 >  
 >  df2


  experimentID tissues
1            1   heart
2            3   heart
3            5   liver

Outer join :


 >  OJ = merge(x = df1, y = df2, by = "experimentID", all = TRUE)
 >  
 >  OJ


  experimentID genes tissues
1            1    g1   heart
2            2    g1    <NA>
3            3    g1   heart
4            4    g2    <NA>
5            5    g2   liver
6            6    g2    <NA>

Left Outer :


 >  LO = merge(x = df1, y = df2, by = "experimentID", all.x = TRUE)
 >  
 >  LO


  experimentID genes tissues
1            1    g1   heart
2            2    g1    <NA>
3            3    g1   heart
4            4    g2    <NA>
5            5    g2   liver
6            6    g2    <NA>

Right Outer :


 >  RO = merge(x = df1, y = df2, by = "experimentID", all.y = TRUE)
 >  
 >  RO


  experimentID genes tissues
1            1    g1   heart
2            3    g1   heart
3            5    g2   liver

Cross Join :


 >  CJ = merge(x = df1, y = df2, by = NULL)
 >  
 >


   experimentID.x genes experimentID.y tissues
1               1    g1              1   heart
2               2    g1              1   heart
3               3    g1              1   heart
4               4    g2              1   heart
5               5    g2              1   heart
6               6    g2              1   heart
7               1    g1              3   heart
8               2    g1              3   heart
9               3    g1              3   heart
10              4    g2              3   heart
11              5    g2              3   heart
12              6    g2              3   heart
13              1    g1              5   liver
14              2    g1              5   liver
15              3    g1              5   liver
16              4    g2              5   liver
17              5    g2              5   liver
18              6    g2              5   liver

CountBio

Mathematical tools for natural sciences