Biostatistics with R

Vectors

A Vector in R is an ordered collection of elements. We can have a vector of numbers, strings or characters. The elements of a given vector must be of same data type. In case we create a vector with mixed element types, R treats it as a vector of strings.


Elements of a vector are stored in consecutive memory locations. There are many internal functions provided by R for manipulating the elements of a vector. Using these functions, we can perform complex operations like sorting, slicing, growing, splitting, filtering etc. on a vector. The vector along with its supporting functions is the most powerful data structure in R. Many library functions of R require the input data in the form of vectors.

Creating vectors

A vector can be defined by placing the comma separated list of elements inside a pair of brackets next to the letter 'c' and assigning it to a variable name as shown below. We can either = or <- operators for the assignment.

> avec <- c(10.2, 5.5, 6.9, 7.2, 8.1) > > avec
[1] 10.2 5.5 6.9 7.2 8.1v>

The data type of the vector is decided by the the data types of the elements it contains. Thus, if all elements of the vector are numbers, the vector takes the type numeric. We can do many numerical operations with such a vector.However, if one or more elements of the vector happen to be of type string, the entire vector will be treated to be string vector, and we cannot perform numerical operations with it.

For example, the vector 'avec' defined above is a numeric vector, while the follwing two vectors are treated as string vectors:

> avec1 <- c("AEC", "AED", "AAB", "AFC") > avec2 <- c(10.2, 5.5, "6.9", 7.2, 8.1) > > avec2
[1] "10.2" "5.5" "6.9" "7.2" "8.1"

A vector can be assigned in many ways. We can use 'assign()' function instead of above syntax. Thus, all of the the following assignments define a vector named "vec" with elements (10.2, 5.5, 6.9, 7.2, 8.2)

> vec <- c(10.2, 5.5, 6.9, 7.2, 8.1) > assign("vec", c(10.2, 5.5, 6.9, 7.2, 8.1) ) > vec = c(10.2, 5.5, 6.9, 7.2, 8.1) > c(10.2, 5.5, 6.9, 7.2, 8.1) -> vec

Accessing vector elements

The individual elements of a vector can be accessed by subscripting the element number inside the square bracket. The subscripting starts with 1.

> x <- c(10,20,30,40,50,60,70,80,90,100,110,120,130) > x[3]
[1] 30
> x[6]
[1] 60
> y= x[3] + x[6] > y
[1] 90

In order to access specific elements of a vector, give the element indices as a vector inside square brackets. Thus, if we want to create a sub-vector 'z' with element numbers 1,3 and 6 from vector x defined above,

> z = x[c(1,3,6)] > z
[1] 10 30 60

We can also access consequtive elements of a vector by specifying the start and end element indices separated by colon inside square brackets. The following statement creates a subset vector 'z' with elements 4 to 9 of vector x:

> z = x[c(4:9)] > z
[1] 40 50 60 70 80 90

Operations on a vector

Once a vector is defined, basic mathematical operations like addition, subtraction, multiplication and division performed on it is applied to all its elements individually resulting in another vector. See the examples below:

> vstr <- c(1,2,3,4,5,6,7,8,9) > > vstr + 100
[1] 101 102 103 104 105 106 107 108 109
>
> vstr - 100
[1] -99 -98 -97 -96 -95 -94 -93 -92 -91
> vstr*100
[1] 10 20 30 40 50 60 70 80 90
> vstr/100
[1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Operations between vectors

The algebraic operations between one or more vectors are applied to their individual elements, and a resulting vector is created.
Thus if we add two vectors of same length (ie., both having same number of elements), their corresponding elements are added to give a new vector. This is illustrated in the following operations between vectors "vec1" and "vec2" below:

> vec1 <- c(1.5,2.5,3.5,4.5,5.5,6.5) > vec2 <- c(10,20,30,40,50,60) > vec1+vec2
[1] 11.5 22.5 33.5 44.5 55.5 66.5
> vec1-vec2
[1] -8.5 -17.5 -26.5 -35.5 -44.5 -53.5
> vec1*vec2
[1] 15 50 105 180 275 390
> vec1/vec2
[1] 0.1500000 0.1250000 0.1166667 0.1125000 0.1100000 0.1083333
> log(vec2)
[1] 2.302585 2.995732 3.401197 3.688879 3.912023 4.094345

Adding elements to a vector

We can start with an empty vector and add elements to it. The empty vector is created by c() . We can subsequently add elements to it as demonstrated below:

> avec = c() > > avec = c(avec,"ATG","TTG") > > avec
[1] "ATG" "TTG"
>
> avec = c(avec, "TATATA", "TTTTTAA") > > avec
[1] "ATG" "TTG" "TATATA" "TTTTTAA"

Combining vectors

We can combine two or more vectors to create a new vector as shown here:

> v1 = c(10,20,30) > v2 = c(100,200,300,400) > v3 = c(1000,2000,3000) > > combvec = c(v1,v2,v3) > > combvec
[1] 10 20 30 100 200 300 400 1000 2000 3000

Equations with vectors

The algebraic equations written with a vector is applied to its individual elements resulting in a new vector. Thus, if 'x' is a vector of numbers, the equation $y = 3x^3 - 4x^2 + 5x + 6$ can be written for each element of 'x' resulting in a corresponding 'y' vector. See below:

> x = c(1, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0) > > y = (3*x^3) - (4*x^2) + (5*x) - 6 > > y
[1] -2.000 2.625 12.000 28.375 54.000 91.125 142.000

Generating sequences

It is very easy to generate a sequence of numbers in R. Use seq() function to generate a sequence from a given number to an end number. The function call below returns a vector of integer sequence from 1 to 50, with a default step size 1 :

> sq <- seq(1,50) > > sq
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

A sequence can also be generated in steps higher than 1 . For example, use the following syntax to generate a sequence from 1 to 50 in steps of 5:

> sq <- seq(1,50,5) > sq
[1] 1 6 11 16 21 26 31 36 41 46

The sequence can be generated in reverse by flipping the sign of the step:

> sq <- seq(50,1,-5) > sq
[1] 50 45 40 35 30 25 20 15 10 5

Handling missing data

When data is collected, there is a possibility that one are more entries are 'missing'. The information may not be available on them. As an example, in a data set consisting on the age of 10 students, age for two of them may be missing, while for all others it is vailable. See the data set below:


Student # Age(years)
1 17
2 18
3 17
4 19
5 missing data
6 17
7 20
8 missing data
9 16
10 22

We can have our own stratagy to deal with the missing data in the downstream analysis. For example, we may replace the missing numbers by the average value of the data computed without them. We may fill the missiing numbers with zero.

But how do we indicate the missing data in a vector?

The vectors in R can handle the missing values. The missing value is recognized by the symbol NA . For example, the above mentioned data points can be represented as a vector of integer elements with two NA (missing) values:

> x = c(17, 18, 17, 19, NA, 17, 20, NA, 16, 22) > > x
[1] 17 18 17 19 NA 17 20 NA 16 22

Note that NA is used as a symbol, and not as a string representation. Different vector operations and functions handle missing values. For example, if we multiply the vector x created above by a constant, only its genuine numbers are multiplied, and missing values are kept as it is:

> x*10
[1] 170 180 170 190 NA 170 200 NA 160 220
We can identify the missing values in a vector and replace them with another value. A function called is.na() takes a vector as input, and returns a corresponding vector with TRUE or FALSE values. The elemental locations of the input vector with missing values will have TRUE in the new vector and have the falue FALSE othewise. The logical NOT operation !is.na() returns the complementary values of is.na(). See here:
> x = c(17, 18, 17, 19, NA, 17, 20, NA, 16, 22) > > is.na(x)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
> !is.na(x)
[1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

To get all the NA elements of a vector x , we use the vector of TRUE and FALSE values returned by is.na(x) as indices of x:

> x[is.na(x)]
[1] NA NA

To get all the non-NA elements of x,we use the vector of TRUE and FALSE values returned by !is.na(x) as indices of x:

> x[!is.na(x)]
[1] 17 18 17 19 17 20 16 22

The missing values in a vector can be replaced by zeroes (or any other value) as shown below:

> x = c(1.5, 2.6, 4.3, NA, 2.2, 5.9, 6.0, NA, 1.2) > > x[is.na(x)] <- 0 > > x
[1] 1.5 2.6 4.3 0.0 2.2 5.9 6.0 0.0 1.2

To get the number of elements in a vector

The number of elements in a vector (called vector length ) is returned by the function length() .

> xx = c('a','e','r','s','k','g','f') > > L = length(x) > > L
[1] 9

Sorting a vector

A vector can be sorted in ascending or descending order by the sort() function, which returns the sorted vector. By default, the vector is sorted in ascending order :

> x = c(12,2,34,67,22,55,123) > > sor = sort(x) > > sor
[1] 2 12 22 34 55 67 123

A vector can be sorted in descending order by setting the boolean parameter decreasing to the value TRUE:

> x = c(12,2,34,67,22,55,123) > > ys = sort(x, decreasing=TRUE) > > ys
[1] 123 67 55 34 22 12 2

The default call to the sort() function ignores the NA values present in the vector:

> y = c(12,2,34,NA,67,29,NA,NA,45,99) > > sr = sort(y) > > sr
[1] 2 12 29 34 45 67 99

In case we want to include the NA values while sorting a vector, we have two choices: the NA values can be placed either in the beginning or in the end of the sorted vector. This is achived by a boolean parametwe called na.last . If this takes a value TRUE, the NA values are placed at the end of the sorted vector. If the value is FALSE, the NA values are placed in the beginning of the sorted vector. If this parameter is not used, NA values are dropped from the sorted array. See the code below:

> y = c(12,2,34,NA,67,29,NA,NA,45,99) > > sort(y)
[1] 2 12 29 34 45 67 99
>
> sort(y, na.last=TRUE)
[1] 2 12 29 34 45 67 99 NA NA NA
>
> sort(y, na.last=FALSE)
[1] NA NA NA 2 12 29 34 45 67 99

Get the maximum and minimum values of a numeric vector

We can get the maximum and minimum values among the elements of a vector by calling min() and max() functions:

> vec <- c(8.9, 1.5, 3.4, 6.7, 12.8, 7.4) > > max(vec)
[1] 12.8
> min(vec)
[1] 1.5

If one or more elements of a vector are NA, then the max() and min() functions will return NA as the maximum and minimum values respectively. We can tell the max() and min() to drop the NA's before finding max or min. This is achieved by setting the parameter called "na.rm" to "TRUE" in the max() and min() functions. Then, only thvalid numbers are sorted, dropping NA values. See here:

> y = c(12,2,34,NA,67,29,NA,NA,45,99) > max(y)
[1] NA
> > min(y)
[1] NA
> > max(y, na.rm=TRUE)
[1] 99
> > min(y, na.rm=TRUE)
[1] 2

Create a character vector with indexed strings

We can create character vector with indexed strings like "X1", "X2", "X3", ... etc. as follows:

> labs <- paste( c("X"), 1:20, sep="") > > labs
[1] "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12" [13] "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20"

How does this work?. The portion of command 1:20 creates a vector of sequence 1 to 20 in steps of 1. When this vector sequence is pasted to the single character string "X", each one of the 20 numbers are pasted to it to create the vector elements "X1","X2",.... "X20".


Similarly,


> labs <- paste( c("X","Y"), 1:20, sep="") > > labs
[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10" "X11" "Y12" [13] "X13" "Y14" "X15" "Y16" "X17" "Y18" "X19" "Y20"

Removing elements from a vector using index

A particular element or a set of elements can be removed from a vector by specifying the element index with a negative sign inside square bracket. See below:

> x = c(5,10,15,20,25,30,35,40) > > yr = x[-2] > > yr
[1] 5 15 20 25 30 35 40
Here, x[-2] has removed the second element of x, which is 10.


In order to remove consequitive elements, we give start and end element locations. For example, to remove elements 2,3,4 and 5 from x,

> ya = x[-2:-5] > > ya
[1] 5 30 35 40

Specific elements can be removed by specifying the corresponding indices as a vector inside square bracket:

> x = c(5,10,15,20,25,30,35,40) > > y = x[c(-2,-4,-7)] > > y
[1] 5 15 25 30 40