Biostatistics with R

Data frames

Dataframe is a data structure similar to matrix, with a special feature that different columns can have different data types.

Dataframe is very useful for combining vectors of same length with different data types into a single data structure

Similar to matrices, all the columns of a data frame should have same number of rows.

Creating a data frame from vectors

A data frame is made up of individual vectors of same length placed as columns. We can easily create a data frame from vectors using data.frame() function. We just have to pass the vector names as parameters to this function.

In the example below, we create a data frame called "frm1" with three vectors namely "data1", "data2" and "data3". The created data frame will have columns named "data1", "data2" and "data3":


> data1 <- c("Iron","Sulphur","Calcium", "Magnecium", "Copper") > data2 <- c(12.5, 32.6, 16.7, 20.6, 7.5) > data3 <- c(1122, 1123, 1124, 1125, 1126) > > frm1 <- data.frame(data1, data2, data3) > > frm1
data1 data2 data3 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124 4 Magnecium 20.6 1125 5 Copper 7.5 1126

In the above example, note that the column names of the data frame 'frm1' we created are just the names of the objects themselves. A sequence of indices 1,2,3,4 and 5 have been added as row names, by default.

Get the row and column names of a data frame

To get the column names of a data frame, call names() function with frame name as parameter. This function returns the column names as a vector of strings:

> names(frm1)
[1] "data1" "data2" "data3"

We can also get the column and row names of a data frame using rownames() and colnames() funtions:

> rname = rownames(frm1) > > rname
[1] "1" "2" "3" "4" "5"
>
> cname = colnames(frm1) > > cname
[1] "data1" "data2" "data3"

Name the rows and columns of a data frame

The columns of a data frame can be named explicitly using a vector of strings. For the above frame "frm1", we can set the column names with our own vector of strings.

> names(frm1) <- c("Element", "Proportion", "Product_ID") > > frm1
Element Proportion Product_ID 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124 4 Magnecium 20.6 1125 5 Copper 7.5 1126

In the above example, we can use colnames(frm1) instead of names(frm1) . Both commands create the same result.



Similarly, the row names can be initialized by a vector of strings:


> rownames(frm1) = c("elmt-1","elmt-2","elmt-3","elmt-4","elmt-5") > > frm1
Element Proportion Product_ID elmt-1 Iron 12.5 1122 elmt-2 Sulphur 32.6 1123 elmt-3 Calcium 16.7 1124 elmt-4 Magnecium 20.6 1125 elmt-5 Copper 7.5 1126

Accessing the elements of a data frame by index

The elements of a Data frame are accessed using same subscript convention as matrices. Thus, frm1[1,3] is the element in first row third column, frm1[1,] is entire first row, frm1[,2] is entire second column. Also, frm1[1:3,] gives the rows 1,2 and 3. This is illustrated here using the frame name frm1 created above:

> frm1[1,3]
[1] 1122

> frm1[1,]
Element Proportion Product_ID 1 Iron 12.5 1122

> frm1[,2]
[1] 12.5 32.6 16.7 20.6 7.5

> frm1[1:3,]
Element Proportion Product_ID 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124

Accessing a column of a data frame by name

We can also access a column of a dataframe by its name, by typing the frame name and the column names separated by a '$' sign. The accessed column is treated as a vector. For example, columns of the data frame 'frm1' can be accessed by their names as shown here:


> frm1$Element
[1] Iron Sulphur Calcium Magnecium Copper Levels: Calcium Copper Iron Magnecium Sulphur
> frm1$Proportion
[1] 12.5 32.6 16.7 20.6 7.5
> frm1$Product_ID
[1] 1122 1123 1124 1125 1126
> 1000*frm1$Proportion
[1] 12500 32600 16700 20600 7500

Adding a new column to the data frame

We can add a new column to the existing data frame by creating a vector and naming it as a new column of the frame. Obviously, this vector should have same length as the number of rows of the existing frame. We will add a new column called "symbol" to the existing frame "frm1":


> frm1$symbol = c("Fe","S","Ca","Mg","Cu") > > frm1
Element Proportion Product_ID symbol 1 Iron 12.5 1122 Fe 2 Sulphur 32.6 1123 S 3 Calcium 16.7 1124 Ca 4 Magnecium 20.6 1125 Mg 5 Copper 7.5 1126 Cu

Removing a column by name from a data frame

A column can be removed from a data frame by accessing it by name and assigning NULL value to it. In the following example, we will access the column named "Product-ID" from frane "frm1" and remove it:


> frm1
Element Proportion Product_ID symbol elmt-1 Iron 12.5 1122 Fe elmt-2 Sulphur 32.6 1123 S elmt-3 Calcium 16.7 1124 Ca elmt-4 Magnecium 20.6 1125 Mg elmt-5 Copper 7.5 1126 Cu
> > frm1$Product_ID <- NULL > > frm1
Element Proportion symbol elmt-1 Iron 12.5 Fe elmt-2 Sulphur 32.6 S elmt-3 Calcium 16.7 Ca elmt-4 Magnecium 20.6 Mg elmt-5 Copper 7.5 Cu

To attach a data frame

We learnt to access a column of a data frame by mentioning the column name along with the frame name separated by '$' sign. When there are more than one data frame in memory with same column names(s), this format can distinguish between them. Suppose we have a situation when we do not have this naming conflict. In this case it will more convenient to access the column by mentioning only its name, dropping the frame name. We use attach() command for this.


The attach() function attaches a database to the R search path, so that the objects in the data base can be accessed by simply giving their names.

As an example, for the data frame called "frm1" created before, we will first access the column named "symbol" directly. It fails. However, after attaching the frame with the command attach(frm1) , we can access "symbol" column directly by its name:


> frm1 Element Proportion symbol elmt-1 Iron 12.5 Fe elmt-2 Sulphur 32.6 S elmt-3 Calcium 16.7 Ca elmt-4 Magnecium 20.6 Mg elmt-5 Copper 7.5 Cu > > symbol
Error: object 'symbol' not found
> > attach(frm1) > > symbol
[1] "Fe" "S" "Ca" "Mg" "Cu"