Microarray Data Analysis in
          R/Bioconductor

( R. Srivatsan )

Affymetrix Genechip HGU array analysis in
R/Bioconductor



This section contains the practical tutorial in Bioconductor/R for analysing data from various types of Affymetrix Genechip Human Genome (HGU) arrays

Using bioconductor libraries and information provided in their manuals, we can build a single analysis pipeline that can analyze data from a family of HGU arrays by tweaking input parameters to functions. From the point of view of the analysis pipeline, the arrays differ in terms of the number of genomic features (probes) represented by them and the annotation of the probes into genes represented by the probes. Rest of the core steps like reading the input cel files, data quality estimation, normalization, probe summarization and the statistical analysis are same for all the arrays in HGU family.

First, we will summarize essential information on the following important HGU arrays for which substantial data sets are available in the Gene Expression Omnibus (GEO) data base.

Affymetrix Genechip Human Genome U95 arrays

  • The Genechip Human Genome U95Av2 array (HG-U95Av2)
    Contains primarily full length genes. It has about 12000 sequences previously characterized in terms of functions or disease associations. Earlier experiments were done with this array, and good amount of data sets on disease studies with this array are available in GEO.


  • The Genechip Human Genome U95 B,C,D,E arrays
    These four are known as HG-U95B, HG_U95C, HG-U95D and HG-U95E arrays. They contain probes representing about 50,000 clusters comprised of only Expressed Sequence Tags (ESTs). (The EST's are short DNA sequences (~200-500 nucleotide length that are used to identify a gene being expressed in a cell at a particular time).


  • The above five arrays of U95 set represent more than 60000 full length genes and EST clusters. They are derived using data bases like Unique Build 95, GeneBank 113 and dbEST/10-02-99.

    These arrays have 16 probe pairs (PM and MM probe sets) for each oligonucleotide feature. Each probe is 25-mer long.

Affymetrix Genechip Human Genome U133 arrays

  • The Genechip Human Genome HG-U133A, HG-U133B arrays
    As in the HG-U95 set of arrays, the HG-U133 aarays were esigned to contain probe sets for well annotated genesx. This array has majority of RefSeq genes.

    The HG-U133A array has 21,722 probesets that contain full-length mRNA sequences wheras the 22,577 probe sets of HG-U133B array represent EST-only clusters.


  • The Genechip Human Genome HG-U133 Plus 2 array and HG-U133A 2.0 array
    The HG-U133 Plus 2 array incorporates all the probesets of HG-U133A and HG-U133B arrays plus an additional 9921 sets to give a more comprehensive coverage of human gene expression. For this array, data from various data bases like dbEST, Genebank, RefSeq and the draft assembly of human genome were used. A total of 54,120 probes are present.

    In this arrays A new non-unique probe set type, “_a”, was added to indicate those probe sets that recognize multiple alternative transcripts from the same gene.This was not there in HG-U133A and HG-U133B arrays. Probe sets with common probes among multiple transcripts from separate genes are annotated with a “_s” suffix

    The HG-U133A 2.0 array contains all the probesets of HG-U133A array, but a significantly increased coverage to include larger portions of human transcriptome. Substantial number of new probes have been added to the one in HG-U133A. Thus it is a superset of HG-U133A set.


Analysis pipeline for Affymetrix Genechip HGU array

In this section, we present an analysis pipeline made of R/Bioconductor libraries for performing the differential expression analysis of data from Affymetrix Human Gene (HGU) family of arrays.

The entire zipped directory containing input data files, R-script file for the analysis, covariate description file and output files can be downloaded from here

Example Data Set used in the analysis

The data sets (CEL files) used in this demonstration were downloaded from the open source repository Gene Expression Omnibus (GEO) maintained by the National Center for Biotechnology Information under the National Institute of Health (NIH) run by the Government of United States.

The data set (CEL files) on the studies of Rheumatoid Arthritis of Synovial Joints in human patients can be accessed and downloaded from the specific GEO web page for accession number GSE12021 in the address
$~~~~~~~~~~~~~~~~$"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12021"

The brief information on the data as in the above GEO site is copied below:

Status : Public on Sep 02, 2008

Title : Identification of inter-individual and gene-specific variances in mRNA expression profiles in the RA SM

Organism : Homo sapiens

Overall design : Expression variances were tested in synovial membrane samples of rheumatoid arthritis patients, osteoarthritis patients, and normal controls (see publication for further details).

Contributor(s) : Huber R, Hummert C, Gausmann U, Pohlers D, Koczan D, Guthke R, Kinne RW

Citation(s) : Huber R, Hummert C, Gausmann U, Pohlers D et al. Identification of intra-group, inter-individual, and gene-specific variances in mRNA expression profiles in the rheumatoid arthritis synovial membrane. Arthritis Res Ther 2008;10(4):R98. PMID: 18721452 Analyze with GEO2R

The above web page and the paper describe the experimental protocols, analysis methods and the patient information on each sample in detail. We have downloaded the basic data in the form of CEL files of the HGU133a affymetrix array they have used in this experiment and performed the analysis in our own way by writing the following R/Bioconductor pipeline.


File structure to run the analysis Pipeline in R/Bioconductor

The entire zipped directory for this analysis containing input files, covariate description file, R script for the analysis and the output files created by running the script can be downloaded as microarray_analysis.zip. In this model analysis, CEL files of control samples from eight healthy (normal) patients and eight RA patients are used.

The names and categories of these 16 CEL files are given below:

Eight files of Normal samples (healthy controls):

GSM302859_EB125.CEL
GSM302864_EB184.CEL
GSM302866_EB188.CEL
GSM302870_EB228.CEL
GSM303522_J19.CEL
GSM303523_J_2.CEL
GSM303525_J_6.CEL
GSM303531_J_7.CEL


Eight files of RA patients :

GSM302872_EB_20.CEL,
GSM302882_EB_27.CEL,
GSM302900_EB_35_A.CEL,
GSM302924_EB_40.CEL,
GSM302882_EB_27.CEL,
GSM302933_EB_56.CEL,
GSM302943_EB_74.CEL,
GSM303364_EB_108.CEL

We place these files inside a directory called "data_files".



Next, we create a table file in csv format called "covdesc.csv" (any name can be given) to store the covariate information on files. This table has 2 columns, one with file name and other with type of data as shown below:





Next, create a directory called "output_files". This is where the output files, both graph images as well as data tables created by the program will be written.



Next, we will write an R-script called " affy_HGU_chips_model_analysis_with_oligo_lirary_and_limma.r ". This will be described elaborately in the pipeline description that will come below.



Now, place the directories input_dir, output_dir and the files covdesct.csv and the above R-script files inside a single directory (say) microarray_analysis. The entire directory structure is shown in the screen shot below for the linux machine:



Explanation of the script lines

In this section, we will explain the various code sections of the R/Bioconductor script " " affy_HGU_chips_model_analysis_with_oligo_library_and_limma.r. We execute this script in R using the following command in R prompt:
$~~~~~~~~~~~~~> $source("affy_HGU_chips_model_analysis_with_oligo_lirary_and_limma.r")

Step-1 : To begin with, we will define a parameter to carry the affymetrix chip name whose data is used for the experiment. For this, we will type various chip names and comment them. We will uncomment the chip name we are using. Following this, we will also define the GEO number of the experiment:

# Affymetrix chip used # Affy_chip = "hgu95av2" Affy_chip = "hgu133a" # Affy_chip = "hgu133plus2" #Affy_chip = "hgu95a" # Affy_chip = "hugene11st" # Affy_chip = "hgu219" # GEO number for experimentqeq gse_current = "GSE12021"

Thus, we have defined the chip "hgu133a" and the GEO number "GSE12012". We will use this number to name output files.


Step-2 : Next, we create variable names for directories and files for read/write operations:

# input file directory. infile_dir = "data_files" # output file directory. outfile_dir = "output_files" # output file name for writing final gene expression table in csv format outfile_name = paste("outfile_dir/", gse_current,"_limma_output.csv", sep="")

Thus, input cel files will be read from directory "data_files", the plots and table outputs files will be created in "output_files" directory. Also, the name of output csv table file is created by pasting three strings with no separator in the last command above.

When these three above commands are executed, the outfile name will be
"outfile_dir/GSE12021_limma_output.csv". The final expression table will be written into this file in a csv format. We can print this variable "outfile_name" on the screen to check this.


Step-3 : The input CEL files will now be read into a data structure of R. This is done using "read.csv()" function of R in the following code lines. This function reads "covdesc.csv" file and stores the table with files names and category names in a data frame called "covdesc". We print the data frame "covdesc" on screen:

covdesc <- read.csv("covdesc.csv",header=T) print(covdesc)

The contents of "covdesk" data frame is printed on screen as below:
> print(covdesc) sample treatment 1 GSM302859_EB125.CEL normal 2 GSM302864_EB184.CEL normal 3 GSM302866_EB188.CEL normal 4 GSM302870_EB228.CEL normal 5 GSM303522_J19.CEL normal 6 GSM303523_J_2.CEL normal 7 GSM303525_J_6.CEL normal 8 GSM303531_J_7.CEL normal 9 GSM302872_EB_20.CEL RA 10 GSM302882_EB_27.CEL RA 11 GSM302900_EB_35_A.CEL RA 12 GSM302924_EB_40.CEL RA 13 GSM302882_EB_27.CEL RA 14 GSM302933_EB_56.CEL RA 15 GSM302943_EB_74.CEL RA 16 GSM303364_EB_108.CEL RA
We should carefully examine the above output and make sure that we have named the categories versus file names correctly. These category names will be tagged to sample names (file names) hereafter in all computations that follow.


Step-4 : At this step, we will define the two conditions of study as "normal" and 'RA", as in the covdesc.csv file. Also, we will create two vectors, one carrying the row numbers of "normal" samples and another carrying row numbers of "RA" samples in the covdesc data frame:

test_between = c("normal","RA") ## As in covdesc file control_cols = which(covdesc$treatment == test_between[1]) treatment_cols = which(covdesc$treatment == test_between[2])

After executing the above code lines, "test_between" will be a vector with two string elements ("normal", "RA").
"control_cols" is a numeric vector with elements (1,2,3,4,5,6,7,8) indicating that first eight files are normal samples.
"treatment_cols" is a numeric vector with elements (9,10,11,12,13,14,15,16) indicating that next eight files are RA samples
(we can print them on screen and check).


Step-5 : In this section, we will load the required R/bioconductor packages into code memory using "library()" function of R.

library(oligo) # normalization library(limma) # DE analysis library(multtest) # estimating False Discovery Rate (FDR) library(Cairo) ## for writing plots into image files # load chip library for annotation. Include more if needed for a new type of chip. if(Affy_chip == "hgu95a") { library(hgu95a.db) library(pd.hg.u95a) print("Included the library files hgu95a.db and pd.hg.u95a for the chip HG-U95A") } if(Affy_chip == "hgu95av2") { library(hgu95av2.db) library(pd.hg.u95av2) print("Included the library files hgu95av2.db and pd.hg.u95av2 for the chip HG-U95Av2") } if(Affy_chip == "hgu133plus2") { library(hgu133plus2.db) library(pd.hg.u133.plus.2) print("Included the library files hgu133plus2.db and pd.hg.u133.plus.2 for the chip HG-U133plus2") } if(Affy_chip == "hgu133a") { library(hgu133a.db) library(pd.hg.u133a) print("Included the library files hgu133a.db and pd.hg.u133a for the chip HG-U133A") } if(Affy_chip == "hgu133b") { library(hgu133b.db) library(pd.hg.u133b) print("Included the library files hgu133b.db and pd.hg.u133b for the chip HG-U133B") } if(Affy_chip == "hgu133a2") { library(hgu133a2.db) library(pd.hg.u133a.2) print("Included the library files hgu133a2.db and pd.hg.u133a.2 for the selected chip HG-U133A2") }

As indicated by the comments in the above comment lines,

The Bioconductor Library "oligo" reads the celfiles, runs the steps of RMA normalization and finally creates probe summarization.

The bioconductor library "limma" performs the statistical test with limma algorithm to compute differential expression p-values and fold changes. It also can annotate the affy probes into gene symbols and other information.

The bioconductor library "multtest" has functions for multiple testing of p-values to estimate the False Discovery rates (FDR) using the p-values of limma as input.

The R-library function "Cairo" writes the R-plots into image files like png, jpeg etc.

The cascade of "if" statements load the data base file and chip description file for the affy chip we have selected through variable "Affy_chip".

Thus, since we have selected "Affy_chip = hgu133a" in the beginning, the files "hgu133a.db" and "pd.hg.u133a" will be loaded for this chip. The other "if statements will not be executed.

Execution of above section code line result in the following output on the screen:
Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: oligoClasses Welcome to oligoClasses version 1.60.0 Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: Biostrings Loading required package: S4Vectors Loading required package: stats4 Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: expand.grid, I, unname Loading required package: IRanges Loading required package: XVector Loading required package: GenomeInfoDb Attaching package: ‘Biostrings’ The following object is masked from ‘package:base’: strsplit ================================================================================ Welcome to oligo version 1.62.2 ================================================================================ Attaching package: ‘limma’ The following object is masked from ‘package:oligo’: backgroundCorrect The following object is masked from ‘package:BiocGenerics’: plotMA Loading required package: AnnotationDbi Loading required package: org.Hs.eg.db Loading required package: RSQLite Loading required package: DBI [1] "Included the library files hgu133a.db and pd.hg.u133a for the selected chip HG-U133A"



Step-6 : Next step is to read the CEL data files into R data structures so that we can process the data. This is done in the following lines of code:

## Read the names of celfiles to be loaded. celFiles = list.celfiles(infile_dir,full.names=T) ## read the contents of all input cel files in the above list. rawData = read.celfiles(celFiles) # get the sample names in rawData sample_names = sampleNames(rawData) print(sample_names) # rawData is now a GeneFeatureSet object. # this object does not contain information on phenotypes, which has to be loaded # separately from the covdesc file. # pData(rawData) retrieves phenodata from the GenefeatureSet object as a dataframe. pData(rawData)$gsm = rownames(pData(rawData)) pData(rawData)$source = covdesc$treatment ## print the values of rawData, ie., raw values of probes across samples. ## we print 10 probe readings across 3 samples print(exprs(rawData)[1:4,1:3])

In the above lines of script,

"list.celfiles()" function reads the names of CEL files in the "infile_dir" directory and stores the string names in the variable "celFiles".

The function "read.celfiles()" of the oligo library reads these celfiles and stores the data into a huge data structure (R list) called "rawData". This list "rawData" has all the data of all the files. Various functions can be used to extract information stored in the variables inside "rawData".

The variable "rawData" is an ExpressionFeatureSet object. An expression set is an R data object (created in bioconductor) consisting of three entities: the expression matrix (exprs), the phenotye data (pData), and the feature data (fData). These information can be retrived using functions provided by the Biobase library of Bioconductor package.

The "sampleNames()" is a function of the Biobase library that can extract the sample names stored in the expression set object like "rawData". we extract this and store as a string vector "sample_names".

The function "pData()" of Biobase library can access the phenotypic data (e.g.,covariates) and meta-data (e.g., descriptions of covariates) associated with an experiment. In the above code lines, the function call "pData(rawData)" extracts the phenotype data from "rawData" structure and returns the information as a data frame.

Now, "rowNames(pData(rawData))" returns the file names as a string vector. This is added as a new variable "gsm" to the "rawData" structure. Similarly, "covdesc$treatment" is a vector of sample types and this is added as a new vector variable called "source" to the "rawData" structure.

After these additions, "rawData" now has phenotype information added to it.

The function "exprs()" of biobase derives the expression values from the "ExpressionFeatureSet" objects. Thus the function call "exprs(rawData)" returns a matrix with rows representing features (probes) and columns samples. The cell values are expression value for a given probe in a sample.

Two print statements are there in the code section. "print(sampleNames)" prints the sample names as a vector elements. The statement "print(exprs(rawData)[1:4,1:3])" prints the first 4 rows and 3 columns of the expression matrix. These printouts are as below:

[1] "GSM302859_EB125.CEL" "GSM302864_EB184.CEL" "GSM302866_EB188.CEL" [4] "GSM302870_EB228.CEL" "GSM302872_EB_20.CEL" "GSM302882_EB_27.CEL" [7] "GSM302900_EB_35_A.CEL" "GSM302924_EB_40.CEL" "GSM302927_EB_41.CEL" [10] "GSM302933_EB_56.CEL" "GSM302943_EB_74.CEL" "GSM303364_EB_108.CEL" [13] "GSM303522_J19.CEL" "GSM303523_J_2.CEL" "GSM303525_J_6.CEL" [16] "GSM303531_J_7.CEL" GSM302859_EB125.CEL GSM302864_EB184.CEL GSM302866_EB188.CEL 1 121.3 59.0 72.8 2 16873.5 12695.8 16688.5 3 102.3 61.0 93.3 4 17321.8 12280.8 16375.3



Step-7 : Once the raw data is read, we must check the quality of data sets. For this, we plot various quantities like the image files of microarrays, scatter plots between two arrays, density plots and box plot before normalization using the fillowing sets of commands:

# Check the image files of samples one by one using oligo::image() function # Due to some reason, not plotting many plots in one page when par(mfrow=c(2,2)) command is used. nplots = length(sample_names) for(i in 1:nplots){ X11() image(rawData, which = i) } ## Scatter plot between pairs of arrays plot(exprs(rawData)[,1], exprs(rawData)[,2], main="typical scatter plot between 2 arrays" pch=20, cex=0.5) ## density plot of data sets before normalizationtransfo- ## Note: transfo=log2 is necessary to convert to log2 scale. X11() oligo::hist(rawData, transfo=log2, which="pm", target="core", main="density plots of pm probes before normalization", font.lab=2) ## Boxplot before normalization xlabs = as.character(covdesc$treatment) X11() par(font.lab=2, font.axis=2) ## for boxplot, this is how we have to get font thickness oligo::boxplot(rawData, target="core", which="pm", boxwex=0.5, las=3, names = xlabs, main="Boxplot of pm probes before normalization", ylab = "log2(Expression) of individual probes ")

The first thing we do after extracting rawData is to check the microarray intensity images for any obvious defective across the image surface. This is done in the above code section by calling the image() function of R, which takes rawData structure to generate images of individual arrays. Using a for loop, we can plot images of all the files one by one. An example file is given here:

In the above image in Figure-3 we see that no defective region is present across the entire array. We must check every file by plotting the image.

Secondly, as a matter of curiosity, we test whether any pair of microarray samples are correlated.

exprs(rawData) is a matrix whose columns are expressions of samples across genes. We plot these columns one by one. This is done as a scatter plot call "plot(exprs(rawData)[,1], exprs(rawData)[,2],....)". The resulting scatter plot between gene expressions across sample 1 and sample2 is presented here:



In the above scatterplot between probe level expressions of control 1 and control2 samples, we see a a fair correlation between the two data sets. In addition there are deviations.

The density plot of the pm probes for each samples are plotted on the same graph using the hist() function of the "oligo" library. While calling the hist() function, the parameter setting which="pm" selects the pm probes for density plots. By the parameter setting target="core", we are summarizing the probe-level data to get gene level expressions using well annotated probes in the array. We log transform the data and plot the density histogram using oligo::hist() function of oligo library, as shown in the above code section. The resulting density plots for all arrays plotted together is shown below:

In the above distributions, we observe considerable shifts that exist between the density distributions of the individual arrays before normalization. We expect the RMA normalization to remove these shifts.

Another way of comparing the the signals from PM probes across samples is to create a Box-Whisker plot (Boxplot) samples. This is done by the boxplot() function of oligo library, as shown in the codeline. The executions leads to the creation of the following boxplots of signals from PM probes across samples:

We notice that considerable non-uniformities exist between the samples when comparing their pm signals. The boxes indicating the first to third quartiles do not overlap properly. This will be rectified after normalization from RMA algorithm.


Step-8 : In this crucial step we will run the RMA algorithm on the rawData to accomplish the three steps of Background Correction, Quantile Normalization and Probe summarization as discussed in the previous section. A single function "rma()" from in oligo library tamkes the ExpressionFeatureSet object "rawData" and runs the RMA algorithm to return and ExpressionSet object with gene level probe summarized results. Using specific functions provided, we can extract various information from this object. The single line of function call is given here:

# normalize with RMA algorithm. normalizedData <- rma(rawData)

The execution of above call prints the names of the three steps performed by the RMA algorithm:
Background correcting Normalizing Calculating Expression

Once we have the expression set object after RMA algorithm, we can extract the expression value matrix using "exprs()" function. This returns a matrix whose columns are sample names and rows are probeset names which we will annotate later to get gene symbols and other information.

From the matrix we call "expval" returned by "exprs()" function, we extract probe names and samples names as separate vectors. This section of 3 commands are given below:

# generate expression values matrix expval <- exprs(normalizedData) prbnames <- rownames(expval) clnames <- colnames(expval)



Finally, we create a vector of samples names from the file names by stripping off the ".cel" extension. We can use these names for plabelling samples in any plots, if required. Thus vector is added as a column name to the data matrix "expval". See the following lines of code:

## get rid of ".cel" extension names from clnames and from colnames(expval) nme = strsplit(colnames(expval), "\\.") for(i in 1:ncol(expval)){ clnames[i] = strsplit(colnames(expval), "\\.")[[i]][1] } colnames(expval) = clnames print(colnames(expval))

When the above code lines are executed, the file names are printed as strings:
[1] "GSM302859_EB125" "GSM302864_EB184" "GSM302866_EB188" [4] "GSM302870_EB228" "GSM302872_EB_20" "GSM302882_EB_27" [7] "GSM302900_EB_35_A" "GSM302924_EB_40" "GSM302927_EB_41" [10] "GSM302933_EB_56" "GSM302943_EB_74" "GSM303364_EB_108" [13] "GSM303522_J19" "GSM303523_J_2" "GSM303525_J_6" [16] "GSM303531_J_7"



Step-9 : Next step is to create and look at the quality plots after normalization. The probe values have been summarized into gene expression values for each gene across each sample, and written into "expval" matrix. We make three plots, namely the density plot of expression of genes across samples, box plots of gene expressions in individual samples and the MA-plot between average log expression and difference in log expression of control and treatment sets. See the code section below:

## density plot of data sets after normalization. oligo::hist(normalizedData, transfo=identity, main="density plots of gene expressions after normalization", font.lab=2)

We have called the hist() function of oligo library which takes the data structure "normalizedData" and plots the density plots. The parameter "transfo=identity" means do not take a log2 transform of data, since since it is already log2 transformed by the RMA algotrithm. (Use "transfo=log2" for log2 tranforming the data incase required for any other analysis).

Execution of the above code lines gives the following plot at the output:

Comparing this density plot with that of pm probes before normalization shown in Figure-6, we can see that the RMA normalization has shifted the density curves overlapping with each other, an indication that the RMA normalization has made the data consistent for comparison across samples.

Next, we create the Box-Whisker plot of gene expressions across samples using the following lines of code:

xlabs = as.character(covdesc$treatment) X11() par(font.lab=2, font.axis=2) ## for boxplot, this is how we have to get font thickness oligo::boxplot(normalizedData, target="core", boxwex=0.5, las=3, names = xlabs, main="boxplot of expression of probesets (genes) after RMA normalization", ylab = "log2(Expression)")


Running the above codelines creates the following Box-Whisker plot:


Comparing Figure-8 with Figure-6 we realize that the box plot after normalization has better overlapping box regions compared to the boxplot before normalization. This is the result of RMA normalization among the arrays.

Lastly we will create an MA plot between the mean values of control samples and the mean values of the treatment samples for every gene. It is a plot between average log expression and difference in log expression of control and treatment means across samples:

rm_control = rowMeans(expval[,control_cols]) rm_treatment = rowMeans(expval[,treatment_cols]) X11() plot( (rm_control + rm_treatment)/2, (rm_control-rm_treatment), pch=20, cex=0.3, xlab="Average log expressions", ylab="Difference in log expressions", main="MA plot between mean expression of control and treatment", font.lab=2 ) segments(0,0,20,0, lwd = 2, col="red")

Executing the above code lines results in the following MA plot:

The above MA plot clearly shows that the difference in log expression (logarithm of ratios of treatment and control means) remains steady across the entire range of the mean log expression. This indicates that the data after normalization has no systematic variation across the range of expression values.


Step-10 : The next section performs the differential expression analysis between the two conditions in the data. The LIMMA model is used for this.

The first step in this direction is to create the design matrix that labels the samples into normal and RA conditions in the form of a matrix. See the following lines of code:

# create design matrix for the test: group samples into 2 groups groups = pData(rawData)$source f = factor(groups,levels=test_between) design = model.matrix(~ 0 + f) colnames(design) = test_between print(design)

The call "pData(rawData)" extracts the covariates and the description of covariates (like "norml", "RA) in the rawData. The call "pData(rawData)$source" returns a vectors of strings "normal" or "RA" for each sample. Therefore, "groups" is a vector of strings.

The call "factor(groups, levels=test_between)" convertes the string elements in groups into factors. Thus. "normal" string in groups vector will become a factor called normal in f. Thus, f is a vector of factors (normal or RA) for each file, as printed below:
> f [1] normal normal normal normal normal normal normal normal RA RA [11] RA RA RA RA RA RA Levels: normal RA
The call "design = model.matrix(~0 + f)" creates the model matrix for the linear regression model in limma. "~0 + f" represents a model including the two factors in f ("normal" and "RA") without intercept term. If we print the "design" variable, it prints the design matrix for this test by placing an appropriate "0" or "1" torepresent whether a sample is "normal" or 'RA":
> design = model.matrix(~ 0 + f) > design fnormal fRA 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 1 0 9 0 1 10 0 1 11 0 1 12 0 1 13 0 1 14 0 1 15 0 1 16 0 1 attr(,"assign") [1] 1 1 attr(,"contrasts") attr(,"contrasts")$f [1] "contr.treatment"
In the above table, note that the column names are "fnormal" and "fRA". We convert them into "normal" and "RA" explicitly through the call "colnames(test_between)". After this call, the design matrix has column names "normal" and "RA".

Once the design matrix is created, we will go ahead and create a contrast matrix. The contrast matrix contains the information on the specific comparisons made in the statistical test we are performing. The scipt lines for creating the contrast matrix are:

# use makeContrasts() to create a contrast matrix that describes the comparison # between the two condition. Which condition is subtracted from which other condition? fstring = paste(test_between[2],"-",test_between[1], sep="") contrast.matrix = makeContrasts(fstring,levels=design) print(contrast.matrix)

In the above portion of script, a string "RA-normal" is created from the elements of string vector "test_between" that we created before. This is called "fstring".

The call "makeContrasts(fstring,levels=design)" calls the "makeContrasts" function of limma package. This takes the fstring whose value of "RA-normal" and the design matrix to create the contrast matrix. When printed in the next line, the contrast matrix looks like this:
Contrasts Levels RA-normal normal -1 RA 1
As explained in the previous section, the above matrix with 1 and -1 values indicates that the design is "RA-normal" in the comparison we are going to make. In this way, fold change will be log2(RA/normal).

In the next code section, we are performing the three steps of limma algorithm as explained in the previous web page section:

# 1. lmFit() calculates mean expression levels in each group # data.fit is an MArrayLM object defined in limma # first two columns contain mean log expression values of the two conditions data.fit = lmFit(expval,design) # 2. contrasts.fit() fits the contrast.matrix to the data.fit object with the log mean expression values data.fit.con = contrasts.fit(data.fit,contrast.matrix) # 3. perform moderated t-test using limma (eBayes()) data.fit.eb = eBayes(data.fit.con) # 4. Volcano plots with uncorrected p-values from moderated t-test in eBayes() ## log2 fold change versus uncorrected p-value volcanoplot(data.fit.eb,coef=1,col="cornflowerblue", xlab = "log2(Fold Change)", ylab = "-log10(p-value)", main="Volcano plots between Fold Change and uncorrected p-values", font.lab = 2)

The lmFit() function takes the expression matrix and the design matrix as two inputs to perform linear model fit on the whole data set. It returns various parameters of the fit as "data.fit" variable. If we print this variable through "print(data.fit)" statement (not shown here), various out puts will be printed, of ehich the following is an important one to note:
$coefficients normal RA 1007_s_at 9.151044 8.818945 1053_at 7.608187 7.682733 117_at 9.033585 9.249645 121_at 10.080587 9.901472 1255_g_at 5.348233 5.264829 22278 more rows ...
Each probe is separately fitted to get two coefficients of the fit. The "normal" and "RA" values for eachh probe (gene) represent the mean expression values of the gene in the "normal" and "RA" samples respectively. Note that these mean expression values are expressed in log2(scale).

The contrasts.fit() function takes the output of lmFit() function and the contrast matrix as input. It then recalculates the regression coefficients for a specific comparison decided by the contrast matrix. The coefficeints of this model are the subtraction of the control expression from the treatment in log2 scale.

If we print the coefficeints of contrasts.fit() model using the command "print(data.fit.con)" we get the following lines in the printout:
> print(data.fit.con) An object of class "MArrayLM" $coefficients Contrasts RA-normal 1007_s_at -0.33209979 1053_at 0.07454609 117_at 0.21606011 121_at -0.17911521 1255_g_at -0.08340427 22278 more rows ...
For example, the probe 1007_s_at has a difference of approximately -0.3321 in the differential expression difference between RA and normal samples in log2 scale. This means,

$~~~~~~~~~~~log2(RA~expression) - log2(normal~expression)~=~-0.3321$

$~~~~~~~~~~~log2(\dfrac{RA~expression}{normal~expression})~=~ -0.3321$

$~~~~~~~~~~~$ Therefore, the Linear Fold_Change $= \Large{2}^{-0.3321}~=~ $0.6083

ie., In RA condition, the probe 1007_s_at expressed approximately 61 percent of the expression in normal condition. This is $61\%$ under expression in a linear scale.

If this ratio is more than one, it is over expression.

The eBayes() function, as explained in the previous section, estimates reliable variances for each gene by borrowing information across all the genes. Using these estimated variances, it creates an output of moderated t-statistics, p-values and log2 fold changes in differential expression between the conditions specified.

The command print(data.fit.eb) to print the output of eBayes() function prints many computed quantities of the model. Among them, the follwing three, namely the coefficients, moderated t-statistic called "t" and p.values are printed as follows

(not given in the order in which they were printed. There are other variables printed in between these section. Selected and copy pasted here from the full screen printout of command print(data.fit.eb):

$coefficients Contrasts RA-normal 1007_s_at -0.33209979 1053_at 0.07454609 117_at 0.21606011 121_at -0.17911521 1255_g_at -0.08340427 22278 more rows ... $t Contrasts RA-normal 1007_s_at -2.8337037 1053_at 0.6838937 117_at 1.0426625 121_at -1.3434391 1255_g_at -0.9511786 22278 more rows ... $p.value Contrasts RA-normal 1007_s_at 0.0117363 1053_at 0.5035626 117_at 0.3121857 121_at 0.1973694 1255_g_at 0.3552804 22278 more rows ...


In the case of eBayes() estimate of moderated t-statistic, the information borrowed from all the genes is used to estimate the variance. This shrinks the variance. This may result in a situation whereby a gene with a smaller contrast might have a much larger variance, resulting in a smaller t-statistic, while a gene with a larger contrast might have a much smaller variance, leading to a larger moderated t-statistic.

Thus, in the case of probes(genes) with large fluctuations in standard deviations due to small sample sizes, the p-value estimates of limma model (as done in eBayes()) is found to be more reliable than normal t-tests.

Finally, the volcanoplot() function plot creates a scatter plot of log2(fold change) and -log10(uncorrected p-value). This function call, takes th eBayes object as input. The 'adjust.method="BH" ' performs the Bejamini-Hochburg procedure to control the False Discovery Rate (FDR). The parameter value "coeff=1" refers to the first coefficient, which is the log2(Fold Change). The Figure of volcanoplot is shown here:




Step-11 : The function "eBayes()" has performed a statistical test between specific conditions (such as "normal" versus "disease"). This has resulted in a p-value for the comparison for each probe(gene). The next stepis to apply corrections for the multiple hypothesis testing we have carried out for all the genes in the array. There is a function called "topTable()" in the limma package which can perform the p-value corrections for multiple testing or adjusting the False Discovery Rate (FDR). After adjusting the p-values for False Discovery rates, this function also can select and display top n selected genes from the corrected list. (ie., select certain number of most significant genes).

The code section below calls the topTable() function:

# multiple hypothesis correction with Benjamini and Hochberg False Discovery Rate (FDR) estimate. tab = topTable(data.fit.eb,coef=1, number = nrow(normalizedData),adjust.method="BH", sort.by = "P") # Print top 10 significant genes in terms of p-values print( tab[1:10,] ) # We also write the data with probe order same as in the data frame "normalizedData" limma_output = tab[order( match(rownames(tab), rownames(normalizedData)) ), , drop = FALSE] # Volcano plots with the FDR adjusted p-values plot(limma_output$logFC, -log10(limma_output$adj.P.Val), pch=20, cex=0.4, col="blue", xlab = "log2(Fold Change)", ylab = "-log10(p-value)", font.lab=2, main="Volcano plot between Fold Change and FDR adjusted p-vlaues")

In the first script line above, the "topTable()" funciton is called with a variable value "adjust.method="BH" which performs Benjamini Hochberg False Discovery Rate adjustments to the uncorrected p-values stored in data.fit.eb object of eBayes(). The resulting data frame is sorted according to lowest to highest p-value and stored as "tab" data frame ("sort.by=P" does thi). We then print the first 10 rows of the data frame which look like this:

> print( tab[1:10,] ) logFC AveExpr t P.Value adj.P.Val B 205975_s_at -0.6163573 8.327323 -6.367583 8.186662e-06 0.1824234 2.9194620 214354_x_at -0.4437778 8.545144 -5.028347 1.138034e-04 0.2683033 0.9869112 217020_at -0.3611141 6.579365 -4.853730 1.630056e-04 0.2683033 0.7120707 206390_x_at -0.4229211 7.845895 -4.839338 1.679294e-04 0.2683033 0.6892043 204679_at -0.8213926 7.075867 -4.818664 1.752705e-04 0.2683033 0.6563006 220522_at -0.3446578 6.603247 -4.811112 1.780331e-04 0.2683033 0.6442661 210169_at -0.4056663 7.086355 -4.795763 1.837865e-04 0.2683033 0.6197783 215949_x_at 1.2219149 9.050217 4.708165 2.204644e-04 0.2683033 0.4793487 218054_s_at -0.3225253 7.563564 -4.701597 2.234997e-04 0.2683033 0.4687736 202419_at -0.5385549 9.253322 -4.685037 2.313433e-04 0.2683033 0.4420833
In the above printout, the term "logFC" refers to uncorrected p-value in log10 scale, "AveExpr" refers to the average expression of each probe(gene), "t" is the t-statistics of the moderated t-test, "P.Value" is the unadjusted original p-value of the test, "adj.P.Val" is the Benjamini Hochberg FDR value and "B" is the B-statistics (log odds). As mentioned, the dmata is sorted from lowest to the highest palues.

While writing the analysis table into files, we want to restore the order of probes as in the beginning. But the data frame "tab" is sorted according to p-value. So, we restore the original order of the row names of "normalizedData" obtained below. This is done by the "order()" function called inside the dimension statement of tab() in the next line.

After this call, the data frame "limma_output" will have unsoreted p-values and order of genes(probes) as in "normalized Data". This is to be written in files.

Finally, we plot the volcano plot between log2(fold change) and log10(FDR coorected p-value) through the plot() function. This creates the following plot:




Step-11 : In this step, we annotate the probe names with the information on corresponsding genes they represent. We use the data base library "hgu133a.db" we have attached for this.

This library has map for many annotation parameters for the given chip HGU133A.

Suppose, we want to get the ENTREZID for probes in the hgu133a chip.

"hgu133aENTRZID" is an annotation object from hgu133a.db package which maps affymetrix probe set ID to ENTRAZID.

"as.list(hgu133aENTRZID)" extracts a list of ENTRAZID for all the probes in the array.

sapply(as.list(hgu133aENTREZID), paste, collapse=", ") --- In this call, The "paste" function is used to concatenate the elements within each entry of the list. The collapse=", " argument ensures that if a single probe set maps to multiple Entrez IDs, those IDs are combined into a single character string separated by a comma and a space (e.g., "123, 456").

ENTREZID = sapply(as.list(hgu133aENTREZID), paste, collapse=", ") ---- this results in a character vector called ENTREZID with one ID per probe set.

Using this method, the following command extracts various annotations like gene accession number (ACCM), gene symbol (SYMBOL), gene name (GENENAME) and chromosome number (CHR) from the data base. The full command is below:

annot <- data.frame(ENTREZID=sapply(as.list(hgu133aENTREZID), paste, collapse=", "), ACCNUM=sapply(as.list(hgu133aACCNUM), paste, collapse=", "), SYMBOL=sapply(as.list(hgu133aSYMBOL), paste, collapse=","), DESC=sapply(as.list(hgu133aGENENAME), paste, collapse=", "), ACHROMOSOME=sapply(as.list(hgu133aCHR), paste, collapse=", ")

However, this command hardcodes the chipname "hgu133a" everywhere. In order to make it work for any chip in general, we create the above command by pasting various component strings by substituting the chname we typed in the beginning as a variable "Affy_chip". The following lines of codes exactly do that and recreate a long string command called "annot_command":

entrez_id = parse(text = paste(Affy_chip, "ENTREZID", sep="") ) accession_number = parse(text = paste(Affy_chip, "ACCNUM", sep="") ) gene_symbol = parse(text = paste(Affy_chip, "SYMBOL", sep="") ) gene_name = parse( text = paste(Affy_chip, "GENENAME", sep="") ) chromosome = parse( text = paste(Affy_chip, "CHR", sep="") ) begin_string = "data.frame(" end_string = ")" str1 = paste("ENTREZID=sapply(as.list(",entrez_id, "), paste, collapse=\', \'),", sep="") str2 = paste("ACCNUM=sapply(as.list(",accession_number, "), paste, collapse=\', \'),", sep="") str3 = paste("SYMBOL=sapply(as.list(",gene_symbol, "), paste, collapse=\', \'),", sep="") str4 = paste("GENENAME=sapply(as.list(",gene_name, "), paste, collapse=\', \'),", sep="") str5 = paste("CHROMOSOME=sapply(as.list(",chromosome, "), paste, collapse=\', \')", sep="") annot_command = paste(begin_string, str1, str2, str3, str4, str5, end_string, sep="") print(annot_command)

The print statement in the above script segment will print the entire command as a string:
[1] "data.frame(ENTREZID=sapply(as.list(hgu133aENTREZID), paste, collapse=', '),ACCNUM=sapply(as.list(hgu133aACCNUM), paste, collapse=', '),SYMBOL=sapply(as.list(hgu133aSYMBOL), paste, collapse=', '),GENENAME=sapply(as.list(hgu133aGENENAME), paste, collapse=', '),CHROMOSOME=sapply(as.list(hgu133aCHR), paste, collapse=', '))"
In R, to execute any R command written as a string "s", we execute "eval(parse(text=s))".

Following this syntax, we execute the above command in R to get annotation data frame for all probes:

annot <- eval(parse(text=annot_command)) print(annot[1:4,])

The first 4 rows containing the annotations of first 4 probes are printed as below:
ENTREZID ACCNUM SYMBOL GENENAME 1007_s_at 780 U48705 DDR1 discoidin domain receptor tyrosine kinase 1 1053_at 5982 M87338 RFC2 replication factor C subunit 2 117_at 3310 X51757 HSPA6 heat shock protein family A (Hsp70) member 6 121_at 7849 X69699 PAX8 paired box 8 CHROMOSOME 1007_s_at 6 1053_at 7 117_at 1 121_at 2


Step-12 : Compile all results and write the data into a file:

Finally, we combine all the results and write the data into a csv file, as done in the following script segment. We use the file name we created before:

## create a data frame with results df = data.frame(prbnames, annot, limma_output, expval) # create an output file name outfile_name = paste("output_files/", gse_current,"_limma_output.csv", sep="") ## write the result table into file write.csv(df, file=outfile_name, row.names=FALSE)

The data table is now written into a csv file namely
"output_files/GSE12021_limma_output.csv"


Step-13 : Write the plot images as image files: So far, the script has been plotting every plot on the screen. Now we should write the required plots onto an image file using "Cairo" library of R.

This is done with four steps: create an output file name, define Cairo plot parameters with "Cairo()" function, give your plot command, and finally close the device with "dev.off". Example is given below:

## Create output file name outfile = paste(outfile_dir, "/", "1_density_plots_before_RMA_normalization.png", sep="") # call Cairo with parameters for 'png' file. Cairo(file=outfile, type="png", units="px", width=600, height=700, pointsize=12, dpi="auto") # plot the graph hist(rawData, transfo=log2, which=c("all"), nsample=10000, main="density plots of pm probes before RMA Rnormalization", font.lab=2) switch off device. dev.off()

In the "Cairo()" function, we can choose image file type, pixel unit, pixel width, pixel height, point size and many parameters. See Cairo manuals for details.

Similarly, do the above 4 segments for every plot separately one below other. All plots will be written into *.png" files in the output directory defined.