Microarray Data Analysis in
          R/Bioconductor

( R. Srivatsan )

Affymetrix Genechip Hugene array analysis in
R/Bioconductor



This section contains the practical tutorial in Bioconductor/R for analysing data from various types of Affymetrix Genechip Hugene arrays

The Affymetrix Human Gene array (Hugene array) has two major differences compared to HGU arrays:

(i) The probes span the entire length of mRNA transcript, unlike the HGU array where probes come from 3' region

(ii) It uses DNA targets, instead of cDNA targets of HGU arrays.

Thus the target-probe hybridization generates DNA-DNA duplexes, which are more specific than the RNA-DNA duplexes in the case of HGU arrays. The Hugene arrays are whole-transcription (WT) arrays, since their probes are derived from the entire length of the transcript.

The Hugene array comes in two versions, Hugene 1.0 ST array and Hugene 1.1 ST array with almost identical probes. The arrays are almost identical with 1.1 version has better annotated probes when compared to the 1.0 version.

Each array has about 760000 oligonucleotide probes of 25 mer length each. Typical feature size is 5 microns. The probes cover over 28000 well annotated genes. The oligonucleotide probes on the Affymetrix Human Gene 1.1 ST Array Plates are synthesized in situ using Affymetrix’ photolithographic process.

Except for the annotation process, the entire Bioconductor/R pipeline for analyzing the Hugene microarray is very similar to that of HGU arrays seen in the previous section.

Analysis pipeline for Affymetrix Hugene array

In this section, we present an analysis pipeline made of R/Bioconductor libraries for performing the differential expression analysis of data from Affymtrix Human Gene (HGU) family of arrays.

The entire zipped directory contaning input data files, R-script file for the analysis, covariate description file and output files can be downloaded from here

Explanation of the script lines

In this section, we will explain the various code sections of the R/Bioconductor script " " affy_HGU_chips_model_analysis_with_oligo_lirary_and_limma.r. We execute this script in R using the following command in R prompt:
$~~~~~~~~~~~~~> $source("affy_hugene_chips_model_analysis_with_oligo_library_and_limma.r")

Step-1 : To begin with, we will define a parameter to carry the affymetrix chip name whose data is used for the experiment. For this, we will type various chip names and comment them. We will uncomment the chip name we are using. Following this, we will also define the GEO number of the experiment:

# Affy_chip = "hugene11st" Affy_chip = "hugene10st" # GEO number for experiment gse_current = "GSE28315"

Thus, we have defined the chip "hugene10st" and the GEO number "GSE28315". We will use this number to name output files.


Step-2 : Next, we create variable names for directories and files for read/write operations:

# input file directory. infile_dir = "data_files" # output file directory. outfile_dir = "output_files" # output file name for writing final gene expression table in csv format outfile_name = paste("outfile_dir/", gse_current,"_limma_output.csv", sep="")

Thus, input cel files will be read from directory "data_files", the plots and table outputs files will be created in "output_files" directory. Also, the name of output csv table file is created by pasting three strings with no separator in the last command above.

When these three above commands are executed, the outfile name will be
"outfile_dir/GSE28315_limma_output.csv". The final expression table will be written into this file in a csv format. We can print this variable "outfile_name" on the screen to check this.


Step-3 : The input CEL files will now be read into a data structure of R. This is done using "read.csv()" function of R in the following code lines. This function reads "covdesc.csv" file and stores the table with files names and category names in a data frame called "covdesc". We print the data frame "covdesc" on screen:

covdesc <- read.csv("covdesc.csv",header=T) print(covdesk)

The contents of "covdesc" data frame is printed on screen as below:
print(covdesc) sample treatment 1 GSM700422.CEL control 2 GSM700423.CEL control 3 GSM700424.CEL control 4 GSM700425.CEL control 5 GSM700426.CEL control 6 GSM700427.CEL control 7 GSM700428.CEL EBS 8 GSM700429.CEL EBS 9 GSM700430.CEL EBS 10 GSM700431.CEL EBS 11 GSM700432.CEL EBS 12 GSM700433.CEL EBS
We should carefully examine the above output and make sure that we have named the categories versus file names correctly. These category names will be tagged to sample names (file names) hereafter in all computations that follow.


Step-4 : At this step, we will define the two conditions of study as "normal" and "EBS", as in the covdesc.csv file. Also, we will create two vectors, one carrying the row numbers of "normal" samples and another carrying row numbers of "RA" samples in the covdesc data frame:

test_between = c("control","EBS") ## As in covdesc file control_cols = which(covdesc$treatment == test_between[1]) treatment_cols = which(covdesc$treatment == test_between[2])

After executing the above code lines, "test_between" will be a vector with two string elements ("normal", "RA").
"control_cols" is a numeric vector with elements (1 2 3 4 5 6) indicating that first six files are normal samples.
"treatment_cols" is a numeric vector with elements (7 8 9 10 11 12) indicating that next six files are EBS samples
(we can print them on screen and check).


Step-5 : In this section, we will load the required R/bioconductor packages into code memory using "library()" function of R.

library(oligo) # normalization library(limma) # DE analysis library(multtest) # estimating False Diccovery Rate (FDR) library(Cairo) # for writing plots into files # load chip library for annotation # for hugene chips, oligo will automatically detect and load the required libraries. however, when not installed previously in the session, this shows an error. if(Affy_chip == "hugene11st") { library(pd.hugene.1.1.st.v1) library(hugene11stprobeset.db) library(hugene11sttranscriptcluster.db) # hugene platforms can be analysed at transcript level or probe level } if(Affy_chip == "hugene10st") { library(pd.hugene.1.0.st.v1) library(hugene10stprobeset.db) library(hugene10sttranscriptcluster.db) # hugene platforms can be analysed at transcript level or probe level }

As indicated by the comments in the above comment lines,

The Bioconductor Library "oligo" reads the celfiles, runs the steps of RMA normalization and finally creates probe summarization.

The bioconductor library "limma" performs the statistical test with liima algorithm to compute differential expression p-values and fold changes. It also can annotate the affy probes into gene symbols and other information.

The bioconductor library "multtest" has functions for multiple testing of p-values to estimate the False Discovery rates (FDR) using the p-values of limma as input.

The R-library function "Cairo" writes the R-plots into image files like png, jpeg etc.

The cascade of "if" statements load the data base file and chip description file for the affy chip we have selectcted through variable "Affy_chip".

Thus, since we have selected "Affy_chip = hugene10st" in the beginning, the files "pd.hugene.1.0.st.v1", "hugene10stprobeset.db" and "hugene10stprobeset.db" will be loaded for this chip. The other "if statements will not be executed.

Execution of above section code line result in the foillowing output on the screen:
Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: oligoClasses Welcome to oligoClasses version 1.60.0 Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: Biostrings Loading required package: S4Vectors Loading required package: stats4 Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: expand.grid, I, unname Loading required package: IRanges Loading required package: XVector Loading required package: GenomeInfoDb Attaching package: ‘Biostrings’ The following object is masked from ‘package:base’: strsplit ================================================================================ Welcome to oligo version 1.62.2 ================================================================================ Attaching package: ‘limma’ The following object is masked from ‘package:oligo’: backgroundCorrect The following object is masked from ‘package:BiocGenerics’: plotMA Loading required package: RSQLite Loading required package: DBI Loading required package: AnnotationDbi Loading required package: org.Hs.eg.db hugene10stprobeset.db is based on exon probesets. For a more gene-centric view, use the transcriptcluster version of this package.



Step-6 : Next step is to read the CEL data files into R data structures so that we can process the data. This is done in the following lines of code:

## Read the names of celfiles to be loaded. celFiles = list.celfiles("data_files",full.names=T) ## read the contents of all input cel files in the above list. rawData = read.celfiles(celFiles) # get the sample names in rawData sample_names = sampleNames(rawData) print(sample_names) cat("\n") # print a blank line # rawData is now a GeneFeatureSet object. # this object does not contain information on phenotypes, which has to be loaded # separately from the covdesc file # pData(rawData) retrieves phenodata from the GenefeatureSet object as a dataframe. pData(rawData)$gsm = rownames(pData(rawData)) pData(rawData)$source = covdesc$treatment ## print the values of rawData, ie., raw values of probes across samples. ## we print 4 probe readings across 3 samples print(exprs(rawData)[1:4,1:3])

In the above lines of script,

"list.celfiles()" function reads the names of CEL files in the "infile_dir" directory and stores the string names in the variable "celFiles".

The function "read.celfiles()" of the oligo library reads these celfiles and stores the data into a huge data structure (R list) called "rawData". This list "rawData" has all the data of all the files. Various functions can be used to extract information stored in the variables inside "rawData".

The variable "rawData" is an ExpressionFeatureSet object. An expression set is an R data object (created in bioconductor) consisting of three entities: the expression matrix (exprs), the phenotye data (pData), and the feature data (fData). These information can be retrived using functions provided by the Biobase library of Bioconductor package.

The "sampleNames()" is a function of the Biobase library that can extract the sample names stored in the expression set object like "rawData". we extract this and store as a string vector "sample_names".

The function "pData()" of Biobase library can access the phenotypic data (e.g.,covariates) and meta-data (e.g., descriptions of covariates) associated with an experiment. In the above code lines, the function call "pData(rawData)" extracts the phenotype data from "rawData" structure and returns the information as a data frame.

Now, "rowNames(pData(rawData))" returns the file names as a string vector. This is added as a new variable "gsm" to the "rawData" structure. Similarly, "covdesc$treatment" is a vector of sample types and this is added as a new vector variable called "source" to the "rawData" structure.

After these additions, "rawData" now has phenotype information added to it.

The function "exprs()" of biobase derives the expression values from the "ExpressionFeatureSet" objects. Thus the function call "exprs(rawData)" returns a matrix with rows representing features (probes) and columns samples. The cell values are expression value for a given probe in a sample.

Two print statements are there in the code section. "print(sampleNames)" prints the sample names as a vector elements. The statement "print(exprs(rawData)[1:4,1:3])" prints the first 4 rows and 3 columns of the expression matrix. These printouts are as below:

[1] "GSM700422.CEL" "GSM700423.CEL" "GSM700424.CEL" "GSM700425.CEL" [5] "GSM700426.CEL" "GSM700427.CEL" "GSM700428.CEL" "GSM700429.CEL" [9] "GSM700430.CEL" "GSM700431.CEL" "GSM700432.CEL" "GSM700433.CEL" GSM700422.CEL GSM700423.CEL GSM700424.CEL 1 3749 4328 3755 2 66 87 74 3 3962 4404 3680 4 66 46 42



Step-7 : Once the raw data is read, we must check the quality of data sets. For this, we plot various quantities like the image files of microarrays, scatter plots between two arrays, density plots and box plot before normalization using the fillowing sets of commands:

# Check the image files of samples one by one using oligo::image() function # Due to some reason, not plotting many plots in one page when par(mfrow=c(2,2)) command is used. nplots = length(sample_names) for(i in 1:nplots){ X11() image(rawData, which = i) } ## Scatter plot between pairs of arrays X11() par(font.lab=3, font.axis=2) plot(exprs(rawData)[,1], exprs(rawData)[,2], main="typical scatter plot between 2 arrays", pch=20, cex=0.5, xlab = "Normal sample 1", ylab="Normal sample 2" ) ## density plot of data sets before normalizationtransfo- ## Note: transfo=log2 is necessary to convert to log2 scale. X11() par(font.lab=2, font.axis=2) oligo::hist(rawData, transfo=log2, target="core", main="density plots of hugene array probes before normalization", font.lab=2, ) ## Boxplot before normalization xlabs = as.character(covdesc$treatment) X11() par(font.lab=2, font.axis=2) ## for boxplot, this is how we have to get font thickness oligo::boxplot(rawData, target="core", boxwex=0.5, las=3, names = xlabs, main="Boxplot of probes before normalization", ylab = "log2(Expression) of individual probes " )

The first thing we do after etracting rawData is to check the microarray intensity images for any obvious defected across the image surface. This is done in the above code section by calling the image() function of R, which takes rawData structure to generate images of individual arrays. Using a for loop, we can plot images of all the files one by one. An example file is given here:

In the above image in Figure-1 we see that no defective region is present across the entire array. We must check every file by plotting the image.

Secondly, as a matter of curiosity, we test whether any pari of microarray samples are correlated.

exprs(rawData) is a matrix whose columns are expressions of samples across gennes. We plot these colimns one by one. This is done as a scatter plot call "plot(exprs(rawData)[,1], exprs(rawData)[,2],....)". The resulting scatter plot between gene expressions across sample 1 and sample2 is presented here:



In the above scatterplot between probe level expressions of control 1 and control2 samples, we see a a fair correlatiopn between the two data sets. In addition there are deviations.

The density plot of the pm probes for each samples are plotted on the same graph using the hist() function of the "oligo" library. While calling the hist() function, the parameter setting which="pm" selects the pm probes for density plots. By the parameter setting target="core", we are summarizing the probe-level data to get gene level expressions using well annotated probes in the array. We log transform the data and plot the density histogram using oligo::hist() function of oligo library, as shown in the above code section. The resulting density plots for all arrays plotted together is shown below:

In the above distributions, we observe considerable shifts that exist between the density distributions of the individual arrays before normalization. We expect the RMA normalization to remove these shifts.

Another way of comparing the the signals from PM probes across samples is to create a Box-Whisker plot (Boxplot) samples. This is done by the boxplot() function of oligo library, as shown in the codeline. The executions leads to the creation of the following boxplots of signals from PM probes across samples:

We notice that certain level of non-uniformity exists netween the samples when comparing their pm signals. The boxes indicating the first to third quartiles do not overlap properly. This will be rectified after normalization from RMA algorithm.


Step-8 : In this crucial step we will run the RMA algorith on the rawData to accomplish the three steps of Background Correction, Quantile Normalization and Probe summarization as discussed in the previous section. A single function "rma()" from in oligo library tamkes the ExpressionFeatureSet object "rawData" and runs the RMA algorithm to return and ExpressionSet object with gene level probe summarized results. Using specific functions provided, we can extract various information from this object. The single line of function call is given here:

# normalize with RMA algorithm. normalizedData <- rma(rawData)

The execution of above call prints the names of the three steps performed by the RMA algorithm:
Background correcting Normalizing Calculating Expression

Once we have the expression set object after RMA algorithm, we can extract the expression value matrix using "exprs()" function. This returns a matrix whose columns are sample names and rows are probeset names which we will annotate later to get gene symbols and other information.

From the matrix we call "expval" returned by "exprs()" function, we extract probe names and samples names as separate vectors. This section of 3 commands are given below:

# generate expression values matrix expval <- exprs(normalizedData) prbnames <- rownames(expval) clnames <- colnames(expval)



Fianlly, we create a vector of samples names from the file names by stripping off the ".cel" extension. We can use these names for plabelling samples in any plots, if required. Thus vector is added as a column name to the data matrix "expval". See the following lines of code:

## get rid of ".cel" extension names from clnames and from colnames(expval) nme = strsplit(colnames(expval), "\\.") for(i in 1:ncol(expval)){ clnames[i] = strsplit(colnames(expval), "\\.")[[i]][1] } colnames(expval) = clnames print(colnames(expval))

When the above code lines are executed, the file nemes are printed as strings:
[1] "GSM700422" "GSM700423" "GSM700424" "GSM700425" "GSM700426" "GSM700427" [7] "GSM700428" "GSM700429" "GSM700430" "GSM700431" "GSM700432" "GSM700433"



Step-9 : Next step is to create and look at the quality plots after normalization. The probe values have been summamarized into gene expression values for each gene across each sample, and written into "expval" matrix. We make three plots, namely the density plot of expression of genes across samples, box plots of gene expressions in individual samples and the MA-plot between average log expression and difference in log expression of control and treatment sets. See the code section below:

## density plot of data sets after normalization. oligo::hist(normalizedData, transfo=identity, main="density plots of gene expressions after normalization", font.lab=2)

We have called the hist() function of oligo library which takes the data structure "normalizedData" and plots the density plots. The parameter "transfo=identity" means do not take a log2 transform of data, since since it is already log2 transformed by the RMA algotrithm. (Use "transfo=log2" for log2 tranforming the data incase required for any other analysis).

Execution of the above code lines gives the following plot at the output:

Comparing this density plot with that of pm probes before normalization shown in Figure-3, we can see that the RMA normalization has shifted the density curves overlapping with each other, an indication that the RMA normalization has made the data consistant for comparison across samples.

Next, we create the Box-Whisker plot of gene expressions across samples using the following lines of code:

xlabs = as.character(covdesc$treatment) X11() par(font.lab=2, font.axis=2) ## for boxplot, this is how we have to get font thickness oligo::boxplot(normalizedData, target="core", boxwex=0.5, las=3, names = xlabs, main="boxplot of expression of probesets (genes) after RMA normalization", ylab = "log2(Expression)")


Running the above codelines creates the following Box-Whisker plot:


Comparing Figure-6 with Figure-4 we ralize that the box plot after normalization has better overlapping box regions compared to the boxplot before normalization. This is the result of RMA normalization among the arrays.

Lastly we will create an MA plot between the mean values of control samples and the mean values of the treatment samples for every gene. It is a plot between average log expression and difference in log expression of control and treatment means across samples:

rm_control = rowMeans(expval[,control_cols]) rm_treatment = rowMeans(expval[,treatment_cols]) X11() plot( (rm_control + rm_treatment)/2, (rm_control-rm_treatment), pch=20, cex=0.3, xlab="Average log expressions", ylab="Difference in log expressions", main="MA plot between mean expression of control and treatment", font.lab=2) segments(0,0,20,0, lwd = 2, col="red")

The above code segment creates the following MA plot on the screen


This MA plot above indicates that the difference in gene expression between control and treatment means symmetrix around zero value across the entire range of mean expression values, indicating that our RMA normalization has removed any trend that existed in the difference across all mean values.


Step-10 : The next section performs the differential expression analysis between the two conditions in the data. The LIMMA model is used for this.

The first step in this direction is to create the design matrix that labels the samples into normal and RA conditions in the form of a matrix. See the following lines of code:

# create design matrix for the test: group samples into 2 groups groups = pData(rawData)$source f = factor(groups,levels=test_between) design = model.matrix(~ 0 + f) colnames(design) = test_between print(design)

The call "pData(rawData)" extreacts the covariates and the description of covariates (like "norml", "RA) in the rawData. The call "pData(rawData)$source" returns a vectors of strings "normal" or "RA" for each sample. Therefore, "groups" is a vector of strings.

The call "factor(groups, levels=test_between)" convertes the string elements in groups into factors. Thus. "normal" string in groups vector will become a factor called normal in f. Thus, f is a vector of factors (normal or RA) for each file, as printed below:
> f [1] control control control control control control EBS EBS EBS [10] EBS EBS EBS Levels: control EBS
The call "design = model.matrix(~0 + f)" creates the model matrix for the linear regression model in limma. "~0 + f" represents a model including the two factors in f ("normal" and "RA") without intercept term. If we print the "design" variable, it prints the design matrix for this test by placing an appropriate "0" or "1" torepresent whether a sample is "normal" or 'RA":
> design = model.matrix(~ 0 + f) > design control EBS 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 0 1 8 0 1 9 0 1 10 0 1 11 0 1 12 0 1 attr(,"assign") [1] 1 1 attr(,"contrasts") attr(,"contrasts")$f [1] "contr.treatment"
In the above table, note that the column names are "control and "EBS". We convert them into "control" and "EBS" explicitly through the call "colnames(test_between)". After this call, the design matrix has column names "control" and "RA".

Once the design matrix is created, we will go ahead and create a contrast matrix. The contrast EBSmatrix contains the information on the specific comparisons made in the statistical test we are performing. The scipt lines for creating the contrast matrix are:

# use makeContrasts() to create a contrast matrix that describes the comparison # between the two condition. Which condition is subtracted from which other condition? fstring = paste(test_between[2],"-",test_between[1], sep="") contrast.matrix = makeContrasts(fstring,levels=design) print(contrast.matrix)

In the above portion of script, a string "RA-EBSnormal" is created from the elements of string vector "test_between" that we created before. This is called "fstring".

The call "makeContrasts(fstring,levels=design)" calls the "makeContrasts" function of limma package. This takes the fstring whose value of "RA-normal" and the design matrix to create the contrast matrix. When printed in the next line, the contrast matrix looks like this:
Contrasts Levels EBS-control control -1 EBS 1
As explained in the previous section, the above matrix with 1 and -1 values indicates that the design is "RA-normal" in the comparison we are going to make. In theis way, fold change will be log2(RA/normal).

In the next code section, we are performing the three steps of limma algorithm as explained in the previous web page section:

# 1. lmFit() calculates mean expression levels in each group # data.fit is an MArrayLM onject defined in limma # first two columns contain mean log expression values of the two conditions data.fit = lmFit(expval,design) # 2. contrasts.fit() fits the contrast.matrix to the data.fit object with the log mean expression values data.fit.con = contrasts.fit(data.fit,contrast.matrix) # 3. perform moderated t-test using limma (eBayes()) data.fit.eb = eBayes(data.fit.con) # 4. Volcano plots with uncorrected p-values from moderated t-test in eBayes() ## log2 fold change versus uncorrected p-value volcanoplot(data.fit.eb,coef=1,col="cornflowerblue", xlab = "log2(Fold Change)", ylab = "-log10(p-value)", main="Volcano plots between Fold Change and uncorrected p-values", font.lab = 2)

The lmFit() function takes the expression matrix and the design matrix as two inputs to perform linear model fit on the whole data set. It returns various parameters of the fit as "data.fit" variable. If we print this variable through "print(data.fit)" statement (not shown here), various out puts will be printed, of ehich the following is an important one to note:
control EBS 7892501 3.697886 2.968512 7892502 4.643794 4.792548 7892503 4.171544 4.032888 7892504 8.738136 8.819950 7892505 3.032325 3.615698 33292 more rows ...
Each probe is separately fitted to get two coefficients of the fit. The "control" and "EBS" values foreasch probe (gene) represent the mean expression values of the gene in the "control" and "RA" samples respectively. Note that these mean expression values are expressed in log2(scale).

The contrasts.fit() function takes the output of lmFit() function and the contrast matrix as input. It then recalculates the regression coefficients for a specific comparison decided by the contrast matrix. The coefficeints of this model are the subtraction of the control expression from the treatment in log2 scale.

If we print the coefficeints of contrasts.fit() model using the command "print(data.fit.con)" we get the following lines in the printout:
> print(data.fit.con) $coefficients Contrasts EBS-control 7892501 -0.72937449 7892502 0.14875458 7892503 -0.13865616 7892504 0.08181429 7892505 0.58337340 33292 more rows ...
For example, the probe 7892501 has a difference of approximately -0.7293 in the differential expression difference between EBS and control samples in log2 scale. This means,

$~~~~~~~~~~~log2(RA~expression) - log2(normal~expression)~=~-0.7293$

$~~~~~~~~~~~log2(\dfrac{RA~expression}{normal~expression})~=~ -0.7293$

$~~~~~~~~~~~$ Therefore, the Linear Fold_Change $= \Large{2}^{-0.7293}~=~ $0.4822

ie., the expression of the probe 7892501 under EBS condition is only 48% of that in healthy controls. This is an under expression by a factor 0.48.

If this ratio is more than one, it is overexpression.

The eBayes() function, as explained in the previous section, estimates reliable variances for each gene by borrowing information across all the genes. Using these estimated variances, it creates an output of moderated t-statistics, p-values and log2 fold changes in differential expression between the conditions specified.

The command print(data.fit.eb) to print the output of eBayes() function prints many computed quantities of the model. Among them, the follwing three, namely the coefficients, moderated t-statistic called "t" and p.values are printed as follows

(not given in the order in which they were printed. There are other variables printed in between these section. Selected and copy pasted here from the full screen printout of command print(data.fit.eb):

$coefficients Contrasts EBS-control 7892501 -0.72937449 7892502 0.14875458 7892503 -0.13865616 7892504 0.08181429 7892505 0.58337340 33292 more rows ... $t Contrasts EBS-control 7892501 -2.4369465 7892502 0.9519078 7892503 -0.3469008 7892504 1.0048379 7892505 2.4707770 33292 more rows ... $p.value Contrasts EBS-control 7892501 0.03020348 7892502 0.35877916 7892503 0.73430437 7892504 0.33359909 7892505 0.02835202 33292 more rows ... $


In a normal t-test, the t-statistic is the ratio of the contrast value to the standard error. using the expression values of only the samples of the particular probe.

In the case of eBayes() estimate of moderated t-statistic, the information borrowed from all the genes is used to estimate the variance. This shrinks the variance. This may result in a situation whereby a gene with a smaller contrast might have a much larger variance, resulting in a smaller t-statistic, while a gene with a larger contrast might have a much smaller variance, leading to a larger moderated t-statistic.

Thus, in the case of probes(genes) with large fluctuations in standard deviations due to small sample sizes, the p-value estimates of limma model (as done in eBayes()) is found to be more reliable than normal t-tests.

Finally, the volcanoplot() function plot creates a scatter plot of log2(fold change) and -log10(uncorrectd p-value). This function call, takes th eBayes object as input. The 'adjust.method="BH" ' performs the Bejamini-Hochburg procedure to control the False Discovery Rate (FDR). The parameter value "coeff=1" refers to the first coefficient, which is the log2(Fold Change). The Figure of volcanoplot is shown here:




Step-11 : The function "eBayes()" has performed a statistical test between secific conditions (such as "normal" versus "disease"). This has resulted in a p-value for the comparison for each probe(gene). The next stepis to apply corrections for the multiple hypothesis testing that we have carried out for all the genes in the array. There is a function called "topTable()" in the limma package which can perform the p-value corrections for multiple testing or adjusting the False Discovery Rate (FDR). After adjusting the p-values for False Discovery rates, this function also can select and display top n selected genes from the corrected list. (ie., select certain number of most significant genes).

The code section below calls the toptable() function:

tab = topTable(data.fit.eb,coef=1, number = nrow(normalizedData),adjust.method="BH", sort.by = "none") plot(tab$logFC, -log10(tab$adj.P.Val), cex = 0.3, col="cornflowerblue", pch=20, main="Volcano plot of BH FDR corrected p-value", xlab = "log2(Fold Change)", ylab = "-log10(p-value)", font.lab=2)

In the first script line above, the "topTable()" funciton is called with a varbale value "adjust.method="BH" which performs Benjamini Hoochberg False Discovery Rate adjustments to the uncorrected p-values stored in data.fit.eb object of eBayes().
If sort.by=P option is set, the resulting data frame is sorted according to lowest to highest p-value and stored as "tab" data frame.
Here, we decided not to sort, so that we can merge this with other results as it is to maintain same probe order. Therefore, we set the option sort.by="none"
We then print the first 10 rows of the data frame which look like this:

> print( tab[1:10,] ) logFC AveExpr t P.Value adj.P.Val B 7892501 -0.72937449 3.333199 -2.4369465 0.03020348 0.6123302 -3.430980 7892502 0.14875458 4.718171 0.9519078 0.35877916 0.8423413 -4.913175 7892503 -0.13865616 4.102216 -0.3469008 0.73430437 0.9535343 -5.208816 7892504 0.08181429 8.779043 1.0048379 0.33359909 0.8315266 -4.875825 7892505 0.58337340 3.324011 2.4707770 0.02835202 0.6123302 -3.391320 7892506 0.02883061 5.182460 0.1542343 0.87983101 0.9816312 -5.246717 7892507 -0.09871354 6.647579 -0.7784638 0.45045164 0.8804114 -5.023479 7892508 0.16484331 7.188741 1.8565109 0.08654700 0.6789022 -4.089266 7892509 -0.12908777 10.056632 -1.7930923 0.09661443 0.6932334 -4.157161 7892510 0.03537473 5.132292 0.1783970 0.86120457 0.9791662 -5.243548
In the above printout, the term "logFC" referws to uncorrected p-value in log10 scale, "AveExpr" refers to the average expression of each probe(gene), "t" is the t-statistics of the moderated t-test, "P.Value" is the unadjusted original p-value of the test, "adj.P.Val" is the Benjamini Hochberg FDR value and "B" is the B-statistics (lods). As mentioned, the dmata is sorted from lowest to the highest palues.

While writing the analysis table into files, we want to restore the order of probes as in the beginning. But the data frame "tab" is sorted according to p-value. So, we restore the original order of the row names of "normalizedData" obtained below. This is done by the "order()" function called inside the dimension statement of tab() in the next line.

After this call, the data frame "limma_output" will have unsoreted p-values and order of genes(probes) as in "normalized Data". This is to be written in files.

Finally, we plot the volcano plot between log2(fold change) and log10(FDR coorected p-value) through the plot() function. This creates the following plot:




Step-11 : In this step, we annotate the probe names with the information on corrresponsding genes they represent. We use the data base library "hugene10sttranscriptcluster.db" we have attached for this.

This library has map for many annotation parameters for the given chip "hugene10st".

The code segment for the annotation of Hugene array is given below, which is very different from that of HGU arrays:

gns = select(hugene10sttranscriptcluster.db, featureNames(normalizedData), c("ENTREZID","SYMBOL","MAP")) # to account for one to many mappings that may result gns = gns[!duplicated(gns[,1]),] # to account for one to many mappings that may result limma_output = tab ## remove probe ID as row names. They will be put as columns. row.names(limma_output) = NULL ## df = data.frame(gns, limma_output, expval) ## select only those rows with properly anootated gene SYMBOL df_with_gene_symbols = subset(df, !is.na(df$SYMBOL))

In the above code section, the function call "featureNames(normalizedData)" returns the probe names of all genes in the order as they appear in normalizedData structure.

The function "select()" can extract data from "AnnotationDb" data base.

Here, we extract three columns namely "ENTREZID","SYMBOL" and "MAP" for all the probe names in the vector "normalizedData" from the "(hugene10sttranscriptcluster.db" data base we have loaded in the beginning. Thus, the data frame "gns" has 4 columns of data namely "PROBEID" "ENTREZID" "SYMBOL" and "MAP" for 40073 features. This data has many duplicated entries which are one to many mappings of genes.

We remove these duplicated entries by the call "gns = gns[!duplicated(gns[,1]),]". This reduces the entries from 40073 to 33297.

Next, the call we create a dataframe consisting of "gns", "limma_output" and the "expval" columns. From thsi data frame, we retain only the subset for which gene symbol is clearly defined. We acive this through the code line "df_with_gene_symbols = subset(df, !is.na(df$SYMBOL))" This completes our annotation.



The first 3 rows of "df_with_gene_symbols" frame containing the annotations of firse 3 genes are printed below:
df_with_gene_symbols[1:3,] PROBEID ENTREZID SYMBOL MAP logFC AveExpr t P.Value 4204 7896740 26682 OR4F4 15q26.3 0.01792991 4.249575 0.1660970 0.8706763 4207 7896742 55251 PCMTD2 20q13.33 -0.04347645 8.329749 -0.4418866 0.6659471 4209 7896744 729759 OR4F29 1p36.33 0.09463393 6.572043 0.4839291 0.6366128 adj.P.Val B GSM700422 GSM700423 GSM700424 GSM700425 GSM700426 4204 0.9802689 -5.245219 4.274170 3.999776 4.352176 4.374453 4.091046 4207 0.9398310 -5.179651 8.446579 8.496928 8.151416 8.278545 8.377593 4209 0.9325243 -5.164584 6.081149 6.501774 6.311935 6.430102 6.982950 GSM700427 GSM700428 GSM700429 GSM700430 GSM700431 GSM700432 GSM700433 4204 4.352040 4.367354 4.263249 4.536844 4.324408 4.219010 3.840375 4207 8.357861 8.276513 7.961237 8.238720 8.571960 8.521862 8.277772 4209 6.840448 6.777381 6.613562 6.731487 6.852563 6.935434 5.805734


Step-12 : Compile all results and write the data into a file:

Finally, we write the data frame "gf_with_gene_symbols" into a csv file, as done in the following script segment.

# write the table into output file outfile_name = paste("output_files/", gse_current,"_limma_output.csv", sep="") # write this table to file write.csv(df_with_gene_symbols, file=outfile_name, row.names=FALSE)

The data table is now written into a csv file namely
"output_files/GSE28315_limma_output.csv"


Step-13 : Write the plot images as image files: So far, the script has been plotting every plot on the screen. Now we should write the required plots onto an image file using "Cairo" library of R.

This is done with four steps: create an ourput file name, define Cairo plot parameters with "Cairo()" function, give your plot command, and finally close the device with "dev.off". Example is given below:

## Create output file name outfile = paste(outfile_dir, "/", "1_density_plots_before_RMA_normalization.png", sep="") # call Cairo with parameters for 'png' file. Cairo(file=outfile, type="png", units="px", width=600, height=700, pointsize=12, dpi="auto") # plot the graph hist(rawData, transfo=log2, which=c("all"), nsample=10000, main="density plots of pm probes before RMA Rnormalization", font.lab=2) switch off device. dev.off()

In the "Cairo()" function, we can choose image file type, pixel unit, pixel width, pixel height, point size and many parameters. See Cairo manuals for details.

Similarly, do the above 4 segments for every plot separately one blow other. All plots will be written into *.png" files in the output directory defined.