Microarray Data Analysis in
          R/Bioconductor

( R. Srivatsan )

Filtering the genes by p-value and Fold Change



Once we have obtained the differential expression of genes between the control and treatment conditions, the next step is to select a subet of genes whose expressions are significant.

In our statistical analysis of expression data using the limma algorithm, we have determined two parameters for each gene that represent its significance in the among others. They are, namely, the p-value and the Fold Change .

The Fold Change and its acale

The Fold change of a gene represents the ratio obtained by dividing the mean expression across the treatment samples by the mean expression across the control samples in the fitted data.

In a linear scale, a fold change of more than one represets an over expression and a fold change less than one represents an under expression of the gene studied. Values closer to one represnt gene expression not changed between the two conditions.

When expressed in logarithmic scale (of base 2 logarithm), the over expressed genes will have log2 expression values more than zero, the under expressed genes will have log2 expresion values less than zero (negative) and the unexpressed gene will have values closer and closer to zero.

Thus, of a gene is 3.2 times over expressed in a linear scale, log2(3.2) = 1.678 will be its over expression level in log2 scale. If a gene is 3.2 times under expressed, then
log2(1/3.2) = -1.678 will be its under expression level in log2 scale. The log2 scale thus makes the equal levels of over and under expression as same number with opposite signs. This is more easier to comprehend than looking at 3.2 and 1/3.2 (which is 0.3125) in lineae scale and figuring out that they are equal levels of over and under expressions.

The p-value and its acale

The p-value of a gene in the limma algorithm represents the statistical significance of observing the given difference in expression levels of control and treatment samples in the fitted data, using a modrated t-statistic. Obviously, this is a number between 0 and 1.

Suppose we take the p-values of five genes to be 0.7, 0.1, 0.01, 0.001 and 0.0001. If we plot them in a linear scale between 0 and 1, it will be very difficult to visualize all the five points in a the same axis, since 0.01, 0.0001 (and values less than this) will be compressed into a section which is less than one tenth of axis range, while the values 0.1, 0.7 will be spread across the entire length. Thus, 0.01, 0.001 will not be visible in the same scale along with 0.1 and 0.7.

This problem can be over ome by expressing the p-value in a logarithmic scale of base 10. In this scale, log(1)=0, log(0.1)=-1, log(0.01)=-2, log(0.001) =-3 etc. Thus, the probability values between 1 to zero will be lying in the range [0, -1, -2, -3,...]. it is very easy to read -1 as 0.1, -2 as 0.01, -3 as 0.003 etc.

The uncorrected and corrected p-values

Once the statistical significance of each gene, expressed as the p-value from the moderated t-test in limma is obtained, we need to apply the correction to control the False Discovery Rate (FDR) due to multiple testing of hypothesis across all the genes. In the limma procedure we have performed, we have controlled the False Discovery Rate using Benjamini-Hochberg (BH) procedure implemented in the topTable() function of limma tool in R.

The BH procedure takes a vector of p-values from moderated t-test (called "uncorrected p-values") to return FDR controlled p-values (called "corrected p-values").

Though both the uncorrected and FDR controlled p-values are numbers between 0 to 1, their interpretations are totally different. It is very important to understand this difference before applying p-value cutoff to any ony one of them for selecting significant genes.

From the list of N uncorrected p-values, if we apply a cutoff with a p-value of 0.05 and select 100 genes, it means that these 100 genes are false positives in our result. ie., by definition of false positives, $N \times 0.05 = 100$. The real hits are embedded in this 100. We do not know how many out of this 100 selected genes are real hits (ie., really differentially expressed) and how many are due to the theshold selection using 0.05 as cutoff.

From the list of N FDR adjusted p-values, if we apply a cutoff with a p-value of 0.05 and select 100 genes, it means that the maximun fraction of false positives in the 100 selected genes is 0.05. That is, out of 100 genes selcted with an FDR cut of 0.05, a maximum of $100\times 0.05 = 5$ genes will be expected to be false positives, leaving the remaining 95 to be true hits. However, this is only an expected result.

For the value of same significance level $\alpha$, the number of genes from the FDR adjusted
p-values will be much less than the ones from uncorrected p-values. Therefore, while selecting genes from FDR adjusted list, we can decide the fraction of maximum expected false positives we can tolerate in the selcted genes. For example, if we select 50 genes by applying a cutoff of $0.1$ in the FDR adjusted p-value list, this means we are ready to tolerate a maximum of $5\times 0.1 = 5$ genes among the 50 selected to be false positives, and remaining 45 will be true hits. This is an expected result.

In some microarray experiments, the general level differential gene expressions between control and treatment conditions can be very low due to various reasons. In that case, even if we are ready to tolerate 0.1 fraction of genes to be false positives after FDR adjustment, we may not find more than few genes that can cross the thresold. We may have to be ready to tolerate more than 10% false positives in the selcted genes. Alternately, we can apply a maximum allowed cut off of 0.05 on uncorrected genes to short list the differentiwlly expressed genes. However, we may not know the fraction of actual false positive in the final number selected. These are the issues.

The follwing R script filters genes from the expression table file "GSE12021_limma_output.csv".

The zipped directory containing this data file and the following R script can be downloaded from here

We have filtered data for various thresholds for uncorrected and FDR corrected p-values. Note that since the differential expression levels on this gene are is very low even in the uncorrected case, the thresholds on fold change and p-values have to be really lowered to get few selected genes after FDR adjustment. This situation varies from experiment to experiment.

## Model script for filtering genes based on p-values and fold changes ## Read the data file into a data frame dat = read.csv(file="GSE12021_limma_output.csv", header=TRUE) # print the dimension of data print(dim(dat)) # Print the column names in the data print(colnames(dat)) ## First filter data on unadjusted pvalues ## set linear fold change threshold (over/under expression) fold = 2.0 # convert to log2 scale log2fold = log2(fold) # set p-value threshold pval = 0.05

Running the above code on the fiile generates the following output on the screen:
> source("filtering.r") [1] "Genes filtered with fold change =2.0 and unadjusted p-value 0.05" [1] "SDC1" "IFRD1" "RGS1" "DPP4" "CCL4" "CD37" [7] "CD79A" "APOB" NA "ANOS1" "SCRG1" "GZMA" [13] "ADAM28" NA "CXCR4" "ANXA3" "PTN" "SLC19A2" [19] "CD69" "HLA-DQB1" NA "IGFBP3" "MYOC" "IGK" [25] "HLA-DQB1" "PTN" NA "IGK" "CXCR4" "HLA-DPA1" [31] "IGFBP3" "SEPTIN6" "PTPRC" "ELN" NA "HLA-DPA1" [37] NA NA NA NA "RARRES1" [1] "----------------------------------------------------" [1] "Genes filtered with fold change 1.5 and FDR adjusted p-value 0.2" [1] "HOXD1" [1] "----------------------------------------------------" [1] "Genes filtered with fold change 1.8 and FDR adjusted p-value 0.27" [1] "CALR" "LAPTM5" "IFRD1" "CCL4" "CXCR4" "HLA-DQB1" [7] "KCNJ15" "LILRB4" "MERTK" "CXCR4" "HLA-DPA1" NA [13] NA "HLA-DMA" [1] "----------------------------------------------------" >
The "NA"s against cetain gene names indicates that anootations are not avilable for these entries. This has to be performed separately outside this analysis with various available data bases.

After this step, we can write these data frames into csv files using "write.csv()" function in R.