Basic Statistics with R

Fisher's exact test on contingency table

The Fisher's exact test is used to test the nonrandom association between two variables in a contingency table. Though generally used for $2 \times 2$ tables, this can be extended to tables of higher dimensions in general.

The test is called "exact" because the probability of obtaining the table by chance under the null hypothesis of no association (the p-value) is computed exactly using a discreet distribution rather than using a continuous distribution under the large sample approximation.

We will understand the test through a specific example in genomics.

In an experiment, suppose we have measured differentially expressed genes by treating certain cells with a drug.

We also have a list of genes in a particular KEGG pathway, for example, like glycolysis.

Is there any association between a gene being differentially expressed and being a part of the pathway?

Put it in other words, "Is the given pathway is enriched with the set of genes from the expression analysis?"


Suppose we take a particular KEGG pathway. We prepare the following contingency table by counting the genes in experiment that are there or not there in pathway gene list:

Differentially expressed Not expressed Sum
Present in the pathway a=12 b=3 a+b=15
No present in the pathway c=3 d=12 c+d=15
Sum a+c=15 b+d=15 N=a+b+c+d=30




From the above table, we see that out of 15 genes that are differentially expressed under the disease condition, 12 are present in the pathway considered. What is the statistical significance of this?.

If we repeat the experiment, we may end up in 10 genes instead of 12. Is this also significant?

We do Fisher's exact test to get a significant level for this table.

Fisher's test uses the hypergeometric probability for this. Suppose there is an urn containing two categories of objects. We have $N_1$ objects of category-1 and $N-N_1$ objects of category-2.

Suppose we randomly draw $n$ objects from the urn. What is the probability that x out of n random draws will belong to category-1?.

This is given by the hypergeometric probability distribution that we have studied:

\( {P_h(x,n,N_1,N) = \dfrac{_{N_1}C_x~\times~ _{N_2}C_{n-x}}{_{N_1+N_2}C_n} = \dfrac{_{N_1}C_x ~\times~ _{N-N_1}C_{n-x}}{_{N}C_n}} \)

In the contect of the contingency table we have constructed, this hypergeometric probability density becomes,

\( P_h(x=a, n=a+c, N1=a+b, N=a+b+c+d) =\dfrac{_{a+b}C_a ~\times~ _{c+d}C_{c}}{_{N}C_{a+c}}~=~\dfrac{_{a+b}C_b ~\times~ _{c+d}C_{d}}{_{N}C_{b+d}} \)

The above expression give a p-value for the observed contingency table under the null hypothesis that the presentce or absence of genes in the given KEGG pathway are independent of their expression levels .

Now, for a given row and column sums, there are many possible tables. We can compute the probability for each one of the possible tables. From this, we get the probability of getting a value more than or equal to the observed 'a'.

Suppose, we take a contingency table with a+b=15 and c+d=15

Pssoble value of a,b,c,d that can satify this sums and corrreponding probabilities are:

$a=0~~~~b=15$

$c=15~~~~d=0~~~~~~~~~~~~ p= 6.45 \times 10^{-9}$



$a=1~~~~b=14$

$c=14~~~~d=1~~~~~~~~~~~~ p= 1.4 \times 10^{-6}$



$a=2~~~~b=13$

$c=13~~~~d=2~~~~~~~~~~~~ p= 7.1 \times 10^{-5}$



$a=3~~~~b=12$

$c=12~~~~d=3~~~~~~~~~~~~ p= 1.3 \times 10^{-3}$



$a=4~~~~b=11$

$c=11~~~~d=4~~~~~~~~~~~~ p= 1.2 \times 10^{-2}$



$a=5~~~~b=10$

$c=10~~~~d=5~~~~~~~~~~~~ p= 5.8 \times 10^{-2}$



$a=6~~~~b=9$

$c=9~~~~d=6~~~~~~~~~~~~ p= 1.6 \times 10^{-2}$



$a=7~~~~b=8$

$c=8~~~~d=7~~~~~~~~~~~~ p= 2.6 \times 10^{-2}$



$a=8~~~~b=7$

$c=7~~~~d=8~~~~~~~~~~~~ p= 2.6 \times 10^{-2}$



$a=9~~~~b=6$

$c=6~~~~d=9~~~~~~~~~~~~ p= 1.6 \times 10^{-2}$



$a=10~~~~b=5$

$c=5~~~~d=10~~~~~~~~~~~~ p= 5.8 \times 10^{-2}$



$a=11~~~~b=4$

$c=4~~~~d=11~~~~~~~~~~~~ p= 1.2 \times 10^{-2}$



$a=12~~~~b=3$

$c=3~~~~d=12~~~~~~~~~~~~ p= 1.3 \times 10^{-3}$



$a=13~~~~b=2$

$c=2~~~~d=13~~~~~~~~~~~~ p= 7.1 \times 10^{-5}$



$a=14~~~~b=1$

$c=1~~~~d=14~~~~~~~~~~~~ p= 1.4 \times 10^{-6}$



$a=15~~~~b=0$

$c=0~~~~d=15~~~~~~~~~~~~ p= 6.4 \times 10^{-9}$


From above results we compute the probability of getting 12 or more differentially expressed genes that are also in the gene set with GO terms as,

\(\small{P~=~\sum_i P_i~=~ 1.3 \times 10^{-3} + 7.1 \times 10^{-5} + 1.4 \times 10^{-6} + 6.4 \times 10^{-9}~~=~~0.0014 }\)

From this result, we claim that the probability of our observed data or that more extreme under the assumption that there is no association between expression and gene set membership is 0.0014

Generalized formula for Fisher's exact test on contingency table

The above example performed the Fisher's exact test for a contingency table of dimension $2 \times 2$. For a table of higher dimensions with r rows and c columns, the following generalized formula can be used to compute the probability $P(i,j)$ for any cell (i,j),as described below.

First, compute row sum $R_i$ and column sum $C_i$ for any ell (i,j) so that

$~~~~~N = \displaystyle \sum \limits_{i=1}^{r} R_i = \displaystyle \sum \limits_{j=1}^{c} C_j$

Then the probability P for the contingency table is given by,

$~~~~P~=~\dfrac{ (R_1! R_2! .... R_r!)(C_1! C_2!.....C_c!)} { N!~\prod_{i,j} a_{ij}!} $

where $a_{ij}$ is the number of observarions in the cell (i,j).

Similar to the computation we did in the 2x2 table example, same probability P is calculated for all the possible contingency tables with same first cell value or more as the given data table.

C1 C2 C3 Sum
R1 7 3 5 15
R2 9 4 2 15
Sum 16 7 7 30






For the above table, we have,

$R_1= 15$, $R_2 = 15$, $C_1 = 16$, $C_2 = 7$, $C_3 = 7$ and $N=30$.

$~~~~P~=~\dfrac{ (R_1! R_2!)(C_1! C_2! C_3!)} { N!~\prod_{i,j} a_{ij}!}~=~ \dfrac{ (15! 15!)(16! 7! 7!)} { 30! (7! 3! 5! 2!9! 4! )} ~=~05420$

Proceed the same way with other possible tables to compute final probability of getting a value more than the observed value in the chosen (first) cell.

Fisher's exact test for a contingency table in R

Let us perform Perason's chis-square test in R for the following contingency table we encountered above:

Differentially expressed Not expressed Sum
Present in the pathway 12 3 15
No present in the pathway 3 12 15
Sum 15 15 30


# Creating the matrix for the table data_matrix <- matrix(c(12,3,3,12), nrow = 2, byrow = TRUE) rownames(data_matrix) <- c("expressed", "not expressed") colnames(data_matrix) <- c("present", "not present") print(data_matrix) # Run Fisher Exact Test test_result <- fisher.test(data_matrix) # Display the results print(test_result)

Running the above script in R prints the following lines:
present not present expressed 12 3 not expressed 3 12 Fisher's Exact Test for Count Data data: data_matrix p-value = 0.002814 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 2.107996 139.178330 sample estimates: odds ratio 14.11256
The p-value of 0.002814 for this test rejects the null hypothesis to a significance level of 0.05.

We accept the alternate hypothesis to state that a gene being present or not present in the pathway is correlated to whether it is expressed or not expressed in disease condition to a significance level of 0.05.


_