Basic Statistics with R
Tests on contingency table - an overview
We have learnt to create and analyze Contingency Tables consisting of two or more categorical varibles, each in turn divided into two or more categories. We also learnt to compute various statistical quantities like true positives, true negatives etc to interpret the data. The important underlying application of the contingency table analysis is to study the aasociation between two variables that are categoried in the table.
Once a contingency table is created, we need to perform statistical tests to quantify the level of association between these two sets of categories .
The idea behind the statistical tests on contingency tables are very similar to other statistical tests we have learnt so far.
Under the null hypothesis that there is no association between the variables and their categories, what is the accidental probability that a random draw of data can result in the observed table? If this probability is less than a pre-defined significance level, then the null is defeated and we can conclde that there is a statistically significant association.
We can understand this idea better with a specific example. Let us construct a $2 \times 2$ contingency table for studying whether the presence of a particular allele 'A' of a gene in a person is associated with the disease. We want to study whether a person testing positive (or negative) for a disease condition is correlated with the presence (or absence) of the allele A in his genome.
Let a,b,c,d be frequencies on the four cells of the table as shown below, where $(a+b+c+d)$ is the total number of cases studied. Such a table is presented below:
| |
Test Positive |
Test Negative |
Sum |
| Allele-A Present |
a=470     |
b=20       |
a+b=490     |
| Allele-A absent |
c=30       |
d=480     |
c+d=510     |
| Sum |
a+c=500     |
b+d=500     |
n=a+b+c+d=1000   |
Under the assumption that the presence or absence of the disease is not correlated with the presence of absence of the allele A, we expect the ratio of number of cases with presence or absence of allele to be same among those with or without the disease.
In the above table, exactly 500 patients test positive and another 500 patients test negative for the disease. If the disease has nothing to do with the presence or absence of the allele in a person, we expect close to 250 people with the allele and 250 people without the presence of allele in both the cases. (We say "close to 250" instead of "exactly 250" to take into account of statistical fluctuations expected in a finite sampling). Thus We expect close to 250 people with the allele present and 250 people with allele absent among those 500 who contacted the disease. Similarly, among those 500 who do not have the disease, close to 250 will have the allele close to 250 will not have the allele, within expected statistical fluctuations of finite sampling.
Thus, if we create many such tables by repeated random sampling of 500 patipants with disease and 500 patiets without the disease, every time we get different nubmbers a,b,c and d in the cells, each fluctuating around the expected value of 250.
Then the following question is asked:
Under the null hypothesis of no association between the disease and the presence of allele, what is the probability of getting the observed table with $(a=470, b=20, c=30~and~d=480)$ in a random draw of 500 peole with the disease and 500 persons without the disease? (ie., observing the result by chance).
If this probability is smaller than a significance level $\alpha$, we reject the null hypothesis to conclude that there is a association between the presence of the allele and the disease. Else, we accept the null hypothesis to conclude that there is no association.
We thus need to come up with a statistical test procedure to estimate this chance probability under the null hypothesis of no association. There are many ststistical tests exist for this. We will discuss the following important tests among many available:
Under the null hypothesis of no association, the expected frequencies of each of the cells in the table are computed. Using the expected and observed values across cells, a chi-squared test is performed under the null hypothesis. This methos works well when observed frequencies in individual cells are not very small.
In this test, the contingency table is assumed to have been created by a random draw of samples from an urn consisting of samples of finite sizes for the categories observed.
The probability of given observation thus follows a hypergeometric distribution, using which the probability of the observed table is computed.
This test works even when the obderved frequencies are small.
The thumb rule for using the Pearson's chi-square test and Fisher's exact test from the point of view of sample sizes is as follows:
Both the tests assume that the observations are independent.
In general, if $\leq 20\%$ of expected cell counts are less than 5, then use the chi-square test; if $ \gt 20 \%$ of expected cell counts are less than 5, then use Fisher’s exact test.