Basic Statistics with R

Chi-squared test on contingency table

The chi-squared test for the contingency table works under the following assumptions:

$~~~~~~~~~~~~~~~~(i)~~$The data points are randomly sampled ans are independent

$~~~~~~~~~~~~~~~ (ii)~~$None of the expected values in the individual cells should be less than 1.

$~~~~~~~~~~~~~~~~(iii)~~$ No more than $20\%$ of the cells can have an expected value less than 5.

According to condition (ii), for example, we cannot perform this chi-square test on a table with any one or more of the cells having a value zero.

If we have a contingency table of dimension $2 \times 2$, then according to the condition (ii), none of the four cells can have a value less than 5. For a contingency table of dimension $2 \times 3$, only cell can have a value less than 5.

These conditions on the data shuld be seriously considered before applying this test.

To compute a chi-square statistic for the table, we need the expected value for each cell.

The expected value for a given cell in a table with 'r' rows and 'c' columns is computed as,

Expected value of a cell $~ = ~ \dfrac{(row~sum~for~the~cell)\times(column~sum~for~the~cell)}{total~sum}$

Following the methodology of chi-square test on correlations, the chi-square statistic is defined as,

$~~~~~~~~~~~~~\chi^2~=~\displaystyle \sum \limits_{i=1}^{n} \dfrac{(Observed(i) - Expected(i))^2}{Expected(i)} \rightarrow \chi^2((r-1)(c-1)) $

Where, Expected(i) and Observed(i) are the expected and observed values for the cell i.

This is satistic follows a Chi_square distribution variable with $(r-1)(c-1)$ degrees of freedom. We accept or reject the null hypothesis of no correlation between the rows and column of the table by comparing the p-value of the chi-square variable with a desired significance level of $\alpha$, like any chi-saure goodness of fit test.

As an example, we consider the following contingency table that studies the mortality rate of patients of a particular disease under two different treatment methods called "Treatment-A"and "Treatment-B":

Dead Alive Sum
Treatment-A 41     216       257    
Treatment-B 64       180     244    
Sum 105     396     501  


In order to perform the chi-square test of association, let us first compute the expected frequencies of each cell.

Noting that the Expected value of a cell $E(i,j)~ = ~ \dfrac{(sum~of~i^{th}~row)\times(sum~of~j^{th}~column)}{total~sum}$. we compute

$~~~~~E(1,1)~=\dfrac{257 \times 105}{501}~=~53.86$

$~~~~~E(1,2)~=~\dfrac{257 \times 396}{501}~=~203.14$

$~~~~E(2,1)~=~\dfrac{244 \times 105}{501}~=~51.13$

$~~~~E(2,2)~=~\dfrac{244 \times 396}{501}~=~192.86$

Using the expected values and the observed values for the four cells of the table, we compute chi-square variable as,

$\chi^2~=~\dfrac{(41-53.86)^2}{53.86} + \dfrac{(216-203.14 )^2}{203.14} + \dfrac{(64-51.13 )^2}{51.13} + \dfrac{(180-192.86 )^2}{192.86} ~=~7.9816$

With two rows and two columns in the table, degree of freedom is computed as, $~~df = (r-1)(c-1) = (2-1)(2-1) = 1$.

The significance value, or the p-value of the test is the area under the curve to the right of $\chi^2 = 7.98$ in a chi-square distribution of 2 degrees of freedom.

Using the R function call 1 - pchisq(7.98,1) gives the p-value of 0.004729.

Thus, comparing to a significance level $alpha = 0.05$, we reject the null hypothesis of no correlation between the treatment method and the survival.

We can conclude, to a fignificant level of 0.05, that the treatment method and the survival are correlated.

Among those who underwent Treatment-A, the fraction survivied = $\dfrac{216}{257}=0.804~=80.4\%$

Among those who underwent Treatment-B, the fraction survivied = $\dfrac{180}{244}=0.738~=73.8\%$

These two fractions $80.4\%$ and $73.8\%$ are significantly different to a level of 0.05. This is the conclusion.

Yates' continuity correction for Pearson's $\chi^2$ test

When we create a 2x2 contingency table, we have binomial variables. The observed frequencies of the binomial variables The Chi-square system is applicable to a coninuous varables. Still, we treat these discrete vriables as if they are from a continuous distribution. This approximation of using discrete binomial variable as variable of $\chi^2$ continuous distribution creates an error, which has the effect of overestimating the statistic of the Pearson's chi-square test, thus underestimating the p-value (type I error) of the test.

This error gets appreciable as the frequency values of the cell get smaller (like, less than 5 or so).

Yates proposed a correction for this error in using discrete frequencies for continuous chi-square test. This is also applicable to any tablr with more than 2x2 size.

To apply this correction, a value of 0.5 is subtracted to every (observed-expected) value, and the rest of the test proceed as described before. This correction takes care of the error due to approximating (discreate) frequncies as a continuous variables in the test.

Accordingly, we compute the test statistic as,

$~~~~~~~~~~~~~\chi^2~=~\displaystyle \sum \limits_{i=1}^{n} \dfrac{(~[Observed(i) - Expected(i)] - 0.5~)^2}{Expected(i)} \rightarrow \chi^2((r-1)(c-1)) $


Let us now compute the chi-square variable for the above test with yates continuity correction

$\chi^2~=~\dfrac{(41-53.86-0.5)^2}{53.86} + \dfrac{(216-203.14-0.5 )^2}{203.14} + \dfrac{(64-51.13-0.5 )^2}{51.13}$
$~~~~~~~~~~~~~~~+ \dfrac{(180-192.86-0.5 )^2}{192.86}$

$~~~~~~~=~7.9841$

We realize that the yates correction for this data alters the chi-square value very neglligibly, from 7.9816 to 7.9841. This correction is very small because of the large sample sizes we have and hence will not alter our conclusions of the test without the correction. Thus yates correction for continuity can be applied to tables with smaller frequencies in cells, of the order of 5 or so.

However, when the frequencies get less than 5 or so, the Yates correction can alter the conclusions of the test obtained without the correction by pushing marginal p-values upwards.

Therefore, for the contingency tables with smaller sample sizes, it is better to use Fisher's exact test described in the next section instead of the Pearson's chi-square test with continuity correction.

Chi-squre test for contingency table in R

Let us perform Perason's chis-square test in R for the following contingency table we encountered above:

Dead Alive Sum
Treatment-A 41     216       257    
Treatment-B 64       180     244    
Sum 105     396     501  


We can use the R function "chisq.test()" for this statistical test, as shown below. This function takes the data table as a matrix and can apply Pearson's chi-square test with or without yate's correction.

# Create a matrix of the data table data_table = matrix(c(41, 216, 64, 180), nrow=2, ncol=2, byrow=TRUE) ## assign row and column names rownames(data_table) = c("Treatmet-A", "Treatment-B") colnames(data_table) = c("Dead","Alive") ## Perform chi-square test without Yates' continuity correction ## (by default, this function applies the continuity correction) res1 = chisq.test(data_table, correct=FALSE) ## print the result print("Pearson's chi-square test without yates continuity correction :") print(res1) print("-------------------------------------------------------------------") print("Pearson's chi-square test with yates continuity correction") ## Perform Pearson's chi-square test with yates' correction (default is TRUE) res2 = chisq.test(data_table, correct=TRUE) # print the result print(res2)

Running the above above lines as an R-script prints the following result:
[1] "Pearson's chi-square test without yates continuity correction :" Pearson's Chi-squared test data: data_table X-squared = 7.9789, df = 1, p-value = 0.004733 [1] "-------------------------------------------------------------------" [1] "Pearson's chi-square test with yates continuity correction" Pearson's Chi-squared test with Yates' continuity correction data: data_table X-squared = 7.3706, df = 1, p-value = 0.00663
We observe that the Yates continuity correction has decreased the $\chi^2$ value from 7.967 to 7.370, which has increased the p-value of the test from 0.00473 to 0.00663. This small increase in p-value has not altered the conclusions of the test, since the frequencies of cells are large (more than few tens). However, if frequencies get smaller, the above correction can push the p-value up enough to change the conclusions of the test.