Contingency tables, Biostatistics

Basic Statistics with R

Contingency Tables

The contingency table is a simple yet powerful way of displaying the frequency of occurance of a categorical data.

A typical data in this table consists of two or more varibles, each in turn divided into two or more categories. The minimum size of the contingency table is $\small{2 \times 2}$, ie., two variables (eg. Male, Female) whose in turn divided into two categories (eg., vaccinated, not vaccinated), thus having $\small{2 \times 2 = 4}$ combinations (cells) of data. There is no upper limit to the dimensions above this.

Contingency table is used to derive and interpret some of the important probability estimates on the data. We will provide two examples here : a classical $\small{2 \times 2 }$ table used for disease studies and clinical trials followed by a table of larger dimensions for problem solving in probability theory.

Example : A 2x2 contingency table for clinical diagnosys

Suppose a new disgnostic kit developed for detecting certain lung infection is sent for clinical trial. The kit shows either a positive or negative result for the disease when tried on a person. A random sample of 500 adults with the disease (confirmed by a more powerful scanning method) and another 500 healthy adults without the disease were tested with the kit. The results are as follows:

Of the 500 persons with disease present, 470 tested positive and 30 tested negative.

Of the 500 healthy persons with disease absent, 20 tested positive and 480 tested negative.

We prepare the following contingency table to represent the above data, with a general labelling of the 4 cells as follows:

True Positives (TP) : These refer to the cases with disease really present and (correctly) tested positive.

True Negatives (TN) : These refer to the cases when disease really absent and (correctly) tested negative.

False Positives (FP) : These refer to the cases when disease really absent but (wrongly) tested positive.

False Negatives (FN) : These refer to the cases with disease really present but (wrongly) tested negative.

Let us denote

P = TP + FN = Number of cases with actual disease present

N = FP + TN = Number of cases with actual disease absent

P' = TP + FP = Number of cases that were diagnosed to be positive (irrespective of whether disease is present or not)

N' = FN + TN = Number of cases diagnosed to be negative (irrespective of whether disese is present or not)

As an additional notation, we refer to the nummber of cases that are TP, FN, FP and TN with letters a,b,c and d respectively as in the table below:

	Test Positive	Test Negative	Sum
Disease Present	a=470 (TP)	b=20 (FN)	a+b=490 ( P )
Disease absent	c=30 ( FP)	d=480 (TN)	c+d=510 ( N )
Sum	a+c=500 ( P' )	b+d=500 ( N' )	n=a+b+c+d=1000 ( P + N )

The above contingency table is referred to as Confusion Matrix in Machine Learning.

We define the following important parameters for the test:

1. The sensitivity of a test (or a symptom) is the probability that the test is positive given that the disease is present.

$\small{ Sensitivity = P(Positive|Present) = \dfrac{TP}{P} = \dfrac{a}{a+c} =\dfrac{470}{500} = 0.94 }~~~~~~~~$ ( true positive rate)

Thus, sensitivity is the proportion of positives that are correctly identified.

2. The specificity of a test (or symptom) is the probability that the test is negative when the disease is absent.

$\small{ Specificity = P(Negative|Absent) = \dfrac{TN}{N} = \dfrac{d}{b+d} = \dfrac{480}{500} = 0.96 }~~~~~~~~$ ( true negative rate)

Thus, specificity is the proportion of negatives that are correctly identified.

3. The false positive rate is the fraction of healthy perons wrongly identified as positive by the test.
$\small{False~Positive~rate = \dfrac{b}{b+d} = \dfrac{20}{500} = 0.040 }$

4. The false negative rate is the fraction of people with the disease who were missed out (wrongly shown negative) by the test.
$\small{False~negative~rate = \dfrac{c}{a+c} = \dfrac{30}{500} = 0.060} $

5. The predictive value positive of a test (or symptom) is the probability that the subject has the disease given that he tested positive.
$\small{ Predictive~value~positive = P(Present|Positive) = \dfrac{a}{a+b} = \dfrac{470}{490} = 0.959 }$

6. The predictive value negative of a test (or symptom) is the probability that the subject does not have the disease given that he tested negative.
$\small{ Predictive~value~negative = P(Absent|Negative) = \dfrac{d}{c+d} = \dfrac{480}{510} = 0.941 }$

7. The likelihood ratio positive is the probability of a person with the disease testing positive divided by the probability of a person who does not have the disease testing positive.

$\small{Likelihood~ratio~positive~~=~~\dfrac{P(Positive|Present)}{P(Positive|Absent)}~~=~~\dfrac{sensitivity}{1-specificity} }$

8. The likelihood ratio negtive is the probability of a person with the disease testing negative divided by the probability of a person who does not have the disease testing negative.

$\small{Likelihood~ratio~negative~~=~~\dfrac{P(Negative|Present)}{P(Negative|Absent)}~~=~~\dfrac{1 - sensitivity}{specificity} }$

The sensitivity and specificity represent the ability of the test to detect the presence or absence of the disease correctly.

The positive and negative predictive values measure the extent to which we can trust the results of the test.

Higher the value of these four probabilities, better is the diagnostics.

False positive is a measure of wrong identification of the disease by the test when it is not there. Similarly, false negative is the wrong rejection of the disease when it is actually present. These two are undesirable quantities and any diagnostic procedure must minimize them. We will lern more about the false positives and false negatives when we learn statistical tests in the chapters ahead.

Contingency table and bayes theorem

The quantities predictive value positives and predictive value negatives can be obtained from measured false positives and false negatives using Bayes theorem. For predictive value positive we write, using Bayes theorem,

$\small{P(Present|Positive)~=~\dfrac{P(Positive|Present)~P(Present)}{P(Positive|Present)~P(Present) + P(Positive|Absent)~P(Absent)} }$

$\small{P(Absent|Negative)~=~\dfrac{P(Negative|Absent)~P(Absent)}{P(Negative|Absent)~P(Absent) + P(Negative|Present)~P(Present)} }$

In the above expression of Bayes theorem, apart from flse positives and falsoe negatives,we also need the probabilities of disease or condition present and probability of absent to get the prdictive values positive and negative on the right hand side. In real life situations, it is not easy to get exact P(Present) and P(absent).

Example problem :
(This problem is from the book "Biostatistics" by Wayne Daniel and Chad.L.Cross, Chapter 3)

The following table shows 1000 nursing school applications classifie according to scores made on a college entrance examination and the quality of the high school from which they graduated, a rated by a group of techers.

____________________________________________________________________________________________
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ Quality of High Shools $~~~~~~~~~~~~~~~$ $~~~~~~~~~~~~~~~~~~~~~~~~$____________________________________________________________________
Score$~~~~~~~~~~~~~~~~$Poor(P)$~~~~~~~~~~$Average(A)$~~~~~~~~~~~$Superior(S)$~~~~~~~~~Total$ ____________________________________________________________________________________________ Low(L)$~~~~~~~~~~~~~~~~$105$~~~~~~~~~~~~~~~~~~~$60$~~~~~~~~~~~~~~~~~~~~~~~$55$~~~~~~~~~~~~~~~~~$220

Medium(M)$~~~~~~~~~~~~$70$~~~~~~~~~~~~~~~~~~~$175$~~~~~~~~~~~~~~~~~~~~$145$~~~~~~~~~~~~~~~~~$390

High(H)$~~~~~~~~~~~~~~~~$25$~~~~~~~~~~~~~~~~~~~$65$~~~~~~~~~~~~~~~~~~~~~~$300$~~~~~~~~~~~~~~~~~$390
__________________________________________________________________________________________ Total$~~~~~~~~~~~~~~~~~~~~~$200$~~~~~~~~~~~~~~~~~$300$~~~~~~~~~~~~~~~~~~~~~$500$~~~~~~~~~~~~~~~$1000 ________________________________________________________________________________________________

Calculate the probability that an applicant picked randomly from this group
1. Made a low score on the examination
2. Graduated from a superior high school
3. Made a low score on the examination and graduated from a superior high school
4. Made a low score on the examination given that he or she graduated from a superior high school.
5. Made a high score or graduated from a superior high school.

From the table, we can answer these questions:

1. Probability of low score = $\small{ P(L) = \dfrac{220}{1000} = 0.22 }$

2. Probability of grduation from Superior high school = $\small{ P(S)= \dfrac{500}{1000} = 0.5 }$

3. Probability of low score on exam and from superior high school = $\small{P(L \cap S) = P(L|S) P(S) = \dfrac{55}{500} \times \dfrac{500}{1000} = 0.055}$

4. Probability of low score given that from superior school = $\small{P(L|S) = \dfrac{P(L \cap S)}{P(S)} = \dfrac{0.055}{0.5} = 0.11 }$

5. Probability of a high score or graduated from a superior school = $\small{P(H \cup S) = P(H) + P(S) - P(H \cap S) = \dfrac{390}{1000} + \dfrac{500}{1000} - \dfrac{300}{1000} = 0.59 }$

CountBio

Mathematical tools for natural sciences

Basic Statistics with R

Contingency Tables