Biostatistics with R

Distribution of difference between two sample means

The central limit theorem helps us to construct a confidence interval for the population mean using the sample mean and varience. On many occasions, we need to compare samples from two different distributions. We are generally interested in studying the difference between the samples means of these two distributions for making some inferences.

As an example, a school management wants to test a new methodology for teaching mathematics. They want to establish that on an average the new methdology leads to a better understanding of the subject than the traditional one followed by the school. They choose a group of 60 students with identical maths background and ability from the school. The samlple is then divided into two groups of 30 each. For one group, they continue to teach selected topics in maths using the traditional method. The second group is taught using the new method. After certain topics have been taught, the students from both the groups take an appropriately devised maths test on the topics. We should be able to compare the mean scores of the students from both the groups to come to a conclusion about the effectiveness of new method.



The independent and dependent samples

While comparing the sample means of two distributions, we must know whether they are dependent or independent .

Two observations X and Y are independent if the two sets of measurements are taken on different samples. Here X and Y are independent random variables. For example, in order to test the effects of two medicines, we select certain number of patients with similar health conditions and divide them into two groups. We then test medicine A on group 1 and medicine B on group 2. Here we assume that there is no individual dependent factors that affect the working of a medicine on a patient.

Two observations X and Y are dependent if the two measurements are taken on the same subject. In this case X and Y are dependent random variables. This method is used for measurements that aim at finding before and after effects of something we want to test. For example, to establish the weight reduction in individuals by comparing their weights before and after a fitness program. In this case it is recognized that different individuals vary in their response to the program, ans hence the X and Y measurements are made on same set of individuals before and after the fitnesss program.

This distinction is important because different statistical methods are used for comparing two data sets that are independent and dependent.


(A) Distribution of difference between the means of two independent samples

Suppose we want to compare the means of two normal distributions. Consider two variables \(\small{X}\) and \(\small{Y}\) following the normal distributions \(\small{N(\mu_{X}, \sigma_{X}) }\) and \(\small{N(\mu_{Y}, \sigma_{Y}) }\) respectively. Let \(\small{X_1,X_2,X_3,....,X_n }\) and \(\small{Y_1,Y_2,Y_3,....,Y_m }\) be tow independent random samples of size \(\small{n}\) and \(\small{n}\) drawn from these two distributions.
Now, depending on the information on the population variances \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\), we consider the following three different cases:


Case A.1: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are known

Suppose we assume that \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\) are known. Since the random samples are independent, the respective sample means \(\small{\overline{X}}\) and \(\small{\overline{Y}}\) must follow the normal distributions with \(\small{N(\mu_X, \dfrac{\sigma_X^2}{n})}\) and \(\small{N(\mu_Y, \dfrac{\sigma_Y^2}{m})}\).
From this we know that the distribution of the difference \(\small{\overline{X} - \overline{Y} }\) follows a Gaussian with mean \(\small{\mu_X - \mu_Y }\) and the combined variance \(~~\small{\sigma_{W}^2 = \dfrac{\sigma_X^2}{m} + \dfrac{\sigma_Y^2}{m}}\). Applying central limit theorem we can write,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} = N(0,1) }\)
From the definition of confidence interval, we can define a \(\small{100(1-\alpha)\%}\) confidence interval for the difference in mean \(\small{\mu_X - \mu_Y}\) as follows:
\(\small{[ (\overline{X} - \overline{Y}) - Z_{1 - \frac{\alpha}{2}}\sigma_W,~~~(\overline{X} - \overline{Y}) + Z_{1 - \frac{\alpha}{2}}\sigma_W ]~~~~~~~~~or~~~~~~~(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}} \sqrt{ \dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m} } }\)


Case A.2: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown with large sample size

If the sample sizes are large (typically greater than 30) and the sample variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown, we can replace \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) with their corresponding sample estimates \(\small{s_x^2}\) and \(\small{s_y^2}\). In this case,
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{s_X^2}{n} + \dfrac{s_Y^2}{m}}} = N(0,1) }\)
and we can write the \(\small{100(1-\alpha)\%}\) confidence interval for the difference in mean \(\small{\mu_X - \mu_Y}\) as,
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}}\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}} } \)



Case A.3: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown with small sample size

We now consider the confidence intervals for the difference between means of two variables drawn from normal distributions of unknown variance, when the sample sizes are small.

Consider two sets of data points \(\small{X_1,X_2,X_3,...,X_n}\) and \(\small{Y_1,Y_2,Y_3,...,Y_m }\) drawn from two normal distributions \(\small{N(\mu_X, \sigma_X) }\) and \(\small{N(\mu_Y,\sigma_Y)}\) respectively. Let \(\small{\S_X , S_Y }\) be the sample standard deviations of these two data sets.

When the sample sizes of the tow sets are small (less than 30 as a thumb rule), then the difference \(\small{ \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y )}{\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}}} }\) does not follow the unit normal distribution \(\small{N(0,1)}\). In this case, a statistic that follows t distribution is derived.

Two different cases are considered at this stage :
(i) Variances of the two normal distributions are unknown but assumed to be equal.
(ii) Variances of the two normal distributions unknown, but assumed to be unequal.

Case A.3.1: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown and equal with small sample size

Suppose we can afford to make the assumption that the two independent samples are drawn from normal distributions whose variances \(\small{\sigma_x^2, \sigma_y^2}\) are equal though we do not have their values. Under this assumption of equal population variance, the distributio followed by difference in the observed sample means has been derived.

Since the population variances are assumed to be equal, we let \(\small{\sigma_X^2 = \sigma_Y^2 = \sigma^2 }\). We already learnt that in this case,
\(~~~~~~~~~~~~\small{Z = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma^2}{n} + \dfrac{\sigma^2}{m} }} }~~~~\) follows unit normal distribution \(\small{N(0,1) }\)

If the variables X and Y follow Normal distribution, then we know that
\(~~~~~~~~~~~~~~~\small{\dfrac{(n-1)S_x^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(n-1)}~~~\) and \(~~\small{\dfrac{(m-1)S_Y^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(m-1)}~~~\).

The Chi-square distribution has the property that \(~~\small{\chi^2(r_1) + \chi^2(r_2) = \chi^2(r_1 + r_2)}\). Then, the sum U of the above two independent variables defined by,

\(\small{U = \dfrac{(n-1)S_X^2}{\sigma^2} + \dfrac{(m-1)S_Y^2}{\sigma^2}} ~~\) follows \(~~\small{\chi^2(n+m-2) }\)

A random variable called T variable is defined in terms of the above mentioned Z and U variables as,

\(~~~~~~~\small{T = \dfrac{Z}{ \sqrt{\dfrac{U}{n+m-2}} } }~~~~\) follows a t distribution with (n+m-2) degrees of freedom.

Substituting the expressions for U and Z into the above expression of the t variable, we get,

\(\small{ T = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\left[\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n-m-2}\right] \left[\dfrac{1}{n} + \dfrac{1}{m}\right] } } ~~ = ~~ \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{S_p \sqrt{\left[\dfrac{1}{n} + \dfrac{1}{m}\right]}} }\)
where the pooled standard deviation \(\small{S_p}\) is defined as, \(~~~\small{S_p = \sqrt{\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2} } }\)

Once we have this expression for t variable following t distribution with n+m-2 degrees of freedom, we can identify a value \(\small{t = t_{\alpha/2}}\) such that the probability(ie., area under symmetric curve) above \(\small{t_{\alpha/2}}\) and below -\(\small{t_{\alpha/2}}\) sums to a \(\small{\alpha }\). We write this as,

\(\small{P(~~-t_{\alpha/2}(n+m-2)~~ \leq ~~T ~~\leq~~ t_{\alpha/2}(n+m-2)~~)~~ = ~~1 - \alpha }\)

Substituting the expression for T into the above equation and solving the inequality for \(\small{\mu_X - \mu_Y }\), we get a probability,

\(\small{ P(~~~\overline{X}-\overline{Y} - t_{\alpha/2}(n+m-2)~~\leq~~\mu_X-\mu_Y~~\leq~~ \overline{X}-\overline{Y} + t_{\alpha/2}(n+m-2)~~~ ) }\)

We thus arrive at the following result:

If \(\small{\overline{x}}\) and \(\small{\overline{y}}\) are estimated means of two variables based on n and m observations respectively, and \(\small{s_p}\) is the estimated pooled standard deviation, then,

\(\small{(\overline{x} - \overline{y})~~ \pm ~~ t_{1-\alpha/2}(n+m-2) s_p \sqrt{\dfrac{1}{n} + \dfrac{1}{m} } }\)

is a \(\small{100(1-\alpha)\% }\) confidence interval for \(\small{\mu_x - \mu_y },~~~~\) where \(~~~\small{s_p = \sqrt{\dfrac{(n-1)s_x^2 + (m-1)s_y^2}{n+m-2} } }\)



We can say with \(\small{100(1-\alpha)\% }\) confidance that the difference \(\small{\mu_x - \mu_y }\) in the unknown population means of the two variables is in the above interval.

Thus, using the difference in the sample means, we can give a confidence interval for the difference in the unknown population means.


Case A.3.1: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown and unequal with small sample size

When the two population variances are unknown and unequal, the quantity

\(\small{W = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}}} }\) does not follow unit normal distribution, when n and m are small.

Welsch proposed a student's t-distribution as an approximate one for the above statistic W with a modifed degrees of freedom . This modified degrees of freedom is a weighted average between the maximum possible value of n+m-2 and the minimum possible value given by the minimum among n-1 and m-1.

Welch-Satterthwaite correction gives an expression for the modifed degree of freedom as,

\(~~~~~~~~~~~~~~~~\small{r = \dfrac{\left(\dfrac{s_x^2}{n} + \dfrac{s_y^2}{m}\right)^2} {\dfrac{1}{n-1}\left(\dfrac{s_x^2}{n}\right)^2 + \dfrac{1}{m-1} \left(\dfrac{s_y^2}{n}\right)^2 } }\)

The statistic W defined above approximately follows a t-distribution with r degrees of freedom.

For the W statistic defined above, an approximate \(\small{100(1-\alpha) }\) percent confidence interval for \(\small{\mu_x - \mu_y }\)is given by,

\(~~~~~~~~~~~~~~~\small{\overline{x} - \overline{y}~~\pm~~t_{1-\alpha/2}(r) \sqrt{\dfrac{s_x^2}{n} + \dfrac{s_y^2}{n} } }\)









(B) Distribution of difference between the means of two dependent samples

In some experiments, the two sets of measurements X and Y are taken on the same subjects under different conditions. As mentione in the beginning of the section, this constitutes a dependent data set. Since X and Y are not independent, we cannot apply Z or t distributions to compute the confidence intervals. We have to take corresponding observations \(\small{X_i,Y_i)}\) as pairs of values in order to arrive at a statistic.

Let \(\small{(X_1,Y_1), (X_2, Y_2), ....., (X_n,Y_n) }\) be the pairs of n observations.

From this, we can compute the individual differences given by \(\small{D_1=X_1 - Y_1,~~D2=X_2-Y_2,........,D_n=X_n-Y_n}\) and proceed further to get their mean \(\small{\mu_D }\) and standard deviation \(\small{\sigma_D }\).

Since \(\small{X}\) and \(\small{Y}\) follow Normal distribution, the difference \(\small{D_i = X_i-Y_i }\) must follow a \(\small{N(\mu_d, \sigma_d) }\).

The statistic defined by,

\(~~~~~~~~~\small{T = \dfrac{\overline{D} - \mu_D}{\left(\dfrac{S_D}{\sqrt{n}}\right)} ~~~~ }\) follows a t distribution with n-1 degrees of freedom

The \(\small{100(1-\alpha) }\) percentage confidence interval for the difference in mean \(\small{D = \mu_x - \mu_y }\) is given by,

\(\small{ \overline{D}~~\pm~~t_{\alpha/2}(n-1) \dfrac{S_D}{\sqrt{n}} }\)





We can say with \(\small{100(1-\alpha)\% }\) confident that the difference \(\small{\mu_X - \mu_Y }\) in the unknown means of the 2 dependent data populations is within the above interval around the observed difference \(\small{\overline{D} }\) in the sample data sets.