Basic Statistics with R

Distribution of difference between two sample means

The central limit theorem helps us to construct a confidence interval for the population mean using the sample mean and varience. On many occasions, we need to compare samples from two different distributions. We are generally interested in studying the difference between the samples means of these two distributions for making some inferences.

As an example, a school management wants to test a new methodology for teaching mathematics. They want to establish that the new methdology leads to a better understanding of the subject than the traditional one followed by the school.

But in reality, not all the students who were taught with new methodology will perform better than all the students with the old methodology, since there are variations among individual students. They can only expect the performance to improve on an average level rather than at the individual level in the groups considered.

In order to test the improvement in the average level of performance among the two groups, they can choose a group of 60 students with identical maths background and ability from the school. The sample is then divided into two groups of 30 each. For one group, they continue to teach selected topics in maths using the traditional method. The second group is taught using the new method. After certain topics have been taught, the students from both the groups take an appropriately devised maths test on the topics. They should be able to compare the mean scores of the students from both the groups to come to a conclusion about the effectiveness of new method.

Suppose the scores btained by the two groups of students are assumed to be the random samples from two Gaussian distributions of different means and standard deviations. From the statistical theory, if we gather information on the distribution followed by the difference of their sample means in terms of their population means, we can come up with a quantification of the average improvement in the performance. We may have to make some assumptions on the unknown population variances.

The independent and dependent samples

While comparing the sample means of two distributions, we must know whether they are dependent or independent .

Two observations X and Y are independent if the two sets of measurements are taken on different samples. Here X and Y are independent random variables. For example, in order to test the effects of two medicines, we select certain number of patients with similar health conditions and divide them into two groups. We then test medicine A on group 1 and medicine B on group 2. Here we assume that there is no individual dependent factors that affect the working of a medicine on a patient. In this case, the difference in means given by $\overline{X} - \overline{Y}$ is used as the test statistic.

Two observations X and Y are dependent if the two measurements are taken on the same subject. In this case X and Y are dependent random variables. This method is used for measurements that aim at finding before and after effects of something we want to test. For example, to establish the weight reduction in individuals by comparing their weights before and after a fitness program. In this case it is recognized that the response to the program varies among individuals, and hence the X and Y measurements are made on same person before and after the fitnesss program. For an individual represented by index i in the study, the difference in the performance $d_i = X_i - Y_i$ between two studies is computed. Then the mean of all $d_i$ values are used as the test statistic.

Thus with independent samples we study the difference of means of two groups and in the dependent case we study the mean of differences among individuals.

This distinction is important because different statistical methods are used for comparing two data sets that are independent and dependent.


(A) Distribution of difference between the means of two independent samples

Suppose we want to compare the means of two normal distributions. Consider two variables \(\small{X}\) and \(\small{Y}\) following the normal distributions \(\small{N(\mu_{X}, \sigma_{X}) }\) and \(\small{N(\mu_{Y}, \sigma_{Y}) }\) respectively. Let \(\small{X_1,X_2,X_3,....,X_n }\) and \(\small{Y_1,Y_2,Y_3,....,Y_m }\) be two independent random samples of size \(\small{n}\) and \(\small{n}\) drawn from these two distributions. Now, depending on the information on the population variances \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\), we consider the following three different cases:


Case A.1: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are known

Suppose we assume that \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\) are known. Since the random samples are independent, the respective sample means \(\small{\overline{X}}\) and \(\small{\overline{Y}}\) must follow the normal distributions with \(\small{N(\mu_X, \dfrac{\sigma_X^2}{n})}\) and \(\small{N(\mu_Y, \dfrac{\sigma_Y^2}{m})}\).
From statistical theory, we know that the distribution of the difference \(\small{\overline{X} - \overline{Y} }\) is a Gaussian with mean \(\small{\mu_X - \mu_Y }\) and the combined variance \(~~\small{\sigma_{W}^2 = \dfrac{\sigma_X^2}{m} + \dfrac{\sigma_Y^2}{m}}\). Applying central limit theorem we can write,

\(~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} }~~\) that follows a unit normal distribution $N(0,1)$

From the definition of confidence interval, we can define a \(\small{100(1-\alpha)\%}\) confidence interval for the difference in population means \(\small{\mu_X - \mu_Y}\) as follows:
\(\small{[ (\overline{X} - \overline{Y}) - Z_{1 - \frac{\alpha}{2}}\sigma_W,~~~(\overline{X} - \overline{Y}) + Z_{1 - \frac{\alpha}{2}}\sigma_W ]~~~~~~~~~or~~~~~~~(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}} \sqrt{ \dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m} } }\)


Case A.2: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown with large sample size

As the sample sizes increase, the estimated variances $S_X$ and $S_Y$ approach their population variances $\sigma_X$ and $\sigma_Y$. Therefore, for large sample sizes (typically greater than 30) and the unknown sample varinances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\), we can replace \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) with their corresponding sample estimates \(\small{s_x^2}\) and \(\small{s_y^2}\) as an approximation. In this case,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{s_X^2}{n} + \dfrac{s_Y^2}{m}}} ~~}\) follows $N(0,1)$ a unit normal distribution.

We write a \(\small{100(1-\alpha)\%}\) confidence interval for the difference in mean \(\small{\mu_X - \mu_Y}\) as,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}}\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}} } \)



Case A.3: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown with small sample size

We now consider the confidence intervals for the difference between means of two variables drawn from normal distributions of unknown variances, when the sample sizes are small.

Consider two sets of data points \(\small{X_1,X_2,X_3,...,X_n}\) and \(\small{Y_1,Y_2,Y_3,...,Y_m }\) drawn from two normal distributions \(\small{N(\mu_X, \sigma_X) }\) and \(\small{N(\mu_Y,\sigma_Y)}\) respectively. Let \(\small{S_X , S_Y }\) be the sample standard deviations of these two data sets.

When the sample sizes of the tow sets are small (less than 30 as a thumb rule), then the difference \(\small{ \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y )}{\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}}} }\) does not follow the unit normal distribution \(\small{N(0,1)}\). In this case, a t-statistic with a t distribution is derived.

Two different cases are considered at this stage, based on our assumptions about the unknown standard deviations $\sigma_X$ and $\sigma_Y$ :

(i) Variances of the two normal distributions are unknown but assumed to be equal.

(ii) Variances of the two normal distributions unknown, but assumed to be unequal.

Case A.3.1: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown and equal with small sample size

Suppose we can afford to make the assumption that the two independent samples are drawn from normal distributions whose variances \(\small{\sigma_x^2, \sigma_y^2}\) are equal, eventhough we do not have their values. Under this assumption of equal population variance, the distributio followed by difference in the observed sample means has been derived.

Since the population variances are assumed to be equal, we let \(\small{\sigma_X^2 = \sigma_Y^2 = \sigma^2 }\). We already learnt that in this case,
\(~~~~~~~~~~~~\small{Z = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma^2}{n} + \dfrac{\sigma^2}{m} }} }~~~~\) is a unit normal variable.

If the variables X and Y follow Normal distribution, then we know from the properties of Chi-Square distribution that

\(~~~~~~~~~~~~~~~\small{\dfrac{(n-1)S_x^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(n-1)}~~~\) and \(~~\small{\dfrac{(m-1)S_Y^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(m-1)}~~~\).

The Chi-square distribution has the property that \(~~\small{\chi^2(r_1) + \chi^2(r_2) = \chi^2(r_1 + r_2)}\). Then, the sum U of the above two independent variables defined by,

\(\small{U = \dfrac{(n-1)S_X^2}{\sigma^2} + \dfrac{(m-1)S_Y^2}{\sigma^2}} ~~\) follows \(~~\small{\chi^2(n+m-2) }\)

A random variable called T variable is defined in terms of the above mentioned Z and U variables as,

\(~~~~~~~\small{T = \dfrac{Z}{ \sqrt{\dfrac{U}{n+m-2}} } }~~~~\) that follows a t distribution with (n+m-2) degrees of freedom.

Substituting the expressions for U and Z into the above expression of the t variable, we get,

\(\small{ T = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\left[\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2}\right] \left[\dfrac{1}{n} + \dfrac{1}{m}\right] } } ~~ = ~~ \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{S_p \sqrt{\left[\dfrac{1}{n} + \dfrac{1}{m}\right]}} }\)
where the pooled standard deviation \(\small{S_p}\) is defined as, \(~~~\small{S_p = \sqrt{\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2} } }\)

Once we have this expression for t variable following t distribution with n+m-2 degrees of freedom, we can identify a value \(\small{t = t_{\alpha/2}}\) for given sample sizes n and m such that the probability(ie., area under symmetric curve) above \(\small{t_{\alpha/2}}\) and below -\(\small{t_{\alpha/2}}\) sums to a \(\small{\alpha }\). We write this as,

\(\small{P(~~-t_{\alpha/2}(n+m-2)~~ \leq ~~T ~~\leq~~ t_{\alpha/2}(n+m-2)~~)~~ = ~~1 - \alpha }\)

Substituting the expression for T into the above equation and solving the inequality for \(\small{\mu_X - \mu_Y }\), we get a probability,

\(\small{ P(~~~\overline{X}-\overline{Y} - t_{\alpha/2}(n+m-2)~~\leq~~\mu_X-\mu_Y~~\leq~~ \overline{X}-\overline{Y} + t_{\alpha/2}(n+m-2)~~~ ) }\)

We thus arrive at the following result:

If \(\small{\overline{x}}\) and \(\small{\overline{y}}\) are estimated means of two variables based on n and m observations respectively, and \(\small{s_p}\) is the estimated pooled standard deviation, then,

\(\small{(\overline{x} - \overline{y})~~ \pm ~~ t_{1-\alpha/2}(n+m-2) s_p \sqrt{\dfrac{1}{n} + \dfrac{1}{m} } }\)

is a \(\small{100(1-\alpha)\% }\) confidence interval for \(\small{\mu_x - \mu_y },~~~~\) where \(~~~\small{s_p = \sqrt{\dfrac{(n-1)s_x^2 + (m-1)s_y^2}{n+m-2} } }\)



We can say with \(\small{100(1-\alpha)\% }\) confidance that the difference \(\small{\mu_x - \mu_y }\) in the unknown population means of the two variables is in the above interval.

Thus, using the difference in the sample means, we can give a confidence interval for the difference in the unknown population means.


Case A.3.2: The poplation variances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) are unknown and unequal with small sample size

When the two population variances are unknown and unequal, the quantity

\(\small{W = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}}} }\) does not follow unit normal distribution, when n and m are small.

Welsch proposed a student's t-distribution as an approximate one for the above statistic W with a modifed degrees of freedom . This modified degrees of freedom is a weighted average between the maximum possible value of n+m-2 and the minimum possible value given by the minimum among n-1 and m-1.

The Welch-Satterthwaite correction gives an expression for the modifed degree of freedom as,

\(~~~~~~~~~~~~~~~~\small{r = \dfrac{\left(\dfrac{s_x^2}{n} + \dfrac{s_y^2}{m}\right)^2} {\dfrac{1}{n-1}\left(\dfrac{s_x^2}{n}\right)^2 + \dfrac{1}{m-1} \left(\dfrac{s_y^2}{n}\right)^2 } }\)

The statistic W defined above approximately follows a t-distribution with r degrees of freedom.

For the W statistic defined above, an approximate \(\small{100(1-\alpha) }\) percent confidence interval for \(\small{\mu_x - \mu_y }\)is given by,

\(~~~~~~~~~~~~~~~\small{\overline{x} - \overline{y}~~\pm~~t_{1-\alpha/2}(r) \sqrt{\dfrac{s_x^2}{n} + \dfrac{s_y^2}{n} } }\)









(B) Distribution of difference between the means of two dependent samples

In some experiments, the two sets of measurements X and Y are taken on the same subjects under different conditions. As mentione in the beginning of the section, this constitutes a dependent data set. Since X and Y are not independent, we cannot apply Z or t distributions using $\overline{X} - \overline{Y}$ to compute the confidence intervals. We have to take corresponding individual observations (\(\small{X_i,Y_i)}\) as pairs of values in order to arrive at a statistic.

Let \(\small{(X_1,Y_1), (X_2, Y_2), ....., (X_n,Y_n) }\) be the pairs of n observations.

From this, we can compute the individual differences given by \(\small{d_1=X_1 - Y_1,~~d_2=X_2-Y_2,........,d_n=X_n-Y_n}\) and proceed further to get their mean \(\small{\mu_d }\) and standard deviation \(\small{\sigma_D }\).

Since \(\small{X}\) and \(\small{Y}\) follow Normal distribution, the difference \(\small{d_i = X_i-Y_i }\) must follow a \(\small{N(\mu_d, \sigma_d) }\).

The statistic defined by,

\(~~~~~~~~~\small{T = \dfrac{\overline{d} - \mu_d}{\left(\dfrac{S_d}{\sqrt{n}}\right)} ~~~~ }\) has a t distribution with n-1 degrees of freedom

The \(\small{100(1-\alpha) }\) percent confidence interval for the difference in mean \(\small{d_{\mu} = \mu_x - \mu_y }\) is given by,

\(\small{ \overline{d}~~\pm~~t_{\alpha/2}(n-1) \dfrac{S_D}{\sqrt{n}} }\)





We can say with \(\small{100(1-\alpha) }\) percent confident that the difference \(\small{\mu_X - \mu_Y }\) in the unknown means of the 2 dependent data populations is within the above interval around the observed difference \(\small{\overline{d} }\) in the sample data sets.