The central limit theorem helps us to construct a confidence interval for the population mean using the sample mean and varience. On many occasions, we need to compare samples from two different distributions. We are generally interested in studying the difference between the samples means of these two distributions for making some inferences. As an example, a school management wants to test a new methodology for teaching mathematics. They want to establish that the new methdology leads to a better understanding of the subject than the traditional one followed by the school. But in reality, not all the students who were taught with new methodology will perform better than all the students with the old methodology, since there are variations among individual students. They can only expect the performance to improve on an average level rather than at the individual level in the groups considered. In order to test the improvement in the average level of performance among the two groups, they can choose a group of 60 students with identical maths background and ability from the school. The sample is then divided into two groups of 30 each. For one group, they continue to teach selected topics in maths using the traditional method. The second group is taught using the new method. After certain topics have been taught, the students from both the groups take an appropriately devised maths test on the topics. They should be able to compare the mean scores of the students from both the groups to come to a conclusion about the effectiveness of new method. Suppose the scores btained by the two groups of students are assumed to be the random samples from two Gaussian distributions of different means and standard deviations. From the statistical theory, if we gather information on the distribution followed by the difference of their sample means in terms of their population means, we can come up with a quantification of the average improvement in the performance. We may have to make some assumptions on the unknown population variances.
While comparing the sample means of two distributions, we must know whether they are dependent or independent . Two observations X and Y are independent if the two sets of measurements are taken on different samples. Here X and Y are independent random variables. For example, in order to test the effects of two medicines, we select certain number of patients with similar health conditions and divide them into two groups. We then test medicine A on group 1 and medicine B on group 2. Here we assume that there is no individual dependent factors that affect the working of a medicine on a patient. In this case, the difference in means given by $\overline{X} - \overline{Y}$ is used as the test statistic. Two observations X and Y are dependent if the two measurements are taken on the same subject. In this case X and Y are dependent random variables. This method is used for measurements that aim at finding before and after effects of something we want to test. For example, to establish the weight reduction in individuals by comparing their weights before and after a fitness program. In this case it is recognized that the response to the program varies among individuals, and hence the X and Y measurements are made on same person before and after the fitnesss program. For an individual represented by index i in the study, the difference in the performance $d_i = X_i - Y_i$ between two studies is computed. Then the mean of all $d_i$ values are used as the test statistic. Thus with independent samples we study the difference of means of two groups and in the dependent case we study the mean of differences among individuals. This distinction is important because different statistical methods are used for comparing two data sets that are independent and dependent.
Suppose we want to compare the means of two normal distributions. Consider two variables \(\small{X}\) and \(\small{Y}\) following the normal distributions \(\small{N(\mu_{X}, \sigma_{X}) }\) and \(\small{N(\mu_{Y}, \sigma_{Y}) }\) respectively. Let \(\small{X_1,X_2,X_3,....,X_n }\) and \(\small{Y_1,Y_2,Y_3,....,Y_m }\) be two independent random samples of size \(\small{n}\) and \(\small{n}\) drawn from these two distributions. Now, depending on the information on the population variances \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\), we consider the following three different cases:
Suppose we assume that \(\small{\sigma_X^2}\) and \(\small{\sigma_Y^2}\) are known. Since the random samples are independent, the respective sample means \(\small{\overline{X}}\) and \(\small{\overline{Y}}\) must follow the normal distributions with \(\small{N(\mu_X, \dfrac{\sigma_X^2}{n})}\) and \(\small{N(\mu_Y, \dfrac{\sigma_Y^2}{m})}\). From statistical theory, we know that the distribution of the difference \(\small{\overline{X} - \overline{Y} }\) is a Gaussian with mean \(\small{\mu_X - \mu_Y }\) and the combined variance \(~~\small{\sigma_{W}^2 = \dfrac{\sigma_X^2}{m} + \dfrac{\sigma_Y^2}{m}}\). Applying central limit theorem we can write, \(~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} }~~\) that follows a unit normal distribution $N(0,1)$ From the definition of confidence interval, we can define a \(\small{100(1-\alpha)\%}\) confidence interval for the difference in population means \(\small{\mu_X - \mu_Y}\) as follows: \(\small{[ (\overline{X} - \overline{Y}) - Z_{1 - \frac{\alpha}{2}}\sigma_W,~~~(\overline{X} - \overline{Y}) + Z_{1 - \frac{\alpha}{2}}\sigma_W ]~~~~~~~~~or~~~~~~~(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}} \sqrt{ \dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m} } }\)
As the sample sizes increase, the estimated variances $S_X$ and $S_Y$ approach their population variances $\sigma_X$ and $\sigma_Y$. Therefore, for large sample sizes (typically greater than 30) and the unknown sample varinances \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\), we can replace \(\small{\sigma_x^2}\) and \(\small{\sigma_y^2}\) with their corresponding sample estimates \(\small{s_x^2}\) and \(\small{s_y^2}\) as an approximation. In this case, \(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{s_X^2}{n} + \dfrac{s_Y^2}{m}}} ~~}\) follows $N(0,1)$ a unit normal distribution. We write a \(\small{100(1-\alpha)\%}\) confidence interval for the difference in mean \(\small{\mu_X - \mu_Y}\) as, \(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}}\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}} } \)
Suppose we can afford to make the assumption that the two independent samples are drawn from normal distributions whose variances \(\small{\sigma_x^2, \sigma_y^2}\) are equal, eventhough we do not have their values. Under this assumption of equal population variance, the distributio followed by difference in the observed sample means has been derived.
Since the population variances are assumed to be equal, we let \(\small{\sigma_X^2 = \sigma_Y^2 = \sigma^2 }\). We already learnt that in this case,
\(~~~~~~~~~~~~\small{Z = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma^2}{n} + \dfrac{\sigma^2}{m} }} }~~~~\) is a unit normal variable.
If the variables X and Y follow Normal distribution, then we know from the properties of Chi-Square distribution that
\(~~~~~~~~~~~~~~~\small{\dfrac{(n-1)S_x^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(n-1)}~~~\) and \(~~\small{\dfrac{(m-1)S_Y^2}{\sigma^2}}~~\) follows \(~~\small{\chi^2(m-1)}~~~\).
The Chi-square distribution has the property that \(~~\small{\chi^2(r_1) + \chi^2(r_2) = \chi^2(r_1 + r_2)}\). Then, the sum U of the above two independent variables defined by,
\(\small{U = \dfrac{(n-1)S_X^2}{\sigma^2} + \dfrac{(m-1)S_Y^2}{\sigma^2}} ~~\) follows \(~~\small{\chi^2(n+m-2) }\)
A random variable called T variable is defined in terms of the above mentioned Z and U variables as,
\(~~~~~~~\small{T = \dfrac{Z}{ \sqrt{\dfrac{U}{n+m-2}} } }~~~~\) that follows a t distribution with (n+m-2) degrees of freedom.
Substituting the expressions for U and Z into the above expression of the t variable, we get,
\(\small{ T = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\left[\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2}\right] \left[\dfrac{1}{n} + \dfrac{1}{m}\right] } } ~~ = ~~ \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{S_p \sqrt{\left[\dfrac{1}{n} + \dfrac{1}{m}\right]}} }\)
where the pooled standard deviation \(\small{S_p}\) is defined as,
\(~~~\small{S_p = \sqrt{\dfrac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2} } }\)
Once we have this expression for t variable following t distribution with n+m-2 degrees of freedom, we can identify a value \(\small{t = t_{\alpha/2}}\) for given sample sizes n and m such that the probability(ie., area under symmetric curve) above \(\small{t_{\alpha/2}}\) and below -\(\small{t_{\alpha/2}}\) sums to a \(\small{\alpha }\). We write this as,
\(\small{P(~~-t_{\alpha/2}(n+m-2)~~ \leq ~~T ~~\leq~~ t_{\alpha/2}(n+m-2)~~)~~ = ~~1 - \alpha }\)
Substituting the expression for T into the above equation and solving the inequality for \(\small{\mu_X - \mu_Y }\), we get a probability,
\(\small{ P(~~~\overline{X}-\overline{Y} - t_{\alpha/2}(n+m-2)~~\leq~~\mu_X-\mu_Y~~\leq~~ \overline{X}-\overline{Y} + t_{\alpha/2}(n+m-2)~~~ ) }\)
We thus arrive at the following result:
When the two population variances are unknown and unequal, the quantity
\(\small{W = \dfrac{\overline{X} - \overline{Y} - (\mu_X - \mu_Y)}{\sqrt{\dfrac{S_X^2}{n} + \dfrac{S_Y^2}{m}}} }\) does not follow unit normal distribution, when n and m are small.
Welsch proposed a student's t-distribution as an approximate one for the above statistic W with a modifed degrees of freedom . This modified degrees of freedom is a weighted average between the maximum possible value of n+m-2 and the minimum possible value given by the minimum among n-1 and m-1.
The Welch-Satterthwaite correction gives an expression for the modifed degree of freedom as,
\(~~~~~~~~~~~~~~~~\small{r = \dfrac{\left(\dfrac{s_x^2}{n} + \dfrac{s_y^2}{m}\right)^2} {\dfrac{1}{n-1}\left(\dfrac{s_x^2}{n}\right)^2 + \dfrac{1}{m-1} \left(\dfrac{s_y^2}{n}\right)^2 } }\)
The statistic W defined above approximately follows a t-distribution with r degrees of freedom.
For the W statistic defined above, an approximate \(\small{100(1-\alpha) }\) percent confidence interval
for \(\small{\mu_x - \mu_y }\)is given by,
In some experiments, the two sets of measurements X and Y are taken on the same subjects under different conditions. As mentione in the beginning of the section, this constitutes a dependent data set. Since X and Y are not independent, we cannot apply Z or t distributions using $\overline{X} - \overline{Y}$ to compute the confidence intervals. We have to take corresponding individual observations (\(\small{X_i,Y_i)}\) as pairs of values in order to arrive at a statistic.
Let \(\small{(X_1,Y_1), (X_2, Y_2), ....., (X_n,Y_n) }\) be the pairs of n observations.
From this, we can compute the individual differences given by \(\small{d_1=X_1 - Y_1,~~d_2=X_2-Y_2,........,d_n=X_n-Y_n}\) and proceed further to get their mean \(\small{\mu_d }\) and standard deviation \(\small{\sigma_D }\).
Since \(\small{X}\) and \(\small{Y}\) follow Normal distribution, the difference \(\small{d_i = X_i-Y_i }\) must follow
a \(\small{N(\mu_d, \sigma_d) }\).
The statistic defined by,
\(~~~~~~~~~\small{T = \dfrac{\overline{d} - \mu_d}{\left(\dfrac{S_d}{\sqrt{n}}\right)} ~~~~ }\) has a t distribution with n-1 degrees of freedom
The \(\small{100(1-\alpha) }\) percent confidence interval for the difference in mean \(\small{d_{\mu} = \mu_x - \mu_y }\) is given by,