Biostatistics with R

Two sample Z test

This test is used to compare two groups (populations) using the random samples drawn from them. We compare the means of two populations which are normally distributed with known variances . This is the condition for the test. Even if the population vriances are not available, if the sample sizes are large (n > 30), the test can be applied by replacing population variances with estimated sample variances.


Let X be a data set of n random samples from a Gaussian of known standard deviation \(\small{\sigma_X }\). Similarly, let Y be a data set of m random samples from a Gaussian distributio with known standard deviation \(\small{\sigma_Y }\). The two populations have means \(\small{\mu_X }\) and \(\small{\mu_Y }\) respectively.


We have leant that the distribution of difference between means of two Gaussian distributions \(\small{N(\mu_X, \sigma_X )}\) and \(\small{N(\mu_Y, \sigma_Y })\) is also a normal distribution with a mean \(\small{\mu_X - \mu_Y }\) and the combined standard deviation \(\small{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}}\). We can construct the Z variable for this distribution as,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} = N(0,1) }\)


We can use this Z variable as the test statistic for the two sample Z test. The null hypothesis can be tested for three possibilities: \(\small{\mu_X - \mu_Y = 0,~~~~\mu_X - \mu_Y > 0~~and~~~\mu_X - \mu_Y < 0 }\).

With \(\small{\mu_X - \mu_Y = 0}\), the Z statistic for the test is written as,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y})}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} = N(0,1) }\)


Under the null hypothesis, this Z statistics follows a unit normal distribution. We can test a null hypothesis for a given significance level of \(\small{\alpha}\), using the following steps of hypothesis testing.

  • We first compute the sample means \(\small{\overline{X}}\) and \(\small{\overline{Y}}\) from the data.

  • Knowing the value of standard deviations \(\small{\sigma_X}\) and \(\small{\sigma_Y}\) of the two populations, we compute the value of Z statistic using above expression.

  • The statistical significance (also called "p-value") of this data is then obtained by computing the probability \(\small{P(\gt Z) }\) or \(\small{P(\lt -Z })\) from the unit normal distribution. Under the null hypothesis, the p-value represents the probability of getting the observed statistic Z.

  • If the p-value is either smaller than a pre-decided value \(\small{\alpha}\) or the observed Z statistic is outside a given range (\(\small{-Z_0 \leq Z \leq Z_0) }\), we reject the null hypothesis and accept the alternate hypothesis.

  • Here, \(\small{Z_0}\) is the value of statistic above which the area under the unit normal curve is \(\small{\alpha}\).

  • We can also reject the null hypothesis if the computed Z statistic for th data is outside the \(\small{(1-\alpha)100\%}\) confidence interval on the population mean.



For the given problem in hand, we set our null hypothesis \(\small{H_0}\)and the alternate hypothesis \(\small{H_1}\) in one of the following three ways:

1. The population mean \(\small{\mu_X}\) is equal to \(\small{\mu_Y}\) . A two sided hypothesis test.
\(~~~~~~~~~~~~~~~~~~~\small{H_0 : \mu_X - \mu_Y = 0,~~or~~\mu_X = \mu_Y} \)
\(~~~~~~~~~~~~~~~~~~~\small{H_A : \mu_X - \mu_Y \neq 0~~or~~\mu_X \neq \mu_Y} \)

The null hypothesis will be rejected in favour of alternate hypothesis for smaller or larger values of Z statistic. Accordingly, the rejection region for a given significance level of \(\small{\alpha}\) is given by, \(~~~~\small{Z \gt Z_{1-\alpha/2} ~~~~}\) or \(~~~~\small{Z \lt -Z_{1-\alpha/2} }\)


2. The population mean \(\small{\mu_X }\) is greater than or equal to \(\small{\mu_Y}\). A one sided hypothesis test.
\(~~~~~~~~~~~~~~~~~~~\small{H_0 : \mu_X - \mu_Y \geq 0~~or~~\mu_X \geq \mu_Y} \)
\(~~~~~~~~~~~~~~~~~~~\small{H_A : \mu_X - \mu_Y \lt 0 ~~ or ~~ \mu_X \lt \mu_Y} \)

The null hypothesis will be rejected in favour of alternative hypothesis for smaller and smaller(ie., more negative) values of Z statistic. Accordingly, the rejection region for a given significance level of \(\small{\alpha}\) is given by \(~~~~\small{Z \lt -Z_{1-\alpha} }\)

3. The population mean \(\small{\mu_X}\) is less than or equal to \(\small{\mu_Y}\). A one sided hypothesis test.
\(~~~~~~~~~~~~~~~~~~~\small{H_0 : \mu_X - \mu_Y \leq 0~~or~~ \mu_X \leq \mu_Y} \)
\(~~~~~~~~~~~~~~~~~~~\small{H_A : \mu_X - \mu_Y \gt 0~~or~~\mu_X \gt \mu_Y} \)

The null hypothesis will be rejected in favour of alternative hypothesis for larger and larger(positive) values of Z statistic. Accordingly, the rejection region for a given significance level of \(\small{\alpha}\) is given by \(~~~~\small{Z \gt Z_{1-\alpha} }\)



Example-1 : Two sided hypothesis test

In order to determine whether two varieties of promagrnates significantly differ in their mean weights, 70 random samples were drawn from each variety and their mean weights measured. The mean weight of variety-A was measured to be 220.6 gram with standard deviation of 24.6 grams, while the mean weight of varity-B was 232.8 gram with a standard deviation of 27.8 gram. Test whether the mean weights are different to a significance level of 0.05.


We will assume that the weight of the two groups of pomgrnades follow Norml distributions. Since the sample sizes of 70 is greater than 30, we can use two sample Z test for comparing the population means of the weight from two groups.

Since we are testing whether the means are significantly different, the null and the alternate hypothesis are stated as follows:
\(~~~~~~~~~~~\small{H_0 : \mu_X = \mu_Y } \)
\(~~~~~~~~~~~\small{H_A : \mu_X \neq \mu_Y}\)

Under null hypothesis, the difference in the sample means follow a unit normal distribution givn by,
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(\overline{X} - \overline{Y})}{\sqrt{\dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m}}} = N(0,1) }\)

Substituting the values \(~~~\small{\mu_X = 220.6,~~\mu_Y=232.8,~~~\sigma_X=24.6,~~~\sigma_Y=27.8~~}\)into the above expression, the Z statistic is computed as,

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\small{ Z = \dfrac{(220.6 - 232.8)}{\sqrt{\dfrac{24.6^2}{70} + \dfrac{27.8^2}{70}}} = -2.75 }\)


Testing the null hypothesis using rejection regions:

We have taken \(\small{\alpha =0.05}\) to be the probability for rejecting the null hypothesis. Since the rejection can occur due to sufficiently small as well as large values of the test statistic, the rejection probability \(\small{\alpha =0.05}\) is divided equally between these two areas to give \(\small{\alpha/2 =0.025}\).

What is the Z value for which the area above Z or area below -Z under the unit Gaussian is equal to \(\small{\alpha/2 =0.025}\)?. From the Gaussian table, we read this to be approximately 1.96.

We reject null hypothesis if the computed test statistic Z is either greater than 1.96 or less than -1.96. See the figure below:

In our case, since the computed Z value of -2.75 is in the rejection region, the two sided null hypothesis is rejected to a significance level of 0.05.




Testing null hypothesis by computing the p-value for the observation:

If the null hypothesis is true, what is the probability of getting the computed Z statistic? ("p-value")

For the computed Z value of -2.75, the p-value is obtained from the Gaussian table to be \(\small{p = 0.00298 }\). This is the area under the curve to the right of \(\small{Z = 2.75 }\) or to the left of \(\small{Z =-2.75 }\).

Since the p-value \(\small{p=0.00298 }\) of the observed test statistic is less than \(\small{\alpha/2 = 0.025}\), we reject the null hypothesis to a significance level of 0.05.




Testing the null hypothesis by computing the confidence interval:

For a significance level \(\small{\alpha = 0.05}\), the $95\%$ two sided confidence interval(CI) for the population mean is given by,
\(~~~~~~~~~~~~\small{CI~=~(\overline{X}-\overline{Y}) \pm Z_{1 - \frac{\alpha}{2}} \sqrt{ \dfrac{\sigma_X^2}{n} + \dfrac{\sigma_Y^2}{m} } }\).
Substituting \(\small{\overline{X}=220.6,~~\overline{Y}=232.8,~~\sigma_X=24.6,~~~\sigma_Y=27.8,~~n=m=12}\) from the data and \(~\small{Z_{0.975}~=~1.96}~\) from Gaussian table, we get a $95\%$ confidence interval of \(\small{CI = (220.6 - 232.8) \pm 1.96*\sqrt{\dfrac{24.6^2}{70} + \dfrac{27.8^2}{70}}= -12.2\pm 8.7 = (-20.9, -3.5)}\)
Since this $95\%$ confidence interval \(\small{(-20.9, -3.5)}\) does not contain zero, we reject the null hypothesis and say that the two population means are not equal to zero.




Example-2 : One sided hypothesis test

For the same data in Eample-1, suppose we want to test whether the mean weight of variety-A is less than the mean weight of variety-B. We do this by rejecting a null hypothesis when the mean weight of variety-A is greater than or equal to the mean weight of variety-B. Accordingly, we can set up the null and the alternate hypothesis as follows:
\(~~~~~~~~~~~\small{H_0 : \mu_X \geq \mu_Y } \)
\(~~~~~~~~~~~\small{H_A : \mu_X \lt \mu_Y}\)

Testing the null hypothesis using rejection regions:

Though the null hypothesis is true for the infinite number of values of \(\small{\mu_X \geq \mu_Y}\), it is tested at only one value \(\small{\mu_X = \mu_Y }\). If it is rejected at this value, it will be rejected at any value when \(\small{\mu_X \lt \mu_Y}\) .

From Example-1, the computed value of statistics is \(\small{Z = -2.75 }\) and given \(\small{\alpha = 0.05 }\). Since this is a one sided test, the rejection region lies to the left of \(\small{-Z_{1-\alpha} = -Z_{0.95 } \approx -1.645 }\). Since the computed Z value of -2.75 lies in the rejection region, we reject the null hypothesis to conclude that the mean weight of variety-A is less than the mean weight of variety-B. The rejection regions are marked below:

Testing null hypothesis by computing the p-value for the observation:

If the null hypothesis is true, what is the probability of getting the computed Z statistic? ("p-value")

For the computed Z value of -2.75, the p-value is obtained from the Gaussian table to be \(\small{p = 0.00298 }\). This is the area under the curve to the left of \(\small{Z =-2.75 }\).

Since the p-value \(\small{p=0.00298 }\) of the observed test statistic is less than \(\small{\alpha/2 = 0.05}\), we reject the null hypothesis to a significance level of 0.05 in an one sided test.

R-scripts

The R script given below performs the one sample Z test. Given a data set x that is assumed to be randomly drawn from a Gaussian distribution of population mean mu and standard deviation sigma, the function returns the conclusions of the test along with computed statistic values.


The function is defined as,

       two_sample_Z_test(x, y, sigma_x, sigma_y, alpha, null) 

where

       x, y  = data vectors of two observations

       sigma_x, sigma_y  = population standard deviations
 
       alpha  = significane level

       null   = string value indicating type of null hypothesis.
      
  Possible values of variable null are:   "equal", "less_than_or_equal", "more_than_or_equal" 

The function returns a vector with two numbers :  (p value, Z statistics) .


################################################### ## Two sample Z test ## The null hypothesis is tested at the eqiality of two population means. ## x,y = vector of data samples, which are numbers ## sigma_x, sigma_Y = population standard deviations ## alpha = significance level for testing ## null = string with three possible values "equal", "greater_than_or_equal", "less_than_or_equal" for indicating whether the test is one sided or two sided. two_sample_Z_test = function(x,y, sigma_x, sigma_y, alpha, null ){ ## compute sample means xbar = mean(x) ybar = mean(y) ## get the sample size n = length(x) m = length(y) ## compute the Z statistic Z_statistic = (xbar - ybar)/sqrt( (sigma_x^2/n)+(sigma_y^2/m) ) ## compute the p-value pvalue = 1.0 if(Z_statistic > 0) pvalue = 1 - pnorm(Z_statistic) if(Z_statistic < 0) pvalue = pnorm(Z_statistic) if(Z_statistic == 0) pvalue = 0.5 ## Perform the statitical test by comaring the computed Z statistic with the ## critical value for various cases ### Case 1 : Null hypothesis that populatin mean equals a given value if(null == "equal") { Z_critical = qnorm(1 - (alpha/2)) print("################################################################") print("Two sample Z test : ") print(paste("sample size = ", n,m)) if( (Z_statistic > Z_critical) | (Z_statistic < -Z_critical) ) { print(paste("Null hypothesis is rejected at the level of significance ", alpha/2)) print(paste("Population mean of x not equal to the population mean of y ")) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } if( (Z_statistic < Z_critical) & (Z_statistic > -Z_critical) ) { print(paste("Null hypothesis is accepted at the level of significance ", alpha/2)) print(paste("Population mean of x equal to population mean of y ")) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } } ##### Case 2 : Null hypothesis that population mean is less than or equal to a given value if(null == "less_than_or_equal") { Z_critical = qnorm(1 - alpha) print("################################################################") print("Two sample Z test : ") print(paste("sample sizes = ", n,m)) if( Z_statistic > Z_critical ) { print(paste("Null hypothesis is rejected at the level of significance ", alpha)) print(paste("Population mean of x greater than population mean of y ")) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } if( Z_statistic &leq Z_critical ) { print(paste("Null hypothesis is accepted at the level of significance ", alpha)) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Population mean of x not greater than population mean of y ")) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } } ###### Case 3 : Null hypothesis that the population mean is greater than or equal to a given value. if(null == "greater_than_or_equal") { Z_critical = qnorm(1 - alpha) print("################################################################") print("Two sample Z test : ") print(paste("sample sizes = ", n,m)) if( Z_statistic < Z_critical ) { print(paste("Null hypothesis is rejected at the level of significance ", alpha)) print(paste("Population mean of x is less than population mean of y ")) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", -round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } if( Z_statistic &geq Z_critical ) { print(paste("Null hypothesis is accepted at the level of significance ", alpha)) print(paste("Population mean of x is not less than population mean of y ")) print(paste("p value for the test = ", round(pvalue, digits=5))) print(paste("Value of Z statistic = ", round(Z_statistic, digits=2))) print(paste("Critical value of the test = ", round(Z_critical, digits=2))) resultVec = c(round(pvalue, digits=5), round(Z_statistic, digits=2)) } } return(resultVec) } ## end of the function ###############------------------------------------------------ ## Perform a two sample Z test ## Use this if the two data sets have more than 30 data points. ## (For less than 30 data points, use two sample independent t test) x = c( 258.0, 271.5, 189.1, 216.5, 237.2, 222.0, 231.3, 181.7, 220.0, 179.3, 238.1, 217.7, 246.2, 241.5, 233.8, 222.3, 199.2, 167.9, 216.2, 240.4, 235.3, 187.0, 233.7, 214.7 174.6, 246.3, 185.7, 207.0, 244.3, 237.7, 245.2, 228.3, 201.8, 218.3, 242.7, 213.8, 231.9, 257.3, 208.4, 250.7, 198.3, 206.7, 259.7, 253.3, 200.3, 196.6, 210.6, 257.6, 173.5, 267.5, 167.2, 227.1, 172.1, 197.6, 256.9, 203.7, 195.1, 237.4, 210.2, 208.8, 218.0, 205.1, 241.1, 216.8, 223.6, 191.0, 225.9, 215.1, 233.1, 243.0) y = c( 221.0, 213.0, 199.3, 211.2, 225.2, 229.1, 253.9, 194.6, 243.0, 221.9, 230.9, 221.1, 206.7, 217.2, 215.8, 203.0, 234.0, 196.3, 235.8, 234.3, 244.7, 248.8, 200.5, 232.0, 233.3, 220.6, 289.2, 244.9, 230.8, 182.9, 199.3, 263.2, 220.6, 266.7, 258.0, 243.9, 178.1, 200.7, 270.2, 224.4, 222.4, 234.6, 296.7, 202.3, 277.9, 204.3, 221.1, 257.0, 243.4, 239.4, 230.0, 263.5, 241.3, 216.6, 227.9, 230.1, 230.5, 188.6, 289.3, 234.4, 267.5, 256.0, 246.5, 210.5, 270.6, 295.5, 195.8, 235.3, 245.4, 245.4) sigma_x = 24.6 sigma_y = 27.8 ## alpha value alpha = 0.05 ## call the function. "res" is a vector with p-vlue and Z value for the test. res = two_sample_Z_test(x, y, sigma_x, sigma_y, alpha, "greater_than_or_equal") print(res)


Executing the script in R prints the following lnes of test results:

[1] "################################################################" [1] "Two sample Z test : " [1] "sample sizes = 70 70" [1] "Null hypothesis is rejected at the level of significance 0.05" [1] "Population mean of x is less than population mean of y " [1] "p value for the test = 0.00249" [1] "Value of Z statistic = -2.81" [1] "Critical value of the test = -1.64" > [1] 0.00249 -2.81000