Executing the above script lines in R prints the results:# set seed for random number generator set.seed(3456) ## Generate 20 randomdeviates from a Gaussian of mean 10, sigma 2 X = rnorm(20, mean=10, sd=2) ## perform KS test with pdf of unit normal corresponding to mean 10 and sigma 2 res = ks.test(X, "pnorm", mean=10, sd=2) # prin the results print(res)
As expected, the p-value of the two sided hypothesis test has a value of 0.6693, which is high enough to acccept the null hypothesis that the data is derived from the specified Gaussian of mean 10 and standard deviation 2. Similar results are ontained if we repeat with different seeds. To check the other end, let us now generate 20 data points from Normal distribution with mean 10 and standard deviation 2, but compare with a Gaussian of mean 8 and standard deviation 2. We expect the null hypothesis to be defeated:Exact one-sample Kolmogorov-Smirnov test data: X D = 0.15464, p-value = 0.6693 alternative hypothesis: two-sided
The output from this script gives a low p-value compared to 0.05 as expected:set.seed(3456) X = rnorm(20, mean=10, sd=2) res = ks.test(X, "pnorm", mean=8, sd=2) print(res)
We will observe similar results for most of different simulations by changing the seed. What do we do if we have a sample data whose population mean and variance are not knwon for performing comparison in KS test? As a first example, we generate 20 samples from Gaussian of mean 24 and standard deviation 3. We then take a Z=tranform of this data and perform the KS test by comparing the Z transform with unit normal PDF, as demostrated in the following lines of script:Exact one-sample Kolmogorov-Smirnov test data: X D = 0.47686, p-value = 0.000106 alternative hypothesis: two-sided
Executing the above script in R generates the following printout on the screen:set.seed(1234) X = rnorm(20, mean=24, sd=3) Z = (X - mean(X))/sd(X) res = ks.test(Z, "pnorm") print(res)
As we see in the above output that the higher p-value of 0.7072 for the test when compared to 0.05 accepts the two sided null nypothesis that the given data set is indeed from a Gaussian distribution. This is what expected since we generate the data from a Gaussian. Similar results are expected most of the times when we repeat the simulation with varying random seeds.Exact one-sample Kolmogorov-Smirnov test data: Z D = 0.14964, p-value = 0.7072 alternative hypothesis: two-sided
The above script generates a p-value of 0.751 which accepts the null to conclude that the Z tranfform of data is from a unit normal distribution with a significance of 0.05, as shown in the following output:set.seed(1234) ## 20 randomdeviates from a chi-square distribution with 3 degrees of freedom: X = rchisq(20, 3) # take a Z-transform Z = (X - mean(X))/sd(X) ## compare with unit Gaussian in a KS test res = ks.test(Z, "pnorm") print(res)
If we repeat the simulation with various seeds, we will get similar result most of the time. Secondly, Let us repeat the above simulation with a large sample size, say, 200. We will generate 200 data points from a Chi-Square distribution of 3 degrees of freedom and compare with a Gaussian in a KS test. See the script below:Exact one-sample Kolmogorov-Smirnov test data: Z D = 0.14374, p-value = 0.751 alternative hypothesis: two-sided
The results are printed here:set.seed(1234) X = rchisq(200, 3) Z = (X - mean(X))/sd(X) res = ks.test(Z, "pnorm") print(res)
As expected, the null is rejected by the small p-value of 0.0008904 to a significance level of 0.05. Therefore, in an one sample Kalmogorov-Smirnov test, large sample size is required to decisively reject the null hypothesis Now Let us perform a Kolmogorov-Smirnov test for comparing samples from two distributions: We first generate two data sets of size 30 sampled from the same Gaussian distribution of population means 30 and standard deviation 3 and perform Kolmogrov-Smirnov test comparing the two samples under the null hypothesis that the two samples are drawn from the same distribution against a two sided alternative.Asymptotic one-sample Kolmogorov-Smirnov test data: Z D = 0.1389, p-value = 0.0008904 alternative hypothesis: two-sided
The above codelines generate the following output:X1 = rnorm(30, mean = 30, sd=3) Y1 = rnorm(30, mean = 30, sd=3) res1 = ks.test(X1,Y1) print(res1)
As expected, the p-value for the test is high compared to 0.05 to accept the null hypothesis that the two data sets are drawn from the same Gaussian. On the other extremen, let us now generate two data sets drawn randomly from two Gaussians of different mean and same standard devistion and compare them with KS test. We like to treat them as two datas sets drawn from different parent distributions Data set X1 is from a Gaussian of mean=30, sd=3 and data set Y1 is from a Gaussian of mean=26 and sd=3:Exact two-sample Kolmogorov-Smirnov test data: X1 and Y1 D = 0.2, p-value = 0.5941 alternative hypothesis: two-sided
The above code results in the output reporoduced below:X1 = rnorm(30, mean = 30, sd=3) Y1 = rnorm(30, mean = 26, sd=3) res1 = ks.test(X1,Y1) print(res1)
As expected, the comparison of data sets from two different distributions results in a very small p-valye as compared to 0.05 which defeats the null that they are from the same distributions.Exact two-sample Kolmogorov-Smirnov test data: X1 and Y1 D = 0.8, p-value = 8.467e-10 alternative hypothesis: two-sided