Basic Statistics with R

Kalmogorov-Smirnov test for normality

The Kalmogorov-Smirnov test (KS test) is a non-patametric test that can compare two continuous distributions.In one sample case, it can compare whether the given data is coming from an underlying reference distribution. In the two sample case, it can be used to compare the underlying distributions of two data sets. The KS test compares the Cumulative Distribution Functions(CDF) rather than the probability density functions.

We can use the one sample KS test for testing the normality of the underlying distribution of an observed data.

Suppose we have an ordered data set $\{x_1 \leq x_2 \leq x_3.......\leq x_n \}$ sampled from a population whose underlying and unknown cumulative distribution function is $P_n(x)$

The Kolmogorov-Smirnov test is used to test the hypothesis whether the cumulative distribution function $P_n(x)$ of an ordered sample of n data points is equal to a given cumulative distribution function $P_0(x)$.

The null and alternate hypothesis of this test are given by,

$~~~~~~~~~~~H_0 : P_n(x) = P_0(x)~~$ and $~~H_A : P_n(x) \neq P_0(x)$

For a given sample data set that is ordered, the empirical cumulative probability $F_n(x)$ at any possible value x is the fraction obtained by dividing the cumulative frequency upto x by the total frequency of the data. Alternatelly, we can sort the data in ascending order and obtain the cumulative probability at each $x_i$ by dividing the order of $x_i$ by the total number of data points.

Under large sample approximation, the empirical cumulative probability $F_n(x)$ approaches the underlying cumulative probability $P_n(x)$ as the sample size n becomes larger and larger.

Let $F(x)$ be the cumulative probability computed for the observed data using the formula for reference CDF $P(x)$.

For every observed data point x, the difference $F_n(x) - F(x)$ between the empirical CDF and the reference CDF is computed.

From these differences, the supremum or the maximum of absolute differences is computed. It is represented as,

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~sup_x|F_n(x) - F(x)| $

The Figure-3 below depicts the maximum difference at a particular data point for the cumulative and empirical CDF of unit normal distribution as an example:



Under the null hypothesis that $H_0 : P_n(x) = P_0(x)~~$, the largest of the absolute differences between the empirical PDF and the reference PDF tends to zero:

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~sup_x|F_n(x) - F(x)| \rightarrow 0$

There is a theorem (we state here without giving the proof) which states that if the function F(x) is continuous, then the distribution of $sup_x|F_n(x) - F(x)|$ does not depend on F(x). This means, we can use this hypothesis testing for any continuous function F(x).

The variable defined by

$~~~~~~~~~~~~~~~~~~D_n = sup_x|F_n(x) - F(x)|$

is used as the test statistic for the Kolmogrov-Smornov test.

For suffeciently large n, the statistics $nD_n = n \times sup_x|F_n(x) - F(x)|$ follows a Kalmogorov distribution. The Cumulative Dsitribution Function of Kolmogorov distribution is given by,

$~~~~~~~~~~~~~~~~~~~~~~~~~~~G(t) = \dfrac{\sqrt{2\pi}}{x} \displaystyle{\sum_{k=1}^{\infty}{\Large{e}}^{-{(2k-1)^2 \pi^2/(8x^2)}}} ~=~ 1 - 2 \displaystyle \sum_{k=1}^{\infty} {\large{e}}^{-2(kx)^2}$

For a one sample Kalmogorov-Smirnov test, let us denote the critical value $D_n$ for a given significance level $\alpha$ as $D_{n,\alpha}$

The critical value $D_{n,\alpha}$ are tabulated for various n and $\alpha$ values. One such table can be found here.

For a given distribution, we compute the empirical pdf $F_n(x)$ from the data and the reference pdf $F(x)$ for data points x to compute the maximum absolute difference $D_n = sup_x|F_n(x) - F(x)|$. We then get the $D_{n,\alpha}$ from the table corresponding to the given significance level $\alpha$.

The decision rule is given by,

$~~~~~~~~~~~~~~~~~~~~$ Accept null $H_0$ if $D_n \leq D_{n,\alpha}$

$~~~~~~~~~~~~~~~~~~~~$ Reject null $H_0$ if $D_n \gt D_{n,\alpha}$


Kolmogorov-Smirnov test in R

In R, the stats library function ks.test() performs one sample and two sample Kolmogorov-Smornov test. This test is performed under the null hypothesis that that the given data set is from a Gaussian distribution. This function prints the D value and the p-value of the test.

In the R implementation of the KS test, we can compare the sample data to a Gaussian of known mean $\mu$ and standard deviation $\sigma$. The mean and standard deviation estimated from the data should not be used in this test.

In the example below, we will first generate 20 data points from a Normal distribution of mean 20, standard deviation 2. We will then perform KS test for comparing the data to the PDF of normal distribution corresponding to the exact mean 10 and sigma 2:

# set seed for random number generator set.seed(3456) ## Generate 20 randomdeviates from a Gaussian of mean 10, sigma 2 X = rnorm(20, mean=10, sd=2) ## perform KS test with pdf of unit normal corresponding to mean 10 and sigma 2 res = ks.test(X, "pnorm", mean=10, sd=2) # prin the results print(res)

Executing the above script lines in R prints the results:
Exact one-sample Kolmogorov-Smirnov test data: X D = 0.15464, p-value = 0.6693 alternative hypothesis: two-sided
As expected, the p-value of the two sided hypothesis test has a value of 0.6693, which is high enough to acccept the null hypothesis that the data is derived from the specified Gaussian of mean 10 and standard deviation 2. Similar results are ontained if we repeat with different seeds.

To check the other end, let us now generate 20 data points from Normal distribution with mean 10 and standard deviation 2, but compare with a Gaussian of mean 8 and standard deviation 2. We expect the null hypothesis to be defeated:

set.seed(3456) X = rnorm(20, mean=10, sd=2) res = ks.test(X, "pnorm", mean=8, sd=2) print(res)

The output from this script gives a low p-value compared to 0.05 as expected:
Exact one-sample Kolmogorov-Smirnov test data: X D = 0.47686, p-value = 0.000106 alternative hypothesis: two-sided
We will observe similar results for most of different simulations by changing the seed.

What do we do if we have a sample data whose population mean and variance are not knwon for performing comparison in KS test?

As a first example, we generate 20 samples from Gaussian of mean 24 and standard deviation 3. We then take a Z=tranform of this data and perform the KS test by comparing the Z transform with unit normal PDF, as demostrated in the following lines of script:

set.seed(1234) X = rnorm(20, mean=24, sd=3) Z = (X - mean(X))/sd(X) res = ks.test(Z, "pnorm") print(res)

Executing the above script in R generates the following printout on the screen:
Exact one-sample Kolmogorov-Smirnov test data: Z D = 0.14964, p-value = 0.7072 alternative hypothesis: two-sided
As we see in the above output that the higher p-value of 0.7072 for the test when compared to 0.05 accepts the two sided null nypothesis that the given data set is indeed from a Gaussian distribution. This is what expected since we generate the data from a Gaussian. Similar results are expected most of the times when we repeat the simulation with varying random seeds.

However, when we compare the data from a non-normal distribution to a Normal PDF of unknown mean and standard deviation, the results of the KS test are reliable only when sample sizes are large. This can be demostrated in the following two simulations:

First, we generate 20 data points (samll sample size) from a Chi-Square distribution of 3 degrees of freedom and compare the Z-transform of this data to a unitnormal PDF in a KS test:

set.seed(1234) ## 20 randomdeviates from a chi-square distribution with 3 degrees of freedom: X = rchisq(20, 3) # take a Z-transform Z = (X - mean(X))/sd(X) ## compare with unit Gaussian in a KS test res = ks.test(Z, "pnorm") print(res)

The above script generates a p-value of 0.751 which accepts the null to conclude that the Z tranfform of data is from a unit normal distribution with a significance of 0.05, as shown in the following output:
Exact one-sample Kolmogorov-Smirnov test data: Z D = 0.14374, p-value = 0.751 alternative hypothesis: two-sided
If we repeat the simulation with various seeds, we will get similar result most of the time.

Secondly, Let us repeat the above simulation with a large sample size, say, 200. We will generate 200 data points from a Chi-Square distribution of 3 degrees of freedom and compare with a Gaussian in a KS test. See the script below:

set.seed(1234) X = rchisq(200, 3) Z = (X - mean(X))/sd(X) res = ks.test(Z, "pnorm") print(res)

The results are printed here:
Asymptotic one-sample Kolmogorov-Smirnov test data: Z D = 0.1389, p-value = 0.0008904 alternative hypothesis: two-sided
As expected, the null is rejected by the small p-value of 0.0008904 to a significance level of 0.05.

Therefore, in an one sample Kalmogorov-Smirnov test, large sample size is required to decisively reject the null hypothesis




Now Let us perform a Kolmogorov-Smirnov test for comparing samples from two distributions:

We first generate two data sets of size 30 sampled from the same Gaussian distribution of population means 30 and standard deviation 3 and perform Kolmogrov-Smirnov test comparing the two samples under the null hypothesis that the two samples are drawn from the same distribution against a two sided alternative.

X1 = rnorm(30, mean = 30, sd=3) Y1 = rnorm(30, mean = 30, sd=3) res1 = ks.test(X1,Y1) print(res1)

The above codelines generate the following output:
Exact two-sample Kolmogorov-Smirnov test data: X1 and Y1 D = 0.2, p-value = 0.5941 alternative hypothesis: two-sided
As expected, the p-value for the test is high compared to 0.05 to accept the null hypothesis that the two data sets are drawn from the same Gaussian.

On the other extremen, let us now generate two data sets drawn randomly from two Gaussians of different mean and same standard devistion and compare them with KS test. We like to treat them as two datas sets drawn from different parent distributions Data set X1 is from a Gaussian of mean=30, sd=3 and data set Y1 is from a Gaussian of mean=26 and sd=3:

X1 = rnorm(30, mean = 30, sd=3) Y1 = rnorm(30, mean = 26, sd=3) res1 = ks.test(X1,Y1) print(res1)

The above code results in the output reporoduced below:
Exact two-sample Kolmogorov-Smirnov test data: X1 and Y1 D = 0.8, p-value = 8.467e-10 alternative hypothesis: two-sided
As expected, the comparison of data sets from two different distributions results in a very small p-valye as compared to 0.05 which defeats the null that they are from the same distributions.