Basic Statistics with R

Tests of Normality - an overview

The families of parameteric tests like Z-tests, t-tests and ANOVA methods assume that the given data sets are random samples from Guassian distributions. In reality, it is not guranteed that a given data from an experiment or observations is from a Gaussian. It is better on our part to test the hypothesis that the data is sampled from a normal distribution. For this purpose, various tests on normality of a dataset are employed.

We have already learnt that the parameter Skewness is a measure of inherent symmetry in the distribution from which the random samples are drawn. Therefore this can be used as a simple test to begin with. However,this is applicable to any distribution whose shape is symmetrical about the mean. For example, random samples from the t distribution and Lorenzian distribution will also have skewness close to 3, similar to the skewness of samples from the Gaussian. We need more decisive methods to test the normality of a data.

There are many methods of testing the normality of a data set. In this section we will introduce the follwing three methods which in general can be used to test whether the data belongs to any continuous distribution, not necessarily Gaussian :

$~~~~~~~~~$ 1. The QQ plot This is a graphical method in which the quantiles of the data set are correlated with the quantiles of a normal distribution. If the data points are random samples from a normal distribution, the scatter plot between these quantiles show a good correlation. In general, this plot can be used for comparing the data with quantiles of any continuous distribution.


$~~~~~~~~~$ 2. Kalmogorov-Smirnov test: A non-parametric statistical test that compares two continuous ditributions. In one sample case, it tests whether the data is sampled from un underlying reference distributiion. This test compares the Cumulation Distribution Functions (CDF) rather than the probability density functions. Under the null hypothesis, the data is derived from the underlying distribution or the two data sets compared are from same distribution.

$~~~~~~~~~$ 3. Shapiro-Wilk test This is a non-parametric test of hypothesis that the given n data points are the random samples from a normal distribution with an unknown mean and nonzero variance. In Shapiro-Wilk test, the order statistics and the standard deviation of the sample are combined to obtain a statistics for testing the hypothesis.

In principle, Shapiro-Wilk test works well for smaller samples while the Kalmogorov-Smirnov test works well with larger samples. However, smaller sample sizes lead to inconsistant results for different data sets sampled under same distributions. The normality tests perform reliabley data of larger sizes, like 50, 100 or more.