Shapiro-Wilk test for normality
The Shapiro-Wilk test is a non-parametric test of hypothesis that the given n data points are the random samples from a normal distribution
with an unknown mean and nonzero variance. This test perfroms better than the Kolmogorov-Smirnov test for small sample sizes. This test is useful for $n \geq 3$.
The reference to the original paper:
"An analysis of variance test for naormality (Complete Samples)"
$~~~~~~~~~~$ eS.S. Shapiro; M.B. Wilk,
Biometrika, Vol. 52, No. 3/4 (Dec., 1965), 591-611
In Shapiro-Wilk test, the order statistics and the standard deviation of the sample are combined
to obtain a statistics for testing the hypothesis. The important concepts and steps in this procedure are explained below by skipping
the mathematical details of the derivations and proofs for the statements.
Let $X = \{X_1, X_2, X_3,........, X_n\}$ be the n data points from our observation. We want to test whether these n data
points are randomly drawn from a Gaussian distrinution.
As a first step, we compute the order statistics for this data.
Let $~~X_{(1)} \leq X_{(2)} \leq X_{(3)} ....... \leq X_{(n)} $ be the order statistics for the observed data X.
( Example: Suppose we arrange the data X in ascending order. Let $X_{(1)}, X_{(2)},, X_{(3)},....,X_{(n)}$ be the respective orders of the data points $X_1, X_2, X_3, ...,X_n$. For example, if the data point $X_1$ has an ascending order 5, then $X_{(1)} = 5$. Similarly, if the data point $X_6$ has an ascending order 11, then $X_{(6)} = 11$ etc. )
Since we are going to compare our data with Gaussian distribution, we will draw n random deviates from a unit normal distribution N(0,1).
We represent these unit normal deviates by $\{Z_1, Z_2, Z_3, ......, Z_n\}$. These are random values from N(0,1).
Now we can arrange these n deviates in ascending order and note down their order statistics as,
$~~Z_{(1)} \leq Z_{(2)} \leq Z_{(3)} ....... \leq Z_{(n)} $.
( Example : Suppose we sample n random deviates $Z_1,Z_2,Z_3,....,Z_n$ from a unit normal distribution.
If we arrange the random deviates, they span from largest negative value to largest positive value
in ascending order. For example, if a deviate $Z_{4}$ has a value -1.124 and is the third largest number in ascending order, then $Z_{(4)} = 3$. Similarly, if a deviate $Z_{7}$ has a value 2.916 and is the eleventh largest number in ascending order, then $Z_{(7)} = 11$.)
Next, we will consider the expectation values of elements at each order i in the order statistics $Z_{(i)}$.
Let us try to understand this expectation value of $Z_{(i)}$ an imgined simulation study as follows:
Suppose we generate n random deviates from a unit normal distribution. we call it "iteration 1" of our simulation.
We compute the order statistics $R_1 = \{ Z1_{(1)} \leq Z1_{(2)} \leq Z1_{(3)} ....... \leq Z1_{(n)} \}$ for this iteration.
Next, we again generate n random deviates from a unit normal distribution. we call it "iteration 2" of our simulation.
We compute the order statistics $R_2 = \{ Z2_{(1)} \leq Z2_{(2)} \leq Z2_{(3)} ....... \leq Z2_{(n)} \}$ for the second iteration.
We repeat a very large number of iterations, say, a million times. We will thus end up with
$R_1, R_2, R_3, ....,R_{1000000}$ order statistics.
If we span across these million order statistics we will endup with million random values of $Z_{(1)}$ namely $\{Z1_{(1)}, Z2_{(1)}, Z3_{(1)},....,Z1000000_{(1)} \}$. Similarly, million random values for $Z_{(2)}$, million random values for $Z_{(n)}$ etc. Every order statistics $Z_{(i)}$ has been randomly generated a million times.
Now we compute the mean $m_i$ of million values of $Z_{(i)}$. For large numbers like a million, we can treat this $m_i$ as the expectation value of the order statistics $Z_{(i)}$ . The distribution $Z_{(i)}$ is thus a Gaussian with population mean $m_i$ when number of simulations are infinitely large.
The important idea of this algorithm is that the vector of order statistics of the data $\{X_{(1)}, X_{(2)}, X_{(3)},........, X_{(n)}\}$ and the corresponding vector $\{ Z_{(1)}, Z_{(2)}, Z_{(3)}, ...., Z_{(n)}\}$ of the order statistics from a unit normal
distribution are linearly correlated .
We thus express the relationship between these two quantities, using the notations of linear regression as,
$~~~~~~X_{(i)} = \mu~+~ \sigma Z_{(i)}~~~~~~~~~~$ (1)
Using the methods of linear regression theory, we can get the unbiased estimated of the population paramters $\mu$ and $\sigma$.
For this, we need the Covariance of the normal order statistic $Z_{(i)}$. The order statistics $Z_{(i)}$ is strongly correlated with its neighbours $Z_{(i-1)}$ and $Z_{(i+1)}$ because of the order $Z_{(1)} \leq Z_{(2)} \leq Z_{(3)} ....... \leq Z_{(n)}$.
Accordingly, the covariance matrix $\overline{V}$ of the order statistics for samples from unit normal distribution is defined as the expectation value of the correlation between pairs of $( Z_{(i)}, Z_{(j)} )$. Its elements are,
$~~~~~~~~~~~~~V_{ij}~=~E(~(Z_{(i)} - m_i)(Z_{(j)} - m_j)~) ~~~~~~~~~~~~~~$ (2)
The matrix $\overline{V}$ is a positive definite symmetric matrix by construction.
For a given data set with n data points, let $\overline{m}$ be a column vector of the expectation values of order statistics of unit normal deviates as defined before. Then, $\overline{m}' = (m_1, m_2, m_3,....,m_n)$ is a row vector of dimension $n \times 1$.
Let $\overline{X}$ denote a column vector of n data points so that $\overline{X}' = (X_1, X_2, X_3,....,X_n)$.
$ Let~~~X_{mean} = \dfrac{1}{n} \displaystyle \sum_i^n X_i ~~~$ and $~~~~~S^2 = \dfrac{1}{n-1} \displaystyle \sum (X_i - X_{mean})^2$ be the sample mean and sample variance.
Then, for a symmetric distribution, it can be shown that
$~\hat{\mu} = X_{mean} ~~~$ and $~~~~\hat{\sigma} = \dfrac{\overline{m}' \overline{V}^{-1} \overline{X}}{\overline{m}' \overline{V}^{-1} \overline{m} }$
Using these quantities, the W statistic for normality is defined as follows:
$ ~~~~~~~~W~=~\dfrac{R^2 \hat{\sigma}^2}{C^2 S^2}~=~\dfrac{b^2}{S^2}~=~\dfrac{ (\overline{a}'\overline{X})^2 }{S^2}~= \dfrac{\displaystyle {(\sum_{i=1}^n a_i X_i)^2}}{\displaystyle \sum_{i=1}^n (X_i - X_{mean} )^2} $
where
$~~~~~~~~~~~~R^2 = \overline{m}' \overline{V}^{-1} \overline{m} $
$~~~~~~~~~~~~C^2 = \overline{m}' \overline{V}^{-1} \overline{V}^{-1} \overline{m}$
$~~~~~~~~~~~~\overline{a}' = (a_1, a_2, a_3,......,a_n) = \dfrac{\overline{m}' \overline{V}^{-1} }{\sqrt{\overline{m}' \overline{V}^{-1} \overline{V}^{-1} \overline{m}}} $
$~~~~~~~~~~~~b = \dfrac{R^2 \hat{\sigma}}{C} $
The Shapiro-Wilk statistics W always satisfies $0 \lt W \leq 1$.
Depending on n, for a value of W close enough to 1, the normality hypothesis will not be rejected. For smaller values of W,
the normality hypothesis will be rejected.
For n=2, the normality can never be rejected. So, Shapiro-Wilk test is useful for only $n \geq 3$
The R implementation of Shapiro-Wilk test allows sample size n from 3 to 5000.
The probability density distribution of W is not given in any specific form, but various approximations and formulae are used for various ranges of n. Tables exist for coefficients a for various n values. The values of Weights W have been computed for various values of n in the form of tables.
Another way of testing is to perform simulations for estimationg the lower tailed p-value for the computed statistic W for a given sample size n. We can use R statistical package funtion "shapiro.test(W)". This generates the test statistic and the corresponding p-value.
If the p-value is less than the threshold $\alpha$, we reject the normality hypothesis (which is the null hyothesis). This function implements the test for $3 \leq n \leq 5000$.
Shapiro-Wilk normality test in R
In R, the Shapiro-Wilk test for the normality is performed by the statistics library function "shapiro.test()".
The test has the null hypothesis that the sample data is radomly drawn from a Gaussian.
As a first example, we randomly draw 20 points from a Gaussian of mean 10 and standard deviation 2 to perform the
test:
set.seed(1334)
X = rnorm(20, mean=10, sd=2)
res = shapiro.test(X)
print(res)
Running the above script in R creates the output:
Shapiro-Wilk normality test
data: X
W = 0.97415, p-value = 0.839
This larger p-value of 0.838, when compared to 0.05 accepts the null hypothesis that this data is from Gaussian, as expected.
As a next strp, we will randomly draw 20 data points from a Chi-square distribution of 3 degrees of freedom and perform the normality test on this data:
set.seed("74544")
X = rchisq(20, 3)
res = shapiro.test(X)
print(res)
This code, when run in R, prints the following output on the screen:
Shapiro-Wilk normality test
data: X
W = 0.86139, p-value = 0.008321
Since we are comparing the random data points from a chi-square distribution with Normal distribution,
the p-value of 0.008321 is small compared to 0.05 and hence we reject the null hypothesis.
However, if we repeat the simulation for different seeds, we will not be rejecting the null for every data set, and significant number of data sets also accept null. This is due to small sample size of 20.
Finally, we can increase the sample size to 50 and perform the same simulation as above:
set.seed("74544")
X = rchisq(50, 3)
res = shapiro.test(X)
print(res)
For the above script, we get the following output:
Shapiro-Wilk normality test
data: X
W = 0.83955, p-value = 8.161e-06
With a larger sample size of 50, the null is rejected to a significant level (as compared to 0.05).
Even when repeat this simulation many times with different random seeds, the null is rejected
to a significant level most of the time.