Statistics for Machine Learning

Simple Linear Regression

Suppose we perform an experiment to establish the relationship between two variables X and Y. We treat one of the variables (say X) as an independent variable and measure the values of Y for various values of X. The variable Y is called a dependent variable , since it is measured for a set of fixed values of X. The independent variable X is also called the predictor variable and Y the response variable . For example, in an experiment on laboratory mouse, we study the change in the concentration of blood sugar in response to the concentration of an infused drug in the blood. We mesure the concentration of blood sugar at various known concentrations of the drug in blood. Here, the medicine concentration is the indpendent(predictor) variable and the blood sugar measured is the dependent (response) variable.

A set of data points between X and Y are shown in Figure-1 below:

By looking at the scatterplot, suppose we decide that the relationship between X and Y can be approximated to be linear . We want to approximate the data points by a stright line of the form $y = a + bx$ so that for any given value of x in the data range, we can easily compute the approximate value of y from this equation. This is called a linear model for this data set.


By constructing this linear model, we replace a set of data points by the equation of a stright line through the points which approximately describes the relationship between X and Y in the given range of data. This helps us to qualitatively and quantitatively understand such a variation. Once the linear relationship is established, we can examine the inherent phenomena that result in a linear variation of Y as a function of X.


The problem is that infinite number of stright lines, each with a distinct slope and intercept, can pass through the given set of points. Three such lines are depicted in Figure-2 here:


In the above Figure-2, we notice that each one of the three lines through the data lie close to some data points and far away from some other points. Infinite number of such lines are can pass through these data points. Which line is the best one?


The method of linear regression uses the observed data points to compute the slope and the intercept of the best straight line that can pass through the data points. The mathematical method is called the least square fit to a straight line . A straight line obtained from the observed data points (X,Y) using least square fit has the least deviation (error) between expected Y value on the line and the experimentally observed Y value for a given X value, when compared to any other line through the data points.


Important point : The method of least square fit to a stright line (linear regression) gives the best straight line through the data, with error minimized compared to any other line. It does not mean that such a line is the best curve through the data points. For example, for a given set of data points, the line obtained from least square method may have more error(deviations) from data points when compared to a polinomial of 5th degree obtained from least square fit of data to a polinomial!. Having decided to fit a curve (line strightline, exponential, polinomial etc) through the data point, the least square method gives the best curve (that minimized the error) in the family. If we decided to fit a strightline, we get the best stright line. If we decide to fit a parabola, we get the best parabola. Thus, for a given curve, the method of least squares gets the best parameters of the curve. If \(\small{y = mx+c}\) is the curve fitted, the method gives the best values for m and c that minimizes the error. Similarly, if \(\small{y = a_0 + a_1x^2 + a_2x^3 + a_3x^4}\) is the fitted curve, the method gives the best values for the coefficients \(\small{a_0, a_1, a_2, a_3 and a_4}\) that minimized the error. If we fit a straight line to a parabolic data points, we may still get the best stright line possible through the poins, but many points will deviate from the best fitted line!!. We will learn more on this later.



Before going into the procedure, we will list the assumptions underlying the liear regression:


Suppose, we denote the data points as ($x_1,y_1$), ($x_2,y_2$), ($x_3,y_3$), ...., ($x_n,y_n$).


1. The values of the independent variable X are fixed, and have no errors. X is a non-random variable.


2. For a given value of X, the possible Y values are randomly drawn from a sub population of normal distribution. Thus, for a given value $x_i$ of X, observed $y_i$ is a point randomly drawn from a normal distribution. Also, the Y values for different X values are independent.


3. The means of the parent normal distributions from which the points $y_1, y_2,...,y_n$ were drawn lie on a straight line. The linear regression method in fact computes the slope and intercept of this (parent) straight line.. This can be expressed as an equation,
$$ \mu_{y,x} = \beta_0 + \beta_1x$$
where $\mu_{y,x}$ is the mean value of the population from which the point y was drawn for the given x value. The coefficients $\beta_0$ and $\beta_1$ are called regression coefficients .

4. The variances of the parent normal distributions to which $y_1, y_2, ..., y_n$ belong can be assumed to be either equal or not equal in the derivation. Different expression will result based on these two assumptions.

Regression analysis procedure

For the given data set (X,Y), let $y = \beta_0 + \beta_1x$ be the regression equation we want.


Since the we will never know the true regression coefficients, we will make an estimate of these coefficients from the data points. We name them $b_0$ and $b_1$ respectivly so that $y = b_0+b_1 x$ for the observed data set. The coefficients $b_0$ and $b_1$ estimated through regression theory are the best estimates of the original coefficients $\beta_0$ and $\beta_1$ respectively, which represent a staight line relation of population data.


For a given $x_i$, let the observed Y value be $y_i$. Then, $P_{obs}(x_i, y_i)$ will be this observed data point.


For a given $x_i$ value, let $y_i^{r}$ denote the corresponding point on this regression line. Then $P_{exp}(x_i, y_i^r)$ will be the expected data point on the regression line. See the Fifure-3 below:


The linear regression procedure computes the square of the difference $(y_i - y_i^r)^2$ between the observed $y_i$ value and the expected $y_i^r$ value on the line for each data point. Using a minimization procudre, the sum of the squares of the differences $\sum(y_i - y_i^r)^2$ are minimized to get the coefficients $b_0$ and $b_1$.



For a given $x_i$, we have $y_i^r = b_0 + b_1x_i$. Using this, we can write the sum square normalized to the standard deviation as,

$$ \chi^2 = \sum_{i}\dfrac{1}{\sigma_i^2}(y_i - (b_0 + b_1 x_i) )^2$$

By minimizing the above mentioned $\chi^2$ expression with respect to variables $b_0$ and $b_1$, we can get expressions for $b_0$ and $b_1$ in terms of observed data points ($x_i, y_i$). Using them we can compute $y_i = b_0 + b_1 x_i$ to plot the straight line through the data points.



See the box below for a complete derivation.


In the derivation, we consider two cases. The observed value $y_i$ is assumed to be a random sample derived from a Normal distribution $N(\mu_i, \sigma_i)$. In the first case, we assume that the standard deviations $\sigma_i$ are all equal. That is, all data points are drawn from distributions whose variances $\sigma_i$ are equal. In the second case, we assume that these variances $\sigma_i$ are unequal.


Derivation of linear regression coefficients:

We start with our basic assumption that for a given independent variable $x_i$, the observed value $y_i$ follows a gaussian with mean $\mu_i = y(x_i) = b_0 + b_1x$ and standard deviation $\sigma_i$.

For each point $(x_i, y_i)$, we write the Gaussian probability density as,

\( ~~~~~~~~~~~~~~~\small{P(y_i|x_i, b_0, b_1, \sigma_i ) = \dfrac{1}{\sigma_i \sqrt{2\pi}} {\Large e}^{-\dfrac{1}{2}\left(\dfrac{y_i-y(x_i)}{\sigma_i}\right)^2} }\)
where \(\small{P(y_i|x_i, b_0, b_1, \sigma_i )} \) is the probability of obtaining a $y_i$ given a value of $x_i$ and all the parameters of the model.

The probability of observing n such independent data points is obtained by multiplying their individual probabilities. We denote this probability by $P(y|b_0,b_1,x)$. This is the "conditional probability of obtaining the data set y given the coefficients $b_0$, $b_1$ and the data set x".

\(\small{P(y|b_0,b_1,x)}= P(y_1|x_1, b_0, b_1, \sigma_1 ) \times P(y_2|x_2, b_0, b_1, \sigma_2 ) \times .....\times P(y_n|x_n, b_0, b_1, \sigma_n ) \)
\(~~~~~~~~~~~~~~~~~~~~~~\small{= \displaystyle\prod_{i=1}^{n} P(y_i|x_i, b_0, b_1, \sigma_i ) }\)
\(~~~~~~~~~~~~~~~~~~~~~~\small{= \displaystyle\prod_{i=1}^{n} \dfrac{1}{\sigma_i\sqrt{2\pi}} {\Large e}^{-\dfrac{1}{2} \left(\dfrac{y_i-y(x_i)}{\sigma_i} \right)^2 } }\)
\(~~~~~~~~~~~~~~~~~~~~~~\small{= \displaystyle\prod_{i=1}^{n} \dfrac{1}{\sigma_i\sqrt{2\pi}} {\Large e}^{-\dfrac{1}{2} \left(\dfrac{y_i-(b_0+b_1x_i)}{\sigma_i} \right)^2 } }\)

The function $\small{P(y|b_0,b_1,x)}$ is called the "likelihood", a function of parameter values. Note that the data points $x_i$ and $y_i$ and $\sigma^2$ are constants from the data.

Since the multiplication of exponential terms is equivalent to summing their exponents, we can write the above expression as,

\(~~~~~~~~\small{ \small{P(y|b_0,b_1,x)}~=~\displaystyle{\left( \prod_{i=1}^{n} \dfrac{1}{\sigma_i\sqrt{2\pi}}\right)} ~ {\Large e}^{-\dfrac{1}{2} \displaystyle \sum_{i=1}^{n} \left( \dfrac{y_i-(b_0+b_1x_i)}{\sigma_i} \right)^2 } } \)

The best values of $b_0$ and $b_1$ are the ones for which the above likelihood is maximum . Therefore we find the best fit values for the parameters by maximizing the likelihood function. This procedure is called method of maximum likelihood. Generally, the logarithms of the likelihood, called log likelihood is maximized to get the paramters.

By looking at the expression we can realize that the probability is maximum when the negative exponent term given by,
\(~~~~~~~~~~~~~~~~\small{\chi^2~=~\displaystyle\sum_{i=1}^{n} \dfrac{1}{\sigma_i^2} \left(y_i - b_0 - b_1x_i)^2 \right) }~~~~~~~\)is a minimum.

The above term is called The least square sum . We have a data set with n date points (x_i, y_i) in which each $y_i$ is assumed to be randomly drawn from a Gaussian with mean $y(x_i) = b_0+b_1x_i$ and standard deviation $\sigma_i$. By minimizing the above least square sum with respect to the variable $b_0$ and $b_1$, we can obtain their best values for the data set. The line $y(x_i) = b_0 + b_1x_i$ is said to be the best fit to the data which minimizes the errors between observed $y_i$ values and the expected values $y(x_i)$ on the straight line. This procedure is also called as a least square fot to a straight line.

The best fit valus for $b_0$ and $b_1$ are obtained by solving the equaltions,

\( \dfrac{\partial \chi^2}{\partial b_0} = 0~~~~~~~~~and~~~~~~~ \dfrac{\partial \chi^2}{\partial b_1} = 0 \)

At this stage, we consider two cases related to our assumption regarding the standard deviations:

Case 1 : The standard deviations $\sigma_i$ are all equal

We take the case for which \(\small{\sigma_1 = \sigma_2 = .....= \sigma_i = \sigma }\)

The expression for \(\small{\chi^2 }\) becomes,

\(~~~~~~\small{\chi^2 = \displaystyle\sum_{i=1}^{n} \dfrac{1}{\sigma^2}(y_i - b_0 - b_1x_i)^2 }\)

\(~~~~~~\small{~~~~= \dfrac{1}{\sigma^2}\displaystyle\sum_{i=1}^{n} (y_i - b_0 - b_1x_i)^2 }\)

Differentiating \(\small{\chi^2 }\) partially with respect to $b_0$ and $b_1$ and equating to zero, we get two equations:

\(\small{\dfrac{\partial \chi^2}{\partial b_0}~~=~~ \dfrac{-2}{\sigma^2} \displaystyle\sum_{i=1}^{n} (y_i - b_0 - b_1x_i)~~=~~0 }~~~~~~~~~~~~(1)\)

\(\small{\dfrac{\partial \chi^2}{\partial b_0}~~=~~ \dfrac{-2}{\sigma^2} \displaystyle\sum_{i=1}^{n} (y_i - b_0 - b_1x_i) x_i~~=~~0 }~~~~~~~~~~~~(2)\)

Equations (1) and (2) can be rearranged to get the following two simultaneous equations with $b_0$ and $b_1$ as variables:

\(\small{ \displaystyle\sum_{i=1}^{n} y_i~=~\displaystyle\sum_{i=1}^{n} b_0~+~\displaystyle\sum_{i=1}^{n} b_1x_i~=~ b_0n~+~b_1\displaystyle\sum_{i=1}^{n}x_i }~~~~~~~~(3)\)

\(\small{ \displaystyle\sum_{i=1}^{n}x_iy_i~=~ \displaystyle\sum_{i=1}^{n}b_0x_i~+~ \displaystyle\sum_{i=1}^{n} b_1 x_i^2~~=~~ b_0 \displaystyle\sum_{i=1}^{n}x_i~+~b_1 \displaystyle\sum_{i=1}^{n}x_i^2 }~~~~~~~~~~~~(4)\)

Solving the equations (3) and (4) for the variables $b_0$ and $b_1$, we get the following expressions for the coefficients of the least square fit for the case when all the data points have equal veriances. All the summations are over the index $i=1~to~n$, where n is the number of data points.

\(\small{ \Delta~=~ \begin{vmatrix} n & \displaystyle\sum x_i \\ \displaystyle\sum x_i & \displaystyle\sum x_i^2 \\ \end{vmatrix} ~~=~~n \sum x_i^2 - (\sum x_i)^2 }\)

\(\small{ b_0~=~\dfrac{1}{\Delta} \begin{vmatrix} \displaystyle\sum y_i & \displaystyle\sum x_i \\ \displaystyle\sum x_i y_i & \displaystyle\sum x_i^2 \\ \end{vmatrix} ~~=~~\dfrac{1}{\Delta} \left(\sum y_i \sum x_i^2 - \sum x_i \sum x_i y_i\right) }\)

\(\small{ b_1~=~\dfrac{1}{\Delta} \begin{vmatrix} n & \displaystyle\sum y_i \\ \displaystyle\sum x_i & \displaystyle\sum x_i y_i \\ \end{vmatrix} ~~=~~\dfrac{1}{\Delta} \left(n \sum x_i y_i - \sum x_i \sum y_i \right) }\)

Case 2 : The standard deviations $\sigma_i$ are unequal

In this case, the standard deviations of populations from which $y_i$ were drawn are not equal. As a consequence, we cannot pull out the standard deviation term $\sigma_i$ out of the summations. We will go back to the derivation where we defined the Chi-square term:



\(~~~~~~~~~~~~~~~~\small{\chi^2~=~\displaystyle\sum_{i=1}^{n} \dfrac{1}{\sigma_i^2} \left(y_i - b_0 - b_1x_i)^2 \right) }~~~~~~~\)

Now, differentiating with respect to the variables $b_0$ and $b_1$, we get

\(\small{\dfrac{\partial \chi^2}{\partial b_0}~~=~~ -2 \displaystyle\sum_{i=1}^{n} \dfrac{1}{\sigma_i^2} (y_i - b_0 - b_1x_i)~~=~~0 }~~~~~~~~~~~~(5)\)

\(\small{\dfrac{\partial \chi^2}{\partial b_1}~~=~~ -2\displaystyle\sum_{i=1}^{n} \dfrac{1}{\sigma_i^2}(y_i - b_0 - b_1x_i) x_i~~=~~0 }~~~~~~~~~~~~(6)\)

Equations (5) and (6) can be rearranged to get the following two simultaneous equations with $b_0$ and $b_1$ as variables:

\(\small{ \displaystyle\sum_{i=1}^{n} \dfrac{y_i}{\sigma_i^2}~=~ b_0 \sum \dfrac{1}{\sigma_i^2}~+~b_1\displaystyle\sum_{i=1}^{n} \dfrac{x_i}{\sigma_i^2} } ~~~~~~~~(7)\)

\(\small{ \displaystyle\sum_{i=1}^{n} \dfrac{x_iy_i}{\sigma_i^2 }~=~ b_0 \displaystyle\sum_{i=1}^{n} \dfrac{x_i}{\sigma_i^2}~+~b_1 \displaystyle\sum_{i=1}^{n} \dfrac{x_i^2}{\sigma_i^2} }~~~~~~~~~~~~(8)\)

Solving the above two equations for the variables $b_0$ and $b_1$, we get their expressions in terms of the data points $(x_i, y_i$ :



\(\small{ \Delta~=~ \begin{vmatrix} \displaystyle\sum\dfrac{1}{\sigma_i^2} & \displaystyle\sum\dfrac{x_i}{\sigma_i^2} \\ \displaystyle\sum\dfrac{x_i}{\sigma_i^2} & \displaystyle\sum\dfrac{x_i^2}{\sigma_i^2} \\ \end{vmatrix} ~~=~~\displaystyle\sum\dfrac{1}{\sigma_i^2} \sum\dfrac{x_i^2}{\sigma_i^2} - \displaystyle\sum\dfrac{x_i}{\sigma_i^2} \sum\dfrac{x_i}{\sigma_i^2} }\)

\(\small{ b_0~=~\dfrac{1}{\Delta} \begin{vmatrix} \displaystyle\sum \dfrac{y_i}{\sigma_i^2} & \displaystyle\sum \dfrac{x_i}{\sigma_i^2} \\ \displaystyle\sum \dfrac{x_i y_i}{\sigma_i^2} & \displaystyle\sum \dfrac{x_i^2}{\sigma_i^2} \\ \end{vmatrix} ~~=~~\dfrac{1}{\Delta} \left(\sum \dfrac{y_i}{\sigma_i^2} \sum \dfrac{x_i^2}{\sigma_i^2} - \sum \dfrac{x_i}{\sigma_i^2} \sum \dfrac{x_i y_i}{\sigma_i^2}\right) }\)

\(\small{ b_1~=~\dfrac{1}{\Delta} \begin{vmatrix} \displaystyle\sum\dfrac{1}{\sigma_i^2} & \displaystyle\sum \dfrac{y_i}{\sigma_i^2} \\ \displaystyle\sum \dfrac{x_i}{\sigma_i^2} & \displaystyle\sum \dfrac{x_i y_i}{\sigma_i^2} \\ \end{vmatrix} ~~=~~\dfrac{1}{\Delta}\left(\displaystyle\sum\dfrac{1}{\sigma_i^2} \displaystyle\sum \dfrac{x_i y_i}{\sigma_i^2}~~-~~\displaystyle\sum \dfrac{x_i}{\sigma_i^2} \displaystyle\sum \dfrac{y_i}{\sigma_i^2}\right) }\)



We summarize the results for both the cases : 1. When the errors on all the data points are assumed to be equal:


\(\small{ \Delta~=~n \sum x_i^2 - (\sum x_i)^2 }\)

\(\small{ b_0~=~\dfrac{1}{\Delta} \left(\sum y_i \sum x_i^2 - \sum x_i \sum x_i y_i\right) }\)

\(\small{ b_1~=~\dfrac{1}{\Delta} \left(n \sum x_i y_i - \sum x_i \sum y_i \right) }\)
















2. When the errors on all the data points are assumed to be unequal:

\(\small{ \Delta~=~\displaystyle\sum\dfrac{1}{\sigma_i^2} \sum\dfrac{x_i^2}{\sigma_i^2} - \displaystyle\sum\dfrac{x_i}{\sigma_i^2} \sum\dfrac{x_i}{\sigma_i^2} }\)

\(\small{ b_0~=~\dfrac{1}{\Delta}~~=~~\dfrac{1}{\Delta} \left(\sum \dfrac{y_i}{\sigma_i^2} \sum \dfrac{x_i^2}{\sigma_i^2} - \sum \dfrac{x_i}{\sigma_i^2} \sum \dfrac{x_i y_i}{\sigma_i^2}\right) }\)

\(\small{ b_1~=~\dfrac{1}{\Delta}~~=~~\dfrac{1}{\Delta}\left(\displaystyle\sum\dfrac{1}{\sigma_i^2} \displaystyle\sum \dfrac{x_i y_i}{\sigma_i^2}~~-~~\displaystyle\sum \dfrac{x_i}{\sigma_i^2} \displaystyle\sum \dfrac{y_i}{\sigma_i^2}\right) }\)


In the above expressions, the quantites $w_i = 1/\sigma_i^2$ is called the "weight" of each data point. Each data point is thus weighted as the inverse of its variance. This method is called the weighted least square estimate.



















Estimating the weights in a weighted least square procedure

1. Ideally, the variances on the $y_i$ values are need to be used as weights.

2. In case many repeated measurements for of $y_i$ available for $x_i$, we can use the estimated standard deviation to compute the weight.

3. In the data, if the variance on $y_i$ at each $x_i$ are directly proportional to the value of $x_i$, then we can use $1/x_i$ as the weight for each $y_i$.

4. If $y_i$ is an average of $n_i$ observations each with same variance $\sigma^2$, variance on $y_i = \sigma^2/n_i$ and thus the weight $w_i = n_i$.

5. Another method is also adopted. First perform the regression without errors to estimate the coefficients $b_0$ and $b_1$. Then estimate the residuals with respect to the fitted values. The $i^{th}$ squared residual is an estimator of $\sigma_i^2$ and the $i^{th}$ absolute residual is an estimator of $\sigma_i$. This can be used as weights at each data point.

Many more methods are described in literature including minimization techneques to estimate the weights.

Evaluating the quality of the fit

While performing linear regression on a data set, we assume that the trend in the data is a linear relationship between the independent and dependent variables. Based on this assumption, we fit a stringht line of the form \(y_i = b_0 + b_1x_i\).


No matter what is the real trend in the data, the least square minimization procedure gives the best fit to a stright line through the data that minimized the sum square deviations from an asssumed stright line.


Suppose, the real trend in the data is not linear, and still we go ahead and fit a straight line relation to it. This is depicted in the figure-4 below:


In the above figure, the plot on the left side depicts a stright line fitted to a data whose actual trend is fairly linear. On the other hand, the plot on the right side shows a best fit to a data that actually has a non-linear trend. But both are "best fit" to a stright line through data points.


Therefore, it is necessary to have a parameter that distinguishes between these two cases. A parameter that indicates the extend to which the data points are aligned to the fitted line. We should derive an expression for such a parameter from the data points \((x_i, y_i)\) and the values of fitted coefficients \( (b_0, b_1) \). Such a parameter is called "coefficient of determination" or the \(r^2~\) (r-square) parameter. We will now derive the parameter.


The coefficient of determination, $r^2$

To obtain a number that represents the quality of the fit, we compare the scateering of the data points about the fitted line with their scattering around their mean value $\overline{y}$


In the Figure-5 below, the blue line is the regression line through the data points that are represented by red circles.



We consider the following quantities shown in the figure:


$x_i$ = the X value of the $i^{th}$ data point named P
$y_i$ = the Y value of the $i^{th}$ data point named P
$\hat{y_i} = y_i(x_i) = b_0 + b_1x_i = $ the value of fitted $Y$ for a given $x_i$
$\overline{y} = $ mean of all the $y_i$ values of the data set.


In the next step, we proceed to look at the deviations between the quantities $y_i$, $\overline{y}$ and $\hat{y_i}$ as follows:


$t_i = y_i - \overline{y}~~$ is the deviation of y-value of a data point i from the sample mean of all y values

This deviation from the sample mean can be decomposed into two parts. We define,

$m_i = \hat{y_i} - \overline{y}~~$ is the deviation of the fitted y-value of a data point i from the sample mean $\overline{y}$. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ this quantity is "explained by the model"

$e_i = y_i - \hat{y_i}~~$ is the deviation of a $y_i$ from the corresponding fitted value.
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ This is called the "residual associated with $y_i$".
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ This can be thought of as a "deviation unexplained by the model"

For every data point i,$~~~t_i = m_i ~+~e_i$

We will now consider the sum square of the above three deviations over all the data points. We define,

$ SST~=~\displaystyle \sum_i t_i^2 =\displaystyle \sum_i (y_i - \overline{y})^2~~~$ is the amount of variation in the response variable y.

$ SSR~=~\displaystyle \sum_i m_i^2 =\displaystyle \sum_i (\hat{y_i} - \overline{y})^2~~~$ is the variation in y explained by the regression model

$ SSE~=~\displaystyle \sum_i e_i^2 =\displaystyle \sum_i (y_i - \hat{y_i})^2~~~$ variation of y left unexplained by the regression model

The following relationship that exists between the above three variances is given below (proof skipped):

\( \displaystyle \sum_i (y_i - \overline{y})^2~=~\displaystyle \sum_i (\hat{y_i} - \overline{y})^2~ +~\displaystyle \sum_i (y_i -~ \hat{y_i})^2 \)


$SST~=~SSR~+~SSE$









In a good stright line fit, we expect the data points to be as close to the fitted line as possible.

This means, the quantity SSE which is the sum square deviation between the data points $y_i$ and fitted values $\hat{y}$ must be close to zero in an ideal fit.

ie,\(~~SSE~=~SST~-~SSR~~\) is close to zero, which means SST is close to SSR to make SSE zero when the fit is ideal.

Therefore, we define a parameter called $r^2$ such that

\(~~~ r^2 = \dfrac{SSR}{SST}~=~1~-~\dfrac{SSE}{SST}~~~~ with~~~(0 \leq r^2 \leq 1) \)




In the case of an ideal fit, \(~r^2 \rightarrow 1~~\) as SSR get very close to SST to make SSE close to 1.

As the data points get closer and closer to the best fit line, $~~r^2 \rightarrow 1$

As the data points deviate more and more form the best fit line, $~~r^2 \rightarrow 0$







$r^2$-value and correlation

The $r^2$ value of the fit is related to the Pearson's correlation coefficient $cor(X,Y)$ between the variable X and Y.

It can be shown that,

\(r^2~ = ~\left( \dfrac{\displaystyle \sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\displaystyle \sum_{i=1}^{n}(x_i - \overline{x})^2} \sqrt{ \displaystyle \sum_{i=1}^{n}(y_i - \overline{y})^2}} \right)^2 ~=~ \left(\dfrac{cov(x,y)}{S_xS_y} \right)^2~=~[cor(x,y)]^2 \)









where $cov(x,y)$ is the covariance between the variables and $cor(x,y)$ is the pearson correlation coefficient between X and Y, given by the terms inside the bracket. Here, $S_x$ and $S_y$ are the sample standard deviations of X and Y data.

Thus, the r^2 value of the best fit $y = b_0 + b_1x$ is equal to the square of the Pearson correlation coefficient between x and y variables.

We can get the $r^2$ value of the least square fit to stright line by computing the Pearson correlation coefficient between X and Y variables and squaring it, even before performing the fit.


Correlation between y and $\hat{y}$

We mention another important result by skipping the proof.

It can be shown that the correlation between the fitted value $\hat{y}$ and the response variable $y$ is same as the correlation between variable x and y:

\(~~~cor(\hat{y}, y)~=~cor(x,y) \)





Hypothesis testing regression coefficients

After estimating the linear regressioncoefficients $b_0$ and $b_1$ by using the method of least squares, we perform hypothesis tests to obtain their statistical significance.


Statistical test for the slope $\beta_1$

We set up a statistical test for testing whether the slope $\beta_1$ of the population is equal to zero under the null hypothesis that there is no liear relationship between the independent variable x and the dependent variable y in the data. Accordingly, the statements of the null and alternate hypothesis are:


$~~~~~~~~~~~~~~~~~~~H_0 : \beta_1~=0~~~~~~~$ (The predictor variable X is not useful for making predictions)

$~~~~~~~~~~~~~~~~~~~H_1 : \beta_1 \neq 0~~~~~$ (The predictor variable x is good for making predictions).


In general, the test statistic is a t-variable defined by,

\(~~~~~~~~~~~~~~~~~~t~=~\dfrac{b_1~-~\beta_1}{SE(b_1)}~~~~~ \) is a t-distribution with (n-2) degrees of freedom.







where, $SE(b_1)$ is the standard error on the estimated coefficient $b_1$ and n is the number of data points used in the fit. This (n-2) degrees of freedom is explained below.

The above definition of t statistic is similar t the one defined in t-tests. It is the difference between mean values from data and the population divided by standard error on the population mean of coefficient.

The expression for the standard error on the estimatedcoefficient $b_1$ in terms of observed data points can be derived from basic definitions. The expression is given below without the derivation:

\(~~~~~~~~~~SE(b_1)~=~\dfrac{ \sqrt{\dfrac{\displaystyle\sum_i(y_i - \hat{y_i})^2}{n-2}}}{\sqrt{\displaystyle\sum(x_i - \overline{x})^2 } }~=~ \dfrac{ \sqrt{\dfrac{SSE}{n-2}}}{\sqrt{\displaystyle \sum_i{(x_i - \overline{x})^2} } } \)









In the standard error expression, the quantitye SSE has $n-2$ degrees of freedom since we have already estimated two paramters namely $b_0$ and $b_1$ from the data and using their values to estimate SSE.

With the standard error on the coefficient, we can define a $(1-\alpha)100\%$ Confidence Interval (CI) on the population parameter $\beta_1$ as follows:

\(~~~~~~CI~=~b_1~+~t_{\small{1-\alpha/2}}(n-2) \dfrac{ \sqrt{\dfrac{SSE}{n-2}}}{\sqrt{\displaystyle \sum_i{(x_i - \overline{x})^2} } } \)







Using the t-staistic $t~=~\dfrac{b_1 - \beta}{SE(b_1)}$, we can test the null hypothesis $H_0: \beta_1 = \beta$ against an alternative $H_A: \beta_1 \neq \beta$ using the usual two sided test to a significant level of $\alpha$.


In the linear regression, we test the null hypothesis $H_0: \beta_1=0$ against an alternative $H_A: \beta_1 \neq 0$ using the test statistic $t~=~\dfrac{b_1}{SE(b_1)}$, which is a t-distribution with (n-2) degrees of freedom.


We can also use the $(1-\alpha)100\%$ confidence interval on $b_1$ defined above to test the hypothesis.

The test procedure is similar to the one followed in the case of ususal two sided t-tests.


Statistical test for the intercept $\beta_0$

The intercept is the value of y when x is zero. On many occasions, the value x=0 for the data may not be meaningful. Still, we set up a procedure for performing the hypothesis testing for the intercept $\beta_0$.


We define the null and alternate hypothesis for the test as,

$~~~~~~~~~~~~H_0: \beta_0 = \beta$
$~~~~~~~~~~~~H_A: \beta_0 \neq \beta$

Under the null hypothesis, the tet statistics is defined using the fitted intercept $b_0$ following ususal procedure as,

\(~~~~~~~~~~~~~~~t~=~\dfrac{b_0 - \beta}{SE(b_0)}~~~~~~\) has a t-distribution with (n-2) degrees of freedom.

Here, $SE(b_0)$ is a standard error on the estimated coefficient $b_0$.

Skipping the derivation, we repoduce here the expression for $SE(b_0)$ in terms of data variables as metioned in the genral literature:

\(SE(b_0)~=~ \sqrt{\dfrac{\displaystyle \sum_i(y_i - \hat{y_i})^2}{n-2}} \displaystyle{\sqrt{\dfrac{1}{n}~+~\dfrac{\overline{x}^2}{\displaystyle \sum_i(x_i - \overline{x_i})^2}}} \)

Under the null hypothesis, the test statistic is defined as,

\(~~~~~~~~~~~t~=~\dfrac{b_0 - \beta}{SE(b_0)}~~~~~~~~ \) has a t-distribution with (n-2) degrees of freedom





In order to test the null hypothesis $h_0:\beta_0=0$, we put $b=0$ in the above expression for the t-statistic to get,

\(~~~~~~~~~~t~=~\dfrac{b_0 }{SE(b_0)} \)

We define a $(1-\alpha)100%$ Confidence Interval for $b_0$ with the ususl definition:

\(~~~~~~~~~~t~=~b_0 \pm t_{1-\alpha/2}~SE(b_0)~~\)



For the computed t-statitics from the data, we can test the two sided null hypothesis $\beta_0 = 0$ to the level of $\alpha$ by either computing the p-value for the statistic or use the confidence interval for the test. The test procedure is same as that used in the one sample t-test.


Testing the significance of the entire model - the F-test

In linear regression, we fitted a model $y = b_0 + b_1x$ to the data between the predictor variable x and the response variable y.

In the previous section, we tested the null hypothesis on the coefficients $\beta_1$ and $\beta_0$ of the population data using the best fit coefficients $b_1$ and $b_0$ from the sample data.

Next we ask a bigger question: whether a linear relationship (model) exists al all between X and Y variables?

Suppose we assume that there is no relationship, whatsoever, exists between X and Y variables in the population. That is, if we create a scatter plot between X and Y in the population data, it will be a uniform distribution of points in (X,Y) plane.

From this data in which X and Y are random variables, suppose we sample data points of finite size n in our experiment.

If n is very large, again there will not be a relation between X and Y. However, if n is finite and not large, there is a chance that the relation between X and Y may show a linear trend by random chance. ie., if we keep sampling (X,Y) data points of size n many times and do a least square fit, what is the probability that we get the observed $b_1$ and $b_0$ by chance in one such sample data?

We set up a test with the null hypothesis that the response variable Y cannot be expressed as a function of independent (predictor) varable X.

In the assumed model $~~y = b_0 + b_1x~~$, we set a null hypothesis $H_0: \beta_0 = \beta_1 = 0$.
ie., there is no relation between X and Y, indicating that the linear regression model does not exist.

We use an F statistic for this test.

We test whether our entire model with predictor variable X explains the observed variations in the dependent variable Y when compared to the model that assumes a random association ("no relation") between X and Y. The model that assumes that the observed relation is only due to random chance is called the "null model".

In the F-test, the larger valus of the F statistic means the probability that random association could have given the observed coefficients $b_1$ and $b_0$ is small, and the model explains the variances in the data to the significant level. A smaller valus of F-statistic indicates that the model does not explain the observed variable well, and the observed results may be due to random chance. Hence it is not a useful model to explain the data.

The F statistics, which is the ration of two variances, is defined as,

\(~~~~~~~~~~~~F~=~\dfrac{ \left(\dfrac{SSR}{df_{SSR}}\right) } { \displaystyle \left(\dfrac{SSE}{df_{SSE}} \right) } \)







where, as defined before, for a linear model fitted with n data points,

\(~~~~SSR~=~\displaystyle \sum_i(\hat{y_i} - \overline{y})^2 ~~~\) is the variation in Y explained by the model

\(~~~~df_{SSR}~=~p~~~~~\) (is the degrees of freedom of the regression model, equal to the number of \(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\) coefficients)

\(~~~~SSE~=~\displaystyle \sum_i( y_i - \hat{y})^2 ~~~\) is the variation in Y unexplained by the model

\(~~~~df_{SSE}~=~n-p-1~~~~~\) (is the degrees of freedom for the error).

The F statistic defined above has an F-distribution with p and n-p-1 degrees of freedom. Therefore, we write

\(~~~~~~~~~~~~F~=~\dfrac{ \left(\dfrac{SSR}{p}\right) } { \displaystyle \left(\dfrac{SSE}{n-p-1} \right) }~~~\rightarrow F(p, n-p-1) \)







Suppose we fit a model $y~=~\beta_0 + \beta_1x~~$ with 20 data points. Then, $n=20$ and $p=2$.

The test procedure is similar to the one followed in one way ANOVA. First estimate the F statistics using the computed values of SSA and SSE. Then, perform an F-test to a significant level of $\alpha$ to reject or accept the null hypothesis.



Interpreting the results of hypothesis testing of slope


In linear regression, we test a null hypothesis $H_0:\beta_1=0$ against an alternative $H_0:\beta_1 \neq 0$ When we reject or accept the null hypothesis, there are more than one conclusion that can be draw. We should be very careful in interpreting the result.

Suppose, we reject the null hypothesis. Then, one of the following is a possibility:

(a) There is a linear relationship between X and Y

(b) Though the linear relation fits the data well, there is a possibility that a non-linear function may fit the data even better.

(c) In reality, the slope $\beta_1=0$ in the population, but the sample we collected shows otherwise. This is a type-I error.

Similarly, when we fail to reject the null hypotesis, it is possible that

(a) The relationship between X and Y may not be linear

(b) The slope $\beta_1 \neq 0$ in the population, but the data sample we got in our observation shows otherwise. This is a Type-II error.


Inspecting the residuals in linear regression - The residual plot.

In linear regression, for each daya point, the value $y_i$ of the data points deviates from the fitted value. At a give $x_i$, the difference between the fitted value $\hat{y}$ and the data value $y_i$ is called residual. Since $y_i$ values are assumed to be from a Gaussian distribution, if the fitting is perfect, then residuals $\epsilon_i = y_i - \hat{y}$ should fluctuate symmetrically around zero value.

The residual plot is a scatter plot between the predicted values $\hat{y}$ and the corresponding residuals $\epsilon_i = y_i - \hat{y}$. In Figure-6 below, three types of scattering patterns in residual plots are depicted:



In the ideal case, the variance of the dependent variable should be same over the entire range of the data. Therefore, we expect the residuals to be scattered unifirmly around the zero value across the entire range of predicted values, as shown in Figure-6(A) above. This property is called homoscedasticity. The regression method is base on the assumption that the data has homoscedasticity.

In the scatter plots depicted in Figure-6(B) and Figure-6(c) above, we find a pattern in the residual plot across the range of fitted values. The residual values are not scattered uniformly around zero value across the predicted values. There is a trend in the scatter plot. Any pattern in the residual plot is called heteroscedasticity. The trend in the residual plot (heteroscedasticity) is likely an indication that the parameters used in our model have not best captured the relationship in the data. The residual plot carries this important information.


Important assumptions of linear regression>

1. The observations are independent.
2. Linear relationship exists between the independent and dependent variables.
3. Nome of the predicted variables are correlated with each other (called "collinearity")
4. Homoscedasticity: The residuals have constant variance at every point in the linear model
5. The residuals are normally distributed.



Example Calculations for Linear regression

Example 1 : unweighted fit with no errors

In a biological experiment, the concentration C of a chemical is known to vary linearly with time T. The quantity was measured every hour and the data is presented here.

\(\small{ T~(hours)~~~~~~~0~~~~~~1~~~~~~2~~~~~~3~~~~~~4~~~~~~~5~~~~~~~~6~~~~~~~~7~~~~~~~~~8 }\)

\(\small{ C~(ng/liter)~~~0.2~~~1.8~~~4.3~~~5.7~~~8.3~~~9.7~~~12.4~~~13.6~~~16.3 }\)

For this data, Make a least square fit to a straight line of the form $C = a + bT$ to determine the slope b and intercept a.


To be consistent with the formulas we derived, we denote the independent variable T(time) as X and the dependent variable C(concentration) as Y.

Since the data does not have the standard deviations of the measuements on C, we assume that the variances on all the data points are equal. We write the data on a tabular format to calculate the quantities:

Fitting the line to get the coefficients

In order to make the symbols consistent with the formulae, let us denote T and C by symbols X and Y respectively.

The follwing table is computed from the data:

--------------------------------------------------------------------------
$~~~~~~~~~~~~~~~~~x_i~~~~~~~~~~~~~y_i~~~~~~~~~~~~~x_i^2~~~~~~~~~~~~~x_i y_i~~~~~~~~~~$
----------------------------------------------------------------------------
$~~~~~~~~~~~~~~~~~0~~~~~~~~~~~~~~0.2~~~~~~~~~~~0~~~~~~~~~~~~~~0.0 $
$~~~~~~~~~~~~~~~~~1~~~~~~~~~~~~~~1.8~~~~~~~~~~~1~~~~~~~~~~~~~~1.8 $
$~~~~~~~~~~~~~~~~~2~~~~~~~~~~~~~~4.3~~~~~~~~~~~4~~~~~~~~~~~~~~8.6 $
$~~~~~~~~~~~~~~~~~3~~~~~~~~~~~~~~5.7~~~~~~~~~~~9~~~~~~~~~~~~~~17.1 $
$~~~~~~~~~~~~~~~~~4~~~~~~~~~~~~~~8.3~~~~~~~~~~~16~~~~~~~~~~~~33.2 $
$~~~~~~~~~~~~~~~~~5~~~~~~~~~~~~~~9.7~~~~~~~~~~~25~~~~~~~~~~~~48.5 $
$~~~~~~~~~~~~~~~~~6~~~~~~~~~~~~~~12.4~~~~~~~~~36~~~~~~~~~~~~74.4 $
$~~~~~~~~~~~~~~~~~7~~~~~~~~~~~~~~13.6~~~~~~~~~49~~~~~~~~~~~~95.2 $
$~~~~~~~~~~~~~~~~~8~~~~~~~~~~~~~~16.3~~~~~~~~~64~~~~~~~~~~~~130.4 $
----------------------------------------------------------------------------
$sum=~~~~36~~~~~~~~~~~~~72.3~~~~~~~~204~~~~~~~~~~409.2 $
-----------------------------------------------------------------------------

number of data points = n = 9

\(\small{ \Delta~=~n \sum x_i^2 - (\sum x_i)^2~=~(9*204)-(36^2)~=~540 }\)

\(\small{ b_0~=~\dfrac{1}{\Delta} \left(\sum y_i \sum x_i^2 - \sum x_i \sum x_i y_i\right)~=~\dfrac{1}{540}(72.3*201 - 36*409.2)~=~0.03333 }\)

\(\small{ b_1~=~\dfrac{1}{\Delta} \left(n \sum x_i y_i - \sum x_i \sum y_i \right)~=~\dfrac{1}{540}(9*409.2 - 36*72.3)~=~2.0 }\)

With this result, we write the expression for the fitted line as

\(\small{\hat{y} = 0.03333 + 2 x }\)

Computing the $r^2$ parameter value

From the above table, We compute the following quantities:

$~~~~~~\hat{y_i}~=~0.03333 + 2*x_i,~~~~~~~~~~\overline{y}=\dfrac{72.3}{9}=8.0333,~~~~~~~\overline{x}=\dfrac{36}{9}=4.0$

For the fitted value, we compute $r^2$ by creating the following table:

----------------------------------------------------------------------------------------------------------------------------
$x_i~~~~~~y_i~~~~~~~~~~~~\hat{y_i}~~~~~~~~~~~~~~~(\hat{y_i}-\overline{y})~~~~~~~(\hat{y_i}-\overline{y})^2~~~~~(y_i - \overline{y})~~~~(y_i-\overline{y})^2~~~~~(x_i - \overline{x})^2$
-----------------------------------------------------------------------------------------------------------------------------
$0~~~~~~~0.2~~~~~~~~-0.03333~~~~-8.000~~~~~~~64.000~~~~~-7.833~~~~~~61.361~~~~~~~~~~16 $
$1~~~~~~~1.8~~~~~~~~~~~2.03333~~~~~-6.000~~~~~~~36.000~~~~~-6.233~~~~~~38.854~~~~~~~~~~~~9 $
$2~~~~~~~4.3~~~~~~~~~~~4.03333~~~~~-4.000~~~~~~~16.000~~~~~-3.733~~~~~~13.937~~~~~~~~~~~~4 $
$3~~~~~~~5.7~~~~~~~~~~~6.03333~~~~~-2.000~~~~~~~4.000~~~~~~~-2.333~~~~~~5.444~~~~~~~~~~~~~~1 $
$4~~~~~~~8.3~~~~~~~~~~~8.03333~~~~~~~~~~0.000~~~~~~~0.000~~~~~~~~~~~~0.266~~~~~~0.071~~~~~~~~~~~~~~0 $
$5~~~~~~~9.7~~~~~~~~~~~10.03333~~~~~~~~1.999~~~~~~~3.996~~~~~~~~~~~1.666~~~~~~2.777~~~~~~~~~~~~~~~1 $
$6~~~~~~~12.4~~~~~~~~~12.03333~~~~~~~~3.999~~~~~~~15.992~~~~~~~~~4.366~~~~~~19.068~~~~~~~~~~~~~4 $
$7~~~~~~~13.6~~~~~~~~~14.03333~~~~~~~~5.999~~~~~~~35.988~~~~~~~~~5.566~~~~~~30.988~~~~~~~~~~~~~9 $
$8~~~~~~~16.3~~~~~~~~~16.03333~~~~~~~~7.999~~~~~~~63.984~~~~~~~~~8.266~~~~~~68.337~~~~~~~~~~~~~16 $
----------------------------------------------------------------------------------------------------------------------------------
$36~~~~~72.3~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~240.0~~~~~~~~~~~~~~~~~~~~~~~~~240.84~~~~~~~~~~~60 $
--------------------------------------------------------------------------------------------------------------------------------

$$~~~ r^2 = \frac{\sum{(\hat{y_i} - \overline{y})^2} }{\sum{(y_i - \overline{y})^2}} = \dfrac{240.0}{240.84}~=~0.9965$$
Therefore, the best fit through data points is \(\small{C~= 0.0333 + 2*T }~~~\) with an $r^2=0.9965$

The $r^2$ value very close to 1.0 indicates a good fit in which data points lie xclose to the fitted line.

The fitted line along with the data points is plotted in Figure-7 below:


Hypothesis testing the slope : t-test

let us compute the standard error on the computed slope $b_1$ :

\(~~~~~~~~~~SE(b_1)~=~\dfrac{ \sqrt{\dfrac{\displaystyle\sum_i(y_i - \hat{y_i})^2}{n-2}}}{\sqrt{\displaystyle\sum_{i}(x_i - \overline{x})^2 } } ~=~ \dfrac{\sqrt{\dfrac{0.84}{9-2}}}{\sqrt{60}}~=~ 0.04472 \)

The t-statistic under the null hypothesis $H_0: \beta_1=0$ is computed as,

\(~~~~~~~~~~~~~t = \dfrac{b_1}{SE(b1)}~=~\dfrac{2.0}{0.04472}~=~44.72 \)

For a significance level $\alpha=0.05$, we compute the critical value of t for a two sided hypothesis test as,

\(~~~~~~t_{c}~=~t_{1-\alpha/2}(n-2)~=t_{0.975}(7)~= 2.3646 \)

Since \(~~t \gt t_c ~~\), we reject the null hypothesis $H_0:\beta_1=0$ at a significance level of 0.05 to colclude that there is a linear relationship between the variable namely time T and the concentration C


Hypothesis testing the entire model: the F-test

To perform the F-test for the entire model, we need to compute the F statistic using the expression,

\(~~~~~~~~~~~~~~~~~F~=~\dfrac{SSR/df_{SSR}}{SSE/df_{SSE}}~~~~\)

We have,

\( SSR~=~\displaystyle\sum_i(\hat{y_i} - \overline{y})^2~=240.0,~~~~~~~~~~df_{SSR}~=~2 \)

\( SSE~=~\displaystyle\sum_i(y_i - \hat{y_i})^2~=0.84,~~~~~~df_{SSE}~=~n-p-1~=9-2-1~=~6 \)

Substituting, we get F-statistic as,

\(~~~~~~~F~=~\dfrac{(240/2)}{(0.84/6)}~=~857.14~~\)

This is an F-distribution with $(p,n-p-1)$ degrees of freedom.

The critical value to a significance level of $\alpha=0.05$ is,

\(~~~~~~~~F_c~=~F_{1-\alpha}(p,n-p-1)~=F_{0.95}(2,6)~=~5.1\)

Since $~~~F \gt F_c,~~$ we reject the null hypothesis $H_0:\beta_0=\beta_1=0$ to a significant level of 0.05.

We conclude that the linear model we fitted shows a significant variation in the dependent variable when compared to the null model which assumes no variation.

In fact, the F value of the test is so large that the corresponsing p-value is very very small.


Example 2 : Weighted fit with errors

We will now perform linear regression on a data in which each point is having an error.

The data set has a variable y dependent on an indpendent variable x. For each value of y, corresponding error value is given. This is a simulated data with a linear relationship between y and x:

x = {1,2,3,4,5,6,7,8,9,10}

y = {16.5, 33.3, 35.5, 50.8, 57.9, 58.0, 73.4, 89.0, 97.9, 107.0}

errors = {3.0, 4.1, 5.0, 5.5, 6.0, 6.2, 1.0, 1.0, 1.0, 1.0}

To be consistent with the symbold used in the formule, we rename the variable "errors" to $\sigma$.

Following the procedure similar to the one in Example-1 above, the following table is created:

-----------------------------------------------------------------------------------------------------------------------------
$x~~~~~~~~~~~y~~~~~~~~~~~~\sigma~~~~~~~~~~~\dfrac{1}{\sigma^2}~~~~~~~~~~~~~~\dfrac{x}{\sigma^2}~~~~~~~~~~~~~~\dfrac{y}{\sigma^2}~~~~~~~~~~~~~~~\dfrac{xy}{\sigma^2}~~~~~~~~~~~~~~\dfrac{x^2}{\sigma^2}$
----------------------------------------------------------------------------------------------------------------------------- $1.0~~~~~~16.5~~~~~~~~3.0~~~~~~~~~0.1111 ~~~~~~0.1111~~~~~~~~1.8333~~~~~~~~~~~1.8333~~~~~~~~~~0.1111 $
$2.0~~~~~~33.3~~~~~~~~4.1~~~~~~~~~0.0595 ~~~~~~0.1189~~~~~~~~1.9809~~~~~~~~~~~3.9619~~~~~~~~~~0.2379 $
$3.0~~~~~~35.5~~~~~~~~5.0~~~~~~~~~0.0400 ~~~~~~0.1200~~~~~~~~1.4200~~~~~~~~~~~4.2600~~~~~~~~~~0.3600 $
$4.0~~~~~~50.8~~~~~~~~5.5~~~~~~~~~0.0330 ~~~~~~0.1322~~~~~~~~1.6793~~~~~~~~~~~6.7173~~~~~~~~~~0.5289 $
$5.0~~~~~~57.9~~~~~~~~6.0~~~~~~~~~0.0277 ~~~~~~0.1389~~~~~~~~1.6083~~~~~~~~~~~8.0417~~~~~~~~~~0.6944 $
$6.0~~~~~~58.0~~~~~~~~6.2~~~~~~~~~0.0260 ~~~~~~0.1561~~~~~~~~1.5088~~~~~~~~~~~9.0531~~~~~~~~~~0.9365 $
$7.0~~~~~~73.4~~~~~~~~1.0~~~~~~~~~1.0000 ~~~~~~7.0000~~~~~~~~73.4000~~~~~~~~~513.80~~~~~~~~~~49.000 $
$8.0~~~~~~89.0~~~~~~~~1.0~~~~~~~~~1.0000 ~~~~~~8.0000~~~~~~~~89.0000~~~~~~~~~712.00~~~~~~~~~~64.000 $
$9.0~~~~~~97.9~~~~~~~~1.0~~~~~~~~~1.0000 ~~~~~~9.0000~~~~~~~~97.9000~~~~~~~~~881.10~~~~~~~~~~81.000 $
$10.0~~~~107.0~~~~~~1.0~~~~~~~~~1.0000 ~~~~~~10.0000~~~~~~107.0000~~~~~~~1070.00~~~~~~~~100.00 $
-------------------------------------------------------------------------------------------------------------------------------
$55~~~~~~~~619.3~~~~~~~~~~~~~~~~~~~4.2975~~~~~~~~34.7773~~~~~377.3308~~~~~3210.767~~~~~296.869 $
-------------------------------------------------------------------------------------------------------------------------------

The coefficients can be computed by direct substitution of values from tables:



\(\small{ \Delta~=~\displaystyle\sum\dfrac{1}{\sigma_i^2} \sum\dfrac{x_i^2}{\sigma_i^2} - \displaystyle\sum\dfrac{x_i}{\sigma_i^2} \sum\dfrac{x_i}{\sigma_i^2} }~=~(4.2975 \times 296.869)~-~ (34.7773 \times 34.7773)\)
\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~=~ 66.3339 \)

\(\small{ b_0~=~~\dfrac{1}{\Delta} \left(\sum \dfrac{y_i}{\sigma_i^2} \sum \dfrac{x_i^2}{\sigma_i^2} - \sum \dfrac{x_i}{\sigma_i^2} \sum \dfrac{x_i y_i}{\sigma_i^2}\right) }\)

\(~~~~~=~ \dfrac{1}{66.3339} ( 377.3308 \times 296.869~-~34.7773 \times 3210.767) \)

\(~~~~~=~5.3669 \)

\(\small{ b_1~=~\dfrac{1}{\Delta}\left(\displaystyle\sum\dfrac{1}{\sigma_i^2} \displaystyle\sum \dfrac{x_i y_i}{\sigma_i^2}~~-~~\displaystyle\sum \dfrac{x_i}{\sigma_i^2} \displaystyle\sum \dfrac{y_i}{\sigma_i^2}\right) }\)

\(~~~~~=~ \dfrac{1}{66.3339} ( 4.2975 \times 3210.767)~-~(34.773 \times 377.3308) \)

\(~~~~~=~10.2112 \)

With the values of slope and intercept obtained, we write the equation of the fitted line:

\(~~~~~~~~y~=5.3669~+~10.2112*x \)

Computing the $r^2$ parameter value

From the above table, We compute the following quantities:

$~~~~~~\hat{y_i}~=~5.3669 + 10.2112 x_i,~~~~~~~~~\overline{x}=\dfrac{55}{10}=5.5$

Since this is a weighted fit in which each computed quantity is weighted with the inverse of variance on y values, we define a weight $W_i$ for each $y_i$ as,

$~~~~~~~~~~~~~~~~~~~~~~~~~~w_i ~=~ \dfrac{1}{\sigma_i^2} $

Accordingly, we should use the weighted mean for the y values in the compoutation. The weighted mean for y is defined as,

$~~~~~~~~~~~~~~~~~~~~~~~~~{\overline{y}}_w~=~\displaystyle\dfrac{\sum_i{w_i y_i}}{\sum_iw_i} ~=~\dfrac{377.33}{4.297}~=~87.80~~~~~~~$(from table below)

In order to compute the $r^2$ value from the data, we create the following table:

-------------------------------------------------------------------------------------------------------------------------------------
$ x_i~~~~~~~~y_i~~~~~~~~~~\sigma_i~~~~~w_i=\dfrac{1}{\sigma^2}~~~~~~~\hat{y_i}~~~~~~w_i y_i~~w_i(\hat{y}_i - {\overline{y}}_w)^2~~~w_i (y_i - {\overline{y}}_w)^2~w_i (y_i - \hat{y_i})^2~~(x_i - \overline{x})^2$
------------------------------------------------------------------------------------------------------------------------------------- $1.0~~~~~~16.5~~~~3.0~~~~~~~0.1111~~~~~15.572~~~~~1.833~~~~579.61~~~~~~~~~~564.91~~~~~~~~~~~~0.0944~~~~~~~~~~20.25 $
$2.0~~~~~~33.3~~~~4.1~~~~~~~0.0595~~~~~25.789~~~~~1.981~~~~228.78~~~~~~~~~~176.72~~~~~~~~~~~~3.3558~~~~~~~~~~12.25 $
$3.0~~~~~~35.5~~~~5.0~~~~~~~0.0400~~~~~36.000~~~~~1.420~~~~107.34~~~~~~~~~~109.42~~~~~~~~~~~~0.0100~~~~~~~~~~~~6.25 $
$4.0~~~~~~50.8~~~~5.5~~~~~~~0.0330~~~~~46.212~~~~~1.679~~~~~~57.18~~~~~~~~~~~~45.26~~~~~~~~~~~~0.6959~~~~~~~~~~~~2.25 $
$5.0~~~~~~57.9~~~~6.0~~~~~~~0.0278~~~~~56.423~~~~~1.608~~~~~~27.35~~~~~~~~~~~~24.84~~~~~~~~~~~~0.0606~~~~~~~~~~~~0.25 $
$6.0~~~~~~58.0~~~~6.2~~~~~~~0.0260~~~~~66.634~~~~~1.509~~~~~~11.66~~~~~~~~~~~~23.11~~~~~~~~~~~~1.9393~~~~~~~~~~~~0.25 $
$7.0~~~~~~73.4~~~~1.0~~~~~~~1.0000~~~~~76.845~~~~73.400~~~120.08~~~~~~~~~~207.46~~~~~~~~~~11.8701~~~~~~~~~~~~2.25 $
$8.0~~~~~~89.0~~~~1.0~~~~~~~1.0000~~~~~87.056~~~~89.000~~~~~~~~0.56~~~~~~~~~~~~~1.43~~~~~~~~~~~~3.7772~~~~~~~~~~~~6.25 $
$9.0~~~~~~97.9~~~~1.0~~~~~~~1.0000~~~~~97.268~~~~97.900~~~~~~89.57~~~~~~~~~~101.94~~~~~~~~~~~~0.3998~~~~~~~~~12.25 $
$10.0~~~~107.0~~1.0~~~~~~~1.0000~~~107.479~~107.000~~~~387.12~~~~~~~~~~368.50~~~~~~~~~~~~0.2293~~~~~~~~~20.25 $
--------------------------------------------------------------------------------------------------------------------------------- $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~4.2973~~~~~~~~~~~~~~~~~~377.33~~~~~~1609.25~~~~~~1623.60~~~~~~~~~~22.43~~~~~~~~~~~~~~82.5 $
---------------------------------------------------------------------------------------------------------------------------------

With weights for each y value, the weighted sum square deviations are defined as,

\(SST~=~\displaystyle\sum_i w_i(y_i - \overline{y}_w)^2~=1623.60 \)

\(SSR~=~\displaystyle\sum_i w_i(\hat{y_i} - \overline{y}_w)^2~=1609.25 \)

\(SSE~=~\displaystyle\sum_i w_i(y_i - \hat{y})^2~=22.432 \)

\( ~~~ r^2~=~1-\dfrac{SSE}{SST}~=~0.9862 \)

Therefore, the fitted line therough data points(x,y) is \(~~y~=5.3669~+~10.2112*x~~\) with $r^2~=~0.9862$.

The $r^2$ value, being close to 1 indicates a good fit to the data. The fitted line along with the data points is plotted in Figure-8 below:




One interesting we notice in the above Figure. In the linear regression with errors (the weighted fit), the fitted line stays more closer to the points with smaller errors(larger weight) than the points with larger errors (smaller weights). Thus, in the above figure, the fitted line passes closer to the last four data points whuch have smaller errors compared to the previous points. Wheras, in the linear regression without errors, all the points are given equal importance.


Hypothesis testing the slope : t-test

let us compute the standard error on the computed slope $b_1$ :

\(SE(b_1)~=~ \displaystyle\dfrac{\displaystyle\sqrt{\dfrac{22.432}{10-2}}} {\sqrt{82.5}}~=~0.18432 \)

The t-statistic under the null hypothesis $H_0: \beta_1=0$ is computed as,

\(~~~~~~~~~~~~~t = \dfrac{b_1}{SE(b1)}~=~\dfrac{10.2112}{0.18432}~=~55.399 \)

For a significance level $\alpha=0.05$, we compute the critical value of t for a two sided hypothesis test as,

\(~~~~~~t_{c}~=~t_{1-\alpha/2}(n-2)~=t_{0.975}(7)~= 2.306 \)

Since \(~~t \gt t_c ~~\), we reject the null hypothesis $H_0:\beta_1=0$ at a significance level of 0.05 to colclude that there is a linear relationship between the variable namely time T and the concentration C


Hypothesis testing the entire model: the F-test

To perform the F-test for the entire model, we need to compute the F statistic using the expression,

\(~~~~~~~~~~~~~~~~~F~=~\dfrac{SSR/df_{SSR}}{SSE/df_{SSE}}~~~~\)

We have,

\( SSR~=~~=1609.25,~~~~~~~~~~df_{SSR}~=~2 \)

\( SSE~=~=22.432,~~~~~~df_{SSE}~=~n-p-1~=10-2-1~=~7 \)

Substituting, we get F-statistic as,

\(~~~~~~~F~=~\dfrac{(1609.25/2)}{(22.432/7)}~=~251.1 ~~\)

This is an F-distribution with $(p,n-p-1)$ degrees of freedom.

The critical value to a significance level of $\alpha=0.05$ is,

\(~~~~~~~~F_c~=~F_{1-\alpha}(p,n-p-1)~=F_{0.95}(2,7)~=~4.74\)

Since $~~~F \gt F_c,~~$ we reject the null hypothesis $H_0:\beta_0=\beta_1=0$ to a significant level of 0.05.

We conclude that the linear model we fitted shows a significant variation in the dependent variable when compared to the null model which assumes no variation.

In fact, the F value of the test is so large that the corresponsing p-value is very very small.

Linear Regression in R


In R, the function lm() performs the linear regression of data between two vectors. It has the option 

for weighted and non-weighted fits, as well as nonlinear curve fitting.

The function call with essential parameters is of the form,

 lm(formula, data, subset, weights) 

where,

 formula  ------>  is the formula used for the fit. For linear function, use "y~x"

 data  -------->  data frame with x and columns for data

 subset  ------> an optional vector specifying subset of data to be used.

 weights  ------> an optionl vector with weights of fits to be used.

For many additional parameters and their usage, type help(lm) in R.



The R script below shows how to perform linear regression with without errors(weights).

# Linear regression in R # lm is used to fit linear models #----------------------------------------------------------- ##Straight line fit without errors # define 2 data sets as vectors xdat = c(1,2,3,4,5,6,7,8,9,10) #ydat = c(16.5, 23.3, 35.5, 45.8, 57.9, 68.0, 73.4, 89.0, 97.9, 107.0) ydat = c(16.5, 33.3, 35.5, 50.8, 57.9, 58.0, 73.4, 89.0, 97.9, 107.0) # create a data frame of vectors df = data.frame(x=xdat, y=ydat) ## function call for lm(). Here,x~y represents a linear relationship "y = a0 + a1 x " # (Use ydat~xdat-1 if you dont want intercept) res = lm(ydat~xdat, df) ## call the lm() function to perform linear model. ##Output results are stored in the list "res" # In the above formuls, use y~x-1 if you don't want intercept # Print the summary of entire analysis print(summary(res)) # plot the results. First plot data points and then x versus fitted y) plot(xdat,ydat, col="red", xlab="xdat", ylab="ydat", cex.lab=1.3, main="Linear regression without weights") lines(xdat, res$fitted.values, col="blue") ## get slope, intercept and r-square vlues as strings. inter_string = paste("intercept = ", round(res$coefficients[[1]], digits=3) ) slope_string = paste("slope = ", round(res$coefficients[[2]], digits=3) ) rsquare_string = paste("r-square = ", round(summary(res)$r.squared, digits=4)) ## Write the above strings inside the plot. ## Forst plot normally, then decide the coordinates for the text centering in the graph text(2.5,90,inter_string, cex=1.3) text(2.5,85, slope_string, cex=1.3) text(2.5,80, rsquare_string, cex=1.3) X11() ## Create the residual plot ## predicted values predicted_values = res$fitted.values ## residuals residuals = (ydat - predicted_values) ## residual plot plot(predicted_values, residuals, ylim=c(2.0*min(residuals), 2.0*max(residuals)), main="Residual plot" ) arrows( 0.5*min(predicted_values), 0, 1.5*max(predicted_values), 0, lty=2, col="red", lwd=2) ## Access the fitted reults as individual numbers. ## Uncomment the following line to see the residual plot, Q-Q plot, theoretical quantiles etc. (4 plots). ## These are plotted by the lm() function. #plot(res) # Uncomment the following line to print the interrcept #print(paste("intercept = ", res$coefficients[1])) # uncomment the following statement to print slope #print(paste("slope = ", res$coeffcients[2]))) # uncomment the following statement to print the fitted values # print(res$fitted.values) # Uncomment the following statement to see 95% confidence interval # confint(res, level=0.95) # Uncomment the following line to print r-square value. Note that they are part of list "summary(res)" # r.square = summary(res)$r.squared ##-----------------------------------------------------------------------------------

Running the above script in R prints the following output lines and graph on the screen:


Call: lm(formula = ydat ~ xdat, data = df) Residuals: Min 1Q Median 3Q Max -8.785 -2.051 1.101 2.593 5.354 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.5267 2.9502 2.89 0.0202 * xdat 9.7097 0.4755 20.42 3.46e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.319 on 8 degrees of freedom Multiple R-squared: 0.9812, Adjusted R-squared: 0.9788 F-statistic: 417 on 1 and 8 DF, p-value: 3.458e-08


In the above output, we gather that

1. The Intercept and slope have been estimated with an r-square value of 0.9812.

2. The p-value of the t-test for the slope is 0.0202 and corresponsing value for intercept t-test is $3.46 \times 10^{-8}$, indicating a very good estimate of these parameters.

3. The p-value of the F-test for the overall fit is $3.458 \times 10^{-8}$, indicating that the linear model fits the data very well.

The following two figures are printed on the screen:









The R script below shows how to perform linear regression with errors( the weighted fit).

## Fit with errors (weights) on individual data points ## The weights are inverse of variances on data point. ## In our data, we call standard deviation as error. ## Hence the weights are inverse of square of errors. xdat = c(1,2,3,4,5,6,7,8,9,10) ydat = c(16.5, 33.3, 35.5, 50.8, 57.9, 58.0, 73.4, 89.0, 97.9, 107.0) errors = c(3.0, 4.1, 5.0, 5.5, 6.0, 6.2, 1.0, 1.0, 1.0, 1.0) df = data.frame(x = xdat, y=ydat, err=errors) res = lm(y~x, data = df, weights = 1/(df$err)^2) print(summary(res)) # plot the results plot(xdat,ydat,col="red", main="Linear regression with weights on Y-values", cex.lab=1.3) lines(xdat, res$fitted.values, col="blue") ## draw error bars on individual data points ## From each y-value, draw a vertical arrow connecting ydat-errors to ydat+errors and make arrow head 90 degree. ## This gives a vertical error bar. arrows(xdat, ydat-errors, xdat, ydat+errors, length=0.05, angle=90, code=3) ## get slope, intercept and r-square vlues as strings. inter_string = paste("intercept = ", round(res$coefficients[[1]], digits=3) ) slope_string = paste("slope = ", round(res$coefficients[[2]], digits=3) ) rsquare_string = paste("r-square = ", round(summary(res)$r.squared, digits=4)) ## Write the above strings inside the plot. ## Forst plot normally, then decide the coordinates for the text centering in the graph text(2.5,90,inter_string, cex=1.3) text(2.5,85, slope_string, cex=1.3) text(2.5,80, rsquare_string, cex=1.3) X11() ## Create the residual plot ## predicted values predicted_values = res$fitted.values ## residuals residuals = (ydat - predicted_values) ## residual plot plot(predicted_values, residuals, ylim=c(2.0*min(residuals), 2.0*max(residuals)), main="Residual plot", cex.lab=1.3 ) arrows( 0.5*min(predicted_values), 0, 1.5*max(predicted_values), 0, lty=2, col="red", lwd=2) ## Access the fitted reults as individual numbers. ## Uncomment the following line to see the residual plot, Q-Q plot, theoretical quantiles etc. (4 plots). ## These are plotted by the lm() function. #plot(res) # Uncomment the following line to print the interrcept #print(paste("intercept = ", res$coefficients[1])) # uncomment the following statement to print slope #print(paste("slope = ", res$coeffcients[2]))) # uncomment the following statement to print the fitted values # print(res$fitted.values) # Uncomment the following statement to see 95% confidence interval # confint(res, level=0.95) # Uncomment the following line to print r-square value. Note that they are part of list "summary(res)" # r.square = summary(res)$r.squared

Running the above script in R prints the following output lines and graph on the screen:


Call: lm(formula = y ~ x, data = df, weights = 1/(df$err)^2) Weighted Residuals: Min 1Q Median 3Q Max -3.2741 -0.1968 0.2908 0.8525 2.1393 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.3680 3.5288 1.521 0.167 x 10.1866 0.4246 23.993 9.71e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.668 on 8 degrees of freedom Multiple R-squared: 0.9863, Adjusted R-squared: 0.9846 F-statistic: 575.6 on 1 and 8 DF, p-value: 9.705e-09


We conclude the following from the output of this weighted fit:

1. The computed slope pd 10.1866 is satistically significant (with small p-value of $9.71 \times 10^{-9}$, while the intercept value has a low statistical significance (p-value 0.167) in the t-test for coefficients.

2. The R-squared value of 0.9863 indicates a good weighted fit to the data points.

3. The extremely small p-value of $9.705\times 10^{-9}$ in the F-test shows the strength of overall linear model assumed for the data.

4. The fitted line is closer to the data points with smaller errors.

The following two figures are printed on the screen: