nonlinear regression, machine learning, multivariate analysis, mathematics for nonlinear least square

Statistics for Machine Learning

Mathematical concepts required for nonlinear regression

In order to understand the algorithms of nonlinear least square minimization, we require the understanding of the following topics:

1. Taylor series expansion of a continuous function

We can expand a continuous function f(x) around a point x=a in terms of its derivatives using the Taylor Series:

\begin{equation} f(x)~=~\displaystyle \sum_{n=0}^{\infty} \dfrac{f^n(a)}{n!} (x-a)^n~=~f(a)~+~\dfrac{f'(a)}{1!} (x-a)~+~ \dfrac{f''(a)}{2!}(x-a)^2~+~\dfrac{f'''(a)}{3!} (x-a)^3~+~..... \end{equation} If we expand the Taylor Series around x=0, we get \begin{equation} f(x)~=~\displaystyle \sum_{n=0}^{\infty} \dfrac{f^n(0)}{n!} x^n~=~f(0)~+~\dfrac{f'(0)}{1!} x~+~ \dfrac{f''(0)}{2!}x^2~+~\dfrac{f'''(0)}{3!} x^3~+~..... \end{equation} When the deviations around x ae very samll, we can neglect the second and higher order terms of the Taylor series to write, \begin{equation} f(x)~=~f(a)~+~f'(a) (x-a) \end{equation} This is the equation of tangent to the curve f(x) at x=a . For small region of x around x=a, we can approximate the function f(x) by a stright line given by the above equation of tangent. This important idea will be used in the algorithms for nonlinear least square minimization.

2. The gradiant matrix of a function

Let $f, f_1, f_2, f_3,....$ be the functions of variables $(x_1,x_2,x_3,...,x_n)$.

Then, the gradient $\nabla f$ of f is a vector whose components are the first derivatives of f with respect to the variables:

$~~~~~~\displaystyle \nabla f~=~\left(\dfrac{\partial f}{\partial x_1}, \dfrac{\partial f}{\partial x_2}, \dfrac{\partial f}{\partial x_3},......,\dfrac{\partial f}{\partial x_n}\right) $

The gradiant vector can also be represented as a column matrix:

\begin{equation} Gradiant~matrix ~=~\overline{\nabla}f~=~ \begin{bmatrix} \dfrac{\partial f}{\partial x_1} \\ \dfrac{\partial f}{\partial x_2} \\ \dfrac{\partial f}{\partial x_3} \\ .... \\ .... \\ \dfrac{\partial f}{\partial x_n} \end{bmatrix} ~=~ \begin{bmatrix} F_1(x_1, x_2, x_3,...,x_n) \\ F_2(x_1, x_2, x_3,...,x_n) \\ F_3(x_1, x_2, x_3,...,x_n) \\ ....\\ ....\\ F_n(x_1, x_2, x_3,...,x_n) \end{bmatrix} \end{equation} where $F_1$,$F_2$,...,$F_n$ are functions of variables $x_1,x_2,x_3,...,x_n$.

2. The Hessian matrix of a function

The Hessian matrix of a function $f(x_1,x_2,x_3,....,x_n)$ is a matrix whose elements are the second derivatives of the function with respect to all the variables:

\begin{equation} Hessian~matrix~=~\overline{H}_f~=~\nabla^2 f~= \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \dfrac{\partial^2 f}{\partial x_2 \partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \dfrac{\partial^2 f}{\partial x_2 \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_2 \partial x_n} \\ \dfrac{\partial^2 f}{\partial x_3 \partial x_1} & \dfrac{\partial^2 f}{\partial x_3 \partial x_2} & \dfrac{\partial^2 f}{\partial x_3^2} & .......& \dfrac{\partial^2 f}{\partial x_3 \partial x_n} \\ .........&............&...........&.............&.........\\ .........&............&...........&.............&.........\\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \dfrac{\partial^2 f}{\partial x_n \partial x_2} & \dfrac{\partial^2 f}{\partial x_n \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \end{equation}
The Hessian matrix is always a square matrix of size equal to the number of variables of the function

3. The Jacobian matrix of a set of functions

Consider a set of n equations in n variables:

$ y_1~=~f_1(x_1, x_2, x_3,....,x_n)$
$ y_2~=~f_2(x_1, x_2, x_3,....,x_n) $
$ y_3~=~f_3(x_1, x_2, x_3,....,x_n) $
$ .... .... ... ... ... ... $
$ .... .... ... ... ... ... $
$ y_n~=~f_n(x_1, x_2, x_3,....,x_n) $

The jacobian matrix of the above set of functions is a matrix of first derivatives givn by

\begin{equation} Jacobian~matrix ~=~\overline{\nabla}J~=~ \begin{bmatrix} \dfrac{\partial y_1}{\partial x_1} & \dfrac{\partial y_1}{\partial x_2} & \dfrac{\partial y_1}{\partial x_3} & ... & \dfrac{\partial y_1}{\partial x_n} \\ \dfrac{\partial y_2}{\partial x_1} & \dfrac{\partial y_2}{\partial x_2} & \dfrac{\partial y_2}{\partial x_3} & ... & \dfrac{\partial y_2}{\partial x_n} \\ \dfrac{\partial y_3}{\partial x_1} & \dfrac{\partial y_3}{\partial x_2} & \dfrac{\partial y_3}{\partial x_3} & ... & \dfrac{\partial y_3}{\partial x_n} \\ ....&....&....&....&....\\ ....&....&....&....&....\\ \dfrac{\partial y_n}{\partial x_1} & \dfrac{\partial y_n}{\partial x_2} & \dfrac{\partial y_n}{\partial x_3} & ........ & \dfrac{\partial y_n}{\partial x_n} \end{bmatrix} \end{equation}
Note : the above matrix can also be written in terms of $\dfrac{\partial f_1}{\partial x_1}$, $\dfrac{\partial f_1}{\partial x_2}$ etc.

The determinant of the Jacobian matrix (also confusingly called "Jacobian") is symbolicaly written as,
\[ \begin{vmatrix} \overline{J} \end{vmatrix} ~=~J~=~ \begin{vmatrix} \dfrac{\partial (y_1, y_2, y_3, ...., y_n)}{ \partial (y_1, y_2, y_3, ...., y_n)} \end{vmatrix} \]

The Jacobian matrix can also be written in vector form:

Let $\overline{x}~=~\{x_1, x_2, x_3, ..., x_n \} $

We can symbolically write,
\[ \overline{y}(\overline{x})~=~ \begin{bmatrix} f_1(x_1, x_2, x_3,..., x_n) \\ f_2(x_1, x_2, x_3,..., x_n) \\ f_3(x_1, x_2, x_3,..., x_n) \\ ................... \\ ................... \\ f_n(x_1, x_2, x_3,..., x_n) \end{bmatrix} ~=~ \begin{bmatrix} f_1(\overline{x}) \\ f_2(\overline{x}) \\ f_3(\overline{x}) \\ .... \\ ....\\ f_n(\overline{x}) \\ \end{bmatrix} ~= \overline{f}(\overline{x}) \]

With this, the jacobian matrix is be written as,

$ \overline{J}~=~ \dfrac{d(\overline{y})}{d\overline{x}} ~=~ \dfrac{d(\overline{f}(\overline{x}))}{d\overline{x}} $

The total differential of the matrix $\overline{y}$ is,

$ d\overline{y}~=~\dfrac{d(\overline{y})}{d \overline{x}} d \overline{x}~=~\overline{J} d \overline{x} $

which gives, \[ dy_1~dy_2~dy_3.....dy_n ~ = \begin{vmatrix} \overline{J} \end{vmatrix} dx_1~dx_2~dx_3~.....dx_n \]

The Jacobian matrix can be written for more than one function with many variables:

For example, let f(u,v,w) and g(u,v,w) be 2 functions of three variables u,v, and w. We can define the following Jacobian matrices:

\[ \dfrac{\partial(f,g)}{\partial(u,v)}~=~ \begin{bmatrix} \dfrac{\partial f}{\partial u} & \dfrac{\partial f}{\partial v} \\ \dfrac{\partial g}{\partial u} & \dfrac{\partial g}{\partial v} \end{bmatrix} \] \[ \dfrac{\partial(f,g)}{\partial(v,w)}~=~ \begin{bmatrix} \dfrac{\partial f}{\partial v} & \dfrac{\partial f}{\partial w} \\ \dfrac{\partial g}{\partial v} & \dfrac{\partial g}{\partial w} \end{bmatrix} \]

The relation between Gradiant, Hessian and jacobian matrices

We have seen the gradiant matrix $\nabla f$ for a function $f(x_1,x_2,x_3,...,x_n)$ defined as,

\begin{equation} Gradiant~matrix ~=~\overline{\nabla}f~=~ \begin{bmatrix} \dfrac{\partial f}{\partial x_1} \\ \dfrac{\partial f}{\partial x_2} \\ \dfrac{\partial f}{\partial x_3} \\ .... \\ .... \\ \dfrac{\partial f}{\partial x_n} \end{bmatrix} ~=~ \begin{bmatrix} F_1(x_1, x_2, x_3,...,x_n) \\ F_2(x_1, x_2, x_3,...,x_n) \\ F_3(x_1, x_2, x_3,...,x_n) \\ ....\\ ....\\ F_n(x_1, x_2, x_3,...,x_n) \end{bmatrix} \end{equation} Now we construct the jacobian of the gradiant:

\begin{equation} J_{\overline{\nabla} f }~=~ \begin{bmatrix} \dfrac{\partial F_1}{\partial x_1} & \dfrac{\partial F_1}{\partial x_2} & \dfrac{\partial F_1}{\partial x_3} & .... & \dfrac{\partial F_1}{\partial x_n} \\ \dfrac{\partial F_2}{\partial x_1} & \dfrac{\partial F_2}{\partial x_2} & \dfrac{\partial F_2}{\partial x_3} & .... & \dfrac{\partial F_2}{\partial x_n} \\ \dfrac{\partial F_3}{\partial x_1} & \dfrac{\partial F_3}{\partial x_2} & \dfrac{\partial F_3}{\partial x_3} & .... & \dfrac{\partial F_3}{\partial x_n} \\ .... & ....& ....& ....& \\ .... & ....& ....& ....& \\ \dfrac{\partial F_n}{\partial x_1} & \dfrac{\partial F_n}{\partial x_2} & \dfrac{\partial F_n}{\partial x_3} & .... & \dfrac{\partial F_n}{\partial x_n} \\ \end{bmatrix} = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \dfrac{\partial^2 f}{\partial x_2 \partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \dfrac{\partial^2 f}{\partial x_2 \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_2 \partial x_n} \\ \dfrac{\partial^2 f}{\partial x_3 \partial x_1} & \dfrac{\partial^2 f}{\partial x_3 \partial x_2} & \dfrac{\partial^2 f}{\partial x_3^2} & .......& \dfrac{\partial^2 f}{\partial x_3 \partial x_n} \\ ....&....&....&....&....\\ ....&....&....&....&....\\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \dfrac{\partial^2 f}{\partial x_n \partial x_2} & \dfrac{\partial^2 f}{\partial x_n \partial x_3} & .......& \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix} =\overline{H}_f \end{equation}

ie., $~~~~~~\overline{H}_f~=~J_{\overline{\nabla} f }$

Thus, the Hessian of a multivariate function f is the Jacobian of its gradiant

We will use this important result in nonlinear regression analysis.

Properties of Hessian matrix

1. Hessian matrix is a symmetrix matrix consisting of the second derivative of a function.

2. If the gradiant $\overline\nabla f$ of a function f is zero at some point $ \overline{x}=\{x_1,x_2,...,x_n\}$, then $\overline{x}$ is said to be the critical point (also called stationary point) of f.

3. The eigenvalues of the Hessian matrix is used to identify the local maximum and minimum of a function.

If the eigenvalues of the Hessian matrix at the critical point are positive (called "positive matrix"), then the critical point is point is the local minimum of the function.

If the eigenvalues of the Hessian matrix at the critical point are negative (called "negative matrix"), then the critical point is point is the local maximum of th funtion.

If the Hessian matrix is indefinite (ie., it is not possible to cnclude whetehr eigen values are positive or negative), the critical point is an inflexion point.