Statistics for Machine Learning

Levenberg-Marquardt algorithm for nonlinear regression

The Levenberg-Marquardt method of nonlinear curve fitting combines the gradianct descent and Gauss-Newtom methods.

In the gradient descend algorithm, by moving along the steepest-descent direction, the sum of square errors are minimized.

If $\overline{b}$ is a column matrix of p coefficients and $f(\overline{p}, {\overline{x}})$ is the function to be minimized, then the update of the steps in the Gradient descent algorithm is obtained as, \begin{equation} { {\overline b}^{~k+1}~=~ {\overline b}^{~k}~+~\lambda {\overline \nabla} f ~=~ {\overline b}^{~k}~+~\lambda~{\overline{J_r}}^T {\overline r} } \end{equation}

where $\overline{r}$ is a column matrix of error terms for each data point $(x_i, y_i)$. In the Gauss-Newton method, it is assumed that the sum of sqaured error function is locally quardratic. The sum of squared error is then reduced by finding the minimum of this quardratic function. The algorithm for updating the step is given by, \begin{equation} {\overline b}^{k+1}~=~{\overline b}^{k}~-~ \left[ \displaystyle \left[\overline{J}_r(\overline{b}^k)\right]^{T} \overline{J}_r(\overline{b}^k) \right]^{-1} \displaystyle \left[\overline{J}_r(\overline{b}^k)\right]^{T} \overline{r}(\overline{b}^k) \end{equation} where $\overline{J_r}$ is the jacobian matrix of the function $f$ and $\overline{r}$ is the residual matrix.


In the Levenberg-Marquardt method, the updating step is written in such a way that when the parameter value is far away from minima, the Gauss-Newton method dominates and when the parameter value is close to minima, the gradient descent method matrixdominates. Accordingly, the algorithm for the update step is given as,

\begin{equation} {\overline b}^{k+1}~=~{\overline b}^{k}~-~ (\overline{H} + \lambda \overline{I})^{-1} \nabla{f(\overline{b})~=~b^k~-~( {\overline J}_r^T {\overline J}_r + \lambda \overline{I})^{-1} {\overline{J_r}}^T {\overline r} } \end{equation}

$~~~~~~~~~~~~~~~~~(1)$



where $\overline{I}$ is the identity matrix for ${\overline{H}}$.

In the above update formula, if the parameter $\lambda$ is negligible, the Gauss-Newton method dominates and when $\lambda$ value is high, the Gradient descent method dominates the update.


The update procedure is as follows:

1. Starting from an initial value, at each step, get an update of parameter each $b_i$ according to equation (1) given above.

2. Compute the error terms $r_i$ with the new updated $b_i$.

3. If the error $r_i$ has increased compared to that of previous step, go back to the values $b_i$ value of previous step and increase $\lambda$ by a large factor (eg. factor of 10). With this increased value of $\lambda$, go to step (1) and repeat.

4. If the updated error $r_i$ has decreased compared to the previous value, then accept these new updated value of $b_i$ and decrease the value of $lambda$ by a factor of 10 or whatever larger factor decided before.

In each step of this algorithm, if error $r_i$ increases after update, we are increasing $\lambda$ by large factor. This helps the Gradient descent method to numerically dominate when compared to the Gauss-Newton term (ie, the second termof equation(1) dominates). On the other hand, if error decreases after update, we decrease $\lambda$ by a large factor, and hence let the Gauss-Newton term (the first term of equation(1)) to dominate over gradient descent term.

In the above algorithm, the step size $\lambda$ remains constant throughout. It is advantageous to scale step size term according to the curvature. This can be achived by using the second derivative in the Hessian matrix. The logic is as follows:

The second derivative of a function at a point gives the rate of change of slope at that point. A positive second derivative says that the function is curving downwards and a negative second derivative says that the function is curving upwards. The magnitude of the second derivative gives the extent of the curvature. In the region where the function curves sharply, the magnitute of the second derivative is large and in the region where the function curves less (ie., getting flatter), the magnitude of the second derivative is lesser. Therefore, by multiplying the constant $\lambda$ with the second derivative and taking an inverse, we ensure that there is a larger movement alomg the directions where the gradient is small and there is a smaller movement along the direction where the gradient is large.

This important idea has been incorporated in Lenvenberg_Marquardt algorithm. Since the diagonal elements of the hessian matrix contain the second derivatives of the function f, the formula for the update was modified to

\begin{equation} {\overline b}^{k+1}={\overline b}^{k}- (\overline{H} + \lambda ~ diagonal(\overline{H}))^{-1} \nabla{f(\overline{b})=b^k~-~( {\overline J}_r^T {\overline J}_r + \lambda~diagonal(\overline{H}))^{-1} {\overline{J_r}}^T {\overline r} } \end{equation}

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(2)$

The above algorithm for updating ensures a larger step in the direction with low curvature of (function is flatter) and a smaller step in the direction with larger curvature (function is steeper).