Statistics for Machine Learning

Gradient descent method for nonlinear regression

Let us assume that we fit a nonlinear multivariate function g to the given data set using nonlinear least square minimization.

We write the model equations as,

\begin{equation} y_i~=~g(b_1,b_2,b_3,....,b_p;~~x_{i1}, x_{i2}, x_{i3},.....,x_{im})~+~e_i \end{equation}

$~~~~~~~~(1)$



where g is the nonlinear function to be fitted to the data, $b_1,b_2,b_3,....,b_p$ are the parametrs of the model and $(x_i, y_i)$ are the data points. x is the dependent variable, y is the independent variable. As we see, there are m independent variables in the model.

There are n data points so that $i=1,2,3,...,n$. The model consists of m-variables $x_{i1}, x_{i2},....,x_{im}$ for data point i.

In the method of nonlinear least square minimization, the sumsqaure deviation function is minimized with respect to the parameters of the model.

We write the sumsquare deviation (error) as,

\begin{equation} \displaystyle\sum_{i=1}^n e_i^2~=~\displaystyle\sum_{i=1}^n [y_i~-~g(b_1,b_2,b_3,....,b_p;~~x_{i1},x_{i2},x_{i3},....,x_{im}]^2 \end{equation}

$~~~~~~~~~(2)$





Let

\begin{equation} r_i~=~[y_i~-~g(b_1,b_2,b_3,....,b_p;~~x_{i1},x_{i2},x_{i3},....,x_{im}] \end{equation}

$~~~~~~~~~(3)$



Now we introduce the function to be minimised with respect to each parameter $b_j$ as,

\begin{equation} f(b_1,b_2,b_3,...,b_p)~=~\dfrac{1}{2} \displaystyle\sum_{i=1}^n r_i^2 \end{equation}

$~~~~~~~~~(3)$




where the aribitrary constant $\dfrac{1}{2}$ that multiplies the fuction is introduced so that it cancels the multiplier 2 when we take partial derivatives.

In order to minimize the function f with respect to the parameters, we set up an iterative procedure.

Starting with some initial value for each parameter, we increment them along the direction of minima of the function through successive iterations until we hit the bottom of the surface in the multidimensional space of parameters b.


The gradient descend method

For a continuous function f with p variables $b_1, b_2, b_3, ...,b_p$, the gradiant vector ${\displaystyle\overline \nabla f}$ is a vector in p-dimensional space whose conponents are partial derivatives of f with respect to the b variables:

${\displaystyle\overline \nabla f}~=~\displaystyle{(\dfrac{\partial f}{\partial b_1}, \dfrac{\partial f}{\partial b_2}, \dfrac{\partial f}{\partial b_3}, ......,\dfrac{\partial f}{\partial b_p} )} $

The gradiant vector of a function f with p variables computed at a point represents the direction of maximum change (steepest ascent) from that point in the p-dimensional space in terms of its magnitude and dirction.

Therefore, the partial derivative $\dfrac{\partial f}{\partial b_j}$ gives the magnitude of maximum change along the $j^{th}$ axis.

Starting from an initial value $b_j^1$ for $b_j$, the Gradient Descent algorithm changes the value in the next interation by a value proportional to the partial derivative $\dfrac{\partial f}{\partial b_j}$ computed with values of $b_1, b_2, ..., b_p$ in the previous iteration. Thus, values of $b_j$ is changed from iteration k to k+1 as,

\begin{equation} b_j^{k+1}~=~ b_j^k~-~\lambda \dfrac{\partial f}{\partial b_j} \end{equation}

$~~~~~~~~~(4)$




where $\lambda$ is a fractional constant.

The gradient at the iteration point k represents the direction of steepest ascent at that point. In order to move along the direction of minima (steepest descent), the negative sign is added to the gradient in out iteration from $b_j^k$ to $b_j^{k+1}$. Hence the name "gradient descent algorithm"

Iterative procedure for estimating the coefficients

1. Derive expressions for the partial derivatives $\dfrac{\partial f}{\partial b_j}$. Set initial values for the p coefficients : Let them be called $b_1^1, b_2^1, b_3^1,....., b_p^1$.

3. Set a value for $\lambda$. For example, let $\lambda = 0.1$. For the initial values of b, compute the partial derivatives using expression derivated in step-1.

4. Iterate each $b_j$ in next step k=2 using equation(4) above:
$ b_j^{2}~=~ b_j^1~-~\lambda \dfrac{\partial f}{\partial b_j},~~~~~~~~~$ where $\dfrac{\partial f}{\partial b_j}$ are computed with initial values of b coefficients.

5. Now carry out the next iteration k=3 for each $b_j$:

$ b_j^{3}~=~ b_j^2~-~\lambda \dfrac{\partial f}{\partial b_j},~~~~~~~~~$ where $\dfrac{\partial f}{\partial b_j}$ are computed with values of coefficients $b_j^2$ in the second iteration.

6. March the solutions by computing $b_j^4, b_j^5, ....,$

7. After each iteration, for each coefficient $b_j$, compute the difference of its value from previous iteration. ie., compute $b_j^{k+1} - b_j^k$.

If $~~~(b_j^{k+1} - b_j^k) \lt \epsilon~~~$, where $\epsilon$ is a pre-decided threshold difference, break the iteration and take the final values of $b_j$ to be $b_j^{k+1}$. (For example, $\epsilon~=~ 10^{-6}$).

Iteration formula in matrix notation


We introduce the following matrices to represent various quantities in our iteration:

\begin{equation} {\large\bf\overline b}= \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ ... \\ ... \\ b_p \end{bmatrix} ~~~~~~~~~~ {\large\bf\overline b}^{~k}= \begin{bmatrix} b_1^k \\ b_2^k \\ b_3^k \\ ... \\ ... \\ b_p^k \end{bmatrix} ~~~~~~~~~~ {\large\bf\overline b}^{~k+1}= \begin{bmatrix} b_1^{k+1} \\ b_2^{k+1} \\ b_3^{k+1} \\ ... \\ ... \\ b_p^{k+1} \end{bmatrix} ~~~~~~~~~~ {\large\bf\overline \nabla f}= \begin{bmatrix} \dfrac{\partial f}{\partial b_1} \\ \dfrac{\partial f}{\partial b_2}\\ \dfrac{\partial f}{\partial b_3}\\ ... \\ ... \\ \dfrac{\partial f}{\partial b_p} \end{bmatrix} \end{equation} In this matrix notation, the set of iterative equations given by equation(4) can be written for all the b coefficients as,

\begin{equation} \large { {\overline b}^{~k+1}~=~ {\overline b}^{~k}~+~\lambda {\overline \nabla} f } \end{equation}

$~~~~~~~~~~(5)$




Convergent problems with Gradient Descent algorithm

The simple gradient descent method suffers from two important convergence issues.

1. Since the increment is in constant proportion to the gradient, there is no scope for varying the step length according to the steepness of the curve. Ideally, we would like to take short steps in steep regions of the curve, and long steps where the curve is flatter.

2. In a multiparameter fit, the gradiant may not be same along all directions. The surface may be steeper in one direction and less steeper in another direction. In the gradient descent algorithm, there is no scope for varying step sizes differently in different directions.