Steepest Descent

Published

February 16, 2024

The Algorithm

The problem we are interested in solving is minimizing the objective function:

\[ \begin{array}{cl} P: \operatorname{minimize} & f(x) \\ \text { s.t. } & x \in \Re^{n}, \end{array} \] where \(f(x)\) is differentiable.

From Cauchy-Schwarz inequality \[ |<\nabla f(\bar{x}), \tilde{d}>| \leq ||\nabla f(\bar{x})|| ||\tilde{d}|| \]

If \(x=\bar{x}\) is a given point, \(f(x)\) can be approximated by its linear expansion

\[ f(\bar{x}+d) \approx f(\bar{x})+\nabla f(\bar{x})^{T} d \]

if \(d\) “small”, i.e., if \(\|d\|\) is small. Now notice that if the approximation in the above expression is good, then we want to choose \(d\) so that the inner product \(\nabla f(\bar{x})^{T} d\) is as small as possible. Let us normalize \(d\) so that \(\|d\|=1\).

\[ |<\nabla f(\bar{x}), \tilde{d}>| \leq ||\nabla f(\bar{x})|| \]

But if \[ \tilde{d}=\frac{-\nabla f(\bar{x})}{\|\nabla f(\bar{x})\|} \] \(\tilde{d}\) makes the smallest inner product with \(\nabla f(\bar{x})\)

\[ \nabla f(\bar{x})^{T} d \geq-\|\nabla f(\bar{x})\|\|d\|=\nabla f(\bar{x})^{T}\left(\frac{-\nabla f(\bar{x})}{\|\nabla f(\bar{x})\|}\right)=-\nabla f(\bar{x})^{T} \tilde{d} \]

For this reason the un-normalized direction:

\[ \bar{d}=-\nabla f(\bar{x}) \]

is called the direction of steepest descent at the point \(\bar{x}\).

Note that \(\bar{d}=-\nabla f(\bar{x})\) is a descent direction as long as \(\nabla f(\bar{x}) \neq 0\). To see this, simply observe that \(\bar{d}^{T} \nabla f(\bar{x})=-(\nabla f(\bar{x}))^{T} \nabla f(\bar{x})<0\) so long as \(\nabla f(\bar{x}) \neq 0\).

The objective function \(f(x)\) is itself a simple quadratic function of the form:

\[ f(x)=\frac{1}{2} x^{T} Q x+q^{T} x \]

where \(Q\) is a positive definite symmetric matrix. We will suppose that the eigenvalues of \(Q\) are

\[ A=a_{1} \geq a_{2} \geq \ldots \geq a_{n}=a>0 \]

i.e, \(A\) and \(a\) are the largest and smallest eigenvalues of \(Q\).

let \(x\) denote the current point in the steepest descent algorithm and let \(d\) denote the current direction, which is the negative of the gradient, i.e.,

\[ d=-\nabla f(x)=-Q x-q . \]

Now let us compute the next iterate of the steepest descent algorithm. If \(\alpha\) is the generic step-length, then

\[ f(x+\alpha d)=\frac{1}{2}(x+\alpha d)^{T} Q(x+\alpha d)+q^{T}(x+\alpha d) \]

\[ =f(x) - \alpha d^{T}d + \frac{1}{2}\alpha^2d^{T}Qd \]

\[ \begin{gathered} \frac{df}{d\alpha} = (x+\alpha d)^{T} Q d + q^{T} d \end{gathered} \] \[(\because d(x^{T}Qx) = 2x^{T}Q) \], (\(\because\) Q is a symmetric matrix)

\[ \begin{gathered} d^{T}=-x^{T}Q^{T}-q^{T} = -x^{T}Q-q^{T} \\q^{T} = -x^{T}Q - d^{T} \end{gathered} \] Substituting this in the equation \[ \frac{df}{d\alpha} = x^{T}Q d + \alpha d^{T}Qd +q^{T} d \] and solving for alpha for \(\frac{df}{d\alpha}=0\) we get

\[ \alpha=\frac{d^{T} d}{d^{T} Q d} \] Optimal solution is \[ x^{*}=-Q^{-1} q \]

and direct substitution shows that the optimal objective function value is:

\[ f(x^{*})=\frac{1}{2}(-Q^{-1}q)^{T} Q(-Q^{-1}q)+q^{T}(-Q^{-1}q) \] \[ f\left(x^{*}\right)=-\frac{1}{2} q^{T} Q^{-1} q \]

The next iterate of the algorithm then is \[ x^{\prime}=x+\alpha d=x+\frac{d^{T} d}{d^{T} Q d} d \]

and

\[ f\left(x^{\prime}\right)=f(x+\alpha d)=f(x)-\alpha d^{T} d+\frac{1}{2} \alpha^{2} d^{T} Q d=f(x)-\frac{1}{2} \frac{\left(d^{T} d\right)^{2}}{d^{T} Q d} \]

Therefore,

\[ \begin{gathered} \frac{f\left(x^{\prime}\right)-f\left(x^{*}\right)}{f(x)-f\left(x^{*}\right)}=\frac{f(x)-\frac{1}{2} \frac{\left(d^{T} d\right)^{2}}{d^{T} Q d}-f\left(x^{*}\right)}{f(x)-f\left(x^{*}\right)} \\ =1-\frac{\frac{1}{2} \frac{\left(d^{T} d\right)^{2}}{d^{T} Q d}}{\frac{1}{2} x^{T} Q x+q^{T} x+\frac{1}{2} q^{T} Q^{-1} q} \\ =1-\frac{\frac{1}{2} \frac{\left(d^{T} d\right)^{2}}{d^{T} Q d}}{\frac{1}{2}(Q x+q)^{T} Q^{-1}(Q x+q)} \\ =1-\frac{\left(d^{T} d\right)^{2}}{\left(d^{T} Q d\right)\left(d^{T} Q^{-1} d\right)} \\ =1-\frac{1}{\beta} \end{gathered} \]

where

\[ \beta=\frac{\left(d^{T} Q d\right)\left(d^{T} Q^{-1} d\right)}{\left(d^{T} d\right)^{2}} \]

In order for the convergence constant to be good, which will translate to fast linear convergence, we would like the quantity \(\beta\) to be small. The following result provides an upper bound on the value of \(\beta\).

Kantorovich Inequality: Let \(A\) and \(a\) be the largest and the smallest. eigenvalues of \(Q\), respectively. Ratio of A and a is called eigen value ratio. Then \[ \beta \leq \frac{(A+a)^2}{4 A a} . \]

let us apply this inequality to the above analysis. \[ \frac{f\left(x^{\prime}\right)-f\left(x^*\right)}{f(x)-f\left(x^*\right)}=1-\frac{1}{\beta} \leq 1-\frac{4 A a}{(A+a)^2}=\frac{(A-a)^2}{(A+a)^2}=\left(\frac{A / a-1}{A / a+1}\right)^2= \delta . \]

Note by definition that \(A / a\) is always at least 1 . If \(A / a\) is small (not much bigger than 1), then the convergence constant \(\delta\) will be much smaller than 1. However, if \(A / a\) is large, then the convergence constant \(\delta\) will be only slightly smaller than 1 . Note that the number of iterations needed to reduce the optimality gap by a factor of 10 grows linearly in the ratio \(A / a\).

First property:

Gradient vector d of a function f(x) at point x is orthogonal to tangent plane. \[ d.T = 0\] where T is tangent vector

Second property:

Gradient represents a direction of maximum increase of f(x) at x. let u be a vector which is not tangent to the surface and t be a parameter along the vector. The derivative of f(x) in the direction u at x is defined as

\[ \frac{d f}{d t}=\lim _{t \rightarrow 0} \frac{f(x+\varepsilon u )-f(x)}{\varepsilon} \]

Using the Taylor series expansion: \[ \left.f(x+\varepsilon u \right)=f(x)+\varepsilon \left[ u_1 \frac{\partial f}{\partial x_1}+u_2 \frac{\partial f}{\partial x_2}+\cdots+u_n \frac{\partial f}{\partial x_n} \right] + o \left(\varepsilon^2\right). \]

\[ \frac{d f}{d t}=\sum_{i=1}^n u_i \frac{\partial f}{\partial x_i}=\mathbf{d} \cdot \mathbf{u} \text {. } \]

The right-hand side of (3.7) represents a scalar product, which can be rewritten in the form: \[ \frac{d f}{d t}=\|\mathbf{d}\|\|\mathbf{u}\| \cos \theta \text {. } \]

\(\cos \theta=1\) and \(\cos \theta=-1\). 1. When \(\theta=0, \cos \theta=1, d f / d t\) is a maximum, and \(\mathbf{u}\) is along \(\mathbf{d}\). This case corresponds to the maximum rate of increase of \(f(\mathbf{x})\). 2. When \(\theta=180^{\circ}, \cos \theta=-1, d f / d t\) is a minimum and \(\mathbf{u}\) is opposite to \(\mathrm{d}\). This case corresponds to the maximum rate of decrease of \(f(\mathbf{x})\). We thus come to a very important concept in optimization. At any design point, \(\mathbf{x}^{(t)}\), the gradient vector, \(\mathbf{d}^{(k)}\), represents the maximum rate of increase of \(f(\mathbf{x})\) and \(-\mathbf{d}^{(k)}\) represents the maximum rate of decrease of \(f(\mathbf{x})\).

\[ \bar{d}=-\nabla f(\bar{x}) \]