Gradient Descent Methods

Published

February 16, 2024

Keywords

optimisation, mdo

Gradient Methods

The basic algorithm for all gradient methods was outlined earlier.

Here we will recap the algorithm in the form of a mermaid diagram that shows the generic algorithm of a gradient

graph TB 
A[Start] --> B[Initial Guess]
B --> C[Compute Gradient]
C --> D[Check Convergence]
D -->|Yes| E[Stop]
D -->|No| F[Choose Descent Direction]
F --> G[Choose Step Size]
G --> H[Update Design]
H --> C

Please note that here we have given a single convergence criterion. In practice, we may have multiple convergence criteria. Possible convergence criteria are:

maximum number of iterations,
maximum number of function evaluations,
maximum number of gradient evaluations,
maximum number of Hessian evaluations,
maximum number of line searches,
maximum number of iterations without improvement,
magnitude of the change in the design per iteration, and
magnitude of the change in the objective function per iteration.

This is not an exhaustive list. We can also use a combination of these criteria together.

Various methods differ in the way they choose the new point \(x_1\) such that \(f(x_1)<f(x_0)\). There are two steps involved in choosing the new point \(x_1\):

Choose the direction in which to move from \(x_0\). This is known as the descent direction.
Choose the distance to move along that direction. This is known as the step-size.

In the present case (\(x \in \mathbb{R}\)), the first is simple. We simply move in the direction of the negative gradient. The second is more complicated. We will discuss various methods for choosing the descent direction here.

Steepest Descent (Cauchy’s Method)

Cauchy proposed the method of steepest descent in 1847. He was a French mathematician who made major contributions to analysis and number theory. He was the first to prove the Cauchy integral theorem. He also made important contributions to mechanics and optics.

Steepest descent method is the simplest gradient method. It uses only the first order information. It is also known as Cauchy’s method. The method is as follows:

Start with an initial guess \(x_0\).
Evaluate \(f(x)\) at \(x_0\). If \(f'(x_0)=0\), then \(x_0\) is the optimum. If \(f'(x_0)\neq 0\), then choose a new point \(x_1\) such that \(f(x_1)<f(x_0)\).
Repeat the process until \(f'(x_n)=0\).

The new point \(x_1\) is chosen such that \(f(x_1)<f(x_0)\). This is known as the

Rosenbrock Problem

Rosenbrock function is a non-convex function used as a performance test problem for optimization algorithms.

\[f(x,y)=(a-x)^{2}+b(y-x^{2})^{2}\]

The global minimum is inside a long, narrow, parabolic shaped flat valley. Given

\[\begin{aligned} f_{x} &= -2(a-x) − 2bx(y−x^2)\\ f_{y} &= -2b(y–x^2) \end{aligned}\]

analytical solution is \((x,y)=(,)\). The numerical solution, however, poses a particular challenge.

using Optim, Plots, PlutoUI
pyplot()
plot([1.0], [1.0], seriestype=:scatter, label="Minima")

f(x::Vector) = (1.0 - x[1])^2 + 100.0 * (x[1] - x[2]^2)^2

x₁ = -5:0.05:5; x₂ = -5:0.05:5
z = [f([xi; yi])  for xi in x₁, yi in x₂]
plt = contour!(x₁,x₂,z,levels=50,
            xlabel="x₁",
            ylabel="x₂", 
            title="Rosenbrock Function",
            titlefontsize=10)

niter = 100
x̄₀ = [3.0, 10.0]
# Steepest Descent
xsd = ones(niter,2)
xsd[1,:] = x̄₀
res = optimize(f, x̄₀, method = GradientDescent(),
    iterations=niter, store_trace=true, extended_trace=true)
function plot_optim_trace(plt, res)
    path=ones(niter,2)
    tmp = Optim.x_trace(res)
    for i in 1:1:niter
        path[i,1] = tmp[i][1]
        path[i,2] = tmp[i][2]
    end
    plot!(plt, path[:,1], path[:,2], seriestype=:scatter, label = "Steepest Descent")
end
savefig(joinpath(@OUTPUT,"ex1.svg"))

plt1 = plot()
plot_optim_trace(plt, res)
savefig(joinpath(@OUTPUT,"ex2.svg"))

Optim.x_trace(res)

Optim.minimizer(res)

function plot_optim_trace(plt, res, label_string)
   niter = length(Optim.x_trace(res))
    path=ones(niter,2)
    tmp = Optim.x_trace(res)
   for i in 1:1:niter
        path[i,1] = tmp[i][1]
        path[i,2] = tmp[i][2]
   end
    plot!(plt, path[:,1], path[:,2], seriestype=:line, markershape=:circle, lw=2, markersize = 4, label = label_string)
end

function run_and_plot_method(f, x̄₀, niter, method, plt)
  if method=="GradientDescent"
   res = optimize(f, x̄₀, GradientDescent(),
                  Optim.Options(iterations=niter,
                                store_trace=true,
                                extended_trace=true);
                  autodiff = :forward)
 elseif method=="ConjugateGradient"
   res = optimize(f, x̄₀, ConjugateGradient(),
                  Optim.Options(iterations=niter,
                                store_trace=true,
                                extended_trace=true);
                  autodiff = :forward)
 end
 plot_optim_trace(plt, res, method)
end

using Optim, Plots


# Elementary example of an Ellipse
niter = 50
x̄₀ = [2.0, 1.0]

f(x::Vector) = x'*[1 0; 0 40]*x

plt = plot([0.0], [0.0], seriestype=:scatter, label="Minima")
# Steepest Descent
run_and_plot_method(f, x̄₀, niter, "GradientDescent", plt)
#Conjugate Gradient
run_and_plot_method(f, x̄₀, niter, "ConjugateGradient", plt)


x₁ = 0:0.01:0.1; x₂ = 0:0.01:0.1
z = [f([xi; yi])  for xi in x₁, yi in x₂]
plt = contour!(x₁,x₂,z,levels=50, xlabel="x₁", ylabel="x₂", title="Rosenbrock Function", titlefontsize=10)

# Classic Example of Rosenbrock Function
niter = 10
x̄₀ = [3.0, 1.5]

plt = plot([1.0], [1.0], seriestype=:scatter, label="Minima")

f(x::Vector) = (1.0 - x[1])^2 + 100.0 * (x[1] - x[2]^2)^2

x₁ = -5:0.05:5; x₂ = -5:0.05:5
z = [f([xi; yi])  for xi in x₁, yi in x₂]
plt = contour!(x₁,x₂,z,levels=50, xlabel="x₁", ylabel="x₂", title="Rosenbrock Function", titlefontsize=10)

# Steepest Descent
xsd = ones(niter,2)
xsd[1,:] = x̄₀
res = optimize(f, x̄₀, method = GradientDescent(),
    iterations=niter, store_trace=true, extended_trace=true)
plot_optim_trace(plt, res, "Steepest Descent")

#Conjugate Gradient
xcg = ones(niter,2)
xcg[1,:] = x̄₀
res = optimize(f, x̄₀, method = ConjugateGradient(),
    iterations=niter, store_trace=true, extended_trace=true)
plot_optim_trace(plt, res, "Conjugate Gradient")