Step Size Calculation

Published

February 5, 2025

In this section we will summarise various step-size calculation methods for the gradient descent algorithm. We will use the same notation as in the previous section. We will also assume that the objective function (f) is differentiable.

Armijo rule

Let \(f(x)\) be a function of a single variable \(x\). Let \(x^*\) be a local minimum of \(f(x)\). Let \(f'(x^*)\neq 0\). Then there exists a \(\delta>0\) such that \(f(x^*)\leq f(x)\) for all \(x\) such that \(|x-x^*|<\delta\). Let \(x\) be such that \(|x-x^*|<\delta\). Then

\[f(x) - f(x^*) \geq \alpha f'(x^*)(x-x^*)\]

where \(0<\alpha<1\) is a constant.

The Armijo rule is used to choose the new point \(x_1\) such that \(f(x_1)<f(x_0)\).

Armijo’s Rule for Step Size Calculation

Armijo’s Rule, also referred to as the Armijo Condition, is a technique used in numerical optimization to determine a suitable step size in iterative optimization algorithms, particularly in gradient descent and line search methods. The purpose of Armijo’s rule is to ensure sufficient decrease in the objective function while balancing computational efficiency and convergence reliability.


1. Overview

In optimization problems, especially when minimizing a differentiable objective function ( f(x) ), the step size (or learning rate) determines how far the algorithm should move along the search direction ( p ) at each iteration. An overly small step size may lead to slow convergence, while an overly large step size might overshoot the optimal solution or even cause divergence.

Armijo’s rule provides a systematic way to choose a step size ( ) such that it ensures a meaningful reduction in the objective function value, avoiding both extremes.


2. The Condition

For an objective function ( f: ^n ), Armijo’s rule states that the step size ( > 0 ) must satisfy:

[ f(x_k + p_k) f(x_k) + c f(x_k)^T p_k ]

Here: - ( x_k ) is the current point in the optimization process. - ( p_k ) is the descent direction (commonly ( -f(x_k) ) in gradient descent). - ( > 0 ) is the step size to be determined. - ( f(x_k) ) is the gradient of ( f ) at ( x_k ). - ( c (0, 1) ) is a small, positive parameter (typically ( c )).

The right-hand side of the inequality represents a linear approximation of the objective function ( f ) at ( x_k ). The condition ensures that the actual reduction in ( f ) (left-hand side) is at least proportional to the reduction predicted by the linear model.


3. Key Concepts

Descent Direction

Armijo’s rule applies when ( p_k ) is a descent direction, i.e., ( f(x_k)^T p_k < 0 ). For example: - In gradient descent, ( p_k = -f(x_k) ). - In Newton’s method, ( p_k ) is derived from solving a linear system involving the Hessian.

Sufficient Decrease

The condition guarantees a “sufficient” decrease in ( f ), avoiding excessively small steps that slow convergence.

Adaptive Step Size

Armijo’s rule often operates within a line search algorithm, where ( ) is initialized at a large value and is iteratively reduced (commonly multiplied by a constant factor, ( ), with ( (0, 1) )) until the condition is met.


4. Algorithm

A typical implementation of Armijo’s rule within a backtracking line search framework involves the following steps:

  1. Input:
    • Starting point ( x_k ),
    • Descent direction ( p_k ),
    • Initial step size ( _0 > 0 ),
    • Parameters ( c (0, 1) ) and ( (0, 1) ).
  2. Initialize ( ):
    • Set ( = _0 ).
  3. Iterate until Armijo’s condition is satisfied:
    • Check if ( f(x_k + p_k) f(x_k) + c f(x_k)^T p_k ).
    • If the condition is not satisfied, update ( ).
    • Repeat until the condition holds.
  4. Output:
    • Return the step size ( ).

5. Practical Considerations

Choice of Parameters

  • ( c ) controls how much decrease is required. A smaller ( c ) enforces stricter reduction, while a larger ( c ) is more lenient.
  • ( ) determines how quickly ( ) decreases during backtracking. Common choices are ( = 0.5 ) or ( = 0.8 ).

**Initial Step Size (( _0 ))**

  • A large ( _0 ) is often chosen to exploit potential rapid initial progress.
  • Some algorithms dynamically adjust ( _0 ) based on previous iterations.

6. Advantages of Armijo’s Rule

  1. Robustness: Ensures stability in optimization by preventing overly aggressive steps.
  2. Guaranteed Convergence: Combined with other line search conditions (e.g., Wolfe conditions), it contributes to theoretical guarantees of convergence.
  3. Computational Efficiency: Reduces the need for an exact line search, making it a computationally feasible alternative.

7. Limitations

  1. Parameter Sensitivity: Performance depends on the choice of ( c ), ( ), and ( _0 ), requiring careful tuning.
  2. Potential Overhead: In complex functions, repeatedly evaluating ( f ) during backtracking may increase computational cost.
  3. Local Perspective: Armijo’s rule focuses on local decrease, which may not always lead to global minima in non-convex problems.

8. Applications

  • Gradient Descent Algorithms: For convex and non-convex optimization problems.
  • Machine Learning: In training models where loss functions need optimization (e.g., neural networks).
  • Engineering: Optimization problems in control, signal processing, and design.

9. Extensions

Armijo’s rule is often paired with other line search strategies, such as: - Wolfe Conditions: Adds curvature considerations for more robust step size selection. - Goldstein Conditions: Similar to Armijo but involves both lower and upper bounds on the decrease.


10. Conclusion

Armijo’s rule is a cornerstone in optimization, striking a balance between ensuring sufficient progress and maintaining computational efficiency. By adaptively adjusting the step size, it prevents the pitfalls of static step sizes and provides a flexible approach for solving a wide range of optimization problems effectively.

Code
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

def objective_function(x):
    """
    Example objective function: f(x) = 2x_1^2 + x_2^2
    """
    return 2 * x[0]**2 + x[1]**2

def gradient(x):
    """
    Gradient of the objective function: grad(f(x)) = [4x_1, 2x_2]
    """
    return np.array([4 * x[0], 2 * x[1]])

def armijo_rule(x_k, p_k, alpha_0, c, beta):
    """
    Implements the Armijo rule for line search.

    Parameters:
        x_k (np.array): Current point.
        p_k (np.array): Descent direction (negative gradient).
        alpha_0 (float): Initial step size.
        c (float): Armijo parameter (0 < c < 1).
        beta (float): Step size reduction factor (0 < beta < 1).

    Returns:
        alpha (float): Step size satisfying the Armijo condition.
    """
    alpha = alpha_0
    alphas = [alpha]
    values = [objective_function(x_k + alpha * p_k)]

    while objective_function(x_k + alpha * p_k) > objective_function(x_k) + c * alpha * np.dot(gradient(x_k), p_k):
        alpha *= beta  # Reduce step size
        alphas.append(alpha)
        values.append(objective_function(x_k + alpha * p_k))

    # Create animation for alphas and function values
    fig, ax = plt.subplots()
    ax.set_title("Armijo Rule: Successive Alphas")
    ax.set_xlabel("Alpha")
    ax.set_ylabel("Function Value")
    ax.grid()

    def update(frame):
        ax.clear()
        ax.set_title("Armijo Rule: Successive Alphas")
        ax.set_xlabel("Alpha")
        ax.set_ylabel("Function Value")
        ax.grid()
        ax.plot(alphas[:frame + 1], values[:frame + 1], marker='o', linestyle='-')

    ani = animation.FuncAnimation(fig, update, frames=len(alphas), interval=1000, repeat=False)
    ani.save("armijo_alpha_animation.gif", writer="pillow")
    plt.close(fig)

    return alpha

def steepest_gradient_descent(x_init, alpha_0, c, beta, tol=1e-6, max_iter=1):
    """
    Steepest gradient descent method using Armijo's rule for step size.

    Parameters:
        x_init (np.array): Initial point.
        alpha_0 (float): Initial step size.
        c (float): Armijo parameter (0 < c < 1).
        beta (float): Step size reduction factor (0 < beta < 1).
        tol (float): Tolerance for convergence.
        max_iter (int): Maximum number of iterations.

    Returns:
        x (np.array): Final solution.
        history (list): History of points visited.
    """
    x = np.array(x_init)
    history = [x.copy()]

    for _ in range(max_iter):
        grad = gradient(x)
        p = -grad  # Steepest descent direction

        # Stop if the gradient is sufficiently small
        if np.linalg.norm(grad) < tol:
            break

        # Find step size using Armijo's rule
        alpha = armijo_rule(x, p, alpha_0, c, beta)

        # Update the solution
        x = x + alpha * p
        history.append(x.copy())

    return x, history

# Parameters for the algorithm
x_init = [10, 10]  # Starting point
alpha_0 = 10  # Initial step size
c = 0.1  # Armijo parameter
beta = 0.5  # Step size reduction factor
tol = 1e-6  # Tolerance for convergence
max_iter = 1  # Maximum number of iterations

# Run the gradient descent algorithm
solution, history = steepest_gradient_descent(x_init, alpha_0, c, beta, tol, max_iter)

# Print the results
print("Solution found:", solution)
print("History of x values:", history)
Solution found: [-2.5   3.75]
History of x values: [array([10, 10]), array([-2.5 ,  3.75])]