Skip to content

Latest commit

 

History

History
226 lines (172 loc) · 9.08 KB

acceleration.rst

File metadata and controls

226 lines (172 loc) · 9.08 KB

Acceleration

SCS includes Anderson acceleration (AA), which can be used to speed up convergence. AA is a quasi-Newton method for the acceleration of fixed point iterations and can dramatically speed up convergence in practice, especially if higher accuracy solutions are desired. However, it can also cause severe instability of the solver and so should be used with caution. It is an open research question how to best implement AA in practice to ensure good performance across all problems and we welcome any :ref:`contributions <contributing>` in that direction!

Mathematical details

The discussion here is taken from section 2 of our paper. Consider the problem of finding a fixed point of the function f: \mathbf{R}^n \rightarrow \mathbf{R}^n, i.e.,

\text{Find } x \in \mathbf{R}^n \text{ such that } x = f(x).

In our case f corresponds to one step of :ref:`Douglas-Rachford splitting <algorithm>` and the iterate is the w^k vector, which converges to a fixed point of the DR operator. At a high level AA, from initial point x_0 and max memory m (corresponding to the acceleration_lookback setting), works as follows:

\begin{array}{l}
\text{For } k=0, 1, \dots \\
\quad \text{Set } m_k=\min\{m, k\} \\
\quad \text{Select weights } \alpha_j^k \text{ based on the last } m_k \text{
  iterations satisfying } \sum_{j=0}^{m_k}\alpha_j^k=1 \\
\quad \text{Set } x^{k+1}=\sum_{j=0}^{m_k}\alpha_j^kf(x^{k-m_k+j})
\end{array}

In other words, AA produces an iterate that is the linear combination of the last m_k + 1 outputs of the map. Thus, the main challenge is in choosing the weights \alpha \in \mathbf{R}^{m_k+1}. There are two ways to choose them, corresponding to type-I and type-II AA (named for the type-I and type-II Broyden updates). We shall present type-II first.

Type-II AA

Define the residual g: \mathbf{R}^n \rightarrow \mathbf{R}^n of f to be g(x) = x - f(x). Note that any fixed point x^\star satisfies g(x^\star) = 0. In type-II AA the weights are selected by solving a small least squares problem.

\begin{array}{ll}
\mbox{minimize} & \|\sum_{j=0}^{m_k}\alpha_j g(x^{k-m_k+j})\|_2^2\\
\mbox{subject to} & \sum_{j=0}^{m_k}\alpha_j=1,
\end{array}

More explicitly, we can reformulate the above as follows:

\begin{array}{ll}
\mbox{minimize} & \|g_k-Y_k\gamma\|_2,
\end{array}

with variable \gamma=(\gamma_0,\dots,\gamma_{m_k-1}) \in \mathbf{R}^{m_k}. Here g_i=g(x^i), Y_k=[y_{k-m_k}~\dots~y_{k-1}] with y_i=g_{i+1}-g_i for each i, and \alpha and \gamma are related by \alpha_0=\gamma_0, \alpha_i=\gamma_i-\gamma_{i-1} for 1\leq i\leq m_k-1 and \alpha_{m_k}=1-\gamma_{m_k-1}.

Assuming that Y_k is full column rank, the solution \gamma^k to the above is given by \gamma^k=(Y_k^\top Y_k)^{-1}Y_k^\top g_k, and hence by the relation between \alpha^k and \gamma^k, the next iterate of type-II AA can be written as

\begin{align}
x^{k+1}&=f(x^k)-\sum_{i=0}^{m_k-1}\gamma_i^k\left(f(x^{k-m_k+i+1})- f(x^{k-m_k+i})\right)\\
&=x^k-g_k-(S_k-Y_k)\gamma^k\\
&=x^k-(I+(S_k-Y_k)(Y_k^\top Y_k)^{-1}Y_k^\top )g_k\\
&=x^k-B_kg_k,
\end{align}

where S_k=[s_{k-m_k}~\dots~s_{k-1}], s_i=x^{i+1}-x^i for each i, and

B_k=I+(S_k-Y_k)(Y_k^\top Y_k)^{-1}Y_k^\top

Observe that B_k minimizes \|B_k-I\|_F subject to the inverse multi-secant condition B_kY_k=S_k, and hence can be regarded as an approximate inverse Jacobian of g. The update of x^k can then be considered as a quasi-Newton-type update, with B_k being a generalized second (or type-II) Broyden's update of I satisfying the inverse multi-secant condition.

Type-I AA

In the same spirit, we define type-I AA, in which we find an approximate Jacobian of g minimizing \|H_k-I\|_F subject to the multi-secant condition H_kS_k=Y_k. Assuming that S_k is full column rank, we obtain (by symmetry) that

H_k=I+(Y_k-S_k)(S_k^\top S_k)^{-1}S_k^\top

and the update scheme is defined as

x^{k+1}=x^k-H_k^{-1}g_k

assuming H_k to be invertible. A direct application of the Woodbury matrix identity shows that

H_k^{-1}=I+(S_k-Y_k)(S_k^\top Y_k)^{-1}S_k^\top

where again we have assumed that S_k^\top Y_k is invertible. Notice that this explicit formula of H_k^{-1} is preferred in that the most costly step, inversion, is implemented only on a small m_k\times m_k matrix.

In SCS

In SCS both types of acceleration are available, though by default type-I is used since it tends to have better performance. If you wish to use AA then set the acceleration_lookback setting to a non-zero value (10 works well for many problems and is the default). This setting corresponds to m, the maximum number of SCS iterates that AA will use to extrapolate to the new point.

To enable type-II acceleration then set acceleration_lookback to a negative value, the sign is interpreted as switching the AA type (this is mostly so that we can test it without fully exposing it the user).

The setting acceleration_interval controls how frequently AA is applied. If acceleration_interval =k for some integer k \geq 1 then AA is applied every k iterations (AA simply ignores the intermediate iterations). This has the benefit of making AA k times faster and approximating a k times larger memory, as well as improving numerical stability by 'decorrelating' the data. On the other hand, older iterates might be stale. More work is needed to determine the optimal setting for this parameter, but 10 appears to work well in practice and is the default.

The details about how the linear systems are solved and updated is abstracted away into the AA package (eg, QR decomposition, SVD decomposition etc). Exactly how best to solve and update the equations is still open.

Regularization

By default we also add a small amount of regularization to the matrices that are being inverted in the above expressions, ie, in the type-II update

(Y_k^\top Y_k)^{-1} \text{ becomes } (Y_k^\top Y_k + \epsilon I)^{-1}

for some small \epsilon > 0, and similarly for the type-I update

(S_k^\top Y_k)^{-1} \text{ becomes } (S_k^\top Y_k + \epsilon I)^{-1}

which is equivalent to adding regularization to the S_k^\top S_k matrix before using the Woodbury matrix identity. The regularization ensures the matrices are invertible and helps stability. In practice type-I tends to require more regularization than type-II for good performance. The regularization shrinks the AA update towards the update without AA, since if \epsilon\rightarrow\infty then \gamma^\star = 0 and the AA step reduces to x^{k+1} = f(x^k). Note that the regularization can be folded into the matrices by appending \sqrt{\epsilon} I to the bottom of S_k or Y_k, which is useful when using a QR or SVD decomposition to solve the equations.

Max \gamma norm

As the algorithm converges to the fixed point the matrices to be inverted can become ill-conditioned and AA can become unstable. In this case the \gamma vector can become very large. As a simple heuristic we reject the AA update and reset the AA state whenever \|\gamma\|_2 is greater than max_weight_norm (eg, something very large like 10^{10}).

Safeguarding

We also apply a safeguarding step to the output of the AA step. Explicitly, let x^k be the current iteration and let x_\mathrm{AA} = x^{k+1} be the output of AA. We reject the AA step if

\|x_\mathrm{AA} - f(x_\mathrm{AA}) \|_2 > \zeta \|x^k - f(x^k) \|_2

where \zeta is the safeguarding tolerance factor (safeguard_factor) and defaults to 1. In other words we reject the step if the norm of the residual after the AA step is larger than some amount (eg, if it increases the residual from the previous iterate). After rejecting a step we revert the iterate to x^k and reset the AA state.

Relaxation

In some works relaxation has been shown to improve performance. Relaxation replaces the final step of AA by mixing the map inputs and outputs as follows:

x^{k+1} = \beta \sum_{j=0}^{m_k}\alpha_j^k f(x^{k-m_k+j}) + (1-\beta) \sum_{j=0}^{m_k}\alpha_j^k x^{k-m_k+j}

where \beta is the relaxation parameter, and \beta=1 recovers vanilla AA. This can be computed using the matrices defined above using

x^{k+1} = \beta (f(x^k) - (S_k - Y_k) \gamma^k) + (1-\beta) (x^k - S_k \gamma^k)

Anderson acceleration API

For completeness, we document the full Anderson acceleration API below.

.. doxygenfile:: include/aa.h