We decided that the last contribution will be a solver for linear least squares with bounds, i.e. we consider the problem:

,

where and are given matrix and vector. This is a convex optimization problem, thus it is very well posed. Turns out though, approaches to solve it are not much simpler than for a nonlinear problem, but the convergence is generally better.

My implementation https://github.com/scipy/scipy/pull/5110 contains 2 methods:

- Adaptation of Trust Region Reflective algorithm I used for nonlinear solver. The difference is that a quadratic model is always accurate in linear least squares, hence we don’t need to track or adjust a radius of a trust region assuming it is big enough for full Gauss-Newton steps (+ some precautions were added). Initially I tried to use other method called “reflective Newton“, but it didn’t converge well on some problems. I didn’t understand what is canonical implementation of this method, but got a feeling that it is not well designed for practical use. Note that MATLAB’s version doesn’t perform that well either (from my limited experience). On the other hand, I haven’t found difficult problems for my TRF adaptation.
- Classical Bounded-Variable Least Squares as described in the paper of Stark and Parker. This is an active-set method which optimally separates variables in free and active by the intelligent inclusion-exclusion procedure. The strong point of this algorithm is that eventually it unambiguously determines the optimal solution. But the number of iterations required can easily be on the order of the number of variables (and iterations are heavy), which restricts its usage for large problems. For small problems this method is very good. Also I added a self-invented ad-hoc procedure for method’s initialization, which should (hopefully) decrease the number of iterations done by BVLS.

I think that’s it for today. The next post will be final on GSoC.

]]>

Conventional least-squares algorithms require memory and floating point operations per iteration (again — the number of variables, — the number of residuals). So on a regular PC it’s possible to solve problems with in a reasonable time, but increasing these numbers by an order or two will cause problems. These limitations are inevitable if working with dense matrices, but if Jacobian matrix of a problem is significantly sparse (has only few non-zero elements), then we can store it as a sparse matrix (eliminating memory issues) and avoid matrix factorizations in algorithms (eliminating time issues). And here I explain how to avoid matrix factorizations and rely only of matrix-vector products.

The crucial part of all non-linear least-squares algorithms is finding (perhaps approximate) solution to linear least squares (it gives time asymptotics):

.

As a method to solve it I chose LSMR algorithm, which is available in scipy. I haven’t thoroughly investigate this algorithm, but conceptually it can be thought of as a specially preconditioned conjugate gradient method applied to least-squares normal equation, but with better numerical properties. I preferred it over LSQR, because it appeared much more recently and the authors claim that it’s more suitable for least-squares problems (as opposed to system of equations). This LSMR algorithm requires only matrix-vector multiplication in the form and .

In large-scale setting both implemented algorithms dogbox and Trust Region Reflective as the first step compute approximate Gauss-Newton solution using LSMR. And then:

- dogbox operates in a usual way, i. e. this large-scale modification was almost for free.
- In Trust Region Reflective I apply the 2-d subspace approach to solve a trust-region problem. This subspace is formed by computed LSMR solution and scaled gradient.

When Jacobian is not provided by a user, we need to estimate it by finite differences. If the number of variables is large, say 100000, this operation becomes very expensive if performed in a standard way. But if Jacobian contains only few non-zero elements in each row (its structure should be provided by a user), then columns can be grouped such that all columns in one group are estimated by a single function evaluation, see “Numerical Optimization”, chapter 8.1. The simplest greedy grouping algorithm I used is described in this paper. Its average performance should be quite good — the number of function evaluations required usually is only slightly higher than the maximum number of non-zero elements in each row. More advanced algorithms consider this problem as a graph-coloring problem, but they come down to simple reordering of columns before applying greedy grouping (so can be perhaps implemented later).

In the next post I will report results of algorithms in sparse large problems.

]]>If such problem is too big to solve, the following popular approach can be used.Select two vectors and put them in matrix . One of these vectors is usually gradient , another is unconstrained minimizer of quadratic function (in case of is positive definite) or the direction of negative curvature otherwise. Then it’s helpful to make vectors orthogonal to each other and of unit norm (apply QR to ). Now let’s define to lie in subspace spanned by these two vectors , substituting in the original problem we get:

,

where is matrix and .

The problem becomes very small and supposably easy to solve. But still we need to find its accurate solution somehow. The appealing approach which is often mentioned in books without details is to reduce the problem to the fourth-order algebraic equation. Let’s find out how to actually do that. As I mentioned in the previous posts there are two main cases: a) is positive definite and lies within a trust region, then it’s an optimal solution b) Otherwise an optimal solution lies on the boundary. Of course the only difficult part is case b. In this case let’s rewrite problem with the obvious change of notation and assuming :

To solve it we need to find stationary points of Lagrangian . Assigning partial derivatives to zeros, we come to the following system of equations:

After eliminating we get:

To exclude the last equation let’s use parametrization . Then substitute it to the first equation and multiply by nonzero to get (with a help of sympy):

And this is our final fourth-order algebraic equation (note how it’s symmetric is some sense). After finding all its roots, we discard complex roots, compute corresponding and , substitute them in the original quadratic function and choose ones which give the smallest value. Originally I thought that this equation can’t have complex roots, but it didn’t confirm in practice.

Here is the code with my implementation. It contains the solver function and the function checking that the found solution is optimal according to the main optimality theorem for trust-region problems. (See my introductory post on least-squares algorithms.) Root-finding is done by numpy.roots, which I assume to be accurate and robust.

import numpy as np from numpy.linalg import norm from scipy.linalg import cho_factor, cho_solve, eigvalsh, orth, LinAlgError def solve_2d_trust_region(B, g, Delta): """Solve a 2-dimensional general trust-region problem. Parameters ---------- B : ndarray, shape (2, 2) Symmetric matrix, defines a quadratic term of the function. g : ndarray,, shape (2,) Defines a linear term of the function. Delta : float Trust region radius. Returns ------- p : ndarray, shape (2,) Found solution. newton_step : bool Whether the returned solution is Newton step which lies within the trust region. """ try: R, lower = cho_factor(B) p = -cho_solve((R, lower), g) if np.dot(p, p) <= Delta**2: return p, True except LinAlgError: pass a = B[0, 0] * Delta ** 2 b = B[0, 1] * Delta ** 2 c = B[1, 1] * Delta ** 2 d = g[0] * Delta f = g[1] * Delta coeffs = np.array( [-b + d, 2 * (a - c + f), 6 * b, 2 * (-a + c + f), -b - d]) t = np.roots(coeffs) # Can handle leading zeros. t = np.real(t[np.isreal(t)]) p = Delta * np.vstack((2 * t / (1 + t**2), (1 - t**2) / (1 + t**2))) value = 0.5 * np.sum(p * B.dot(p), axis=0) + np.dot(g, p) i = np.argmin(value) p = p[:, i] return p, False def check_optimality(B, g, Delta, p, newton_step): """ Check if a trust-region solution optimal. Optimal solution p satisfies the following conditions for some alpha >= 0: 1. (B + alpha*I) * p = -g. 2. alpha * (||p|| - Delta) = 0. 3. B + alpha * I is positive semidefinite. Returns ------- alpha : float Corresponding alpha value, must be non negative. collinearity : float Condition 1 check - norm((B + alpha * I) * p + g), must be very small. complementarity : float Condition 2 check - alpha * (norm(p) - Delta), must be very small. pos_def : float Condition 3 check - the minimum eigenvalue of B + alpha * I, must be non negative. """ if newton_step: alpha = 0.0 else: q = B.dot(p) + g i = np.argmax(np.abs(p)) alpha = -q[i] / p[i] A = B + alpha * np.identity(2) collinearity = norm(np.dot(A, p) + g) complementarity = alpha * (Delta - norm(p)) pos_def = eigvalsh(A)[0] return alpha, collinearity, complementarity, pos_def def matrix_with_spectrum(eigvalues): Q = orth(np.random.randn(eigvalues.size, eigvalues.size)) return np.dot(Q * eigvalues, Q) def test_on_random(n_tests): np.random.seed(0) print(("{:<20}" * 4).format( "alpha", "collinearity", "complementarity", "pos. def.")) for i in range(n_tests): eigvalues = np.random.randn(2) B = matrix_with_spectrum(eigvalues) g = np.random.randn(2) Delta = 3.0 * np.random.rand(1)[0] p, newton_step = solve_2d_trust_region(B, g, Delta) print(("{:<20.1e}" * 4).format( *check_optimality(B, g, Delta, p, newton_step))) if __name__ == '__main__': test_on_random(10)

The output after running the script:

alpha collinearity complementarity pos. def. 0.0e+00 1.1e-16 0.0e+00 4.0e-01 1.1e+00 9.2e-16 0.0e+00 6.0e-01 4.8e+01 5.0e-16 3.3e-16 4.7e+01 8.9e+00 4.4e-16 0.0e+00 1.0e+01 0.0e+00 3.1e-16 0.0e+00 1.2e+00 2.6e+00 1.1e-16 0.0e+00 2.4e+00 9.1e-01 4.4e-15 0.0e+00 1.1e-02 2.9e+00 2.2e-16 -3.2e-16 2.2e+00 1.6e+00 1.2e-16 -1.8e-16 7.0e-01 1.8e+00 8.0e-15 7.8e-16 5.2e-01

The figures tell us that all found solutions are optimal (see the docstring of check_optimality). So, provided we have a good function for root-finding, this approach is simple, robust and accurate.

]]>Before I present the results I want to make a few notes.

- Initially I wanted to find a very accurate reference optimal value for each problem and measure the accuracy of an optimization process by comparison with it. I abandoned this idea for several reasons. a) In local optimization there isn’t a single correct minimum, all local minima are equivalently good. So ideally we should find all local minima which can be hard and comparison logic with several minima becomes awkward. b) Sources with problem descriptions often provide inaccurate (or plain incorrect) reference values or provide them with single precision, or doesn’t provide them at all. Finding optimal values with MATLAB (for example) is cumbersome and still we can’t 100% assure the required accuracy.
- It is desirable to compare algorithms with identical termination conditions. But this requirement is never satisfied in practice as we work with already implemented algorithms. Also there is no one correct way to specify termination condition. So the termination conditions for all algorithms will be somewhat different, but nothing we can do about it.

The methods benchmarked were dogbox, Trust Region Reflective, leastsqbound (also used in lmfit) and l-bfgs-b (this method doesn’t take into account the structure of a problem and works with and ).

The columns have the following meaning:

- n – the number of independent variables.
- m – the number of residuals.
- solver – the algorithm used. The suffix “-s” means additional scaling of variables to equalize their influence on the objective function, this has nothing to do with scaling applied in Trust Region Reflective. The equivalent point of view is usage of an elliptical trust region. In constrained case this scaling usually degrades performance, so I don’t show the results for it.
- nfev – the number of function evaluations done by the algorithm.
- g norm – first-order (gradient) optimality. In dogbox it is the infinity-norm of the gradient with respect to variables which aren’t on the boundary (optimality of active variables are assured by the algorithm). For other algorithms it is the infinity-norm of the gradient scaled by Coleman-Li matrix, read my post about it.
- value – the value of the objective function we are minimizing, the final sum of squares. Could serve as a rough measure of an algorithm adequacy (by comparison with “value” for other algorithms).
- active – the number of active constraints at the solution. Absolutely accurate for “dogbox”, somewhat arbitrary for other algorithms (determined with tolerance threshold).

The most important columns are “nfev” and “g norm”.

For all runs I used tolerance parameters ftol = xtol = gtol = EPS**0.5, where EPS is the machine epsilon for double precision floating-point numbers. As I said above termination conditions vary from method to method, so it is a tedious job to explain each parameter for each method.

The benchmark problems were taken from “The MINPACK-2 test problem collection” and “Moré, J.J., Garbow, B.S. and Hillstrom, K.E., Testing Unconstrained Optimization Software”, constraints to the latter collection are added according to “Gay, D.M., A trust-region approach to linearly constrained optimization”. Here is the very helpful page I used. All problems were run with analytical (user supplied) Jacobian computation routine.

The discussion of the results is below the table.

Unbounded problems problem n m solver nfev g norm value active status --------------------------------------------------------------------------------------------------- Beale 3 2 dogbox 8 4.80e-11 1.02e-22 0 1 dogbox-s 9 7.69e-12 2.58e-24 0 1 trf 7 3.21e-11 4.50e-23 0 1 trf-s 9 4.07e-11 7.26e-23 0 1 leastsq 10 3.66e-15 5.92e-31 0 2 leastsq-s 9 6.66e-16 4.93e-32 0 2 l-bfgs-b 16 1.77e-07 1.95e-15 0 0 Biggs 6 13 dogbox 140 1.84e-02 3.03e-01 0 2 dogbox-s 600 3.92e-03 8.80e-03 0 0 trf 65 2.18e-16 2.56e-31 0 1 trf-s 43 1.18e-14 3.60e-29 0 1 leastsq 74 7.24e-16 4.56e-31 0 2 leastsq-s 40 1.65e-15 1.53e-30 0 2 l-bfgs-b 42 1.23e-06 5.66e-03 0 0 Box3D 3 10 dogbox 6 3.92e-10 1.14e-19 0 1 dogbox-s 6 3.92e-10 1.14e-19 0 1 trf 6 3.92e-10 1.14e-19 0 1 trf-s 6 3.92e-10 1.14e-19 0 1 leastsq 7 2.80e-16 4.62e-32 0 2 leastsq-s 7 2.80e-16 4.62e-32 0 2 l-bfgs-b 37 1.41e-07 3.42e-13 0 0 BrownAndDennis 4 20 dogbox 144 4.89e+00 8.58e+04 0 2 dogbox-s 153 3.46e+00 8.58e+04 0 2 trf 26 1.38e+00 8.58e+04 0 2 trf-s 275 4.28e+00 8.58e+04 0 2 leastsq 26 1.15e+00 8.58e+04 0 1 leastsq-s 254 3.63e+00 8.58e+04 0 1 l-bfgs-b 17 9.66e-01 8.58e+04 0 0 BrownBadlyScaled 2 3 dogbox 26 0.00e+00 0.00e+00 0 1 dogbox-s 29 0.00e+00 0.00e+00 0 1 trf 23 0.00e+00 0.00e+00 0 1 trf-s 23 0.00e+00 0.00e+00 0 1 leastsq 17 0.00e+00 0.00e+00 0 2 leastsq-s 16 0.00e+00 0.00e+00 0 2 l-bfgs-b 25 2.14e-02 2.06e-15 0 0 ChebyshevQuadrature10 10 10 dogbox 83 3.53e-05 6.50e-03 0 2 dogbox-s 72 1.70e-05 6.50e-03 0 2 trf 21 6.50e-07 6.50e-03 0 2 trf-s 21 5.93e-06 6.50e-03 0 2 leastsq 18 2.78e-06 6.50e-03 0 1 leastsq-s 25 3.22e-06 6.50e-03 0 1 l-bfgs-b 29 1.48e-05 6.50e-03 0 0 ChebyshevQuadrature11 11 11 dogbox 154 2.75e-05 2.80e-03 0 2 dogbox-s 196 5.01e-05 2.80e-03 0 2 trf 37 7.25e-06 2.80e-03 0 2 trf-s 44 7.85e-06 2.80e-03 0 2 leastsq 45 4.99e-06 2.80e-03 0 1 leastsq-s 47 7.70e-06 2.80e-03 0 1 l-bfgs-b 32 4.79e-04 2.80e-03 0 0 ChebyshevQuadrature7 7 7 dogbox 8 1.17e-12 4.82e-25 0 1 dogbox-s 10 7.22e-15 1.62e-29 0 1 trf 8 3.10e-15 2.60e-30 0 1 trf-s 9 1.04e-08 3.76e-17 0 1 leastsq 9 8.43e-16 7.65e-32 0 2 leastsq-s 9 1.37e-15 1.96e-31 0 2 l-bfgs-b 18 1.16e-05 7.14e-11 0 0 ChebyshevQuadrature8 8 8 dogbox 20 4.12e-06 3.52e-03 0 2 dogbox-s 56 5.73e-06 3.52e-03 0 2 trf 33 5.40e-06 3.52e-03 0 2 trf-s 39 9.26e-06 3.52e-03 0 2 leastsq 32 2.71e-06 3.52e-03 0 1 leastsq-s 39 7.99e-06 3.52e-03 0 1 l-bfgs-b 27 5.13e-06 3.52e-03 0 0 ChebyshevQuadrature9 9 9 dogbox 14 3.55e-15 9.00e-30 0 1 dogbox-s 11 6.02e-13 2.72e-25 0 1 trf 12 6.46e-13 3.15e-25 0 1 trf-s 9 2.22e-10 6.38e-20 0 1 leastsq 13 8.47e-16 7.64e-32 0 2 leastsq-s 12 5.95e-16 5.85e-32 0 2 l-bfgs-b 27 2.07e-05 2.53e-10 0 0 CoatingThickness 134 252 dogbox 7 2.33e-05 5.05e-01 0 2 dogbox-s 7 2.33e-05 5.05e-01 0 2 trf 7 2.33e-05 5.05e-01 0 2 trf-s 7 2.33e-05 5.05e-01 0 2 leastsq 7 2.33e-05 5.05e-01 0 1 leastsq-s 7 2.33e-05 5.05e-01 0 1 l-bfgs-b 281 5.54e-04 5.11e-01 0 0 EnzymeReaction 4 11 dogbox 23 1.32e-07 3.08e-04 0 2 dogbox-s 21 1.36e-07 3.08e-04 0 2 trf 20 1.30e-07 3.08e-04 0 2 trf-s 24 1.17e-07 3.08e-04 0 2 leastsq 23 7.13e-08 3.08e-04 0 1 leastsq-s 18 7.96e-08 3.08e-04 0 1 l-bfgs-b 30 2.53e-06 3.08e-04 0 0 ExponentialFitting 5 33 dogbox 10 1.56e-10 5.46e-05 0 1 dogbox-s 10 1.56e-10 5.46e-05 0 1 trf 19 1.29e-08 5.46e-05 0 1 trf-s 20 3.28e-08 5.46e-05 0 2 leastsq 20 3.23e-08 5.46e-05 0 1 leastsq-s 18 1.94e-08 5.46e-05 0 1 l-bfgs-b 44 9.98e-05 7.68e-05 0 0 ExtendedPowellSingular 4 4 dogbox 13 2.33e-09 5.72e-13 0 1 dogbox-s 13 2.33e-09 5.72e-13 0 1 trf 13 2.33e-09 5.72e-13 0 1 trf-s 13 2.33e-09 5.72e-13 0 1 leastsq 37 4.93e-31 7.22e-42 0 4 leastsq-s 37 4.93e-31 7.22e-42 0 4 l-bfgs-b 27 1.77e-04 3.70e-08 0 0 FreudensteinAndRoth 2 2 dogbox 6 0.00e+00 0.00e+00 0 1 dogbox-s 9 1.84e-11 1.95e-25 0 1 trf 6 1.57e-10 1.41e-23 0 1 trf-s 9 8.41e-11 4.07e-24 0 1 leastsq 8 1.78e-14 3.16e-30 0 2 leastsq-s 10 0.00e+00 0.00e+00 0 2 l-bfgs-b 15 5.29e-06 1.54e-13 0 0 GaussianFittingI 11 65 dogbox 14 1.27e-07 4.01e-02 0 2 dogbox-s 15 3.25e-07 4.01e-02 0 2 trf 13 1.75e-07 4.01e-02 0 2 trf-s 16 1.93e-07 4.01e-02 0 2 leastsq 13 5.89e-07 4.01e-02 0 1 leastsq-s 16 1.77e-06 4.01e-02 0 1 l-bfgs-b 69 8.67e-05 4.01e-02 0 0 GaussianFittingII 3 15 dogbox 3 5.93e-13 1.13e-08 0 1 dogbox-s 3 5.93e-13 1.13e-08 0 1 trf 3 5.93e-13 1.13e-08 0 1 trf-s 3 5.93e-13 1.13e-08 0 1 leastsq 4 1.25e-16 1.13e-08 0 2 leastsq-s 4 1.25e-16 1.13e-08 0 2 l-bfgs-b 4 5.81e-06 1.18e-08 0 0 GulfRnD 3 100 dogbox 20 9.12e-09 1.83e-18 0 1 dogbox-s 22 1.00e-15 5.87e-31 0 1 trf 16 1.00e-15 5.87e-31 0 1 trf-s 25 1.26e-08 3.51e-18 0 1 leastsq 16 1.61e-15 7.12e-31 0 2 leastsq-s 23 1.61e-15 7.12e-31 0 2 l-bfgs-b 60 2.51e-06 1.25e-12 0 0 HelicalValley 3 3 dogbox 9 8.37e-12 3.13e-25 0 1 dogbox-s 19 5.81e-09 3.94e-19 0 1 trf 13 1.68e-13 1.16e-28 0 1 trf-s 13 1.78e-11 1.26e-24 0 1 leastsq 16 2.50e-29 2.46e-60 0 2 leastsq-s 11 1.58e-15 9.87e-33 0 2 l-bfgs-b 32 6.79e-07 3.28e-15 0 0 JenrichAndSampson10 2 10 dogbox 22 7.17e-02 1.24e+02 0 2 dogbox-s 21 5.51e-02 1.24e+02 0 2 trf 20 6.21e-03 1.24e+02 0 2 trf-s 20 5.86e-02 1.24e+02 0 2 leastsq 20 2.76e-04 1.24e+02 0 1 leastsq-s 21 2.90e-02 1.24e+02 0 1 l-bfgs-b 63 1.86e+03 nan 0 2 PenaltyI 10 11 dogbox 35 4.90e-09 7.09e-05 0 1 dogbox-s 25 2.89e-09 7.09e-05 0 1 trf 38 3.98e-08 7.09e-05 0 2 trf-s 69 2.93e-08 7.09e-05 0 2 leastsq 26 7.54e-08 7.09e-05 0 1 leastsq-s 79 1.67e-08 7.09e-05 0 1 l-bfgs-b 20 8.58e-06 7.45e-05 0 0 PenaltyII10 10 20 dogbox 71 8.64e-07 2.91e-04 0 2 dogbox-s 33 4.15e-06 2.91e-04 0 2 trf 50 2.01e-06 2.91e-04 0 2 trf-s 32 4.98e-07 2.91e-04 0 2 leastsq 47 2.77e-07 2.91e-04 0 1 leastsq-s 58 6.64e-08 2.91e-04 0 1 l-bfgs-b 14 2.31e-06 2.91e-04 0 0 PenaltyII4 4 8 dogbox 27 1.64e-07 9.31e-06 0 2 dogbox-s 29 2.96e-07 9.31e-06 0 2 trf 24 3.42e-07 9.31e-06 0 2 trf-s 85 8.46e-08 9.31e-06 0 2 leastsq 70 7.35e-08 9.31e-06 0 1 leastsq-s 111 2.74e-08 9.31e-06 0 1 l-bfgs-b 19 1.16e-06 9.61e-06 0 0 PowellBadlyScaled 2 2 dogbox 43 4.89e-09 2.89e-27 0 1 dogbox-s 67 2.02e-11 9.86e-32 0 1 trf 43 4.90e-09 2.90e-27 0 1 trf-s 19 0.00e+00 0.00e+00 0 1 leastsq 72 1.01e-11 6.16e-32 0 2 leastsq-s 19 1.01e-11 1.23e-32 0 2 l-bfgs-b 4 1.35e-01 1.35e-01 0 0 Rosenbrock 2 2 dogbox 20 0.00e+00 0.00e+00 0 1 dogbox-s 18 0.00e+00 0.00e+00 0 1 trf 18 0.00e+00 0.00e+00 0 1 trf-s 20 0.00e+00 0.00e+00 0 1 leastsq 15 0.00e+00 0.00e+00 0 4 leastsq-s 14 0.00e+00 0.00e+00 0 4 l-bfgs-b 47 1.89e-06 1.31e-14 0 0 ThermistorResistance 3 16 dogbox 300 5.91e+06 1.61e+02 0 0 dogbox-s 291 1.95e+00 8.79e+01 0 2 trf 262 5.32e-04 8.79e+01 0 2 trf-s 202 3.10e-04 8.79e+01 0 3 leastsq 279 7.49e-04 8.79e+01 0 2 leastsq-s 216 1.68e+01 8.79e+01 0 3 l-bfgs-b 633 3.16e+00 3.17e+04 0 0 Trigonometric 10 10 dogbox 10 1.54e-11 9.90e-22 0 1 dogbox-s 65 3.66e-07 2.80e-05 0 2 trf 26 1.42e-07 2.80e-05 0 2 trf-s 31 2.08e-08 2.80e-05 0 2 leastsq 25 3.84e-08 2.80e-05 0 1 leastsq-s 28 5.67e-08 2.80e-05 0 1 l-bfgs-b 28 1.63e-06 2.80e-05 0 0 Watson12 12 31 dogbox 7 5.77e-10 4.72e-10 0 1 dogbox-s 12 1.50e-13 4.72e-10 0 1 trf 6 1.56e-10 5.98e-10 0 1 trf-s 8 2.19e-10 2.16e-09 0 1 leastsq 9 8.94e-14 4.72e-10 0 2 leastsq-s 9 3.63e-11 4.72e-10 0 3 l-bfgs-b 52 4.20e-05 1.35e-05 0 0 Watson20 20 31 dogbox 11 4.90e-12 2.48e-20 0 1 dogbox-s 19 6.35e-10 2.60e-20 0 1 trf 7 1.36e-12 1.63e-19 0 1 trf-s 8 1.32e-08 7.10e-18 0 1 leastsq 17 8.65e-13 2.48e-20 0 2 leastsq-s 20 1.08e-11 2.49e-20 0 2 l-bfgs-b 69 2.66e-05 7.28e-06 0 0 Watson6 6 31 dogbox 8 5.16e-08 2.29e-03 0 2 dogbox-s 10 5.62e-08 2.29e-03 0 2 trf 8 5.16e-08 2.29e-03 0 2 trf-s 11 1.11e-07 2.29e-03 0 2 leastsq 8 5.16e-08 2.29e-03 0 1 leastsq-s 8 5.16e-08 2.29e-03 0 1 l-bfgs-b 44 5.28e-06 2.29e-03 0 0 Watson9 9 31 dogbox 7 3.21e-13 1.40e-06 0 1 dogbox-s 9 8.26e-12 1.40e-06 0 1 trf 6 2.91e-11 1.40e-06 0 1 trf-s 10 4.29e-12 1.40e-06 0 1 leastsq 7 1.47e-11 1.40e-06 0 1 leastsq-s 7 1.70e-11 1.40e-06 0 4 l-bfgs-b 40 1.27e-04 6.51e-05 0 0 Wood 4 6 dogbox 73 0.00e+00 0.00e+00 0 1 dogbox-s 66 0.00e+00 0.00e+00 0 1 trf 74 0.00e+00 0.00e+00 0 1 trf-s 67 5.53e-12 9.26e-26 0 1 leastsq 69 0.00e+00 0.00e+00 0 2 leastsq-s 70 0.00e+00 0.00e+00 0 2 l-bfgs-b 20 6.39e-04 7.88e+00 0 0 Bounded problems problem n m solver nfev g norm value active status --------------------------------------------------------------------------------------------------- Beale_B 3 2 dogbox 4 0.00e+00 0.00e+00 0 1 trf 19 8.74e-09 2.11e-10 0 1 leastsqbound 12 1.17e-09 1.99e-20 1 2 l-bfgs-b 5 5.83e-15 4.44e-31 1 0 Biggs_B 6 13 dogbox 32 1.61e-08 5.32e-04 2 2 trf 24 3.96e-10 5.32e-04 2 1 leastsqbound 63 5.95e-04 5.90e-04 2 1 l-bfgs-b 70 1.52e-03 5.79e-04 2 0 Box3D_B 3 10 dogbox 8 1.45e-10 1.14e-04 1 1 trf 13 6.55e-09 1.14e-04 0 1 leastsqbound 18 1.75e-08 1.14e-04 0 1 l-bfgs-b 16 1.08e-03 1.18e-04 0 0 BrownAndDennis_B 4 20 dogbox 78 9.28e+00 8.89e+04 2 2 trf 41 4.98e+01 8.89e+04 0 2 leastsqbound 271 8.63e-01 8.89e+04 1 1 l-bfgs-b 18 5.66e-01 8.89e+04 2 0 BrownBadlyScaled_B 2 3 dogbox 33 1.11e-10 7.84e+02 1 1 trf 39 8.25e-05 7.84e+02 1 3 leastsqbound 300 3.14e+00 7.87e+02 0 5 l-bfgs-b 7 1.44e-11 7.84e+02 1 0 ChebyshevQuadrature10_B 10 10 dogbox 147 1.64e-05 6.50e-03 0 2 trf 40 1.03e-06 4.77e-03 0 2 leastsqbound 55 1.14e-06 4.77e-03 0 1 l-bfgs-b 50 1.78e-06 4.77e-03 0 0 ChebyshevQuadrature7_B 7 7 dogbox 15 2.75e-07 6.03e-04 2 2 trf 15 3.95e-08 6.03e-04 2 2 leastsqbound 33 9.61e-08 6.03e-04 0 1 l-bfgs-b 29 2.18e-05 6.03e-04 2 0 ChebyshevQuadrature8_B 8 8 dogbox 81 5.34e-06 3.59e-03 1 2 trf 127 1.12e-06 3.59e-03 0 2 leastsqbound 900 1.33e-06 3.59e-03 0 5 l-bfgs-b 46 2.92e-06 3.59e-03 1 0 ExtendedPowellSingular_B 4 4 dogbox 20 2.42e-07 1.88e-04 1 2 trf 16 7.36e-09 1.88e-04 1 1 leastsqbound 23 5.59e-08 1.88e-04 1 1 l-bfgs-b 29 6.06e-05 1.88e-04 1 0 GaussianFittingII_B 3 15 dogbox 3 5.93e-13 1.13e-08 0 1 trf 5 2.64e-10 1.13e-08 0 1 leastsqbound 12 1.43e-15 1.13e-08 0 1 l-bfgs-b 5 7.93e-09 1.84e-08 0 0 GulfRnD_B 3 100 dogbox 10 7.29e-05 5.29e+00 2 2 trf 9 9.03e-07 5.29e+00 1 2 leastsqbound 22 4.71e-05 5.29e+00 0 1 l-bfgs-b 29 4.11e-01 6.49e+00 0 0 HelicalValley_B 3 3 dogbox 9 2.69e-05 9.90e-01 1 2 trf 14 6.10e-05 9.90e-01 1 2 leastsqbound 125 4.24e-03 9.90e-01 1 1 l-bfgs-b 17 2.63e-05 9.90e-01 1 0 PenaltyI_B 10 11 dogbox 16 1.00e-05 7.56e+00 3 2 trf 17 8.72e-04 7.56e+00 3 2 leastsqbound 328 2.56e-06 7.56e+00 3 1 l-bfgs-b 5 8.56e-04 7.56e+00 3 0 PenaltyII10_B 10 20 dogbox 30 9.20e-07 2.91e-04 2 2 trf 304 5.20e-06 2.91e-04 1 2 leastsqbound 297 2.18e-07 2.91e-04 2 1 l-bfgs-b 23 2.59e-04 2.92e-04 0 0 PenaltyII4_B 4 8 dogbox 29 1.40e-12 9.35e-06 2 1 trf 193 2.91e-09 9.35e-06 0 1 leastsqbound 78 1.39e-08 9.35e-06 1 1 l-bfgs-b 14 3.07e-05 9.50e-06 0 0 PowellBadlyScaled_B 2 2 dogbox 38 4.07e-12 1.51e-10 1 1 trf 100 3.02e-11 2.07e-10 0 1 leastsqbound 220 1.82e-06 1.51e-10 1 1 l-bfgs-b 5 1.08e+00 1.35e-01 0 0 Rosenbrock_B_0 2 2 dogbox 17 0.00e+00 0.00e+00 0 1 trf 23 0.00e+00 0.00e+00 1 1 leastsqbound 6 1.11e-13 1.97e-29 1 2 l-bfgs-b 47 1.89e-06 1.31e-14 1 0 Rosenbrock_B_1 2 2 dogbox 6 4.97e-09 5.04e-02 1 1 trf 9 1.59e-07 5.04e-02 1 2 leastsqbound 21 2.66e-07 5.04e-02 1 1 l-bfgs-b 23 1.68e-06 5.04e-02 1 0 Rosenbrock_B_2 2 2 dogbox 6 2.27e-06 4.94e+00 1 2 trf 9 4.36e-07 4.94e+00 1 2 leastsqbound 18 3.44e-05 4.94e+00 1 1 l-bfgs-b 19 1.04e-05 4.94e+00 1 0 Rosenbrock_B_3 2 2 dogbox 3 0.00e+00 2.50e+01 2 1 trf 8 3.27e-09 2.50e+01 2 1 leastsqbound 19 4.38e-09 2.50e+01 2 1 l-bfgs-b 3 0.00e+00 2.50e+01 2 0 Rosenbrock_B_4 2 2 dogbox 6 4.97e-09 5.04e-02 1 1 trf 14 1.06e-08 5.04e-02 1 1 leastsqbound 20 7.03e-06 5.04e-02 0 1 l-bfgs-b 21 2.73e-10 5.04e-02 1 0 Rosenbrock_B_5 2 2 dogbox 12 0.00e+00 2.50e-01 1 1 trf 20 5.05e-06 2.50e-01 1 2 leastsqbound 24 8.47e-08 2.50e-01 1 1 l-bfgs-b 27 0.00e+00 2.50e-01 1 0 Trigonometric_B 10 10 dogbox 117 3.31e-07 2.80e-05 0 2 trf 37 7.67e-07 2.80e-05 0 2 leastsqbound 64 3.99e-08 2.80e-05 0 1 l-bfgs-b 34 2.61e-04 4.22e-05 0 0 Watson12_B 12 31 dogbox 1200 1.30e-03 7.17e-02 5 0 trf 171 4.50e-05 7.16e-02 6 2 leastsqbound 13 1.02e+02 1.71e+01 12 1 l-bfgs-b 101 9.37e-02 7.28e-02 6 0 Watson9_B 9 31 dogbox 5 1.79e+01 4.91e+00 3 2 trf 26 1.87e-09 3.74e-02 5 1 leastsqbound 462 2.89e-03 3.91e-02 2 1 l-bfgs-b 285 5.72e-05 3.74e-02 5 0 Wood_B 4 6 dogbox 63 5.05e-07 1.56e+00 1 2 trf 29 2.12e-08 1.56e+00 1 2 leastsqbound 43 4.17e-05 1.56e+00 1 1 l-bfgs-b 20 8.38e-03 1.56e+00 1 0

For unbounded problems “leastsq” and “trf” are generally comparable, with “leastsq” being modestly better. This is easily explained as the algorithms are almost equivalent, but “leastsq” uses a smarter strategy for decreasing a trust region radius, perhaps this issue is worth investigating. My second algorithm “dogbox” is less robust and fails in some problems (most of them have rank deficient Jacobian). The general purpose “l-bfgs-b” is generally not as good as lsq algorithms, but might be used with satisfactory results.

In bounded problems “trf”, “dogbox” and “l-bfgs-b” do reasonably well, with performance varying over problems. I see one big fail of “dogbox” in “Watson9_B”, all other problems were solved relatively successful by all 3 methods. I suspect that performance of “l-bfgs-b” might degrade in high-dimensional problems, but for small constrained problems this method proved to be very solid, so use it! (At least until I add new lsq methods to scipy.) And I just fixed leastsqbound, so now it works OK!

]]>

This problem is interesting by itself, and I’m sure there are a lot of theories and methods for solving it. But we are going to use perhaps the simplest approach called “dogleg”. It can be applied when is positive definite. In this case we compute Newton (Gauss-Newton for least squares) step and so called Cauchy step, which is unconstrained minimizer of our quadratic model along anti-gradient:

And define the dogleg path as follows:

First we move towards Cauchy point and then from Cauchy point to Newton point. It can be proven that is a decreasing function of , which means we should follow the dogleg path as long as we stay in a trust region. For a spherical trust regions there is no more than one intersection of dogleg path and the boundary, which makes the method especially simple and clear. The last statement is not true for a rectangular trust region, but it is still not hard to find the optimum along the dogleg path within a trust region, or we can just stop at the first intersection, a slightly improved strategy is suggested in the paper I linked above.

If during iterations some variable hit the initial boundary (from and ) and the component of anti-gradient points outside the feasible region, then if we try a dogleg step our algorithm won’t make any progress. At this state such variables satisfy the first-order optimality and we should exclude them before taking a next dogleg step.

When is positive semidefinite then we don’t have proper Newton step, in this case we should compute regularized Newton step, by increasing diagonal elements of by a proper amount. As far as I know there is no universal and 100% satisfactory recipe of doing it (but I mentioned a paper in the previous post with some solution).

And that’s basically the whole algorithm. It seems to be a little hackish, from the experience it works adequate in unconstrained least-squares problems when is not rank deficient. (I haven’t implemented any tweaks for this case so far.) In bounded problems it performance varies, but as it does for Trust Region Reflective.

There is a notion of “combinatorial difficulty of inequality-constrained problems” which burdens methods that try to determine what constraints are active in the optimal solution (active set methods). The dogbox algorithm does something like that, but to my shame I have no idea how it will work in this regard when the number of variables becomes large. On the other hand, Trust Region Reflective is expected to work very well in high dimensional setting, it was mentioned as its strongest point by the authors.

So far I was focusing on the general logic and workflow of each algorithm and tested it on small problems, I will publish the results in the next post.

]]>

The minimization problem is stated as follows:

Some of the components of and can be infinite meaning no bound in this direction. Let’s use the notation and . The first order necessary conditions for to be a local minimum:

Define a vector with the following components:

Its components are distances to the bounds at which anti-gradient points (if this distance is finite). Define a matrix , the first order optimality can be stated as . Now we can think of our optimization problem as the diagonal system of nonlinear equations (I would say it is the main idea of this part):

.

The Jacobian of the left hand side exist whenever for all , which is true when (not on the bound). Assume that this holds, then Newton step for this system satisfies:

Here is diagonal Jacobian matrix of , its elements take values or , note that all elements of the matrix are non-negative. Now introduce the change of variables . In the new variables we have Newton step satisfying: where , (note that is a proper gradient of with respect to “hat” variables). Looking at this Newton step we formulate corresponding trust-region problem:

.

In the original space we have:

,

and the equivalent trust-region problem

.

From my experience the better approach is to solve the trust-region problem in “hat” space, so we don’t need to compute which can become arbitrary large when the optimum is on the boundary and the algorithm approaches it.

A modified improvement ratio of out trust-region solution is computed as follows:

Based on we adjust a radius of trust region using some reasonable strategy.

Now summary and conclusion for this section. Motivated by the first-order optimality condition we introduced a matrix and reformulated our problem as the system of nonlinear equations. Then motivated by the Newton process for this system we formulated the corresponding trust-region problem. The purpose of the matrix is to prevent steps directly into bounds, so that other variables can also be explored during the step. It absolutely doesn’t mean that after introducing such matrix we can ignore the bounds, specifically our estimates must remain strictly feasible. The full algorithm will be described below.

This idea comes from another paper “On the convergence of reflective Newton methods for large-scale nonlinear minimization subject to bounds” by the same authors. Conceptually we apply a special transformation , such that is unbounded variable and try to solve unconstrained problem . The authors suggest a reflective transformation: a piecewise linear function, equal to identity when satisfies the initial bound constraints, otherwise reflected from the bounds as a beam of light (I hope you got the idea). I implemented it as follows (although don’t use this code anywhere):

import numpy as np def reflective_transformation(y, l, u): if l is None: l = np.full_like(y, -np.inf) if u is None: u = np.full_like(y, np.inf) l_fin = np.isfinite(l) u_fin = np.isfinite(u) x = y.copy() m = l_fin & ~u_fin x[m] = np.maximum(y[m], 2 * l[m] - y[m]) m = ~l_fin & u_fin x[m] = np.minimum(y[m], 2 * u[m] - y[m]) m = l_fin & u_fin d = u - l t = np.remainder(y[m] - l[m], 2 * d[m]) x[m] = l[m] + np.minimum(t, 2 * d[m] - t) return x

This transformation is simple and doesn’t significantly increase the complexity of the function to minimize. But it is not differentiable when is on the bounds, thus we again use strictly feasible iterates. The general idea of the reflective Newton method is to do line search along the reflective path (or a traditional straight line in space). According to the authors this method has cool properties, but it is used very modestly in the final large-scale Trust Region Reflective.

In the previous post I conceptually described how to accurately solve trust-region subproblems arising in least-squares minimization. Here I again focus on least-squares setting and briefly describe how it can be solved approximately in large-scale.

- Steihaug Conjugate Gradient. Apply conjugate gradient method to the normal equation until the current approximate solution falls outside the trust region (or indefinite direction is found if is rank deficient). This actually might be just the best approach for least squares as we don’t have negative curvature directions in , and the only criticism of Steihaug-CG I read is that it can terminate before finding the negative curvature direction. I would assume that it is not very important for positive semidefinite case.
- Two-dimensional subspace minimization. We form a basis consisting of two “good” vectors, then solve two-dimensional trust region problem with the exact method. The first vector is a gradient, the second is an approximate solution of linear least squares with the current (computed by LSQR or LSMR). When Jacobian is rank deficient the situation is somewhat problematic, as I noticed in this case a least-norm solution is useless for approximating a trust-region solution. In this case we need to add (not too big) regularization diagonal term to . A recipe for this situation is given in “Approximate solution of the trust region problem by minimization over two-dimensional subspaces”.

Here is the high level description.

- Consider the trust-region problem in “hat” space as described in the first section.
- Find its solution by whatever method is appropriate (exact for small problems, approximate for large scale). Compute the corresponding solution in the original space .
- Restrict this trust-region step to lie within bounds if necessary. Step back from the bounds by times the step length. Do it for all type of steps below.
- Consider a single reflection of the trust-region step if bound was encountered in 3. Use 1-d minimization of the quadratic model to find the minimum along the reflected direction (this is trivial).
- Find the minimum of the quadratic model along the . (Rarely it can be better than the trust-region step because of the bounds.)
- Choose the best step among 3, 4, 5. Compute the corresponding step in the original space as in 2, update .
- Update the trust region radius by computing as described in the first section.
- Check for convergence and go to 1 if the algorithm has not converged.

In the next two posts I will describe another type of algorithm which we call “dogbox” and provide comparison benchmark results.

]]>The first thing we decided I should do is to provide a plain English description of the methods I studied and implemented. To make it more understandable I decided to make this post as an introduction to nonlinear least-squares optimization. So here I start.

The objective function we seek to minimize has the form

,

here we introduced the residual vector and we want to minimize the square of its Euclidean norm. Each component is a smooth function from to . We treat as a vector-valued function which Jacobian matrix defined as follows

In other words -th row of contains transposed gradient . Now it is easy to verify that the gradient and Hessian (the matrix containing all second partial derivatives) of the objective function have the form:

Notice that the second term in a Hessian will be small if a) the residuals are small near the optimum or b) the residuals depends on approximately linearly (possibly only near the optimum). Both conditions are often satisfied in practice, which leads to the main idea of least-squares minimization to use an approximation of Hessian in the form:

.

So the distinctive feature of least-squares optimization is the availability of a good Hessian approximation using only the first order derivative information. Noted that, we can apply any optimization method which employs Hessian information with a good probability of satisfactory results.

Most of the local optimization algorithms are iterative: starting with the initial guess they generate a sequence which should converge to a local minimum. I will denote steps taken by an algorithm with a letter , such that . Also I will denote the quantities at a point with the index , for example and so on.

This is just an adaptation of Newton method where instead of computing Newton step at each iteration , we compute Gauss-Newton step using the aforementioned Hessian approximation:

.

To make the algorithm globally convergent we invoke a line search along the computed direction to satisfy a “sufficient decrease condition”. See the chapter on line search in “Numerical optimization” of Nocedal and Wright.

Note that the equation for is the normal equation of the following linear least squares problem:

.

It means that can be found by whatever linear-least squares method you think is appropriate:

- Through the normal equation with Cholesky factorization. Pros: effectiveness when , and can be computed using memory, solving the equation is fast. Cons: potentially bad accuracy since the condition number of is squared compared to , Cholesky factorization can incorrectly fail when is nearly rank deficient.
- Through QR factorization with pivoting of . Pros: more reliable for ill-conditioned . Cons: slower than previous, rank deficient case is still problematic.
- Through singular value decomposition (SVD) of . Pros: the most robust approach, gives the full information about sensitivity of the solution to perturbations of the input, allows finding the least norm solution in the rank deficient case, allows zeroing of very small singular values (thus avoiding the excessive sensitivity). Cons: slower than the previous two.
- Conjugate gradient and alike methods (LSQR, LSMR), which only requires the ability to compute and for arbitrary and vectors. Used in sparse large-scale setting.

The third approach is used in numpy.lstsq. At the moment I don’t have a good idea of how much SVD approach is slower than QR.

If singular values of are uniformly bounded from zero in the region of interest then Gauss-Newton method is globally convergent and the convergence rate is no worse than linear and approaches quadratic as the second term in true Hessian becomes insignificant compared to . But note that convergence rate of an infinite sequence is measured after some , which can be arbitrary large. It means that we are to observe such rate of convergence within some (perhaps small) neighborhood of the optimal point.

The original idea of Levenberg was to use “regularized” Gauss-Newton step, which satisfies the following equation:

,

where is adjusted from iteration to iteration depending on some measure of the success of the previous step. A line search is not used in this algorithm. The rough explanation is as follows. When is small we are taking full Gauss-Newton steps, which are known to be good near the optimum, when is big we are taking the steps along the anti-gradient, thus assuring global convergence. If I’m not mistaken Marquardt contribution was a suggestion to use a more general term instead of , with . This is a question of variables scaling and for simplicity I will ignore it here.

Algorithms directly adjusting are considered obsolete, but nevertheless such algorithm is used in MATLAB for instance and it works well. But the more recent view is to consider Levenberg-Marquardt as a trust-region type algorithm.

In a trust-region approach we obtain the step by solving the following constrained quadratic subproblem:

,

where is the approximation to Hessian at the current point, — the gradient, — a radius of the trust region. A radius is adjusted by observing the ratio of actual to predicted change in the objective function (as a measure of adequateness of a quadratic model within the trust region):

The update rules for are approximately as follows: if then , if and then , otherwise . If is negative (no actual descrease) then the computed step is not taken and it is recomputed with from the current point again (the threshold for this might be higher, for example 0.25 is stated in “Numerical Optimization”).

In least-squares problems we have . The connection between original Levenberg method and a trust-region method are given by the following theorem:

*The solution to a trust-region problem satisfies the equation , for some . Such that is positive semidefinite, and .*

The last condition means that either or the optimal solution lies on the boundary. This theorem tells us that there is a one-to-one correspondence between and and suggests the following conceptual algorithm of solving trust-region problem:

- If is positive definite compute a Newton step and if it is within the trust regions — we found our solution.
- Otherwise find s. t. using some root finding algorithm. Then compute using .

The step 2 is not particular easy, also there is an additional difficult case when will be positive semidefinite and we can’t compute just from the equation. For the proper discussion of the problem refer to “Numerical Optimization”, Chapter 4.

The suitable and detailed algorithm for solving a trust-region subproblem arising in least squares is given in the paper “The Levenberg-Marquardt Algorithm – Implementation and theory” by J. J. More (implemented in MINPACK, scipy.optimize.leastsq wraps it). The author analyses the problem in terms of SVD, but then suggests an implementation using QR decomposition and Givens rotations (the approach generally chosen in MINPACK). I decided to stick with SVD, the only potential disadvantage of SVD is speed (but even that is questionable), in all other aspects this approach is great, including the simplicity of implementation. Let’s introduce a function we wan’t to find a zero of (see case 2 of trust-region solving algorithm):

If we have SVD and , then

We have an explicit function of and can easily compute its derivative too. In numpy/scipy this function can be implemented as follows and will work correctly even when .

import numpy as np from scipy.linalg import svd def phi_and_derivative(J, f, alpha, Delta): U, s, VT = svd(J, full_matrices=False) suf = s * U.T.dot(f) denom = s**2 + alpha p_norm = np.linalg.norm(suf / denom) phi = p_norm - Delta phi_prime = -np.sum(suf ** 2 / denom**3) / p_norm return phi, phi_prime

Then an iterative Newton-like zero-finding method is run (with some safeguarding described in the paper) until , it converges usually in 1-3 iterations. Of course in a real implementation SVD decomposition should be precomputed only once (this is not true for the QR approach described in the paper), also note that we need only “thin SVD” (full_matrices=False). So by almost the same with numpy.lstsq amount of work we accurately solve a trust region least-squares subproblem.

This (near) exact procedure of solving a trust-region problem is important, but in large-scale setting one of the approximate methods must be chosen. I will outline them in a future posts. Perhaps the most efficient and accurate method is to solve the problem in a subspace spanned by 2 properly selected vectors, in this case we apply the exact procedure to the matrix.

Gauss-Newton and Levenberg-Marquardt methods share the same convergence properties. But Levenberg-Marquardt can handle rank deficient Jacobians and, as far as I understand, works generally better in practice (although I don’t have experience with Gauss-Newton). So LM is usually the method of choice for unconstrained least-squares optimization.

In the next posts I will describe the algorithm I was studying and implementing for bound constrained least squares. Its unofficial title is “Trust Region Reflective”.

]]>

1) Turned out that there is a no reasonable implementation of numerical differentiation in scipy, and I will certainly need it. So I started to implement one. It might seem like quite a lot of code, but in fact it is a very basic finite difference derivatives estimation for vector-valued functions. The “features” are:

- Classical forward and central finite difference schemes + 3 point forward scheme with the second order accuracy (as a fallback for central difference near the boundaries, see below).
- The capability of automatic step selection, which will be good in a lot of cases.
- The ability to specify bounds of a rectangular region where the function is defined. The step and/or finite difference scheme can be automatically adjusted to fit into the bounds.

Some really good suggestions appeared recently, so perhaps it will take another couple of days to potentially incorporate them and merge.

2) According to the plan this week should be devoted to adding some benchmark problems for least squares methods. Well, it didn’t take a lot of time. I added about 10 problems from MINPACK-2 problem set (all unconstrained) and a very simple benchmarking class (for ASV framework). Then Evgeny Burovski (one of my mentors) sort of unexpectedly fast merged it. I guess it’s fine since it’s not public. Certainly I will come back to this benchmarks, meanwhile I have some basic problems to work this.

3) I implemented a draft version of an optimization solver I was planning to work on. Btw, here is the link with a description. Turned out it doesn’t work for least squares problems. It originally designed to solve nonlinear systems of equations (by minimizing the squared norm of a residual), so I naively though that it can be trivially generalized to least squares. Yes, you need to be careful with math :).

But things are looking good. At the moment I’m strongly considering implementing a method based on a rectangular trust-region approach. The idea is simple: if we intersect such trust region with a rectangular feasible region we will again get a rectangular region. So all we need is to solve quadratic subproblems subject to bound constraints. The simplest method of doing it is again a dogleg approach (maybe with a little tweak), and it is very well suited for large scale problems. Here is the link to a detailed description of the method (very likely I will describe it myself in one of the future posts). So far a draft implementation of this “dogbox” method was able to successfully solve all problems from MINPACK-2 problem set and a few constrained problems (many more are needed for sure). In the coming days I will continue to investigate its properties.

I think that’s all for now. Sorry for this post being not very well thought-out, I hope to come back with more interesting updates.

]]>