## Playing with the scipy.optimize.minimize function

Wednesday, October 29th, 2014 at 8:28 pm

I’m back at work on my SLAM based laser scanner. Failure is not an option. Yet it feels like there is a real risk of it.

One of the steps in the process is to displace all the inertial unit measurements by a small error term in order to minimize the error in the correspondences.

More simply, we have a matrix A of height m and width n (m>n), a column vector b of height n, and we want to fill in the column vector x of height m such that A x = b is almost true.

There is no exact solution, so we look for a least squares answer, where (A x – b)^2 is minimal.

Luckily, there is a function scipy.linalg.lstsq() which does the job.

Let’s consider a simple example of a 3×2 matrix:

```import scipy.optimize
import numpy as np

A = np.array([ [2,  3],
[1,  2],
[-9, 8] ])

b = np.array([7,8,9])
x = np.linalg.lstsq(A, b)
print(x)   # [ 0.9631829   2.21615202]
print(np.dot(A, x) - b)
# [ 1.57482185 -2.60451306  0.06057007]
```

In theory, any other value of x will evaluate to a bigger vector.

The problem is I am frequently working with a 2200×2000 matrix, which takes up to ten seconds to solve.

I am not a numerical analyst, but I have been told that it’s worth looking at the Limited Memory Broyden–Fletcher–Goldfarb–Shanno algorithm. Luckily, there is one implemented in the scipy.optimize.minimize(), so I don’t need to understand anything.

```def func(x):
bd = np.dot(A, x) - b
return np.dot(bd, bd)

x0 = [0.0, -1.0]  # pick any starting point
res = scipy.optimize.minimize(fun=func, x0=x0, method="L-BFGS-B", jac=grad)
print(res)
#  nfev: 9
#     x: array([ 0.9631829 ,  2.21615202])
```

This looks pretty close to the answer from lstsq() after what it claims are 9 functional evaluations.

However, when I put a counter into func(x) I got 36 evaluations, many of which are small increments around a single value as though it is numerically computing the partial derivatives:

```funceval 13 [-0.40910492  1.00305985]
funceval 14 [-0.40910492  1.00305985]
funceval 15 [-0.40910491  1.00305985]
funceval 16 [-0.40910492  1.00305986]
funceval 17 [ 0.64061277  2.10014887]
funceval 18 [ 0.64061277  2.10014887]
funceval 19 [ 0.64061278  2.10014887]
funceval 20 [ 0.64061277  2.10014888]
```

The solution to this is to include a gradient function, which is a vector of partial derivatives in each of the coordinates of x:

```def grad(x):
bd = np.dot(mA, x) - mb
return 2*np.dot(bd, mA)

res = scipy.optimize.minimize(fun=func, x0=x0, jac=grad, method="L-BFGS-B")
```

Now I get one function call to each per iteration, like so:

```funceval 5 [ 0.64061285  2.10014896]
funceval 6 [ 0.96463734  2.21446478]
```

How do we know that I’ve programmed the grad() function right? Well, for a start, its evaluation at (0.9631829, 2.21615202) is nearly zero, as you would expect at a minima. And secondly, I can check it with this handy scipy.optimize.check_grad() function which automatically compares its value to its numerical approximation.

So, how does this work with my real world 1266×1201 size matricies?

I have a case where func(*1201) evaluates to 97.6421, and the lstsq() function finds a value of x in 2.77 seconds where func(x) evaluates to 7.8202.

So, how does the L-BFGS-B algorithm perform of this when I put it in?

Horribly.

Starting from x0=*1201 it gets to an answer that evaluates to 92.6386 after 1000 iterations in 14 seconds.

I tried the BFGS implementation itself, which doesn’t allow you to limit the number of iterations, and it came back after 35 seconds with an evaluation of 96.8386.

Then I discovered scipy.optimize.leastsq() within this huge sprawling interface and noted that it’s not to be confused with scipy.linalg.lstsq().

Back to the toy example, where instead we need to return the whole vector before we’ve squared it:

```def func(x):
return np.dot(A, x) - b
res = scipy.optimize.leastsq(func=func, x0=x0)
print(res)
# [ 0.9631829 ,  2.21615202]), 3
```

Now back to the real world case of my 1266×1201 matrix, where it produces a residual of 7.8202 (same as the linalg.lstsq()) after 29 seconds and 8417 evaluations of func(x).

But, it’s possible to give it the Jacobean (the matrix of partial derivatives), which is just the A matrix, like so:

```def Dfun(x):
return A
res = scipy.optimize.leastsq(func=func, x0=x0, Dfun=Dfun)
```

Now it gets to that answer in 12 seconds after 10 calls to func(x) and 8 to Dfun().

According to the documentation, this uses a set of FORTRAN libraries MINPACK written in 1980 to implement the Levenberg–Marquardt algorithm.

So the upshot is, nothing is coming close to the performance linalg.lstsq() algorithm in this case, which appears to be sourced from another worldwide public FORTRAN library LAPACK, developed since 1992 wrapping gelss as disclosed in the scipy source code.

So, that’s what I have found out today. I am now slightly familiar with how to call into this giant set of fundamental best-in-the-world numerical solvers.

It remains to explain why the BFGS functions performed so badly on my real world matricies. I was kind of counting on them performing a bit better than they did.

Things to do tomorrow: Get familiar with using kd-trees, and more hacking on the SLAM thing, if I can cope with it. Or I could do something else, like work on a GPL implementation of an STL slicer, which is at least something I know I can do.