SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

The basic paper is here: Using Gaussian processes to optimize expensive functions.


Musings, after talking with Christopher Lee-Johnson today

  • $ x $ is a point in the input space, for example a vector of parameters controlling some aspect of robotic control.
  • $ y(x) $ is an actual evaluation of some system using that $x$. This could be a really really noisy evaluation : we want an algorithm that's okay with this....
  • $ X $ is the set of previously sampled vectors in this input space (ie. $X$ is a matrix).
  • $ Y $ is the set of actual evaluations of points $X$, so it's a vector.

The Gaussian process machinery can give us a mean and a variance for some arbitrary point $ x^\star $:

  • $ \mu(x^\star) = \mu(x^\star | X,Y) $
  • $ \sigma^2(x^\star) = \sigma^2(x^\star | X,Y) $

Default code uses this predictive distribution to find the Expected Improvement (EI) over the best sample so far. But this is a BAD idea in really noisy scenarios such as the one we're interested in: we don't really believe that one fluke high value really means we've found the best spot so far at all... This nasty assumption leads to nasty behaviour in the optimizer. It'll only work well when the evaluations are low-noise.

So what's the right thing to do instead?

Ummm.

It might help to think of the solution you would choose if the project had to go live RIGHT THEN. That is, at any stage in the optimization process we've got the "best so far" point, call it $x_\text{best}$, which is NOT just the one that happened to have got lucky once, but rather captures some utility function we have, reflecting the "go live" scenario. Clearly this function will differ from application to application.

An example, though, would be something like $ \mu(x^\star) - \sigma(x^\star)$, which likes high expected values but dislikes uncertainty: this is a "risk averse" solution: "Give me something you know is going to work okay - I can't afford a disaster".

Call this the utility, $U(x) = U(x | X,Y) $

So maybe the right solution to be comparing things against is the one that maximizes this utility:

$  x_\text{best} = \text{argmax} U(x) $

Then we consider some new point $x^\star$, which has current utility $U(x^\star) = U(x^\star | X,Y) $

What's it's EXPECTED UTILITY, if we trust the model and think about taking a sample there? It is....


Latex rendering error!! dvi file was not created.