gallorob projects' blog

A blog where I blab about projects & stuff


Project maintained by gallorob Hosted on GitHub Pages — Theme by mattgraham

Paper notes: Safe Mutations for Deep and Recurrent Neural Networks through Output Gradients

Authors: Joel Lehman, Jay Chen, Jeff Clune and Kenneth O. Stanley

Full paper: here

Code: here

tl;dr

Neuroevolution works by perturbing the parameters vector of neural networks; this works well for small networks but it may easily break functionality for bigger, deeper networks. The paper introduces a family of operators that manage to avoid such problem by rescaling the perturbation vector applied such that the perturbed network’s behaviour changes enough to continue exploration without breaking functionality.

What’s the problem with perturbations

Assume we have an agent that we want to evolve for a specific task. The agent itself is the parameters vector \(\theta\) of some Neural Network. In neuroevolution, we add some noise to this vector, evaluate its performance via a fitness function and determine if it’s a good fit for our task or not (spoiler alert: we use population search for a reason).

This is all well and good for small NNs, but as soon as we ramp up the complexity of the network, stuff breaks. More specifically, the network’s behaviour after it gets perturbed is… well, shit. You may have a decent network, add some noise (even a small itsy-bitsy amount) and BAM, it suddenly behaves like a drunkard. Good like applying a search algorithm with this in mind.

Why’s that, you ask? Well, simply put, each weight of the network has a different sensitivity to noise, and as the network gets larger it’s much, much easier to have larger and larger repercussions on the end result.

What’s the proposed solution

What’s the solution to this problem, then? Easy: we don’t consider just the parameters space when we apply the perturbation, but also the outputs space!

Why? Because if we apply not just any noise, but an informed noise (with respect to the outputs, of course) we can get a perturbed network that behaves roughly the same as the non-perturbed one, yet differently enough to allow us to do some search.

How do we do this? Well, first of all we need to determine how a network behaves differently from another. We do this by defining the \(\mathrm{Divergence}\):

\[\mathrm{Divergence}(\boldsymbol{\delta}; \boldsymbol{w}) = \frac{\sum_i \sum_k (\mathrm{NN}(\boldsymbol{X}_i; \boldsymbol{w})_k - \mathrm{NN}(\boldsymbol{X}_i; \boldsymbol{w} + \boldsymbol{\delta})_k)^2}{I}\]

This is a MSE of the NN outputs before and after perturbations to the same inputs over \(I\) experiences. We can compute this by:

Easy enough, honestly. It requires as many forward passes as sampled experiences, though, so maybe keep your model and archives on GPU.

If your GPU isn’t that great, remember to ask Santa, there’s still time!

GPU

Ideally, now we can specify how much divergence we want between the parent NN and the offspring NN and find a \(\boldsymbol{\delta}\) that fits.

But how can we look for such a vector? Since it’s a noise vector, we can’t just search randomly, right? God, no, it’d take forever. We have the power of MATHS on our side!

SM through Rescaling

First, we can decompose the noise vector as

\[\boldsymbol{\delta} = \boldsymbol{\delta}_{magnitude}\boldsymbol{\delta}_{direction}\]

We then pick a random direction and optimize \(\boldsymbol{\delta}_{magnitude}\) via line search to obtain the desired divergence value.

This method is called Safe Mutation through Rescaling, or SM-R if you want to sound cool.

This first approach works well enough since it rescales the perturbation, but we can do better. And by that, I mean pulling the big boiz gradients in the mix.

SM through Gradients

Introducing Safe Mutations through Gradients (SM-G), the better way to do safe mutations assuming your NN is differentiable (tough luck otherwise, you’re stuck with SM-R).

Why do gradients w.r.t. the network’s weights work? Because they approximate the weights sensitivity well.

Don’t believe me? Well, this first-order Taylor expansion begs to differ:

\[Y_{ik}(\boldsymbol{X}_i, \boldsymbol{\delta}; \boldsymbol{w}) = \mathrm{NN}(\boldsymbol{X}_i;\boldsymbol{w})_k + \boldsymbol{\delta}\nabla_\boldsymbol{w}\mathrm{NN}(\boldsymbol{X}_i;\boldsymbol{w})_k\]

Don’t see how? Me neither, at first, but it simply shows that we can use gradients as local estimate of the sensitivity of a given weight on a given output.

SM-G-ABS

With the previous formula in mind, we can compute the sensitivity vector \(\boldsymbol{s}\) as:

\[\boldsymbol{s} = \sqrt{\sum_k \left(\frac{\sum_i \mathrm{abs}(\nabla_{\boldsymbol{w}}\mathrm{NN}(\boldsymbol{X}_i)_k)^2}{\boldsymbol{I}}\right)}\]

This vector contains the sensitivities of all weights in the network. This method is called SM-G-ABS because it uses the \(\mathrm{abs}\) function when computing the sensitivity vector.

We then adjust the perturbation with the sensitivity vector:

\[\boldsymbol{\delta}_{adjusted} = \frac{\boldsymbol{\delta}}{\boldsymbol{s}}\]

How well does this work? Very well, actually: since the \(\boldsymbol{s}\) vector contains information per weight, we have the most granularity available at our disposal.

The drawback is that you need to compute gradients (ie: forwards and backward passes per batch), which yeah, it’s annoying, but it’s not that bad. The real problem arises when you realise you have to sum the absolute value of \(n\) gradients and no ML library (Tensorflow or PyTorch etc…) handles that well. This is a huge bottleneck in performance (and time spent).

SM-G-SUM

Let’s be honest, though: do we really need to take the absolute value? Yes, actually, if we don’t want to risk cancelling out gradients (Gradient Washout), but we can just drop it and accept the loss in precision.

This gives us the SM-G-SUM method, which computes the sensitivy vector as

\[\boldsymbol{s}_{SUM} \approx \sqrt{\sum_k \left(\sum_i \nabla_{\boldsymbol{w}}\mathrm{NN}(\boldsymbol{X}_i)_k\right)^2}\]

SM-G-SO

Alternatively, we can pick back up the divergence function and apply in a SM-G method (using a second-order approximation).

We first have that

\[\boldsymbol{\nabla}_\boldsymbol{w}(\mathrm{Divergence}(0 + \boldsymbol{\delta}_0; \boldsymbol{w})) \approx H_\boldsymbol{w}(\mathrm{Divergence}(0;\boldsymbol{w}))\boldsymbol{\delta}_0\]

This approximation is useful because we don’t have to compute the full Hessian of the \(\mathrm{Divergence}\) w.r.t. \(\boldsymbol{w}\) (which is too much to ask for, let’s be honest) and only requires an additional backward pass.

We can then compute the sensitivity vector as

\[\boldsymbol{s}_{SO} = \sqrt{\mathrm{abs}(H_\boldsymbol{w}(\mathrm{Divergence}(0;\boldsymbol{w}))\boldsymbol{\delta})}\]

And then adjust the \(\boldsymbol{\delta}\) vector as previously.

This last method is called SM-G-SO because it uses Second Order approximation and is probably the best of the bunch (whereas SM-R is obviously the black sheep).

Conclusions

This paper clearly shows that NE can be applied to deep networks when making sure the perturbations are not destructive. The SM methods presented here can be implemented as-is in any NE project, with minimal domain considerations.

Overall, it’s a very useful paper for anyone working on a NE system with deep networks.

Oh wait, that’s me.

Paper notes