Population-based training of neural networks

knexer · on Nov 27, 2017

Two things stick out to me after a first read:

First, this actually learns a schedule for each hyperparameter, not just a good set of fixed values, automatically discovering learning rate annealing and related techniques. This seems incredibly powerful. It is also learning hyperparameter schedules specific to a single training run - which seems interesting but not obviously helpful, especially since many of the learned schedules fairly closely match the baseline hand-tuned ones.

Second, it seems like they're optimizing against their validation metric directly; isn't that basically 'cheating' (i.e. defeats much of the point of having a separate validation metric in the first place)? It also seems completely orthogonal to their technique - could they not have optimized for the same loss function as the network itself? Is this an improvement over state of the art, or is it just overfitting to the validation metric?

gwern · on Nov 27, 2017

Well, they consider RL problems extensively, and as the joke goes, in RL it's OK to overfit to your validation set - if you can.

As for regular supervised learning: it's no worse than, say, early stopping based on validation scores. It should be wrong but in practice NNs generalize anyway, and since this paper implies that Google Brain & DM are doing this hyperparameter optimization routinely now for everything, I figure that they would have noticed any overfitting problems by now (either when the methods fail to outperform on one of Google's private internal huge databases, or when they rolled outth the translator).

0x63_Problems · on Nov 27, 2017

This is really cool! I haven't read through the real paper yet but it's very impressive that this method does not incur a significant performance cost. I had assumed that using a genetic-style algorithm would be costly since you would need to train a large number of networks individually, but treating all variations equally in terms of training time now seems naive. Distributing the training time using intelligent exploration and exploitation is an awesome idea to fix this.

kirillseva · on Nov 27, 2017

This is a fairly well represented technique in bayesian hyperparameter optimization, where you train a meta-classifier that keeps track of the parameter space. Kind of like a manager model, if you will, that learns to intelligently optimize exploration vs exploitation so that a team of workers will arrive at the global optimum.

Back when Yahoo! was a real company they used a technique called "multi-armed bandits" to learn what ads to show. [1]

More recently, there's a number of off-the-shelf packages available that you can trivially integrate into your ML pipeline to optimize hyperparameters of your models, I'll include the links below.

[1] multi-armed bandits https://www.theregister.co.uk/2011/09/23/yahoo_core_personal...

[2] tree of parzen estimators https://jaberg.github.io/hyperopt/

[3] hyperband - what google uses in their internal ML toolkits AFAIK https://arxiv.org/pdf/1603.06560.pdf

[4] (shameless plug) gaussian process based hyperparameter optimization service https://github.com/avantoss/loop

gwern · on Nov 27, 2017

This isn't your standard MAB or GP hyperparameter optimization; those typically require you to train each NN to convergence before further exploration is done (ie each 'round' is training a NN). Skimming the paper, OP is closer to freeze-thaw or reversible backpropagation hyperparameter optimization, or Net2Net meta-RL: the hyperparameter optimization is monitoring the loss curve of each trained NN, switching between them based on promisingess like in freeze-thaw, but also switching hyperparameters on the fly and reusing the trained weights to avoid starting from scratch, Net2Net style. Each NN being trained is periodically updated to either clone & tweak a new hyperparameter set to continue training the current NN's parameters, or clone & tweak the best NN's parameters while keeping the old hyperparameters. (They only clone the full NN, so they can't do architecture search, but there's no reason they couldn't use Net2Net or other recent approaches which similarly recycle the trained weights to avoid the huge computational burden of training from scratch.)

margorczynski · on Nov 27, 2017

I think this is what Google and the others aim for - no hand-tuning. You simply specify the problem (some function to optimize) and the data. Everything in a nice concise package running on Google Cloud using their custom software and hardware.

giacaglia · on Nov 27, 2017

This seems similar to what Jeff Dean was working on with AutoML: https://research.googleblog.com/2017/11/automl-for-large-sca.... Is DeepMind collaborating with the Google Brain team and how connected are the teams? It seems somehow that the efforts may be duplicated in some areas...

billysbeanes · on Nov 27, 2017

Isn't it good that efforts are duplicated? It commoditizes the work and results, provides more jobs so there are more people who understand this field. It's unlikely each approach will be exactly similar.

Similarly, take a look at the deep learning library market: caffe (I think out of Stanford?), tensorflow (google), pytorch (FB + MS)... each has different strengths, but I'm sure glad the pytorch people pushed ahead, even though google put a ton of marketing effort into TF, simply because now we have more awesome things :).

Once a market or product is mature, then I can see the "duplicates are wasteful". But a nascent, exploratory field like ML/DL needs as many different approaches as is possible.

Now, if only we could gradient descent to find the optimal approach ;).

epmaybe · on Nov 27, 2017

Should I move from theano to tensorflow? I didn't realize that theano was no longer being developed when I first starting playing with keras.

billysbeanes · on Nov 28, 2017

Definitely, Theano is no longer active or have plans to be.

If you don't need mobile on-device D.L., take a look at pytorch. Otherwise, Tensorflow.

Fasi.ai will release some excellent self-paced coursework in January for Pytorch. Best bang for the buck (free, but time ain't) I've seen in any AI learning. Much of the lower level stuff is optimized for you, and he gives some great SOTA tricks for getting in the top 10% in kaggle competitions in like an hour or two.

Alas, no pytorch on device yet. But the state of the art is nearly 100% turnover every year, so the question becomes: do you need SOTA? Many problems are 98+% solved these days, so maybe we've reached "good enough" with some of these applications of d.l.

taneq · on Nov 28, 2017

Does Theano meet your needs? Then no. Does TensorFlow meet them better, enough to justify the cost in switching? Then yes. "Actively developed" is a silly metric. Focus on features, flexibility, robustness etc.

nl · on Nov 28, 2017

For neural network libraries this isn't sensible.

For many (most?) users outside of Google and Facebook the most important feature is "is there an off-the-shelf implementation of new technique XXX or do I have to build it myself?"

For most users the sensible choice comes down to Keras+Tensorflow or PyTorch.

kirillseva · on Nov 28, 2017

Depends on what you do. If you're starting a new project picking Theano indeed isn't a very good choice due to the reasons you've mentioned. However, if you already have a stable piece of software that does what you want it to do then migration won't add much value and you could spend this time doing something more important, like improving documentation or having dinner with your family and friends.

However it's worth pointing out that theano's API is somewhat similar to tensorflow so migrating shouldn't be too hard and should be fairly easy to test

jicks · on Nov 28, 2017

It is actually very different:

- AutoML is used to automate the design of the ML model.

- Population-based trained is used to automate the choice of the hyperparameters (e.g. the rate of learning).

If you wanted to use both, you'd first use AutoML to find a good design for your problem, and then you'd use PBT when training your network.

hmm_really · on Dec 4, 2017

Looks like a genetic algorithm to me, just framed it differently to normal (which is clever). The hyperparams being the DNA and a network being an environment in which they are executed against.

bdod6 · on Nov 27, 2017

Seems very similar to using RL to tune the hyperparameters. Surely that means there are hyperparameters for PBT that need to be set, such as the exploration vs exploitation tradeoff.

deepnotderp · on Nov 27, 2017

So basically SGD based NEAT?

Rhapso · on Nov 27, 2017

NEAT paper link I post a lot on HN: http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf

Looks like like a "not a GA we promise" on hyper-parameter selection (which is cool if unnecessarily obtuse)

ZayleePyrex · on Nov 27, 2017

NEAT just uses GA to generate a network topology and weights. From what I read, I think this is just a fancy way to parallelize searching for optimal hyperparameters.

deepnotderp · on Nov 27, 2017

Well, okay, what I meant is that it's literally applying evolutionary algorithms for hyperparameter optimization. Calling it some souped up BS like "population based training" seems like Deepmind marketing is getting out of hand...

nametube · on Nov 27, 2017

Evolutionary Algorithms are a subset of Population Based heuristics. I don't think its "souped up BS" to use the term.

kirillseva · on Nov 27, 2017

Maybe they have a quota on how many impressions they need to get per quarter?