First, this actually learns a schedule for each hyperparameter, not just a good set of fixed values, automatically discovering learning rate annealing and related techniques. This seems incredibly powerful. It is also learning hyperparameter schedules specific to a single training run - which seems interesting but not obviously helpful, especially since many of the learned schedules fairly closely match the baseline hand-tuned ones.
Second, it seems like they're optimizing against their validation metric directly; isn't that basically 'cheating' (i.e. defeats much of the point of having a separate validation metric in the first place)? It also seems completely orthogonal to their technique - could they not have optimized for the same loss function as the network itself? Is this an improvement over state of the art, or is it just overfitting to the validation metric?
Well, they consider RL problems extensively, and as the joke goes, in RL it's OK to overfit to your validation set - if you can.
As for regular supervised learning: it's no worse than, say, early stopping based on validation scores. It should be wrong but in practice NNs generalize anyway, and since this paper implies that Google Brain & DM are doing this hyperparameter optimization routinely now for everything, I figure that they would have noticed any overfitting problems by now (either when the methods fail to outperform on one of Google's private internal huge databases, or when they rolled outth the translator).
This is really cool! I haven't read through the real paper yet but it's very impressive that this method does not incur a significant performance cost. I had assumed that using a genetic-style algorithm would be costly since you would need to train a large number of networks individually, but treating all variations equally in terms of training time now seems naive. Distributing the training time using intelligent exploration and exploitation is an awesome idea to fix this.
This is a fairly well represented technique in bayesian hyperparameter optimization, where you train a meta-classifier that keeps track of the parameter space. Kind of like a manager model, if you will, that learns to intelligently optimize exploration vs exploitation so that a team of workers will arrive at the global optimum.
Back when Yahoo! was a real company they used a technique called "multi-armed bandits" to learn what ads to show. [1]
More recently, there's a number of off-the-shelf packages available that you can trivially integrate into your ML pipeline to optimize hyperparameters of your models, I'll include the links below.
This isn't your standard MAB or GP hyperparameter optimization; those typically require you to train each NN to convergence before further exploration is done (ie each 'round' is training a NN). Skimming the paper, OP is closer to freeze-thaw or reversible backpropagation hyperparameter optimization, or Net2Net meta-RL: the hyperparameter optimization is monitoring the loss curve of each trained NN, switching between them based on promisingess like in freeze-thaw, but also switching hyperparameters on the fly and reusing the trained weights to avoid starting from scratch, Net2Net style. Each NN being trained is periodically updated to either clone & tweak a new hyperparameter set to continue training the current NN's parameters, or clone & tweak the best NN's parameters while keeping the old hyperparameters. (They only clone the full NN, so they can't do architecture search, but there's no reason they couldn't use Net2Net or other recent approaches which similarly recycle the trained weights to avoid the huge computational burden of training from scratch.)
I think this is what Google and the others aim for - no hand-tuning. You simply specify the problem (some function to optimize) and the data. Everything in a nice concise package running on Google Cloud using their custom software and hardware.
This seems similar to what Jeff Dean was working on with AutoML: https://research.googleblog.com/2017/11/automl-for-large-sca....
Is DeepMind collaborating with the Google Brain team and how connected are the teams? It seems somehow that the efforts may be duplicated in some areas...
Isn't it good that efforts are duplicated? It commoditizes the work and results, provides more jobs so there are more people who understand this field. It's unlikely each approach will be exactly similar.
Similarly, take a look at the deep learning library market: caffe (I think out of Stanford?), tensorflow (google), pytorch (FB + MS)... each has different strengths, but I'm sure glad the pytorch people pushed ahead, even though google put a ton of marketing effort into TF, simply because now we have more awesome things :).
Once a market or product is mature, then I can see the "duplicates are wasteful". But a nascent, exploratory field like ML/DL needs as many different approaches as is possible.
Now, if only we could gradient descent to find the optimal approach ;).
Definitely, Theano is no longer active or have plans to be.
If you don't need mobile on-device D.L., take a look at pytorch. Otherwise, Tensorflow.
Fasi.ai will release some excellent self-paced coursework in January for Pytorch. Best bang for the buck (free, but time ain't) I've seen in any AI learning. Much of the lower level stuff is optimized for you, and he gives some great SOTA tricks for getting in the top 10% in kaggle competitions in like an hour or two.
Alas, no pytorch on device yet. But the state of the art is nearly 100% turnover every year, so the question becomes: do you need SOTA? Many problems are 98+% solved these days, so maybe we've reached "good enough" with some of these applications of d.l.
Does Theano meet your needs? Then no. Does TensorFlow meet them better, enough to justify the cost in switching? Then yes. "Actively developed" is a silly metric. Focus on features, flexibility, robustness etc.
For many (most?) users outside of Google and Facebook the most important feature is "is there an off-the-shelf implementation of new technique XXX or do I have to build it myself?"
For most users the sensible choice comes down to Keras+Tensorflow or PyTorch.
Depends on what you do. If you're starting a new project picking Theano indeed isn't a very good choice due to the reasons you've mentioned. However, if you already have a stable piece of software that does what you want it to do then migration won't add much value and you could spend this time doing something more important, like improving documentation or having dinner with your family and friends.
However it's worth pointing out that theano's API is somewhat similar to tensorflow so migrating shouldn't be too hard and should be fairly easy to test
Looks like a genetic algorithm to me, just framed it differently to normal (which is clever). The hyperparams being the DNA and a network being an environment in which they are executed against.
Seems very similar to using RL to tune the hyperparameters. Surely that means there are hyperparameters for PBT that need to be set, such as the exploration vs exploitation tradeoff.
NEAT just uses GA to generate a network topology and weights. From what I read, I think this is just a fancy way to parallelize searching for optimal hyperparameters.
Well, okay, what I meant is that it's literally applying evolutionary algorithms for hyperparameter optimization. Calling it some souped up BS like "population based training" seems like Deepmind marketing is getting out of hand...
First, this actually learns a schedule for each hyperparameter, not just a good set of fixed values, automatically discovering learning rate annealing and related techniques. This seems incredibly powerful. It is also learning hyperparameter schedules specific to a single training run - which seems interesting but not obviously helpful, especially since many of the learned schedules fairly closely match the baseline hand-tuned ones.
Second, it seems like they're optimizing against their validation metric directly; isn't that basically 'cheating' (i.e. defeats much of the point of having a separate validation metric in the first place)? It also seems completely orthogonal to their technique - could they not have optimized for the same loss function as the network itself? Is this an improvement over state of the art, or is it just overfitting to the validation metric?