×
all 29 comments

[–]datatatatata 8 points9 points  (7 children)

I'm not sure I understand the hypernet.

Let met try to explain it, and the correct me please. For a given architecture, it takes "network structure" as input (which is represented as a grid), and "all the weights" as output (which I don't know how to represent, but lets assume it works). The "true" weights are given by actually trained networks (slow), so there is nothing particularly clever before training this hypernet.

The clever part is that once this hypernet has learnt to produce weights for N architectures, it can rapidly produce weights for M >> N architectures without actually training each network, thus allowing for many tests at the cost of a few actual trainings.

Right ?

[–]ajmooch[S] 9 points10 points  (6 children)

So, I don't train architectures ahead of time and then train the net to learn the distribution over their weights. Instead, I sample a random architecture, c at every single minibatch--this architecture is basically a "skeleton" with empty weights that need to be filled in. I then use the hypernet to generate the weights for that architecture, based on c.

W = H(c)

Then, I calculate the error of that minibatch of samples using a normal forward pass through that architecture. The key is that on the backward pass, I don't update the weights W, but instead take dE/dW and backprop THAT through the hypernet, and then update H. The hypernet is learning to generate weights for a new, random architecture at every single training step, such that ideally it learns to, on average, generate decent weights for any architecture that we sample.

The hypothesis is that so long as those weights are reasonable, even if they're clearly not optimal, then when I sample two archs after training and get their validation error with hypernet-generated weights, I can say "architecture B is likely better than architecture A," and approximately rank them. As I mention in the paper, I observe that this performance does seem to correlate with true performance, but that there's no guarantee the correlation holds, and I can construct specific scenarios where it specifically does not hold.

Does that make sense?

[–][deleted]  (3 children)

[deleted]

    [–]ajmooch[S] 4 points5 points  (2 children)

    So, I did actually do some PTB experiments where I replicated the RNN search space from the original NASwRL paper, but I know approximately nothing about RNNs so at the moment even trying to just run that code as a vanilla, freely-learned LSTM doesn't work at all (4000 training perplexity! woo!) <_< not to mention that without cuDNN kernels it's literally 20-30x slower than a cuDNN LSTM, which sort of killed my enthusiasm for debugging it.

    As to the correlation, I'm sort of surprised that it holds as well as it does in the one test I did--though in the paper I try to be appropriately bearish and express that, while I observed it in that trial, there's no guarantee that it will hold, and I don't have any hard numbers on the exact conditions under which it breaks down. Hopefully that comes through as an appropriate level of "properly scientific skepticism," since with things like this I'm wary of overstating and claiming this as a silver bullet.

    I briefly mention in the appendix that I basically did not explore the hypernet's architecture at all--it's a fixed, ad-hoc 26 layer DenseNet that I spun up with numbers I pulled out of thin air, and basically never changed other than for one experiment where I crippled it. There's almost certainly room for improvement there, though I've never really trained a serious RNN so I haven't any intuition as to precisely how well it would work out.

    [–][deleted]  (1 child)

    [deleted]

      [–]ajmooch[S] 0 points1 point  (0 children)

      Thanks, if I do pursue any recurrent projects I'll hit you up.

      [–]Reiinakano 0 points1 point  (0 children)

      That is incredible.

      [–]datatatatata 0 points1 point  (0 children)

      Tbh, if you proposed this idea before trying it, I would have told you "there is no way this is going to even remotely work". But it looks like it does.... impressive :o

      [–]L43 8 points9 points  (0 children)

      The paper for the lazy: https://arxiv.org/abs/1708.05344

      [–]evc123 6 points7 points  (7 children)

      Is there a way to combine this with Net2Net technique from

      "Reinforcement Learning for Architecture Search by Network Transformation"

      https://arxiv.org/abs/1707.04873

      to search even more efficiently/robustly?

      [–]ajmooch[S] 2 points3 points  (5 children)

      So one obvious trick I mention in the appendix is to use hypernet-generated weights to initialize a resulting net, but the related idea (which eventually became FreezeOut) was to instead progressively fix the architecture from the bottom up until the entire thing was just a regularly trained net. Maybe using something like MCTS to sample archs or having an RNN predict the architecture a la NASwRL would help with this since just evaluating the valid error wrt a single layer through random search might prove difficult and having something that can pick up on the structure of the architectural space would be useful.

      [–]evc123 3 points4 points  (4 children)

      What if you did something like MAML https://arxiv.org/abs/1703.03400 in which you directly train the HyperNet to generate weights that can easily be adjusted afterwords (to near optimal) via one gradient step of training. Then you could have the robustness of Zoph's original NASwRL, but the training of each architecture would only take one/a_few gradients steps.

      [–]ajmooch[S] 0 points1 point  (3 children)

      Hmm, that's an interesting thought--train each architecture for 1 or 2 epochs, initialized from hypernet-generated weights, and it'd then you'd have a much stronger guarantee that your weights are optimal. I like that a lot! Maybe even start by sampling 500 architectures the way I do now, and then train the top 20 or 30 for 2-3 epochs each and pick the best. I could see that being a lot more reliable!

      [–]evc123 4 points5 points  (0 children)

      You might also get some speed ups during the 1-3 epochs of finetuning by using meta-learning methods from:

      Meta Networks https://arxiv.org/abs/1703.00837

      and/or

      Learning to Learn: Meta-Critic Networks for Sample Efficient Learning https://arxiv.org/abs/1706.09529

      and/or

      Meta-SGD: Learning to Learn Quickly for Few Shot Learning https://arxiv.org/abs/1707.09835 <--which is basically a more expressive version of MAML

      [–]evc123 2 points3 points  (1 child)

      Paper contains following statement:

      Our results demonstrate a correlation between performance using suboptimal weights generated by the auxiliary model and performance using fully-trained weights, indicating that we can efficiently explore the architectural design space through this proxy model.

      Since there is "a correlation between performance using suboptimal weights generated by the auxiliary model and performance using fully-trained weights", you might be able train a second aux network to learn this correlation. The second aux network would receive the performance using suboptimal weights generated by the auxiliary model as input, and predict the performance using fully-trained weights. This prediction (instead of the performance using suboptimal weights) would then be used to rank the models.

      [–]XalosXandrez 0 points1 point  (0 children)

      They discuss this a little in the Appendix D "Future Directions".

      [–]nickbuch 2 points3 points  (0 children)

      God I wish I understood this more.

      [–]XalosXandrez 2 points3 points  (2 children)

      I have a couple of comments / questions:

      1) If I understand correctly, the architectures it produces are always in between a regular feedforward (VGG-like) net and a densenet. Why not just use a densenet all the time? It would be sufficient to just tune the width and depth of such nets.

      2) Perhaps it is something trivial I am missing - what exactly is the "WRN baseline" in the paper?

      3) I tried to read the paper, but I am still not clear how the encoding works and how the hypernetwork generates weights for arbitrarily shaped conv tensors. The interpretation of a NN activation as a read-write memory module seems crucial but I am not able to understand why it is so important, given that it's shape keeps changing between layers.

      Overall, I really like the goal of this paper and the approach taken. Kudos also on the excellent video! Perhaps it's a bit too much to ask: would you consider doing a 'For Dummies' blogpost to explain the main ideas with examples?

      Thanks!

      [–]ajmooch[S] 1 point2 points  (0 children)

      Thanks for the detailed read! I'll do my best to answer.

      1) That's mostly correct, though I'd describe the results under this scheme as a mashup of multi-tiered residual and dense connections. You could definitely apply this to just picking simple hyperparams (widening factor/depth for a ResNet, growthrate/depth for a DenseNet), but I chose not to for this paper for several reasons:

      a) I ran a bunch of tests last year comparing the performance of DenseNets with varying depth/width but near-identical compute budgets (in terms of convolutional FLOPS), and found that there was basically no meaningful variation. I concluded that there was a decent chance that given a fixed compute budget, there would need to be a large variation in architecture for variations in depth/width to be worth exploring.

      b) I ran tests on benchmark datasets for which we already know good parameters, but in general we don't know a priori if a DenseNet or a ResNet or a Fractalnet is always the best for a new task. Of course, I didn't test things on any new tasks, so that's largely speculation (especially since I've noted that architectures might display transferability like weights do), but it was the reasoning for pursuing this scheme instead of something more simple.

      c) I chose to favor complexity over practicality because I was interested sort of in "how far can I take it in coming up with ridiculous nets." My code is obviously inefficient (and horrifyingly complicated, even with comments on nearly every line), but I was more interested in seeing this through as a concept paper than in producing an immediately useful tool.

      2) WRN baselines are Wide ResNet baselines. I collected these myself for STL-10 because that dataset is usually used as a semisupervised task (making use of a bunch of extra unlabeled images), and I didn't find any modern baselines reported on it anywhere. (Reviewers specifically requested STL-10). On Downsampled ImageNet they're baselines collected by the authors of the paper (Chrabaszcz, Loschilov, and Hutter).

      3) I was worried this would happen--I spent a lot of time trying to boil down my description of this scheme and make it grokkable, but it looks like I'm still having trouble getting it across.

      Basically, after an architecture is sampled, we know exactly what the shape of W needs to be at each op. The key thing is that the trailing dimension of W varies with the number of input banks. In the simplest case, I could just have every slice of the trailing dimension of c correspond to one trailing slice of W, so if I have an op that reads from 3 memory banks, I need to stack 3 slices of c in order to produce a W of the appropriate size. With regards to width, I just end up making the width dimension of c as large as the largest required width of W (i.e. the size of W along its leading, output dimension), and then just slice W along its leading dimension as needed. If I weren't applying the net as a convnet that sees all the slices of c for every slice of W at once, and instead fed it little chunks of c in sequence, I wouldn't have to slice W at all.

      The nice thing about all this is that I can make c have as many channels as I want, since convnets just act like FC layers along the channel dimension. So if I want to encode a bunch of information in c (i.e. the one-hot read-write patterns, a one-hot encoding of the dilation factor, one-hot encoding of groups, etc) I can just stack a bunch of slices of c along the channel dimension without it affecting the final output size.

      I don't think grokking the exact way I slam all the channels in there is particularly critical to understanding how the whole apparatus operates, but feel free to PM me if you want to discuss it in more detail--I'm keen on making this accessible but distilling it is tough!

      As to a for-dummies explanation, I'm considering just doing a little addendum blog post for "Practical SMASH" that focuses on just picking the best ResNet for the job, so maybe getting rid of the spaghetti architecture element will make it easier to show how all the tensors get moved around.

      [–]evc123 2 points3 points  (1 child)

      You should ask some Variational Inference people how they would approach trying to improve SMASH. Their main focus has always been the tradeoff between accuracy/precision & computational_tractability; (i.e. variational bounds) (which seems to be the crux of what you're trying to manage).

      [–]ajmooch[S] 3 points4 points  (0 children)

      It's funny you should say that, because this project originally started as "Bayesian HyperNets," where I was trying to generate a learned distribution over the weights of a Bayesian NN. I definitely think you could apply a variational viewpoint (replacing the intractable true weights with a variational approximation) but I don't know enough to see where that takes you or what you could do with it--I'll ask around and see if the local Bayesians have any ideas.

      [–]EdwardRaff 1 point2 points  (1 child)

      Was this your NIPS submission? This is a really cool idea. The fact that it works at all I think should have a significant impact on how we think about model architectures, weights, and convolutions, for image related tasks. I'm not sure what that means at this point, but I would not have guessed that this would work at all.

      I would love to see how this performs on some other domains like NLP / general sequence classification.

      [–]ajmooch[S] 1 point2 points  (0 children)

      Thanks! Yep, this is a NIPS submission but the manuscript was really rough so it'll probably turn into an ICLR sub <_< As I mentioned elsewhere, I've got a bug-riddled PTB implementation that I might someday summon the willpower to fix, or maybe work out a collaboration with someone with domain knowledge. What I'm really interested in is tasks for which we don't already know good architectures or primitives (i.e. at least something other than CIFAR), so one possible direction to take this would be to try and see how well we can do on automatic building-block engineering.

      [–]HigherTopoi 1 point2 points  (0 children)

      My primary concern is for each learning task how to know the ratio of (# hypernet parameters) to (# generated parameters) so that there will be a good correlation, like the first experiment, between actual error and SMASH score, since too small/large Hypernet relative to SMASH net doesn't work well. Is the ratio a hyperparameter that depends on the learning task? Is it sensitive? Or does choosing it based on common sense usually lead to a good result consistently?

      In general, how much faster is the convergence of training network with a hypernet than without it?

      Let c=(c1,c2,...) be a binary vector to describe the architecture. Let E(c,x)=f(H(c),x) be the classification error, where H is already trained in the first loop. Then, I suppose you attempted for architecture gradient descent an update like

      (c1,c2,...) -> (argmax_a1 E((a1,c2,...),x), ..., argmax_ai E((c1,...ai,...),x),...)

      Am I correct? I guess, since the first experiment's result shows that 60% of architectures had a good validation error, and since relatively many points are concentrated around the highest rank, it's probably better to stick with random search at least for this instance.

      [–]HigherTopoi 0 points1 point  (5 children)

      I'm very impressed by your work. I've been interested in hypernetworks, since that is one of a few meta-learning methods that can cope with weight optimization, and I didn't expect that one could employ it in the way you guys did. Since this method works better for large dataset, I hope someone will report the performance of the architecture found based on Imagenet rather than CIFAR-100 as well as the architecture of RNN that is superior to DNC/(m)LSTM in various tasks. There have been many meta-learning attempts for the latter, but all of them are only superior in some narrow tasks. I'm going to work on that!

      [–]ajmooch[S] 0 points1 point  (4 children)

      Thanks! I would also be interested in learning archs on Imagenet, but I don't typically have access to the compute needed to really make that happen. I'm also interested in recurrent architectures, but there's been a lot of work recently showing that just throwing a few more tricks and regularizers at LSTMs gets them to the top of the leaderboards, so I'm not really sure what it'd take to come up with a cell that outperforms the baseline (or the ones discovered by NASwRL).

      [–]serge_cell 0 points1 point  (3 children)

      You should ignore CIFAR10/100 completely and concentrate on imagenet/localization datasets. CIFAR is practically useless - it say near nothing about performance on real-world datasets. If you don't have enough computing resources you may contact some cloud providers like microsoft or Amazon they may provide demo/sponsored access to cloud-based GPUs Another good smaller dataset is ImageNet64

      [–]lahwran_ -1 points0 points  (2 children)

      cifar is statistically harder than imagenet - it's a pretty similar distribution of images, but training is harder because they're smaller and you have less class data. performing well on cifar is pretty strong evidence of good performance on imagenet.

      [–]serge_cell -1 points0 points  (1 child)

      That is not observed in practice. Every month there are new papers which show improvement in CIFAR10 or/and 100. If significant part of them would show improvement on ImageNet we would have new state-of-the-art ImageNet results every couple of month, which obviously not happening. ImageNet is much harder then CIFAR because with ~ the same number of samples per class as CIFAR100 it has ten times more classes. And for small images there are ImageNet32 and ImageNet64. Why not to show advantage of the method on them, instead of CIFAR? Why ImageNet is so important? Because NN pretrained on ImageNet can be easily finetuned for anything - from car driving to x-ray iamges.

      ImageNet or it didn't happen.

      [–]lahwran_ 0 points1 point  (0 children)

      ImageNet is much harder then CIFAR because with ~ the same number of samples per class as CIFAR100 it has ten times more classes.

      that is what I'm trying to say though