×
all 53 comments

[–]thatguydr 53 points54 points  (17 children)

Brief summary: scaling depth, width, or resolution in a net independently tends not to improve results beyond a certain point. They instead make depth = αφ , width = βφ , and resolution = γφ . They then constrain α · β2 · γ2 ≈ c, and for this paper, c = 2. Grid search on a small net to find the values for α,β,γ, then increase φ to fit system constraints.

This is a huge paper - it's going to change how everyone trains CNNs!

EDIT: I am genuinely curious why depth isn't more important, given that more than one paper has claimed that representation power scales exponentially with depth. In their net, it's only 10% more important than width and equivalent to width2.

[–]gwern 17 points18 points  (9 children)

It's astonishing. They do better than Gpipe (!) at a fraction the size (!!) with such a simple-looking solution. How have humans missed this? How have all the previous NAS approaches missed it? It's not like like 'change depth, width, or resolution' are unusual primitives. (Serious question BTW; a simple linear scaling relationship should be easily found, and even more easily inferred by a small NN, with all of these Le-style approaches of 'train tens of thousands of different-sized NNs with thousands of GPUs'; so why wasn't it?)

[–]sander314 22 points23 points  (0 children)

The rule also seems to be based on very little, other than let's scale everything together. No proof this is anywhere near optimal, so who knows what follow ups this will have.

[–]thatguydr 7 points8 points  (6 children)

Dude - who does three things at once? That's like a Fields medal! ;)

[–]zawerf 5 points6 points  (2 children)

It might just be Baader-Meinhof phenomenon, but I just read a quote that says exactly that:

Stan Ulam, who knew von Neumann well, described his mastery of mathematics this way: "Most mathematicians know one method. For example, Norbert Wiener had mastered Fourier transforms. Some mathematicians have mastered two methods and might really impress someone who knows only one of them. John von Neumann had mastered three methods."

Is this actually a popular meme with mathematicians?

[–]gwern 1 point2 points  (0 children)

Gian-Carlo Rota says the same thing in his "Ten Lessons".

[–]thatguydr 0 points1 point  (0 children)

It was a joke. (The other response to it is super-weird, though.)

[–]MohKohn 3 points4 points  (2 children)

if they can show why that works, it's a Fields medal. otherwise I think you're looking for a Turing award

[–]muntoo 11 points12 points  (1 child)

Is this a mathematician's version of throwing shade at a computer scientist?

[–]MohKohn 3 points4 points  (0 children)

Different ways of looking at the same ideas. This is a scientific/empirical, not mathematical/theoretical result, and as such not the sort of thing you could win the Fields medal for. Still cool and points in an interesting direction.

[–]alexmlamb 1 point2 points  (0 children)

Well, in almost all of my work I just double the number of channels whenever I stride (reduce resolution). I think most people do the same.

I think a lot of people don't work on more nuanced ways to do this selection because (1) it's hard to publish unless the results turn out to be insanely good, (2) it maybe is somewhere between what a basic algorithms researcher would focus on and what an applied research would focus on, so it ends up under-explored.

[–]akaberto 1 point2 points  (4 children)

I haven't read it yet but can you explain a bit more why you think so?

Edit: glanced over it. Does seem very promising if it works as advertised.

[–]thatguydr 18 points19 points  (3 children)

Their results are almost obscenely good and the method of implementation is really, really simple. It's easy to scale up from a smaller net, so you can run experiments to figure out a good shape initially.

Everyone, and I mean everyone, always hacks together their CNN solution. They either give up and use off the shelf models and change a few things or they spend a LONG time on hyperparameter selection. This doesn't obviate that entirely, but it will speed the process up significantly. It's a phenomenal paper in that regard.

(It also unfortunately demonstrates how ineffective our subreddit is at paper valuation, because there are so many posts with a few hundred upvotes and this one is currently at eight.

EDIT: At 100 now. I'm happy to walk that back. Sure, all the other papers are at 20-30, but this one got reasonable attention.)

[–][deleted]  (1 child)

[deleted]

    [–]akaberto 1 point2 points  (0 children)

    I actually asked my question because the commenter was being down voted when I saw it (okay, I started it as a social experiment and made it zero and it was immediately followed by more down votes; I felt guilty and used the comment to redeem myself). People here have twitchy trigger fingers on the downvote button and follow the trend without thinking of their own.

    That said, I feel like this research is sensationalist and nice at the same time. Seems pretty easy to reproduce as well. Pretty easy paper to follow as well (even beginners can easily appreciate this one).

    [–]Phylliida 0 points1 point  (0 children)

    At 100 votes now

    [–]seraschka 0 points1 point  (0 children)

    haven't read the paper, but in general, the deeper the net, the more vanishing and exploding gradient problems will become a problem. Sure, there are ways to reduce that effect, like skip connections, batchnorm, and attention gates, ... but still, i'd guess there is a sweet spot depth to balance this.

    [–]102564 0 points1 point  (0 children)

    EDIT: I am genuinely curious why depth isn't more important, given that more than one paper has claimed that representation power scales exponentially with depth. In their net, it's only 10% more important than width and equivalent to width2.

    This is pretty well known actually. While the representation power scaling with depth you cited is true, this is a theoretical result which isn’t necessarily all that relevant in practice. Width in fact often buys you more than depth - this is the whole idea behind WideResNets which have been around for a long time.

    [–]PublicMoralityPolice 45 points46 points  (2 children)

    3 FLOPS may differ from theocratic value due to rounding

    I wasn't aware of this issue, how does rounding real numbers result in a form of government where religious and state authority is combined?

    [–]MohKohn 9 points10 points  (0 children)

    leave it to the public morality police to pick up on this

    [–]bob80333 13 points14 points  (2 children)

    Could this be used to speed up something like YOLO as well?

    With a quick search I found this where it appears the network YOLO uses gets:

    Network                 top-1   top-5   ops
    Darknet Reference   61.1    83.0    0.81 Bn 
    

    but in the paper EfficientNet-B0 has these properties:

    Network            top-1   top-5   FLOPS
    EfficientNet B0    76.3    93.2    0.39B
    

    That looks like better accuracy with higher performance to me, but I don't know how much that would actually help something like YOLO.

    [–]Code_star 2 points3 points  (0 children)

    well yeah, of course, it will. at least if you replace the backbone of yolo with an efficientNet. I'm not sure how it would be applied to the actual object detection portion of yolo, but it seems reasonable one could take inspiration from this to scale that as well.

    [–]arXiv_abstract_bot 10 points11 points  (0 children)

    Title:EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    Authors:Mingxing Tan, Quoc V. Le

    Abstract: Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. > To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at this https URL.

    PDF Link | Landing Page | Read as web page on arXiv Vanity

    [–]LukeAndGeorge 25 points26 points  (5 children)

    Working on a PyTorch implementation now:

    https://github.com/lukemelas/EfficientNet-PyTorch

    These are exciting times!

    [–]ozizai 3 points4 points  (1 child)

    Could you post whether you can confirm the claims in the paper when you finish?

    [–]LukeAndGeorge 4 points5 points  (0 children)

    Absolutely!

    [–]Geeks_sid 0 points1 point  (1 child)

    I wish I could give you an award now!

    [–]LukeAndGeorge 1 point2 points  (0 children)

    Thanks! The model and pretrained PyTorch weights are now up :)

    [–]LukeAndGeorge 0 points1 point  (0 children)

    Update: released the model with pretrained PyTorch weights and examples.

    [–]Code_star 5 points6 points  (0 children)

    This is super cool, and I think something that CNN architectures needed. A more objective way of deciding how to build models

    [–]eukaryote31 3 points4 points  (0 children)

    I wonder if this could also be applied to transformer models, on context size / heads per layer / number of layers. Perhaps that could hugely outperform GPT2 et al with much smaller models.

    [–][deleted]  (4 children)

    [deleted]

      [–]SatanicSurfer 0 points1 point  (3 children)

      I'd agree with you, but in this specific case the models are separated by accuracy. So in each block, all the models have the same accuracy, and the bolding doesn't correspond to better accuracy. So bolding could correspond to either less Params or less FLOPS, both of which they do have the best results.

      [–]drsxr 2 points3 points  (2 children)

      FD: Need to read the full paper

      Quoc Le has been putting out very high quality stuff btw.

      [–]m__ke 4 points5 points  (1 child)

      Lately? When has he not put out very high quality stuff?

      [–]drsxr -1 points0 points  (0 children)

      OK fair point. Edited.

      [–]FlyingOctopus0 2 points3 points  (0 children)

      This so like meta-learning. They "learn" how to do scaling up. I wonder if there are any imporovements to be made by using a more complicated model to fit a function f(flops) = argmax_{parameters with same flops}(accuracy or other metric) on small flops and then extrapolate. (The above function gives the best parameters constrained by number of flops). In this setting the paper just finds two points of such function and "fits" an exponential function.

      [–]alex_raw 1 point2 points  (0 children)

      I like this paper! Thanks for sharing!

      [–]visarga 1 point2 points  (1 child)

      What is network width - number of channels?

      [–]dev-ai 9 points10 points  (0 children)

      Yep.

      • depth - number of layers
      • width - number of channels
      • resolution - input size

      [–]shaggorama 1 point2 points  (0 children)

      Even if it turns out this isn't always the best procedure for CNNs, it's still going to catch on because people have been thirsty for a heuristic like this to guide their architecture.

      [–]veqtor 1 point2 points  (1 child)

      Again we see depthwise convolutions outperforming regular ones, yet research about applying this to GANs, VAEs, etc, hasn't really begun since its transposed form doesn't exist in any framework :(

      [–]eric01300 0 points1 point  (0 children)

      Pixelshuffle would replace transposes convs would regular convs

      [–]albertzeyer 4 points5 points  (3 children)

      We do something similar/related in our pretraining scheme for LSTM encoders (in encoder-decoder-attention end-to-end speech recognition) (paper). We start with a small depth and width (2 layers, 512 dims), and then we gradually grow in depth and width (linearly), until we reach the final size (e.g. 6 layers, 1024 dims).

      Edit: It seems that multiple people disagree with something I said, as this gets downvoted. I am curious about what exactly? That this is related? If so, why do you think it is not related? One of the results from the paper is, that it is important to scale both width and depth together. That's basically the same as what we found, and I personally found that interesting, that other people in other context (here: image with convolutional networks) also do this.

      [–]arthurlanher 1 point2 points  (1 child)

      Probably the use of "I do" and "I start". A professor once told me to change every "I" to "we" in a paper I was writing. Even though I was solo. He said it sounded unprofessional and arrogant.

      [–]albertzeyer 0 points1 point  (0 children)

      Ah, yes, maybe. I changed it to "we". I am used to do this in papers as well, but I thought in this context here on Reddit, it would be useful additional information, in case anyone has further questions.

      [–]-Rizhiy- 0 points1 point  (1 child)

      Might also be an interesting idea to scale the network with amount of data. While more data is always better, it might identify cases where the network is not expressive enough to capture data effectively or if too redundant if the amount of data is too small.

      [–]dorsalstream 1 point2 points  (0 children)

      There is recent work showing that adaptive gradient descent methods cause this to happen implicitly in convnets under certain conditions https://arxiv.org/abs/1811.12495

      [–]kuiyuan 0 points1 point  (0 children)

      Nice work. Simple, efficient and effective. Given a budget of number of neurons, EfficientNet spends these neurons along spatial, channel and depth in a more optimal way.

      [–]wuziheng 0 points1 point  (0 children)

      can we use this strategy to smaller basemodel to get small backbone, such as shufflenetv2 0.5x , use this as basemodel , expand the model to get a 140m flops backbone, will it be better than original shufflenetv2 1x(140m flops).

      [–]eugenelet123 0 points1 point  (0 children)

      I'm curious about one thing not mentioned in the paper. It's about the number of searches and scale for the 3 dimensions of grid search. Let's say each parameter requires 10 searches, wouldn't this require training 103 independent models of different sizes?

      [–]dclaz 0 points1 point  (0 children)

      Wonder what the carbon footprint of deriving that scaling heuristic was...

      Not saying it's its a bad or unwelcome result, but I'm guessing the number of model fits that would have been performed would have required a serious amount of hardware.