[R] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

thatguydr · 2019-05-31T00:52:37+00:00

Brief summary: scaling depth, width, or resolution in a net independently tends not to improve results beyond a certain point. They instead make depth = α^φ , width = β^φ , and resolution = γ^φ . They then constrain α · β² · γ² ≈ c, and for this paper, c = 2. Grid search on a small net to find the values for α,β,γ, then increase φ to fit system constraints.

This is a huge paper - it's going to change how everyone trains CNNs!

EDIT: I am genuinely curious why depth isn't more important, given that more than one paper has claimed that representation power scales exponentially with depth. In their net, it's only 10% more important than width and equivalent to width².

baylearn · 2019-05-30T00:29:54+00:00

associated GitHub repo:

https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet

blog post:

https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html

PublicMoralityPolice · 2019-05-30T06:56:50+00:00

³ FLOPS may differ from theocratic value due to rounding

I wasn't aware of this issue, how does rounding real numbers result in a form of government where religious and state authority is combined?

bob80333 · 2019-05-30T04:20:33+00:00

Could this be used to speed up something like YOLO as well?

With a quick search I found this where it appears the network YOLO uses gets:

Network                 top-1   top-5   ops
Darknet Reference   61.1    83.0    0.81 Bn

but in the paper EfficientNet-B0 has these properties:

Network            top-1   top-5   FLOPS
EfficientNet B0    76.3    93.2    0.39B

That looks like better accuracy with higher performance to me, but I don't know how much that would actually help something like YOLO.

arXiv_abstract_bot · 2019-05-30T00:28:16+00:00

Title:EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Authors:Mingxing Tan, Quoc V. Le

Abstract: Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. > To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at this https URL.

PDF Link | Landing Page | Read as web page on arXiv Vanity

LukeAndGeorge · 2019-05-30T05:33:58+00:00

Working on a PyTorch implementation now:

https://github.com/lukemelas/EfficientNet-PyTorch

These are exciting times!

Code_star · 2019-05-30T01:02:59+00:00

This is super cool, and I think something that CNN architectures needed. A more objective way of deciding how to build models

eukaryote31 · 2019-05-31T04:12:08+00:00

I wonder if this could also be applied to transformer models, on context size / heads per layer / number of layers. Perhaps that could hugely outperform GPT2 et al with much smaller models.

SatanicSurfer · 2019-05-30T10:34:37+00:00

[deleted]

FlyingOctopus0 · 2019-05-30T13:06:43+00:00

This so like meta-learning. They "learn" how to do scaling up. I wonder if there are any imporovements to be made by using a more complicated model to fit a function f(flops) = argmax_{parameters with same flops}(accuracy or other metric) on small flops and then extrapolate. (The above function gives the best parameters constrained by number of flops). In this setting the paper just finds two points of such function and "fits" an exponential function.

alex_raw · 2019-05-30T02:51:18+00:00

I like this paper! Thanks for sharing!

visarga · 2019-05-30T07:32:04+00:00

What is network width - number of channels?

shaggorama · 2019-05-30T14:59:57+00:00

Even if it turns out this isn't always the best procedure for CNNs, it's still going to catch on because people have been thirsty for a heuristic like this to guide their architecture.

veqtor · 2019-05-31T12:05:01+00:00

Again we see depthwise convolutions outperforming regular ones, yet research about applying this to GANs, VAEs, etc, hasn't really begun since its transposed form doesn't exist in any framework :(

gwern · 2019-05-31T16:04:36+00:00

Keras implementation: https://github.com/qubvel/efficientnet https://twitter.com/alxndrkalinin/status/1134464300217774087

albertzeyer · 2019-05-30T14:18:56+00:00

We do something similar/related in our pretraining scheme for LSTM encoders (in encoder-decoder-attention end-to-end speech recognition) (paper). We start with a small depth and width (2 layers, 512 dims), and then we gradually grow in depth and width (linearly), until we reach the final size (e.g. 6 layers, 1024 dims).

Edit: It seems that multiple people disagree with something I said, as this gets downvoted. I am curious about what exactly? That this is related? If so, why do you think it is not related? One of the results from the paper is, that it is important to scale both width and depth together. That's basically the same as what we found, and I personally found that interesting, that other people in other context (here: image with convolutional networks) also do this.

-Rizhiy- · 2019-05-30T09:58:30+00:00

Might also be an interesting idea to scale the network with amount of data. While more data is always better, it might identify cases where the network is not expressive enough to capture data effectively or if too redundant if the amount of data is too small.

kuiyuan · 2019-05-31T09:32:20+00:00

Nice work. Simple, efficient and effective. Given a budget of number of neurons, EfficientNet spends these neurons along spatial, channel and depth in a more optimal way.

wuziheng · 2019-06-04T03:07:53+00:00

can we use this strategy to smaller basemodel to get small backbone, such as shufflenetv2 0.5x , use this as basemodel , expand the model to get a 140m flops backbone, will it be better than original shufflenetv2 1x(140m flops).

eugenelet123 · 2019-08-07T10:43:16+00:00

I'm curious about one thing not mentioned in the paper. It's about the number of searches and scale for the 3 dimensions of grid search. Let's say each parameter requires 10 searches, wouldn't this require training 10³ independent models of different sizes?

dclaz · 2019-05-31T05:06:14+00:00

Wonder what the carbon footprint of deriving that scaling heuristic was...

Not saying it's its a bad or unwelcome result, but I'm guessing the number of model fits that would have been performed would have required a serious amount of hardware.

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS