
Links
 “DeepSpeedMoE: Advancing MixtureofExperts Inference and Training to Power NextGeneration AI Scale”, Rajbhandari et al 2022
 “Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
 “ERNIE 3.0 Titan: Exploring Largerscale Knowledge Enhanced Pretraining for Language Understanding and Generation”, Wang et al 2021
 “Causal Distillation for Language Models”, Wu et al 2021
 “Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
 “FQViT: Fully Quantized Vision Transformer without Retraining”, Lin et al 2021
 “𝜇NCA: Texture Generation With UltraCompact Neural Cellular Automata”, Mordvintsev & Niklasson 2021
 “Prune Once for All: Sparse PreTrained Language Models”, Zafrir et al 2021
 “DSEE: Dually Sparsityembedded Efficient Tuning of Pretrained Language Models”, Chen et al 2021
 “HALP: HardwareAware Latency Pruning”, Shen et al 2021
 “When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
 “Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
 “Language Modelling via Learning to Rank”, Frydenlund et al 2021
 “8bit Optimizers via Blockwise Quantization”, Dettmers et al 2021
 “On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
 “Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Bondarenko et al 2021
 “Beyond Distillation: Tasklevel MixtureofExperts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
 “Block Pruning For Faster Transformers”, Lagunas et al 2021
 “Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
 “EvilModel: Hiding Malware Inside of Neural Network Models”, Wang et al 2021
 “A Winning Hand: Compressing Deep Networks Can Improve OutOfDistribution Robustness”, Diffenderfer et al 2021
 “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
 “Chasing Sparsity in Vision Transformers: An EndtoEnd Exploration”, Chen et al 2021
 “Clusterability in Neural Networks”, Filan et al 2021
 “Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
 “Scaling down Deep Learning”, Greydanus 2020
 “Extreme Model Compression for Ondevice Natural Language Understanding”, Sathyendra et al 2020
 “A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
 “Dataset MetaLearning from Kernel RidgeRegression”, Nguyen et al 2020
 “Optimal Subarchitecture Extraction For BERT”, Wynter & Perry 2020
 “Logarithmic Pruning Is All You Need”, Orseau et al 2020
 “On the Predictability of Pruning Across Scales”, Rosenfeld et al 2020
 “SimCLRv2: Big SelfSupervised Models Are Strong SemiSupervised Learners”, Chen et al 2020
 “Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow”, Tanaka et al 2020
 “Movement Pruning: Adaptive Sparsity by FineTuning”, Sanh et al 2020
 “Bayesian Bits: Unifying Quantization and Pruning”, Baalen et al 2020
 “Training With Quantization Noise for Extreme Model Compression”, Fan et al 2020
 “On the Effect of Dropping Layers of Pretrained Transformer Models”, Sajjad et al 2020
 “Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers”, Li et al 2020
 “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
 “QUARL: Quantized Reinforcement Learning”, Lam et al 2019
 “Learning to Seek: Autonomous Source Seeking With Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, Duisterhof et al 2019
 “TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
 “Accelerating LargeScale Inference With Anisotropic Vector Quantization”, Guo et al 2019
 “And the Bit Goes Down: Revisiting the Quantization of Neural Networks”, Stock et al 2019
 “Sparse Networks from Scratch: Faster Training without Losing Performance”, Dettmers & Zettlemoyer 2019
 “Weight Agnostic Neural Networks”, Gaier & Ha 2019
 “Playing the Lottery With Rewards and Multiple Languages: Lottery Tickets in RL and NLP”, Yu et al 2019
 “StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, An et al 2019
 “SpArSe: Sparse Architecture Search for CNNs on ResourceConstrained Microcontrollers”, Fedorov et al 2019
 “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, Tan & Le 2019
 “Analyzing MultiHead SelfAttention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Voita et al 2019
 “Stabilizing the Lottery Ticket Hypothesis”, Frankle et al 2019
 “The State of Sparsity in Deep Neural Networks”, Gale et al 2019
 “Superposition of Many Models into One”, Cheung et al 2019
 “Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
 “Rethinking Floating Point for Deep Learning”, Johnson 2018
 “A Closer Look at Structured Pruning for Neural Network Compression”, Crowley et al 2018
 “Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
 “Rethinking Numerical Representations for Deep Neural Networks”, Hill et al 2018
 “ClariNet: Parallel Wave Generation in EndtoEnd TexttoSpeech”, Ping et al 2018
 “Playing Atari With Six Neurons”, Cuccu et al 2018
 “Quantization Mimic: Towards Very Tiny CNN for Object Detection”, Wei et al 2018
 “Measuring the Intrinsic Dimension of Objective Landscapes”, Li et al 2018
 “SqueezeNext: HardwareAware Neural Network Design”, Gholami et al 2018
 “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”, Frankle & Carbin 2018
 “Wide Compression: Tensor Ring Nets”, Wang et al 2018
 “Training Wide Residual Networks for Deployment Using a Single Bit for Each Weight”, McDonnell 2018
 “Efficient Neural Audio Synthesis”, Kalchbrenner et al 2018
 “Universal Deep Neural Network Compression”, Choi et al 2018
 “Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks”, Mittal et al 2018
 “Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing”, Rosenfeld & Tsotsos 2018
 “Learning to Prune Filters in Convolutional Neural Networks”, Huang et al 2018
 “Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
 “Fix Your Classifier: the Marginal Value of Training the Last Weight Layer”, Hoffer et al 2018
 “Learning Compact Recurrent Neural Networks With BlockTerm Tensor Decomposition”, Ye et al 2017
 “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Lin et al 2017
 “Automated Pruning for Deep Neural Network Compression”, Manessi et al 2017
 “Parallel WaveNet: Fast HighFidelity Speech Synthesis”, Oord et al 2017
 “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions”, Wu et al 2017
 “Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
 “Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method”, Sun et al 2017
 “XUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, Kligvasser et al 2017
 “NeST: A Neural Network Synthesis Tool Based on a GrowandPrune Paradigm”, Dai et al 2017
 “Compressing Word Embeddings via Deep Compositional Code Learning”, Shu & Nakayama 2017
 “Learning Discrete Weights Using the Local Reparameterization Trick”, Shayer et al 2017
 “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression”, Zhu & Gupta 2017
 “N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
 “Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
 “Natural Language Processing With Small FeedForward Networks”, Botha et al 2017
 “Bayesian Sparsification of Recurrent Neural Networks”, Lobacheva et al 2017
 “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, Zhang et al 2017
 “Structured Bayesian Pruning via LogNormal Multiplicative Noise”, Neklyudov et al 2017
 “Exploring Sparsity in Recurrent Neural Networks”, Narang et al 2017
 “ShakeShake Regularization of 3branch Residual Networks”, Gastaldi 2017
 “Variational Dropout Sparsifies Deep Neural Networks”, Molchanov et al 2017
 “Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
 “Policy Distillation”, Rusu et al 2015
 “Tensorizing Neural Networks”, Novikov et al 2015
 “Distilling the Knowledge in a Neural Network”, Hinton et al 2015
 Knowledge distillation
 Miscellaneous
Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained on, as can be proven by the regular improvements in training smaller/faster but still performant networks but also in directly creating smaller neural nets with similar or identical performance on those problems by deleting parameters (sparsification)/reducing precision of the numeric encoding (compressing)/training a much smaller network from scratch using the original large network somehow (distillation).
Mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to imitate themselves in selfdistillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of blessing of scale in large NNs having smoother loss landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up ‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful but extremely brittle encodings such as Brainf—k or assembler programs). As well as their great theoretical interest—How can we train these small models directly? What does this tell us about how NNs work?—such smaller NNs are critical to practical realworld deployment to servers & smartphones at scale, the design of accelerator hardware supporting reducedprecision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X almost immediately thereafter. (These are merely one way that your software can be much faster.)
Some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (a vastly incomplete bibliography, merely some papers I have noted during my general reading) below.
Links
“DeepSpeedMoE: Advancing MixtureofExperts Inference and Training to Power NextGeneration AI Scale”, Rajbhandari et al 2022
“DeepSpeedMoE: Advancing MixtureofExperts Inference and Training to Power NextGeneration AI Scale”, (20220114; ; similar):
[blog] As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, MixtureofExperts (MoE) models become one of the most promising model architectures due to their substantial training cost reduction compared to a qualityequivalent dense model.
Its training cost saving is demonstrated from encoderdecoder models (prior works) to a 5× saving for autoaggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage.
To tackle this, we present DeepSpeedMoE, an endtoend MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7×, and a highly optimized inference system that provides 7.3× better latency and cost compared to existing MoE inference solutions. It offers ultrafast inference latencies (25ms) for trillionparameter MoE models. DeepSpeedMoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5× faster and 9× cheaper inference compared to qualityequivalent dense models.
We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higherquality models with fewer resources becomes more widely possible.
“Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022
“Microdosing: Knowledge Distillation for GAN based Compression”, (20220107; ; similar):
Recently, significant progress has been made in learned image and video compression. In particular the usage of Generative Adversarial Networks has lead to impressive results in the low bit rate regime. However, the model size remains an important issue in current stateoftheart proposals and existing solutions require significant computation effort on the decoding side. This limits their usage in realistic scenarios and the extension to video compression. In this paper, we demonstrate how to leverage knowledge distillation to obtain equally capable image decoders at a fraction of the original number of parameters. We investigate several aspects of our solution including sequence specialization with side information for image coding. Finally, we also show how to transfer the obtained benefits into the setting of video compression. Overall, this allows us to reduce the model size by a factor of 20 and to achieve 50% reduction in decoding time.
“ERNIE 3.0 Titan: Exploring Largerscale Knowledge Enhanced Pretraining for Language Understanding and Generation”, Wang et al 2021
“ERNIE 3.0 Titan: Exploring Largerscale Knowledge Enhanced Pretraining for Language Understanding and Generation”, (20211223; ; similar):
blog] Pretrained language models have achieved stateoftheart results in various Natural Language Processing (NLP) tasks. GPT3 has shown that scaling up pretrained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pretraining largescale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the stateoftheart models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundredbillionparameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, we design a selfsupervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pretrained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the stateoftheart models on 68 NLP datasets.
“Causal Distillation for Language Models”, Wu et al 2021
“Causal Distillation for Language Models”, (20211205; similar):
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a taskspecific objective (eg. language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training (IIT). IIT pushes the student model to become a causal abstraction of the teacher model—a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL2003 (named entity recognition).
“Extrapolating from a Single Image to a Thousand Classes Using Distillation”, Asano & Saeed 2021
“Extrapolating from a Single Image to a Thousand Classes using Distillation”, (20211201; similar):
What can neural networks learn about the visual world from a single image? While it obviously cannot contain the multitudes of possible objects, scenes and lighting conditions that exist—within the space of all possible 256^(3×224×224) 224sized square images, it might still provide a strong prior for natural images. To analyze this hypothesis, we develop a framework for training neural networks from scratch using a single image by means of knowledge distillation from a supervisedly pretrained teacher. With this, we find that the answer to the above question is: ‘surprisingly, a lot’. In quantitative terms, we find top1 accuracies of 94%/74% on CIFAR10/100, 59% on ImageNet and, by extending this method to audio, 84% on SpeechCommands. In extensive analyses we disentangle the effect of augmentations, choice of source image and network architectures and also discover “panda neurons” in networks that have never seen a panda. This work shows that one image can be used to extrapolate to thousands of object classes and motivates a renewed research agenda on the fundamental interplay of augmentations and image.
“FQViT: Fully Quantized Vision Transformer without Retraining”, Lin et al 2021
“FQViT: Fully Quantized Vision Transformer without Retraining”, (20211127; similar):
Network quantization significantly reduces model inference complexity and has been widely used in realworld deployments. However, most existing quantization methods have been developed and tested mainly on Convolutional Neural Networks (CNN), and suffer severe degradation when applied to Transformerbased architectures. In this work, we present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. In particular, we propose PowersofTwo Scale (PTS) to deal with the serious interchannel variation of LayerNorm inputs in a hardwarefriendly way. In addition, we propose LogIntSoftmax (LIS) that can sustain the extreme nonuniform distribution of the attention maps while simplifying inference by using 4bit quantization and the BitShift operator. Comprehensive experiments on various Transformerbased architectures and benchmarks show that our methods outperform previous works in performance while using even lower bitwidth in attention maps. For instance, we reach 85.17% Top1 accuracy with ViTL on ImageNet and 51.4 mAP with Cascade Mask RCNN (SwinS) on COCO. To our knowledge, we are the first to achieve comparable accuracy degradation (~1%) on fully quantized Vision Transformers. Code is available at
https://github.com/linyangzhh/FQViT
.
“𝜇NCA: Texture Generation With UltraCompact Neural Cellular Automata”, Mordvintsev & Niklasson 2021
“𝜇NCA: Texture Generation with UltraCompact Neural Cellular Automata”, (20211126; ; similar):
We study the problem of examplebased procedural texture synthesis using highly compact models. Given a sample image, we use differentiable programming to train a generative process, parameterised by a recurrent Neural Cellular Automata (NCA) rule.
Contrary to the common belief that neural networks should be highly overparameterised, we demonstrate that our model architecture and training procedure allows for representing complex texture patterns using just a few hundred learned parameters, making their expressivity comparable to handengineered procedural texture generating programs. The smallest models from the proposed 𝜇NCA family scale down to 68 parameters. When using quantisation to one byte per parameter, proposed models can be shrunk to a size range between 588 and 68 bytes.
Implementation of a texture generator that uses these parameters to produce images is possible with just a few lines of GLSL or C code.
“Prune Once for All: Sparse PreTrained Language Models”, Zafrir et al 2021
“Prune Once for All: Sparse PreTrained Language Models”, (20211110; similar):
Transformerbased language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformerbased models on target hardware. In this work we present a new method for training sparse pretrained Transformer language models by integrating weight pruning and model distillation. These sparse pretrained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pretrained BERTBase, BERTLarge and DistilBERT. We show how the compressed sparse pretrained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models’ weights to 8bit precision using quantizationaware training. For example, with our sparse pretrained BERTLarge finetuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of 40× for the encoder with less than 1% accuracy loss. To the best of our knowledge, our results show the best compressiontoaccuracy ratio for BERTBase, BERTLarge, and DistilBERT.
“DSEE: Dually Sparsityembedded Efficient Tuning of Pretrained Language Models”, Chen et al 2021
“DSEE: Dually Sparsityembedded Efficient Tuning of Pretrained Language Models”, (20211030; similar):
Gigantic pretrained models have become central to natural language processing (NLP), serving as the starting point for finetuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pretrained models grow bigger (eg. 175B parameters for GPT3), even the finetuning process can be timeconsuming and computationally expensive; (b) the finetuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many finetuned models will be deployed in resourceconstrained environments.
To address these pain points, we propose a framework for resourceefficient and parameterefficient finetuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually SparsityEmbedded Efficient Tuning (DSEE), aims to achieve two key objectives: (1) parameter efficient finetuning—by enforcing sparsityaware weight updates on top of the pretrained weights; and (2) resourceefficient inference—by encouraging a sparse weight structure towards the final finetuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pretrained language models via magnitudebased pruning and \U0001D4C1_{1} sparse regularization.
Extensive experiments and indepth investigations, with diverse network backbones (ie. BERT, GPT2, and DeBERTa) on dozens of datasets, consistently demonstrate highly impressive parameter/training/inferenceefficiency, while maintaining competitive downstream transfer performance. For instance, our DSEEBERT obtains about 35% inference FLOPs savings with <1% trainable parameters and comparable performance to conventional finetuning. Codes are available in Github.
“HALP: HardwareAware Latency Pruning”, Shen et al 2021
“HALP: HardwareAware Latency Pruning”, (20211020; similar):
Structural pruning can simplify network architecture and improve inference speed.
We propose HardwareAware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracyefficiency tradeoff.
We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets. In particular, for ResNet50/ResNet101 pruning on ImageNet, HALP improves network throughput by 1.60×/1.90× with +0.3% / 0.2% top1 accuracy changes, respectively. For SSD pruning on VOC, HALP improves throughput by 1.94× with only a 0.56 mAP drop. HALP consistently outperforms prior art, sometimes by large margins.
“When in Doubt, Summon the Titans: Efficient Inference With Large Models”, Rawat et al 2021
“When in Doubt, Summon the Titans: Efficient Inference with Large Models”, (20211019; similar):
Scaling neural networks to “large” sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most realworld settings. In this paper, we propose a twostage framework based on distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of “easy” examples; for the “hard” examples, we fallback to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive tradeoff in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach on both image classification and natural language processing benchmarks.
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, West et al 2021
“Symbolic Knowledge Distillation: from General Language Models to Commonsense Models”, (20211014; ; backlinks; similar):
The common practice for training commonsense models has gone fromhumantocorpustomachine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative frommachinetocorpustomachine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolicallyas textin addition to the neural model. We also distill only one aspectthe commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill highquality causal commonsense from GPT3, a general language model. Empirical results demonstrate that, for the first time, a humanauthored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model’s commonsense capabilities despite its 100× smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
“Language Modelling via Learning to Rank”, Frydenlund et al 2021
“Language Modelling via Learning to Rank”, (20211013; similar):
We consider language modelling (LM) as a multilabel structured prediction task by reframing training from solely predicting a single groundtruth word to ranking a set of words which could continue a given context. To avoid annotating topk ranks, we generate them using pretrained LMs: GPT2, BERT and BornAgain models. This leads to a rankbased form of knowledge distillation (KD). We also develop a method using Ngrams to create a nonprobabilistic teacher which generates the ranks without the need of a pretrained LM.
We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pretrained LM. We show that rankbased KD generally improves perplexity (PPL), often with statisticalsignificance, when compared to KullbackLeiblerbased KD. Surprisingly, given the simplicity of the method, Ngrams act as competitive teachers and achieve similar performance as using either BERT or a BornAgain model teachers. GPT2 always acts as the best teacher, though, and using it and a TransformerXL student on Wiki02, rankbased KD reduces a crossentropy baseline from 65.27 to 55.94 and against a KLbased KD of 56.70.
“8bit Optimizers via Blockwise Quantization”, Dettmers et al 2021
“8bit Optimizers via Blockwise Quantization”, (20211006; similar):
Stateful optimizers maintain gradient statistics over time, eg. the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8bit statistics while maintaining the performance levels of using 32bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop blockwise dynamic quantization. Blockwise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine blockwise quantization with two additional changes: (1) dynamic quantization, a form of nonlinear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly nonuniform distribution of input tokens in language models. As a result, our 8bit optimizers maintain 32bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We opensource our 8bit optimizers as a dropin replacement that only requires a twoline code change.
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, Lai et al 2021
“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis”, (20211004; similar):
Are endtoend texttospeech (TTS) models overparameterized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTSAugmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are endtoend TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by largescale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, Bondarenko et al 2021
“Understanding and Overcoming the Challenges of Efficient Transformer Quantization”, (20210927; similar):
Transformerbased architectures have become the defacto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resourcelimited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges—namely, high dynamic activation ranges that are difficult to represent with a low bit fixedpoint format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on posttraining quantization and quantizationaware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme—perembeddinggroup quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing stateoftheart results for posttraining quantization. Finally, we show that transformer weights and embeddings can be quantized to ultralow bitwidths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at
https://github.com/qualcommairesearch/transformerquantization
.
“Beyond Distillation: Tasklevel MixtureofExperts (TaskMoE) for Efficient Inference”, Kudugunta et al 2021
“Beyond Distillation: Tasklevel MixtureofExperts (TaskMoE) for Efficient Inference”, (20210924; ; similar):
[blog] Sparse MixtureofExperts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a webscale dataset suggest that tasklevel routing (TaskMoE) enables us to extract smaller, readytodeploy subnetworks from large sparse models.
On WMT, our taskMoE with 32 experts (533M parameters) outperforms the best performing tokenlevel MoE model (tokenMoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9× when we route by tasks instead of tokens. While distilling a tokenMoE to a smaller dense model preserves only 32% of the BLEU gains, our subnetwork taskMoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128expert taskMoE (13B parameters) performs competitively with a tokenlevel counterpart, while improving the peak inference throughput by a factor of 2.6×.
“Block Pruning For Faster Transformers”, Lagunas et al 2021
“Block Pruning For Faster Transformers”, (20210910; similar):
Pretraining has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for finetuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4× faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
“Dataset Distillation With Infinitely Wide Convolutional Networks”, Nguyen et al 2021
“Dataset Distillation with Infinitely Wide Convolutional Networks”, (20210727; ; similar):
The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based metalearning framework to achieve stateoftheart results for dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over 64% test accuracy on CIFAR10 image classification task, a dramatic improvement over the previous best test accuracy of 40%. Our stateoftheart results extend across many other settings for MNIST, FashionMNIST, CIFAR10, CIFAR100, and SVHN. Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.
“EvilModel: Hiding Malware Inside of Neural Network Models”, Wang et al 2021
“EvilModel: Hiding Malware Inside of Neural Network Models”, (20210719; similar):
Delivering malware covertly and detectionevadingly is critical to advanced malware campaigns. In this paper, we present a method that delivers malware covertly and detectionevadingly through neural network models. Neural network models are poorly explainable and have a good generalization ability. By embedding malware into the neurons, malware can be delivered covertly with minor or even no impact on the performance of neural networks. Meanwhile, since the structure of the neural network models remains unchanged, they can pass the security scan of antivirus engines. Experiments show that 36.9MB of malware can be embedded into a 178MBAlexNet model within 1% accuracy loss, and no suspicious are raised by antivirus engines in VirusTotal, which verifies the feasibility of this method. With the widespread application of artificial intelligence, utilizing neural networks becomes a forwarding trend of malware. We hope this work could provide a referenceable scenario for the defense on neural networkassisted attacks.
“A Winning Hand: Compressing Deep Networks Can Improve OutOfDistribution Robustness”, Diffenderfer et al 2021
“A Winning Hand: Compressing Deep Networks Can Improve OutOfDistribution Robustness”, (20210616; similar):
Two crucial requirements for a successful adoption of deep learning (DL) in the wild are: (1) robustness to distributional shifts, and (2) model compactness for achieving efficiency. Unfortunately, efforts towards simultaneously achieving OutofDistribution (OOD) robustness and extreme model compactness without sacrificing accuracy have mostly been unsuccessful. This raises an important question: “Is the inability to create compact, accurate and robust deep neural networks (CARDs) fundamental?” To answer this question we perform a largescale analysis for a range of popular model compression techniques which uncovers several intriguing patterns. Notably, in contrast to traditional pruning approaches (eg. fine tuning and gradual magnitude pruning), we find that “lottery ticketstyle” pruning approaches can surprisingly be used to create high performing CARDs. Specifically, we are able to create extremely compact CARDs that are dramatically more robust than their significantly larger and fullprecision counterparts while matching (or beating) their test accuracy, simply by pruning and/or quantizing. To better understand these differences, we perform sensitivity analysis in the Fourier domain for CARDs trained using different data augmentation methods. Motivated by our analysis, we develop a simple domainadaptive testtime ensembling approach (CARDDeck) that uses a gating module to dynamically select an appropriate CARD from the CARDDeck based on their spectralsimilarity with test samples. By leveraging complementary frequency biases of different compressed models, the proposed approach builds a “winning hand” of CARDs that establishes a new stateoftheart on CIFAR10C accuracies (ie. 96.8% clean and 92.75% robust) with dramatically better memory usage than their noncompressed counterparts. We also present some theoretical evidences supporting our empirical findings.
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, Menghani 2021
“Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better”, (20210616; similar):
Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. We present and motivate the problem of efficiency in deep learning, followed by a thorough survey of the five core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware) and the seminal work there. We also present an experimentbased guide along with code, for practitioners to optimize their model training and deployment. We believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support. Our hope is that this survey would provide the reader with the mental model and the necessary understanding of the field to apply generic efficiency techniques to immediately get significant improvements, and also equip them with ideas for further research and experimentation to achieve additional gains.
“Chasing Sparsity in Vision Transformers: An EndtoEnd Exploration”, Chen et al 2021
“Chasing Sparsity in Vision Transformers: An EndtoEnd Exploration”, (20210608; similar):
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional posttraining pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the firstofitskind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end”. Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the pruneandgrow of selfattention heads inside ViTs. We further coexplore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing “free lunch”. For example, our sparsified DeiTSmall at (5%, 50%) sparsity for (data, architecture), improves 0.28% top1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITAGroup/SViTE.
“Clusterability in Neural Networks”, Filan et al 2021
“Clusterability in Neural Networks”, (20210304; similar):
The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized networks, and often clusterable relative to random networks with the same distribution of weights. We also exhibit novel methods to promote clusterability in neural network training, and find that in multilayer perceptrons they lead to more clusterable networks with little reduction in accuracy. Understanding and controlling the clusterability of neural networks will hopefully render their inner workings more interpretable to engineers by facilitating partitioning into meaningful clusters.
“Switch Transformers: Scaling to Trillion Parameter Models With Simple and Efficient Sparsity”, Fedus et al 2021
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”, (20210111; ; backlinks; similar):
In deep learning, models typically reuse the same parameters for all inputs. MixtureofExperts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparselyactivated model—with outrageous numbers of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability—we address these with the Switch Transformer.
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5Base and T5Large (Raffel et al 2019) to obtain up to 7× increases in pretraining speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5Base version across all 101 languages. Finally, we advance the current scale of language models by pretraining up to 1trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4× speedup over the T5XXL model.
…Appendix E: Relation OF Upstream To Downstream Model Performance
There is no guarantee that a model’s quality on a pretraining objective will translate to downstream task results. Figure 13 presents the correlation of the upstream model quality, for both dense and Switch models, on the C4 pretraining task with two downstream task measures: average SuperGLUE performance and TriviaQA score. We choose these two tasks as one probes the model’s reasoning and the other factual knowledge.
We find a consistent correlation, indicating that for both baseline and Switch models, improved pretraining leads to better downstream results. Additionally, for a fixed upstream perplexity we find that both Switch and dense models perform similarly in the small to medium model size regime. However, in the largest model regime (T511B/T5XXL) our largest Switch models, as mentioned in Section 5.6, do not always translate their upstream perplexity well to downstream finetuning on the SuperGLUE task. This warrants future investigation and study to fully realize the potential of sparse models. Understanding the finetuning dynamics with expertmodels is very complicated and is dependent on regularization, loadbalancing, and finetuning hyperparameters.
“Scaling down Deep Learning”, Greydanus 2020
“Scaling down Deep Learning”, (20201201; ; backlinks; similar):
…Yet in spite of its historical importance, MNIST has three notable shortcomings. First, it does a poor job of differentiating between linear, nonlinear, and translationinvariant models. For example, logistic, MLP, and CNN benchmarks obtain 94, 99+, and 99+% accuracy on it. This makes it hard to measure the contribution of a CNN’s spatial priors or to judge the relative effectiveness of different regularization schemes. Second, it is somewhat large for a toy dataset. Each input example is a 784dimensional vector and thus it takes a nontrivial amount of computation to perform hyperparameter searches or debug a metalearning loop. Third, MNIST is hard to hack. The ideal toy dataset should be procedurally generated so that researchers can smoothly vary parameters such as background noise, translation, and resolution.
In order to address these shortcomings, we propose the MNIST1D dataset. It is a minimalist, lowmemory, and lowcompute alternative to MNIST, designed for exploratory deep learning research where rapid iteration is a priority. Training examples are 20× smaller but they are still better at measuring the difference between (1) linear and nonlinear classifiers and (2) models with and without spatial inductive biases (eg. translation invariance). The dataset is procedurally generated but still permits analogies to realworld digit classification…Unlike MNIST, each example is a onedimensional sequence of points. To generate an example, we begin with a digit template and then randomly pad, translate, and transform it.
Example use cases: In this section we will explore several examples of how MNIST1D can be used to study core “science of deep learning” phenomena.
Finding lottery tickets…Unlike many followup experiments on the lottery ticket, this one took just two days of researcher time to produce. The curious reader can also reproduce these results in their browser in a few minutes.
Observing deep double descent…We see the MNIST1D dataset as a good tool for exploring these properties. In fact, we were able to reproduce the double descent pattern after a few hours of researcher effort. The figure below shows our results for a fullyconnected network and a convolutional model.
Gradientbased metalearning…A model does this by having two levels of optimization: the first is a fast inner loop which corresponds to a traditional learning objective and second is a slow outer loop which updates the “meta” properties of the learning process…Metalearning is a promising topic but it is very difficult to scale. First of all, metalearning algorithms consume enormous amounts of time and compute. Second of all, implementations tend to grow complex since there are twice as many hyperparameters (one set for each level of optimization) and most deep learning frameworks are not set up well for metalearning. This places an especially high incentive on debugging and iterating metalearning algorithms on smallscale datasets such as MNIST1D. For example, it took just a few hours to implement and debug the gradientbased hyperparameter optimization of a learning rate shown below.
 Metalearning an activation function: Having implemented a “minimal working example” of gradientbased metalearning, we realized that it permitted a simple and novel extension: metalearning an activation function. With a few more hours of researcher time, we were able to parameterize our classifier’s activation function with a second neural network and then learn the weights using metagradients.
Measuring the spatial priors of deep networks: …Principle among these priors is the translation invariance of convolution. A primary motivation for this dataset was to construct a toy problem that could effectively quantify a model’s spatial priors. The second figure in this post illustrates that this is indeed possible with MNIST1D.
Benchmarking pooling methods. Our final case study begins with a specific question: What is the relationship between pooling and sample efficiency? We had not seen evidence that pooling makes models more or less sample efficient, but this seemed an important relationship to understand. With this in mind, we trained models with different pooling methods and training set sizes and found that, while pooling tended to be effective in lowdata regimes, it did not make much of a difference in highdata regimes.
…this post argues in favor of smallscale machine learning research. Neural networks do not have problems with scaling or performance—but they do have problems with interpretability, reproducibility, and iteration speed. We see carefullycontrolled, smallscale experiments as a great way to address these problems…For example, several of the findings reported in this post are at the point where they should be investigated at scale. We would like to show that large scale lottery tickets also learn spatial inductive biases, and show evidence that they develop local connectivity. We would also like to try metalearning an activation function on a larger model in the hopes of finding an activation that will outperform ReLU and Swish in generality. We should emphasize that we are only ready to scale these results now that we have isolated and understood them in a controlled setting. We believe that scaling a system is only a good idea once the relevant causal mechanisms have been isolated and understood. [cf scaling law papers] …Our work also bears philosophical similarities to the “Synthetic Petri Dish” by Rawal et al 2020.
Closing Thoughts: There is a counterintuitive possibility that in order to explore the limits of how large we can scale neural networks, we may need to explore the limits of how small we can scale them first. Scaling models and datasets downward in a way that preserves the nuances of their behaviors at scale will allow researchers to iterate quickly on fundamental and creative ideas. This fast iteration cycle is the best way of obtaining insights about how to incorporate progressively more complex inductive biases into our models. We can then transfer these inductive biases across spatial scales in order to dramatically improve the sample efficiency and generalization properties of largescale models. We see the humble MNIST1D dataset as a first step in that direction.
“Extreme Model Compression for Ondevice Natural Language Understanding”, Sathyendra et al 2020
“Extreme Model Compression for Ondevice Natural Language Understanding”, (20201130; similar):
In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resourceconstrained devices. We propose a taskaware, endtoend compression approach that performs wordembedding compression jointly with NLU task learning. We show our results on a largescale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes. Our approach outperforms a range of baselines and achieves a compression rate of 97.4% with less than 3.7% degradation in predictive performance. Our analysis indicates that the signal from the downstream task is important for effective compression with minimal degradation in performance.
“A Primer in BERTology: What We Know about How BERT Works”, Rogers et al 2020
“A Primer in BERTology: What we know about how BERT works”, (20201109; similar):
Transformerbased models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
…Given the above evidence of overparameterization, it does not come as a surprise that BERT can be efficiently compressed with minimal accuracy loss, which would be highly desirable for realworld applications. Such efforts to date are summarized in Table 1. The main approaches are knowledge distillation, quantization, and pruning…If the ultimate goal of training BERT is compression, Li et al (2020) recommend training larger models and compressing them heavily rather than compressing smaller models lightly.
“Dataset MetaLearning from Kernel RidgeRegression”, Nguyen et al 2020
“Dataset MetaLearning from Kernel RidgeRegression”, (20201030; ; similar):
One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of εapproximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a metalearning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitelywide neural networks and kernel ridgeregression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR10 classification. Furthermore, our KIPlearned datasets are transferable to the training of finitewidth neural networks even beyond the lazytraining regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacypreservation.
“Optimal Subarchitecture Extraction For BERT”, Wynter & Perry 2020
“Optimal Subarchitecture Extraction For BERT”, (20201020; similar):
We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al 2018 by applying recent breakthroughs in algorithms for neural architecture search.
This optimal subset, which we refer to as “Bort”, is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the original BERTlarge architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which is 1.2% of the time required to pretrain the highestperforming BERT parametric architectural variant, RoBERTalarge (Liu et al 2019), and about 33% of that of the worldrecord, in GPU hours, required to train BERTlarge on the same hardware.
It is also 7.9× faster on a CPU, as well as being better performing than other compressed variants of the architecture, and some of the noncompressed variants: it obtains performance improvements of between 0.3% and 31%, absolute, with respect to BERTlarge, on multiple public natural language understanding (NLU) benchmarks.
“Logarithmic Pruning Is All You Need”, Orseau et al 2020
“Logarithmic Pruning is All You Need”, (20200622; ; backlinks; similar):
The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.
“On the Predictability of Pruning Across Scales”, Rosenfeld et al 2020
“On the Predictability of Pruning Across Scales”, (20200618; ; backlinks; similar):
We show that the error of iteratively magnitudepruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (eg. ImageNet) and architectures (eg. ResNets). As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.
“SimCLRv2: Big SelfSupervised Models Are Strong SemiSupervised Learners”, Chen et al 2020
“SimCLRv2: Big SelfSupervised Models are Strong SemiSupervised Learners”, (20200617; ; backlinks; similar):
One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised finetuning. Although this paradigm uses unlabeled data in a taskagnostic way, in contrast to common approaches to semisupervised learning for computer vision, we show that it is surprisingly effective for semisupervised learning on ImageNet.
A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and finetuning. We find that, the fewer the labels, the more this approach (taskagnostic use of unlabeled data) benefits from a bigger network. After finetuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a taskspecific way. The proposed semisupervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2, supervised finetuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the taskspecific knowledge.
This procedure achieves 73.9% ImageNet top1 accuracy with just 1% of the labels (≤13 labeled images per class) using ResNet50, a 10× improvement in label efficiency over the previous stateoftheart. With 10% of labels, ResNet50 trained with our method achieves 77.5% top1 accuracy, outperforming standard supervised training with all of the labels.
“Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow”, Tanaka et al 2020
“Pruning neural networks without any data by iteratively conserving synaptic flow”, (20200609; similar):
Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradientbased pruning algorithms at initialization suffer from layercollapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layercollapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently competes with or outperforms existing stateoftheart pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR10/100 and Tiny ImageNet), and sparsity constraints (up to 99.99 percent). Thus our dataagnostic pruning algorithm challenges the existing paradigm that, at initialization, data must be used to quantify which synapses are important.
“Movement Pruning: Adaptive Sparsity by FineTuning”, Sanh et al 2020
“Movement Pruning: Adaptive Sparsity by FineTuning”, (20200515; similar):
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for stateoftheart natural language processing applications.
We propose the use of movement pruning, a simple, deterministic firstorder weight pruning method that is more adaptive to pretrained model finetuning. We give mathematical foundations to the method and compare it to existing zerothorder and firstorder pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows substantial improvements in highsparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.
“Bayesian Bits: Unifying Quantization and Pruning”, Baalen et al 2020
“Bayesian Bits: Unifying Quantization and Pruning”, (20200514; similar):
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide whether or not to add this quantized residual error for a higher effective bit width and lower quantization noise. By starting with a poweroftwo bit width, this decomposition will always produce hardwarefriendly configurations, and through an additional 0bit option, serves as a unified view of pruning and quantization. Bayesian Bits then introduces learnable stochastic gates, which collectively control the bit width of the given tensor. As a result, we can obtain low bit solutions by performing approximate inference over the gates, with prior distributions that encourage most of them to be switched off. We experimentally validate our proposed method on several benchmark datasets and show that we can learn pruned, mixed precision networks that provide a better tradeoff between accuracy and efficiency than their static bit width equivalents.
“Training With Quantization Noise for Extreme Model Compression”, Fan et al 2020
“Training with Quantization Noise for Extreme Model Compression”, (20200415; similar):
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the StraightThrough Estimator. In this paper, we extend this approach to work beyond int8 fixedpoint quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new stateoftheart compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to stateoftheart Transformer and Convnet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14MB and 80.0 top1 accuracy on ImageNet by compressing an EfficientNetB3 to 3.3MB.
“On the Effect of Dropping Layers of Pretrained Transformer Models”, Sajjad et al 2020
“On the Effect of Dropping Layers of Pretrained Transformer Models”, (20200408; similar):
Transformerbased NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pretrained models, we explore strategies to drop layers in pretrained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping.
“Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers”, Li et al 2020
“Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers”, (20200226; ; backlinks; similar):
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: selfsupervised pretraining and highresource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most computeefficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.
This leads to an apparent tradeoff between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
“DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter”, Sanh et al 2019
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, (20191002; backlinks; similar):
As Transfer Learning from largescale pretrained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in ontheedge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pretrain a smaller generalpurpose language representation model, called DistilBERT, which can then be finetuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building taskspecific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosinedistance losses. Our smaller, faster and lighter model is cheaper to pretrain and we demonstrate its capabilities for ondevice computations in a proofofconcept experiment and a comparative ondevice study.
“QUARL: Quantized Reinforcement Learning”, Lam et al 2019
“QUARL: Quantized Reinforcement Learning”, (20191002; similar):
Deep reinforcement learning has achieved important milestones, however, the computational demands of reinforcement learning training and inference remain substantial. Quantization is an effective method to reduce the computational overheads of neural networks, though in the context of reinforcement learning, it is unknown whether quantization’s computational benefits outweigh the accuracy costs introduced by the corresponding quantization error.
To quantify this tradeoff we perform a broad study applying quantization to reinforcement learning. We apply standard quantization techniques such as posttraining quantization (PTQ) and quantization aware training (QAT) to a comprehensive set of reinforcement learning tasks (Atari, Gym), algorithms (A2C, DDPG, DQN, D4PG, PPO), and models (MLPs, CNNs) and show that policies may be quantized to 8bits without degrading reward, enabling substantial inference speedups on resourceconstrained edge devices.
Motivated by the effectiveness of standard quantization techniques on reinforcement learning policies, we introduce a novel quantization algorithm, ActorQ, for quantized actorlearner distributed reinforcement learning training. By leveraging full precision optimization on the learner and quantized execution on the actors, ActorQ enables 8bit inference while maintaining convergence. We develop a system for quantized reinforcement learning training around ActorQ and demonstrate end to end speedups of > 1.5×–2.5× over full precision training on a range of tasks (DeepMind Control Suite).
Finally, we break down the various runtime costs of distributed reinforcement learning training (such as communication time, inference time, model load time, etc) and evaluate the effects of quantization on these system attributes.
“Learning to Seek: Autonomous Source Seeking With Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, Duisterhof et al 2019
“Learning to Seek: Autonomous Source Seeking with Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, (20190925; ; similar):
We present fully autonomous source seeking onboard a highly constrained nano quadcopter, by contributing applicationspecific system and observation feature design to enable inference of a deepRL policy onboard a nano quadcopter. Our deepRL algorithm finds a highperformance solution to a challenging problem, even in presence of high noise levels and generalizes across real and simulation environments with different obstacle configurations. We verify our approach with simulation and infield testing on a Bitcraze CrazyFlie using only the cheap and ubiquitous CortexM4 microcontroller unit. The results show that by endtoend applicationspecific system design, our contribution consumes almost three times less additional power, as compared to competing learningbased navigation approach onboard a nano quadcopter. Thanks to our observation space, which we carefully design within the resource constraints, our solution achieves a 94% success rate in cluttered and randomized test environments, as compared to the previously achieved 80%. We also compare our strategy to a simple finite state machine (FSM), geared towards efficient exploration, and demonstrate that our policy is more robust and resilient at obstacle avoidance as well as up to 70% more efficient in source seeking. To this end, we contribute a cheap and lightweight endtoend tiny robot learning (tinyRL) solution, running onboard a nano quadcopter, that proves to be robust and efficient in a challenging task using limited sensory input.
“TinyBERT: Distilling BERT for Natural Language Understanding”, Jiao et al 2019
“TinyBERT: Distilling BERT for Natural Language Understanding”, (20190923; similar):
Language model pretraining, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pretrained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformerbased models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student TinyBERT. Then, we introduce a new twostage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and taskspecific learning stages. This framework ensures that TinyBERT can capture he generaldomain as well as the taskspecific knowledge in BERT.
TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT_{base} on GLUE benchmark, while being 7.5× smaller and 9.4× faster on inference. TinyBERT with 4 layers is also significantly better than 4layer stateoftheart baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs onpar with its teacher BERTBASE.
“Accelerating LargeScale Inference With Anisotropic Vector Quantization”, Guo et al 2019
“Accelerating LargeScale Inference with Anisotropic Vector Quantization”, (20190827; backlinks; similar):
Quantizationbased techniques are the current stateoftheart for scaling maximum inner product search to massive databases. Traditional approaches to quantization aim to minimize the reconstruction error of the database points.
Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions. Under natural statistical assumptions, we show that quantization with these loss functions leads to a new variant of vector quantization that more greatly penalizes the parallel component of a datapoint’s residual relative to its orthogonal component.
The proposed approach achieves stateoftheart results on the public benchmarks available at
annbenchmarks.com
.
“And the Bit Goes Down: Revisiting the Quantization of Neural Networks”, Stock et al 2019
“And the Bit Goes Down: Revisiting the Quantization of Neural Networks”, (20190712; similar):
In this paper, we address the problem of reducing the memory footprint of convolutional network architectures. We introduce a vector quantization method that aims at preserving the quality of the reconstruction of the network outputs rather than its weights. The principle of our approach is that it minimizes the loss reconstruction error for indomain inputs. Our method only requires a set of unlabeled data at quantization time and allows for efficient inference on CPU by using bytealigned codebooks to store the compressed weights. We validate our approach by quantizing a high performing ResNet50 model to a memory size of 5MB (20× compression factor) while preserving a top1 accuracy of 76.1% on ImageNet object classification and by compressing a Mask RCNN with a 26× factor.
“Sparse Networks from Scratch: Faster Training without Losing Performance”, Dettmers & Zettlemoyer 2019
“Sparse Networks from Scratch: Faster Training without Losing Performance”, (20190710; similar):
We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zerovalued weights. We demonstrate stateoftheart sparse performance on MNIST, CIFAR10, and ImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that sparse momentum reliably reproduces dense performance levels while providing up to 5.61× faster training. In our analysis, ablations show that the benefits of momentum redistribution and growth increase with the depth and size of the network. Additionally, we find that sparse momentum is insensitive to the choice of its hyperparameters suggesting that sparse momentum is robust and easy to use.
“Weight Agnostic Neural Networks”, Gaier & Ha 2019
“Weight Agnostic Neural Networks”, (20190611; ; backlinks; similar):
Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much higher than chance accuracy on MNIST using random weights. Interactive version of this paper at
https://weightagnostic.github.io/
“Playing the Lottery With Rewards and Multiple Languages: Lottery Tickets in RL and NLP”, Yu et al 2019
“Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP”, (20190606; similar):
The lottery ticket hypothesis proposes that overparameterization of deep neural networks (DNNs) aids training by increasing the probability of a “lucky” subnetwork initialization being present rather than by helping the optimization process (Frankle & Carbin, 2019). Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether “winning ticket” initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL).For NLP, we examined both recurrent LSTM models and largescale Transformer models (Vaswani et al 2017). For RL, we analyzed a number of discreteaction space tasks, including both classic control and pixel control. Consistent with work in supervised image classification, we confirm that winning ticket initializations generally outperform parametermatched random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning ticket initializations for Transformers which enable models onethird the size to achieve nearly equivalent performance. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.
“StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, An et al 2019
“StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, (20190606; backlinks; similar):
Neural Architecture Search (NAS) has been widely studied for designing discriminative deep learning models such as image classification, object detection, and semantic segmentation. As a large number of priors have been obtained through the manual design of architectures in the fields, NAS is usually considered as a supplement approach. In this paper, we have significantly expanded the application areas of NAS by performing an empirical study of NAS to search generative models, or specifically, autoencoder based universal style transfer, which lacks systematic exploration, if any, from the architecture search aspect. In our work, we first designed a search space where common operators for image style transfer such as VGGbased encoders, whitening and coloring transforms (WCT), convolution kernels, instance normalization operators, and skip connections were searched in a combinatorial approach. With a simple yet effective parallel evolutionary NAS algorithm with multiple objectives, we derived the first group of endtoend deep networks for universal photorealistic style transfer. Comparing to random search, a NAS method that is gaining popularity recently, we demonstrated that carefully designed search strategy leads to much better architecture design. Finally compared to existing universal style transfer networks for photorealistic rendering such as PhotoWCT that stacks multiple welltrained autoencoders and WCT transforms in a nonendtoend manner, the architectures designed by StyleNAS produce better styletransferred images with details preserving, using a tiny number of operators/parameters, and enjoying around 500× inference time speedup.
“SpArSe: Sparse Architecture Search for CNNs on ResourceConstrained Microcontrollers”, Fedorov et al 2019
“SpArSe: Sparse Architecture Search for CNNs on ResourceConstrained Microcontrollers”, (20190528; similar):
The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these everyday objects via tiny, cheap MCUs. However, these resourceimpoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve stateoftheart results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe finite memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memorylimited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 4.35× smaller than previous approaches, while meeting the strict MCU working memory constraint.
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, Tan & Le 2019
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, (20190528; backlinks; similar):
Convolutional Neural Networks (Convnets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.
To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous Convnets. In particular, our EfficientNetB7 achieves stateoftheart 84.3% top1 accuracy on ImageNet, while being 8.4× smaller and 6.1× faster on inference than the best existing Convnet. Our EfficientNets also transfer well and achieve stateoftheart accuracy on CIFAR100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.
“Analyzing MultiHead SelfAttention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, Voita et al 2019
“Analyzing MultiHead SelfAttention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”, (20190523; similar):
Multihead selfattention is a key component of the Transformer, a stateoftheart architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguisticallyinterpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the EnglishRussian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.
“Stabilizing the Lottery Ticket Hypothesis”, Frankle et al 2019
“Stabilizing the Lottery Ticket Hypothesis”, (20190305; similar):
Pruning is a wellestablished technique for removing unnecessary structure from neural networks after training to improve the performance of inference. Several recent results have explored the possibility of pruning at initialization time to provide similar benefits during training. In particular, the “lottery ticket hypothesis” conjectures that typical neural networks contain small subnetworks that can train to similar accuracy in a commensurate number of steps. The evidence for this claim is that a procedure based on iterative magnitude pruning (IMP) reliably finds such subnetworks retroactively on small vision tasks. However, IMP fails on deeper networks, and proposed methods to prune before training or train pruned networks encounter similar scaling limitations. In this paper, we argue that these efforts have struggled on deeper networks because they have focused on pruning precisely at initialization. We modify IMP to search for subnetworks that could have been obtained by pruning early in training (0.1% to 7% through) rather than at iteration 0. With this change, it finds small subnetworks of deeper networks (eg. 80% sparsity on Resnet50) that can complete the training process to match the accuracy of the original network on more challenging tasks (eg. ImageNet). In situations where IMP fails at iteration 0, the accuracy benefits of delaying pruning accrue rapidly over the earliest iterations of training. To explain these behaviors, we study subnetwork “stability,” finding that—as accuracy improves in this fashion—IMP subnetworks train to parameters closer to those of the full network and do so with improved consistency in the face of gradient noise. These results offer new insights into the opportunity to prune largescale networks early in training and the behaviors underlying the lottery ticket hypothesis
“The State of Sparsity in Deep Neural Networks”, Gale et al 2019
“The State of Sparsity in Deep Neural Networks”, (20190225; similar):
We rigorously evaluate three stateoftheart techniques for inducing sparsity in deep neural networks on two largescale learning tasks: Transformer trained on WMT 2014 EnglishtoGerman, and ResNet50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al 2017; Louizos et al 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al 2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for largescale benchmarks in the field of model compression. We opensource our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification.
“Superposition of Many Models into One”, Cheung et al 2019
“Superposition of many models into one”, (20190214; similar):
We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.
“Compressing GANs Using Knowledge Distillation”, Aguinaldo et al 2019
“Compressing GANs using Knowledge Distillation”, (20190201; ; similar):
Generative Adversarial Networks (GANs) have been used in several machine learning tasks such as domain transfer, super resolution, and synthetic data generation. Stateoftheart GANs often use tens of millions of parameters, making them expensive to deploy for applications in low SWAP (size, weight, and power) hardware, such as mobile devices, and for applications with real time capabilities. There has been no work found to reduce the number of parameters used in GANs. Therefore, we propose a method to compress GANs using knowledge distillation techniques, in which a smaller “student” GAN learns to mimic a larger “teacher” GAN. We show that the distillation methods used on MNIST, CIFAR10, and CelebA datasets can compress teacher GANs at ratios of 1669:1, 58:1, and 87:1, respectively, while retaining the quality of the generated image. From our experiments, we observe a qualitative limit for GAN’s compression. Moreover, we observe that, with a fixed parameter budget, compressed GANs outperform GANs trained using standard training methods. We conjecture that this is partially owing to the optimization landscape of overparameterized GANs which allows efficient training using alternating gradient descent. Thus, training an overparameterized GAN followed by our proposed compression scheme provides a high quality generative model with a small number of parameters.
“Rethinking Floating Point for Deep Learning”, Johnson 2018
“Rethinking floating point for deep learning”, (20181101; similar):
Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiplyadd has been thoroughly investigated, which requires learning many quantization parameters, finetuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson’s posit format. With no network retraining, and dropin replacement of all math and float32 parameters via roundtonearesteven only, this opensourced 8bit log float is within 0.9% top1 and 0.2% top5 accuracy of the original float32 ResNet50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable outofthebox. Our 8/38bit log float multiplyadd is synthesized and power profiled at 28 nm at 0.96× the power and 1.12× the area of 8/32bit integer multiplyadd. In 16 bits, our log float multiplyadd is 0.59× the power and 0.68× the area of IEEE 754 float16 fused multiplyadd, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.
“A Closer Look at Structured Pruning for Neural Network Compression”, Crowley et al 2018
“A Closer Look at Structured Pruning for Neural Network Compression”, (20181010; similar):
Structured pruning is a popular method for compressing a neural network: given a large trained network, one alternates between removing channel connections and finetuning; reducing the overall width of the network. However, the efficacy of structured pruning has largely evaded scrutiny. In this paper, we examine ResNets and DenseNets obtained through structured pruningandtuning and make two interesting observations: (i) reduced networks—smaller versions of the original network trained from scratch—consistently outperform pruned networks; (ii) if one takes the architecture of a pruned network and then trains it from scratch it is statisticallysignificantly more competitive. Furthermore, these architectures are easy to approximate: we can prune once and obtain a family of new, scalable network architectures that can simply be trained from scratch. Finally, we compare the inference speed of reduced and pruned networks on hardware, and show that reduced networks are significantly faster. Code is available at https://github.com/BayesWatch/pytorchprunes.
“Network Recasting: A Universal Method for Network Architecture Transformation”, Yu et al 2018
“Network Recasting: A Universal Method for Network Architecture Transformation”, (20180914; similar):
This paper proposes network recasting as a general method for network architecture transformation. The primary goal of this method is to accelerate the inference process through the transformation, but there can be many other practical applications. The method is based on blockwise recasting; it recasts each source block in a pretrained teacher network to a target block in a student network. For the recasting, a target block is trained such that its output activation approximates that of the source block. Such a blockbyblock recasting in a sequential manner transforms the network architecture while preserving the accuracy. This method can be used to transform an arbitrary teacher network type to an arbitrary student network type. It can even generate a mixedarchitecture network that consists of two or more types of block. The network recasting can generate a network with fewer parameters and/or activations, which reduce the inference time significantly. Naturally, it can be used for network compression by recasting a trained network into a smaller network of the same type. Our experiments show that it outperforms previous compression approaches in terms of actual speedup on a GPU.
“Rethinking Numerical Representations for Deep Neural Networks”, Hill et al 2018
“Rethinking Numerical Representations for Deep Neural Networks”, (20180807; similar):
With everincreasing computational demand for deep learning, it is critical to investigate the implications of the numeric representation and precision of DNN model weights and activations on computational efficiency. In this work, we explore unconventional narrowprecision floatingpoint representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms. We show that inference using these custom numeric representations on productiongrade DNNs, including GoogLeNet and VGG, achieves an average speedup of 7.6× with less than 1% degradation in inference accuracy relative to a stateoftheart baseline platform representing the most sophisticated hardware using singleprecision floating point. To facilitate the use of such customized precision, we also present a novel technique that drastically reduces the time required to derive the optimal precision configuration.
“ClariNet: Parallel Wave Generation in EndtoEnd TexttoSpeech”, Ping et al 2018
“ClariNet: Parallel Wave Generation in EndtoEnd TexttoSpeech”, (20180719; similar):
In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van den Oord et al 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highlypeaked output distributions. Our method computes the KL divergence in closedform, which simplifies the training algorithm and provides very efficient distillation. In addition, we introduce the first texttowave neural architecture for speech synthesis, which is fully convolutional and enables fast endtoend training from scratch. It significantly outperforms the previous pipeline that connects a texttospectrogram model to a separately trained WaveNet (Ping et al 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this endtoend model.
“Playing Atari With Six Neurons”, Cuccu et al 2018
“Playing Atari with Six Neurons”, (20180604; ; similar):
Deep reinforcement learning, applied to visionbased problems like Atari games, maps pixels directly to actions; internally, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it. By separating the image processing from decisionmaking, one could better understand the complexity of each task, as well as potentially find smaller policy representations that are easier for humans to understand and may generalize better. To this end, we propose a new method for learning policies and compact state representations separately but simultaneously for policy approximation in reinforcement learning. State representations are generated by an encoder based on two novel algorithms: Increasing Dictionary Vector Quantization makes the encoder capable of growing its dictionary size over time, to address new observations as they appear in an openended onlinelearning context; Direct Residuals Sparse Coding encodes observations by disregarding reconstruction error minimization, and aiming instead for highest information inclusion. The encoder autonomously selects observations online to train on, in order to maximize code sparsity. As the dictionary size increases, the encoder produces increasingly larger inputs for the neural network: this is addressed by a variation of the Exponential Natural Evolution Strategies algorithm which adapts its probability distribution dimensionality along the run. We test our system on a selection of Atari games using tiny neural networks of only 6 to 18 neurons (depending on the game’s controls). These are still capable of achieving results comparable—and occasionally superior—to stateoftheart techniques which use two orders of magnitude more neurons.
“Quantization Mimic: Towards Very Tiny CNN for Object Detection”, Wei et al 2018
“Quantization Mimic: Towards Very Tiny CNN for Object Detection”, (20180506; similar):
In this paper, we propose a simple and general framework for training very tiny CNNs for object detection. Due to limited representation ability, it is challenging to train very tiny networks for complicated tasks like detection. To the best of our knowledge, our method, called Quantization Mimic, is the first one focusing on very tiny networks. We utilize two types of acceleration methods: mimic and quantization. Mimic improves the performance of a student network by transferring knowledge from a teacher network. Quantization converts a fullprecision network to a quantized one without large degradation of performance. If the teacher network is quantized, the search scope of the student network will be smaller. Using this feature of the quantization, we propose Quantization Mimic. It first quantizes the large network, then mimic a quantized small network. The quantization operation can help student network to better match the feature maps from teacher network. To evaluate our approach, we carry out experiments on various popular CNNs including VGG and Resnet, as well as different detection frameworks including Faster RCNN and RFCN. Experiments on Pascal VOC and WIDER FACE verify that our Quantization Mimic algorithm can be applied on various settings and outperforms stateoftheart model acceleration methods given limited computing resouces.
“Measuring the Intrinsic Dimension of Objective Landscapes”, Li et al 2018
“Measuring the Intrinsic Dimension of Objective Landscapes”, (20180424; similar):
Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example, that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR10. In addition to providing new cartography of the objective landscapes wandered by parameterized models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this construction is a simple approach for compressing networks, in some cases by more than 100 times.
“SqueezeNext: HardwareAware Neural Network Design”, Gholami et al 2018
“SqueezeNext: HardwareAware Neural Network Design”, (20180323; similar):
One of the main barriers for deploying neural networks on embedded systems has been large memory and power consumption of existing neural networks. In this work, we introduce SqueezeNext, a new family of neural network architectures whose design was guided by considering previous architectures such as SqueezeNet, as well as by simulation results on a neural network accelerator. This new network is able to match AlexNet’s accuracy on the ImageNet benchmark with 112× fewer parameters, and one of its deeper variants is able to achieve VGG19 accuracy with only 4.4 Million parameters, (31× smaller than VGG19). SqueezeNext also achieves better top5 classification accuracy with 1.3× fewer parameters as compared to MobileNet, but avoids using depthwiseseparable convolutions that are inefficient on some mobile processor platforms. This wide range of accuracy gives the user the ability to make speedaccuracy tradeoffs, depending on the available resources on the target hardware. Using hardware simulation results for power and inference speed on an embedded system has guided us to design variations of the baseline model that are 2.59×/8.26× faster and 2.25×/7.5× more energy efficient as compared to SqueezeNet/AlexNet without any accuracy degradation.
“The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”, Frankle & Carbin 2018
“The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”, (20180309; similar):
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the “lottery ticket hypothesis:” dense, randomlyinitialized, feedforward networks contain subnetworks (“winning tickets”) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10–20% of the size of several fullyconnected and convolutional feedforward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
“Wide Compression: Tensor Ring Nets”, Wang et al 2018
“Wide Compression: Tensor Ring Nets”, (20180225; similar):
Deep neural networks have demonstrated stateoftheart performance in a variety of realworld applications. In order to obtain performance gains, these networks have grown larger and deeper, containing millions or even billions of parameters and over a thousand layers. The tradeoff is that these large architectures require an enormous amount of memory, storage, and computation, thus limiting their usability. Inspired by the recent tensor ring factorization, we introduce Tensor Ring Networks (TRNets), which significantly compress both the fully connected layers and the convolutional layers of deep neural networks. Our results show that our TRNets approach is able to compress LeNet5 by 11× without losing accuracy, and can compress the stateoftheart Wide ResNet by 243× with only 2.3% degradation in CIFAR10 image classification. Overall, this compression scheme shows promise in scientific computing and deep learning, especially for emerging resourceconstrained devices such as smartphones, wearables, and IoT devices.
“Training Wide Residual Networks for Deployment Using a Single Bit for Each Weight”, McDonnell 2018
“Training wide residual networks for deployment using a single bit for each weight”, (20180223; similar):
For fast and energyefficient deployment of trained deep neural networks on resourceconstrained embedded hardware, each learned weight parameter should ideally be represented and stored using a single bit. Errorrates usually increase when this requirement is imposed. Here, we report large improvements in error rates on multiple datasets, for deep convolutional neural networks deployed with 1bitperweight. Using wide residual networks as our main baseline, our approach simplifies existing methods that binarize weights by applying the sign function in training; we apply scaling factors for each layer with constant unlearned values equal to the layerspecific standard deviations used for initialization. For CIFAR10, CIFAR100 and ImageNet, and models with 1bitperweight requiring less than 10 MB of parameter memory, we achieve error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top1 / Top5) respectively. We also considered MNIST, SVHN and ImageNet32, achieving 1bitperweight test results of 0.27%, 1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error rates halve previously reported values, and are within about 1% of our errorrates for the same network with fullprecision weights. For networks that overfit, we also show significant improvements in error rate by not learning batch normalization scale and offset parameters. This applies to both full precision and 1bitperweight networks. Using a warmrestart learningrate schedule, we found that training for 1bitperweight is just as fast as fullprecision networks, with better accuracy than standard schedules, and achieved about 98%99% of peak performance in just 62 training epochs for CIFAR10/100. For full training code and trained models in MATLAB, Keras and PyTorch see https://github.com/McDonnellLab/1bitperweight/ .
“Efficient Neural Audio Synthesis”, Kalchbrenner et al 2018
“Efficient Neural Audio Synthesis”, (20180223; similar):
Sequential models achieve stateoftheart results in audio, visual and textual domains with respect to both estimating the data distribution and generating highquality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on texttospeech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a singlelayer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the stateoftheart WaveNet model. The compact form of the network makes it possible to generate 24kHz 16bit audio 4× faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample highfidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.
“Universal Deep Neural Network Compression”, Choi et al 2018
“Universal Deep Neural Network Compression”, (20180207; similar):
In this paper, we investigate lossy compression of deep neural networks (DNNs) by weight quantization and lossless source coding for memoryefficient deployment.
Whereas the previous work addressed nonuniversal scalar quantization and entropy coding of DNN weights, we for the first time introduce universal DNN compression by universal vector quantization and universal source coding. In particular, we examine universal randomized lattice quantization of DNNs, which randomizes DNN weights by uniform random dithering before lattice quantization and can perform nearoptimally on any source without relying on knowledge of its probability distribution. Moreover, we present a method of finetuning vector quantized DNNs to recover the performance loss after quantization.
Our experimental results show that the proposed universal DNN compression scheme compresses the 32layer ResNet (trained on CIFAR10) and the AlexNet (trained on ImageNet) with compression ratios of 47.1 and 42.5, respectively.
“Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks”, Mittal et al 2018
“Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks”, (20180131; similar):
Recently there has been a lot of work on pruning filters from deep convolutional neural networks (CNNs) with the intention of reducing computations. The key idea is to rank the filters based on a certain criterion (say, 𝓁_{1}norm, average percentage of zeros, etc) and retain only the top ranked filters. Once the low scoring filters are pruned away the remainder of the network is fine tuned and is shown to give performance comparable to the original unpruned network.
In this work, we report experiments which suggest that the comparable performance of the pruned network is not due to the specific criterion chosen but due to the inherent plasticity of deep neural networks which allows them to recover from the loss of pruned filters once the rest of the filters are finetuned.
Specifically, we show counterintuitive results wherein by randomly pruning 25–50% filters from deep CNNs we are able to obtain the same performance as obtained by using state of the art pruning methods. We empirically validate our claims by doing an exhaustive evaluation with VGG16 and ResNet50.
Further, we also evaluate a real world scenario where a CNN trained on all 1000 ImageNet classes needs to be tested on only a small set of classes at test time (say, only animals). We create a new benchmark dataset from ImageNet to evaluate such class specific pruning and show that even here a random pruning strategy gives close to state of the art performance.
Lastly, unlike existing approaches which mainly focus on the task of image classification, in this work we also report results on object detection. We show that using a simple random pruning strategy we can achieve substantial speed up in object detection (74% improvement in FPS) while retaining the same accuracy as that of the original Faster RCNN model.
“Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing”, Rosenfeld & Tsotsos 2018
“Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing”, (20180125; similar):
Convnets can achieve good performance even when only a fraction of parameters are learned.
Training deep neural networks results in strong learned representations that show good generalization capabilities. In most cases, training involves iterative modification of all weights inside the network via backpropagation. In this paper, we propose to take an extreme approach and fix almost all weights of a deep convolutional neural network in their randomly initialized values, allowing only a small portion to be learned. As our experiments show, this often results in performance which is on par with the performance of learning all weights. The implications of this intriguing property or deep neural networks are discussed and we suggest ways to harness it to create more robust representations.
[Keywords: Random Networks, Extreme Learning, Compact Representations]
“Learning to Prune Filters in Convolutional Neural Networks”, Huang et al 2018
“Learning to Prune Filters in Convolutional Neural Networks”, (20180123; backlinks; similar):
Many stateoftheart computer vision algorithms use large scale convolutional neural networks (CNNs) as basic building blocks. These CNNs are known for their huge number of parameters, high redundancy in weights, and tremendous computing resource consumption. This paper presents a learning algorithm to simplify and speed up these CNNs. Specifically, we introduce a “tryandlearn” algorithm to train pruning agents that remove unnecessary CNN filters in a datadriven way. With the help of a novel reward function, our agents removes a significant number of filters in CNNs while maintaining performance at a desired level. Moreover, this method provides an easy control of the tradeoff between network performance and its scale. Performance of our algorithm is validated with comprehensive pruning experiments on several popular CNNs for visual recognition and semantic segmentation tasks.
“Faster Gaze Prediction With Dense Networks and Fisher Pruning”, Theis et al 2018
“Faster gaze prediction with dense networks and Fisher pruning”, (20180117; similar):
Predicting human fixations from images has recently seen large improvements by leveraging deep representations which were pretrained for object recognition. However, as we show in this paper, these networks are highly overparameterized for the task of fixation prediction. We first present a simple yet principled greedy pruning method which we call Fisher pruning. Through a combination of knowledge distillation and Fisher pruning, we obtain much more runtimeefficient architectures for saliency prediction, achieving a 10× speedup for the same AUC performance as a state of the art network on the CAT2000 dataset. Speeding up singleimage gaze prediction is important for many realworld applications, but it is also a crucial step in the development of video saliency models, where the amount of data to be processed is substantially larger.
“Fix Your Classifier: the Marginal Value of Training the Last Weight Layer”, Hoffer et al 2018
“Fix your classifier: the marginal value of training the last weight layer”, (20180114; similar):
Neural networks are commonly used as models for classification for a wide variety of tasks. Typically, a learned affine transformation is placed at the end of such models, yielding a perclass value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more resources. In this work we argue that this classifier can be fixed, up to a global scale constant, with little or no loss of accuracy for most tasks, allowing memory and computational benefits. Moreover, we show that by initializing the classifier with a Hadamard matrix we can speed up inference as well. We discuss the implications for current understanding of neural network models.
“Learning Compact Recurrent Neural Networks With BlockTerm Tensor Decomposition”, Ye et al 2017
“Learning Compact Recurrent Neural Networks with BlockTerm Tensor Decomposition”, (20171214; similar):
Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely BlockTerm tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative lowrank approximations, such as tensortrain RNN (TTRNN), our method, BlockTerm RNN (BTRNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BTRNN outperforms TTRNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BTLSTM utilizes 17,388 times fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset.
“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Lin et al 2017
“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, (20171205; similar):
Largescale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multinode training, and requires expensive highbandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warmup training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including CIFAR10, ImageNet, Penn Treebank, and LibriSpeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270× to 600× without losing accuracy, cutting the gradient size of ResNet50 from 97MB to 0.35MB, and for Deep Speech from 488MB to 0.74MB. Deep gradient compression enables largescale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. Code is available at: https://github.com/synxlin/deepgradientcompression.
“Automated Pruning for Deep Neural Network Compression”, Manessi et al 2017
“Automated Pruning for Deep Neural Network Compression”, (20171205; similar):
In this work we present a method to improve the pruning step of the current stateoftheart methodology to compress neural networks. The novelty of the proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an endtoend learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% compared to the current stateoftheart. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same effectiveness and generality of those learned by the corresponding noncompressed network on a set of different recognition tasks.
“Parallel WaveNet: Fast HighFidelity Speech Synthesis”, Oord et al 2017
“Parallel WaveNet: Fast HighFidelity Speech Synthesis”, (20171128; similar):
The recentlydeveloped WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today’s massively parallel computers, and therefore hard to deploy in a realtime production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feedforward network from a trained WaveNet with no statisticallysignificant difference in quality. The resulting system is capable of generating highfidelity speech samples at more than 20 times faster than realtime, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.
“Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions”, Wu et al 2017
“Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions”, (20171122; similar):
Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameterfree, FLOPfree “shift” operation as an alternative to spatial convolutions. We fuse shifts and pointwise convolutions to construct endtoend trainable shiftbased modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation’s efficacy, we replace ResNet’s 3×3 convolutions with shiftbased modules for improved CIFAR10 and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation’s resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation’s applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, Gao et al 2017
“Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN”, (20171121; ; similar):
Finegrained image labels are desirable for many computer vision applications, such as visual search or mobile AI assistant. These applications rely on image classification models that can produce hundreds of thousands (e.g. 100K) of diversified finegrained image labels on input images. However, training a network at this vocabulary scale is challenging, and suffers from intolerable large model size and slow training speed, which leads to unsatisfying classification performance. A straightforward solution would be training separate expert networks (specialists), with each specialist focusing on learning one specific vertical (e.g. cars, birds…). However, deploying dozens of expert networks in a practical system would significantly increase system complexity and inference latency, and consumes large amounts of computational resources. To address these challenges, we propose a Knowledge Concentration method, which effectively transfers the knowledge from dozens of specialists (multiple teacher networks) into one single model (one student network) to classify 100K object categories. There are three salient aspects in our method: (1) a multiteacher singlestudent knowledge distillation framework; (2) a selfpaced learning mechanism to allow the student to learn from different teachers at various paces; (3) structurally connected layers to expand the student network capacity with limited extra parameters. We validate our method on OpenImage and a newly collected dataset, EntityFotoTree (EFT), with 100K categories, and show that the proposed model performs significantly better than the baseline generalist model.
“Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method”, Sun et al 2017
“Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method”, (20171117; similar):
We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the topk elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in realworld applications. Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9×, without any loss on accuracy or even with improved accuracy. The codes, including the extension, are available at https://github.com/lancopku/meSimp
“XUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, Kligvasser et al 2017
“xUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, (20171117; similar):
In recent years, deep neural networks (DNNs) achieved unprecedented performance in many lowlevel vision tasks. However, stateoftheart results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and efficiency. In this paper, we propose a new activation unit, which is particularly suitable for image restoration problems. In contrast to the widespread perpixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units through experiments with stateoftheart nets for denoising, deraining, and super resolution, which are already considered to be very small. With our approach, we are able to further reduce these models by nearly 50% without incurring any degradation in performance.
“NeST: A Neural Network Synthesis Tool Based on a GrowandPrune Paradigm”, Dai et al 2017
“NeST: A Neural Network Synthesis Tool Based on a GrowandPrune Paradigm”, (20171106; similar):
Deep neural networks (DNNs) have begun to have a pervasive impact on various applications of machine learning. However, the problem of finding an optimal DNN architecture for large applications is challenging. Common approaches go for deeper and larger DNN architectures but may incur substantial redundancy. To address these problems, we introduce a network growth algorithm that complements network pruning to learn both weights and compact DNN architectures during training. We propose a DNN synthesis tool (NeST) that combines both methods to automate the generation of compact and accurate DNNs. NeST starts with a randomly initialized sparse network called the seed architecture. It iteratively tunes the architecture with gradientbased growth and magnitudebased pruning of neurons and connections. Our experimental results show that NeST yields accurate, yet very compact DNNs, with a wide range of seed architecture selection. For the LeNet300100 (LeNet5) architecture, we reduce network parameters by 70.2× (74.3×) and floatingpoint operations (FLOPs) by 79.4× (43.7×). For the AlexNet and VGG16 architectures, we reduce network parameters (FLOPs) by 15.7× (4.6×) and 30.2× (8.6×), respectively. NeST’s growandprune paradigm delivers significant additional parameter and FLOPs reduction relative to pruningonly methods.
“Compressing Word Embeddings via Deep Compositional Code Learning”, Shu & Nakayama 2017
“Compressing Word Embeddings via Deep Compositional Code Learning”, (20171103; similar):
Natural language processing (NLP) models often require a massive number of parameters for word embeddings, resulting in a large storage or memory footprint. Deploying neural NLP models to mobile devices requires compressing the word embeddings without any significant sacrifices in performance. For this purpose, we propose to construct the embeddings with few basis vectors. For each word, the composition of basis vectors is determined by a hash code. To maximize the compression rate, we adopt the multicodebook quantization approach instead of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, 8), where the value of each component is limited to a fixed range. We propose to directly learn the discrete codes in an endtoend neural network by applying the Gumbelsoftmax trick. Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% 99% in machine translation tasks without performance loss. In both tasks, the proposed method can improve the model performance by slightly lowering the compression rate. Compared to other approaches such as characterlevel segmentation, the proposed method is languageindependent and does not require modifications to the network architecture.
“Learning Discrete Weights Using the Local Reparameterization Trick”, Shayer et al 2017
“Learning Discrete Weights Using the Local Reparameterization Trick”, (20171021; similar):
Recent breakthroughs in computer vision make use of large deep neural networks, utilizing the substantial speedup offered by GPUs. For applications running on limited hardware, however, high precision realtime processing can still be a challenge. One approach to solving this problem is training networks with binary or ternary weights, thus removing the need to calculate multiplications and significantly reducing memory size. In this work, we introduce LRnets (Local reparameterization networks), a new method for training neural networks with discrete weights using stochastic parameters. We show how a simple modification to the local reparameterization trick, previously used to train Gaussian distributed weights, enables the training of discrete weights. Using the proposed training we test both binary and ternary models on MNIST, CIFAR10 and ImageNet benchmarks and reach stateoftheart results on most experiments.
“To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression”, Zhu & Gupta 2017
“To prune, or not to prune: exploring the efficacy of pruning for model compression”, (20171005; similar):
Model pruning seeks to induce sparsity in a deep neural network’s various connection matrices, thereby reducing the number of nonzerovalued parameters in the model. Recent reports (Han et al 2015; Narang et al 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely overparameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model’s dense connection structure, exposing a similar tradeoff in model size and accuracy. We investigate these two distinct paths for model compression within the context of energyefficient inference in resourceconstrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (largesparse) and their smaller, but dense (smalldense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find largesparse models to consistently outperform smalldense models and achieve up to 10× reduction in number of nonzero parameters with minimal loss in accuracy.
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, Ashok et al 2017
“N2N Learning: Network to Network Compression via Policy Gradient Reinforcement Learning”, (20170918; backlinks; similar):
While bigger and deeper neural network architectures continue to advance the stateoftheart for many computer vision tasks, realworld adoption of these networks is impeded by hardware and speed constraints. Conventional model compression methods attempt to address this problem by modifying the architecture manually or using predefined heuristics. Since the space of all reduced architectures is very large, modifying the architecture of a deep neural network in this way is a difficult task. In this paper, we tackle this issue by introducing a principled method for learning reduced network architectures in a datadriven way using reinforcement learning. Our approach takes a larger ‘teacher’ network as input and outputs a compressed ‘student’ network derived from the ‘teacher’ network. In the first stage of our method, a recurrent policy network aggressively removes layers from the large ‘teacher’ model. In the second stage, another recurrent policy network carefully reduces the size of each remaining layer. The resulting network is then evaluated to obtain a reward—a score based on the accuracy and compression of the network. Our approach uses this reward signal with policy gradients to train the policies to find a locally optimal student network. Our experiments show that we can achieve compression rates of more than 10× for models such as ResNet34 while maintaining similar performance to the input ‘teacher’ network. We also present a valuable transfer learning result which shows that policies which are pretrained on smaller ‘teacher’ networks can be used to rapidly speed up training on larger ‘teacher’ networks.
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation With Conditional Adversarial Networks”, Xu et al 2017
“Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks”, (20170902; similar):
There is an increasing interest on accelerating neural networks for realtime applications. We study the studentteacher strategy, in which a small and fast student network is trained with the auxiliary information learned from a large and accurate teacher network. We propose to use conditional adversarial networks to learn the loss function to transfer knowledge from teacher to student. The proposed method is particularly effective for relatively small student networks. Moreover, experimental results show the effect of network size when the modern networks are used as student. We empirically study the tradeoff between inference time and classification accuracy, and provide suggestions on choosing a proper student network.
“Natural Language Processing With Small FeedForward Networks”, Botha et al 2017
“Natural Language Processing with Small FeedForward Networks”, (20170801; similar):
We show that small and shallow feedforward neural networks can achieve near stateoftheart results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resourceconstrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.
“Bayesian Sparsification of Recurrent Neural Networks”, Lobacheva et al 2017
“Bayesian Sparsification of Recurrent Neural Networks”, (20170731; similar):
Recurrent neural networks show stateoftheart results in many text analysis tasks but often require a lot of memory to store their weights. Recently proposed Sparse Variational Dropout eliminates the majority of the weights in a feedforward neural network without significant loss of quality. We apply this technique to sparsify recurrent neural networks. To account for recurrent specifics we also rely on Binary Variational Dropout for RNN. We report 99.5% sparsity level on sentiment analysis task without a quality drop and up to 87% sparsity level on language modeling task with slight loss of accuracy.
“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, Zhang et al 2017
“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, (20170704; similar):
We introduce an extremely computationefficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (eg. 10–150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, eg. lower top1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARMbased mobile device, ShuffleNet achieves 13× actual speedup over AlexNet while maintaining comparable accuracy.
“Structured Bayesian Pruning via LogNormal Multiplicative Noise”, Neklyudov et al 2017
“Structured Bayesian Pruning via LogNormal Multiplicative Noise”, (20170520; similar):
Dropoutbased regularization methods can be regarded as injecting random noise with predefined magnitude to different parts of the neural network during training. It was recently shown that Bayesian dropout procedure not only improves generalization but also leads to extremely sparse neural architectures by automatically setting the individual noise magnitude per weight. However, this sparsity can hardly be used for acceleration since it is unstructured. In the paper, we propose a new Bayesian model that takes into account the computational structure of neural networks and provides structured sparsity, eg. removes neurons and/or convolutional channels in CNNs. To do this we inject noise to the neurons outputs while keeping the weights unregularized. We establish the probabilistic model with a proper truncated loguniform prior over the noise and truncated lognormal variational approximation that ensures that the KLterm in the evidence lower bound is computed in closedform. The model leads to structured sparsity by removing elements with a low SNR from the computation graph and provides significant acceleration on a number of deep neural architectures. The model is easy to implement as it can be formulated as a separate dropoutlike layer.
“Exploring Sparsity in Recurrent Neural Networks”, Narang et al 2017
“Exploring Sparsity in Recurrent Neural Networks”, (20170417; similar):
Recurrent Neural Networks (RNN) are widely used to solve a variety of problems and as the quantity of data and the amount of available compute have increased, so have model sizes. The number of parameters in recent stateoftheart networks makes them hard to deploy, especially on mobile phones and embedded devices. The challenge is due to both the size of the model and the time it takes to evaluate it. In order to deploy these RNNs efficiently, we propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network. At the end of training, the parameters of the network are sparse while accuracy is still close to the original dense neural network. The network size is reduced by 8× and the time required to train the model remains constant. Additionally, we can prune a larger dense network to achieve better than baseline performance while still reducing the total number of parameters significantly. Pruning RNNs reduces the size of the model and can also help achieve significant inference time speedup using sparse matrix multiply. Benchmarks show that using our technique model size can be reduced by 90% and speedup is around 2× to 7×.
“ShakeShake Regularization of 3branch Residual Networks”, Gastaldi 2017
“ShakeShake regularization of 3branch residual networks”, (20170315; similar):
Reduce overfit by replacing, in a 3branch ResNet, the standard summation of residual branches by a stochastic affine combination
The method introduced in this paper aims at helping computer vision practitioners faced with an overfit problem. The idea is to replace, in a 3branch ResNet, the standard summation of residual branches by a stochastic affine combination. The largest tested model improves on the best single shot published result on CIFAR10 by reaching 2.86% test error. Code is available at https://github.com/xgastaldi/shakeshake
[Keywords: Computer vision, Deep learning, Supervised Learning]
“Variational Dropout Sparsifies Deep Neural Networks”, Molchanov et al 2017
“Variational Dropout Sparsifies Deep Neural Networks”, (20170119; similar):
We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fullyconnected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGGlike networks with a negligible decrease of accuracy.
“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”, (20160317; ; backlinks; similar):
Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained.
Although previous research showed that shallow feedforward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.
…Figure 1 summarizes the results in Table 2 for student models of different depth, number of convolutional layers, and number of parameters when trained to mimic the ensemble teacher model. Student models trained on the ensemble logits are able to achieve accuracies previously unseen on CIFAR10 for models with so few layers. Also, it is clear that there is a huge gap between the convolutional student models at the top of the figure, and the nonconvolutional student models at the bottom of the figure: the most accurate student MLP has accuracy less than 75%, while the least accurate convolutional student model with the same number of parameters but only one convolutional layer has accuracy above 87%. And the accuracy of the convolutional student models increases further as more layers of convolution are added. Interestingly, the most accurate student MLPs with no convolutional layers have only 2 or 3 hidden layers; the student MLPs with 4 or 5 hidden layers are not as accurate.
Comparing the student MLP with only one hidden layer (bottom of the graph) to the student CNN with 1 convolutional layer clearly suggests that convolution is critical for this problem even when models are trained via distillation, and that it is very unlikely that a shallow nonconvolutional model with 100 million parameters or less could ever achieve accuracy comparable to a convolutional model. It appears that if convolution is critical for teacher models trained on the original 0/1 hard targets, it is likely to be critical for student models trained to mimic these teacher models. Adding depth to the student MLPs without adding convolution does not substantially close this “convolutional gap”.
“Policy Distillation”, Rusu et al 2015
“Policy Distillation”, (20151119; backlinks; similar):
Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Qnetworks (DQN), but relatively large (taskspecific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple taskspecific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multitask distilled agent outperforms the singletask teachers as well as a jointlytrained DQN agent.
“Tensorizing Neural Networks”, Novikov et al 2015
“Tensorizing Neural Networks”, (20150922; ; backlinks; similar):
Deep neural networks currently demonstrate stateoftheart performance in several domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fullyconnected layers, making it hard to use the models on lowend devices and stopping the further increase of the model size.
In this paper we convert the dense weight matrices of the fullyconnected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.
In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fullyconnected layer up to 200,000× leading to the compression factor of the whole network up to 7×.
“Distilling the Knowledge in a Neural Network”, Hinton et al 2015
“Distilling the Knowledge in a Neural Network”, (20150309; ; backlinks; similar):
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish finegrained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Knowledge distillation
Miscellaneous

https://tech.piccollage.com/distillationofclipmodelandotherexperimentsf8394b7321ce
( ) 
https://engineering.fb.com/2016/11/08/android/deliveringrealtimeaiinthepalmofyourhand/
( ; backlinks) 
https://blog.tensorflow.org/2020/03/higheraccuracyonvisionmodelswithefficientnetlite.html

https://blog.roblox.com/2020/05/scaledbertserve1billiondailyrequestscpus/

https://ai.googleblog.com/2021/12/trainingmachinelearningmodelsmore.html
( ) 
https://ai.googleblog.com/2021/10/grammarcorrectionasyoutypeonpixel.html

https://ai.googleblog.com/2019/03/anallneuralondevicespeech.html

https://ai.googleblog.com/2018/05/customondevicemlmodels.html

https://ai.facebook.com/blog/ahighlyefficientrealtimetexttospeechsystemdeployedoncpus/

http://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_00990