Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained
on, as can be proven by the regular improvements in training smaller/faster but still
performant networks but also in directly creating smaller neural nets with similar or identical performance on those problems by deleting parameters
(sparsification)/reducing precision of the numeric encoding (compressing)/training a much smaller network from scratch using the
original large network somehow (distillation).
Mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to
imitate themselves in self-distillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller
networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of blessing of scale in large NNs having smoother loss
landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up
‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful
but extremely brittle encodings such as Brainf—k or assembler programs). As well as their great theoretical interest—How can we train these small models directly? What
does this tell us about how NNs work?—such smaller NNs are critical to practical real-world deployment to servers & smartphones at scale, the design of accelerator
hardware supporting reduced-precision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can
achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X almost
immediately thereafter. (These are merely one way that your software can be much faster.)
Some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (a vastly incomplete
bibliography, merely some papers I have noted during my general reading) below.
[blog] As the
training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of
the most promising model architectures due to their substantial training cost reduction compared to a quality-equivalent dense model.
Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5× saving for auto-aggressive language models (this work along with
parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and
unsolved, limiting its practical usage.
To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to
3.7×, and a highly optimized inference system that provides 7.3× better latency and cost compared to existing MoE inference solutions. It offers ultra-fast
inference latencies (25ms) for trillion-parameter MoE models. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to
4.5× faster and 9× cheaper inference compared to quality-equivalent dense models.
We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where
training and deploying higher-quality models with fewer resources becomes more widely possible.
Recently, significant progress has been made in learned image and video compression. In particular the usage of Generative Adversarial Networks has lead to
impressive results in the low bit rate regime. However, the model size remains an important issue in current state-of-the-art proposals and existing solutions
require significant computation effort on the decoding side. This limits their usage in realistic scenarios and the extension to video compression. In this paper,
we demonstrate how to leverage knowledge distillation to obtain equally capable image decoders at a fraction of the original number of parameters. We investigate
several aspects of our solution including sequence specialization with side information for image coding. Finally, we also show how to transfer the obtained
benefits into the setting of video compression. Overall, this allows us to reduce the model size by a factor of 20 and to achieve 50% reduction in decoding
blog] Pre-trained language models have
achieved state-of-the-art results in various Natural Language Processing (NLP) tasks.GPT-3 has
shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion
parameters. ERNIE 3.0 outperformed thestate-of-the-art models on various NLP
tasks. In order to explore theperformance of scaling up ERNIE 3.0, we train ahundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle
platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose
an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself
simultaneously. ERNIE 3.0 Titan is the largest Chinese densepre-trained model so far. Empirical results
show that the ERNIE 3.0Titan outperforms the state-of-the-art models on 68 NLP datasets.
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation
trains a student model against two objectives: a task-specific objective (eg. language modeling) and an imitation objective that encourages the hidden states of
the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective
that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model—a simpler
model with the same causal structure. IIT is fullydifferentiable, easily implemented, and combines flexibly with other objectives. Compared with standard
distillation of BERT, distillation via IIT results in lower perplexity on
Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding),
SQuAD (questionanswering), and CoNLL-2003 (named entity recognition).
What can neural networks learn about the visual world from a single image? While it obviously cannot contain the multitudes of possible objects, scenes and
lighting conditions that exist—within the space of all possible 256^(3×224×224) 224-sized square images, it might still provide a strong prior for natural images.
To analyze this hypothesis, we develop a framework for training neural networks from scratch using a single image by means of knowledge distillation from a
supervisedly pretrained teacher. With this, we find that the answer to the above question is: ‘surprisingly, a lot’. In quantitative terms, we find top-1
accuracies of 94%/74% on CIFAR-10/100, 59% on ImageNet and, by extending this method to audio, 84% on
SpeechCommands. In extensive analyses we disentangle the effect of augmentations, choice of source image and network architectures and also discover “panda
neurons” in networks that have never seen a panda. This work shows that one image can be used to extrapolate to thousands of object classes and motivates a renewed
research agenda on the fundamental interplay of augmentations and image.
Network quantization significantly reduces model inference complexity and has been widely used in real-world deployments. However, most existing quantization
methods have been developed and tested mainly on Convolutional Neural Networks (CNN), and suffer severe degradation when applied to Transformer-based architectures. In this work,
we present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. In
particular, we propose Powers-of-Two Scale (PTS) to deal with the serious inter-channel variation of LayerNorm
inputs in a hardware-friendly way. In addition, we propose Log-Int-Softmax (LIS) that can sustain the extreme
non-uniform distribution of the attention maps while simplifying inference by using 4-bit quantization and the BitShift operator. Comprehensive experiments on
various Transformer-based architectures and benchmarks show that our methods outperform previous works in performance while
using even lower bit-width in attention maps. For instance, we reach 85.17% Top-1 accuracy with ViT-L on ImageNet and 51.4 mAP with Cascade Mask R-CNN (Swin-S) on COCO. To our knowledge, we are the first to achieve comparable accuracy
degradation (~1%) on fully quantized Vision Transformers. Code is available at https://github.com/linyang-zhh/FQ-ViT .
We study the problem of example-based procedural texture synthesis using highly compact models. Given a sample image, we use differentiable programming to train
a generative process, parameterised by a recurrent Neural Cellular Automata(NCA) rule.
Contrary to the common belief that neural networks should be highly over-parameterised, we demonstrate that our model architecture and training procedure allows
for representing complex texture patterns using just a few hundred learned parameters, making their expressivity comparable to hand-engineered procedural texture
generating programs. The smallest models from the proposed 𝜇NCA family scale down to 68 parameters. When using
quantisation to one byte per parameter, proposed models can be shrunk to a size range between 588 and 68 bytes.
Implementation of a texture generator that uses these parameters to produce images is possible with just a few lines of GLSL or C code.
Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to
deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language
models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while
maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the
compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover,
we show how to further compress the sparse models’ weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained
BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of 40× for the encoder with less
than 1% accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, andDistilBERT.
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point
for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (eg. 175B
parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b)
the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since
many fine-tuned models will be deployed in resource-constrained environments.
To address these pain points, we propose a framework for resource-efficient and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight
updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning(DSEE), aims to achieve two key objectives: (1) parameter efficient fine-tuning—by enforcing sparsity-aware weight updates on top of
the pre-trained weights; and (2) resource-efficient inference—by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in
these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via magnitude-based pruning and
\U0001D4C11 sparse regularization.
Extensive experiments and in-depth investigations, with diverse network backbones (ie. BERT,GPT-2, and DeBERTa) on dozens of datasets, consistently demonstrate highly impressive parameter-/training-/inference-efficiency, while
maintaining competitive downstream transfer performance. For instance, our DSEE-BERT obtains about 35% inference
FLOPs savings with <1% trainable parameters and comparable performance to conventional fine-tuning. Codes are
available in Github.
Structural pruning can simplify network architecture and improve inference speed.
We propose Hardware-Aware Latency Pruning(HALP) that formulates structural pruning as a
global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget. For filter
importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score
to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward
maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency
We examine HALP on both classification and detection tasks, over varying networks, on ImageNetand VOC datasets. In particular, for ResNet-50/ResNet-101 pruning on ImageNet, HALP
improves network throughput by 1.60×/1.90× with +0.3% / -0.2% top-1 accuracy changes, respectively. For SSD
pruning on VOC, HALP improves throughput by1.94× with only a 0.56 mAP drop.
HALP consistently outperforms prior art, sometimes by large margins.
Scaling neural networks to “large” sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the
inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on
distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight
models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of “easy” examples; for
the “hard” examples, we fall-back to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much
more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size,
thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach
on both image classification and natural language processing benchmarks.
The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train
commonsense models. In this work, we investigate an alternative from-machine-to-corpus-to-machine: general language models author these commonsense knowledge
graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation
(Hinton et al 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically-as text-in
addition to the neural model. We also distill only one aspect-the commonsense of a general language model teacher, allowing the student to be a different type, a
commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense
from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a
human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In
addition, it results in a neural commonsense model that surpasses the teacher model’s commonsense capabilities despite its 100× smaller size. We apply this
to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to
ranking a set of words which could continue a given context. To avoid annotating top-k ranks, we generate them using pre-trained LMs: GPT-2, BERT and Born-Again models. This leads to a rank-based form of
knowledge distillation (KD). We also develop a method using N-grams to create a non-probabilistic teacher which generates the ranks without the need of a
We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD
generally improves perplexity (PPL), often with statistical-significance, when compared to Kullback-Leibler-based KD.
Surprisingly, given the simplicity of the method, N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers.GPT-2 always acts as the best teacher,
though, and using it and a Transformer-XL student
on Wiki-02, rank-based KD reduces a cross-entropy baseline
from 65.27 to 55.94 and against a KL-based KD of 56.70.
Stateful optimizers maintain gradient statistics over time, eg. the exponentially smoothed sum (SGD with
momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of
models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit
optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise
quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster
optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic
quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient
variance that comes from the highly non-uniform distribution of
input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks,
including 1.5B parameter language modeling, GLUE finetuning, ImageNetclassification, WMT’14 machine translation, MoCo v2 contrastive ImageNet
pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We
open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.
Are end-to-endtext-to-speech (TTS) models over-parameterized? To what
extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram
prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we
explored several aspects of TTSpruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are
end-to-endTTS modelshighly prunable, but also, perhaps
surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with
similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and
objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint
and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We
show that transformers have unique quantization challenges—namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point
format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending
to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each
with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme—per-embedding-group
quantization. We demonstrate the effectiveness of our methods on the GLUEbenchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings
can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at https://github.com/qualcomm-ai-research/transformer-quantization.
[blog] Sparse Mixture-of-Experts (MoE) has been a successful
approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are
prohibitively large and practitioners often resort to methods such as distillation for serving.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on
WMT and a web-scale dataset suggest that task-level routing (TaskMoE) enables us to extract smaller,
ready-to-deploy sub-networks from large sparse models.
On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model
(token-MoE) by +1.0 BLEU on
average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9× when we route by tasks instead of tokens. While distilling a
token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design,
preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B
parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6×.
Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods
have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach
targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement
pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments
consider classification and generation tasks, yielding among other results a pruned model that is a 2.4× faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes
increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training
efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based meta-learning framework to achieve state-of-the-art results for
dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over
64% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of
40%. Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST,
CIFAR-10, CIFAR-100, andSVHN.
Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.
Delivering malware covertly and detection-evadingly is critical to advanced malware campaigns. In this paper, we present a method that delivers malware covertly
and detection-evadingly through neural network models. Neural network models are poorly explainable and have a good generalization ability. By embedding malware
into the neurons, malware can be delivered covertly with minor or even no impact on the performance of neural networks. Meanwhile, since the structure of the
neural network models remains unchanged, they can pass the security scan of antivirus engines. Experiments show that 36.9MB of malware can be embedded into a
178MB-AlexNet model within 1% accuracy loss, and no suspicious are raised by antivirus engines in VirusTotal, which verifies the feasibility of this method. With
the widespread application of artificial intelligence, utilizing neural networks becomes a forwarding trend of malware. We hope this work could provide a
referenceable scenario for the defense on neural network-assisted attacks.
Two crucial requirements for a successful adoption of deep learning (DL) in the wild are: (1) robustness to distributional shifts, and (2) model compactness for
achieving efficiency. Unfortunately, efforts towards simultaneously achieving Out-of-Distribution (OOD) robustness
and extreme model compactness without sacrificing accuracy have mostly been unsuccessful. This raises an important question: “Is the inability to create compact,
accurate and robust deep neural networks (CARDs) fundamental?” To answer this question we perform a large-scale
analysis for a range of popular model compression techniques which uncovers several intriguing patterns. Notably, in contrast to traditional pruning approaches
(eg. fine tuning and gradual magnitude pruning), we find that “lottery ticket-style” pruning approaches can surprisingly be used to create high performing
CARDs. Specifically, weare able to create extremely compact CARDs that are
dramatically more robust than their significantly larger and full-precision counterparts while matching (or beating) their test accuracy, simply by pruning
and/or quantizing. To better understand these differences, we perform sensitivity analysis in the Fourier domain for CARDs trained using different data augmentation methods. Motivated by our analysis, we develop a simple domain-adaptive test-time ensembling approach
(CARD-Deck) that uses a gating module to dynamically select anappropriate CARD from the CARD-Deck based on their spectral-similarity with test samples. By leveraging
complementary frequency biases of different compressed models, the proposed approach builds a “winning hand” of CARDs
that establishes a new state-of-the-art on CIFAR-10-C accuracies (ie. 96.8% clean and 92.75% robust) with dramatically
better memory usage than their non-compressed counterparts. We also present some theoretical evidences supporting our empirical findings.
Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However,
with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased
significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. We present and motivate
the problem of efficiency in deep learning, followed by a thorough survey of the five core areas of model efficiency (spanning modeling techniques, infrastructure,
and hardware) and the seminal work there. We also present an experiment-based guide along with code, for practitioners to optimize their model training and
deployment. We believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling
techniques to hardware support. Our hope is that this survey would provide the reader with the mental model and the necessary understanding of the field to apply
generic efficiency techniques to immediately get significant improvements, and also equip them with ideas for further research and experimentation to achieve
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional
post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference
complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating
sparsity in ViTs “from end to end”. Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed
small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the
final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention
heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to
adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired
generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT
accuracy rather than compromising it, making sparsity a tantalizing “free lunch”. For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data,
architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings.
Our codes are available at https://github.com/VITA-Group/SViTE.
The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the
form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a
trained neural network is typically more clusterable than randomly initialized networks, and often clusterable relative to random networks with the same
distribution of weights. We also exhibit novel methods to promote clusterability in neural network training, and find that in multi-layer perceptrons they lead to
more clusterable networks with little reduction in accuracy. Understanding and controlling the clusterability of neural networks will hopefully render their inner
workings more interpretable to engineers by facilitating partitioning into meaningful clusters.
In deep learning, models typically reuse the same parameters for all inputs. Mixture-of-Experts (MoE) defies this and instead selects different
parameters for each incoming example. The result is a sparsely-activated model—with outrageous numbers of parameters—but a constant computational cost. However,
despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability—we address these with
the Switch Transformer.
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques
help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based
off T5-Base and T5-Large (Raffel et al 2019) to obtain up to 7× increases in
pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally,
we advance the current scale of language models by pre-training up to 1-trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4× speedup
over the T5-XXL model.
…Appendix E: Relation OF Upstream To Downstream Model Performance
There is no guarantee that a model’s quality on a pre-training objective will translate to downstream task results. Figure 13 presents the
correlation of the upstream model quality, for both dense and Switch models, on the C4 pre-training task with two downstream task measures: average
Super-GLUE performance and TriviaQA score. We choose these two tasks as one probes the model’s reasoning and
the other factual knowledge.
We find a consistent correlation, indicating that for both baseline and Switch models, improved pre-training leads to better downstream results. Additionally,
for a fixed upstream perplexity we find that both Switch and dense models perform similarly in the small to medium model size regime. However, in the largest model
regime (T5-11B/T5-XXL) our largest Switch models, as mentioned inSection
5.6, do not always translate their upstream perplexity well to downstream fine-tuning on the SuperGLUE task. This
warrants future investigation and study to fully realize the potential of sparse models. Understanding the fine-tuning dynamics with expert-models is very
complicated and is dependent on regularization, load-balancing, and fine-tuning hyper-parameters.
…Yet in spite of its historical importance, MNIST has three notable shortcomings. First, it does a poor
job of differentiating between linear, nonlinear, and translation-invariant models. For example, logistic, MLP,
andCNN benchmarks obtain 94, 99+, and 99+% accuracy on it. This makes it hard to measure the
contribution of a CNN’s spatial priors or to judge the relative effectiveness of different regularization schemes.
Second, it is somewhat large for a toy dataset. Each input example is a 784-dimensional vector and thus it takes a non-trivial amount of computation to perform
hyperparameter searches or debug a meta-learning loop. Third, MNIST is hard to hack. The ideal toy dataset
should be procedurally generated so that researchers can smoothly vary parameters such as background noise, translation, and resolution.
In order to address these shortcomings, we propose the MNIST-1D dataset. It is a minimalist, low-memory,
and low-compute alternative to MNIST, designed for exploratory deep learning research where rapid iteration
is a priority. Training examples are 20× smaller but they are still better at measuring the difference between (1) linear and nonlinear classifiers and (2) models
with and without spatial inductive biases (eg. translation invariance). The dataset is procedurally generated but still permits analogies to real-world digit
classification…Unlike MNIST, each example is a one-dimensional sequence of points. To generate an example, we
begin with a digit template and then randomly pad, translate, and transform it.
Example use cases: In this section we will explore several examples of how MNIST-1D can
be used to study core “science of deep learning” phenomena.
Finding lottery tickets…Unlike many follow-up experiments on the lottery ticket, this one took just two days of researcher time to produce.
The curious reader can also reproduce these results in their browser in a few minutes.
Observing deep double descent…We see the MNIST-1D dataset as a good tool for
exploring these properties. In fact, we were able to reproduce the double descent pattern after a few hours of researcher effort. The figure below shows our results for a
fully-connected network and a convolutional model.
Gradient-based meta-learning…A model does this by having two levels of optimization: the first is a fast inner loop which corresponds to a
traditional learning objective and second is a slow outer loop which updates the “meta” properties of the learning process…Meta-learning is a promising topic
but it is very difficult to scale. First of all, meta-learning algorithms consume enormous amounts of time and compute. Second of all, implementations tend to
grow complex since there are twice as many hyperparameters (one set for each level of optimization) and most deep learning frameworks are not set up well for
meta-learning. This places an especially high incentive on debugging and iterating meta-learning algorithms on small-scale datasets such as MNIST-1D. For example, it took just a few hours to implement and debug the gradient-based hyperparameter optimization of a
learning rate shown below.
Meta-learning an activation function: Having implemented a “minimal working example” of gradient-based meta-learning, we
realized that it permitted a simple and novel extension: meta-learning an activation function. With a few more hours of researcher time, we were able to
parameterize our classifier’s activation function with a second neural network and then learn the weights using meta-gradients.
Measuring the spatial priors of deep
networks: …Principle among these priors is the translation invariance of convolution. A primary motivation for
this dataset was to construct a toy problem that could effectively quantify a model’s spatial priors. The second figure
in this post illustrates that this is indeed possible with MNIST-1D.
Benchmarking pooling methods. Our final case study begins with a specific question: What is the relationship between pooling and sample
efficiency? We had not seen evidence that pooling makes models more or less sample efficient, but this seemed an important relationship to understand. With
this in mind, we trained models with different pooling methods and training set sizes and found that, while pooling tended to be effective in low-data regimes,
it did not make much of a difference in high-data regimes.
…this post argues in favor of small-scale machine learning research. Neural networks do not have problems with scaling or performance—but they do have
problems with interpretability, reproducibility, and iteration speed. We see carefully-controlled, small-scale experiments as a great way to address these
problems…For example, several of the findings reported in this post are at the point where they should be investigated at scale. We would like to show that large
scale lottery tickets also learn spatial inductive biases, and show evidence that they develop local connectivity. We would also like to try meta-learning an
activation function on a larger model in the hopes of finding an activation that will outperform ReLU and Swish in generality. We should emphasize
that we are only ready to scale these results now that we have isolated and understood them in a controlled setting. We believe that scaling a system is only a
good idea once the relevant causal mechanisms have been isolated and understood. [cf scaling
law papers] …Our work also bears philosophical similarities to the “Synthetic Petri Dish” by Rawal et al 2020.
Closing Thoughts: There is a counterintuitive possibility that in order to explore the limits of how large we can scale neural networks, we may
need to explore the limits of how small we can scale them first. Scaling models and datasets downward in a way that preserves the nuances of their behaviors at
scale will allow researchers to iterate quickly on fundamental and creative ideas. This fast iteration cycle is the best way of obtaining insights about how to
incorporate progressively more complex inductive biases into our models. We can then transfer these inductive biases across spatial scales in order to dramatically
improve the sample efficiency and generalization properties of large-scale models. We see the humble MNIST-1D
dataset as a first step in that direction.
In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task
learning. We show ourresults on a large-scale, commercial NLU system trained on a varied set of intents
with huge vocabulary sizes. Our approach outperforms a range of baselines and achieves a compression rate of 97.4% with less than 3.7% degradation in predictive
performance. Our analysis indicates that the signal from the downstream task is important for effective compression with minimal degradation in performance.
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind
their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and
architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
…Given the above evidence of overparameterization, it does not come as a surprise that BERT can be efficiently
compressed with minimal accuracy loss, which would be highly desirable for real-world applications. Such efforts to date are summarized in
Table 1. The main approaches are knowledge distillation, quantization, and pruning…If the ultimate goal of training BERT is compression,Li et al (2020) recommend training larger models and
compressing them heavily rather than compressing smaller models lightly.
One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of
ε-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar
model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such
remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression
(KRR). For KRR tasks, wedemonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset
selection methods while obtaining state of the art results for MNIST and CIFAR-10
classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks
even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to
We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al 2018 by applying recent breakthroughs in
algorithms for neural architecture search.
This optimal subset, which we refer to as “Bort”, is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5%
the original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which is
1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al 2019), and about 33% of that of the
world-record, in GPU hours, required to train BERT-large on the
It is also 7.9× faster on a CPU, as well as being better performing than other compressed variants of the
architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%, absolute, with respect to BERT-large, onmultiple public natural language understanding (NLU) benchmarks.
The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable
performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at
random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong
assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting
assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but
depth) number of neurons per weight of the target subnetwork.
We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the
architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and
pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of
magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (eg. ImageNet) and architectures (eg. ResNets). As neural networks become ever larger and costlier to
train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.
One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised
fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to common approaches to semi-supervised learning for computer vision, we show
that it is surprisingly effective for semi-supervised learning on ImageNet.
A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning. We find that, the fewer the labels, the more
this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a
much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed
semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNetmodel
using SimCLRv2, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining
and transferring the task-specific knowledge.
This procedure achieves 73.9% ImageNet top-1 accuracy with just 1% of the labels (≤13 labeled images per class) using
ResNet-50, a 10× improvement in label efficiency over the previous state-of-the-art. With 10% of labels, ResNet-50 trained with our method achieves 77.5% top-1 accuracy, outperforming standard supervised training with all of the labels.
Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at
test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable
subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training,
or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically
formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse,
the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a
novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the
network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently competes with or
outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG andResNet), datasets (CIFAR-10/100 and TinyImageNet), and sparsity constraints (up to 99.99 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm
that, at initialization, data must be used to quantify which synapses are important.
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime
that has become standard for state-of-the-art natural language processing applications.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give
mathematical foundations to the method and compare it to existing zeroth-order and first-order pruning methods. Experiments show that when pruning large pretrained
language models, movement pruning shows substantial improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy
loss with down to only 3% of the model parameters.
We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a
novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full
precision value and the previously rounded value is quantized. We then decide whether or not to add this quantized residual error for a higher effective bit width
and lower quantization noise. By starting with a power-of-two bit width, this decomposition will always produce hardware-friendly configurations, and through an
additional 0-bit option, serves as a unified view of pruning and quantization. Bayesian Bits then introduces learnable stochastic gates, which collectively control
the bit width of the given tensor. As a result, we can obtain low bit solutions by performing approximate inference over the gates, with prior distributions that encourage most of them to be switched off. We experimentally validate our proposed method on several benchmark
datasets and show that we can learn pruned, mixed precision networks that provide a better trade-off between accuracy and efficiency than their static bit width
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization
Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this
approach to work beyond int8 fixed-point quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each
forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while
maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural
language processing and image classification. For example, applying our method to state-of-the-art Transformer and
Convnet architectures, we can achieve 82.5% accuracy on MNLI bycompressing RoBERTa to 14MB and 80.0 top-1 accuracy onImageNet by compressing an EfficientNet-B3 to 3.3MB.
Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting
their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the
entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers
in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to
prune BERT, RoBERTa andXLNet models up to 40%, while maintaining up to 98% of
their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and
performance. Our experiments yield interesting observations such as, (i) the lower layers are most critical to maintain downstream task performance, (ii) some
tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using a different objective function
exhibit different learning patterns and w.r.t the layer dropping.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory
constraints of training and inference. We study the impact of model size in this setting, focusing on Transformermodels for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine
translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models
converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models.
Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference
efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small
models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we
propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then
be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for
building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the
inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our
smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a
comparative on-device study.
Deep reinforcement learning has achieved
important milestones, however, the computational demands of reinforcement learning training and inference remain
substantial. Quantization is an effective method to reduce the computational overheads of neural networks, though in the context of reinforcement learning, it is unknown whether quantization’s computational benefits outweigh the accuracy costs introduced by the
corresponding quantization error.
To quantify this tradeoff we perform a broad study applying quantization to reinforcement learning. We apply standard
quantization techniques such as post-training quantization (PTQ) and quantizationaware training
(QAT) to a comprehensive set of reinforcement learning tasks (Atari, Gym), algorithms (A2C, DDPG,
D4PG, PPO), and
models (MLPs, CNNs) and show that policies may be quantized to 8-bits without
degrading reward, enabling substantial inference speedups on resource-constrained edge devices.
Motivated by the effectiveness of standard quantization techniques on reinforcement learning policies, we introduce a novel quantization algorithm,
ActorQ, for quantized actor-learner distributed reinforcement learning training. By leveraging full precision optimization on the learner and quantized
execution on the actors, ActorQ enables 8-bit inference while maintaining convergence. We develop a system for quantized reinforcement learning training around ActorQ and demonstrate end to end speedups of
> 1.5×–2.5× over full precision training on a range of tasks (DeepMind Control Suite).
Finally, we break down the various runtime costs of distributed reinforcement learning training (such as communication time, inference time, model load time,
etc) and evaluate the effects of quantization on these system attributes.
We present fully autonomous source seeking onboard a highly constrained nano quadcopter, by contributing application-specific system and observation feature design to enable inference of a
deep-RL policy onboard a nano quadcopter. Our deep-RL algorithm finds a high-performance solution to a challenging problem,
even in presence of high noise levels and generalizes across real and simulation environments with different obstacle configurations. We verify our approach with
simulation and in-field testing on a Bitcraze CrazyFlie using only the cheap and ubiquitous Cortex-M4 microcontroller unit. The results show that by end-to-end application-specific system design, our contribution consumes almost three times less additional power, as compared to
competing learning-based navigation approach onboard a nano quadcopter. Thanks to our observation space, which we carefully
design within the resource constraints, our solution achieves a 94% success rate in cluttered and randomized test environments, as compared to the previously
achieved 80%. We also compare our strategy to a simple finite state machine (FSM), geared towards efficient
exploration, and demonstrate that our policy is more robust and resilient at obstacle avoidance as well as up to 70% more efficient in source seeking. To this end,
we contribute a cheap and lightweight end-to-end tiny robot learning (tinyRL) solution, running onboard a nano quadcopter, that proves to be robust and efficient in a challenging task using limited sensory input.
Language model pre-training, such as BERT, has significantly improved the performances of many natural language
processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted
devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially
designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty
of knowledge encoded in a large teacher BERTcan be effectively transferred to a small student Tiny-BERT. Then, weintroduce a new two-stage learning framework for TinyBERT, which
performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures
that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT.
TinyBERT with 4 layers is empirically effective and achieves more than96.8% the performance of its
teacher BERTbase on GLUE benchmark, while being
7.5× smaller and 9.4× faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer
state-of-the-art baselines on BERT distillation, with only about 28% parameters andabout 31% inference time
of them. Moreover, TinyBERT with 6 layersperforms on-par with its teacher BERTBASE.
Quantization-based techniques are the current state-of-the-art for scaling maximum inner product search to massive databases. Traditional approaches to
quantization aim to minimize the reconstruction error of the database points.
Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions. Under natural statistical assumptions, we show
that quantization with these loss functions leads to a new variant of vector quantization that more greatly penalizes the parallel component of a datapoint’s
residual relative to its orthogonal component.
The proposed approach achieves state-of-the-art results on the public benchmarks available at ann-benchmarks.com.
In this paper, we address the problem of reducing the memory footprint of convolutional network architectures. We introduce a vector quantization method that
aims at preserving the quality of the reconstruction of the network outputs rather than its weights. The principle of our approach is that it minimizes the loss
reconstruction error for in-domain inputs. Our method only requires a set of unlabeled data at quantization time and allows for efficient inference on
CPU by using byte-aligned codebooks to store the compressed weights. We validate our approach by quantizing a high
performing ResNet-50 model to a memory size of 5MB (20× compression factor) while preserving a top-1 accuracy of 76.1% on ImageNet object classification and by compressing a Mask R-CNN with a 26×
We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training
while achieving dense performance levels. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to
identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude
of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. We demonstrate state-of-the-art sparse
performance on MNIST, CIFAR-10, andImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that sparse
momentum reliably reproduces dense performance levels while providing up to 5.61× faster training. In our analysis, ablations show that the benefits of momentum
redistribution and growth increase with the depth and size of the network. Additionally, we find that sparse momentum is insensitive to the choice of its
hyperparameters suggesting that sparse momentum is robust and easy to use.
Not all neural network architectures are created equal, some perform much better than others for certain tasks. But how important are the weight parameters of a
neural network compared to its architecture? In this work, we question to what extent neural network architectures alone, without learning any weight parameters,
can encode solutions for a given task. We propose a search method for neural network architectures that can already perform a task without any explicit weight
training. To evaluate these networks, we populate the connections with a single shared weight parameter sampled from a uniform random distribution, and measure the
expected performance. We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much
higher than chance accuracy on MNIST using random weights. Interactive version of this paper athttps://weightagnostic.github.io/
The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by
increasing the probability of a “lucky” sub-network initialization being present rather than by helping the optimization process (Frankle & Carbin, 2019).
Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the
lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether “winning
ticket” initializations exist in two different domains: natural language processing (NLP) andreinforcement learning (RL).For NLP, we examined both recurrentLSTM models and
large-scale Transformer models (Vaswani et al 2017). For RL, we analyzed a number of discrete-action space tasks, including both classic control and
pixel control. Consistent with work in supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched
random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning
ticket initializations for Transformers which enable models one-third the size to achieve nearly equivalent performance.
Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a
broader phenomenon in DNNs.
Neural Architecture Search (NAS) has been widely studied for designing discriminative deep learning models such
as image classification, object detection, and semantic segmentation. As a large number of priors have been obtained through the manual design of architectures
in the fields, NAS is usually considered as a supplement approach. In this paper, we have significantly expanded
the application areas of NAS by performing an empirical study of NAS to search
generative models, or specifically, auto-encoder based universal style transfer, which lacks systematic exploration, if any, from the architecture search aspect. In our work, we first
designed a search space where common operators for image style transfersuch as VGG-based encoders,whitening and coloring transforms (WCT), convolution kernels,
instance normalization operators, and skip connections were searched in a combinatorial approach. With a simple yet effective parallel evolutionary
NAS algorithm with multiple objectives, we derived the first group of end-to-end
deep networks for universal photorealistic style transfer. Comparing to random search, a NAS method that is
gaining popularity recently, we demonstrated that carefully designed search strategy leads to much better architecture design. Finally compared to existing
universal style transfer networks for photorealistic rendering such as PhotoWCT
that stacks multiplewell-trained auto-encoders and WCT transforms in a non-end-to-endmanner, the
architectures designed by StyleNAS produce better style-transferred images with details preserving, using a tiny number
of operators/parameters, and enjoying around 500× inference time speed-up.
The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use
performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to
inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these
resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional
neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNNinference on MCUs is challenging due to severe finite memory
limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that
dofit within the memory budget of an MCU, albeit at the cost ofprediction accuracy. This paper
challenges the idea that CNNs are notsuitable for deployment on MCUs. We
demonstrate that it is possible toautomatically design CNNs which generalize well, while also beingsmall enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture
search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we
find are more accurate and up to 4.35×smaller than previous approaches, while meeting the strict MCU
working memory constraint.
Convolutional Neural Networks (Convnets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are
available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better
performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet
highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.
To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which
achieve much better accuracy and efficiency than previous Convnets. In particular, our EfficientNet-B7 achieves
state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4× smaller and 6.1× faster on inference than the best
existing Convnet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%),
Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at
Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine
translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the
roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a
method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority
of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out
of 48 encoder heads results in a drop of only 0.15 BLEU.
Pruning is a well-established technique for removing unnecessary structure from neural networks after training to improve the performance of inference. Several
recent results have explored the possibility of pruning at initialization time to provide similar benefits during training. In particular, the “lottery ticket
hypothesis” conjectures that typical neural networks contain small subnetworks that can train to similar accuracy in a commensurate number of steps. The evidence
for this claim is that a procedure based on iterative magnitude pruning (IMP) reliably finds such subnetworksretroactively on small vision tasks. However, IMP fails on deeper networks, and proposed methods to prune before
training or train pruned networks encounter similar scaling limitations. In this paper, we argue that these efforts have struggled on deeper networks because they
have focused on pruning precisely at initialization. We modify IMP to search for subnetworks that could have been
obtained by pruning early in training (0.1% to 7% through) rather than at iteration 0. With this change, it finds small subnetworks of deeper networks (eg.
80% sparsity on Resnet-50) that can complete the training process to match the accuracy of the original network on more
challenging tasks (eg. ImageNet). In situations where IMP fails at iteration 0,
the accuracy benefits of delaying pruning accrue rapidly over the earliest iterations of training. To explain these behaviors, we study subnetwork
“stability,” finding that—as accuracy improves in this fashion—IMP subnetworks train to parameters closer to those of
the full network and do so with improved consistency in the face of gradient noise. These results offer new insights into the opportunity to prune
large-scale networks early in training and the behaviors underlying the lottery ticket hypothesis
We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer
trained on WMT 2014 English-to-German, andResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al 2017;
Louizos et al 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches
achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al 2018) at scale
and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with
joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our
code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and
We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In
experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore,
each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be
viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network
Generative Adversarial Networks (GANs) have been used in several machine learning tasks such as domain transfer,
super resolution, and synthetic data generation. State-of-the-art GANs often use tens of millions of
parameters, making them expensive to deploy for applications in low SWAP (size, weight, and power) hardware, such
as mobile devices, and for applications with real time capabilities. There has been no work found to reduce the number of parameters used in GANs. Therefore, we propose a method to compress GANs using
knowledge distillation techniques, in which a smaller “student” GAN learns to mimic a larger “teacher” GAN. We show that the distillation methods used on MNIST,
CIFAR-10, and Celeb-A datasets can compress teacher GANs at
ratios of 1669:1, 58:1, and 87:1, respectively, while retaining the quality of the generated image. From our experiments, we observe a qualitative limit for
GAN’s compression. Moreover, weobserve that, with a fixed parameter budget, compressed GANs outperform GANs trained using standard training methods. We conjecture
that this is partially owing to the optimization landscape of over-parameterized GANs which allows efficient
training using alternating gradient descent. Thus, training an over-parameterized GAN followed by our proposed
compression scheme provides a high quality generative model with a small number of parameters.
Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer
multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is
made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We
improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC
process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson’s posit
format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is
within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our
8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96× the power and 1.12× the area of 8/32-bit integer multiply-add. In 16
bits, our log float multiply-add is 0.59× the power and 0.68× the area of IEEE 754 float16 fused multiply-add,
maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.
Structured pruning is a popular method for compressing a neural network: given a large trained network, one alternates between removing channel connections and
fine-tuning; reducing the overall width of the network. However, the efficacy of structured pruning has largely evaded scrutiny. In this paper, we examine
ResNets and DenseNets obtained through structured pruning-and-tuning and make two interesting observations: (i) reduced
networks—smaller versions of the original network trained from scratch—consistently outperform pruned networks; (ii) if one takes the architecture of a pruned
network and then trains it from scratch it is statistically-significantly more competitive. Furthermore, these architectures are easy to approximate: we can prune
once and obtain a family of new, scalable network architectures that can simply be trained from scratch. Finally, we compare the inference speed of reduced and
pruned networks on hardware, and show that reduced networks are significantly faster. Code is available at
This paper proposes network recasting as a general method for network architecture transformation. The primary goal of this method is to accelerate the
inference process through the transformation, but there can be many other practical applications. The method is based on block-wise recasting; it recasts each
source block in a pre-trained teacher network to a target block in a student network. For the recasting, a target block is trained such that its output activation
approximates that of the source block. Such a block-by-block recasting in a sequential manner transforms the network architecture while preserving the accuracy.
This method can be used to transform an arbitrary teacher network type to an arbitrary student network type. It can even generate a mixed-architecture network that
consists of two or more types of block. The network recasting can generate a network with fewer parameters and/or activations, which reduce the inference time
significantly. Naturally, it can be used for network compression by recasting a trained network into a smaller network of the same type. Our experiments show that
it outperforms previous compression approaches in terms of actual speedup on a GPU.
With ever-increasing computational demand for deep learning, it is critical to investigate the implications of the numeric representation and precision of
DNN model weights and activations on computational efficiency. In this work, we explore unconventional narrow-precision
floating-point representations as it relates to inference accuracy and efficiency to steer the improved design of future DNN platforms. We show that inference using these custom numeric representations on production-grade DNNs, including GoogLeNet and VGG, achieves an average speedup of 7.6× with less than 1%
degradation in inference accuracy relative to a state-of-the-art baseline platform representing the most sophisticated hardware using single-precision floating
point. To facilitate the use of such customized precision, we also present a novel technique that drastically reduces the time required to derive the optimal
In this work, we propose a new solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (van den Oord et al 2018), we
distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a regularized KL divergence between their highly-peaked output
distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition,
we introduce the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It
significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al 2018). We also
successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.
Deep reinforcement learning, applied to vision-based problems like Atari games, maps pixels directly to actions;
internally, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it. By separating the image
processing from decision-making, one could better understand the complexity of each task, as well as potentially find smaller policy representations that are
easier for humans to understand and may generalize better. To this end, we propose a new method for learning policies and compact state representations separately
but simultaneously for policy approximation in reinforcement learning. State representations are generated by an encoder
based on two novel algorithms: Increasing Dictionary Vector Quantization makes the encoder capable of growing its dictionary size over time, to address new
observations as they appear in an open-ended online-learning context; Direct Residuals Sparse Coding encodes observations by disregarding reconstruction error
minimization, and aiming instead for highest information inclusion. The encoder autonomously selects observations online to train on, in order to maximize code
sparsity. As the dictionary size increases, the encoder produces increasingly larger inputs for the neural network: this is addressed by a variation of the
Exponential Natural Evolution Strategies algorithm which adapts its probability distribution dimensionality along the run. We test our system on a selection of
Atari games using tiny neural networks of only 6 to 18 neurons (depending on the game’s controls). These are still capable of achieving results comparable—and
occasionally superior—to state-of-the-art techniques which use two orders of magnitude more neurons.
In this paper, we propose a simple and general framework for training very tiny CNNs forobject detection. Due to limited representation ability, it is
challenging to train very tiny networks for complicated tasks like detection. To the best of our knowledge, our method, called Quantization Mimic, is the first one
focusing on very tiny networks. We utilize two types of acceleration methods: mimic and quantization. Mimic improves the performance of a student network by
transferring knowledge from a teacher network. Quantization converts a full-precision network to a quantized one without large degradation of performance. If the
teacher network is quantized, the search scope of the student network will be smaller. Using this feature of the quantization, we propose Quantization Mimic. It
first quantizes the large network, then mimic a quantized small network. The quantization operation can help student network to better match the feature maps from
teacher network. To evaluate our approach, we carry out experiments on various popular CNNs including VGG andResnet, as well as different detection frameworks including Faster R-CNNand R-FCN.Experiments on Pascal VOC and WIDERFACE verify that our Quantization Mimic
algorithm can be applied on various settings and outperforms state-of-the-art model acceleration methods given limited computing resouces.
Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required
as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this
question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this
subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to
implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the
intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that
once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic
dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example,
that solving the inverted pendulum problem is 100 times
easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as
classifying CIFAR-10. In addition to providing new cartography of the objective landscapes wandered by parameterized
models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this
construction is a simple approach for compressing networks, in some cases by more than 100 times.
One of the main barriers for deploying neural networks on embedded systems has been large memory and power consumption of existing neural networks. In this
work, we introduce SqueezeNext, a new family of neural network architectures whose design was guided by considering previous architectures such as SqueezeNet, as
well as by simulation results on a neural network accelerator. This new network is able to match AlexNet’s accuracy on the ImageNet benchmark with 112× fewer parameters, and one of its deeper variants is able to achieve VGG-19accuracy with only 4.4 Million parameters, (31× smaller than VGG-19).
SqueezeNext also achieves better top-5 classification accuracy with 1.3× fewer parameters as compared to MobileNet, but avoids using depthwise-separable
convolutions that are inefficient on some mobile processor platforms. This wide range of accuracy gives the user the ability to make speed-accuracy tradeoffs,
depending on the available resources on the target hardware. Using hardware simulation results for power and inference speed on an embedded system has guided us to
design variations of the baseline model that are 2.59×/8.26× faster and 2.25×/7.5× more energy efficient as compared to SqueezeNet/AlexNet without any
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational
performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to
train from the start, which would similarly improve training performance.
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these
results, we articulate the “lottery ticket hypothesis:” dense, randomly-initialized, feed-forward networks contain subnetworks (“winning tickets”) that—when
trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the
initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these
fortuitous initializations. We consistently find winning tickets that are less than 10–20% of the size of several fully-connected and convolutional
feed-forward architectures for MNISTand CIFAR10. Above this size, the winning
tickets that we find learn faster than the original network and reach higher test accuracy.
Deep neural networks have demonstrated state-of-the-art performance in a variety of real-world applications. In order to obtain performance gains, these
networks have grown larger and deeper, containing millions or even billions of parameters and over a thousand layers. The trade-off is that these large
architectures require an enormous amount of memory, storage, and computation, thus limiting their usability. Inspired by the recent tensor ring factorization, we
introduce Tensor Ring Networks (TR-Nets), which significantly compress both the fully connected layers and the convolutional layers of deep neural networks. Our
results show that our TR-Nets approach is able to compress LeNet-5 by 11× without losing accuracy, and can compress the state-of-the-art Wide ResNet by 243× with only 2.3% degradation in CIFAR10 image classification. Overall, this
compression scheme shows promise in scientific computing and deep learning, especially for emerging resource-constrained devices such as smartphones,
wearables, and IoT devices.
For fast and energy-efficient deployment of trained deep neural networks on resource-constrained embedded hardware, each learned weight parameter should ideally
be represented and stored using a single bit. Error-rates usually increase when this requirement is imposed. Here, we report large improvements in error rates on
multiple datasets, for deep convolutional neural networks deployed with 1-bit-per-weight. Using wide residual networks as
our main baseline, our approach simplifies existing methods that binarize weights by applying the sign function in training; we apply scaling factors for each
layer with constant unlearned values equal to the layer-specific standard deviations used for initialization. For CIFAR-10, CIFAR-100 and ImageNet, and models with
1-bit-per-weight requiring less than 10 MB of parameter memory, we achieve error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top-1 / Top-5) respectively. We also
considered MNIST, SVHN and ImageNet32, achieving 1-bit-per-weight
test results of 0.27%,1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error rates halve previously
reported values, and are within about 1% of our error-rates for the same network with full-precision weights. For networks that overfit, we also show significant
improvements in error rate by not learning batch normalization scale and offset parameters. This applies to both full precision and 1-bit-per-weight networks.
Using a warm-restart learning-rate schedule, we found that training for 1-bit-per-weight is just as fast as full-precision networks, with better accuracy than
standard schedules, and achieved about 98%-99% of peak performance in just 62 training epochs for CIFAR-10/100. For
fulltraining code and trained models in MATLAB, Keras and PyTorch see
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating
high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a
set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes
it possible to generate 24kHz 16-bit audio 4× faster than real time on a GPU. Second, we apply a weight
pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of
parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of
weights in a Sparse WaveRNN makes it possible tosample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter
sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step
without loss of quality and offers an orthogonal method for increasing sampling efficiency.
In this paper, we investigate lossy compression of deep neural networks (DNNs) by weight quantization and lossless
source coding for memory-efficient deployment.
Whereas the previous work addressed non-universal scalar quantization and entropycoding of DNN weights, we for the first time introduceuniversal DNN compression by universal vector quantization and universal source coding. In particular, we examine
universal randomized lattice quantization of DNNs, which randomizes DNN
weights by uniform random dithering before lattice quantization and can perform near-optimally on any source without relying on knowledge of its probability
distribution. Moreover, we present a method of fine-tuning vector quantized DNNs to recover the performance loss
Our experimental results show that the proposed universal DNN compression scheme compresses the 32-layer
ResNet (trained on CIFAR-10) and the AlexNet (trained onImageNet) with compression ratios of 47.1 and 42.5, respectively.
Recently there has been a lot of work on pruning filters from deep convolutional neural networks (CNNs) with the
intention of reducing computations. The key idea is to rank the filters based on a certain criterion (say, 𝓁1-norm, average percentage of zeros,
etc) and retain only the top ranked filters. Once the low scoring filters are pruned away the remainder of the network is fine tuned and is shown to give
performance comparable to the original unpruned network.
In this work, we report experiments which suggest that the comparable performance of the pruned network is not due to the specific criterion chosen but due to
the inherent plasticity of deep neural networks which allows them to recover from the loss of pruned filters once the rest of the filters are fine-tuned.
Specifically, we show counter-intuitive results wherein by randomly pruning 25–50% filters from deep CNNs we are able
to obtain the same performance as obtained by using state of the art pruning methods. We empirically validate our claims by doing an exhaustive evaluation
with VGG-16 andResNet-50.
Further, we also evaluate a real world scenario where a CNN trained on all 1000 ImageNet classes needs to be tested on only a small set of classes at test time (say, only animals). We create a new benchmark dataset
from ImageNet to evaluate such class specific pruning and show that even here a random pruning strategy gives close to state
of the art performance.
Lastly, unlike existing approaches which mainly focus on the task of image classification, in this work we also report results on object detection. We show that
using a simple random pruning strategy we can achieve substantial speed up in object detection (74% improvement in
FPS) while retaining the same accuracy as that of the originalFaster RCNN model.
Convnets can achieve good performance even when only a fraction of parameters are learned.
Training deep neural networks results in strong learned representations that show good generalization capabilities. In most cases, training involves iterative
modification of all weights inside the network via back-propagation. In this paper, we propose to take an extreme approach and fix almost all weights of a
deep convolutional neural network in their randomly initialized values, allowing only a small portion to be learned. As our experiments show, this often results in
performance which is on par with the performance of learning all weights. The implications of this intriguing property or deep neural networks are discussed and we
suggest ways to harness it to create more robust representations.
[Keywords: Random Networks, Extreme Learning, Compact Representations]
Many state-of-the-art computer vision algorithms use large scale convolutional neural networks (CNNs) as basic
building blocks. TheseCNNs are known for their huge number of parameters, high redundancy in weights, and
tremendous computing resource consumption. This paper presents a learning algorithm to simplify and speed up these CNNs. Specifically, we introduce a “try-and-learn” algorithm to train pruning agents that remove unnecessary CNN filters in a data-driven way. With the help of a novel reward function, our agents removes a significant number
of filters in CNNs while maintaining performance at a desired level. Moreover, this method provides an easy control of
the tradeoff between network performance and its scale. Performance of our algorithm is validated with comprehensive pruning experiments on several popular
CNNs for visual recognition and semantic segmentation tasks.
Predicting human fixations from images has recently seen large improvements by leveraging deep representations which were pretrained for object recognition.
However, as we show in this paper, these networks are highly overparameterized for the task of fixation prediction. We first present a simple yet principled greedy
pruning method which we call Fisher pruning. Through a combination of knowledge distillation and Fisher pruning, we obtain much more runtime-efficient
architectures for saliency prediction, achieving a 10× speedup for the same AUC performance as a state of the art
networkon the CAT2000 dataset. Speeding up single-image gaze prediction is important for many real-world
applications, but it is also a crucial step in the development of video saliency models, where the amount of data to be processed is substantially larger.
Neural networks are commonly used as models for classification for a wide variety of tasks. Typically, a learned affine transformation is placed at the end of
such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of
possible classes, thus requiring increasingly more resources. In this work we argue that this classifier can be fixed, up to a global scale constant, with little
or no loss of accuracy for most tasks, allowing memory and computational benefits. Moreover, we show that by initializing the classifier with a Hadamard matrix we
can speed up inference as well. We discuss the implications for current understanding of neural network models.
Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high
dimensional inputs, the training of RNNs becomes computational expensive due to the large number of modelparameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in
Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly
reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank
approximations, such as tensor-train RNN (TT-RNN), our
method, Block-Term RNN (BT-RNN), is not only more
concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much
fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388 times fewer parameters than the standard LSTM
to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset.
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and
requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which
suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed
SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly
reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum
correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification,
speech recognition, and language modeling with multiple datasets including CIFAR10,ImageNet, Penn Treebank, and LibriSpeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio
from 270× to 600× without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for Deep Speech
from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed
training on mobile. Code is available at: https://github.com/synxlin/deep-gradient-compression.
In this work we present a method to improve the pruning step of the current state-of-the-art methodology to compress neural networks. The novelty of the
proposed pruning technique is in its differentiability, which allows pruning to be performed during the backpropagation phase of the network training. This enables an end-to-end
learning and strongly reduces the training time. The technique is based on a family of differentiable pruning functions and
a new regularizer specifically designed to enforce pruning. The experimental results show that the joint optimization of both the thresholds and the network
weights permits to reach a higher compression rate, reducing the number of weights of the pruned network by a further 14% to 33% compared to the current
state-of-the-art. Furthermore, we believe that this is the first study where the generalization capabilities in transfer learning tasks of the features extracted
by a pruned network are analyzed. To achieve this goal, we show that the representations learned using the proposed pruning methodology maintain the same
effectiveness and generality of those learned by the corresponding non-compressed network on a set of different recognition tasks.
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many
different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to
today’s massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new
method for training a parallel feed-forward network from a trained WaveNet with no statistically-significant difference in quality. The resulting system is capable
of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple
English and Japanese voices.
Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both
of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free
“shift” operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To
demonstrate the operation’s efficacy, we replace ResNet’s 3×3 convolutions with shift-based modules for improved CIFAR10
and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation’s resilience to parameter
reduction on ImageNet, outperforming ResNet family members. We finally show the shift
operation’s applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.
Fine-grained image labels are desirable for many computer vision applications, such as visual search or mobile AI assistant. These applications rely on image
classification models that can produce hundreds of thousands (e.g. 100K) of diversified fine-grained image labels on input images. However, training a network
at this vocabulary scale is challenging, and suffers from intolerable large model size and slow training speed, which leads to unsatisfying classification
performance. A straightforward solution would be training separate expert networks (specialists), with each specialist focusing on learning one specific vertical
(e.g. cars, birds…). However, deploying dozens of expert networks in a practical system would significantly increase system complexity and inference latency,
and consumes large amounts of computational resources. To address these challenges, we propose a Knowledge Concentration method, which effectively transfers the
knowledge from dozens of specialists (multiple teacher networks) into one single model (one student network) to classify 100K object categories. There are three
salient aspects in our method: (1) a multi-teacher single-student knowledge distillation framework; (2) a self-paced learning mechanism to allow the student to
learn from different teachers at various paces; (3) structurally connected layers to expand the student network capacity with limited extra parameters. We validate
our method on OpenImage and a newly collected dataset, Entity-Foto-Tree (EFT), with 100K categories, and show that
the proposed model performs significantly better than the baseline generalist model.
We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back
propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that
only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading
to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are
seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications.
Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More
interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results
show that we could adaptively simplify the model which could often be reduced by around 9×, without any loss on accuracy or even with improved accuracy. The codes,
including the extension, are available at https://github.com/lancopku/meSimp
In recent years, deep neural networks (DNNs) achieved unprecedented performance in many low-level vision tasks.
However, state-of-the-art results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make
DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and
efficiency. In this paper, we propose a new activation unit, which is particularly suitable for image restoration problems. In contrast to the widespread per-pixel
activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more
complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units
through experiments with state-of-the-art nets for denoising, de-raining, and super resolution, which are already considered to be very small. With our approach,
we are able to further reduce these models by nearly 50% without incurring any degradation in performance.
Deep neural networks (DNNs) have begun to have a pervasive impact on various applications of machine learning.
However, the problem of finding an optimal DNN architecture for large applications ischallenging. Common
approaches go for deeper and larger DNN architectures but may incur substantial redundancy. To address these problems,
we introduce a network growth algorithm that complements network pruning to learn both weights and compact DNN
architecturesduring training. We propose a DNN synthesis tool (NeST) that combinesboth methods to
automate the generation of compact and accurate DNNs. NeST starts with a randomly initialized sparse network called the
seed architecture. It iteratively tunes the architecture with gradient-based growth and magnitude-based pruning of neurons and connections. Our experimental
results show that NeST yields accurate, yet very compact DNNs, with a wide range of seed architecture selection.
For the LeNet-300-100 (LeNet-5) architecture, we reduce network parameters by 70.2× (74.3×) and floating-point operations (FLOPs) by 79.4× (43.7×). For the AlexNet andVGG-16 architectures, we reduce network parameters (FLOPs) by 15.7× (4.6×) and 30.2× (8.6×), respectively. NeST’s grow-and-prune paradigm delivers significant additional
parameter and FLOPs reduction relative to pruning-only methods.
Natural language processing (NLP) models often require a massive number of parameters for word embeddings,
resulting in a large storage or memory footprint. Deploying neural NLP models to mobile devices requires
compressing the word embeddings without any significant sacrifices in performance. For this purpose, we propose to construct the embeddings with few basis vectors.
For each word, the composition of basis vectors is determined by a hash code. To maximize the compression rate, we adopt the multi-codebook quantization approach
instead of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, 8), where the value of each component is limited to a fixed
range. We propose to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick.
Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% 99% in machine translation tasks without performance loss. In both
tasks, the proposed method can improve the model performance by slightly lowering the compression rate. Compared to other approaches such as character-level
segmentation, the proposed method is language-independent and does not require modifications to the network architecture.
Recent breakthroughs in computer vision make use of large deep neural networks, utilizing the substantial speedup offered by GPUs. For applications running on limited hardware, however, high precision real-time processing can still be a challenge. One
approach to solving this problem is training networks with binary or ternary weights, thus removing the need to calculate multiplications and significantly
reducing memory size. In this work, we introduce LR-nets (Local reparameterization networks), a new method for training neural networks with discrete weights using
stochastic parameters. We show how a simple modification to the local reparameterization trick, previously used to train Gaussian distributed weights, enables the
training of discrete weights. Using the proposed training we test both binary and ternary models on MNIST,
CIFAR-10 andImageNet benchmarks and reach state-of-the-art results on most
Model pruning seeks to induce sparsity in a deep neural network’s various connection matrices, thereby reducing the number of nonzero-valued parameters in the
model. Recent reports (Han et al 2015; Narang et al 2017) prune deep networks at the cost of only a marginal loss in accuracy and
achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at
the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model’s dense connection
structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of
energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a
variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned
models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network
architectures (deep CNNs, stackedLSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10× reduction
in number of non-zero parameters with minimal loss in accuracy.
While bigger and deeper neural network architectures continue to advance the state-of-the-art for many computer vision tasks, real-world adoption of these
networks is impeded by hardware and speed constraints. Conventional model compression methods attempt to address this problem by modifying the architecture
manually or using pre-defined heuristics. Since the space of all reduced architectures is very large, modifying the architecture of a deep neural network in this
way is a difficult task. In this paper, we tackle this issue by introducing a principled method for learning reduced network architectures in a data-driven way
using reinforcement learning. Our approach takes a larger ‘teacher’ network as input and outputs a compressed ‘student’
network derived from the ‘teacher’ network. In the first stage of our method, a recurrent policy network aggressively removes layers from the large ‘teacher’
model. In the second stage, another recurrent policy network carefully reduces the size of each remaining layer. The resulting network is then evaluated to obtain
a reward—a score based on the accuracy and compression of the network. Our approach uses this reward signal with policy gradients to train the policies to find a
locally optimal student network. Our experiments show that we can achieve compression rates of more than 10× for models such as ResNet-34 while maintaining similar performance to the input ‘teacher’ network. We also present a valuable transfer learning result
which shows that policies which are pre-trained on smaller ‘teacher’ networks can be used to rapidly speed up training on larger ‘teacher’ networks.
There is an increasing interest on accelerating neural networks for real-time applications. We study the student-teacher strategy, in which a small and fast
student network is trained with the auxiliary information learned from a large and accurate teacher network. We propose to use conditional adversarial networks to
learn the loss function to transfer knowledge from teacher to student. The proposed method is particularly effective for relatively small student networks.
Moreover, experimental results show the effect of network size when the modern networks are used as student. We empirically study the trade-off between inference
time and classification accuracy, and provide suggestions on choosing a proper student network.
We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language
processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained
environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding
how to allocate a small memory budget.
Recurrent neural networks show state-of-the-art results in many text analysis tasks but often require a lot of memory to store their weights. Recently proposed
Sparse Variational Dropout eliminates the majority of the weights in a
feed-forward neural network without significant loss of quality. We apply this technique to sparsify recurrent neural networks. To account for recurrent specifics
we also rely on Binary Variational Dropout for RNN. We report 99.5% sparsity level on sentiment analysis task
without a quality drop and up to 87% sparsity level on language modeling task with slight loss of accuracy.
We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed
specially for mobile devices with very limited computing power (eg. 10–150 MFLOPs). The new architecture utilizes
two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior
performance of ShuffleNet over other structures, eg. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet
classification task, under the computation budget of 40 MFLOPs. On an ARM-based
mobile device, ShuffleNet achieves 13× actual speedup over AlexNet while maintaining comparable accuracy.
Dropout-based regularization methods can be regarded as injecting random noise with pre-defined magnitude to different parts of the neural network during
training. It was recently shown that Bayesian dropout procedure not only improves generalization but also leads to extremely sparse neural architectures by
automatically setting the individual noise magnitude per weight. However, this sparsity can hardly be used for acceleration since it is unstructured. In the paper,
we propose a new Bayesian model that takes into
account the computational structure of neural networks and provides structured sparsity, eg. removes neurons and/or convolutional channels in CNNs. To do this we inject noise to the neurons outputs while keeping the weights unregularized. We establish the probabilistic
model with a proper truncated log-uniform prior over the noise and truncated log-normal variational approximation that ensures that the KL-term in the evidence lower bound is computed in closed-form. The model leads to structured
sparsity by removing elements with a low SNR from the computation graph and provides significant acceleration on a
number of deep neural architectures. The model is easy to implement as it can be formulated as a separate dropout-like layer.
Recurrent Neural Networks (RNN) are widely used to solve a variety of problems and as the quantity of data
and the amount of available compute have increased, so have model sizes. The number of parameters in recent state-of-the-art networks makes them hard to deploy,
especially on mobile phones and embedded devices. The challenge is due to both the size of the model and the time it takes to evaluate it. In order to deploy
these RNNs efficiently, we propose a technique to reduce the parameters of a network by pruning weights during the
initial training of the network. At the end of training, the parameters of the network are sparse while accuracy is still close to the original dense neural
network. The network size is reduced by 8× and the time required to train the model remains constant. Additionally, we can prune a larger dense network to achieve
better than baseline performance while still reducing the total number of parameters significantly. Pruning RNNs reduces
the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply. Benchmarks show that using our technique
model size can be reduced by 90% and speed-up is around 2× to 7×.
Reduce overfit by replacing, in a 3-branch ResNet, the standard summation of residual branches by a stochastic affine
The method introduced in this paper aims at helping computer vision practitioners faced with an overfit problem. The idea is to replace, in a 3-branch
ResNet, the standard summation of residual branches by a stochastic affine combination. The largest tested model improves on
the best single shot published result on CIFAR-10 by reaching 2.86% test error. Code is available at
[Keywords: Computer vision, Deep learning, Supervised Learning]
We explore a recently proposed Variational Dropout technique that
provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to
reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per
weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance
determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to
68 times on VGG-like networks with a negligible decrease of accuracy.
Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when
trained with methods such as distillation that allow small or shallow models of high accuracy to be trained.
Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same
number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on
CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be
as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional
…Figure 1 summarizes the results in Table 2 for student models of different depth, number of convolutional layers,
and number of parameters when trained to mimic the ensemble teacher model. Student models trained on the ensemble logits are able to achieve accuracies
previously unseen on CIFAR-10 for models with so few layers. Also, it is clear that there is a huge gap between the
convolutional student models at the top of the figure, and the non-convolutional student models at the bottom of the figure: the most accurate student
MLP has accuracy less than 75%, while the least accurate convolutional student model with the same number of parameters
but only one convolutional layer has accuracy above 87%. And the accuracy of the convolutional student models increases further as more layers of convolution are
added. Interestingly, the most accurate student MLPs with no convolutional layers have only 2 or 3hidden
layers; the student MLPs with 4 or 5 hidden layers are not as accurate.
Comparing the student MLP with only one hidden layer (bottom of the graph) to the student CNN with 1 convolutional layer clearly suggests that convolution is critical for this problem even when models are
trained via distillation, and that it is very unlikely that a shallow non-convolutional model with 100 million parameters or less could ever achieve accuracy
comparable to a convolutional model. It appears that if convolution is critical for teacher models trained on the original 0/1 hard targets, it is likely to be
critical for student models trained to mimic these teacher models. Adding depth to the student MLPs without adding
convolution does not substantially close this “convolutional gap”.
Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach
called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed
to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more
efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the
Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.
Deep neural networks currently demonstrate state-of-the-art performance in several domains. At the same time, models of this class are very demanding in terms
of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on
low-end devices and stopping the further increase of the model size.
In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a
huge factor and at the same time the expressive power of the layer is preserved.
In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a
fully-connected layer up to 200,000× leading to the compression factor of the whole network up to 7×.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average
their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to
a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the
knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We
achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a
heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one
or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these
specialist models can be trained rapidly and in parallel.