People perceive the world with multiple senses (eg. through hearing sounds, reading words and seeing objects). However, most existing AI systems only process an individual modality. This paper presents an approach that excels at handling multiple modalities of information with a single model. In our “SkillNet” model, different parts of the parameters are specialized for processing different modalities. Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task. Such model design enables SkillNet to learn skills in a more interpretable way. We develop our model for five modalities including text, image, sound, video and code. Results show that, SkillNet performs comparably to five modality-specific fine-tuned models. Moreover, our model supports self-supervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities. We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining. On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including WukongViT-B and Wenlan 2.0 while using less number of activated parameters.
[blog] As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their substantial training cost reduction compared to a quality-equivalent dense model.
Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5× saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage.
To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7×, and a highly optimized inference system that provides 7.3× better latency and cost compared to existing MoE inference solutions. It offers ultra-fast inference latencies (25ms) for trillion-parameter MoE models. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5× faster and 9× cheaper inference compared to quality-equivalent dense models.
We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
…A year later, with much less fanfare, Tsinghua University’s Beijing Academy of Artificial Intelligence released an even larger model, Wu Dao 2.0, with 10× as many parameters—the neural network values that encode information. While GPT-3 boasts 175 billion parameters, Wu Dao 2.0’s creators claim it has a whopping 1.75 trillion. Moreover, the model is capable not only of generating text like GPT-3 does but also images from textual descriptions like OpenAI’s 12-billion parameter DALL·E model, and has a similar scaling strategy to Google’s 1.6 trillion-parameter Switch Transformer model.
Tang Jie, the Tsinghua University professor leading the Wu Dao project, said in a recent interview that the group built an even bigger, 100 trillion-parameter model in June, though it has not trained it to “convergence”, the point at which the model stops improving. “We just wanted to prove that we have the ability to do that”, Tang said…Tang says his group is now working on video with the goal of generating realistic video from text descriptions. “Hopefully, we can make this model do something beyond the Turing test”, he says, referring to an assessment of whether a computer can generate text indistinguishable from that created by a human. “That’s our final goal.”
…Geoffrey Hinton instead helped to put deep learning on the map in 2012 with a now-famous neural net called AlexNet when he was at the University of Toronto. But Hinton was also in close contact with the Microsoft Research Lab in Redmond, Wash., before and after his group validated AlexNet, according to one of Hinton’s associates there, Li Deng, then principal researcher and manager and later chief scientist of AI at Microsoft.
In 2009 and 2010, Hinton and Deng worked together at Microsoft on speech recognition and Deng, then Editor-In-Chief of the IEEE Signal Processing Magazine, was invited in 2011 to lecture at several academic organizations in China where he said he shared the published success of deep learning in speech processing. Deng said he was in close contact with former Microsoft colleagues at Baidu, a Chinese search engine and AI giant, and a company called iFlyTek, a spin off from Deng’s undergraduate alma mater.
When Hinton achieved his breakthrough with backpropagation in neural networks in 2012, he sent an email to Deng in Washington, and Deng said he shared it with Microsoft executives, including Qi Lu who led the development of the company’s search engine, Bing. Deng said he also sent a note to his friends at iFlyTek, which quickly adopted the strategy and became an AI powerhouse—famously demonstrated in 2017 with a convincing video of then-president Donald Trump speaking Chinese.
Qi Lu went on to become COO of Baidu where Deng said another Microsoft alum, Kai Yu, who also knew Hinton well, had already seized on Hinton’s breakthrough. Literally within hours of Hinton’s results, according to Deng, researchers in China were working on repeating his success.
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in-domain and out-of-domain language modeling, zero-shot and few-shot priming, and full fine-tuning.
With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ~4× less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters).
Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study.
We make our code and models publicly available for research use.
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is ~7× larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
[blog] Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (TaskMoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9× when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6×.
Mixture of Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve experts capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations.
By pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.
In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. (2) We explore the best practice of prompt tuning with large-scale PLMs. Compared with conventional fine-tuning, prompt tuning significantly reduces the number of task-specific parameters. (3) We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources. Based on our cost-effective pipeline, we pre-train two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks. Experimental results show that CPM-2 has excellent general language intelligence. Moreover, we validate the efficiency of InfMoE when conducting inference of large-scale models having tens of billions of parameters on a single GPU. All source code and model parameters are available at Github.
[blog; code] Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are “dense”, that is, every input is processed by every parameter.
We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable [to JFT-300M] and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time.
Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
[WP] The Beijing Academy of Artificial Intelligence, styled as BAAI and known in Chinese as 北京智源人工智能研究院, launched the latest version of Wu Dao 悟道, a pre-trained deep learning model that the lab dubbed as “China’s first”, and “the world’s largest ever”, with a whopping 1.75 trillion parameters.
…Unlike conventional deep learning models that are usually task-specific, Wu Dao is a multi-modal model trained to tackle both text and image, 2 dramatically different sets of problems. At BAAI’s annual academic conference on Tuesday, the institution demonstrated Wu Dao performing tasks such as natural language processing, text generation, image recognition, image generation, etc.
The model is capable of writing poems and couplets in the traditional Chinese styles, answer questions, write essays, generate alt text for images, and generate corresponding images from natural language description with a decent level of photorealism. It is even able to power “virtual idols”, with the help of Xiaoice, a Chinese company spun off of Microsoft—so there can be voice support too, in addition to text and image.
…Very interestingly, this model with 1.75 trillion parameters is already the 2.0 version of Wu Dao, whose first version was just launched less than 3 months ago. One of the main reasons the Chinese researchers made progress quickly was that they were able to tap into China’s supercomputing clusters, with the help of a few of its core members who also worked on the national supercomputing projects.
A little more technical explanation: BAAI researchers developed and open-sourced a deep learning system called FastMoE, which allowed Wu Dao to be trained on both supercomputers and regular GPUs with substantially more parameters, giving the model, in theory, more flexibility than Google’s take on the MoE, or Mixture-of-Experts. This is because Google’s system requires the company’s dedicated TPU hardware and distributed training framework, while BAAI’s FastMoE works with at least one industry-standard open-source framework, namely PyTorch, and can be operated on off-the-shelf hardware.
The Chinese lab claims that Wu Dao’s sub-models achieved better performance than previous models, beating OpenAI’s CLIP and Google’s ALIGN on English image and text indexing in the Microsoft COCO dataset. For image generation from text, a novel task, BAAI claims that Wu Dao’s sub-model CogView beat OpenAI’s DALL·E, a state-of-the-art neural network launched in January this year with 12 billion parameters.
“The way to artificial general intelligence is big models and big computer”, said Dr. Zhang Hongjiang, chairman of BAAI, “What we are building is a power plant for the future of AI, with mega data, mega computing power, and mega models, we can transform data to fuel the AI applications of the future.”
…However, while OpenAI and DeepMind are privately funded, a key distinction for BAAI is that it’s formed and funded with substantial help from China’s Ministry of Science and Technology, as well as Beijing’s municipal government.
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still, it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation.
In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a major problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts k and expert capacity C in top-k routing can substantially make a difference in this context. Furthermore, we take a step forward to propose a simple method called ‘expert prototyping’ that splits experts into different prototypes and applies k top-1 routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
We push the model scale to over 1 trillion parameters and implement it on solely 480 NVIDIA V100-32GB GPUs, in comparison with the recent SOTA Switch Transformer on 2048 TPUs. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.
Recent advances in large-scale pre-training such as GPT-3 allow seemingly high quality text to be generated from a given prompt. However, such generation systems often suffer from problems of hallucinated facts, and are not inherently designed to incorporate useful external information. Grounded generation models appear to offer remedies, but their training typically relies on rarely-available parallel data where information-relevant documents are provided for context. We propose a framework that alleviates this data constraint by jointly training a grounded generator and document retriever on the language model signal. The model learns to reward retrieval of the documents with the highest utility in generation, and attentively combines them using a Mixture-of-Experts (MoE) ensemble to generate follow-on text. We demonstrate that both generator and retriever can take advantage of this joint training and work synergistically to produce more informative and relevant text in both prose and dialogue generation.
The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information.
We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer.
We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e):
Large but sparsely activated DNNs can consume <1⁄10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters.
Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary 5×–10×, even within the same country and the same organization.
We are now optimizing where and when large models are trained.
Specific datacenter infrastructure matters, as Cloud datacenters can be 1.4–2× more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be 2–5× more effective than off-the-shelf systems.
Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to 100–1000×.
These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.
…Most companies spend more energy on serving a DNN model (performing inference) than on training it. For example, NVIDIA estimated that 80–90% of the ML workload is inference processing [Leo19]. Similarly, Amazon Web services claimed that 90% of the ML demand in the cloud is for inference [Bar19]. Given its substantial role in the ML model lifecycle, AliBaba, Amazon, Google, and NVIDIA designed ML accelerators solely for inference. If the total ML energy is split 10% on training and 90% on serving, then even if a given ML model required double the energy cost of training, it could reduce overall total carbon emissions if that model also cut serving energy by 20%.
…As Table 4 shows, the actual cost of Evolved Transformer NAS is nearly 2 orders of magnitude smaller than previously estimated [Str19]. Why the discrepancy? The answer is that, in addition to the efficiency of Google datacenters, there was a confusion in estimating the energy cost of NAS. In Evolved Transformer NAS, researchers used a small proxy task to search for the best models to save time and money, and then scaled up the found models to full size. Small proxies may not be obvious, which made it hard to estimate the CO2e correctly in retrospect from the NAS paper [So19]. Due to the misunderstanding of the usage of proxy tasks in NAS, it was assumed the search was done with full size tasks. Because of this assumption, despite considerable effort on their part, Strubell et al’s energy estimate for NAS ended up 18.7× too high for the average organization (see Appendix C) and 88× off in emissions for energy-efficient organizations like Google (see Appendix D).
…In terms of cost-benefit tradeoff, NAS can also lead to improved energy efficiency in training of downstream applications, and the benefit can dramatically outweigh the cost. Figure 4 shows that the Evolved Transformer, found by NAS [So19], has 37% fewer parameters and converges to the same accuracy with 25% less energy expenditure (see Table 1) than the vanilla Transformer (Big) model on WMT English to German translation. The use of Evolved Transformer instead of a regular Transformer architecture saved 48.5 tCO2e during the training of the Meena DNN (see Tables 1 and 4). The savings from this single reuse in Meena are ~15× larger than the energy cost of running the search to discover it. The results of the Evolved Transformer neural architecture search have been open-sourced. It can readily be used by anyone training ML models for NLP problems, similar to how a Transformer-style model can be used for NLP problems [Evo19].
…Finally, Google publishes its total energy consumption, and for 2019 it was 12.2 TeraWatt-hours [Goo20]. Row 18 of Table 4 shows the percentage that each NLP model training was of that total. Even if we assume all four of Google’s large NLP models in Table 4 were trained in 2019, the total represents less than 0.005%. The training of those 4 large NLP models is not a substantial fraction of Google’s energy consumption.
…For example, our large scale translation models (M4) have already been used to translate billions of queries annually for each mid-to-low resource language 25 with 2B speakers globally for these languages. Figure 7, from the GShard paper [Lep20], shows substantial improvements for translation of 100 different languages to English. The blue line on the top in the left represents the 600B parameter multi-lingual translation MoE model of GShard. The dashed black line near the bottom is for a traditional dense DNN that is fully activated for every token. The dense DNN requires ~10× more computational resources to train than the 600B sparse MoE model, despite substantially lower translation quality. Figure 7 shows the larger MoE model, the larger the BLEU score gains were across all languages; the lines rarely cross. The 600B MoE model improves average quality +13.5 BLEU, 7.4 higher than the 2.3B dense model.
GShard-600B’s emissions (Table 4) are 4.3 tCO2e —3.5 passenger SF-NY round trips—from consuming 24 MWh to train the model that could have 2B users; the amortized per-user CO2e impact of model training would be less than the CO2e impact of sending one text message.
[Fun note: the corpus uses The Pile.] In a bid to promote the research and development of China’s own large-scale pretraining models and further explore universal intelligence from a more fundamental perspective, the Beijing Academy of Artificial Intelligence (BAAI) recently unveiled Wu Dao 1.0, China’s first homegrown super-scale intelligent model system. The work was led by BAAI Research Academic Vice President and Tsinghua University Professor Tang Jie, with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of China, Chinese Academy of Sciences and other institutes.
Wu Dao 1.0 has initiated large-scale research projects via 4 related models: Wu Dao—Wen Yuan, Wu Dao—Wen Lan, Wu Dao—Wen Hui, and Wu Dao—Wen Su.
Wu Dao—Wen Yuan: is China’s largest-ever pretraining language model, boasting the best processing power in mainstream languages, including Chinese and English. It has surpassed average human performance benchmarks on text categorization, sentiment analysis, natural language inference, reading comprehension and more. The Wu Dao—Wen Yuan project is designed to explore universal natural language understanding (NLU) techniques and study brain-inspired language models. It has 2.6 billion parameters and is capable of performing cognitive activities such as memorization, comprehension, retrieval, numerical calculation, multi-language, etc. Wu Dao—Wen Yuan has achieved GPT-3 comparable performance on 20 Chinese NLP tasks such as open-domain answering, grammar correction, sentiment analysis, etc.
…Wen Yuan introduces the open-source Chinese pretraining model (CPM). Based on CPM, the CPM-Distill model reduces language confusion by 38% and achieves better results on downstream tasks.
Wu Dao—Wen Lan: meanwhile, is the first publicly available Chinese universal graphic multimodal pretraining model. The ultra-large-scale multimodal pretraining model aims to break through the theoretical challenges of pretraining multimodal data based on a combination of graphics, text and video, and eventually generate industrial-grade Chinese graphics pretraining models and applications that exceed SOTA performance. Currently, the model has 1 billion parameters and is trained on 50 million graphic pairs collected from open sources. The Wu Dao—Wen Lan model has reached SOTA performance, scoring 5% higher than the champion team on the Image Caption task on the Chinese public multimodal test set AIC-ICC and 20% higher than the most popular UNITER model on the Visual Entailment task.
…Wen Lan is the first Chinese generic multimodal pretraining model that can understand “connotative information” based on weak correlations of images and text. Wen Lan uses an advanced cross-modal contrast learning algorithm: Given an image-text pair, it can enlarge the number of negative samples for each modal, especially for those which are difficult to distinguish, further improving the expression ability of neural networks. It can easily replace image and text encoders with the most advanced single-mode pretraining model, achieving 20× faster performance than the UNITER model.
Wu Dao—Wen Hui: is an ultra-large-scale cognitive-oriented pretraining model that focuses on a series of essential problems in general artificial intelligence from a cognitive perspective, aiming to develop and enhance the logic/consciousness/reasoning-based cognitive capabilities of pretraining models. Wu Dao—Wen Hui has reached 11.3 billion parameters, and through simple fine-tuning can generate poetry, make videos, draw pictures, retrieve text, perform complex reasoning, etc. BAAI says the model achieves near-human performance on poetry generation on the Turing test.
…Wen Hui proposes a new pretraining paradigm, Generative Language Model, breaking the bottlenecks of BERT and GPT. For the first time in history, a single model has achieved the best results in language understanding and generating tasks, and surpassed common pretraining models such as BERT, RoBERTa and T5 that trained on the same volume of data. Wen Hui’s continuous vector based fine-tuning method, P-tuning, is the first autoregressive model that surpasses the autoencoder model in NLU tasks and has achieved SOTA results on more than 10 tasks such as Knowledge Extraction and SuperGLUE Few-shot Learning, with over 20% performance improvement. Wen Hui’s inverse prompting algorithm achieves close to human performance on the task of Q&A and poetry generation, and is the first model that can generate classical Chinese poetry based on modern themes.
Wu Dao—Wen Su: is a large-scale training model for biomolecular structure prediction. It can handle super long biomolecular structures, where it has achieved SOTA performance, interpretability and robustness. Based on Google’s BERT language model, Wu Dao—Wen Su has completed protein training on the 100 GB UNIPARC database and gene training on 5–100,000 human peripheral blood immune cells (25–30 cell types) and 10,000 drug-resistant bacteria.
…Wen Su’s open-sourced FastMoE is the first high-performance MoE (Mixture-of-Experts Model) system that supports the PyTorch framework and a variety of hardware. Only one line of code is required to complete the MoE transformation, and model training speed is increased by 47× compared with the traditional PyTorch implementation.
Deep learning has seen a movement away from representing examples with a monolithic hidden state towards a richly structured state. For example, Transformers segment by position, and object-centric architectures decompose images into entities. In all these architectures, interactions between different elements are modeled via pairwise interactions: Transformers make use of self-attention to incorporate information from other positions; object-centric architectures make use of graph neural networks to model interactions among entities.
However, pairwise interactions may not achieve global coordination or a coherent, integrated representation that can be used for downstream tasks. In cognitive science, a global workspace architecture has been proposed in which functionally specialized components share information through a common, bandwidth-limited communication channel. We explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments. The proposed method includes a shared workspace through which communication among different specialist modules takes place but due to limits on the communication bandwidth, specialist modules must compete for access. We show that capacity limitations have a rational basis in that (1) they encourage specialization and compositionality and (2) they facilitate the synchronization of otherwise independent specialists.
In deep learning, models typically reuse the same parameters for all inputs. Mixture-of-Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model—with outrageous numbers of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability—we address these with the Switch Transformer.
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large (Raffel et al 2019) to obtain up to 7× increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to 1-trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4× speedup over the T5-XXL model.
Figure 1: Scaling and sample efficiency of Switch Transformers. Left Plot: Scaling properties for increasingly sparse (more experts) Switch Transformers. Right Plot: Negative log-perplexity.
…Appendix E: Relation OF Upstream To Downstream Model Performance
There is no guarantee that a model’s quality on a pre-training objective will translate to downstream task results. Figure 13 presents the correlation of the upstream model quality, for both dense and Switch models, on the C4 pre-training task with two downstream task measures: average Super-GLUE performance and TriviaQA score. We choose these two tasks as one probes the model’s reasoning and the other factual knowledge.
Figure 13: Upstream pre-trained quality to downstream model quality. We correlate the upstream performance with downstream quality on both SuperGLUE and TriviaQA (SOTA recorded without SSM), reasoning and knowledge-heavy benchmarks, respectively (validation sets). We find that, as with the baseline, the Switch model scales with improvements in the upstream pre-training task. For SuperGLUE, we find a loosely linear relation between negative log perplexity and the average SuperGLUE score. However, the dense model often performs better for a fixed perplexity, particularly in the large-scale regime. Conversely, on the knowledge-heavy task, TriviaQA, we find that the Switch Transformer may follow an improved scaling relationship—for a given upstream perplexity, it does better than a dense counterpart. Further statistics (expensive to collect and left to future work) would be necessary to confirm these observations.
We find a consistent correlation, indicating that for both baseline and Switch models, improved pre-training leads to better downstream results. Additionally, for a fixed upstream perplexity we find that both Switch and dense models perform similarly in the small to medium model size regime. However, in the largest model regime (T5-11B/T5-XXL) our largest Switch models, as mentioned in Section 5.6, do not always translate their upstream perplexity well to downstream fine-tuning on the SuperGLUE task. This warrants future investigation and study to fully realize the potential of sparse models. Understanding the fine-tuning dynamics with expert-models is very complicated and is dependent on regularization, load-balancing, and fine-tuning hyper-parameters.
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices.
GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
Training convolutional networks (CNNs) that fit on a single GPU with minibatch stochastic gradient descent has become effective in practice. However, there is still no effective method for training large CNN’s that do not fit in the memory of a few GPU cards, or for parallelizing CNN training.
In this work we show that a simple hard mixture of experts model can be efficiently trained to good effect on large scale hashtag (multilabel) prediction tasks. Mixture of experts models are not new (Jacobs et. al. 1991, Collobert et. al. 2003), but in the past, researchers have had to devise sophisticated methods to deal with data fragmentation.
We show empirically that modern weakly supervised data sets are large enough to support naive partitioning schemes where each data point is assigned to a single expert. Because the experts are independent, training them in parallel is easy, and evaluation is cheap for the size of the model. Furthermore, we show that we can use a single decoding layer for all the experts, allowing an unified feature embedding space.
We demonstrate that it is feasible (and in fact relatively painless) to train far larger models than could be practically trained with standard CNN architectures, and that the extra capacity can be well used on current datasets.
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000× improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Mixture of experts (ME) is one of the most popular and interesting combining methods, which has great potential to improve performance in machine learning. ME is established based on the divide-and-conquer principle in which the problem space is divided between a few neural network experts, supervised by a gating network. In earlier works on ME, different strategies were developed to divide the problem space between the experts.
To survey and analyse these methods more clearly, we present a categorization of the ME literature based on this difference. Various ME implementations were classified into 2 groups, according to the partitioning strategies used and both how and when the gating network is involved in the partitioning and combining procedures. In the first group, The conventional ME and the extensions of this method stochastically partition the problem space into a number of subspaces using a special employed error function, and experts become specialised in each subspace. In the second group, the problem space is explicitly partitioned by the clustering method before the experts’ training process starts, and each expert is then assigned to one of these sub-spaces. Based on the implicit problem space partitioning using a tacit competitive process between the experts, we call the first group the mixture of implicitly localized experts (MILE), and the second group is called mixture of explicitly localized experts (MELE), as it uses pre-specified clusters.
The properties of both groups are investigated in comparison with each other. Investigation of MILE versus MELE, discussing the advantages and disadvantages of each group, showed that the 2 approaches have complementary features. Moreover, the features of the ME method are compared with other popular combining methods, including boosting and negative correlation learning methods. As the investigated methods have complementary strengths and limitations, previous researches that attempted to combine their features in integrated approaches are reviewed.
Moreover, some suggestions are proposed for future research directions.