‘notes/Scaling’"> ‘notes/Scaling’"> notes/Scaling (Link Bibliography) · Gwern.net

notes/Scaling (Link Bibliography)

  1. Scaling-hypothesis#blessings-of-scale

  2. https://citeseerx.ist.psu.edu/viewdoc/download?doi=

  3. ⁠, Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson (2016-03-17):

    Figure 1: Accuracy of student models with different architectures trained to mimic the CIFAR10 ensemble. The average performance of the 5 best models of each hyperparameter-optimization experiment is shown, together with dashed lines indicating the accuracy of the best and the fifth best model from each setting. The short horizontal lines at 10M parameters are the accuracy of models trained without compression on the original 0/​​​​1 hard targets.

    Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained.

    Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.

    Figure 1 summarizes the results in Table 2 for student models of different depth, number of convolutional layers, and number of parameters when trained to mimic the ensemble teacher model. models trained on the ensemble logits are able to achieve accuracies previously unseen on CIFAR-10 for models with so few layers. Also, it is clear that there is a huge gap between the convolutional student models at the top of the figure, and the non-convolutional student models at the bottom of the figure: the most accurate student MLP has accuracy less than 75%, while the least accurate convolutional student model with the same number of parameters but only one convolutional layer has accuracy above 87%. And the accuracy of the convolutional student models increases further as more layers of convolution are added. Interestingly, the most accurate student MLPs with no convolutional layers have only 2 or 3 hidden layers; the student MLPs with 4 or 5 hidden layers are not as accurate.

    Comparing the student MLP with only one hidden layer (bottom of the graph) to the student with 1 convolutional layer clearly suggests that convolution is critical for this problem even when models are trained via distillation, and that it is very unlikely that a shallow non-convolutional model with 100 million parameters or less could ever achieve accuracy comparable to a convolutional model. It appears that if convolution is critical for teacher models trained on the original 0/​​​​1 hard targets, it is likely to be critical for student models trained to mimic these teacher models. Adding depth to the student MLPs without adding convolution does not substantially close this “convolutional gap”.

  4. https://arxiv.org/pdf/1603.05691.pdf#page=7

  5. ⁠, Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta (2017-07-10):

    The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between ‘enormous data’ and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.

  6. ⁠, Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou (2017-12-01):

    Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art.

    This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show generalization error scaling across a breadth of factors, resulting in power-law exponents—the “steepness” of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

  7. ⁠, Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, Jesse Berent, Abhinav Gupta, Rahul Sukthankar, Luc Van Gool (2017-05-16):

    We present the 2017 WebVision Challenge, a public image recognition challenge designed for deep learning based on web images without instance-level human annotation. Following the spirit of previous vision challenges, such as ILSVRC, Places2 and PASCAL VOC, which have played critical roles in the development of computer vision by contributing to the community with large scale annotated data for model designing and standardized benchmarking, we contribute with this challenge a large scale web images dataset, and a public competition with a workshop co-located with CVPR 2017. The WebVision dataset contains more than 2.4 million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information is also included. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. The 2017 WebVision challenge consists of two tracks, the image classification task on WebVision test set, and the transfer learning task on PASCAL VOC 2012 dataset. In this paper, we describe the details of data collection and annotation, highlight the characteristics of the dataset, and introduce the evaluation metrics.

  8. ⁠, Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, Luc Van Gool (2017-08-09):

    In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than 2.4 million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (eg., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development.

    Based on our new database, we obtain a few interesting observations: (1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; (2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; (3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation.

    Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

  9. ⁠, Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang (2018-08-03):

    We present a simple yet efficient approach capable of training deep neural networks on large-scale weakly-supervised web images, which are crawled raw from the Internet by using text queries, without any human annotation. We develop a principled learning strategy by leveraging curriculum learning, with the goal of handling a massive amount of noisy labels and data imbalance effectively. We design a new learning curriculum by measuring the complexity of data using its distribution density in a feature space, and rank the complexity in an unsupervised manner. This allows for an efficient implementation of curriculum learning on large-scale web images, resulting in a high-performance CNN model, where the negative impact of noisy labels is reduced substantially. Importantly, we show by experiments that those images with highly noisy labels can surprisingly improve the generalization capability of the model, by serving as a manner of regularization. Our approaches obtain state-of-the-art performance on four benchmarks: WebVision, ⁠, Clothing-1M and Food-101. With an ensemble of multiple models, we achieved a top-5 error rate of 5.2% on the WebVision challenge for 1000-category classification. This result was the top performance by a wide margin, outperforming second place by a nearly 50% relative error rate. Code and models are available at: https:/​​​​/​​​​github.com/​​​​MalongTech/​​​​CurriculumNet .

  10. ⁠, Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl (2018-11-08):

    Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.

  11. ⁠, Sam McCandlish, Jared Kaplan, Dario Amodei, Dota Team (2018-12-14):

    In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

  12. ⁠, Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, Nir Shavit (2019-09-27):

    The dependency of the generalization error of neural networks on model and dataset size is of critical importance both in practice and for understanding the theory of neural networks. Nevertheless, the functional form of this dependency remains elusive. In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (eg., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/​​​​data scales. Our construction follows insights obtained from observations conducted over a range of model/​​​​data scales, in various model types and datasets, in vision and language tasks. We show that the form both fits the observations well across scales, and provides accurate predictions from small-scale to large-scale models and data.

  13. ⁠, Mingxing Tan, Quoc V. Le (2019-05-28):

    Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/​​​​width/​​​​resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.

    To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4× smaller and 6.1× faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https:/​​​​/​​​​github.com/​​​​tensorflow/​​​​tpu/​​​​tree/​​​​master/​​​​models/​​​​official/​​​​efficientnet.

  14. ⁠, Aran Komatsuzaki (2019-06-19):

    In unsupervised learning, collecting more data is not always a costly process unlike the training. For example, it is not hard to enlarge the 40GB WebText used for training by modifying its sampling methodology considering how many webpages there are in the Internet. On the other hand, given that training on this dataset already costs tens of thousands of dollars, training on a larger dataset naively is not cost-wise feasible. In this paper, we suggest to train on a larger dataset for only one epoch unlike the current practice, in which the unsupervised models are trained for from tens to hundreds of epochs. Furthermore, we suggest to adjust the model size and the number of iterations to be performed appropriately. We show that the performance of Transformer language model becomes dramatically improved in this way, especially if the original number of epochs is greater. For example, by replacing the training for 10 epochs with the one epoch training, this translates to 1.9–3.3× speedup in wall-clock time in our settings and more if the original number of epochs is greater. Under one epoch training, no overfitting occurs, and regularization method does nothing but slows down the training. Also, the curve of test loss over iterations follows power-law extensively. We compare the wall-clock time of the training of models with different parameter budget under one epoch training, and we show that size/​​​​iteration adjustment based on our proposed heuristics leads to 1–2.7× speedup in our cases. With the two methods combined, we achieve 3.3–5.1× speedup. Finally, we speculate various implications of one epoch training and size/​​​​iteration adjustment. In particular, based on our analysis we believe that we can reduce the cost to train the state-of-the-art models as and GPT-2 dramatically, maybe even by the factor of 10.

  15. ⁠, Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez (2020-02-26):

    Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.

    This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

  16. ⁠, Jorg Bornschein, Francesco Visin, Simon Osindero (2020-09-26):

    Highly overparametrized neural networks can display curiously strong generalization performance—a phenomenon that has recently garnered a wealth of theoretical and empirical research in order to better understand it. In contrast to most previous work, which typically considers the performance as a function of the model size, in this paper we empirically study the generalization performance as the size of the training set varies over multiple orders of magnitude. These systematic experiments lead to some interesting and potentially very useful observations; perhaps most notably that training on smaller subsets of the data can lead to more reliable model selection decisions whilst simultaneously enjoying smaller computational costs. Our experiments furthermore allow us to estimate Minimum Description Lengths for common datasets given modern neural network architectures, thereby paving the way for principled model selection taking into account Occams-razor.

  17. ⁠, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020-01-23):

    We study empirical scaling laws for language model performance on the loss.

    The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/​​​​dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.

    Larger models are substantially more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping substantially before convergence.

    Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
    Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(Cmin) and L(D) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
    3.2.1: Comparing to LSTMs and Universal Transformers: In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count n. The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position in Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns.
    Appendix A: Summary of Power Laws
    Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/​​​​token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from >Henighan et al 2020.]
  18. ⁠, Utkarsh Sharma, Jared Kaplan (2020-04-22):

    When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law LN−α in the number of network parameters N. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension d. This simple theory predicts that the scaling exponents α ≈ 4⁄d for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/​​​​student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

  19. ⁠, Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish (2020-10-28):

    We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image ↔︎ text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.

    The cross-entropy loss has an information theoretic interpretation as S(True)+DKL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution’s entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie DKL) in nats/​​​​image for other resolutions.

    We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question “Is a picture worth a thousand words?”; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

    …As we increase model and dataset sizes, optimization becomes increasingly efficient, until eventually learning curves begin to merge with the L(D) trend, so that there are no benefits to be gained from training for more than a single epoch [].

    …We have argued that a single neural architecture, the Transformer, can be applied to the generative modeling of images, videos, multimodal data, and math, along with language [⁠, ]. We identified common scaling laws for the loss achieved on all data modalities as a function of both model size and compute budget. As in the case of language, these results imply that larger models become more sample-efficient. Furthermore, we found that in some important cases, fine-tuned performance on downstream tasks also follows similar scaling laws. This suggests that trends in the generative modeling loss translate into advantages in practical capabilities.

    A greater surprise was the approximately universal trend (figure 2) for optimal model size as a function of the training compute budget—we did not anticipate that the exponent NoptC0.7 would be largely independent of the data distribution. This trend implies a dual trend for the number of tokens elapsed during optimized training, as a function of C or N, and leads to the conclusion that larger compute budgets should be “spent” mostly on larger models, rather than much longer training runs. So this lesson from language modeling [Kaplan et al 2020] generalizes. These empirical regularities beg for theoretical explanation—why do these scaling relations hold? The scaling laws also suggest a shift in perspective away from the particularities of neural architectures, loss functions, and training algorithms and towards the broader commonalities that appear when machine learning is studied across a large hierarchy of model, data, and compute scales. Work in ML often involves identifying specific deficiencies in current capabilities and remedying them through the alteration of models and algorithms. Perhaps many capabilities simply lie on a spectrum that can be continuously unlocked through increasing scale, as might be suggested by the meta-learning capabilities of the GPT-3 model [Brown et al 2020].

    Figure 1: Smooth scaling of reducible loss across domains—We show power-law scaling laws for the reducible loss L−L∞ as a function of compute, where the irreducible loss L∞ is a fitted domain-dependent constant. Under plausible assumptions concerning the infinite data and compute limits, the irreducible loss estimates the entropy of the underlying data distribution, while the reducible loss approximates the KL divergence between the data and model distributions. In the case of language we use results from [BMR+20], and only show the full loss L.
    Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/​​​​token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language.
    Figure 2: Optimal model size is consistent across domains—We display the optimal model size Nopt as a function of the training compute budget C. Not only does Nopt(C) behave as a power-law, but the behavior is remarkably similar for all data modalities.
    Figure 31: Q&A—We show the progression of simple Q&A capabilities of GPT-3 family models as we increase the parameter count [BMR+20]. We ask the model who the first and second president of the United States was. · Tiny models appear to have trouble understanding the question, and don’t place any substantial probability on the correct answer. Larger models understand that we’re requesting a US president, but fail to understand that the “second president” and “first president” are different requests, placing most of their weight for both questions on “George Washington”. Only larger models understand both aspects of the questions, answering both correctly.

    [See also: Figure 3 & Figure 11⁠.]

  20. ⁠, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020-05-28):

    Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.

    Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

    Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

    …The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

  21. ⁠, Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt (2020-09-07):

    GPT-3 model size vs Q&A

    We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings. (tests and code)

    [bigger = better:

    See also the paper.]

  22. ⁠, Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt (2021-03-05):

    Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

  23. ⁠, Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh (2021-01-31):

    Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality specific attention mechanisms. Finally, we show that successful losses used in the literature do not yield similar performance gains when used in multimodal transformers

  24. ⁠, Danny Hernandez, Jared Kaplan, Tom Henighan, Sam McCandlish (2021-02-02):

    We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero.

    We calculate the effective data “transferred” from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.

    We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.

    The effective data transferred is well-described by a power-law in the low-data regime: We use DT to represent the effective data transferred, ie. the amount of additional python data that a model of the same size trained on only python would have needed to achieve the same loss on python as a model pre-trained on language. Our notation is indicated visually in figure 1. The scaling law for transfer in equation 1.1 is at the core of many key insights and predictions in this work. We find the simplicity of this result very intriguing:

    DT = effective data transferred = k(DF)α(N)β

    where N is the number of non-embedding model parameters, and DF is the size of the fine-tuning data distribution.

    Figure 1: We display the performance of a 40M parameter Transformer model on python, both trained from scratch on python and pre-trained on text then fine-tuned on python. DT is the amount of additional python characters that a from-scratch model of the same size would have needed to achieve the same loss on python as a fine-tuned model. In the labeled example, we see that for a 40M parameter transformer fine-tuned on 3e5 characters, DT is approximately 1000× bigger than DF. The less fine-tuning data is available, the more pre-training helps.
    Figure 2: In the low-data regime, we observe a good fit for over 4 orders of magnitude in model size and 3 orders of magnitude in fine-tuning dataset size. The fit equation is shown above in terms of DT for simplicity, but the fractional form is given by equation B.2. We show the omitted high data regime points in Appendix D. Details for the approach used to generate these fits are shown in Appendix C.
  25. ⁠, Christina Kim (2021-04-11):

    Building upon OpenAI’s recent work on scaling laws, my project explores how much pre-training on English helps when transferring across different languages.

    Here, I will discuss scaling laws discovered while fine-tuning across different languages with pre-trained English language models. Specifically, I found that a) pre-trained English models help most when learning German, then Spanish, and finally Chinese and b) transfer from English to Chinese, German, and Spanish scales predictably in terms of parameters, data, and compute.

    My experiments try to answer the question: How much does pre-training on English help when transferring across different languages as we vary the dataset size and model size?

    Effective Data Transfer:

    Figure 4: The performance of a 16M parameter transformer model on Chinese, both trained from scratch on Chinese and pre-trained on English then fine-tuned on Chinese.

    In my experiments, I wanted to find the effective data transferred for models trained on English text to Chinese, Spanish, and German text. The effective data transferred is defined in as the amount of additional fine-tuning data that a model of the same size, trained on only that fine-tuning dataset, would have needed to achieve the same loss as a pre-trained model. In the figure above, each point is a 16M transformer trained to convergence on dataset of X tokens. The total amount of data required for the model trained from scratch can be represented as De = Df + Dt where De is the total amount of effective data, Df is the amount of data needed for the fine-tuned model, and Dt is the amount of additional data needed for the trained from scratch model. Dt is the amount of data transferred from pre-training on English.

    Figure 5: Comparing performance of a 16M parameter transformers trained from scratch, and fine-tuned on Chinese, Spanish, and German. For the dataset size of 8000 tokens, Dt, the amount of data transferred, is largest for German. The dashed line on the graphs represent Dt. As the number of tokens in the dataset size increase, Dt becomes smaller across all languages.

    As seen in the figures above, English to Chinese had a smaller amount of data transferred compared to English to Spanish for the same model size and English to German had the greatest amount of data transferred. Pre-trained English text models help most when learning German, followed by Spanish, and finally, Chinese. I believe these results reflect the degree of linguistic similarities between English and the non-English languages. English and German are both derived from Proto-Germanic and are linguistically most similar. Although the Spanish alphabet shares almost all the same symbols with the English alphabet, it is a Romance language, and Chinese does not share the same alphabet as English. Each language has a distinctive shape and distance between fine-tuning and training from scratch. For instance, the effective data transfer is not too much greater for Spanish, vs Chinese, at the smallest dataset size, 8000 tokens. However, as we increase the dataset size, pre-training continues to help for another order of magnitude until the 100M token dataset size than the Chinese which converges at 10M token dataset size.

    …I find many of the same trends and relationships found in the Scaling Law for Transfer between text and code, between English and different languages. In the low data regime, pre-training is helpful across model sizes, but especially in large model sizes…Lastly, pre-trained models are more compute efficient than training from-scratch across dataset sizes. This is without accounting for the compute costs for the pre-trained model.

  26. ⁠, Teven Le Scao, Alexander M. Rush (2021-03-15):

    When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many sources of advantage, we find that prompting does indeed provide a benefit, and that this benefit can be quantified per task. Results show that prompting is often worth 100s of data points on average across classification tasks.

  27. ⁠, Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba (2021-07-07):

    We introduce Codex, a GPT language model fine-tuned on publicly available code from ⁠, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

  28. https://copilot.github.com/

  29. ⁠, Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton (2021-08-16):

    This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

  30. ⁠, Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski, Noah A. Smith, Yejin Choi (2021-07-02; wikipedia⁠, ai):

    Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.

    To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.

    We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largest GPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.

    Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github⁠.

    Figure 2: Average portion of tokens annotated with each span type (y-axis) across models (x-axis), with 95% confidence intervals.
    Figure 3: Average portion of tokens covered by span annotations, broken down by span type. All models, including GPT-3, use the same apples-to-apples decoding hyperparameters: top-p = 0.96, temperature = 1, and no frequency penalty. We scale each span by its token length, normalize by generation token lengths, and remove severity-1 Grammar and Usage errors (see §C).
    Figure 4: Taking the average span coverage (Figure 3) and removing reader issues (Technical Jargon and Needs Google), we plot values and 95% confidence intervals for all models, including all decoding hyperparameters we tested for GPT-3. We find a surprisingly large change in annotated errors depending on the decoding setting used.
    1. Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).

      These error categories decrease with in-domain training () and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.

    2. Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).

      These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.

    3. Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).

      We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.

    4. Human-authored text produces the most reader issues (Figure 2–3).

      The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).

      Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.

    5. Decoding hyperparameters have a huge impact (Figure).

      For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (i.e., word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.

      To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequency penalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent SCARECROW error spans as those authored by humans (more: §6.6).

    …We notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an article’s text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3’s generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this word. For fairness, we removed these errors from GPT-3’s tally as well, though they were few enough we do not plot them separately.

  31. ⁠, Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston (2020-04-28):

    Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

  32. ⁠, Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton (2020-06-17):

    One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to common approaches to semi-supervised learning for computer vision, we show that it is surprisingly effective for semi-supervised learning on ImageNet.

    A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge.

    This procedure achieves 73.9% ImageNet top-1 accuracy with just 1% of the labels (≤13 labeled images per class) using ⁠, a 10× improvement in label efficiency over the previous state-of-the-art. With 10% of labels, ResNet-50 trained with our method achieves 77.5% top-1 accuracy, outperforming standard supervised training with all of the labels.

  33. 2020-chen.pdf#openai: ⁠, Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever (OpenAI) (2020-06-17; ai):

    Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.

    [See also ⁠.]

  34. ⁠, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen (2020-06-30):

    Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

  35. ⁠, William Fedus, Barret Zoph, Noam Shazeer (2021-01-11):

    In deep learning, models typically reuse the same parameters for all inputs. Mixture-of-Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model—with outrageous numbers of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability—we address these with the Switch Transformer.

    We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large () to obtain up to 7× increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the -Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to 1-trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4× speedup over the T5-XXL model.

    Figure 1: Scaling and sample efficiency of Switch Transformers. Left Plot: Scaling properties for increasingly sparse (more experts) Switch Transformers. Right Plot: Negative log-perplexity.

    Appendix E: Relation OF Upstream To Downstream Model Performance

    There is no guarantee that a model’s quality on a pre-training objective will translate to downstream task results. Figure 13 presents the correlation of the upstream model quality, for both dense and Switch models, on the C4 pre-training task with two downstream task measures: average Super- performance and TriviaQA score. We choose these two tasks as one probes the model’s reasoning and the other factual knowledge.

    Figure 13: Upstream pre-trained quality to downstream model quality. We correlate the upstream performance with downstream quality on both SuperGLUE and TriviaQA (SOTA recorded without SSM), reasoning and knowledge-heavy benchmarks, respectively (validation sets). We find that, as with the baseline, the Switch model scales with improvements in the upstream pre-training task. For SuperGLUE, we find a loosely linear relation between negative log perplexity and the average SuperGLUE score. However, the dense model often performs better for a fixed perplexity, particularly in the large-scale regime. Conversely, on the knowledge-heavy task, TriviaQA, we find that the Switch Transformer may follow an improved scaling relationship—for a given upstream perplexity, it does better than a dense counterpart. Further statistics (expensive to collect and left to future work) would be necessary to confirm these observations.

    We find a consistent correlation, indicating that for both baseline and Switch models, improved pre-training leads to better downstream results. Additionally, for a fixed upstream perplexity we find that both Switch and dense models perform similarly in the small to medium model size regime. However, in the largest model regime (T5-11B/​​​​T5-XXL) our largest Switch models, as mentioned in Section 5.6⁠, do not always translate their upstream perplexity well to downstream fine-tuning on the SuperGLUE task. This warrants future investigation and study to fully realize the potential of sparse models. Understanding the fine-tuning dynamics with expert-models is very complicated and is dependent on regularization, load-balancing, and fine-tuning hyper-parameters.

  36. ⁠, An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, Hongxia Yang (2021-05-31):

    Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still, it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation.

    In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a major problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts k and expert capacity C in top-k routing can substantially make a difference in this context. Furthermore, we take a step forward to propose a simple method called ‘expert prototyping’ that splits experts into different prototypes and applies k top-1 routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.

    We push the model scale to over 1 trillion parameters and implement it on solely 480 NVIDIA -32GB GPUs, in comparison with the recent SOTA on 2048 TPUs. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.

  37. ⁠, Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit (2020-06-18):

    We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (e.g., ImageNet) and architectures (e.g., ResNets). As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.

  38. Sparsity

  39. ⁠, Teven Le Scao (Hugging Face) (2020-06-08):

    [Discussion of DL scaling laws and how big = better, with interactive graphs to help visualize the multi-way relationship between dataset / model / validation-loss / FLOPS.]

    Research at Hugging Face also leverages this phenomenon, and we’ve combined it with speed estimations to ensure model size is just right for the compute budget of the experiment (when in doubt, it’s bigger than you think!). This blog post will show how this impacts architecture decisions on a standard language modeling benchmark: we replicate the 14-layer state-of-the-art result from Zhang et al.’s paper without any hyper-parameter optimization and saving 25% of training time. We also estimate that the 18-layer model from the same paper trained for an order of magnitude too many training steps. Wanna play with our demo before reading? Just click here!

    1. There is an optimal time to stop training (and it’s earlier than you think)

    2. GPUs are optimized for large, wide models

    3. Demonstration on a language modeling task: Wikitext-103

    4. Takeaways

      • Big models are surprisingly efficient!
      • Training until convergence is not efficient at all.
      • Benchmarking smaller-scale runs allows us to predict model performance and time for production-scale models.
      • Using larger models stopped earlier and optimizing model size for speed lowers training costs.
  40. ⁠, Yian Zhang, Alex Warstadt, Haau-Sing Li, Samuel R. Bowman (2020-11-09):

    NLP is currently dominated by general-purpose pretrained language models like ⁠, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data?

    We adopt four probing methods—classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks—and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words.

    We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough common-sense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.

  41. ⁠, Alex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu, Samuel R. Bowman (2020-10-11):

    One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning. We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa-base does demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.

  42. ⁠, Leo Z. Liu, Yizhong Wang, Jungo Kasai, Hannaneh Hajishirzi, Noah A. Smith (2021-04-16):

    Models of language trained on very large corpora have been demonstrated useful for NLP. As fixed artifacts, they have become the object of intense study, with many researchers “probing” the extent to which linguistic abstractions, factual and commonsense knowledge, and reasoning abilities they acquire and readily demonstrate. Building on this line of work, we consider a new question: for types of knowledge a language model learns, when during (pre)training are they acquired? We plot probing performance across iterations, using RoBERTa as a case study. Among our findings: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

  43. ⁠, Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal (2021-01-05):

    [] We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-25 and GPT-3.6 This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet507 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

    Approach: We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

    In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

    1. CLIP is highly efficient…In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.
    2. CLIP is flexible and general: Because they learn a wide range of visual concepts directly from natural language, CLIP models are substantially more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR. [While CLIP’s zero-shot OCR performance is mixed, its semantic OCR representation is quite useful. When evaluated on the SST-2 NLP dataset rendered as images, a linear classifier on CLIP’s representation matches a CBoW model with direct access to the text. CLIP is also competitive at detecting hateful memes without needing ground truth text.] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.

    …CLIP allows people to design their own classifiers and removes the need for task-specific training data. [See also ⁠, Guzhov et al 2021; CLIP notebook compilation for art⁠, “Alien Dreams: An Emerging Art Scene”⁠/​​​​“AI Generated Art Scene Explodes as Hackers Create Groundbreaking New Tools”⁠.]

  44. ⁠, Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig (2021-02-11):

    [blog] Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like ⁠, ⁠, or all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models.

    In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.

    We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

  45. ⁠, Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut (2021-02-17):

    The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

  46. ⁠, Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen (2021-03-11):

    Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project ‘WenLan’ led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

  47. ⁠, Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill (2021-06-25):

    When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.

  48. 2020-bell.pdf#facebook: ⁠, Sean Bell, Yiqun Liu, Sami Alsheikh, Yina Tang, Ed Pizzi, M. Henning, Karun Singh, Omkar Parkhi, Fedor Borisyuk (2020-08-22; ai):

    In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1× improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.

  49. https://arxiv.org/abs/2108.05887

  50. ⁠, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever (2021-02-24):

    Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

  51. ⁠, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Mark Chen, Rewon Child, Vedant Misra, Pamela Mishkin, Gretchen Krueger, Sandhini Agarwal, Ilya Sutskever (2021-01-05):

    [Paper: ⁠, Ramesh et al 2021. Re-implementation: DALL·E Mini (writeup). cf ⁠, ⁠. Availability through OA API still planned as of 2021-09-05.] DALL·E is a 12-billion parameter version of trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

    GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. showed that the same type of neural network can also be used to generate images with high fidelity. [iGPT is another answer to the question of “how do we do images autoregressively, but not at the exorbitant cost of generating pixels 1 by 1?”; iGPT uses ‘super pixels’ & very small images, while DALL·E uses VAE ‘tokens’ corresponding roughly to small squares so the token sequence is relatively small, where the VAE does the actual compilation to raw pixels.] we extend these findings to show that manipulating visual concepts through language is now within reach.

    [3 DALL·E prompts: “an armchair in the shape of an avocado…” · “a store front that has the word ‘openai’ written on it…” · “the exact same cat on the top as a sketch on the bottom”]

    DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192. The images are preprocessed to 256×256 resolution during training. Similar to VQ-VAE, each image is compressed to a 32×32 grid of discrete codes using a discrete VAE1011 that we pretrained using a continuous relaxation. We found that training using the relaxation obviates the need for an explicit codebook, loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.

    Capabilities: We find that DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with ⁠, but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.

    1. Controlling attributes: We test DALL·E’s ability to modify several of an object’s attributes, as well as the number of times that it appears.

    2. Drawing multiple objects

    3. Visualizing perspective and three-dimensionality

    4. Visualizing internal and external structure

    5. Inferring contextual details

      We find that DALL·E is able to render the same scene in a variety of different styles, and can adapt the lighting, shadows, and environment based on the time of day or season: “a … of a capybara sitting in a field at sunrise”

    …With varying degrees of reliability, DALL·E provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.

    Zero-shot visual reasoning: GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way. [See also CLIP.]

    We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s Progressive Matrices, a visual IQ test that saw widespread use in the 20th century. Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original. DALL·E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E gets almost none of them correct. For each of the sets, we measure DALL·E’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E’s performance, suggesting its capabilities may be brittle in unexpected ways.

  52. ⁠, Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong Li, Wei Lin, Jingren Zhou, Jie Tang, Hongxia Yang (2021-03-01):

    In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains.

    We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese.

    We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.

  53. ⁠, Alex Nichol, Prafulla Dhariwal (2021-02-18):

    Denoising diffusion probabilistic models (DDPM) are a class of generative models which have recently been shown to produce excellent samples.

    We show that with a few simple modifications, DDPMs can also achieve competitive log-likelihoods while maintaining high sample quality. Additionally, we find that learning variances of the reverse diffusion process allows sampling with an order of magnitude fewer forward passes with a negligible difference in sample quality, which is important for the practical deployment of these models. We additionally use precision and recall to compare how well DDPMs and cover the target distribution. Finally, we show that the sample quality and likelihood of these models scale smoothly with model capacity and training compute, making them easily scalable.

    We release our code at Github⁠.

  54. ⁠, Jonathan Ho, Ajay Jain, Pieter Abbeel (2020-06-19):

    We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.

    Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

    On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art score of 3.17. On 256×256 LSUN, we obtain sample quality similar to Progressive GAN. Our implementation is available at Github⁠.

  55. ⁠, Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song (2021-01-26):

    Large-scale datasets are the cornerstone of self-supervised representation learning. Existing algorithms extract learning signals by making certain assumptions about the data, eg., spatio-temporal continuity and multimodal correspondence. Unfortunately, finding a large amount of data that satisfies such assumptions is sometimes not straightforward. This restricts the community to rely on datasets that require laborious annotation and/​​​​or manual filtering processes. In this paper, we describe a subset optimization approach for automatic dataset curation. Focusing on the scenario of audio-visual representation learning, we pose the problem as finding a subset that maximizes the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales. The most important benefit of our approach is scalability. We release the largest video dataset for audio-visual research collected automatically using our approach.

    Figure 4: Linear evaluation on downstream tasks. The top-1/​​​​5 accuracy (%) of video classification on UCF101 [66], audio classification on ESC-50 [58] and audio-visual classification on Kinetics-Sounds (KS) [4]. We group the results by the downstream tasks and by the scale of the pretrain datasets. Baselines are Kinetics-Sounds [4] (20K), VGG-Sound [11] (200K), and AudioSet [19] (2M).
  56. ⁠, Jasha Droppo, Oguz Elibol (2021-06-11):

    There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent “irreducible loss” of the task. We find that the scaling laws accurately match model performance over 2 orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.

    …The context module is a sequence-to-sequence model that converts the encoded input sequence into a sequence of context vectors. We experiment with 2 different designs for the context model: the LSTM and the Transformer. To maintain causality, the LSTM are uni-directional and the Transformer masking to prevent the network from using future frames…All acoustic data used in this paper was drawn from a 23,000 hour corpus of untranscribed, de-identified, far-field, English voice command and voice query speech collected from home environments [?]. This data is presented to the network as a series of log-Mel frequency filterbank feature vectors, at a rate of 100 vectors per second of audio. Although this data is not publicly available, the authors believe that the phenomena described in this paper should apply to any similar set of speech recordings.

    Figure 5: Development set loss for both LSTM and Transformer models for models with the indicated number of layers. The dashed line represents the computationally efficient frontier defined in Eq. 4.

    When a model reaches L(C), it means that a different model with enough capacity, but with fewer parameters, would need more computation and more data to reach the same loss value. Alternatively, a model with more parameters would need more computation and less data to reach the same loss value.

    Where curves for 2 experiments meet, it is an indication that the same amount of compute can reach the given loss value through 2 different methods. One can either use more parameters and fewer data, or use fewer parameters and more data.

    The constant L is 0.306 in both figures. This represents a shared asymptote between the LSTM and Transformer systems, which will never be surpassed, regardless of the computational or data budget. The fact that the same asymptote applies to both systems hints that irreducible loss is indeed a fundamental property of the data and not the model. Additionally, this constant is similar to the value found in Section 3.1. The authors suspect that the constants should be identical, but our precision in measuring it is limited.

    The LSTM models exhibit a compute-efficient frontier with a slope of −0.167. A doubling of computation yields a 10.9% reduction in objective function. A halving of objective function would come with a 63.5× increase in computation. The slope of the compute-efficient frontier for Transformer models is −0.197. When computation is increased by a factor of r, then the reducible loss will be changed by a factor of r−0.197. At that rate, a doubling of computation yields a 12.7% reduction in objective function. A halving of objective function would come with a 33.7× increase in computation. [These results are consistent with LSTMs vs Transformers on text⁠.]

    The difference in slope between the LSTM and Transformer experiments indicate that the Transformer architecture makes more efficient use of increased model parameters and increased training data. Although LSTM is superior to transformer at smaller model sizes, as the model size grows, and these trends continue, the transformer will eventually be more efficient.

    Finally, the experimental data show that larger models learn more quickly from the same amount of data. Each of the points plotted in Figure 5 represent the consumption of an additional 25,000 minibatches of training data. At the first point, second, or third, each model has processed the same data, but the larger models have achieved better accuracy on the held-out development set.

  57. ⁠, Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli (2020-06-24):

    This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing XLSR-53, a large model pretrained in 53 languages.

  58. ⁠, Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma (2021-04-30):

    Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.

    We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours. We adopt to efficiently scale up to 10B parameters.

    Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck—our 500M-param model is already better than monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days—the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth usually works better than width and large encoders tend to do better than large decoders.

    Figure 1: WER performance (%) vs. (a) training steps, (b) TPU days and (c) language. Systems with * use LSTM decoders.
  59. ⁠, Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux (2021-01-02):

    We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https:/​​​​/​​​​github.com/​​​​facebookresearch/​​​​voxpopuli under an open license.

  60. ⁠, Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau (2021-04-14):

    In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with ⁠.

    Our experiments improve over the previous state of the art by 2.6 BLEU on average on all 4 considered CoVoST 2 language pairs via a simple recipe of combining pretraining, a single iteration of self-training and decoding with a language model.

    Different to existing work, our approach does not leverage any other supervision than ST data. Code and models will be publicly released.

  61. ⁠, Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed (2021-06-14):

    Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.

  62. ⁠, Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski (2021-03-02):

    Figure 1: Performance of large pretrained models on ImageNet. We pretrain our SEER models on uncurated and random images. They are RegNet architectures [40] trained with the SwAV self-supervised method [7]. We compare with the original models trained in Caron et al. [7] as well as the pretraining on curated data from SimCLRv2 [9] and ViT [14]. The network architectures are different. We report the top-1 accuracy after finetuning on ImageNet.

    Recently, self-supervised learning methods like ⁠, ⁠, and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset.

    In this work, we explore if self-supervision lives up to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RetNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: this URL⁠. [See also ⁠, Hénaff et al 2021; ⁠, Caron et al 2021]

    Figure 6: (left) Impact of number of updates. We compare the quality of a RegNetY-128GF after different number of updates of an online pretraining on 1B images. For both studies, we report the relative improvement in top-1 accuracy for a linear evaluation of frozen features on ImageNet. (right) Impact of number of unique images. We compare the impact of the size of the training set for a RegNetY-8GF and a RegNetY-16GF pretrained for the same number of updates. The number of updates corresponds to 1 epoch for 1B images, 32 epochs for 32M images and 1K for 1M images.
  63. ⁠, Piotr Dollár, Mannat Singh, Ross Girshick (2021-03-11):

    In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power.

    Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Existing analysis typically focuses on the interplay of accuracy and FLOPS (floating point operations). Yet, as we demonstrate, various scaling strategies affect model parameters, activations, and consequently actual runtime quite differently. In our experiments we show the surprising result that numerous scaling strategies yield networks with similar accuracy but with widely varying properties.

    This leads us to propose a simple fast compound scaling strategy that encourages primarily scaling model width, while scaling depth and resolution to a lesser extent. Unlike currently popular scaling strategies, which result in about 𝑂(s) increase in model activation w.r.t. scaling FLOPS by a factor of s, the proposed fast compound scaling results in close to 𝑂(√s) increase in activations, while achieving excellent accuracy. This leads to comparable speedups on modern memory-limited hardware (eg., GPU, TPU).

    More generally, we hope this work provides a framework for analyzing and selecting scaling strategies under various computational constraints.

  64. ⁠, Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph (2021-03-13):

    Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended (Tan & Le, 2019). Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1.7×—2.7× faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet. In a large-scale semi-supervised learning setup, ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7× faster than EfficientNet NoisyStudent. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised ResNets as baselines for future research.

  65. ⁠, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov (2019-11-05):

    This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

  66. ⁠, Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau (2021-05-02):

    Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests pretrained models with larger capacity may obtain both strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available.

  67. ⁠, Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang (2021-07-05):

    Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/​​​​few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%).

  68. ⁠, Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer (2021-06-08):

    Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model’s scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

  69. ⁠, Sébastien Bubeck, Mark Sellke (2021-05-26):

    Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

  70. ⁠, Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang (2021-08-16; ai  /​ ​​ ​scaling⁠, economics⁠, biology):

    AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL·E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.

    This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations).

    Though foundation models are based on conventional deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.

    To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

    • Introduction
      • Emergence and homogenization
      • Social impact and the foundation models ecosystem
      • The future of foundation models
      • Overview of this report
    • Capabilities
      • Language
      • Vision
      • Robotics
      • Reasoning and search
      • Interaction
      • Philosophy of understanding
    • Applications
      • Healthcare and biomedicine
      • Law
      • Education
    • Technology
      • Modeling

      • Training

      • Adaptation

      • Evaluation

      • Systems

      • Data

      • Security and privacy

      • Robustness to distribution shifts

      • AI safety and alignment

        • Theory
        • Interpretability
    • Society
      • Inequity and fairness
      • Misuse
      • Environment
      • Legality
      • Economics
      • Ethics of scale
    • Conclusion

    [Rohin Shah discussion:

    The history of AI is one of increasing emergence and homogenization. With the introduction of machine learning, we moved from a large proliferation of specialized algorithms that specified how to compute answers to a small number of general algorithms that learned how to compute answers (i.e. the algorithm for computing answers emerged from the learning algorithm). With the introduction of deep learning, we moved from a large proliferation of hand-engineered features for learning algorithms to a small number of architectures that could be pointed at a new domain and discover good features for that domain. Recently, the trend has continued: we have moved from a large proliferation of trained models for different tasks to a few large “foundation models” which learn general algorithms useful for solving specific tasks. BERT and GPT-3 are central examples of foundation models in language; many NLP tasks that previously required different models are now solved using finetuned or prompted versions of BERT and/​​​​or GPT-3.

    Note that, while language is the main example of a domain with foundation models today, we should expect foundation models to be developed in an increasing number of domains over time. The authors call these “foundation” models to emphasize that (1) they form a fundamental building block for applications and (2) they are not themselves ready for deployment; they are simply a foundation on which applications can be built. Foundation models have been enabled only recently because they depend on having large scale in order to make use of large unlabeled datasets using self-supervised learning to enable effective transfer to new tasks. It is particularly challenging to understand and predict the capabilities exhibited by foundation models because their multitask nature emerges from the large-scale training rather than being designed in from the start, making the capabilities hard to anticipate. This is particularly unsettling because foundation models also lead to substantially increased homogenization, where everyone is using the same few models, and so any new emergent capability (or risk) is quickly distributed to everyone.

    The authors argue that academia is uniquely suited to study and understand the risks of foundation models. Foundation models are going to interact with society, both in terms of the data used to create them and the effects on people who use applications built upon them. Thus, analysis of them will need to be interdisciplinary; this is best achieved in academia due to the concentration of people working in the various relevant areas. In addition, market-driven incentives need not align well with societal benefit, whereas the research mission of universities is the production and dissemination of knowledge and creation of global public goods, allowing academia to study directions that would have large societal benefit that might not be prioritized by industry.

    All of this is just a summary of parts of the introduction to the report. The full report is over 150 pages and goes into detail on capabilities, applications, technologies (including technical risks), and societal implications. I’m not going to summarize it here, because it is long and a lot of it isn’t that relevant to alignment; I’ll instead note down particular points that I found interesting.

    • (pg. 26) Some studies have suggested that foundation models in language don’t learn linguistic constructions robustly; even if they use it well once, they may not do so again, especially under distribution shift. In contrast, humans can easily “slot in” new knowledge into existing linguistic constructions.
    • (pg. 34) This isn’t surprising but is worth repeating: many of the capabilities highlighted in the robotics section are very similar to the ones that we focus on in alignment (task specification, robustness, safety, sample efficiency).
    • (pg. 42) For tasks involving reasoning (e.g. mathematical proofs, program synthesis, drug discovery, computer-aided design), neural nets can be used to guide a search through a large space of possibilities. Foundation models could be helpful because (1) since they are very good at generating sequences, you can encode arbitrary actions (e.g. in theorem proving, they can use arbitrary instructions in the proof assistant language rather than being restricted to an existing database of theorems), (2) the heuristics for effective search learned in one domain could transfer well to other domains where data is scarce, and (3) they could accept multimodal input: for example, in theorem proving for geometry, a multimodal foundation model could also incorporate information from geometric diagrams.
    • (Section 3) A substantial portion of the report is spent discussing potential applications of foundation models. This is the most in-depth version of this I have seen; anyone aiming to forecast the impacts of AI on the real world in the next 5–10 years should likely read this section. It’s notable to me how nearly all of the applications have an emphasis on robustness and reliability, particularly in truth-telling and logical reasoning.
    • (Section 4.3) We’ve seen a (AN #152) ways (AN #155) in which foundation models can be adapted. This section provides a good overview of the various methods that have been proposed in the literature. Note that adaptation is useful not just for specializing to a particular task like summarization, but also for enforcing constraints, handling distributional shifts, and more.
    • (pg. 92) Foundation models are commonly evaluated by their performance on downstream tasks. One limitation of this evaluation paradigm is that it makes it hard to distinguish between the benefits provided by better training, data, adaptation techniques, architectures, etc. (The authors propose a bunch of other evaluation methodologies we could use.)
    • (Section 4.9) There is a review of AI safety and AI alignment as it relates to foundation models, if you’re interested. (I suspect there won’t be much new for readers of this newsletter.)
    • (Section 4.10) The section on theory emphasizes studying the pretraining-adaptation interface, which seems quite good to me. I especially liked the emphasis on the fact that pretraining and adaptation work on different distributions, and so it will be important to make good modeling assumptions about how these distributions are related.


  71. ⁠, Yun Zeng, Siqi Zuo, Dongcai Shen (2020-04-17; ai  /​ ​​ ​scaling):

    One of the limitations of deep learning models with sparse features today stems from the predefined nature of their input, which requires a dictionary be defined prior to the training. With this paper we propose both a theory and a working system design which remove this limitation, and show that the resulting models are able to perform better and efficiently run at a much larger scale. Specifically, we achieve this by decoupling a model’s content from its form to tackle architecture evolution and memory growth separately. To efficiently handle model growth, we propose a new neuron model, called DynamicCell, drawing inspiration from the to introduce the concept of reaction to discharge non-digestive energy, which also subsumes gradient descent based approaches as its special cases. We implement DynamicCell by introducing a new server into TensorFlow to take over most of the work involving model growth. Consequently, it enables any existing deep learning models to efficiently handle arbitrary number of distinct sparse features (eg., search queries), and grow incessantly without redefining the model. Most notably, one of our models, which has been reliably running in production for over a year, is capable of suggesting high quality keywords for advertisers of Google Smart Campaigns and achieved substantial accuracy gains based on a challenging metric—evidence that data-driven, self-evolving systems can potentially exceed the performance of traditional rule-based approaches.

  72. ⁠, Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao (2021-04-12):

    Deep learning models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/​​​​HW co-designed solution for high-performance distributed training of large-scale DLRMs.

    We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 trillion parameters and show that we can attain 40× speedup in terms of time to solution over previous systems. We achieve this by

    1. designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport
    2. implementing an optimized PyTorch-based training stack supporting both model and data parallelism
    3. developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers;
    4. adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates
    5. leveraging reduced precision communications, multi-level memory hierarchy (+DDR+SSD) and pipelining.

    Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.

  73. https://www.microsoft.com/en-us/research/blog/make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance/

  74. FC

  75. #urban-et-al-2016

  76. ⁠, Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy (2021-05-04):

    [blog] Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the ⁠, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary.

    We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (ie. “mixing” the per-location features), and one with MLPs applied across patches (ie. “mixing” spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models.

    We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

  77. ⁠, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le (2021-05-17):

    Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

  78. ⁠, Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving (2019-09-18):

    Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/​​​​Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

  79. ⁠, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano (2020-09-02):

    As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about—summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/​​​​DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

  80. ⁠, hippke (2020-08-05; economics  /​ ​​ ​experience-curve⁠, ai  /​ ​​ ​scaling):

    How can we measure a potential AI or hardware overhang? For the problem of chess, modern algorithms gained two orders of magnitude in compute (or ten years in time) compared to older versions. While it took the supercomputer “Deep Blue” to win over world champion Gary Kasparov in 1997, today’s program achieves the same ELO level on a 486-DX4-100 MHz from 1994. In contrast, the scaling of neural network chess algorithms to slower hardware is worse (and more difficult to implement) compared to classical algorithms. Similarly, future algorithms will likely be able to better leverage today’s hardware by 2–3 orders of magnitude. I would be interested in extending this scaling relation to AI problems other than chess to check its universality.

    …We may wonder: How do modern (better) chess algorithms perform on slower hardware? I tested this with Stockfish version 8 (SF8), one of the strongest classical chess engine. I simulated 10k matches of SF8 against slower versions of itself and a ⁠, using cutechess-cli. In these benchmarks, I varied the total number of nodes to be searched during each game. I kept the RAM constant (this may be unrealistic for very old machines, see below). By assuming a fixed thinking time per game, the experiments scale out to slower machines. By cross-correlating various old benchmarks of Stockfish and other engines on older machines, I matched these ratings to units of MIPS; and finally, MIPS approximately to the calendar year. Depending on the actual release dates of the processors, the year axis has a jitter up to 2 years. I estimate the error for the compute estimates to be perhaps 20%, and certainly less than 50%. As we will see, the results measure in orders of magnitude, so that these errors are small in comparison (<10%).

    Results: SF8 achieves Kasparov’s 2850 ELOs running on a 486-100 MHz introduced in 1994, three years before the Kasparov-Deep Blue match. These ELOs refer to tournament conditions as in the 1997 IBM games. In other words, with today’s algorithms, computers would have beat the world chess champion already in 1994 on a contemporary desk computer (not a supercomputer). [Followup: “A closer look at chess scalings (into the past)”>⁠/​​​​“Benchmarking an old chess engine on new hardware”: “How much more compute does Stockfish 3 require to match Stockfish 13? Answer: 32× (uncertainty: 30–35×)…Interpretation: If we accept SF as amongst the very best chess programs in the last decade, we can make a more general assessment of chess compute vs. algorithm. Compute explains 30–50% of the computer chess ELO progress; algorithm improvements explain 50–70%.”]

    Chess Elo vs MIPS, 1990–2020
  81. ⁠, Andy L. Jones (2021-04-07):

    The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.

    By conducting a sequence of experiments using and ⁠, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the test-time and train-time compute available to an agent can be traded off while maintaining performance.

    Figure 5: Each training run (each faint line) of each differently-sized agent follows a sigmoid, starting at random play and progressing up to some plateau. The frontiers (dark lines) formed by taking a maximum across training runs have a similar form across board sizes (colors).
    Figure 6: The compute-performance frontier follows the same sigmoid for each board size 3 through 9, just scaled and shifted. The dotted lines give the fitted curves.
    1. Slope: The slope of the incline is 500 per order of magnitude increase in compute.

      A more memorable interpretation is that if you are in the linearly-increasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.

    1. Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.

    2. Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.

    3. Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.

      Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.

    Train-test trade-off: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the train-time search and test-time search have to be the same size, and so by varying the size of the test-time compute budget we can see in Figure 8 that larger tree searches at test time can substantially improve the performance of an agent.

    Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in Figure 9, which shows that the trade-off is linear in log-compute: for each additional 10× of train-time compute, about 15× of test-time compute can be eliminated, down to a floor of a single-node tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that test-time compute is much ‘cheaper’ than train-time compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at test-time needs only optimise over one sample, while train-time compute meanwhile must optimise over the entire distribution of samples.

    Figure 9: The trade-off between train-time compute and test-time compute. Each dotted line gives the minimum train-test compute required for a certain Elo on a 9 × 9 board.

    …the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/​​​​(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.

    Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.

  82. Faster

  83. ⁠, Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver (2021-04-13):

    Learning efficiently from small amounts of data has long been the focus of model-based ⁠, both for the online case when interacting with the environment and the case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate state-of-the-art results in both settings.

    In this work, we describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL).

    Combining Reanalyse with the ⁠, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the off-policy or offline RL settings.

    MuZero Unplugged sets new state-of-the-art results in the offline RL benchmark as well as in the online RL benchmark of Atari ALE in the standard 200 million frame setting.

    Figure 1: Final scores in Ms. Pac-Man for different Reanalyse fractions. By scaling the Reanalyse fraction, MuZero can be trained at any desired data budget. All other parameters are held constant. Note the logarithmic x-axis: Linear improvements in score require exponentially more data, matching scaling laws such as described by Kaplan et al 2020 for language models.
  84. ⁠, Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S. M. Ali Eslami, Daniel Hennes, Wojciech M. Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y. Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H. Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D. Tracey, Karl Tuyls, Thore Graepel, Nicolas Heess (2021-05-25):

    [Previously: ⁠, Merel et al 2020; ⁠, Hill et al 2020.] Intelligent behaviour in the physical world exhibits structure at multiple spatial and temporal scales. Although movements are ultimately executed at the level of instantaneous muscle tensions or joint torques, they must be selected to serve goals defined on much longer timescales, and in terms of relations that extend far beyond the body itself, ultimately involving coordination with other agents. Recent research in artificial intelligence has shown the promise of learning-based approaches to the respective problems of complex movement, longer-term planning and multi-agent coordination. However, there is limited research aimed at their integration.

    We study this problem by training teams of physically simulated humanoid avatars to play football in a realistic virtual environment. We develop a method that combines imitation learning [from motion-capture of human soccer], single-agent and multi-agent reinforcement learning and ⁠, and makes use of transferable representations of behaviour for decision making at different levels of abstraction. In a sequence of stages, players first learn to control a fully articulated body to perform realistic, human-like movements such as running and turning; they then acquire mid-level football skills such as dribbling and shooting; finally, they develop awareness of others and play as a team, bridging the gap between low-level motor control at a timescale of milliseconds, and coordinated goal-directed behaviour as a team at the timescale of tens of seconds.

    We investigate the emergence of behaviours at different levels of abstraction, as well as the representations that underlie these behaviours using several analysis techniques, including statistics from real-world sports analytics. Our work constitutes a complete demonstration of integrated decision-making at multiple scales in a physically embodied multi-agent setting. See project video⁠.

    Figure 5: (A) Agent performance measured by Elo against a set of pre-trained evaluation agents increases as the agents learn football behaviours. Counterfactual policy divergence by entity: early in training, the ball (blue curve) induces most divergence in the agent policy; other players have progressively more influence on the agent’s policy as training progresses. Pass-value-correlation increases for both passer and receiver over training as coordination improves. Agent’s probe score drops below 50% early in training, but improves to 60% as the agents learn coordinated strategies, and identify the value of teammate possession. (B) Emergence of behaviours and abilities over training. Early in training (up to 1.5 billion environment steps or approximately 24 hours of training) running speed and possession increase rapidly and the ability to get up is effectively perfected. Division of labour decreases in this early phase as agents prioritize possession and learn uncoordinated ball chasing behaviours. After 1.5 billion environment steps a transition occurs in which division of labour improves and behaviour shifts from individualistic ball chasing to coordinated play. In this second phase passing frequency, passing range and receiver OBSO [Receiver off-ball scoring opportunity] increase substantially. (C) Division of Labour and passing plays: solid/​​​​dashed lines indicates past/​​​​future trajectories of the red and blue players and the ball (black line). The 2 left frames are at the point in time of the pass; the receiver turns to anticipate an upfield kick before the pass, leaving the teammate to control the ball. Rightmost frame is the point of reception. (D) Typical probe task initialization with blue player 1 (“passer”) initialized in its own half, and player 2 (“receiver”) initialized on a wing and 2 defenders in the centre. Right: receiver value (scoring channel) as a function of future ball position on the pitch. Regions of high value in green and low value in red. Left: passer value function. Both receiver and passer register higher value when the ball travels to the right wing, where the receiver is positioned.

    …A schematic of our infrastructure is provided in Figure 4. Learning is performed on a central 16-core machine where one core is used for each player in the population. Model inference occurs on 128 inference servers, each providing inference-as-a-service initiated by an inbound request identified by a unique model name. Concurrent requests for the same inference model result in automated batched inference, where an additional request incurs negligible marginal cost. Policy-environment interactions are executed on a large pool of 4,096 CPU actor workers. These connect to a central orchestrator machine which schedules the matches.

  85. ⁠, Vitaly Feldman (2019-06-12):

    State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize the labels of the training data is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning.

    We provide the first conceptual explanation and a theoretical model for this phenomenon. Specifically, we demonstrate that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. Crucially, even labels of outliers and noisy labels need to be memorized. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and our results show that memorization is necessary whenever the distribution of subpopulation frequencies is long-tailed. Image and text data is known to be long-tailed and therefore our results establish a formal link between these empirical phenomena. Our results allow to quantify the cost of limiting memorization in learning and explain the disparate effects that privacy and model compression have on different subgroups.

  86. ⁠, Guillermo Valle-Pérez, Ard A. Louis (2020-12-07):

    Generalization in deep learning has been the topic of much recent theoretical and empirical research. Here we introduce desiderata for techniques that predict generalization errors for deep learning models in supervised learning. Such predictions should 1) scale correctly with data complexity; 2) scale correctly with training set size; 3) capture differences between architectures; 4) capture differences between optimization algorithms; 5) be quantitatively not too far from the true error (in particular, be non-vacuous); 6) be efficiently computable; and 7) be rigorous. We focus on generalization error upper bounds, and introduce a categorisation of bounds depending on assumptions on the algorithm and data. We review a wide range of existing approaches, from classical VC dimension to recent PAC-Bayesian bounds, commenting on how well they perform against the desiderata.

    We next use a function-based picture to derive a marginal-likelihood PAC-Bayesian bound. This bound is, by one definition, optimal up to a multiplicative constant in the asymptotic limit of large training sets, as long as the learning curve follows a power law, which is typically found in practice for deep learning problems. Extensive empirical analysis demonstrates that our marginal-likelihood PAC-Bayes bound fulfills desiderata 1–3 and 5. The results for 6 and 7 are promising, but not yet fully conclusive, while only desideratum 4 is currently beyond the scope of our bound. Finally, we comment on why this function-based bound performs significantly better than current parameter-based PAC-Bayes bounds.

  87. ⁠, Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi (2020-10-16):

    We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is “because” they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

  88. ⁠, Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma (2021-02-12):

    The test loss of well-trained neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains and connects these scaling laws. We identify -limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents: super-classing image tasks does not change exponents, while changing input distribution (via changing datasets or adding noise) has a strong effect. We further explore the effect of architecture aspect ratio on scaling exponents.

  89. ⁠, Marcus Hutter (2021-02-08):

    Recently a number of empirical “universal” scaling law papers have been published, most notably by OpenAI. ‘Scaling laws’ refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/​​​​or more compute. In this work we focus on scaling w.r.t. data size n.

    Theoretical understanding of this phenomenon is largely lacking, except in finite-dimensional models for which error typically decreases with n−1⁄2 or n−1, where n is the sample size.

    We develop and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determine whether power laws are universal or depend on the data distribution.

  90. https://www.lesswrong.com/posts/Yt5wAXMc7D2zLpQqx/an-140-theoretical-models-that-predict-scaling-laws#HIGHLIGHTS

  91. ⁠, Tom Viering, Marco Loog (2021-03-19):

    Learning curves provide insight into the dependence of a learner’s generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.

  92. ⁠, Andrew M. Saxe, James L. McClelland, Surya Ganguli (2019-06-04):

    Over the course of development, humans learn myriad facts about items in the world, and naturally group these items into useful categories and structures. This semantic knowledge is essential for diverse behaviors and inferences in adulthood. How is this richly structured semantic knowledge acquired, organized, deployed, and represented by neuronal networks in the brain? We address this question by studying how the nonlinear learning dynamics of deep linear networks acquires information about complex environmental structures. Our results show that this deep learning dynamics can self-organize emergent hidden representations in a manner that recapitulates many empirical phenomena in human semantic development. Such deep networks thus provide a mathematically tractable window into the development of internal neural representations through experience.

    An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences?

    We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species.

    Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

    [Keywords: semantic cognition, deep learning, neural networks, generative models]

  93. https://arxiv.org/pdf/2103.10948.pdf#page=22

  94. ⁠, Yehuda Dar, Vidya Muthukumar, Richard G. Baraniuk (2021-09-06):

    The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models—from simple linear models to deep neural networks—have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance.

    Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of ⁠. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.

  95. 1987-shepard.pdf: ⁠, Roger N. Shepard (1987-09-11; psychology⁠, ai  /​ ​​ ​scaling):

    [⁠; exponential necessary for ? Is there a connection to the power-laws in ML?] A psychological space is established for any set of stimuli by determining metric distances between the stimuli such that the probability that a response learned to any stimulus will generalize to any other is an invariant monotonic function of the distance between them.

    To a good approximation, this probability of generalization (1) decays exponentially with this distance, and (2) does so in accordance with one of 2 metrics, depending on the relation between the dimensions along which the stimuli vary.

    These empirical regularities are mathematically derivable from universal principles of natural kinds and probabilistic geometry that may, through evolutionary internalization, tend to govern the behaviors of all sentient organisms.

    Figure 1: 12 gradients of generalization. Measures of generalization between stimuli are plotted against distances between corresponding points in the psychological space that renders the relation most nearly monotonic. Sources of the generalization data (g) and the distances (d) are as follows. (A) g, McGuire (33); d, Shepard (7, 18). (B) g, Shepard (7, 17); d, Shepard (7,18). (C) g, Shepard (17); d, Shepard (8). (D) g, Attneave (25); d, Shepard(8). (E) g, Guttman and Kalish (4); d, Shepard (11). (F) g, Miller and Nicely (34); d, Shepard (35). (G) g, Attneave (25); d, Shepard (8). (H) g, Blough (36); d, Shepard (11). (I) g, Peterson and Barney (37); d, Shepard (35). (J) g and d, Shepard and Cermak (38). (K) g, Ekman (39); d, Shepard (18). (L) g, Rothkopf (40); d, Cunningham and Shepard (41). The generalization data in the bottom row are of a somewhat different type. [See (39) and the section “Limitations and Proposed Extensions”.]
  96. 2001-banko.pdf#microsoft: ⁠, Michele Banko, Eric Brill (2001-07-01; ai):

    The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

    …We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, ie. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.

    In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.

    Figure 1: Learning Curves for Confusion Set Disambiguation
  97. https://papers.nips.cc/paper/2003/file/9fb7b048c96d44a0337f049e0a61ff06-Paper.pdf

  98. 2003-perlich.pdf: ⁠, Claudia Perlich, Foster Provost, Jeffrey S. Simonoff (2003-06-01; ai):

    and are 2 standard, off-the-shelf methods for building models for classification.

    We present a large-scale experimental comparison of logistic regression and tree induction (), assessing classification accuracy and the quality of rankings based on class-membership probabilities.

    We use a learning-curve analysis to examine the relationship of these measures to the size of the training set.

    The results of the study show several things:

    1. Contrary to some prior observations, logistic regression does not generally outperform tree induction.
    2. More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves.
    3. Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally,
    4. the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise. [Keywords: decision trees, learning curves, logistic regression, ROC analysis, tree induction]

    …The average data-set size is larger than is usual in machine-learning research, and we see behavioral characteristics that would be overlooked when comparing algorithms only on smaller data sets (such as most in the ⁠; see Blake & Merz 2000).

    …Papers such as this seldom consider carefully the size of the data sets to which the algorithms are being applied. Does the relative performance of the different learning methods depend on the size of the data set?

    More than a decade ago in machine learning research, the examination of learning curves was commonplace (see, for example, Kibler & Langley 1988⁠, but usually on single data sets (notable exceptions being the study by ⁠, and the work of Catlett 1991 [“Megainduction: machine learning on very large databases”]). Now learning curves are presented only rarely in comparisons of learning algorithms. Learning curves also are found in the statistical literature () and in the neural network literature (). They have been analyzed theoretically, using statistical mechanics (⁠; ⁠.

    The few cases that exist draw conflicting conclusions, with respect to our goals. compare classification-accuracy learning curves of naive Bayes and the C4.5RULES rule learner (Quinlan 1993). On synthetic data, they show that naive Bayes performs better for smaller training sets and C4.5RULES performs better for larger training sets (the learning curves cross). They discuss that this can be explained by considering the different bias/​​​​variance profile of the algorithms for classification (zero/​​​​one loss). Roughly speaking,4 variance plays a more critical role than estimation bias when considering classification accuracy. For smaller data sets, naive Bayes has a substantial advantage over tree or rule induction in terms of variance. They show that this is the case even when (by their construction) the rule learning algorithm has no bias. As expected, as larger training sets reduce variance, C4.5RULES approaches perfect classification. perform a similar bias/​​​​variance analysis of C4.5 and naive Bayes. They do not examine whether the curves cross, but do show on 4 UCI data sets that variance is reduced consistently with more data, but bias is not. These results do not directly examine logistic regression, but the bias/​​​​variance arguments do apply: logistic regression, a linear model, should have higher bias but lower variance than tree induction. Therefore, one would expect that their learning curves might cross.

    However, the results of Domingos & Pazzani 1997 were generated from synthetic data where the rule learner had no bias. Would we see such behavior on real-world domains? shows classification-accuracy learning curves of tree induction (using C4.5) and of naive Bayes for 9 UCI data sets. With only one exception, either naive Bayes or tree induction dominates (that is, the performance of one or the other is superior consistently for all training-set sizes). Furthermore, by examining the curves, Kohavi concludes that “In most cases, it is clear that even with much more data, the learning curves will not cross” (pp. 203–204).

    We are aware of only one learning-curve analysis that compares logistic regression and tree induction. Harris-Jones & Haines 1997 [“Sample size and misclassification: is more always better?”] compare them on 2 business data sets, one real and one synthetic. For these data the learning curves cross, suggesting (as they observe) that logistic regression is preferable for smaller data sets and tree induction for larger data sets. Our results generally support this conclusion.

    …These results concur with recent results () comparing discriminative and generative versions of the same model (viz., logistic regression and naive Bayes), which show that learning curves often cross…A corollary observation is that even for very large data-set sizes, the slope of the learning curves remains distinguishable from zero. Catlett 1991 concluded that learning curves continue to grow, on several large-at-the-time data sets (the largest with fewer than 100,000 training examples).14 suggest that this conclusion should be revisited as the size of data sets that can be processed (feasibly) by learning algorithms increases. Our results provide a contemporary reiteration of Catlett’s. On the other hand, our results seemingly contradict conclusions or assumptions made in some prior work. For example, conclude that classification-tree learning curves level off, and replicate this finding and use it as an assumption of their sampling strategy. Technically, the criterion for a curve to have reached a plateau in these studies is that there be less than a certain threshold (<1%) increase in accuracy from the accuracy with the largest data-set size; however, the conclusion often is taken to mean that increases in accuracy cease. Our results show clearly that this latter interpretation is not appropriate even for our largest data-set sizes.

  99. 2007-brants.pdf#google: ⁠, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean (2007-06; ai):

    This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion ⁠. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.

    Figure 5, modified by Chris Dyer in a 2020 talk: data vs translation quality (BLEU score) scaling of n-grams, and later, RNNs.
  100. ⁠, Philipp Koehn, Rebecca Knowles (2017-06-12):

    We explore six challenges for neural machine translation: domain mismatch, amount of training data, rare words, long sentences, word alignment, and ⁠. We show both deficiencies and improvements over the quality of phrase-based statistical machine translation.

  101. 2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png

  102. 2009-halevy.pdf: ⁠, Alon Halevy, Peter Norvig, Fernando Pereira (2009-03-24; ai):

    At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

    …For many tasks, words and word combinations provide all the representational machinery we need to learn from text.

    …So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do

  103. 2012-bottou.pdf: ⁠, Leon Bottou, Olivier Bousquet (2007; ai):

    This chapter develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the under-lying optimization algorithm in non-trivial ways. For instance, a mediocre optimization algorithm, ⁠, is shown to perform very well on large-scale learning problems.

    …This chapter develops the ideas initially proposed by Bottou & Bousquet 2008 [“The tradeoffs of large scale learning”, NIPS 2007]. Section 13.2 proposes a decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well-known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the of the learning algorithm. Section 13.3 explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate of the statistical estimation error. Section 13.4 reports experimental results supporting this analysis.

    …These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the objective function and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences:

    • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (13.11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error.
    • Second-order algorithms bring few asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all 4 algorithms are dominated by the polynomial term in (1⁄ε). However, there are important variations in the influence of the constants d, κ, and ν.These constants are very important in practice.
    • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite showing the worst optimization performance on the empirical cost. This phenomenon has already been described and observed in experiments (eg Bottou & Le Cun 2004).

    In contrast, since the optimization error εopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is determined solely by the statistical properties of the objective function.

    Figure 13.1 shows how much time each algorithm takes to reach a given optimization accuracy. The superlinear algorithm TRON reaches the optimum with 10 digits of accuracy in less than one minute. The stochastic gradient starts more quickly but is unable to deliver such a high accuracy. The upper part of the figure clearly shows that the testing set loss stops decreasing long before the superlinear algorithm overcomes the SGD algorithm.

    Figure 13.1: Training time and testing loss as a function of the optimization accuracy ρ for SGD and TRON (Lin et al 2007).

    Figure 13.2 shows how the testing loss evolves with the training time. The stochastic gradient descent curve can be compared with the curves obtained using conjugate gradients on subsets of the training examples with increasing sizes. Assume, for instance, that our computing time budget is 1 second. Running the conjugate gradient algorithm on a random subset of 30,000 training examples achieves a much better performance than running it on the whole training set. How to guess the right subset size a priori remains unclear. Meanwhile, running the SGD algorithm on the full training set reaches the same testing set performance much faster.

    Figure 13.2: Testing loss versus training time for SGD, and for conjugate gradients running on subsets of the training set.

    Conclusion: Taking into account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the objective function and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms.

    This framework leaves room for considerable refinements. Shalev-Shwartz & Srebro 2008 rigorously extend the analysis to regularized risk formulations with linear parameterization and find again that, for learning purposes, SGD algorithms are often more attractive than standard primal or dual algorithms with good optimization complexity (Joachims 2006; Hush et al 2006). It could also be interesting to investigate how the choice of a surrogate (Zhang 2004; Bartlett et al 2006) impacts the large-scale case.

  104. 2013-bottou.pdf: “Large–Scale Machine Learning Revisited [slides]”⁠, Léon Bottou

  105. ⁠, Gwern Branwen (2020-10-30):

    Subreddit for discussing AI, machine learning, or deep learning approaches involving big numbers: billions of parameters, millions of n, petaflops, etc. eg ⁠. Most research is conducted at much smaller scale; this subreddit is for research analogous to ‘high energy physics’, requiring specialized approaches, large investments, consortium, etc.

    Topics: How? Who? Why do they work? What are they good for? What resources are available? Who will pay & how? What is the future of such approaches? What global consequences will there be?