Scaling-hypothesis (Link Bibliography)

“Scaling-hypothesis” links:

  1. #meta-learning

  2. #blessings-of-scale

  3. Backstop#deep-bayes

  4. #scaling-hypothesis

  5. ⁠, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020-05-28):

    Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.

    Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

    Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

    …The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

  6. ⁠, GPT-3 (2020-05-28):

    As an example, in Figure F.1, we show 4 uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the style of Wallace Stevens. We first experimented with a few prompts, then generated 4 samples with no additional editing or selection (sampling at temperature 1 using with P = 0.9). Completions were truncated when the model began to write a new title and author heading, or broke into prose commentary.

    …“Shadows on the Way”, Wallace Stevens:

    1. I must have shadows on the way / If I am to walk I must have / Each step taken slowly and alone / To have it ready made //​​​​​ And I must think in lines of grey / To have dim thoughts to be my guide / Must look on blue and green / And never let my eye forget / That color is my friend / And purple must surround me too // The yellow of the sun is no more / Intrusive than the bluish snow / That falls on all of us. I must have / Grey thoughts and blue thoughts walk with me / If I am to go away at all.
    2. There is a way in the world. / That way a man may live in a small town / His lifetime and see no man come, / Never the stir of men about a market, / Never the bobbing of covered carts, / Never the creak passed on the stair / Closing some other life; an old man / Can prosper in that town, can prosper / When his store is seen to close, his wife / Fades in his house, and the world shouts in vain / That no one comes to hear him / Where his small bell tolls. / In that small town / Everything is known. / The ways of the young men are known, / And the ways of the young women / As they walk to the Spring; / The rumors are all known, / The quarrels, the hurt, the envy / Of the men and the women. / Till a day comes / That way out of the world / That leads to a wood. / Young, the young man / Walks, a lover now, / Knows the streets as they are and is of them.
    3. The sun was all we had. Now, in the shade / All is changed. The mind must dwell on those / White fields, that to its eyes were always old; / Those ancient gleams, convoluting / The way they lay among the huge roots, / The great dark tomes of reverie, / The plumed labyrinth of the sea. // When our own breath gone faint and fade / Up the long stairway of the dead / We shall remember well / The blindness of the light, the whiteness / Of that white land. The footsteps, and / The grasses in that light, and in the shade / The menacing life. /​​​​​/ We played, too, once, in the warmest rooms. / To one content, there is one content / For all mankind. The forest glades/ Are all the more for being in fable, / And wood and town in legend shadowed— / Once more to return. The one content / Has in it every happiness, more brave / For being purposeless; and so / Those island gardens, deep in the mind, / Make of remoteness familiarity / And of the unspoken name a friend.
    4. Nobody will come to this place. It is a road that leads nowhere. / The solitude is deep. The mountains are high. / But they are desolate, and they turn the traveler’s face / Towards the North. All the sounds of the world are far away. / When the wind rises above the trees, / The boughs bow to the ground. / Even the birds that inhabit the tangle of weeds / That is the roadside cover, are silent. One listens, / But hears no roar of the forest. One is alone. / One will be taken. / One will be taken. / There is no utterance, there is no conversation, / But one is uneasy all the same… / There is a thin blue mist, / A darkness rising like smoke, / And within that darkness / A possession of the heart. / One will be taken… It was here, and it will be here again— / Here, under this sky empty and full of light.
  7. GPT-3

  8. GPT-2


  10. ⁠, Greg Brockman, Mira Murati, Peter Welinder () (2020-06-11):

    We’re releasing an API for accessing new AI models developed by OpenAI. Unlike most AI systems which are designed for one use-case, the API today provides a general-purpose “text in, text out” interface, allowing users to try it on virtually any English language task. You can now request access in order to integrate the API into your product, develop an entirely new application, or help us explore the strengths and limits of this technology.

    Given any text prompt, the API will return a text completion, attempting to match the pattern you gave it. You can “program” it by showing it just a few examples of what you’d like it to do; its success generally varies depending on how complex the task is. The API also allows you to hone performance on specific tasks by training on a dataset (small or large) of examples you provide, or by learning from human feedback provided by users or labelers.

  11. ⁠, Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, Ilya Sutskever (OpenAI) (2019-02-14):

    Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a ⁠.

    GPT-2 is a large -based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10× the parameters and trained on more than 10× the amount of data.

    GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

  12. GPT-3#bpes

  13. ⁠, Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu (2016-10-20):

    Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These “fast weights” can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.


  15. ⁠, Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, Timothy Lillicrap (2016-05-19):

    Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.

  16. 2018-wang.pdf#deepmind: ⁠, Jane X. Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Demis Hassabis, Matthew Botvinick (2018-05-14; reinforcement-learning):

    Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter ‘stamps in’ associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.

  17. ⁠, Adam Scholl (2020-08-12):

    Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview⁠, he discusses results from which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:

    Initial Observation

    At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of… emerges, out of, like out of thin air”:

    “What happens… it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.”

    Other versions of this basic architecture—eg., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed “meta-RL.” So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1. have memory, 2. whose weights are trained by a RL algorithm, 3. on a sequence of similar input data.

    From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/​​​​instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:

    “…it’s something that just happens. In a sense, you can’t avoid this happening. If you have a system that has memory, and the function of that memory is shaped by ⁠, and this system is trained on a series of interrelated tasks, this is going to happen. You can’t stop it.”

    …The account detailed by and Wang et al. strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong.

  18. ⁠, Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson, Charles Blundell, Demis Hassabis (2019-05-16):

    Recent AI research has given rise to powerful techniques for deep reinforcement learning. In their combination of representation learning with reward-driven behavior, deep reinforcement learning would appear to have inherent interest for psychology and neuroscience.

    One reservation has been that deep reinforcement learning procedures demand large amounts of training data, suggesting that these algorithms may differ fundamentally from those underlying human learning.

    While this concern applies to the initial wave of deep RL techniques, subsequent AI work has established methods that allow deep RL systems to learn more quickly and efficiently. Two particularly interesting and promising techniques center, respectively, on episodic memory and meta-learning. Alongside their interest as AI techniques, deep RL methods leveraging episodic memory and meta-learning have direct and interesting implications for psychology and neuroscience. One subtle but critically important insight which these techniques bring into focus is the fundamental connection between fast and slow forms of learning.

    Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to no-limit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sample-inefficient—that is, it may simply be too slow—to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.

  19. ⁠, Jeff Clune (2019-05-27):

    Perhaps the most ambitious scientific quest in human history is the creation of general artificial intelligence, which roughly means AI that is as smart or smarter than humans. The dominant approach in the machine learning community is to attempt to discover each of the pieces required for intelligence, with the implicit assumption that some future group will complete the Herculean task of figuring out how to combine all of those pieces into a complex thinking machine. I call this the “manual AI approach”. This paper describes another exciting path that ultimately may be more successful at producing general AI. It is based on the clear trend in machine learning that hand-designed solutions eventually are replaced by more effective, learned solutions. The idea is to create an AI-generating algorithm (AI-GA), which automatically learns how to produce general AI. Three Pillars are essential for the approach: (1) meta-learning architectures, (2) meta-learning the learning algorithms themselves, and (3) generating effective learning environments. I argue that either approach could produce general AI first, and both are scientifically worthwhile irrespective of which is the fastest path. Because both are promising, yet the ML community is currently committed to the manual approach, I argue that our community should increase its research investment in the AI-GA approach. To encourage such research, I describe promising work in each of the Three Pillars. I also discuss AI-GA-specific safety and ethical considerations. Because it it may be the fastest path to general AI and because it is inherently scientifically interesting to understand the conditions in which a simple algorithm can produce general AI (as happened on Earth where Darwinian evolution produced human intelligence), I argue that the pursuit of AI-GAs should be considered a new grand challenge of computer science research.

  20. ⁠, Juergen Schmidhuber (2015-11-30):

    This paper addresses the general problem of reinforcement learning (RL) in partially observable environments. In 2013, our large RL recurrent neural networks (RNNs) learned from scratch to drive simulated cars from high-dimensional video input. However, real brains are more powerful in many ways. In particular, they learn a predictive model of their initially unknown environment, and somehow use it for abstract (e.g., hierarchical) planning and reasoning. Guided by ⁠, we describe RNN-based AIs (RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending sequences of tasks, some of them provided by the user, others invented by the RNNAI itself in a curious, playful fashion, to improve its RNN-based world model. Unlike our previous model-building RNN-based RL machines dating back to 1990, the RNNAI learns to actively query its model for abstract reasoning and planning and decision making, essentially “learning to think.” The basic ideas of this report can be applied to many other cases where one RNN-like system exploits the algorithmic information content of another. They are taken from a grant proposal submitted in Fall 2014, and also explain concepts such as “mirror neurons.” Experimental results will be described in separate papers.

  21. ⁠, Juergen Schmidhuber (2018-02-24):

    I apply recent work on “learning to think” (2015) and on PowerPlay (2011) to the incremental training of an increasingly general problem solver, continually learning to solve new tasks without forgetting previous skills. The problem solver is a single recurrent neural network (or similar general purpose computer) called ONE. ONE is unusual in the sense that it is trained in various ways, e.g., by black box optimization / reinforcement learning / artificial evolution as well as supervised / unsupervised learning. For example, ONE may learn through neuroevolution to control a robot through environment-changing actions, and learn through unsupervised gradient descent to predict future inputs and vector-valued reward signals as suggested in 1990. User-given tasks can be defined through extra goal-defining input patterns, also proposed in 1990. Suppose ONE has already learned many skills. Now a copy of ONE can be re-trained to learn a new skill, e.g., through neuroevolution without a teacher. Here it may profit from re-using previously learned subroutines, but it may also forget previous skills. Then ONE is retrained in PowerPlay style (2011) on stored input/​​​​output traces of (a) ONE’s copy executing the new skill and (b) previous instances of ONE whose skills are still considered worth memorizing. Simultaneously, ONE is retrained on old traces (even those of unsuccessful trials) to become a better predictor, without additional expensive interaction with the enviroment. More and more control and prediction skills are thus collapsed into ONE, like in the chunker-automatizer system of the neural history compressor (1991). This forces ONE to relate partially analogous skills (with shared algorithmic information) to each other, creating common subroutines in form of shared subnetworks of ONE, to greatly speed up subsequent learning of additional, novel but algorithmically related skills.

  22. ⁠, Lilian Weng (2019-11-30):

    Meta-learning, also known as “learning to learn”, intends to design models that can learn new skills or adapt to new environments rapidly with a few training examples. There are three common approaches: 1. learn an efficient distance metric (metric-based); 2. use (recurrent) network with external or internal memory (model-based); 3. optimize the model parameters explicitly for fast learning (optimization-based).

    …We expect a good meta-learning model capable of well adapting or generalizing to new tasks and new environments that have never been encountered during training time. The adaptation process, essentially a mini learning session, happens during test but with a limited exposure to the new task configurations. Eventually, the adapted model can complete new tasks. This is why meta-learning is also known as learning to learn⁠.

    Define the Meta-Learning Problem · A Simple View · Training in the Same Way as Testing · Learner and Meta-Learner · Common Approaches · Metric-Based · Convolutional Siamese Neural Network · Matching Networks · Simple Embedding · Full Context Embeddings · · Prototypical Networks · Model-Based · Memory-Augmented Neural Networks · MANN for Meta-Learning · Addressing Mechanism for Meta-Learning · Meta Networks · Fast Weights · Model Components · Training Process · Optimization-Based · Meta-Learner · Why LSTM? · Model Setup · MAML · First-Order MAML · Reptile · The Optimization Assumption · Reptile vs FOMAML · Reference

  23. ⁠, Lilian Weng (2019-06-23):

    [Review/​​​​discussion] Meta-RL is meta-learning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of meta-RL and then dives into three key components of meta-RL…, a good meta-learning model is expected to generalize to new tasks or new environments that have never been encountered during training. The adaptation process, essentially a mini learning session, happens at test with limited exposure to the new configurations. Even without any explicit fine-tuning (no gradient on trainable variables), the meta-learning model autonomously adjusts internal hidden states to learn.

  24. ⁠, Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever (2020-04-30):

    [⁠; ⁠; probing Jukebox as pretraining for music analysis (posing similar difficulties in extracting the right embedding as ).] A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.

    We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.

    …Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called ⁠. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-226 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images…We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8×, 32×, and 128×, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.

    Jukebox architecture

    …The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, substantially improving the audio quality. We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively. Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.

    …While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a substantial gap between these generations and human-created music. For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernible noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render 1 minute of audio through our models, and thus they cannot yet be used in interactive applications.

  25. ⁠, OpenAI (2018-06-11):

    We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: and ⁠. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.


  27. {#linkBibliography-castelvecchi-(nature)-2020 .docMetadata doi=“10.1038/​​d41586-020-01866-9”}, Davide Castelvecchi, Elizabeth Gibney (Nature) (2020-06-19):

    CERN has taken a major step towards building a 100-kilometer circular supercollider to push the frontier of high-energy physics.

    The decision was unanimously endorsed by the CERN Council, the organization’s governing body, on 19 June, following the plan’s approval by an independent panel in March. Europe’s pre-eminent particle-physics organization will need global help to fund the project, which is expected to cost at least €21 billion (US$24 billion) and would be a follow-up to the lab’s famed Large Hadron Collider (LHC). The new machine would be colliding electrons with their antimatter partners, positrons, by the middle of the century. The design—to be built in an underground tunnel near CERN’s location near Geneva, Switzerland—will enable physicists to study the properties of the Higgs boson and, later, to host an even more-powerful machine that will collide protons and will last well into the second half of the century.

    Funding Tour: CERN’s strategy envisages 2038 as the date for beginning construction of the new, 100-kilometer tunnel and the electron-positron collider. Until then, the lab will continue to operate an upgraded version of the LHC, called High Luminosity LHC, which is currently under construction. But before CERN can start building its new machine, it will have to seek new funding beyond the regular budget it receives from member states. Llewellyn Smith says that countries outside Europe, including the United States, China and Japan, might need to join CERN to form a new, global organization. “Almost certainly it will need a new structure”, he says.

    The costly plan has its detractors—even in the physics community. ⁠, a theoretical physicist at the Frankfurt Institute for Advanced Studies in Germany, has emerged as a critic of pursuing ever-higher energies, when the scientific payback—apart from measuring the properties of known particles—is far from guaranteed. “I still think it’s not a good idea”, Hossenfelder says. “We’re talking about tens of billions. I just think there is not enough scientific potential in doing that kind of study right now.”

    The new collider will be in uncharted territory, says Tara Shears, a physicist at the University of Liverpool, UK. The LHC had a clear target to look for—the Higgs boson—as well as theorists’ well-motivated reasons to believe that there could be new particles in the range of masses it could explore, but the situation now is different, she says. “We don’t have an equivalent, rock-solid prediction now—and that makes knowing where and how to look for answers more challenging and higher risk.” Still, she says, “We do know that the only way to find answers is by experiment and the only place to find them is where we haven’t been able to look yet.”

  28. 2008-sandberg-wholebrainemulationroadmap.pdf

  29. {#linkBibliography-impacts)-2020 .docMetadata}, Asya Bergal (AI Impacts) (2020-03-25):

    Release Prices 95th-percentile Active Prices 95th-percentile Active Prices (pre-crypto price rise)
    11/​​​​2007–1/​​​​2010 3/​​​​2011–1/​​​​2020 3/​​​​2011–12/​​​​2016
    $ / single-precision FLOPS 12.5 17 16
    9/​​​​2014–1/​​​​2020 1/​​​​2015–1/​​​​2020 1/​​​​2015–12/​​​​2016
    $ / half-precision FLOPS 8 10 8
    $ / half-precision FMA FLOPS 4 4.5

    Release price data seems to generally support the trends we found in active prices, with the notable exception of trends in GPU price / single-precision FLOPS, which cannot be explained solely by the different start dates.[^58^](https:/​​​​/​​​​​​​​2019-recent-trends-in-gpu-price-per-flops/​​​​#easy-footnote-bottom-58-2316 “See our analysis in this section above.”) We think the best estimate of the overall trend for prices at which people recently bought GPUs is the 95th-percentile active price data from 2011–2020, since release price data does not account for existing GPUs becoming cheaper over time. The pre-crypto trends are similar to the overall trends, suggesting that the trends we are seeing are not anomalous due to cryptocurrency.

    Given that, we guess that prices as a whole have fallen at rates that would yield an order of magnitude over roughly:

    • 17 years for single-precision FLOPS
    • 10 years for half-precision FLOPS
    • 5 years for half-precision fused multiply-add FLOPS

    Half-precision FLOPS seem to have become cheaper substantially faster than single-precision in recent years. This may be a “catching up” effect as more of the space on GPUs was allocated to half-precision computing, rather than reflecting more fundamental technological progress.

    [Keywords: AI Timelines, Featured Articles, Hardware and AI Timelines, Hardware progress]



  32. Attention

  33. ⁠, Gwern Branwen (2018-10-20):

    Description of emerging machine learning paradigm identified by commentator starspawn0: discussions of building artificial brains typically presume either learning a brain architecture & parameters from scratch (AGI) or laboriously ‘scanning’ and reverse-engineering a biological brain in its entirety to get a functioning artificial brain.

    However, the rise of deep learning’s transfer learning & meta-learning shows a wide variety of intermediate approaches, where ‘side data’ from natural brains can be used as scaffolding to guide & constrain standard deep learning methods. Such approaches do not seek to ‘upload’ or ‘emulate’ any specific brain, they merely seek to imitate an average brain. A simple example would be training a to imitate saliency data: what a human looks at while playing a video game or driving is the important part of a scene, and the CNN doesn’t have to learn importance from scratch. A more complex example would be using EEG as a ‘description’ of music in addition to the music itself. fMRI data could be used to guide a NN to have a similar modularized architecture with similar activation patterns given a particular stimulus as a human brain, which presumably is related to human abilities to zero-shot/​​​​few-shot learn and generalize.

    While a highly marginal approach at the moment compared to standard approaches like scaling up models & datasets, it is largely untapped, and progress in VR with eyetracking capabilities (intended for but usable for many other purposes), brain imaging methods & has been more rapid than generally appreciated—in part thanks to breakthroughs using DL itself, suggesting the potential for a positive feedback loop where a BCI breakthrough enables a better NN for BCIs and so on.

  34. ⁠, Diogo Almeida, Clemens Winter, Jie Tang, Wojciech Zaremba (2021-06-02):

    A core issue with learning to optimize neural networks has been the lack of generalization to real world problems. To address this, we describe a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function. This system outperforms Adam at all neural network tasks including on modalities not seen during training. We achieve 2× speedups on ⁠, and a 2.5× speedup on a language modeling task using over 5 orders of magnitude more compute than the training tasks.

  35. ⁠, Mingxing Tan, Quoc V. Le (2019-05-28):

    Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/​​​​width/​​​​resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.

    To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4× smaller and 6.1× faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https:/​​​​/​​​​​​​​tensorflow/​​​​tpu/​​​​tree/​​​​master/​​​​models/​​​​official/​​​​efficientnet.

  36. ⁠, GPT-3 (2020-05-28):

    [Sample from ⁠; condensed paragraphs & formatting added for legibility inside popups, as the original plain text dumps strip the original HTML formatting of tutorials etc.]

    Working with an example—rotating through an array For a number of years, I’ve been fascinated by the idea of writing code that can rotate through an array. Let’s say I have the following array of numbers:

    var myNumbers = [ 1, 2, 3, 4, 5 ];

    The following snippet of code would display the same numbers in reverse order:

    for ( var i = myNumbers.length - 1; i >= 0; i-- ) { console.log( myNumbers[i] ); }

    What’s going on here?

    We know that the index of an array can be used as a “pointer” to the location in memory that contains a particular item in an array. If the array myNumbers has five items, then myNumbers[0], myNumbers[1], myNumbers[2], myNumbers[3] and myNumbers[4] will point to the values 1, 2, 3, 4 and 5 respectively.

    In the previous example, the code starts with the length of the array (5) and subtracts one from it. As we know that index 0 in an array contains the first item in the array, the previous code will execute the following steps: Take the length of the array (5) Subtract 1 (so we now have 4) Start at the index that contains the first item in the array (myNumbers[4] = 3) and run down the array until we reach the last item in the array (myNumbers[3] = 5) So the above code will end up displaying the items in the array in reverse order, starting at index 4 and working backwards to index 0. At this point, we have a good idea of how we can access the values in an array in a specific order, but how do we go the other way?

    It would be great if we could simply use the same code as above, but add a “–” operator before the last number. That way, we could add a simple parameter to the code to control the range of numbers it will use. Unfortunately, we can’t do that. While JavaScript will happily work with --, it will also happily take -- in code and convert it to −1, or the last number in an array. That’s not going to do us any good, as we want the code to be flexible enough that we can work with a number that we specify as the range of numbers.

    So let’s look at a simple function that can give us the results we need.

    function rotate(numbers, direction, number) { numbers.reverse(); numbers.push(number); numbers.unshift(number); };

    The rotate function above has three parameters, and will rotate the values in the numbers array….

  37. ⁠, François Chollet (2019-11-05):

    To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to “buy” arbitrary levels of skills for a system, in a way that masks the system’s own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

  38. 2019-02-18-lecun-isscc-talk-deeplearninghardwarepastpresentandfuture.pdf#page=60: “Deep Learning Hardware: Past, Present, & Future”⁠, Yann LeCun

  39. Forking-Paths


  41. ⁠, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher (Salesforce) (2019-09-11):

    Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at⁠.

  42. ⁠, Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le (2020-01-27):

    We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.

  43. ⁠, NVIDIA ADLR (2019-08-13):

    Larger language models are dramatically more useful for NLP tasks such as article completion, question answering, and dialog systems. Training the largest neural language model has recently been the best way to advance the state of the art in NLP applications. Two recent papers, and ⁠, demonstrate the benefits of large scale language modeling. Both papers leverage advances in compute and available text corpora to substantially surpass state of the art performance in natural language understanding, modeling, and generation. Training these models requires hundreds of exaflops of compute and to trade recomputation for a reduced memory footprint. However, for very large models beyond a billion parameters, the memory on a single GPU is not enough to fit the model along with the parameters needed for training, requiring model parallelism to split the parameters across multiple GPUs. Several approaches to model parallelism exist, but they are difficult to use, either because they rely on custom compilers, or because they scale poorly or require changes to the optimizer.

    In this work, we implement a simple and efficient model parallel approach by making only a few targeted modifications to existing PyTorch transformer implementations. Our code is written in native Python, leverages mixed precision training, and utilizes the NCCL library for communication between GPUs. We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24× the size of BERT and 5.6× the size of GPT-2. We have published the code that implements this approach at our GitHub repository⁠.

    Our experiments are conducted on NVIDIA’s DGX SuperPOD⁠. Without model parallelism, we can fit a baseline model of 1.2B parameters on a single 32GB GPU, and sustain 39 TeraFLOPS during the overall training process, which is 30% of the theoretical peak FLOPS for a single GPU in a -H server. Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case.

  44. ⁠, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu (2019-10-23):

    Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

  45. ⁠, Corby Rosset (2020-02-10):

    [History of large-parameter natural language neural network models since ELMo (0.094b LSTM, February 2018) to Turing-NLG (17b, February 2020)]

    Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes. <|endoftext|>

    – This summary was generated by the Turing-NLG language model itself.

    …Following the trend that larger natural language models lead to better results, Microsoft is introducing Turing Natural Language Generation (T-NLG), the largest model ever published at 17 billion parameters, which outperforms the state of the art on a variety of language modeling benchmarks and also excels when applied to numerous practical tasks, including summarization and question answering. This work would not be possible without breakthroughs produced by the DeepSpeed library (compatible with PyTorch) and ⁠, which can be explored more in this accompanying blog post.

  46. ⁠, Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen (2018-11-16):

    Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.

  47. Scaling

  48. ⁠, Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel (2020-12-14):

    It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

    We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

    We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

  49. ⁠, Vitaly Feldman (2019-06-12):

    State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize the labels of the training data is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning.

    We provide the first conceptual explanation and a theoretical model for this phenomenon. Specifically, we demonstrate that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. Crucially, even labels of outliers and noisy labels need to be memorized. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and our results show that memorization is necessary whenever the distribution of subpopulation frequencies is long-tailed. Image and text data is known to be long-tailed and therefore our results establish a formal link between these empirical phenomena. Our results allow to quantify the cost of limiting memorization in learning and explain the disparate effects that privacy and model compression have on different subgroups.

  50. ⁠, Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso (2020-07-10):

    Deep learning’s recent history has been one of achievement: from triumphing over humans in the game of Go to world-leading performance in image recognition, voice recognition, translation, and other tasks. But this progress has come with a voracious appetite for computing power. This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power. Extrapolating forward this reliance reveals that progress along current lines is rapidly becoming economically, technically, and environmentally unsustainable. Thus, continued progress in these applications will require dramatically more computationally-efficient methods, which will either have to come from changes to deep learning or from moving to other machine learning methods.

  51. ⁠, Gwern Branwen (2020-10-30):

    Subreddit for discussing AI, machine learning, or deep learning approaches involving big numbers: billions of parameters, millions of n, petaflops, etc. eg ⁠. Most research is conducted at much smaller scale; this subreddit is for research analogous to ‘high energy physics’, requiring specialized approaches, large investments, consortium, etc.

    Topics: How? Who? Why do they work? What are they good for? What resources are available? Who will pay & how? What is the future of such approaches? What global consequences will there be?

  52. 2009-halevy.pdf: ⁠, Alon Halevy, Peter Norvig, Fernando Pereira (2009-03-24; ai):

    At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

    …For many tasks, words and word combinations provide all the representational machinery we need to learn from text.

    …So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do

  53. 2001-banko.pdf#microsoft: ⁠, Michele Banko, Eric Brill (2001-07-01; ai):

    The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

    …We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, ie. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.

    In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.

    Figure 1: Learning Curves for Confusion Set Disambiguation
  54. 2007-brants.pdf#google: ⁠, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean (2007-06; ai):

    This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion ⁠. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.

    Figure 5, modified by Chris Dyer in a 2020 talk: data vs translation quality (BLEU score) scaling of n-grams, and later, RNNs.
  55. 2017-koehn-figure3-bleuscoreswithvaryingamountsoftrainingdata.png

  56. ⁠, Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi (2020-10-16):

    The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4–79.1%, which are 15–35% below human performance of 94.0%, depending on the amount of the training data allowed. Furthermore, we establish new state-of-the-art results on five related benchmarks: WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.


  58. ⁠, Dario Amodei, Danny Hernandez, Girish Sastry, Jack Clark, Greg Brockman, Ilya Sutskever (2018-05-26):

    [Further reading: “Parameter Counts In Machine Learning” (2021-06-19), Akronomicon leaderboard⁠.] We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000× (a 2-year doubling period would yield only a 7× increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.

    Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to ⁠, and is often complementary to algorithmic advances…The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and ), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.

    AlexNet to AlphaGo Zero: A 300,000× Increase in Compute. (The total amount of compute, in petaflop/​​​​s-days, used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used.)

    Eras: Looking at the graph we can roughly see 4 distinct eras:

    1. Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
    2. 2012–2014: Infrastructure to train on many GPUs was uncommon, so most results used 1–8 GPUs rated at 1–2 TFLOPS for a total of 0.001–0.1 pfs-days.
    3. 2014–2016: Large-scale results used 10–100 GPUs rated at 5–10 TFLOPS, resulting in 0.1–10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
    4. 2016–2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes, architecture search, and expert iteration, along with specialized hardware such as TPUs and faster interconnects, have greatly increased these limits, at least for some applications.

    ⁠/​​​​ is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.

    Addendum: Compute used in older headline results (2019-11-07)

    We’ve updated our analysis with data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train models over the last half-century.

    Two Distinct Eras of Compute Usage in Training AI Systems

    Starting from the perceptron in 1959, we see a ~2-year doubling time for the compute used in these historical results—with a 3.4-month doubling time starting in ~2012. It’s difficult to draw a strong conclusion from this data alone, but we believe that this trend is probably due to a combination of the limits on the amount of compute that was possible to use for those results and the willingness to spend on scaling up experiments. [For one vivid account of the history of computing in AI in this period, see the “False Start” section in ⁠.]



  61. {#linkBibliography-(microsoft)-2020 .docMetadata}, Jennifer Langston (Microsoft) (2020-05-19):

    Microsoft has built one of the top five publicly disclosed supercomputers in the world, making new infrastructure available in Azure to train extremely large artificial intelligence models, the company is announcing at its Build developers conference.

    Built in collaboration with and exclusively for OpenAI, the supercomputer hosted in Azure was designed specifically to train that company’s AI models. It represents a key milestone in a partnership announced last year to jointly create new supercomputing technologies in Azure.

    …It’s also a first step toward making the next generation of very large AI models and the infrastructure needed to train them available as a platform for other organizations and developers to build upon. “The exciting thing about these models is the breadth of things they’re going to enable”, said Microsoft Chief Technical Officer Kevin Scott, who said the potential benefits extend far beyond narrow advances in one type of AI model. “This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now”, he said…Microsoft is also exploring large-scale AI models that can learn in a generalized way across text, images and video. That could help with automatic captioning of images for accessibility in Office, for instance, or improve the ways people search Bing by understanding what’s inside images and videos.

    …The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five, Microsoft says. Hosted in Azure, the supercomputer also benefits from all the capabilities of a robust modern cloud infrastructure, including rapid deployment, sustainable datacenters and access to Azure services.

  62. ⁠, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei (2020-01-23):

    We study empirical scaling laws for language model performance on the loss.

    The loss scales as a with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/​​​​dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.

    Larger models are substantially more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping substantially before convergence.

    Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
    Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(Cmin) and L(D) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
    3.2.1: Comparing to LSTMs and Universal Transformers: In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count n. The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position in Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns.
    Appendix A: Summary of Power Laws
    Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/​​​​token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from >Henighan et al 2020.]


  65. #kaplan-et-al-2020



  68. 2020-adiwardana-meena-figure1-humanratingsvslikelihood.png

  69. 2020-brown-figure313-humanabilitytodetectmodelgeneratednewsstories.png

  70. 2020-hendrycks-figure1b-gpt3-qascaling.png

  71. ⁠, Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt (2020-09-07):

    GPT-3 model size vs Q&A

    We propose a new test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings. (tests and code)

    [bigger = better:

    See also the paper.]


  73. ⁠, Rich Sutton (2019-03-13):

    The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

    …In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess…A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale…In speech recognition, there was an early competition, sponsored by ⁠, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods…In computer vision…Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

    …We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that (1) AI researchers have often tried to build knowledge into their agents, (2) this always helps in the short term, and is personally satisfying to the researcher, but (3) in the long run it plateaus and even inhibits further progress, and (4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

    [My meme summary:

    The GPT-3 bitter lesson.]
  74. ⁠, Mark Chen, Alec Radford, Ilya Sutskever (OpenAI) (2020-06-17):

    Transformer models like and are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT-2 on images unrolled into long sequences of pixels, which we call ⁠, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy on ImageNet…we deliberately use the same transformer architecture as GPT-2 in language. As a consequence, we require substantially more compute in order to produce features competitive with those from top unsupervised convolutional nets…Generative sequence modeling is a universal unsupervised learning algorithm: since all data types can be represented as sequences of bytes, a transformer can be directly applied to any data type without additional engineering. Our work tests the power of this generality by directly applying the architecture used to train GPT-2 on natural language to image generation. We deliberately chose to forgo hand coding any image specific knowledge in the form of convolutions38 or techniques like relative attention,39 sparse attention,40 and 2-D position embeddings.27

    …We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL[4], a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web. Due to the large computational cost of modeling long sequences with dense attention, we train at the low resolutions of 32×32, 48×48, and 64×64…Our next result establishes the link between generative performance and feature quality. We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality.

    …When we evaluate our features using linear probes on CIFAR-10, CIFAR-100, and STL-10⁠, we outperform features from all supervised and unsupervised transfer algorithms. Our results are also compelling in the full fine-tuning setting

    …Because we use the generic sequence transformer used for GPT-2 in language, our method requires large amounts of compute: iGPT-L was trained for roughly 2500 -days while a similarly performing MoCo24 model can be trained in roughly 70 V100-days…We have shown that by trading off 2-D knowledge for scale and by choosing predictive features from the middle of the network, a sequence transformer can be competitive with top convolutional nets for unsupervised image classification. Notably, we achieved our results by directly applying the GPT-2 language model to image generation. Our results suggest that due to its simplicity and generality, a sequence transformer given sufficient compute might ultimately be an effective way to learn excellent features in many domains.

  75. ⁠, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby (2020-09-28):

    One-sentence Summary: Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

    While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data [JFT-300M] and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc), Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train…Our Vision Transformer, pre-trained on the JFT-300M dataset, approaches or beats state of the art on multiple image recognition benchmarks, reaching accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks…Interestingly, our models took substantially less compute to pre-train than prior state of the art, however, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, ⁠, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4…Finally, [we plan] to further scale ViT, given that the performance does not seem yet to be saturating with the increased model size.

    [Keywords: computer vision, image recognition, self-attention, transformer, large-scale training]

    [Blog⁠. See also ⁠.]

  76. ⁠, Stanislas Polu, Ilya Sutskever (2020-09-07):

    We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans—the generation of original mathematical terms—might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community. [Also notable: the benefits of pretraining on Arxiv etc⁠, despite likely including no or only redundant Metamath, and primarily natural language text, showing transfer learning of general math knowledge to abstract low-level formal proof language. See also: ⁠, lean-gptf (for ), ⁠, ⁠/​​​​⁠, ⁠, ]

  77. ⁠, Martin Schrimpf, Idan Blank, Greta Tuckute, Carina Kauf, Eghbal A. Hosseini, Nancy Kanwisher, Joshua Tenenbaum, Evelina Fedorenko (2020-10-09):

    The neuroscience of perception has recently been revolutionized with an integrative reverse-engineering approach in which computation, brain function, and behavior are linked across many different datasets and many computational models. We here present a first systematic study taking this approach into higher-level cognition: human language processing, our species’ signature cognitive skill. We find that the most powerful ‘transformer’ networks predict neural responses at nearly 100% and generalize across different datasets and data types (fMRI, ECoG). Across models, correlations are observed among all three metrics of performance: neural fit, fit to behavioral responses, and accuracy on the next-word prediction task (but not other language tasks), consistent with the long-standing hypothesis that the brain’s language system is optimized for predictive processing. Model architectures with initial weights further perform surprisingly similar to final trained models, suggesting that inherent structure—and not just experience with language—crucially contributes to a model’s match to the brain.

  78. ⁠, Alex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu, Samuel R. Bowman (2020-10-11):

    One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning. We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa-base does demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.

  79. FC

  80. ⁠, Zachary Nado, Justin M. Gilmer, Christopher J. Shallue, Rohan Anil, George E. Dahl (2021-02-12):

    Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.


  82. ⁠, Sam McCandlish, Jared Kaplan, Dario Amodei (OpenAI) (2018-12-14):

    that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized.

    In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

    The gradient noise scale (appropriately averaged over training) explains the vast majority (R2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games).

    …We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients, we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data…We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

    Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales.

    …more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there’s some evidence that the increasing noise scale over training isn’t just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable. Second, tasks that are subjectively more difficult are also more amenable to parallelization…we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute.

  83. ⁠, Andrew Brock, Jeff Donahue, Karen Simonyan (2018-09-28):

    …To confirm that our design choices are effective for even larger and more complex and diverse datasets, we also present results of our system on a subset of JFT-300M (). The full JFT-300M dataset contains 300M real-world images labeled with 18K categories. Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images—two orders of magnitude larger than ImageNet.

    …Our results show that these techniques substantially improve performance even in the setting of this much larger dataset at the same model capacity (64 base channels). We further show that for a dataset of this scale, we see substantial additional improvements from expanding the capacity of our models to 128 base channels, while for ImageNet GANs that additional capacity was not beneficial. In Figure 19 (Appendix D), we present truncation plots for models trained on this dataset…Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization (Section 4), the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate stability issues.

  84. ⁠, Rewon Child (2020-11-20):

    We present a hierarchical VAE that, for the first time, generates samples quickly while outperforming the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that, in theory, VAEs can actually represent autoregressive models, as well as faster, better models if they exist, when made sufficiently deep. Despite this, autoregressive models have historically outperformed VAEs in log-likelihood. We test if insufficient depth explains why by scaling a VAE to greater stochastic depth than previously explored and evaluating it CIFAR-10, ImageNet, and FFHQ. In comparison to the PixelCNN, these very deep VAEs achieve higher likelihoods, use fewer parameters, generate samples thousands of times faster, and are more easily applied to high-resolution images. Qualitative studies suggest this is because the VAE learns efficient hierarchical visual representations. We release our source code and models at https:/​​​​/​​​​​​​​openai/​​​​vdvae.

  85. ⁠, Arash Vahdat, Jan Kautz (2020-07-08):

    Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels. The source code is available at https:/​​​​/​​​​​​​​NVlabs/​​​​NVAE .

  86. ⁠, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby (2019-12-24):

    Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes—from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.

  87. ⁠, Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, Aäron van den Oord (2020-06-12):

    Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a statistically-significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.

  88. ⁠, Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D'Amour, Dan Moldovan, Sylvan Gelly, Neil Houlsby, Xiaohua Zhai, Mario Lucic (2020-07-16):

    Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we revisit the out-of-distribution and transfer performance of modern image classification CNNs and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes substantially improve the distributional shift robustness. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can substantially mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset we use for a systematic analysis across common factors of variation.

  89. ⁠, A. Emin Orhan (2019-07-17):

    We investigate the robustness properties of ResNeXt class image recognition models trained with billion scale weakly supervised data (ResNeXt WSL models). These models, recently made public by Facebook AI, were trained with  1B images from Instagram and fine-tuned on ImageNet. We show that these models display an unprecedented degree of robustness against common image corruptions and perturbations, as measured by the ImageNet-C and ImageNet-P benchmarks. They also achieve substantially improved accuracies on the recently introduced “natural adversarial examples” benchmark (ImageNet-A). The largest of the released models, in particular, achieves state-of-the-art results on ImageNet-C, ImageNet-P, and ImageNet-A by a large margin. The gains on ImageNet-C, ImageNet-P, and ImageNet-A far outpace the gains on ImageNet validation accuracy, suggesting the former as more useful benchmarks to measure further progress in image recognition. Remarkably, the ResNeXt WSL models even achieve a limited degree of adversarial robustness against state-of-the-art white-box attacks (10-step PGD attacks). However, in contrast to adversarially trained models, the robustness of the ResNeXt WSL models rapidly declines with the number of PGD steps, suggesting that these models do not achieve genuine adversarial robustness. Visualization of the learned features also confirms this conclusion. Finally, we show that although the ResNeXt WSL models are more shape-biased than comparable ImageNet-trained models in a shape-texture cue conflict experiment, they still remain much more texture-biased than humans, suggesting that they share some of the underlying characteristics of ImageNet-trained models that make this benchmark challenging.

  90. ⁠, Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le (2019-11-11):

    [blog on BERT application] We present Noisy Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than that requires 3.5b weakly labeled Instagram images. On robustness test sets, it improves top-1 accuracy from 61.0% to 83.7%, reduces mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.

    Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via to the student so that the student generalizes better than the teacher. Models are available⁠. Code is available⁠.

  91. ⁠, Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt (2020-07-01):

    We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 ImageNet models in 213 different test conditions, we find that there is often little to no transfer of robustness from current synthetic to natural distribution shift. Moreover, most current techniques provide no robustness to the natural distribution shifts in our testbed. The main exception is training on larger and more diverse datasets, which in multiple cases increases robustness, but is still far from closing the performance gaps. Our results indicate that distribution shifts arising in real data are currently an open research problem. We provide our testbed and data as a resource for future work at https:/​​​​/​​​​​​​​imagenet-testbed/ .

  92. ⁠, Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit (2021-03-26):

    Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like (ViT) have matched or even surpassed for image classification. However, details of the Transformer architecture—such as the use of non-overlapping patches—lead one to wonder whether these networks are as robust.

    In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations.

    We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.

  93. ⁠, Geoffrey Hinton, Oriol Vinyals, Jeff Dean (2015-03-09):

    A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

  94. ⁠, Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc V. Le (2020-06-25):

    It is commonly believed that networks cannot be both accurate and robust, that gaining robustness means losing accuracy. It is also generally believed that, unless making networks larger, network architectural elements would otherwise matter little in improving adversarial robustness. Here we present evidence to challenge these common beliefs by a careful study about adversarial training. Our key observation is that the widely-used ReLU activation function significantly weakens adversarial training due to its non-smooth nature. Hence we propose smooth adversarial training (SAT), in which we replace with its smooth approximations to strengthen adversarial training. The purpose of smooth activation functions in SAT is to allow it to find harder adversarial examples and compute better gradient updates during adversarial training.

    Compared to standard adversarial training, SAT improves adversarial robustness for “free”, i.e., no drop in accuracy and no increase in computational cost. For example, without introducing additional computations, SAT significantly enhances ResNet-50’s robustness from 33.0% to 42.3%, while also improving accuracy by 0.9% on ImageNet. SAT also works well with larger networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6% robustness on ImageNet, outperforming the previous state-of-the-art defense by 9.5% for accuracy and 11.6% for robustness. Models are available at https:/​​​​/​​​​​​​​cihangxie/​​​​SmoothAdversarialTraining.

  95. ⁠, Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee (2019-12-05):

    Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

  96. ⁠, Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid (2019-04-03):

    Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.

  97. ⁠, Karen Hao (2020-02-17):

    There are two prevailing technical theories about what it will take to reach AGI. In one, all the necessary techniques already exist; it’s just a matter of figuring out how to scale and assemble them. In the other, there needs to be an entirely new paradigm; deep learning, the current dominant technique in AI, won’t be enough. Most researchers fall somewhere between these extremes, but OpenAI has consistently sat almost exclusively on the scale-and-assemble end of the spectrum. Most of its breakthroughs have been the product of sinking dramatically greater computational resources into technical innovations developed in other labs.

    Brockman and Sutskever deny that this is their sole strategy, but the lab’s tightly guarded research suggests otherwise. A team called “Foresight” runs experiments to test how far they can push AI capabilities forward by training existing algorithms with increasingly large amounts of data and computing power. For the leadership, the results of these experiments have confirmed its instincts that the lab’s all-in, compute-driven strategy is the best approach. For roughly six months, these results were hidden from the public because OpenAI sees this knowledge as its primary competitive advantage. Employees and interns were explicitly instructed not to reveal them, and those who left signed nondisclosure agreements. It was only in January that the team, without the usual fanfare, on one of the primary open-source databases for AI research. People who experienced the intense secrecy around the effort didn’t know what to make of this change. Notably, from different researchers had been posted a month earlier.

    …One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI. [See ⁠, ⁠.]

    …The man driving OpenAI’s strategy is Dario Amodei, the ex-Googler who now serves as research director. When I meet him, he strikes me as a more anxious version of Brockman. He has a similar sincerity and sensitivity, but an air of unsettled nervous energy. He looks distant when he talks, his brows furrowed, a hand absentmindedly tugging his curls. Amodei divides the lab’s strategy into two parts. The first part, which dictates how it plans to reach advanced AI capabilities, he likens to an investor’s “portfolio of bets.” Different teams at OpenAI are playing out different bets. The language team, for example, has its money on a theory postulating that AI can develop a substantial understanding of the world through mere language learning. The robotics team, in contrast, is advancing an opposing theory that intelligence requires a physical embodiment to develop. As in an investor’s portfolio, not every bet has an equal weight. But for the purposes of scientific rigor, all should be tested before being discarded. Amodei points to ⁠, with its remarkably realistic auto-generated texts, as an instance of why it’s important to keep an open mind. “Pure language is a direction that the field and even some of us were somewhat skeptical of”, he says. “But now it’s like, ‘Wow, this is really promising.’” Over time, as different bets rise above others, they will attract more intense efforts. Then they will cross-pollinate and combine. The goal is to have fewer and fewer teams that ultimately collapse into a single technical direction for AGI. This is the exact process that OpenAI’s latest top-secret project has supposedly already begun.

  98. ⁠, Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee (2019-11-05):

    Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.

  99. 2019-vinyals.pdf#deepmind: ⁠, Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, David Silver (2019-10-30; reinforcement-learning):

    Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional e-sports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks. We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.

  100. ⁠, Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W. Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden, Nando de Freitas (2018-10-11):

    Humans are experts at high-fidelity imitation—closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions.

  101. ⁠, Tero Karras, Samuli Laine, Timo Aila (2018-12-12):

    We propose an alternative generator architecture for generative adversarial networks, borrowing from literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.

  102. ⁠, Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap (2017-06-05):

    Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called ⁠, on which we achieve state-of-the-art, super-human performance; text-based question answering using the suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

  103. 2018-eslami.pdf#deepmind: ⁠, S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, Demis Hassabis (2018-06-15; reinforcement-learning):

    A scene-internalizing computer program: To train a computer to “recognize” elements of a scene supplied by its visual sensors, computer scientists typically use millions of images painstakingly labeled by humans. Eslami et al. developed an artificial vision system, dubbed the Generative Query Network (GQN), that has no need for such labeled data. Instead, the GQN first uses images taken from different viewpoints and creates an abstract description of the scene, learning its essentials. Next, on the basis of this representation, the network predicts what the scene would look like from a new, arbitrary viewpoint.

    Scene representation—the process of converting visual sensory data into concise descriptions—is a requirement for intelligent behavior. Recent work has shown that neural networks excel at this task when provided with large, labeled datasets. However, removing the reliance on human labeling remains an important open problem. To this end, we introduce the Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors. The GQN takes as input images of a scene taken from different viewpoints, constructs an internal representation, and uses this representation to predict the appearance of that scene from previously unobserved viewpoints. The GQN demonstrates representation learning without human labels or domain knowledge, paving the way toward machines that autonomously learn to understand the world around them.

  104. ⁠, Peter Clark, Oyvind Tafjord, Kyle Richardson (2020-02-14):

    Beginning with McCarthy’s Advice Taker (1959), AI has pursued the goal of providing a system with explicit, general knowledge and having the system reason over that knowledge. However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research. This paper investigates a modern approach to this problem where the facts and rules are provided as natural language sentences, thus bypassing a formal representation. We train transformers to reason (or emulate reasoning) over these sentences using synthetically generated data. Our models, that we call RuleTakers, provide the first empirical demonstration that this kind of soft reasoning over language is learnable, can achieve high (99%) accuracy, and generalizes to test data requiring substantially deeper chaining than seen during training (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as limited “soft theorem provers” operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering.

  105. ⁠, Felix Hill, Andrew Lampinen, Rosalia Schneider, Stephen Clark, Matthew Botvinick, James L. McClelland, Adam Santoro (2019-10-01):

    The question of whether deep neural networks are good at generalising beyond their immediate training experience is of critical importance for learning-based approaches to AI. Here, we consider tests of out-of-sample generalisation that require an agent to respond to never-seen-before instructions by manipulating and positioning objects in a 3D Unity simulated room. We first describe a comparatively generic agent architecture that exhibits strong performance on these tests. We then identify three aspects of the training regime and environment that make a statistically-significant difference to its performance: (a) the number of object/​​​​word experiences in the training set; (b) the visual invariances afforded by the agent’s perspective, or frame of reference; and (c) the variety of visual input inherent in the perceptual aspect of the agent’s perception. Our findings indicate that the degree of generalisation that networks exhibit can depend critically on particulars of the environment in which a given task is instantiated. They further suggest that the propensity for neural networks to generalise in systematic ways may increase if, like human children, those networks have access to many frames of richly varying, multi-modal observations as they learn.

  106. ⁠, Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, Ruslan Salakhutdinov (2017-06-22):

    To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

  107. ⁠, Haonan Yu, Haichao Zhang, Wei Xu (2018-01-31):

    We build a virtual agent for learning language in a 2D maze-like world. The agent sees images of the surrounding environment, listens to a virtual teacher, and takes actions to receive rewards. It interactively learns the teacher’s language from scratch based on two language use cases: sentence-directed navigation and question answering. It learns simultaneously the visual representations of the world, the language, and the action control. By disentangling language grounding from other computational routines and sharing a concept detection function between language grounding and prediction, the agent reliably interpolates and extrapolates to interpret sentences that contain new word combinations or new words missing from training sentences. The new words are transferred from the answers of language prediction. Such a language ability is trained and evaluated on a population of over 1.6 million distinct sentences consisting of 119 object words, 8 color words, 9 spatial-relation words, and 50 grammatical words. The proposed model significantly outperforms five comparison methods for interpreting zero-shot sentences. In addition, we demonstrate human-interpretable intermediate outputs of the model in the appendix.

  108. ⁠, Brenden M. Lake (2019-06-12):

    People can learn a new concept and use it compositionally, understanding how to “blicket twice” after learning how to “blicket.” In contrast, powerful sequence-to-sequence (seq2seq) neural networks fail such tests of compositionality, especially when composing new concepts together with existing concepts. In this paper, I show how memory-augmented neural networks can be trained to generalize compositionally through meta seq2seq learning. In this approach, models train on a series of seq2seq problems to acquire the compositional skills needed to solve new seq2seq problems. Meta se2seq learning solves several of the SCAN tests for compositional learning and can learn to apply implicit rules to variables.

  109. ⁠, Interactive Agents Group (Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Stephen Clark, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Nathaniel Wong, Chen Yan, Rui Zhu) (2020-12-10):

    A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour. Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalize beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount. See videos for an overview of the manuscript, training time-lapse⁠, and human-agent interactions⁠.

    …Although the agents do not yet attain human-level performance, we will soon describe scaling experiments which suggest that this gap could be closed substantially simply by collecting more data…The scripted probe tasks are imperfect measures of model performance, but as we have shown above, they tend to be well correlated with model performance under human evaluation. With each doubling of the dataset size, performance grew by approximately the same increment. The rate of performance, in particular for instruction-following tasks, was larger for the BG·A model compared to B·A. Generally, these results give us confidence that we could continue to improve the performance of the agents straightforwardly by increasing the dataset size.

    Figure 15: Scaling & Transfer. A. Scaling properties for 2 of our agents. The agent’s performance on the scripted probe tasks increased as we trained on more data. In instruction-following tasks in particular, the rate of this increase was higher for BC+GAIL compared to BC (scatter points indicate seeds). B. Transfer learning across different language game prompts. Training on multiple language games simultaneously led to higher performance than training on each single prompt independently. C. Multitask training improved data efficiency. We held out episodes with instructions that contain the words “put”, “position” or “place” and studied how much of this data was required to learn to position objects in the room. When simultaneously trained on all language game prompts, using 1⁄8 of the Position data led to 60% of the performance with all data, compared to 7% if we used the positional data alone. D. Object-colour generalisation. We removed all instances of orange ducks from the data and environment, but we left all other orange objects and all non-orange ducks. The performance at scripted tasks testing for this particular object-colour combination was similar to baseline.

    …After training, we asked the models to “Lift an orange duck” or “What colour is the duck?”…Figure 15D shows that the agent trained without orange ducks performed almost as well on these restricted Lift and Color probe tasks as an agent trained with all of the data. These results demonstrate explicitly what our results elsewhere suggest: that agents trained to imitate human action and language demonstrate powerful combinatorial generalisation capabilities. While they have never encountered the entity, they know what an “orange duck” is and how to interact with one when asked to do so for the first time. This particular example was chosen at random; we have every reason to believe that similar effects would be observed for other compound concepts.

  110. ⁠, OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, Lei Zhang (2019-10-16):

    We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented complexity on a real robot. This is made possible by two key components: a novel algorithm, which we call automatic domain randomization (ADR) and a robot platform built for machine learning. ADR automatically generates a distribution over randomized environments of ever-increasing difficulty. Control policies and vision state estimators trained with ADR exhibit vastly improved sim2real transfer. For control policies, memory-augmented models trained on an ADR-generated distribution of environments show clear signs of emergent meta-learning at test time. The combination of ADR with our custom robot platform allows us to solve a Rubik’s cube with a humanoid robot hand, which involves both control and state estimation problems. Videos summarizing our results are available: https:/​​​​/​​​​​​​​blog/​​​​solving-rubiks-cube/​​​​

  111. ⁠, Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra (2019-11-01):

    We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling—achieving a speedup of 107× on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience)—over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs.

    This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task –near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks—the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).

  112. ⁠, Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman (2019-12-03):

    Announcement of Procgen: ⁠, Cobbe et al 2019:

    In this report, we introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, finding that larger models substantially improve both sample efficiency and generalization.

    …We want the best of both worlds: a benchmark comprised of many diverse environments, each of which fundamentally requires generalization. To fulfill this need, we have created Procgen Benchmark. CoinRun [] now serves as the inaugural environment in Procgen Benchmark, contributing its diversity to a greater whole.

    …We’ve found that all of the Procgen environments require training on 500–1000 different levels before they can generalize to new levels, which suggests that standard RL benchmarks need much more diversity within each environment. Procgen Benchmark has become the standard research platform used by the OpenAI RL team, and we hope that it accelerates the community in creating better RL algorithms.

  113. ⁠, Jacob Hilton, Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah (2020-11-17):

    In this article, we apply interpretability techniques to a reinforcement learning (RL) model trained to play the video game CoinRun. Using attribution combined with dimensionality reduction, we build an interface for exploring the objects detected by the model, and how they influence its value function and policy. We leverage this interface in several ways.

    • Dissecting failure: We perform a step-by-step analysis of the agent’s behavior in cases where it failed to achieve the maximum reward, allowing us to understand what went wrong, and why. For example, one case of failure was caused by an obstacle being temporarily obscured from view.
    • Hallucinations: We find situations when the model “hallucinated” a feature not present in the observation, thereby explaining inaccuracies in the model’s value function. These were brief enough that they did not affect the agent’s behavior.
    • Model editing: We hand-edit the weights of the model to blind the agent to certain hazards, without otherwise changing the agent’s behavior. We verify the effects of these edits by checking which hazards cause the new agents to fail. Such editing is only made possible by our previous analysis, and thus provides a quantitative validation of this analysis.

    Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability. If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse. We provide evidence for our hypothesis by measuring the relationship between interpretability and generalization.

    …All of the above analysis uses the same hidden layer of our network, the third of five convolutional layers, since it was much harder to find interpretable features at other layers. Interestingly, the level of abstraction at which this layer operates—finding the locations of various in-game objects—is exactly the level at which CoinRun levels are randomized using procedural generation. Furthermore, we found that training on many randomized levels was essential for us to be able to find any interpretable features at all.

    This led us to suspect that the diversity introduced by CoinRun’s randomization is linked to the formation of interpretable features. We call this the diversity hypothesis:

    Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).

    Our explanation for this hypothesis is as follows. For the forward implication (“only if”), we only expect features to be interpretable if they are general enough, and when the training distribution is not diverse enough, models have no incentive to develop features that generalize instead of overfitting. For the reverse implication (“if”), we do not expect it to hold in a strict sense: diversity on its own is not enough to guarantee the development of interpretable features, since they must also be relevant to the task. Rather, our intention with the reverse implication is to hypothesize that it holds very often in practice, as a result of generalization being bottlenecked by diversity.

  114. ⁠, Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, Sonal Gupta (2021-01-26):

    We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (eg. ) and generation models (eg. ) on a wide range of tasks (sentence prediction, commonsense reasoning, [machine reading comprehension] MRC, etc.), while also substantially improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.

    Figure 1: We plot the RoBERTa evaluation accuracy of five datasets: RTE, BoolQ, RACE, SQuAD, and MNLI, across various scales of multi-task learning measured in the number of datasets. We notice that performance initially degrades until a critical point is reached regarding the number of the datasets used by the MTL framework for all but one dataset. Past this critical point, our representations improve over the original RoBERTa model.
  115. 2018-silver.pdf#deepmind: ⁠, David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018-12-07; reinforcement-learning):

    The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.

  116. ⁠, Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver (2019-11-19):

    Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown.

    In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function.

    When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the algorithm that was supplied with the game rules.

  117. 1995-breiman.pdf: ⁠, Leo Breiman (1995-01-01; ai):

    The theoretical work by Weiner and others on the spectral analysis of stationary time series penetrated statistics following Tukey’s heuristic work on estimation of the spectrum. In refereeing papers for NIPS the author was struck by the growing emphasis on mathematical theory. Mathematical theory is not critical to the development of machine learning. In machine learning, the current panacea is a sigmoid network fitted using backpropagation. The pi-method, for approximating functions using noisy data, was suggested by results in mathematical approximation theory. In spite of intense activity, none of the work has had any effect on the day-to-day practice of statistics, or even on present-day theory. The useful theories was not meant to be inclusive, but even a more inclusive list would be very short. A possible reason is that it is difficult to formulate reasonable analytic models for complex data.

    Uses Of Theory

    • Comfort: We knew it worked, but it’s nice to have a proof.
    • Insight: Aha! So that’s why it works.
    • Innovation: At last, a mathematically proven idea that applies to data.
    • Suggestion: Something like this might work with data.

    …Our fields would be better off with far fewer theorems, less emphasis on faddish stuff, and much more scientific inquiry and engineering. But the latter requires real thinking. For instance, there are many important questions regarding neural networks which are largely unanswered. There seem to be conflicting stories regarding the following issues:

    • Why don’t heavily parameterized neural networks overfit the data?
    • What is the effective number of parameters?
    • Why doesn’t backpropagation head for a poor local minima?
    • When should one stop the backpropagation and use the current parameters?

    It makes research more interesting to know that there is no one universally best method. What is best is data dependent. Sometimes “least glamorous” methods such as nearest neighbor are best. We need to learn more about what works best where. But emphasis on theory often distracts us from doing good engineering and living with the data.

  118. ⁠, Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2016-11-10):

    Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training.

    Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice.

    We interpret our experimental findings by comparison with traditional models.

  119. ⁠, Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever (OpenAI) (2019-12-05):

    Paper: ⁠.

    Many classes of modern deep learning models, including CNNs, ResNets, and transformers, exhibit the previously-observed double descent phenomenon when not using early stopping or regularization. The peak occurs predictably at a “critical regime”, where the models are barely able to fit the training set. As we increase the number of parameters in a neural network, the test error initially decreases, increases, and, just as the model is able to fit the train set, undergoes a second descent.

    1. There is a regime where bigger models are worse.
    2. There is a regime where more samples hurts.
    3. There is a regime where training longer reverses overfitting.

    In general, the peak of test error appears systematically when models are just barely able to fit the train set.

    Overfitting a 1000-degree spline to a cubic curve works.

    Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure. That is, there are no “good models” which both interpolate the train set and perform well on the test set. However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models. Moreover, the implicit bias of (SGD) leads it to such good models, for reasons we don’t yet understand.

    [See also ⁠; Belkin 2021⁠, Boaz Barak:

    TL;DR: Our paper shows that occurs in conventional modern deep learning settings: visual classification in the presence of label noise (CIFAR 10, CIFAR 100) and machine translation (IWSLT’14 and WMT’14). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases. This behavior also extends over train epochs, where a single model undergoes double-descent in test error over the course of training. Surprisingly (at least to us!), we show these phenomenon can lead to a regime where “more data hurts”—training a deep network on a larger train set actually performs worse.

    …It seems that the higher the degree, the worse things are, but what happens if we go even higher? It seems like a crazy idea—-why would we increase the degree beyond the number of samples? But it corresponds to the practice of having many more parameters than training samples in modern deep learning. Just like in deep learning, when the degree is larger than the number of samples, there is more than one polynomial that fits the data—but we choose a specific one: the one found running gradient descent. Here is what happens if we do this for degree 1000, fitting a polynomial using gradient descent (see this notebook):

    We still fit all the training points, but now we do so in a more controlled way which actually tracks quite closely the ground truth. We see that despite what we learn in statistics textbooks, sometimes overfitting is not that bad, as long as you go “all in” rather than “barely overfitting” the data. That is, overfitting doesn’t hurt us if we take the number of parameters to be much larger than what is needed to just fit the training set—and in fact, as we see in deep learning, larger models are often better.]

  120. {#linkBibliography-morcos-(fair)-2019 .docMetadata}, Ari Morcos, Yuandong Tian (FAIR) (2019-11-25):

    The lottery ticket hypothesis, initially proposed by researchers Jonathan Frankle and Michael Carbin at MIT, suggests that by training deep neural networks (DNNs) from “lucky” initializations, often referred to as “winning lottery tickets”, we can train networks which are 10–100× smaller with minimal losses—or even while achieving gains—in performance. This work has exciting implications for potentially finding ways to not only train with fewer resources, but also run faster inference of models on smaller devices, like smartphones and VR headsets. But the lottery ticket hypothesis is not yet fully understood by the AI community. In particular, it has remained unclear whether winning tickets are dependent on specific factors or rather represent an intrinsic feature of DNNs.

    New research from Facebook AI finds the first definitive evidence that lottery tickets generalize across related, but distinct datasets and can extend to reinforcement learning (RL) and natural language processing (NLP). We’re sharing details on the results of our experiments using winning tickets, and we’re also introducing a new theoretical framework on the formation of lottery tickets to help researchers advance toward a better understanding of lucky initializations.

    …there are many more open questions about the underlying properties and behaviors of neural networks, such as how do these winning tickets form, why do they exist, and how do they work?

    To begin to analyze these questions in the context of deep ReLU networks, we used a student-teacher setting, in which a larger student network must learn to mimic exactly what the smaller teacher is doing. Since we can define the teacher network with fixed parameters in this setting, we can quantitatively measure the student network’s learning progress, and, critical to our investigation of lottery tickets, how the student network’s initialization affects the learning process.

    In the student-teacher setting, we see that after training, the activity patterns of select student neurons correlate more strongly with those of teacher neurons than with the activity of other student neurons—a concept that is referred to as “student specialization.” This stronger correlation suggests that, during training, the student network not only learns the teacher’s network output but also the internal structure of the teacher by mimicking individual teacher neurons.

    In our analysis, we show this occurrence happens locally in a 2-layer ReLU network: if the initial weights of a student neuron happen to be similar to those of some teacher neurons, then specialization will follow. The size of the neural network is important because the larger the student network, the more likely that one of the student neurons will start out close enough to a teacher neuron to learn to mimic its activity during training. What’s more, if a student neuron’s initial activation region has a more substantial overlap with a teacher neuron, then that student neuron specializes faster. This behavior corroborates the lottery ticket hypothesis, which similarly proposes that some lucky subset of initializations exist within neural networks, and “winning tickets” are the lucky student neurons that happen to be in the right location at the beginning of training. In our follow-up research, we strengthen our results by removing many mathematical assumptions, including independent activations and locality, and still prove that student specialization happens in the lowest layer in deep ReLU networks after training. From our analysis, we find certain mathematical properties in the training dynamics resonate with the lottery ticket phenomenon: those weights with a slight advantage in the initialization may have a greater chance of being the winning tickets after training converges.

  121. ⁠, Andrew Gordon Wilson, Pavel Izmailov (2020-02-20):

    The key distinguishing property of a is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. We also show that Bayesian model averaging alleviates double descent, resulting in monotonic performance improvements with increased flexibility. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.

  122. ⁠, Geoffrey Roeder, Luke Metz, Diederik P. Kingma (2020-07-01):

    Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to some downstream task. When parameterized as deep neural networks, such representation functions typically lack identifiability in parameter space, because they are overparameterized by design. In this paper, building on recent advances in nonlinear ICA, we aim to rehabilitate identifiability by showing that a large family of discriminative models are in fact identifiable in function space, up to a linear indeterminacy. Many models for representation learning in a wide variety of domains have been identifiable in this sense, including text, images and audio, state-of-the-art at time of publication. We derive sufficient conditions for linear identifiability and provide empirical support for the result on both simulated and real-world data.

  123. ⁠, Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, Shan Carter (OA) (2020-03-10):

    Most work on interpretability aims to give simple explanations of an entire neural network’s behavior. But what if we instead take an approach inspired by neuroscience or cellular biology — an approach of zooming in? What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?

    In contrast to the typical picture of neural networks as a black box, we’ve been surprised how approachable the network is on this scale. Not only do neurons seem understandable (even ones that initially seemed inscrutable), but the “circuits” of connections between them seem to be meaningful algorithms corresponding to facts about the world. You can watch a circle detector be assembled from curves. You can see a dog head be assembled from eyes, snout, fur and tongue. You can observe how a car is composed from wheels and windows. You can even find circuits implementing simple logic: cases where the network implements AND, OR, or XOR over high-level visual features.

    Three Speculative Claims about Neural Networks

    1. Features: Features are the fundamental unit of neural networks. They correspond to directions. By “direction” we mean a linear combination of neurons in a layer. You can think of this as a direction vector in the vector space of activations of neurons in a given layer. Often, we find it most helpful to talk about individual neurons, but we’ll see that there are some cases where other combinations are a more useful way to analyze networks — especially when neurons are “polysemantic.” (See the for a detailed definition.) These features can be rigorously studied and understood.
    2. Circuits: Features are connected by weights, forming circuits. A “circuit” is a computational subgraph of a neural network. It consists of a set of features, and the weighted edges that go between them in the original network. Often, we study quite small circuits — say with less than a dozen features — but they can also be much larger. (See the for a detailed definition.) These circuits can also be rigorously studied and understood.
    3. Universality: Analogous features and circuits form across models and tasks.
  124. ⁠, Chris Olah (2014-04-06):

    [Discussion of geometric interpretations of neural networks: each layer in a NN continuously ‘squashes’ or ‘squeezes’ points (data), gradually associating like with like, and creating new abstractions/​​​​representations. By stacking many of these, a NN can approximate extremely complex nonlinear functions which solve the problem. Olah provides animations to visualize how the datapoints in standard toy problems like the ‘Swiss roll’ example are stretched and warped until they can be solved easily by a simple linear function. This helps us understand what a NN does, and can provide some simple limiting results on what a NN of a given size/​​​​depth can or cannot do.]

    …it can be quite challenging to understand what a neural network is really doing. If one trains it well, it achieves high quality results, but it is challenging to understand how it is doing so. If the network fails, it is hard to understand what went wrong. While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks—networks that only have a few neurons in each layer. In fact, we can create visualizations to completely understand the behavior and training of such networks. This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology.

    …Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth. Even in cases where it is technically possible, such as spirals, it can be very challenging to do so. To accurately classify data with neural networks, wide layers are sometimes necessary. Further, traditional neural network layers do not seem to be very good at representing important manipulations of manifolds; even if we were to cleverly set weights by hand, it would be challenging to compactly represent the transformations we want. New layers, specifically motivated by the manifold perspective of machine learning, may be useful supplements.

    The Manifold Hypothesis: Is this relevant to real world data sets, like image data? If you take the manifold hypothesis really seriously, I think it bears consideration. The manifold hypothesis is that natural data forms lower-dimensional manifolds in its embedding space. There are both theoretical3 and experimental4 reasons to believe this to be true. If you believe this, then the task of a classification algorithm is fundamentally to separate a bunch of tangled manifolds.

  125. ⁠, Laurent Orseau, Marcus Hutter, Omar Rivasplata (2020-06-22):

    The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.

  126. 2020-hasson.pdf: ⁠, Uri Hasson, Samuel A. Nastase, Ariel Goldstein (2020-02-05; ai):

    Evolution is a blind fitting process by which organisms become adapted to their environment. Does the brain use similar brute-force fitting processes to learn how to perceive and act upon the world? Recent advances in artificial neural networks have exposed the power of optimizing millions of synaptic weights over millions of observations to operate robustly in real-world contexts. These models do not learn simple, human-interpretable rules or representations of the world; rather, they use local computations to interpolate over task-relevant manifolds in a high-dimensional parameter space. Counterintuitively, similar to evolutionary processes, over-parameterized models can be simple and parsimonious, as they provide a versatile, robust solution for learning a diverse set of functions. This new family of direct-fit models present a radical challenge to many of the theoretical assumptions in psychology and neuroscience. At the same time, this shift in perspective establishes unexpected links with developmental and ecological psychology.

    [Keywords: evolution, experimental design, interpolation, learning, neural networks]


  128. ⁠, Jacob Cannell (2015-06-24):

    This article presents an emerging architectural hypothesis of the brain as a biological implementation of a Universal Learning Machine. I present a rough but complete architectural view of how the brain works under the universal learning hypothesis. I also contrast this new viewpoint—which comes from computational neuroscience and machine learning—with the older evolved modularity hypothesis popular in evolutionary psychology and the heuristics and biases literature. These two conceptions of the brain lead to very different predictions for the likely route to AGI, the value of neuroscience, the expected differences between AGI and humans, and thus any consequent safety issues and dependent strategies.

    Intro · Two viewpoints on the Mind · Universal Learning Machines · Historical Interlude · Dynamic Rewiring · Brain Architecture (the whole brain in one picture and a few pages of text) · The Basal Ganglia · Implications for AGI · Conclusion

    …The roots of the universal learning hypothesis can be traced back to Mountcastle’s discovery of the simple uniform architecture of the cortex. The universal learning hypothesis proposes that all substantial mental algorithms are learned; nothing is innate except for the learning and reward machinery itself (which is somewhat complicated, involving a number of systems and mechanisms), the initial rough architecture (equivalent to a prior over mindspace), and a small library of simple innate circuits (analogous to the operating system layer in a computer). In this view the mind (software) is distinct from the brain (hardware). The mind is a complex software system built out of a general learning mechanism…The key takeaway is that the data is what matters—and in the end it is all that matters. Train a universal learner on image data and it just becomes a visual system. Train it on speech data and it becomes a speech recognizer. Train it on ATARI and it becomes a little gamer agent.

    Conclusion: Ray Kurzweil has been predicting for decades that AGI will be built by reverse engineering the brain, and this particular prediction is not especially unique—this has been a popular position for quite a while. My own investigation of neuroscience and machine learning led me to a similar conclusion some time ago.

    The recent progress in deep learning, combined with the emerging modern understanding of the brain, provide further evidence that AGI could arrive around the time when we can build and train ANNs with similar computational power as measured very roughly in terms of neuron/​​​​synapse counts. In general the evidence from the last four years or so supports Hanson’s viewpoint from the Foom debate. More specifically, his general conclusion:

    Future superintelligences will exist, but their vast and broad mental capacities will come mainly from vast mental content and computational resources. By comparison, their general architectural innovations will be minor additions.

    The ULH supports this conclusion. Current engines can already train and run models with around 10 million neurons and 10 billion (compressed/​​​​shared) synapses on a single GPU, which suggests that the goal could soon be within the reach of a large organization. Furthermore, Moore’s Law for GPUs still has some steam left, and software advances are currently improving simulation performance at a faster rate than hardware. These trends implies that Anthropomorphic/​​​​Neuromorphic AGI could be surprisingly close, and may appear suddenly. What kind of leverage can we exert on a short timescale?

  129. 2012-herculanohouzel.pdf: ⁠, Suzana Herculano-Houzel (2012-06-19; psychology):

    [] Neuroscientists have become used to a number of “facts” about the human brain: It has 100 billion neurons and 10- to 50-fold more glial cells; it is the largest-than-expected for its body among primates and mammals in general, and therefore the most cognitively able; it consumes an outstanding 20% of the total body energy budget despite representing only 2% of body mass because of an increased metabolic need of its neurons; and it is endowed with an overdeveloped ⁠, the largest compared with brain size.

    These facts led to the widespread notion that the human brain is literally extraordinary: an outlier among mammalian brains, defying evolutionary rules that apply to other species, with a uniqueness seemingly necessary to justify the superior cognitive abilities of humans over mammals with even larger brains. These facts, with deep implications for neurophysiology and evolutionary biology, are not grounded on solid evidence or sound assumptions, however.

    Our recent development of a method that allows rapid and reliable quantification of the numbers of cells that compose the whole brain has provided a means to verify these facts. Here, I review this recent evidence and argue that, with 86 billion neurons and just as many nonneuronal cells, the human brain is a scaled-up primate brain in its cellular composition and metabolic cost, with a relatively enlarged cerebral cortex that does not have a relatively larger number of brain neurons yet is remarkable in its cognitive abilities and metabolism simply because of its extremely large number of neurons.

  130. 2014-cambria.pdf: ⁠, Erik Cambria, Bebo White (2014-04-10; ai):

    Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of ` jumping curves’ from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves—namely Syntactics, Semantics, and Pragmatics Curves—which will eventually lead NLP research to evolve into natural language understanding.

    “Figure 1: Envisioned evolution of NLP research through three different eras or curves”
  131. Differences#efficient-natural-languages

  132. 1962-teller-thelegacyofhiroshima.pdf: “The Legacy of Hiroshima”⁠, Edward Teller, Allen Brown

  133. ⁠, Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter (2020-07-16):

    We show that the transformer attention mechanism is the update rule of a modern with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, eg. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called “Hopfield”, which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. (GitHub) [See also their blog⁠; ⁠, Widrich et al 2020; ⁠, Krotov & Hopfield 2020; ⁠, Geva et al 2020.]

    [Meme summary of “Hopfield Networks is All You Need”, Ramsauer et al 2020.]
  134. 2019-radford-figure4-gpt2validationloss.png

  135. 2020-brown-figure31-gpt3scaling.png

  136. 1993-marcus.pdf: ⁠, Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz (1993-10-01; cs):

    In this paper, we review our experience with constructing one such large annotated corpus—the Penn Treebank, a corpus consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989–1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure.

  137. ⁠, Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson (2013-12-11):

    We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline.

    The benchmark is available as a project; besides the scripts needed to rebuild the training/​​​​held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline models.

  138. ⁠, Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, Raquel Fernández (2016-06-20):

    We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse. We show that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark. We thus propose LAMBADA as a challenging test set, meant to encourage the development of new models capable of genuine understanding of broad context in natural language text.


  140. 2017-shen.pdf: ⁠, Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, Mittul Singh, Dietrich Klakow (2017-01-01; ai):

    Language models (LMs) have gained dramatic improvement in the past years due to the wide application of neural networks. This raises the question of how far we are away from the perfect language model and how much more research is needed in language modelling. As for perplexity giving a value for human perplexity (as an upper bound of what is reasonably expected from an LM) is difficult. Word error rate (WER) has the disadvantage that it also measures the quality of other components of a speech recognizer like the acoustic model and the feature extraction. We therefore suggest evaluating LMs in a generative setting (which has been done before on selected hand-picked examples) and running a human evaluation on the generated sentences. The results imply that LMs need about 10 to 20 more years of research before human performance is reached. Moreover, we show that the human judgement scores on the generated sentences and perplexity are closely correlated. This leads to an estimated perplexity of 12 for an LM that would be able to pass the human judgement test in the setting we suggested.

    [Keywords: language model, generative task, human judgement score, performance gap]





  145. ⁠, Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever (2017-03-10):

    We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

  146. ⁠, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov (2017-07-20):

    We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

  147. ⁠, Andy Jones (2020-07-27):

    I am worried we’re in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think is the trigger for 100× larger projects at Google, Facebook and the like, with timelines measured in months.

    …GPT-3 has been estimated to cost $5m in compute to train, and—looking at the author list and OpenAI’s overall size—maybe another $10m in labour.

    Google, Amazon and Microsoft each spend about $20bn/​​​​year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/​​​​year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100× is entirely plausible right now. All that’s necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds—and that’s for a technology with a much longer road to market…The current hardware floor is nearer to the RTX 2080 TI’s $1k/​​​​unit for 125 tensor-core TFLOPS, and that gives you $25/​​​​PFLOPS-day. This roughly aligns with AI Impacts’ current estimates⁠, and offers another >10× speedup to our model.

    …I think the key question is if by 1000×, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is—and I think it’s plausible that it will be—then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP. On paper that leaves room for another 1000× scale-up as it reaches up to $1tn, though current market mechanisms aren’t really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.

    That’s from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.

  148. ⁠, Shane Legg (2009-12-07):

    It’s been an interesting year in which I’ve been exposed to far more neuroscience than ever before. What I’ve learnt, plus other news I’ve absorbed during the year, has helped to clarify my thinking on the future of AI. First, let’s begin with computer power. I recently gave a talk at the Gatsby Unit on the singularity in which I used the following graph showing the estimated LINPACK scores of the fastest computers over the last 50 years:

    [Top supercomputer LINPACK performance in FLOPS, 1960–2020]

    First observation: just like the people who told me in 1990 that exponential growth in supercomputer power couldn’t continue for another decade, the people who told me this in 2000 were again completely wrong. Ha ha, told you so! So let me make another prediction: for the next decade this pattern will once again roughly hold, taking us to about 1018 FLOPS by 2020.

    Third observation: it looks like we’re heading towards 1020 FLOPS before 2030, even if things slow down a bit from 2020 onwards…Desktop performance is also continuing this trend. I recently saw that a PC with just 2 high end graphics cards is around 1013 FLOPS of SGEMM performance. I also read a paper recently showing that less powerful versions of these cards lead to around 100× performance increases over CPU computation when learning large deep belief networks.

    Conclusion: computer power is unlikely to be the issue anymore in terms of AGI being possible. The main question is whether we can find the right algorithms. Of course, with more computer power we have a more powerful tool with which to hunt for the right algorithms and it also allows any algorithms we find to be less efficient. Thus growth in computer power will continue to be an important factor.

    Having dealt with computation, now we get to the algorithm side of things. One of the big things influencing me this year has been learning about how much we understand about how the brain works, in particular, how much we know that should be of interest to AGI designers. I won’t get into it all here, but suffice to say that just a brief outline of all this information would be a 20 page journal paper (there is currently a suggestion that I write such a paper next year with some Gatsby Unit neuroscientists, but for the time being I’ve got too many other things to attend to). At a high level what we are seeing in the brain is a fairly sensible looking AGI design. You’ve got hierarchical temporal abstraction formed for perception and action combined with more precise timing motor control, with an underlying system for reinforcement learning. The reinforcement learning system is essentially a type of temporal difference learning though unfortunately at the moment there is evidence in favour of actor-critic, Q-learning and also SARSA type mechanisms—this picture should clear up in the next year or so. The system contains a long list of features that you might expect to see in a sophisticated reinforcement learner such as pseudo-rewards for informative queues, inverse reward computations, uncertainty and environmental change modelling, dual model based and model free modes of operation, things to monitor context, it even seems to have mechanisms that reward the development of conceptual knowledge. When I ask leading experts in the field whether we will understand reinforcement learning in the human brain within ten years, the answer I get back is “yes, in fact we already have a pretty good idea how it works and our knowledge is developing rapidly.”

    The really tough nut to crack will be how the cortical system works…Thus I suspect that for the next 5 years, and probably longer, neuroscientists working on understanding cortex aren’t going to be of much use to AGI efforts. My guess is that sometime in the next 10 years developments in deep belief networks, temporal graphical models, liquid computation models, slow feature analysis etc. will produce sufficiently powerful hierarchical temporal generative models to essentially fill the role of cortex within an AGI.

    Right, so my prediction for the last 10 years has been for roughly human level AGI in the year 2025 (though I also predict that sceptics will deny that it’s happened when it does!).

    [And what rough beast, its hour come round at last, / slouches towards Bethlehem to be born?]

  149. ⁠, Shane Legg (2009-12-28):

    …Machine learning will grow in importance due to ever increasing quantities of data, computer power, and better algorithms. It mostly won’t be publicly seen, however, much like how it’s heavily used in Google and a few financial and pharmaceutical companies at the moment.

    Significant progress will be made in understanding the brain. We will have a rough high level sketch of how the brain works, and some of its processes we will understand quite well. We probably still won’t understand cortical function very well, that will take longer.

    More groups will start AGI projects, particularly from 2015 onwards. These groups will become increasingly mainstream, serious and well funded. This will be driven by faster computers, better machine learning algorithms and a better understanding of the brain’s architecture. Some of these groups will produce small AGIs that will learn to do some interesting things, but they will be nowhere near human level intelligence. They will, however, be preparing the way for this. Concern at the dangers of artificial intelligence will become less fringe but it won’t go mainstream.

    In short, I’m predicting a bigger brighter expanded version of the last few years—nothing particularly radical. I think the real importance of the teenies will be to lay the foundations for more important things to come.

  150. ⁠, Kelley Tantau (2018-05-07):

    It was at the Gatsby​ Computational Neuroscience Unit where he met ⁠, who introduced him to ⁠, and ​ was formed shortly after from a shared passion and grand plans.

    “We were believers that great things were going to happen in this area in the coming years”, ​ said. “When we started DeepMind, we had grand plans. We really believed that machine learning and artificial intelligence were certainly going to take off, and if that did happen, we’d be able to grow DeepMind​ into quite a large organisation. We had big plans and, amazingly, the plans have worked out. You still have to pinch yourself when you see the reality of it.”

    Despite being at the forefront of artificial intelligence for some time, Legg​ said he was still regularly in awe by what machines were able to achieve…But even New Zealand’s most influential global Kiwi in the technology field said there’s no way of knowing what the future holds for artificial intelligence.

    ​“Nobody knows what it’s going to look like in another 10–20 years. I am quite confident, talking to people in hardware, that at least for the next few years, the performance of microprocessors, the amount of information they can process, the amount of mathematical calculations they can perform in a second, will keep growing very rapidly for a few years”, he​ said. “After that, it’s anybody’s guess what will happen.”

  151. ⁠, Shane Legg (2010-10-10):

    It’s been a very eventful year for me, both personally and on the work front. I keep my personal life off this blog, and as for work… um, substantial things are happening but I’m not ready to talk about them yet. 🙂

    My longest running prediction, since 1999, has been the time until roughly human level AGI. It’s been consistent since then, though I decided to clarify things a bit and put down an actual distribution and some parameters. Basically, I gave it a log-normal distribution with a mean of 2028, and a mode of 2025. Over the last year computer power has increased as expected⁠, and so it looks like we’re still on target to have supercomputers with 1018 FLOPS around 2018. In terms of neuroscience and machine learning, I think things are progressing well, maybe a little faster than I’d expected. I was toying with the idea of moving the prediction very slightly closer, but decided to play it safe and keep the prediction unmoved at 2028. With many people thinking I’m too optimistic, showing restraint is perhaps wise. 🙂 I can always move my prediction nearer in a year or two.



  154. ⁠, Adrià Puigdomènech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell (2020-03-31):

    The Atari57 suite of games is a long-standing benchmark to gauge agent performance across a wide range of tasks. We’ve developed Agent57, the first deep reinforcement learning agent to obtain a score that is above the human baseline on all 57 Atari 2600 games. combines an algorithm for efficient exploration with a meta-controller that adapts the exploration and long vs. short-term behaviour of the agent.

    …In 2012, the Arcade Learning environment—a suite of 57 Atari 2600 games (dubbed Atari57)—was proposed as a benchmark set of tasks: these canonical Atari games pose a broad range of challenges for an agent to master…Unfortunately, the average performance can fail to capture how many tasks an agent is doing well on, and so is not a good statistic for determining how general an agent is: it captures that an agent is doing sufficiently well, but not that it is doing sufficiently well on a sufficiently wide set of tasks. So although average scores have increased, until now, the number of above human games has not.

    …Back in 2012, DeepMind developed the Deep Q-network agent () to tackle the Atari57 suite. Since then, the research community has developed many extensions and alternatives to DQN. Despite these advancements, however, all deep reinforcement learning agents have consistently failed to score in four games: Montezuma’s Revenge, Pitfall, Solaris and Skiing. For Agent57 to tackle these four challenging games in addition to the other Atari57 games, several changes to DQN were necessary.

    Figure 3: Conceptual advancements to DQN that have resulted in the development of more generally intelligent agents.
    • DQN improvements

      • Distributed agents
      • Short-term memory
      • Episodic memory
    • Intrinsic motivation methods to encourage directed exploration

      • Seeking novelty over long time scales
      • Seeking novelty over short time scales
      • Meta-controller: learning to balance exploration with exploitation
    • Agent57: putting it all together

    Performance table of Agent57, NGU, R2D2, & MuZero

    …With Agent57, we have succeeded in building a more generally intelligent agent that has above-human performance on all tasks in the Atari57 benchmark. It builds on our previous agent Never Give Up, and instantiates an adaptive meta-controller that helps the agent to know when to explore and when to exploit, as well as what time-horizon it would be useful to learn with. A wide range of tasks will naturally require different choices of both of these trade-offs, therefore the meta-controller provides a way to dynamically adapt such choices.

    Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got. While this enabled Agent57 to achieve strong general performance, it takes a lot of computation and time; the data efficiency can certainly be improved. Additionally, this agent shows better 5th percentile performance on the set of Atari57 games. This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance. We offer two views on this: firstly, analyzing the performance among percentiles gives us new insights on how general algorithms are. While Agent57 achieves strong results on the first percentiles of the 57 games and holds better mean and median performance than NGU or R2D2, as illustrated by ⁠, it could still obtain a higher average performance. Secondly, all current algorithms are far from achieving optimal performance in some games. To that end, key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment.


  156. ⁠, Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, Stig Petersen (2016-12-12):

    DeepMind Lab is a first-person 3D game platform designed for research and development of general artificial intelligence and machine learning systems. can be used to study how autonomous artificial agents may learn complex tasks in large, partially observed, and visually diverse worlds. DeepMind Lab has a simple and flexible API enabling creative task-designs and novel AI-designs to be explored and quickly iterated upon. It is powered by a fast and widely recognised game engine, and tailored for effective use by the research community.

  157. 06#deepmind-budget

  158. ⁠, Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou (2017-12-01):

    Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art.

    This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the “steepness” of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

  159. ⁠, Scott Alexander (2018-11-26):

    [Discussion of ⁠, and ⁠.]

    This is the standard presentation of Moore’s Law—the number of transistors you can fit on a chip doubles about every two years (eg grows by 35% per year)…But BJRW have a pessimistic take. There are 18× more people involved in transistor-related research today than in 1971. So if in 1971 it took 1000 scientists to increase transistor density 35% per year, today it takes 18,000 scientists to do the same task. So apparently the average transistor scientist is eighteen times less productive today than fifty years ago. That should be surprising and scary.

    Anyway, most other measurable fields show the same pattern of constant progress in the face of exponentially increasing number of researchers. Here’s BJRW’s data on crop yield [corn & soybeans]…number of chemical elements discovered…BJRW go on to prove the same is true for whatever other scientific fields they care to measure. Measuring scientific progress is inherently difficult, but their finding of constant or log-constant progress in most areas accords with Nintil’s overview of the same topic⁠,…Meanwhile, the increase in researchers is obvious. Not only is the population increasing (by a factor of about 2.5× in the US since 1930), but the percent of people with college degrees has quintupled over the same period. The exact numbers differ from field to field, but orders of magnitude increases are the norm. For example, the number of people publishing astronomy papers seems to have dectupled over the past fifty years or so…So if you take their methodology seriously, over the past ninety years, each researcher has become about 25× less productive in making discoveries that translate into economic growth…All of these lines of evidence lead me to the same conclusion: constant growth rates in response to exponentially increasing inputs is the null hypothesis. If it wasn’t, we should be expecting 50% year-on-year GDP growth, easily-discovered-immortality, and the like. Nobody expected that before reading BJRW, so we shouldn’t be surprised when BJRW provide a data-driven model showing it isn’t happening…I brought this up at the conference, and somebody reasonably objected—doesn’t that mean science will stagnate soon? After all, we can’t keep feeding it an exponentially increasing number of researchers forever. If nothing else stops us, then at some point, 100% (or the highest plausible amount) of the human population will be researchers, we can only increase as fast as population growth, and then the scientific enterprise collapses.

    I answered that the Gods Of Straight Lines are more powerful than the Gods Of The Copybook Headings, so if you try to use common sense on this problem you will fail.

    Imagine being a futurist in 1970 presented with Moore’s Law. You scoff: “If this were to continue only 20 more years, it would mean a million transistors on a single chip! You would be able to fit an entire supercomputer in a shoebox!” But common sense was wrong and the trendline was right.

    “If this were to continue only 40 more years, it would mean ten billion transistors per chip! You would need more transistors on a single chip than there are humans in the world! You could have computers more powerful than any today, that are too small to even see with the naked eye! You would have transistors with like a double-digit number of atoms!” But common sense was wrong and the trendline was right.

    Or imagine being a futurist in ancient Greece presented with world GDP doubling time. Take the trend seriously, and in two thousand years, the future would be fifty thousand times richer. Every man would live better than the Shah of Persia! There would have to be so many people in the world you would need to tile entire countries with cityscape, or build structures higher than the hills just to house all of them. Just to sustain itself, the world would need transportation networks orders of magnitude faster than the fastest horse. But common sense was wrong and the trendline was right.

    …The conference was organized by Patrick Collison and Michael Nielsen; they have written up some of their thoughts here⁠.


  161. {#linkBibliography-(ms)-2020 .docMetadata}, DeepSpeed Team (MS) (2020-05-19):

    Today, we are happy to share our new findings and results as we introduce the improved ZeRO-2 and further developments with DeepSpeed:

    • An order-of-magnitude larger and faster training with ZeRO-2: ZeRO-2 expands the scope of memory optimizations in the original ZeRO by tackling the full spectrum of memory consumption during training. More specifically, ZeRO-2 introduces new technology to reduce the memory footprint of gradients, activation memory, and fragmented memory, in addition to optimizer state memory optimization in the original ZeRO. Altogether, the memory savings empower DeepSpeed to improve the scale and speed of deep learning training by an order of magnitude. More concretely, ZeRO-2 allows training models as large as 170 billion parameters up to 10× faster compared to state of the art.
    • Fastest training: While ZeRO-2 optimizes large models during distributed training, we also introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 64 teraflops per GPU, which is over 50% of the hardware peak. Using these optimizations as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.

    …ZeRO-2: Training models with 100 billion parameters up to 10× faster:

    1. Model scale: State-of-the-art large models (trained without using ZeRO) such as OpenAI , NVIDIA ⁠, and Google have sizes of 1.5B, 8.3B, and 11B parameters respectively. ZeRO-2 provides system capability to efficiently run models of 170 billion parameters, an order-of-magnitude bigger than these largest models (Figure 2, top left). The tests were conducted using 400 NVIDIA V100 GPUs; with more devices (such as 1,000 GPUs), ZeRO-2 allows us to scale toward 200 billion parameters.
    2. Speed: Improved memory efficiency powers higher throughput and faster training. Figure 2 (bottom left) shows system throughput of ZeRO-2, ZeRO-1, and baseline model parallelism. Here we use a state-of-the-art model parallelism approach, NVIDIA Megatron-LM, as baseline-MP, while ZeRO-2 and ZeRO-1 both combine ZeRO-powered data parallelism with Megatron-LM model parallelism. ZeRO-2 runs 100-billion-parameter models with over 38 teraflops per GPU, 30% of hardware peak, and aggregated performance over 15 petaflops on the cluster with 400 NVIDIA V100 GPUs. For models of the same size, ZeRO-2 is up to 10× faster in training speed when compared to the baseline because model parallelism requires high communication bandwidth to be efficient, and models of these sizes require model parallelism across nodes where the communication bandwidth is limited. The memory savings of ZeRO-2 allows us to reduce model parallelism degree and fit the model without requiring inter-node model parallelism, drastically reducing communication cost. ZeRO-2 is also up to 5× faster than ZeRO-1 because its additional memory savings help reduce communication further and support even larger batch sizes.
    3. Scalability: We observe superlinear speedup (Figure 2, top right), where the performance more than doubles when the number of NVIDIA GPUs are doubled. ZeRO-2 reduces the memory footprint of the model states as we increase the data parallelism degree, allowing us to fit larger batch sizes per GPU and resulting in better performance.
    4. Democratizing large model training: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). With respect to throughput, we observe an average throughput of 37 teraflops (30% hardware peak) per V100 GPU for model sizes ranging from 2 billion to 13 billion parameters. Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters.

    For more details about ZeRO-2, please see the DeepSpeed GitHub repository and the updated ⁠.

  162. ⁠, DeepSpeed Team, Rangan Majumder, Junhua Wang (MS) (2020-09-10):

    Today, we are happy to share our new advancements that not only push deep learning training to the extreme, but also democratize it for more people—from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU.

    More specifically, DeepSpeed adds 4 new system technologies that further the AI at Scale initiative to innovate across Microsoft’s AI products and platforms. These offer extreme compute, memory, and communication efficiency, and they power model training with billions to trillions of parameters. The technologies also allow for extremely long input sequences and power on hardware systems with a single GPU, high-end clusters with thousands of GPUs, or low-end clusters with very slow ethernet networks.

    • Trillion parameter model training with 3D parallelism: DeepSpeed enables a flexible combination of three parallelism approaches—ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. 3D parallelism adapts to the varying needs of workload requirements to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency. In addition, its improved communication efficiency allows users to train multi-billion-parameter models 2–7× faster on regular clusters with limited network bandwidth.
    • 10× bigger model training on a single GPU with : We extend ZeRO-2 to leverage both CPU and GPU memory for training large models. Using a machine with a single NVIDIA GPU, our users can run models of up to 13 billion parameters without running out of memory, 10× bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
    • Powering 10× longer sequences and 6× faster execution through DeepSpeed Sparse Attention: DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6× faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3× faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.
    • with up to 5× communication volume reduction: Adam is an effective and (probably the most well-utilized) optimizer for training many large-scale deep learning models. However, Adam is generally not compatible with communication-efficient optimization algorithms. Therefore, the communication cost could become a bottleneck while scaling across distributed devices. We introduce a new algorithm, 1-bit Adam with efficient implementation, which reduces communication volume by up to 5× while achieving similar convergence efficiency to Adam. We observe up to 3.5× faster distributed training in communication-constrained scenarios, allowing for scaling to different types of GPU clusters and networks.
  163. ⁠, Hans Moravec (1998):

    This paper describes how the performance of AI machines tends to improve at the same pace that AI researchers get access to faster hardware. The processing power and memory capacity necessary to match general intellectual performance of the human brain are estimated. Based on extrapolation of past trends and on examination of technologies under development, it is predicted that the required hardware will be available in cheap machines in the 2020s…At the present rate, computers suitable for human-like robots will appear in the 2020s. Can the pace be sustained for another three decades?

    …By 1990, entire careers had passed in the frozen winter of 1-MIPS computers, mainly from necessity, but partly from habit and a lingering opinion that the early machines really should have been powerful enough. In 1990, 1 MIPS cost $2,338$1,0001990 in a low-end personal computer. There was no need to go any lower. Finally spring thaw has come. Since 1990, the power available to individual AI and robotics programs has doubled yearly, to 30 MIPS by 1994 and 500 MIPS by 1998. Seeds long ago alleged barren are suddenly sprouting. Machines read text, recognize speech, even translate languages. Robots drive cross-country, crawl across Mars, and trundle down office corridors. In 1996 a theorem-proving program called running five weeks on a 50 MIPS computer at Argonne National Laboratory found a proof of a boolean algebra conjecture by Herbert Robbins that had eluded mathematicians for sixty years. And it is still only spring. Wait until summer.

    …The mental steps underlying good human chess playing and theorem proving are complex and hidden, putting a mechanical interpretation out of reach. Those who can follow the play naturally describe it instead in mentalistic language, using terms like strategy, understanding and creativity. When a machine manages to be simultaneously meaningful and surprising in the same rich way, it too compels a mentalistic interpretation. Of course, somewhere behind the scenes, there are programmers who, in principle, have a mechanical interpretation. But even for them, that interpretation loses its grip as the working program fills its memory with details too voluminous for them to grasp.

    As the rising flood reaches more populated heights, machines will begin to do well in areas a greater number can appreciate. The visceral sense of a thinking presence in machinery will become increasingly widespread. When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self-evident.

    Faster than Exponential Growth in Computing Power: The number of MIPS in $1,854$10001998 of computer from 1900 to the present. Steady improvements in mechanical and electromechanical calculators before World War II had increased the speed of calculation a thousandfold over manual methods from 1900 to 1940. The pace quickened with the appearance of electronic computers during the war, and 1940 to 1980 saw a million-fold increase. The pace has been even quicker since then, a pace which would make human-like robots possible before the middle of the next century. The vertical scale is logarithmic, the major divisions represent thousandfold increases in computer performance. Exponential growth would show as a straight line, the upward curve indicates faster than exponential growth, or, equivalently, an accelerating rate of innovation. The reduced spread of the data in the 1990s is probably the result of intensified competition: underperforming machines are more rapidly squeezed out. The numerical data for this power curve are presented in the appendix.
    The big freeze: From 1960 to 1990 the cost of computers used in AI research declined, as their numbers dilution absorbed computer-efficiency gains during the period, and the power available to individual AI programs remained almost unchanged at 1 MIPS, barely insect power. AI computer cost bottomed in 1990, and since then power has doubled yearly, to several hundred MIPS by 1998. The major visible exception is computer chess (shown by a progression of knights), whose prestige lured the resources of major computer companies and the talents of programmers and machine designers. Exceptions also exist in less public competitions, like petroleum exploration and intelligence gathering, whose high return on investment gave them regular access to the largest computers.

  165. ⁠, Nouamane Laanait, Joshua Romero, Junqi Yin, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, Michael Matheson (2019-09-24):

    We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors.

    These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA GPUs on the Summit Supercomputer. We demonstrate our gradient reduction techniques in the context of training a Fully Convolutional Neural Network to approximate the solution of a longstanding scientific inverse problem in materials imaging.

    The efficient distributed training on a dataset size of 0.5 PB, produces a model capable of an atomically-accurate reconstruction of materials, and in the process reaching a peak performance of 2.15(4) EFLOPS16.

  166. ⁠, Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, Weinan E, Linfeng Zhang (2020-05-01):

    For 35 years, ab initio molecular dynamics (AIMD) has been the method of choice for modeling complex atomistic phenomena from first principles. However, most AIMD applications are limited by computational cost to systems with thousands of atoms at most.

    We report that a machine learning-based simulation protocol (Deep Potential Molecular Dynamics), while retaining ab initio accuracy, can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day, using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer. Our code can efficiently scale up to the entire Summit supercomputer, attaining 91 PFLOPS in double precision (45.5% of the peak) and 162/​​​​275 PFLOPS in mixed-single/​​​​half precision.

    The great accomplishment of this work is that it opens the door to simulating unprecedented size and time scales with ab initio accuracy. It also poses new challenges to the next-generation supercomputer for a better integration of machine learning and physical modeling.

  167. ⁠, Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost (2020-07-13):

    Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (⁠, ) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3 = 81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10 = 81%) and membrane vs. water-soluble (2-state accuracy Q2 = 91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https:/​​​​/​​​​​​​​agemagician/​​​​ProtTrans.

  168. ⁠, Ji Lin, Chuang Gan, Song Han (2019-10-01):

    Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics [1]. Therefore, training scalability is essential to handle a large amount of videos. In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), communication (data movement over networking), and computation FLOPs. We propose three design guidelines to improve the scalability: (1) fewer FLOPs and hardware-friendly operator to increase the computation efficiency; (2) fewer input frames to reduce the data movement and increase the data loading efficiency; (3) smaller model size to reduce the networking traffic and increase the networking efficiency. With these guidelines, we designed a new operator Temporal Shift Module (TSM) that is efficient and scalable for distributed training. TSM model can achieve 1.8× higher throughput compared to previous I3D models. We scale up the training of the TSM model to 1,536 GPUs, with a mini-batch of 12,288 video clips/​​​​98,304 images, without losing the accuracy. With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74.0%, which is 1.6× and 2.9× faster than previous 3D video models with higher accuracy. The code and more details can be found here: http:/​​​​/​​​​


  170. ⁠, Jeff Dean (2019-11-13):

    The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks.

    This paper is a companion paper to a keynote talk at the 2020 International Solid-State Circuits Conference (ISSCC) discussing some of the advances in machine learning, and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law-era. It also discusses some of the ways that machine learning may also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at least one interesting direction towards much larger-scale multi-task models that are sparsely activated and employ much more dynamic, example-based and task-based routing than the machine learning models of today.



  173. ⁠, Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin (2020-03-18):

    We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the “entailment” token as a score of the hypothesis. Our first (and only) submission to the official leaderboard yielded 0.7673 AUC on March 13, 2020, which is the best known result at this time and beats the previous state of the art by over five points.

  174. ⁠, Vid Kocijan, Thomas Lukasiewicz, Ernest Davis, Gary Marcus, Leora Morgenstern (2020-04-23):

    The Winograd Schema Challenge is both a commonsense reasoning and natural language understanding challenge, introduced as an alternative to the Turing test. A Winograd schema is a pair of sentences differing in one or two words with a highly ambiguous pronoun, resolved differently in the two sentences, that appears to require commonsense knowledge to be resolved correctly. The examples were designed to be easily solvable by humans but difficult for machines, in principle requiring a deep understanding of the content of the text and the situation it describes. This paper reviews existing Winograd Schema Challenge benchmark datasets and approaches that have been published since its introduction.

  175. Modus

  176. ⁠, Eliezer Yudkowsky (2017-10-13):

    [Meditation on the problem of coordinating reaction to x-risks, and AI risks in particular. To quote Norbert Wiener:

    Again and again I have heard the statement that learning machines cannot subject us to any new dangers, because we can turn them off when we feel like it. But can we? To turn a machine off effectively, we must be in possession of information as to whether the danger point has come. The mere fact that we have made the machine does not guarantee we shall have the proper information to do this.

    A fire alarm, even if it is not 100% accurate, coordinates human reactions: it becomes permissible to leave the room and investigate, take precautions, and for everyone to evacuate the building. This is because we all agree that fires usually come with smoke and smoke can be objectively detected. But what is the fire alarm for AI? “AI is whatever we can’t do yet”, and whenever AI accomplishes a new feat, people will simply move the goalposts and say that that task turned out to be unexpectedly easy to solve. There is no agreement on what “imminent AGIlooks like. You can ask AI researchers, “how would the world look different if we were in fact heading towards AGI in the near future, the next decade or three?” and they are unable to answer. They do not know what is or is not a ringing alarm bell, the point at which everyone should start taking the prospect very seriously. It was not chess, it was not ImageNet classification, it was not Go…

    AI so far resembles other technologies like airplanes or nuclear bombs where just years before, the physicists who would invent it, eminent physicists, and physicists in general, were highly uncertain or skeptical or outright convinced of their impossibility. This was because progress in nuclear physics looked much the same regardless of whether nuclear bombs were possible and impossible. There was large ineradicable uncertainty, which appears to have neutered any serious effort to prepare. And yet, these matters ought to be dealt with in advance. Things like nuclear bombs or AI should not just arrive with no one having done anything to prepare. Or consider pandemics. Those who tried to warn the world about coronavirus will find this essay eerily apt.]

    Okay, let’s be blunt here. I don’t think most of the discourse about AGI being far away (or that it’s near) is being generated by models of future progress in machine learning. I don’t think we’re looking at wrong models; I think we’re looking at no models.

    I was once at a conference…I got up in Q&A and said, “Okay, you’ve all told us that progress won’t be all that fast. But let’s be more concrete and specific. I’d like to know what’s the least impressive accomplishment that you are very confident cannot be done in the next two years.”

    There was a silence.

    Eventually, 2 people on the panel ventured replies, spoken in a rather more tentative tone than they’d been using to pronounce that AGI was decades out. They named “A robot puts away the dishes from a dishwasher without breaking them”, and Winograd schemas…A few months after that panel, there was unexpectedly a big breakthrough on Winograd schemas. The breakthrough didn’t crack 80%, so three cheers for wide credibility intervals with error margin, but I expect the predictor might be feeling slightly more nervous now with one year left to go…

    But that’s not the point. The point is the silence that fell after my question, and that eventually I only got 2 replies, spoken in tentative tones. When I asked for concrete feats that were impossible in the next two years, I think that that’s when the luminaries on that panel switched to trying to build a mental model of future progress in machine learning, asking themselves what they could or couldn’t predict, what they knew or didn’t know. And to their credit, most of them did know their profession well enough to realize that forecasting future boundaries around a rapidly moving field is actually really hard, that nobody knows what will appear on arXiv next month, and that they needed to put wide credibility intervals with very generous upper bounds on how much progress might take place 24 months’ worth of arXiv papers later. (Also, Demis Hassabis was present, so they all knew that if they named something insufficiently impossible, Demis would have DeepMind go and do it.)

    …When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door. What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable…There is never going to be a time before the end when you can look around nervously, and see that it is now clearly common knowledge that you can talk about AGI being imminent, and take action and exit the building in an orderly fashion, without fear of looking stupid or frightened.


  178. 13#what-progress

  179. 1940-sciam-harrington-nuclearweapons-dontworryitcanthappen.pdf: {#linkBibliography-american)-1940 .docMetadata doi=“10.2307/​​24988773”}, Jean Harrington (Scientific American) (1940-05-01; existential-risk):

    …Early last summer, in the midst of all this research, a chilly sensation began tingling up and down the spines of the experimenters. These extra neutrons that were being erupted—could they not in turn become involuntary bullets, flying from one exploding uranium nucleus into the heart of another, causing another fission which would itself cause still others? Wasn’t there a dangerous possibility that the uranium would at last become explosive? That the samples being bombarded in the laboratories at Columbia University, for example, might blow up the whole of New York City? To make matters more ominous, news of fission research from Germany, plentiful in the early part of 1939, mysteriously and abruptly stopped for some months. Had government censorship been placed on what might be a secret of military importance? The press and populace, getting wind of these possibly lethal goings-on, raised a hue and cry. Nothing daunted, however, the physicists worked on to find out whether or not they would be blown up, and the rest of us along with them. Now, a year after the original discovery, word comes from Paris that we don’t have to worry.

    …With typical French—and scientific—caution, they added that this was perhaps true only for the particular conditions of their own experiment, which was carried out on a large mass of uranium under water. But most scientists agreed that it was very likely true in general.

    …Readers made insomnious by “newspaper talk” of terrific atomic war weapons held in reserve by dictators may now get sleep.

  180. ⁠, Sarah Constantin (2016-10-20):

    [Speculative essay about a specific kind of groupthink and failure mode of institutions: a pursuit of prestige, legitimacy, and respectability detached from reality, a prizing of vagueness and inscrutability and superficial perfection and avoidance of anything that might seem absurd or daring or mockable.]

    The Egyptian god Ra was a symbol of divine kingship, all-powerful and all-seeing. He’s a good metaphor for a certain kind of psychological phenomenon that involves thought distortions around authority and legitimacy…The idea of a malign Establishment is somewhat convergent:

    • The Establishment (attributed to Henry Fairlie in 1950’s Britain, talking about an informal social network of power among prominent, well-connected people)
    • The Man (eg. Yippies, Burning Man)
    • The Combine (Ken Kesey)
    • Moloch (Allen Ginsberg)
    • The Beige Dictatorship (Charles Stross)
    • The Cathedral (Mencius Moldbug)
    • The Mandarins (Megan McArdle)

    Not all of these ideas are coterminous with Ra, or identical to each other.

    What they have in common is that the Establishment is primarily an upper-class phenomenon, that it is more about social and moral legitimacy than mere wealth or raw power, and that it is boringly evil—it produces respectable, normal, right-thinking, mild-mannered people who do things with very bad consequences.

    …Ra is something more like a psychological mindset, that causes people to actually seek corruption and confusion, and to prefer corruption for its own sake — though, of course, it doesn’t feel quite like that from the inside.

    Ra is a specific kind of glitch in intuition, which can roughly be summarized as the drive to idealize vagueness and despise clarity. I’m going to try to define it by extension, using examples from my and others’ personal experiences.

    Ra is about generic superlativity.

    You know how universal gods are praised with formulas that call them glorious, mighty, exalted, holy, righteous, and other suchlike adjectives, all of which are perfectly generic and involve no specific characteristics except wonderfulness? That’s what Ra is all about.

    The worship of Ra involves a preference for stockpiling money, accolades, awards, or other resources, beyond what you can meaningfully consume or make practical use of; a felt sense of wanting to attain that abstract radiance of “bestness”.

    Ra defends itself with vagueness, confusion, incoherence — and then anger.

    “Respectability” turns out to be incoherent quite often — ie. if you have any consistent model of the world you often have to take extreme or novel positions as a logical conclusion from your assumptions. To Ra, disrespectability is damnation, and thus consistent thought is suspect.

    Vagueness, mental fog, “underconfidence”, avoidance, evasion, blanking out, etc. are hallmarks of Ra. If cornered, a person embodying Ra will abruptly switch from blurry vagueness to anger and nihilism…Ra causes persistent brain fog or confusion, especially around economic thinking or cost-benefit analysis or quantitative estimates.

    …Ra promotes the idea that optimal politeness conveys as little information as possible. That you should actively try to hide preferences (because if you shared them, you’d inconvenience others by pressuring them to satisfy your preferences). That all compliments are empty pleasantries. There’s an interpretation of “politeness” that’s anti-cooperative, that avoids probing for opportunities for genuine mutual benefit or connection and just wants to make the mutual defection process go as smoothly as possible. Ra prefers this, because it’s less revealing, commits you less, doesn’t pin you down, allows you to keep all your options open and devote everything to the pursuit of Ra…I’ve had my writing criticized because “when you give your opinion, it sounds like you think you’re smart”.