Skip to main content

neural net directory


“Automated Crossword Solving”, Wallace et al 2022

“Automated Crossword Solving”⁠, Eric Wallace, Nicholas Tomlin, Albert Xu, Kevin Yang, Eshaan Pathak, Matthew Ginsberg, Dan Klein (2022-05-19; ):

[blog⁠; Github] We present the Berkeley Crossword Solver, a state-of-the-art approach for automatically solving crossword puzzles.

Our system works by generating answer candidates for each crossword clue using neural question answering models and then combines loopy belief propagation with local search [using ByT5 to handle puns/​humor/​word-play] to find full puzzle solutions.

Compared to existing approaches [like Dr.Fill], our system improves exact puzzle accuracy from 57% to 82% on crosswords from The New York Times and obtains 99.9% letter accuracy on themeless puzzles. Our system also won first place at the top human crossword tournament⁠, which marks the first time that a computer program has surpassed human performance at this event.

To facilitate research on question answering and crossword solving, we analyze our system’s remaining errors and release a dataset of over 6 million question-answer pairs.

Figure 2: An overview of the Berkeley Crossword Solver. We use a neural question answering model to generate answer probabilities for each question, and then refine the probabilities with loopy belief propagation. Finally, we fill the grid with greedy search and iteratively improve uncertain areas of the puzzle using local search.

“Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper Than In-Context Learning”, Liu et al 2022

“Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning”⁠, Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, Colin Raffel (2022-05-11):

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (eg. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.

In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model called T-Few that can be applied to new tasks without task-specific tuning or modifications.

We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark⁠, attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.

All of the code used in our experiments is publicly available.

“NaturalSpeech: End-to-End Text to Speech Synthesis With Human-Level Quality”, Tan et al 2022

“NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality”⁠, Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi et al (2022-05-09):

[samples] Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/​judge human-level quality and how to achieve it…Using this judge method, we found several previous TTS systems have not achieved it (see Table 1).

In this paper, we answer these questions by first defining human-level quality based on statistical-significance of measurement and describing the guidelines to judge it, and then proposing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key designs to enhance the capacity of prior from text and reduce the complexity of posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/​posterior modeling, and memory mechanism in VAE.

Experiment evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings on sentence level, with Wilcoxon signed rank test at p-level p ≫0.05, which demonstrates no statistically-significant difference from human recordings for the first time on this dataset.

“Generating Scientific Claims for Zero-Shot Scientific Fact Checking”, Wright et al 2022

“Generating Scientific Claims for Zero-Shot Scientific Fact Checking”⁠, Dustin Wright, David Wadden, Kyle Lo, Bailey Kuehl, Arman Cohan, Isabelle Augenstein, Lucy Lu Wang (2022-03-24; ; backlinks; similar):

Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data, as annotation requires domain expertise. To address this challenge, we propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrate its usefulness in zero-shot fact checking for biomedical claims.

We propose CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel method for generating claim negations. Additionally, we adapt an existing unsupervised entity-centric method of claim generation to biomedical claims, which we call CLAIMGEN-ENTITY.

Experiments on zero-shot fact checking demonstrate that both CLAIMGEN-ENTITY and CLAIMGEN-BART, coupled with KBIN, achieve up to 90% performance of fully supervised models trained on manually annotated claims and evidence. A rigorous evaluation study demonstrates significant improvement in generated claim and negation quality over existing baselines

“Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time”, Wortsman et al 2022

“Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time”⁠, Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos et al (2022-03-10; backlinks):

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs—we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP⁠, ALIGN⁠, and a ViT-G pre-trained on JFT⁠, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94% top-1 accuracy on ImageNet⁠, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.

“It Looks Like You’re Trying To Take Over The World”, Branwen 2022

Clippy: “It Looks Like You’re Trying To Take Over The World”⁠, Gwern Branwen (2022-03-06; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Fictional short story about Clippy & AI hard takeoff scenarios grounded in contemporary ML scaling, self-supervised learning⁠, reinforcement learning, and meta-learning research literature.

It might help to imagine a hard takeoff scenario using solely known sorts of NN & scaling effects… Below is a story which may help stretch your imagination and defamiliarize the 2022 state of machine learning.

To read the annotated alternate version of this story, scroll to the end or manually disable ‘reader-mode’ () in the theme toggle in the upper-right corner.

“Transformer Quality in Linear Time”, Hua et al 2022

“Transformer Quality in Linear Time”⁠, Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le (2022-02-21; backlinks; similar):

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences.

First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality.

The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9× on Wiki-40B and 12.1× on PG-19 for auto-regressive language modeling, and 4.8× on C4 for masked language modeling.

“Approximating CNNs With Bag-of-local-Features Models Works Surprisingly Well on ImageNet”, Wiel et al 2022

“Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet”⁠, Wiel, Brendel, Matthias Bethge (2022-02-10; backlinks; similar):

Aggregating class evidence from many small image patches suffices to solve ImageNet, yields more interpretable models and can explain aspects of the decision-making of popular DNNs.

Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a high-performance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top-5 for 32×32px features and Alexnet performance for 16×16px features). The constraint on local features makes it straight-forward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to state-of-the art deep neural networks such as VGG-16⁠, ResNet-152 or DenseNet-169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies.

[Keywords: interpretability, representation learning, bag of features, deep learning, object recognition]

“Random Feature Attention”, Peng et al 2022

“Random Feature Attention”⁠, Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, Lingpeng Kong (2022-02-10; backlinks; similar):

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA’s efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

[Keywords: Attention, transformers, machine translation, language modeling]

“Achieving Human Parity on Visual Question Answering”, Yan et al 2021

“Achieving Human Parity on Visual Question Answering”⁠, Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang, Weihua Chen, Xianzhe Xu, Fan Wang et al (2021-11-17; similar):

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade.

This paper describes our recent research of AliceMind-MMU (ALIbaba’s Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy—MultiMedia Understanding) that obtains similar or even slightly better results than human being does on VQA. This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level.

An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.

“A Dot Product Attention Free Transformer”, Zhai et al 2021

“A Dot Product Attention Free Transformer”⁠, Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang ZHANG, Joshua M. Susskind et al (2021-10-05; backlinks; similar):

We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (transformer) that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

“RAFT: A Real-World Few-Shot Text Classification Benchmark”, Alex et al 2021

“RAFT: A Real-World Few-Shot Text Classification Benchmark”⁠, Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C. Jess Riedel, Emmie Hine, Carolyn Ashurst et al (2021-09-28; backlinks):

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don’t directly answer this question.

The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment.

Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11.

The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at⁠.

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”⁠, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula et al (2021-07-30; ; backlinks; similar):

[code⁠; Hugging Face] The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics.

Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes.

The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II⁠, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization, and achieves state-of-the-art performance on Sintel optical flow estimation.

Figure 2: The Perceiver IO architecture. Perceiver IO maps arbitrary input arrays to arbitrary output arrays in a domain agnostic process. The bulk of the computation happens in a latent space whose size is typically smaller than the inputs and outputs, which makes the process computationally tractable even for very large inputs & outputs.

…The Perceiver IO architecture relies on the same primitives as Transformers: so why aren’t Transformers all you need? The answer is that Transformers scale very poorly in both compute and memory.82 A Transformer deploys attention modules homogeneously throughout its architecture, using its full input to generate queries and keys at every layer. As discussed in,35 this means each layer scales quadratically in compute and memory, which currently makes it impossible to apply Transformers on high-dimensional data like images without some form of preprocessing. Even on domains like language where Transformers shine, preprocessing (eg. tokenization) is often needed to scale beyond short input sequences. On the other hand, Perceiver IO uses attention non-homogeneously, first using it to map inputs to a latent space, then using it to process in that latent space, and finally using it to map to an output space. The resulting architecture has no quadratic dependence on the input or output size: encoder and decoder attention modules depend linearly on the input and output size (respectively), while latent attention is independent of both input and output sizes. Because of this structure, and the corresponding reduction in compute and memory requirements, Perceivers scale to much larger inputs and outputs. While Transformers are typically used in settings with inputs and outputs of at most a few thousand dimensions [9, 63], we show good results on domains with hundreds of thousands of input and output dimensions.

…Because of this structure, this architecture can be applied to inputs of any shape or spatial layout and even to inputs or outputs which don’t share the same spatial structure (eg. sound and video). However, in contrast to the latent spaces used elsewhere in vision (eg.)67 the latent does not explicitly share the structure (spatial or otherwise) of the inputs. To decode this information, we query for it using cross-attention.

4.4 StarCraft II: To further demonstrate Perceiver IO’s capabilities on discrete modalities and to serve as a drop-in replacement for Transformers, we use Perceiver IO to replace the Transformer in AlphaStar, the state-of-the-art system for the complex game of StarCraft II. At its core, AlphaStar [89] represents the units in the game as a discrete, unordered set of symbols (the “units”). These units are represented by a vector of properties such as unit type, position, health, etc. At each timestep, the architecture encodes up to 512 units “tokens” with a vanilla Transformer. This representation is used both as a summary of the state (after pooling) and as a rich representation of the 512 units. This representation is used by a pointer network [90], to assign a probability to each possible unit selection, effectively parameterizing the agent’s unit selection policy (see 89 and Appendix Section G for more details). We replaced the Transformer that inputs and outputs 512 units with Perceiver IO with a latent size of 32. Without tuning any additional parameters, we observed that the resulting agent reached the same level of performance as the original AlphaStar agent, reaching an 87% win-rate versus the Elite bot after behavioral cloning[61] on human data.

“Choose-Your-Own-Adventure AI Dungeon Games”, Branwen 2021

CYOA: “Choose-Your-Own-Adventure AI Dungeon Games”⁠, Gwern Branwen (2021-06-06; ⁠, ; backlinks; similar):

Neural networks like GPT-2 power text adventure games where you can do anything; but they are too expensive. I propose that if we turn them into Choose Your Own Adventure hypertext games, they become feasible and enable new gameplay.

A useful variation on AI Dungeon-style (AID) text games would be to turn them into shared public game trees of pre-generated options which the player selects from, Choose-Your-Own-Adventure-book style.

This trades teraflops for kilobytes and so can dramatically reduce costs as players spend most of their time reading cached output (rarely needing nor wanting to generate brandnew output requiring a NN run), can increase quality as players collectively uprank actions/​outcomes which are highest-quality, and caters to newbies who don’t understand the power of NN-backed text games and flail around.

“PAWS: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments With Support Samples”, Assran et al 2021

“PAWS: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples”⁠, Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Arm, Joulin, Nicolas Ballas, Michael Rabbat et al (2021-04-28; backlinks; similar):

This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly sampled labeled images. The distance between the view representations and labeled representations is used to provide a weighting over class labels, which we interpret as a soft pseudo-label. By non-parametrically incorporating labeled samples in this way, PAWS extends the distance-metric loss used in self-supervised methods such as BYOL and SwAV to the semi-supervised setting. Despite the simplicity of the approach, PAWS outperforms other semi-supervised methods across architectures, setting a new state-of-the-art for a ResNet-50 on ImageNet trained with either 10% or 1% of the labels, reaching 75.5% and 66.5% top-1 respectively. PAWS requires 4× to 12× less training than the previous best methods.

“Rip Van Winkle’s Razor, a Simple New Estimate for Adaptive Data Analysis”, Arora & Zhang 2021

“Rip van Winkle’s Razor, a Simple New Estimate for Adaptive Data Analysis”⁠, Sanjeev Arora, Yi Zhang (2021-04-07; backlinks; similar):

Meta-overfitting Error (MOE) of a model is the difference between its average error on the test data and its expected error on the full distribution. (It is closely related to false discovery rate in statistics.)

This blog post concerns our new paper⁠, which gives meaningful upper bounds on this sort of trouble for popular deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate Rip van Winkle’s Razor which combines references to Occam’s Razor and the mythical person who fell asleep for 20 years.

MOE bounds and description length: The starting point of our work is the following classical concentration bounds:

Folklore Theorem: With high probability over the choice of a test set of size N, the MOE of all models with description length at most k bits is 𝒪(√kN).

At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE. Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.

Rip van Winkle’s Razor: Our new upper bound involves a more careful definition of Description Length: it is the smallest description that allows a referee to reproduce a model of similar performance using the (universally available) training and validation datasets.

Estimating MOE of ResNet-152: As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves 4.49% top-5 test error.

The description consists of 3 types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such as CONV, MaxPool, ReLU, SGD etc., as well as a graph-theoretic notation/​shorthand for describing net architecture. The newly introduced concepts including Batch-Norm, Layer, Block are defined precisely using Math, English, and other primitive concepts.

Description for reproducing ResNet-152

According to our estimate, the length of the above description is 1032 bits, which translates into an upper bound on meta-overfitting error of merely 5%! This suggests the real top-5 error of the model on full distribution is at most 9.49%. In the paper we also provide a 980-bit long description for reproducing DenseNet-264, which leads to 5.06% upper bound on its meta-overfitting error.

…But the important point is that unlike existing bounds in Adaptive Data Analysis, there is no dependence on t, the number of models that have been tested before, and the bound is non-vacuous. Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number of parameters and hyper-parameters in today’s deep nets, the information content of these models is not high given knowledge circa 2012.

“Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations”, Ryali et al 2021

“Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations”⁠, Chaitanya K. Ryali, David J. Schwab, Ari S. Morcos (2021-03-23; similar):

Recent progress in self-supervised learning has demonstrated promising results in multiple visual tasks.

An important ingredient in high-performing self-supervised methods is the use of data augmentation by training models to place different augmented views of the same image nearby in embedding space. However, commonly used augmentation pipelines treat images holistically, ignoring the semantic relevance of parts of an image-eg. a subject vs. a background-which can lead to the learning of spurious correlations.

Our work addresses this problem by investigating a class of simple, yet highly effective “background augmentations”, which encourage models to focus on semantically-relevant content by discouraging them from focusing on image backgrounds.

Through a systematic investigation, we show that background augmentations lead to substantial improvements in performance across a spectrum of state-of-the-art self-supervised methods (MoCo-v2, BYOL⁠, SwAV) on a variety of tasks, eg. 1–2% gains on ImageNet, enabling performance on par with the supervised baseline. Further, we find the improvement in limited-labels settings is even larger (up to 4.2%). Background augmentations also improve robustness to a number of distribution shifts, including natural adversarial examples, ImageNet-9, adversarial attacks, ImageNet-Renditions. We also make progress in completely unsupervised saliency detection, in the process of generating saliency masks used for background augmentations.

“DirectPred: Understanding Self-supervised Learning Dynamics without Contrastive Pairs”, Tian et al 2021

“DirectPred: Understanding self-supervised Learning Dynamics without Contrastive Pairs”⁠, Yuandong Tian, Xinlei Chen, Surya Ganguli (2021-02-12; backlinks; similar):

While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent non-contrastive SSL (eg. BYOL and SimSiam) show remarkable performance without negative pairs, with an extra learnable predictor and a stop-gradient operation.

A fundamental question arises: why do these methods not collapse into trivial representations? We answer this question via a simple theoretical study and propose a novel approach, DirectPred, that directly sets the linear predictor based on the statistics of its inputs, without gradient training. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms a linear predictor by 2.5% in 300-epoch training (and 5% in 60-epoch).

DirectPred is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet.

Code is released here:⁠.

“Facial Recognition Technology Can Expose Political Orientation from Naturalistic Facial Images”, Kosinski 2021

“Facial recognition technology can expose political orientation from naturalistic facial images”⁠, Michal Kosinski (2021-01-11; ⁠, ; backlinks; similar):

Ubiquitous facial recognition technology can expose individuals’ political orientation, as faces of liberals and conservatives consistently differ. A facial recognition algorithm was applied to naturalistic images of 1,085,795 individuals to predict their political orientation by comparing their similarity to faces of liberal and conservative others. Political orientation was correctly classified in 72% of liberal-conservative face pairs, remarkably better than chance (50%), human accuracy (55%), or one afforded by a 100-item personality questionnaire (66%). Accuracy was similar across countries (the U.S., Canada, and the UK), environments (Facebook and dating websites), and when comparing faces across samples. Accuracy remained high (69%) even when controlling for age, gender, and ethnicity. Given the widespread use of facial recognition, our findings have critical implications for the protection of privacy and civil liberties.

[Stats evaluation by Andrew Gelman et al, plus copy of behind-the-scenes letter lobbying to censor such research in the future.]

“Exploring Simple Siamese Representation Learning”, Chen & He 2020

“Exploring Simple Siamese Representation Learning”⁠, Xinlei Chen, Kaiming He (2020-11-20; backlinks):

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (1) negative sample pairs, (2) large batches, (3) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.

“Open-Domain Question Answering Goes Conversational via Question Rewriting”, Anantha et al 2020

“Open-Domain Question Answering Goes Conversational via Question Rewriting”⁠, Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, Srinivas Chappidi (2020-10-10; backlinks):

We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14K conversations with 80K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10M web pages (split into 54M passages). Answers to questions in the same conversation may be distributed across several web pages. QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.

“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Branwen 2020

Attention: “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”⁠, Gwern Branwen (2020-07-25; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Discussion of removing a major architectural limitation in Transformer neural networks: the length of the input it can look at. Beyond a few thousand inputs, the resource requirements explode quadratically, rendering it infeasible to encode raw text at the character level, much less use entire books, images, or many other kinds of data which could be useful. Even for text, this inability also forces limitations like the use of BPE text encoding (responsible for sabotaging GPT-3’s rhyming, among other things), forgetfulness, limits to prompt programming, and inability to write coherent long texts.

Possibilities for fixing this generally fall into

  1. adding state, through recurrence (a memory) or creating a compressed history/​state as an explicit summary
  2. tinkering with matrix algebra to remove the quadratic explosion while still keeping more or less the same self-attention mechanism
  3. approximating self-attention: using attention on only a small subset of tokens at any time (dodging the quadratic limit), or using a mix of local and global attention (local attentions to do most of the work, and global attention on top of the local attentions, each one avoiding the quadratic by considering only a few inputs at a time)
  4. miscellaneous tricks: removing parts, using only randomized untrainable components (with no need to compute gradients over) etc

“GPT-3 Nonfiction”, Branwen 2020

GPT-3-nonfiction: “GPT-3 Nonfiction”⁠, Gwern Branwen (2020-06-19; ⁠, ⁠, ⁠, ; backlinks; similar):

Nonfiction writing by OpenAI’s GPT-3 model, testing logic, commonsense reasoning, anagrams, PDF/​OCR cleaning, creative nonfiction, etc

GPT-3, announced in May 2020 by OpenAI⁠, was a breakthrough in neural net modeling of natural language and natural-language-related tasks; the June 2020 API opened up GPT-3 use to outsiders, including myself. I extensively documented my experiences testing GPT-3 and learning how to use it primarily for creative fiction such as poetry; but I also tested some “nonfiction” uses (often in response to hyperbolic claims about what GPT-3 could never do). This page documents tasks like anagrams, queries based on premises described as ‘databases’, probing the problems with GPT-3’s commonsense and other tasks (often related to poor prompting, showing the importance of prompt programming⁠, or the pernicious influence of BPEs)

“GPT-3 Creative Fiction”, Branwen 2020

GPT-3: “GPT-3 Creative Fiction”⁠, Gwern Branwen (2020-06-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Creative writing by OpenAI’s GPT-3 model, demonstrating poetry, dialogue, puns, literary parodies, and storytelling. Plus advice on effective GPT-3 prompt programming & avoiding common errors.

I continue my AI poetry generation experiments with OpenAI’s 2020 GPT-3, which is 116× larger, and much more powerful, than the 2019 GPT-2⁠. GPT-3, however, is not merely a quantitative tweak yielding “GPT-2 but better”—it is qualitatively different, exhibiting eerie runtime learning capabilities allowing even the raw model, with zero finetuning, to “meta-learn” many textual tasks purely by example or instruction. One does not train or program GPT-3 in a normal way, but one engages in dialogue and writes prompts to teach GPT-3 what one wants.

Experimenting through the OpenAI Beta API in June 2020, I find that GPT-3 does not just match my finetuned GPT-2-1.5b-poetry for poem-writing quality, but exceeds it, while being versatile in handling poetry⁠, Tom Swifty puns⁠, science fiction, dialogue like Turing’s Turing-test dialogue⁠, literary style parodies… As the pièce de résistance, I recreate Stanislaw Lem’s Cyberiad’s “Trurl’s Electronic Bard” poetry using GPT-3. (Along the way, I document instances of how the BPE text encoding unnecessarily damages GPT-3’s performance on a variety of tasks, how to best elicit the highest-quality responses, common errors people make in using GPT-3, and test out GPT-3’s improvements in NN weak points like logic or commonsense knowledge.)

GPT-3’s samples are not just close to human level: they are creative, witty, deep, meta, and often beautiful. They demonstrate an ability to handle abstractions, like style parodies, I have not seen in GPT-2 at all. Chatting with GPT-3 feels uncannily like chatting with a human. I was impressed by the results reported in the GPT-3 paper, and after spending a week trying it out, I remain impressed.

This page records GPT-3 samples I generated in my explorations, and thoughts on how to use GPT-3 and its remaining weaknesses⁠. I hope you enjoy them even a tenth as much as I enjoyed testing GPT-3 and watching the completions scroll across my screen.

“Bootstrap Your Own Latent (BYOL): A New Approach to Self-supervised Learning”, Grill et al 2020

“Bootstrap your own latent (BYOL): A new approach to self-supervised Learning”⁠, Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya et al (2020-06-13; backlinks; similar):

We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.

While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.

Our implementation and pretrained models are given on GitHub⁠.

“The Scaling Hypothesis”, Branwen 2020

Scaling-hypothesis: “The Scaling Hypothesis”⁠, Gwern Branwen (2020-05-28; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

On GPT-3: meta-learning, scaling, implications, and deep theory. The scaling hypothesis: neural nets absorb data & compute, generalizing and becoming more Bayesian as problems get harder, manifesting new abilities even at trivial-by-global-standards-scale. The deep learning revolution has begun as foretold.

GPT-3, announced by OpenAI in May 2020, is the largest neural network ever trained, by over an order of magnitude. Trained on Internet text data, it is the successor to GPT-2⁠, which had surprised everyone by its natural language understanding & generation ability. To the surprise of most (including myself), this vast increase in size did not run into diminishing or negative returns, as many expected, but the benefits of scale continued to happen as forecasted by OpenAI. These benefits were not merely learning more facts & text than GPT-2, but qualitatively distinct & even more surprising in showing meta-learning: while GPT-2 learned how to do common natural language tasks like text summarization, GPT-3 instead learned how to follow directions and learn new tasks from a few examples. (As a result, GPT-3 outputs & interaction are more fascinating & human-like than GPT-2.)

While the immediate applications of GPT-3, like my poetry or humor writings, are nice, the short-term implications of GPT-3 are much more important.

First, while GPT-3 is expensive by conventional DL standards, it is cheap by scientific/​commercial/​military/​government budget standards, and the results indicate that models could be made much larger. Second, models can also be made much more powerful, as GPT is an old approach known to be flawed in both minor & major ways, and far from an ‘ideal’ Transformer⁠. Third, GPT-3’s capabilities come from learning on raw (unsupervised) data; that has long been one of the weakest areas of DL, holding back progress in other areas like reinforcement learning or robotics. Models like GPT-3 suggest that large unsupervised models will be vital components of future DL systems, as they can be ‘plugged into’ systems to immediately provide understanding of the world, humans, natural language, and reasoning.

The meta-learning has a longer-term implication: it is a demonstration of the blessings of scale⁠, where problems with simple neural networks vanish, and they become more powerful, more generalizable, more human-like when simply made very large & trained on very large datasets with very large compute—even though those properties are believed to require complicated architectures & fancy algorithms (and this perceived need drives much research). Unsupervised models benefit from this, as training on large corpuses like Internet-scale text present a myriad of difficult problems to solve; this is enough to drive meta-learning despite GPT not being designed for meta-learning in any way. (This family of phenomena is perhaps driven by neural networks functioning as ensembles of many sub-networks with them all averaging out to an Occam’s razor, which for small data & models, learn superficial or memorized parts of the data, but can be forced into true learning by making the problems hard & rich enough; as meta-learners learn amortized Bayesian inference⁠, they build in informative priors when trained over many tasks, and become dramatically more sample-efficient and better at generalization.)

The blessings of scale in turn support a radical theory: an old AI paradigm held by a few pioneers in connectionism (early artificial neural network research) and by more recent deep learning researchers, the scaling hypothesis⁠. The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale. As increasing computational resources permit running such algorithms at the necessary scale, the neural networks will get ever more intelligent.

When? Estimates of Moore’s law-like progress curves decades ago by pioneers like Hans Moravec indicated that it would take until the 2010s for the sufficiently-cheap compute for tiny insect-level prototype systems to be available, and the 2020s for the first sub-human systems to become feasible, and these forecasts are holding up. (Despite this vindication, the scaling hypothesis is so unpopular an idea, and difficult to prove in advance rather than as a fait accompli, that while the GPT-3 results finally drew some public notice after OpenAI enabled limited public access & people could experiment with it live, it is unlikely that many entities will modify their research philosophies, much less kick off an ‘arms race’.)

More concerningly, GPT-3’s scaling curves, unpredicted meta-learning, and success on various anti-AI challenges suggests that in terms of futurology, AI researchers’ forecasts are an emperor sans garments: they have no coherent model of how AI progress happens or why GPT-3 was possible or what specific achievements should cause alarm, where intelligence comes from, and do not learn from any falsified predictions. Their primary concerns appear to be supporting the status quo, placating public concern, and remaining respectable. As such, their comments on AI risk are meaningless: they would make the same public statements if the scaling hypothesis were true or not.

Depending on what investments are made into scaling DL, and how fast compute grows, the 2020s should be quite interesting—sigmoid or singularity?

For more ML scaling research, follow the  /  ​ r /  ​ MLScaling subreddit. For a fiction treatment as SF short story, see “It Looks Like You’re Trying To Take Over The World”⁠.

“Open-Retrieval Conversational Question Answering”, Qu et al 2020

“Open-Retrieval Conversational Question Answering”⁠, Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, Mohit Iyyer (2020-05-22; backlinks):

Conversational search is one of the ultimate goals of information retrieval. Recent research approaches conversational search by simplified settings of response ranking and conversational question answering, where an answer is either selected from a given candidate set or extracted from a given passage. These simplifications neglect the fundamental role of retrieval in conversational search.

To address this limitation, we introduce an open-retrieval conversational question answering (ORConvQA) setting, where we learn to retrieve evidence from a large collection before extracting answers, as a further step towards building functional conversational search systems. We create a dataset, OR-QuAC, to facilitate research on ORConvQA. We build an end-to-end system for ORConvQA, featuring a retriever, a reranker, and a reader that are all based on Transformers. Our extensive experiments on OR-QuAC demonstrate that a learnable retriever is crucial for ORConvQA. We further show that our system can make a substantial improvement when we enable history modeling in all system components. Moreover, we show that the reranker component contributes to the model performance by providing a regularization effect. Finally, further in-depth analyses are performed to provide new insights into ORConvQA.

“Anime Crop Datasets: Faces, Figures, & Hands”, Branwen et al 2020

Crops: “Anime Crop Datasets: Faces, Figures, & Hands”⁠, Gwern Branwen, Arfafax, Shawn Presser, Anonymous, Danbooru Community (2020-05-10; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Description of 3 anime datasets for machine learning based on Danbooru: cropped anime faces, whole-single-character crops, and hand crops (with hand detection model).

Documentation of 3 anime datasets for machine learning based on Danbooru: 300k cropped anime faces (primarily used for StyleGAN⁠/​This Waifu Does Not Exist), 855k whole-single-character figure crops (extracted from Danbooru using AniSeg), and 58k hand crops (based on a dataset of 14k hand-annotated bounding boxes used to train a YOLOv3 hand detection model).

These datasets can be used for machine learning directly, or included as data augmentation: faces, figures, and hands are some of the most noticeable features of anime images, and by cropping images down to just those 3 features, they can enhance modeling of those by eliminating distracting context, zooming in, and increasing the weight during training.

“TREC CAsT 2019: The Conversational Assistance Track Overview”, Dalton et al 2020

“TREC CAsT 2019: The Conversational Assistance Track Overview”⁠, Jeffrey Dalton, Chenyan Xiong, Jamie Callan (2020-03-30):

The Conversational Assistance Track (CAsT) is a new track for TREC 2019 to facilitate Conversational Information Seeking (CIS) research and to create a large-scale reusable test collection for conversational search systems. The document corpus is 38,426,252 passages from the TREC Complex Answer Retrieval (CAR) and Microsoft MAchine Reading COmprehension (MARCO) datasets. Eighty information seeking dialogues (30 train, 50 test) are an average of 9 to 10 questions long. Relevance assessments are provided for 30 training topics and 20 test topics. This year 21 groups submitted a total of 65 runs using varying methods for conversational query understanding and ranking. Methods include traditional retrieval based methods, feature based learning-to-rank, neural models, and knowledge enhanced methods. A common theme through the runs is the use of BERT-based neural reranking methods. Leading methods also employed document expansion, conversational query expansion, and generative language models for conversational query rewriting (GPT-2). The results show a gap between automatic systems and those using the manually resolved utterances, with a 35% relative improvement of manual rewrites over the best automatic system.

“The Secret History of Facial Recognition: Sixty Years Ago, a Sharecropper’s Son Invented a Technology to Identify Faces. Then the Record of His Role All but Vanished. Who Was Woody Bledsoe, and Who Was He Working For?”, Raviv 2020

“The Secret History of Facial Recognition: Sixty years ago, a sharecropper’s son invented a technology to identify faces. Then the record of his role all but vanished. Who was Woody Bledsoe, and who was he working for?”⁠, Shaun Raviv (2020-01-21; ⁠, ; backlinks; similar):

Over the following year, Woody came to believe that the most promising path to automated facial recognition was one that reduced a face to a set of relationships between its major landmarks: eyes, ears, nose, eyebrows, lips. The system that he imagined was similar to one that Alphonse Bertillon, the French criminologist who invented the modern mug shot, had pioneered in 1879. Bertillon described people on the basis of 11 physical measurements, including the length of the left foot and the length from the elbow to the end of the middle finger. The idea was that, if you took enough measurements, every person was unique. Although the system was labor-intensive, it worked: In 1897, years before fingerprinting became widespread, French gendarmes used it to identify the serial killer Joseph Vacher. Throughout 1965, Panoramic attempted to create a fully automated Bertillon system for the face. The team tried to devise a program that could locate noses, lips, and the like by parsing patterns of lightness and darkness in a photograph, but the effort was mostly a flop.

…Even with this larger sample size, though, Woody’s team struggled to overcome all the usual obstacles. The computer still had trouble with smiles, for instance, which “distort the face and drastically change inter-facial measurements.” Aging remained a problem too, as Woody’s own face proved. When asked to cross-match a photo of Woody from 1945 with one from 1965, the computer was flummoxed. It saw little resemblance between the younger man, with his toothy smile and dark widow’s peak, and the older one, with his grim expression and thinning hair. It was as if the decades had created a different person.

…In 1967, more than a year after his move to Austin, Woody took on one last assignment that involved recognizing patterns in the human face. The purpose of the experiment was to help law enforcement agencies quickly sift through databases of mug shots and portraits, looking for matches…Woody’s main collaborator on the project was Peter Hart, a research engineer in the Applied Physics Laboratory at the Stanford Research Institute. (Now known as SRI International, the institute split from Stanford University in 1970 because its heavy reliance on military funding had become so controversial on campus.) Woody and Hart began with a database of around 800 images—two newsprint-quality photos each of about “400 adult male caucasians”, varying in age and head rotation. (I did not see images of women or people of color, or references to them, in any of Woody’s facial-recognition studies.) Using the RAND tablet, they recorded 46 coordinates per photo, including five on each ear, seven on the nose, and four on each eyebrow. Building on Woody’s earlier experience at normalizing variations in images, they used a mathematical equation to rotate each head into a forward-looking position. Then, to account for differences in scale, they enlarged or reduced each image to a standard size, with the distance between the pupils as their anchor metric. The computer’s task was to memorize one version of each face and use it to identify the other. Woody and Hart offered the machine one of two shortcuts. With the first, known as group matching, the computer would divide the face into features—left eyebrow, right ear, and so on—and compare the relative distances between them. The second approach relied on Bayesian decision theory⁠; it used 22 measurements to make an educated guess about the whole.

In the end, the two programs handled the task about equally well. More important, they blew their human competitors out of the water. When Woody and Hart asked three people to cross-match subsets of 100 faces, even the fastest one took six hours to finish. The CDC 3800 computer completed a similar task in about three minutes, reaching a hundredfold reduction in time. The humans were better at coping with head rotation and poor photographic quality, Woody and Hart acknowledged, but the computer was “vastly superior” at tolerating the differences caused by aging. Overall, they concluded, the machine “dominates” or “very nearly dominates” the humans.

This was the greatest success Woody ever had with his facial-recognition research. It was also the last paper he would write on the subject. The paper was never made public—for “government reasons”, Hart says—which both men lamented. In 1970, two years after the collaboration with Hart ended, a roboticist named Michael Kassler alerted Woody to a facial-recognition study that Leon Harmon at Bell Labs was planning. “I’m irked that this second rate study will now be published and appear to be the best man-machine system available”, Woody replied. “It sounds to me like Leon, if he works hard, will be almost 10 years behind us by 1975.” He must have been frustrated when Harmon’s research made the cover of Scientific American a few years later, while his own, more advanced work was essentially kept in a vault.

“GPT-2 Preference Learning for Music Generation”, Branwen 2019

GPT-2-preference-learning: “GPT-2 Preference Learning for Music Generation”⁠, Gwern Branwen (2019-12-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

Standard language generation neural network models, like GPT-2⁠, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.

A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO⁠.

OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation⁠, I experimented with GPT-2 preference learning for unconditional music and poetry generation.

I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≈ 7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n ≈ 200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).

This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.

Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output⁠.

“GPT-2 Folk Music”, Branwen & Presser 2019

GPT-2-music: “GPT-2 Folk Music”⁠, Gwern Branwen, Shawn Presser (2019-11-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Generating Irish/​folk/​classical music in ABC format using GPT-2-117M, with good results.

In November 2019, I experimented with training a GPT-2 neural net model to generate folk music in the high-level ABC music text format, following previous work in 2016 which used a char-RNN trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an RNN by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.

I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that⁠, I successfully trained it on n = 205,304 ABC music pieces taken from The Session & The resulting music samples are in my opinion quite pleasant. (A similar model was later retrained by Geerlings & Meroño-Peñuela 2020⁠.)

The ABC folk model & dataset are available for download⁠, and I provide for listening selected music samples as well as medleys of random samples from throughout training.

We followed the ABC folk model with an ABC-MIDI model: a dataset of 453k ABC pieces decompiled from MIDI pieces, which fit into GPT-2-117M with an expanded context window when trained on TPUs⁠. The MIDI pieces are far more diverse and challenging, and GPT-2 underfits and struggles to produce valid samples but when sampling succeeds, it can generate even better musical samples⁠.

“Best Practices for the Human Evaluation of Automatically Generated Text”, Lee et al 2019

“Best practices for the human evaluation of automatically generated text”⁠, Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, Emiel Krahmer (2019-10; backlinks):

Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out.

This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature.

With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

“Fine-Tuning GPT-2 from Human Preferences”, Ziegler et al 2019

“Fine-Tuning GPT-2 from Human Preferences”⁠, Daniel Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Dario Amodei, Alec Radford, Paul Christiano, Geoffrey Irving et al (2019-09-19; ⁠, ; backlinks; similar):

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans”, which we believe is key to extracting information about human values.

This work applies human preference learning to several natural language tasks: continuing text with positive sentiment or physically descriptive language using the BookCorpus, and summarizing content from the TL;DR and CNN⁠/​Daily Mail datasets. Each of these tasks can be viewed as a text completion problem: starting with some text X, we ask what text Y should follow. [For summarization, the text is the article plus the string “TL;DR:”.]

We start with a pretrained language model (the 774M parameter version of GPT-2) and fine-tune the model by asking human labelers which of four samples is best. Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble; this copying is an easy way to ensure accurate summaries, but may exploit the fact that labelers rely on simple heuristics.

Bugs can optimize for bad behavior

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord could have prevented this, by allowing any labeler to stop a problematic training process.

Looking forward

We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.

We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality.” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification. [Followup: “Learning to summarize from human feedback”⁠, Stiennon et al 2020.]

“Fine-Tuning Language Models from Human Preferences”, Ziegler et al 2019

“Fine-Tuning Language Models from Human Preferences”⁠, Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano et al (2019-09-18; ⁠, ⁠, ; backlinks; similar):

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/​Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

“Emergent Tool Use From Multi-Agent Autocurricula”, Baker et al 2019

“Emergent Tool Use From Multi-Agent Autocurricula”⁠, Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, Igor Mordatch (2019-09-17; ⁠, ⁠, ; backlinks; similar):

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using movable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

“R2D3: Making Efficient Use of Demonstrations to Solve Hard Exploration Problems”, Paine et al 2019

“R2D3: Making Efficient Use of Demonstrations to Solve Hard Exploration Problems”⁠, Tom Le Paine, Caglar Gulcehre, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn et al (2019-09-03; ; backlinks; similar):

[previously: R2D2] This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of 8 tasks that combine these 3 properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.

Wall Sensor Stack: The original Wall Sensor Stack environment had a bug that the R2D3 agent was able to exploit. We fixed the bug and verified the agent can learn the proper stacking behavior.

…Another desirable property of our approach is that our agents are able to learn to outperform the demonstrators, and in some cases even to discover strategies that the demonstrators were not aware of. In one of our tasks the agent is able to discover and exploit a bug in the environment in spite of all the demonstrators completing the task in the intended way…R2D3 performed better than our average human demonstrator on Baseball, Drawbridge, Navigate Cubes and the Wall Sensor tasks. The behavior on Wall Sensor Stack in particular is quite interesting. On this task R2D3 found a completely different strategy than the human demonstrators by exploiting a bug in the implementation of the environment. The intended strategy for this task is to stack 2 blocks on top of each other so that one of them can remain in contact with a wall mounted sensor, and this is the strategy employed by the demonstrators. However, due to a bug in the environment the strategy learned by R2D3 was to trick the sensor into remaining active even when it is not in contact with the key by pressing the key against it in a precise way.

“Human-level Performance in 3D Multiplayer Games With Population-based Reinforcement Learning”, Jaderberg et al 2019

2019-jaderberg.pdf#deepmind: “Human-level performance in 3D multiplayer games with population-based reinforcement learning”⁠, Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda et al (2019-05-31; ⁠, ⁠, ; backlinks; similar):

[Videos: 1⁠, 2⁠, 3⁠, 4] Artificial teamwork: Artificially intelligent agents are getting better and better at 2-player games, but most real-world endeavors require teamwork. Jaderberg et al 2019 designed a computer program that excels at playing the video game Quake III Arena in Capture the Flag mode, where 2 multiplayer teams compete in capturing the flags of the opposing team. The agents were trained by playing thousands of games, gradually learning successful strategies not unlike those favored by their human counterparts. Computer agents competed successfully against humans even when their reaction times were slowed to match those of humans.

Reinforcement learning (RL) has shown great success in increasingly complex single-agent environments and 2-player turn-based games. However, the real world contains multiple agents, each learning and acting independently to cooperate and compete with other agents. We used a tournament-style evaluation to demonstrate that an agent can achieve human-level performance in a 3-dimensional multiplayer first-person video game, Quake III Arena in Capture the Flag mode, using only pixels and game points scored as input. We used a 2-tier optimization process in which a population of independent RL agents are trained concurrently from thousands of parallel matches on randomly generated environments. Each agent learns its own internal reward signal and rich representation of the world. These results indicate the great potential of multiagent reinforcement learning for artificial intelligence research.

Figure 1: CTF task and computational training framework. (A and B) 2 example maps that have been sampled from the distribution of (A) outdoor maps and (B) indoor maps. Each agent in the game sees only its own first-person pixel view of the environment. (C) Training data are generated by playing thousands of CTF games in parallel on a diverse distribution of procedurally generated maps and (D) used to train the agents that played in each game with RL. (E) We trained a population of 30 different agents together, which provided a diverse set of teammates and opponents to play with and was also used to evolve the internal rewards and hyperparameters of agents and learning process. Each circle represents an agent in the population, with the size of the inner circle representing strength. Agents undergo computational evolution (represented as splitting) with descendants inheriting and mutating hyperparameters (represented as color). Gameplay footage and further exposition of the environment variability can be found in movie S1.
Figure 2: Agent architecture and benchmarking. (A) How the agent processes a temporal sequence of observations xt from the environment. The model operates at 2 different time scales, faster at the bottom and slower by a factor of τ at the top. A stochastic vector-valued latent variable is sampled at the fast time scale from distribution ℚt on the basis of observations xt. The action distribution πt is sampled conditional on the latent variable at each time step t. The latent variable is regularized by the slow moving prior ℙt, which helps capture long-range temporal correlations and promotes memory. The network parameters are updated by using RL according to the agent’s own internal reward signal rt, which is obtained from a learned transformation w of game points ρt. w is optimized for winning probability through PBT, another level of training performed at yet a slower time scale than that of RL. Detailed network architectures are described in figure S11. (B) (Top) The Elo skill ratings of the FTW agent population throughout training (blue) together with those of the best baseline agents by using hand-tuned reward shaping (RS) (red) and game-winning reward signal only (black), compared with human and random agent reference points (violet, shaded region shows strength between 10th and 90th percentile). The FTW agent achieves a skill level considerably beyond strong human subjects, whereas the baseline agent’s skill plateaus below and does not learn anything without reward shaping [evaluation procedure is provided in (28 [supplements])]. (Bottom) The evolution of 3 hyperparameters of the FTW agent population: learning rate, Kullback-Leibler divergence (KL) weighting, and internal time scale τ, plotted as mean and standard deviation across the population.
Figure 3: Knowledge representation and behavioral analysis. (A) The 2D t-SNE embedding of an FTW agent’s internal states during gameplay. Each point represents the internal state (hp, hq) at a particular point in the game and is colored according to the high-level game state at this time—the conjunction of (B) 4 basic CTF situations, each state of which is colored distinctly. Color clusters form, showing that nearby regions in the internal representation of the agent correspond to the same high-level game state. (C) A visualization of the expected internal state arranged in a similarity-preserving topological embedding and colored according to activation (figure S5). (D) Distributions of situation conditional activations (each conditional distribution is colored gray and green) for particular single neurons that are distinctly selective for these CTF situations and show the predictive accuracy of this neuron. (E) The true return of the agent’s internal reward signal and (F) the agent’s prediction, its value function (orange denotes high value, and purple denotes low value). (G) Regions where the agent’s internal 2-time scale representation diverges (red), the agent’s surprise, measured as the KL between the agent’s slow-time and fast-time scale representations (28). (H) The 4-step temporal sequence of the high-level strategy “opponent base camping.” (I) 3 automatically discovered high-level behaviors of agents and corresponding regions in the t-SNE embedding. (Right) Average occurrence per game of each behavior for the FTW agent, the FTW agent without temporal hierarchy (TH), self-play with reward shaping agent, and human subjects (figure S9).
Figure 4: Progression of agent during training. Shown is the development of knowledge representation and behaviors of the FTW agent over the training period of 450,000 games, segmented into 3 phases (movie S2). “Knowledge” indicates the percentage of game knowledge that is linearly decodable from the agent’s representation, measured by average scaled AUC/​ROC across 200 features of game state. Some knowledge is compressed to single-neuron responses (Figure 3A), whose emergence in training is shown at the top. “Relative internal reward magnitude” indicates the relative magnitude of the agent’s internal reward weights of 3 of the 13 events corresponding to game points ρ. Early in training, the agent puts large reward weight on picking up the opponent’s flag, whereas later, this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of 2. “Behavior probability” indicates the frequencies of occurrence for 3 of the 32 automatically discovered behavior clusters through training. Opponent base camping (red) is discovered early on, whereas teammate following (blue) becomes very prominent midway through training before mostly disappearing. The “home base defense” behavior (green) resurges in occurrence toward the end of training, which is in line with the agent’s increased internal penalty for more opponent flag captures. “Memory usage” comprises heat maps of visitation frequencies for (left) locations in a particular map and (right) locations of the agent at which the top-10 most frequently read memories were written to memory, normalized by random reads from memory, indicating which locations the agent learned to recall. Recalled locations change considerably throughout training, eventually showing the agent recalling the entrances to both bases, presumably in order to perform more efficient navigation in unseen maps (figure S7).

“Speech2Face: Learning the Face Behind a Voice”, Oh et al 2019

“Speech2Face: Learning the Face Behind a Voice”⁠, Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik et al (2019-05-23; backlinks; similar):

How much can we infer about a person’s looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/​YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how—and in what manner—our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

“GPT-2 Neural Network Poetry”, Branwen & Presser 2019

GPT-2: “GPT-2 Neural Network Poetry”⁠, Gwern Branwen, Shawn Presser (2019-03-03; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Demonstration tutorial of retraining OpenAI’s GPT-2 (a text-generating Transformer neural network) on large poetry corpuses to generate high-quality English verse.

In February 2019, following up on my 2015–2016 text-generation experiments with char-RNNs⁠, I experiment with the cutting-edge Transformer NN architecture for language modeling & text generation. Using OpenAI’s GPT-2-117M (117M) model pre-trained on a large Internet corpus and nshepperd’s finetuning code, I retrain GPT-2-117M on a large (117MB) Project Gutenberg poetry corpus. I demonstrate how to train 2 variants: “GPT-2-poetry”, trained on the poems as a continuous stream of text, and “GPT-2-poetry-prefix”, with each line prefixed with the metadata of the PG book it came from. In May 2019, I trained the next-largest GPT-2, GPT-2-345M, similarly, for a further quality boost in generated poems. In October 2019, I retrained GPT-2-117M on a Project Gutenberg corpus with improved formatting, and combined it with a contemporary poem dataset based on Poetry Foundation’s website⁠.

With just a few GPU-days on 1080ti GPUs, GPT-2-117M finetuning can produce high-quality poetry which is more thematically consistent than my char-RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems. For the highest-quality AI poetry to date, see my followup pages, “GPT-3 Creative Writing”⁠/​“GPT-3 Non-Fiction”⁠.

For anime plot summaries, see TWDNE⁠; for generating ABC-formatted folk music, see “GPT-2 Folk Music” & “GPT-2 Preference Learning for Music and Poetry Generation”⁠; for playing chess, see “A Very Unlikely Chess Game”⁠; for the Reddit comment generator, see SubSimulatorGPT-2⁠; for fanfiction, the Ao3⁠; and for video games, the walkthrough model⁠. For OpenAI’s GPT-3 followup, see “GPT-3: Language Models are Few-Shot Learners”⁠.

“A Replication Study: Machine Learning Models Are Capable of Predicting Sexual Orientation From Facial Images”, Leuner 2019

“A Replication Study: Machine Learning Models Are Capable of Predicting Sexual Orientation From Facial Images”⁠, John Leuner (2019-02-27; ; backlinks; similar):

Recent research used machine learning methods to predict a person’s sexual orientation from their photograph (Wang & Kosinski 2017). To verify this result, two of these models are replicated, one based on a deep neural network (DNN) and one on facial morphology (FM). Using a new dataset of 20,910 photographs from dating websites, the ability to predict sexual orientation is confirmed (DNN accuracy male 68%, female 77%, FM male 62%, female 72%). To investigate whether facial features such as brightness or predominant colours are predictive of sexual orientation, a new model based on highly blurred facial images was created. This model was also able to predict sexual orientation (male 63%, female 72%). The tested models are invariant to intentional changes to a subject’s makeup, eyewear, facial hair and head pose (angle that the photograph is taken at). It is shown that the head pose is not correlated with sexual orientation. While demonstrating that dating profile images carry rich information about sexual orientation these results leave open the question of how much is determined by facial morphology and how much by differences in grooming, presentation and lifestyle. The advent of new technology that is able to detect sexual orientation in this way may have serious implications for the privacy and safety of gay men and women.

“What Makes a Good Conversation? How Controllable Attributes Affect Human Judgments”, See et al 2019

“What makes a good conversation? How controllable attributes affect human judgments”⁠, Abigail See, Stephen Roller, Douwe Kiela, Jason Weston (2019-02-22; backlinks):

A good conversation requires balance—between simplicity and detail; staying on topic and changing it; asking questions and answering them. Although dialogue agents are commonly evaluated via human judgments of overall quality, the relationship between quality and these individual factors is less well-studied.

In this work, we examine two controllable neural text generation methods, conditional training and weighted decoding, in order to control 4 important attributes for chitchat dialogue: repetition, specificity, response-relatedness and question-asking.

We conduct a large-scale human evaluation to measure the effect of these control parameters on multi-turn interactive conversations on the PersonaChat task.

We provide a detailed analysis of their relationship to high-level aspects of conversation, and show that by controlling combinations of these variables our models obtain clear improvements in human quality judgments.

“This Waifu Does Not Exist”, Branwen 2019

TWDNE: “This Waifu Does Not Exist”⁠, Gwern Branwen (2019-02-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

I describe how I made the website (TWDNE) for displaying random anime faces generated by StyleGAN neural networks, and how it went viral.

Generating high-quality anime faces has long been a task neural networks struggled with. The invention of StyleGAN in 2018 has effectively solved this task and I have trained a StyleGAN model which can generate high-quality anime faces at 512px resolution. To show off the recent progress, I made a website, “This Waifu Does Not Exist” for displaying random StyleGAN 2 faces. TWDNE displays a different neural-net-generated face & plot summary every 15s. The site was popular and went viral online, especially in China. The model can also be used interactively for exploration & editing in the Artbreeder online service⁠.

TWDNE faces have been used as screensavers, user avatars, character art for game packs or online games⁠, painted watercolors⁠, uploaded to Pixiv, given away in streams⁠, and used in a research paper (Noguchi & Harada 2019). TWDNE results also helped inspired Sizigi Studio’s online interactive waifu GAN⁠, Waifu Labs⁠, which generates even better anime faces than my StyleGAN results.

“Anime Neural Net Graveyard”, Branwen 2019

Faces-graveyard: “Anime Neural Net Graveyard”⁠, Gwern Branwen (2019-02-04; ⁠, ⁠, ; backlinks; similar):

Compilation of failed neural network experiments in generating anime images, pre-StyleGAN/​BigGAN.

My experiments in generating anime faces, tried periodically since 2015, succeeded in 2019 with the release of StyleGAN⁠. But for comparison, here are the failures from some of my older GAN or other NN attempts; as the quality is worse than StyleGAN, I won’t bother going into details—creating the datasets & training the ProGAN & tuning & transfer-learning were all much the same as already outlined at length for the StyleGAN results⁠.

Included are:

  • ProGAN

  • Glow


  • PokeGAN

  • Self-Attention-GAN-TensorFlow

  • VGAN

  • BigGAN unofficial (official BigGAN is covered elsewhere)

    • BigGAN-TensorFlow
    • BigGAN-PyTorch
  • GAN-QP

  • WGAN

  • IntroVAE

“Making Anime With BigGAN”, Branwen 2019

BigGAN: “Making Anime With BigGAN”⁠, Gwern Branwen (2019-02-04; ⁠, ⁠, ⁠, ; backlinks; similar):

Experiments in using BigGAN to generate anime faces and whole anime images; semi-successful.

Following my StyleGAN anime face experiments⁠, I explore BigGAN⁠, another recent GAN with SOTA results on one of the most complex image domains tackled by GANs so far (ImageNet). BigGAN’s capabilities come at a steep compute cost, however.

Using the unofficial BigGAN-PyTorch reimplementation, I experimented in 2019 with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1000 characters on an 8×2080ti machine for a month (mixed results). My BigGAN results are good but compromised by the compute expense & practical problems with the released BigGAN code base. While BigGAN is not yet superior to StyleGAN for many purposes, BigGAN-like approaches may be necessary to scale to whole anime images.

For followup experiments, Shawn Presser⁠, I and others (collectively, “Tensorfork”) have used Tensorflow Research Cloud TPU credits & the compare_gan BigGAN reimplementation. Running this at scale on the full Danbooru2019 dataset in May 2020, we have reached the best anime GAN results to date (later exceeded by This Anime Does Not Exist).

“Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition”, Winkler et al 2019

2019-winkler.pdf: “Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition”⁠, Julia K. Winkler, Christine Fink, Ferdin, Toberer, Alexander Enk, Teresa Deinlein, Rainer Hofmann-Wellenhof et al (2019-01-01; ; backlinks)

“Internet Search Tips”, Branwen 2018

Search: “Internet Search Tips”⁠, Gwern Branwen (2018-12-11; ⁠, ⁠, ⁠, ; backlinks; similar):

A description of advanced tips and tricks for effective Internet research of papers/​books, with real-world examples.

Over time, I developed a certain google-fu and expertise in finding references, papers, and books online. Some of these tricks are not well-known, like checking the Internet Archive (IA) for books. I try to write down my search workflow, and give general advice about finding and hosting documents, with demonstration case studies⁠.

“Evolution As Backstop for Reinforcement Learning”, Branwen 2018

Backstop: “Evolution as Backstop for Reinforcement Learning”⁠, Gwern Branwen (2018-12-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Markets/​evolution as backstops/​ground truths for reinforcement learning/​optimization: on some connections between Coase’s theory of the firm/​linear optimization/​DRL/​evolution/​multicellular life/​pain/​Internet communities as multi-level optimization problems.

One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest interpreting phenomenon as multi-level nested optimization paradigm: many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/​optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & meta-learning in AI, and pain in humans. This view suggests that are inherent balances between market/​non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.

“ARPA and SCI: Surfing AI”, Branwen 2018

ARPA: “ARPA and SCI: Surfing AI”⁠, Gwern Branwen (2018-07-04; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Review of Roland & Shiman 2002 history of a decade of ARPA/​DARPA involvement in AI and supercomputing, and the ARPA philosophy of technological acceleration; it yielded mixed results, perhaps due to ultimately insurmountable bottlenecks—the time was not yet ripe for many goals.

Review of DARPA history book, Strategic Computing: DARPA and the Quest for Machine Intelligence, 1983–1993, Roland & Shiman 2002, which reviews a large-scale DARPA effort to jumpstart real-world uses of AI in the 1980s by a multi-pronged research effort into more efficient computer chip R&D, supercomputing, robotics/​self-driving cars, & expert system software. Roland & Shiman 2002 particularly focus on the various ‘philosophies’ of technological forecasting & development, which guided DARPA’s strategy in different periods, ultimately endorsing a weak technological determinism where the bottlenecks are too large for a small (in comparison to the global economy & global R&D) organization best a DARPA can hope for is a largely agnostic & reactive strategy in which granters ‘surf’ technological changes, rapidly exploiting new technology while investing their limited funds into targeted research patching up any gaps or lags that accidentally open up and block broader applications. (For broader discussion of progress, see “Lessons from the Media Lab” & Bakewell⁠.)

“Do Better ImageNet Models Transfer Better?”, Kornblith et al 2018

“Do Better ImageNet Models Transfer Better?”⁠, Simon Kornblith, Jonathon Shlens, Quoc V. Le (2018-05-23; backlinks; similar):

Transfer learning is a cornerstone of computer vision, yet little work has been done to evaluate the relationship between architecture and transfer. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested.

Here, we compare the performance of 16 classification networks on 12 image classification datasets. We find that, when networks are used as fixed feature extractors or fine-tuned, there is a strong correlation between ImageNet accuracy and transfer accuracy (r = 0.99 and 0.96, respectively).

In the former setting, we find that this relationship is very sensitive to the way in which networks are trained on ImageNet; many common forms of regularization slightly improve ImageNet accuracy but yield penultimate layer features that are much worse for transfer learning. Additionally, we find that, on two small fine-grained image classification datasets, pretraining on ImageNet provides minimal benefits, indicating the learned features from ImageNet do not transfer well to fine-grained tasks.

Together, our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.

“Self-Attention With Relative Position Representations”, Shaw et al 2018

“Self-Attention with Relative Position Representations”⁠, Peter Shaw, Jakob Uszkoreit, Ashish Vaswani (2018-03-06; backlinks):

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al 2017 achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

“Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari”, Chrabaszcz et al 2018

“Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari”⁠, Patryk Chrabaszcz, Ilya Loshchilov, Frank Hutter (2018-02-24; ; backlinks; similar):

Evolution Strategies (ES) have recently been demonstrated to be a viable alternative to reinforcement learning (RL) algorithms on a set of challenging deep RL problems, including Atari games and MuJoCo humanoid locomotion benchmarks. While the ES algorithms in that work belonged to the specialized class of natural evolution strategies (which resemble approximate gradient RL algorithms, such as REINFORCE), we demonstrate that even a very basic canonical ES algorithm can achieve the same or even better performance. This success of a basic ES algorithm suggests that the state-of-the-art can be advanced further by integrating the many advances made in the field of ES in the last decades.

We also demonstrate qualitatively that ES algorithms have very different performance characteristics than traditional RL algorithms: on some games, they learn to exploit the environment and perform much better while on others they can get stuck in suboptimal local minima. Combining their strengths with those of traditional RL algorithms is therefore likely to lead to new advances in the state of the art.

“ArcFace: Additive Angular Margin Loss for Deep Face Recognition”, Deng et al 2018

“ArcFace: Additive Angular Margin Loss for Deep Face Recognition”⁠, Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou (2018-01-23; backlinks):

One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for large-scale face recognition is the design of appropriate loss functions that enhance discriminative power. Centre loss penalises the distance between the deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in an angular space and penalises the angles between the deep features and their corresponding weights in a multiplicative way. Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximize face class separability. In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition. The proposed ArcFace has a clear geometric interpretation due to the exact correspondence to the geodesic distance on the hypersphere. We present arguably the most extensive experimental evaluation of all the recent state-of-the-art face recognition methods on over 10 face recognition benchmarks including a new large-scale image database with trillion level of pairs and a large-scale video dataset. We show that ArcFace consistently outperforms the state-of-the-art and can be easily implemented with negligible computational overhead. We release all refined training data, training codes, pre-trained models and training logs, which will help reproduce the results in this paper.

“Deep Image Reconstruction from Human Brain Activity”, Shen et al 2017

“Deep image reconstruction from human brain activity”⁠, Guohua Shen, Tomoyasu Horikawa, Kei Majima, Yukiyasu Kamitani (2017-12-30; ⁠, ; similar):

Machine learning-based analysis of human functional magnetic resonance imaging (fMRI) patterns has enabled the visualization of perceptual content. However, it has been limited to the reconstruction with low-level image bases (Miyawaki et al 2008; Wen et al 2016) or to the matching to exemplars (Naselaris et al 2009; Nishimoto et al 2011). Recent work showed that visual cortical activity can be decoded (translated) into hierarchical features of a deep neural network (DNN) for the same input image, providing a way to make use of the information from hierarchical visual features (Horikawa & Kamitani, 2017).

Here, we present a novel image reconstruction method, in which the pixel values of an image are optimized to make its DNN features similar to those decoded from human brain activity at multiple layers. We found that the generated images resembled the stimulus images (both natural images and artificial shapes) and the subjective visual content during imagery. While our model was solely trained with natural images, our method successfully generalized the reconstruction to artificial shapes, indicating that our model indeed ‘reconstructs’ or ‘generates’ images from brain activity, not simply matches to exemplars.

A natural image prior introduced by another deep neural network effectively rendered semantically meaningful details to reconstructions by constraining reconstructed images to be similar to natural images. Furthermore, human judgment of reconstructions suggests the effectiveness of combining multiple DNN layers to enhance visual quality of generated images. The results suggest that hierarchical visual information in the brain can be effectively combined to reconstruct perceptual and subjective images.

“CycleGAN, a Master of Steganography”, Chu et al 2017

“CycleGAN, a Master of Steganography”⁠, Casey Chu, Andrey Zhmoginov, Mark Sandler (2017-12-08; ; backlinks; similar):

CycleGAN (Zhu et al 2017) is one recent successful approach to learn a transformation between two image distributions. In a series of experiments, we demonstrate an intriguing property of the model: CycleGAN learns to “hide” information about a source image into the images it generates in a nearly imperceptible, high-frequency signal. This trick ensures that the generator can recover the original sample and thus satisfy the cyclic consistency requirement, while the generated image remains realistic. We connect this phenomenon with adversarial attacks by viewing CycleGAN’s training procedure as training a generator of adversarial examples and demonstrate that the cyclic consistency loss causes CycleGAN to be especially vulnerable to adversarial attacks.

“Automatic Differentiation in PyTorch”, Paszke et al 2017

“Automatic differentiation in PyTorch”⁠, Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison et al (2017-10-28; ; backlinks; similar):

A summary of automatic differentiation techniques employed in PyTorch library, including novelties like support for in-place modification in presence of objects aliasing the same data, performance optimizations and Python extensions.

In this article, we describe an automatic differentiation module of PyTorch—a library designed to enable rapid research on machine learning models. It builds upon a few projects, most notably Lua Torch, Chainer, and HIPS Autograd, and provides a high performance environment with easy access to automatic differentiation of models executed on different devices (CPU and GPU). To make prototyping easier, PyTorch does not follow the symbolic approach used in many other deep learning frameworks, but focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead. Note that this preprint is a draft of certain sections from an upcoming paper covering all PyTorch features.

[Keywords: PyTorch, Automatic differentiation, imperative, aliasing, dynamic, eager, machine learning]

“High-Precision Automated Reconstruction of Neurons With Flood-filling Networks”, Januszewski et al 2017

“High-Precision Automated Reconstruction of Neurons with Flood-filling Networks”⁠, Michał Januszewski, Jörgen Kornfeld, Peter H. Li, Art Pope, Tim Blakely, Larry Lindsey, Jeremy Maitin-Shepard et al (2017-10-09; ; backlinks; similar):

Reconstruction of neural circuits from volume electron microscopy data requires the tracing of complete cells including all their neurites. Automated approaches have been developed to perform the tracing, but without costly human proofreading their error rates are too high to obtain reliable circuit diagrams. We present a method for automated segmentation that, like the majority of previous efforts, employs convolutional neural networks, but contains in addition a recurrent pathway that allows the iterative optimization and extension of the reconstructed shape of individual neural processes. We used this technique, which we call flood-filling networks, to trace neurons in a data set obtained by serial block-face electron microscopy from a male zebra finch brain. Our method achieved a mean error-free neurite path length of 1.1 mm, an order of magnitude better than previously published approaches applied to the same dataset. Only 4 mergers were observed in a neurite test set of 97 mm path length.

“Online Learning of a Memory for Learning Rates”, Meier et al 2017

“Online Learning of a Memory for Learning Rates”⁠, Franziska Meier, Daniel Kappler, Stefan Schaal (2017-09-20; ; backlinks; similar):

The promise of learning to learn for robotics rests on the hope that by extracting some information about the learning process itself we can speed up subsequent similar learning tasks. Here, we introduce a computationally efficient online meta-learning algorithm that builds and optimizes a memory model of the optimal learning rate landscape from previously observed gradient behaviors. While performing task specific optimization, this memory of learning rates predicts how to scale currently observed gradients. After applying the gradient scaling our meta-learner updates its internal memory based on the observed effect its prediction had. Our meta-learner can be combined with any gradient-based optimizer, learns on the fly and can be transferred to new optimization tasks. In our evaluations we show that our meta-learning algorithm speeds up learning of MNIST classification and a variety of learning control tasks, either in batch or online learning settings.

“Emergence of Locomotion Behaviours in Rich Environments”, Heess et al 2017

“Emergence of Locomotion Behaviours in Rich Environments”⁠, Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez et al (2017-07-07; ⁠, ; backlinks; similar):

The reinforcement learning paradigm allows, in principle, for complex behaviours to be learned directly from simple reward signals. In practice, however, it is common to carefully hand-design the reward function to encourage a particular solution, or to derive it from demonstration data. In this paper explore how a rich environment can help to promote the learning of complex behavior. Specifically, we train agents in diverse environmental contexts, and find that this encourages the emergence of robust behaviours that perform well across a suite of tasks. We demonstrate this principle for locomotion—behaviours that are known for their sensitivity to the choice of reward. We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress. Using a novel scalable variant of policy gradient reinforcement learning, our agents learn to run, jump, crouch and turn as required by the environment without explicit reward-based guidance. A visual depiction of highlights of the learned behavior can be viewed following this URL⁠.

“Learning from Human Preferences”, Amodei et al 2017

“Learning from Human Preferences”⁠, Dario Amodei, Paul Christiano, Alex Ray (2017-06-13; ⁠, ⁠, ; backlinks; similar):

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

We present a learning algorithm that uses small amounts of human feedback to solve modern RL environments. Machine learning systems with human feedback have been explored before, but we’ve scaled up the approach to be able to work on much more complicated tasks. Our algorithm needed 900 bits of feedback from a human evaluator to learn to backflip—a seemingly simple task which is simple to judge but challenging to specify.

The overall training process is a 3-step feedback cycle between the human, the agent’s understanding of the goal, and the RL training.

Preference learning architecture

Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments. It then uses RL to learn how to achieve that goal. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, and further refines its understanding of the goal.

“A Deep Reinforced Model for Abstractive Summarization”, Paulus et al 2017

“A Deep Reinforced Model for Abstractive Summarization”⁠, Romain Paulus, Caiming Xiong, Richard Socher (2017-05-11; ⁠, ; backlinks; similar):

Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. We introduce a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL). Models trained only with supervised learning often exhibit “exposure bias”—they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/​Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/​Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries.

“Data-efficient Deep Reinforcement Learning for Dexterous Manipulation”, Popov et al 2017

“Data-efficient Deep Reinforcement Learning for Dexterous Manipulation”⁠, Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Rol, Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe et al (2017-04-10; ; backlinks; similar):

Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continuous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. Solving this difficult and practically relevant problem in the real world is an important long-term goal for the field of robotics.

Here we take a step towards this goal by examining the problem in simulation and providing models and techniques aimed at solving it. We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it statistically-significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them.

Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.

“The Kelly Coin-Flipping Game: Exact Solutions”, Branwen et al 2017

Coin-flip: “The Kelly Coin-Flipping Game: Exact Solutions”⁠, Gwern Branwen, Arthur B., nshepperd, FeepingCreature, Gurkenglas (2017-01-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016’s 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.

Haghani & Dewey 2016 experiment with a double-or-nothing coin-flipping game where the player starts with $30.4[^\$25.0^~2016~]{.supsub} and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $303.8[^\$250.0^~2016~]{.supsub}. Most of their subjects fail to play well, earning an average $110.6[^\$91.0^~2016~]{.supsub}, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$291.6[^\$240.0^~2016~]{.supsub} in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays.

We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming for calculating the value function, with implementations in R, Haskell⁠, and C. We also provide a closed-form exact value formula in R & Python, several approximations using Monte Carlo/​random forests⁠/​neural networks, visualizations of the value function, and a Python implementation of the game for the OpenAI Gym collection. We find that optimal play yields $246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing $155.6 in potential profit. Comparing decision trees and the Kelly criterion for various horizons (bets left), the relative advantage of the decision tree strategy depends on the horizon: it is highest when the player can make few bets (at b = 23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling.

In the Kelly game, the maximum winnings, number of rounds, and edge are fixed; we describe a more difficult generalized version in which the 3 parameters are drawn from Pareto, normal, and beta distributions and are unknown to the player (who can use Bayesian inference to try to estimate them during play). Upper and lower bounds are estimated on the value of this game. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only $1. Two deep reinforcement learning agents, DQN & DDPG, are implemented but DQN fails to learn and DDPG doesn’t show acceptable performance, indicating better deep RL methods may be required to solve the generalized Kelly game.

“Dermatologist-level Classification of Skin Cancer With Deep Neural Networks”, Esteva et al 2017

2017-esteva.pdf: “Dermatologist-level classification of skin cancer with deep neural networks”⁠, Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, Sebastian Thrun et al (2017-01-01; ; backlinks)

“Responses to Critiques on Machine Learning of Criminality Perceptions (Addendum of ArXiv:1611.04135)”, Wu & Zhang 2016

“Responses to Critiques on Machine Learning of Criminality Perceptions (Addendum of arXiv:1611.04135)”⁠, Xiaolin Wu, Xi Zhang (2016-11-13; backlinks; similar):

In November 2016 we submitted to arXiv our paper “Automated Inference on Criminality Using Face Images”. It generated a great deal of discussions in the Internet and some media outlets. Our work is only intended for pure academic discussions; how it has become a media consumption is a total surprise to us. Although in agreement with our critics on the need and importance of policing AI research for the general good of the society, we are deeply baffled by the ways some of them mispresented our work, in particular the motive and objective of our research.

“Why Tool AIs Want to Be Agent AIs”, Branwen 2016

Tool-AI: “Why Tool AIs Want to Be Agent AIs”⁠, Gwern Branwen (2016-09-07; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.

Autonomous AI systems (Agent AIs) trained using reinforcement learning can do harm when they take wrong actions, especially superintelligent Agent AIs. One solution would be to eliminate their agency by not giving AIs the ability to take actions, confining them to purely informational or inferential tasks such as classification or prediction (Tool AIs), and have all actions be approved & executed by humans, giving equivalently superintelligent results without the risk.

I argue that this is not an effective solution for two major reasons. First, because Agent AIs will by definition be better at actions than Tool AIs, giving an economic advantage. Secondly, because Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as long-term memories or external software or large databases or the Internet, and how best to acquire new data.

All of these actions will result in Agent AIs more intelligent than Tool AIs, in addition to their greater economic competitiveness. Thus, Tool AIs will be inferior to Agent AIs in both actions and intelligence, implying use of Tool AIs is an even more highly unstable equilibrium than previously argued, as users of Agent AIs will be able to outcompete them on two dimensions (and not just one).

“Direct Feedback Alignment Provides Learning in Deep Neural Networks”, Nøkland 2016

“Direct Feedback Alignment Provides Learning in Deep Neural Networks”⁠, Arild Nøkland (2016-09-06; ; backlinks; similar):

Artificial neural networks are most commonly trained with the back-propagation algorithm, where the gradient for learning is provided by back-propagating the error, layer by layer, from the output layer to the hidden layers. A recently discovered method called feedback-alignment shows that the weights used for propagating the error backward don’t have to be symmetric with the weights used for propagation the activation forward. In fact, random feedback weights work evenly well, because the network learns how to make the feedback useful.

In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error back-propagation.

The method is a step towards biologically plausible machine learning because the error signal is almost local, and no symmetric or reciprocal weights are required. Experiments show that the test performance on MNIST and CIFAR is almost as good as those obtained with back-propagation for fully connected networks. If combined with dropout, the method achieves 1.45% error on the permutation invariant MNIST task.

“Why Does Deep and Cheap Learning Work so Well?”, Lin et al 2016

“Why does deep and cheap learning work so well?”⁠, Henry W. Lin, Max Tegmark, David Rolnick (2016-08-29; similar):

We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones.

We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one.

We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss, for example, we show that n variables cannot be multiplied using fewer than 2n neurons in a single hidden layer.

“Concrete Problems in AI Safety”, Amodei et al 2016

“Concrete Problems in AI Safety”⁠, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané (2016-06-21; backlinks; similar):

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (“avoiding side effects” and “avoiding reward hacking”), an objective function that is too expensive to evaluate frequently (“scalable supervision”), or undesirable behavior during the learning process (“safe exploration” and “distributional shift”). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

“PlaNet—Photo Geolocation With Convolutional Neural Networks”, Weyand et al 2016

“PlaNet—Photo Geolocation with Convolutional Neural Networks”⁠, Tobias Weyand, Ilya Kostrikov, James Philbin (2016-02-17; ⁠, ⁠, ; backlinks; similar):

Is it possible to build a system to determine the location where a photo was taken using just its pixels? In general, the problem seems exceptionally difficult: it is trivial to construct situations where no location can be inferred. Yet images often contain informative cues such as landmarks, weather patterns, vegetation, road markings, and architectural details, which in combination may allow one to determine an approximate location and occasionally an exact location. Websites such as GeoGuessr and View from your Window suggest that humans are relatively good at integrating these cues to geolocate images, especially en-masse. In computer vision, the photo geolocation problem is usually approached using image retrieval methods. In contrast, we pose the problem as one of classification by subdividing the surface of the earth into thousands of multi-scale geographic cells, and train a deep network using millions of geotagged images. While previous approaches only recognize landmarks or perform approximate matching using global image descriptors, our model is able to use and integrate multiple visible cues. We show that the resulting model, called PlaNet, outperforms previous approaches and even attains superhuman levels of accuracy in some cases. Moreover, we extend our model to photo albums by combining it with a long short-term memory (LSTM) architecture. By learning to exploit temporal coherence to geolocate uncertain photos, we demonstrate that this model achieves a 50% performance improvement over the single-image model.

“"Why Should I Trust You?": Explaining the Predictions of Any Classifier”, Ribeiro et al 2016

“"Why Should I Trust You?": Explaining the Predictions of Any Classifier”⁠, Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin (2016-02-16; backlinks; similar):

Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (eg. random forests) and image classification (eg. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

“Image Synthesis from Yahoo’s open_nsfw”, Goh 2016

2016-goh-opennsfw.html: “Image Synthesis from Yahoo’s open_nsfw⁠, Gabriel Goh (2016; ⁠, ; backlinks; similar):

Yahoo’s recently open sourced neural network, open_nsfw, is a fine tuned Residual Network which scores images on a scale of 0 to 1 on its suitability for use in the workplace…What makes an image NSFW, according to Yahoo?

I explore this question with a clever new visualization technique by Nguyen et al…Like Google’s Deep Dream, this visualization trick works by maximally activating certain neurons of the classifier. Unlike deep dream, we optimize these activations by performing descent on a parameterization of the manifold of natural images.

[Demonstration of an unusual use of backpropagation to ‘optimize’ a neural network: instead of taking a piece of data to input to a neural network and then updating the neural network to change its output slightly towards some desired output (such as a correct classification), one can instead update the input so as to make the neural net output slightly more towards the desired output.

When using an image classification neural network, this reversed form of optimization will ‘hallucinate’ or ‘edit’ the ‘input’ to make it more like a particular class of images.

In this case, a porn/​NSFW-detecting NN is reversed so as to make images more (or less) “porn-like”. Goh runs this process on various images like landscapes, musical bands, or empty images; the maximally/​minimally porn-like images are disturbing, hilarious, and undeniably pornographic in some sense.]

“Neural Module Networks”, Andreas et al 2015

“Neural Module Networks”⁠, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein (2015-11-09; backlinks; similar):

Visual question answering is fundamentally compositional in nature—a question like “where is the dog?” shares substructure with questions like “what color is the dog?” and “where is the cat?” This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions.

We describe a procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural “modules” into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained.

We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

“RNN Metadata for Mimicking Author Style”, Branwen 2015

RNN-metadata: “RNN Metadata for Mimicking Author Style”⁠, Gwern Branwen (2015-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.

Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.

I further try & fail to train a char-RNN on Geocities HTML for unclear reasons.

More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs⁠, the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.

“Deep DPG (DDPG): Continuous Control With Deep Reinforcement Learning”, Lillicrap et al 2015

“Deep DPG (DDPG): Continuous control with deep reinforcement learning”⁠, Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver et al (2015-09-09; ; backlinks; similar):

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.

We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.

Using the same learning algorithm, network architecture and hyper-parameters, our DDPG algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation [gripper/​reacher], legged locomotion [Cheetah/​walker] and car driving [TORCS]. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives.

We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

“VQA: Visual Question Answering”, Agrawal et al 2015

“VQA: Visual Question Answering”⁠, Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh et al (2015-05-03; backlinks):

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (, and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http:/​/​​vqa).

“Leprechaun Hunting & Citogenesis”, Branwen 2014

Leprechauns: “Leprechaun Hunting & Citogenesis”⁠, Gwern Branwen (2014-06-30; ⁠, ; backlinks; similar):

Many claims, about history in particular, turn out to be false when traced back to their origins, and form kinds of academic urban legends. These “leprechauns” are particularly pernicious because they are often widely-repeated due to their apparent trustworthiness, yet difficult to research & debunk due to the difficulty of following deeply-nested chains of citations through ever more obscure sources. This page lists instances I have run into.

A major source of leprechaun transmission is the frequency with which researchers do not read the papers they cite: because they do not read them, they repeat misstatements or add their own errors, further transforming the leprechaun and adding another link in the chain to anyone seeking the original source. This can be quantified by checking statements against the original paper, and examining the spread of typos in citations: someone reading the original will fix a typo in the usual citation, or is unlikely to make the same typo, and so will not repeat it. Both methods indicate high rates of non-reading, explaining how leprechauns can propagate so easily.

“Movie Reviews”, Branwen 2014

Movies: “Movie Reviews”⁠, Gwern Branwen (2014-05-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A compilation of movie, television, and opera reviews since 2014.

This is a compilation of my film/​television/​theater reviews; it is compiled from my newsletter⁠. Reviews are sorted by rating in descending order.

See also my book & anime /  ​ manga reviews⁠.

“Deep Learning in Neural Networks: An Overview”, Schmidhuber 2014

“Deep Learning in Neural Networks: An Overview”⁠, Juergen Schmidhuber (2014-04-30; ; backlinks; similar):

In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

“Surprisingly Turing-Complete”, Branwen 2012

Turing-complete: “Surprisingly Turing-Complete”⁠, Gwern Branwen (2012-12-09; ⁠, ⁠, ⁠, ; backlinks; similar):

A catalogue of software constructs, languages, or APIs which are unexpectedly Turing-complete; implications for security and reliability

‘Computers’, in the sense of being Turing-complete, are extremely common. Almost any system of sufficient complexity—unless carefully engineered otherwise—may be found to ‘accidentally’ support Turing-complete somewhere inside it through ‘weird machines’ which can be rebuilt out of the original system’s parts, even systems which would appear to have not the slightest thing to do with computation. Software systems are especially susceptible to this, which often leads to serious security problems as the Turing-complete components can be used to run attacks on the rest of the system.

I provide a running catalogue of systems which have been, surprisingly, demonstrated to be Turing-complete. These examples may help unsee surface systems to see the weird machines and Turing-completeness lurking within.

“The Neural Net Tank Urban Legend”, Branwen 2011

Tanks: “The Neural Net Tank Urban Legend”⁠, Gwern Branwen (2011-09-20; ⁠, ⁠, ; backlinks; similar):

AI folklore tells a story about a neural network trained to detect tanks which instead learned to detect time of day; investigating, this probably never happened.

A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/​non-tanks and the NN had learned something useless like time of day. This story is often told to warn about the limits of algorithms and importance of data collection to avoid “dataset bias”/​“data leakage” where the collected data can be solved using algorithms that do not generalize to the true data distribution, but the tank story is usually never sourced.

I collate many extent versions dating back a quarter of a century to 1992 along with two NN-related anecdotes from the 1960s; their contradictions & details indicate a classic “urban legend”, with a probable origin in a speculative question in the 1960s by Edward Fredkin at an AI conference about some early NN research, which was then classified & never followed up on.

I suggest that dataset bias is real but exaggerated by the tank story, giving a misleading indication of risks from deep learning and that it would be better to not repeat it but use real examples of dataset bias and focus on larger-scale risks like AI systems optimizing for wrong utility functions.

“Stigler’s Diet Problem Revisited”, Garille & Gass 2001

2001-garille.pdf: “Stigler’s Diet Problem Revisited”⁠, Susan Garner Garille, Saul I. Gass (2001-01-01; ; backlinks)

“A Sociological Study of the Official History of the Perceptrons Controversy [1996]”, Olazaran 1996

1996-olazaran.pdf: “A Sociological Study of the Official History of the Perceptrons Controversy [1996]”⁠, Mikel Olazaran (1996-08-01; backlinks):

[earlier longer version] In this paper, I analyze the controversy within Artificial Intelligence (AI) which surrounded the ‘perceptron’ project (and neural nets in general) in the late 1950s and early 1960s.

I devote particular attention to the proofs and arguments of Minsky and Papert, which were interpreted as showing that further progress in neural nets was not possible, and that this approach to AI had to be abandoned. I maintain that this official interpretation of the debate was a result of the emergence, institutionalization and (importantly) legitimation of the symbolic AI approach (with its resource allocation system and authority structure). At the ‘research-area’ level, there was considerable interpretative flexibility.

This interpretative flexibility was further demonstrated by the revival of neural nets in the late 1980s, and subsequent rewriting of the official history of the debate.

“Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”, Goodacre et al 1996

1996-goodacre.pdf: “Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”⁠, Royston Goodacre, Mark J. Neal, Douglas B. Kell (1996-08-01; backlinks; similar):

The implementation of artificial neural networks (ANNs) to the analysis of multivariate data is reviewed, with particular reference to the analysis of pyrolysis mass spectra⁠. The need for and benefits of multivariate data analysis are explained followed by a discussion of ANNs and their optimisation.

Finally, an example of the use of ANNs for the quantitative deconvolution of the pyrolysis mass spectra of Staphylococcus aureus mixed with Escherichia coli is demonstrated.

“A Sociological Study of the Official History of the Perceptrons Controversy [1993]”, Olazaran 1993

1993-olazaran.pdf: “A Sociological Study of the Official History of the Perceptrons Controversy [1993]”⁠, Mikel Olazaran (1993-08; ; backlinks):

This chapter discusses the scientific controversies that have shaped neural network research from a sociological point of view.

It looks at the controversy that surrounded Frank Rosenblatt’s perceptron machine in the late 1950s and early 1960s. Rosenblatt was well aware of the main problems of his machine, and that he even insisted on them in his books and papers. Emphasis is given on one of the main problems of early neural network research, namely the issue of training multilayer systems⁠.

In the middle of the perceptron controversy, Minsky and Papert embarked on a project aimed at showing the limitations of Rosenblatt’s perceptron beyond doubt.

The chapter analyzes the main results of that project, and shows that Minsky and Papert, and neural network researchers interpreted those results rather differently. It discusses the processes through which this interpretative flexibility was closed and the effects that the crisis of early neural network research had upon the 3 most important neural network groups of the time, namely Widrow’s group, Rosenblatt’s group, and the group at SRI⁠.

The chapter also looks at the influence that factors like the emergence of symbolic artificial intelligence (AI) and computer technology had on the closure of the neural network controversy. After the closure of the perceptron controversy, symbol-processing remained the dominant approach to AI over the years, until the early 1980s. Some of the most important aspects of that changing context are reviewed and the history of backpropagation is discussed.

  1. Introduction: A Sociological View of Scientific Controversies

  2. The Controversy of the Perceptron

  3. The Problems of Early Neural Networks

  4. Training Multilayer Networks: A “Reverse Salient” of Neural Network Research

  5. Interpretative Flexibility

  6. Closure of the Controversy 1: Widrow’s Group

  7. Closure of the Controversy 2: The SRI Group

  8. Closure of the Controversy 3: Rosenblatt

  9. The 1980s: A Changing Context

  10. History of Back-Propagation

  11. Back-Propagation: Learning in Multilayer Perceptrons

  12. The Neural Network Explosion

  13. The Current Debate: Conclusions

    1. Debate Continues
    2. Conclusions
  14. Appendix 1: List of Those Interviewed

  15. Appendix 2: List of Personal Communications by Letter

[lengthier version in Olazaran 1996⁠; cf. “Did Frank Rosenblatt invent deep learning in 1962?”⁠; Schmidhuber’s history of DL⁠.

The author Mikel Olazaran spent a long time in the early 1990s interviewing what looks have been almost all the surviving connectionists & Minsky etc.

Olazaran argues that all the connectionists were perfectly aware of the Perceptrons headline conclusion about single-layer perceptrons being hopelessly linear, which drafts had been circulating for like 4 years beforehand as well, and most regarded it as unimportant (pointing out that humans can’t solve the parity of a grid of dots either without painfully counting them out one by one) and having an obvious solution (multiple layers) that they all, Rosenblatt especially, had put a lot of work into trying.

The problem was, none of the multi-layer things worked, and people had run out of ideas. So, most of the connectionist researchers got sucked away by things that were working at the time (eg. the Stanford group was having huge success with adaptive antennas & telephone filters which accidentally come out of their NN work), and funding dried up (for both exogenous political reasons related to military R&D being cut, and just the lack of results compared to alternative research programs like the symbolic approaches which were enjoying their initial flush of success in theorem proving and Samuel’s checkers player etc, and had not run headlong into the wall of Moravec’s paradox).

So when, years later, Perceptrons came out with all of its i’s dotted & t’s-crossed, it didn’t “kill connectionism” because that had already died. What Perceptrons really did was it served as a kind of excuse or Schelling point to make the death ‘official’ & cement the dominance of the symbolic approaches. Rosenblatt never gave up, but he had already been left high and dry with no more funding and no research community.

Olazaran directly asks several of them whether more funding or work would have helped, and it seems everyone agrees that it would’ve been useless. The computers just weren’t there in the ’60s. (One notes that it might have worked in the ’70s if anyone had paid attention to the invention of backpropagation, pointing out that Rumelhart et al doing the PDP studies on backprop were using the equivalent of PCs for those studies in the late ’80s, so if you were patient & dedicated, you could hypothetically have done them on minicomputers/​mainframes in the ’70s. But not the ’60s.)]