Skip to main content

AI/​RNN directory


“Simple Recurrence Improves Masked Language Models”, Lei et al 2022

“Simple Recurrence Improves Masked Language Models”⁠, Tao Lei, Ran Tian, Jasmijn Bastings, Ankur P. Parikh (2022-05-23):

In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT⁠. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters constant. For example, our base model achieves an absolute improvement of 2.1 points averaged across 10 tasks and also demonstrates increased stability in fine-tuning over a range of learning rates.

“Sequencer: Deep LSTM for Image Classification”, Tatsunami & Taki 2022

“Sequencer: Deep LSTM for Image Classification”⁠, Yuki Tatsunami, Masato Taki (2022-05-04):

In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.

“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”, Chan et al 2022

“Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers”⁠, Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond et al (2022-04-22; ⁠, ):

Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it.

We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset.

We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language—burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model—a skewed, Zipfian distribution over classes—which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models.

In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.

“Learning by Directional Gradient Descent”, Silver et al 2022

“Learning by Directional Gradient Descent”⁠, David Silver, Anirudh Goyal, Ivo Danihelka, Matteo Hessel, Hado van Hasselt (2022-02-17; backlinks; similar):

How should state be constructed from a sequence of observations, so as to best achieve some objective? Most deep learning methods update the parameters of the state representation by gradient descent. However, no prior method for computing the gradient is fully satisfactory, for example consuming too much memory, introducing too much variance⁠, or adding too much bias. In this work, we propose a new learning algorithm that addresses these limitations. The basic idea is to update the parameters of the representation by using the directional derivative along a candidate direction, a quantity that may be computed online with the same computational cost as the representation itself. We consider several different choices of candidate direction, including random selection and approximations to the true gradient, and investigate their performance on several synthetic tasks.

[Keywords: credit assignment, directional derivative, recurrent networks]

“Active Predictive Coding Networks: A Neural Solution to the Problem of Learning Reference Frames and Part-Whole Hierarchies”, Gklezakos & Rao 2022

“Active Predictive Coding Networks: A Neural Solution to the Problem of Learning Reference Frames and Part-Whole Hierarchies”⁠, Dimitrios C. Gklezakos, Rajesh P. N. Rao (2022-01-21; ⁠, ; similar):

We introduce Active Predictive Coding Networks (APCNs), a new class of neural networks that solve a major problem posed by Hinton and others in the fields of artificial intelligence and brain modeling: how can neural networks learn intrinsic reference frames for objects and parse visual scenes into part-whole hierarchies by dynamically allocating nodes in a parse tree? APCNs address this problem by using a novel combination of ideas: (1) hypernetworks are used for dynamically generating recurrent neural networks that predict parts and their locations within intrinsic reference frames conditioned on higher object-level embedding vectors, and (2) reinforcement learning is used in conjunction with backpropagation for end-to-end learning of model parameters. The APCN architecture lends itself naturally to multilevel hierarchical learning and is closely related to predictive coding models of cortical function. Using the MNIST, Fashion-MNIST and Omniglot datasets, we demonstrate that APCNs can (a) learn to parse images into part-whole hierarchies, (b) learn compositional representations, and (c) transfer their knowledge to unseen classes of objects. With their ability to dynamically generate parse trees with part locations for objects, APCNs offer a new framework for explainable AI that leverages advances in deep learning while retaining interpretability and compositionality.

“Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild”, Miki et al 2022

2022-miki.pdf: “Learning robust perceptive locomotion for quadrupedal robots in the wild”⁠, Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, Marco Hutter (2022-01-19; ⁠, ; backlinks; similar):

[homepage⁠; video⁠; cf. Dactyl⁠, Hwangbo et al 2019⁠/​Rudin et al 2021] Legged robots that can operate autonomously in remote and hazardous environments will greatly increase opportunities for exploration into underexplored areas.

Exteroceptive perception is crucial for fast and energy-efficient locomotion: Perceiving the terrain before making contact with it enables planning and adaptation of the gait ahead of time to maintain speed and stability. However, using exteroceptive perception robustly for locomotion has remained a grand challenge in robotics. Snow, vegetation, and water visually appear as obstacles on which the robot cannot step or are missing altogether due to high reflectance. In addition, depth perception can degrade due to difficult lighting, dust, fog, reflective or transparent surfaces, sensor occlusion, and more. For this reason, the most robust and general solutions to legged locomotion to date rely solely on proprioception. This severely limits locomotion speed because the robot has to physically feel out the terrain before adapting its gait accordingly.

Here, we present a robust and general solution to integrating exteroceptive and proprioceptive perception for legged locomotion [in ANYmal robots]. We leverage an attention-based recurrent encoder that integrates proprioceptive and exteroceptive input [using privileged learning⁠, the simulator as oracle, then training the RNN to infer the POMDP & meta-learn at runtime to adapt to changing environments]. The encoder is trained end to end and learns to seamlessly combine the different perception modalities without resorting to heuristics. The result is a legged locomotion controller with high robustness and speed.

The controller was tested in a variety of challenging natural and urban environments over multiple seasons and completed an hour-long hike in the Alps in the time recommended for human hikers.

DARPA Subterranean Challenge: [video 1⁠, 2] Our controller was used as the default controller in the DARPA Subterranean Challenge missions of team CERBERUS which has won the first prize in the finals (Results). In this challenge, our controller drove ANYmals to operate autonomously over extended periods of time in underground environments with rough terrain, obstructions, and degraded sensing in the presence of dust, fog, water, and smoke. Our controller played a crucial role as it enabled 4 ANYmals to explore over 1,700m in all 3 types of courses—tunnel, urban, and cave—without a single fall.

“Evaluating Distributional Distortion in Neural Language Modeling”, Anonymous 2021

“Evaluating Distributional Distortion in Neural Language Modeling”⁠, Anonymous (2021-11-16; ; similar):

A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (1) systematically underestimate the probability of sequences drawn from the target language, and (2) do so more severely for less-probable sequences. Investigating where this probability mass went, (3) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (4) is weakened, but not eliminated by greater amounts of training data, and (5) is exacerbated for target distributions with lower entropy⁠.

“Gradients Are Not All You Need”, Metz et al 2021

“Gradients are Not All You Need”⁠, Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, Tal Kachman (2021-11-10; ; similar):

Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

“An Explanation of In-context Learning As Implicit Bayesian Inference”, Xie et al 2021

“An Explanation of In-context Learning as Implicit Bayesian Inference”⁠, Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma (2021-11-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on “out-of-distribution” prompts. Thus, it is unclear what mechanism enables in-context learning.

In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example.

Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs⁠. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning.

Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.

“S4: Efficiently Modeling Long Sequences With Structured State Spaces”, Gu et al 2021

“S4: Efficiently Modeling Long Sequences with Structured State Spaces”⁠, Albert Gu, Karan Goel, Christopher Ré (2021-10-31; backlinks; similar):

[cf. LSSL⁠, HiPPO⁠; Github (example); talk⁠; explainer] A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps.

A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) x’(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t), and showed that for appropriate choices of the state matrix A, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence (S4) model based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning A with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (1) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet⁠, (2) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (3) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.


I find it easiest to think of it as a “super RNN”—an RNN with all the long-term dependency and vanishing gradient issues fixed. My best TLDR for why it works:

  1. It’s like a linear RNN with an N-dimensional hidden state. x

  2. The key is to initialize and parameterize the RNN in a very special way.

  3. This makes xt evolve in a special way: each xt lets you reconstruct all past inputs u0, u1, … ut with high accuracy.

    IIUC, just being able to “memorize” like this is apparently enough to break SOTA on Long Range Arena.

  4. And with the special initialization, the RNN’s parameter matrix is so simple that you can compute a very large number of time steps entirely in parallel, using FFT⁠. FFT is the key computational trick; the other part is initializing with a matrix that is “almost diagonal” and therefore easy to work with.

TLDR: RNNs can be really really good if you parameterize them the right way.

Narendra Patwardhan:

Transformers would perform better than S4 (in its current form) on any task which can’t be easily expressed as a simple differential equation such as language modeling, question answering, object detection⁠, image segmentation etc.]

“A Connectome of the Drosophila Central Complex Reveals Network Motifs Suitable for Flexible Navigation and Context-dependent Action Selection”, Hulse et al 2021

“A connectome of the Drosophila central complex reveals network motifs suitable for flexible navigation and context-dependent action selection”⁠, Brad K. Hulse, Hannah Haberkern, Romain Franconville, Daniel B. Turner-Evans, Shin-ya Takemura, Tanya Wolff et al (2021-10-26; ; backlinks; similar):

[media] Flexible behaviors over long timescales are thought to engage recurrent neural networks in deep brain regions, which are experimentally challenging to study. In insects, recurrent circuit dynamics in a brain region called the central complex (CX) enable directed locomotion, sleep, and context/​experience-dependent spatial navigation.

We describe the first complete electron-microscopy-based connectome of the Drosophila CX, including all its neurons and circuits at synaptic resolution.

We identified new CX neuron types, novel sensory and motor pathways, and network motifs that likely enable the CX to extract the fly’s head-direction, maintain it with attractor dynamics, and combine it with other sensorimotor information to perform vector-based navigational computations. We also identified numerous pathways that may facilitate the selection of CX-driven behavioral patterns by context and internal state. The CX connectome provides a comprehensive blueprint necessary for a detailed understanding of network dynamics underlying sleep, flexible navigation, and state-dependent action selection.

…Here we analyzed the arborizations and connectivity of the ~3,000 CX neurons in version 1.1 of the ‘hemibrain’ connectome—a dataset with 25,000 semi-automatically reconstructed neurons and 20 million synapses from the central brain of a 5-day-old female fly (Scheffer et al 2020) (see Materials and Methods).

EM circuit reconstruction: how complete is complete enough? The value of EM-level connectomes in understanding the function of neural circuits in small and large brains is widely appreciated (Abbott et al 2020; Litwin-Kumar & Turaga 2019; Schlegel et al 2017). Although recent technical advances have made it possible to acquire larger EM volumes (Scheffer et al 2020; Zheng et al 2018) and improvements in machine learning have enabled high-throughput reconstruction of larger neural circuits (Dorkenwald et al 2020; Januszewski et al 2018), the step from acquiring a volume to obtaining a complete connectome still requires considerable human proofreading and tracing effort (Scheffer et al 2020).

As part of our analysis of the CX connectome, we found that although increased proofreading led to an expected increase in the number of synaptic connections between neurons, it did not necessarily lead to substantial changes in the relative weight of connections between different neuron types (Figures 3–4). While it is important to note that we made comparisons between the hemibrain connectome at fairly advanced stages of proofreading in the CX, our results do suggest that it may be possible to obtain an accurate picture of neural circuit connectivity from incomplete reconstructions. It may be useful for future large scale connectomics efforts to incorporate similar validation steps of smaller sample volumes into reconstruction pipelines to determine appropriate trade-offs between accuracy and cost of proofreading.

“LSSL: Combining Recurrent, Convolutional, and Continuous-time Models With Linear State-Space Layers”, Gu et al 2021

“LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers”⁠, Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré (2021-10-26; backlinks; similar):

[cf. S4⁠, HiPPO] Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency.

We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence uy by simply simulating a linear continuous-time state-space representation ͘x = Ax + Bu, y = Cx + Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory.

Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100× shorter sequences.

“Recurrent Model-Free RL Is a Strong Baseline for Many POMDPs”, Ni et al 2021

“Recurrent Model-Free RL is a Strong Baseline for Many POMDPs”⁠, Tianwei Ni, Benjamin Eysenbach, Ruslan Salakhutdinov (2021-10-11; ; backlinks; similar):

Many problems in RL, such as meta RL, robust RL, and generalization in RL can be cast as POMDPs. In theory, simply augmenting model-free RL with memory, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques in their respective domains. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs. Code is available at

“Unbiased Gradient Estimation in Unrolled Computation Graphs With Persistent Evolution Strategies”, Vicol et al 2021

“Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies”⁠, Paul Vicol, Luke Metz, Jascha Sohl-Dickstein (2021-07-01; ⁠, ⁠, ; similar):

[supplement⁠; poster⁠; code⁠; Colab] Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers. Current approaches to optimizing parameters in such computation graphs suffer from high variance gradients, bias, slow updates, or large memory usage.

We introduce a method called Persistent Evolution Strategies (PES), which divides the computation graph into a series of truncated unrolls, and performs an evolution strategies-based (Rechenberg 1973⁠; Nesterov & Spokoiny 2017) update step after each unroll. PES eliminates bias from these truncations by accumulating correction terms over the entire sequence of unrolls. PES allows for rapid parameter updates, has low memory usage, is unbiased, and has reasonable variance characteristics.

We experimentally demonstrate the advantages of PES compared to several other methods for gradient estimation on synthetic tasks, and show its applicability to training learned optimizers and tuning hyperparameters.

  • …We introduce a method called Persistent Evolution Strategies (PES) to obtain unbiased gradient estimates for the parameters of an unrolled system from partial unrolls of the system.
  • We prove that PES is an unbiased gradient estimate for a smoothed version of the loss, and an unbiased estimate of the true gradient for quadratic losses. We provide theoretical and empirical analyses of its variance.
  • We demonstrate the applicability of PES in several illustrative scenarios: (1) we apply PES to tune hyperparameters including learning rates and momentums, by estimating hypergradients through partial unrolls of optimization algorithms; (2) we use PES to meta-train a learned optimizer; (3) we use PES to learn policy parameters for a continuous control task

[cf. forward gradients⁠, anthithetic sampling⁠, Direct Feedback Alignment⁠, Silver et al 2021⁠, RTRL⁠.]

“Shelley: A Crowd-sourced Collaborative Horror Writer”, Delul et al 2021

2021-delul.pdf: “Shelley: A Crowd-sourced Collaborative Horror Writer”⁠, Pinar Yanardag Delul, Manuel Cebrian, Iyad Rahwan (2021-06-15; ; similar):

In this work, we propose a deep-learning based collaborative horror writer [RNN] that collaboratively writes scary stories with people on Twitter⁠. We deploy our system [on October 2017] as a bot on Twitter that regularly generates and posts new stories on Twitter, and invites users to participate. Users who interact with the stories produce multiple storylines originating from the same tweet, thereby creating a tree-based story structure.

We further perform a validation study on n = 105 subjects to verify whether the generated stories psychologically move people on psychometrically validated measures of effect and anxiety such as I-PANAS-SF 43 and STAI-SF.26 Our experiments show that (1) stories generated by our bot as well as the stories generated collaboratively between our bot and Twitter users produced statistically-significant increases in negative affect and state anxiety compared to the control condition, and (2) collaborated stories are more successful in terms of increasing negative affect and state anxiety than the machine-generated ones. [This claim does not seem to be supported by their reported statistics in Section 4…]

Furthermore, we make 3 novel datasets used in our framework publicly available at GitHub for encouraging further research on this topic.

“Ten Lessons From Three Generations Shaped Google’s TPUv4i”, Jouppi et al 2021

2021-jouppi.pdf: “Ten Lessons From Three Generations Shaped Google’s TPUv4i”⁠, Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon et al (2021-06-14; ):

Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semiconductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSAs); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5× annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.

Table 1: Key characteristics of DSAs. The underlines show changes over the prior TPU generation, from left to right. System TDP includes power for the DSA memory system plus its share of the server host power, eg. add host TDP⁄8 for 8 DSAs per host.

  • Document the unequal improvement in logic, wires, SRAM, and DRAM from 45 nm to 7 nm—including an update of Horowitz’s operation energy table16 from 45 nm to 7 nm—and show how these changes led to 4 systolic floating point matrix units for TPUv4i in 2020 versus one systolic integer matrix unit for TPUv1 in 2015.
  • Explain the difference between designing for performance per TCO vs per CapEx, leading to HBM and a low TDP for TPUv4i, and show how TPUv1’s headroom led to application scaleup after the 2017 paper21.
  • Explain backwards ML compatibility, including why inference can need floating point and how it spurred the TPUv4i and TPUv4 designs (§3). Backwards ML compatible training also tailors DNNs to TPUv4i (§2).
  • Measure production inference applications to show that DSAs normally run multiple DNNs concurrently, requiring Google inference DSAs to support multi-tenancy.
  • Discuss how DNN advances change the production inference workload. The 2020 workload keeps MLP and CNN from 2017 but adds BERT, and RNN succeeds LSTM.
  • Document the growth of production DNNs in memory size and computation by ~1.5× annually since 2016, which encourages designing DSAs with headroom.
  • Show that Google’s TCO and TDP for DNN DSAs are strongly correlated (r = 0.99), likely due to the end of Dennard scaling. TDP offers a good proxy for DSA TCO.
  • Document that the SLO limit is P99 time for inference applications, list typical batch sizes, and show how large on-chip SRAM helps P99 performance.
  • Explain why TPUv4i architects chose compiler compatibility over binary compatibility for its VLIW ISA.
  • Describe Google’s latest inference accelerator in production since March 2020 and evaluate its performance/​TDP vs. TPUv3 and NVIDIA’s T4 inference GPU using production apps and MLPerf Inference benchmarks 0.5–0.7.

…TPUv1 required quantization—since it supported only integer arithmetic—which proved a problem for some datacenter applications. Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.

…Alas, DNN DSA designers often ignore multi-tenancy. Indeed, multi-tenancy is not mentioned in the TPUv1 paper21. (It was lucky that the smallest available DDR3 DRAM held 8GB, allowing TPUv1 software to add multi-tenancy.)

…BERT appeared in 2018, yet it’s already 28% of the workload.

“RASP: Thinking Like Transformers”, Weiss et al 2021

“RASP: Thinking Like Transformers”⁠, Gail Weiss, Yoav Goldberg, Eran Yahav (2021-06-13; ⁠, ; backlinks; similar):

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel.

In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder—attention and feed-forward computation—into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP).

We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. [independent examples of hand-coding Transformers]

We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

“Scaling Laws for Acoustic Models”, Droppo & Elibol 2021

“Scaling Laws for Acoustic Models”⁠, Jasha Droppo, Oguz Elibol (2021-06-11; ; backlinks; similar):

There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent “irreducible loss” of the task. We find that the scaling laws accurately match model performance over 2 orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.

…The context module is a sequence-to-sequence model that converts the encoded input sequence into a sequence of context vectors. We experiment with 2 different designs for the context model: the LSTM and the Transformer. To maintain causality, the LSTM are uni-directional and the Transformer masking to prevent the network from using future frames…All acoustic data used in this paper was drawn from a 23,000 hour corpus of untranscribed, de-identified, far-field, English voice command and voice query speech collected from home environments [Alexa?]. This data is presented to the network as a series of log-Mel frequency filterbank feature vectors, at a rate of 100 vectors per second of audio. Although this data is not publicly available, the authors believe that the phenomena described in this paper should apply to any similar set of speech recordings.

Figure 5: Development set loss for both LSTM and Transformer models for models with the indicated number of layers. The dashed line represents the computationally efficient frontier defined in Equation 4.

When a model reaches L(C), it means that a different model with enough capacity, but with fewer parameters, would need more computation and more data to reach the same loss value. Alternatively, a model with more parameters would need more computation and less data to reach the same loss value.

Where curves for 2 experiments meet, it is an indication that the same amount of compute can reach the given loss value through 2 different methods. One can either use more parameters and fewer data, or use fewer parameters and more data.

The constant L is 0.306 in both figures. This represents a shared asymptote between the LSTM and Transformer systems, which will never be surpassed, regardless of the computational or data budget. The fact that the same asymptote applies to both systems hints that irreducible loss is indeed a fundamental property of the data and not the model. Additionally, this constant is similar to the value found in Section 3.1. The authors suspect that the constants should be identical, but our precision in measuring it is limited.

The LSTM models exhibit a compute-efficient frontier with a slope of −0.167. A doubling of computation yields a 10.9% reduction in objective function. A halving of objective function would come with a 63.5× increase in computation. The slope of the compute-efficient frontier for Transformer models is −0.197. When computation is increased by a factor of r, then the reducible loss will be changed by a factor of r−0.197. At that rate, a doubling of computation yields a 12.7% reduction in objective function. A halving of objective function would come with a 33.7× increase in computation. [These results are consistent with LSTMs vs Transformers on text⁠, and would probably be more impressive if acoustic modeling wasn’t so close to the irreducible loss (ie. solved).]

The difference in slope between the LSTM and Transformer experiments indicate that the Transformer architecture makes more efficient use of increased model parameters and increased training data. Although LSTM is superior to transformer at smaller model sizes, as the model size grows, and these trends continue, the transformer will eventually be more efficient.

Finally, the experimental data show that larger models learn more quickly from the same amount of data. Each of the points plotted in Figure 5 represent the consumption of an additional 25,000 minibatches of training data. At the first point, second, or third, each model has processed the same data, but the larger models have achieved better accuracy on the held-out development set.

“Scaling End-to-End Models for Large-Scale Multilingual ASR”, Li et al 2021

“Scaling End-to-End Models for Large-Scale Multilingual ASR”⁠, Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang et al (2021-04-30; ; backlinks; similar):

Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity.

We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours. We adopt GShard to efficiently scale up to 10B parameters.

Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck—our 500M-param model is already better than monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days—the 1B-param model reaches the same accuracy at 34% of training time as the 500M-param model; (3) given a fixed capacity budget, adding depth usually works better than width and large encoders tend to do better than large decoders.

Figure 1: WER performance (%) vs. (a) training steps, (b) TPU days and (c) language. Systems with ✱ use LSTM decoders.

“Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation”, Parisotto & Salakhutdinov 2021

“Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation”⁠, Emilio Parisotto, Ruslan Salakhutdinov (2021-04-04; backlinks; similar):

Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on un-accelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These “actor-latency” constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to utilize large model capacity while still operating within the limits imposed by the system during acting, we develop an “Actor-Learner Distillation” (ALD) procedure that leverages a continual form of distillation that transfers learning progress from a large capacity learner model to a small capacity actor model. As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model.

“Finetuning Pretrained Transformers into RNNs”, Kasai et al 2021

“Finetuning Pretrained Transformers into RNNs”⁠, Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen et al (2021-03-24; backlinks; similar):

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism’s complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process has lower training cost relative to training these recurrent variants from scratch. As many models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.

“Pretrained Transformers As Universal Computation Engines”, Lu et al 2021

“Pretrained Transformers as Universal Computation Engines”⁠, Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch (2021-03-09; ⁠, ; backlinks; similar):

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning—in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

“Predictive Coding Is a Consequence of Energy Efficiency in Recurrent Neural Networks”, Ali et al 2021

“Predictive coding is a consequence of energy efficiency in recurrent neural networks”⁠, Abdullahi Ali, Nasir Ahmad, Elgar de Groot, Marcel A. J. van Gerven, Tim C. Kietzmann (2021-02-16; ; similar):

Predictive coding represents a promising framework for understanding brain function. It postulates that the brain continuously inhibits predictable sensory input, ensuring a preferential processing of surprising elements. A central aspect of this view is its hierarchical connectivity, involving recurrent message passing between excitatory bottom-up signals and inhibitory top-down feedback. Here we use computational modelling to demonstrate that such architectural hard-wiring is not necessary. Rather, predictive coding is shown to emerge as a consequence of energy efficiency. When training recurrent neural networks to minimize their energy consumption while operating in predictive environments, the networks self-organize into prediction and error units with appropriate inhibitory and excitatory interconnections, and learn to inhibit predictable sensory input. Moving beyond the view of purely top-down driven predictions, we demonstrate via virtual lesioning experiments that networks perform predictions on two timescales: fast lateral predictions among sensory units, and slower prediction cycles that integrate evidence over time.

“Meta Learning Backpropagation And Improving It”, Kirsch & Schmidhuber 2020

“Meta Learning Backpropagation And Improving It”⁠, Louis Kirsch, Jürgen Schmidhuber (2020-12-29; ⁠, ; backlinks; similar):

Many concepts have been proposed for meta learning with neural networks (NNs), eg. NNs that learn to control fast weights, hyper networks, learned learning rules, and meta recurrent NNs. Our Variable Shared Meta Learning (VS-ML) unifies the above and demonstrates that simple weight-sharing and sparsity in an NN is sufficient to express powerful learning algorithms (LAs) in a reusable fashion. A simple implementation of VS-ML called VS-ML RNN allows for implementing the backpropagation LA solely by running an RNN in forward-mode. It can even meta-learn new LAs that improve upon backpropagation and generalize to datasets outside of the meta training distribution without explicit gradient calculation. Introspection reveals that our meta-learned LAs learn qualitatively different from gradient descent through fast association.

“Adversarial Vulnerabilities of Human Decision-making”, Dezfouli et al 2020

“Adversarial vulnerabilities of human decision-making”⁠, Amir Dezfouli, Richard Nock, Peter Dayan (2020-11-04; ⁠, ⁠, ; similar):

“What I cannot efficiently break, I cannot understand.” Understanding the vulnerabilities of human choice processes allows us to detect and potentially avoid adversarial attacks. We develop a general framework for creating adversaries for human decision-making. The framework is based on recent developments in deep reinforcement learning models and recurrent neural networks and can in principle be applied to any decision-making task and adversarial objective. We show the performance of the framework in 3 tasks involving choice, response inhibition, and social decision-making. In all of the cases the framework was successful in its adversarial attack. Furthermore, we show various ways to interpret the models to provide insights into the exploitability of human choice.

Adversarial examples are carefully crafted input patterns that are surprisingly poorly classified by artificial and/​or natural neural networks. Here we examine adversarial vulnerabilities in the processes responsible for learning and choice in humans. Building upon recent recurrent neural network models of choice processes, we propose a general framework for generating adversarial opponents that can shape the choices of individuals in particular decision-making tasks toward the behavioral patterns desired by the adversary. We show the efficacy of the framework through 3 experiments involving action selection, response inhibition, and social decision-making. We further investigate the strategy used by the adversary in order to gain insights into the vulnerabilities of human choice. The framework may find applications across behavioral sciences in helping detect and avoid flawed choice.

[Keywords: decision-making, recurrent neural networks, reinforcement learning]

“Learning to Summarize Long Texts With Memory Compression and Transfer”, Park et al 2020

“Learning to Summarize Long Texts with Memory Compression and Transfer”⁠, Jaehong Park, Jonathan Pilault, Christopher Pal (2020-10-21; backlinks; similar):

We introduce Mem2Mem, a memory-to-memory mechanism for hierarchical recurrent neural network based encoder decoder architectures and we explore its use for abstractive document summarization. Mem2Mem transfers “memories” via readable/​writable external memory modules that augment both the encoder and decoder. Our memory regularization compresses an encoded input article into a more compact set of sentence representations. Most importantly, the memory compression step performs implicit extraction without labels, sidestepping issues with suboptimal ground-truth data and exposure bias of hybrid extractive-abstractive summarization techniques. By allowing the decoder to read/​write over the encoded input memory, the model learns to read salient information about the input article while keeping track of what has been generated. Our Mem2Mem approach yields results that are competitive with state of the art transformer based summarization methods, but with 16× fewer parameters

“An Attention Free Transformer”, Anonymous 2020

“An Attention Free Transformer”⁠, Anonymous (2020-09-28; ; backlinks):

We propose an efficient Transformer that eliminates attention.

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for spatial attention. AFT offers great simplicity compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/​divisions and global/​local pooling. We provide several variants of AFT along with simple yet efficient implementations that are supported by main stream deep learning libraries. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts.

[Keywords: Transformers, attention, efficient]

“HiPPO: Recurrent Memory With Optimal Polynomial Projections”, Gu et al 2020

“HiPPO: Recurrent Memory with Optimal Polynomial Projections”⁠, Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re (2020-08-17; backlinks; similar):

[cf. S4⁠, LSSL] A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed.

We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs⁠. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies.

On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25–40% accuracy.

“Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020

“Matt Botvinick on the spontaneous emergence of learning algorithms”⁠, Adam Scholl (2020-08-12; ⁠, ; backlinks; similar):

Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview⁠, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:

Initial Observation

At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of… emerges, out of, like out of thin air”:

“What happens… it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.”

Other versions of this basic architecture—eg. using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed “meta-RL.” So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1. have memory, 2. whose weights are trained by a RL algorithm, 3. on a sequence of similar input data.

From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/​instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:

“…it’s something that just happens. In a sense, you can’t avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can’t stop it.”

…The account detailed by Botvinick and Wang et al strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong.

“High-performance Brain-to-text Communication via Imagined Handwriting”, Willett et al 2020

“High-performance brain-to-text communication via imagined handwriting”⁠, Francis R. Willett, Donald T. Avansino, Leigh R. Hochberg, Jaimie M. Henderson, Krishna V. Shenoy (2020-07-02; ; similar):

Brain-computer interfaces (BCIs) can restore communication to people who have lost the ability to move or speak. To date, a major focus of BCI research has been on restoring gross motor skills, such as reaching and grasping1–5 or point-and-click typing with a 2D computer cursor6,7. However, rapid sequences of highly dexterous behaviors, such as handwriting or touch typing, might enable faster communication rates. Here, we demonstrate an intracortical BCI that can decode imagined handwriting movements from neural activity in motor cortex and translate it to text in real-time, using a novel recurrent neural network decoding approach. With this BCI, our study participant (whose hand was paralyzed) achieved typing speeds that exceed those of any other BCI yet reported: 90 characters per minute at >99% accuracy with a general-purpose autocorrect. These speeds are comparable to able-bodied smartphone typing speeds in our participant’s age group (115 characters per minute)8 and significantly close the gap between BCI-enabled typing and able-bodied typing rates. Finally, new theoretical considerations explain why temporally complex movements, such as handwriting, may be fundamentally easier to decode than point-to-point movements. Our results open a new approach for BCIs and demonstrate the feasibility of accurately decoding rapid, dexterous movements years after paralysis.

“Transformers Are RNNs: Fast Autoregressive Transformers With Linear Attention”, Katharopoulos et al 2020

“Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”⁠, Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret (2020-06-29; ; backlinks; similar):

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences.

To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from 𝒪(N2) to 𝒪(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks.

Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000× faster on autoregressive prediction of very long sequences.

“Learning-based Memory Allocation for C++ Server Workloads”, Maas et al 2020

“Learning-based Memory Allocation for C++ Server Workloads”⁠, Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, Colin Raffel et al (2020-03-16; backlinks; similar):

Modern C++ servers have memory footprints that vary widely over time, causing persistent heap fragmentation of up to 2× from long-lived objects allocated during peak memory usage. This fragmentation is exacerbated by the use of huge (2MB) pages, a requirement for high performance on large heap sizes. Reducing fragmentation automatically is challenging because C++ memory managers cannot move objects.

This paper presents a new approach to huge page fragmentation. It combines modern machine learning techniques with a novel memory manager (LLAMA) that manages the heap based on object lifetimes and huge pages (divided into blocks and lines). A neural network-based language model predicts lifetime classes using symbolized calling contexts. The model learns context-sensitive per-allocation site lifetimes from previous runs, generalizes over different binary versions, and extrapolates from samples to unobserved calling contexts. Instead of size classes, LLAMA’s heap is organized by lifetime classes that are dynamically adjusted based on observed behavior at a block granularity

LLAMA reduces memory fragmentation by up to 78% while only using huge pages on several production servers. We address ML-specific questions such as tolerating mispredictions and amortizing expensive predictions across application execution. Although our results focus on memory allocation, the questions we identify apply to other system-level problems with strict latency and resource requirements where machine learning could be applied.

[CCS Concepts: Computing methodologies → Supervised learning; Software and its engineering → Allocation / deallocation strategies; Keywords: Memory management, Machine Learning, Lifetime Prediction, Profile-guided Optimization, LSTMs]

“Scaling Laws for Neural Language Models”, Kaplan et al 2020

“Scaling Laws for Neural Language Models”⁠, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford et al (2020-01-23; ⁠, ; backlinks; similar):

We study empirical scaling laws for language model performance on the cross-entropy loss.

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/​dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.

Larger models are substantially more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping substantially before convergence.

Figure 1: Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
Figure 15: Far beyond the model sizes we study empirically, we find a contradiction between our equations for L(Cmin) and L(D) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits.
3.2.1: Comparing to LSTMs and Universal Transformers: In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count n. The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position in Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns. [see Khandelwal et al 2018 on the rapid forgetting of RNNs, SSMs as a possible optimization fix, and “Scaling Laws for Acoustic Models” for another direct LSTM-RNN vs Transformer comparison.]
Appendix A: Summary of Power Laws
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/​token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while Nopt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language. [This is an updated scaling power law summary from Henighan et al 2020.]

“Single Headed Attention RNN: Stop Thinking With Your Head”, Merity 2019

“Single Headed Attention RNN: Stop Thinking With Your Head”⁠, Stephen Merity (2019-11-26; ; similar):

The leading approaches in language modeling are all obsessed with TV shows of my youth—namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon. We opt for the lazy path of old and proven techniques with a fancy crypto inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author’s lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result.

We take a previously strong language model based only on boring LSTMs and get it to within a stone’s throw of a stone’s throw of state-of-the-art byte level language model results on Enwik8. This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author’s small studio apartment far too warm in the midst of a San Franciscan summer. The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts with minimal computation.

Take that Sesame Street.

“Excavate”, Lynch 2019

“Excavate”⁠, Mike Lynch (2019-11-22; ⁠, ; backlinks; similar):

After skipping it last year (I did NaNoWriMo instead) I decided that I missed doing National Novel Generating Month and thought I’d do something relatively simple, based on Tom Phillips’ A Humument, which I recently read for the first time. Phillips’ project was created by drawing over the pages of the forgotten Victorian novel A Human Document, leaving behind a handful of words on each page which form their own narrative, revealing a latent story in the original text. I wanted to simulate this process by taking a neural net trained on one text and use it to excavate a slice from a second text which would somehow preserve the style of the RNN. To get to the target length of 50,000 words, the second text would have to be very long, so I picked Robert Burton’s The Anatomy of Melancholy, which is over half a million words, and one of my favourite books.

The next step was to use this to implement the excavate algorithm, which works like this:

  1. read a vocab from the next L words from the primary text (Burton) where L is the lookahead parameter
  2. take the first letter of every word in the vocab and turn it into a constraint
  3. run the RNN with that constraint to get the next character C
  4. prune the vocab to those words with the first letter C, with that letter removed
  5. turn the new vocab into a new constraint and go back to 3
  6. once we’ve finished a word, add it to the results
  7. skip ahead to the word we picked, and read more words from the text until we have L words
  8. go back to 2 unless we’ve run out of original text, or reached the target word count

Here’s an example of how the RNN generates a single word with L set to 100:

Vocab 1: “prime cause of my disease. Or as he did, of whom Felix Plater speaks, that thought he had some of Aristophanes’ frogs in his belly, still crying Breec, okex, coax, coax, oop, oop, and for that cause studied physic seven years, and travelled over most part of Europe to ease himself. To do myself good I turned over such physicians as our libraries would afford, or my private friends impart, and have taken this pains. And why not? Cardan professeth he wrote his book, De Consolatione after his son’s death, to comfort himself; so did Tully”

RNN: s

Vocab 2: “peaks ome till tudied even uch on’s o”

RNN: t

Vocab 3: “ill udied”

RNN: u

Final result: studied

The algorithm then restarts with a new 100-word vocabulary starting at “physic seven years”

It works pretty well with a high enough lookahead value, although I’m not happy with how the algorithm decides when to end a word. The weight table always gets a list of all the punctuation symbols and a space, which means that the RNN can always bail out of a word half-way if it decides to. I tried constraining it so that it always finished a word once it had narrowed down the options to a single-word vocab, but when I did this, it somehow removed the patterns of punctuation and line-breaks—for example, the way the Three Musketeers RNN emits dialogue in quotation marks—and this was a quality of the RNN I wanted to preserve. I think a little more work could improve this.

…This kind of hybridisation can be applied to any RNN and base text, so there’s a lot of scope for exploration here, of grafting the grammar and style of one text onto the words from another. And the alliteration and lipogram experiments above are just two simple examples of more general ways in which I’ll be able to tamper with the output of RNNs.

“Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks”, Voelker et al 2019

“Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks”⁠, Aaron R. Voelker, Ivana Kajić, Chris Eliasmith (2019-11-05; backlinks; similar):

We propose a novel memory cell for recurrent neural networks that dynamically maintains information across long windows of time using relatively few resources.

The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize its continuous-time history—doing so by solving d coupled ordinary differential equations (ODEs), whose phase space linearly maps onto sliding windows of time via the Legendre polynomials up to degree d − 1.

Backpropagation across LMUs outperforms equivalently-sized LSTMs on a chaotic time-series prediction task, improves memory capacity by 2 orders of magnitude, and substantially reduces training and inference times. LMUs can efficiently handle temporal dependencies spanning 100,000 time-steps, converge rapidly, and use few internal state-variables to learn complex functions spanning long windows of time—exceeding state-of-the-art performance among RNNs on permuted sequential MNIST⁠.

These results are due to the network’s disposition to learn scale-invariant features independently of step size. Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales. We demonstrate that LMU memory cells can be implemented using m recurrently-connected Poisson spiking neurons, 𝒪(m) time and memory, with error scaling as 𝒪(d⁄√m).

We discuss implementations of LMUs on analog and digital neuromorphic hardware.

“High Fidelity Video Prediction With Large Stochastic Recurrent Neural Networks”, Villegas et al 2019

“High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks”⁠, Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee (2019-11-05; ⁠, ; backlinks; similar):

Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.

“Mixed-Signal Neuromorphic Processors: Quo Vadis?”, Bavandpour et al 2019

2019-bavandpour.pdf: “Mixed-Signal Neuromorphic Processors: Quo vadis?”⁠, Mohammad Bavandpour, Mohammad Reza Mahmoodi, Shubham Sahay, Dmitri B. Strukov (2019-10-14; ; similar):

This paper outlines different design options and most suitable memory devices for implementing dense vector-by-matrix multiplication operation, the key operation in neuromorphic computing⁠.

The considered approaches are evaluated by modeling system-level performance of 55-nm 4-bit mixed-signal neuromorphic inference processor running common deep learning feedforward and recurrent neural network models.

[Keywords: nonvolatile memory device, mixed-signal circuits, neuromorphic processor, vector-by-matrix multiplication]

“Metalearned Neural Memory”, Munkhdalai et al 2019

“Metalearned Neural Memory”⁠, Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, Adam Trischler (2019-07-23; ; similar):

We augment recurrent neural networks with an external memory mechanism that builds upon recent progress in metalearning. We conceptualize this memory as a rapidly adaptable function that we parameterize as a deep neural network. Reading from the neural memory function amounts to pushing an input (the key vector) through the function to produce an output (the value vector). Writing to memory means changing the function; specifically, updating the parameters of the neural network to encode desired information. We leverage training and algorithmic techniques from metalearning to update the neural memory function in one shot. The proposed memory-augmented model achieves strong performance on a variety of learning problems, from supervised question answering to reinforcement learning.

“Playing the Lottery With Rewards and Multiple Languages: Lottery Tickets in RL and NLP”, Yu et al 2019

“Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP”⁠, Haonan Yu, Sergey Edunov, Yuandong Tian, Ari S. Morcos (2019-06-06; ⁠, ; similar):

The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by increasing the probability of a “lucky” sub-network initialization being present rather than by helping the optimization process (Frankle & Carbin, 2019). Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether “winning ticket” initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL).For NLP, we examined both recurrent LSTM models and large-scale Transformer models (Vaswani et al 2017). For RL, we analyzed a number of discrete-action space tasks, including both classic control and pixel control. Consistent with work in supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning ticket initializations for Transformers which enable models 1⁄3rd the size to achieve nearly equivalent performance. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.

“MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalizing Flows”, Henter et al 2019

“MoGlow: Probabilistic and controllable motion synthesis using normalizing flows”⁠, Gustav Eje Henter, Simon Alexanderson, Jonas Beskow (2019-05-16; ; backlinks; similar):

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalizing flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood⁠, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

“Meta-learners’ Learning Dynamics Are unlike Learners’”, Rabinowitz 2019

“Meta-learners’ learning dynamics are unlike learners’”⁠, Neil C. Rabinowitz (2019-05-03; ; backlinks; similar):

Meta-learning is a tool that allows us to build sample-efficient learning systems. Here we show that, once meta-trained, LSTM Meta-Learners aren’t just faster learners than their sample-inefficient deep learning (DL) and reinforcement learning (RL) brethren, but that they actually pursue fundamentally different learning trajectories. We study their learning dynamics on three sets of structured tasks for which the corresponding learning dynamics of DL and RL systems have been previously described: linear regression (Saxe et al 2013), nonlinear regression (Rahaman et al 2018; Xu et al 2018), and contextual bandits (Schaul et al 2019). In each case, while sample-inefficient DL and RL Learners uncover the task structure in a staggered manner, meta-trained LSTM Meta-Learners uncover almost all task structure concurrently, congruent with the patterns expected from Bayes-optimal inference algorithms. This has implications for research areas wherever the learning behaviour itself is of interest, such as safety, curriculum design, and human-in-the-loop machine learning.

“Speech Synthesis from Neural Decoding of Spoken Sentences”, Anumanchipalli et al 2019

2019-anumanchipalli.pdf: “Speech synthesis from neural decoding of spoken sentences”⁠, Gopala K. Anumanchipalli, Josh Chartier, Edward F. Chang (2019-04-24; ; similar):

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators.

Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences.

These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.

“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”, Dai et al 2019

“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”⁠, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov (2019-01-09; ⁠, ⁠, ; backlinks; similar):

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/​perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

“High Fidelity Video Prediction With Large Stochastic Recurrent Neural Networks: Videos”, Villegas et al 2019

“High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks: Videos”⁠, Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee (2019; ; similar):

Sample videos generated by large-scale RNNs:

  • 128×128 Videos:

    • Human 3.6M
    • KITTI Driving
  • Video Comparisons (64×64):

    • Towel pick
    • Human 3.6M
    • KITTI Driving
    • Towel pick
    • Human 3.6M
    • KITTI Driving

“Meta-Learning: Learning to Learn Fast”, Weng 2018

“Meta-Learning: Learning to Learn Fast”⁠, Lilian Weng (2018-11-30; ⁠, ; backlinks; similar):

Meta-learning, also known as “learning to learn”, intends to design models that can learn new skills or adapt to new environments rapidly with a few training examples. There are three common approaches: 1. learn an efficient distance metric (metric-based); 2. use (recurrent) network with external or internal memory (model-based); 3. optimize the model parameters explicitly for fast learning (optimization-based).

…We expect a good meta-learning model capable of well adapting or generalizing to new tasks and new environments that have never been encountered during training time. The adaptation process, essentially a mini learning session, happens during test but with a limited exposure to the new task configurations. Eventually, the adapted model can complete new tasks. This is why meta-learning is also known as learning to learn⁠.

Define the Meta-Learning Problem · A Simple View · Training in the Same Way as Testing · Learner and Meta-Learner · Common Approaches · Metric-Based · Convolutional Siamese Neural Network · Matching Networks · Simple Embedding · Full Context Embeddings · Relation Network · Prototypical Networks · Model-Based · Memory-Augmented Neural Networks · MANN for Meta-Learning · Addressing Mechanism for Meta-Learning · Meta Networks · Fast Weights · Model Components · Training Process · Optimization-Based · LSTM Meta-Learner · Why LSTM? · Model Setup · MAML · First-Order MAML · Reptile · The Optimization Assumption · Reptile vs FOMAML · Reference

“R2D2: Recurrent Experience Replay in Distributed Reinforcement Learning”, Kapturowski et al 2018

“R2D2: Recurrent Experience Replay in Distributed Reinforcement Learning”⁠, Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, Will Dabney (2018-09-27; ; backlinks; similar):

Building on the recent successes of distributed training of RL agents, in this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay. We study the effects of parameter lag resulting in representational drift and recurrent state staleness and empirically derive an improved training strategy. Using a single network architecture and fixed set of hyper-parameters, the resulting agent, Recurrent Replay Distributed DQN (R2D2), quadruples the previous state of the art on Atari-57, and matches the state of the art on DMLab-30⁠. It is the first agent to exceed human-level performance in 52 of the 57 Atari games.

[Keywords: RNN, LSTM, experience replay, distributed training, reinforcement learning]

TL;DR: Investigation on combining recurrent neural networks and experience replay leading to state-of-the-art agent on both Atari-57 and DMLab-30 using single set of hyper-parameters.

[See also Ni et al 2021 on how easy it is to do RNN DRL wrong.]

“Adversarial Reprogramming of Text Classification Neural Networks”, Neekhara et al 2018

“Adversarial Reprogramming of Text Classification Neural Networks”⁠, Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar (2018-09-06; ⁠, ⁠, ⁠, ; backlinks; similar):

Adversarial Reprogramming has demonstrated success in utilizing pre-trained neural network classifiers for alternative classification tasks without modification to the original network. An adversary in such an attack scenario trains an additive contribution to the inputs to repurpose the neural network for the new classification task. While this reprogramming approach works for neural networks with a continuous input space such as that of images, it is not directly applicable to neural networks trained for tasks such as text classification, where the input space is discrete. Repurposing such classification networks would require the attacker to learn an adversarial program that maps inputs from one discrete space to the other. In this work, we introduce a context-based vocabulary remapping model to reprogram neural networks trained on a specific sequence classification task, for a new sequence classification task desired by the adversary. We propose training procedures for this adversarial program in both white-box and black-box settings. We demonstrate the application of our model by adversarially repurposing various text-classification models including LSTM, bi-directional LSTM and CNN for alternate classification tasks.

“This Time With Feeling: Learning Expressive Musical Performance”, Oore et al 2018

“This Time with Feeling: Learning Expressive Musical Performance”⁠, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan (2018-08-10; ; backlinks; similar):

Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct performance generation: jointly predicting the notes and also their expressive timing and dynamics.

We consider the importance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task.

Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.

“Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018

“Character-Level Language Modeling with Deeper Self-Attention”⁠, Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones (2018-08-09; ⁠, ; backlinks; similar):

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on Enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

“General Value Function Networks”, Schlegel et al 2018

“General Value Function Networks”⁠, Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White, Martha White (2018-07-18; similar):

State construction is important for learning in partially observable environments. A general purpose strategy for state construction is to learn the state update using a Recurrent Neural Network (RNN), which updates the internal state using the current internal state and the most recent observation. This internal state provides a summary of the observed sequence, to facilitate accurate predictions and decision-making. At the same time, specifying and training RNNs is notoriously tricky, particularly as the common strategy to approximate gradients back in time, called truncated Back-prop Through Time (BPTT), can be sensitive to the truncation window. Further, domain-expertise—which can usually help constrain the function class and so improve trainability—can be difficult to incorporate into complex recurrent units used within RNNs. In this work, we explore how to use multi-step predictions to constrain the RNN and incorporate prior knowledge. In particular, we revisit the idea of using predictions to construct state and ask: does constraining (parts of) the state to consist of predictions about the future improve RNN trainability?

We formulate a novel RNN architecture, called a General Value Function Network (GVFN), where each internal state component corresponds to a prediction about the future represented as a value function. We first provide an objective for optimizing GVFNs, and derive several algorithms to optimize this objective. We then show that GVFNs are more robust to the truncation level, in many cases only requiring one-step gradient updates.

“Universal Transformers”, Dehghani et al 2018

“Universal Transformers”⁠, Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser (2018-07-10; ; backlinks; similar):

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, eg. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.

“GPT-1: Improving Language Understanding With Unsupervised Learning”, OpenAI 2018

“GPT-1: Improving Language Understanding with Unsupervised Learning”⁠, OpenAI (2018-06-11; ⁠, ⁠, ; backlinks; similar):

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers and unsupervised pre-training⁠. These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.

“GPT-1: Improving Language Understanding by Generative Pre-Training”, Radford et al 2018-page-5

“GPT-1: Improving Language Understanding by Generative Pre-Training”⁠, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018-06-08; ⁠, ⁠, ; backlinks; similar):

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.

We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, substantially improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

“Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data”, Yang et al 2018

“Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data”⁠, Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling Wang, Michael I. Jordan (2018-05-31; ; backlinks; similar):

We present a probabilistic framework for studying adversarial attacks on discrete data. Based on this framework, we derive a perturbation-based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, that illustrate various tradeoffs in the design of attacks. We demonstrate the effectiveness of these methods using both quantitative metrics and human evaluation on various state-of-the-art models for text classification, including a word-based CNN, a character-based CNN and an LSTM. As as example of our results, we show that the accuracy of character-based convolutional networks drops to the level of random selection by modifying only five characters through Greedy Attack.

“Hierarchical Neural Story Generation”, Fan et al 2018

“Hierarchical Neural Story Generation”⁠, Angela Fan, Mike Lewis, Yann Dauphin (2018-05-13; ; backlinks; similar):

We explore story generation: creative systems that can build coherent and fluent passages of text about a topic. We collect a large dataset of 300K human-written stories paired with writing prompts from an online forum. Our dataset enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text. We gain further improvements with a novel form of model fusion that improves the relevance of the story to the prompt, and adding a new gated multi-scale self-attention mechanism to model long-range context. Experiments show large improvements over strong baselines on both automated and human evaluations. Human judges prefer stories generated by our approach to those from a strong non-hierarchical model by a factor of two to one.

“Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context”, Khandelwal et al 2018

“Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context”⁠, Urvashi Khandelwal, He He, Peng Qi, Dan Jurafsky (2018-05-12; backlinks; similar):

We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

“A Tree Search Algorithm for Sequence Labeling”, Lao et al 2018

“A Tree Search Algorithm for Sequence Labeling”⁠, Yadi Lao, Jun Xu, Yanyan Lan, Jiafeng Guo, Sheng Gao, Xueqi Cheng (2018-04-29; ; similar):

In this paper we propose a novel reinforcement learning based model for sequence tagging, referred to as MM-Tag. Inspired by the success and methodology of the AlphaGo Zero, MM-Tag formalizes the problem of sequence tagging with a Monte Carlo tree search (MCTS) enhanced Markov decision process (MDP) model, in which the time steps correspond to the positions of words in a sentence from left to right, and each action corresponds to assign a tag to a word. Two long short-term memory networks (LSTM) are used to summarize the past tag assignments and words in the sentence. Based on the outputs of LSTMs, the policy for guiding the tag assignment and the value for predicting the whole tagging accuracy of the whole sentence are produced. The policy and value are then strengthened with MCTS, which takes the produced raw policy and value as inputs, simulates and evaluates the possible tag assignments at the subsequent positions, and outputs a better search policy for assigning tags. A reinforcement learning algorithm is proposed to train the model parameters. Our work is the first to apply the MCTS enhanced MDP model to the sequence tagging task. We show that MM-Tag can accurately predict the tags thanks to the exploratory decision making mechanism introduced by MCTS. Experimental results show based on a chunking benchmark showed that MM-Tag outperformed the state-of-the-art sequence tagging baselines including CRF and CRF with LSTM.

“Community Interaction and Conflict on the Web”, Kumar et al 2018

“Community Interaction and Conflict on the Web”⁠, Srijan Kumar, William L. Hamilton, Jure Leskovec, Dan Jurafsky (2018-03-09; ⁠, ⁠, ⁠, ; similar):

Users organize themselves into communities on web platforms. These communities can interact with one another, often leading to conflicts and toxic interactions. However, little is known about the mechanisms of interactions between communities and how they impact users.

Here we study inter-community interactions across 36,000 communities on Reddit, examining cases where users of one community are mobilized by negative sentiment to comment in another community. We show that such conflicts tend to be initiated by a handful of communities—less than 1% of communities start 74% of conflicts. While conflicts tend to be initiated by highly active community members, they are carried out by statistically-significantly less active members. We find that conflicts are marked by formation of echo chambers, where users primarily talk to other users from their own community. In the long-term, conflicts have adverse effects and reduce the overall activity of users in the targeted communities.

Our analysis of user interactions also suggests strategies for mitigating the negative impact of conflicts—such as increasing direct engagement between attackers and defenders. Further, we accurately predict whether a conflict will occur by creating a novel LSTM model that combines graph embeddings, user, community, and text features. This model can be used to create early-warning systems for community moderators to prevent conflicts. Altogether, this work presents a data-driven view of community interactions and conflict, and paves the way towards healthier online communities.

“Learning Memory Access Patterns”, Hashemi et al 2018

“Learning Memory Access Patterns”⁠, Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis et al (2018-03-06; ; backlinks; similar):

The explosion in workload complexity and the recent slow-down in Moore’s law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.

“One Big Net For Everything”, Schmidhuber 2018

“One Big Net For Everything”⁠, Juergen Schmidhuber (2018-02-24; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

I apply recent work on “learning to think” (2015) and on PowerPlay (2011) to the incremental training of an increasingly general problem solver, continually learning to solve new tasks without forgetting previous skills. The problem solver is a single recurrent neural network (or similar general purpose computer) called ONE. ONE is unusual in the sense that it is trained in various ways, eg. by black box optimization / reinforcement learning / artificial evolution as well as supervised / unsupervised learning. For example, ONE may learn through neuroevolution to control a robot through environment-changing actions, and learn through unsupervised gradient descent to predict future inputs and vector-valued reward signals as suggested in 1990. User-given tasks can be defined through extra goal-defining input patterns, also proposed in 1990. Suppose ONE has already learned many skills. Now a copy of ONE can be re-trained to learn a new skill, eg. through neuroevolution without a teacher. Here it may profit from re-using previously learned subroutines, but it may also forget previous skills. Then ONE is retrained in PowerPlay style (2011) on stored input/​output traces of (a) ONE’s copy executing the new skill and (b) previous instances of ONE whose skills are still considered worth memorizing. Simultaneously, ONE is retrained on old traces (even those of unsuccessful trials) to become a better predictor, without additional expensive interaction with the environment. More and more control and prediction skills are thus collapsed into ONE, like in the chunker-automatizer system of the neural history compressor (1991). This forces ONE to relate partially analogous skills (with shared algorithmic information) to each other, creating common subroutines in form of shared subnetworks of ONE, to greatly speed up subsequent learning of additional, novel but algorithmically related skills.

“Efficient Neural Audio Synthesis”, Kalchbrenner et al 2018

“Efficient Neural Audio Synthesis”⁠, Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg et al (2018-02-23; ⁠, ; similar):

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4× faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.

“Deep Contextualized Word Representations”, Peters et al 2018

“Deep contextualized word representations”⁠, Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer et al (2018-02-15; backlinks; similar):

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (eg. syntax and semantics), and (2) how these uses vary across linguistic contexts (ie. to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

“M-Walk: Learning to Walk over Graphs Using Monte Carlo Tree Search”, Shen et al 2018

“M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search”⁠, Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, Jianfeng Gao (2018-02-12; ; similar):

Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model.

To overcome the challenge of sparse rewards, we develop a graph-walking agent called M-Walk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (ie. history of the walked path) and maps it separately to a policy and Q-values. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an off-policy manner using Q-learning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policy-improvement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node.

Experimental results on several graph-walking benchmarks show that M-Walk is able to learn better policies than other RL-based methods, which are mainly based on policy gradients. M-Walk also outperforms traditional KBC baselines.

“Universal Language Model Fine-tuning for Text Classification”, Howard & Ruder 2018

“Universal Language Model Fine-tuning for Text Classification”⁠, Jeremy Howard, Sebastian Ruder (2018-01-18):

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18–24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We open-source our pretrained models and code.

“A Flexible Approach to Automated RNN Architecture Generation”, Schrimpf et al 2017

“A Flexible Approach to Automated RNN Architecture Generation”⁠, Martin Schrimpf, Stephen Merity, James Bradbury, Richard Socher (2017-12-20; ; backlinks; similar):

The process of designing neural architectures requires expert knowledge and extensive trial and error. While automated architecture search may simplify these requirements, the recurrent neural network (RNN) architectures generated by existing methods are limited in both flexibility and components. We propose a domain-specific language (DSL) for use in automated architecture search which can produce novel RNNs of arbitrary depth and width. The DSL is flexible enough to define standard architectures such as the Gated Recurrent Unit and Long Short Term Memory and allows the introduction of non-standard RNN components such as trigonometric curves and layer normalization. Using two different candidate generation techniques, random search with a ranking function and reinforcement learning, we explore the novel architectures produced by the RNN DSL for language modeling and machine translation domains. The resulting architectures do not follow human intuition yet perform well on their targeted tasks, suggesting the space of usable RNN architectures is far larger than previously assumed.

“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Ye et al 2017

“Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition”⁠, Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, Zenglin Xu (2017-12-14; ⁠, ; similar):

Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensor-train RNN (TT-RNN), our method, Block-Term RNN (BT-RNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388× fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset.

“Evaluating Prose Style Transfer With the Bible”, Carlson et al 2017

“Evaluating prose style transfer with the Bible”⁠, Keith Carlson, Allen Riddell, Daniel Rockmore (2017-11-13; ⁠, ; backlinks; similar):

In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results and usually make use of parallel data for training. Currently, there are few publicly available corpora for this task. In this work, we identify a high-quality source of aligned, stylistically distinct text in different versions of the Bible. We provide a standardized split, into training, development and testing data, of the public domain versions in our corpus. This corpus is highly parallel since many Bible versions are included. Sentences are aligned due to the presence of chapter and verse numbers within all versions of the text. In addition to the corpus, we present the results, as measured by the BLEU and PINC metrics, of several models trained on our data which can serve as baselines for future research. While we present these data as a style transfer corpus, we believe that it is of unmatched quality and may be useful for other natural language tasks as well.

“Neural Speed Reading via Skim-RNN”, Seo et al 2017

“Neural Speed Reading via Skim-RNN”⁠, Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi (2017-11-06; backlinks; similar):

Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be easily used instead of RNNs in existing models.

In our experiments, we show that Skim-RNN can achieve significantly reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks. In addition, we demonstrate that the trade-off between accuracy and speed of Skim-RNN can be dynamically controlled during inference time in a stable manner. Our analysis also shows that Skim-RNN running on a single CPU offers lower latency compared to standard RNNs on GPUs.

“To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression”, Zhu & Gupta 2017

“To prune, or not to prune: exploring the efficacy of pruning for model compression”⁠, Michael Zhu, Suyog Gupta (2017-10-05; ; similar):

Model pruning seeks to induce sparsity in a deep neural network’s various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al 2015; Narang et al 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model’s dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/​datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10× reduction in number of non-zero parameters with minimal loss in accuracy.

“Why Pay More When You Can Pay Less: A Joint Learning Framework for Active Feature Acquisition and Classification”, Shim et al 2017

“Why Pay More When You Can Pay Less: A Joint Learning Framework for Active Feature Acquisition and Classification”⁠, Hajin Shim, Sung Ju Hwang, Eunho Yang (2017-09-18; ; backlinks; similar):

We consider the problem of active feature acquisition, where we sequentially select the subset of features in order to achieve the maximum prediction performance in the most cost-effective way. In this work, we formulate this active feature acquisition problem as a reinforcement learning problem, and provide a novel framework for jointly learning both the RL agent and the classifier (environment). We also introduce a more systematic way of encoding subsets of features that can properly handle innate challenge with missing entries in active feature acquisition problems, that uses the orderless LSTM-based set encoding mechanism that readily fits in the joint learning framework. We evaluate our model on a carefully designed synthetic dataset for the active feature acquisition as well as several real datasets such as electric health record (EHR) datasets, on which it outperforms all baselines in terms of prediction performance as well feature acquisition cost.

“Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, Campos et al 2017

“Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”⁠, Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, Shih-Fu Chang (2017-08-22; backlinks; similar):

Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time.

We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models.

Source code is publicly available at https: /  ​ /  ​ /  ​skiprnn-2017-telecombcn /  ​⁠.

“Revisiting Activation Regularization for Language RNNs”, Merity et al 2017

“Revisiting Activation Regularization for Language RNNs”⁠, Stephen Merity, Bryan McCann, Richard Socher (2017-08-03; similar):

Recurrent neural networks (RNNs) serve as a fundamental building block for many sequence tasks across natural language processing. Recent research has focused on recurrent dropout techniques or custom RNN cells in order to improve performance. Both of these can require substantial modifications to the machine learning model or to the underlying RNN configurations. We revisit traditional regularization techniques, specifically L2 regularization on RNN activations and slowness regularization over successive hidden states, to improve the performance of RNNs on the task of language modeling. Both of these techniques require minimal modification to existing RNN architectures and result in performance improvements comparable or superior to more complicated regularization techniques or custom cell architectures. These regularization techniques can be used without any modification on optimized LSTM implementations such as the NVIDIA cuDNN LSTM.

“Bayesian Sparsification of Recurrent Neural Networks”, Lobacheva et al 2017

“Bayesian Sparsification of Recurrent Neural Networks”⁠, Ekaterina Lobacheva, Nadezhda Chirkova, Dmitry Vetrov (2017-07-31; ; similar):

Recurrent neural networks show state-of-the-art results in many text analysis tasks but often require a lot of memory to store their weights. Recently proposed Sparse Variational Dropout eliminates the majority of the weights in a feed-forward neural network without significant loss of quality. We apply this technique to sparsify recurrent neural networks. To account for recurrent specifics we also rely on Binary Variational Dropout for RNN. We report 99.5% sparsity level on sentiment analysis task without a quality drop and up to 87% sparsity level on language modeling task with slight loss of accuracy.

“On the State of the Art of Evaluation in Neural Language Models”, Melis et al 2017

“On the State of the Art of Evaluation in Neural Language Models”⁠, Gábor Melis, Chris Dyer, Phil Blunsom (2017-07-18; similar):

Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation.

We reevaluate several popular architectures and regularization methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularized, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.

“Controlling Linguistic Style Aspects in Neural Language Generation”, Ficler & Goldberg 2017

“Controlling Linguistic Style Aspects in Neural Language Generation”⁠, Jessica Ficler, Yoav Goldberg (2017-07-09; ; backlinks; similar):

Most work on neural natural language generation (NNLG) focus on controlling the content of the generated text. We experiment with controlling several stylistic aspects of the generated text, in addition to its content. The method is based on conditioned RNN language model, where the desired content as well as the stylistic parameters serve as conditioning contexts. We demonstrate the approach on the movie reviews domain and show that it is successful in generating coherent sentences corresponding to the required linguistic style and content.

“Device Placement Optimization With Reinforcement Learning”, Mirhoseini et al 2017

“Device Placement Optimization with Reinforcement Learning”⁠, Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi et al (2017-06-13; ; backlinks; similar):

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

“Language Generation With Recurrent Generative Adversarial Networks without Pre-training”, Press et al 2017

“Language Generation with Recurrent Generative Adversarial Networks without Pre-training”⁠, Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf (2017-06-05; ⁠, ; backlinks; similar):

Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent neural networks. Consequently, past work has either resorted to pre-training with maximum-likelihood or used convolutional networks for generation. In this work, we show that recurrent neural networks can be trained to generate text with GANs from scratch using curriculum learning, by slowly teaching the model to generate sequences of increasing and variable length. We empirically show that our approach vastly improves the quality of generated sequences compared to a convolutional baseline.

“Biased Importance Sampling for Deep Neural Network Training”, Katharopoulos & Fleuret 2017

“Biased Importance Sampling for Deep Neural Network Training”⁠, Angelos Katharopoulos, François Fleuret (2017-05-31; ; backlinks; similar):

Importance sampling has been successfully used to accelerate stochastic optimization in many convex problems. However, the lack of an efficient way to calculate the importance still hinders its application to Deep Learning.

In this paper, we show that the loss value can be used as an alternative importance metric, and propose a way to efficiently approximate it for a deep model, using a small model trained for that purpose in parallel.

This method allows in particular to utilize a biased gradient estimate that implicitly optimizes a soft max-loss, and leads to better generalization performance. While such method suffers from a prohibitively high variance of the gradient estimate when using a standard stochastic optimizer, we show that when it is combined with our sampling mechanism, it results in a reliable procedure.

We showcase the generality of our method by testing it on both image classification and language modeling tasks using deep convolutional and recurrent neural networks. In particular, our method results in 30% faster training of a CNN for CIFAR10 than when using uniform sampling.

“A Deep Reinforced Model for Abstractive Summarization”, Paulus et al 2017

“A Deep Reinforced Model for Abstractive Summarization”⁠, Romain Paulus, Caiming Xiong, Richard Socher (2017-05-11; ⁠, ; backlinks; similar):

Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. We introduce a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL). Models trained only with supervised learning often exhibit “exposure bias”—they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/​Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/​Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries.

“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, Devlin 2017

“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”⁠, Jacob Devlin (2017-05-04; ; backlinks; similar):

Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a statistically-significant increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/​throughput close to that of a phrasal decoder.

We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4× speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/​LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/​sec on single-threaded CPU. We believe this is the best published accuracy/​speed trade-off of an NMT system.

“Exploring Sparsity in Recurrent Neural Networks”, Narang et al 2017

“Exploring Sparsity in Recurrent Neural Networks”⁠, Sharan Narang, Erich Elsen, Gregory Diamos, Shubho Sengupta (2017-04-17; ; similar):

Recurrent Neural Networks (RNN) are widely used to solve a variety of problems and as the quantity of data and the amount of available compute have increased, so have model sizes. The number of parameters in recent state-of-the-art networks makes them hard to deploy, especially on mobile phones and embedded devices. The challenge is due to both the size of the model and the time it takes to evaluate it. In order to deploy these RNNs efficiently, we propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network. At the end of training, the parameters of the network are sparse while accuracy is still close to the original dense neural network. The network size is reduced by 8× and the time required to train the model remains constant. Additionally, we can prune a larger dense network to achieve better than baseline performance while still reducing the total number of parameters significantly. Pruning RNNs reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply. Benchmarks show that using our technique model size can be reduced by 90% and speed-up is around 2× to 7×.

“Recurrent Environment Simulators”, Chiappa et al 2017

“Recurrent Environment Simulators”⁠, Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, Shakir Mohamed (2017-04-07; ; similar):

Models that can simulate how environments change in response to actions can be used by agents to plan and act efficiently. We improve on previous environment simulators from high-dimensional pixel observations by introducing recurrent neural networks that are able to make temporally and spatially coherent predictions for hundreds of time-steps into the future. We present an in-depth analysis of the factors affecting performance, providing the most extensive attempt to advance the understanding of the properties of these models. We address the issue of computationally inefficiency with a model that does not need to generate a high-dimensional image at each time-step. We show that our approach can be used to improve exploration and is adaptable to many diverse environments, namely 10 Atari games, a 3D car racing environment, and complex 3D mazes.

“Learning to Generate Reviews and Discovering Sentiment”, Radford et al 2017

“Learning to Generate Reviews and Discovering Sentiment”⁠, Alec Radford, Rafal Jozefowicz, Ilya Sutskever (2017-04-05; ; backlinks; similar):

We explore the properties of byte-level recurrent language models. When given sufficient amounts of capacity, training data, and compute time, the representations learned by these models include disentangled features corresponding to high-level concepts. Specifically, we find a single unit which performs sentiment analysis. These representations, learned in an unsupervised manner, achieve state of the art on the binary subset of the Stanford Sentiment Treebank. They are also very data efficient. When using only a handful of labeled examples, our approach matches the performance of strong baselines trained on full datasets. We also demonstrate the sentiment unit has a direct influence on the generative process of the model. Simply fixing its value to be positive or negative generates samples with the corresponding positive or negative sentiment.

“I2T2I: Learning Text to Image Synthesis With Textual Data Augmentation”, Dong et al 2017

“I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation”⁠, Hao Dong, Jingqing Zhang, Douglas McIlwraith, Yike Guo (2017-03-20; ⁠, ; backlinks; similar):

Translating information between text and image is a fundamental problem in artificial intelligence that connects natural language processing and computer vision.

In the past few years, performance in image caption generation has seen substantial improvement through the adoption of recurrent neural networks (RNN). Meanwhile, text-to-image generation begun to generate plausible images using datasets of specific categories like birds and flowers. We’ve even seen image generation from multi-category datasets such as the Microsoft Common Objects in Context (MS-COCO) through the use of generative adversarial networks (GANs). Synthesizing objects with a complex shape, however, is still challenging. For example, animals and humans have many degrees of freedom, which means that they can take on many complex shapes.

We propose a new training method called Image-Text-Image (I2T2I) which integrates text-to-image and image-to-text (image captioning) synthesis to improve the performance of text-to-image synthesis. We demonstrate that I2T2I can generate better multi-categories using MS COCO than the state-of-the-art. We also demonstrate that I2T2I can achieve transfer learning by using a pre-trained image captioning module to generate human images on the MPII Human Pose dataset (MHP) without using sentence annotations.

“Improving Neural Machine Translation With Conditional Sequence Generative Adversarial Nets”, Yang et al 2017

“Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets”⁠, Zhen Yang, Wei Chen, Feng Wang, Bo Xu (2017-03-15; ⁠, ; backlinks; similar):

This paper proposes an approach for applying GANs to NMT. We build a conditional sequence generative adversarial net which comprises of two adversarial sub models, a generator and a discriminator. The generator aims to generate sentences which are hard to be discriminated from human-translated sentences (ie. the golden target sentences), And the discriminator makes efforts to discriminate the machine-generated sentences from human-translated ones. The two sub models play a mini-max game and achieve the win-win situation when they reach a Nash Equilibrium. Additionally, the static sentence-level BLEU is utilized as the reinforced objective for the generator, which biases the generation towards high BLEU points. During training, both the dynamic discriminator and the static BLEU objective are employed to evaluate the generated sentences and feedback the evaluations to guide the learning of the generator. Experimental results show that the proposed model consistently outperforms the traditional RNNSearch and the newly emerged state-of-the-art Transformer on English-German and Chinese-English translation tasks.

“Learned Optimizers That Scale and Generalize”, Wichrowska et al 2017

“Learned Optimizers that Scale and Generalize”⁠, Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas et al (2017-03-14; ⁠, ; backlinks; similar):

Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal per-parameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a meta-training ensemble of small, diverse optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/​ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its meta-training set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on. We release an open source implementation of the meta-training algorithm.

“Parallel Multiscale Autoregressive Density Estimation”, Reed et al 2017

“Parallel Multiscale Autoregressive Density Estimation”⁠, Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, Nando de Freitas et al (2017-03-10; ⁠, ; similar):

PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; 𝑂(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially.

In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup—𝑂(log n) sampling instead of 𝑂(N)—enabling the practical generation of 512×512 images.

We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.

“Tracking the World State With Recurrent Entity Networks”, Henaff et al 2017

“Tracking the World State with Recurrent Entity Networks”⁠, Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun (2017-03-03; similar):

A new memory-augmented model which learns to track the world state, obtaining SOTA on the bAbI tasks amongst other results.

We introduce a new model, the Recurrent Entity Network (EntNet). It is equipped with a dynamic long-term memory which allows it to maintain and update a representation of the state of the world as it receives new data. For language understanding tasks, it can reason on-the-fly as it reads text, not just when it is required

to answer a question or respond as is the case for a Memory Network (Sukhbaatar et al 2015). Like a Neural Turing Machine or Differentiable Neural Computer (Graves et al 2014; 2016) it maintains a fixed size memory and can learn to perform location and content-based read and write operations. However, unlike those models it has a simple parallel architecture in which several memory locations can be updated simultaneously. The EntNet sets a new state-of-the-art on the bAbI tasks, and is the first method to solve all the tasks in the 10k training examples setting. We also demonstrate that it can solve a reasoning task which requires a large number of supporting facts, which other methods are not able to solve, and can generalize past its training horizon. It can also be practically used on large scale datasets such as Children’s Book Test, where it obtains competitive performance, reading the story in a single pass.

[Keywords: Natural language processing, Deep learning]

“Optimization As a Model for Few-Shot Learning”, Ravi & Larochelle 2017

“Optimization as a Model for Few-Shot Learning”⁠, Sachin Ravi, Hugo Larochelle (2017-03-01; ; backlinks; similar):

We propose an LSTM-based meta-learner model to learn the exact optimization algorithm used to train another learner neural network in the few-shot regime

Though deep neural networks have shown great success in the large data domain, they generally perform poorly on few-shot learning tasks, where a model has to quickly generalize after seeing very few examples from each class. The general belief is that gradient-based optimization in high capacity models requires many iterative steps over many examples to perform well. Here, we propose an LSTM-based meta-learner model to learn the exact optimization algorithm used to train another learner neural network in the few-shot regime. The parameterization of our model allows it to learn appropriate parameter updates specifically for the scenario where a set amount of updates will be made, while also learning a general initialization of the learner network that allows for quick convergence of training. We demonstrate that this meta-learning model is competitive with deep metric-learning techniques for few-shot learning.

“Neural Combinatorial Optimization With Reinforcement Learning”, Bello et al 2017

“Neural Combinatorial Optimization with Reinforcement Learning”⁠, Irwan Bello, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio (2017-02-17; ⁠, ; backlinks; similar):

neural combinatorial optimization, reinforcement learning

We present a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. Without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. These results, albeit still quite far from state-of-the-art, give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems.

“Tuning Recurrent Neural Networks With Reinforcement Learning”, Jaques et al 2017

“Tuning Recurrent Neural Networks with Reinforcement Learning”⁠, Natasha Jaques, Shixiang Gu, Richard E. Turner, Douglas Eck (2017-02-14; ⁠, ; backlinks; similar):

New method for refining an RNN using Reinforcement Learning by penalizing KL-divergence from the policy of an RNN pre-trained on data using maximum likelihood.

The approach of training sequence models using supervised learning and next-step prediction suffers from known failure modes. For example, it is notoriously difficult to ensure multi-step generated sequences have coherent global structure. We propose a novel sequence-learning approach in which we use a pre-trained Recurrent Neural Network (RNN) to supply part of the reward value in a Reinforcement Learning (RL) model. Thus, we can refine a sequence predictor by optimizing for some imposed reward functions, while maintaining good predictive properties learned from data. We propose efficient ways to solve this by augmenting deep Q-learning with a cross-entropy reward and deriving novel off-policy methods for RNNs from KL control. We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data.

[Keywords: Deep learning, Supervised Learning, Reinforcement Learning, Applications, Structured prediction]

“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, Shazeer et al 2017

“Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”⁠, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean (2017-01-23; ; backlinks; similar):

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000× improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

“Neural Data Filter for Bootstrapping Stochastic Gradient Descent”, Fan et al 2017

“Neural Data Filter for Bootstrapping Stochastic Gradient Descent”⁠, Yang Fan, Fei Tian, Tao Qin, Tie-Yan Liu (2017-01-20; ; backlinks; similar):

We propose a reinforcement learning based teacher-student framework for filtering training data to boost SGD convergence.

Mini-batch based Stochastic Gradient Descent(SGD) has been widely used to train deep neural networks efficiently. In this paper, we design a general framework to automatically and adaptively select training data for SGD. The framework is based on neural networks and we call it Neural Data Filter (NDF). In Neural Data Filter, the whole training process of the original neural network is monitored and supervised by a deep reinforcement network, which controls whether to filter some data in sequentially arrived mini-batches so as to maximize future accumulative reward (eg. validation accuracy). The SGD process accompanied with NDF is able to use less data and converge faster while achieving comparable accuracy as the standard SGD trained on the full dataset. Our experiments show that NDF bootstraps SGD training for different neural network models including Multi Layer Perceptron Network and Recurrent Neural Network trained on various types of tasks including image classification and text understanding.

[Keywords: Reinforcement Learning, Deep learning, Optimization]

“Improving Neural Language Models With a Continuous Cache”, Grave et al 2016

“Improving Neural Language Models with a Continuous Cache”⁠, Edouard Grave, Arm, Joulin, Nicolas Usunier (2016-12-13; ; backlinks):

We propose an extension to neural network language models to adapt their prediction to the recent history.

Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models.

We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.

“RL2: Fast Reinforcement Learning via Slow Reinforcement Learning”, Duan et al 2016

“RL2: Fast Reinforcement Learning via Slow Reinforcement Learning”⁠, Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel (2016-11-09; ; backlinks; similar):

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap.

Rather than designing a “fast” reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL2, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose (“slow”) RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the “fast” RL algorithm on the current (previously unseen) MDP.

We evaluate RL2 experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL2 is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL2 on a vision-based navigation task and show that it scales up to high-dimensional problems.

“DeepCoder: Learning to Write Programs”, Balog et al 2016

“DeepCoder: Learning to Write Programs”⁠, Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, Daniel Tarlow (2016-11-07; ; similar):

We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network’s predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.

“Neural Architecture Search With Reinforcement Learning”, Zoph & Le 2016

“Neural Architecture Search with Reinforcement Learning”⁠, Barret Zoph, Quoc V. Le (2016-11-05; ; backlinks; similar):

Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09% better and 1.05× faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.

“Achieving Human Parity in Conversational Speech Recognition”, Xiong et al 2016

“Achieving Human Parity in Conversational Speech Recognition”⁠, W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig (2016-10-17; similar):

Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively.

The key to our system’s performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination.

“HyperNetworks”, Ha et al 2016

“HyperNetworks”⁠, David Ha, Andrew Dai, Quoc V. Le (2016-09-27; ; backlinks; similar):

This work explores hypernetworks: an approach of using an one network, also known as a hypernetwork, to generate the weights for another network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype—the hypernetwork—and a phenotype—the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve near state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters.

“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, Wu et al 2016

“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”⁠, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al (2016-09-26; backlinks; similar):

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

“Pointer Sentinel Mixture Models”, Merity et al 2016

“Pointer Sentinel Mixture Models”⁠, Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher (2016-09-26; backlinks; similar):

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

“Deep Learning Human Mind for Automated Visual Classification”, Spampinato et al 2016

“Deep Learning Human Mind for Automated Visual Classification”⁠, Concetto Spampinato, Simone Palazzo, Isaak Kavasidis, Daniela Giordano, Mubarak Shah, Nasim Souly (2016-09-01; ⁠, ; similar):

What if we could effectively read the mind and transfer human visual capabilities to computer vision methods? In this paper, we aim at addressing this question by developing the first visual object classifier driven by human brain signals. In particular, we employ EEG data evoked by visual object stimuli combined with Recurrent Neural Networks (RNN) to learn a discriminative brain activity manifold of visual categories. Afterwards, we train a Convolutional Neural Network (CNN)-based regressor to project images onto the learned manifold, thus effectively allowing machines to employ human brain-based features for automated visual classification. We use a 32-channel EEG to record brain activity of seven subjects while looking at images of 40 ImageNet object classes. The proposed RNN based approach for discriminating object classes using brain signals reaches an average accuracy of about 40%, which outperforms existing methods attempting to learn EEG visual object representations. As for automated object categorization, our human brain-driven approach obtains competitive performance, comparable to those achieved by powerful CNN models, both on ImageNet and CalTech 101, thus demonstrating its classification and generalization capabilities. This gives us a real hope that, indeed, human mind can be read and transferred to machines.

“Decoupled Neural Interfaces Using Synthetic Gradients”, Jaderberg et al 2016

“Decoupled Neural Interfaces using Synthetic Gradients”⁠, Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, Koray Kavukcuoglu et al (2016-08-18; ; backlinks; similar):

Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one’s future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass—amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

“Learning to Learn by Gradient Descent by Gradient Descent”, Andrychowicz et al 2016

“Learning to learn by gradient descent by gradient descent”⁠, Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford et al (2016-06-14; ; backlinks; similar):

The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.

“Programming With a Differentiable Forth Interpreter”, Bošnjak et al 2016

“Programming with a Differentiable Forth Interpreter”⁠, Matko Bošnjak, Tim Rocktäschel, Jason Naradowsky, Sebastian Riedel (2016-05-21; similar):

Given that in practice training data is scarce for all but a small set of problems, a core question is how to incorporate prior knowledge into a model. In this paper, we consider the case of prior procedural knowledge for neural networks, such as knowing how a program should traverse a sequence, but not what local actions should be performed at each step. To this end, we present an end-to-end differentiable interpreter for the programming language Forth which enables programmers to write program sketches with slots that can be filled with behaviour trained from program input-output data. We can optimize this behaviour directly through gradient descent techniques on user-specified objectives, and also integrate the program into any larger neural computation graph. We show empirically that our interpreter is able to effectively leverage different levels of prior program structure and learn complex behaviours such as sequence sorting and addition. When connected to outputs of an LSTM and trained jointly, our interpreter achieves state-of-the-art accuracy for end-to-end reasoning about quantities expressed in natural language stories.

“Training Deep Nets With Sublinear Memory Cost”, Chen et al 2016

“Training Deep Nets with Sublinear Memory Cost”⁠, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin (2016-04-21; ⁠, ; backlinks; similar):

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs 𝑂(√n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research.

We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory—giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to 𝑂(log n) with as little as 𝑂(n log n) extra cost for forward computation.

Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48GB to 7GB with only 30% additional running time cost on ImageNet problems. Similarly, substantial memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

“Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex”, Liao & Poggio 2016

“Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex”⁠, Qianli Liao, Tomaso Poggio (2016-04-13; ; similar):

We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a special type of shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose (1) a generalization of both RNN and ResNet architectures and (2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 and ImageNet dataset.

“Improving Sentence Compression by Learning to Predict Gaze”, Klerke et al 2016

“Improving sentence compression by learning to predict gaze”⁠, Sigrid Klerke, Yoav Goldberg, Anders Søgaard (2016-04-12; ⁠, ; similar):

We show how eye-tracking corpora can be used to improve sentence compression models, presenting a novel multi-task learning algorithm based on multi-layer LSTMs. We obtain performance competitive with or better than state-of-the-art approaches.

“Adaptive Computation Time for Recurrent Neural Networks”, Graves 2016

“Adaptive Computation Time for Recurrent Neural Networks”⁠, Alex Graves (2016-03-29; backlinks; similar):

This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.

“PlaNet—Photo Geolocation With Convolutional Neural Networks”, Weyand et al 2016

“PlaNet—Photo Geolocation with Convolutional Neural Networks”⁠, Tobias Weyand, Ilya Kostrikov, James Philbin (2016-02-17; ⁠, ⁠, ; backlinks; similar):

Is it possible to build a system to determine the location where a photo was taken using just its pixels? In general, the problem seems exceptionally difficult: it is trivial to construct situations where no location can be inferred. Yet images often contain informative cues such as landmarks, weather patterns, vegetation, road markings, and architectural details, which in combination may allow one to determine an approximate location and occasionally an exact location. Websites such as GeoGuessr and View from your Window suggest that humans are relatively good at integrating these cues to geolocate images, especially en-masse. In computer vision, the photo geolocation problem is usually approached using image retrieval methods. In contrast, we pose the problem as one of classification by subdividing the surface of the earth into thousands of multi-scale geographic cells, and train a deep network using millions of geotagged images. While previous approaches only recognize landmarks or perform approximate matching using global image descriptors, our model is able to use and integrate multiple visible cues. We show that the resulting model, called PlaNet, outperforms previous approaches and even attains superhuman levels of accuracy in some cases. Moreover, we extend our model to photo albums by combining it with a long short-term memory (LSTM) architecture. By learning to exploit temporal coherence to geolocate uncertain photos, we demonstrate that this model achieves a 50% performance improvement over the single-image model.

“Exploring the Limits of Language Modeling”, Jozefowicz et al 2016

“Exploring the Limits of Language Modeling”⁠, Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu (2016-02-07; ; similar):

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.

“Pixel Recurrent Neural Networks”, Oord et al 2016

“Pixel Recurrent Neural Networks”⁠, Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu (2016-01-25; backlinks; similar):

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.

“Deep-Spying: Spying Using Smartwatch and Deep Learning”, Beltramelli & Risi 2015

“Deep-Spying: Spying using Smartwatch and Deep Learning”⁠, Tony Beltramelli, Sebastian Risi (2015-12-17; ; backlinks; similar):

Wearable technologies are today on the rise, becoming more common and broadly available to mainstream users. In fact, wristband and armband devices such as smartwatches and fitness trackers already took an important place in the consumer electronics market and are becoming ubiquitous. By their very nature of being wearable, these devices, however, provide a new pervasive attack surface threatening users privacy, among others.

In the meantime, advances in machine learning are providing unprecedented possibilities to process complex data efficiently. Allowing patterns to emerge from high dimensional unavoidably noisy data.

The goal of this work is to raise awareness about the potential risks related to motion sensors built-in wearable devices and to demonstrate abuse opportunities leveraged by advanced neural network architectures.

The LSTM-based implementation presented in this research can perform touchlogging and keylogging on 12-keys keypads with above-average accuracy even when confronted with raw unprocessed data. Thus demonstrating that deep neural networks are capable of making keystroke inference attacks based on motion sensors easier to achieve by removing the need for non-trivial pre-processing pipelines and carefully engineered feature extraction strategies. Our results suggest that the complete technological ecosystem of an user can be compromised when a wearable wristband device is worn.

“On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models”, Schmidhuber 2015

“On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models”⁠, Juergen Schmidhuber (2015-11-30; ⁠, ; backlinks; similar):

This paper addresses the general problem of reinforcement learning (RL) in partially observable environments. In 2013, our large RL recurrent neural networks (RNNs) learned from scratch to drive simulated cars from high-dimensional video input. However, real brains are more powerful in many ways. In particular, they learn a predictive model of their initially unknown environment, and somehow use it for abstract (eg. hierarchical) planning and reasoning. Guided by algorithmic information theory, we describe RNN-based AIs (RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending sequences of tasks, some of them provided by the user, others invented by the RNNAI itself in a curious, playful fashion, to improve its RNN-based world model. Unlike our previous model-building RNN-based RL machines dating back to 1990, the RNNAI learns to actively query its model for abstract reasoning and planning and decision making, essentially “learning to think.” The basic ideas of this report can be applied to many other cases where one RNN-like system exploits the algorithmic information content of another. They are taken from a grant proposal submitted in Fall 2014, and also explain concepts such as “mirror neurons.” Experimental results will be described in separate papers.

“Sequence Level Training With Recurrent Neural Networks”, Ranzato et al 2015

“Sequence Level Training with Recurrent Neural Networks”⁠, Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba (2015-11-20; ; backlinks; similar):

Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image. However, at test time the model is expected to generate the entire sequence from scratch. This discrepancy makes generation brittle, as errors may accumulate along the way.

We address this issue by proposing a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE. On three different tasks, our approach outperforms several strong baselines for greedy generation. The method is also competitive when these baselines employ beam search, while being several times faster.

“Generative Concatenative Nets Jointly Learn to Write and Classify Reviews”, Lipton et al 2015

“Generative Concatenative Nets Jointly Learn to Write and Classify Reviews”⁠, Zachary C. Lipton, Sharad Vikram, Julian McAuley (2015-11-11; ; backlinks; similar):

A recommender system’s basic task is to estimate how users will respond to unseen items. This is typically modeled in terms of how an user might rate a product, but here we aim to extend such approaches to model how an user would write about the product. To do so, we design a character-level Recurrent Neural Network (RNN) that generates personalized product reviews. The network convincingly learns styles and opinions of nearly 1000 distinct authors, using a large corpus of reviews from It also tailors reviews to describe specific items, categories, and star ratings.

Using a simple input replication strategy, the Generative Concatenative Network (GCN) preserves the signal of static auxiliary inputs across wide sequence intervals. Without any additional training, the generative model can classify reviews, identifying the author of the review, the product category, and the sentiment (rating), with remarkable accuracy. Our evaluation shows the GCN captures complex dynamics in text, such as the effect of negation, misspellings, slang, and large vocabularies gracefully absent any machinery explicitly dedicated to the purpose.

“The Unreasonable Effectiveness of Recurrent Neural Networks”, Karpathy 2015

“The Unreasonable Effectiveness of Recurrent Neural Networks”⁠, Andrej Karpathy (2015-05-21; ⁠, ; backlinks; similar):

[Exploration of char-RNN neural nets for generating text. Karpathy codes a simple recurrent NN which generates character-by-character, and discovers that it is able to generate remarkably plausible text (at the syntactic level) for Paul Graham⁠, Shakespeare, Wikipedia, LaTeX, Linux C code, and baby names—all using the same generic architecture. Visualizing the internal activity of the char-RNNs, they seem to be genuinely understanding some of the recursive syntactic structure of the text in a way that other text-generation methods like n-grams cannot. Inspired by this post, I began tinkering with char-RNNs for poetry myself; as of 2019, char-RNNs have been largely obsoleted by the new Transformer architecture⁠, but recurrency will make a comeback and Karpathy’s post is still a valuable and fun read.]

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

“Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, Bluche 2015

2015-bluche.pdf: “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”⁠, Théodore Bluche (2015-05-13; ; backlinks; similar):

The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding.

One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated with neural networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep neural networks consisting of several hidden layers recently produced a large reduction of error rates.

In this thesis, we have conducted a thorough study of different aspects of optical models based on deep neural networks in the hybrid neural network / HMM scheme, in order to better understand and evaluate their relative importance.

  1. First, we show that deep neural networks produce consistent and large improvements over networks with one or 2 hidden layers, independently of the kind of neural network, MLP or RNN, and of input, handcrafted features or pixels.
  2. Then, we show that deep neural networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the 2 kinds of inputs, supporting the idea that deep neural networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way.
  3. Despite the dominance of LSTM-RNNs in the recent literature of handwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/​HMMs as those observed in speech recognition.
  4. We also show how the Connectionist Temporal Classification framework is especially suited to RNNs.
  5. Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.

We conducted the experiments on 3 public databases, representing 2 languages (English and French) and 2 epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014.

The results of the final systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel inputs, are comparable to the state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases.

[Keywords: pattern recognition, Hidden Markov Models, neural networks, hand-writing recognition]

“End-To-End Memory Networks”, Sukhbaatar et al 2015

“End-To-End Memory Networks”⁠, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus (2015-03-31; similar):

We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al 2015) but unlike the model in that work, it is trained end-to-end, and hence requires statistically-significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.

“DRAW: A Recurrent Neural Network For Image Generation”, Gregor et al 2015

“DRAW: A Recurrent Neural Network For Image Generation”⁠, Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra (2015-02-16; ⁠, ; backlinks; similar):

This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural network architecture for image generation.

DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images.

The system substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from real data with the naked eye.

“Learning to Execute”, Zaremba & Sutskever 2014

“Learning to Execute”⁠, Wojciech Zaremba, Ilya Sutskever (2014-10-17; ; backlinks):

Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks.

We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.

“One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling”, Chelba et al 2013

“One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling”⁠, Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Philipp Koehn, Tony Robinson (2013-12-11; ; backlinks; similar):

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline.

The benchmark is available as a project; besides the scripts needed to rebuild the training/​held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

“Generating Sequences With Recurrent Neural Networks”, Graves 2013

“Generating Sequences With Recurrent Neural Networks”⁠, Alex Graves (2013-08-04; backlinks; similar):

This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.

“Large Language Models in Machine Translation”, Brants et al 2007

2007-brants.pdf#google: “Large Language Models in Machine Translation”⁠, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean (2007-06; ; backlinks; similar):

This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams⁠. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.

Figure 5, modified by Chris Dyer in a 2020 talk: data vs translation quality (BLEU score) scaling of n-grams, and later, RNNs.

“Long Short-Term Memory”, Hochreiter & Schmidhuber 1997

1997-hochreiter.pdf: “Long Short-Term Memory”⁠, Sepp Hochreiter, Jürgen Schmidhuber (1997-12-15; ; backlinks; similar):

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM).

Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is 𝒪(1).

Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

“Flat Minima”, Hochreiter & Schmidhuber 1997b

1997-hochreiter-2.pdf: “Flat Minima”⁠, Sepp Hochreiter, Jurgen Schmidhuber (1997; backlinks):

We present a new algorithm for finding low-complexity neural networks with high generalization capability.

The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains ~constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set.

Although our algorithm requires the computation of second-order derivatives, it has backpropagation’s order of complexity. Automatically, it effectively prunes units, weights, and input lines.

Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/​optimal brain damage.”

“A Focused Backpropagation Algorithm for Temporal Pattern Recognition”, Mozer 1995

1995-mozer.pdf: “A Focused Backpropagation Algorithm for Temporal Pattern Recognition”⁠, Michael C. Mozer (1995; backlinks; similar):

Time is at the heart of many pattern recognition tasks (eg. speech recognition). However, connectionist learning algorithms to date are not well-suited for dealing with time-varying input patterns.

This chapter introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently, both in computational time and space requirements, on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in the backpropagation unfolding-in-time procedure (Rumelhart, Hinton, & Williams, 1986) for handing sequence recognition tasks: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/​or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time.

It is argued that this architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s (1986) verb past-tense model. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones”, a rich and flexible representation of temporal information

“Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks”, Schmidhuber 1992

1991-schmidhuber.pdf: “Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks”⁠, Jurgen Schmidhuber (1992; ; backlinks):

Previous algorithms for supervised sequence learning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly.

The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived.

Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding.

Untersuchungen Zu Dynamischen Neuronalen Netzen [Studies of Dynamic Neural Networks]”, Hochreiter 1991

1991-hochreiter.pdf: Untersuchungen zu dynamischen neuronalen Netzen [Studies of dynamic neural networks]”⁠, Sepp Hochreiter (1991-06-15; backlinks):

[GPT-3 translation of German abstract]

Since the seminal article by Williams, Hinton, and Rumelhart [RHW86], backpropagation (BP) has become very popular as a learning method for neural networks with and without feedback. In contrast to many other learning methods for neural networks, BP takes into account the network structure and improves the network on the basis of this knowledge.

Since a very remote past input has to influence the present output, if it is randomly selected, this input is very unlikely to influence the present state of the network. Hence BP algorithms do not detect the fact that this input is responsible for the output desired. Therefore, BP algorithms are very hard to train a network to remember an input until it is needed to produce a later output. Moreover, the public BP algorithms take a very long time to compute.

In many cases, though, one needs an input sequence, as in Mozer [Moz90], which learns to compose music, where musical pieces are repeated and later note pitches are determined by previous note pitches. Steers a vehicle in a labyrinth, and the network obtains the error information only if the vehicle is in a dead end, so back-propagated errors are needed. If a neural network controls a robot that performs a task, perhaps some preparatory tasks are necessary whose performance the system should remember.

In this work, we investigate how to approach the problem of the long learning time associated with network inputs that are used later to control a desired output. This can be done either by means of the network architecture or by using the structure of the input sequences. In Chapter 4, a network is built so that inputs that are received at long delays are considered better than in the usual network architecture. Here, ‘storage nodes’ are introduced, which can carry information about an arbitrarily long time interval. The shortening of the input sequences, while retaining all relevant information, is investigated in Chapter 3. When a shortened input sequence must be recognized within this not so far back into the past, to recognize the relevant inputs. In Chapter 1 the used BP-Learn algorithms are presented, which are then in Chapter 2 analyzed to determine the cause of the long learning time to learn to store past inputs. To the algorithms it should be said that in some cases these were slightly modified to save computational time. The problem of resource acquisition occurring in Chapter 3 and 4 methods is addressed in Chapter 5.

The described experiments were performed on Sparc-based SUN stations. Due to time-resource constraints, algorithm comparison tests could not be carried out in the desired extent. There were trials that ran for up to a week on these machines, but other processes with higher priority were also running on these machines.

The definitions and notations in this work are not those commonly used in studies of neural networks, but they are introduced here only for this work. The reason is that there are no uniform, fundamental definitions for neural networks on which other authors would have based their work. Therefore, it is not guaranteed that there are no inconsistencies in the definitions and notations with other works.

These results were not all mathematically proven, as the work does not claim to be a mathematical analysis of neural networks. It is also difficult to find simple mathematical formalisms for neural networks. The work will rather describe ideas and approaches to see if it is possible to get a better grip on the problem of the long learning time for important previous inputs.

Besides the methods described here for learning in non-static environments, there is also the approach of the “Adaptive Critic”, as described in [Sch90a] and [Sch90c [“Recurrent networks adjusted by adaptive critics”]]. The approach of “fast weights” by Schmidhuber [Sch91b] founds a storage function, although with a completely different approach than in Chapter 4, where a storage is also constructed.

“Finding Structure In Time”, Elman 1990

1990-elman.pdf: “Finding Structure In Time”⁠, Jeffrey L. Elman (1990-04-01; backlinks):

Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation).

The current report develops a proposal along these lines first described by Jordan 1986 which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states.

A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/​semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items.

These representations suggest a method for representing lexical categories and the type/​token distinction.

“A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”, Williams & Zipser 1989b

1989-williams-2.pdf: “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”⁠, Ronald J. Williams, David Zipser (1989-06-01; backlinks; similar):

The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.

“A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks”, Schmidhuber 1989

1990-schmidhuber.pdf: “A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks”⁠, Jurgen Schmidhuber (1989; backlinks):

Most known learning algorithms for dynamic neural networks in non-stationary environments need global computations to perform credit assignment. These algorithms either are not local in time or not local in space. Those algorithms which are local in both time and space usually cannot deal sensibly with ‘hidden units’. In contrast, as far as we can judge, learning rules in biological systems with many ‘hidden units’ are local in both space and time.

In this paper we propose a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’. The approach is inspired by Holland’s idea of the bucket brigade for classifier systems, which is transformed to run on a neural network with fixed topology.

The result is a feedforward or recurrent ‘neural’ dissipative system which is consuming ‘weight-substance’ and permanently trying to distribute this substance onto its connections in an appropriate way.

Simple experiments demonstrating the feasibility of the algorithm are reported.

“Experimental Analysis of the Real-time Recurrent Learning Algorithm”, Williams & Zipser 1989

1989-williams.pdf: “Experimental Analysis of the Real-time Recurrent Learning Algorithm”⁠, Ronald J. Williams, David Zipser (1989; similar):

The real-time recurrent learning algorithm (RTRL) is a gradient-following learning algorithm for completely recurrent networks running in continually sampled time.

Here we use a series of simulation experiments to investigate the power and properties of this algorithm.

In the recurrent networks studied here, any unit can be connected to any other, and any unit can receive external input. These networks run continually in the sense that they sample their inputs on every update cycle, and any unit can have a training target on any cycle. The storage required and computation time on each step are independent of time and are completely determined by the size of the network, so no prior knowledge of the temporal structure of the task being learned is required. The algorithm is nonlocal in the sense that each unit must have knowledge of the complete recurrent weight matrix and error vector.

The algorithm is computationally intensive in sequential computers, requiring a storage capacity of the order of the third power of the number of units and a computation time on each cycle of the order of the fourth power of the number of units.

The simulations include examples in which networks are taught tasks not possible with tapped delay lines—that is, tasks that require the preservation of state over potentially unbounded periods of time. The most complex example of this kind is learning to emulate a Turing machine that does a parenthesis balancing problem. Examples are also given of networks that do feedforward computations with unknown delays, requiring them to organize into networks with the correct number of layers.

Finally, examples are given in which networks are trained to oscillate in various ways, including sinusoidal oscillation.

[See also Robinson & Fallside 1987⁠, Mozer 1988 /  ​1995⁠, Bachrach 1988⁠, and Williams & Zipser 1989a⁠.]

“A Sticky-Bit Approach for Learning to Represent State”, Bachrach 1988

1988-bachrach.pdf: “A Sticky-Bit Approach for Learning to Represent State”⁠, Jonathan R. Bachrach (1988-09-06; backlinks)

“The Utility Driven Dynamic Error Propagation Network (RTRL)”, Robinson & Fallside 1987

1987-robinson.pdf: “The Utility Driven Dynamic Error Propagation Network (RTRL)”⁠, A. J. Robinson, F. Fallside (1987-11-04; backlinks; similar):

[later: Williams & Zipser 1995] Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic, patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed.

A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. On one possible architecture for such utility driven dynamic nets is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximize any function of the input and output data streams, within the considered context. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.

Recurrent neural network


Long short-term memory


Hopfield network