“The Unreasonable Effectiveness of Recurrent Neural Networks”, (2015-05-21):
[Exploration of char-RNN neural nets for generating text. Karpathy codes a simple recurrent NN which generates character-by-character, and discovers that it is able to generate remarkably plausible text (at the syntactic level) for Paul Graham, Shakespeare, Wikipedia, LaTeX, Linux C code, and baby names—all using the same generic architecture. Visualizing the internal activity of the char-RNNs, they seem to be genuinely understanding some of the recursive syntactic structure of the text in a way that other text-generation methods like n-grams cannot. Inspired by this post, I began tinkering with char-RNNs for poetry myself; as of 2019, char-RNNs have been largely obsoleted by the new Transformer architecture, but recurrency will make a comeback and Karpathy’s post is still a valuable and fun read.]
There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”
“Learning to Execute”, (2014-10-17):
Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an to add two 9-digit numbers with 99% accuracy.
“Neural Turing Machines”, (2014-10-20):
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
“Pointer Networks”, (2015-06-09):
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems—finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem—using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.
“Massively Parallel Methods for Deep Reinforcement Learning”, (2015-07-15):
We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience. We used our architecture to implement the Deep Q-Network algorithm (DQN). Our distributed algorithm was applied to 49 games from Atari 2600 games from the Arcade Learning Environment, using identical hyperparameters. Our performance surpassed non-distributed in 41 of the 49 games and also reduced the wall-time required to achieve these results by an order of magnitude on most games.
“Generating Sequences With Recurrent Neural Networks”, (2013-08-04):
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.
“On the Near Impossibility of Measuring the Returns to Advertising”, (2013-04-23; ):
Classical theories of the firm assume access to reliable signals to measure the causal impact of choice variables on profit. For advertising expenditure we show, using 25 online field experiments (representing $3.50$2.82013 million) with major U.S. retailers and brokerages, that this assumption typically does not hold. Statistical evidence from the randomized trials is very weak because individual-level sales are incredibly volatile relative to the per capita cost of a campaign—a “small” impact on a noisy dependent variable can generate positive returns. A concise statistical argument shows that the required sample size for an experiment to generate sufficiently informative confidence intervals is typically in excess of ten million person-weeks. This also implies that heterogeneity bias (or model misspecification) unaccounted for by observational methods only needs to explain a tiny fraction of the variation in sales to severely bias estimates. The weak informational feedback means most firms cannot even approach profit maximization.