Quadrotors are among the most agile flying robots. However, planning time-optimal trajectories at the actuation limit through multiple waypoints remains an open problem. This is crucial for applications such as inspection, delivery, search and rescue, and drone racing.
Early works used polynomial trajectory formulations, which do not exploit the full actuator potential because of their inherent smoothness. Recent works resorted to numerical optimization but require waypoints to be allocated as costs or constraints at specific discrete times. However, this time allocation is a priori unknown and renders previous works incapable of producing truly time-optimal trajectories.
To generate truly time-optimal trajectories, we propose a solution to the time allocation problem while exploiting the full quadrotor’s actuator potential. We achieve this by introducing a formulation of progress along the trajectory, which enables the simultaneous optimization of the time allocation and the trajectory itself.
We compare our method against related approaches and validate it in real-world flights in one of the world’s largest motion-capture systems, where we outperform human expert drone pilots in a drone-racing task.
2021-mirhoseini.pdf: “A graph placement methodology for fast chip design”, Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter, Jeff Dean (backlinks)
Exploration, consolidation and planning depend on the generation of sequential state representations. However, these algorithms require disparate forms of sampling dynamics for optimal performance. We theorize how the brain should adapt internally generated sequences for particular cognitive functions and propose a neural mechanism by which this may be accomplished within the entorhinal-hippocampal circuit.
Specifically, we demonstrate that the systematic modulation along the medial entorhinal cortex dorsoventral axis of grid population input into the hippocampus facilitates a flexible generative process that can interpolate between qualitatively distinct regimes of sequential hippocampal reactivations.
By relating the emergent hippocampal activity patterns drawn from our model to empirical data, we explain and reconcile a diversity of recently observed, but apparently unrelated, phenomena such as generative cycling, diffusive hippocampal reactivations and jumping trajectory events.
[Blog] Preventing and mitigating high severity collisions is one of the main opportunities for Automated Driving Systems (ADS) to improve road safety.
This study evaluated the Waymo Driver’s performance within real-world fatal collision scenarios that occurred in a specific operational design domain (ODD). To address the rare nature of high-severity collisions, this paper describes the addition of novel techniques to established safety impact assessment methodologies.
A census of fatal, human-involved collisions was examined for years 2008 through 2017 for Chandler, AZ, which overlaps the current geographic ODD of the Waymo One fully automated ride-hailing service. Crash reconstructions were performed on all available fatal collisions that involved a passenger vehicle as one of the first collision partners and an available map in this ODD to determine the pre-impact kinematics of the vehicles involved in the original crashes. The final dataset consisted of a total of 72 crashes and 91 vehicle actors (52 initiators and 39 responders) for simulations.
Next, a novel counterfactual “what-if” simulation method was developed to synthetically replace human-driven crash participants one at a time with the Waymo Driver. This study focused on the Waymo Driver’s performance when replacing one of the first two collision partners.
The results of these simulations showed that the Waymo Driver was successful in avoiding all collisions when replacing the crash initiator, that is, the road user who made the initial, unexpected maneuver leading to a collision. Replacing the driver reacting (the responder) to the actions of the crash initiator with the Waymo Driver resulted in an estimated 82% of simulations where a collision was prevented and an additional 10% of simulations where the collision severity was mitigated (reduction in crash-level serious injury risk). The remaining 8% of simulations with the Waymo Driver in the responder role had a similar outcome to the original collision. All of these “unchanged” collisions involved both the original vehicle and the Waymo Driver being struck in the rear in a front-to-rear configuration.
These results demonstrate the potential of fully automated driving systems to improve traffic safety compared to the performance of the humans originally involved in the collisions. The findings also highlight the major importance of driving behaviors that prevent entering a conflict situation (eg. maintaining safe time gaps and not surprising other road users). However, methodological challenges in performing single instance counterfactual simulations based solely on police report data and uncertainty in ADS performance may result in variable performance, requiring additional analysis and supplemental methodologies.
This study’s methods provide insights on rare, severe events that would otherwise only be experienced after operating in extreme real-world driving distances (many billions of driving miles).
Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field.
Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly ‘remembering’ promising states and returning to such states before intentionally exploring.
Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders-of-magnitude improvements on the grand challenges of Montezuma’s Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore’s exploration efficiency and enable it to handle stochasticity throughout training.
The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration—an insight that may prove critical to the creation of truly intelligent learning agents.
While we instantaneously recognize a face as attractive, it is much harder to explain what exactly defines personal attraction. This suggests that attraction depends on implicit processing of complex, culturally and individually defined features. Generative adversarial neural networks (GANs), which learn to mimic complex data distributions, can potentially model subjective preferences unconstrained by pre-defined model parameterization.
Here, we present generative brain-computer interfaces (GBCI), couplingGANs with brain-computer interfaces. GBCI first presents a selection of images and captures personalized attractiveness reactions toward the images via electroencephalography. These reactions are then used to control a ProGAN model, finding a representation that matches the features constituting an attractive image for an individual. We conducted an experiment (N= 30) to validate GBCI using a face-generatingGAN and producing images that are hypothesized to be individually attractive. In double-blind evaluation of the GBCI-produced images against matchedcontrols, we found GBCI yielded highly accurate results.
Thus, the use of EEG responses to control aGAN presents a valid tool for interactive information-generation. Furthermore, the GBCI-derived images visually replicated known effects from social neuroscience, suggesting that the individually responsive, generative nature of GBCI provides a powerful, new tool in mapping individual differences and visualizing cognitive-affective processing.
…Thus, negative generated images were evaluated as highly attractive for other people, but not for the participant themselves. Taken together, the results suggest that the GBCI was highly accurate in generating personally attractive images (83.33%). They also show that while both negative and positive generated images were evaluated as highly attractive for the general population (respectively M = 4.43 and 4.90 on a scale of 1–5), only the positive generated images (M = 4.57) were evaluated as highly personally attractive.
Qualitative results: In semi-structured post-test interviews, participants were shown the generated images that were expected to be found attractive/ unattractive. Thematic analysis found predictions of positive attractiveness were experienced as accurate: There were no false positives (generated unattractive found personally attractive). The participants also expressed being pleased with results (eg. “Quite an ideal beauty for a male!”; “I would be really attracted to this!”; “Can I have a copy of this? It looks just like my girlfriend!”).
Naturalistic decision-making tasks modeled by a deep Q-network (DQN)
Task representations encoded in dorsal visual pathway and posterior parietal cortex
Computational principles common to both DQN and human brain are characterized
Humans possess an exceptional aptitude to efficiently make decisions from high-dimensional sensory observations. However, it is unknown how the brain compactly represents the current state of the environment to guide this process. The deep Q-network (DQN) achieves this by capturing highly nonlinear mappings from multivariate inputs to the values of potential actions. We deployed DQN as a model of brain activity and behavior in participants playing three Atari video games during fMRI. Hidden layers of DQN exhibited a striking resemblance to voxel activity in a distributed sensorimotor network, extending throughout the dorsal visual pathway into posterior parietal cortex. Neural state-space representations emerged from nonlinear transformations of the pixel space bridging perception to action and reward. These transformations reshape axes to reflect relevant high-level features and strip away information about task-irrelevant sensory features. Our findings shed light on the neural encoding of task representations for decision-making in real-world situations.
Efficiently navigating a superpressure balloon in the stratosphere requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques. Here we describe the use of reinforcement learning to create a high-performing flight controller. Our algorithm uses data augmentation and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.
Temporal difference (TD) error is a powerful teaching signal in machine learning
Teleport and speed manipulations are used to characterize dopamine signals in mice
Slowly ramping as well as phasic dopamine responses convey TD errors
Dopamine neurons compute TD error or changes in value on a moment-by-moment basis
Rapid phasic activity of midbrain dopamine neurons is thought to signal reward prediction errors (RPEs), resembling temporal difference errors used in machine learning. However, recent studies describing slowly increasing dopamine signals have instead proposed that they represent state values and arise independent from somatic spiking activity. Here we developed experimental paradigms using virtual reality that disambiguate RPEs from values. We examined dopamine circuit activity at various stages, including somatic spiking, calcium signals at somata and axons, and striatal dopamine concentrations. Our results demonstrate that ramping dopamine signals are consistent with RPEs rather than value, and this ramping is observed at all stages examined. Ramping dopamine signals can be driven by a dynamic stimulus that indicates a gradual approach to a reward. We provide a unified computational understanding of rapid phasic and slowly ramping dopamine signals: dopamine neurons perform a derivative-like computation over values on a moment-by-moment basis.
The game of curling can be considered a good test bed for studying the interaction between artificial intelligence systems and the real world. In curling, the environmental characteristics change at every moment, and every throw has an impact on the outcome of the match. Furthermore, there is no time for relearning during a curling match due to the timing rules of the game. Here, we report a curling robot that can achieve human-level performance in the game of curling using an adaptive deep reinforcement learning framework. Our proposed adaptation framework extends standard deep reinforcement learning using temporal features, which learn to compensate for the uncertainties and nonstationarities that are an unavoidable part of curling. Our curling robot, Curly, was able to win three of four official matches against expert human teams [top-ranked women’s curling teams and Korea national wheelchair curling team (reserve team)]. These results indicate that the gap between physics-based simulators and the real world can be narrowed.
The emergence of powerful artificial intelligence (AI) is defining new research directions in neuroscience. To date, this research has focused largely on deep neural networks trained using supervised learning in tasks such as image classification. However, there is another area of recent AI work that has so far received less attention from neuroscientists but that may have profound neuroscientific implications: deep reinforcement learning (RL). Deep RL offers a comprehensive framework for studying the interplay among learning, representation, and decision making, offering to the brain sciences a new set of research tools and a wide range of novel hypotheses. In the present review, we provide a high-level introduction to deep RL, discuss some of its initial applications to neuroscience, and survey its wider implications for research on brain and behavior, concluding with a list of opportunities for next-stage research.
2019-vinyals.pdf#deepmind: “Grandmaster level in StarCraft II using multi-agent reinforcement learning”, Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, David Silver (2019-10-30; backlinks):
Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional e-sports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks. We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.
2019-zhavoronkov.pdf: “Deep learning enables rapid identification of potent DDR1 kinase inhibitors”, Alex Zhavoronkov, Yan A. Ivanenkov, Alex Aliper, Mark S. Veselov, Vladimir A. Aladinskiy, Anastasiya V. Aladinskaya, Victor A. Terentiev, Daniil A. Polykovskiy, Maksim D. Kuznetsov, Arip Asadulaev, Yury Volkov, Artem Zholus, Rim R. Shayakhmetov, Alexander Zhebrak, Lidiya I. Minaeva, Bogdan A. Zagribelnyy, Lennart H. Lee, Richard Soll, David Madge, Li Xing, Tao Guo, Alán Aspuru-Guzik
Adaptive information seeking is critical for goal-directed behavior. Growing evidence suggests the importance of intrinsic motives such as curiosity or need for novelty, mediated through dopaminergic valuation systems, in driving information-seeking behavior. However, valuing information for its own sake can be highly suboptimal when agents need to evaluate instrumental benefit of information in a forward-looking manner. Here we show that information-seeking behavior in humans is driven by subjective value that is shaped by both instrumental and noninstrumental motives, and that this subjective value of information (SVOI) shares a common neural code with more basic reward value. Specifically, using a task where subjects could purchase information to reduce uncertainty about outcomes of a monetary lottery, we found information purchase decisions could be captured by a computational model of SVOI incorporating utility of anticipation, a form of noninstrumental motive for information seeking, in addition to instrumental benefits. Neurally, trial-by-trial variation in SVOI was correlated with activity in striatum and ventromedial prefrontal cortex. Furthermore, cross-categorical decoding revealed that, within these regions, SVOI and expected utility of lotteries were represented using a common code. These findings provide support for the common currency hypothesis and shed insight on neurocognitive mechanisms underlying information-seeking behavior.
In everyday life, the outcomes of our actions are rarely certain. Further, we often lack the information needed to precisely estimate the probability and value of potential outcomes as well as how much effort will be required by the courses of action under consideration. Under such conditions of uncertainty, individual differences in the estimation and weighting of these variables, and in reliance on model-free versus model-based decision making, have the potential to strongly influence our behavior. Both anxiety and depression are associated with difficulties in decision making. Further, anxiety is linked to increased engagement in threat-avoidance behaviors and depression is linked to reduced engagement in reward-seeking behaviors. The precise deficits, or biases, in decision making associated with these common forms of psychopathology remain to be fully specified. In this article, we review evidence for which of the computations supporting decision making are altered in anxiety and depression and consider the potential consequences for action selection. In addition, we provide a schematic framework that integrates the findings reviewed and will hopefully be of value to future studies.
The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.
Building on the recent successes of distributed training of RL agents, in this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay. We study the effects of parameter lag resulting in representational drift and recurrent state staleness and empirically derive an improved training strategy. Using a single network architecture and fixed set of hyper-parameters, the resulting agent, Recurrent Replay Distributed DQN (R2D2), quadruples the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. It is the first agent to exceed human-level performance in 52 of the 57 Atari games.
Value alignment is often considered a critical component of safe artificial intelligence. Meanwhile, reinforcement learning is often criticized as being inherently unsafe and misaligned, for reasons such as wireheading, delusion boxes, misspecified reward functions and distributional shifts. In this report, we categorize sources of misalignment for reinforcement learning agents, illustrating each type with numerous examples. For each type of problem, we also describe ways to remove the source of misalignment. Combined, the suggestions form high-level blueprints for how to design value-aligned RL agents.
It analyzes the alignment problem from an AIXI-like perspective, that is, by theoretical analysis of powerful Bayesian RL agents in an online POMDP setting. In this setup, we have a POMDP environment, in which the environment has some underlying state, but the agent only gets observations of the state and must take actions in order to maximize rewards. The authors consider three main setups: (1) rewards are computed by a preprogrammed reward function, (2) rewards are provided by a human in the loop, and (3) rewards are provided by a reward predictor which is trained interactively from human-generated data. For each setup, they consider the various objects present in the formalism, and ask how these objects could be corrupted, misspecified, or misleading. This methodology allows them to identify several potential issues, which I won’t get into as I expect most readers are familiar with them. (Examples include wireheading and threatening to harm the human unless they provide maximal reward.)
They also propose several tools that can be used to help solve misalignment. In order to prevent reward function corruption, we can have the agent simulate the future trajectory, and evaluate this future trajectory with the current reward, removing the incentive to corrupt the reward function. (This was later developed into current-RF optimization (AN #71).) Self-corruption awareness refers to whether or not the agent is aware that its policy can be modified. A self-corruption unaware agent is one that behaves as though it’s current policy function will never be changed, effectively ignoring the possibility of corruption. It is not clear which is more desirable: while a self-corruption unaware agent will be more corrigible (in the MIRI sense), it also will not preserve its utility function, as it believes that even if the utility function changes the policy will not change.
Action-observation grounding ensures that the agent only optimizes over policies that work on histories of observations and actions, preventing agents from constructing entirely new observation channels (“delusion boxes”) which mislead the reward function into thinking everything is perfect.
The interactive setting in which a reward predictor is trained based on human feedback offers a new challenge: that the human data can be corrupted or manipulated. One technique to address this is to get decoupled data: if your corruption is determined by the current state s, but you get feedback about some different state s’, as long as s and s’ aren’t too correlated it is possible to mitigate potential corruptions.
Another leverage point is how we decide to use the reward predictor. We could consider the stationary reward function, which evaluates simulated trajectories with the current reward predictor, ie. assuming that the reward predictor will never be updated again. If we combine this with self-corruption unawareness (so that the policy also never expects the policy to change), then the incentive to corrupt the reward predictor’s data is removed. However, the resulting agent is time-inconsistent: it acts as though its reward never changes even though it in practice does, and so it can make a plan and start executing it, only to switch over to a new plan once the reward changes, over and over again.
The dynamic reward function avoids this pitfall by evaluating the kth timestep of a simulated trajectory by also taking an expectation over future data that the reward predictor will get. This agent is no longer time-inconsistent, but it now incentivizes the agent to manipulate the data. This can be fixed by building a single integrated Bayesian agent, which maintains a single environment model that predicts both the reward function and the environment model. The resulting agent is time-consistent, utility-preserving, and has no direct incentive to manipulate the data. (This is akin to the setup in assistance games / CIRL (AN #69).)
One final approach is to use a counterfactual reward function, in which the data is simulated in a counterfactual world where the agent executed some known safe default policy. This no longer depends on the current time, and is not subject to data corruption since the data comes from a hypothetical that is independent of the agent’s actual policy. However, it requires a good default policy that does the necessary information-gathering actions, and requires the agent to have the ability to simulate human feedback in a counterfactual world.
This paper is a great organization and explanation of several older papers (that haven’t been summarized in this newsletter because they were published before 2018 and I read them before starting this newsletter), and I wish I had read it sooner. It seems to me that the integrated Bayesian agent is the clear winner—the only downside is the computational cost, which would be a bottleneck for any of the models considered here. One worry I have with this sort of analysis is that the guarantees you get out of it depends quite a lot on how you model the situation. For example, let’s suppose that after I sleep I wake up refreshed and more capable of intellectual work. Should I model this as “policy corruption”, or as a fixed policy that takes as an input some information about how rested I am?]
2018-eslami.pdf#deepmind: “Neural scene representation and rendering”, S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, Demis Hassabis (2018-06-15; backlinks):
A scene-internalizing computer program: To train a computer to “recognize” elements of a scene supplied by its visual sensors, computer scientists typically use millions of images painstakingly labeled by humans. Eslami et al. developed an artificial vision system, dubbed the Generative Query Network (GQN), that has no needfor such labeled data. Instead, the GQN first uses images taken from different viewpoints and creates an abstract description of the scene, learning its essentials. Next, on the basis of this representation, the network predicts what the scene would look like from a new, arbitrary viewpoint.
Scene representation—the process of converting visual sensory data into concise descriptions—is a requirement for intelligent behavior. Recent work has shown that neural networks excel at this task when provided with large, labeled datasets. However, removing the reliance on human labeling remains an important open problem. To this end, we introduce the Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors. The GQN takes as input images of a scene taken from different viewpoints, constructs an internal representation, and uses this representation to predict the appearance of that scene from previously unobserved viewpoints. The GQN demonstrates representation learning without human labels or domain knowledge, paving the way toward machines that autonomously learn to understand the world around them.
Over the past 20 years, neuroscience research on reward-based learning has converged on a canonical model, under which the neurotransmitter dopamine ‘stamps in’ associations between situations, actions and rewards by modulating the strength of synaptic connections between neurons. However, a growing number of recent findings have placed this standard model under strain. We now draw on recent advances in artificial intelligence to introduce a new theory of reward-based learning. Here, the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system. This new perspective accommodates the findings that motivated the standard model, but also deals gracefully with a wider range of observations, providing a fresh foundation for future research.
2017-silver.pdf#deepmind: “Mastering the game of Go without human knowledge”, David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis (2017-10-19; backlinks):
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
A significant amount of the world’s knowledge is stored in relational databases. However, the ability for users to retrieve facts from a database is limited due to a lack of understanding of query languages such as SQL. We propose Seq2SQL, a deep neural network for translatingnatural language questions to corresponding SQL queries. Our modelleverages the structure of SQL queries to significantly reduce the output space of generated queries. Moreover, we use rewards from in-the-loop query execution over the database to learn a policy to generate unordered parts of the query, which we show are less suitable for optimization via cross entropy loss. In addition, we will publish WikiSQL, a dataset of 80654 hand-annotated examples of questions andSQL queries distributed across 24241 tables from Wikipedia. This dataset is required to train our model and is an order of magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL,our model Seq2SQL outperforms attentional sequence to sequence models, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.
Theories of reward learning in neuroscience have focused on 2 families of algorithms thought to capture deliberative versus habitual choice. ‘Model-based’ algorithms compute the value of candidate actions from scratch, whereas ‘model-free’ algorithms make choice more efficient but less flexible by storing pre-computed action values.
We examine an intermediate algorithmic family, the successor representation, which balances flexibility and efficiency by storing partially computed action values: predictions about future events.
These pre-computation strategies differ in how they update their choices following changes in a task. The successor representation’s reliance on stored predictions about future states predicts a unique signature of insensitivity to changes in the task’s sequence of events, but flexible adjustment following changes to rewards. We provide evidence for such differential sensitivity in 2 behavioural studies with humans.
These results suggest that the successor representation is a computational substrate for semi-flexible choice in humans, introducing a subtler, more cognitive notion of habit.
2016-graves.pdf: “Hybrid computing using a neural network with dynamic external memory”, Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, Demis Hassabis (2016-10-27; backlinks):
Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.
We present Project Malmo—an AI experimentation platform built on top of the popular computer game Minecraft, and designed to support fundamental research in artificial intelligence. As the AI research community pushes for artificial general intelligence (AGI), experimentation platforms are needed that support the development of flexible agents that learn to solve diverse tasks in complex environments. Minecraft is an ideal foundation for such a platform, as it exposes agents to complex 3D worlds, coupled with infinitely varied game-play.
Project Malmo provides a sophisticated abstraction layer on top of Minecraft that supports a wide range of experimentation scenarios, ranging from navigation and survival to collaboration and problem solving tasks. In this demo we present the Malmo platform and its capabilities. The platform is publicly released as open source software at IJCAI, to support openness and collaboration in AI research.
2016-silver.pdf#deepmind: “Mastering the game of Go with deep neural networks and tree search”, David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016-01-28; backlinks):
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
[Anecdote: I hear from Groq that the original AlphaGoGPU implementation was not on track to defeat Lee Sedol by about a month before, when they happened to gamble on implementing TPUv1 support.The additional compute led to drastic performance gains, and the TPU model could beat the GPU model in ~98 of 100 games, and the final model solidly defeated Lee Sedol. (Since TPUv1s reportedly only did inferencing/forward-mode, presumably they were not used for the initial imitation learning, or the policy gradients self-play, but for generating the ~30 million self-play games which the value network was trained on (doing regression/prediction of ‘board → P(win)’, requiring no state or activations from the self-play games, just generating an extremely large corpus which could be easily used by GPU training.]
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way.
By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.
We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximally-informative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium, and I just had to try it out once I recognized it. (My sample-efficiency was not better than PILCO.)]
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
A program of research into weakly supervised learning algorithms led us to ask if learning could occur given only natural selection as feedback.
We developed an algorithm that combined evolution and learning, and tested it in an artificial environment populated with adaptive and non-adaptive organisms. We found that learning and evolution together were more successful than either alone in producing adaptive populations that survived to the end of our simulation.
In a case study testing long-term stability, we simulated one well-adapted population far beyond the original time limit. The story of that population’s success and ultimate demise involves both familiar and novel effects in evolutionary biology and learning algorithms [such as “tree senility”].