 See Also

Links
 “Policy Improvement by Planning With Gumbel”, Danihelka et al 2022
 “MuZero With Selfcompetition for Rate Control in VP9 Video Compression”, Mandhane et al 2022
 “Procedural Generalization by Planning With SelfSupervised World Models”, Anand et al 2021
 “Mastering Atari Games With Limited Data”, Ye et al 2021
 “Proper Value Equivalence”, Grimm et al 2021
 “Vector Quantized Models for Planning”, Ozair et al 2021
 “Learning and Planning in Complex Action Spaces”, Hubert et al 2021
 “MuZero Unplugged: Online and Offline Reinforcement Learning by Planning With a Learned Model”, Schrittwieser et al 2021
 “Podracer Architectures for Scalable Reinforcement Learning”, Hessel et al 2021
 “Scaling Scaling Laws With Board Games”, Jones 2021
 “Playing Nondeterministic Games through Planning With a Learned Model”, Willkens & Pollack 2021
 “Visualizing MuZero Models”, Vries et al 2021
 “Combining Off and OnPolicy Training in ModelBased Reinforcement Learning”, Borges & Oliveira 2021
 “Improving ModelBased Reinforcement Learning With Internal State Representations through SelfSupervision”, Scholz et al 2021
 “On the Role of Planning in Modelbased Deep Reinforcement Learning”, Hamrick et al 2020
 “The Value Equivalence Principle for ModelBased Reinforcement Learning”, Grimm et al 2020
 “Measuring Progress in Deep Reinforcement Learning Sample Efficiency”, Anonymous 2020
 “MonteCarlo Tree Search As Regularized Policy Optimization”, Grill et al 2020
 “Continuous Control for Searching and Planning With a Learned Model”, Yang et al 2020
 “Agent57: Outperforming the Human Atari Benchmark”, Puigdomènech et al 2020
 “MuZero: Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model”, Schrittwieser et al 2019
 “Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018
 “TreeQN and ATreeC: Differentiable TreeStructured Models for Deep Reinforcement Learning”, Farquhar et al 2017
 Tensor processing unit
 MuZero
 Monte Carlo tree search
 Miscellaneous
See Also
Links
“Policy Improvement by Planning With Gumbel”, Danihelka et al 2022
“Policy improvement by planning with Gumbel”, (20220304; ):
AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree.
To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement [see Huijben et al 2021 for more on Gumbel tricks]. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at nonroot nodes.
Our new algorithms, Gumbel AlphaZero and Gumbel MuZero, respectively without and with modellearning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.
[Keywords: AlphaZero, MuZero, reinforcement learning]
“MuZero With Selfcompetition for Rate Control in VP9 Video Compression”, Mandhane et al 2022
“MuZero with Selfcompetition for Rate Control in VP9 Video Compression”, (20220214; ; backlinks; similar):
Video streaming usage has seen a significant rise as entertainment, education, and business increasingly rely on online video. Optimizing video compression has the potential to increase access and quality of content to users, and reduce energy use and costs overall. In this paper, we present an application of the MuZero algorithm to the challenge of video compression. Specifically, we target the problem of learning a rate control policy to select the quantization parameters (QP) in the encoding process of libvpx, an open source VP9 video compression library widely used by popular videoondemand (VOD) services.
We treat this as a sequential decision making problem to maximize the video quality with an episodic constraint imposed by the target bitrate. Notably, we introduce a novel selfcompetition based reward mechanism to solve constrained RL with variable constraint satisfaction difficulty, which is challenging for existing constrained RL methods. We demonstrate that the MuZerobased rate control achieves an average 6.28% reduction in size of the compressed videos for the same delivered video quality level (measured as PSNR BDrate) compared to libvpx’s twopass VBR rate control policy, while having better constraint satisfaction behavior.
“Procedural Generalization by Planning With SelfSupervised World Models”, Anand et al 2021
“Procedural Generalization by Planning with SelfSupervised World Models”, (20211102; ; backlinks; similar):
One of the key promises of modelbased reinforcement learning is the ability to generalize using an internal model of the world to make predictions in novel environments and tasks. However, the generalization ability of modelbased agents is not well understood because existing work has focused on modelfree agents when benchmarking generalization.
Here, we explicitly measure the generalization ability of modelbased agents in comparison to their modelfree counterparts. We focus our analysis on MuZero (Schrittwieser et al 2020), a powerful modelbased agent, and evaluate its performance on both procedural and task generalization. We identify 3 factors of procedural generalization—planning, selfsupervised representation learning, and procedural data diversity—and show that by combining these techniques:
we achieve stateofthe art generalization performance and data efficiency on Procgen (Cobbe et al 2019). However, we find that these factors do not always provide the same benefits for the task generalization benchmarks in MetaWorld (Yu et al 2019), indicating that transfer remains a challenge and may require different approaches than procedural generalization.
Overall, we suggest that building generalizable agents requires moving beyond the singletask, modelfree paradigm and towards selfsupervised modelbased agents that are trained in rich, procedural, multitask environments.
“Mastering Atari Games With Limited Data”, Ye et al 2021
“Mastering Atari Games with Limited Data”, (20211030; ; similar):
Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient imagebased RL algorithms; however, consistent humanlevel performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient modelbased visual RL algorithm built on MuZero, which we name Efficient Zero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of realtime game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves superhuman performance on Atari games with such little data. EfficientZero’s performance is also close to DQN’s performance at 200 million frames while we consume 500× less data. EfficientZero’s low sample complexity and high performance can bring RL closer to realworld applicability. We implement our algorithm in an easytounderstand manner and it is available at https://github.com/YeWR/EfficientZero. We hope it will accelerate the research of MCTSbased RL algorithms in the wider community.
“Proper Value Equivalence”, Grimm et al 2021
“Proper Value Equivalence”, (20210618; backlinks; similar):
One of the main challenges in modelbased reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The valueequivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for valuebased planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model.
A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to orderk counterparts defined with respect to k applications of the Bellman operator. This leads to a family of VE classes that increase in size as k → ∞. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE.
Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment.
We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.
“Vector Quantized Models for Planning”, Ozair et al 2021
“Vector Quantized Models for Planning”, (20210608; ; similar):
Recent developments in the field of modelbased RL have proven successful in a range of environments, especially ones where planning is essential. However, such successes have been limited to deterministic fullyobserved environments. We present a new approach that handles stochastic and partiallyobservable environments. Our key insight is to use discrete autoencoders to capture the multiple possible effects of an action in a stochastic environment. We use a stochastic variant of Monte Carlo tree search to plan over both the agent’s actions and the discrete latent variables representing the environment’s response. Our approach significantly outperforms an offline version of MuZero on a stochastic interpretation of chess where the opponent is considered part of the environment. We also show that our approach scales to DeepMind Lab, a firstperson 3D environment with large visual observations and partial observability.
“Learning and Planning in Complex Action Spaces”, Hubert et al 2021
“Learning and Planning in Complex Action Spaces”, (20210413; similar):
Many important realworld problems have action spaces that are highdimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This samplebased policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and RealWorld RL Suite.
“MuZero Unplugged: Online and Offline Reinforcement Learning by Planning With a Learned Model”, Schrittwieser et al 2021
“MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model”, (20210413; ; backlinks; similar):
Learning efficiently from small amounts of data has long been the focus of modelbased reinforcement learning, both for the online case when interacting with the environment and the offline case when learning from a fixed dataset. However, to date no single unified algorithm could demonstrate stateoftheart results in both settings.
In this work, we describe the Reanalyse algorithm which uses modelbased policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude. We further show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions, as in the case of offline Reinforcement Learning (offline RL).
Combining Reanalyse with the MuZero algorithm, we introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL. In contrast to previous work, our algorithm does not require any special adaptations for the offpolicy or offline RL settings.
MuZero Unplugged sets new stateoftheart results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari ALE in the standard 200 million frame setting.
“Podracer Architectures for Scalable Reinforcement Learning”, Hessel et al 2021
“Podracer architectures for scalable Reinforcement Learning”, (20210413; ; backlinks; similar):
Supporting stateoftheart AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems. Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use these frameworks for deep learning typically focus on (un)supervised learning. How to best train reinforcement learning (RL) agents at scale is still an active research area.
In this report we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way. Specifically we describe two architectures designed to make the best use of the resources available on a TPU Pod (a special configuration in a Google data center that features multiple TPU devices connected to each other by extremely low latency communication channels).
Anakin: When using small neural networks and gridworld environments an Anakin architecture can easily perform 5 million steps per second, even on the 8core TPU accessible for free through Google Colab.
This can be very useful to experiment and debug research ideas in the friendly Colab environment. In Figure 4a we show how, thanks to the efficient network connecting different TPU cores in a Pod, performance scales almost linearly with the number of cores; the collective operations used to average gradients across replicas appear to cause only minimal overhead.
In a recent paper by Oh et al 2021 Anakin was used, at a much larger scale, to discover a general reinforcement learning update, from experience of interacting with a rich set of environments implemented in JAX. In this paper, Anakin was used to learn a single shared update rule from 60K JAX environments and 1K policies running and training in parallel.
Despite the complex nature of the system, based on the use of neural networks to metalearn not just a policy but the entire RL update, Anakin delivered over 3 million steps per second on a 16core TPU. Training the update rule to a good level of performance, required running Anakin for ~24 hours; this would cost ~$100 on GCP’s preemptible instances
…Sebulba: Our second podracer architecture has also been extensively used for exploring a variety of RL ideas at scale, on environments that cannot be compiled to run on TPU (eg. Atari, DMLab and MuJoCo). As both IMPALA and Sebulba are based on a decomposition between actors and learners, agents designed for the IMPALA architecture can be easily mapped onto Sebulba; for instance a Podracer version of the Vtrace agent easily reproduced the results from Espeholt et al 2018. However, we found that training an agent for 200 million frames of an Atari game could be done in just ~1 hour, by running Sebulba on a 8core TPU. This comes at a cost of ~$2.88, on GCP’s preemptible instances. This is similar in cost to training with the more complex SEED RL framework, and much cheaper than training an agent for 200 million Atari frames using either IMPALA or singlestream GPUbased system such as that traditionally used by DQN.
…In addition to the trajectory length the effective batch size used to compute each update also depends on how many times we replicate the basic 8TPU setup. Sebulba also scales effectively along this dimension: using 2048 TPU cores (an entire Pod) we were able to further scale all the way to 43 million frames per second, solving the classic Atari videogame Pong in less than 1 minute…Sebulba has also been used to train searchbased agents inspired by MuZero (Schrittwieser et al 2020). The workloads associated to these agents are very different from that of modelfree agents like IMPALA. The key difference is in the cost of action selection. This increases because MuZero’s policy combines search with deep neural networks (used to guide and/or truncate the search). Typically, searchbased agents like MuZero required custom C++ implementations of the search to deliver good performance. We could reproduce results from MuZero (no Reanalyse) on multiple RL environments, using Sebulba and a pure JAX implementation of MCTS. Training a MuZero agent with Sebulba for 200M Atari frames takes 9 hours on a 16core TPU (at a cost of ~$40 on GCP’s preemptible instances).
We found that scalability, via replication, was particularly useful in this context. Figure 4c reports the number of frames per seconds processed by Sebulba when running MuZero on Atari, as a function of the number of TPU cores. The throughput increased linearly with the number of cores.
“Scaling Scaling Laws With Board Games”, Jones 2021
“Scaling Scaling Laws with Board Games”, (20210407; ; backlinks; similar):
The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.
By conducting a sequence of experiments using AlphaZero and Hex, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the testtime and traintime compute available to an agent can be traded off while maintaining performance.
Slope: The slope of the incline is 500 Elo per order of magnitude increase in compute.
A more memorable interpretation is that if you are in the linearlyincreasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.
Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.
Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.
Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.
Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.
…Traintest tradeoff: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the traintime search and testtime search have to be the same size, and so by varying the size of the testtime compute budget we can see in Figure 8 that larger tree searches at test time can substantially improve the performance of an agent.
Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in Figure 9, which shows that the tradeoff is linear in logcompute: for each additional 10× of traintime compute, about 15× of testtime compute can be eliminated, down to a floor of a singlenode tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that testtime compute is much ‘cheaper’ than traintime compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at testtime needs only optimise over one sample, while traintime compute meanwhile must optimise over the entire distribution of samples.
…the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins^{3}. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.
Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.
“Playing Nondeterministic Games through Planning With a Learned Model”, Willkens & Pollack 2021
“Playing Nondeterministic Games through Planning with a Learned Model”, (20210305; similar):
The MuZero algorithm is known for achieving highlevel performance on traditional zerosum twoplayer games of perfect information such as chess, Go, and shogi, as well as visual, nonzero sum, singleplayer environments such as the Atari suite. Despite lacking a perfect simulator and employing a learned model of environmental dynamics, MuZero produces gameplaying agents comparable to its predecessor AlphaZero. However, the current implementation of MuZero is restricted only to deterministic environments. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero for nondeterministic, twoplayer, zerosum games of perfect information. Borrowing from Nondeterministic Monte Carlo Tree Search and the theory of extensiveform games, NDMZ formalizes chance as a player in the game and incorporates it into the MuZero network architecture and tree search. Experiments show that NDMZ is capable of learning effective strategies and an accurate model of the game.
[Keywords: reinforcement learning, AlphaZero, MuZero, MCTS, planning, search]
“Visualizing MuZero Models”, Vries et al 2021
“Visualizing MuZero Models”, (20210225; similar):
MuZero, a modelbased reinforcement learning algorithm that uses a value equivalent dynamics model, achieved stateoftheart performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn.
Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero’s performance.
Additionally, we provide an opensource implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms.
“Combining Off and OnPolicy Training in ModelBased Reinforcement Learning”, Borges & Oliveira 2021
“Combining Off and OnPolicy Training in ModelBased Reinforcement Learning”, (20210224; similar):
The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown to be effective in various domains, such as board and video games. AlphaGo represented a significant step forward in our ability to learn complex board games, and it was rapidly followed by significant advances, such as AlphaGo Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master both Atari games and board games by directly learning a model of the environment, which is then used with MCTS to decide what move to play in each position. During tree search, the algorithm simulates games by exploring several possible moves and then picks the action that corresponds to the most promising trajectory. When training, limited use is made of these simulated games since none of their trajectories are directly used as training examples. Even if we consider that not all trajectories from simulated games are useful, there are thousands of potentially useful trajectories that are discarded. Using information from these trajectories would provide more training data, more quickly, leading to faster convergence and higher sample efficiency. Recent work introduced an offpolicy value target for AlphaZero that uses data from simulated games. In this work, we propose a way to obtain offpolicy targets using data from simulated games in MuZero. We combine these offpolicy targets with the onpolicy targets already used in MuZero in several ways, and study the impact of these targets and their combinations in three environments with distinct characteristics. When used in the right combinations, our results show that these targets can speed up the training process and lead to faster convergence and higher rewards than the ones obtained by MuZero.
“Improving ModelBased Reinforcement Learning With Internal State Representations through SelfSupervision”, Scholz et al 2021
“Improving ModelBased Reinforcement Learning with Internal State Representations through SelfSupervision”, (20210210; similar):
Using a model of the environment, reinforcement learning agents can plan their future moves and achieve superhuman performance in board games like Chess, Shogi, and Go, while remaining relatively sampleefficient. As demonstrated by the MuZero Algorithm, the environment model can even be learned dynamically, generalizing the agent to many more tasks while at the same time achieving stateoftheart performance. Notably, MuZero uses internal state representations derived from real environment states for its predictions. In this paper, we bind the model’s predicted internal state representation to the environment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of reconstruction model loss and simpler consistency loss provide a significant performance increase in OpenAI Gym environments. Our modifications also enable selfsupervised pretraining for MuZero, so the algorithm can learn about environment dynamics before a goal is made available.
“On the Role of Planning in Modelbased Deep Reinforcement Learning”, Hamrick et al 2020
“On the role of planning in modelbased deep reinforcement learning”, (20201108; similar):
Modelbased planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of modelbased reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of modelbased methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al 2019), a stateoftheart MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9×9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple MonteCarlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.
“The Value Equivalence Principle for ModelBased Reinforcement Learning”, Grimm et al 2020
“The Value Equivalence Principle for ModelBased Reinforcement Learning”, (20201106; backlinks; similar):
Learning models of the environment from data is often viewed as an essential component to building intelligent reinforcement learning (RL) agents. The common practice is to separate the learning of the model from its use, by constructing a model of the environment’s dynamics that correctly predicts the observed state transitions. In this paper we argue that the limited representational resources of modelbased RL agents are better used to build models that are directly useful for valuebased planning.
As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. We propose a formulation of the model learning problem based on the value equivalence principle and analyze how the set of feasible solutions is impacted by the choice of policies and functions. Specifically, we show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks, until eventually collapsing to a single point corresponding to a model that perfectly describes the environment.
In many problems, directly modelling statetostate transitions may be both difficult and unnecessary. By leveraging the valueequivalence principle one may find simpler models without compromising performance, saving computation and memory. We illustrate the benefits of valueequivalent model learning with experiments comparing it against more traditional counterparts like maximum likelihood estimation.
More generally, we argue that the principle of value equivalence underlies a number of recent empirical successes in RL, such as Value Iteration Networks, the Predictron, Value Prediction Networks, TreeQN, and MuZero, and provides a first theoretical underpinning of those results. [Followup: “Proper Value Equivalence”, Grimm et al 2021.]
“Measuring Progress in Deep Reinforcement Learning Sample Efficiency”, Anonymous 2020
“Measuring Progress in Deep Reinforcement Learning Sample Efficiency”, (20200928; similar):
We measure progress in deep reinforcement learning sample efficiency using training curves from published papers. Sampled environment transitions are a critical input to deep reinforcement learning (DRL) algorithms. Current DRL benchmarks often allow for the cheap and easy generation of large amounts of samples such that perceived progress in DRL does not necessarily correspond to improved sample efficiency. As simulating real world processes is often prohibitively hard and collecting real world experience is costly, sample efficiency is an important indicator for economically relevant applications of DRL. We investigate progress in sample efficiency on Atari games and continuous control tasks by comparing the amount of samples that a variety of algorithms need to reach a given performance level according to training curves in the corresponding publications. We find exponential progress in sample efficiency with estimated doubling times of around 10 to 18 months on Atari [ALE], 5 to 24 months on statebased continuous control [HalfCheetah] and of around 4 to 9 months on pixelbased continuous control [Walker Walk] depending on the specific task and performance level.
…The amount of samples used to train DRL agents on the ALE and the speed at which samples are generated has increased rapidly. since DQN was first trained on the majority of the now standard 57 Atari games in 2015 (Mnih et al2015), the amount of samples per game used by the most ambitious projects to train their agents on the ALE has increased by a factor of 450 from 200 million to 90 billion as shown in figure 1 (a). This corresponds to a doubling time in sample use of around 7 months. Converted into real game time, it represents a jump from 38.6 hours (per game) to 47.6 years which was enabled by the fast speed of the simulators and running large amounts of simulations in parallel to reduce the wallclock time needed for processing that many frames. In fact, the trend in wallclock training time is actually reversed as can be seen in Table 1: while DQN was trained for a total of 9.5 days, MuZero took only 12 hours of training to process 20 billion frames (Schrittwieser et al 2019), which represents a speedup in utilized frames per second of 1900 in less than 5 years. This demonstrates that using larger and larger amounts of samples has become a lot more popular as well as feasible over time.
…While the exact slopes of the fitted trend lines are fairly uncertain due to the limited amount of data points, especially for the unrestricted benchmark, it seems like progress on the unrestricted benchmarks is around twice as fast. This can be interpreted as roughly half of progress coming from increased sample use, while the other half comes from a combination of algorithmic improvements and more compute usage [In the form of larger neural networks or reusing samples for multiple training passes.].
“MonteCarlo Tree Search As Regularized Policy Optimization”, Grill et al 2020
“MonteCarlo Tree Search as Regularized Policy Optimization”, (20200724; ; similar):
The combination of MonteCarlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current stateoftheart MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
“Continuous Control for Searching and Planning With a Learned Model”, Yang et al 2020
“Continuous Control for Searching and Planning with a Learned Model”, (20200612; similar):
Decisionmaking agents with planning capabilities have achieved huge success in the challenging domain like Chess, Shogi, and Go. In an effort to generalize the planning ability to the more general tasks where the environment dynamics are not available to the agent, researchers proposed the MuZero algorithm that can learn the dynamical model through the interactions with the environment. In this paper, we provide a way and the necessary theoretical results to extend the MuZero algorithm to more generalized environments with continuous action space. Through numerical results on two relatively lowdimensional MuJoCo environments, we show the proposed algorithm outperforms the soft actorcritic (SAC) algorithm, a stateoftheart modelfree deep reinforcement learning algorithm.
“Agent57: Outperforming the Human Atari Benchmark”, Puigdomènech et al 2020
“Agent57: Outperforming the human Atari benchmark”, (20200331; ; backlinks; similar):
The Atari57 suite of games is a longstanding benchmark to gauge agent performance across a wide range of tasks. We’ve developed Agent57, the first deep reinforcement learning agent to obtain a score that is above the human baseline on all 57 Atari 2600 games. Agent57 combines an algorithm for efficient exploration with a metacontroller that adapts the exploration and long vs. shortterm behaviour of the agent.
…In 2012, the Arcade Learning environment—a suite of 57 Atari 2600 games (dubbed Atari57)—was proposed as a benchmark set of tasks: these canonical Atari games pose a broad range of challenges for an agent to master…Unfortunately, the average performance can fail to capture how many tasks an agent is doing well on, and so is not a good statistic for determining how general an agent is: it captures that an agent is doing sufficiently well, but not that it is doing sufficiently well on a sufficiently wide set of tasks. So although average scores have increased, until now, the number of above human games has not.
…Back in 2012, DeepMind developed the Deep Qnetwork agent (DQN) to tackle the Atari57 suite. Since then, the research community has developed many extensions and alternatives to DQN. Despite these advancements, however, all deep reinforcement learning agents have consistently failed to score in four games: Montezuma’s Revenge, Pitfall, Solaris and Skiing. For Agent57 to tackle these four challenging games in addition to the other Atari57 games, several changes to DQN were necessary.
DQN improvements
 Distributed agents
 Shortterm memory
 Episodic memory
Intrinsic motivation methods to encourage directed exploration
 Seeking novelty over long time scales
 Seeking novelty over short time scales
 Metacontroller: learning to balance exploration with exploitation
Agent57: putting it all together
…With Agent57, we have succeeded in building a more generally intelligent agent that has abovehuman performance on all tasks in the Atari57 benchmark. It builds on our previous agent Never Give Up, and instantiates an adaptive metacontroller that helps the agent to know when to explore and when to exploit, as well as what timehorizon it would be useful to learn with. A wide range of tasks will naturally require different choices of both of these tradeoffs, therefore the metacontroller provides a way to dynamically adapt such choices.
Agent57 was able to scale with increasing amounts of computation: the longer it trained, the higher its score got. While this enabled Agent57 to achieve strong general performance, it takes a lot of computation and time; the data efficiency can certainly be improved. Additionally, this agent shows better 5^{th} percentile performance on the set of Atari57 games. This by no means marks the end of Atari research, not only in terms of data efficiency, but also in terms of general performance. We offer two views on this: firstly, analyzing the performance among percentiles gives us new insights on how general algorithms are. While Agent57 achieves strong results on the first percentiles of the 57 games and holds better mean and median performance than NGU or R2D2, as illustrated by MuZero, it could still obtain a higher average performance. Secondly, all current algorithms are far from achieving optimal performance in some games. To that end, key improvements to use might be enhancements in the representations that Agent57 uses for exploration, planning, and credit assignment.
“MuZero: Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model”, Schrittwieser et al 2019
“MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, (20191119; ; backlinks; similar):
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Treebased planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in realworld problems the dynamics governing the environment are often complex and unknown.
In this work we present the MuZero algorithm which, by combining a treebased search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the actionselection policy, and the value function.
When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which modelbased planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
“Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018
“Surprising Negative Results for Generative Adversarial Tree Search”, (20180615; ; similar):
While many recent advances in deep reinforcement learning (RL) rely on modelfree methods, modelbased approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Qfunction to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS visavis the biasvariance tradeoff and show GATS is able to mitigate the worstcase error in the Qestimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.
“TreeQN and ATreeC: Differentiable TreeStructured Models for Deep Reinforcement Learning”, Farquhar et al 2017
“TreeQN and ATreeC: Differentiable TreeStructured Models for Deep Reinforcement Learning”, (20171031; backlinks; similar):
Combining deep modelfree reinforcement learning with online planning is a promising approach to building on the successes of deep RL. Online planning with lookahead trees has proven successful in environments where transition models are known a priori. However, in complex environments where transition models need to be learned from data, the deficiencies of learned models have limited their utility for planning.
To address these challenges, we propose TreeQN, a differentiable, recursive, treestructured model that serves as a dropin replacement for any value function network in deep RL with discrete actions. TreeQN dynamically constructs a tree by recursively applying a transition model in a learned abstract state space and then aggregating predicted rewards and statevalues using a tree backup to estimate Qvalues. We also propose ATreeC, an actorcritic variant that augments TreeQN with a softmax layer to form a stochastic policy network. Both approaches are trained endtoend, such that the learned model is optimized for its actual use in the tree.
We show that TreeQN and ATreeC outperform nstep DQN and A2C on a boxpushing task, as well as nstep DQN and value prediction networks (Oh et al 2017) on multiple Atari games. Furthermore, we present ablation studies that demonstrate the effect of different auxiliary losses on learning transition models.