 See Also

Links
 “HTPS: HyperTree Proof Search for Neural Theorem Proving”, Lample et al 2022
 “CrossBeam: Learning to Search in BottomUp Program Synthesis”, Shi et al 2022
 “Policy Improvement by Planning With Gumbel”, Danihelka et al 2022
 “Player of Games”, Schmid et al 2021
 “ΝSDDP: Neural Stochastic Dual Dynamic Programming”, Dai et al 2021
 “Acquisition of Chess Knowledge in AlphaZero”, McGrath et al 2021
 “Evaluating Modelbased Planning and Planner Amortization for Continuous Control”, Byravan et al 2021
 “Scalable Online Planning via Reinforcement Learning FineTuning”, Fickinger et al 2021
 “Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, Bertsekas 2021
 “How Does AI Improve Human DecisionMaking? Evidence from the AIPowered Go Program”, Choi et al 2021
 “Train on Small, Play the Large: Scaling Up Board Games With AlphaZero and GNN”, BenAssayag & ElYaniv 2021
 “Neural Tree Expansion for MultiRobot Planning in NonCooperative Environments”, Riviere et al 2021
 “Scaling Scaling Laws With Board Games”, Jones 2021
 “OLIVAW: Mastering Othello without Human Knowledge, nor a Fortune”, Norelli & Panconesi 2021
 “Transfer of Fully Convolutional PolicyValue Networks Between Games and Game Variants”, Soemers et al 2021
 “Learning to Stop: Dynamic Simulation MonteCarlo Tree Search”, Lan et al 2020
 “Assessing Game Balance With AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020
 “Learning Personalized Models of Human Behavior in Chess”, McIlroyYoung et al 2020
 “ReBeL: Combining Deep Reinforcement Learning and Search for ImperfectInformation Games”, Brown et al 2020
 “Learning Compositional Neural Programs for Continuous Control”, Pierrot et al 2020
 “MonteCarlo Tree Search As Regularized Policy Optimization”, Grill et al 2020
 “Tackling Morpion Solitaire With AlphaZerolike Ranked Reward Reinforcement Learning”, Wang et al 2020
 “Aligning Superhuman AI With Human Behavior: Chess As a Model System”, McIlroyYoung et al 2020
 “Neural Machine Translation With MonteCarlo Tree Search”, Parker & Chen 2020
 “Approximate Exploitability: Learning a Best Response in Large Games”, Timbers et al 2020
 “Real World Games Look Like Spinning Tops”, Czarnecki et al 2020
 “Accelerating and Improving AlphaZero Using Population Based Training”, Wu et al 2020
 “SelfPlay Learning Without a Reward Metric”, Schmidt et al 2019
 “MuZero: Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model”, Schrittwieser et al 2019
 “Multiplayer AlphaZero”, Petosa & Balch 2019
 “Global Optimization of Quantum Dynamics With AlphaZero Deep Exploration”, Dalgaard et al 2019
 “Learning Compositional Neural Programs With Recursive Tree Search and Planning”, Pierrot et al 2019
 “Deep Policies for WidthBased Planning in Pixel Domains”, Junyent et al 2019
 “Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, Anthony et al 2019
 “AlphaX: EXploring Neural Architectures With Deep Neural Networks and Monte Carlo Tree Search”, Wang et al 2019
 “Minigo: A Case Study in Reproducing Reinforcement Learning Research”, Anonymous 2019
 “ΑRank: MultiAgent Evaluation by Evolution”, Omidshafiei et al 2019
 “Accelerating SelfPlay Learning in Go”, Wu 2019
 “ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, Tian et al 2019
 “Bayesian Optimization in AlphaGo”, Chen et al 2018
 “A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go through Selfplay”, Silver et al 2018
 “AlphaSeq: Sequence Discovery With Deep Reinforcement Learning”, Shao et al 2018
 “ExItOOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018
 “Has Dynamic Programming Improved Decision Making?”, Rust 2018
 “Improving Widthbased Planning With Compact Policies”, Junyent et al 2018
 “Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018
 “Dual Policy Iteration”, Sun et al 2018
 “Solving the Rubik’s Cube Without Human Knowledge”, McAleer et al 2018
 “FeedbackBased Tree Search for Reinforcement Learning”, Jiang et al 2018
 “A Tree Search Algorithm for Sequence Labeling”, Lao et al 2018
 “FeatureBased Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, Bertsekas 2018
 “SimtoReal Optimization of Complex Real World Mobile Network With Imperfect Information via Deep Reinforcement Learning from Selfplay”, Tan et al 2018
 “Learning to Search With MCTSnets”, Guez et al 2018
 “MWalk: Learning to Walk over Graphs Using Monte Carlo Tree Search”, Shen et al 2018
 “Mastering Chess and Shogi by SelfPlay With a General Reinforcement Learning Algorithm”, Silver et al 2017
 “Mastering the Game of Go without Human Knowledge”, Silver et al 2017
 “DeepMind's Latest AI Breakthrough Is Its Most Important Yet: Googleowned DeepMind's Goplaying Artificial Intelligence Can Now Learn without Human Help... or Data”, Burgess 2017
 “Selftaught AI Is Best yet at Strategy Game Go”, Gibney 2017
 “Learning Generalized Reactive Policies Using Deep Neural Networks”, Groshev et al 2017
 “Learning to Plan Chemical Syntheses”, Segler et al 2017
 “Thinking Fast and Slow With Deep Learning and Tree Search”, Anthony et al 2017
 “Deep Reinforcement Learning: An Overview”, Li 2017
 “DeepStack: ExpertLevel Artificial Intelligence in NoLimit Poker”, Moravčík et al 2017
 “Mastering the Game of Go With Deep Neural Networks and Tree Search”, Silver et al 2016
 “Giraffe: Using Deep Reinforcement Learning to Play Chess”, Lai 2015
 “Reinforcement Learning As Classification: Leveraging Modern Classifiers”, Lagoudakis & Parr 2003
 Tensor processing unit
 Monte Carlo tree search
 Master (software)
 Lee Sedol
 Fan Hui
 Demis Hassabis
 David Silver (programmer)
 Darkforest
 AlphaZero
 AlphaGo
 Miscellaneous
See Also
Links
“HTPS: HyperTree Proof Search for Neural Theorem Proving”, Lample et al 2022
“HTPS: HyperTree Proof Search for Neural Theorem Proving”, (20220523; ):
We propose an online training procedure for a transformerbased automated theorem prover.
Our approach leverages a new search algorithm, HyperTree Proof Search (HTPS), inspired by the recent success of AlphaZero. Our model learns from previous proof searches through online training, allowing it to generalize to domains far from the training distribution.
We report detailed ablations of our pipeline’s main components by studying performance on three environments of increasing complexity.
In particular, we show that with HTPS alone, a model trained on annotated proofs manages to prove 65.4% of a heldout set of Metamath theorems, substantially outperforming the previous state of the art of 56.5% by GPTf. Online training on these unproved theorems increases accuracy to 82.6%. With a similar computational budget, we improve the state of the art on the Leanbased miniF2Fcurriculum dataset from 31% to 42% proving accuracy.
“CrossBeam: Learning to Search in BottomUp Program Synthesis”, Shi et al 2022
“CrossBeam: Learning to Search in BottomUp Program Synthesis”, (20220320; similar):
Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we propose training a neural model to learn a handson search policy for bottomup synthesis, instead of relying on a combinatorial search algorithm. Our approach, called CrossBeam, uses the neural model to choose how to combine previouslyexplored programs into new programs, taking into account the search history and partial program executions. Motivated by work in structured prediction on learning to search, CrossBeam is trained onpolicy using data extracted from its own bottomup searches on training tasks. We evaluate CrossBeam in two very different domains, string manipulation and logic programming. We observe that CrossBeam learns to search efficiently, exploring much smaller portions of the program space compared to the stateoftheart.
“Policy Improvement by Planning With Gumbel”, Danihelka et al 2022
“Policy improvement by planning with Gumbel”, (20220304; ):
AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree.
To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement [see Huijben et al 2021 for more on Gumbel tricks]. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at nonroot nodes.
Our new algorithms, Gumbel AlphaZero and Gumbel MuZero, respectively without and with modellearning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.
[Keywords: AlphaZero, MuZero, reinforcement learning]
“Player of Games”, Schmid et al 2021
“Player of Games”, (20211206; ; backlinks; similar):
[Kilcher video; lead author has launched an algorithmic trading startup; cf. DeepStack] Games have a long history of serving as a benchmark for progress in artificial intelligence. Recently, approaches using search and learning have shown strong performance across a set of perfect information games, and approaches using gametheoretic reasoning and learning have shown strong performance for specific imperfect information poker variants.
We introduce Player of Games, a generalpurpose algorithm that unifies previous approaches, combining guided search, selfplay learning, and gametheoretic reasoning [counterfactual regret minimization]. We prove that Player of Games is sound, converging to perfect play as available computation time and approximation capacity increases.
Player of Games reaches strong performance in chess and Go, beats the strongest openly available agent in headsup nolimit Texas hold’em poker (Slumbot), and defeats the stateoftheart agent in Scotland Yard, an imperfect information game that illustrates the value of guided search, learning, and gametheoretic reasoning.
Player of Games is the first algorithm to achieve strong empirical performance in large perfect and imperfect information games—an important step towards truly general algorithms for arbitrary environments.
“ΝSDDP: Neural Stochastic Dual Dynamic Programming”, Dai et al 2021
“νSDDP: Neural Stochastic Dual Dynamic Programming”, (20211201; ; similar):
Stochastic dual dynamic programming (SDDP) is a stateoftheart method for solving multistage stochastic optimization, widely used for modeling realworld process optimization tasks. Unfortunately, SDDP has a worstcase complexity that scales exponentially in the number of decision variables, which severely limits applicability to only low dimensional problems.
To overcome this limitation, we extend SDDP by introducing a trainable neural model that learns to map problem instances to a piecewise linear value function within intrinsic lowdimension space, which is architected specifically to interact with a base SDDP solver, so that can accelerate optimization performance on new instances. The proposed Neural Stochastic Dual Dynamic Programming (νSDDP) continually selfimproves by solving successive problems.
An empirical investigation demonstrates that νSDDP can substantially reduce problem solving cost without sacrificing solution quality over competitors such as SDDP and reinforcement learning algorithms, across a range of synthetic and realworld process optimization problems.
“Acquisition of Chess Knowledge in AlphaZero”, McGrath et al 2021
“Acquisition of Chess Knowledge in AlphaZero”, (20211117; ; backlinks; similar):
What is being learned by superhuman neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability.
In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the lowlevel details of AlphaZero’s representations, and make the resulting behavioural and representational analyses available online.
“Evaluating Modelbased Planning and Planner Amortization for Continuous Control”, Byravan et al 2021
“Evaluating modelbased planning and planner amortization for continuous control”, (20211007; similar):
There is a widespread intuition that modelbased control methods should be able to surpass the data efficiency of modelfree approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and modelfree policy learning; the learned policy serves as a proposal for MPC. We find that welltuned modelfree agents are strong baselines even for high DoF control problems but MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multitask/multigoal settings. Finally, we show that it is possible to distill a modelbased planner into a policy that amortizes the planning computation without any loss of performance. Videos of agents performing different tasks can be seen at
https://sites.google.com/view/mbrlamortization/home
.
“Scalable Online Planning via Reinforcement Learning FineTuning”, Fickinger et al 2021
“Scalable Online Planning via Reinforcement Learning FineTuning”, (20210930; ; similar):
Lookahead search has been a critical component of recent AI successes, such as in the games of chess, go, and poker. However, the search methods used in these games, and in many other settings, are tabular. Tabular search methods do not scale well with the size of the search space, and this problem is exacerbated by stochasticity and partial observability.
In this work we replace tabular search with online modelbased finetuning of a policy neural network via reinforcement learning, and show that this approach outperforms stateoftheart search algorithms in benchmark settings.
In particular, we use our search algorithm to achieve a new stateoftheart result in selfplay Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman.
“Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, Bertsekas 2021
“Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control”, (20210820; similar):
In this paper we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of online decision making on top of offline training.
In particular, through an unifying abstract mathematical framework, we show that the principal AlphaZero/TDGammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces.
Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural networkbased value and policy approximations, and heuristic algorithms for discrete optimization.
“How Does AI Improve Human DecisionMaking? Evidence from the AIPowered Go Program”, Choi et al 2021
“How Does AI Improve Human DecisionMaking? Evidence from the AIPowered Go Program”, (20210726; ; backlinks; similar):
[see also “Assessing Human Error Against a Benchmark of Perfection”, Anderson et al 2016] How does artificial intelligence (AI) improve human decisionmaking? Answering this question is challenging because it is difficult to assess the quality of each decision and to disentangle AI’s influence on decisions. We study professional Go games, which provide an unique opportunity to overcome such challenges.
In 2016 an AIpowered Go program (APG) unexpectedly beat the best human player, surpassing the best human knowledge and skills accumulated over thousands of years. To investigate the impact of APGs, we compare human moves to AI’s superior solutions, before and after the initial public release of an APG [Leela Zero, KataGo, and NHN’s Handol]. Our analysis of 750,990 moves in 25,033 games by 1,242 professional players reveals that APGs noticeably improved the quality of the players’ moves as measured by the changes in winning probability with each move. We also show that the key mechanisms are reductions in the number of human errors and in the magnitude of the most critical mistake during the game. Interestingly, the improvement is most prominent in the early stage of a game when uncertainty is higher. Further, young players—who are more open to and better able to utilize APG—benefit more than senior players, suggesting generational inequality in AI adoption and utilization.
[Keywords: artificial intelligence (AI), technology adoption, decisionmaking, human capital, professional Go players, AI adoption inequality]
…The historic Go match (AlphaGo vs. Sedol Lee) was held in 2016; in this game, AI beat the best human professional player for the first time and by a large margin. Shortly after this event, the first open APG, Leela, became available to players in February 2017. Our quantitative and qualitative investigation indicates that professional Go players have used APGs heavily in their training since its release.
The great advantage of this context is that it allows us to observe every single decision of professional Go players before and after the public release of APGs; a game’s entire move history is well archived and maintained for all major games. Furthermore, using the APG’s best solution as a benchmark, we can calculate the probability of winning for every move (ie. 750,990 decisions) by 1,242 professional Go players in 25,033 major games held from 2015 through 2019; note that this can be done even for the games played before APG’s release. We then compare the movelevel probability of winning to that of APG’s best solution.
The results show that the quality of moves by professional Go players improved substantially following the release of APG. Before the release, the winning probability of each move by professional Go players averaged 2.47 percentage points lower than the moves of APG. This gap decreased by about 0.756 percentage points (or 30.5%) after the release of APG. Additional analyses indicate that the improvement in move quality eventually leads to the final win of the game. Interestingly, this effect is most prominent in the early stage of a game where higher uncertainty is exhibited and there is more opportunity for players to learn from AI. Furthermore, quality improvement is more prominent among young players who are open to and capable of utilizing APGs; this has important implications for digital literacy and inequality in AI utilization.
[Example of an absolute human error rate: from the AI’s perspective, each move a human Go pro makes costs them ~1.2% chance of winning!]
We also explore the mechanisms through which professional players achieve a higher probability of winning. Our mediation analysis reveals that improvements in the quality of moves are driven mainly by reducing the number of errors (moves where the winning probability drops by 10 or more percentage points compared to the immediately preceding move by a focal player) and by reducing the magnitude of the most critical mistake (the biggest drop in winning probability during the game). Specifically, the number of errors per game decreased by 0.15–0.50 and the magnitude of the most critical mistake decreased by 4–7 percentage points.
…3.3.1. Go Games and Players: We collect data on professional Go games held from 2015 through 2019 from the Go4Go database, which has been widely used in studies of Go (eg. Chao et al 2018, Ramon & Struyf 2003, Wu et al 2018). The data contains detailed information on the game, its players, Komi (the number of bonus points given to the second mover), the sequence of all moves, and the game outcome. From Go Ratings we gather additional data on the ages, nationalities (eg. China, Japan, South Korea, Taiwan, and others), genders, and annual rankings of professional players. We multiplied negative one by the ranking and divide it by 1,000 to ease the interpretation of the result; the higher the value, the better the player. To control for the difference in players’ capabilities for each game, we create a variable, Rank difference, as the difference between the raw rankings of 2 players; we divide this difference by 1,000 such that a positive value indicates that the focal player’s ranking is lower than the opponent’s ranking.
…Using 2–8 Nvidia TitanX GPUs running in parallel, the computational analysis of games took about 3 months.
…This analysis is motivated by the norm that, after Go games, players spend substantial time and effort analyzing and evaluating each move—especially if the move was an error or a mistake. In an interview with news media, Jinseo Shin (who was ranked first in the world in 2020) stated:
Before APG, players and their peers replayed the game and discussed which move was an error and which was a critical mistake. After the public release of APG, this replay and discussion by players became almost meaningless. APG teaches us by showing the accurate winning probability with each move. If the winning probability drops from 60% to 40% after a move, that is an error. If it drops from 80% to 20%, that is a critical mistake. … I have to admit that the APGbased training provides limitless help in developing my Go skills (Sohn 2021).
“Train on Small, Play the Large: Scaling Up Board Games With AlphaZero and GNN”, BenAssayag & ElYaniv 2021
“Train on Small, Play the Large: Scaling Up Board Games with AlphaZero and GNN”, (20210718; similar):
Playing board games is considered a major challenge for both humans and AI researchers. Because some complicated board games are quite hard to learn, humans usually begin with playing on smaller boards and incrementally advance to master larger board strategies. Most neural network frameworks that are currently tasked with playing board games neither perform such incremental learning nor possess capabilities to automatically scale up.
In this work, we look at the board as a graph and combine a graph neural network architecture inside the AlphaZero framework, along with some other innovative improvements. Our ScalableAlphaZero is capable of learning to play incrementally on small boards, and advancing to play on large ones. Our model can be trained quickly to play different challenging board games on multiple board sizes, without using any domain knowledge.
We demonstrate the effectiveness of ScalableAlphaZero and show, for example, that by training it for only three days on small Othello boards, it can defeat the AlphaZero model on a large board, which was trained to play the large board for 30 days.
“Neural Tree Expansion for MultiRobot Planning in NonCooperative Environments”, Riviere et al 2021
“Neural Tree Expansion for MultiRobot Planning in NonCooperative Environments”, (20210420; ; similar):
We present a selfimproving, Neural Tree Expansion (NTE) method for multirobot online planning in noncooperative environments, where each robot attempts to maximize its cumulative reward while interacting with other selfinterested robots. Our algorithm adapts the centralized, perfect information, discreteaction space method from AlphaZero to a decentralized, partial information, continuous action space setting for multirobot applications. Our method has three interacting components: (1) a centralized, perfectinformation “expert” Monte Carlo Tree Search (MCTS) with large computation resources that provides expert demonstrations, (2) a decentralized, partialinformation “learner” MCTS with small computation resources that runs in realtime and provides selfplay examples, and (3) policy & value neural networks that are trained with the expert demonstrations and bias both the expert and the learner tree growth. Our numerical experiments demonstrate Neural Tree Expansion’s computational advantage by finding better solutions than a MCTS with 20× more resources. The resulting policies are dynamically sophisticated, demonstrate coordination between robots, and play the ReachTargetAvoid differential game significantly better than the stateoftheart controltheoretic baseline for multirobot, doubleintegrator systems. Our hardware experiments on an aerial swarm demonstrate the computational advantage of Neural Tree Expansion, enabling online planning at 20Hz with effective policies in complex scenarios.
“Scaling Scaling Laws With Board Games”, Jones 2021
“Scaling Scaling Laws with Board Games”, (20210407; ; backlinks; similar):
The largest experiments in machine learning now require resources far beyond the budget of all but a few institutions. Fortunately, it has recently been shown that the results of these huge experiments can often be extrapolated from the results of a sequence of far smaller, cheaper experiments. In this work, we show that not only can the extrapolation be done based on the size of the model, but on the size of the problem as well.
By conducting a sequence of experiments using AlphaZero and Hex, we show that the performance achievable with a fixed amount of compute degrades predictably as the game gets larger and harder. Along with our main result, we further show that the testtime and traintime compute available to an agent can be traded off while maintaining performance.
Slope: The slope of the incline is 500 Elo per order of magnitude increase in compute.
A more memorable interpretation is that if you are in the linearlyincreasing regime, then you will need about 2× as much compute as your opponent to beat them 2⁄3 of the time.
Perfect play: The minimum compute needed for perfect play increases 7× for each increment in board size.
Takeoff: The minimum training compute needed to see any improvement over random play increases by 4× for each increment of board size.
Random play: Finally, the distance between random play and perfect play increases by 500 Elo for each increment of board size.
Unlike the other quantities mentioned previously, the distance between random and perfect play is a property of the game itself rather than of the agent.
…Traintest tradeoff: So far we have focused on the compute budget during training, but another pertinent budget is the compute spent during evaluation. All the results discussed previously have used a tree search of size 64 during evaluation, the same as used during training. But there is no reason that the traintime search and testtime search have to be the same size, and so by varying the size of the testtime compute budget we can see in Figure 8 that larger tree searches at test time can substantially improve the performance of an agent.
Knowing now that compute can be spent in 2 places, at train time and test time, the immediate question is: how do these 2 budgets trade off? This is illustrated in Figure 9, which shows that the tradeoff is linear in logcompute: for each additional 10× of traintime compute, about 15× of testtime compute can be eliminated, down to a floor of a singlenode tree search…the simple relationship between compute at train time and compute at test time was originally surprising to us. Our intuition was that testtime compute is much ‘cheaper’ than traintime compute, and so we were surprised that one could easily substitute for the other. On reflection however, we believe the key distinction is that an optimization at testtime needs only optimise over one sample, while traintime compute meanwhile must optimise over the entire distribution of samples.
…the way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins^{3}. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2⁄3 [as you go from 1:1, half the total numbers drawn, to 2:1, or 2/(2+1)—as if each tree search were an independent lottery ticket]. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.
Second, both the relation of performance to board size and the relation of performance to compute are smooth. Before embarking on this project, a key unknown was whether performance would show any ‘spikes’ with regards to compute or board size. A spike with regards to compute might indicate the model had achieved some key insight, while a spike with regards to board size might indicate a minimum complexity past which key insights are available for the model to discover. As is however, models’ performance changes smoothly and predictably with both increased compute and increased complexity.
“OLIVAW: Mastering Othello without Human Knowledge, nor a Fortune”, Norelli & Panconesi 2021
“OLIVAW: Mastering Othello without Human Knowledge, nor a Fortune”, (20210331):
We introduce OLIVAW, an AI Othello player adopting the design principles of the famous AlphaGo programs. The main motivation behind OLIVAW was to attain exceptional competence in a nontrivial board game at a tiny fraction of the cost of its illustrious predecessors. In this paper, we show how the AlphaGo Zero’s paradigm can be successfully applied to the popular game of Othello using only commodity hardware and free cloud services. While being simpler than Chess or Go, Othello maintains a considerable search space and difficulty in evaluating board positions. To achieve this result, OLIVAW implements some improvements inspired by recent works to accelerate the standard AlphaGo Zero learning process. The main modification implies doubling the positions collected per game during the training phase, by including also positions not played but largely explored by the agent. We tested the strength of OLIVAW in three different ways: by pitting it against Edax, the strongest opensource Othello engine, by playing anonymous games on the web platform OthelloQuest, and finally in two inperson matches against topnotch human players: a national champion and a former world champion.
“Transfer of Fully Convolutional PolicyValue Networks Between Games and Game Variants”, Soemers et al 2021
“Transfer of Fully Convolutional PolicyValue Networks Between Games and Game Variants”, (20210224; similar):
In this paper, we use fully convolutional architectures in AlphaZerolike selfplay training setups to facilitate transfer between variants of board games as well as distinct games. We explore how to transfer trained parameters of these architectures based on shared semantics of channels in the state and action representations of the Ludii general game system. We use Ludii’s large library of games and game variants for extensive transfer learning evaluations, in zeroshot transfer experiments as well as experiments with additional finetuning time.
“Learning to Stop: Dynamic Simulation MonteCarlo Tree Search”, Lan et al 2020
“Learning to Stop: Dynamic Simulation MonteCarlo Tree Search”, (20201214; similar):
Monte Carlo tree search (MCTS) has achieved stateoftheart results in many domains such as Go and Atari games when combining with deep neural networks (DNNs). When more simulations are executed, MCTS can achieve higher performance but also requires enormous amounts of CPU and GPU resources. However, not all states require a long searching time to identify the best action that the agent can find. For example, in 19×19 Go and NoGo, we found that for more than half of the states, the best action predicted by DNN remains unchanged even after searching 2 minutes. This implies that a significant amount of resources can be saved if we are able to stop the searching earlier when we are confident with the current searching result. In this paper, we propose to achieve this goal by predicting the uncertainty of the current searching status and use the result to decide whether we should stop searching. With our algorithm, called Dynamic Simulation MCTS (DSMCTS), we can speed up a NoGo agent trained by AlphaZero 2.5× faster while maintaining a similar winning rate. Also, under the same average simulation count, our method can achieve a 61% winning rate against the original program.
“Assessing Game Balance With AlphaZero: Exploring Alternative Rule Sets in Chess”, Tomašev et al 2020
“Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess”, (20200909; ; similar):
[essay] It is nontrivial to design engaging and balanced sets of game rules. Modern chess has evolved over centuries, but without a similar recourse to history, the consequences of rule changes to game dynamics are difficult to predict.
AlphaZero provides an alternative in silico means of game balance assessment. It is a system that can learn nearoptimal strategies for any rule set from scratch, without any human supervision, by continually learning from its own experience. In this study we use AlphaZero to creatively explore and design new chess variants. There is growing interest in chess variants like Fischer Random Chess, because of classical chess’s voluminous opening theory, the high percentage of draws in professional play, and the nonnegligible number of games that end while both players are still in their home preparation.
We compare 9 other variants that involve atomic changes to the rules of chess…The nine changes considered in this study are listed in Table 1. Nocastling and Nocastling (10) involve a full and partial [not allowed during first 10 turns / 20 plies] restriction on the castling rule. Pawnonesquare, Semitorpedo [can move 2 squares on 2^{nd} and 3^{rd} ranks], Torpedo [can move 2 squares], Pawnback [up to your 2^{nd} rank, don’’t count against 50 limit], and Pawnsideways involve changes to pawn mobility. Selfcapture chess allows players to also capture their own pieces. Finally, Stalemate = win recasts stalemate as a win for the attacking side, rather than a draw…The changes allow for novel strategic and tactical patterns to emerge, while keeping the games close to the original. By learning nearoptimal strategies for each variant with AlphaZero, we determine what games between strong human players might look like if these variants were adopted.
Qualitatively, several variants are very dynamic. An analytic comparison show that pieces are valued differently between variants, and that some variants are more decisive than classical chess.
Our findings demonstrate the rich possibilities that lie beyond the rules of modern chess.
“Learning Personalized Models of Human Behavior in Chess”, McIlroyYoung et al 2020
“Learning Personalized Models of Human Behavior in Chess”, (20200823; ; similar):
Even when machine learning systems surpass human ability in a domain, there are many reasons why AI systems that capture humanlike behavior would be desirable: humans may want to learn from them, they may need to collaborate with them, or they may expect them to serve as partners in an extended interaction. Motivated by this goal of humanlike AI systems, the problem of predicting human actions—as opposed to predicting optimal actions—has become an increasingly useful task.
We extend this line of work by developing highly accurate personalized models of human behavior in the context of chess. Chess is a rich domain for exploring these questions, since it combines a set of appealing features: AI systems have achieved superhuman performance but still interact closely with human chess players both as opponents and preparation tools, and there is an enormous amount of recorded data on individual players. Starting with an opensource version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction of a particular player’s moves by applying a series of finetuning adjustments. Furthermore, we can accurately perform stylometry—predicting who made a given set of actions—indicating that our personalized models capture human decisionmaking at an individual level.
“ReBeL: Combining Deep Reinforcement Learning and Search for ImperfectInformation Games”, Brown et al 2020
“ReBeL: Combining Deep Reinforcement Learning and Search for ImperfectInformation Games”, (20200727; ; similar):
The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in singleagent settings and perfectinformation games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfectinformation games.
This paper presents ReBeL, a general framework for selfplay reinforcement learning and search that provably converges to a Nash equilibrium in any twoplayer zerosum game. In the simpler setting of perfectinformation games, ReBeL reduces to an algorithm similar to AlphaZero.
Results in two different imperfectinformation games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in headsup nolimit Texas hold’em poker, while using far less domain knowledge than any prior poker AI.
“Learning Compositional Neural Programs for Continuous Control”, Pierrot et al 2020
“Learning Compositional Neural Programs for Continuous Control”, (20200727; similar):
We propose a novel solution to challenging sparsereward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPIX, involves three separate stages of learning. First, we use offpolicy reinforcement learning algorithms with experience replay to learn a set of atomic goalconditioned policies, which can be easily repurposed for many tasks. Second, we learn selfmodels describing the effect of the atomic policies on the environment. Third, the selfmodels are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the selfmodels enable planning by imagination, obviating the need for interaction with the world when learning higherlevel compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmerinterpreters. We empirically show that AlphaNPIX can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful modelfree baselines fail.
“MonteCarlo Tree Search As Regularized Policy Optimization”, Grill et al 2020
“MonteCarlo Tree Search as Regularized Policy Optimization”, (20200724; ; similar):
The combination of MonteCarlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current stateoftheart MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.
“Tackling Morpion Solitaire With AlphaZerolike Ranked Reward Reinforcement Learning”, Wang et al 2020
“Tackling Morpion Solitaire with AlphaZerolike Ranked Reward Reinforcement Learning”, (20200614; similar):
Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record, to the best of our knowledge, there has been no further progress reported, for about a decade.
In this paper we take the recent impressive performance of deep selflearning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Morpion Solitaire. A challenge of Morpion Solitaire is that the state space is sparse, there are few win/loss signals. Instead, we use an approach known as ranked reward to create a reinforcement learning selfplay framework for Morpion Solitaire. This enables us to find mediumquality solutions with reasonable computational effort. Our record is a 67 steps solution, which is very close to the human best (68) without any other adaptation to the problem than using ranked reward. We list many further avenues for potential improvement.
“Aligning Superhuman AI With Human Behavior: Chess As a Model System”, McIlroyYoung et al 2020
“Aligning Superhuman AI with Human Behavior: Chess as a Model System”, (20200602; ; similar):
As artificial intelligence becomes increasingly intelligent—in some cases, achieving superhuman performance—there is growing potential for humans to learn from and collaborate with algorithms. However, the ways in which AI systems approach problems are often different from the ways people do, and thus may be uninterpretable and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that constitute human behavior, rather than simply matching aggregate human performance.
We pursue this goal in a model system with a long history in artificial intelligence: chess. The aggregate performance of a chess player unfolds as they make decisions over the course of a game. The hundreds of millions of games played online by players at every skill level form a rich source of data in which these decisions, and their exact context, are recorded in minute detail. Applying existing chess engines to this data, including an opensource implementation of AlphaZero, we find that they do not predict human moves well.
We develop and introduce Maia, a customized version of AlphaZero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way. For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines. Taken together, our results suggest that there is substantial promise in designing artificial intelligence systems with human collaboration in mind by first accurately modeling granular human decisionmaking.
“Neural Machine Translation With MonteCarlo Tree Search”, Parker & Chen 2020
“Neural Machine Translation with MonteCarlo Tree Search”, (20200427; similar):
Recent algorithms in machine translation have included a value network to assist the policy network when deciding which word to output at each step of the translation. The addition of a value network helps the algorithm perform better on evaluation metrics like the BLEU score. After training the policy and value networks in a supervised setting, the policy and value networks can be jointly improved through common actorcritic methods. The main idea of our project is to instead leverage MonteCarlo Tree Search (MCTS) to search for good output words with guidance from a combined policy and value network architecture in a similar fashion as AlphaZero. This network serves both as a local and a global lookahead reference that uses the result of the search to improve itself. Experiments using the IWLST14 German to English translation dataset show that our method outperforms the actorcritic methods used in recent machine translation papers.
“Approximate Exploitability: Learning a Best Response in Large Games”, Timbers et al 2020
“Approximate exploitability: Learning a best response in large games”, (20200420; ; similar):
A standard metric used to measure the approximate optimality of policies in imperfect information games is exploitability, i.e. the performance of a policy against its worstcase opponent. However, exploitability is intractable to compute in large games as it requires a full traversal of the game tree to calculate a best response to the given policy. We introduce a new metric, approximate exploitability, that calculates an analogous metric using an approximate best response; the approximation is done by using search and reinforcement learning. This is a generalization of local best response, a domain specific evaluation metric used in poker. We provide empirical results for a specific instance of the method, demonstrating that our method converges to exploitability in the tabular and function approximation settings for small games. In large games, our method learns to exploit both strong and weak agents, learning to exploit an AlphaZero agent.
“Real World Games Look Like Spinning Tops”, Czarnecki et al 2020
“Real World Games Look Like Spinning Tops”, (20200420; ):
This paper investigates the geometrical properties of real world games (eg. TicTacToe, Go, StarCraft II). We hypothesise that their geometrical structure resemble a spinning top, with the upright axis representing transitive strength, and the radial axis, which corresponds to the number of cycles that exist at a particular transitive strength, representing the nontransitive dimension. We prove the existence of this geometry for a wide class of real world games, exposing their temporal nature. Additionally, we show that this unique structure also has consequences for learning—it clarifies why populations of strategies are necessary for training of agents, and how population size relates to the structure of the game. Finally, we empirically validate these claims by using a selection of nine real world twoplayer zerosum symmetric games, showing (1) the spinning top structure is revealed and can be easily reconstructed by using a new method of Nash clustering to measure the interaction between transitive and cyclical strategy behaviour, and (2) the effect that population size has on the convergence in these games.
“Accelerating and Improving AlphaZero Using Population Based Training”, Wu et al 2020
“Accelerating and Improving AlphaZero Using Population Based Training”, (20200313; ; similar):
AlphaZero has been very successful in many games. Unfortunately, it still consumes a huge amount of computing resources, the majority of which is spent in selfplay. Hyperparameter tuning exacerbates the training cost since each hyperparameter configuration requires its own time to train one run, during which it will generate its own selfplay records. As a result, multiple runs are usually needed for different hyperparameter configurations. This paper proposes using population based training (PBT) to help tune hyperparameters dynamically and improve strength during training time. Another significant advantage is that this method requires a single run only, while incurring a small additional time cost, since the time for generating selfplay records remains unchanged though the time for optimization is increased following the AlphaZero training algorithm. In our experiments for 9×9 Go, the PBT method is able to achieve a higher win rate for 9×9 Go than the baselines, each with its own hyperparameter configuration and trained individually. For 19×19 Go, with PBT, we are able to obtain improvements in playing strength. Specifically, the PBT agent can obtain up to 74% win rate against ELF OpenGo, an opensource stateoftheart AlphaZero program using a neural network of a comparable capacity. This is compared to a saturated nonPBT agent, which achieves a win rate of 47% against ELF OpenGo under the same circumstances.
“SelfPlay Learning Without a Reward Metric”, Schmidt et al 2019
“SelfPlay Learning Without a Reward Metric”, (20191216; ; backlinks; similar):
The AlphaZero algorithm for the learning of strategy games via selfplay, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.
“MuZero: Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model”, Schrittwieser et al 2019
“MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, (20191119; ; backlinks; similar):
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Treebased planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in realworld problems the dynamics governing the environment are often complex and unknown.
In this work we present the MuZero algorithm which, by combining a treebased search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the actionselection policy, and the value function.
When evaluated on 57 different Atari games—the canonical video game environment for testing AI techniques, in which modelbased planning approaches have historically struggled—our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
“Multiplayer AlphaZero”, Petosa & Balch 2019
“Multiplayer AlphaZero”, (20191029; similar):
The AlphaZero algorithm has achieved superhuman performance in twoplayer, deterministic, zerosum games where perfect information of the game state is available. This success has been demonstrated in Chess, Shogi, and Go where learning occurs solely through selfplay. Many realworld applications (eg. equity trading) require the consideration of a multiplayer environment. In this work, we suggest novel modifications of the AlphaZero algorithm to support multiplayer environments, and evaluate the approach in two simple 3player games. Our experiments show that multiplayer AlphaZero learns successfully and consistently outperforms a competing approach: Monte Carlo tree search. These results suggest that our modified AlphaZero can learn effective strategies in multiplayer game scenarios. Our work supports the use of AlphaZero in multiplayer games and suggests future research for more complex environments.
“Global Optimization of Quantum Dynamics With AlphaZero Deep Exploration”, Dalgaard et al 2019
“Global optimization of quantum dynamics with AlphaZero deep exploration”, (20190712; similar):
While a large number of algorithms for optimizing quantum dynamics for different objectives have been developed, a common limitation is the reliance on good initial guesses, being either random or based on heuristics and intuitions. Here we implement a tabula rasa deep quantum exploration version of the DeepMind AlphaZero algorithm for systematically averting this limitation. AlphaZero employs a deep neural network in conjunction with deep lookahead in a guided tree search, which allows for predictive hidden variable approximation of the quantum parameter landscape. To emphasize transferability, we apply and benchmark the algorithm on three classes of control problems using only a single common set of algorithmic hyperparameters. AlphaZero achieves substantial improvements in both the quality and quantity of good solution clusters compared to earlier methods. It is able to spontaneously learn unexpected hidden structure and global symmetry in the solutions, going beyond even human heuristics.
“Learning Compositional Neural Programs With Recursive Tree Search and Planning”, Pierrot et al 2019
“Learning Compositional Neural Programs with Recursive Tree Search and Planning”, (20190530; similar):
We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural ProgrammerInterpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk
“Deep Policies for WidthBased Planning in Pixel Domains”, Junyent et al 2019
“Deep Policies for WidthBased Planning in Pixel Domains”, (20190412; similar):
Widthbased planning has demonstrated great success in recent years due to its ability to scale independently of the size of the state space. For example, Bandres et al 2018 introduced a rollout version of the Iterated Width algorithm whose performance compares well with humans and learning methods in the pixel setting of the Atari games suite. In this setting, planning is done online using the “screen” states and selecting actions by looking ahead into the future. However, this algorithm is purely exploratory and does not leverage past reward information. Furthermore, it requires the state to be factored into features that need to be predefined for the particular task, eg. the BPROST pixel features. In this work, we extend widthbased planning by incorporating an explicit policy in the action selection mechanism. Our method, called πIW, interleaves widthbased planning and policy learning using the stateactions visited by the planner. The policy estimate takes the form of a neural network and is in turn used to guide the planning step, thus reinforcing promising paths. Surprisingly, we observe that the representation learned by the neural network can be used as a feature space for the widthbased planner without degrading its performance, thus removing the requirement of predefined features for the planner. We compare πIW with previous widthbased methods and with AlphaZero, a method that also interleaves planning and learning, in simple environments, and show that πIW has superior performance. We also show that πIW algorithm outperforms previous widthbased methods in the pixel setting of Atari games suite.
“Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, Anthony et al 2019
“Policy Gradient Search: Online Planning and Expert Iteration without Search Trees”, (20190407; similar):
Monte Carlo Tree Search (MCTS) algorithms perform simulationbased search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by stateoftheart programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulationbased search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest opensource Hex agent, in 9×9 Hex.
“AlphaX: EXploring Neural Architectures With Deep Neural Networks and Monte Carlo Tree Search”, Wang et al 2019
“AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search”, (20190326; ; similar):
Neural Architecture Search (NAS) has shown great success in automating the design of neural networks, but the prohibitive amount of computations behind current NAS methods requires further investigations in improving the sample efficiency and the network evaluation cost to get better results in a shorter time. In this paper, we present a novel scalable Monte Carlo Tree Search (MCTS) based NAS agent, named AlphaX, to tackle these two aspects. AlphaX improves the search efficiency by adaptively balancing the exploration and exploitation at the state level, and by a MetaDeep Neural Network (DNN) to predict network accuracies for biasing the search toward a promising region. To amortize the network evaluation cost, AlphaX accelerates MCTS rollouts with a distributed design and reduces the number of epochs in evaluating a network by transfer learning guided with the tree structure in MCTS. In 12 GPU days and 1000 samples, AlphaX found an architecture that reaches 97.84% top1 accuracy on CIFAR10, and 75.5% top1 accuracy on ImageNet, exceeding SOTA NAS methods in both the accuracy and sampling efficiency. Particularly, we also evaluate AlphaX on NASBench101, a large scale NAS dataset; AlphaX is 3× and 2.8× more sample efficient than Random Search and Regularized Evolution in finding the global optimum. Finally, we show the searched architecture improves a variety of vision applications from Neural Style Transfer, to Image Captioning and Object Detection.
“Minigo: A Case Study in Reproducing Reinforcement Learning Research”, Anonymous 2019
“Minigo: A Case Study in Reproducing Reinforcement Learning Research”, (20190306):
We reproduced AlphaZero on Google Cloud Platform
The reproducibility of reinforcementlearning research has been highlighted as a key challenge area in the field. In this paper, we present a case study in reproducing the results of one groundbreaking algorithm, AlphaZero, a reinforcement learning system that learns how to play Go at a superhuman level given only the rules of the game. We describe Minigo, a reproduction of the AlphaZero system using publicly available Google Cloud Platform infrastructure and Google Cloud TPUs. The Minigo system includes both the central reinforcement learning loop as well as auxiliary monitoring and evaluation infrastructure. With ten days of training from scratch on 800 Cloud TPUs, Minigo can play evenly against LeelaZero and ELF OpenGo, two of the strongest publicly available Go AIs. We discuss the difficulties of scaling a reinforcement learning system and the monitoring systems required to understand the complex interplay of hyperparameter configurations.
“ΑRank: MultiAgent Evaluation by Evolution”, Omidshafiei et al 2019
“αRank: MultiAgent Evaluation by Evolution”, (20190304; ):
We introduce αRank, a principled evolutionary dynamics methodology for the evaluation and ranking of agents in largescale multiagent interactions, grounded in a novel dynamical gametheoretic solution concept called MarkovConley chains (MCCs).
The approach leverages continuoustime and discretetime evolutionary dynamical systems applied to empirical games, and scales tractably in the number of agents, the type of interactions, and the type of empirical games (symmetric and asymmetric). Current models are fundamentally limited in one or more of these dimensions and are not guaranteed to converge to the desired gametheoretic solution concept (typically the Nash equilibrium).
αRank provides a ranking over the set of agents under evaluation and provides insights into their strengths, weaknesses, and longterm dynamics. This is a consequence of the links we establish to the MCC solution concept when the underlying evolutionary model’s rankingintensity parameter, α, is chosen to be large, which exactly forms the basis of αRank. In contrast to the Nash equilibrium, which is a static concept based on fixed points, MCCs are a dynamical solution concept based on the Markov chain formalism, Conley’s Fundamental Theorem of Dynamical Systems, and the core ingredients of dynamical systems: fixed points, recurrent sets, periodic orbits, and limit cycles.
αRank runs in polynomial time with respect to the total number of pure strategy profiles, whereas computing a Nash equilibrium for a generalsum game is known to be intractable. We introduce proofs that not only provide an unifying perspective of existing continuoustime and discretetime evolutionary evaluation models, but also reveal the formal underpinnings of the αRank methodology.
We empirically validate the method in several domains including AlphaGo, AlphaZero, MuJoCo Soccer, and Poker.
“Accelerating SelfPlay Learning in Go”, Wu 2019
“Accelerating SelfPlay Learning in Go”, (20190227; similar):
By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate selfplay learning in Go, achieving a 50× reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neuralnetguided Monte Carlo tree search selfplay. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF’s final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves nondomainspecific improvements that might directly transfer to other problems. Further gains from domainspecific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without largescale computational resources.
“ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, Tian et al 2019
“ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero”, (20190212; similar):
The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning’s capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an opensource reimplementation of the AlphaZero algorithm. ELF OpenGo is the first opensource Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available.
“Bayesian Optimization in AlphaGo”, Chen et al 2018
“Bayesian Optimization in AlphaGo”, (20181217; similar):
During the development of AlphaGo, its many hyperparameters were tuned with Bayesian optimization multiple times.
This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its winrate from 50% to 66.5% in selfplay games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage.
It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.
…Interestingly, the automatically found hyperparameter values were very different from the default values found by previous hand tuning efforts. Moreover, the hyperparameters were often correlated, and hence the values found by Bayesian optimization were not reachable with elementwise handtuning, or even by tuning pairs of parameters in some cases.
By tuning the mixing ratio between rollout estimates and value network estimates, we found out that Bayesian optimization gave increased preference to value network estimates as the design cycle progressed. This eventual led the team to abandon rollout estimates in future versions of AlphaGo and AlphaGo Zero.
“A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go through Selfplay”, Silver et al 2018
2018silver.pdf#deepmind
: “A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay”, (20181207; ; backlinks; similar):
The game of chess is the longeststudied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domainspecific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from selfplay. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess), as well as Go.
“AlphaSeq: Sequence Discovery With Deep Reinforcement Learning”, Shao et al 2018
“AlphaSeq: Sequence Discovery with Deep Reinforcement Learning”, (20180926; similar):
Sequences play an important role in many applications and systems. Discovering sequences with desired properties has long been an interesting intellectual pursuit. This paper puts forth a new paradigm, AlphaSeq, to discover desired sequences algorithmically using deep reinforcement learning (DRL) techniques. AlphaSeq treats the sequence discovery problem as an episodic symbolfilling game, in which a player fills symbols in the vacant positions of a sequence set sequentially during an episode of the game. Each episode ends with a completelyfilled sequence set, upon which a reward is given based on the desirability of the sequence set. AlphaSeq models the game as a Markov Decision Process (MDP), and adapts the DRL framework of AlphaGo to solve the MDP. Sequences discovered improve progressively as AlphaSeq, starting as a novice, learns to become an expert game player through many episodes of game playing. Compared with traditional sequence construction by mathematical tools, AlphaSeq is particularly suitable for problems with complex objectives intractable to mathematical analysis. We demonstrate the searching capabilities of AlphaSeq in two applications: (1) AlphaSeq successfully rediscovers a set of ideal complementary codes that can zeroforce all potential interferences in multicarrier CDMA systems. (2) AlphaSeq discovers new sequences that triple the signaltointerference ratio—benchmarked against the wellknown Legendre sequence—of a mismatched filter estimator in pulse compression radar systems.
“ExItOOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018
“ExItOOS: Towards Learning from Planning in Imperfect Information Games”, (20180830; ; similar):
The current state of the art in playing many important perfect information games, including Chess and Go, combines planning and deep reinforcement learning with selfplay. We extend this approach to imperfect information games and present ExItOOS, a novel approach to playing imperfect information games within the Expert Iteration framework and inspired by AlphaZero. We use Online Outcome Sampling, an online search algorithm for imperfect information games in place of MCTS. While training online, our neural strategy is used to improve the accuracy of playouts in OOS, allowing a learning and planning feedback loop for imperfect information games.
“Has Dynamic Programming Improved Decision Making?”, Rust 2018
“Has dynamic programming improved decision making?”, (20180822; backlinks; similar):
Dynamic programming (DP) is an extremely powerful tool for solving a wide class of sequential decision making problems under uncertainty. In principle, it enables us to compute optimal decision rules that specify the best possible decision to take in any given situation. This article reviews developments in DP and contrasts its revolutionary impact on economics, operations research, engineering, and artificial intelligence, with the comparative paucity of real world applications where DP is actually used to improve decision making. I discuss the literature on numerical solution of DPs and its connection to the literature on reinforcement learning (RL) and artificial intelligence (AI).
Despite amazing, highly publicized successes of these algorithms that result in superhuman levels of performance in board games such as chess or Go, I am not aware of comparably successful applications of DP for helping individuals and firms to solve realworld problems. I point to the fuzziness of many real world decision problems and the difficulty in mathematically formulating and modeling them as key obstacles to wider application of DP to improve decision making. Nevertheless, I provide several success stories where DP has demonstrably improved decision making and discuss a number of other examples where it seems likely that the application of DP could have substantial value.
I conclude that ‘applied DP’ offers substantial promise for economic policy making if economists can let go of the empirically untenable assumption of unbounded rationality and try to tackle the challenging decision problems faced every day by individuals and firms.
[Keywords: actorcritic algorithms, Alpha Zero, approximate dynamic programming, artificial intelligence, behavioral economics, Bellman equation, bounded rationality, curse of dimensionality, computational complexity, decision rules, dynamic pricing, dynamic programming, employee compensation, Herbert Simon, fleet sizing, identification problem, individual and firm behavior lifecycle problem, locomotive allocation, machine learning, Markov decision processes, mental models, modelfree learning, neural networks, neurodynamic programming, offline versus online training, optimal inventory management, optimal replacement, optimal search, principle of decomposition, Qlearning, revenue management, realtime dynamic programming, reinforcement learning, Richard Bellman, structural econometrics, supervised versus unsupervised learning]
“Improving Widthbased Planning With Compact Policies”, Junyent et al 2018
“Improving widthbased planning with compact policies”, (20180615; ; similar):
Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach humanlevel performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the IteratedWidth (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the stateactions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the stateoftheart reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.
“Surprising Negative Results for Generative Adversarial Tree Search”, Azizzadenesheli et al 2018
“Surprising Negative Results for Generative Adversarial Tree Search”, (20180615; ; similar):
While many recent advances in deep reinforcement learning (RL) rely on modelfree methods, modelbased approaches remain an alluring prospect for their potential to exploit unsupervised data to learn environment model. In this work, we provide an extensive study on the design of deep generative models for RL environments and propose a sample efficient and robust method to learn the model of Atari environments. We deploy this model and propose generative adversarial tree search (GATS) a deep RL algorithm that learns the environment model and implements Monte Carlo tree search (MCTS) on the learned model for planning. While MCTS on the learned model is computationally expensive, similar to AlphaGo, GATS follows depth limited MCTS. GATS employs deep Q network (DQN) and learns a Qfunction to assign values to the leaves of the tree in MCTS. We theoretical analyze GATS visavis the biasvariance tradeoff and show GATS is able to mitigate the worstcase error in the Qestimate. While we were expecting GATS to enjoy a better sample complexity and faster converges to better policies, surprisingly, GATS fails to outperform DQN. We provide a study on which we show why depth limited MCTS fails to perform desirably.
“Dual Policy Iteration”, Sun et al 2018
“Dual Policy Iteration”, (20180528; similar):
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (eg. ExIt from [2], AlphaGoZero from [27]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (eg. a deep neural network) deployed at test time, and a slow, nonreactive policy (eg. Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the nonreactive policy, while the nonreactive policy is improved with guidance from the reactive policy.
In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of nonreactive policies to modelbased optimal control using learned local models, and provides a theoretically sound way of unifying modelfree and modelbased RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.
“Solving the Rubik’s Cube Without Human Knowledge”, McAleer et al 2018
“Solving the Rubik’s Cube Without Human Knowledge”, (20180518; similar):
A generally intelligent agent must be able to teach itself how to solve problems in complex domains with minimal human supervision. Recently, deep reinforcement learning algorithms combined with selfplay have achieved superhuman proficiency in Go, Chess, and Shogi without human data or domain knowledge. In these environments, a reward is always received at the end of the game, however, for many combinatorial optimization environments, rewards are sparse and episodes are not guaranteed to terminate. We introduce Autodidactic Iteration: a novel reinforcement learning algorithm that is able to teach itself how to solve the Rubik’s Cube with no human assistance. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves—less than or equal to solvers that employ human domain knowledge.
“FeedbackBased Tree Search for Reinforcement Learning”, Jiang et al 2018
“FeedbackBased Tree Search for Reinforcement Learning”, (20180515; similar):
Inspired by recent successes of MonteCarlo tree search (MCTS) in a number of artificial intelligence (AI) application domains, we propose a modelbased reinforcement learning (RL) technique that iteratively applies MCTS on batches of small, finitehorizon versions of the original infinitehorizon Markov decision process. The terminal condition of the finitehorizon problems, or the leafnode evaluator of the decision tree generated by MCTS, is specified using a combination of an estimated value function and an estimated policy function. The recommendations generated by the MCTS procedure are then provided as feedback in order to refine, through classification and regression, the leafnode evaluator for the next iteration.
We provide the first sample complexity bounds for a tree searchbased RL algorithm. In addition, we show that a deep neural network implementation of the technique can create a competitive AI agent for the popular multiplayer online battle arena (MOBA) game King of Glory.
“A Tree Search Algorithm for Sequence Labeling”, Lao et al 2018
“A Tree Search Algorithm for Sequence Labeling”, (20180429; ; similar):
In this paper we propose a novel reinforcement learning based model for sequence tagging, referred to as MMTag. Inspired by the success and methodology of the AlphaGo Zero, MMTag formalizes the problem of sequence tagging with a Monte Carlo tree search (MCTS) enhanced Markov decision process (MDP) model, in which the time steps correspond to the positions of words in a sentence from left to right, and each action corresponds to assign a tag to a word. Two long shortterm memory networks (LSTM) are used to summarize the past tag assignments and words in the sentence. Based on the outputs of LSTMs, the policy for guiding the tag assignment and the value for predicting the whole tagging accuracy of the whole sentence are produced. The policy and value are then strengthened with MCTS, which takes the produced raw policy and value as inputs, simulates and evaluates the possible tag assignments at the subsequent positions, and outputs a better search policy for assigning tags. A reinforcement learning algorithm is proposed to train the model parameters. Our work is the first to apply the MCTS enhanced MDP model to the sequence tagging task. We show that MMTag can accurately predict the tags thanks to the exploratory decision making mechanism introduced by MCTS. Experimental results show based on a chunking benchmark showed that MMTag outperformed the stateoftheart sequence tagging baselines including CRF and CRF with LSTM.
“FeatureBased Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, Bertsekas 2018
“FeatureBased Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations”, (20180412; ; backlinks; similar):
In this paper we discuss policy iteration methods for approximate solution of a finitestate discounted Markov decision problem, with a focus on featurebased aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller “aggregate” Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines featurebased aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.
“SimtoReal Optimization of Complex Real World Mobile Network With Imperfect Information via Deep Reinforcement Learning from Selfplay”, Tan et al 2018
“SimtoReal Optimization of Complex Real World Mobile Network with Imperfect Information via Deep Reinforcement Learning from Selfplay”, (20180218; ):
Mobile network that millions of people use every day is one of the most complex systems in the world. Optimization of mobile network to meet exploding customer demand and reduce capital/operation expenditures poses great challenges. Despite recent progress, application of deep reinforcement learning (DRL) to complex real world problem still remains unsolved, given data scarcity, partial observability, risk and complex rules/dynamics in real world, as well as the huge reality gap between simulation and real world.
To bridge the reality gap, we introduce a SimtoReal framework to directly transfer learning from simulation to real world via graph convolutional neural network (CNN)—by abstracting partially observable mobile network into graph, then distilling domainvariant irregular graph into domaininvariant tensor in locally Euclidean space as input to CNN, domain randomization and multitask learning. We use a novel selfplay mechanism to encourage competition among DRL agents for best record on multiple tasks via simulated annealing, just like athletes compete for world record in decathlon. We also propose a decentralized multiagent, competitive and cooperative DRL method to coordinate the actions of multicells to maximize global reward and minimize negative impact to neighbor cells.
Using 6 field trials on commercial mobile networks, we demonstrate for the first time that a DRL agent can successfully transfer learning from simulation to complex real world problem with imperfect information, complex rules/dynamics, huge state/action space, and multiagent interactions, without any training in the real world.
“Learning to Search With MCTSnets”, Guez et al 2018
“Learning to Search with MCTSnets”, (20180213; ; similar):
Planning problems are among the most important and wellstudied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and backup those evaluations to the root of a search tree. Among these algorithms, MonteCarlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimized to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to backup those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulationbased search inside a neural network, by expanding, evaluating and backingup a vector embedding. The parameters of the network are trained endtoend using gradientbased optimisation. When applied to small searches in the well known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.
“MWalk: Learning to Walk over Graphs Using Monte Carlo Tree Search”, Shen et al 2018
“MWalk: Learning to Walk over Graphs using Monte Carlo Tree Search”, (20180212; ; similar):
Learning to walk over a graph towards a target node for a given query and a source node is an important problem in applications such as knowledge base completion (KBC). It can be formulated as a reinforcement learning (RL) problem with a known state transition model.
To overcome the challenge of sparse rewards, we develop a graphwalking agent called MWalk, which consists of a deep recurrent neural network (RNN) and Monte Carlo Tree Search (MCTS). The RNN encodes the state (ie. history of the walked path) and maps it separately to a policy and Qvalues. In order to effectively train the agent from sparse rewards, we combine MCTS with the neural policy to generate trajectories yielding more positive rewards. From these trajectories, the network is improved in an offpolicy manner using Qlearning, which modifies the RNN policy via parameter sharing. Our proposed RL algorithm repeatedly applies this policyimprovement step to learn the model. At test time, MCTS is combined with the neural policy to predict the target node.
Experimental results on several graphwalking benchmarks show that MWalk is able to learn better policies than other RLbased methods, which are mainly based on policy gradients. MWalk also outperforms traditional KBC baselines.
“Mastering Chess and Shogi by SelfPlay With a General Reinforcement Learning Algorithm”, Silver et al 2017
“Mastering Chess and Shogi by SelfPlay with a General Reinforcement Learning Algorithm”, (20171205; ; backlinks; similar):
The game of chess is the most widelystudied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domainspecific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of selfplay.
In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a worldchampion program in each case.
“Mastering the Game of Go without Human Knowledge”, Silver et al 2017
2017silver.pdf#deepmind
: “Mastering the game of Go without human knowledge”, (20171019; ; backlinks; similar):
A longstanding goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger selfplay in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, championdefeating AlphaGo.
“DeepMind's Latest AI Breakthrough Is Its Most Important Yet: Googleowned DeepMind's Goplaying Artificial Intelligence Can Now Learn without Human Help... or Data”, Burgess 2017
“DeepMind's latest AI breakthrough is its most important yet: Googleowned DeepMind's Goplaying artificial intelligence can now learn without human help... or data”, (20171018; ; similar):
DeepMind’s humanconquering AlphaGo AI just got even smarter. The firm’s latest Goplaying system not only defeated all previous versions of the software, it did it all by itself. “The most striking thing for me is we don’t need any human data anymore”, says Demis Hassabis, the CEO and cofounder of DeepMind. While the first version of AlphaGo needed to be trained on data from more than 100,000 human games, AlphaGo Zero can learn to play from a blank slate. Not only has DeepMind removed the need for the initial human data input, Zero is also able to learn faster than its predecessor.
David Silver, the main programmer on DeepMind’s Go project, says the original AlphaGo that defeated 18time world champion Lee Sedol 4–1 required several months of training. “We reached a superior level of performance after training for just 72 hours with AlphaGo Zero”, he says. Only 4.9 million simulated games were needed to train Zero, compared to the original AlphaGo’s 30 million. After the 3 days of learning Zero was able to defeat the Lee Sedolconquering version 100–0. After it had been playing the game for 40 days, Zero defeated DeepMind’s previous strongest version of AlphaGo, called Master, which defeated Chinese master Ke Jie in May.
…When Zero played a game against itself, it was given feedback from the system. A +1 is given if it wins and a −1 if it loses. After each game the neural network behind Zero automatically reconfigures to a new, theoretically better, version. On average the system took 0.4 seconds of thinking time before making a move.
“In the original version, we tried this a couple of years ago and it would collapse”, Hassabis says. He cites DeepMind’s “novel” reinforcement algorithms for Zero’s new ability to learn without prior knowledge. Additionally the new system only uses one neural network instead of 2 and 4 of Google’s AI processors compared to the 48 needed to beat Lee. During the development of Zero, Hassabis says the system was trained on hardware that cost the company as much as $35 million. The hardware is also used for other DeepMind projects.
“Selftaught AI Is Best yet at Strategy Game Go”, Gibney 2017
“Selftaught AI is best yet at strategy game Go”, (20171018; ; similar):
Merging these functions into a single neural network made the algorithm both stronger and much more efficient, said Silver. It still required a huge amount of computing power—4 of the specialized chips called tensor processing units, which Hassabis estimated to be US$25 million of hardware. But this was less than onetenth what its predecessors used. It also trained itself in days, rather than months. The implication is that “algorithms matter much more than either computing or data available”, said Silver.
“Learning Generalized Reactive Policies Using Deep Neural Networks”, Groshev et al 2017
“Learning Generalized Reactive Policies using Deep Neural Networks”, (20170824; similar):
We present a new approach to learning for planning, where knowledge acquired while solving a given set of planning problems is used to plan faster in related, but new problem instances. We show that a deep neural network can be used to learn and represent a generalized reactive policy (GRP) that maps a problem instance and a state to an action, and that the learned GRPs efficiently solve large classes of challenging problem instances. In contrast to prior efforts in this direction, our approach significantly reduces the dependence of learning on handcrafted domain knowledge or feature selection. Instead, the GRP is trained from scratch using a set of successful execution traces. We show that our approach can also be used to automatically learn a heuristic function that can be used in directed search algorithms. We evaluate our approach using an extensive suite of experiments on two challenging planning problem domains and show that our approach facilitates learning complex decision making policies and powerful heuristic functions with minimal human input. Videos of our results are available at goo.gl/Hpy4e3.
“Learning to Plan Chemical Syntheses”, Segler et al 2017
“Learning to Plan Chemical Syntheses”, (20170814; similar):
From medicines to materials, small organic molecules are indispensable for human wellbeing. To plan their syntheses, chemists employ a problem solving technique called retrosynthesis. In retrosynthesis, target molecules are recursively transformed into increasingly simpler precursor compounds until a set of readily available starting materials is obtained. Computeraided retrosynthesis would be a highly valuable tool, however, past approaches were slow and provided results of unsatisfactory quality. Here, we employ Monte Carlo Tree Search (MCTS) to efficiently discover retrosynthetic routes. MCTS was combined with an expansion policy network that guides the search, and an “inscope” filter network to preselect the most promising retrosynthetic steps. These deep neural networks were trained on 12 million reactions, which represents essentially all reactions ever published in organic chemistry. Our system solves almost twice as many molecules and is 30× faster in comparison to the traditional search method based on extracted rules and handcoded heuristics. Finally after a 60 year history of computeraided synthesis planning, chemists can no longer distinguish between routes generated by a computer system and real routes taken from the scientific literature. We anticipate that our method will accelerate drug and materials discovery by assisting chemists to plan better syntheses faster, and by enabling fully automated robot synthesis.
“Thinking Fast and Slow With Deep Learning and Tree Search”, Anthony et al 2017
“Thinking Fast and Slow with Deep Learning and Tree Search”, (20170523; similar):
Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalize plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released.
“Deep Reinforcement Learning: An Overview”, Li 2017
“Deep Reinforcement Learning: An Overview”, (20170125; ; similar):
We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep QNetwork (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multiagent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions.
Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.
“DeepStack: ExpertLevel Artificial Intelligence in NoLimit Poker”, Moravčík et al 2017
“DeepStack: ExpertLevel Artificial Intelligence in NoLimit Poker”, (20170106; ; backlinks; similar):
Artificial intelligence has seen several breakthroughs in recent years, with games often serving as milestones. A common feature of these games is that players have perfect information. Poker is the quintessential game of imperfect information, and a longstanding challenge problem in artificial intelligence.
We introduce DeepStack, an algorithm for imperfect information settings. It combines recursive reasoning to handle information asymmetry, decomposition to focus computation on the relevant decision, and a form of intuition that is automatically learned from selfplay using deep learning.
The approach is theoretically sound and is shown to produce more difficult to exploit strategies than prior approaches. In a study involving 44,000 hands of poker, DeepStack defeated with statisticalsignificance professional poker players in headsup nolimit Texas hold’em.
“Mastering the Game of Go With Deep Neural Networks and Tree Search”, Silver et al 2016
2016silver.pdf#deepmind
: “Mastering the game of Go with deep neural networks and tree search”, (20160128; ; backlinks; similar):
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of selfplay. Without any lookahead search, the neural networks play Go at the level of stateoftheart Monte Carlo tree search programs that simulate thousands of random games of selfplay. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the fullsized game of Go, a feat previously thought to be at least a decade away.
[Anecdote: I hear from Groq that the original AlphaGo GPU implementation was not on track to defeat Lee Sedol by about a month before, when they happened to gamble on implementing TPUv1 support. The additional compute led to drastic performance gains, and the TPU model could beat the GPU model in ~98 of 100 games, and the final model solidly defeated Lee Sedol. (Since TPUv1s reportedly only did inferencing/forwardmode, presumably they were not used for the initial imitation learning, or the policy gradients selfplay, but for generating the ~30 million selfplay games which the value network was trained on (doing regression/prediction of ‘board → P(win)’), requiring no state or activations from the selfplay games, just generating an extremely large corpus which could be easily used by GPU training.)]
“Giraffe: Using Deep Reinforcement Learning to Play Chess”, Lai 2015
“Giraffe: Using Deep Reinforcement Learning to Play Chess”, (20150904; ; similar):
This report presents Giraffe, a chess engine that uses selfplay to discover all its domainspecific knowledge, with minimal handcrafted knowledge given by the programmer. Unlike previous attempts using machine learning only to perform parametertuning on handcrafted evaluation functions, Giraffe’s learning system also performs automatic feature extraction and pattern recognition. The trained evaluation function performs comparably to the evaluation functions of stateoftheart chess engines—all of which containing thousands of lines of carefully handcrafted pattern recognizers, tuned over many years by both computer chess experts and human chess masters. Giraffe is the most successful attempt thus far at using endtoend machine learning to play chess.
“Reinforcement Learning As Classification: Leveraging Modern Classifiers”, Lagoudakis & Parr 2003
2003lagoudakis.pdf
: “Reinforcement Learning as Classification: Leveraging Modern Classifiers”, Michail Lagoudakis, Ronald Parr (20030101)
Tensor processing unit
Monte Carlo tree search
Master (software)
Lee Sedol
Fan Hui
Demis Hassabis
David Silver (programmer)
Darkforest
AlphaZero
AlphaGo
Miscellaneous

https://www.nature.com/articles/s41598019456199#deepmind
( ) 
https://proceedings.neurips.cc/paper/2014/file/8bb88f80d334b1869781beb89f7b73bePaper.pdf

https://hackernoon.com/the3tricksthatmadealphagozeroworkf3d47b6686ef

https://en.chessbase.com/post/thefutureisherealphazerolearnschess

https://en.chessbase.com/post/leelachesszeroalphazeroforthepc
( ) 
https://en.chessbase.com/post/acquisitionofchessknowledgeinalphazero

https://deepmind.com/blog/alphazerosheddingnewlightgrandgameschessshogiandgo/

https://cacm.acm.org/magazines/2021/9/255049playingwithandagainstcomputers/fulltext

https://blog.janestreet.com/deeplearningthehardestgoproblemintheworld/

https://ai.googleblog.com/2021/03/leveragingmachinelearningforgame.html
( ) 
http://clinformatik.uibk.ac.at/cek/holstep/ckfccsholstepsubmitted.pdf