- See Also
-
Links
- “DouZero: Mastering DouDizhu With Self-Play Deep Reinforcement Learning”, Zha et al 2021
- “Suphx: Mastering Mahjong With Deep Reinforcement Learning”, Li et al 2020
- “Finding Friend and Foe in Multi-Agent Games”, Serrino et al 2019
- “Monte Carlo Neural Fictitious Self-Play: Approach to Approximate Nash Equilibrium of Imperfect-Information Games”, Zhang et al 2019
- “A Survey and Critique of Multiagent Deep Reinforcement Learning”, Hernandez-Leal et al 2018
- “Solving Imperfect-Information Games via Discounted Regret Minimization”, Brown & Sandholm 2018
- “ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018
- “Regret Minimization for Partially Observable Deep Reinforcement Learning”, Jin et al 2017
- “Deep Recurrent Q-Learning for Partially Observable MDPs”, Hausknecht & Stone 2015
- Scotland Yard
- Miscellaneous
See Also
Links
“DouZero: Mastering DouDizhu With Self-Play Deep Reinforcement Learning”, Zha et al 2021
“DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning”, (2021-06-11):
Games are abstractions of the real world, where artificial agents learn to compete and cooperate with other agents. While significant achievements have been made in various perfect-information and imperfect-information games, DouDizhu (a.k.a. Fighting the Landlord), a three-player card game, is still unsolved. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary substantially from turn to turn. Unfortunately, modern reinforcement learning algorithms mainly focus on simple and small action spaces, and not surprisingly, are shown not to make satisfactory progress in DouDizhu.
In this work, we propose a conceptually simple yet effective DouDizhu AI system, namely DouZero, which enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors.
Starting from scratch in a single server with four GPUs, DouZero outperformed all the existing DouDizhu AI programs in days of training and was ranked the first in the Botzone leaderboard among 344 AI agents. Through building DouZero, we show that classic Monte-Carlo methods can be made to deliver strong results in a hard domain with a complex action space.
The code and an online demo are released at Github with the hope that this insight could motivate future work.
“Suphx: Mastering Mahjong With Deep Reinforcement Learning”, Li et al 2020
“Suphx: Mastering Mahjong with Deep Reinforcement Learning”, (2020-03-30; similar):
Artificial Intelligence (AI) has achieved great success in many domains, and game AI is widely regarded as its beachhead since the dawn of AI. In recent years, studies on game AI have gradually evolved from relatively simple environments (eg. perfect-information games such as Go, chess, shogi or two-player imperfect-information games such as heads-up Texas hold’em) to more complex ones (eg. multi-player imperfect-information games such as multi-player Texas hold’em and StartCraft II). Mahjong is a popular multi-player imperfect-information game worldwide but very challenging for AI research due to its complex playing/scoring rules and rich hidden information. We design an AI for Mahjong, named Suphx, based on deep reinforcement learning with some newly introduced techniques including global reward prediction, oracle guiding, and run-time policy adaptation. Suphx has demonstrated stronger performance than most top human players in terms of stable rank and is rated above 99.99% of all the officially ranked human players in the Tenhou platform. This is the first time that a computer program outperforms most top human players in Mahjong.
“Finding Friend and Foe in Multi-Agent Games”, Serrino et al 2019
“Finding Friend and Foe in Multi-Agent Games”, (2019-06-05; ; similar):
Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on The Resistance: Avalon, the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable DeepRole to scale to the full Avalon game. Empirical game-theoretic methods show that DeepRole outperforms other hand-crafted and learned agents in five-player Avalon. DeepRole played with and against human players on the web in hybrid human-agent teams. We find that DeepRole outperforms human players as both a cooperator and a competitor.
“Monte Carlo Neural Fictitious Self-Play: Approach to Approximate Nash Equilibrium of Imperfect-Information Games”, Zhang et al 2019
“Monte Carlo Neural Fictitious Self-Play: Approach to Approximate Nash equilibrium of Imperfect-Information Games”, (2019-03-22):
Researchers on artificial intelligence have achieved human-level intelligence in large-scale perfect-information games, but it is still a challenge to achieve (nearly) optimal results (in other words, an approximate Nash Equilibrium) in large-scale imperfect-information games (ie. war games, football coach or business strategies). Neural Fictitious Self Play (NFSP) is an effective algorithm for learning approximate Nash equilibrium of imperfect-information games from self-play without prior domain knowledge. However, it relies on Deep Q-Network, which is off-line and is hard to converge in online games with changing opponent strategy, so it can’t approach approximate Nash equilibrium in games with large search scale and deep search depth.
In this paper, we propose Monte Carlo Neural Fictitious Self Play (MC-NFSP), an algorithm combines Monte Carlo tree search with NFSP, which greatly improves the performance on large-scale zero-sum imperfect-information games. Experimentally, we demonstrate that the proposed Monte Carlo Neural Fictitious Self Play can converge to approximate Nash equilibrium in games with large-scale search depth while the Neural Fictitious Self Play can’t. Furthermore, we develop Asynchronous Neural Fictitious Self Play (ANFSP). It use asynchronous and parallel architecture to collect game experience. In experiments, we show that parallel actor-learners have a further accelerated and stabilizing effect on training.
“A Survey and Critique of Multiagent Deep Reinforcement Learning”, Hernandez-Leal et al 2018
“A Survey and Critique of Multiagent Deep Reinforcement Learning”, (2018-10-12; ):
Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (1) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (2) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (3) We take a more critical tone raising practical challenges of MDRL (eg. implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (eg. RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
“Solving Imperfect-Information Games via Discounted Regret Minimization”, Brown & Sandholm 2018
“Solving Imperfect-Information Games via Discounted Regret Minimization”, (2018-09-11):
Counterfactual regret minimization (CFR) is a family of iterative algorithms that are the most popular and, in practice, fastest approach to solving large imperfect-information games. In this paper we introduce novel CFR variants that (1) discount regrets from earlier iterations in various ways (in some cases differently for positive and negative regrets), (2) reweight iterations in various ways to obtain the output strategies, (3) use a non-standard regret minimizer and/or (4) leverage “optimistic regret matching”. They lead to dramatically improved performance in many settings. For one, we introduce a variant that outperforms CFR+, the prior state-of-the-art algorithm, in every game tested, including large-scale realistic settings. CFR+ is a formidable benchmark: no other algorithm has been able to outperform it. Finally, we show that, unlike CFR+, many of the important new variants are compatible with modern imperfect-information-game pruning techniques and one is also compatible with sampling in the game tree.
“ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, Kitchen & Benedetti 2018
“ExIt-OOS: Towards Learning from Planning in Imperfect Information Games”, (2018-08-30; ; similar):
The current state of the art in playing many important perfect information games, including Chess and Go, combines planning and deep reinforcement learning with self-play. We extend this approach to imperfect information games and present ExIt-OOS, a novel approach to playing imperfect information games within the Expert Iteration framework and inspired by AlphaZero. We use Online Outcome Sampling, an online search algorithm for imperfect information games in place of MCTS. While training online, our neural strategy is used to improve the accuracy of playouts in OOS, allowing a learning and planning feedback loop for imperfect information games.
“Regret Minimization for Partially Observable Deep Reinforcement Learning”, Jin et al 2017
“Regret Minimization for Partially Observable Deep Reinforcement Learning”, (2017-10-31):
Deep reinforcement learning algorithms that estimate state and state-action value functions have been shown to be effective in a variety of challenging domains, including learning control strategies from raw image pixels. However, algorithms that estimate state and state-action value functions typically assume a fully observed state and must compensate for partial observations by using finite length observation histories or recurrent networks. In this work, we propose a new deep reinforcement learning algorithm based on counterfactual regret minimization that iteratively updates an approximation to an advantage-like function and is robust to partially observed state. We demonstrate that this new algorithm can substantially outperform strong baseline methods on several partially observed reinforcement learning tasks: learning first-person 3D navigation in Doom and Minecraft, and acting in the presence of partially observed objects in Doom and Pong.
“Deep Recurrent Q-Learning for Partially Observable MDPs”, Hausknecht & Stone 2015
“Deep Recurrent Q-Learning for Partially Observable MDPs”, (2015-07-23):
Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting Deep Recurrent Q-Network (DRQN), although capable of seeing only a single frame at each timestep, successfully integrates information through time and replicates DQN’s performance on standard Atari games and partially observed equivalents featuring flickering game screens. Additionally, when trained with partial observations and evaluated with incrementally more complete observations, DRQN’s performance scales as a function of observability. Conversely, when trained with full observations and evaluated with partial observations, DRQN’s performance degrades less than DQN’s. Thus, given the same length of history, recurrency is a viable alternative to stacking a history of frames in the DQN’s input layer and while recurrency confers no systematic advantage when learning to play the game, the recurrent net can better adapt at evaluation time if the quality of observations changes.