"AlphaZero: Shedding new light on the grand games of chess, shogi and Go" [DM releases followup paper on AlphaZero, +100 shogi games, +100 chess games, and video discussion]

gwern · 2018-12-06T19:30:37+00:00

tl;dr: no important differences - this is a cleaned up AlphaZero responding to the critics: training is simplified a little bit more, it's trained for longer, the chess engine comparisons are fairer & less of a curbstomp, some of 'domain knowledge' that made Gary Marcus so butt-hurt like the mirroring of Go positions has been removed, and some more sample games have been released for the gamers to look at (but of course no trained models or source). And overall: it's pretty much the same thing, as one should've expected. Good to know but not a big deal.

Previously: "Mastering The Game of Go without Human Knowledge", Silver et al 2017a; "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm", Silver et al 2017b

New: "A general reinforcement learning algorithm that masters chess, shogi and Go through self-play", Silver et al 2018:

The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self-play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess) as well as Go.

...The AlphaZero algorithm described in this paper (see (10) for pseudocode) differs from the original AlphaGo Zero algorithm in several respects.

AlphaGo Zero estimated and optimized the probability of winning, exploiting the fact that Go games have a binary win or loss outcome. However, both chess and shogi may end in drawn outcomes; it is believed that the optimal solution to chess is a draw (16–18). AlphaZero instead estimates and optimizes the expected outcome.

The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo and AlphaGo Zero in two ways. First, training data were augmented by generating eight symmetries for each position. Second, during MCTS, board positions were transformed by using a randomly selected rotation or reflection before being evaluated by the neural network, so that the Monte Carlo evaluation was averaged over different biases. To accommodate a broader class of games, AlphaZero does not assume symmetry; the rules of chess and shogi are asymmetric (e.g. pawns only move forward, and castling is different on kingside and queenside). AlphaZero does not augment the training data and does not transform the board position during MCTS.

In AlphaGo Zero, self-play games were generated by the best player from all previous iterations. After each iteration of training, the performance of the new player was measured against the best player; if the new player won by a margin of 55% then it replaced the best player. By contrast, AlphaZero simply maintains a single neural network that is updated continually, rather than waiting for an iteration to complete. Self-play games are always generated by using the latest parameters for this neural network. Like AlphaGo Zero, the board state is encoded by spatial planes based only on the basic rules for each game. The actions are encoded by either spatial planes or a flat vector, again based only on the basic rules for each game (10). AlphaGo Zero used a convolutional neural network architecture that is particularly wellsuited to Go: the rules of the game are translationally invariant (matching the weight sharing structure of convolutional networks) and are defined in terms of liberties corresponding to the adjacencies between points on the board (matching the local structure of convolutional networks). By contrast, the rules of chess and shogi are position-dependent (e.g. pawns may move two steps forward from the second rank and promote on the eighth rank) and include long-range interactions (e.g. the queen may traverse the board in one move). Despite these differences, AlphaZero uses the same convolutional network architecture as AlphaGo Zero for chess, shogi and Go.

The hyperparameters of AlphaGo Zero were tuned by Bayesian optimization. In AlphaZero we reuse the same hyperparameters, algorithm settings and network architecture for all games without game-specific tuning. The only exceptions are the exploration noise and the learning rate schedule (see (10) for further details).

We trained separate instances of AlphaZero for chess, shogi and Go. Training proceeded for 700,000 steps (in mini-batches of 4,096 training positions) starting from randomly initialized parameters. During training only, 5,000 first-generation tensor processing units (TPUs) (19) were used to generate self-play games, and 16 second-generation TPUs were used to train the neural networks. Training lasted for approximately 9 hours in chess, 12 hours in shogi and 13 days in Go (see table S3) (20). Further details of the training procedure are provided in (10). Figure 1 shows the performance of AlphaZero during self-play reinforcement learning, as a function of training steps, on an Elo (21) scale (22). In chess, AlphaZero first outperformed Stockfish after just 4 hours (300,000 steps); in shogi, AlphaZero first outperformed Elmo after 2 hours (110,000 steps); and in Go, AlphaZero first outperformed AlphaGo Lee (9) after 30 hours (74,000 steps). The training algorithm achieved similar performance in all independent runs (see fig. S3), suggesting that the high performance of AlphaZero’s training algorithm is repeatable.

...In Go, AlphaZero defeated AlphaGo Zero (9), winning 61% of games. This demonstrates that a general approach can recover the performance of an algorithm that exploited board symmetries to generate eight times as much data (see also fig. S1).

...DeepMind has filed the following patent applications related to this work: PCT/EP2018/063869; US15/280,711; US15/280,784.

Video: https://www.youtube.com/watch?v=7L2sUGcOgh0 (promo video, reuses some of the documentary footage, not very substantial or worth watching, but some fun quotes eg Sadler: "It plays [chess] like a human on fire.")
Kasparov editorial: "Chess, a Drosophila of Reasoning".

Amusing mostly for Kasparov at the end going into contortions to try to rescue his old claims that machines would never be better than man+machines, by redefining it to learning from the machines and handling tasks the machines don't handle at all and defining that as a partnership (as opposed to the original actual Advanced Chess/centaur teams he used to talk about, where the machines merely made tactical suggestions to the player, who routinely overrode them for strategic reasons, and this was the future of work and AI - anyone remember that? Pepperidge Farm remembers...):

AlphaZero shows us that machines can be the experts, not merely expert tools. Explainability is still an issue—it's not going to put chess coaches out of business just yet. But the knowledge it generates is information we can all learn from. Alpha-Zero is surpassing us in a profound and useful way, a model that may be duplicated on any other task or field where virtual knowledge can be generated. Machine learning systems aren't perfect, even at a closed system like chess. There will be cases where an AI will fail to detect exceptions to their rules. Therefore, we must work together, to combine our strengths.
Murray Campbell editorial: "Mastering Board Games"
Chess/shogi game downloads: https://deepmind.com/research/alphago/alphazero-resources/

reinforcementlearning

MODERATORS

reinforcementlearning

MODERATORS

Welcome to Reddit.

Want to add to the discussion?