"Mastering the Game of Go without Human Knowledge", Silver, Schrittwieser & Simonyan et al 2017

gwern · 2017-10-19T00:06:30+00:00

Highlights:

AlphaGo Fan < AlphaGo Sedol < Master < Zero; extremely high ELO rating for Zero:

The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan. Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100game match with 2h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information)

...However, it was useful to test different versions of AlphaGo against each other under handicap conditions. Using names of major versions from Zero paper, AlphaGo Master > AlphaGo Lee > AlphaGo Fan, each version defeated its predecessor with 3 handicap stones...
pure selfplay, no initialization from expert games
no hand-engineered features, just the board state
architecture:
- batchnormed residual layers instead of plain CNN/FCs
- value & policy networks merged
- switching to resnets and merging appear to give additive gains
full MCTS is used in training but not in play; during play, simple tree search
- MCTS is however used to provide additional supervision: the policy-value network is now being optimized during policy-gradient self-play by the MCTS finetuning of the probability of playing each move + the ultimate winner. (So if moves A/B/C are evaluated as 33/33/33% by the NN, then several thousand MCTS rollouts indicate it's more likely to be 20/40/40%, it's backpropped to reduce A & increase B/C.)
training & computation time is massively reduced by all of the above:

Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs)29, whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1and Supplementary Information)...We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days.Over the course of training, 29 million games of selfplay were generated. Parameters were updated from 3.1 million minibatches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Fig. 6a. Games played at regular intervals throughout training are shown in Extended Data Fig. 5 and in the Supplementary Information

Nature summary:

Merging these functions into a single neural network made the algorithm both stronger and much more efficient, said Silver. It still required a huge amount of computing power — four of the specialized chips called tensor processing units, which Hassabis estimated to be US$25 million of hardware. But its predecessors used ten times that number. It also trained itself in days, rather than months. The implication is that “algorithms matter much more than either computing or data available”, said Silver.

(This is in line with the computation/training improvements outlined by Hassabis in post-Ke Jie talks.)
- sample-efficiency: one point that seems to be missing from the usual 'Zero reinvents all human Go research in a few days' is that Zero is also doing this at apparently much greater sample-efficiency than previous AGs, and, of course, humanity. Go pros start playing in early childhood full time in special-purpose Go academies and will play possibly 100s of thousands of games total in a lifetime, Go is played by tens of millions of people, etc, and the sum total of that expertise is passed in ~4m games by Zero.
Zero rediscovers many of the usual joseki... and discards some of them after training for a while with them: Extended Figure 2 pg11. For example, Knights Move Pincer is discovered ~40h, skyrockets in popularity, and then is discarded and largely disappears by 65h. Presumably the self-play learned a weakness in it.
Fan Hui is still working with DM according to Wired, and analyzing particularly good moves:

Google has already used the company’s algorithms to cut data-center cooling bills. The recent financial filing listed the company’s first revenues, £40 million from services provided to other parts of Alphabet. Hassabis says the ideas in AlphaGo Zero could be applied to work on understanding climate, or proteins in the body. Machine-learning research from Google and others has also shown promise for extracting more ad dollars from consumers. AlphaGo Zero is also set to give back to the community DeepMind's project has shaken up. New ideas from its predecessors like that jaw-dropping move against Lee Sedol have invigorated the game. Fan Hui, the first professional player beaten by AlphaGo, now works with DeepMind and says AlphaGo Zero can inject further creativity into one of the world’s oldest board games. “Its games look a lot like human play but it also feels more free, perhaps because it is not limited by our knowledge,” Fan says. He’s already christened one tactic it came up the “zero move,” such is its striking power in the early stages of a game. “We have never seen a move like this, even from AlphaGo," he says.
from the Silver/Schrittweiser AmA:
- the AlphaGo program is done:
  
  AlphaGo is retired! That means the people and hardware resources have moved onto other projects on the long, winding road to AI :)...We have stopped active research into making AlphaGo stronger. But it's still there as a research test-bed for DeepMinders to experiment with new ideas and algorithms... [on what would happen if continued to train since it hadn't converged fully] I guess it's a question of people and resources and priorities! If we'd run for 3 months, I guess you might still be wondering what would happen after, say, 6 months :)
  
  As we said in May, the Future of Go Summit was our final match event with AlphaGo.
- there will not be release of the codebase (or, presumably, trained models):
  
  We've open sourced a lot of our code in the past, but it's always a complex process. And in this case, unfortunately, it's a prohibitively intricate codebase.
- there will probably be a release of the 'teaching' tool:
  
  When are you planning to release the Go tool that Demis Hassabis announced at Wuzhen?
  
  Work is progressing on this tool as we speak. Expect some news soon : )

gwern · 2017-12-06T04:08:56+00:00

Followup paper is amazing (discussion)

AG Zero can be used to learn chess and defeats Stockfish after 4 hours of training: "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm", Silver et al 2017 https://arxiv.org/abs/1712.01815

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

vee3my · 2017-10-22T11:52:05+00:00

Would anyone be interested in trying to reconstruct, at least in principle, what exactly alpha go zero is doing? I am reading the paper quite carefully and it is a bit lacking in details at times.

reinforcementlearning

MODERATORS

reinforcementlearning

MODERATORS

Welcome to Reddit,

Want to add to the discussion?