[Self-play videos; blog/2 (discussion; AmA)] Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional e-sports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players.
We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies [AlphaStar League], each represented by deep neural networks.
We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.
…In order to train AlphaStar, we built a highly scalable distributed training setup using Google’s v3 TPUs that supports a population of agents learning from many thousands of parallel instances of StarCraft II. The AlphaStar league was run for 14 [wallclock] days, using 16 TPUs for each agent. During training, each agent experienced up to 200 years of real-time StarCraft play. The final AlphaStar agent consists of the components of the Nash distribution of the league—in other words, the most effective mixture of strategies that have been discovered—that run on a single desktop GPU.
…In StarCraft, each player chooses one of 3 races—Terran, Protoss or Zerg—each with distinct mechanics. We trained the league using 3 main agents (one for each StarCraft race), 3 main exploiter agents (one for each race), and 6 league exploiter agents (two for each race). Each agent was trained using 32 third-generation tensor processing units (TPUv3s23) over 44 [wallclock] days. During league training almost 900 distinct players were created.
…For every training agent in the League, we run 16,000 concurrent StarCraft II matches and 16 actor tasks (each using a TPU v3 device with 8 TPU cores23) to perform inference. The game instances progress asynchronously on preemptible CPUs (roughly equivalent to 150 processors with 28 physical cores each), but requests for agent steps are batched together dynamically to make efficient use of the TPU. Utilising TPUs for batched inference provides large efficiency gains over prior work.14,28
Actors send sequences of observations, actions, and rewards over the network to a central 128-core TPU learner worker, which updates the parameters of the training agent. The received data is buffered in memory and replayed twice. The learner worker performs large-batch synchronous updates. Each TPU core processes a mini-batch of 4 sequences, for a total batch size of 512. The learner processes about 50,000 agent steps per second. The actors update their copy of the parameters from the learner every 10 seconds.