AlphaStar, the AI that reaches GrandMaster level in StarCraft II, is a remarkable milestone demonstrating what deep
reinforcement learning can achieve in complex Real-Time Strategy (RTS) games.
However, the complexities of the game, algorithms and systems, and especially the tremendous amount of computation
needed are big obstacles for the community to conduct further research in this direction. We propose a deep reinforcement
learningagent, StarCraft Commander (SCC). With order of magnitude
less computation, it demonstrates top human performance defeating GrandMaster players in test matches and top professional
players in a live event. Moreover, it shows strong robustness to various human strategies and discovers novel strategies
unseen from human plays. In this paper, we will share the key insights and optimizations on efficient imitation learning
and reinforcement learning for StarCraft II full game.
StarCraft, one of the most difficult esport games with long-standing history of professional tournaments, has attracted
generations of players and fans, and also, intense attentions in artificial intelligence research. Recently, Google’s
DeepMind announced AlphaStar, a grandmaster level AI in StarCraft II. In this paper, we introduce a new AI agent, named
TStarBot-X, that is trained under limited computation resources and can play competitively with expert human players.
TStarBot-X takes advantage of important techniques introduced in AlphaStar, and also benefits from substantial innovations
including new league training methods, novel multi-agent roles, rule-guided policy search, lightweight neural network
architecture, and importance sampling in imitation learning, etc. We show that with limited computation resources, a
faithful reimplementation of AlphaStar can not succeed and the proposed techniques are necessary to ensure TStarBot-X’s
competitive performance. We reveal all technical details that are complementary to those mentioned in AlphaStar, showing
the most sensitive parts in league training, reinforcement learning and imitation
learning that affect the performance of the agents. Most importantly, this is an open-sourced study that all codes and
resources (including the trained model parameters) are publicly accessible via this URL. We expect this study could be beneficial for both academic and industrial future research
in solving complex problems like StarCraft, and also, might provide a sparring partner for all StarCraft II players and
other AI agents.
Competitive Self-Play (CSP) based Multi-Agent Reinforcement Learning(MARL) has shown phenomenal breakthroughs recently. Strong AIs are
achieved for several benchmarks, including DoTA 2,
Kings, Quake III, StarCraft
II, to name a few. Despite the success, the MARL training is
extremely data thirsty, requiring typically billions of (if not trillions of) frames be seen from the environment
during training in order for learning a high performance agent. This poses non-trivial difficulties for researchers
or engineers and prevents the application of MARL to a broader range of
real-world problems. To address this issue, in this manuscript we describe a framework, referred to as TLeague, that aims
at large-scale training and implements several main-stream CSP-MARL
algorithms. The training can be deployed in either a single machine or a cluster of hybrid machines (CPUs and GPUs), where the standard Kubernetes is
supported in a cloud native manner. TLeague achieves a high throughput and a reasonable scale-up when performing
distributed training. Thanks to the modular design, it is also easy to extend for solving other multi-agent problems
or implementing and verifying MARL algorithms. We present experiments over
StarCraft II, ViZDoom
and Pommerman to show the efficiency
and effectiveness of TLeague. The code is open-sourced and available at this URL.
[Self-play videos; blog/2 (discussion;
Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments.
As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence
research, owing to its iconic and enduring status among the most difficult professional e-sports and its relevance to the
real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous
competitions, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or
employed hand-crafted sub-systems. Despite these advantages, no previous agent has come close to matching the overall skill
of top StarCraft players.
We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable
to other complex domains: a multi-agent reinforcement learning algorithm that uses
data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies
[AlphaStar League], each represented by deep neural networks.
We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human
players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human
Artificial teamwork: Artificially intelligent agents are getting better and better at 2-player games,
but most real-world endeavors require teamwork. Jaderberg et al. designed a computer program that excels at playing
the video game Quake III Arena in Capture the Flag mode, where 2
multiplayer teams compete in capturing the flags of the opposing team. The agents were trained by playing thousands of
games, gradually learning successful strategies not unlike those favored by their human counterparts. Computer agents
competed successfully against humans even when their reaction times were slowed to match those of humans.
Reinforcement learning (RL) has shown great success in increasingly complex single-agent environments and 2-player
turn-based games. However, the real world contains multiple agents, each learning and acting independently to cooperate and
compete with other agents. We used a tournament-style evaluation to demonstrate that an agent can achieve human-level
performance in a 3-dimensional multiplayer first-person video game, Quake III
Arena in Capture the Flag mode, using only pixels and game points scored as input. We used a 2-tier
optimization process in which a population of independent RL agents are trained concurrently from thousands of parallel
matches on randomly generated environments. Each agent learns its own internal reward signal and rich representation of the
world. These results indicate the great potential of multiagent reinforcement
learning for artificial intelligence research.
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law,
or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been
conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one
of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers
seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of
…In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.
At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that
leveraged human understanding of the special structure of chess…A similar pattern of research progress was seen in computer
Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human
knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was
applied effectively at scale…In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human
knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were
more statistical in nature and did much more computation, based on hidden Markov models
(HMMs). Again, the statistical methods won out over the
human-knowledge-based methods…In computer vision…Modern deep-learning neural networks use only the notions of convolution
and certain kinds of invariances, and perform much better.
…We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter
lesson is based on the historical observations that (1) AI researchers have often tried to build knowledge into their
agents, (2) this always helps in the short term, and is personally satisfying to the researcher, but (3) in the long run it
plateaus and even inhibits further progress, and (4) breakthrough progress eventually arrives by an opposing approach based
on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely
digested, because it is success over a favored, human-centric approach.
Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the
proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic
evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely,
and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose
Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios:
agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in
evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus
encourages maximally inclusive evaluation—since there is no harm (computational cost aside) from including all available
tasks and agents.
We consider scenarios from the real-time strategy game StarCraft as new benchmarks for reinforcement learning algorithms. We propose micromanagement tasks, which present the problem
of the short-term, low-level control of army members during a battle. From a reinforcement learning point of view, these
scenarios are challenging because the state-action space is very large, and because there is no obvious feature
representation for the state-action evaluation function. We describe our approach to tackle the micromanagement scenarios
with deep neural network controllers from raw state features given by the game engine. In addition, we present a heuristic
reinforcement learning algorithm which combines direct exploration in the policy space and backpropagation. This
algorithm allows for the collection of traces for learning using deterministic policies, which appears much more efficient
than, for example, -greedy exploration. Experiments show that with this algorithm, we successfully learn non-trivial
strategies for scenarios with armies of up to 15 agents, where both Q-learning and REINFORCE
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are
discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent
approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of
the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and
various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output
dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in
that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses
attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net
(Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems—finding
planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem—using training examples
alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to
variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained
on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.