We study the reinforcement learning problem of complex action control in the
Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far
more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it
very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and
algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our
algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention,
and dual-clip PPO, with which our proposed actor-critic network can be
effectively trained in our system. Tested on the MOBA game Honor of Kings, our AI
agent, called Tencent Solo, can defeat top professional human players in full 1v1 games.
At OpenAI, we’ve used
the multiplayer video game Dota 2 as a research platform for general-purpose AI systems. Our Dota 2 AI, called
OpenAI Five, learned by playing over 10,000 years of games against itself. It demonstrated the ability to achieve
expert-level performance, learn human-AI cooperation, and operate at internet scale.
[OpenAI final report on OA5: timeline, training curve, index of blog posts.]
“Dota 2 with Large Scale Deep Reinforcement Learning”, Christopher Berner, Greg
Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris
Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson… (2019-12-13):
On April 13th, 2019, OpenAI Five became the first AI system to
defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as
long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become
increasingly central to more capable AI systems. OpenAI Five leveraged existing
reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We
developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five
demonstrates that self-play reinforcement learning can achieve superhuman
performance on a difficult task.
We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented
complexity on a real robot. This is made possible by two key components: a novel algorithm, which we call automatic
domain randomization (ADR) and a robotplatform built for machine
learning. ADR automatically generates a distribution over randomized
environments of ever-increasing difficulty. Control policies and vision state estimators trained with ADR exhibit vastly improved sim2real transfer. For control policies,memory-augmented
models trained on an ADR-generated distribution of environments show clear signs
of emergent meta-learning at test time. The combination of ADR with our custom
robot platform allows us to solve a Rubik’s cube with a humanoid robot hand, which involves both control and state
estimation problems. Videos summarizing our results are available:
We’ve trained a pair of neural networks to solve the Rubik’s Cube with a human-like robot hand. The neural networks are
trained entirely in simulation, using the same reinforcement learning code as
OpenAI Five paired with a new technique called Automatic Domain Randomization
(ADR). The system can handle situations it never saw during training, such
as being prodded by a stuffed giraffe. This shows that reinforcement learning isn’t just a tool for virtual tasks, but can
solve physical-world problems requiring unprecedented dexterity.
…Since May 2017, we’ve been trying to train a human-like robotic hand to solve the Rubik’s Cube. We set this goal
because we believe that successfully training such a robotic hand to do complex manipulation tasks lays the foundation for
general-purpose robots. We solved the Rubik’s Cube in simulation in July 2017. But as of July 2018, we could only
manipulate a block on the robot. Now, we’ve reached our initial goal. Solving a Rubik’s Cube one-handed is a challenging
task even for humans, and it takes children several years to gain the dexterity required to master it. Our robot still
hasn’t perfected its technique though, as it solves the Rubik’s Cube 60% of the time (and only 20% of the time for a
maximally difficult scramble).
Perhaps the most ambitious scientific quest in human history is the creation of general artificial intelligence, which
roughly means AI that is as smart or smarter than humans. The dominant approach in the machine learning community is to
attempt to discover each of the pieces required for intelligence, with the implicit assumption that some future group will
complete the Herculean task of figuring out how to combine all of those pieces into a complex thinking machine. I call this
the “manual AI approach”.
This paper describes another exciting path that ultimately may be more successful at producing general AI. It is based
on the clear trend in machine learning that hand-designed solutions eventually are replaced by more effective, learned
solutions. The idea is to create an AI-generating algorithm (AI-GA), which automatically learns how to produce general AI.
Three Pillars are essential for the approach: (1) meta-learning architectures, (2) meta-learning the learning algorithms
themselves, and (3) generating effective learning environments.
I argue that either approach could produce general AI first, and both are scientifically worthwhile irrespective of
which is the fastest path. Because both are promising, yet the ML community is currently committed to the manual approach,
I argue that our community should increase its research investment in the AI-GA approach. To encourage such research, I
describe promising work in each of the Three Pillars. I also discuss AI-GA-specific safety and ethical considerations.
Because it it may be the fastest path to general AI and because it is inherently scientifically interesting to
understand the conditions in which a simple algorithm can produce general AI (as happened on Earth where Darwinian
evolution produced human intelligence), I argue that the pursuit of AI-GAs should be considered a new grand challenge of
computer science research.
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law,
or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been
conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one
of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers
seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of
…In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.
At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that
leveraged human understanding of the special structure of chess…A similar pattern of research progress was seen in computer
Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human
knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was
applied effectively at scale…In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human
knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were
more statistical in nature and did much more computation, based on hidden Markov models
(HMMs). Again, the statistical methods won out over the
human-knowledge-based methods…In computer vision…Modern deep-learning neural networks use only the notions of convolution
and certain kinds of invariances, and perform much better.
…We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter
lesson is based on the historical observations that (1) AI researchers have often tried to build knowledge into their
agents, (2) this always helps in the short term, and is personally satisfying to the researcher, but (3) in the long run it
plateaus and even inhibits further progress, and (4) breakthrough progress eventually arrives by an opposing approach based
on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely
digested, because it is success over a favored, human-centric approach.
Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally,
the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly
capable agent requires a complex environment for training. In this paper, we point out that a competitive multi-agent
environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also
point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents
of this level will have the right level of difficulty. This work introduces several competitive multi-agent environments
where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and
interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as
running, blocking, ducking, tackling, fooling opponents, kicking, and defending using
both arms and legs. A highlight of the learned behaviors can be found here: https://goo.gl/eR7fbX
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data
through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function
that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization
(PPO), have some of the benefits of trust region policy optimization(TRPO), but they are much simpler to implement, more general, and havebetter sample complexity (empirically). Our experiments test PPO on a
collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that
PPO outperforms other online policy gradient methods, and overall strikes a
favorable balance between sample complexity, simplicity, and wall-time.