OpenAI DoTA update: PPO LSTM reaches amateur-level 5x5 DoTA via self-play (256 GPUs/128k CPU, 65000x realtime), highly parallelized framework called 'Rapid'; no hierarchical or abstraction for long-term planning

gwern · 2018-06-25T14:54:02+00:00

The long awaited update on OpenAI's DoTA program, after their 1v1 success (1/2). Notes:

this is incredibly simple. PPO, 1000-unit LSTMs, some basic reward shaping, a few crude randomizing heuristics for exploration, some restrictions on the gameplay format, some tweaking of the discount rate (but still absurdly short time horizons!), some historical checkpointing for self-play, and that's about it. No tree search, no deep environment models, no intrinsic curiosity, no hierarchical RL or learning abstracted MDPs, no imitation or transfer learning, no special multi-agent losses or decentralized algorithms, essentially no communication between the agents, etc. The sample-efficiency looks like it's horrible (180 years of gameplay per day, probably a few weeks or a month per iteration given that the 1v1 agent took ~2 weeks wallclock EDIT: Brockman says 5x5 takes 19 wallclock days, so ~3240 years of gameplay and $1.1m to train?), but with the huge compute behind it, it works! It learns long-range planning and smooth coordination between 5 allied NNs
- reward shaping may not be particularly necessary, as end-to-end 1/0 losses also train the original 1v1 bot reasonably well. Given how expensive it is to train, though, it probably makes the most sense for OpenAI to leave simplification to the end, like AlphaGo v1 vs the final Zero, once everything is understood.
interesting comment about the cost of that compute from a Googler:

The 256 [non-preemptible] P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.
Rapid architecture looks a lot like DM's Impala (ie a few GPUs doing training with lots of CPU workers doing async rollouts, leading me to wonder if the DoTA PPO is suffering from stale gradients and would benefit from off-policy corrections)
good performance against human amateurs, even taking 2/3 games with a 99th percentile semi-pro team
troubling from a safety perspective: aside from demonstrating like AG that you can get big capability jumps in small wallclock time (vastly smaller than limited, serial humans require) via parallelization, as Karpathy has noted 'deep learning wants to work', and like AlphaGo you can get human (or perhaps eventually superhuman) performance despite serious "delusions" or logic bugs or outright crashes in the code:

[Comparison of learning curves before and after bugfixes, showing how fixing bugs increases learning speed.] We’re still fixing bugs. The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, such as rare crashes during training, or a bug which resulted in a large negative reward for reaching level 25. It turns out it’s possible to beat good humans while still hiding serious bugs!
next match will be streamed 28 July 2018; pro matches at The International 20-25 August 2018.

Overall, probably the most interesting part for DRL is the fact that PPO LSTM with enough brute force is capable of learning long-range planning. Like ApeX, this reinforces the observation that brute force with a simulator can take us quite a ways... (And simulators can be learned, see all the recent work on deep environment models.)

Video: https://www.youtube.com/watch?v=eHipy_j29Xw (largely redundant with the blog post but maybe people want to hear from Blitz himself)
Media: https://www.wired.com/story/can-bots-outwit-humans-in-one-of-the-biggest-esports-games/
Discussion: HN (some comments by Greg Brockman there)
Major OA donor Sam Altman comments: http://blog.samaltman.com/reinforcement-learning-progress
Smerity Twitter thread: https://twitter.com/Smerity/status/1011340815682056192 As a former DoTA player himself, he is impressed by the excellent teamwork and selflessness of the agents.

LazyOptimist · 2018-06-27T23:31:07+00:00

Theres quite a few impressive things about this. The first that stands out is that it's basically just PPO + LSTM + ~10² PetaFLOP/s * days. From what I gathered, it seems like the only trick they're using is some light reward shaping, the rest is engineering to operate at that scale. The second is that it works on such a large time horizon, they're only using BPTT for 16 time steps, which is just over 2 seconds of game play. So to see that the agent preform actions that pay off only several minutes later is impressive. The third is the parameters of the LSTM and what they're using to train it. The LSTM only has 2¹⁰ hidden units, which means there only optimizing a few million parameters, and the amount of possible behaviours that the bots can express is somewhat limited, yet this is enough to deal with a massive action space and operate successfully in a 5v5 game.

However, I'm a bit disappointed that they haven't used anything except brute force to make this work, especially regarding long time horizons. I don't know if it's because they tried a bunch of different ideas and the penalty to scaling outweighed the benefit of the idea, or if they just want to do this without any tricks, but I think that not addressing the time horizon problem head on might be bottleneck for them. Compared to the previous agent, it looks like it operates every 4 ticks vs every 3 ticks, I wonder how many ticks they can skip before discretizing time starts to hurt performance?

I'm also wondering if the main thing making this work is the reward shaping. As described in the RUDDER paper, you get "exponential" speed ups in learning when you reshape the reward properly.(~~although they don't claim that's the case for policy gradient methods~~) If that's the case, then I suspect that if you took away the reward shaping, getting this to work would be intractable.

yazriel0 · 2018-06-25T20:56:47+00:00

Great Summary. Thank you.

Here is a key comment from HN where they explain how "linear variance of the gradient" (i.e. linear with action dimensions) makes the learning time feasible

https://news.ycombinator.com/item?id=17392903

gwern · 2018-06-26T14:24:36+00:00

[deleted]

2018-06-28T06:17:33+00:00

Question regarding those 36.6 kB observations from the game API: Can Viper see something that Sniper can't?

onkelFungus · 2018-07-03T14:44:40+00:00

Can somebody say something about Rapid ?

Will this be open sourced?

gwern · 2018-07-18T17:00:30+00:00

Update: it's generalized further and removed several of the simplifications DoTA players were complaining about: https://www.reddit.com/r/reinforcementlearning/comments/8zx78c/openai_dota_update_several_restrictions_lifted/

2018-06-25T21:05:40+00:00

I'm in the 92th percentile in this game and I have to say this is really impressive. It might be that we have AGI but it's limited by hardware.

reinforcementlearning

MODERATORS

Welcome to Reddit.

Want to add to the discussion?