×
all 16 comments

[–]gwern[S] 22 points23 points  (2 children)

The long awaited update on OpenAI's DoTA program, after their 1v1 success (1/2). Notes:

  • this is incredibly simple. PPO, 1000-unit LSTMs, some basic reward shaping, a few crude randomizing heuristics for exploration, some restrictions on the gameplay format, some tweaking of the discount rate (but still absurdly short time horizons!), some historical checkpointing for self-play, and that's about it. No tree search, no deep environment models, no intrinsic curiosity, no hierarchical RL or learning abstracted MDPs, no imitation or transfer learning, no special multi-agent losses or decentralized algorithms, essentially no communication between the agents, etc. The sample-efficiency looks like it's horrible (180 years of gameplay per day, probably a few weeks or a month per iteration given that the 1v1 agent took ~2 weeks wallclock EDIT: Brockman says 5x5 takes 19 wallclock days, so ~3240 years of gameplay and $1.1m to train?), but with the huge compute behind it, it works! It learns long-range planning and smooth coordination between 5 allied NNs

    • reward shaping may not be particularly necessary, as end-to-end 1/0 losses also train the original 1v1 bot reasonably well. Given how expensive it is to train, though, it probably makes the most sense for OpenAI to leave simplification to the end, like AlphaGo v1 vs the final Zero, once everything is understood.
  • interesting comment about the cost of that compute from a Googler:

    The 256 [non-preemptible] P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.

  • Rapid architecture looks a lot like DM's Impala (ie a few GPUs doing training with lots of CPU workers doing async rollouts, leading me to wonder if the DoTA PPO is suffering from stale gradients and would benefit from off-policy corrections)

  • good performance against human amateurs, even taking 2/3 games with a 99th percentile semi-pro team

  • troubling from a safety perspective: aside from demonstrating like AG that you can get big capability jumps in small wallclock time (vastly smaller than limited, serial humans require) via parallelization, as Karpathy has noted 'deep learning wants to work', and like AlphaGo you can get human (or perhaps eventually superhuman) performance despite serious "delusions" or logic bugs or outright crashes in the code:

    [Comparison of learning curves before and after bugfixes, showing how fixing bugs increases learning speed.] We’re still fixing bugs. The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, such as rare crashes during training, or a bug which resulted in a large negative reward for reaching level 25. It turns out it’s possible to beat good humans while still hiding serious bugs!

  • next match will be streamed 28 July 2018; pro matches at The International 20-25 August 2018.

Overall, probably the most interesting part for DRL is the fact that PPO LSTM with enough brute force is capable of learning long-range planning. Like ApeX, this reinforces the observation that brute force with a simulator can take us quite a ways... (And simulators can be learned, see all the recent work on deep environment models.)


[–]wassname 1 point2 points  (1 child)

Nice summary thanks.

leading me to think they may be suffering from stale gradients

By stale gradients you mean that they collected to much training data so some of it becomes irrelevant/off-policy before the whole rollout has been sampled? Another way of saying it is stale experience that's no longer relevant to a model's decisions.

Interestlingly (for me :p) openai has added some metrics to theirs baselines ppo2 repository that seem to aim at diagnosing this (added 7 months ago). They have approximate kl and clipfrac which measure how much of the minibatch is being clipped by PPO's important sampling. So presumably they would know.

Overall, probably the most interesting part for DRL is the fact that PPO LSTM with enough brute force is capable of learning long-range planning

Agreed! For anyone considering spending 1 million dollars on training, the decision looks completely different once you have a few examples showing it has a decent chance of working.

[–]gwern[S] 2 points3 points  (0 children)

By stale gradients you mean that they collected to much training data so some of it becomes irrelevant/off-policy before the whole rollout has been sampled?

Yes. Since they have so many GPUs/CPUs, that implies everything is asynchronous: the GPUs are constantly training on new experiences from the CPUs, distributing their last snapshot to the CPUs, while swapping gradients with the other GPU masters; a stop-the-world synchronization would avoid staleness but destroy throughput as hundreds of thousands of nodes all start waiting on a master node.

[–]LazyOptimist 2 points3 points  (3 children)

Theres quite a few impressive things about this. The first that stands out is that it's basically just PPO + LSTM + ~102 PetaFLOP/s * days. From what I gathered, it seems like the only trick they're using is some light reward shaping, the rest is engineering to operate at that scale. The second is that it works on such a large time horizon, they're only using BPTT for 16 time steps, which is just over 2 seconds of game play. So to see that the agent preform actions that pay off only several minutes later is impressive. The third is the parameters of the LSTM and what they're using to train it. The LSTM only has 210 hidden units, which means there only optimizing a few million parameters, and the amount of possible behaviours that the bots can express is somewhat limited, yet this is enough to deal with a massive action space and operate successfully in a 5v5 game.

However, I'm a bit disappointed that they haven't used anything except brute force to make this work, especially regarding long time horizons. I don't know if it's because they tried a bunch of different ideas and the penalty to scaling outweighed the benefit of the idea, or if they just want to do this without any tricks, but I think that not addressing the time horizon problem head on might be bottleneck for them. Compared to the previous agent, it looks like it operates every 4 ticks vs every 3 ticks, I wonder how many ticks they can skip before discretizing time starts to hurt performance?

I'm also wondering if the main thing making this work is the reward shaping. As described in the RUDDER paper, you get "exponential" speed ups in learning when you reshape the reward properly.(although they don't claim that's the case for policy gradient methods) If that's the case, then I suspect that if you took away the reward shaping, getting this to work would be intractable.

[–]gwern[S] 1 point2 points  (0 children)

As described in the RUDDER paper, you get "exponential" speed ups in learning when you reshape the reward properly.(although they don't claim that's the case for policy gradient methods) If that's the case, then I suspect that if you took away the reward shaping, getting this to work would be intractable.

The part about the binary reward still working reasonably well suggests that the reward shaping isn't contributing that much to the DotA agent.

[–]wassname 1 point2 points  (0 children)

although they don't claim that's the case for policy gradient methods

The section on Atari games, and the code they released is PPO based (a policy gradient method).

[–]abstractcontrol 1 point2 points  (0 children)

Rudder doesn't reshape rewards, it distributes them. The difference is that reward shaping fundamentally changes the cost function, reward distribution just moves around when the rewards happen - it does not change the optimal policy in any way.

[–]yazriel0 2 points3 points  (1 child)

Great Summary. Thank you.

Here is a key comment from HN where they explain how "linear variance of the gradient" (i.e. linear with action dimensions) makes the learning time feasible

https://news.ycombinator.com/item?id=17392903

[–]wassname 0 points1 point  (0 children)

There's now a correction from the author (ilyasut) saying the variance of the gradient actually increases exponentially with the difficulty of exploration. But they stabilised it with self play and shaped reward.

If I'm interpreting his statement correctly.

[–][deleted]  (2 children)

[deleted]

    [–]gwern[S] 5 points6 points  (1 child)

    Is this using Experience Replay ?

    No. Experience replay requires off-policyness and PPO is a standard on-policy algorithm. It can't use experience replay in any useful way.

    How often in the NN updated?

    They don't say specifically but you can get an idea by looking at DM's papers on highly-parallel DRL architectures like ApeX or Impala (as I mentioned are similar to Rapid). Since PPO is on-policy, you can't afford to train on samples which are more than a few gradient steps old, and you want to keep all actors using the latest policy possible, so presumably the updates are generally the equivalent of every minute or less. (Human DoTA games run >20 minutes but these are accelerated barebones games without rendering any graphics, so they can tick through each game a lot faster than realtime.)

    the AlphaGos plays with tree search and updates the NN once per experience

    All the AGs I know of have minibatches much larger than 1. Usually a couple thousand, I think.

    [–]ForeskinLamp 1 point2 points  (0 children)

    PPO is a standard on-policy algorithm

    I don't believe this is correct. From Algorithm 1 in the paper, the trajectory rollout is done under theta_old, but the policy update is done on theta, which would make it off-policy. If you multiply the loss function by pi_theta/pi_theta, you end up with an off-policy actor-critic algorithm that uses the GAE update instead of the Q-function. The clipped objective ensures that the gradient update doesn't take large steps, because in general A{theta_old} != A{theta} (I believe the Off-PAC paper makes this point).

    [–][deleted] 0 points1 point  (0 children)

    Question regarding those 36.6 kB observations from the game API: Can Viper see something that Sniper can't?

    [–]onkelFungus 0 points1 point  (0 children)

    Can somebody say something about Rapid ?

    Will this be open sourced?

    [–]gwern[S] 0 points1 point  (0 children)

    Update: it's generalized further and removed several of the simplifications DoTA players were complaining about: https://www.reddit.com/r/reinforcementlearning/comments/8zx78c/openai_dota_update_several_restrictions_lifted/

    [–][deleted] 0 points1 point  (0 children)

    I'm in the 92th percentile in this game and I have to say this is really impressive. It might be that we have AGI but it's limited by hardware.