Apropos of the news on AlphaGo, "Recurrent Environment Simulators", Guo et al 2014/Desai & Bannerjee 2017, Mathieu et al 2017, and LeCun's talk on unsupervised learning/RL, I have been wondering something: has anyone proposed or done any work on using Generative Adversarial Networks for forward planning such as in a tree search method like Monte Carlo Tree Search? It seems like what people are working towards when they bring up GANs in RL-related talks, but I don't recall anyone explicitly suggesting how GANs would be useful or this specific approach, and some searching doesn't turn it up for me either. So here goes:
GANs approximate a data distribution, so drawing samples from a GAN trained on data from sequential timesteps should constitute a distribution over the true environment data distribution, weighted by likelihood; with this sampling, one could combine it with MCTS to evaluate deeply the decision tree, estimate the value of each action, and choose optimal next actions. A MCTS+GAN RL agent would inherit most of the strengths of GANs and MCTS: simple implementations, able to learn from off-policy transition samples, create a deep environment model, do long-term planning, quantify its uncertainty for better exploration, provide any-time estimates, run in parallel, etc. This could be extended by using dropout-trained GANs or bootstrapped GANs for deep exploration by training the GAN, taking actions in the true environment based on MCTS+GAN estimates, and using the newly collected samples to retrain the GAN. The MCTS+GAN value estimates could also be used to distill down into or pre-train a fast reactive deep RL agent like DQN or A3C by providing high-quality transition samples or a highly accurate advantage function.
(Note: I'm not sure GAN use here is strictly necessary. PixelCNN appears to be competitive with GANs for modeling visual data distributions, so perhaps that approach would also work.)
So to go back to LeCun's talk: one of his points is that for RL, we want some sort of model-based planning for look ahead. Model-free or non-planning is too sample-inefficient to get anywhere, although it tends to be very fast and have advantages; so in humans we see a fairly clear looking difference between immediate System I reflexes and intuition, and slower System II explicit planning and 'thinking through' possible futures by exploring a little model of the environment in our heads. But how do you have a deep model predict the future? He shows an example with cars. You can take a CNN and have it predict subsequent frames using something like RMSE loss and autoencoders, but this leads to blurry images immediately as it averages each pixel over all possible futures; this is not very helpful for planning since a blur doesn't tell you anything about what actions to take. The future is very multimodal, but the autoencoder will just produce the average. Then LeCun demonstrates GANs for predicting next frames - the sample is very sharp, as it picks out one possible future and depicts it as exactly as possible. But this isn't useful either: while it's probably the future with the highest likelihood, the maximum likelihood is still very unlikely, and the more steps out, the more that exact sequence becomes vanishingly unlikely, plus it ignores the expected value of the slightly-less-likely alternatives. What we need is planning over the entire distribution, not a single sample from it. Fortunately, a GAN does provide us the entire distribution of futures weighted by their likelihood, via the z-vector: we just create, say, 1000 unique sets of uniform deviates and feed them into the GAN and we will get an approximation of the distribution of futures, each one sharp; if in 500 futures the car continues forward, then on average 500 of the samples will have slight variants of the car moving forward, then perhaps another 250 have it turning left and another 250 have it turning right; with this, we can start doing planning and note that there's a 25% chance of it turning towards us and coming unacceptably close with potentially huge negative rewards, so we should slow down. This is not something we would get from an autoencoder (which would just show an expanding blur of grayness) or a single sample from a GAN (which would usually safely show the turn going straight or turning away from us). To account for actions changing the environment and turn it from a predictive model to a causal model, you would make the GAN conditional: in addition to the z-vector, provide an action. If there are immediate rewards rather than just terminal rewards, then the GAN prediction is both the new environment state and that state's immediate reward (which doesn't require knowledge of the total value of that state or the policy that was being followed, since the future agent's will be done by working backwards from an eventual terminal state and summing all the discounted 1-step rewards).
If you have a sharp simulator of the environment, what do you do with it? Well, you plug it into a decision tree: enumerate all possible sequences of actions and stochastic outcomes, and do backwards induction to figure out the optimal action at any outcome. That's too hard? You approximate it with MCTS. The simulator provides the possible outcomes for each action, which can then be fed back into the simulator to explore another ply down, and so on until hitting the depth limit and exploring randomly until a terminal state, at which point the cumulative rewards for each node are estimated, and another rollout begins.
So the whole algorithm goes:
- create a dataset of environment+reward+action samples (perhaps from following a random policy in an ALE game or from expert human trajectories)
- train until convergence a GAN (like Improved WGAN) to approximate the distribution of new environments conditional on old environment, action, and noise z-vector
- for n games in parallel:
- initialize the RL agent in the environment
- until termination, take each action according to MCTS using the GAN as environment simulator by drawing a relatively small number such as 5-100 samples (similar to Go's branching factor), while adding all environment transitions+rewards+actions to the dataset
- go to #2
This could probably be implemented fairly easily using Gym+Improved WGAN.
This is:
- parallelizable, since the agents act independently and stochastically while the GAN can be trained in parallel on the 'experience replay buffer'
- can learn off-policy, since the GAN is focused only on the immediate state transition and reward, which doesn't depend on any successive actions following a better or worse policy - the full value is estimated only during the MCTS search by backwards induction
- keeps maintaining sharp environment states, since each individual GAN sample represents 1 possible future, not an ensemble or average
- capable of exploring deeply, using MCTS to prioritize promising lines of action
- capable of handling all environments that DQN/A3C do now like the ALE, and further capable of handling sequence data given the recent work on Improved WGAN and other GANs on discrete and sequential data
- anytime, since each rollout updates value estimates incrementally
The major weakness is that I'm not sure how well the MCTS part would handle continuous actions. This hybrid algorithm seems to fall closer to DQN than A3C in the taxonomy. There is also the potential for curse of dimensionality here - many of the drawn GAN samples will be morally equivalent in that they are the GAN reflecting the same basic causal change but with slight appearance differences, which despite MCTS's usual ability to handle extremely high branching factors like in Go, might wind up killing the approach.
Want to add to the discussion?
Post a comment!