[D] RL: GANs as MCTS environment simulator for deep model-based planning?

NichG · 2017-04-23T16:24:40+00:00

We tried something like this, using gradient descent on the latent space rather than MCTS: https://arxiv.org/abs/1702.06676

Long story short, there are three things that go wrong which require technical and theoretical improvements.

One is that as noted, the model may not be properly causal; the latent space doesn't just index controllable variation, it also indexes uncontrollable variation. So something like a minimax search is needed so that the model is pessimistic about what it can't control and optimistic about what it can. Not as much of an issue in deterministic settings though.

Second is, even if the environment is deterministic, the model may not distribute its errors in a uniform way throughout the latent space. In that case, search can simply find the points at which the model makes the right kind of error, rather than where the actual correct policy is. Furthermore, the types of errors made in matching distributions versus making accurate future predictions are different, which can have implications for continuous-valued control problems (for example, being slightly off on a billiard shot can make a big difference in arranging a particular outcome, but all of those errors correspond to fairly 'plausible' futures so a GAN-type setup won't penalize them as strongly)

Third has to do with the training stability of GANs and other generative models when the source distribution is non-stationary. Other RL algorithms have this issue too, but the tendency of GANs to fluctuate even on stationary problems seemed to make for much worse catastrophic collapse when we tried them. As a result, we needed to use a non-adversarial generative model instead, which was more stable (but still sometimes exhibits collapse).

r-sync · 2017-04-24T07:06:47+00:00

building good forward models (via GANs or otherwise) is a pretty hard challenge that many of us are tackling in order to do Model-based planning, and the most likely plan is to use MCTS based stuff. Your post is a very good write-up of the more precise and granular details on how we're all thinking about it.

One can build good forward models either in input space or in state space.

Input space: given the previous frame of pixels and actions, predict the next frame of pixels (extendable to previous n-frames + n-actions -> next m frames).
State space: given previous hidden state (or some embedding) + action, predict next hidden state. This is harder to debug while prototyping as it isn't as tangible as (1).

We are trying to do both (1) and (2) via GANs, but the success is not really limited to ust GANs.

DeepMind showed that you can build recurrent models of prediction just with plain old pixel-wise MSE like losses + tricks in Recurrent Environment Simulators (video1 video2 ).

Folks such as Honglak Lee showed reasonably compelling results as well, but limited to 2D synthetic environments, Atari stuff: Action-Conditional Video Prediction using Deep Networks in Atari Games

Lastly, there is an elevated rise in papers where they use priors about an environment to build better forward-models, rather than doing things fully unsupervised. Here's one that builds forward models on top of pixels + human pose estimates.

The fundamental problem in building good forward models is long-term coherency. Models catastrophically forget what happened in the past and/or subtle pixel-wise errors compound. So, the problems to tackle are similar to what we see elsewhere (like in language modeling). My take on this is that having explicit memory and doing fast search can go a long way, though there are no compelling published works in this direction yet.

The reason you haven't seen the "whole algorithm" of someone building a forward model + using it with MCTS to do planning, yet / in a convincing or large-scale application, is because forward models dont work yet and people are still trying to make them work.

gwern · 2017-05-02T17:58:06+00:00

A relevant new paper: "Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning", Moerland et al 2017 (repo):

In this paper we study how to learn stochastic, multimodal transition dynamics for model-based reinforcement learning (RL) tasks. Stochasticity is a fundamental property of many task environments. However, function approximation based on mean-squared error fails at approximating multimodal stochasticity. In contrast, deep generative models can capture complex high-dimensional outcome distributions. First we discuss why, amongst such models, conditional variational inference (VI) is theoretically most appealing for sample-based planning in model-based RL. Subsequently, we study different VI models and identify their ability to learn complex stochasticity on simulated functions, as well as on a typical RL gridworld with strongly multimodal dynamics. Importantly, our simulations show that the VI network successfully uses stochastic latent network nodes to predict multimodal outcomes, but also robustly ignores these for deterministic parts of the transition dynamics. In summary, we show a robust method to learn multimodal transitions using function approximation, which is a key preliminary for model-based RL in stochastic domains.

They make a number of the same points about the need for deep model-based planning and the usefulness of a generative model for forward planning, and demonstrate that VAE networks can make sharp correct rollouts over a small gridworld while learning from on-policy transitions from a model-free approach, and mention that it might go well with MCTS. They argue that VAE or PixelCNN are better than GANs for this purpose as they produce explicit likelihood functions rather than implicit sampling-based ones.

metacurse · 2017-04-22T18:06:28+00:00

The question is, why are you spending time here to type this all out instead of typing code and seeing how it works?

Also, while LeCun might have shown the applicability of GANs, everything else, including the use of Deep Nets is not new at all.

TotesMessenger · 2017-04-23T18:10:22+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

[/r/reinforcementlearning] [D] GANs as MCTS environment simulator for deep model-based planning?

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

metacurse · 2017-04-22T18:21:11+00:00

[deleted]

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS