We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to
draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language
modeling such as GPT-x andBERT. In
particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional
Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past
states, and actions, our Decision Transformer model can generate future actions that
achieve the desired return. Despite the simplicity, Decision Transformer matches or
exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
…Decision Transformer: autoregressive sequence modeling for RL:
We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional
encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer
model, trained to predict the next action given the previous tokens using a linear output layer. Evaluation is also easy:
we can initialize by a desired target return (eg. 1 or 0 for success or failure) and the starting state in the
environment. Unrolling the sequence—similar to standard autoregressive generation in language models—yields a sequence of
actions to execute in the environment.
…Sequence modeling as multitask learning: One effect of this type of modeling is that we perform
conditional generation, where we initialize a trajectory by inputting our desired return. Decision Transformer does not
yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the
target return of a trained Decision Transformer, we find distinct policies are
learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as
Q*bert and Seaquest), we find Decision Transformer can actually extrapolate outside
of the dataset and model policies achieving higher return!
On April 18th, I discovered a vulnerability in the AI DungeonGraphQLAPI that allowed unpublished adventures [games],
unpublished scenarios [settings], and unpublished posts [stories] to be leaked. These resources could be read in bulk, at a
rate of approximately 1000 requests per minute. Unfortunately, this is, in fact, the second time I have discovered this
exact vulnerability. The first time, the issue was reported and fixed, but after finding it again, I can see that simply
reporting the issue was a mistake…There was nothing preventing me from collecting more data, but what was gathered seemed
sufficient to demonstrate the vulnerability fully—adventures dating all the way back to Dec 16th,
2019 were at risk.
…A Surprising Observation: Looking at the resulting aggregated data led to a surprising observation.
There were a lot of lewd or otherwise nsfw user action fragments—way more than I had anticipated. As a bit of
followup analysis, I checked what percentage of adventures had explicitly lewd (18+) actions, and what percentage had nsfw
The results are… surprising, to say the least. Out of the 188k adventures (and 3.9M user actions) analyzed:
87.3k (46.3% of all adventures sampled) are NSFW and…
59.1k (31.4% (!) of all adventures sampled) are explicit (18+)
…Autoincrementing IDs: Autoincrementing IDs are, in my opinion, by far the biggest issue. They allow
someone to read all resources, simply by starting from 1 and counting upwards. Had these not been used, a secondary
vulnerability would have needed to be discovered alongside the vote vulnerability in order to exploit either one.
Otherwise, there would be no way to figure out what the private adventure IDs are, even if they could be read through a
vulnerability. I recommend deprecating and removing autoincrementing IDs completely, as soon as possible. After which point
leaking and publishing a non UUID id should be treated as a security issue just
Also note—autoincrementing IDs allow anyone to trivially figure out roughly how many of each resource exists. For AI
Dungeon, (as of April 19th) these would be:
~250K comments—10% on posts, 25% as nested comments, 50% on scenarios, 5% on adventures, 10% on “story” posts
Storytelling plays a central role in human socializing and entertainment. However, much of the research on automatic
storytelling generation assumes that stories will be generated by an agent without any human interaction. In this paper, we
introduce the task of collaborative storytelling, where an artificial intelligence agent and a person collaborate to create
a unique story by taking turns adding to it. We present a collaborative storytelling system which works with a human
storyteller to create a story by generating new utterances based on the story so far. We constructed the storytelling
system by tuning a publicly-available large scale language model on a dataset of writing prompts and their accompanying
fictional works. We identify generating sufficiently human-like utterances to be an important technical issue and propose a
sample-and-rank approach to improve utterance quality. Quantitative evaluation shows that our approach outperforms a
baseline, and we present qualitative evaluation of our system’s capabilities.
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics
used for a particular task. For example, summarization models are often trained to predict human reference summaries and
evaluated using ROUGE, but both of these metrics are rough
proxies for what we really care about—summary quality. In this work, we show that it is possible to significantly improve
summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human
comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward
function to fine-tune a summarization policy using reinforcement learning. We apply
our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human
reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any
news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We
establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better
summaries than optimizing ROUGE according to humans. We hope the evidence from
our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model
behavior they actually want.
Three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training.
Algorithmic progress has traditionally been more difficult to quantify than compute and data. In this work, we argue that
algorithmic progress has an aspect that is both straightforward to measure and interesting: reductions over time in the
compute needed to reach past capabilities. We show that the number of floating-point operations required to train a
classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44× between 2012 and 2019. This
corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years. By contrast, Moore’s Law would
only have yielded an 11× cost improvement. We observe that hardware and algorithmic efficiency gains multiply and can be on
a similar scale over meaningful horizons, which suggests that a good model of AI progress should integrate measures from
AI Dungeon 2 is a completely AI generated text adventure built with OpenAI’s largest GPT-2
model. It’s a first of its kind game that allows you to enter and will react to any action you can imagine.
What is this?
Google Colab is a way to experience machine learning for free. Google provides GPUs that you can run code in. Because this game exploded however, Google likely won’t be
able to allow free usage of it for AI Dungeon for very long. We are almost done making an app version of the game where you
will be able to play AI Dungeon 2. Until that’s released you can still play the game here.
Main mirrors of AI Dungeon 2 are currently down due to high download costs.
We are using BitTorrent as a temporary solution to host game files and keep this game alive.
It’s not fast, but it’s the best we’ve got right now.
If you want to help, best thing you can do is to download this torrent file with game files and seed it indefinitely to
the best of your ability. This will help new players download this game faster, and discover the vast worlds of AI
Follow @nickwalton00 on Twitter for updates on when it will be available again.
[AI Dungeon 2 is a project which trains GPT-2-1.5b on logs from text adventure games; when used interactively
by a human, it “plays RPG games” with you, but because it is powered
by GPT-2-1.5b, it is immensely flexible and can cope (to some degree) with
almost any input, producing bizarre, hilarious, or surprisingly logical sequences of adventures. It became popular
overnight, crushing Walton with bandwidth bills, and has been turned into an app and community to support distribution and
development. See also https://colab.research.google.com/github/nickwalton/AIDungeon/blob/master/AIDungeon_2.ipynband https://old.reddit.com/r/AIDungeon/ and
[Demonstration dialogue of interacting with a GPT-2-1.5b trained on text adventures/RPGs. The player chooses to join a band of orcs as a musician and tries to steer the game
towards orc rights, with moderate success, reaching the Emperor himself.]
In the first AI Dungeon, we created and deployed a deep learning generated text adventure using OpenAI’s 124M
parameter GPT-2 model. In the interest of computational cost, possible actions
and their results were generated and given to the player to choose from.
In AI Dungeon 2 we do away with pregenerated actions and allow the user to enter any action. The model then
continues generating the story resulting from that action. We also upgrade the size of our model to OpenAI’s largest 1.5B
parameter model and fine-tune it on a collection of text adventures obtained from chooseyourstory.com.
Following the example of the Salesforce
CTRL model we also modified the sampling of our model to add a penalty to
already generated words to reduce the repetition issues that GPT-2 has and
enable a lower temperature setting.
When playing AI Dungeon 2 the player can choose from several possible settings and characters. These choices
generate a starting prompt and context sentence. The context sentence is always included as input to the model to help keep
it grounded to the chosen setting while the starting prompt is the beginning of the story.
…After the Gnoll War ends, the orcs decide to hold a big celebration. You immediately organize concert to entertain
everyone. It goes well and you even win a few awards for your work. However, you still aren’t satisfied and you decide to
do something else.
> write plays for the orcs to perform in
You write a series of plays based on the stories of the orcs. These works are performed throughout the Empire and they
are very popular. You also take an interest in the orcs’ future and begin to study their language and culture. In time you
even develop a liking for the orcs themselves.
> found a human-orc friendship organization
While you were studying the orcs, you came across a group called the Human-Orc Friendship Association
(HOOFA). While they weren’t exactly like the Orcs, they had similar goals. You
decided to form a separate chapter just for them.
> write some pamphlets support orc-human friendship
You wrote a pamphlet explaining what the HOOFA was and how it worked.
You also explained that the orcs were the only race that could truly understand each other and that humans were inferior to
> promote inter-racial marriage between humans and orcs
In order to promote inter-racial marriage between humans and orcs, you started a branch of the HOOFA called the Inter-Race Marriage Bureau(IRMB)
which was located in the city of Kishak…
The last installment of the original “Choose Your Own Adventure” series came out in 1998, but since 2004, Chooseco, founded by one of the series’ original authors,
R. A. Montgomery, has been republishing classic volumes, as well as new riffs on the form of interactive fiction that
seemed ubiquitous in the 1980s and ’90s. The new editions also carry an additional feature—maps of the hidden structure of
For years, fans have been creating visualizations of the forking structures of “Choose Your Own Adventure” books. Often,
they’re interested in the types of outcomes at the end of each path. One map labels each ending as “new life, return home,
or death”, and another separates them into “cliffhanger, solution, or death.” Christian Swinehart’s extensive graphical
analysis of the books labels the endings as “great, favorable, mediocre, disappointing, or catastrophic.”
…Mapping the bones of the books can have other purposes, too. Nick Montfort, a poet and professor at the Massachusetts
Institute of Technology who studies interactive fiction, has a habit of asking people what they know about “Choose Your Own
Adventure” books. “They often say, ‘You have two choices after every page’”, he says. “That’s not true. Sometimes you have
one choice. Sometimes you have more than two. When you show the maps, you can see that these books don’t look exactly the
same.” The older volumes, for instance, tend to have more endings than the later ones, and three of the oldest—Journey
Under the Sea, Space and Beyond, and By Balloon to the Sahara—have 42 endings each, more than any
other books in the series…In just about every case, it can be surprising how a simple choice leads you down a complex path.
In By Balloon to the Sahara, you’re in a balloon and are presented with a choice on the very first page. Storm
clouds are on the horizon. Choice 1: “If you act now, you can release gas from the balloon and land before the storm
overtakes you.” Choice 2: “Perhaps the storm will pass quickly. Maybe you can ride it out.” That’s just the beginning,
since this book has the most decision points—48—of the series.
…There is yet another possibility in these nonlinear books: hidden endings. Inside UFO 54-40 has a hidden ending that’s only available to a reader who ignores the
decisions and flips to it without prompting. But it’s there. “It’s a two-page, big illustration of this city”, says
Montfort, the MIT professor. “The land of Ultima. As you flip through the book,
even if you’re being very obedient, you can’t help but wonder what this text is.”
…Maps like the ones Chooseco created can reveal the structure of a book that gives readers choices, but though the
multiple story lines are part of what makes the series so fun, they’re not the only thing that defines it. The meat of
“Choose Your Own Adventure” stories are gender-neutral romps in worlds where there are no obviously right or wrong moral
choices. There’s danger around bend, usually in the form of something like space monkeys, malicious ghosts, or conniving
grown-ups. Even with a map, there’s no way to find out what really comes next without making a choice and flipping to
[AI game paradigm: highly-complicated simulations but with AI decision support, like providing the top n ranked
The whole point of “egamebook” is to allow for complex game worlds that are controlled by a series of simple choices. By
simple, I don’t mean “easy” or “without complex consequences”. I just mean they’re not a complicated interface. They’re a
list of 2 to 5 options to choose from.
More generally, I’m interested in systems that present complex simulations in conversational form.
…In games, AI is generally written for the enemies. Some games have allies or wingmen, and those also need AI. In other
words, AI is written for all agents in the game except for the player.
But that’s exactly what you want. You want to write your AI in a way that it can be applied to the player. Or, more
precisely, to the User Interface (UI).
…This is what I tried to do this past weekend when I entered the Ludum Dare #33 competition. (It’s a challenge to create a game in one weekend from scratch, solo.)
I used the (still very much incomplete) egamebook library as the engine, and my fuzzy logic library as the basis of the AI.
I made a little prototype called Loch Ness.
The game is, of course, very flawed. It does receive quite favorable reviews, but there’s just so much you can do in 2
days, especially if you strive for a strategy game. For me, though, the biggest success is that it only gives you a few
options at a time, and they’re not dumb, and you still play in a sandbox world.
The way I did this was simple, really. I wrote the AI code that scores different strategic moves according to their
immediate desirability. (For example, moving troops from a well-supplied place to a place where they would starve receives
a low score. Attacking an undefended enemy city receives a high score. And so on.) In traditional AI fashion, this code is
then used by the opposing factions3 by scoring all possible moves and then picking the most desirable ones.
But—since I already have a mechanism to score moves—I can use the same thing for the player. I score all the
possibilities, sort them, then pick the first few and bring them up as options.
This makes sure that you don’t need to pick from 100 or more options, most of which are irrelevant or dumb. But it still
gives you the freedom of a simulated world.