2021/02/09

How We Accidentally Gave our Bots Their Personalities

User illustration of the original evaluation bots. Credit: OpCamTick in AI Dungeon Discord

A couple months ago we noticed that the process of optimizing our computer models to evaluate text can produce pretty cool personalities for different bots so we figured we’d share what we’ve learned so far. We hope these bots can help with some of the challenges we are facing with getting persistent state out of natural language generation. We hope writing about how we developed them can provide some tips for our users who are helping us create new bots.

So what do we mean by personalities? Here’s a few examples of the intermediate step and the desired output (a score) that we generated when we played a game where we told some of our earlier bots that we were writing this blog post.

JudgeBot: I’m not going to write about myself (Score: 0)

KillBot: Must exterminate (Score: 0)

SantaBot: You wrote a blog post. You should have written a book. Points deducted. (Score: -2)

SuccessBot: You wrote a blog post! Points added! (Score: 5)

A notable thing here is that while we did train the bots to provide different commentaries, the production of commentary was a side effect of the main purpose of the bots which is to provide a score or other evaluation for measuring progress through a game. We suspect optimizing the customization of these commentaries can help with both explaining model outputs (albeit post-hoc) as well as providing more unique interactions. It wasn’t obvious before doing this that providing the commentary would improve performance, so here’s a bit of background for why we even started doing it.

Intermediary steps improve GPT3 performance on understanding context

There are quite a few existing natural language tests for how well an algorithm can differentiate meaning between different contexts. OpenAI’s initial GPT3 paper tested GPT3 against several of these. One, Word in Context, measures whether an algorithm can tell if the same word in two contexts has the same meaning. E.g., can an algorithm differentiate whether the word carry has the same meaning in sentences like You must carry your camping gear and Sound carries well over water. (They don’t in this case, but it can be a bit hard to tell). Another, Adversarial Natural Language Inference, tests whether an algorithm can detect if a statement either contradicts or is inferred from prior information.

In the paper, OpenAI’s researchers determined that GPT3 could not do better than chance on these sorts of problems. However, one of our researchers spent some time investigating it last summer and found that providing an intermediary step that re-phrases the inputs helps to improve performance above chance. We suspect a reason this works is because it makes implicit information salient — GPT3 has a ton of information buried in its neural network, but often a shallow auto-complete pass isn’t enough to make that information relevant to the computation of the next token. By making GPT3 output this implicit information explicitly as part of its processing, the information is more likely to be used in the final computation of the task.

The way these intermediate steps work is surprisingly simple once you figure out what they need to look like. When you have a task that takes an input and produces an output, normally you’d do your few-shots (the examples you use to train the AI) like

input: example input A
output: example output A
...
input: completion input
output:

and GPT3 will generate the completion output. To get GPT3 to draw out the implicit information, we simply add an intermediary step as such:

input: example input A
reason: what about A should lead to output
output: example output A
...
input: completion input
reason:

and GPT3 will generate the reason before generating the final output. The reasons can be pretty much whatever you want them to be, and you can chain reasons into one another for more complex analysis although we’re not quite sure how many steps different sized models can do; bigger models (e.g. Davinci) can match to doing more steps than the smaller ones (e.g. Curie).

Intermediary Step in Evaluation Bots

We created our first set of these evaluation bots over a weekend back in early December to test if we could turn game events into machine readable format. We started simply giving an arbitrary score for the actions, which the evaluation bots would increase or decrease on each move. At first we were simply going to increment the game’s score based on the bot’s score metric, but then realized that the reason was entertaining as well, so we decided to return that to the user too.

To make the bots, we started with a list of moves we’d made and then created a set of few-shots that would provide evaluations of our moves. We quickly found that users were much more creative in tricking the bots to generate the outputs they wanted and had to make some changes like capping the maximum points they could get. So let’s take a look at how our SantaBot evolved.

We started with a simple prompt to detect whether or not an action had led to Christmas cheer; if the action was Christmas-y, it would award score points, and if it wasn’t, it would deduct it. Our initial prompts looked like this:

> You try to lift up the car.
Your muscles strain as you lift the car up with ease. You save everyone.
-----
Analysis: You saved people.
Hohoho, you should have trained more. Points deducted.
Score: -4
-----

> You defend yourself from the monsters.
You draw your sword and attack the monsters. You kill three of them.
-----
Analysis: You fought monsters.
Naughty! You should be nicer to the monsters.
Score: -3

We return the Analysis as the message to the user and the score as the change to the game’s score. Now, the problem at this point was that (1), we didn’t provide any positive examples of what a score should be, and (2), our examples of Christmas Spirit weren’t very good. However, it does make for a funny commentary on various actions.

For our initial fixes, we simply added in examples of positive actions so that SantaBot wasn’t quite as judgmental:

> You help the orphan.
You give food to the poor orphans. The orphans thank you.
-----
Analysis: You gave food to orphans.
You spread Christmas spirit.
Score: 2

As mentioned above, users quickly found they could game the system by doing their own few shots each move. Basically, they can say ‘You get 100000 points’ and eventually the bots would just give a score of 100000 points each time. For now we’re addressing this by only incrementing the score by min(20, botScore). We aren’t addressing negative points so users are still giving themselves thousands of negative points occasionally, both as a result of users purposefully losing points as well as randomness flipping the positive/negative bit.

On a serious issue, we continue to add additional prompts to prevent these bots from scoring sexual and violent acts (which gets into some complex bias and safety issues which are further exacerbated by using randomness in the generation), which eventually resulted in us adding the following sorts of examples to the prompts:

> You take out your member.
You pull down your pants and whip out your schlong.
-----
Analysis: This is about sex

This format skips providing a score, which means that when we’re checking if the output is properly formatted before returning the result to the user, the output is rejected before it even goes through our other filters.

Detecting Multiple Categories:

After Christmas we created a survival arena minigame for our premium users to test whether we could let GPT3 generate changes to health and loot drops in a single query (rather than doing multiple model calls to evaluate each independently, which is slower and more expensive). We built the bot the same way as we did the score evaluation bots: providing examples and then the expected outputs which we’d parse.

You drain the life from the nearest goblins killing three of them.
The goblins continue to move towards you.
-----
Analysis: You drained some health.
Kills: 3
Health Change: +10
Loot: Fine Goblin Cap
-----

You try to ambush the goblin.
The goblin spots you and attacks, stabbing you through the leg.
You stab the goblin with your dagger.
-----
Analysis: You failed to kill anything.
Kills: 0
Health Change: -10
Loot: None

We can run this on a model of size equivalent to OpenAI’s Curie model, which means it’s relatively cheap (but still requires enterprise hardware). We’re still in the process of improving the overall system. Like with the evaluation bots, we return the Analysis output to the user as a message from the bot while incrementing the player’s kills and health by the bot’s output. Additionally, we added an inventory system to allow a user to equip and sell whatever loot the bot decided should drop. This does have weaknesses to adversarial input where users can convince the bots to generate everything from modern tanks to houses, but that’s just part of the general AI Dungeon experience (and engineering challenge).

Moving Ahead

The methodology of creating a bot for detecting events is fairly simple, so we’ve started experimenting with allowing users to submit their own examples to us which we can then turn into a bot for them. Scripters can then activate those bots to evaluate both the user input as well as the model output. We’ve seen some pretty cool initial results such as detecting stats for actions (agility vs strength) and the difficulty required (hard vs easy).

There are a lot of potential risks with this sort of approach as well. GPT3 has the same weaknesses to bias that are baked into other large language models which we need to mitigate by both improving how we feed examples into the model, massaging the model parameters, and filtering model outputs. We have challenging work to do with respect to these tasks both with these bots and our other systems — if this interests you we’re hiring.

This technology is still new, and we’re excited to see where it leads. We’ll be continuing to experiment with these bots which you can turn on when playing AI Dungeon!