Pluribus: "Superhuman AI for multiplayer poker", Brown & Sandholm 2019 [ Monte Carlo CFR "stronger than top human professionals in six-player no-limit Texas hold’em poker"]

gwern · 2019-07-16T13:36:42+00:00

Followup to Libratus last year, going from 1-on-1 to 6-player. Rapid improvement in capabilities:

In the end, Pluribus learned to apply complex strategies, including bluffing and random behavior, in real time. Then, when playing against human opponents, it would refine these strategies by looking ahead to possible outcomes, as a chess player might. This spring, the researchers tested the system in games in which a single human professional played against five separate instances of Pluribus. In that format, Mr. Elias was unimpressed. “You could find holes in the way it played,” he said; among other bad habits, Pluribus tended to bluff too often. But after taking suggestions from him and other players, the researchers modified and retrained the system. In subsequent games against top professionals, Mr. Elias said, the system seemed to have reached superhuman levels.

The system did not play for real money. But if the chips had been valued at a dollar apiece, Pluribus would have won about $1,000 an hour against its elite opponents. “At this point, you couldn’t find any holes,” Mr. Elias said. All the matches were played online, so the system was not deciphering the emotions or physical “tells” of its human opponents. The success of Pluribus showed that poker can be boiled down to nothing but math, Mr. Elias said: “Pure numbers and percentages. It is solving the game itself.”

Players & incentives:

Each human participant has won more than $1 million playing poker professionally. Performance was measured using the standard metric in this field of AI, milli big blinds per game (mbb/game). This measures how many big blinds (the initial money the second player must put into the pot) were won on average per thousand hands of poker. In all experiments, we used the variance-reduction technique AIVAT (44) to reduce the luck factor in the game (45) and measured statistical significance at the 95% confidence level using a one-tailed t test to determine whether Pluribus is profitable.

The human participants in the 5H+1AI experiment were Jimmy Chou, Seth Davies, Michael Gagliano, Anthony Gregg, Dong Kim, Jason Les, Linus Loeliger, Daniel McAulay, Greg Merson, Nicholas Petrangelo, Sean Ruane, Trevor Savage, and Jacob Toole. In this experiment, 10,000 hands of poker were played over 12 days. Each day, five volunteers from the pool of professionals were selected to participate based on availability. The participants were not told who else was participating in the experiment. Instead, each participant was assigned an alias that remained constant throughout the experiment. The alias of each player in each game was known, so that players could track the tendencies of each player throughout the experiment. $50,000 was divided among the human participants based on their performance to incentivize them to play their best. Each player was guaranteed a minimum of $0.40 per hand for participating, but this could increase to as much as $1.60 per hand based on performance. After applying AIVAT, Pluribus won an average of 48 mbb/game (with a standard error of 25 mbb/game).

More on Loeliger:

This experiment was conducted with Ferguson, Elias, and Linus Loeliger. (The experiment involving Loeliger was completed after the final version of the Science paper was submitted.) Loeliger is considered by many to be the best player in the world at six-player no-limit Hold’em cash games. Each human played 5,000 hands of poker with five copies of Pluribus at the table. Pluribus does not adapt its strategy to its opponents, so intentional collusion among the bots was not an issue. In aggregate, the humans lost by 2.3 bb/100. Elias was down 4.0 bb/100 (standard error of 2.2 bb/100), Ferguson was down 2.5 bb/100 (standard error of 2.0 bb/100), and Loeliger was down 0.5 bb/100 (standard error of 1.0 bb/100).

... That took place after the final version of the Science paper was submitted. It would have been nice to include but it takes a while to do those experiments and we didn't feel it was worth delaying the publication process for it.

An interesting analytic twist is the 'AIVAT' mentioned, to increase statistical power:

Although poker is a game of skill, there is an extremely large luck component as well. It is common for top professionals to lose money even over the course of 10,000 hands of poker simply because of bad luck. To reduce the role of luck, we used a version of the AIVAT variance reduction algorithm, which applies a baseline estimate of the value of each situation to reduce variance while still keeping the samples unbiased. For example, if the bot is dealt a really strong hand, AIVAT will subtract a baseline value from its winnings to counter the good luck. This adjustment allowed us to achieve statistically significant results with roughly 10x fewer hands than would normally be needed. [that is, equivalent to 100,000?]

As far as I can tell, Pluribus does not use neural networks, although they did create a 'deep CFR' last year. This is not discussed in the paper, so it's unclear if deep CFR just wasn't ready for primetime yet compared to further tuning of Libratus or if the extra overhead of deep CFR eliminates any advantage over a more straightforward tree search.

Author Noam Brown was answering questions on HN:

Why has poker taken so much longer than chess/Go/Dota2/SC2?

I think it took the community a while to come up with the right algorithms. So much of early AI research was focused on beating humans at chess and later Go. But those techniques don't directly carry over to an imperfect-information game like poker. The challenge of hidden information was kind of neglected by the AI community. This line of research really has its origins in the game theory community actually (which is why the notation is completely different from reinforcement learning).

Fortunately, these techniques now work really really well for poker. It's now quite inexpensive to make a superhuman poker bot.

Logistics:

Also, as you add more players it becomes harder and harder to evaluate because the bot's involved in fewer hands, we need to have more pros at the table, and we need to coordinate more schedules. Six was logistically pretty tough already.... It was online. The players were playing from home on their own schedules. The bot did not look at any tells (timing tells or otherwise). The players knew they were playing a bot and knew which player the bot was.

The professionals didn't spot any holes to exploit during their 10k hands:

In the paper we include a graph of performance over the course of the 10,000-hand 5 humans + 1 AI experiment that was played over 12 days. There's no indication that the bot's performance decreased over time (there is a temporary downward blip in the middle, but that's likely just variance). Based on discussions with pros, it sounds like they didn't find any weaknesses and they didn't seem to think they'd find any given more time.

Search is the key:

I think the key is that the search algorithm is picking up so much of the slack that we don't really need to train an amazing precomputed strategy. If we weren't using search, it would probably be infeasible to generate a strong 6-player poker AI. Search was also critical for previous AI benchmark victories like chess and Go...There were several improvements but the most important was the depth-limited search. Libratus would always search to the end of the game. But that's not necessarily feasible in a game as complex as six-player poker. With these new algorithms, we don't need to go to the end of the game. Instead, we can stop at some arbitrary depth limit (as is done in chess and Go AI's). That drastically reduces the amount of compute needed.

On not releasing source:

I don't think the poker world would be happy with us if we did that. Heads-up limit hold'em isn't really played professionally anymore, but six-player no-limit hold'em is very popular.

Odd betting behavior:

We talk about this a bit in the paper. Based on the feedback from the pros, the bot seems to "donk bet" (call and then bet on the next round) much more than human pros do. It also randomizes between multiple bet sizes, including very large bet sizes, while humans stick to just one or two sizes depending on the situation.

Like DRL, hard to know if it's working:

Honestly, probably debugging. Training this thing is very cheap, but the variance in poker is huge (even with the best variance-reduction techniques) so it takes a very long time to tell whether one version is better than another version (or better than a human).

reinforcementlearning

MODERATORS

reinforcementlearning

MODERATORS

Welcome to Reddit.

Want to add to the discussion?