The Kelly Coin-Flipping Game: Exact Solutions via Decision Trees

Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016's 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC. (statistics, decision theory, Haskell) created: 19 Jan 2017; modified: 12 Apr 2017; status: finished; confidence: likely; Haghani & Dewey 2016 experiment with a double-or-nothing coin-flipping game where the player starts with$25 and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $250. Most of their subjects fail to play well, earning an average$91, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$240 in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays. We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming in R, Haskell, and C, and provide Python implementations for the OpenAI Gym collection. We find that optimal play yields$246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing$155.6. The relative advantage of the decision tree strategy depends on the number of bets: it is highest when the player can make few bets (at b=23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only$1.

Background

Set up

The paper Rational Decision-Making Under Uncertainty: Observed Betting Patterns on a Biased Coin, by Haghani & Dewey 2016 runs an economics/psychology experiment on optimal betting in a simple coin-flipping game:

What would you do if you were invited to play a game where you were given $25 and allowed to place bets for 30 minutes on a coin that you were told was biased to come up heads 60% of the time? This is exactly what we did, gathering 61 young, quantitatively trained men and women to play this game. The results, in a nutshell, were that the majority of these 61 players didn’t place their bets very well, displaying a broad panoply of behavioral and cognitive biases. About 30% of the subjects actually went bust, losing their full$25 stake. We also discuss optimal betting strategies, valuation of the opportunity to play the game and its similarities to investing in the stock market. The main implication of our study is that people need to be better educated and trained in how to approach decision making under uncertainty. If these quantitatively trained players, playing the simplest game we can think of involving uncertainty and favourable odds, didn’t play well, what hope is there for the rest of us when it comes to playing the biggest and most important game of all: investing our savings? In the words of Ed Thorp, who gave us helpful feedback on our research: This is a great experiment for many reasons. It ought to become part of the basic education of anyone interested in finance or gambling.

More specifically:

…Prior to starting the game, participants read a detailed description of the game, which included a clear statement, in bold, indicating that the simulated coin had a 60% chance of coming up heads and a 40% chance of coming up tails. Participants were given $25 of starting capital and it was explained in text and verbally that they would be paid, by check, the amount of their ending balance subject to a maximum payout. The maximum payout would be revealed if and when subjects placed a bet that if successful would make their balance greater than or equal to the cap. We set the cap at$250…Participants were told that they could play the game for 30 minutes, and if they accepted the $25 stake, they had to remain in the room for that amount of time.5 Participants could place a wager of any amount in their account, in increments of$0.01, and they could bet on heads or tails…Assuming a player with agile fingers can put down a bet every 6 seconds, that would allow 300 bets in the 30 minutes of play.

Near-optimal play

The authors make a specific suggestion about what near-optimal play in this game would be, based on the Kelly criterion which would yield bets each round of 20% of capital:

The basic idea of the Kelly formula is that a player who wants to maximize the rate of growth of his wealth should bet a constant fraction of his wealth on each flip of the coin, defined by the function $(2 \cdot p) - 1$, where $p$ is the probability of winning. The formula implicitly assumes the gambler has log utility. It’s intuitive that there should be an optimal fraction to bet; if the player bets a very high fraction, he risks losing so much money on a bad run that he would not be able to recover, and if he bet too little, he would not be making the most of what is a finite opportunity to place bets at favorable odds…We present the Kelly criterion as a useful heuristic a subject could gainfully employ. It may not be the optimal approach for playing the game we presented for several reasons. The Kelly criterion is consistent with the bettor having log-utility of wealth, which is a more tolerant level of risk aversion than most people exhibit. On the other hand, the subjects of our experiment likely did not view $25 (or even$250) as the totality of their capital, and so they ought to be less risk averse in their approach to maximizing their harvest from the game. The fact that there is some cap on the amount the subject can win should also modify the optimal strategy…In our game, the Kelly criterion would tell the subject to bet 20% ($(2 \cdot 0.6) - 1$) of his account on heads on each flip. So, the first bet would be $5 (20% of$25) on heads, and if he won, then he’d bet $6 on heads (20% of$30), but if he lost, he’d bet $4 on heads (20% of$20), and so on.

…If the subject rightly assumed we wouldn’t be offering a cap of more than $1,000 per player, then a reasonable heuristic would be to bet a constant proportion of one’s bank using a fraction less than the Kelly criterion, and if and when the cap is discovered, reducing the betting fraction further depending on betting time remaining to glide in safely to the maximum payout. For example, betting 10% or 15% of one’s account may have been a sound starting strategy. We ran simulations on the probability of hitting the cap if the subject bet a fixed proportion of wealth of 10%, 15% and 20%, and stopping when the cap was exceeded with a successful bet. We found there to be a 95% probability that the subjects would reach the$250 cap following any of those constant proportion betting strategies, and so the expected value of the game as it was presented (with the $250 cap) would be just under$240. However, if they bet 5% or 40% of their bank on each flip, the probability of exceeding the cap goes down to about 70%.

Subjects’ performance

Despite the Kelly criterion being well-known and fairly intuitive, and the game being very generous, participants did not perform well:

The sample was largely comprised of college age students in economics and finance and young professionals at finance firms. We had 14 analyst and associate level employees at two leading asset management firms. The sample consisted of 49 males and 12 females. Our prior was that these participants should have been well prepared to play a simple game with a defined positive expected value…Only 21% of participants reached the maximum payout of $250,7 well below the 95% that should have reached it given a simple constant percentage betting strategy of anywhere from 10% to 20%.8 We were surprised that one third of the participants wound up with less money in their account than they started with. More astounding still is the fact that 28% of participants went bust and received no payout. That a game of flipping coins with an ex-ante 60/40 winning probability produced so many subjects that lost everything is startling. The average ending bankroll of those who did not reach the maximum and who also did not go bust, which represented 51% of the sample, was$75. While this was a tripling of their initial $25 stake, it still represents a very sub-optimal outcome given the opportunity presented. The average payout across all subjects was$91, letting the authors off the hook relative to the $250 per person they’d have had to pay out had all the subjects played well. This is troubling because the problem is so well-defined and favorable to the players, and can be seen as a microcosm of the difficulties people experience in rational betting. (While it’s true that human subjects typically perform badly initially in games like the iterated prisoner’s dilemma and need time to learn, it’s also true that humans only have one life to learn stock market investment during, and these subjects all should’ve been well-prepared to play.) Instead of expected earnings of ~$240, the players earned $91 - forfeiting$149. However, if anything, the authors understate the underperformance, because as they correctly note, the Kelly criterion is not guaranteed to be optimal in this problem due to the potential for different utility functions (what if we simply want to maximize expected wealth, not log wealth?), the fixed number of bets & the ceiling, as the Kelly criterion tends to assume that wealth can increase without limit & there is an indefinite time horizon.

Optimality in the coin-flipping MDP

Indeed, we can see with a simple example that KC is suboptimal in terms of maximizing expected value: what if we are given only 1 bet (b=1) to use our $25 on? If we bet 20% (or less) per the KC, then $0.6 \cdot (25+5) + 0.4 \cdot (25-5)= 26$ But if we bet everything: $0.6 \cdot (25+25) + 0.4 \cdot (25-25) = 30$ It’s true that 40% of the time, we go bankrupt and so we couldn’t play again… but there are no more plays in b=1 so avoiding bankruptcy boots nothing. We can treat this coin-flipping game as a tree-structured Markov decision process. For more possible bets, the value of a bet of a particular amount given a wealth w and bets remaining b-1 will recursively depend on the best strategy for the two possible outcomes (weighted by probability), giving us a Bellman value equation to solve like: $V(w,b) = \max_{x \in [0,w]}[0.6 \cdot V(\min(w+x, 250), b-1) + 0.4 \cdot V(w-x, b-1) ]$ To solve this equation, we can explore all possible sequences of bets and outcomes to a termination condition and reward, and then work use backwards induction, defining a decision tree which can be (reasonably) efficiently computed using memoization/dynamic programming. Given the problem setup, we can note a few things about the optimal strategy: 1. if the wealth ever reaches$0, the game has effectively ended regardless of how many bets remain, because betting $0 is the only possibility and it always returns$0
2. similarly, if the wealth ever reaches the upper bound of $250, the optimal strategy will effectively end the game by always betting$0 after that regardless of how many bets remain, since it can’t do better than $250 and can only do worse. These two shortcuts will make the tree much easier to evaluate because many possible sequences of bet amounts & outcomes will quickly hit$0 or $250 and require no further exploration. 3. a state with more bets is always of equal or better value than fewer 4. a state with more wealth is always equal or better value than less 5. the value of 0 bets is the current wealth 6. the value of 1 bet depends on the ceiling and current wealth: whether$250 is >2x current wealth. If the ceiling more than twice current wealth, the optimal strategy with 1 bet left is to bet everything, since that has highest EV and there’s no more bets to worry about going bankrupt & missing out on.

Implementation of Game

A Python implementation is available in OpenAI Gym:

import gym
env = gym.make('KellyCoinflip-v0')
env.reset()
env._step(env.wealth*100*0.2) # bet with 20% KC
# ...
env._reset() # end of game, start a new one, etc

Decision tree

We can write down the value function as a mutual recursion: f calculates the expected value of the current step, and calls V to estimate the value of future steps; V checks for the termination conditions w=$0/$250 & b=0 returning current wealth as the final payoff, and if not, calls f on every possible action to estimate their value and returns the max… This mutual recursion bottoms out at the termination conditions. As defined, this is grossly inefficient as every node in the decision tree will be recalculated many times despite yielding identical deterministic, referentially transparent results. So we need to memoize results if we want to evaluate much beyond b=5.

Approximate Value function

Implemented in R:

# devtools::install_github("hadley/memoise")
library(memoise)

f <- function(x, w, b) { 0.6*mV(min(w+x,250), b-1) + 0.4*mV(w-x, b-1)}
mf <- memoise(f)
V <- function(w,b) {
returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b) }) } else { w }
max(returns) }
mV <- memoise(V)
## sanity checks:
mV(25, 0) == 25 && mV(25, 1) == 30 && mV(0, 300) == 0

Given our memoized value function mV we can now take a look at expected value of optimal betting for bets b from 0 to 300 with the fixed starting payroll of $25. Memoization is no panacea, and R/memoise is slow enough that I am forced to make one change to the betting: instead of allowing bets in penny/$0.01 increments, I approximate by only allowing bets in $1 increments, to reduce the branching factor to a maximum of 250 possible choices each round rather than 25000 choices. (Some comparisons with decipenny and penny trees suggests that this makes little difference since early on the bet amounts are identical and later on being able to make <$1 adjustments to bets yield small gains relative to total wealth; see the Haskell implementation.)

Of course, even with memoization and reducing the branching factor, it’s still difficult to compute the value of the optimal strategy all the way to b=300 because the exponentially increasing number of strategies still need to be computed at least once. With this R implementation, trees for up to b=150 can be computed with 4GB RAM & ~16h.

The value function for increasing bets:

vs <- sapply(1:150, function(b) { round(mV(25, b), digits=1) }); vs
#   [1]  30.0  36.0  43.2  45.4  48.0  53.7  54.6  57.6  61.8  62.6  66.8  68.2  70.6  74.1  75.0  79.1  79.7  82.5  84.7  86.2  89.8  90.3  93.8
#  [24]  94.5  97.0  98.8 100.3 103.2 103.9 107.2 107.6 110.3 111.4 113.3 115.2 116.4 119.0 119.6 122.4 122.9 125.3 126.2 128.1 129.5 130.8 132.8
#  [47] 133.6 136.0 136.5 138.8 139.4 141.4 142.3 143.8 145.2 146.3 148.0 148.8 150.6 151.3 153.1 153.8 155.5 156.3 157.7 158.8 159.9 161.2 162.1
#  [70] 163.6 164.2 165.8 166.4 167.9 168.5 169.9 170.7 171.8 172.8 173.7 174.8 175.6 176.8 177.5 178.7 179.3 180.5 181.1 182.3 182.9 184.0 184.7
#  [93] 185.6 186.4 187.2 188.1 188.8 189.8 190.3 191.3 191.9 192.8 193.4 194.3 194.9 195.7 196.3 197.1 197.8 198.4 199.1 199.7 200.5 201.0 201.8
# [116] 202.3 203.0 203.5 204.3 204.8 205.4 206.0 206.6 207.1 207.7 208.3 208.8 209.4 209.9 210.5 210.9 211.5 211.9 212.5 213.0 213.5 213.9 214.5
# [139] 214.9 215.4 215.8 216.3 216.8 217.2 217.6 218.0 218.5 218.9 219.3 219.7

So as the number of bets escalates, our expected payoff increases fairly quickly and we can get very close to the ceiling of $250 with canny betting. Monte Carlo tree evaluation If we do not have enough RAM to expand the full decision tree to its terminal nodes, there are many ways to approximate it. A simple one is to expand the tree as many levels far down as possible, and then if a terminal node is reached, do exact backwards induction, otherwise, at each pseudo-terminal node, approximate the value function somehow and then do backwards induction as usual. The deeper the depth, the closer the approximation becomes to the exact value, while still doing optimal planning within the horizon. One way to approximate it would be to run a large number of simulations (perhaps 100) taking random actions until a terminal node is hit, and take the mean of the total values as the estimate. This would be a Monte Carlo tree evaluation. (This forms the conceptual basis of the famous MCTS/Monte Carlo tree search.) A random policy is handy because it can be used anywhere, but here we already know a good heuristic which does better than random: the KC. So we can use that instead. This gives us as much optimality as we can afford. Setting up a simulation of the coin-flipping game which can be played with various strategies: game <- function(strategy, wealth, betsLeft) { if (betsLeft>0) { bet <- strategy(wealth, betsLeft) wealth <- wealth - bet flip <- rbinom(1,1,p=0.6) winnings <- 2*bet*flip wealth <- min(wealth+winnings, 250) return(game(strategy, wealth, betsLeft-1)) } else { return(wealth); } } simulateGame <- function(s, w=25, b=300, iters=100) { mean(replicate(iters, game(s, w, b))) } kelly <- function(w, b) { 0.20 * w } smarterKelly <- function(w, b) { if(w==250) {0} else { (2*0.6-1) * w } } random <- function(w, b) { sample(seq(0, w, by=0.1), 1) } evaluate <- function(w,b) { simulateGame(smarterKelly, w, b) } mevaluate <- memoise(evaluate) mevaluate(25, 10) # [1] 35.30544906 mevaluate(25, 100) # [1] 159.385275 mevaluate(25, 300) # [1] 231.4763619 As expected the KC can do well and is just as fast to compute as a random action, so using it will give much better estimates of the value function for free. With the Monte Carlo value function set up, the original value function can be slightly modified to include a maximum depth parameter and to evaluate using the MC value function instead once that maximum depth is hit: f <- function(x, w, b, maxDepth) { 0.6*mV(min(w+x,250), b-1, maxDepth) + 0.4*mV(w-x, b-1, maxDepth)} mf <- memoise(f) V <- function(w,b, maxDepth) { if (b<=(b-maxDepth)) { mevaluate(w,b) } else { returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b, maxDepth) }) } else { w } max(returns) }} mV <- memoise(V) mV(25, 300, 100); mV(25, 300, 50) Optimal next action One is also curious what the optimal strategy is, not just how much we can make with the optimal strategy given a particular starting point. As defined, I can’t see how to make mV return the optimal choices along the way, since it has to explore all choices bottom-up and can’t know what is optimal until it has popped all the way back to the root, and state can’t be threaded in because that defeats the memoization. But then I remembered the point of computing value functions: given the (memoized) value function for each state, we can then plan by simply using V in a greedy planning algorithm by asking, at each step, for the value of each possible action (which is memoized, so it’s fast), and returning/choosing the action with the maximum value (which takes into account all the downstream effects, including our use of V at each future choice): VPplan <- function(w, b) { if (b==0) { return (0); } else { returns <- sapply(seq(0, w), function(wp) { mf(wp, w, b); }) return (which.max(returns)-1); } } mVPplan <- memoise(VPplan) ## sanity checks: mVPplan(250, 0) # [1] 0 mVPplan(250, 1) # [1] 0 mVPplan(25, 3) # [1] 25 It’s interesting that when b is very small, we want to bet everything on our first bet, because there’s not enough time to bother with recovering from losses; but as we increase b, the first bet will shrink & become more conservative: firstAction <- sapply(1:150, function(b) { mVPplan(25, b) }); firstAction # [1] 25 25 25 10 10 18 17 18 14 14 14 9 10 11 7 11 7 10 10 10 9 6 9 7 9 8 9 8 7 8 7 8 7 8 7 8 7 7 7 7 7 6 7 6 7 6 7 # [48] 6 6 6 6 6 6 6 6 6 6 6 6 5 6 5 6 5 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 5 4 5 4 5 4 # [95] 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 # [142] 4 3 4 3 4 3 3 3 3 Optimizing The foregoing is interesting but the R implementation is too slow to examine the case of most importance: b=300. The issue is not so much the intrinsic difficulty of the problem - since the R implementation’s RAM usage is moderate even at b=150, indicating that the boundary conditions do indeed tame the exponential growth & turn it into something more like quadratic growth - but the slowness of computing. Profiling suggests that most of the time is spent inside memoise: # redefine to clear out any existing caching: forget(mV) Rprof(memory.profiling=TRUE, gc.profiling=TRUE, line.profiling=TRUE) mV(w=25, b=9) Rprof(NULL) summaryRprof() Most of the time is spent in functions which appear nowhere in our R code, like lapply or deparse, which points to the behind-the-scenes memoise. Memoization should be fast because the decision tree can be represented as nothing but a multidimensional array in which one does lookups for each combination of w/b for a stored value V, and array lookups ought to be near-instantaneous. But apparently there is enough overhead to dominate our decision tree’s computation. R does not provide hash tables, and there don’t seem to be any good libraries for this; what one can do is use R’s environments, which use hash tables under the hood, but are restricted to string keys (yes, really) as a poor man’s hash table by using digest to serialize & hash R objects into string keys/ The backwards induction remains the same and the hashtable can be updated incrementally without any problem (useful for switching to alternative methods like MCTS), so the rewrite is easy: ## crude hash table library(digest) d <- function (o) { digest(o, algo="xxhash32") } lookup <- function(key) { get(d(key), tree) } set <- function(key,value) { assign(envir=tree, d(key), value) } isSet <- function(key) { exists(d(key), envir=tree) } ## example: tree <- new.env(size=1000000, parent=emptyenv()) set(c(25,300), 50); lookup(c(25, 300)) # [1] 50 tree <- new.env(size=1000000, parent=emptyenv()) f <- function(x, w, b) { 0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1) } V <- function(w, b) { if (isSet(c(w,b))) { return(lookup(c(w,b))) } else { returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b) }) } else { return(w) } set(c(w,b), max(returns)) return(lookup(c(w,b))) } } V(25,5) While the overhead is not as bad as memoise, performance is still not great. Python Unmemoized version (a little different due to floating point): import numpy as np # for enumerating bet amounts as a float range def F(w, x, b): return 0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1) def V(w, b): if b>0 and w>0 and w<250: bets = np.arange(0, w, 0.1).tolist() returns = [F(w,x,b) for x in bets] else: returns = [w] return max(returns) V(25,0) V(25,0) # 25 V(25,1) # 29.98 V(25,2) # 35.967999999999996 V(25,3) Python has hashtables built in as dicts, so one can use those: import numpy as np def F(w, x, b, dt): return 0.6*V(min(w+x,250), b-1, dt) + 0.4*V(w-x, b-1, dt) def V(w, b, dt): if b>0 and w>0 and w<250: if (w,b) in dict: return dict[(w,b)] else: bets = np.arange(0, w, 0.1).tolist() returns = [F(w,x,b,dt) for x in bets] else: returns = [w] best = max(returns) dict[(w,b)] = best return best dict = {} V(25, 0, dict) # 25 Haskell We can switch to a compiled language with fast arrays. Implementations in Haskell have been written by Gurkenglas & nshepperd using array-memoize & Data.Vector (respectively): import System.Environment (getArgs) import Data.Function (iterate) import Data.Function.ArrayMemoize (arrayMemo) type Wealth = Int type EV = Double cap :: Wealth cap = 250 value :: [Wealth -> EV] value = iterate f fromIntegral where f next = arrayMemo (0, cap)$ \x -> maximum [go w | w <- [0..min (cap - x) x]] where
go w = 0.6 * next (min cap (x+w)) + 0.4 * next (x-w)

main :: IO ()
main = do [x, b] <- map read <$> getArgs print$ value !! b $x The second one is more low-level and uses laziness/tying the knot to implement the memoization; it is the fastest and is able to evaluate memo 25 300 in <12s: the value turns out to be$246. (If we assume that all games either end in $250 or$0, then this implies that $\frac{246}{250} = 0.984$ or >98.4% of decision-tree games hit the max, as compared to Haghani & Dewey’s estimate that ~95% of KC players would.) The memoization is done internally, so to access all values we do more logic within the lexical scope.

import System.Environment (getArgs)
import Data.Function (fix)
import Data.Function.Memoize (memoFix2)
import Data.Vector (Vector)
import qualified Data.Vector as V (generate, (!))
import qualified Data.MemoUgly as U (memo)

type Wealth = Int
type Bets = Int
type EV = Double

cap :: Wealth
cap = 250

value :: (Wealth -> Bets -> EV) -> Wealth -> Bets -> EV
value next 0 b = 0
value next w 0 = fromIntegral w
value next w b = maximum [go x | x <- [0..min (cap - w) w]]
where
go x = 0.6 * next (min cap (w+x)) (b-1) + 0.4 * next (w-x) (b-1)

-- There are 4 possible ways aside from 'array-memoize':
-- 1. Direct recursion.
direct :: Wealth -> Bets -> EV
direct = fix value

-- 2. Memoised recursion.
memo :: Wealth -> Bets -> EV
memo w b = cached_value w b
where
cached_value w b = table V.! b V.! w
table :: Vector (Vector EV)
table = V.generate (b + 1)
(\b -> V.generate (cap + 1)
(\w -> value cached_value w b))

-- 3. Memoised recursion using 'memoize' library; slower.
memo2 :: Wealth -> Bets -> EV
memo2 = memoFix2 value

-- 4. Memoize using 'memoize-ugly' library; also slower but global
memo3 :: Wealth -> Bets -> EV
memo3 = U.memo  . direct

main :: IO ()
echo "$C" && (for i in {1..10}; do /usr/bin/time -f '%e' ./coingame; done ); done For levels of parallelism on my laptop from 1-8 threads, the average times are: 28.98s/15.44s/10.70s/9.22s/9.10s/8.4s/7.66s/7.20s. Caio Oliveira provides a parallel C++ implementation using std::async; compiled with the same optimizations on (g++ -Ofast -fwhole-program -march=native -pthread --std=c++11 kc.cpp -o kc), it runs in ~1m16s: #include <iostream> #include <vector> #include <map> #include <set> #include <bitset> #include <list> #include <stack> #include <queue> #include <deque> #include <string> #include <sstream> #include <algorithm> #include <cstdio> #include <cstring> #include <future> #include <ctime> using namespace std; const int MAX_ROUND = 300; const int MAX_WEALTH = 25000; const int BETS_DELTA = 1; double memo[MAX_WEALTH + 1][2]; inline double vWin(int wealth, int bet, int roundIdx) { if(wealth + bet < MAX_WEALTH) { return memo[wealth + bet][!roundIdx]; } else { return MAX_WEALTH; } } inline double vLose(int wealth, int bet, int roundIdx) { if (wealth - bet == 0) { return 0; } else { return memo[wealth - bet][!roundIdx]; } } inline double v(int wealth, int bet, int roundIdx) { return .6 * vWin(wealth, bet, roundIdx) + .4 * vLose(wealth, bet, roundIdx); } template<typename RAIter> void calc_round(RAIter beg, RAIter end, int roundIdx){ int len = distance(beg, end); if(len < 1000) { for(RAIter p = beg; p != end; ++p) { int wealth = distance(memo, p); (*p)[roundIdx] = v(wealth, wealth, roundIdx); for (int bet = 0; bet < wealth; bet += BETS_DELTA) { memo[wealth][roundIdx] = max(memo[wealth][roundIdx], v(wealth, bet, roundIdx)); } } } else { RAIter mid = beg + len/2; future<void> handle = async(launch::async, calc_round<RAIter>, mid, end, roundIdx); calc_round(beg, mid, roundIdx); } } double calc_table() { bool roundIdx = 0; for(int i = 0; i <= MAX_WEALTH ; ++i) { memo[i][!roundIdx] = i; } for(int round = MAX_ROUND - 1; round >= 0; --round) { calc_round(memo, memo + MAX_WEALTH, roundIdx); roundIdx ^= 1; } return memo[2500][!roundIdx] / 100.; } int main() { ios_base::sync_with_stdio(0); cin.tie(0); clock_t begin = clock(); double res = calc_table(); clock_t end = clock(); cout << "result: " << res << "(elapsed: " << (double(end - begin) / CLOCKS_PER_SEC) << ")" << endl; } Given the locality of the access patterns, I have to wonder how well a GPU implementation might perform. It doesn’t seem amenable to being expressed as a large matrix multiplication, but each GPU core could be responsible for evaluating a terminal node and do nearby lookups from there. (And the use in finance of GPUs to compute the lattice model/binomial options pricing model/trinomial tree, which are similar, suggests it should be possible.) Exact formula Arthur B. notes that you can compute $U(w,b)$ in $O(b)$ if you allow bets to be any fraction of w. It’s piecewise linear, the cutoff points and slopes follow a very predictable pattern1 and provides a formula for calculating the value function without constructing a decision tree: $\left(\frac{6}{5}\right)^b \left(w - \frac{1}{3}\sum_{k=0}^{b-1}\left(\frac{2}{3}\right)^k\max\!\left(w - \frac{250 }{2^b}\sum_{j=0}^{k}\binom{b}{j},~0\right)\right)$ A short version in R: V <- function(w, b, m=250) { j <- qbinom(w/m, b, 0.5); 1.2^b * 1.5^-j * (w + (m/2 * sum(1.5^(j:0) * pbinom(0:j-1,b,1/2)))) } (This is sufficiently fast for all values tested that it is not worth memoizing using the relatively slow memoise.) Re-implementing in Python is a little tricky because NumPy/SciPy use different names and Python doesn’t support R’s range notation, but the following seems like it works: import scipy.stats import numpy as np def V(w, b, m=250): j = binom.ppf(float(w)/float(m), b, 0.5) return 1.2**b * 1.5**-j * (w + m/2 * sum(np.multiply(binom.cdf(map(lambda(x2):x2-1, range(0,int(j+1))),b,0.5), map(lambda(x): 1.5**x, list(reversed(range(0, int(j+1)))))))) Graphing the Value function With an exact value function, we can visualize it to get an idea of how the wealth/rounds interact and how difficult it would be to approximate it. We can generate many iterates, interpolate, and graph: n <- 50000 w <- runif(n, min=0, max=250) b <- sample(0:250, n, replace=TRUE) v <- unlist(Map(V, w, b)) df <- data.frame(Wealth=w, Rounds=b, Value=v) library(akima) library(plot3D) dfM <- interp(df$Wealth, df$Rounds, df$Value, duplicate="strip")
par(mfrow=c(1,2), mar=0)
persp(dfM$x, dfM$y, dfM$z, xlab="Wealth", ylab="Rounds left", zlab="Value", ticktype="detailed") scatter3D(df$Wealth, df$Rounds, df$Value, theta=-25, xlab="Wealth", ylab="Rounds left", zlab="Value", ticktype="detailed")

Approximating the Exact Value Function

The value function winds up looking simple, a smooth landscape tucked down at the edges. A value function computed by decision trees is a bulky and hard to carry around sort of thing, weighing perhaps gigabytes in size. Especially looking at this value function, it’s clear that it can be easily approximated or memorized by another algorithm like random forests or neural networks. (Usually, a NN has a hard time learning a value function because it’s a reinforcement learning problem where it’s unclear how actions are linked to rewards, but in this case, it is the far easier setting of supervised learning.)

A random forest works fine, getting 0.4 RMSE:

library(randomForest)
r <- randomForest(Value ~ Wealth + Rounds, ntree=1000, data=df); r
# ...
#           Mean of squared residuals: 1.175970325
#                     % Var explained: 99.94
sqrt((predict(r) - df$Value)^2) # [1] 0.407913389 It can also be approximated with NN. For example, in R one can use MXNet and a 1x30 or 2x152 ReLu feedforward NN will solve it to <4 RMSE: ## very slow library(neuralnet) n <- neuralnet(Value ~ Wealth + Rounds, data=df, hidden=c(20,20)) # prediction(n) ## GPU-accelerated: library(mxnet) mx.set.seed(0) ## 2x30 ReLu FC NN for linear regression with RMSE loss: data <- mx.symbol.Variable("data") perLayerNN <- 30 fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=perLayerNN) act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu") fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=perLayerNN) act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu") fc3 <- mx.symbol.FullyConnected(act2, name="fc2", num_hidden=perLayerNN) act3 <- mx.symbol.Activation(fc3, name="relu2", act_type="relu") finalFc <- mx.symbol.FullyConnected(act3, num_hidden=1) lro <- mx.symbol.LinearRegressionOutput(finalFc) model <- mx.model.FeedForward.create(lro, X=as.matrix(subset(df,select=c("Wealth","Rounds"))), y=df$Value,
ctx=mx.gpu(),     num.round=50000,
learning.rate=5e-6, eval.metric=mx.metric.rmse, array.layout="rowmajor")

Some experimenting with fitting polynomial regressions did not give any decent fits for less than 5 degrees, but probably some combination of logs and polynomials could compactly encode the value function.

Simulation Performance

The implementations here could be wrong, or decision trees not actually optimal like claimed. We can test it by directly comparing the performance in a simulation of the coin-flipping game, comparing the performance of mVPplan vs the 20% Kelly criterion and a simple bet-everything strategy:

library(memoise)
f <- function(x, w, b) { 0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1)}
mf <- memoise(f)
V <- function(w, b, m=250) { j <- qbinom(w/m,b,1/2);
1.2^b * 1.5^-j * (w+m/2 * sum(1.5^(j:0) * pbinom(0:j-1,b,1/2))) }
VPplan <- function(w, b) {
if (b==0) { return (0); } else {
returns <- sapply(seq(0, w), function(wp) { mf(wp, w, b); })
return (which.max(returns)-1); }
}
mVPplan <- memoise(VPplan)

game <- function(strategy, wealth, betsLeft) {
if (betsLeft>0) {
bet <- strategy(wealth, betsLeft)
wealth <- wealth - bet
flip <- rbinom(1,1,p=0.6)
winnings <- 2*bet*flip
wealth <- min(wealth+winnings, 250)
return(game(strategy, wealth, betsLeft-1)) } else { return(wealth); } }
simulateGame <- function(s, w=25, b=300, iters=5000) { mean(replicate(iters, game(s, w, b))) }

## Various strategies (the decision tree strategy is already defined as 'mVPplan'):
kelly <- function(w, b) { 0.20 * w }
smarterKelly <- function(w, b) { if(w==250) {0} else { 0.2 * w } }
maximizer <- function(w, b) { w; }
smarterMaximizer <- function(w, b) { if(w>=250) { 0 } else {w}; }

library(parallel)
library(plyr)
## Comparing performance in the finite horizon setting up to b=300:
bs <- 1:300
ldply(mclapply(bs, function(bets) {
ky <- round(simulateGame(smarterKelly, b=bets), digits=1)
dt <- round(simulateGame(mVPplan, b=bets), digits=1)
gain <- dt-ky;
print(paste(bets, round(V(25, bets), digits=2), dt, ky, gain)) }
))

Simulating 5,000 games for each b:

Total bets Value function Decision tree performance Kelly performance Difference
1 30 30.1 26 4.1
2 36 35.5 27 8.5
3 43.2 42.4 28.2 14.2
4 45.36 46 29.3 16.7
5 47.95 49.3 30.5 18.8
6 53.65 52.5 31.5 21
7 54.59 51.4 32.7 18.7
8 57.57 56.1 34.1 22
9 1.83 61.4 35.6 25.8
10 62.6 63 36.8 26.2
11 66.76 68.6 38.3 30.3
12 68.22 66.9 40.2 26.7
13 70.6 70.7 41.8 28.9
14 74.13 71.2 43.2 28
15 75 73.8 44.9 28.9
16 79.13 79.7 46.8 32.9
17 79.75 79.1 47.8 31.3
18 82.5 84.1 50.1 34
19 84.71 83.7 52.7 31
20 86.25 85.6 54.2 31.4
21 89.8 88.6 56.4 32.2
22 90.28 90.2 56.5 33.7
23 93.85 94.8 58.7 36.1
24 94.49 93.7 60.7 33
25 97 95 62.2 32.8
26 98.83 97.1 65.2 31.9
27 100.38 98.7 68.1 30.6
28 103.23 102.1 69.4 32.7
29 103.95 102.5 73.4 29.1
30 107.61 107.7 73.7 34
31 107.64 107.3 74.7 32.6
32 110.4 106.7 76.1 30.6
33 111.41 107.9 79.2 28.7
34 113.38 115.1 80.6 34.5
35 115.24 115.4 82.2 33.2
36 116.48 116.1 84.2 31.9
37 119.09 118.5 87.1 31.4
38 119.69 119.9 86 33.9
39 122.94 125.4 89.6 35.8
40 122.96 119.7 92 27.7
41 125.54 124.2 95.2 29
42 126.28 124.4 96.9 27.5
43 128.21 128.5 97.1 31.4
44 129.63 130.5 100.3 30.2
45 130.96 130.1 100 30.1
46 132.98 131.9 100.9 31
47 133.78 132.4 104.2 28.2
48 136.33 134.2 104.5 29.7
49 136.64 133.6 107.7 25.9
50 139.36 141.2 110.1 31.1
51 139.54 139.4 113.6 25.8
52 141.68 140 113 27
53 142.45 140.7 113.5 27.2
54 144.08 141.8 115.5 26.3
55 145.36 145 116.3 28.7
56 146.52 150.2 119.7 30.5
57 148.26 146.4 119.7 26.7
58 149 143.9 120.6 23.3
59 151.15 151.8 124.2 27.6
60 151.5 148.5 124.4 24.1
61 154.01 151.9 125.5 26.4
62 154.01 150.9 127.1 23.8
63 156.05 157.9 128.4 29.5
64 156.52 154.3 129.9 24.4
65 158.14 155.5 132.3 23.2
66 159.03 156.8 132.1 24.7
67 160.25 157.4 133.2 24.2
68 161.53 159.2 137.1 22.1
69 162.39 160.2 135.9 24.3
70 164 161 137.8 23.2
71 164.54 162.2 137.8 24.4
72 166.46 166.7 138.3 28.4
73 166.7 165.2 142.7 22.5
74 168.81 169 145 24
75 168.85 168.3 143.1 25.2
76 170.59 169.5 144.4 25.1
77 171 165.6 146.4 19.2
78 172.39 171.2 147.5 23.7
79 173.13 171.3 150.6 20.7
80 174.21 170.6 151.8 18.8
81 175.24 174 152.3 21.7
82 176.04 175.5 153.8 21.7
83 177.34 174.9 151.8 23.1
84 177.87 177.6 152.5 25.1
85 179.4 178.5 157.3 21.2
86 179.7 177.1 156 21.1
87 181.44 178.9 158.1 20.8
88 181.52 179.6 160.1 19.5
89 183.15 181.1 159 22.1
90 183.33 182.8 163.3 19.5
91 184.68 184.2 162.3 21.9
92 185.13 183.4 162.5 20.9
93 186.21 187.5 165.1 22.4
94 186.9 185.3 160.5 24.8
95 187.75 188.6 164.8 23.8
96 188.66 186.4 167.1 19.3
97 189.29 187.6 168 19.6
98 190.39 188.9 167.7 21.2
99 190.82 187.8 169.8 18
100 192.1 190.7 168.4 22.3
101 192.34 192.5 171.8 20.7
102 193.78 192.6 170 22.6
103 193.86 193.2 170.7 22.5
104 195.24 194.1 170 24.1
105 195.35 192.9 174.1 18.8
106 196.52 195.2 176.8 18.4
107 196.84 194.5 173.4 21.1
108 197.79 194.4 179.1 15.3
109 198.3 195.5 176 19.5
110 199.07 196.7 179.1 17.6
111 199.74 198.7 181.2 17.5
112 200.34 201.1 178.2 22.9
113 201.17 197.9 180.9 17
114 201.6 200.3 181.2 19.1
115 202.57 202 183.2 18.8
116 202.85 201.6 181 20.6
117 203.94 201.7 181.4 20.3
118 204.09 201.2 183.6 17.6
119 205.29 205.9 185 20.9
120 205.32 201.3 186.8 14.5
121 206.4 204 182.2 21.8
122 206.53 203.7 186.2 17.5
123 207.44 205.7 186.1 19.6
124 207.72 205.2 189.5 15.7
125 208.48 203.9 191.4 12.5
126 208.9 209.3 188 21.3
127 209.52 206.7 187.7 19
128 210.06 209.5 188.5 21
129 210.54 206.5 192.4 14.1
130 211.2 211.1 190.9 20.2
131 211.56 207.1 195.6 11.5
132 212.32 210.3 194 16.3
133 212.57 212.1 191.1 21
134 213.42 211 192.7 18.3
135 213.56 210.2 195.8 14.4
136 214.5 213.3 196.8 16.5
137 214.55 211.3 194.4 16.9
138 215.46 212 196.6 15.4
139 215.52 210.8 197.4 13.4
140 216.3 215.3 197 18.3
141 216.47 217.8 199.3 18.5
142 217.13 215.4 197.3 18.1
143 217.41 214.8 196.2 18.6
144 217.96 213.9 200.1 13.8
145 218.33 215.7 200.4 15.3
146 218.77 217.4 200.1 17.3
147 219.24 217.5 199.7 17.8
148 219.58 218.5 200.3 18.2
149 220.13 218.4 200.3 18.1
150 220.39 220.4 201.9 18.5
151 221 218.1 201.6 16.5
152 221.18 220.5 203.9 16.6
153 221.86 220.6 202.6 18
154 221.96 220.5 205.2 15.3
155 222.69 218.7 203.1 15.6
156 222.72 220.6 204.4 16.2
157 223.43 220.6 203.3 17.3
158 223.48 221.1 202.8 18.3
159 224.09 222.6 207.1 15.5
160 224.22 224.5 207.5 17
161 224.74 220.8 206 14.8
162 224.95 224.2 208.1 16.1
163 225.39 223.8 208 15.8
164 225.67 222.8 209 13.8
165 226.03 223.4 208.6 14.8
166 226.37 224 210 14
167 226.66 225.3 209.2 16.1
168 227.06 224.1 211.6 12.5
169 227.28 224.5 210.5 14
170 227.73 223.8 211 12.8
171 227.89 226.9 209.1 17.8
172 228.39 226 212.2 13.8
173 228.49 226 211.8 14.2
174 229.04 226.6 212.1 14.5
175 229.09 227.9 211.3 16.6
176 229.67 226.4 211.5 14.9
177 229.67 228 214 14
178 230.18 228.4 215.1 13.3
179 230.24 227.5 213.3 14.2
180 230.68 229.2 213.6 15.6
181 230.8 229.5 215 14.5
182 231.18 228.7 213.9 14.8
183 231.36 229.8 216 13.8
184 231.67 230.6 214.4 16.2
185 231.9 231 213.1 17.9
186 232.16 231.2 216.2 15
189 232.94 231.1 217.9 13.2
190 233.1 230.4 217.6 12.8
191 233.45 231.1 218.4 12.7
192 233.56 231.9 219 12.9
193 233.94 232.1 216.6 15.5
194 234.02 232 219.3 12.7
195 234.43 231.8 217.5 14.3
196 234.47 232.4 220.6 11.8
197 234.9 233.5 218.6 14.9
198 234.9 233 219.3 13.7
199 235.3 233.4 220.2 13.2
200 235.33 233.8 221.1 12.7
201 235.67 235.5 218.8 16.7
202 235.75 233 222 11
203 236.05 232.9 220.4 12.5
204 236.17 233.9 220.1 13.8
205 236.42 234.8 221 13.8
206 236.57 234.4 221.4 13
207 236.78 234.8 222.6 12.2
208 236.96 236.4 222.5 13.9
209 237.14 234.6 223.5 11.1
210 237.35 236.6 222.6 14
211 237.49 235.7 221.9 13.8
212 237.73 234.4 222.4 12
213 237.83 234.9 226 8.9
214 238.1 237.3 223.9 13.4
215 238.17 237 223.6 13.4
216 238.46 235.7 225.1 10.6
217 238.5 236.6 223.6 13
218 238.81 237 226.1 10.9
219 238.83 236.4 225 11.4
220 239.15 237.7 225.7 12
221 239.15 236.8 225.9 10.9
222 239.43 237.7 225.9 11.8
223 239.46 238.6 224.8 13.8
224 239.71 237.1 226.3 10.8
225 239.77 238.7 227.4 11.3
226 239.98 238.7 225.9 12.8
227 240.07 238 226.9 11.1
228 240.25 240.5 227.6 12.9
229 240.36 238.8 227.5 11.3
230 240.51 237.9 225.8 12.1
231 240.65 238.5 228.2 10.3
232 240.77 239.3 226.6 12.7
233 240.92 238.8 226.1 12.7
234 241.03 240.2 228.8 11.4
235 241.2 240.4 227.5 12.9
236 241.28 240 227.4 12.6
237 241.46 239.8 228 11.8
238 241.52 240.6 228.8 11.8
239 241.72 240.1 228.7 11.4
240 241.76 240.2 229.2 11
241 241.98 240.3 229.2 11.1
242 242 240.7 229.3 11.4
243 242.22 240.5 229.7 10.8
244 242.23 239.9 229.2 10.7
245 242.44 241.2 230.3 10.9
246 242.45 240.7 230.5 10.2
247 242.64 241.3 231.5 9.8
248 242.68 239.2 230.4 8.79
249 242.84 241.5 230.3 11.2
250 242.89 241.4 230.6 10.8
251 243.03 242.2 230.4 11.8
252 243.1 241.7 232.3 9.39
253 243.22 242.6 232.2 10.4
254 243.31 241 229.7 11.3
255 243.41 240.7 231.1 9.59
256 243.51 242.2 231 11.2
257 243.59 241 232.4 8.59
258 243.7 242.1 230.2 11.9
259 243.77 242 232.1 9.90
260 243.9 243.2 230.4 12.8
261 243.95 242.9 233.6 9.30
262 244.08 242.6 233.6 9
263 244.12 243 231.3 11.7
264 244.26 242.3 233.5 8.80
265 244.29 241.8 233.8 8
266 244.44 242.9 233.1 9.80
267 244.46 242.6 233.8 8.79
268 244.61 242.9 234 8.90
269 244.62 244 234.3 9.69
270 244.76 243.6 234.4 9.19
271 244.77 243.6 235 8.59
272 244.91 243.2 234.7 8.5
273 244.93 243.9 233.2 10.7
274 245.04 243.5 233.5 10
275 245.08 243.7 234.2 9.5
276 245.18 243.4 234.8 8.59
277 245.23 244.2 234.2 10
278 245.31 244.8 234.8 10
279 245.37 244.1 234.7 9.40
280 245.44 243.7 234.1 9.59
281 245.51 244.1 234 10.1
282 245.57 243.6 235.8 7.79
283 245.65 243.8 235.3 8.5
284 245.7 244 235 9
285 245.78 244.3 236.9 7.40
286 245.82 243.7 235 8.69
287 245.91 244.6 236.2 8.40
288 245.94 244.7 235.4 9.29
289 246.04 245.2 237.3 7.89
290 246.06 244.6 234.8 9.79
291 246.16 243.8 235.6 8.20
292 246.17 244.8 236.2 8.60
293 246.28 244.6 236.2 8.40
294 246.29 245.2 236.9 8.29
295 246.39 245.2 237.2 8
296 246.39 245.2 236.5 8.69
297 246.49 245.1 235.7 9.40
298 246.5 244.9 237.4 7.5
299 246.59 245.4 237.5 7.90
300 246.61 246 236.2 9.80

So the decision tree does outperform the 20% KC, and not by trivial amounts either for many possible b, either in relative or absolute terms: for b=23, the estimated gain is $37.1. The decision tree doesn’t require the full b=300 to often hit the ceiling, although the KC will. However, the performance edge goes down as the horizon gets more distant, and we’ll see that as far out as b=300, the advantage shrinks to ~$6 as the ceiling is hit almost all the time by decent strategies (as we know from Haghani & Dewey 2016’s analysis that KC hits the ceiling ~95% of the time with b=300, although of course it will do so far less often for shorter runs like b=50).

Since the 20% KC was only suggested as a heuristic, one might wonder if there’s a different constant fraction which does slightly better in the b=300 case, which there is, 12%:

fraction <- seq(0,1, by=0.01)
fractionEVs <- ldply(mclapply(fraction, function(f) {
k <- function(w,b) { if(w==250) {0} else { f * w } }
kev <- simulateGame(k, b=300)
return(kev) }
))
fraction[which.max(unlist(fractionEVs))]
# [1] 0.12
qplot(fraction, fractionEVs)

Coin-flipping POMDP

The original Kelly coinflip game has a clear way to make it more general and difficult: randomize the 3 key parameters of edge/max wealth/rounds and hide them, turning it into a partially-observed Markov decision process.

To solve a POMDP one often has to record the entire history, but with judicious choice of what distribution we randomize the 3 parameters from, we can make it both plausibly reflecting the human subjects’ expectations and allow summarizing the history in a few sufficient statistics.

Implementation

A Python implementation with the suggested randomizations is available in OpenAI Gym:

import gym
env = gym.make('KellyCoinflipGeneralized-v0')
env.reset()
env._render() # peek at the hidden parameters
env._step(env.wealth*100*0.2)
# ...
env._reset() # etc

Bayesian decision tree

Unknown coin-flip probability

A variant on this problem is when the probability of winning is not given, if one doesn’t know in advance p=0.6. How do you make decisions as coin flips happen and provide information on p? Since p is unknown and only partially observed, this turns the coin-flip MDP into a POMDP problem. A POMDP can be turned back into a MDP and solved in a decision tree by including the history of observations as more parameters to explore over.

The curse of dimensionality can be tamed here a little by observing that a history of coin-flip sequences is unnecessary - it doesn’t matter what order the coin flips happen in, only the number of wins and the number of total coin-flips matter (are the sufficient statistics). So we only need to augment wealth/rounds/bet-amount with wins/n. And one can compute p from wins/n by Bayesian updates of a prior on the observed coin flip results, treating it as a binomial problem of estimating p distributed according to the beta distribution.

This beta Bayesian update is easy and doesn’t require calling out to a library: with a beta uniform prior (Beta(1,1)), update on binomial (win/loss) is conjugate with the simple closed form posterior of Beta(1+win, 1+n-win); and the expectation of the beta is 1+win / 1+n. Then our expected values in f simply change from 0.6/0.4 to expectation/1-expectation.

Going back to the R implementation, we add in the two additional parameters, the beta estimation, and estimate the hardwired probabilities:

library(memoise)
f <- function(x, w, b, wins, n) {
beta <- c(1+wins, 1+n-wins)
expectedWinProb <- beta[1] / (beta[1]+beta[2]) # E[Beta(alpha, beta)] = alpha / (alpha+beta)
expectedLoseProb <- 1 - expectedWinProb
expectedWinProb*mV(min(w+x,250), b-1, wins+1, n+1) + expectedLoseProb*mV(w-x, b-1, wins, n+1) }
mf <- memoise(f)
V <- function(w,b,wins,n) {
returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b, wins, n) }) } else { w }
max(returns) }
mV <- memoise(V)

What prior should we use?

Ideally we would use the prior of the participants in the experiment, to compare how well they do compared to the Bayesian decision tree, and estimate how much mistaken beliefs cost them. Unfortunately they were not surveyed on this and the authors do not give any indication of how much of an edge the subjects expected before starting the game, so for analysis, we’ll make up a plausible prior.

We know that they know that 100% is not an option (as that would be boring) and <50% is impossible too since they were promised an edge; but it can’t be too high because then everyone would win quickly and nothing would be learned. I think a reasonable prior would be something like 70%, but with a lot of room for surprise. let’s assume a weak prior around 70%, which is equivalent to playing 10 games and winning 7, so Beta(7, 3).

With the additional parameters, the memoized version becomes impossibly slow, so I rewrite it to use R environments as a poor man’s hash table:

tree <- new.env(size=1000000, parent=emptyenv())
f <- function(x, w, b, wins, n) {
beta <- c(1+wins, 1+n-wins)
expectedWinProb <- beta[1] / (beta[1]+beta[2]) # E[Beta(alpha, beta)] = alpha / (alpha+beta)
expectedLoseProb <- 1 - expectedWinProb
expectedWinProb*V(min(w+x,250), b-1, wins+1, n+1) + expectedLoseProb*V(w-x, b-1, wins, n+1) }
V <- function(w, b, wins, n) {
if (isSet(c(w,b,wins,n))) { return(lookup(c(w,b,wins,n))) } else {
returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b,wins,n) }) } else { return(w) }
set(c(w,b,wins,n), max(returns))
return(lookup(c(w,b,wins,n)))
}
}
# test expression: $25, 5 bets, suggested 7/10 prior: V(25,5,7,10) This prior leads to somewhat more aggressive betting in short games as the EV is higher, but overall performs much like the decision tree. nshepperd provides a modified version of his code which lets us evaluate up to b=300: {-# LANGUAGE ScopedTypeVariables #-} import System.Environment import Data.List import Data.Foldable import Data.Function import Data.Function.Memoize import Data.Vector (Vector) import qualified Data.Vector as V import Data.Ord import Text.Printf import Numeric.Search.Range -- from 'binary-search' type Wealth = Int type BetAmount = Wealth type Probability = Double type EV = Double type Round = Int type MaxRounds = Round type WonRounds = Round type Alpha = Int type Beta = Int cap :: Wealth cap = 250 data Tree = Done EV | Bet EV Wealth BetAmount Round Probability Tree Tree deriving (Show) data Dual = Dual { subjectiveEV :: !EV, trueEV :: !EV } deriving (Show) instance Plan Dual where done w = Dual (fromIntegral w) (fromIntegral w) bet w x b p win loss = Dual { subjectiveEV = p * subjectiveEV win + (1 - p) * subjectiveEV loss, trueEV = 0.6 * trueEV win + (1 - 0.6) * trueEV loss } strategyEV = subjectiveEV class Plan s where done :: Wealth -> s bet :: Wealth -> BetAmount -> Round -> Double -> s -> s -> s strategyEV :: s -> EV instance Plan Double where done = fromIntegral bet w x b p win loss = p * win + (1 - p) * loss strategyEV = id instance Plan Tree where done w = Done (fromIntegral w) strategyEV (Done ev) = ev strategyEV (Bet ev _ _ _ _ _ _) = ev bet 0 _ _ _ _ _ = Done 0 bet w x b p win loss = Bet ev w x b p win loss where ev = p * strategyEV win + (1 - p) * strategyEV loss showTree :: Tree -> String showTree (Done ev) = show (Done ev) showTree (Bet ev w x b p win loss) = intercalate "\n" [ printf "%i (%1.1f): %i/%i at %2.2f" b ev x w p, indent 4 ("+ " ++ showTree win), indent 4 ("- " ++ showTree loss) ] where indent n str = intercalate "\n"$ map (replicate n ' ' ++) $lines str --$ Unordered maximum
maximumOn :: (Foldable f, Ord b) => (a -> b) -> f a -> a
maximumOn f = maximumBy (comparing f)

-- Binary search on bitonic function.
maximumBitonic :: Ord b => (a -> b) -> Vector a -> a
maximumBitonic f options = case searchFromTo find 0 (V.length options - 2) of
Just i -> options V.! i
Nothing -> options V.! (V.length options - 1)
where
find i = f (options V.! i) >= f (options V.! (i + 1))

value :: Plan s
=> Alpha -> Beta -> MaxRounds
-> (Round -> WonRounds -> Wealth -> s)
-> (Round -> WonRounds -> Wealth -> s)
value alpha beta lastRound next currentRound wins w
| currentRound >= lastRound = done w
| w            >= cap       = done w
| otherwise = maximumBitonic strategyEV options
where
new_alpha = alpha + wins
new_beta = beta + (currentRound - wins)
p = fromIntegral new_alpha / fromIntegral (new_alpha + new_beta)
go x = bet w x currentRound p
(next (currentRound+1) (wins+1) (w+x))
(next (currentRound+1) wins     (w-x))
options = V.fromList [go x | x <- [0..min (cap - w) w]]

-- Direct recursion.
direct :: Plan s
=> Alpha -> Beta -> MaxRounds
-> (Round -> WonRounds -> Wealth -> s)
direct a b m = fix (value a b m)

-- Memoised recursion.
memo :: forall s. Plan s
=> Alpha -> Beta -> MaxRounds
-> (Round -> WonRounds -> Wealth -> s)
memo a b maxRounds = cached_value
where
cached_value r wins w = table V.! r V.! wins V.! w
table :: Vector (Vector (Vector s))
table = V.generate (maxRounds + 1)
(\r -> V.generate (r + 1)
(\wins -> V.generate (cap + 1)
(\w -> value a b maxRounds cached_value r wins w)))

-- Memoised recursion using 'memoize' library; slower.
memo2 :: Plan s
=> Alpha -> Beta -> MaxRounds
-> (Round -> WonRounds -> Wealth -> s)
memo2 a b r = memoFix3 (value a b r)

run :: Plan s
=> (alpha -> beta -> maxRounds -> (Round -> WonRounds -> wealth -> s))
-> alpha -> beta -> maxRounds -> wealth -> s
run method alpha beta rounds w = method alpha beta rounds 0 0 w

main :: IO ()
main = do [a, b, r, w] <- map read <$> getArgs print$ (run memo a b r w :: Dual)

One feature offered by this code is reporting two expected-values:

1. the subjective estimate of the agent at the beginning of the game about how much it can profit, given the rules and its priors about p
2. the EV it will actually earn, following its policy, given the true probability p=0.6

For our prior, w=25, and b=300, we can evaluate run memo (7+1) (3+1) 300 25 :: Dual and find that the subjective EV is $207.238 and the actual EV is$245.676.

And since we know from earlier that an agent following the optimal policy believing p=0.6 would earn $246.606, it follows that the Bayesian agent, due to its ignorance, loses ~$1. So the price of ignorance (regret) in this scenario is surprisingly small.

Why is the Bayesian decision tree able to perform so close to the known-probability decision tree? I would guess that it is because all agents bet small amounts in the early rounds (to avoid gambler’s ruin, since they have hundreds of rounds left to reach the ceiling), and this gives them data for free to update towards p=0.6; the agent which knows that p=0.6 can bet a little more precisely early on, but this advantage over a few early rounds doesn’t wind up being much of a help in the long run.

With b=300, there are so many rounds available to bet, that the details don’t matter as much; while it there are only a few bets, the maximizing behavior is to bet a lot so the regret for small b is also probably small; the largest regret for not knowing p is probably somewhere, like with the KC vs decision tree, in the small-medium area of b.

Unknown stopping time

Another variant would be to make the stopping time uncertain; strictly speaking, the game wasn’t fixed at exactly 300 rounds, that was just their guess at how many rounds a reasonably fast player might get in before the clock expired. From the player’s perspective, it is unknown how many rounds they will get to play although they can guess that they won’t be allowed to play more than a few hundred, and this might affect the optimal strategy.

One way to calculate this would be to assume that the player doesn’t learn anything about the stopping time while playing, and merely has a prior over the stopping time. For example, N(300,25). What is the value of this game? We can approximate it with our pre-existing value function for a known stopping time, sampling from the posterior; for example, we might draw 1000 values from N(300,25), calculate the value of 300, 295, 320, 301, 296… etc, and taking the mean value. (This is not as terrible as it sounds computation-wise: since the value function is deterministic, we can cache each value and we only need to run the value function for each unique value, and then take a mean weighted by their count in the posterior sample. So if we draw 10,000 posterior samples, we don’t need to do 10,000x the work, but maybe only 100x, depending on how narrow our prior is.)

Another way would be to try to calculate it inside the value function. The game stopping can be seen as a third probabilistic response to a player’s bet, where each move with some sp, if it happens the number of bets left immediately becomes b=0 (so they win their current w) and they win the net value V(w,0); and if it doesn’t happen the value is the usual 0.6*V(w+x, b-1) + 0.4*V(w-x, b-1). Or in longer form, the value of a action/bet x is (1-sp)*(0.6*V(w+x,b-1) + 0.4*V(w-x,b-1)) + sp*V(w,0).

What is sp as a distribution? Well, one might believe that every round, there is a ~1% chance of the game stopping, in which case sp is the memoryless geometric distribution (not exponential distribution, since this is a discrete setup); in this case, one must discount each additional round n by 1% cumulatively. (Discounting in general could be seen as reflecting a probability of stopping at each stage.)

This has the difficulty that hypothetically, a game might go on indefinitely long, but since a player must eventually hit $0 or$250 and stop playing, it would turn out like the penny simulator - after a few thousand rounds, the probability of having not reached $250 is so small it can’t be handed by most software. The problem is more that this expands the states out enough to again make it unevaluable, so it helps to put an upper bound. So the problem becomes, playing the game knowing that the game might stop with a certain sp (eg 0.01) each round or stops at an upper limit number of rounds (eg b=300). So continuing with the R hash table, it might go like this: f <- function(x, w, b) { sp <- 0.01 (1-sp)*(0.6*V(min(250,w+x), b-1) + 0.4*V(w-x,b-1)) + sp*V(w,0) } V <- function(w, b) { if (isSet(c(w,b))) { return(lookup(c(w,b))) } else { returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b) }) } else { return(w) } set(c(w,b), max(returns)) return(lookup(c(w,b))) } } This can be extended to scenarios where we do learn something about the stopping time, similarly to the coin-flip probability, by doing Bayesian updates on sp. Or if sp changes over rounds. If we had a distribution N(300,25) over the stopping time, then the probability of stopping at each round can change inside the value function given an additional argument n for rounds elapsed. Unknown maximum wealth cap Finally, subjects weren’t told what the maximum would be: The maximum payout would be revealed if and when subjects placed a bet that if successful would make their balance greater than or equal to the cap. We set the cap at$250.

A convenient distribution for such a cap would be the Pareto distribution, parameterized by a minimum xm and a tail parameter, α. If xm is set to something like 100 (the agent thinks the cap is >$100), for most rounds, the agent would not be able to update this, but towards the end, as it exceeds$100, it then updates the posterior - if it is allowed to make a bet of $50 at$100, then xm becomes 150 while α remains the same, and so on The posterior mean of the cap is then $\frac{\alpha \cdot x_m}{\alpha - 1}$. A reasonable prior might be α=5 and xm=200, in which case the mean wealth cap is 250.

f <- function(x, w, b, alpha, x_m) {
meanCap        <- (alpha * x_m) / (alpha-1) # mean posterior cap if our bet does not threaten to hit the cap
x_m_updated <- max(x_m, x+w)
meanCapUpdated <- (alpha * x_m_updated) / (alpha-1) # mean posterior cap if our bet does threaten to hit the cap
Pcap <- if ((x+b) < x_m) { 0 } else { 1-((x_m / (x+b))^alpha) } # CDF
(1-Pcap)*(0.6*V(min(meanCap,w+x), b-1, alpha, x_m) + 0.4*V(w-x,b-1,alpha,x_m)) +
Pcap*(0.6*V(min(meanCapUpdated,w+x), b-1, alpha, x_m_updated) + 0.4*V(w-x,b-1,alpha,x_m_updated))
}
V <- function(w, b, alpha=5, x_m=200) {
if (isSet(c(w,b,alpha,x_m))) { return(lookup(c(w,b,alpha,x_m))) } else {
returns <- if (b>0 && w>0) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b,alpha,x_m) }) } else { return(w) }
set(c(w,b,alpha,x_m), max(returns))
return(lookup(c(w,b,alpha,x_m)))
}
}

1. That is, the value of the actions/bets 0-w seems like it would be bitonic - increasing linearly to the maximum then linearly decreasing. So one ought to be able to do a binary search over actions or something even better.

2. Presumably a 3x5 or smaller NN would also work, but I experienced convergence issues with the SGD optimizer for all 3-layer NNs I experimented with, and as of 29 March 2017, a MXNet bug blocked use of the better optimizers.