The Kelly Coin-Flipping Game: Exact Solutions

Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016’s 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.
statistics, decision-theory, Haskell, NN, Python, shell, R, C, C++, Bayes
by: Gwern Branwen, Arthur B., nshepperd, FeepingCreature, Gurkenglas 2017-01-192018-12-12 finished certainty: likely importance: 8


Haghani & Dewey 2016 exper­i­ment with a dou­ble-or-noth­ing coin-flip­ping game where the player starts with $25 and has an edge of 60%, and can play 300 times, choos­ing how much to bet each time, win­ning up to a max­i­mum ceil­ing of $250. Most of their sub­jects fail to play well, earn­ing an aver­age $91, com­pared to Haghani & Dewey 2016’s heuris­tic bench­mark of ~$240 in win­nings achiev­able using a mod­i­fied Kelly Cri­te­rion as their strat­e­gy. The KC, how­ev­er, is not opti­mal for this prob­lem as it ignores the ceil­ing and lim­ited num­ber of plays.

We solve the prob­lem of the value of opti­mal play exactly by using deci­sion trees & dynamic pro­gram­ming for cal­cu­lat­ing the value func­tion, with imple­men­ta­tions in R, Haskell, and C. We also pro­vide a closed-form exact value for­mula in R & Python, sev­eral approx­i­ma­tions using Monte Carlo/random forests/neural net­works, visu­al­iza­tions of the value func­tion, and a Python imple­men­ta­tion of the game for the Ope­nAI Gym col­lec­tion. We find that opti­mal play yields $246.61 on aver­age (rather than ~$240), and so the human play­ers actu­ally earned only 36.8% of what was pos­si­ble, los­ing $155.6 in poten­tial profit. Com­par­ing deci­sion trees and the Kelly cri­te­rion for var­i­ous hori­zons (bets left), the rel­a­tive advan­tage of the deci­sion tree strat­egy depends on the hori­zon: it is high­est when the player can make few bets (at b = 23, with a differ­ence of ~$36), and decreases with num­ber of bets as more strate­gies hit the ceil­ing.

In the Kelly game, the max­i­mum win­nings, num­ber of rounds, and edge are fixed; we describe a more diffi­cult gen­er­al­ized ver­sion in which the 3 para­me­ters are drawn from Pare­to, nor­mal, and beta dis­tri­b­u­tions and are unknown to the player (who can use Bayesian infer­ence to try to esti­mate them dur­ing play). Upper and lower bounds are esti­mated on the value of this game. In the vari­ant of this game where sub­jects are not told the exact edge of 60%, a Bayesian deci­sion tree approach shows that per­for­mance can closely approach that of the deci­sion tree, with a penalty for 1 plau­si­ble prior of only $1. Two deep rein­force­ment learn­ing agents, DQN & DDPG, are imple­mented but DQN fails to learn and DDPG does­n’t show accept­able per­for­mance, indi­cat­ing bet­ter deep RL meth­ods may be required to solve the gen­er­al­ized Kelly game.

The paper , by Haghani & Dewey 2016 runs an economics/psychology exper­i­ment on opti­mal bet­ting in a sim­ple coin-flip­ping game:

What would you do if you were invited to play a game where you were given $25 and allowed to place bets for 30 min­utes on a coin that you were told was biased to come up heads 60% of the time? This is exactly what we did, gath­er­ing 61 young, quan­ti­ta­tively trained men and women to play this game. The results, in a nut­shell, were that the major­ity of these 61 play­ers did­n’t place their bets very well, dis­play­ing a broad panoply of behav­ioral and cog­ni­tive bias­es. About 30% of the sub­jects actu­ally went bust, los­ing their full $25 stake. We also dis­cuss opti­mal bet­ting strate­gies, val­u­a­tion of the oppor­tu­nity to play the game and its sim­i­lar­i­ties to invest­ing in the stock mar­ket. The main impli­ca­tion of our study is that peo­ple need to be bet­ter edu­cated and trained in how to approach deci­sion mak­ing under uncer­tain­ty. If these quan­ti­ta­tively trained play­ers, play­ing the sim­plest game we can think of involv­ing uncer­tainty and favourable odds, did­n’t play well, what hope is there for the rest of us when it comes to play­ing the biggest and most impor­tant game of all: invest­ing our sav­ings? In the words of , who gave us help­ful feed­back on our research: “This is a great exper­i­ment for many rea­sons. It ought to become part of the basic edu­ca­tion of any­one inter­ested in finance or gam­bling.”

More specifi­cal­ly:

…Prior to start­ing the game, par­tic­i­pants read a detailed descrip­tion of the game, which included a clear state­ment, in bold, indi­cat­ing that the sim­u­lated coin had a 60% chance of com­ing up heads and a 40% chance of com­ing up tails. Par­tic­i­pants were given $25 of start­ing cap­i­tal and it was explained in text and ver­bally that they would be paid, by check, the amount of their end­ing bal­ance sub­ject to a max­i­mum pay­out. The max­i­mum pay­out would be revealed if and when sub­jects placed a bet that if suc­cess­ful would make their bal­ance greater than or equal to the cap. We set the cap at $250…Par­tic­i­pants were told that they could play the game for 30 min­utes, and if they accepted the $25 stake, they had to remain in the room for that amount of time.5 Par­tic­i­pants could place a wager of any amount in their account, in incre­ments of $0.01, and they could bet on heads or tail­s…As­sum­ing a player with agile fin­gers can put down a bet every 6 sec­onds, that would allow 300 bets in the 30 min­utes of play.

Near-optimal play

The authors make a spe­cific sug­ges­tion about what near-op­ti­mal play in this game would be, based on the which would yield bets each round of 20% of cap­i­tal:

The basic idea of the Kelly for­mula is that a player who wants to max­i­mize the rate of growth of his wealth should bet a con­stant frac­tion of his wealth on each flip of the coin, defined by the func­tion , where p is the prob­a­bil­ity of win­ning. The for­mula implic­itly assumes the gam­bler has log util­i­ty. It’s intu­itive that there should be an opti­mal frac­tion to bet; if the player bets a very high frac­tion, he risks los­ing so much money on a bad run that he would not be able to recov­er, and if he bet too lit­tle, he would not be mak­ing the most of what is a finite oppor­tu­nity to place bets at favor­able odd­s…We present the Kelly cri­te­rion as a use­ful heuris­tic a sub­ject could gain­fully employ. It may not be the opti­mal approach for play­ing the game we pre­sented for sev­eral rea­sons. The Kelly cri­te­rion is con­sis­tent with the bet­tor hav­ing log-u­til­ity of wealth, which is a more tol­er­ant level of risk aver­sion than most peo­ple exhib­it. On the other hand, the sub­jects of our exper­i­ment likely did not view $25 (or even $250) as the total­ity of their cap­i­tal, and so they ought to be less risk averse in their approach to max­i­miz­ing their har­vest from the game. The fact that there is some cap on the amount the sub­ject can win should also mod­ify the opti­mal strat­e­gy…In our game, the Kelly cri­te­rion would tell the sub­ject to bet 20% () of his account on heads on each flip. So, the first bet would be $5 (20% of $25) on heads, and if he won, then he’d bet $6 on heads (20% of $30), but if he lost, he’d bet $4 on heads (20% of $20), and so on.

…If the sub­ject rightly assumed we would­n’t be offer­ing a cap of more than $1,000 per play­er, then a rea­son­able heuris­tic would be to bet a con­stant pro­por­tion of one’s bank using a frac­tion less than the Kelly cri­te­ri­on, and if and when the cap is dis­cov­ered, reduc­ing the bet­ting frac­tion fur­ther depend­ing on bet­ting time remain­ing to glide in safely to the max­i­mum pay­out. For exam­ple, bet­ting 10% or 15% of one’s account may have been a sound start­ing strat­e­gy. We ran sim­u­la­tions on the prob­a­bil­ity of hit­ting the cap if the sub­ject bet a fixed pro­por­tion of wealth of 10%, 15% and 20%, and stop­ping when the cap was exceeded with a suc­cess­ful bet. We found there to be a 95% prob­a­bil­ity that the sub­jects would reach the $250 cap fol­low­ing any of those con­stant pro­por­tion bet­ting strate­gies, and so the expected value of the game as it was pre­sented (with the $250 cap) would be just under $240. How­ev­er, if they bet 5% or 40% of their bank on each flip, the prob­a­bil­ity of exceed­ing the cap goes down to about 70%.

This game is inter­est­ing as a test case because it is just easy enough to solve exactly on a stan­dard com­puter in var­i­ous ways, but also hard enough to defeat naive humans and be non­triv­ial.

Subjects’ performance

Despite the Kelly cri­te­rion being well-known in finance and fairly intu­itive, and the game being very gen­er­ous, par­tic­i­pants did not per­form well:

The sam­ple was largely com­prised of col­lege age stu­dents in eco­nom­ics and finance and young pro­fes­sion­als at finance firms. We had 14 ana­lyst and asso­ciate level employ­ees at two lead­ing asset man­age­ment firms. The sam­ple con­sisted of 49 males and 12 females. Our prior was that these par­tic­i­pants should have been well pre­pared to play a sim­ple game with a defined pos­i­tive expected val­ue…Only 21% of par­tic­i­pants reached the max­i­mum pay­out of $250,7 well below the 95% that should have reached it given a sim­ple con­stant per­cent­age bet­ting strat­egy of any­where from 10% to 20%.8 We were sur­prised that one third of the par­tic­i­pants wound up with less money in their account than they started with. More astound­ing still is the fact that 28% of par­tic­i­pants went bust and received no pay­out. That a game of flip­ping coins with an ex-ante 60:40 win­ning prob­a­bil­ity pro­duced so many sub­jects that lost every­thing is star­tling. The aver­age end­ing bankroll of those who did not reach the max­i­mum and who also did not go bust, which rep­re­sented 51% of the sam­ple, was $75. While this was a tripling of their ini­tial $25 stake, it still rep­re­sents a very sub­-op­ti­mal out­come given the oppor­tu­nity pre­sent­ed. The aver­age pay­out across all sub­jects was $91, let­ting the authors off the hook rel­a­tive to the $250 per per­son they’d have had to pay out had all the sub­jects played well.

This is trou­bling because the prob­lem is so well-de­fined and favor­able to the play­ers, and can be seen as a micro­cosm of the diffi­cul­ties peo­ple expe­ri­ence in ratio­nal bet­ting. (While it’s true that human sub­jects typ­i­cally per­form badly ini­tially in games like the iter­ated pris­on­er’s dilemma and need time to learn, it’s also true that humans only have one life to learn stock mar­ket invest­ment dur­ing, and these sub­jects all should’ve been well-pre­pared to play.)

Instead of expected earn­ings of ~$240, the play­ers earned $91—­for­feit­ing $149. How­ev­er, if any­thing, the authors under­state the under­per­for­mance, because as they cor­rectly note, the Kelly cri­te­rion is not guar­an­teed to be opti­mal in this prob­lem due to the poten­tial for differ­ent util­ity func­tions (what if we sim­ply want to max­i­mize expected wealth, not log wealth?), the fixed num­ber of bets & the ceil­ing, as the Kelly cri­te­rion tends to assume that wealth can increase with­out limit & there is an indefi­nite time hori­zon.

Optimality in the coin-flipping MDP

Indeed, we can see with a sim­ple exam­ple that KC is sub­op­ti­mal in terms of max­i­miz­ing expected val­ue: what if we are given only 1 bet (b = 1) to use our $25 on? If we bet 20% (or less) per the KC, then

But if we bet every­thing:

It’s true that 40% of the time, we go bank­rupt and so we could­n’t play again… but there are no more plays in b = 1 so avoid­ing bank­ruptcy boots noth­ing.

We can treat this coin-flip­ping game as a tree-struc­tured . For more pos­si­ble bets, the value of a bet of a par­tic­u­lar amount given a wealth w and bets remain­ing b-1 will recur­sively depend on the best strat­egy for the two pos­si­ble out­comes (weighted by prob­a­bil­i­ty), giv­ing us a Bell­man value equa­tion to solve like:

To solve this equa­tion, we can explore all pos­si­ble sequences of bets and out­comes to a ter­mi­na­tion con­di­tion and reward, and then work use back­wards induc­tion, defin­ing a deci­sion tree which can be (rea­son­ably) effi­ciently com­puted using memoization/.

Given the prob­lem setup, we can note a few things about the opti­mal strat­e­gy:

  1. if the wealth ever reaches $0, the game has effec­tively ended regard­less of how many bets remain, because bet­ting $0 is the only pos­si­bil­ity and it always returns $0

  2. sim­i­lar­ly, if the wealth ever reaches the upper bound of $250, the opti­mal strat­egy will effec­tively end the game by always bet­ting $0 after that regard­less of how many bets remain, since it can’t do bet­ter than $250 and can only do worse.

    These two short­cuts will make the tree much eas­ier to eval­u­ate because many pos­si­ble sequences of bet amounts & out­comes will quickly hit $0 or $250 and require no fur­ther explo­ration.

  3. a state with more bets is always of equal or bet­ter value than fewer

  4. a state with more wealth is always equal or bet­ter value than less

  5. the value of 0 bets is the cur­rent wealth

  6. the value of 1 bet depends on the ceil­ing and cur­rent wealth: whether $250 is >2× cur­rent wealth. If the ceil­ing more than twice cur­rent wealth, the opti­mal strat­egy with 1 bet left is to bet every­thing, since that has high­est EV and there’s no more bets to worry about going bank­rupt & miss­ing out on.

Implementation of Game

A Python imple­men­ta­tion (use­ful for apply­ing the many rein­force­ment learn­ing agent imple­men­ta­tions which are Gym-com­pat­i­ble) is avail­able in Ope­nAI Gym:

import gym
env = gym.make('KellyCoinflip-v0')
env.reset()
env._step(env.wealth*100*0.2) # bet with 20% KC
# ...
env._reset() # end of game, start a new one, etc

Decision tree

We can write down the value func­tion as a mutual recur­sion: f cal­cu­lates the expected value of the cur­rent step, and calls V to esti­mate the value of future steps; V checks for the ter­mi­na­tion con­di­tions w=$0/$250 & b =0 return­ing cur­rent wealth as the final pay­off, and if not, calls f on every pos­si­ble action to esti­mate their value and returns the max… This mutual recur­sion bot­toms out at the ter­mi­na­tion con­di­tions. As defined, this is grossly ineffi­cient as every node in the deci­sion tree will be recal­cu­lated many times despite yield­ing iden­ti­cal deter­min­is­tic, ref­er­en­tially trans­par­ent results. So we need to mem­o­ize results if we want to eval­u­ate much beyond b =5.

Approximate Value function

Imple­mented in R:

# devtools::install_github("hadley/memoise")
library(memoise)

f <- function(x, w, b) { 0.6*mV(min(w+x,250), b-1) + 0.4*mV(w-x, b-1)}
mf <- memoise(f)
V <- function(w,b) {
    returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b) }) } else { w }
    max(returns) }
mV <- memoise(V)
## sanity checks:
mV(25, 0) == 25 && mV(25, 1) == 30 && mV(0, 300) == 0

Given our mem­o­ized value func­tion mV we can now take a look at expected value of opti­mal bet­ting for bets b from 0 to 300 with the fixed start­ing pay­roll of $25. Mem­o­iza­tion is no panacea, and R/memoise is slow enough that I am forced to make one change to the bet­ting: instead of allow­ing bets in penny/$0.01 incre­ments, I approx­i­mate by only allow­ing bets in $1 incre­ments, to reduce the branch­ing fac­tor to a max­i­mum of 250 pos­si­ble choices each round rather than 25000 choic­es. (Some com­par­isons with decipenny and penny trees sug­gests that this makes lit­tle differ­ence since early on the bet amounts are iden­ti­cal and later on being able to make <$1 adjust­ments to bets yield small gains rel­a­tive to total wealth; see the Haskell imple­men­ta­tion.)

Of course, even with mem­o­iza­tion and reduc­ing the branch­ing fac­tor, it’s still diffi­cult to com­pute the value of the opti­mal strat­egy all the way to b =300 because the expo­nen­tially increas­ing num­ber of strate­gies still need to be com­puted at least once. With this R imple­men­ta­tion, trees for up to b = 150 can be com­puted with 4GB RAM & ~16h.

The value func­tion for increas­ing bets:

vs <- sapply(1:150, function(b) { round(mV(25, b), digits=1) }); vs
#   [1]  30.0  36.0  43.2  45.4  48.0  53.7  54.6  57.6  61.8  62.6  66.8  68.2  70.6  74.1  75.0  79.1  79.7  82.5  84.7  86.2  89.8  90.3  93.8
#  [24]  94.5  97.0  98.8 100.3 103.2 103.9 107.2 107.6 110.3 111.4 113.3 115.2 116.4 119.0 119.6 122.4 122.9 125.3 126.2 128.1 129.5 130.8 132.8
#  [47] 133.6 136.0 136.5 138.8 139.4 141.4 142.3 143.8 145.2 146.3 148.0 148.8 150.6 151.3 153.1 153.8 155.5 156.3 157.7 158.8 159.9 161.2 162.1
#  [70] 163.6 164.2 165.8 166.4 167.9 168.5 169.9 170.7 171.8 172.8 173.7 174.8 175.6 176.8 177.5 178.7 179.3 180.5 181.1 182.3 182.9 184.0 184.7
#  [93] 185.6 186.4 187.2 188.1 188.8 189.8 190.3 191.3 191.9 192.8 193.4 194.3 194.9 195.7 196.3 197.1 197.8 198.4 199.1 199.7 200.5 201.0 201.8
# [116] 202.3 203.0 203.5 204.3 204.8 205.4 206.0 206.6 207.1 207.7 208.3 208.8 209.4 209.9 210.5 210.9 211.5 211.9 212.5 213.0 213.5 213.9 214.5
# [139] 214.9 215.4 215.8 216.3 216.8 217.2 217.6 218.0 218.5 218.9 219.3 219.7

So as the num­ber of bets esca­lates, our expected pay­off increases fairly quickly and we can get very close to the ceil­ing of $250 with canny bet­ting.

Monte Carlo tree evaluation

If we do not have enough RAM to expand the full deci­sion tree to its ter­mi­nal nodes, there are many ways to approx­i­mate it. A sim­ple one is to expand the tree as many lev­els far down as pos­si­ble, and then if a ter­mi­nal node is reached, do exact back­wards induc­tion, oth­er­wise, at each pseudo-ter­mi­nal node, approx­i­mate the value func­tion some­how and then do back­wards induc­tion as usu­al. The deeper the depth, the closer the approx­i­ma­tion becomes to the exact val­ue, while still doing opti­mal plan­ning within the hori­zon.

One way to approx­i­mate it would be to run a large num­ber of sim­u­la­tions (per­haps 100) tak­ing ran­dom actions until a ter­mi­nal node is hit, and take the mean of the total val­ues as the esti­mate. This would be a Monte Carlo tree eval­u­a­tion. (This forms the con­cep­tual basis of the famous MCTS/.) A ran­dom pol­icy is handy because it can be used any­where, but here we already know a good heuris­tic which does bet­ter than ran­dom: the KC. So we can use that instead. This gives us as much opti­mal­ity as we can afford.

Set­ting up a sim­u­la­tion of the coin-flip­ping game which can be played with var­i­ous strate­gies:

game <- function(strategy, wealth, betsLeft) {
    if (betsLeft>0) {
    bet <- strategy(wealth, betsLeft)
    wealth <- wealth - bet
    flip <- rbinom(1,1,p=0.6)
    winnings <- 2*bet*flip
    wealth <- min(wealth+winnings, 250)
    return(game(strategy, wealth, betsLeft-1)) } else { return(wealth); } }
simulateGame <- function(s, w=25, b=300, iters=100) { mean(replicate(iters, game(s, w, b))) }

kelly <- function(w, b) { 0.20 * w }
smarterKelly <- function(w, b) { if(w==250) {0} else { (2*0.6-1) * w } }
random <- function(w, b) { sample(seq(0, w, by=0.1), 1) }

evaluate <- function(w,b) { simulateGame(smarterKelly, w, b) }
mevaluate <- memoise(evaluate)

mevaluate(25, 10)
# [1] 35.30544906
mevaluate(25, 100)
# [1] 159.385275
mevaluate(25, 300)
# [1] 231.4763619

As expected the KC can do well and is just as fast to com­pute as a ran­dom action, so using it will give much bet­ter esti­mates of the value func­tion for free.

With the Monte Carlo value func­tion set up, the orig­i­nal value func­tion can be slightly mod­i­fied to include a max­i­mum depth para­me­ter and to eval­u­ate using the MC value func­tion instead once that max­i­mum depth is hit:

f <- function(x, w, b, maxDepth) { 0.6*mV(min(w+x,250), b-1, maxDepth) + 0.4*mV(w-x, b-1, maxDepth)}
mf <- memoise(f)
V <- function(w,b, maxDepth) {
    if (b<=(b-maxDepth)) { mevaluate(w,b) } else {
    returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b, maxDepth) }) } else { w }
    max(returns) }}
mV <- memoise(V)

mV(25, 300, 100); mV(25, 300, 50)

Optimal next action

One is also curi­ous what the opti­mal strat­egy is, not just how much we can make with the opti­mal strat­egy given a par­tic­u­lar start­ing point. As defined, I can’t see how to make mV return the opti­mal choices along the way, since it has to explore all choices bot­tom-up and can’t know what is opti­mal until it has popped all the way back to the root, and state can’t be threaded in because that defeats the mem­o­iza­tion.

But then I remem­bered the point of com­put­ing value func­tions: given the (mem­o­ized) value func­tion for each state, we can then plan by sim­ply using V in a greedy plan­ning algo­rithm by ask­ing, at each step, for the value of each pos­si­ble action (which is mem­o­ized, so it’s fast), and returning/choosing the action with the max­i­mum value (which takes into account all the down­stream effects, includ­ing our use of V at each future choice):

VPplan <- function(w, b) {
    if (b==0) { return (0); } else {
    returns <- sapply(seq(0, w), function(wp) { mf(wp, w, b); })
      return (which.max(returns)-1); }
 }
mVPplan <- memoise(VPplan)
## sanity checks:
mVPplan(250, 0)
# [1] 0
mVPplan(250, 1)
# [1] 0
mVPplan(25, 3)
# [1] 25

It’s inter­est­ing that when b is very small, we want to bet every­thing on our first bet, because there’s not enough time to bother with recov­er­ing from loss­es; but as we increase b, the first bet will shrink & become more con­ser­v­a­tive:

firstAction <- sapply(1:150, function(b) { mVPplan(25, b) }); firstAction
#   [1] 25 25 25 10 10 18 17 18 14 14 14  9 10 11  7 11  7 10 10 10  9  6  9  7  9  8  9  8  7  8  7  8  7  8  7  8  7  7  7  7  7  6  7  6  7  6  7
#  [48]  6  6  6  6  6  6  6  6  6  6  6  6  5  6  5  6  5  6  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  4  5  4  5  4  5  4
#  [95]  5  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  3
# [142]  4  3  4  3  4  3  3  3  3

Optimizing

The fore­go­ing is inter­est­ing but the R imple­men­ta­tion is too slow to exam­ine the case of most impor­tance: b =300. The issue is not so much the intrin­sic diffi­culty of the prob­lem—s­ince the R imple­men­ta­tion’s RAM usage is mod­er­ate even at b = 150, indi­cat­ing that the bound­ary con­di­tions do indeed tame the expo­nen­tial growth & turn it into some­thing more like qua­dratic growth—but the slow­ness of com­put­ing. Pro­fil­ing sug­gests that most of the time is spent inside memoise:

# redefine to clear out any existing caching:
forget(mV)
Rprof(memory.profiling=TRUE, gc.profiling=TRUE, line.profiling=TRUE)
mV(w=25, b=9)
Rprof(NULL)
summaryRprof()

Most of the time is spent in func­tions which appear nowhere in our R code, like lapply or deparse, which points to the behind-the-scenes memoise. Mem­o­iza­tion should be fast because the deci­sion tree can be rep­re­sented as noth­ing but a mul­ti­di­men­sional array in which one does lookups for each com­bi­na­tion of w/b for a stored value V, and array lookups ought to be near-in­stan­ta­neous. But appar­ently there is enough over­head to dom­i­nate our deci­sion tree’s com­pu­ta­tion.

R does not pro­vide native hash tables; what one can do is use R’s “envi­ron­ments” (which use hash tables under the hood), but are restricted to string keys (yes, real­ly) as a poor man’s hash table by using digest to seri­al­ize & hash arbi­trary R objects into string keys which can then be inserted into an ‘envi­ron­ment’. (I have since learned that there is a hashmap library which is a wrap­per around a C++ hashmap imple­men­ta­tion which should be usable: Nathan Rus­sel­l’s hashmap.) The back­wards induc­tion remains the same and the hash-table can be updated incre­men­tally with­out any prob­lem (use­ful for switch­ing to alter­na­tive meth­ods like MCTS), so the rewrite is easy:

## crude hash table
library(digest)
## digest's default hash is MD5, which is unnecessarily slower, so use a faster hash:
d <- function (o) { digest(o, algo="xxhash32") }
lookup <- function(key) { get(d(key), tree) }
set <- function(key,value) { assign(envir=tree, d(key), value) }
isSet <- function(key) { exists(d(key), envir=tree) }
## example:
tree <- new.env(size=1000000, parent=emptyenv())
set(c(25,300), 50); lookup(c(25, 300))
# [1] 50

tree <- new.env(size=1000000, parent=emptyenv())
f <- function(x, w, b) {
    0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1) }
V <- function(w, b) {
    if (isSet(c(w,b))) { return(lookup(c(w,b))) } else {
      returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b) }) } else { return(w) }
      set(c(w,b), max(returns))
      return(lookup(c(w,b)))
      }
}
V(25,5)

While the over­head is not as bad as memoise, per­for­mance is still not great.

Python

Unmem­o­ized ver­sion (a lit­tle differ­ent due to float­ing point):

import numpy as np # for enumerating bet amounts as a float range
def F(w, x, b):
    return 0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1)
def V(w, b):
    if b>0 and w>0 and w<250:
        bets = np.arange(0, w, 0.1).tolist()
        returns = [F(w,x,b) for x in bets]
    else:
        returns = [w]
    return max(returns)
    V(25,0)
V(25,0)
# 25
V(25,1)
# 29.98
V(25,2)
# 35.967999999999996
V(25,3)

Python has hash-ta­bles built in as dicts, so one can use those:

import numpy as np
def F(w, x, b, dt):
        return 0.6*V(min(w+x,250), b-1, dt) + 0.4*V(w-x, b-1, dt)
def V(w, b, dt):
    if b>0 and w>0 and w<250:
        if (w,b) in dict:
            return dict[(w,b)]
        else:
            bets = np.arange(0, w, 0.1).tolist()
            returns = [F(w,x,b,dt) for x in bets]
    else:
        returns = [w]
    best = max(returns)
    dict[(w,b)] = best
    return best

dict = {}
V(25, 0, dict)
# 25

Haskell

Gurken­glas pro­vides 2 inscrutably ele­gant & fast ver­sions of the value func­tion which works in GHCi:

(!!25) $ (!!300) $ (`iterate` [0..250]) $ \nextutilities -> (0:) $ (++[250]) $
    tail $ init $ map maximum $ (zipWith . zipWith) (\up down -> 0.6 * up + 0.4 * down)
     (tail $ tails nextutilities) (map reverse $ inits nextutilities)
-- 246.2080494949234
-- it :: (Enum a, Fractional a, Ord a) => a
-- (7.28 secs, 9,025,766,704 bytes)

:module + Control.Applicative
(!!25) $ (!!300) $ (`iterate` [0..250]) $ map maximum . liftA2 ((zipWith . zipWith)
    (\up down -> 0.6 * up + 0.4 * down)) tails (map reverse . tail . inits)
- 246.2080494949234
-- it :: (Enum a, Ord a, Fractional a) => a
-- (8.35 secs, 6,904,047,568 bytes)

Not­ing:

iterate f x = x : iterate f (f x) does­n’t need to think much about mem­o­iza­tion. From right to left, the line reads “Given how much each money amount is worth when n games are left, you can zip together that list’s pre­fixes and suffixes to get a list of options for each money amount at (n+1) games left. Use the best for each, com­put­ing the new util­ity as a weighted aver­age of the pre­vi­ous. At 0 and 250 we have no options, so we must man­u­ally pre­scribe util­i­ties of 0 and 250. Induc­tively use this to com­pute util­ity at each time from know­ing that at the end, money is worth itself. Look at the util­i­ties at 300 games left. Look at the util­ity of $25.” Induc­tively gen­er­at­ing that whole list pre­scribes lazi­ness.

Since it’s Haskell, we can switch to com­piled code and use fast arrays.

Imple­men­ta­tions in Haskell have been writ­ten by Gurken­glas & nshep­perd using array-memoize & Data.Vector (re­spec­tive­ly):

import System.Environment (getArgs)
import Data.Function (iterate)
import Data.Function.ArrayMemoize (arrayMemo)

type Wealth = Int
type EV = Double

cap :: Wealth
cap = 250

value :: [Wealth -> EV]
value = iterate f fromIntegral where
  f next = arrayMemo (0, cap) $ \x -> maximum [go w | w <- [0..min (cap - x) x]] where
    go w = 0.6 * next (min cap (x+w)) + 0.4 * next (x-w)

main :: IO ()
main = do [x, b] <- map read <$> getArgs
          print $ value !! b $ x

The sec­ond one is more low-level and uses laziness/“tying the knot” to imple­ment the mem­o­iza­tion; it is the fastest and is able to eval­u­ate memo 25 300 in <12s: the value turns out to be $246. (If we assume that all games either end in $250 or $0, then this implies that or >98.4% of deci­sion-tree games hit the max, as com­pared to Haghani & Dewey’s esti­mate that ~95% of KC play­ers would.) The mem­o­iza­tion is done inter­nal­ly, so to access all val­ues we do more logic within the lex­i­cal scope.

import System.Environment (getArgs)
import Data.Function (fix)
import Data.Function.Memoize (memoFix2)
import Data.Vector (Vector)
import qualified Data.Vector as V (generate, (!))
import qualified Data.MemoUgly as U (memo)

type Wealth = Int
type Bets = Int
type EV = Double

cap :: Wealth
cap = 250

value :: (Wealth -> Bets -> EV) -> Wealth -> Bets -> EV
value next 0 b = 0
value next w 0 = fromIntegral w
value next w b = maximum [go x | x <- [0..min (cap - w) w]]
  where
    go x = 0.6 * next (min cap (w+x)) (b-1) + 0.4 * next (w-x) (b-1)

-- There are 4 possible ways aside from 'array-memoize':
-- 1. Direct recursion.
direct :: Wealth -> Bets -> EV
direct = fix value

-- 2. Memoised recursion.
memo :: Wealth -> Bets -> EV
memo w b = cached_value w b
  where
    cached_value w b = table V.! b V.! w
    table :: Vector (Vector EV)
    table = V.generate (b + 1)
            (\b -> V.generate (cap + 1)
              (\w -> value cached_value w b))

-- 3. Memoised recursion using 'memoize' library; slower.
memo2 :: Wealth -> Bets -> EV
memo2 = memoFix2 value

-- 4. Memoize using 'uglymemo' library; also slower but global
memo3 :: Wealth -> Bets -> EV
memo3 = U.memo . direct

main :: IO ()
main = do [w, b] <- map read <$> getArgs
          print (memo w b)

We can obtain all val­ues with a func­tion like

memo3 :: [Wealth] -> [Bets] -> [EV]
memo3 ws bs = zipWith cached_value ws bs
  where
    cached_value w b = table V.! b V.! w
    table :: Vector (Vector EV)
    table = V.generate (maximum bs + 1)
            (\b -> V.generate (cap + 1)
              (\w -> value cached_value w b))

and eval­u­ate:

λ> map round (memo3 (repeat 25) [1..300])
-- [30,36,43,45,48,54,55,58,62,63,67,68,71,74,75,79,80,82,85,86,90,90,94,94,97,99,100,
-- 103,104,107,108,110,111,113,115,116,119,120,122,123,125,126,128,130,131,133,134,
-- 136,136,139,139,141,142,144,145,146,148,149,151,151,153,154,155,156,158,159,160,
-- 161,162,164,164,166,166,168,169,170,171,172,173,174,175,176,177,177,179,179,181,
-- 181,182,183,184,185,186,186,187,188,189,190,190,191,192,193,193,194,195,196,196,
-- 197,198,198,199,200,200,201,202,202,203,204,204,205,205,206,207,207,208,208,209,
-- 209,210,210,211,212,212,213,213,214,214,214,215,215,216,216,217,217,218,218,219,
-- 219,219,220,220,221,221,221,222,222,222,223,223,224,224,224,225,225,225,226,226,
-- 226,227,227,227,228,228,228,228,229,229,229,230,230,230,230,231,231,231,231,232,
-- 232,232,232,233,233,233,233,234,234,234,234,234,235,235,235,235,236,236,236,236,
-- 236,236,237,237,237,237,237,238,238,238,238,238,238,239,239,239,239,239,239,239,
-- 240,240,240,240,240,240,240,241,241,241,241,241,241,241,241,242,242,242,242,242,
-- 242,242,242,242,243,243,243,243,243,243,243,243,243,243,244,244,244,244,244,244,
-- 244,244,244,244,244,244,245,245,245,245,245,245,245,245,245,245,245,245,245,245,
-- 246,246,246,246,246,246,246,246,246,246,246,246,246]

Exact value function

We could go even fur­ther with zip [1..3000] (memo3 (repeat 25) [1..3000]) (which takes ~118s). Inter­est­ing­ly, the value stops increas­ing around b = 2150 (V = 249.99009946798637) and sim­ply repeats from there; it does not actu­ally reach 250.

This is prob­a­bly because with a stake of $25 it is pos­si­ble to go bank­rupt () even bet­ting the min­i­mal fixed amount of $1. (The orig­i­nal Haskell code is mod­eled after the R, which dis­cretized this way for tractabil­i­ty; how­ev­er, since the Haskell code is so fast, this is now unnec­es­sary.) If this is the case, then being able to bet smaller amounts should allow expected value to con­verge to $250 as exactly as can be com­put­ed, because the prob­a­bil­ity of going bank­rupt after 2500 bets of $0.01 each is effec­tively zero—as far as R and Haskell will com­pute with­out spe­cial mea­sures, . We can go back and model the prob­lem exactly as it was in the paper by sim­ply mul­ti­ply every­thing by 100 & inter­pret­ing in pen­nies:

cap = 25000 -- 250*100
-- ...
λ> zip [1..3000] (memo3 (repeat (25*100)) [1..3000])
[(1,3000.0),(2,3600.0),(3,4320.0),(4,4536.000000000002),(5,4795.200000000003),
 (6,5365.440000000002),(7,5458.752000000004),(8,5757.350400000005),(9,6182.853120000005),(10,6260.488704000007),
 (11,6676.137676800007),(12,6822.331453440009),(13,7060.11133132801),(14,7413.298296422411),(15,7499.673019514893),
 (16,7913.219095166989),(17,7974.777338678492),(18,8250.07680018581),(19,8471.445742115113),(20,8625.387566014524),
 (21,8979.7747405993),(22,9028.029532170909),(23,9384.523360172316),(24,9449.401095461177),(25,9699.670042282964),
 (26,9882.821122181414),(27,10038.471393315525),(28,10323.078038637072),(29,10394.862038743217),(30,10760.507349554608),
 (31,10763.85850433795),(32,11040.414570571116),(33,11141.364854687306),(34,11337.607662912955),(35,11524.013477359924),
 (36,11648.117264906417),(37,11909.035644688667),(38,11968.56153182668),(39,12294.156320547045),(40,12296.062638839443),
 (41,12554.172237535688),(42,12628.174503649356),(43,12820.701630209007),(44,12962.820770637838),(45,13095.979243218648),
 (46,13298.241837094192),(47,13377.799870378874),(48,13632.949691989288),(49,13664.245374230446),(50,13935.591599024072),
 (51,13953.650302164471),(52,14168.318904539414),(53,14244.57145315508),(54,14407.574739209653),(55,14535.761032800987),
 (56,14651.76990919319),(57,14826.143029982914),(58,14899.500923102909),(59,15114.7924517843),(60,15149.530755496799),
 (61,15399.983335449066),(62,15400.771450460255),(63,15604.908915203496),(64,15652.268440952908),(65,15813.476931576255),
 (66,15903.186446379666),(67,16025.159151461572),(68,16152.79680966944),(69,16238.976728976708),(70,16400.466139297678),
 (71,16454.05880056233),(72,16645.646128447275),(73,16669.63235112316),(74,16880.735724497223),(75,16885.012958972715),
 (76,17058.916142797243),(77,17099.596358241968),(78,17239.339510457277),(79,17312.850756552616),(80,17421.275332110683),
 (81,17524.309847338853),(82,17604.068867939182),(83,17733.56645905442),(84,17787.13463287746),(85,17940.266786525535),
 (86,17969.95038247786),(87,18144.105153480315),(88,18152.051576292564),(89,18315.336828409476),(90,18333.026286703964),
 (91,18467.997509430665),(92,18512.510521892633),(93,18621.43451689773),(94,18690.183932686265),(95,18775.180977557586),
 (96,18865.765874427907),(97,18928.817685066606),(98,19039.0117965317),(99,19081.969208988696),(100,19209.70993405155),
 (101,19234.300290845116),(102,19377.678277281033),(103,19385.51250899971),(104,19524.210291319585),(105,19535.341194745335),
 (106,19651.63659936918),(107,19683.55258264968),(108,19779.19175398216),(109,19829.94117896368),(110,19906.559349360658),
 (111,19974.327332744302),(112,20033.4549154762),(113,20116.554995203987),(114,20159.623419373274),(115,20256.489653646044),
 (116,20284.836941737507),(117,20394.01642719967),(118,20408.892517902153),(119,20529.038285622184),(120,20531.61013289084),
 (121,20639.865300258658),(122,20652.830860566657),(123,20744.14900653099),(124,20772.415137446053),(125,20848.093322557295),
 (126,20890.24116222975),(127,20951.49506668366),(128,21006.203412595118),(129,21054.171837601196),(130,21120.21127127963),
 (131,21155.960430614512),(132,21232.187753960352),(133,21256.71536121999),(134,21342.06833189189),(135,21356.30748944987),
 (136,21449.7998427048),(137,21454.622738748447),(138,21545.893033861896),(139,21551.56090345761),(140,21629.60553083633),
 (141,21647.034539300425),(142,21712.842497605183),(143,21740.96793155338),(144,21795.467123757197),(145,21833.2961358927),
 (146,21877.356736609778),(147,21923.964087187007),(148,21958.401746908596),(149,22012.92577178374),(150,22038.504663866137),
 (151,22100.143459100647),(152,22117.579175397976),(153,22185.586988586107),(154,22195.549289624447),(155,22269.23310835155),
 (156,22272.348533907876),(157,22343.200726693787),(158,22347.9192078922),(159,22408.92268721723),(160,22422.211687202966),
 (161,22474.09110290414),(162,22495.183774650206),(163,22538.619509420547),(164,22566.800095953415),(165,22602.430698828903),
 (166,22637.031537177958),(167,22665.456036220217),(168,22705.854721234122),(169,22727.634820414445),(170,22773.2515209447),
 (171,22788.913686133077),(172,22839.208606334214),(173,22849.246045182404),(174,22903.71702393249),(175,22908.591564315902),
 (176,22966.44249475668),(177,22966.915677569094),(178,23017.91541566918),(179,23024.18913098028),(180,23068.1913376925),
 (181,23080.387557725626),(182,23117.90218718125),(183,23135.491081806776),(184,23166.99717629234),(185,23189.48394853381),
 (186,23215.431334191013),(187,23242.35418014611),(188,23263.165077204263),(189,23294.093255008785),(190,23310.163806371635),
 (191,23344.695808912184),(192,23356.397530793947),(193,23394.159357087876),(194,23401.840515265318),(195,23442.484035635396),
 (196,23446.47095075504),(197,23489.640293583914),(198,23490.270646383346),(199,23529.46834312825),(200,23533.224741609163),
 (201,23567.312697440484),(202,23575.321437418483),(203,23604.666095269786),(204,23616.551745369172),(205,23641.497567737664),
 (206,23656.90925341186),(207,23677.779921209272),(208,23696.38990746728),(209,23713.489458166325),(210,23734.991807798215),
 (211,23748.60571568506),(212,23772.71501926872),(213,23783.111220502426),(214,23809.56139463541),(215,23816.991259707916),
 (216,23845.534410064567),(217,23850.233666149958),(218,23880.639012115425),(219,23882.82861769509),(220,23914.391375329913),
 (221,23914.76844952492),(222,23942.714790600083),(223,23946.0474787004),(224,23970.438187949254),(225,23976.66184026535),
 (226,23997.74934793851),(227,24006.60933420138),(228,24024.631040582935),(229,24035.889282584474),(230,24051.068374275732),
 (231,24064.50239633),(232,24077.048621078735),(233,24092.450650947027),(234,24102.561052984205),(235,24119.737170755707),
 (236,24127.59678851838),(237,24146.366121052237),(238,24152.148649090945),(239,24172.342607735227),(240,24176.211024526496),
 (241,24197.672583935127),(242,24199.779747244538),(243,24222.35471729034),(244,24222.851974583486),(245,24243.93438158678),
 (246,24245.42607879143),(247,24263.972996501543),(248,24267.50154423283),(249,24283.6900658947),(250,24289.078871384656),
 (251,24303.07565677869),(252,24310.159487219513),(253,24322.121319078917),(254,24330.74566159496),(255,24340.819980440137),
 (256,24350.840429290023),(257,24359.165841053073),(258,24370.447517349417),(259,24377.154275094566),(260,24389.571277415347),
 (261,24394.781738403053),(262,24408.216622744527),(263,24412.045682031203),(264,24426.388969625325),(265,24428.94447133704),
 (266,24444.09418292577),(267,24445.477310292874),(268,24461.32346887292),(269,24461.644170708976),(270,24476.42136455785),
 (271,24477.445726085083),(272,24490.501181704694),(273,24492.883289818776),(274,24504.335944986862),(275,24507.958757514363),
 (276,24517.920620087607),(277,24522.6745531501),(278,24531.251041416166),(279,24537.03357887482),(280,24544.32387659046),
 (281,24551.039168217816),(282,24557.136561611253),(283,24564.69504250772),(284,24569.687240115927),(285,24578.005270307505),
 (286,24581.974706479406),(287,24590.97422968356),(288,24593.998352542163),(289,24603.606573136858),(290,24605.75811775705),
 (291,24615.907195034095),(292,24617.254442557904),(293,24627.881197714756),(294,24628.488224763427),(295,24639.274079225856),
 (296,24639.46077884001),(297,24649.128469095107),(298,24650.173797856685),(299,24658.72751316172),(300,24660.629317974468)
 ...
 (1884,24999.999999999614),(1885,24999.99999999963),(1886,24999.999999999643),(1887,24999.999999999658),(1888,24999.99999999967),
 (1889,24999.99999999968),(1890,24999.999999999694),(1891,24999.9999999997),(1892,24999.999999999716),(1893,24999.999999999727),
 (1894,24999.999999999738),(1895,24999.99999999975),(1896,24999.99999999976),(1897,24999.999999999767),(1898,24999.99999999978),
 (1899,24999.99999999979),(1900,24999.9999999998),(1901,24999.99999999981),(1902,24999.999999999818),(1903,24999.999999999825),
 (1904,24999.999999999833),(1905,24999.999999999844),(1906,24999.999999999854),(1907,24999.99999999986),(1908,24999.99999999987),
 (1909,24999.999999999876),(1910,24999.999999999884),(1911,24999.99999999989),(1912,24999.999999999898),(1913,24999.999999999905),
 (1914,24999.999999999913),(1915,24999.99999999992),(1916,24999.999999999924),(1917,24999.999999999927),(1918,24999.999999999935),
 (1919,24999.99999999994),(1920,24999.99999999995),(1921,24999.99999999995),(1922,24999.999999999956),(1923,24999.999999999964),
 (1924,24999.999999999967),(1925,24999.99999999997),(1926,24999.999999999978),(1927,24999.999999999978),(1928,24999.999999999985),
 (1929,24999.99999999999),(1930,24999.999999999993),(1931,24999.999999999993),(1932,25000.0),(1933,25000.0),
 (1934,25000.0),(1935,25000.0),(1936,25000.0),(1937,25000.0),(1938,25000.0),
 (1939,25000.0),(1940,25000.0),(1941,25000.0),(1942,25000.0),(1943,25000.0),
 (1944,25000.0),(1945,25000.0),(1946,25000.0),(1947,25000.0),(1948,25000.0),
 (1949,25000) ... ]

With the exact penny ver­sion, the value of b =300 is now $246.6062932 (tak­ing ~3746s com­piled), so the approx­i­ma­tion costs an expected value of only $0.61. The full run reveals that con­ver­gence hap­pens at b = 1932 (which can be com­puted in ~48503s com­piled), past which there is no need to com­pute fur­ther.

C/C++

Feep­ingCrea­ture pro­vides a vec­tor­ized C imple­men­ta­tion of the exact penny game that runs in a few min­utes (ben­e­fit­ing ~13% from pro­file-guided opti­miza­tion when com­piled with GCC). He also notes that memoization/dynamic pro­gram­ming is overkill for this prob­lem: because almost all states ulti­mately get eval­u­at­ed, the dynamic pro­gram­ming does­n’t avoid eval­u­at­ing many states. Since each state depends on an imme­di­ate suc­ces­sor and forms a clean tree, it is pos­si­ble to cal­cu­late the tree much more directly by eval­u­at­ing all the ter­mi­nal nodes b =0, loop­ing over the pre­vi­ous nodes b+1, then throw­ing away the no longer nec­es­sary b, and eval­u­at­ing b+2, and so on; only two lay­ers of the tree need to be stored at any time, which can poten­tially fit into on-proces­sor cache & exhibit far bet­ter cache local­i­ty. Another advan­tage of this rep­re­sen­ta­tion is par­al­lelism, as each node in the upper layer can be eval­u­ated inde­pen­dent­ly. An imple­men­ta­tion using OpenMP for thread­ing can solve b =300 in ~8s:

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <math.h>

#define ROUND_LIMIT 300
#define MONEY_LIMIT 25000
#define MONEY_START 2500
#define MONEY_FACTOR 100.0f

// #define DATATYPE float
// #define F(X) X ## f
#define DATATYPE double
#define F(X) X

#define THREADS 8

// MEMOIZED METHOD

typedef struct {
  bool known;
  DATATYPE ev;
} CacheEntry;

static DATATYPE decide(CacheEntry * const table, int money, int round) {
  if (round < 0) return money;
  if (money == 0 || money == MONEY_LIMIT) return money;
  int index = round * (MONEY_LIMIT + 1) + money;
  // int index = money * ROUND_LIMIT + round;

  if (table[index].known) return table[index].ev;

  DATATYPE best_bet_score = -1;
  for (int bet = 0; bet <= money; bet++) {
    DATATYPE winres = decide(table, (money + bet > MONEY_LIMIT) ? MONEY_LIMIT : (money + bet), round - 1);
    DATATYPE loseres = decide(table, money - bet, round - 1);
    DATATYPE combined = F(0.6) * winres + F(0.4) * loseres;
    best_bet_score = F(fmax)(best_bet_score, combined);
  }
  table[index] = (CacheEntry) { true, best_bet_score };
  return best_bet_score;
}

void method1() {
  CacheEntry *table = calloc(sizeof(CacheEntry), ROUND_LIMIT * (MONEY_LIMIT + 1));

  printf("running.\n");
  int best_bet = 0;
  DATATYPE best_bet_score = -1;
  for (int bet = 0; bet <= MONEY_START; bet++) {
    DATATYPE winres = decide(table, MONEY_START + bet, ROUND_LIMIT - 1);
    DATATYPE loseres = decide(table, MONEY_START - bet, ROUND_LIMIT - 1);
    DATATYPE combined = F(0.6) * winres + F(0.4) * loseres;
    if (combined > best_bet_score) {
      best_bet = bet;
      best_bet_score = combined;
    }
  }

  printf("first round: bet %f for expected total reward of %f\n", best_bet / MONEY_FACTOR, best_bet_score / MONEY_FACTOR);
  int count_known = 0;
  for (int i = 0; i < ROUND_LIMIT * (MONEY_LIMIT + 1); i++) {
    if (table[i].known) count_known ++;
  }
  printf("known: %i of %i\n", count_known, ROUND_LIMIT * (MONEY_LIMIT + 1));
}

// BOTTOM-UP METHOD

// given pass n in "from", compute pass n-1 in "to"
static void propagate(DATATYPE *from, DATATYPE *to) {
  // for each money state...
#pragma omp parallel for num_threads(THREADS)
  for (int a = 0; a < THREADS; a++) {
    // distribute workload so that inner loop passes are approximately equal
    // f(x) = 2x  F(x) = x^2   = a/THREADS x = sqrt(a / THREADS)
    int low = (int) (MONEY_LIMIT * sqrtf((float)a / THREADS));
    int high = (int) (MONEY_LIMIT * sqrtf((a + 1.0f) / THREADS));
    if (a == THREADS - 1) high = MONEY_LIMIT + 1; // precise upper border
    for (int m = low; m < high; m++) {
      DATATYPE max_score = 0.0;
      // determine the bet that maximizes our expected earnings
      // to enable vectorization we break the loop into two so that each half only gets one branch of the conditional
      // DATATYPE winval = from[(m + b > MONEY_LIMIT) ? MONEY_LIMIT : (m + b)];
      int high_inner = (m + m > MONEY_LIMIT) ? (MONEY_LIMIT - m) : m;
      for (int b = 0; b <= high_inner; b++) {
        DATATYPE winval = from[m + b];
        DATATYPE loseval = from[m - b];
        DATATYPE combined = F(0.6) * winval + F(0.4) * loseval;
        max_score = F(fmax)(max_score, combined);
      }
      for (int b = high_inner + 1; b <= m; b++) {
        DATATYPE winval = from[MONEY_LIMIT];
        DATATYPE loseval = from[m - b];
        DATATYPE combined = F(0.6) * winval + F(0.4) * loseval;
        max_score = F(fmax)(max_score, combined);
      }
      to[m] = max_score;
    }
  }
}

void method2() {
  DATATYPE *buf = malloc(sizeof(DATATYPE) * (MONEY_LIMIT + 1));
  DATATYPE *buf_to = malloc(sizeof(DATATYPE) * (MONEY_LIMIT + 1));
  // set up base case: making no bet, we have money of i
  for (int i = 0; i <= MONEY_LIMIT; i++) buf[i] = i;
  for (int i = ROUND_LIMIT - 1; i >= 0; i--) {
    propagate(buf, buf_to);
    DATATYPE *temp = buf;
    buf = buf_to;
    buf_to = temp;
  }
  int best_bet = 0;
  DATATYPE best_bet_score = -1;
  for (int b = 0; b <= MONEY_START; b++) {
    DATATYPE winval = buf[MONEY_START + b];
    DATATYPE loseval = buf[MONEY_START - b];
    DATATYPE combined = 0.6 * winval + 0.4 * loseval;
    if (combined > best_bet_score) {
      best_bet_score = combined;
      best_bet = b;
    }
  }
  printf("first round: bet %f for expected total reward of %f\n", best_bet / MONEY_FACTOR, best_bet_score / MONEY_FACTOR);
}

int main() {
  method2();
  return 0;
}

Some quick bench­mark­ing on my 2016 Lenovo ThinkPad P70 lap­top with 4 phys­i­cal cores & 8 vir­tual cores (In­tel Xeon E3-1505M v5 Proces­sor 8MB Cache / up to 3.70GHz) shows a roughly lin­ear speedup from 1 to 3 threads and then small gains there­after to a min­i­mum of 7.2s with 8 threads:

for C in `ls coingame-*.c`; do
    gcc -Wall -Ofast -fwhole-program -march=native -fopenmp $C -o coingame -lm &&
    echo "$C" &&
    (for i in {1..10}; do
        /usr/bin/time -f '%e' ./coingame; done
    );
done

For lev­els of par­al­lelism on my lap­top from 1–8 threads, the aver­age times are: 28.98s/15.44s/10.70s/9.22s/9.10s/8.4s/7.66s/7.20s.

Caio Oliveira pro­vides a par­al­lel C++ imple­men­ta­tion using std::async; com­piled with the same opti­miza­tions on (g++ -Ofast -fwhole-program -march=native -pthread --std=c++11 kc.cpp -o kc), it runs in ~1m16s:

#include <iostream>
#include <vector>
#include <map>
#include <set>
#include <bitset>
#include <list>
#include <stack>
#include <queue>
#include <deque>
#include <string>
#include <sstream>
#include <algorithm>
#include <cstdio>
#include <cstring>
#include <future>
#include <ctime>

using namespace std;

const int MAX_ROUND = 300;
const int MAX_WEALTH = 25000;
const int BETS_DELTA = 1;

double memo[MAX_WEALTH + 1][2];

inline double vWin(int wealth, int bet, int roundIdx) {
    if(wealth + bet < MAX_WEALTH) {
        return memo[wealth + bet][!roundIdx];
    }
    else {
        return MAX_WEALTH;
    }
}

inline double vLose(int wealth, int bet, int roundIdx) {
    if (wealth - bet == 0) {
        return 0;
    }
    else {
        return memo[wealth - bet][!roundIdx];
    }
}

inline double v(int wealth, int bet, int roundIdx) {
    return .6 * vWin(wealth, bet, roundIdx) + .4 * vLose(wealth, bet, roundIdx);
}

template<typename RAIter>
void calc_round(RAIter beg, RAIter end, int roundIdx){
    int len = distance(beg, end);

    if(len < 1000) {
        for(RAIter p = beg; p != end; ++p) {
            int wealth =  distance(memo, p);
            (*p)[roundIdx] = v(wealth, wealth, roundIdx);

            for (int bet = 0; bet < wealth; bet += BETS_DELTA) {
                memo[wealth][roundIdx] = max(memo[wealth][roundIdx], v(wealth, bet, roundIdx));
            }
        }
    }
    else {
        RAIter mid = beg + len/2;
        future<void> handle = async(launch::async, calc_round<RAIter>, mid, end, roundIdx);
        calc_round(beg, mid, roundIdx);
    }
}

double calc_table() {
    bool roundIdx = 0;

    for(int i = 0; i <= MAX_WEALTH ; ++i) {
        memo[i][!roundIdx] = i;
    }

    for(int round = MAX_ROUND - 1; round >= 0; --round) {
        calc_round(memo, memo + MAX_WEALTH, roundIdx);
        roundIdx ^= 1;
    }

    return memo[2500][!roundIdx] / 100.;
}

int main() {
    ios_base::sync_with_stdio(0);
    cin.tie(0);
    clock_t begin = clock();
    double res = calc_table();
    clock_t end = clock();
    cout << "result: " << res << "(elapsed: " << (double(end - begin) / CLOCKS_PER_SEC) << ")" << endl;
}

Given the local­ity of the access pat­terns, I have to won­der how well a GPU imple­men­ta­tion might per­form. It does­n’t seem amenable to being expressed as a large matrix mul­ti­pli­ca­tion, but each GPU core could be respon­si­ble for eval­u­at­ing a ter­mi­nal node and do nearby lookups from there. (And the use in finance of GPUs to com­pute the // mod­el, which are sim­i­lar, sug­gests it should be pos­si­ble.)

Exact formula

Arthur B. notes that “you can com­pute in if you allow bets to be any frac­tion of w. It’s piece­wise lin­ear, the cut­off points and slopes fol­low a very pre­dictable pat­tern”1 and pro­vides a for­mula for cal­cu­lat­ing the value func­tion with­out con­struct­ing a deci­sion tree:

A short ver­sion in R:

V <- function(w, b, m=250) {
    b <- round(b) # rounds left must be integer
    j <- qbinom(w/m, b, 0.5)
    1.2^b * 1.5^-j * (w + (m/2 * sum(1.5^(j:0) * pbinom(0:j-1,b,1/2)))) }

(This is suffi­ciently fast for all val­ues tested that it is not worth mem­o­iz­ing using the rel­a­tively slow memoise.)

Re-im­ple­ment­ing in Python is a lit­tle tricky because NumPy/SciPy use differ­ent names and Python does­n’t sup­port R’s range nota­tion, but the fol­low­ing seems like it works; and then with a value func­tion, we can imple­ment a plan­ner to choose opti­mal bet­ting amounts, mem­o­ize it with repoze.lru for time-sav­ings over mul­ti­ple games, and then it is sim­ple to set up a Gym envi­ron­ment and start play­ing the Kelly coin-flip game opti­mally (and indeed usu­ally reach­ing the $250 lim­it):

from scipy.stats import binom
import numpy as np
from repoze.lru import lru_cache

def V(w, b, m=250):
    if w>=250:
      return 250
    if w<=0:
      return 0
    if b==0:
      return w
    else:
      try:
        j = binom.ppf(float(w)/float(m), b, 0.5)
        return 1.2**b * 1.5**-j * (w + m/2 *
            sum(np.multiply(binom.cdf(map(lambda(x2):x2-1, range(0,int(j+1))),b,0.5),
                map(lambda(x): 1.5**x, list(reversed(range(0, int(j+1))))))))
      except ValueError:
        print "Error:", (w,b,m)

@lru_cache(None)
def VPplan(w, b):
    # optimization: short-circuit
    if w<=0 or w>=250:
      return 0
    else:
      if b==0:
        return w
      else:
          possibleBets = map(lambda(pb): float(pb)/100.0, range(0*100,int((w*100)+1),1))
          returns = map(lambda(pb): 0.6*V(w+pb, b-1) + 0.4*V(w-pb,b-1), possibleBets)
          return float(returns.index(max(returns)))/100.0
import gym
env = gym.make('KellyCoinflip-v0')

## play 500 games and calculate mean reward:
rewards = []
for n in range(0,500):
    done = False
    reward = 0
    while not done:
        w = env._get_obs()[0][0]
        b = env._get_obs()[1]
        bet = VPplan(w, b)
        results = env._step(bet*100)
        print n, w, b, bet, "results:", results
        reward = reward+results[1]
        done = results[2]
    rewards.append(reward)
    env._reset()

print sum(rewards)/len(rewards)
# 247.525564356

Graphing the Value function

What does the range of pos­si­ble bets in each state look like? Does it look bitonic and like an inverted V one could do binary search on? We can sam­ple some ran­dom states, com­pute all the pos­si­ble bets, and plot them by index to see if the pay­offs form curves:

# return all actions' values instead of the max action:
VPplans <- function(w, b) {
    if (b==0) { return (0); } else {
    returns <- sapply(seq(0, w), function(wp) { mf(wp, w, b); })
      return(returns); } }
plans <- data.frame()
n <- 47
for (i in 1:n) {
  # omit the uninteresting cases of w=0/250 and b=0/300:
  w <- round(runif(n=1, 0+1, 250-1))
  b <- round(runif(n=1, 0+1, 300-1))
  plan <- VPplans(w,b)
  plans <- rbind(plans, data.frame(W=as.factor(w), B=as.factor(b), Bet=1:(length(plan)), Value=plan))
}
library(ggplot2)
p <- qplot(x=Bet, y=Value, data=plans)
p + facet_wrap(~W) + geom_line(aes(x=Bet, y=Value, color=W), size = 1) + theme(legend.position = "none")

For the most part they do gen­er­ally fol­low a qua­dratic or invert­ed-V like shape, with the excep­tion of sev­eral which wig­gle up and down while fol­low­ing the over­all curve (nu­meric or approx­i­ma­tion error?).

47 ran­dom­ly-sam­pled states with all legal bets’ exact value plot­ted, show­ing gen­er­ally smooth curves

With an exact value func­tion, we can also visu­al­ize it to get an idea of how the wealth/rounds inter­act and how diffi­cult it would be to approx­i­mate it. We can gen­er­ate many iter­ates, inter­po­late, and graph:

n <- 50000
w <- runif(n, min=0, max=250)
b <- sample(0:250, n, replace=TRUE)
v <- unlist(Map(V, w, b))
df <- data.frame(Wealth=w, Rounds=b, Value=v)

library(akima)
library(plot3D)
dfM <- interp(df$Wealth, df$Rounds, df$Value, duplicate="strip")
par(mfrow=c(1,2), mar=0)
persp(dfM$x, dfM$y, dfM$z, xlab="Wealth", ylab="Rounds left", zlab="Value", ticktype="detailed")
scatter3D(df$Wealth, df$Rounds, df$Value, theta=-25, xlab="Wealth", ylab="Rounds left", zlab="Value", ticktype="detailed")
3D plots of the value func­tion vary­ing with wealth avail­able to bet & rounds left to bet in in order to reach the $250 max­i­mum pay­out

Approximating the Exact Value Function

The value func­tion winds up look­ing sim­ple, a smooth land­scape tucked down at the edges, which is per­haps not too sur­pris­ing, as the opti­mal bet­ting pat­terns tend to be small amounts except when there are very few rounds left (and one should bet most or all of one’s wealth) or one is near­ing the max­i­mum wealth cap (and should very grad­u­ally glide in).

A value func­tion (or the pol­icy func­tion) com­puted by deci­sion trees is bulky and hard to carry around, weigh­ing per­haps giga­bytes in size. Look­ing at this value func­tion, it’s clear that we can do model dis­til­la­tion and eas­ily approx­i­mate it by another algo­rithm like ran­dom forests or neural net­works. (Usu­al­ly, a NN has a hard time learn­ing a value func­tion because it’s a rein­force­ment learn­ing prob­lem where it’s unclear how actions are linked to rewards, but in this case, since we can already cal­cu­late it, it becomes the far eas­ier set­ting of super­vised learn­ing in map­ping a state to a value/policy.)

A ran­dom for­est works fine, get­ting 0.4 :

library(randomForest)
r <- randomForest(Value ~ Wealth + Rounds, ntree=1000, data=df); r
# ...
#           Mean of squared residuals: 1.175970325
#                     % Var explained: 99.94
sqrt((predict(r) - df$Value)^2)
# [1] 0.407913389

It can also be approx­i­mated with NN. For exam­ple, in R one can use MXNet and a 1×30 or 2×152 ReLu feed­for­ward NN will solve it to <4 RMSE:

## very slow
library(neuralnet)
n <- neuralnet(Value ~ Wealth + Rounds, data=df, hidden=c(20,20))
# prediction(n)

## GPU-accelerated:
library(mxnet)
mx.set.seed(0)
## 2×30 ReLu FC NN for linear regression with RMSE loss:
data <- mx.symbol.Variable("data")
perLayerNN <- 30
fc1     <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=perLayerNN)
act1    <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2     <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=perLayerNN)
act2    <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3     <- mx.symbol.FullyConnected(act2, name="fc2", num_hidden=perLayerNN)
act3    <- mx.symbol.Activation(fc3, name="relu2", act_type="relu")
finalFc <- mx.symbol.FullyConnected(act3, num_hidden=1)
lro <- mx.symbol.LinearRegressionOutput(finalFc)
model <- mx.model.FeedForward.create(lro, X=as.matrix(subset(df,select=c("Wealth","Rounds"))), y=df$Value,
    ctx=mx.gpu(),     num.round=50000,
    learning.rate=5e-6, eval.metric=mx.metric.rmse, array.layout="rowmajor")

Some exper­i­ment­ing with fit­ting poly­no­mial regres­sions did not give any decent fits for less than quin­tic, but prob­a­bly some com­bi­na­tion of logs and poly­no­mi­als could com­pactly encode the value func­tion.

Simulation Performance

The imple­men­ta­tions here could be wrong, or deci­sion trees not actu­ally opti­mal like claimed. We can test it by directly com­par­ing the per­for­mance in a sim­u­la­tion of the coin-flip­ping game, com­par­ing the per­for­mance of mVPplan vs the 20% Kelly cri­te­rion and a sim­ple bet-ev­ery­thing strat­e­gy:

library(memoise)
f <- function(x, w, b) { 0.6*V(min(w+x,250), b-1) + 0.4*V(w-x, b-1)}
mf <- memoise(f)
V <- function(w, b, m=250) { j <- qbinom(w/m,b,1/2);
    1.2^b * 1.5^-j * (w+m/2 * sum(1.5^(j:0) * pbinom(0:j-1,b,1/2))) }
VPplan <- function(w, b) {
    if (b==0) { return (0); } else {
    returns <- sapply(seq(0, w), function(wp) { mf(wp, w, b); })
      return (which.max(returns)-1); }
 }
mVPplan <- memoise(VPplan)

game <- function(strategy, wealth, betsLeft) {
    if (betsLeft>0) {
    bet <- strategy(wealth, betsLeft)
    wealth <- wealth - bet
    flip <- rbinom(1,1,p=0.6)
    winnings <- 2*bet*flip
    wealth <- min(wealth+winnings, 250)
    return(game(strategy, wealth, betsLeft-1)) } else { return(wealth); } }
simulateGame <- function(s, w=25, b=300, iters=5000) { mean(replicate(iters, game(s, w, b))) }

## Various strategies (the decision tree strategy is already defined as 'mVPplan'):
kelly <- function(w, b) { 0.20 * w }
smarterKelly <- function(w, b) { if(w==250) {0} else { 0.2 * w } }
maximizer <- function(w, b) { w; }
smarterMaximizer <- function(w, b) { if(w>=250) { 0 } else {w}; }

## examining a sample of how wealth evolves over the course of a sample of b=300 games:
library(parallel)
library(plyr)
i <- 56
games <- ldply(mclapply(1:i, function(ith) {
    b <- 300
    wealth <- 25
    wealths <- numeric(b)
    while (b>0) {
        bet <- mVPplan(wealth, b)
        wealth <- wealth-bet
        flip <- rbinom(1,1,p=0.6)
        winnings <- 2*bet*flip
        wealth <- min(wealth+winnings, 250)
        wealths[b] <- wealth
        b <- b-1
    }
    return(data.frame(Game=ith, Bets=300:1, Wealth=wealths)) }
))
library(ggplot2)
p <- qplot(x=Bets, y=Wealth, data=games)
p + facet_wrap(~Game) + geom_line(aes(x=Bets, y=Wealth), size = 1) + theme(legend.position = "none")

Sim­u­lat­ing 56 games, we can see how while the opti­mal strat­egy looks fairly risky, it usu­ally wins in the end:

Sim­u­la­tion of win­nings over a sam­ple of b =300 games while using the opti­mal strat­e­gy.

We can also exam­ine how the strat­egy does for all the other pos­si­ble hori­zons and com­pare with the Kel­ly.

## Comparing performance in the finite horizon setting up to b=300:
bs <- 1:300
ldply(mclapply(bs, function(bets) {
    ky <- round(simulateGame(smarterKelly, b=bets), digits=1)
    dt <- round(simulateGame(mVPplan, b=bets), digits=1)
    gain <- dt-ky;
    print(paste(bets, round(V(25, bets), digits=2), dt, ky, gain)) }
))

Sim­u­lat­ing 5,000 games for each b:

Total bets Value func­tion Deci­sion tree per­for­mance Kelly per­for­mance Differ­ence
1 30 30.1 26 4.1
2 36 35.5 27 8.5
3 43.2 42.4 28.2 14.2
4 45.36 46 29.3 16.7
5 47.95 49.3 30.5 18.8
6 53.65 52.5 31.5 21
7 54.59 51.4 32.7 18.7
8 57.57 56.1 34.1 22
9 1.83 61.4 35.6 25.8
10 62.6 63 36.8 26.2
11 66.76 68.6 38.3 30.3
12 68.22 66.9 40.2 26.7
13 70.6 70.7 41.8 28.9
14 74.13 71.2 43.2 28
15 75 73.8 44.9 28.9
16 79.13 79.7 46.8 32.9
17 79.75 79.1 47.8 31.3
18 82.5 84.1 50.1 34
19 84.71 83.7 52.7 31
20 86.25 85.6 54.2 31.4
21 89.8 88.6 56.4 32.2
22 90.28 90.2 56.5 33.7
23 93.85 94.8 58.7 36.1
24 94.49 93.7 60.7 33
25 97 95 62.2 32.8
26 98.83 97.1 65.2 31.9
27 100.38 98.7 68.1 30.6
28 103.23 102.1 69.4 32.7
29 103.95 102.5 73.4 29.1
30 107.61 107.7 73.7 34
31 107.64 107.3 74.7 32.6
32 110.4 106.7 76.1 30.6
33 111.41 107.9 79.2 28.7
34 113.38 115.1 80.6 34.5
35 115.24 115.4 82.2 33.2
36 116.48 116.1 84.2 31.9
37 119.09 118.5 87.1 31.4
38 119.69 119.9 86 33.9
39 122.94 125.4 89.6 35.8
40 122.96 119.7 92 27.7
41 125.54 124.2 95.2 29
42 126.28 124.4 96.9 27.5
43 128.21 128.5 97.1 31.4
44 129.63 130.5 100.3 30.2
45 130.96 130.1 100 30.1
46 132.98 131.9 100.9 31
47 133.78 132.4 104.2 28.2
48 136.33 134.2 104.5 29.7
49 136.64 133.6 107.7 25.9
50 139.36 141.2 110.1 31.1
51 139.54 139.4 113.6 25.8
52 141.68 140 113 27
53 142.45 140.7 113.5 27.2
54 144.08 141.8 115.5 26.3
55 145.36 145 116.3 28.7
56 146.52 150.2 119.7 30.5
57 148.26 146.4 119.7 26.7
58 149 143.9 120.6 23.3
59 151.15 151.8 124.2 27.6
60 151.5 148.5 124.4 24.1
61 154.01 151.9 125.5 26.4
62 154.01 150.9 127.1 23.8
63 156.05 157.9 128.4 29.5
64 156.52 154.3 129.9 24.4
65 158.14 155.5 132.3 23.2
66 159.03 156.8 132.1 24.7
67 160.25 157.4 133.2 24.2
68 161.53 159.2 137.1 22.1
69 162.39 160.2 135.9 24.3
70 164 161 137.8 23.2
71 164.54 162.2 137.8 24.4
72 166.46 166.7 138.3 28.4
73 166.7 165.2 142.7 22.5
74 168.81 169 145 24
75 168.85 168.3 143.1 25.2
76 170.59 169.5 144.4 25.1
77 171 165.6 146.4 19.2
78 172.39 171.2 147.5 23.7
79 173.13 171.3 150.6 20.7
80 174.21 170.6 151.8 18.8
81 175.24 174 152.3 21.7
82 176.04 175.5 153.8 21.7
83 177.34 174.9 151.8 23.1
84 177.87 177.6 152.5 25.1
85 179.4 178.5 157.3 21.2
86 179.7 177.1 156 21.1
87 181.44 178.9 158.1 20.8
88 181.52 179.6 160.1 19.5
89 183.15 181.1 159 22.1
90 183.33 182.8 163.3 19.5
91 184.68 184.2 162.3 21.9
92 185.13 183.4 162.5 20.9
93 186.21 187.5 165.1 22.4
94 186.9 185.3 160.5 24.8
95 187.75 188.6 164.8 23.8
96 188.66 186.4 167.1 19.3
97 189.29 187.6 168 19.6
98 190.39 188.9 167.7 21.2
99 190.82 187.8 169.8 18
100 192.1 190.7 168.4 22.3
101 192.34 192.5 171.8 20.7
102 193.78 192.6 170 22.6
103 193.86 193.2 170.7 22.5
104 195.24 194.1 170 24.1
105 195.35 192.9 174.1 18.8
106 196.52 195.2 176.8 18.4
107 196.84 194.5 173.4 21.1
108 197.79 194.4 179.1 15.3
109 198.3 195.5 176 19.5
110 199.07 196.7 179.1 17.6
111 199.74 198.7 181.2 17.5
112 200.34 201.1 178.2 22.9
113 201.17 197.9 180.9 17
114 201.6 200.3 181.2 19.1
115 202.57 202 183.2 18.8
116 202.85 201.6 181 20.6
117 203.94 201.7 181.4 20.3
118 204.09 201.2 183.6 17.6
119 205.29 205.9 185 20.9
120 205.32 201.3 186.8 14.5
121 206.4 204 182.2 21.8
122 206.53 203.7 186.2 17.5
123 207.44 205.7 186.1 19.6
124 207.72 205.2 189.5 15.7
125 208.48 203.9 191.4 12.5
126 208.9 209.3 188 21.3
127 209.52 206.7 187.7 19
128 210.06 209.5 188.5 21
129 210.54 206.5 192.4 14.1
130 211.2 211.1 190.9 20.2
131 211.56 207.1 195.6 11.5
132 212.32 210.3 194 16.3
133 212.57 212.1 191.1 21
134 213.42 211 192.7 18.3
135 213.56 210.2 195.8 14.4
136 214.5 213.3 196.8 16.5
137 214.55 211.3 194.4 16.9
138 215.46 212 196.6 15.4
139 215.52 210.8 197.4 13.4
140 216.3 215.3 197 18.3
141 216.47 217.8 199.3 18.5
142 217.13 215.4 197.3 18.1
143 217.41 214.8 196.2 18.6
144 217.96 213.9 200.1 13.8
145 218.33 215.7 200.4 15.3
146 218.77 217.4 200.1 17.3
147 219.24 217.5 199.7 17.8
148 219.58 218.5 200.3 18.2
149 220.13 218.4 200.3 18.1
150 220.39 220.4 201.9 18.5
151 221 218.1 201.6 16.5
152 221.18 220.5 203.9 16.6
153 221.86 220.6 202.6 18
154 221.96 220.5 205.2 15.3
155 222.69 218.7 203.1 15.6
156 222.72 220.6 204.4 16.2
157 223.43 220.6 203.3 17.3
158 223.48 221.1 202.8 18.3
159 224.09 222.6 207.1 15.5
160 224.22 224.5 207.5 17
161 224.74 220.8 206 14.8
162 224.95 224.2 208.1 16.1
163 225.39 223.8 208 15.8
164 225.67 222.8 209 13.8
165 226.03 223.4 208.6 14.8
166 226.37 224 210 14
167 226.66 225.3 209.2 16.1
168 227.06 224.1 211.6 12.5
169 227.28 224.5 210.5 14
170 227.73 223.8 211 12.8
171 227.89 226.9 209.1 17.8
172 228.39 226 212.2 13.8
173 228.49 226 211.8 14.2
174 229.04 226.6 212.1 14.5
175 229.09 227.9 211.3 16.6
176 229.67 226.4 211.5 14.9
177 229.67 228 214 14
178 230.18 228.4 215.1 13.3
179 230.24 227.5 213.3 14.2
180 230.68 229.2 213.6 15.6
181 230.8 229.5 215 14.5
182 231.18 228.7 213.9 14.8
183 231.36 229.8 216 13.8
184 231.67 230.6 214.4 16.2
185 231.9 231 213.1 17.9
186 232.16 231.2 216.2 15
189 232.94 231.1 217.9 13.2
190 233.1 230.4 217.6 12.8
191 233.45 231.1 218.4 12.7
192 233.56 231.9 219 12.9
193 233.94 232.1 216.6 15.5
194 234.02 232 219.3 12.7
195 234.43 231.8 217.5 14.3
196 234.47 232.4 220.6 11.8
197 234.9 233.5 218.6 14.9
198 234.9 233 219.3 13.7
199 235.3 233.4 220.2 13.2
200 235.33 233.8 221.1 12.7
201 235.67 235.5 218.8 16.7
202 235.75 233 222 11
203 236.05 232.9 220.4 12.5
204 236.17 233.9 220.1 13.8
205 236.42 234.8 221 13.8
206 236.57 234.4 221.4 13
207 236.78 234.8 222.6 12.2
208 236.96 236.4 222.5 13.9
209 237.14 234.6 223.5 11.1
210 237.35 236.6 222.6 14
211 237.49 235.7 221.9 13.8
212 237.73 234.4 222.4 12
213 237.83 234.9 226 8.9
214 238.1 237.3 223.9 13.4
215 238.17 237 223.6 13.4
216 238.46 235.7 225.1 10.6
217 238.5 236.6 223.6 13
218 238.81 237 226.1 10.9
219 238.83 236.4 225 11.4
220 239.15 237.7 225.7 12
221 239.15 236.8 225.9 10.9
222 239.43 237.7 225.9 11.8
223 239.46 238.6 224.8 13.8
224 239.71 237.1 226.3 10.8
225 239.77 238.7 227.4 11.3
226 239.98 238.7 225.9 12.8
227 240.07 238 226.9 11.1
228 240.25 240.5 227.6 12.9
229 240.36 238.8 227.5 11.3
230 240.51 237.9 225.8 12.1
231 240.65 238.5 228.2 10.3
232 240.77 239.3 226.6 12.7
233 240.92 238.8 226.1 12.7
234 241.03 240.2 228.8 11.4
235 241.2 240.4 227.5 12.9
236 241.28 240 227.4 12.6
237 241.46 239.8 228 11.8
238 241.52 240.6 228.8 11.8
239 241.72 240.1 228.7 11.4
240 241.76 240.2 229.2 11
241 241.98 240.3 229.2 11.1
242 242 240.7 229.3 11.4
243 242.22 240.5 229.7 10.8
244 242.23 239.9 229.2 10.7
245 242.44 241.2 230.3 10.9
246 242.45 240.7 230.5 10.2
247 242.64 241.3 231.5 9.8
248 242.68 239.2 230.4 8.79
249 242.84 241.5 230.3 11.2
250 242.89 241.4 230.6 10.8
251 243.03 242.2 230.4 11.8
252 243.1 241.7 232.3 9.39
253 243.22 242.6 232.2 10.4
254 243.31 241 229.7 11.3
255 243.41 240.7 231.1 9.59
256 243.51 242.2 231 11.2
257 243.59 241 232.4 8.59
258 243.7 242.1 230.2 11.9
259 243.77 242 232.1 9.90
260 243.9 243.2 230.4 12.8
261 243.95 242.9 233.6 9.30
262 244.08 242.6 233.6 9
263 244.12 243 231.3 11.7
264 244.26 242.3 233.5 8.80
265 244.29 241.8 233.8 8
266 244.44 242.9 233.1 9.80
267 244.46 242.6 233.8 8.79
268 244.61 242.9 234 8.90
269 244.62 244 234.3 9.69
270 244.76 243.6 234.4 9.19
271 244.77 243.6 235 8.59
272 244.91 243.2 234.7 8.5
273 244.93 243.9 233.2 10.7
274 245.04 243.5 233.5 10
275 245.08 243.7 234.2 9.5
276 245.18 243.4 234.8 8.59
277 245.23 244.2 234.2 10
278 245.31 244.8 234.8 10
279 245.37 244.1 234.7 9.40
280 245.44 243.7 234.1 9.59
281 245.51 244.1 234 10.1
282 245.57 243.6 235.8 7.79
283 245.65 243.8 235.3 8.5
284 245.7 244 235 9
285 245.78 244.3 236.9 7.40
286 245.82 243.7 235 8.69
287 245.91 244.6 236.2 8.40
288 245.94 244.7 235.4 9.29
289 246.04 245.2 237.3 7.89
290 246.06 244.6 234.8 9.79
291 246.16 243.8 235.6 8.20
292 246.17 244.8 236.2 8.60
293 246.28 244.6 236.2 8.40
294 246.29 245.2 236.9 8.29
295 246.39 245.2 237.2 8
296 246.39 245.2 236.5 8.69
297 246.49 245.1 235.7 9.40
298 246.5 244.9 237.4 7.5
299 246.59 245.4 237.5 7.90
300 246.61 246 236.2 9.80
Per­for­mance of mod­i­fied Kelly cri­te­rion ver­sus the­o­ret­i­cal opti­mal per­for­mance ver­sus deci­sion-tree per­for­mance

So the deci­sion tree does out­per­form the 20% KC, and not by triv­ial amounts either for many pos­si­ble b, either in rel­a­tive or absolute terms: for b = 23, the esti­mated gain is $37.1. The deci­sion tree does­n’t require the full b =300 to often hit the ceil­ing, although the KC will. How­ev­er, the per­for­mance edge goes down as the hori­zon gets more dis­tant, and we’ll see that as far out as b =300, the advan­tage shrinks to ~$6 as the ceil­ing is hit almost all the time by decent strate­gies (as we know from Haghani & Dewey 2016’s analy­sis that KC hits the ceil­ing ~95% of the time with b =300, although of course it will do so far less often for shorter runs like b =50).

Since the 20% KC was only sug­gested as a heuris­tic, one might won­der if there’s a differ­ent con­stant frac­tion which does slightly bet­ter in the b =300 case, which there is, 12%:

fraction <- seq(0,1, by=0.01)
fractionEVs <- ldply(mclapply(fraction, function(f) {
    k <- function(w,b) { if(w==250) {0} else { f * w } }
    kev <- simulateGame(k, b=300)
    return(kev) }
    ))
    fraction[which.max(unlist(fractionEVs))]
# [1] 0.12
qplot(fraction, fractionEVs)

Generalized Kelly Coin-flip game: POMDPs

The orig­i­nal Kelly coin­flip game has a clear way to make it more gen­eral and diffi­cult: ran­dom­ize the 3 key para­me­ters of edge/max wealth/rounds and hide them, turn­ing it into a .

To solve a POMDP one often has to record the entire his­to­ry, but with judi­cious choice of what dis­tri­b­u­tion we ran­dom­ize the 3 para­me­ters from, we can make it both plau­si­bly reflect­ing the human sub­jects’ expec­ta­tions and allow sum­ma­riz­ing the his­tory in a few .

Implementation

A Python imple­men­ta­tion with the sug­gested ran­dom­iza­tions is avail­able in Ope­nAI Gym:

import gym
env = gym.make('KellyCoinflipGeneralized-v0')
env.reset()
env._render() # peek at the hidden parameters
env._step(env.wealth*100*0.2)
# ...
env._reset() # etc

Bayesian decision tree

Unknown coin-flip probability

A vari­ant on this prob­lem is when the prob­a­bil­ity of win­ning is not given, if one does­n’t know in advance p = 0.6. How do you make deci­sions as coin flips hap­pen and pro­vide infor­ma­tion on p? Since p is unknown and only par­tially observed, this turns the coin-flip MDP into a POMDP prob­lem. A POMDP can be turned back into a MDP and solved in a deci­sion tree by includ­ing the his­tory of obser­va­tions as more para­me­ters to explore over.

The curse of dimen­sion­al­ity can be tamed here a lit­tle by observ­ing that a his­tory of coin-flip sequences is unnec­es­sary—it does­n’t mat­ter what order the coin flips hap­pen in, only the num­ber of wins and the num­ber of total coin-flips mat­ter (are the suffi­cient sta­tis­tic­s). So we only need to aug­ment wealth/rounds/bet-amount with wins/n. And one can com­pute p from wins/n by Bayesian updates of a prior on the observed coin flip results, treat­ing it as a bino­mial prob­lem of esti­mat­ing p dis­trib­uted accord­ing to the .

This beta Bayesian update is easy and does­n’t require call­ing out to a library: with a beta uni­form prior (Beta(1,1)), update on bino­mial (win/loss) is con­ju­gate with the sim­ple closed form pos­te­rior of Beta(1+win, 1+n-win); and the expec­ta­tion of the beta is 1+win / 1+n. Then our expected val­ues in f sim­ply change from 0.6/0.4 to expectation/1-expectation.

Going back to the R imple­men­ta­tion, we add in the two addi­tional para­me­ters, the beta esti­ma­tion, and esti­mate the hard­wired prob­a­bil­i­ties:

library(memoise)
f <- function(x, w, b, wins, n) {
    beta <- c(1+wins, 1+n-wins)
    expectedWinProb <- beta[1] / (beta[1]+beta[2]) # E[Beta(alpha, beta)] = alpha / (alpha+beta)
    expectedLoseProb <- 1 - expectedWinProb
    expectedWinProb*mV(min(w+x,250), b-1, wins+1, n+1) + expectedLoseProb*mV(w-x, b-1, wins, n+1) }
mf <- memoise(f)
V <- function(w,b,wins,n) {
    returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { mf(x,w,b, wins, n) }) } else { w }
    max(returns) }
mV <- memoise(V)

What prior should we use?

Ide­ally we would use the prior of the par­tic­i­pants in the exper­i­ment, to com­pare how well they do com­pared to the Bayesian deci­sion tree, and esti­mate how much mis­taken beliefs cost them. Unfor­tu­nately they were not sur­veyed on this and the authors do not give any indi­ca­tion of how much of an edge the sub­jects expected before start­ing the game, so for analy­sis, we’ll make up a plau­si­ble pri­or.

We know that they know that 100% is not an option (as that would be bor­ing) and <50% is impos­si­ble too since they were promised an edge; but it can’t be too high because then every­one would win quickly and noth­ing would be learned. I think a rea­son­able prior would be some­thing like 70%, but with a lot of room for sur­prise. let’s assume a weak prior around 70%, which is equiv­a­lent to play­ing 10 games and win­ning 7, so Beta(7, 3).

With the addi­tional para­me­ters, the mem­o­ized ver­sion becomes impos­si­bly slow, so I rewrite it to use R envi­ron­ments as a poor man’s hash table:

tree <- new.env(size=1000000, parent=emptyenv())
f <- function(x, w, b, wins, n) {
    beta <- c(1+wins, 1+n-wins)
    expectedWinProb <- beta[1] / (beta[1]+beta[2]) # E[Beta(alpha, beta)] = alpha / (alpha+beta)
    expectedLoseProb <- 1 - expectedWinProb
    expectedWinProb*V(min(w+x,250), b-1, wins+1, n+1) + expectedLoseProb*V(w-x, b-1, wins, n+1) }
V <- function(w, b, wins, n) {
    if (isSet(c(w,b,wins,n))) { return(lookup(c(w,b,wins,n))) } else {
      returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b,wins,n) }) } else { return(w) }
      set(c(w,b,wins,n), max(returns))
      return(lookup(c(w,b,wins,n)))
      }
}
# test expression: $25, 5 bets, suggested 7⁄10 prior:
V(25,5,7,10)

This prior leads to some­what more aggres­sive bet­ting in short games as the EV is high­er, but over­all per­forms much like the deci­sion tree.

nshep­perd pro­vides a mod­i­fied ver­sion of his code which lets us eval­u­ate up to b =300:

{-# LANGUAGE ScopedTypeVariables #-}
import System.Environment
import Data.List
import Data.Foldable
import Data.Function
import Data.Function.Memoize
import Data.Vector (Vector)
import qualified Data.Vector as V
import Data.Ord
import Text.Printf
import Numeric.Search.Range -- from 'binary-search'

type Wealth = Int
type BetAmount = Wealth
type Probability = Double
type EV = Double

type Round = Int
type MaxRounds = Round
type WonRounds = Round
type Alpha = Int
type Beta = Int

cap :: Wealth
cap = 250

data Tree = Done EV
          | Bet EV Wealth BetAmount Round Probability Tree Tree
          deriving (Show)

data Dual = Dual {
  subjectiveEV :: !EV,
  trueEV :: !EV
  } deriving (Show)

instance Plan Dual where
  done w = Dual (fromIntegral w) (fromIntegral w)
  bet w x b p win loss = Dual {
    subjectiveEV = p * subjectiveEV win + (1 - p) * subjectiveEV loss,
    trueEV = 0.6 * trueEV win + (1 - 0.6) * trueEV loss
    }
  strategyEV = subjectiveEV

class Plan s where
  done :: Wealth -> s
  bet :: Wealth -> BetAmount -> Round -> Double -> s -> s -> s
  strategyEV :: s -> EV

instance Plan Double where
  done = fromIntegral
  bet w x b p win loss = p * win + (1 - p) * loss
  strategyEV = id

instance Plan Tree where
  done w = Done (fromIntegral w)
  strategyEV (Done ev) = ev
  strategyEV (Bet ev _ _ _ _ _ _) = ev
  bet 0 _ _ _ _ _ = Done 0
  bet w x b p win loss = Bet ev w x b p win loss
    where
      ev = p * strategyEV win + (1 - p) * strategyEV loss

showTree :: Tree -> String
showTree (Done ev) = show (Done ev)
showTree (Bet ev w x b p win loss) = intercalate "\n" [
  printf "%i (%1.1f): %i/%i at %2.2f" b ev x w p,
  indent 4 ("+ " ++ showTree win),
  indent 4 ("- " ++ showTree loss)
  ]
  where
    indent n str = intercalate "\n" $ map (replicate n ' ' ++) $ lines str

-- $ Unordered maximum
maximumOn :: (Foldable f, Ord b) => (a -> b) -> f a -> a
maximumOn f = maximumBy (comparing f)

-- Binary search on bitonic function.
maximumBitonic :: Ord b => (a -> b) -> Vector a -> a
maximumBitonic f options = case searchFromTo find 0 (V.length options - 2) of
                             Just i -> options V.! i
                             Nothing -> options V.! (V.length options - 1)
  where
    find i = f (options V.! i) >= f (options V.! (i + 1))

value :: Plan s
      => Alpha -> Beta -> MaxRounds
      -> (Round -> WonRounds -> Wealth -> s)
      -> (Round -> WonRounds -> Wealth -> s)
value alpha beta lastRound next currentRound wins w
  | currentRound >= lastRound = done w
  | w            >= cap       = done w
  | otherwise = maximumBitonic strategyEV options
  where
    new_alpha = alpha + wins
    new_beta = beta + (currentRound - wins)
    p = fromIntegral new_alpha / fromIntegral (new_alpha + new_beta)
    go x = bet w x currentRound p
           (next (currentRound+1) (wins+1) (w+x))
           (next (currentRound+1) wins     (w-x))
    options = V.fromList [go x | x <- [0..min (cap - w) w]]

-- Direct recursion.
direct :: Plan s
       => Alpha -> Beta -> MaxRounds
       -> (Round -> WonRounds -> Wealth -> s)
direct a b m = fix (value a b m)

-- Memoised recursion.
memo :: forall s. Plan s
     => Alpha -> Beta -> MaxRounds
     -> (Round -> WonRounds -> Wealth -> s)
memo a b maxRounds = cached_value
  where
    cached_value r wins w = table V.! r V.! wins V.! w
    table :: Vector (Vector (Vector s))
    table = V.generate (maxRounds + 1)
            (\r -> V.generate (r + 1)
              (\wins -> V.generate (cap + 1)
                (\w -> value a b maxRounds cached_value r wins w)))

-- Memoised recursion using 'memoize' library; slower.
memo2 :: Plan s
       => Alpha -> Beta -> MaxRounds
       -> (Round -> WonRounds -> Wealth -> s)
memo2 a b r = memoFix3 (value a b r)

run :: Plan s
    => (alpha -> beta -> maxRounds -> (Round -> WonRounds -> wealth -> s))
    -> alpha -> beta -> maxRounds -> wealth -> s
run method alpha beta rounds w = method alpha beta rounds 0 0 w

main :: IO ()
main = do [a, b, r, w] <- map read <$> getArgs
          print $ (run memo a b r w :: Dual)

One fea­ture offered by this code is report­ing two expect­ed-val­ues:

  1. the sub­jec­tive esti­mate of the agent at the begin­ning of the game about how much it can profit, given the rules and its pri­ors about p
  2. the EV it will actu­ally earn, fol­low­ing its pol­i­cy, given the true prob­a­bil­ity p = 0.6

For our pri­or, w = 25, and b =300, we can eval­u­ate run memo (7+1) (3+1) 300 25 :: Dual and find that the sub­jec­tive EV is $207.238 and the actual EV is $245.676.

And since we know from ear­lier that an agent fol­low­ing the opti­mal pol­icy believ­ing p = 0.6 would earn $246.606, it fol­lows that the Bayesian agent, due to its igno­rance, loses ~$1. So the price of igno­rance () in this sce­nario is sur­pris­ingly small.

Why is the Bayesian deci­sion tree able to per­form so close to the known-prob­a­bil­ity deci­sion tree? I would guess that it is because all agents bet small amounts in the early rounds (to avoid gam­bler’s ruin, since they have hun­dreds of rounds left to reach the ceil­ing), and this gives them data for free to update towards p = 0.6; the agent which knows that p = 0.6 can bet a lit­tle more pre­cisely early on, but this advan­tage over a few early rounds does­n’t wind up being much of a help in the long run.

With b =300, there are so many rounds avail­able to bet, that the details don’t mat­ter as much; while it there are only a few bets, the max­i­miz­ing behav­ior is to bet a lot so the regret for small b is also prob­a­bly small; the largest regret for not know­ing p is prob­a­bly some­where, like with the KC vs deci­sion tree, in the smal­l­-medium area of b.

Unknown stopping time

Another vari­ant would be to make the stop­ping time uncer­tain; strictly speak­ing, the game was­n’t fixed at exactly 300 rounds, that was just their guess at how many rounds a rea­son­ably fast player might get in before the clock expired. From the play­er’s per­spec­tive, it is unknown how many rounds they will get to play although they can guess that they won’t be allowed to play more than a few hun­dred, and this might affect the opti­mal strat­e­gy.

One way to cal­cu­late this would be to assume that the player does­n’t learn any­thing about the stop­ping time while play­ing, and merely has a prior over the stop­ping time. For exam­ple, . The game stop­ping can be seen as a third prob­a­bilis­tic response to a play­er’s bet, where each move with some sp, if it hap­pens the num­ber of bets left imme­di­ately becomes b =0 (so they win their cur­rent w) and they win the net value V(w,0); and if it does­n’t hap­pen the value is the usual 0.6*V(w+x, b-1) + 0.4*V(w-x, b-1). Or in longer form, the value of a action/bet x is (1-sp)*(0.6*V(w+x,b-1) + 0.4*V(w-x,b-1)) + sp*V(w,0).

What is sp as a dis­tri­b­u­tion? Well, one might believe that every round, there is a ~1% chance of the game stop­ping, in which case sp is the mem­o­ry­less (not , since this is a dis­crete setup); in this case, one must dis­count each addi­tional round n by 1% cumu­la­tive­ly. (Dis­count­ing in gen­eral could be seen as reflect­ing a prob­a­bil­ity of stop­ping at each stage.)

This has the diffi­culty that hypo­thet­i­cal­ly, a game might go on indefi­nitely long, but since a player must even­tu­ally hit $0 or $250 and stop play­ing, it would turn out like the penny sim­u­la­tor—after a few thou­sand rounds, the prob­a­bil­ity of hav­ing not reached $250 is so small it can’t be han­dled by most soft­ware. The prob­lem is more that this expands the states out enough to again make it unevalu­able, so it helps to put an upper bound. So the prob­lem becomes, play­ing the game know­ing that the game might stop with a cer­tain sp (eg 0.01) each round or stops at an upper limit num­ber of rounds (eg b =300).

So con­tin­u­ing with the R hash table, it might go like this:

f <- function(x, w, b) {
    sp <- 0.01
    (1-sp)*(0.6*V(min(250,w+x), b-1) + 0.4*V(w-x,b-1)) + sp*V(w,0)
}
V <- function(w, b) {
    if (isSet(c(w,b))) { return(lookup(c(w,b))) } else {
      returns <- if (b>0 && w>0 && w<250) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b) }) } else { return(w) }
      set(c(w,b), max(returns))
      return(lookup(c(w,b)))
      }
}

This can be extended to sce­nar­ios where we do learn some­thing about the stop­ping time, sim­i­larly to the coin-flip prob­a­bil­i­ty, by doing Bayesian updates on sp. Or if sp changes over rounds. If we had a dis­tri­b­u­tion over the stop­ping time, then the prob­a­bil­ity of stop­ping at each round can change inside the value func­tion given an addi­tional argu­ment n for rounds elapsed as the suffi­cient sta­tis­tic.

Unknown maximum wealth cap

Final­ly, sub­jects weren’t told what the max­i­mum would be:

The max­i­mum pay­out would be revealed if and when sub­jects placed a bet that if suc­cess­ful would make their bal­ance greater than or equal to the cap. We set the cap at $250.

A con­ve­nient dis­tri­b­u­tion for such a cap would be the , para­me­ter­ized by a min­i­mum xm and a tail para­me­ter, α. If xm is set to some­thing like 100 (the agent thinks the cap is >$100), for most rounds, the agent would not be able to update this, but towards the end, as it exceeds $100, it then updates the pos­te­ri­or—if it is allowed to make a bet of $50 at $100, then xm becomes 150 while α remains the same, and so on. In this case, the max­i­mum observed wealth is the suffi­cient sta­tis­tic. The pos­te­rior mean of the cap is then . A rea­son­able prior might be α=5 and xm = 200, in which case the mean wealth cap is 250.

f <- function(x, w, b, alpha, x_m) {
    meanCap        <- (alpha * x_m) / (alpha-1) # mean posterior cap if our bet does not threaten to hit the cap
    x_m_updated <- max(x_m, x+w)
    meanCapUpdated <- (alpha * x_m_updated) / (alpha-1) # mean posterior cap if our bet does threaten to hit the cap
    Pcap <- if ((x+b) < x_m) { 0 } else { 1-((x_m / (x+b))^alpha) } # CDF
    (1-Pcap)*(0.6*V(min(meanCap,w+x), b-1, alpha, x_m) + 0.4*V(w-x,b-1,alpha,x_m)) +
        Pcap*(0.6*V(min(meanCapUpdated,w+x), b-1, alpha, x_m_updated) + 0.4*V(w-x,b-1,alpha,x_m_updated))
}
V <- function(w, b, alpha=5, x_m=200) {
    if (isSet(c(w,b,alpha,x_m))) { return(lookup(c(w,b,alpha,x_m))) } else {
      returns <- if (b>0 && w>0) { sapply(seq(0, w, by=0.1), function(x) { f(x,w,b,alpha,x_m) }) } else { return(w) }
      set(c(w,b,alpha,x_m), max(returns))
      return(lookup(c(w,b,alpha,x_m)))
      }
}

Value of POMDP

What is the value of this full POMDP game with uncer­tain wealth cap, stop­ping time, and edge?

Upper bound

We can approx­i­mate it with our pre-ex­ist­ing value func­tion for a known stop­ping time/edge/max wealth, sam­pling from the pos­te­ri­or; for exam­ple, we might draw 1000 val­ues from , the Pare­to, the Beta etc, cal­cu­late the value of each set of para­me­ters, and tak­ing the mean val­ue. Since the value of infor­ma­tion is always zero or pos­i­tive, the value of the POMDP with known para­me­ters must be an upper bound on its true val­ue, since we can only do worse in each game by being igno­rant of the true para­me­ters and need­ing to infer them as we play; so if we cal­cu­late a value of $240 or what­ev­er, we know the POMDP must have a value <$240, and that gives a per­for­mance tar­get to aim for.

The Haskell ver­sion is the eas­i­est to work with, but Haskell does­n’t have con­ve­nient dis­tri­b­u­tions to sam­ple from, so instead I gen­er­ate 1000 sam­ples from the prior dis­tri­b­u­tion of para­me­ters in Python (reusing some of the Gym code):

from scipy.stats import genpareto
import numpy as np
import numpy.random
from gym.spaces import prng
edgePriorAlpha=7
edgePriorBeta=3
maxWealthAlpha=5
maxWealthM=200
maxRoundsMean=300
maxRoundsSD=25
for i in xrange(0,1000):
    edge = prng.np_random.beta(edgePriorAlpha, edgePriorBeta)
    maxWealth = int(round(genpareto.rvs(maxWealthAlpha, maxWealthM, random_state=prng.np_random)))
    maxRounds = int(round(prng.np_random.normal(maxRoundsMean, maxRoundsSD)))
    print 25, maxRounds, edge, maxWealth

and then we can mod­ify nshep­perd’s Haskell code to allow the edge and wealth cap to be mod­i­fi­able, and then we sim­ply run memo over the 1000 pos­si­ble sets of para­me­ters, and take the mean. Using the raw sam­ples turns out to be infea­si­ble in terms of RAM, prob­a­bly because higher wealth caps (which are Pareto dis­trib­uted and can be extremely large) increases dimen­sion­al­ity too much to eval­u­ate in my ~25GB RAM, so the wealth cap itself must be shrunk, to TODO $1000? which unfor­tu­nately mud­dles the mean­ing of the esti­mate since now it’s a lower bound of an upper bound, or to reverse it, value of clair­voy­ant play assum­ing that many games have a lower wealth cap than they do.

import Data.Vector (Vector)
import qualified Data.Vector as V (generate, (!))

type Wealth = Int
type Bets = Int
type Edge = Double
type Cap = Int
type EV = Double

value :: (Wealth -> Bets -> EV) -> Wealth -> Bets -> Edge -> Cap -> EV
value    _ 0 _ _ _ = 0
value    _ w 0 _ _ = fromIntegral w
value next w b e cap = maximum [go x | x <- [0..min (cap - w) w]]
  where
    go x = e * next (min cap (w+x)) (b-1) + (1-e) * next (w-x) (b-1)

memo :: Wealth -> Bets -> Edge -> Cap -> EV
memo w b e cp = cached_value w b
  where
    cached_value w' b' = table V.! b' V.! w'
    table :: Vector (Vector EV)
    table = V.generate (b + 1)
            (\b'' -> V.generate (cp + 1)
              (\w'' -> value cached_value w'' b'' e cp))

main :: IO ()
main = do let evs = map (\(w,b,e,cap) -> memo w b e cap) parameters
          print evs
          print $ mean evs
       where mean x = sum x / fromIntegral (length x)

maxCap :: Int
maxCap = 800
parameters :: [(Wealth, Bets, Edge, Cap)]
parameters = map (\(w,b,e,c) -> (w,b,e, max maxCap c)) [(25, 236, 0.72435085557, 205), (25, 237, 0.845435411107, 260), (25, 238, 0.413782110845, 202),
              (25, 240, 0.345133535603, 23999), (25, 240, 0.400147029568, 2521), (25, 242, 0.661912880989, 201), (25, 242, 0.676168121514, 6740),
              (25, 242, 0.701381545443, 1726), (25, 242, 0.849537381053, 203), (25, 242, 0.898462493046, 201), (25, 243, 0.54257657328, 201),
              (25, 245, 0.77972062792, 48534), (25, 246, 0.813395396485, 203), (25, 247, 0.480134063147, 214), (25, 247, 0.533273012857, 222),
              (25, 247, 0.754222974601, 201), (25, 249, 0.70782818171, 200), (25, 249, 0.748310452532, 202), (25, 251, 0.606478140487, 200),
              (25, 251, 0.809355705263, 526), (25, 252, 0.524864167122, 200), (25, 252, 0.749206398756, 352321), (25, 252, 0.822818892461, 201),
              (25, 253, 0.542598606499, 201), (25, 253, 0.600469152584, 200), (25, 253, 0.648598435209, 219), (25, 253, 0.649453264073, 2589),
              (25, 253, 0.679414549454, 28311200), (25, 253, 0.86227613481, 202), (25, 254, 0.585056497076, 2575033),
              (25, 254, 0.680452842612, 200), (25, 255, 0.877950348346, 201), (25, 256, 0.643298850757, 205), (25, 256, 0.712973552037, 204),
              (25, 256, 0.718153926517, 201), (25, 256, 0.813126205309, 201), (25, 256, 0.8280445068, 208), (25, 257, 0.402688232643, 201),
              (25, 257, 0.548591912297, 651), (25, 257, 0.601967601919, 206), (25, 257, 0.862541920316, 200), (25, 257, 0.863389650382, 200),
              (25, 258, 0.474519595118, 213), (25, 258, 0.735744534702, 208), (25, 258, 0.738339372201, 360), (25, 258, 0.84277195943, 83144),
              (25, 258, 0.93190614133, 2409618061), (25, 259, 0.471094149142, 745), (25, 260, 0.628797799899, 202), (25, 260, 0.710033767475, 9552510),
              (25, 260, 0.813466187199, 200), (25, 261, 0.550748996178, 230), (25, 261, 0.629544256918, 259), (25, 261, 0.683386138653, 200),
              (25, 261, 0.792209742213, 201), (25, 261, 0.794533480259, 207), (25, 262, 0.344887568623, 201), (25, 262, 0.450645011078, 473),
              (25, 262, 0.738248174312, 225), (25, 262, 0.791527693933, 210), (25, 262, 0.832198167412, 200), (25, 263, 0.35093259949, 203),
              (25, 263, 0.521818533732, 366), (25, 263, 0.64076936694, 59847), (25, 263, 0.684063131912, 201), (25, 263, 0.714065733976, 201),
              (25, 263, 0.789350373174, 1705), (25, 263, 0.829406017035, 210), (25, 263, 0.902144916794, 255), (25, 263, 0.951257247224, 202),
              (25, 264, 0.32472812884, 203), (25, 264, 0.604563170498, 250), (25, 264, 0.626423177116, 22769), (25, 264, 0.871339562605, 205),
              (25, 265, 0.452014126574, 200), (25, 265, 0.54226419925, 201), (25, 265, 0.625113911686, 212), (25, 265, 0.654404133816, 206),
              (25, 265, 0.678345789504, 420), (25, 265, 0.771116406164, 210), (25, 265, 0.771222624754, 205), (25, 265, 0.801480809161, 200),
              (25, 265, 0.806182111956, 60957), (25, 265, 0.851715904261, 18981), (25, 266, 0.685689594858, 200), (25, 266, 0.711416415881, 210),
              (25, 266, 0.711937019202, 851), (25, 266, 0.724721571466, 200), (25, 266, 0.816529965664, 654), (25, 266, 0.820884461992, 287),
              (25, 267, 0.798462101302, 200), (25, 267, 0.814520382826, 200), (25, 267, 0.815795664692, 201), (25, 267, 0.877810894936, 200),
              (25, 268, 0.37374747435, 7152), (25, 268, 0.66383704238, 201), (25, 268, 0.671499425192, 224), (25, 268, 0.72882798744, 205),
              (25, 268, 0.788161778332, 534537579002), (25, 268, 0.793723223502, 584), (25, 268, 0.83414422735, 16291),
              (25, 268, 0.933374294433, 219), (25, 269, 0.460012103336, 211), (25, 269, 0.575089962258, 203), (25, 269, 0.587666214938, 213),
              (25, 269, 0.617907437269, 201), (25, 269, 0.711693869892, 203), (25, 269, 0.755193420929, 200), (25, 269, 0.764409970652, 2593),
              (25, 269, 0.769445254497, 6962), (25, 269, 0.781046904571, 1030), (25, 269, 0.92721940969, 2062780), (25, 269, 0.937746457329, 201),
              (25, 270, 0.43758799016, 201), (25, 270, 0.55945290829, 206), (25, 270, 0.594599037706, 579), (25, 270, 0.6205174976, 506),
              (25, 270, 0.660258985896, 217), (25, 270, 0.688180575479, 308), (25, 270, 0.703263135667, 201), (25, 270, 0.730492162736, 46132),
              (25, 270, 0.7898205196, 25541), (25, 270, 0.893976819617, 201), (25, 270, 0.921080018968, 200), (25, 271, 0.525815069976, 200),
              (25, 271, 0.554615752325, 200), (25, 271, 0.629469767651, 200), (25, 271, 0.709135907244, 9875), (25, 271, 0.759981836553, 202),
              (25, 271, 0.811129692174, 91717), (25, 271, 0.837347167701, 200), (25, 271, 0.873514454515, 200), (25, 271, 0.923252939562, 208),
              (25, 272, 0.556207216343, 785), (25, 272, 0.575453865622, 215), (25, 272, 0.638921822998, 200), (25, 272, 0.642223697178, 219),
              (25, 272, 0.66336333735, 213), (25, 272, 0.798971543861, 209), (25, 272, 0.826187121413, 216), (25, 272, 0.864733208766, 204),
              (25, 273, 0.517135630277, 200), (25, 273, 0.560438221638, 221192794), (25, 273, 0.77368870442, 883), (25, 273, 0.846232723863, 785),
              (25, 273, 0.9652451074, 200), (25, 274, 0.607659170171, 339), (25, 274, 0.611760970714, 201), (25, 274, 0.638249890182, 201),
              (25, 274, 0.706185385091, 201), (25, 274, 0.760147689131, 200), (25, 274, 0.804365960436, 1051), (25, 274, 0.848798405341, 1117),
              (25, 275, 0.333707647765, 200), (25, 275, 0.555403709258, 205), (25, 275, 0.655942931908, 206), (25, 275, 0.676684395133, 201),
              (25, 275, 0.698400431321, 203), (25, 275, 0.796131451966, 200), (25, 275, 0.840528265848, 200), (25, 275, 0.856009338181, 200),
              (25, 276, 0.497611590615, 211), (25, 276, 0.719196036042, 202), (25, 276, 0.72119258703, 212), (25, 276, 0.724848372617, 201),
              (25, 276, 0.727773523116, 947132), (25, 276, 0.786181091729, 223), (25, 276, 0.835470024888, 213), (25, 276, 0.850785506884, 1650536293),
              (25, 276, 0.872106280647, 205), (25, 277, 0.409025006548, 1245), (25, 277, 0.432733743192, 200), (25, 277, 0.571657665503, 45217),
              (25, 277, 0.702562712047, 201), (25, 277, 0.730418085673, 575), (25, 277, 0.749515377892, 200), (25, 277, 0.775231917334, 200),
              (25, 277, 0.790660296053, 204), (25, 277, 0.875858958177, 236), (25, 278, 0.379322394893, 309), (25, 278, 0.420790416182, 201),
              (25, 278, 0.527244364126, 200), (25, 278, 0.615000956496, 205), (25, 278, 0.675667333791, 201), (25, 278, 0.696208960885, 201),
              (25, 278, 0.790420764423, 204), (25, 278, 0.841240203947, 523), (25, 279, 0.341095083629, 201), (25, 279, 0.476230165726, 1798),
              (25, 279, 0.518063098433, 209), (25, 279, 0.656454646158, 306), (25, 279, 0.684136349, 273), (25, 279, 0.709847569568, 119033),
              (25, 279, 0.762133225185, 200), (25, 279, 0.776577127267, 200), (25, 279, 0.79611434817, 201), (25, 279, 0.808780479341, 425),
              (25, 279, 0.850439803007, 204), (25, 279, 0.898230041029, 204), (25, 279, 0.945894882249, 616), (25, 280, 0.368073973138, 14001),
              (25, 280, 0.464321107456, 1523), (25, 280, 0.617310083425, 504403), (25, 280, 0.657193497398, 204), (25, 280, 0.707839204221, 1265268),
              (25, 280, 0.76379246649, 304), (25, 280, 0.772912128233, 201), (25, 280, 0.865579809939, 77454), (25, 281, 0.475947154647, 200),
              (25, 281, 0.557771053261, 2188305), (25, 281, 0.56666775076, 12340), (25, 281, 0.638041977737, 200), (25, 281, 0.668510798451, 212),
              (25, 281, 0.678366665681, 1294), (25, 281, 0.716116162808, 213), (25, 281, 0.748052894007, 203), (25, 281, 0.749911576065, 23865988060),
              (25, 281, 0.759712569326, 204), (25, 281, 0.806434961546, 353082), (25, 281, 0.893617069228, 200), (25, 281, 0.931891968132, 203),
              (25, 282, 0.454895506694, 11408), (25, 282, 0.549639923973, 462), (25, 282, 0.569017378489, 201), (25, 282, 0.577850694309, 200),
              (25, 282, 0.584479523845, 334042071872), (25, 282, 0.599723920362, 200), (25, 282, 0.632612897763, 1293),
              (25, 282, 0.681109488172, 200), (25, 282, 0.724133951717, 200), (25, 282, 0.754930901531, 201), (25, 282, 0.778740223985, 203),
              (25, 282, 0.779781137188, 200), (25, 282, 0.781305596085, 202), (25, 282, 0.83569736595, 200), (25, 282, 0.850446466345, 1253),
              (25, 282, 0.859549928516, 201), (25, 282, 0.871855706726, 200), (25, 283, 0.327637841131, 200), (25, 283, 0.404249604299, 201),
              (25, 283, 0.561172940189, 200), (25, 283, 0.668083130397, 201), (25, 283, 0.678629207223, 203), (25, 283, 0.74122980107, 208),
              (25, 283, 0.744679958097, 200), (25, 283, 0.760349422921, 203), (25, 283, 0.781697384113, 209), (25, 283, 0.784799938769, 202),
              (25, 283, 0.789112337759, 236), (25, 283, 0.819466827649, 262), (25, 283, 0.839687812522, 200), (25, 283, 0.856859655325, 271),
              (25, 284, 0.495523668627, 201), (25, 284, 0.56435102208, 205), (25, 284, 0.588093361148, 219), (25, 284, 0.593268824769, 200),
              (25, 284, 0.640596228637, 213), (25, 284, 0.653491369691, 44714), (25, 284, 0.675009753521, 202), (25, 284, 0.773401673731, 200),
              (25, 284, 0.805092875342, 213), (25, 284, 0.82735685621, 263), (25, 284, 0.840696528783, 16521), (25, 285, 0.362802762398, 320),
              (25, 285, 0.367012291981, 20073), (25, 285, 0.521190733548, 200), (25, 285, 0.575799236698, 200), (25, 285, 0.637555604061, 201),
              (25, 285, 0.642866106846, 200), (25, 285, 0.739442877608, 1154709), (25, 285, 0.740973872676, 205), (25, 285, 0.779312184025, 258),
              (25, 285, 0.7798668215, 1286), (25, 285, 0.841286817662, 200), (25, 285, 0.913642856385, 858), (25, 286, 0.526160290087, 202),
              (25, 286, 0.57102556171, 200), (25, 286, 0.605564882529, 200), (25, 286, 0.673079677685, 202), (25, 286, 0.675864181927, 200),
              (25, 286, 0.723247596231, 224), (25, 286, 0.723347705484, 203), (25, 286, 0.767358570943, 207), (25, 286, 0.775537484183, 241),
              (25, 286, 0.780033922166, 2962), (25, 286, 0.806301077907, 1553), (25, 286, 0.829016366841, 200), (25, 286, 0.839169441571, 200),
              (25, 286, 0.852776222427, 28148), (25, 286, 0.853758532671, 200), (25, 286, 0.906829095884, 200), (25, 286, 0.922712587421, 202),
              (25, 287, 0.498032911641, 200), (25, 287, 0.592863506249, 233), (25, 287, 0.622406587519, 208), (25, 287, 0.718903577193, 218),
              (25, 287, 0.72138759488, 227), (25, 287, 0.749942677419, 240), (25, 287, 0.762366566525, 215), (25, 287, 0.770805576807, 200),
              (25, 287, 0.797492731584, 200), (25, 287, 0.835165503936, 200), (25, 287, 0.946751084636, 863358347021),
              (25, 288, 0.306016851261, 201), (25, 288, 0.468949036459, 4770), (25, 288, 0.549651576351, 201), (25, 288, 0.592965066535, 6236),
              (25, 288, 0.706825000163, 216), (25, 288, 0.74286176604, 41482), (25, 288, 0.743238045326, 206), (25, 288, 0.777949156606, 213),
              (25, 288, 0.781597068828, 418), (25, 288, 0.794578713285, 200), (25, 288, 0.828331653519, 418), (25, 288, 0.834874015356, 212),
              (25, 288, 0.848009292304, 200), (25, 288, 0.852116433152, 200), (25, 288, 0.906593204881, 202), (25, 288, 0.920623087617, 220),
              (25, 289, 0.524452337599, 1057), (25, 289, 0.552951317234, 208), (25, 289, 0.651146664053, 200), (25, 289, 0.658611767347, 1126709768047),
              (25, 289, 0.691072913786, 201), (25, 289, 0.695521738775, 42204), (25, 289, 0.710204540486, 207), (25, 289, 0.716186135845, 3696),
              (25, 289, 0.748287169753, 209), (25, 289, 0.750374480604, 201), (25, 289, 0.810573219838, 201), (25, 289, 0.824662154794, 200),
              (25, 289, 0.881953772318, 200), (25, 289, 0.885424313568, 200), (25, 290, 0.478371525842, 2087), (25, 290, 0.509245836087, 200),
              (25, 290, 0.559208272006, 202), (25, 290, 0.566118076472, 336), (25, 290, 0.592660738469, 200), (25, 290, 0.594018239861, 273),
              (25, 290, 0.609100591562, 201), (25, 290, 0.613710234154, 235), (25, 290, 0.622454795175, 201), (25, 290, 0.642187814474, 201),
              (25, 290, 0.650179222226, 202), (25, 290, 0.71534164315, 3261), (25, 290, 0.720180673012, 206), (25, 290, 0.737585832942, 200),
              (25, 290, 0.775169473922, 229), (25, 290, 0.783784198305, 53137), (25, 290, 0.789135961744, 201), (25, 290, 0.800078889184, 200),
              (25, 291, 0.405626786206, 34744), (25, 291, 0.525738841457, 211), (25, 291, 0.533871294974, 12996), (25, 291, 0.608310459901, 201),
              (25, 291, 0.70893418817, 200), (25, 291, 0.748805670518, 200), (25, 291, 0.760314517315, 202), (25, 291, 0.760919201657, 215),
              (25, 291, 0.763324684293, 200), (25, 291, 0.778634985402, 200), (25, 291, 0.805883746215, 200), (25, 291, 0.827079825705, 203),
              (25, 291, 0.827672243803, 200), (25, 292, 0.353361504401, 9552), (25, 292, 0.413680391502, 240), (25, 292, 0.498283007005, 200),
              (25, 292, 0.560682661174, 201), (25, 292, 0.660502227784, 201), (25, 292, 0.676014312494, 200), (25, 292, 0.729161823418, 200),
              (25, 292, 0.754733824112, 230), (25, 292, 0.772617271317, 312), (25, 292, 0.778956063712, 212), (25, 292, 0.808767644485, 200),
              (25, 292, 0.808779697891, 201), (25, 292, 0.844087667679, 202), (25, 292, 0.904903907681, 200), (25, 293, 0.524954771554, 4945),
              (25, 293, 0.57444832542, 233), (25, 293, 0.600122035306, 200), (25, 293, 0.69001246376, 205), (25, 293, 0.691232039332, 207),
              (25, 293, 0.691742268177, 211), (25, 293, 0.698366150359, 200), (25, 293, 0.711927031055, 1815880), (25, 293, 0.71619667357, 3964),
              (25, 293, 0.728536730678, 200), (25, 293, 0.741604172773, 200), (25, 293, 0.743399425593, 201), (25, 293, 0.758084864431, 200),
              (25, 293, 0.759978441853, 200), (25, 293, 0.780081691192, 1448778253), (25, 293, 0.788319084437, 208),
              (25, 293, 0.801525808156, 200), (25, 293, 0.813155425067, 206), (25, 293, 0.819811779201, 377), (25, 293, 0.837649946204, 209),
              (25, 293, 0.869752644338, 4592), (25, 293, 0.875398670372, 200), (25, 293, 0.918150841685, 200), (25, 293, 0.921450726956, 201),
              (25, 294, 0.445036622233, 202), (25, 294, 0.496380029754, 2294499), (25, 294, 0.517533203971, 214), (25, 294, 0.597738939234, 213),
              (25, 294, 0.597783619419, 102427), (25, 294, 0.600782734524, 263), (25, 294, 0.625741320644, 200), (25, 294, 0.632555212361, 202),
              (25, 294, 0.633928301179, 217), (25, 294, 0.720710116312, 200), (25, 294, 0.741687972492, 235), (25, 294, 0.775184150825, 203),
              (25, 294, 0.799975478422, 2129), (25, 294, 0.811533351383, 221), (25, 294, 0.827766846665, 47518), (25, 294, 0.834896019626, 200),
              (25, 294, 0.855889683449, 200), (25, 294, 0.887096214635, 201), (25, 294, 0.898185143854, 200), (25, 294, 0.904470557778, 207),
              (25, 295, 0.510000548384, 4948), (25, 295, 0.527335029277, 48990), (25, 295, 0.553592976478, 200), (25, 295, 0.587128450848, 200),
              (25, 295, 0.685163151117, 221), (25, 295, 0.689665955815, 200), (25, 295, 0.693619741737, 200), (25, 295, 0.698808800461, 212),
              (25, 295, 0.79024052726, 203), (25, 295, 0.805930533305, 233), (25, 295, 0.870408868052, 208), (25, 295, 0.879989559956, 209),
              (25, 295, 0.917289124506, 201), (25, 296, 0.368675367618, 200), (25, 296, 0.422106669815, 87196), (25, 296, 0.570294648439, 201),
              (25, 296, 0.576307309235, 5064), (25, 296, 0.577913079103, 223), (25, 296, 0.615569633698, 201), (25, 296, 0.619178871735, 43506),
              (25, 296, 0.718168554371, 310), (25, 296, 0.747322660094, 202), (25, 296, 0.760729108487, 200), (25, 296, 0.822530768331, 205),
              (25, 296, 0.836178694446, 200), (25, 296, 0.922082501039, 202), (25, 297, 0.575427625651, 325), (25, 297, 0.582330676039, 200),
              (25, 297, 0.59032029128, 201), (25, 297, 0.590800056359, 200), (25, 297, 0.601983239354, 200), (25, 297, 0.604624445572, 349),
              (25, 297, 0.618592598637, 275), (25, 297, 0.630430484034, 1804), (25, 297, 0.747994935933, 200), (25, 297, 0.761287220637, 201),
              (25, 297, 0.784311762717, 24040), (25, 297, 0.805896793235, 207), (25, 297, 0.83131315563, 213), (25, 297, 0.859395579848, 209),
              (25, 297, 0.878201883463, 201), (25, 298, 0.365915091546, 7657), (25, 298, 0.52132925192, 201), (25, 298, 0.615089219457, 212),
              (25, 298, 0.637405757267, 200), (25, 298, 0.688423448322, 214), (25, 298, 0.745560344488, 206), (25, 298, 0.80183307542, 589),
              (25, 298, 0.870426520049, 272), (25, 298, 0.89547470974, 623), (25, 299, 0.439413983287, 621), (25, 299, 0.635744201316, 400665320),
              (25, 299, 0.635839800048, 203), (25, 299, 0.654855825396, 230), (25, 299, 0.663358277149, 200), (25, 299, 0.716539473345, 206),
              (25, 299, 0.720916562601, 201), (25, 299, 0.751504283345, 239), (25, 299, 0.757182123011, 201), (25, 299, 0.783947882789, 34957848),
              (25, 299, 0.787635290729, 2549771968), (25, 299, 0.810452875242, 200), (25, 299, 0.817501090944, 200),
              (25, 299, 0.825030647643, 202), (25, 299, 0.826381476756, 201309), (25, 299, 0.827561766471, 200), (25, 299, 0.838546065913, 406),
              (25, 299, 0.838668086492, 41065010), (25, 299, 0.849677138359, 205), (25, 299, 0.888747407578, 203), (25, 300, 0.409478159695, 204),
              (25, 300, 0.422098065762, 127064), (25, 300, 0.461635641331, 215), (25, 300, 0.509929449334, 200), (25, 300, 0.526075686173, 249),
              (25, 300, 0.570498723667, 200), (25, 300, 0.608851280889, 104165306), (25, 300, 0.63030529901, 200), (25, 300, 0.672317987371, 664),
              (25, 300, 0.673406231884, 200), (25, 300, 0.680327735756, 3147), (25, 300, 0.69178511648, 482), (25, 300, 0.72755905432, 201),
              (25, 300, 0.732997721442, 212), (25, 300, 0.753568752527, 200), (25, 300, 0.772028272907, 202), (25, 300, 0.779901373004, 21845295),
              (25, 300, 0.787993478282, 812414), (25, 300, 0.803550492942, 206), (25, 300, 0.874415087757, 263), (25, 300, 0.885236357353, 331970),
              (25, 300, 0.894826991604, 588), (25, 300, 0.902637609898, 85410), (25, 301, 0.360093850862, 204), (25, 301, 0.400165301316, 201),
              (25, 301, 0.469204018151, 200), (25, 301, 0.537073638156, 200), (25, 301, 0.546705498921, 203), (25, 301, 0.592970028946, 303),
              (25, 301, 0.608415628183, 200), (25, 301, 0.63288933406, 2859), (25, 301, 0.63664042403, 970), (25, 301, 0.64529111355, 402),
              (25, 301, 0.681233825938, 200), (25, 301, 0.690899753273, 210), (25, 301, 0.692879776665, 200), (25, 301, 0.748166942643, 201),
              (25, 301, 0.780381651924, 201), (25, 301, 0.786615600294, 240), (25, 301, 0.796162957093, 218), (25, 301, 0.814111079911, 200),
              (25, 301, 0.829147987825, 200), (25, 301, 0.842615885711, 226), (25, 301, 0.848517486075, 54928), (25, 301, 0.89833095029, 1550),
              (25, 302, 0.405983475031, 18159), (25, 302, 0.584932652676, 220), (25, 302, 0.64638274985, 200), (25, 302, 0.704969570047, 201),
              (25, 302, 0.706051626625, 200), (25, 302, 0.741692298627, 201), (25, 302, 0.756403039924, 236), (25, 302, 0.775208821424, 205),
              (25, 302, 0.820549125721, 6413), (25, 302, 0.843449911907, 202), (25, 302, 0.854836903637, 1444), (25, 302, 0.959731993346, 18110751),
              (25, 303, 0.424498218556, 202), (25, 303, 0.59564207646, 5859), (25, 303, 0.628957916397, 312), (25, 303, 0.675945884436, 405),
              (25, 303, 0.703924077074, 3436122), (25, 303, 0.775340096428, 203), (25, 303, 0.840799001056, 200), (25, 303, 0.843478387795, 200),
              (25, 303, 0.843620594265, 201), (25, 303, 0.882905827794, 200), (25, 303, 0.891407840196, 201), (25, 303, 0.928661184425, 200),
              (25, 304, 0.612209498336, 200), (25, 304, 0.632934680031, 217), (25, 304, 0.680037492863, 208), (25, 304, 0.702135166708, 201),
              (25, 304, 0.739443325164, 200), (25, 304, 0.792856089945, 18219), (25, 304, 0.793868715939, 1195), (25, 304, 0.807936872595, 243),
              (25, 304, 0.855967740382, 9139), (25, 304, 0.876774098031, 264), (25, 304, 0.908151496191, 200), (25, 304, 0.909811696417, 201),
              (25, 305, 0.467412185835, 220), (25, 305, 0.504357824329, 202), (25, 305, 0.551514171063, 1813205), (25, 305, 0.562075541816, 28957),
              (25, 305, 0.565839799799, 200), (25, 305, 0.57478857299, 201), (25, 305, 0.616735566246, 18968), (25, 305, 0.694741607886, 211),
              (25, 305, 0.730098365549, 350), (25, 305, 0.74553240102, 3669), (25, 305, 0.775373331324, 273), (25, 305, 0.782999134264, 202),
              (25, 305, 0.8326463963, 1718174), (25, 305, 0.84309553266, 34103), (25, 305, 0.853174396025, 202), (25, 305, 0.866990268211, 200),
              (25, 305, 0.95814576074, 207), (25, 306, 0.426358484538, 2619), (25, 306, 0.601491940798, 1236), (25, 306, 0.656076249659, 201),
              (25, 306, 0.667159358364, 8851), (25, 306, 0.680137137507, 93771), (25, 306, 0.702048987783, 215), (25, 306, 0.736768186781, 200),
              (25, 306, 0.766656078382, 225), (25, 306, 0.778572773772, 201), (25, 306, 0.788035864564, 224), (25, 306, 0.810667703089, 203),
              (25, 306, 0.816630755616, 10977626), (25, 306, 0.874526902796, 207), (25, 306, 0.909874167041, 200), (25, 306, 0.937873115764, 201),
              (25, 307, 0.399427943647, 308), (25, 307, 0.502741334176, 210), (25, 307, 0.529663281186, 1180), (25, 307, 0.630224185157, 200),
              (25, 307, 0.632031271401, 218), (25, 307, 0.640704995246, 203), (25, 307, 0.69831193803, 77936), (25, 307, 0.702814625747, 201),
              (25, 307, 0.747406138489, 202), (25, 307, 0.747958417038, 204), (25, 307, 0.784937576368, 1648), (25, 307, 0.794869553561, 201),
              (25, 307, 0.828787292708, 9683), (25, 307, 0.863132135591, 207), (25, 307, 0.869668447709, 207), (25, 307, 0.871174430814, 1874),
              (25, 307, 0.874024275077, 210), (25, 307, 0.888454459822, 2205), (25, 307, 0.908595383436, 1066), (25, 308, 0.325783846322, 203),
              (25, 308, 0.397224619923, 201), (25, 308, 0.398632358505, 1866), (25, 308, 0.549245225571, 223), (25, 308, 0.614176736202, 1627),
              (25, 308, 0.657868307531, 835565), (25, 308, 0.720171327788, 102155), (25, 308, 0.760576054602, 202), (25, 308, 0.789153275477, 205),
              (25, 308, 0.923998732287, 201), (25, 309, 0.561546536824, 200), (25, 309, 0.635434397698, 200), (25, 309, 0.643406617116, 203),
              (25, 309, 0.658560572878, 200), (25, 309, 0.674697791801, 248), (25, 309, 0.676725636526, 228), (25, 309, 0.676937775159, 200),
              (25, 309, 0.6792495829, 221), (25, 309, 0.716486921989, 200), (25, 309, 0.725512876373, 78653), (25, 309, 0.757976245878, 746),
              (25, 309, 0.781027844906, 200), (25, 309, 0.78623950218, 257), (25, 309, 0.810962413638, 230), (25, 310, 0.260183128022, 4099),
              (25, 310, 0.587394172232, 251), (25, 310, 0.621739526011, 322), (25, 310, 0.64027318687, 204), (25, 310, 0.67688167552, 202),
              (25, 310, 0.683476431144, 203), (25, 310, 0.74431135893, 10529), (25, 310, 0.804336362739, 201), (25, 310, 0.809925171125, 224),
              (25, 310, 0.84167604034, 219), (25, 310, 0.914653852136, 240), (25, 311, 0.312014901261, 200), (25, 311, 0.565120276608, 1451),
              (25, 311, 0.668447484328, 200), (25, 311, 0.693205380973, 201), (25, 311, 0.715572448621, 200), (25, 311, 0.735871833159, 206),
              (25, 311, 0.754197545032, 201), (25, 311, 0.755255511596, 207), (25, 311, 0.797893642622, 200), (25, 311, 0.82548808715, 629357488),
              (25, 311, 0.848601310255, 212), (25, 311, 0.893160345087, 222), (25, 311, 0.907535669966, 210), (25, 312, 0.390384390913, 223),
              (25, 312, 0.404305090449, 201), (25, 312, 0.587382901527, 200), (25, 312, 0.607534113237, 204), (25, 312, 0.719561090933, 2427),
              (25, 312, 0.765116208912, 207), (25, 312, 0.768884636481, 201), (25, 312, 0.77321985892, 136853), (25, 312, 0.781591538183, 357),
              (25, 312, 0.786119097558, 3483946), (25, 312, 0.848619555391, 11017), (25, 312, 0.954299904225, 200), (25, 312, 0.955187319821, 204),
              (25, 313, 0.529251017792, 204), (25, 313, 0.541364870877, 200), (25, 313, 0.60546139467, 230), (25, 313, 0.61482220499, 24877),
              (25, 313, 0.638663284904, 209), (25, 313, 0.644164848762, 200), (25, 313, 0.64713451255, 200), (25, 313, 0.743586331391, 201),
              (25, 313, 0.775805821428, 62845), (25, 313, 0.798122759695, 7977), (25, 313, 0.858577862803, 127038), (25, 313, 0.866823531593, 200),
              (25, 313, 0.876882672035, 283), (25, 313, 0.917675334467, 200), (25, 314, 0.369362271547, 200), (25, 314, 0.37571281143, 7538438),
              (25, 314, 0.421702878008, 872), (25, 314, 0.512143047288, 200), (25, 314, 0.615334964732, 41957), (25, 314, 0.634135010466, 201),
              (25, 314, 0.65143589365, 5976), (25, 314, 0.6624335318, 202), (25, 314, 0.687439523632, 207), (25, 314, 0.70341643248, 219100),
              (25, 314, 0.71373952253, 200), (25, 314, 0.716799246311, 200), (25, 314, 0.736070231117, 202), (25, 314, 0.769820801936, 201),
              (25, 314, 0.796382966462, 1810920), (25, 314, 0.817248231272, 200), (25, 314, 0.911372340659, 217771),
              (25, 315, 0.494168494957, 200), (25, 315, 0.525928344185, 108504), (25, 315, 0.530891192429, 309), (25, 315, 0.543427134741, 570),
              (25, 315, 0.582642352511, 200), (25, 315, 0.62020198716, 204), (25, 315, 0.66021096512, 201), (25, 315, 0.810810141201, 125205306),
              (25, 315, 0.875437215808, 201), (25, 315, 0.891388896727, 202), (25, 315, 0.950706326773, 230), (25, 316, 0.531487130493, 203),
              (25, 316, 0.618472935292, 251), (25, 316, 0.647254887133, 206), (25, 316, 0.653924814146, 212), (25, 316, 0.667901497546, 212),
              (25, 316, 0.708180220581, 200), (25, 316, 0.760548015821, 2030), (25, 316, 0.768690555637, 200), (25, 316, 0.774476148446, 200),
              (25, 316, 0.796262761253, 201), (25, 316, 0.796625765733, 200), (25, 316, 0.823221552247, 200), (25, 316, 0.830702399623, 200),
              (25, 316, 0.836360274124, 325), (25, 316, 0.896613527206, 248), (25, 317, 0.482186115357, 383), (25, 317, 0.576756996375, 200),
              (25, 317, 0.578057886341, 204), (25, 317, 0.594056991632, 203), (25, 317, 0.631811741834, 200), (25, 317, 0.634065817187, 201),
              (25, 317, 0.738259051865, 200), (25, 317, 0.741522939153, 200), (25, 317, 0.753479524136, 2637045), (25, 317, 0.776711507455, 301),
              (25, 317, 0.788723637838, 222), (25, 317, 0.859905403917, 201), (25, 318, 0.524837076185, 783), (25, 318, 0.570833052501, 324),
              (25, 318, 0.578522088699, 201), (25, 318, 0.657902296334, 202), (25, 318, 0.677142738648, 200), (25, 318, 0.726613137498, 769),
              (25, 318, 0.732167007028, 209), (25, 318, 0.774227686638, 371), (25, 318, 0.827201651993, 202), (25, 318, 0.847552472266, 333),
              (25, 319, 0.299119763862, 203), (25, 319, 0.45927498818, 201), (25, 319, 0.516148758591, 203), (25, 319, 0.615909131871, 4852),
              (25, 319, 0.632155319322, 108442), (25, 319, 0.646735509589, 260), (25, 319, 0.650284817404, 205), (25, 319, 0.686381662086, 9214),
              (25, 319, 0.693039056628, 200), (25, 319, 0.744051833919, 202), (25, 319, 0.766865749349, 200), (25, 319, 0.787624751367, 209),
              (25, 319, 0.80385182439, 201), (25, 319, 0.828506576415, 203), (25, 319, 0.882672117998, 236), (25, 319, 0.897596538512, 320),
              (25, 320, 0.463229776045, 201), (25, 320, 0.495664364554, 543), (25, 320, 0.557110073697, 46084), (25, 320, 0.564003442697, 228),
              (25, 320, 0.592427293085, 200), (25, 320, 0.647193255894, 541), (25, 320, 0.687780061052, 305), (25, 320, 0.804990051837, 204),
              (25, 320, 0.812916984869, 1331), (25, 320, 0.931548693535, 1716800), (25, 321, 0.629165408494, 202), (25, 321, 0.646414688214, 1085812),
              (25, 321, 0.675364980195, 20790), (25, 321, 0.681600225983, 206), (25, 321, 0.685460441866, 13703), (25, 321, 0.832202070011, 236),
              (25, 321, 0.869702143767, 228), (25, 322, 0.359393286821, 200), (25, 322, 0.456060077056, 234), (25, 322, 0.489952848972, 219),
              (25, 322, 0.514488904665, 327), (25, 322, 0.599151587589, 200), (25, 322, 0.669303689783, 200), (25, 322, 0.696930419223, 216),
              (25, 322, 0.727835215524, 477), (25, 322, 0.80782526029, 200), (25, 322, 0.833256121648, 129263), (25, 322, 0.870336001756, 216),
              (25, 322, 0.940963125417, 679), (25, 323, 0.396053762877, 201), (25, 323, 0.425459758563, 200), (25, 323, 0.55251719767, 201),
              (25, 323, 0.584884865708, 200), (25, 323, 0.609829616479, 203), (25, 323, 0.612493852048, 322), (25, 323, 0.691710173338, 201),
              (25, 323, 0.71108498005, 311), (25, 323, 0.788507519521, 203), (25, 323, 0.807011603621, 200), (25, 323, 0.828169583641, 201),
              (25, 323, 0.913928205211, 201), (25, 324, 0.45836160144, 200), (25, 324, 0.499107021276, 13215), (25, 324, 0.559870976212, 200),
              (25, 324, 0.603511924685, 200), (25, 324, 0.623198421134, 210), (25, 324, 0.633639995522, 338), (25, 324, 0.635556007837, 246),
              (25, 324, 0.683944057834, 364), (25, 324, 0.730240665978, 202), (25, 324, 0.786380149202, 859), (25, 324, 0.798736866984, 263),
              (25, 324, 0.805943711046, 298), (25, 324, 0.820587885957, 201), (25, 325, 0.231831459255, 200), (25, 325, 0.563362833248, 239),
              (25, 325, 0.636026944883, 21971), (25, 325, 0.669608327745, 200), (25, 325, 0.716574236797, 200), (25, 325, 0.724404665933, 206),
              (25, 325, 0.739644789795, 40472), (25, 325, 0.752542864573, 200), (25, 325, 0.824388257792, 6198), (25, 325, 0.915197856113, 202),
              (25, 326, 0.707329416062, 208), (25, 326, 0.728563259999, 204), (25, 326, 0.752738034251, 201), (25, 326, 0.825564275897, 204),
              (25, 327, 0.719013790228, 203), (25, 327, 0.72228899991, 200), (25, 327, 0.751466201033, 209), (25, 327, 0.759232117888, 211),
              (25, 327, 0.767027224929, 1537), (25, 327, 0.778548686629, 203), (25, 327, 0.784075647798, 200), (25, 327, 0.793621518891, 200),
              (25, 327, 0.806195524296, 229), (25, 327, 0.808608560267, 201), (25, 327, 0.827857910169, 572), (25, 328, 0.393098272591, 203),
              (25, 328, 0.394325016086, 2993), (25, 328, 0.56951420425, 201), (25, 328, 0.580323359062, 2924475), (25, 328, 0.580905747982, 201),
              (25, 328, 0.583192577095, 542), (25, 328, 0.597499239304, 213), (25, 328, 0.600613025852, 253), (25, 328, 0.621643431237, 436),
              (25, 328, 0.701710063929, 200), (25, 328, 0.716681957271, 202), (25, 328, 0.761650010569, 202), (25, 328, 0.772259027715, 206),
              (25, 328, 0.779616327067, 200), (25, 328, 0.866568305949, 201), (25, 328, 0.898210193557, 515), (25, 329, 0.375129736945, 211),
              (25, 329, 0.446336810237, 285), (25, 329, 0.642038213561, 1496), (25, 329, 0.645488070638, 205), (25, 329, 0.771011371351, 203),
              (25, 329, 0.829964202191, 212), (25, 330, 0.511762908051, 201), (25, 330, 0.518705307319, 200), (25, 330, 0.584850797881, 200),
              (25, 330, 0.730246394391, 135696), (25, 330, 0.751681569544, 2366819428908), (25, 330, 0.92157485973, 200),
              (25, 331, 0.393252838456, 201), (25, 331, 0.413705536186, 200), (25, 331, 0.541667450199, 200), (25, 331, 0.610163703725, 202),
              (25, 331, 0.662097234233, 201), (25, 331, 0.711996875688, 557), (25, 331, 0.756070937233, 200), (25, 331, 0.77482801802, 660),
              (25, 331, 0.785152716014, 202), (25, 331, 0.804310562109, 201), (25, 331, 0.869247778778, 202), (25, 331, 0.900011576777, 206),
              (25, 332, 0.360501981662, 200), (25, 332, 0.396119173231, 1350), (25, 332, 0.549423865109, 204), (25, 332, 0.572660049768, 12648),
              (25, 332, 0.626922110118, 200), (25, 332, 0.725028858444, 7829535), (25, 332, 0.770714243024, 200), (25, 332, 0.778895975847, 228),
              (25, 332, 0.790939903651, 200), (25, 332, 0.845479439223, 204), (25, 333, 0.552595567924, 4737), (25, 333, 0.628912424024, 230),
              (25, 333, 0.712025685826, 205), (25, 333, 0.745406810596, 201), (25, 333, 0.905976663231, 205), (25, 334, 0.483908768827, 207),
              (25, 334, 0.60959982508, 200), (25, 334, 0.812519911253, 224), (25, 334, 0.843221977913, 217), (25, 335, 0.60973667496, 201),
              (25, 335, 0.693600650693, 201), (25, 335, 0.728238441845, 203), (25, 335, 0.75743368108, 210), (25, 335, 0.810360530954, 207),
              (25, 335, 0.882951536209, 200), (25, 335, 0.893759028105, 208), (25, 336, 0.740623773204, 201), (25, 336, 0.755244695771, 204),
              (25, 336, 0.831100589453, 200), (25, 337, 0.703493435921, 327), (25, 337, 0.718881873842, 226), (25, 337, 0.81048243721, 206),
              (25, 337, 0.899330103434, 18637), (25, 338, 0.700633185403, 206), (25, 338, 0.802867938789, 205), (25, 338, 0.805246709953, 4733064),
              (25, 338, 0.861528612477, 217), (25, 338, 0.875310494507, 201), (25, 339, 0.530202056325, 201), (25, 339, 0.596821460467, 206),
              (25, 339, 0.637622807415, 201), (25, 339, 0.720319384906, 1063), (25, 339, 0.925562964099, 241), (25, 340, 0.545836200959, 81897),
              (25, 340, 0.575462739406, 8298), (25, 340, 0.66454554903, 392), (25, 340, 0.665110271015, 200), (25, 340, 0.682396835586, 583),
              (25, 340, 0.705902758344, 216), (25, 341, 0.485770196306, 202), (25, 341, 0.572928076121, 8075702), (25, 341, 0.584378279664, 200),
              (25, 341, 0.671131495684, 201), (25, 341, 0.681341791186, 255), (25, 341, 0.729285274999, 56404734), (25, 341, 0.817259802301, 214),
              (25, 342, 0.7941751354, 201), (25, 342, 0.815001400116, 209), (25, 343, 0.713674285547, 204), (25, 343, 0.776290208985, 210),
              (25, 344, 0.578809632483, 200), (25, 344, 0.602669599066, 204), (25, 344, 0.640884720483, 201), (25, 344, 0.689617532849, 200),
              (25, 344, 0.735474986847, 793226), (25, 344, 0.801550588878, 200), (25, 345, 0.511613953266, 200), (25, 345, 0.599632141615, 3606),
              (25, 345, 0.725633060098, 213), (25, 345, 0.864442340693, 200), (25, 346, 0.629039256263, 519), (25, 346, 0.868847037952, 26139957),
              (25, 347, 0.383146657028, 505762), (25, 347, 0.793122392817, 887), (25, 348, 0.634239706868, 200), (25, 348, 0.747883842713, 908),
              (25, 348, 0.751757880877, 206), (25, 349, 0.617658321678, 200), (25, 350, 0.747783531486, 2330), (25, 350, 0.860911393971, 338),
              (25, 352, 0.686825990299, 10332), (25, 353, 0.683483882694, 227), (25, 353, 0.693201365913, 201), (25, 353, 0.86051685785, 79018),
              (25, 354, 0.700525525741, 200), (25, 355, 0.81145294361, 2178), (25, 356, 0.595846869158, 200), (25, 356, 0.619201382241, 1819),
              (25, 358, 0.563881791069, 210), (25, 358, 0.69894141965, 200), (25, 358, 0.743018024778, 325), (25, 358, 0.797592813469, 200),
              (25, 360, 0.663337283569, 258), (25, 360, 0.788791573196, 376), (25, 361, 0.527838758249, 290), (25, 361, 0.64267249425, 242),
              (25, 363, 0.487136757184, 200), (25, 365, 0.734633994789, 213), (25, 365, 0.889239450729, 225), (25, 368, 0.664001959486, 201),
              (25, 370, 0.593215752818, 1180660)]

Lower bound

One base­line would be to sim­ply take our pre-ex­ist­ing exact solu­tion V/VPplan and run it on the gen­er­al­ized Kelly by ignor­ing obser­va­tions and assum­ing 300 rounds. It will prob­a­bly per­form rea­son­ably well since it per­forms opti­mally at least one pos­si­ble set­ting, and serve as a base­line:

import gym
env = gym.make('KellyCoinflipGeneralized-v0')

rewards = []
for n in range(0,1000):
    done = False
    reward = 0
    roundsLeft = 300
    while not done:
        w = env._get_obs()[0][0]
        b = roundsLeft
        bet = VPplan(w, b)
        results = env._step(bet*100)
        reward = reward+results[1]
        print n, w, b, bet, "results:", results
        roundsLeft = max(0, roundsLeft - 1)
        done = results[2]
    rewards.append(reward)
    env._reset()

print sum(rewards)/len(rewards)
# 140.642807947

In the gen­eral game, it per­forms poorly because it self­-lim­its to $250 in the many games where the max cap is high­er, and it often wipes out to $0 in games where the edge is small or neg­a­tive.

Deep RL

Since the Bayesian deci­sion tree is too hard to com­pute in full, we need a differ­ent approach. Tab­u­lar approaches in gen­eral will have diffi­culty as the full his­tory makes the state-space vastly larg­er, but also so do just the suffi­cient sta­tis­tics of wins/losses + rounds elapsed + max­i­mum observed wealth.

One approach which should be able to cope with the com­plex­i­ties would be deep rein­force­ment learn­ing, whose neural net­works let them opti­mize well in diffi­cult envi­ron­ments, and given the suffi­cient sta­tis­tics, we don’t need to train an explic­itly Bayesian agent, as a deep RL agent trained on ran­dom instances can learn a reac­tive pol­icy which adapts ‘online’ and com­putes an approx­i­ma­tion of the Bayes-op­ti­mal agent ( is a good review of DRL/Bayesian RL), which as we have already seen is intractable even for small prob­lems like this.

Unfor­tu­nate­ly, the spar­sity of rewards in the Kelly coin­flip game (re­ceiv­ing a reward only every ~300 steps) does make solv­ing the game much harder for DRL agents.

To cre­ate a deep RL agent for the gen­er­al­ized Kelly coin-flip game, I will use the keras-rl library of agents, which is based on the Keras deep learn­ing frame­work (backed by Ten­sor­Flow); deep learn­ing is often sen­si­tive to choice of hyper­pa­ra­me­ters, so for hyper­pa­ra­me­ter opti­miza­tion, I use the hyperas wrap­per around hyper­opt (paper).

keras-rl offers pri­mar­ily3 DQN () and DDPG ().

DQN is a pos­si­bil­ity but for effi­ciency4 most DQN imple­men­ta­tions require/assume a fixed num­ber of actions, which in our case of pen­ny-level bet­ting, implies hav­ing 25,000 dis­tinct unordered actions. I tried pro­to­typ­ing DQN for the sim­ple Kelly coin-flip game using the keras-rl DQN exam­ple, but after hours, it still had not received any reward­s—it repeat­edly bet more than its cur­rent wealth, equiv­a­lent to its cur­rent wealth, and bust­ing out, never man­ag­ing to go 300 rounds to receive any reward and begin learn­ing. (This sort of issue with sparse rewards is a com­mon prob­lem with DQN and deep RL in gen­er­al, as the most com­mon forms of explo­ration will flail around at ran­dom rather than make deep tar­geted explo­ration of par­tic­u­lar strate­gies or try to seek out new regions of state-space which might con­tain reward­s.) One pos­si­bil­ity would have been to ‘cheat’ by using the exact value func­tion to pro­vide opti­mal tra­jec­to­ries for the DQN agent to learn from.

DDPG isn’t like tab­u­lar learn­ing but is a pol­icy gra­di­ent method (Karpa­thy explains pol­icy gra­di­ents). With deep pol­icy gra­di­ents, the basic idea is that the ‘actor’ NN repeat­edly out­puts an action based on the cur­rent state, and then at the end of the episode with the final total reward, if the reward is higher than expect­ed, all the neu­rons who con­tributed to all the actions get strength­ened to be more likely to pro­duce those actions, and if the reward is lower than expect­ed, they are weak­ened. This is a very indis­crim­i­nate error-filled way to train neu­rons, but it does work if run over enough episodes for the errors to can­cel out; the noise can be reduced by train­ing an addi­tional ‘critic’ NN to pre­dict the rewards from actions, based on an expe­ri­ence replay buffer. The NNs are usu­ally fairly small (to avoid over­fit­ting & because of the weak super­vi­sion), use RELUs, and the actor NN often has added to it. (To pre­vent them from learn­ing too fast and inter­fer­ing with each oth­er, a copy is made of each called the ‘tar­get net­works’.) For explo­ration, some ran­dom noise is added to the final action; not epsilon-ran­dom noise (like in DQN) but a sort of ran­dom walk, to make the con­tin­u­ous action con­sis­tently lower or higher (and avoid can­cel­ing out).

A con­tin­u­ous action would have less of an unordered action/curse of dimen­sion­al­ity required switch­ing from DQN to DDPG (exam­ple). Unfor­tu­nate­ly, DDPG also had ini­tial­iza­tion prob­lems in over­bet­ting, as its cho­sen actions would quickly drift to always bet­ting either very large pos­i­tive or neg­a­tive num­bers, which get con­verted to bet­ting every­thing or $0 respec­tively (which then always result in no rewards via either bust­ing & get­ting 0 or going 300 rounds & receiv­ing a reward of 0), and it would never return to mean­ing­ful bets. To deal with the drift/busting out prob­lem, I decided to add some sort of con­straint on the agent: make it choose a con­tin­u­ous action 0–1 (squashed by a final sig­moid acti­va­tion rather than lin­ear or tan­h), and con­vert it into a valid bet, so all bets would be mean­ing­ful and it would have less of a chance to over­bet and bust out imme­di­ate­ly. (This trick might also be usable in DQN if we dis­cretize 0–1 into, say, 100 per­centiles.) My knowl­edge of Keras is weak, so I was unsure how to con­vert an action 0–1 to a frac­tion of the agen­t’s cur­rent wealth, although I did fig­ure out how to mul­ti­ply it by $250 before pass­ing it in. This resulted in mean­ing­ful progress but was sen­si­tive to hyper­pa­ra­me­ter­s—­for exam­ple, the DDPG appeared to train worse with a very large expe­ri­ence replay buffer than with the Lil­l­i­crap et al 2015 set­ting of 10,000.

Hyper­pa­ra­me­ter-wise, Lil­l­i­crap et al 2015 used, for the low-di­men­sional (non-ALE screen) RL prob­lems sim­i­lar to the gen­eral Kelly coin-flip, the set­tings:

  • 2 hid­den lay­ers of 400 & 300 RELU neu­rons into a final tanh
  • an expe­ri­ence replay buffer of 10,000
  • Adam SGD opti­miza­tion algo­rithm with learn­ing rate: 10-4
  • discount/gamma: 0.99
  • mini­batch size: 64
  • an ran­dom walk, =0, =0.2, =0.15

For the sim­ple Kelly coin-flip prob­lem, the pre­vi­ous exper­i­ment with train­ing based on the exact value func­tion demon­strates that 400+300 neu­rons would be overkill, and prob­a­bly a much smaller 3-layer net­work with <128 neu­rons would be more than ade­quate; the discount/gamma being set to 0.99 for Kelly is ques­tion­able, as almost no steps result in a reward, and a dis­count rate of 0.99 would imply for a 300 round game that the final reward from the per­spec­tive of the first round would be almost worth­less (since ), so a dis­count rate of 0.999 or even just 1 would make more sense; because the reward is delayed, I expect the gra­di­ents to not be so help­ful, and per­haps a lower learn­ing rate or larger mini­batch size would be required; final­ly, the explo­ration noise is prob­a­bly too small, as the ran­dom walk would tend to increase or decrease bet amounts by less than $1, so prob­a­bly a larger would be bet­ter. With these set­tings, DDPG can achieve an aver­age per-step/round reward of ~0.8 as com­pared to the opti­mal aver­age reward of 0.82 ($246 over 300 steps/rounds).

For the gen­eral Kelly coin-flip prob­lem, the opti­mal aver­age reward is unknown, but ini­tial tries get ~0.7 (~$210). To try to dis­cover bet­ter per­for­mance, I set up hyper­pa­ra­me­ter opti­miza­tion using hyperas/hyperopt over the fol­low­ing para­me­ters:

  • num­ber of neu­rons in each layer in the actor and critic (16–128)

  • size of the expe­ri­ence replay buffer (0–100000)

  • explo­ration noise:

    • ()
    • (0–1)
    • ()
  • learn­ing rate:

    • Adam, main NNs (log uni­form −7 to −2; rough­ly, 0.14—0.00…)
    • Adam, tar­get NNs (*)
  • mini­batch size (8–2056)

Each agent/sample is trained ~8h and is eval­u­ated on mean total reward over 2000 episodes, with 120 sam­ples total.

Full source code for the gen­er­al­ized Kelly coin-flip game DDPG:

import numpy as np
import gym

from hyperas.distributions import choice, randint, uniform, normal, loguniform
from hyperas import optim
from hyperopt import Trials, STATUS_OK, tpe

from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Flatten, Input, concatenate, Lambda, BatchNormalization
from keras.optimizers import Adam

from rl.agents import DDPGAgent
from rl.memory import SequentialMemory
from rl.random import OrnsteinUhlenbeckProcess

def model(x_train, y_train, x_test, y_test):
    ENV_NAME = 'KellyCoinflipGeneralized-v0'
    gym.undo_logger_setup()

    # Get the environment and extract the number of actions.
    env = gym.make(ENV_NAME)
    numpy.random.seed(123)
    env.seed(123)
    nb_actions = 1

    # Next, we build a very simple model.
    actor = Sequential()
    actor.add(Flatten(input_shape=(1,) + (5,)))
    actor.add(Dense({{choice([16, 32, 64, 96, 128])}}))
    actor.add(BatchNormalization())
    actor.add(Activation('relu'))
    actor.add(Dense({{choice([16, 32, 64, 96, 128])}}))
    actor.add(BatchNormalization())
    actor.add(Activation('relu'))
    actor.add(Dense({{choice([16, 32, 64, 96, 128])}}))
    actor.add(BatchNormalization())
    actor.add(Activation('relu'))
    # pass into a single sigmoid to force a choice 0-1, corresponding to fraction of total possible wealth.
    # It would be better to multiply the fraction against one's *current* wealth to reduce the pseudo-shift
    # in optimal action with increasing wealth, but how do we set up a multiplication against the first
    # original input in the Flatten layer? This apparently can't be done as a Sequential...
    actor.add(Dense(nb_actions))
    actor.add(BatchNormalization())
    actor.add(Activation('sigmoid'))
    actor.add(Lambda(lambda x: x*250))
    print(actor.summary())

    action_input = Input(shape=(nb_actions,), name='action_input')
    observation_input = Input(shape=(1,) + (5,), name='observation_input')
    flattened_observation = Flatten()(observation_input)
    x = concatenate([action_input, flattened_observation])
    x = Dense({{choice([16, 32, 64, 96, 128])}})(x)
    x = Activation('relu')(x)
    x = Dense({{choice([16, 32, 64, 96, 128])}})(x)
    x = Activation('relu')(x)
    x = Dense({{choice([16, 32, 64, 96, 128])}})(x)
    x = Activation('relu')(x)
    x = Dense(nb_actions)(x)
    x = Activation('linear')(x)
    critic = Model(inputs=[action_input, observation_input], outputs=x)
    print(critic.summary())

    memory = SequentialMemory(limit={{randint(100000)}}, window_length=1)
    random_process = OrnsteinUhlenbeckProcess(size=nb_actions,
        theta={{uniform(0,1)}}, mu={{normal(0,5)}}, sigma={{normal(3,3)}})
    agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                      memory=memory, nb_steps_warmup_critic=301, nb_steps_warmup_actor=301,
                      random_process=random_process, gamma=1, target_model_update={{loguniform(-7,-2)}},
                      batch_size={{choice([8,16,32,64,256,512,1024,2056])}})
    agent.compile(Adam(lr={{loguniform(-7,-2)}}), metrics=['mae'])

    # Train; ~120 steps/s, so train for ~8 hours:
    agent.fit(env, nb_steps=3000000, visualize=False, verbose=1, nb_max_episode_steps=1000)

    # After training is done, we save the final weights.
    agent.save_weights('ddpg_{}_weights.h5f', overwrite=True)

    # Finally, evaluate our algorithm for n episodes.
    h = agent.test(env, nb_episodes=2000, visualize=False, nb_max_episode_steps=1000)
    reward = numpy.mean(h.history['episode_reward'])
    print "Reward: ", reward
    return {'loss': -reward, 'status': STATUS_OK, 'model': agent}

def data():
    return [], [], [], []

if __name__ == '__main__':
    best_run, best_model = optim.minimize(model=model,
                                          data=data,
                                          algo=tpe.suggest,
                                          max_evals=120, # 8h each, 4 per day, 30 days, so 120 trials
                                          trials=Trials())
    print("Best performing model chosen hyper-parameters:")
    print(best_run)

How could this be improved?

  1. fix the out­put scal­ing to be a frac­tion of cur­rent wealth rather than a fixed amount of $250; this avoids forc­ing the NN to learn to change its actions based on cur­rent wealth just to keep a con­stant bet amount, and instead focus on learn­ing devi­a­tions from the base­line Kelly small bet of ~12%

  2. exper­i­ment with batch­norm in the critic as well

  3. incor­po­rate the exact value func­tion into the train­ing to see how much ben­e­fit the addi­tional super­vi­sion pro­vides and how much dam­age is caused by the RL set­ting, or pre-pop­u­late the expe­ri­ence replay buffer based on sam­ples from the exact value func­tion or the 12% heuris­tic; it might also be inter­est­ing to see how much pre­train­ing on the sim­ple Kelly coin-flip helps with train­ing on the gen­eral Kelly coin-flip

  4. exper­i­ment with addi­tional reg­u­lar­iza­tion meth­ods like L2 norm (which Lil­l­i­crap et al 2015 also used)

  5. try vari­ants on the envi­ron­ment:

    • scale rewards down to 0–1; deep RL approaches report­edly have trou­ble with rewards which are not on the unit-s­cale
    • instead of under­-bet­ting=$0 and over­bet­ting=cur­rent wealth, make them ille­gal bets which ter­mi­nate an episode

  1. That is, the value of the actions/bets 0-w seems like it would be biton­ic—in­creas­ing lin­early to the max­i­mum then lin­early decreas­ing. So one ought to be able to do a binary search over actions or some­thing even bet­ter. I assume this is related to Bell­man’s proof that finite-hori­zon MDPs’ value func­tions are con­vex and piece­wise lin­ear.↩︎

  2. Pre­sum­ably a 3×5 or smaller NN would also work, but I expe­ri­enced con­ver­gence issues with the SGD opti­mizer for all 3-layer NNs I exper­i­mented with, and as of 2017-03-29, a MXNet bug blocked use of the bet­ter opti­miz­ers.↩︎

  3. I also tried its imple­men­ta­tion, but after try­ing a vari­ety of set­tings, CEM never per­formed bet­ter than an aver­age reward of 0.4 or ~$120, infe­rior to most DDPG runs.↩︎

  4. This fol­lows the orig­i­nal Mnih et al 2013 imple­men­ta­tion. The NN is con­structed with the screen obser­va­tion state as the input and a final top layer out­putting n reals cor­re­spond­ing to the q-value of the n actions, in that case the fixed num­ber of but­tons on the Atari con­troller. This allows enu­mer­a­tion of all pos­si­ble actions’ val­ues in a sin­gle for­ward pass of the NN, and pick­ing the max­i­mum… The alter­na­tive would be to input both the screen obser­va­tion and one pos­si­ble action, and get out a sin­gle q-value for that one action; then to act, the agent would loop over all pos­si­ble actions in that state, and then find­ing the max­i­mum actions, which would allow a vary­ing num­ber of pos­si­ble actions in each state, but at the cost of doing n for­ward passes (one for each pos­si­ble action). In the Atari set­ting, the DQN agent inter­acts exclu­sively through the con­troller which has only a few but­tons and a direc­tional pad; and the envi­ron­ments are com­plex enough that it would be hard to define in advance what are the valid actions for each pos­si­ble screen, which is some­thing that must be learned, so this fixed-ac­tion trick is both nec­es­sary and effi­cient. Thus, it is also used in most imple­men­ta­tions or exten­sions of DQN. In the Kelly coin­flip, on the other hand, the dis­cretiz­ing is prob­lem­atic since the major­ity of actions will be invalid, using an invalid action is dam­ag­ing since it leads to over­bet­ting, and we can eas­ily define the range of valid actions in each state (0-w), so the trick is much less com­pelling.↩︎