Evolution as Backstop for Reinforcement Learning

Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase’s theory of the firm/linear optimization/DRL/evolution/multicellular life/pain as multi-level optimization problems.
Bayes, biology, psychology, decision-theory, sociology, NN, philosophy, insight-porn
2018-12-062019-12-01 finished certainty: possible importance: 7


One defense of free mar­kets notes the inabil­ity of non-­mar­ket mech­a­nisms to solve plan­ning & opti­miza­tion prob­lems. This has dif­fi­culty with Coase’s para­dox of the firm, and I note that the dif­fi­culty is increased by the fact that with improve­ments in com­put­ers, algo­rithms, and data, ever larger plan­ning prob­lems are solved. Expand­ing on some Cosma Shal­izi com­ments, I sug­gest inter­pret­ing phe­nom­e­non as mul­ti­-level nested opti­miza­tion par­a­digm: many sys­tems can be use­fully described as hav­ing two (or more) lev­els where a slow sam­ple-in­ef­fi­cient but ground-truth ‘outer’ loss such as death, bank­rupt­cy, or repro­duc­tive fit­ness, trains & con­strains a fast sam­ple-­ef­fi­cient but pos­si­bly mis­guided ‘inner’ loss which is used by learned mech­a­nisms such as neural net­works or lin­ear pro­gram­ming group selec­tion per­spec­tive. So, one rea­son for free-­mar­ket or evo­lu­tion­ary or Bayesian meth­ods in gen­eral is that while poorer at planning/optimization in the short run, they have the advan­tage of sim­plic­ity and oper­at­ing on ground-truth val­ues, and serve as a con­straint on the more sophis­ti­cated non-­mar­ket mech­a­nisms. I illus­trate by dis­cussing cor­po­ra­tions, mul­ti­cel­lu­lar life, rein­force­ment learn­ing & meta-learn­ing in AI, and pain in humans. This view sug­gests that are inher­ent bal­ances between market/non-market mech­a­nisms which reflect the rel­a­tive advan­tages between a slow unbi­ased method and faster but poten­tially arbi­trar­ily biased meth­ods.

In , a para­dox is not­ed: ide­al­ized com­pet­i­tive mar­kets are opti­mal for allo­cat­ing resources and mak­ing deci­sions to reach effi­cient out­comes, but each mar­ket is made up of par­tic­i­pants such as large multi­na­tional mega-­cor­po­ra­tions which are not inter­nally made of mar­kets and make their deci­sions by non-­mar­ket mech­a­nisms, even things which could clearly be out­sourced. In an oft-quoted and amus­ing pas­sage, Her­bert Simon dra­ma­tizes the actual sit­u­a­tion:

Sup­pose that [“a myth­i­cal vis­i­tor from Mars”] approaches the Earth from space, equipped with a tele­scope that rev­els social struc­tures. The firms reveal them­selves, say, as solid green areas with faint inte­rior con­tours mark­ing out divi­sions and depart­ments. Mar­ket trans­ac­tions show as red lines con­nect­ing firms, form­ing a net­work in the spaces between them. Within firms (and per­haps even between them) the approach­ing vis­i­tor also sees pale blue lines, the lines of author­ity con­nect­ing bosses with var­i­ous lev­els of work­ers. As our vis­i­tors looked more care­fully at the scene beneath, it might see one of the green masses divide, as a firm divested itself of one of its divi­sions. Or it might see one green object gob­ble up anoth­er. At this dis­tance, the depart­ing golden para­chutes would prob­a­bly not be vis­i­ble. No mat­ter whether our vis­i­tor approached the United States or the Soviet Union, urban China or the Euro­pean Com­mu­ni­ty, the greater part of the space below it would be within green areas, for almost all of the inhab­i­tants would be employ­ees, hence inside the firm bound­aries. Orga­ni­za­tions would be the dom­i­nant fea­ture of the land­scape. A mes­sage sent back home, describ­ing the scene, would speak of “large green areas inter­con­nected by red lines.” It would not likely speak of “a net­work of red lines con­nect­ing green spots.”…When our vis­i­tor came to know that the green masses were orga­ni­za­tions and the red lines con­nect­ing them were mar­ket trans­ac­tions, it might be sur­prised to hear the struc­ture called a mar­ket econ­o­my. “Would­n’t ‘orga­ni­za­tional econ­omy’ be the more appro­pri­ate term?” it might ask.

A free com­pet­i­tive mar­ket is a weigh­ing machine, not a think­ing machine; it weighs & com­pares pro­posed buys & sells made by par­tic­i­pants, and reaches a clear­ing price. But where, then, do the things being weighed come from? Mar­ket par­tic­i­pants are them­selves not mar­kets, and to appeal to the wis­dom of the mar­ket is buck­-­pass­ing; if mar­kets ‘elicit infor­ma­tion’ or ‘incen­tivize per­for­mance’, how is that infor­ma­tion learned and expressed, and where do the actual actions which yield higher per­for­mance come from? At some point, some­one has to do some real think­ing. (A com­pany can out­source its jan­i­tors to the free mar­ket, but then what­ever con­trac­tor is hired still has to decide exactly when and where and how to do the jan­i­tor-ing; safe to say, it does not hold an inter­nal auc­tion among its jan­i­tors to divide up respon­si­bil­i­ties and set their sched­ules.)

The para­dox is that free mar­kets appear to depend on enti­ties which are inter­nally run as total­i­tar­ian com­mand dic­ta­tor­ships. One might won­der why there is such a thing as a firm, instead of every­thing being accom­plished by exchanges among the most atomic unit (cur­rent­ly) pos­si­ble, indi­vid­ual humans. Coase’s sug­ges­tion is that it is a prin­ci­pal-a­gent prob­lem: there’s risk, nego­ti­a­tion costs, trade secrets, betray­al, and hav­ing a dif­fer­ence between the prin­ci­pal and agent at all can be too expen­sive & have too much over­head.

Asymptotics Ascendant

An alter­na­tive per­spec­tive comes from the : why have a mar­ket at all, with all its waste and com­pe­ti­tion, if a cen­tral plan­ner can out opti­mal allo­ca­tions and sim­ply decree it? Cosma Shal­izi in a review1 of Spufford’s Red Plenty (which draws on Plan­ning Prob­lems in the USSR: The Con­tri­bu­tion of Math­e­mat­i­cal Eco­nom­ics to their Solu­tion 1960–1971, ed Ell­man 1973), dis­cusses the his­tory of , which were also devel­oped in Soviet Rus­sia under and used for eco­nom­ics plan­ning. One irony (which Shal­izi ascribes to ) is that under the same the­o­ret­i­cal con­di­tions in which mar­kets could lead to an opti­mal out­come, so too could a lin­ear opti­miza­tion algo­rithm. In prac­tice, of course, the Soviet econ­omy could­n’t pos­si­bly be run that way because it would require opti­miz­ing over mil­lions or bil­lions of vari­ables, requir­ing unfath­omable amounts of com­put­ing pow­er.

Optimization Obtained

As it hap­pens, we now have unfath­omable amounts of com­put­ing pow­er. What was once a is now just a modus ponens.

Cor­po­ra­tions, and tech com­pa­nies in par­tic­u­lar as the lead­ing edge, rou­tinely solve plan­ning prob­lems for logis­tics like fleets of cars or dat­a­cen­ter opti­miza­tion involv­ing mil­lions of vari­ables; the sim­i­lar SAT solvers are ubiq­ui­tous in com­puter secu­rity research for mod­el­ing large com­puter code­bases to ver­ify safety or dis­cover vul­ner­a­bil­i­ties; most robots could­n’t oper­ate with­out con­stantly solv­ing & opti­miz­ing enor­mous sys­tems of equa­tions. The inter­nal planned ‘economies’ of tech com­pa­nies have grown kudzu-­like, sprout­ing ever larger datasets to pre­dict and auto­mated analy­ses to plan and to con­trol. The prob­lems solved by retail­ers like Wal­mart or Tar­get are world-­sized.2 (‘“We are not set­ting the price. The mar­ket is set­ting the price”, he says. “We have algo­rithms to deter­mine what that mar­ket is.”’) The motto of a Google or Ama­zon or Uber might be (to para­phrase Free­man Dyson’s para­phrase of John von Neu­mann in , 1988): “All processes that are sta­ble we shall plan. All processes that are unsta­ble we shall com­pete in (for now).” Com­pa­nies may use some lim­ited inter­nal ‘mar­kets’ as use­ful metaphors for allo­ca­tion, and dab­ble in , but the inter­nal dynam­ics of tech com­pa­nies bear lit­tle resem­blance to com­pet­i­tive free mar­kets, and show lit­tle sign of mov­ing in mar­ket-ward direc­tions.

The march of plan­ning also shows lit­tle sign of stop­ping. Uber is not going to stop using his­tor­i­cal fore­casts of demand to move around dri­vers to meet expected demand and opti­mize trip tra­jec­to­ries; dat­a­cen­ters will not stop using lin­ear solvers to allo­cate run­ning jobs to machines in an opti­mal man­ner to min­i­mize elec­tric­ity con­sump­tion while bal­anc­ing against latency and through­put, in search of a vir­tu­ous cycle cul­mi­nat­ing in the opti­mal route, “the per­pet­ual trip, the trip that never ends”; ‘mar­kets’ like smart­phone walled gar­dens rely ever more each year on algo­rithms pars­ing human reviews & bina­ries & clicks to decide how to rank or push adver­tis­ing and con­duct mul­ti­-armed ban­dit explo­ration of options; and so on end­less­ly.

So, can we run a econ­omy with scaled-up plan­ning approach­ing 100% cen­tral­iza­tion, while increas­ing effi­ciency and even out­com­pet­ing free cap­i­tal­is­m-style com­pet­i­tive mar­kets, as Cock­shott & Cot­trell pro­pose (a pro­posal occa­sion­ally revived in pop social­ism like The Peo­ple’s Repub­lic of Wal­mart: How the World’s Biggest Cor­po­ra­tions are Lay­ing the Foun­da­tion for Social­ism)?

Systems

Let’s look at some more exam­ples:

  1. cor­po­ra­tions and growth
  2. humans, brains, and cells
  3. meta-learn­ing in AI (par­tic­u­larly RL)

Artificial Persons

The strik­ing thing about cor­po­ra­tions improv­ing is that they don’t; cor­po­ra­tions don’t evolve (see the & , which can be applied to ). The busi­ness world would look com­pletely dif­fer­ent if they did! Despite large dif­fer­ences in com­pe­tency between cor­po­ra­tions, the best cor­po­ra­tions don’t sim­ply ‘clone’ them­selves and reg­u­larly take over arbi­trary indus­tries with their supe­rior skills, only to even­tu­ally suc­cumb to their mutant off­spring who have become even more effi­cient.

We can copy the best soft­ware algo­rithms, like Alp­haZe­ro, indef­i­nitely and they will per­form as well as the orig­i­nal, and we can tweak them in var­i­ous ways to make them steadily bet­ter (and this is in fact how many algo­rithms are devel­oped, by con­stant iter­a­tion); species can repro­duce them­selves, steadily evolv­ing to ever bet­ter exploit their nich­es, not to men­tion the power of selec­tive breed­ing pro­grams; indi­vid­ual humans can refine teach­ing meth­ods and trans­mit com­pe­tence (cal­cu­lus used to be reserved for the most skilled math­e­mati­cians, and now is taught to ordi­nary high school stu­dents, and chess grand­mas­ters have become steadily younger with bet­ter & more inten­sive teach­ing meth­ods like chess engi­nes); we could even clone excep­tional indi­vid­u­als to get more sim­i­larly tal­ented indi­vid­u­als, if we really wanted to. But we don’t see this hap­pen with cor­po­ra­tions. Instead, despite des­per­ate strug­gles to main­tain “cor­po­rate cul­ture”, com­pa­nies typ­i­cally coast along, get­ting more and more slug­gish, fail­ing to spin off smaller com­pa­nies as lean & mean as they used to be, until con­di­tions change or ran­dom shocks or degra­da­tion finally do them in, such as per­haps some com­plete­ly-un­re­lated com­pany (some­times founded by a com­plete out­sider like a col­lege stu­dent) eat­ing their lunch.

Why do we not see excep­tional cor­po­ra­tions clone them­selves and take over all mar­ket seg­ments? Why don’t cor­po­ra­tions evolve such that all cor­po­ra­tions or busi­nesses are now the hyper­-­ef­fi­cient descen­dants of a sin­gle ur-­cor­po­ra­tion 50 years ago, all other cor­po­ra­tions hav­ing gone extinct in bank­ruptcy or been acquired? Why is it so hard for cor­po­ra­tions to keep their “cul­ture” intact and retain their youth­ful lean effi­cien­cy, or if avoid­ing ‘aging’ is impos­si­ble, why copy them­selves or oth­er­wise repro­duce to cre­ate new cor­po­ra­tions like them­selves? Instead, suc­cess­ful large cor­po­ra­tions coast on iner­tia or mar­ket fail­ures like reg­u­la­tory capture/monopoly, while suc­cess­ful small ones worry end­lessly about how to pre­serve their ‘cul­ture’ or how to ‘stay hun­gry’ or find a replace­ment for the founder as they grow, and there is con­stant turnover. The large cor­po­ra­tions func­tion just well enough that main­tain­ing their exis­tence is an achieve­ment3.

Evo­lu­tion & the Price equa­tion requires 3 things: enti­ties which can repli­cate them­selves; vari­a­tion of enti­ties; and selec­tion on enti­ties. Cor­po­ra­tions have vari­a­tion, they have selec­tion—but they don’t have repli­ca­tion.

Cor­po­ra­tions cer­tainly undergo selec­tion for kinds of fit­ness, and do vary a lot. The prob­lem seems to be that cor­po­ra­tions can­not repli­cate them­selves. They can set up new cor­po­ra­tions, yes, but that’s not nec­es­sar­ily repli­cat­ing them­selves—they can­not clone them­selves the way a bac­te­ria can. When a bac­te­ria clones itself, it has… a clone, which is dif­fi­cult to dis­tin­guish in any way from the ‘orig­i­nal’. In sex­ual organ­isms, chil­dren still resem­ble their par­ents to a great extent. But when a large cor­po­ra­tion spins off a divi­sion or starts a new one, the result may be noth­ing like the par­ent and com­pletely lack any secret sauce. A new acqui­si­tion will retain its orig­i­nal char­ac­ter and effi­cien­cies (if any). A cor­po­ra­tion sat­is­fies the Peter Prin­ci­ple by even­tu­ally grow­ing to its level of incom­pe­tence, which is always much smaller than ‘the entire econ­omy’. Cor­po­ra­tions are made of peo­ple, not inter­change­able eas­i­ly-­copied wid­gets or strands of DNA. There is no ‘cor­po­rate DNA’ which can be copied to cre­ate a new one just like the old. The cor­po­ra­tion may not even be able to ‘repli­cate’ itself over time, lead­ing to scle­roti­cism and aging—but this then leads to under­per­for­mance and even­tu­ally selec­tion against it, one way or anoth­er. So, an aver­age cor­po­ra­tion appears lit­tle more effi­cient, par­tic­u­larly if we exclude any gains from new tech­nolo­gies, than an aver­age cor­po­ra­tion 50 years ago, and the chal­lenges and fail­ures of the rare multi­na­tional cor­po­ra­tion 500 years ago like the Medici bank look strik­ingly sim­i­lar to chal­lenges and fail­ures of banks today.

We can see a sim­i­lar prob­lem with other large-s­cale human orga­ni­za­tions: ‘cul­tures’. An idea seen some­times is that cul­tures undergo selec­tion & evo­lu­tion, and as such, are made up of adap­tive beliefs/practices/institutions, which no indi­vid­ual under­stands (such as farm­ing prac­tices opti­mally tai­lored to local con­di­tion­s); even appar­ently highly irra­tional & waste­ful tra­di­tional prac­tices may actu­ally be an adap­tive evolved respon­se, which is opti­mal in some sense we as yet do not appre­ci­ate (some­times linked to “Chester­ton’s fence” as an argu­ment for sta­tus quo-is­m).

This is not a ridicu­lous posi­tion, since occa­sion­ally cer­tain tra­di­tional prac­tices have been vin­di­cated by sci­en­tific inves­ti­ga­tion, but the lenses of mul­ti­level selec­tion as defined by the Price equa­tion shows there are seri­ous quan­ti­ta­tive issues with this: cul­tures or groups are rarely dri­ven extinct, with most large-s­cale ones per­sist­ing for mil­len­nia; such ‘nat­ural selec­tion’ on the group-level is only ten­u­ously linked to the many thou­sands of dis­tinct prac­tices & beliefs that make up these cul­tures; and these cul­tures mutate rapidly as fads and visions and sto­ries and neigh­bor­ing cul­tures and new tech­nolo­gies all change over time (com­pare the con­sis­tency of folk magic/medicine over even small geo­graphic regions, or in the same place over sev­eral cen­turies). For most things, ‘tra­di­tional cul­ture’ is sim­ply flatout wrong and harm­ful and all forms are mutu­ally con­tra­dic­to­ry, not ver­i­fied by sci­ence, and con­tains no use­ful infor­ma­tion, and—­con­trary to “Chester­ton’s fence”—the older and harder it is to find a ratio­nal basis for a prac­tice, the less likely it is to be help­ful:

Chester­ton’s meta-fence: “in our cur­rent sys­tem (de­mo­c­ra­tic mar­ket economies with large gov­ern­ments) the com­mon prac­tice of tak­ing down Chester­ton fences is a process which seems well estab­lished and has a decent track record, and should not be unduly inter­fered with (un­less you fully under­stand it)”.

The exis­tence of many erro­neous prac­tices, and the suc­cess­ful dif­fu­sion of erro­neous ones, is acknowl­edged by pro­po­nents of cul­tural evo­lu­tion like Hein­rich (eg Hein­rich pro­vides sev­eral exam­ples which are com­pa­ra­ble to spread­ing harm­ful muta­tion­s), so the ques­tion here is one of empha­sis or quan­ti­ty: is the glass 1% full or 99% emp­ty? It’s worth recall­ing the con­di­tions for human exper­tise (Arm­strong 2001, ; 2005, Expert Polit­i­cal Judg­ment: How Good Is It? How Can We Know?; ed Eric­s­son 2006, The Cam­bridge Hand­book of Exper­tise and Expert Per­for­mance; Kah­ne­man & Klein 2009): repeated prac­tice with quick feed­back on objec­tive out­comes in unchang­ing envi­ron­ments; these con­di­tions are sat­is­fied for rel­a­tively few human activ­i­ties, which are more often rare, with long-de­layed feed­back, left to quite sub­jec­tive appraisals mixed in with enor­mous amounts of ran­dom­ness & con­se­quences of many other choices before/after, and sub­ject to poten­tially rapid change (and the more so the more peo­ple are able to learn). In such envi­ron­ments, peo­ple are more likely to fail to build exper­tise, be fooled by ran­dom­ness, and con­struct elab­o­rate yet erro­neous the­o­ret­i­cal edi­fices of super­sti­tion (like Tet­lock’s hedge­hogs). Evo­lu­tion is no fairy dust which can over­come these seri­ous infer­en­tial prob­lems, which are why rein­force­ment learn­ing is so hard.4

For , with reg­u­lar feed­back, results which are enor­mously impor­tant to both indi­vid­ual and group sur­vival, and rel­a­tively straight­for­ward mech­a­nis­tic cause-and-­ef­fect rela­tion­ships, it is not sur­pris­ing that prac­tices tend to be some­what opti­mized (although still far from opti­mal, as enor­mously increased yields in the Indus­trial Rev­o­lu­tion demon­strate, in part by avoid­ing the errors of tra­di­tional agri­cul­ture & )5 ; but none of that applies to ‘tra­di­tional med­i­cine’, deal­ing as it does with com­plex self­-s­e­lec­tion, regres­sion to the mean, and placebo effects, where aside from the sim­plest cases like set­ting bro­ken bones (again, straight­for­ward, with cause-and-­ef­fect rela­tion­ship), hardly any of it works6 and one is lucky if a tra­di­tional rem­edy is merely inef­fec­tive rather than out­right poi­so­nous, and in the hard­est cases like snake bites, it would be bet­ter to wait for death at home than waste time going to the local witch doc­tor.

So—just like cor­po­ra­tions—‘selec­tion’ of cul­tures hap­pens rarely with each ‘gen­er­a­tion’ span­ning cen­turies or mil­len­nia, typ­i­cally has lit­tle to do with how real­i­ty-based their beliefs tend to be (for a selec­tion coef­fi­cient approach­ing zero), and if one cul­ture did in fact con­sume another one thanks to more use­ful beliefs about some herb, it is likely to back­slide under the bom­bard­ment of memetic muta­tion (so any selec­tion is spent just purg­ing muta­tions, cre­at­ing a muta­tion-s­e­lec­tion bal­ance); under such con­di­tions, there will be lit­tle long-term ‘evo­lu­tion’ towards higher opti­ma, and the infor­ma­tion con­tent of cul­ture will be min­i­mal and closely con­strained to only the most uni­ver­sal, high­-­fit­ness-im­pact, and memet­i­cal­ly-ro­bust aspects.

Natural Persons

“Indi­vid­ual organ­isms are best thought of as adap­ta­tion-ex­e­cuters rather than as fit­ness-­max­i­miz­ers. Nat­ural selec­tion can­not directly ‘see’ an indi­vid­ual organ­ism in a spe­cific sit­u­a­tion and cause behav­ior to be adap­tively tai­lored to the func­tional require­ments imposed by that sit­u­a­tion.”

Tooby & Cos­mides 1992, “The Psy­cho­log­i­cal Foun­da­tions of Cul­ture”

“Good ide­ol­o­gy. Wrong species.”

, of

Con­trast that with a human. Despite ulti­mately being designed by evo­lu­tion, evo­lu­tion then plays no role at ‘run­time’ and more pow­er­ful learn­ing algo­rithms take over.

With these more pow­er­ful algo­rithms designed by the meta-al­go­rithm of evo­lu­tion, a human is able to live suc­cess­fully for over 100 years, with tremen­dous coop­er­a­tion between the tril­lions of cells in their body, only rarely break­ing down towards the end with a small hand­ful of seed can­cer cells defect­ing over a life­time despite even more tril­lions of cell divi­sions and replace­ments. They are also able to be cloned, yield­ing iden­ti­cal twins so sim­i­lar across the board that peo­ple who know them may be unable to dis­tin­guish them. And they don’t need to use evo­lu­tion or mar­kets to develop these bod­ies, instead, rely­ing on a com­plex hard­wired devel­op­men­tal pro­gram con­trolled by genes which ensures that >99% of humans get the two pairs of eyes, lungs, legs, brain hemi­spheres etc that they need. Per­haps the most strik­ing effi­ciency gain from a human is the pos­ses­sion of a brain with the abil­ity to pre­dict the future, learn highly abstract mod­els of the world, and plan and opti­mize over these plans for objec­tives which may only relate indi­rectly to fit­ness decades from now or fit­ness-re­lated events which hap­pen less than once in a life­time & are usu­ally unob­served or fit­ness events like that of descen­dants which can never be observed.

RL

Black Box vs White Box Optimization

Let’s put it another way.

Imag­ine try­ing to run a busi­ness in which the only feed­back given is whether you go bank­rupt or not. In run­ning that busi­ness, you make mil­lions or bil­lions of deci­sions, to adopt a par­tic­u­lar mod­el, rent a par­tic­u­lar store, adver­tise this or that, hire one per­son out of scores of appli­cants, assign them this or that task to make many deci­sions of their own (which may in turn require deci­sions to be made by still oth­er­s), and so on, extended over many years. At the end, you turn a healthy prof­it, or go bank­rupt. So you get 1 bit of feed­back, which must be split over bil­lions of deci­sions. When a com­pany goes bank­rupt, what killed it? Hir­ing the wrong accoun­tant? The CEO not invest­ing enough in R&D? Ran­dom geopo­lit­i­cal events? New gov­ern­ment reg­u­la­tions? Putting its HQ in the wrong city? Just a gen­er­al­ized inef­fi­cien­cy? How would you know which deci­sions were good and which were bad? How do you solve the “credit assign­ment prob­lem”?

Ide­al­ly, you would have some way of trac­ing back every change in the finan­cial health of a com­pany back to the orig­i­nal deci­sion & the algo­rithm which made that deci­sion, but of course this is impos­si­ble since there is no way to know who said or did what or even who dis­cussed what with whom when. There would seem to be no gen­eral approach other than the truly brute force one of evo­lu­tion: over many com­pa­nies, have some act one way and some act another way, and on aver­age, good deci­sions will clus­ter in the sur­vivors and not-­so-­good deci­sions will clus­ter in the deceased. ‘Learn­ing’ here works (un­der cer­tain con­di­tion­s—­like suf­fi­ciently reli­able repli­ca­tion—which in prac­tice may not obtain) but is hor­rif­i­cally expen­sive & slow.

In RL, this would cor­re­spond to black box/gradient-free meth­ods, par­tic­u­larly evo­lu­tion­ary meth­ods. For exam­ple, uses an evo­lu­tion­ary method in which thou­sands of slight­ly-ran­dom­ized neural net­works play an Atari game simul­ta­ne­ous­ly, and at the end of the games, a new aver­age neural net­work is defined based on the per­for­mance of them all; no attempt is made to fig­ure out which spe­cific changes are good or bad or even to get a reli­able esti­mate—they sim­ply run and the scores are what they are. If we imag­ine a schematic like ‘mod­els → model para­me­ters → envi­ron­ments → deci­sions → out­comes’, evo­lu­tion col­lapses it to just ‘mod­els → out­comes’; feed a bunch of pos­si­ble mod­els in, get back out­comes, pick the mod­els with best out­comes.

A more sam­ple-­ef­fi­cient method would be some­thing like REINFORCE, which Andrej Karpa­thy explains with an ALE Pong agent; what does REINFORCE do to crack the black box open a lit­tle bit? It’s still hor­rific and amaz­ing that it works:

So here is how the train­ing will work in detail. We will ini­tial­ize the pol­icy net­work with some W1, W2 and play 100 games of Pong (we call these pol­icy “roll­outs”). Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 deci­sions for going UP or DOWN and for each one of these we know the para­me­ter gra­di­ent, which tells us how we should change the para­me­ters if we wanted to encour­age that deci­sion in that state in the future. All that remains now is to label every deci­sion we’ve made as good or bad. For exam­ple sup­pose we won 12 games and lost 88. We’ll take all deci­sions we made in the win­ning games and do a pos­i­tive update (fill­ing in a +1.0 in the gra­di­ent for the sam­pled action, doing back­prop, and para­me­ter update encour­ag­ing the actions we picked in all those states). And we’ll take the other deci­sions we made in the los­ing games and do a neg­a­tive update (dis­cour­ag­ing what­ever we did). And… that’s it. The net­work will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that did­n’t work. Now we play another 100 games with our new, slightly improved pol­icy and rinse and repeat.

Pol­icy Gra­di­ents: Run a pol­icy for a while. See what actions led to high rewards. Increase their prob­a­bil­i­ty.

If you think through this process you’ll start to find a few funny prop­er­ties. For exam­ple what if we made a good action in frame 50 (bounc­ing the ball back cor­rect­ly), but then missed the ball in frame 150? If every sin­gle action is now labeled as bad (be­cause we lost), would­n’t that dis­cour­age the cor­rect bounce on frame 50? You’re right—it would. How­ev­er, when you con­sider the process over thousands/millions of games, then doing the first bounce cor­rectly makes you slightly more likely to win down the road, so on aver­age you’ll see more pos­i­tive than neg­a­tive updates for the cor­rect bounce and your pol­icy will end up doing the right thing.

…I did not tune the hyper­pa­ra­me­ters too much and ran the exper­i­ment on my (slow) Mac­book, but after train­ing for 3 nights I ended up with a pol­icy that is slightly bet­ter than the AI play­er. The total num­ber of episodes was approx­i­mately 8,000 so the algo­rithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates.

The dif­fer­ence here from evo­lu­tion is that the credit assign­ment is able to use back­prop­a­ga­tion to reach into the NN and directly adjust their con­tri­bu­tion to the deci­sion which was ‘good’ or ‘bad’; the dif­fi­culty of trac­ing out the con­se­quences of each deci­sion and label­ing it ‘good’ is sim­ply bypassed with the brute force approach of decree­ing that all actions taken in a ulti­mate­ly-­suc­cess­ful game were good, and all of them were bad if the game is ulti­mately bad. Here we opti­mize some­thing more like ‘model para­me­ters → deci­sions → out­comes’; we feed para­me­ters in to get out deci­sions which then are assumed to cause the out­come, and reverse it to pick the para­me­ters with the best out­comes.

This is still crazy, but it works, and bet­ter than sim­ple-­minded evo­lu­tion: Sal­i­mans et al 2017 com­pares their evo­lu­tion method to more stan­dard meth­ods which are fancier ver­sions of the REINFORCE pol­icy gra­di­ent approach, and this bru­tally lim­ited use of back­prop­a­ga­tion for credit assign­ment still cuts the sam­ple size by 3–10x, and more on more dif­fi­cult prob­lems.

Can we do bet­ter? Of course. It is absurd to claim that all actions in a game deter­mine the final out­come, since the envi­ron­ment itself is sto­chas­tic and many deci­sions are either irrel­e­vant or were the oppo­site in true qual­ity of what­ever the out­come was. To do bet­ter, we can con­nect the deci­sions to the envi­ron­ment by mod­el­ing the envi­ron­ment itself as a white box which can be cracked open & ana­lyzed, using a mod­el-based RL approach like the well-­known PILCO.

In PILCO, a model of the envi­ron­ment is learned by a pow­er­ful model (the non-neu­ral-net­work , in this case), and the model is used to do plan­ning: start with a series of pos­si­ble actions, run them through the model to pre­dict what would hap­pen, and directly opti­mize the actions to max­i­mize the reward. The influ­ence of the para­me­ters of the model caus­ing the cho­sen actions, which then par­tially cause the envi­ron­ment, which then par­tially cause the reward, can all be traced from the final reward back to the orig­i­nal para­me­ters. (It’s white boxes all the way down.) Here the full ‘mod­els → model para­me­ters → envi­ron­ments → deci­sions → out­comes’ pipeline is expressed and the credit assign­ment is per­formed cor­rectly & as a whole.

The result is state-of-the-art sam­ple effi­cien­cy: in a sim­ple prob­lem like Cart­pole, PILCO can solve it within as lit­tle as 10 episodes, while stan­dard deep rein­force­ment learn­ing approaches like pol­icy gra­di­ents can strug­gle to solve it within 10,000 episodes.

The prob­lem, of course, with mod­el-based RL such as PILCO is that what they gain in cor­rect­ness & sam­ple-­ef­fi­cien­cy, they give back in com­pu­ta­tional require­ments: I can’t com­pare PILCO’s sam­ple-­ef­fi­ciency with Sal­i­mans et al 2017’s ALE sam­ple-­ef­fi­ciency or even Karpa­thy’s Pong sam­ple-­ef­fi­ciency because PILCO sim­ply can’t be run on prob­lems all that much more com­plex than Cart­pole.

So we have a painful dilem­ma: sam­ple-­ef­fi­ciency can be many orders of mag­ni­tude greater than pos­si­ble with evo­lu­tion, if only one could do more pre­cise fine-­grained credit assign­men­t—in­stead of judg­ing bil­lions of deci­sions based solely on a sin­gle dis­tant noisy binary out­come, the algo­rithm gen­er­at­ing each deci­sion can be traced through all of its ram­i­fi­ca­tions through all sub­se­quent deci­sions & out­comes to a final reward—but these bet­ter meth­ods are not directly applic­a­ble. What to do?

Going Meta

“…the spac­ing that has made for the most suc­cess­ful induc­tions will have tended to pre­dom­i­nate through nat­ural selec­tion. Crea­tures invet­er­ately wrong in their induc­tions have a pathetic but praise­wor­thy ten­dency to die before repro­duc­ing their kind….In induc­tion noth­ing suc­ceeds like suc­cess.”

, “Nat­ural Kinds” 1969

Speak­ing of evo­lu­tion­ary algo­rithms & sam­ple-­ef­fi­cien­cy, an inter­est­ing area of AI and rein­force­ment learn­ing is “meta-learn­ing”, usu­ally described as “learn­ing to learn” (). This rewrites a given learn­ing task as a two-level prob­lem, where one seeks a meta-al­go­rithm for a fam­ily of prob­lems which then adapts at run­time to the spe­cific prob­lem at hand. (In evo­lu­tion­ary terms, this could be seen as related to a .) There are many par­a­digms in meta-learn­ing using var­i­ous kinds of learn­ing & opti­miz­ers; for list­ing of sev­eral recent ones, see Table 1 of (re­pro­duced in an appen­dix).

For exam­ple, one could train an RNN on a ‘left or right’ T-maze task where the direc­tion with the reward switches at ran­dom every once in a while: the RNN has a mem­o­ry, its hid­den state, so after try­ing the left arm a few times and observ­ing no reward, it can encode “the reward has switched to the right”, and then decide to go right every time while con­tin­u­ing to encode how many fail­ures it’s had after the switch; when the reward then switches back to the left, after a few fail­ures on the right, the learned rule will fire and it’ll switch back to the left. With­out this sequen­tial learn­ing, if it was just trained on a bunch of sam­ples, where half the ‘lefts’ have a reward and half the ‘rights’ also have a reward (be­cause of the con­stant switch­ing), it’ll learn a bad strat­egy like pick­ing a ran­dom choice 50-50, or always going left/right. Another approach is ‘fast weights’, where a start­ing meta-NN observes a few dat­a­points from a new prob­lem, and then emits the adjusted para­me­ters for a new NN, spe­cial­ized to the prob­lem, which is then run exactly and receives a reward, so the meta-NN can learn to emit adjusted para­me­ters which will achieve high reward on all prob­lems. A ver­sion of this might be the MAML meta-learn­ing algo­rithms () where a meta-NN is learned which is care­fully bal­anced between pos­si­ble NNs so that a few fine­tun­ing steps of gra­di­ent descent train­ing within a new prob­lem ‘spe­cial­izes’ it to that prob­lem (one might think of the meta-NN as being a point in the high­-di­men­sional model space which is roughly equidis­tant from a large num­ber of NNs trained on each indi­vid­ual prob­lem, where tweak­ing a few para­me­ters con­trols over­all behav­ior and only those need to be learned from the ini­tial expe­ri­ences). In gen­er­al, meta-learn­ing enables learn­ing of the supe­rior Bayes-op­ti­mal agent within envi­ron­ments by inef­fi­cient (pos­si­bly not even Bayesian) train­ing across envi­ron­ments (). As Duff 2002 puts it, “One way of think­ing about the com­pu­ta­tional pro­ce­dures that I later pro­pose is that they per­form an offline com­pu­ta­tion of an online, adap­tive machine. One may regard the process of approx­i­mat­ing an opti­mal pol­icy for the Markov deci­sion process defined over hyper­-s­tates as ‘com­pil­ing’ an opti­mal learn­ing strat­e­gy, which can then be ‘loaded’ into an agent.”

An inter­est­ing exam­ple of this approach is the Deep­Mind paper , which presents a Quake team FPS agent trained using a two-level approach (and which extends it fur­ther with mul­ti­ple pop­u­la­tions; for back­ground, see Sut­ton & Barto 2018; for an evo­lu­tion­ary man­i­festo, see ), an approach which was valu­able for their AlphaS­tar Star­Craft II agent pub­li­cized in Jan­u­ary 2019. The FPS game is a mul­ti­player cap­ture-the-flag match where teams com­pete on a map, rather than the agent con­trol­ling a sin­gle agent in a death-­match set­ting; learn­ing to coor­di­nate, as well as explic­itly com­mu­ni­cate, with mul­ti­ple copies of one­self is tricky and nor­mal train­ing meth­ods don’t work well because updates change all the other copies of one­self as well and desta­bi­lize any com­mu­ni­ca­tion pro­to­cols which have been learned. What Jader­berg does is use nor­mal deep RL tech­niques within each agent, pre­dict­ing and receiv­ing rewards within each game based on earn­ing points for flags/attacks, but then the over­all pop­u­la­tion of 30 agents, after each set of match­es, under­goes a sec­ond level of selec­tion based on final game score/victory, which then selects on the agen­t’s inter­nal reward pre­dic­tion & hyper­pa­ra­me­ters

This can be seen as a two-tier rein­force­ment learn­ing prob­lem. The inner opti­mi­sa­tion max­imises Jinner, the agents’ expected future dis­counted inter­nal rewards. The outer opti­mi­sa­tion of Jouter can be viewed as a meta-game, in which the meta-re­ward of win­ning the match is max­imised with respect to inter­nal reward schemes wp and hyper­pa­ra­me­ters φp, with the inner opti­mi­sa­tion pro­vid­ing the meta tran­si­tion dynam­ics. We solve the inner opti­mi­sa­tion with RL as pre­vi­ously described, and the outer opti­mi­sa­tion with . PBT is an online evo­lu­tion­ary process which adapts inter­nal rewards and hyper­pa­ra­me­ters and per­forms model selec­tion by replac­ing under­-per­form­ing agents with mutated ver­sions of bet­ter agents. This joint opti­mi­sa­tion of the agent pol­icy using RL together with the opti­mi­sa­tion of the RL pro­ce­dure itself towards a high­-level goal proves to be an effec­tive and gen­er­ally applic­a­ble strat­e­gy, and utilises the poten­tial of com­bin­ing learn­ing and evo­lu­tion (2) in large scale learn­ing sys­tems.

The goal is to win, the ground-truth reward is the win/loss, but learn­ing only from win/loss is extremely slow: a sin­gle bit (prob­a­bly less) of infor­ma­tion must be split over all actions taken by all agents in the game and used to train NNs with mil­lions of inter­de­pen­dent para­me­ters, in a par­tic­u­larly inef­fi­cient way as one can­not com­pute exact gra­di­ents from the win/loss back to the respon­si­ble neu­rons. With­in-game points are a much richer form of super­vi­sion, more numer­ous and cor­re­spond­ing to short time seg­ments, allow­ing for much more learn­ing within each game (pos­si­bly using exact gra­di­ents), but are only indi­rectly related to the final win/loss; an agent could rack up many points on its own while neglect­ing to fight the enemy or coor­di­nate well and ensur­ing a final defeat, or it could learn a greedy team strat­egy which per­forms well ini­tially but loses over the long run. So the two-tier prob­lem uses the slow ‘outer’ sig­nal or loss func­tion (win­ning) to sculpt the faster inner loss which does the bulk of the learn­ing. (“Organ­isms are adap­ta­tion-ex­ecu­tors, not fit­ness-­max­i­miz­ers.”) Should the fast inner algo­rithms not be learn­ing some­thing use­ful or go hay­wire or fall for a trap, the outer rewards will even­tu­ally recover from the mis­take, by mutat­ing or aban­don­ing them in favor of more suc­cess­ful lin­eages. This com­bines the crude, slow, dogged opti­miza­tion of evo­lu­tion, with the much faster, more clev­er, but poten­tially mis­guided gra­di­en­t-based opti­miza­tion, to pro­duce some­thing which will reach the right goal faster. (Two more recent exam­ples would be /.)

Two-Level Meta-Learning

Cosma Shal­izi, else­where, enjoys not­ing for­mal iden­ti­ties between nat­ural selec­tion and Bayesian sta­tis­tics (espe­cially ) and mar­kets, where the pop­u­la­tion fre­quency of an allele cor­re­sponds to a para­me­ter’s prior prob­a­bil­ity or start­ing wealth of a trader, and fit­ness differentials/profits cor­re­spond to updates based on new evi­dence. (See also Evstigneev et al 2008/Lens­berg & Schenk-Hoppé 2006, , .) While a para­me­ter may start with erro­neously low pri­or, at some point the updates will make the pos­te­rior con­verge on it. (The rela­tion­ship between pop­u­la­tions of indi­vid­ual with noisy fixed beliefs, and , is also inter­est­ing: . Can we see the appar­ent­ly-in­ef­fi­cient stream of star­tups try­ing ‘failed’ ideas—and occa­sion­ally wind­ing up win­ning big—as a kind of col­lec­tive Thomp­son sam­pling & more effi­cient than it seem­s?) And can be seen as secretly an approx­i­ma­tion or form of Bayesian updates by esti­mat­ing its gra­di­ents (because every­thing that works works because it’s Bayesian?) and of course evo­lu­tion­ary meth­ods can be seen as cal­cu­lat­ing approx­i­ma­tions to gra­di­ents…

Analo­gies between dif­fer­ent optimization/inference mod­els
Model Para­me­ter Prior Update
Evo­lu­tion Allele Pop­u­la­tion Fre­quency Fit­ness Dif­fer­en­tial
Mar­ket Trader Start­ing Wealth Profit
Par­ti­cle Fil­ter­ing Par­ti­cles Pop­u­la­tion Fre­quency Accep­t-Re­ject Sam­ple
SGD Para­me­ter Ran­dom Ini­tial­iza­tion Gra­di­ent Step

This pat­tern sur­faces in our other exam­ples too. This two-level learn­ing is anal­o­gous to meta-learn­ing: the outer or meta-al­go­rithm learns how to gen­er­ate an inner or objec­t-level algo­rithm which can learn most effec­tive­ly, bet­ter than the meta-al­go­rithm. Inner algo­rithms them­selves can learn bet­ter algo­rithms, and so on, gain­ing pow­er, com­pute-­ef­fi­cien­cy, or sam­ple-­ef­fi­cien­cy, with every level of spe­cial­iza­tion. (“It’s opti­miz­ers all the way up, young man!”) It’s also anal­o­gous to cells in a human body: over­all repro­duc­tive fit­ness is a slow sig­nal that occurs only a few times in a life­time at most, but over many gen­er­a­tions, it builds up fast-re­act­ing devel­op­men­tal and home­o­sta­tic processes which can build an effi­cient and capa­ble body and respond to envi­ron­men­tal fluc­tu­a­tions within min­utes rather than mil­len­nia, and the brain is still supe­rior with split-sec­ond sit­u­a­tions. It’s also anal­o­gous to cor­po­ra­tions in a mar­ket: the cor­po­ra­tion can use what­ever inter­nal algo­rithms it pleas­es, such as lin­ear opti­miza­tion or neural net­works, and eval­u­ate them inter­nally using inter­nal met­rics like “num­ber of daily users”; but even­tu­al­ly, this must result in prof­its…

The cen­tral prob­lem a cor­po­ra­tion solves is how to moti­vate, orga­nize, pun­ish & reward its sub­-u­nits and con­stituent humans in the absence of direct end-­to-end losses with­out the use of slow exter­nal mar­ket mech­a­nisms. This is done by tap­ping into social mech­a­nisms like peer esteem (sol­diers don’t fight for their coun­try, they fight for their bud­dies), select­ing work­ers who are intrin­si­cally moti­vated to work use­fully rather than par­a­sit­i­cal­ly, con­stant attempts to instill a “com­pany cul­ture” with slo­ga­neer­ing or hand­books or com­pany songs, use of mul­ti­ple proxy mea­sures for rewards to reduce Good­hart-style reward hack­ing, ad hoc mech­a­nisms like stock options to try to inter­nal­ize within work­ers the mar­ket loss­es, replac­ing work­ers with out­sourc­ing or automa­tion, acquir­ing smaller com­pa­nies which have not yet decayed inter­nally or as a selec­tion mech­a­nism (“acqui­hires”), employ­ing intel­lec­tual prop­erty or reg­u­la­tion… All of these tech­niques together can align the parts into some­thing use­ful to even­tu­ally sell…

Man Proposes, God Disposes

…Or else the com­pany will even­tu­ally go bank­rupt:

Great is Bank­ruptcy: the great bot­tom­less gulf into which all False­hoods, pub­lic and pri­vate, do sink, dis­ap­pear­ing; whith­er, from the first ori­gin of them, they were all doomed. For Nature is true and not a lie. No lie you can speak or act but it will come, after longer or shorter cir­cu­la­tion, like a Bill drawn on Nature’s Real­i­ty, and be pre­sented there for pay­men­t,—with the answer, No effects. Pity only that it often had so long a cir­cu­la­tion: that the orig­i­nal forger were so sel­dom he who bore the final smart of it! Lies, and the bur­den of evil they bring, are passed on; shifted from back to back, and from rank to rank; and so land ulti­mately on the dumb low­est rank, who with spade and mat­tock, with sore heart and empty wal­let, daily come in con­tact with real­i­ty, and can pass the cheat no fur­ther.

…But with a For­tu­na­tus’ Purse in his pock­et, through what length of time might not almost any False­hood last! Your Soci­ety, your House­hold, prac­ti­cal or spir­i­tual Arrange­ment, is untrue, unjust, offen­sive to the eye of God and man. Nev­er­the­less its hearth is warm, its larder well replen­ished: the innu­mer­able Swiss of Heav­en, with a kind of Nat­ural loy­al­ty, gather round it; will prove, by pam­phle­teer­ing, mus­ke­teer­ing, that it is a truth; or if not an unmixed (un­earth­ly, impos­si­ble) Truth, then bet­ter, a whole­somely attem­pered one, (as wind is to the shorn lam­b), and works well. Changed out­look, how­ev­er, when purse and larder grow emp­ty! Was your Arrange­ment so true, so accor­dant to Nature’s ways, then how, in the name of won­der, has Nature, with her infi­nite boun­ty, come to leave it fam­ish­ing there? To all men, to all women and all chil­dren, it is now indu­bitable that your Arrange­ment was false. Hon­our to Bank­rupt­cy; ever right­eous on the great scale, though in detail it is so cru­el! Under all False­hoods it works, unwea­riedly min­ing. No False­hood, did it rise heav­en-high and cover the world, but Bank­rupt­cy, one day, will sweep it down, and make us free of it.7

A large cor­po­ra­tion like Sears may take decades to die (“There is a great deal of ruin in a nation”, Adam Smith observed), but die it does. Cor­po­ra­tions do not increase in per­for­mance rapidly and con­sis­tently the way selec­tive breed­ing or AI algo­rithms do because they can­not repli­cate them­selves as exactly as dig­i­tal neural net­works or bio­log­i­cal cells can, but, nev­er­the­less, they are still part of a two-tier process where a ground-truth uncheat­able outer loss con­strains the inter­nal dynam­ics to some degree and main­tain a base­line or per­haps mod­est improve­ment over time. The plan is “checked”, as Trot­sky puts it in crit­i­ciz­ing Stal­in’s poli­cies like aban­don­ing the , by sup­ply and demand:

If a uni­ver­sal mind exist­ed, of the kind that pro­jected itself into the sci­en­tific fancy of Laplace—a mind that could reg­is­ter simul­ta­ne­ously all the processes of nature and soci­ety, that could mea­sure the dynam­ics of their motion, that could fore­cast the results of their inter-re­ac­tion­s—­such a mind, of course, could a pri­ori draw up a fault­less and exhaus­tive eco­nomic plan, begin­ning with the num­ber of acres of wheat down to the last but­ton for a vest. The bureau­cracy often imag­ines that just such a mind is at its dis­pos­al; that is why it so eas­ily frees itself from the con­trol of the mar­ket and of Soviet democ­ra­cy. But, in real­i­ty, the bureau­cracy errs fright­fully in its esti­mate of its spir­i­tual resources.

…The innu­mer­able liv­ing par­tic­i­pants in the econ­o­my, state and pri­vate, col­lec­tive and indi­vid­u­al, must serve notice of their needs and of their rel­a­tive strength not only through the sta­tis­ti­cal deter­mi­na­tions of plan com­mis­sions but by the direct pres­sure of sup­ply and demand. The plan is checked and, to a con­sid­er­able degree, real­ized through the mar­ket.

“Pain Is the Only School-Teacher”

Pain is a curi­ous thing. Why do we have painful pain instead of just a more neu­tral pain­less pain, when it can back­fire so eas­ily as chronic pain, among other prob­lems? Why do we have pain at all instead of reg­u­lar learn­ing processes or expe­ri­enc­ing rewards as we fol­low plans?

Can we under­stand pain as another two-level learn­ing process, where a slow but ground-truth outer loss con­strains a fast but unre­li­able inner loss? I would sug­gest that pain itself is not an outer loss, but the painful­ness of pain, its intru­sive moti­va­tional aspects, is what makes it an outer loss. There is no log­i­cal neces­sity for pain to be pain but this would not be adap­tive or prac­ti­cal because it would too eas­ily let the inner loss lead to dam­ag­ing behav­ior.

Taxonomy of Pain

So let’s con­sider the pos­si­bil­i­ties when it comes to pain. There isn’t just “pain”. There is (at the least):

  • use­less painful pain (chronic pain, exer­cise)

  • use­ful painful pain (the nor­mal sort)

  • use­less non­painful non­pain (dead nerves in dia­betes or 8 or 91011121314; and are every­day ver­sions demon­strat­ing that even the most harm­less activ­i­ties like ‘lying on a bed’ are in fact con­stantly caus­ing dam­age)

  • use­ful non­painful non­pain (adren­a­line rushes dur­ing com­bat)

  • use­less non­painful pain ( where they maim & kill them­selves, pos­si­bly also );

  • and inter­me­di­ate cas­es: like the who have a genetic muta­tion (Habib et al 2018) which par­tially dam­ages pain per­cep­tion. The Mar­silis do feel use­ful painful pain but only briefly, and incur sub­stan­tial bod­ily dam­age (bro­ken bones, scars) but avoid the most hor­rific anec­dotes of those with dead­ened nerves or pain asym­bo­l­ia.

    Another inter­est­ing case is the , who has a dif­fer­ent set of muta­tions to her endo­cannabi­noid sys­tem (FAAH & FAAH-OUT): while not as bad as neu­ropa­thy, she still exhibits sim­i­lar symp­tom­s—her father who may also have been a car­rier died pecu­liar­ly, she reg­u­larly burns or cuts her­self in house­hold chores, she broke her arm roller-skat­ing as a child but did­n’t seek treat­ment, delayed treat­ment of a dam­aged hip and then a hand dam­aged by arthri­tis until almost too late15, took in fos­ter chil­dren who stole her sav­ings, etc. (Bi­ol­o­gist Matthew Hill describes the most com­mon FAAH muta­tion as caus­ing “low lev­els of anx­i­ety, for­get­ful­ness, a hap­py-­go-lucky demeanor”, and “Since the paper was pub­lished, Matthew Hill has heard from half a dozen peo­ple with pain insen­si­tiv­i­ty, and he told me that many of them seemed nuts” com­pared to Jo Cameron.)

  • but—is there ‘use­ful pain­less pain’ or ‘use­less painful non­pain’?

It turns out there is ‘pain­less pain’: peo­ple expe­ri­ence that, and “reac­tive dis­so­ci­a­tion” is the phrase used to describe the effects some­times of anal­gesics like mor­phine when admin­is­tered after pain has begun, and the patient reports, to quote Den­nett 1978 (em­pha­sis in orig­i­nal), that “After receiv­ing the anal­gesic sub­jects com­monly report not that the pain has dis­ap­peared or dimin­ished (as with aspir­in) but that the pain is as intense as ever though they no longer mind it…if it is admin­is­tered before the onset of pain…the sub­jects claim to not feel any pain sub­se­quently (though they are not numb or anes­thetized—they have sen­sa­tion in the rel­e­vant parts of their bod­ies); while if the mor­phine is admin­is­tered after the pain has com­menced, the sub­jects report that the pain con­tin­ues (and con­tin­ues to be pain), though they no longer mind it…Lo­bot­o­mized sub­jects sim­i­larly report feel­ing intense pain but not mind­ing it, and in other ways the man­i­fes­ta­tions of lobot­omy and mor­phine are sim­i­lar enough to lead some researchers to describe the action of mor­phine (and some bar­bi­tu­rates) as ‘reversible phar­ma­co­log­i­cal leu­co­tomy [lo­bot­o­my]’.2316

And we can find exam­ples of what appears to be ‘painful non­pain’: high­lights a case-s­tudy, Ploner et al 1999, where the Ger­man patien­t’s somatosen­sory cor­tices suf­fered a lesion from a stroke, lead­ing to an inabil­ity to feel heat nor­mally on one side of his body or feel any spots of heat or pain from heat; despite this, when suf­fi­cient heat was applied to a sin­gle spot on the arm, the patient became increas­ingly agi­tat­ed, describ­ing an “clearly unpleas­ant” feel­ing asso­ci­ated with his whole arm, but also denied any descrip­tion of it involv­ing crawl­ing skin sen­sa­tions or words like “slight pain” or “burn­ing”.

A table might help lay out the pos­si­bil­i­ties:

A tax­on­omy of pos­si­ble kinds of ‘pain’, split by organ­is­mal con­se­quences, moti­va­tional effects, and reported sub­jec­tive (non)­ex­pe­ri­ence.
Util­ity Aver­sive­ness Qualia pres­ence Exam­ples
use­less painful pain chronic pain; exer­cise?
use­ful painful pain normal/injuries
use­less non­painful pain asym­bo­lia
use­ful non­painful pain reac­tive dis­so­ci­a­tion, lobot­o­mies; exer­cise?
use­less painful non­pain uncon­scious processes such as anes­the­sia aware­ness. Itches or tick­les, ?17
use­ful painful non­pain cold/heat , as in the somatosen­sory cor­tex lesion case-s­tudy
use­less non­painful non­pain dead­ened nerves from dis­eases (di­a­betes, lep­rosy), injury, drugs (anes­thet­ics)
use­ful non­painful non­pain adren­a­line rush/accidents/combat

Pain serves a clear pur­pose (stop­ping us from doing things which may cause dam­age to our bod­ies), but in an oddly unre­lent­ing way which we can­not dis­able and which increas­ingly often back­fires on our long-term inter­ests in the form of ‘chronic pain’ and other prob­lems. Why does­n’t pain oper­ate more like a warn­ing, or like hunger or thirst? They inter­rupt our minds, but like a com­puter popup dia­logue, after due con­sid­er­a­tion of our plans and knowl­edge, we can gen­er­ally dis­miss them. Pain is the inter­rup­tion which does­n’t go away, although (Morsella 2005):

The­o­ret­i­cal­ly, ner­vous mech­a­nisms could have evolved to solve the need for this par­tic­u­lar kind of inter­ac­tion oth­er­wise. Apart from automata, which act like humans but have no phe­nom­e­nal expe­ri­ence, a con­scious ner­vous sys­tem that oper­ates as humans do but does not suf­fer any inter­nal strife. In such a sys­tem, knowl­edge guid­ing skele­to­mo­tor action would be iso­mor­phic to, and never at odds with, the nature of the phe­nom­e­nal state—run­ning across the hot desert sand in order to reach water would actu­ally feel good, because per­form­ing the action is deemed adap­tive. Why our ner­vous sys­tem does not oper­ate with such har­mony is per­haps a ques­tion that only evo­lu­tion­ary biol­ogy can answer. Cer­tainly one can imag­ine such inte­gra­tion occur­ring with­out any­thing like phe­nom­e­nal states, but from the present stand­point, this reflects more one’s pow­ers of imag­i­na­tion than what has occurred in the course of evo­lu­tion­ary his­to­ry.

Hui Neng’s Flag

In the rein­force­ment learn­ing con­text, one could ask: does it make a dif­fer­ence whether one has ‘neg­a­tive’ or ‘pos­i­tive’ rewards? Any reward func­tion with both neg­a­tive and pos­i­tive rewards could be turned into all-­pos­i­tive rewards sim­ply by adding a large con­stant. Is that a dif­fer­ence which makes a dif­fer­ence? Or instead of max­i­miz­ing pos­i­tive ‘rewards’, one could speak of min­i­miz­ing ‘losses’, and one often does in eco­nom­ics or deci­sion the­ory or 18.

, Tomasik 2014, debates the rela­tion­ship of rewards to con­sid­er­a­tions of “suf­fer­ing” or “pain”, given the dual­ity between costs-losses/rewards:

Per­haps the more urgent form of refine­ment than algo­rithm selec­tion is to replace pun­ish­ment with rewards within a given algo­rithm. RL sys­tems vary in whether they use pos­i­tive, neg­a­tive, or both types of rewards:

  • In cer­tain RL prob­lems, such as maze-­nav­i­ga­tion tasks dis­cussed in Sut­ton and Barto [1998], the rewards are only pos­i­tive (if the agent reaches a goal) or zero (for non-­goal states).
  • Some­times a mix between pos­i­tive and neg­a­tive rewards6 is used. For instance, McCal­lum [1993] put a sim­u­lated mouse in a maze, with a reward of 1 for reach­ing the goal, −1 for hit­ting a wall, and −0.1 for any other action.
  • In other sit­u­a­tions, the rewards are always neg­a­tive or zero. For instance, in the cart-­pole bal­anc­ing sys­tem of Barto et al. [1990], the agent receives reward of 0 until the pole falls over, at which point the reward is −1. In Koppe­jan and White­son [2011]’s neu­roevo­lu­tion­ary RL approach to heli­copter con­trol, the RL agent is pun­ished either a lit­tle bit, with the neg­a­tive sum of squared devi­a­tions of the heli­copter’s posi­tions from its tar­get posi­tions, or a lot if the heli­copter crash­es.

Just as ani­mal-wel­fare con­cerns may moti­vate incor­po­ra­tion of rewards rather than pun­ish­ments in train­ing dogs [Hiby et al., 2004] and horses [War­ren-­Smith and McGreevy, 2007, Innes and McBride, 2008], so too RL-a­gent wel­fare can moti­vate more pos­i­tive forms of train­ing for arti­fi­cial learn­ers. Pearce [2007] envi­sions a future in which agents are dri­ven by ‘gra­di­ents of well-be­ing’ (i.e., pos­i­tive expe­ri­ences that are more or less intense) rather than by the dis­tinc­tion between plea­sure ver­sus pain. How­ev­er, it’s not entirely clear where the moral bound­ary lies between pos­i­tive ver­sus neg­a­tive wel­fare for sim­ple RL sys­tems. We might think that just the sign of the agen­t’s reward value r would dis­tin­guish the cas­es, but the sign alone may not be enough, as the fol­low­ing sec­tion explains.

What’s the bound­ary between pos­i­tive and neg­a­tive wel­fare?

Con­sider an RL agent with a fixed life of T time steps. At each time t, the agent receives a non-­pos­i­tive reward rt ≤ 0 as a func­tion of the action at that it takes, such as in the pole-bal­anc­ing exam­ple. The agent chooses its action sequence (at) t = 1…T with the goal of max­imis­ing the sum of future rewards:

Now sup­pose we rewrite the rewards by adding a huge pos­i­tive con­stant c to each of them, r′t = rt + c, big enough that all of the r′t are pos­i­tive. The agent now acts so as to opti­mise

So the opti­mal action sequence is the same in either case, since addi­tive con­stants don’t mat­ter to the agen­t’s behav­iour.7 But if behav­iour is iden­ti­cal, the only thing that changed was the sign and numer­i­cal mag­ni­tude of the reward num­bers. Yet it seems absurd that the dif­fer­ence between hap­pi­ness and suf­fer­ing would depend on whether the num­bers used by the algo­rithm hap­pened to have neg­a­tive signs in front. After all, in com­puter bina­ry, neg­a­tive num­bers have no minus sign but are just another sequence of 0s and 1s, and at the level of com­puter hard­ware, they look dif­fer­ent still. More­over, if the agent was pre­vi­ously react­ing aver­sively to harm­ful stim­uli, it would con­tinue to do so. As Lenhart K. Schu­bert explains:8 [This quo­ta­tion comes from spring 2014 lec­ture notes (ac­cessed March 2014) for a course called “Machines and Con­scious­ness”.]

If the shift in ori­gin [to make neg­a­tive rewards pos­i­tive] causes no behav­ioural change, then the robot (anal­o­gous­ly, a per­son) would still behave as if suf­fer­ing, yelling for help, etc., when injured or oth­er­wise in trou­ble, so it seems that the pain would not have been ban­ished after all!

So then what dis­tin­guishes plea­sure from pain?

…A more plau­si­ble account is that the dif­fer­ence relates to ‘avoid­ing’ ver­sus ‘seek­ing.’ A neg­a­tive expe­ri­ence is one that the agent tries to get out of and do less of in the future. For instance, injury should be an inher­ently neg­a­tive expe­ri­ence, because if repair­ing injury was reward­ing for an agent, the agent would seek to injure itself so as to do repairs more often. If we tried to reward avoid­ance of injury, the agent would seek dan­ger­ous sit­u­a­tions so that it could enjoy return­ing to safe­ty.10 [This exam­ple comes from Lenhart K. Schu­bert’s spring 2014 lec­ture notes (ac­cessed March 2014), for a course called ‘Machines and Con­scious­ness.’ These thought exper­i­ments are not purely aca­d­e­m­ic. We can see an exam­ple of mal­adap­tive behav­iour result­ing from an asso­ci­a­tion of plea­sure with injury when peo­ple become addicted to the endor­phin release of self­-har­m.]19 Injury needs to be some­thing the agent wants to get as far away from as pos­si­ble. So, for exam­ple, even if vom­it­ing due to food poi­son­ing is the best response you can take given your cur­rent sit­u­a­tion, the expe­ri­ence should be neg­a­tive in order to dis­suade you from eat­ing spoiled foods again. Still, the dis­tinc­tion between avoid­ing and seek­ing isn’t always clear. We expe­ri­ence plea­sure due to seek­ing and con­sum­ing food but also pain that moti­vates us to avoid hunger. Seek­ing one thing is often equiv­a­lent to avoid­ing anoth­er. Like­wise with the pole-bal­anc­ing agent: Is it seek­ing a bal­anced pole, or avoid­ing a pole that falls over?

…Where does all of this leave our pole-bal­anc­ing agent? Does it suf­fer con­stant­ly, or is it enjoy­ing its efforts? Like­wise, is an RL agent that aims to accu­mu­late pos­i­tive rewards hav­ing fun, or is it suf­fer­ing when its reward is sub­op­ti­mal?

Pain as Grounding

So with all that for back­ground, what is the pur­pose of pain?

The pur­pose of pain, I would say, is as a ground truth or outer loss. (This is a moti­va­tional the­ory of pain with a more sophis­ti­cated RL/psychiatric ground­ing.)

The pain reward/loss can­not be removed entirely for the rea­sons demon­strated by the diabetics/lepers/congenital insen­si­tives: the unno­ticed injuries and the poor plan­ning are ulti­mately fatal. With­out any pain qualia to make pain feel painful, we will do harm­ful things like run on a bro­ken leg or jump off a roof to impress our friends20, or just move in a not-quite-right fash­ion and a few years later wind up para­plegics. (An intrin­sic curios­ity drive alone would inter­act badly with a total absence of painful pain: after all, what is more novel or harder to pre­dict than the strange and unique states which can be reached by self­-in­jury or reck­less­ness?)

If pain could­n’t be removed, could pain be turned into a reward, then? Could we be the equiv­a­lent of Morsel­la’s mind that does­n’t expe­ri­ence pain, as it infers plans and then exe­cutes them, expe­ri­enc­ing only more or less rewards? It only expe­ri­ence pos­i­tive rewards (plea­sure) as it runs across burn­ing-hot sands, as this is the opti­mal action for it to be tak­ing accord­ing to what­ever grand plan it has thought of.

Per­haps we could… but what stops Morsel­la’s mind from enjoy­ing rewards by lit­er­ally run­ning in cir­cles on those sands until it dies or is crip­pled? Morsel­la’s mind may make a plan and define a reward func­tion which avoids the need for any pain or neg­a­tive rewards, but what hap­pens if there is any flaw in the com­puted plan or the reward esti­mates? Or if the plan is based on mis­taken premis­es? What if the sands are hot­ter than expect­ed, or if the dis­tance is much fur­ther than expect­ed, or if the final goal (per­haps an oasis of water) is not there? Such a mind raises seri­ous ques­tions about learn­ing and deal­ing with errors: what does such a mind expe­ri­ence when a plan fails? Does it expe­ri­ence noth­ing? Does it expe­ri­ence a kind of “meta-­pain”?

Con­sider what Brand (The Gift of Pain again, pg191–197) describes as the ulti­mate cause of the fail­ure of years of research into cre­at­ing ‘pain pros­thet­ics’, com­put­er­ized gloves & socks that would mea­sure heat & pres­sure in real-­time in order to warn those with­out pain like lep­ers or dia­bet­ics: the patients would just ignore the warn­ings, because stop­ping to pre­vent future prob­lems was incon­ve­nient while con­tin­u­ing paid off now. And when elec­tri­cal shock­ers were added to the sys­tem to stop them from doing a dan­ger­ous thing, Brand observed patients sim­ply dis­abling it to do the dan­ger­ous thing & re-en­abling it after­wards!

What pain pro­vides is a con­stant, ongo­ing feed­back which anchors all the esti­mates of future rewards based on plan­ning or boot­strap­ping. It anchors our intel­li­gence in a con­crete esti­ma­tion of bod­ily integri­ty: the intact­ness of skin, the health of skin cells, the lack of dam­age to mus­cles, joints slid­ing and mov­ing as they ought to, and so on. If we are plan­ning well and act­ing effi­ciently in the world, we will, in the long run, on aver­age, expe­ri­ence higher lev­els of bod­ily integrity and phys­i­cal health; if we are learn­ing and choos­ing and plan­ning poor­ly, then… we won’t. The bad­ness will grad­u­ally catch up with us and we may find our­selves blind scarred para­plegics miss­ing fin­gers and soon to die. A pain that was not painful would not serve this pur­pose, as it would merely be another kind of “tick­ling” sen­sa­tion. (Some might find it inter­est­ing or enjoy­able or it could acci­den­tally become sex­u­al­ly-linked.) The per­cep­tions in ques­tion are sim­ply more ordi­nary tac­tile, kines­thet­ic, , or other stan­dard cat­e­gories of per­cep­tion; with­out painful pain, a fire burn­ing your hand sim­ply feels warm (be­fore the ther­mal-per­cep­tive nerves are destroyed and noth­ing fur­ther is felt), and a knife cut­ting flesh might feel like a rip­pling stretch­ing rub­bing move­ment.

We might say that a painful pain is a pain which forcibly inserts itself into the planning/optimization process, as a cost or lack of reward to be opti­mized. A pain which was not moti­vat­ing is not what we mean by ‘pain’ at all.21 The moti­va­tion itself is the qualia of pain, much like an itch is an ordi­nary sen­sa­tion cou­pled with a moti­va­tional urge to scratch. Any men­tal qual­ity or emo­tion or sen­sa­tion which is not accom­pa­nied by a demand­ing­ness, an invol­un­tary tak­ing-in­to-­con­sid­er­a­tion, is not pain. The rest of our mind can force its way through pain, if it is suf­fi­ciently con­vinced that there is enough rea­son to incur the costs of pain because the long-term reward is so great, and we do this all the time: we can con­vince our­selves to go to the gym, or with­stand the vac­ci­na­tion needle, or, in the utmost extrem­i­ty, saw off a trapped hand to save our life. And if we are mis­tak­en, and the pre­dicted rewards do not arrive, even­tu­ally the noisy con­stant feed­back of pain will over­ride the deci­sions lead­ing to pain, and what­ever incor­rect beliefs or mod­els led to the incor­rect deci­sions will be adjusted to do bet­ter in the future.

But the pain can­not and must not be over­rid­den: human organ­isms can’t be trusted to sim­ply ‘turn off’ pain and indulge an idle curios­ity about cut­ting off hands. We are insuf­fi­ciently intel­li­gent, our pri­ors insuf­fi­ciently strong, our rea­son­ing and plan­ning too poor, and we must do too much learn­ing within each life to do with­out pain.

A sim­i­lar argu­ment might apply to the puz­zle of ‘willpower’, ‘pro­cras­ti­na­tion’. Why do we have such prob­lems, par­tic­u­larly in a mod­ern con­text, doing aught we know we should and doing naught we ought­n’t?

On the grave of the ‘blood glu­cose’ level the­o­ry, Kurzban et al 2013 (see later ) erects an oppor­tu­nity cost the­ory of willpow­er. Since objec­tive phys­i­cal mea­sure­ments like blood glu­cose lev­els fail to mechan­i­cally explain poorer brain func­tion­al­i­ty, sim­i­lar to the fail­ure of objec­tive phys­i­cal mea­sure­ments like lac­tate lev­els to explain why peo­ple are able to phys­i­cally exer­cise only a cer­tain amount (de­spite being able to exer­cise far more if prop­erly moti­vated or if tricked), the rea­son for willpower run­ning out must be sub­jec­tive.

The lack of willpower is a heuris­tic which does­n’t require the brain to explic­itly track & pri­or­i­tize & sched­ule all pos­si­ble tasks, by forc­ing it to reg­u­larly halt tasks—“like a timer that says, ‘Okay you’re done now.’” If one could over­ride fatigue at will and do things like cycle for thou­sands of miles like ultra­-en­durance cyclist , the phys­i­cal con­se­quences would be sev­ere, (and inci­den­tal­ly, Robič was even­tu­ally run over while cycling). The ‘timer’ is imple­ment­ed, among other things, as a grad­ual buildup of , which cre­ates and pos­si­bly phys­i­cal fatigue dur­ing exer­cise (, Mar­tin et al 2018), lead­ing to a grad­u­ally increas­ing sub­jec­tively per­ceived ‘cost’ of con­tin­u­ing with a task/staying awake/continuing ath­letic activ­i­ties, which resets when one stops/sleeps/rests.

To explain the sug­ar-re­lated obser­va­tions, Kurzban et al 2013 sug­gest that the aver­sive­ness of long focus and cog­ni­tive effort is a sim­ple heuris­tic which cre­ates a base­line cost to focus­ing for ‘too long’ on any one task, to the poten­tial neglect of other oppor­tu­ni­ties, with the sugar inter­ven­tions (such as merely tast­ing sugar water) which appear to boost willpower actu­ally serv­ing as prox­i­mate reward sig­nals (sig­nals, because the actual ener­getic con­tent is nil, and cog­ni­tive effort does­n’t mean­ing­fully burns calo­ries in the first place), which jus­tify to the under­ly­ing heuris­tic that fur­ther effort on the same task is worth­while and the oppor­tu­nity cost is min­i­mal.

Since the human mind is too lim­ited in its plan­ning and mon­i­tor­ing abil­i­ty, it can­not be allowed to ‘turn off’ oppor­tu­nity cost warn­ings and engage in hyper­fo­cus on poten­tially use­less things at the neglect of all other things; pro­cras­ti­na­tion here rep­re­sents a psy­chic ver­sion of pain. From this per­spec­tive, it is not sur­pris­ing that so many stim­u­lants are adenosin­er­gic or dopamin­er­gic22, or that many anti-pro­cras­ti­na­tion strate­gies boil down to opti­miz­ing for more rewards or more fre­quent rewards (eg break­ing tasks down into many smaller tasks, which can be com­pleted indi­vid­u­ally & receive smaller but more fre­quent rewards, or think­ing more clearly about whether some­thing is worth doing): all of these would affect the reward per­cep­tion itself, and reduce the base­line oppor­tu­nity cost ‘pain’. This per­spec­tive may also shed light on and why restora­tive hob­bies are ide­ally max­i­mally dif­fer­ent from jobs and more mis­cel­la­neous obser­va­tions like the lower rate of ‘hob­bies’ out­side the West: burnout may be a long-term home­o­sta­tic reac­tion to spend­ing ‘too much’ time too fre­quently on a dif­fi­cult not-im­me­di­ately reward­ing task despite ear­lier attempts to pur­sue other oppor­tu­ni­ties, which were always over­rid­den, ulti­mately result­ing in a total col­lapse; and hob­bies ought to be as dif­fer­ent in loca­tion and phys­i­cal activ­ity and social struc­ture (eg a soli­tary pro­gram­mer indoors should pur­sue a social phys­i­cal activ­ity out­doors) to ensure that it feels com­pletely dif­fer­ent for the mind than the reg­u­lar occu­pa­tion; and in places with less job spe­cial­iza­tion or fewer work-hours, the reg­u­lar flow of a vari­ety of tasks and oppor­tu­ni­ties means that no such spe­cial activ­ity as a ‘hobby’ is nec­es­sary.


Per­haps if we were super­in­tel­li­gent AIs who could triv­ially plan flaw­less humanoid loco­mo­tion at 1000Hz tak­ing into account all pos­si­ble dam­ages, or if we were emu­lated brains sculpted by end­less evo­lu­tion­ary pro­ce­dures to exe­cute per­fectly adap­tive plans by pure instinct, or if we were sim­ple amoeba in a Petri dish who had no real choices to make, there would be no need for a pain which was painful. And like­wise, were we end­lessly plan­ning and replan­ning to the end of days, we should never expe­ri­ence akrasia, we should merely do what is nec­es­sary (per­haps not even expe­ri­enc­ing any qualia of effort or delib­er­a­tion, merely ). But we are not. The pain keeps us hon­est. In the end, pain is our only teacher.

The Perpetual Peace

“These laws, taken in the largest sense, being Growth with Repro­duc­tion; Inher­i­tance which is almost implied by repro­duc­tion; Vari­abil­ity from the indi­rect and direct action of the exter­nal con­di­tions of life, and from use and dis­use; a Ratio of Increase so high as to lead to a Strug­gle for Life, and as a con­se­quence to Nat­ural Selec­tion, entail­ing Diver­gence of Char­ac­ter and the Extinc­tion of less-im­proved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capa­ble of con­ceiv­ing, name­ly, the pro­duc­tion of the higher ani­mals, directly fol­lows. There is grandeur in this view of life, with its sev­eral pow­ers, hav­ing been orig­i­nally breathed into a few forms or into one; and that, whilst this planet has gone cycling on accord­ing to the fixed law of grav­i­ty, from so sim­ple a begin­ning end­less forms most beau­ti­ful and most won­der­ful have been, and are being, evolved.”

Charles Dar­win, On the Ori­gin of Species

“In war, there is the free pos­si­bil­ity that not only indi­vid­ual deter­mi­na­cies, but the sum total of the­se, will be destroyed as life, whether for the absolute itself or for the peo­ple. Thus, war pre­serves the eth­i­cal health of peo­ples in their indif­fer­ence to deter­mi­nate things [Bes­timmtheiten]; it pre­vents the lat­ter from hard­en­ing, and the peo­ple from becom­ing habit­u­ated to them, just as the move­ment of the winds pre­serves the seas from that stag­na­tion which a per­ma­nent calm would pro­duce, and which a per­ma­nent (or indeed ‘per­pet­ual’) peace would pro­duce among peo­ples.”

G.W.F. Hegel23

“We must rec­og­nize that war is com­mon, strife is jus­tice, and all things hap­pen accord­ing to strife and neces­si­ty…War is father of all and king of all”

Her­a­cli­tus, B80/B53

“It is not enough to suc­ceed; oth­ers must fail.”

,

What if we remove the outer loss?

In a meta-learn­ing con­text, it will then either over­fit to a sin­gle instance of a prob­lem, or learn a poten­tially arbi­trar­ily sub­op­ti­mal aver­age respon­se; in the Quake CTF, the inner loss might con­verge, as men­tioned, to every-a­gen­t-­for-it­self or greedy tac­ti­cal vic­to­ries guar­an­tee­ing strate­gic loss­es; in a human, the result would (at pre­sent, due to refusal to use arti­fi­cial selec­tion or genetic engi­neer­ing) be a grad­ual buildup of lead­ing to seri­ous health issues and even­tu­ally per­haps a muta­tional meltdown/error cat­a­stro­phe; and in an econ­o­my, it leads to… the USSR.

The amount of this con­straint can vary, based on the greater power of the non-­ground-truth opti­miza­tion and fidelity of repli­ca­tion and accu­racy of selec­tion. The gives us quan­ti­ta­tive insight into the con­di­tions under which could work at all: if a NN could only copy itself in a crude and lossy way, meta-learn­ing would not work well in the first place (prop­er­ties must be pre­served from one gen­er­a­tion to the nex­t); if a human cell copied itself with an error rate of as much as 1 in mil­lions, humans could never exist because repro­duc­tive fit­ness is too weak a reward to purge the esca­lat­ing muta­tion load (se­lec­tive gain is neg­a­tive); if bank­ruptcy becomes more arbi­trary and have less to do with con­sumer demand than acts of god/government, then cor­po­ra­tions will become more patho­log­i­cally inef­fi­cient (co­vari­ance between traits & fit­ness too small to accu­mu­late in mean­ing­ful ways).

As Shal­izi con­cludes in his review:

Plan­ning is cer­tainly pos­si­ble within lim­ited domain­s—at least if we can get good data to the plan­ner­s—and those lim­its will expand as com­put­ing power grows. But plan­ning is only pos­si­ble within those domains because mak­ing money gives firms (or fir­m-­like enti­ties) an objec­tive func­tion which is both unam­bigu­ous and blink­ered. Plan­ning for the whole econ­omy would, under the most favor­able pos­si­ble assump­tions, be intractable for the fore­see­able future, and decid­ing on a plan runs into dif­fi­cul­ties we have no idea how to solve. The sort of effi­cient planned econ­omy dreamed of by the char­ac­ters in Red Plenty is some­thing we have no clue of how to bring about, even if we were will­ing to accept dic­ta­tor­ship to do so.

This is why the plan­ning algo­rithms can­not sim­ply keep grow­ing and take over all mar­kets: “who watches the watch­men?” As pow­er­ful as the var­i­ous inter­nal orga­ni­za­tional and plan­ning algo­rithms are, and much supe­rior to evolution/market com­pe­ti­tion, they only opti­mize sur­ro­gate inner loss­es, which are not the end-­goal, and they must be con­strained by a ground-truth loss. The reliance on this loss can and should be reduced, but a reduc­tion to zero is unde­sir­able as long as the inner losses con­verge to any optima dif­fer­ent from the ground-truth opti­ma.

Given the often long lifes­pan of a fail­ing cor­po­ra­tion, the dif­fi­culty cor­po­ra­tions encounter in align­ing employ­ees with their goals, and the inabil­ity to repro­duce their ‘cul­ture’, it is no won­der that group selec­tion in mar­kets is fee­ble at best, and the outer loss can­not be removed. On the other hand, these fail­ings are not nec­es­sar­ily per­ma­nent: as cor­po­ra­tions grad­u­ally turn into soft­ware, which can be copied and exist in much more dynamic mar­kets with faster OODA loops, per­haps we can expect a tran­si­tion to an era where cor­po­ra­tions do repli­cate pre­cisely & can then start to con­sis­tently evolve large increases in effi­cien­cy, rapidly exceed­ing all progress to date.

See Also

Appendix

Meta-Learning Paradigms

From : “Table 1. A com­par­i­son of pub­lished meta-learn­ing approach­es.”

Pain prosthetics

Brand & Yancey’s 1993 Pain: The Gift No One Wants, pg191–197, recounts Brand’s research in the 1960s–1970s in attempt­ing to cre­ate ‘arti­fi­cial pain’ or ‘pain pros­thet­ics’, which ulti­mately failed because human per­cep­tion of pain is mar­velously accu­rate & supe­rior to the crude elec­tron­ics of the day, but more fun­da­men­tally because they dis­cov­ered the aver­sive­ness of pain was crit­i­cal to accom­plish­ing the goal of dis­cour­ag­ing repet­i­tive or severe­ly-­dam­ag­ing behav­ior, as the test sub­jects would sim­ply ignore or dis­able the devices to get on with what­ever they were doing.

Excerpts:

My grant appli­ca­tion bore the title “A Prac­ti­cal Sub­sti­tute for Pain.” We pro­posed devel­op­ing an arti­fi­cial pain sys­tem to replace the defec­tive sys­tem in peo­ple who suf­fered from lep­rosy, con­gen­i­tal pain­less­ness, dia­betic neu­ropa­thy, and other nerve dis­or­ders. Our pro­posal stressed the poten­tial eco­nomic ben­e­fits: by invest­ing a mil­lion dol­lars to find a way to alert such patients to the worst dan­gers, the gov­ern­ment might save many mil­lions in clin­i­cal treat­ment, ampu­ta­tions, and reha­bil­i­ta­tion.

The pro­posal caused a stir at the National Insti­tutes of Health in Wash­ing­ton. They had received appli­ca­tions from sci­en­tists who wanted to dimin­ish or abol­ish pain, but never from one who wished to cre­ate pain. Nev­er­the­less, we received fund­ing for the pro­ject.

We planned, in effect, to dupli­cate the human ner­vous sys­tem on a very small scale. We would need a sub­sti­tute “nerve sen­sor” to gen­er­ate sig­nals at the extrem­i­ty, a “nerve axon” or wiring sys­tem to con­vey the warn­ing mes­sage, and a response device to inform the brain of the dan­ger. Excite­ment grew in the Carville research lab­o­ra­to­ry. We were attempt­ing some­thing that, to our knowl­edge, had never been tried.

I sub­con­tracted with the elec­tri­cal engi­neer­ing depart­ment at Louisiana State Uni­ver­sity to develop a minia­ture sen­sor for mea­sur­ing tem­per­a­ture and pres­sure. One of the engi­neers there joked about the poten­tial for prof­it: “If our idea works, we’ll have a pain sys­tem that warns of dan­ger but does­n’t hurt. In other words, we’ll have the good parts of pain with­out the bad! Healthy peo­ple will demand these gad­gets for them­selves in place of their own pain sys­tems. Who would­n’t pre­fer a warn­ing sig­nal through a hear­ing aid over real pain in a fin­ger?”

The LSU engi­neers soon showed us pro­to­type trans­duc­ers, slim metal disks smaller than a shirt but­ton. Suf­fi­cient pres­sure on these trans­duc­ers would alter their elec­tri­cal resis­tance, trig­ger­ing an elec­tri­cal cur­rent. They asked our research team to deter­mine what thresh­olds of pres­sure should be pro­grammed into the minia­ture sen­sors. I replayed my uni­ver­sity days in Tommy Lewis’s pain lab­o­ra­to­ry, with one big dif­fer­ence: now, instead of merely test­ing the in-built prop­er­ties of a well-de­signed human body, I had to think like the design­er. What dan­gers would that body face? How could I quan­tify those dan­gers in a way the sen­sors could mea­sure?

To sim­plify mat­ters, we focused on fin­ger­tips and the soles of feet, the two areas that caused our patients the most prob­lems. But how could we get a mechan­i­cal sen­sor to dis­tin­guish between the accept­able pres­sure of, say, grip­ping a fork and the unac­cept­able pres­sure of grip­ping a piece of bro­ken glass? How could we cal­i­brate the stress level of ordi­nary walk­ing and yet allow for the occa­sional extra stress of step­ping off a curb or jump­ing over a pud­dle? Our pro­ject, which we had begun with such enthu­si­asm, seemed more and more daunt­ing.

I remem­bered from stu­dent days that nerve cells change their per­cep­tion of pain in accor­dance with the body’s needs. We say a fin­ger feels ten­der: thou­sands of nerve cells in the dam­aged tis­sue auto­mat­i­cally lower their thresh­old of pain to dis­cour­age us from using the fin­ger. An infected fin­ger seems as if it is always get­ting bumped—it “sticks out like a sore thumb”—be­cause inflam­ma­tion has made it ten times more sen­si­tive to pain. No mechan­i­cal trans­ducer could be so respon­sive to the needs of liv­ing tis­sue.

Every month the opti­mism level of the researchers went down a notch. Our Carville team, who had made the sig­nif­i­cant find­ings about repet­i­tive stress and con­stant stress, knew that the worst dan­gers came not from abnor­mal stress­es, but from very nor­mal stresses repeated thou­sands of times, as in the act of walk­ing. And Sher­man the pig24 had demon­strated that a con­stant pres­sure as low as one pound per square inch could cause skin dam­age. How could we pos­si­bly pro­gram all these vari­ables into a minia­ture trans­duc­er? We would need a com­puter chip on every sen­sor just to keep track of chang­ing vul­ner­a­bil­ity of tis­sues to dam­age from repet­i­tive stress. We gained a new respect for the human body’s capac­ity to sort through such dif­fi­cult options instan­ta­neous­ly.

After many com­pro­mises we set­tled on base­line pres­sures and tem­per­a­tures to acti­vate the sen­sors, and then designed a glove and a sock to incor­po­rate sev­eral trans­duc­ers. At last we could test our sub­sti­tute pain sys­tem on actual patients. Now we ran into mechan­i­cal prob­lems. The sen­sors, state-of-the-art elec­tronic minia­tures, tended to dete­ri­o­rate from metal fatigue or cor­ro­sion after a few hun­dred uses. Short­-­cir­cuits made them fire off false alarms, which aggra­vated our vol­un­teer patients. Worse, the sen­sors cost about $2,060 each and a lep­rosy patient who took a long walk around the hos­pi­tal grounds could wear out a $9,156 sock!

On aver­age, a set of trans­duc­ers held up to nor­mal wear-and-tear for one or two weeks. We cer­tainly could not afford to let a patient wear one of our expen­sive gloves for a task like rak­ing leaves or pound­ing a ham­mer—the very activ­i­ties we were try­ing to make safe. Before long the patients were wor­ry­ing more about pro­tect­ing our trans­duc­ers, their sup­posed pro­tec­tors, than about pro­tect­ing them­selves.

Even when the trans­duc­ers worked cor­rect­ly, the entire sys­tem was con­tin­gent on the free will of the patients. We had grandly talked of retain­ing “the good parts of pain with­out the bad,” which meant design­ing a warn­ing sys­tem that would not hurt. First we tried a device like a hear­ing aid that would hum when the sen­sors were receiv­ing nor­mal pres­sures, buzz when they were in slight dan­ger, and emit a pierc­ing sound when they per­ceived an actual dan­ger. But when a patient with a dam­aged hand turned a screw­driver too hard, and the loud warn­ing sig­nal went off, he would sim­ply over­ride it—This glove is always send­ing out false sig­nals—and turn the screw­driver any­way. Blink­ing lights failed for the same rea­son.

Patients who per­ceived “pain” only in the abstract could not be per­suaded to trust the arti­fi­cial sen­sors. Or they became bored with the sig­nals and ignored them. The sober­ing real­iza­tion dawned on us that unless we built in a qual­ity of com­pul­sion, our sub­sti­tute sys­tem would never work. Being alerted to the dan­ger was not enough; our patients had to be forced to respond. Pro­fes­sor Tims of LSU said to me, almost in despair, “Paul, it’s no use. We’ll never be able to pro­tect these limbs unless the sig­nal really hurts. Surely there must be some way to hurt your patients enough to make them pay atten­tion.”

We tried every alter­na­tive before resort­ing to pain, and finally con­cluded Tims was right: the stim­u­lus had to be unpleas­ant, just as pain is unpleas­ant. One of Tim­s’s grad­u­ate stu­dents devel­oped a small bat­tery-­op­er­ated coil that, when acti­vat­ed, sent out an elec­tric shock at high volt­age but low cur­rent. It was harm­less but painful, at least when applied to parts of the body that could feel pain.

Lep­rosy bacil­li, favor­ing the cooler parts of the body, usu­ally left warm regions such as the armpit undis­turbed, and so we began tap­ing the elec­tric coil to patients’ armpits for our tests. Some vol­un­teers dropped out of the pro­gram, but a few brave ones stayed on. I noticed, though, that they viewed pain from our arti­fi­cial sen­sors in a dif­fer­ent way than pain from nat­ural sources. They tended to see the elec­tric shocks as pun­ish­ment for break­ing rules, not as mes­sages from an endan­gered body part. They responded with resent­ment, not an instinct of self­-p­reser­va­tion, because our arti­fi­cial sys­tem had no innate link to their sense of self. How could it, when they felt a jolt in the armpit for some­thing hap­pen­ing to the hand?

I learned a fun­da­men­tal dis­tinc­tion: a per­son who never feels pain is task-ori­ent­ed, whereas a per­son who has an intact pain sys­tem is self­-ori­ent­ed. The pain­less per­son may know by a sig­nal that a cer­tain action is harm­ful, but if he really wants to, he does it any­way. The pain-sen­si­tive per­son, no mat­ter how much he wants to do some­thing, will stop for pain, because deep in his psy­che he knows that pre­serv­ing his own self is more sig­nif­i­cant than any­thing he might want to do.

Our project went through many stages, con­sum­ing five years of lab­o­ra­tory research, thou­sands of man-hours, and more than a mil­lion dol­lars of gov­ern­ment funds. In the end we had to aban­don the entire scheme. A warn­ing sys­tem suit­able for just one hand was exor­bi­tantly expen­sive, sub­ject to fre­quent mechan­i­cal break­down, and hope­lessly inad­e­quate to inter­pret the pro­fu­sion of sen­sa­tions that con­sti­tute touch and pain. Most impor­tant, we found no way around the fun­da­men­tal weak­ness in our sys­tem: it remained under the patien­t’s con­trol. If the patient did not want to heed the warn­ings from our sen­sors, he could always find a way to bypass the whole sys­tem.

Look­ing back, I can point to a sin­gle instant when I knew for cer­tain that the sub­sti­tute pain project would not suc­ceed. I was look­ing for a tool in the man­ual arts work­shop when Charles, one of our vol­un­teer patients, came in to replace a gas­ket on a motor­cy­cle engine. He wheeled the bike across the con­crete floor, kicked down the kick­stand, and set to work on the gaso­line engine. I watched him out of the cor­ner of my eye. Charles was one of our most con­sci­en­tious vol­un­teers, and I was eager to see how the arti­fi­cial pain sen­sors on his glove would per­form.

One of the engine bolts had appar­ently rust­ed, and Charles made sev­eral attempts to loosen it with a wrench. It did not give. I saw him put some force behind the wrench, and then stop abrupt­ly, jerk­ing back­ward. The elec­tric coil must have jolted him. (I could never avoid winc­ing when I saw our man-­made pain sys­tem func­tion as it was designed to do.) Charles stud­ied the sit­u­a­tion for a moment, then reached up under his armpit and dis­con­nected a wire. He forced the bolt loose with a big wrench, put his hand in his shirt again, and recon­nected the wire. It was then that I knew we had failed. Any sys­tem that allowed our patients free­dom of choice was doomed.

I never ful­filled my dream of “a prac­ti­cal sub­sti­tute for pain,” but the process did at last set to rest the two ques­tions that had long haunted me. Why must pain be unpleas­ant? Why must pain per­sist? Our sys­tem failed for the pre­cise rea­son that we could not effec­tively repro­duce those two qual­i­ties of pain. The mys­te­ri­ous power of the human brain can force a per­son to STOP!—something I could never accom­plish with my sub­sti­tute sys­tem. And “nat­ural” pain will per­sist as long as dan­ger threat­ens, whether we want it to or not; unlike my sub­sti­tute sys­tem, it can­not be switched off.

As I worked on the sub­sti­tute sys­tem, I some­times thought of my rheuma­toid arthri­tis patients, who yearned for just the sort of on-off switch we were installing. If rheuma­toid patients had a switch or a wire they could dis­con­nect, most would destroy their hands in days or weeks. How for­tu­nate, I thought, that for most of us the pain switch will always remain out of reach.


  1. See also SSC & Chris Said’s reviews.↩︎

  2. Amus­ing­ly, the front of Red Plenty notes a grant from Tar­get to the pub­lisher. ↩︎

  3. More Simon 1991:

    Over a span of years, a large frac­tion of all eco­nomic activ­ity has been gath­ered within the walls of large and steadily grow­ing orga­ni­za­tions. The green areas observed by our Mar­t­ian have grown steadi­ly. Ijiri and I have sug­gested that the growth of orga­ni­za­tions may have only a lit­tle to do with effi­ciency (espe­cially since, in most large-s­cale enter­pris­es, economies and dis­ec­onomies of scale are quite smal­l), but may be pro­duced mainly by sim­ple sto­chas­tic growth mech­a­nisms (Ijiri and Simon, 1977).

    But if par­tic­u­lar coor­di­na­tion mech­a­nisms do not deter­mine exactly where the bound­aries between orga­ni­za­tions and mar­kets will lie, the exis­tence and effec­tive­ness of large orga­ni­za­tions does depend on some ade­quate set of pow­er­ful coor­di­nat­ing mech­a­nisms being avail­able. These means of coor­di­na­tion in orga­ni­za­tions, taken in com­bi­na­tion with the moti­va­tional mech­a­nisms dis­cussed ear­lier, cre­ate pos­si­bil­i­ties for enhanc­ing pro­duc­tiv­ity and effi­ciency through the divi­sion of labor and spe­cial­iza­tion.

    In gen­er­al, as spe­cial­iza­tion of tasks pro­ceeds, the inter­de­pen­dency of the spe­cial­ized parts increas­es. Hence a struc­ture with effec­tive mech­a­nisms for coor­di­na­tion can carry spe­cial­iza­tion fur­ther than a struc­ture lack­ing these mech­a­nisms. It has some­times been argued that spe­cial­iza­tion of work in mod­ern indus­try pro­ceeded quite inde­pen­dently of the rise of the fac­tory sys­tem. This may have been true of the early phases of the indus­trial rev­o­lu­tion, but would be hard to sus­tain in rela­tion to con­tem­po­rary fac­to­ries. With the com­bi­na­tion of author­ity rela­tions, their moti­va­tional foun­da­tions, a reper­tory of coor­di­na­tive mech­a­nisms, and the divi­sion of labor, we arrive at the large hier­ar­chi­cal orga­ni­za­tions that are so char­ac­ter­is­tic of mod­ern life.

    ↩︎
  4. In RL terms, evo­lu­tion, like , are a kind of Monte Carlo method. Monte Carlo meth­ods require no knowl­edge or model of the envi­ron­ment, ben­e­fit from low bias, can han­dle even long-term con­se­quences with ease, do not diverge or fail or are biased like approaches using boot­strap­ping (espe­cially in the case of the “deadly triad”), is decentralized/embarrassingly par­al­lel. A major down­side, of course, is that they accom­plish all this by being extremely high-variance/sample-inefficient (eg Sal­i­mans et al 2017 is ~10x worse than com­pet­ing DRL meth­od­s).↩︎

  5. And note the irony of the wide­ly-cited corn & exam­ples of how farm­ing encodes sub­tle wis­dom due to group selec­tion: in both cas­es, the groups that devel­oped it in the Amer­i­cas were, despite their supe­rior local food pro­cess­ing, highly ‘unfit’ and suf­fered enor­mous pop­u­la­tion declines due to pan­demic & con­quest! You might object that those were exoge­nous fac­tors, bad luck, due to things unre­lated to their food pro­cess­ing… which is pre­cisely the prob­lem when select­ing on groups.↩︎

  6. An exam­ple of the fail­ure of tra­di­tional med­i­cine is pro­vided by the NCI anti-­cancer plant screen­ing pro­gram, run by an enthu­si­ast for med­ical folk­lore & eth­nob­otany who specif­i­cally tar­geted plants based on a “a mas­sive lit­er­a­ture search, includ­ing ancient Chi­ne­se, Egyp­tian, Greek, and Roman texts”. The screen­ing pro­gram screened “some 12,000 to 13,000 species…over 114,000 extracts were tested for anti­tu­mor activ­ity” (rates ris­ing steeply after­ward­s), which yielded 3 drugs ever (/Taxol/PTX, , and ), only one of which was all that impor­tant (Tax­ol). So, in a period with few use­ful anti-­cancer drugs to com­pete again­st, large-s­cale screen­ing of all the low-hang­ing fruit, tar­get­ing plants prized by tra­di­tional med­ical prac­tices from through­out his­tory & across the globe, had a suc­cess rate some­where on the order of 0.007%.

    A recent exam­ple is the anti-­malar­ial drug , which earned its dis­cov­er­er, , a 2015 Nobel; she worked in a lab ded­i­cated to tra­di­tional herbal med­i­cine (Mao Zedong encour­aged the con­struc­tion of a ‘tra­di­tional Chi­nese med­i­cine’ as a way to reduce med­ical expenses and con­serve for­eign cur­ren­cy). She dis­cov­ered it in 1972, after screen­ing sev­eral thou­sand tra­di­tional Chi­nese reme­dies. Artemisinin is impor­tant, and one might ask what else her lab dis­cov­ered in the trea­sure trove of tra­di­tional Chi­nese med­i­cine in the inter­ven­ing 43 years; the answer, appar­ent­ly, is ‘noth­ing’.

    While Taxol and artemisinin may jus­tify plant screen­ing on a pure cost-ben­e­fit basis (such a hit rate does not appear much worse than other meth­ods, although one should note that the prof­it-hun­gry phar­ma­ceu­ti­cal indus­try does not pri­or­i­tize or invest much in ‘’), the more impor­tant les­son here is about the accu­racy of ‘tra­di­tional med­i­cine’. Tra­di­tional med­i­cine affords an excel­lent test case for ‘the wis­dom of tra­di­tion’: med­i­cine has hard end­points as it is lit­er­ally a mat­ter of life and death, is an issue dur­ing every indi­vid­u­al’s life at the indi­vid­ual level (rather than occa­sion­ally at the group lev­el), effects can be extremely large (bor­der­ing on ‘sil­ver bul­let’ lev­el) and tens of thou­sands or hun­dreds of thou­sands of years have passed for accu­mu­la­tion & selec­tion. Given all of these favor­able fac­tors, can the wis­dom of tra­di­tion still over­come the seri­ous sta­tis­ti­cal dif­fi­cul­ties and cog­ni­tive biases lead­ing to false beliefs? Well, the best suc­cess sto­ries of tra­di­tional med­i­cine have accu­racy rates like… <1%. So much for the ‘wis­dom of tra­di­tion’. The fact that some work­ing drugs hap­pen to also have been men­tioned, some­times, in some tra­di­tions, in some ways, along with hun­dreds of thou­sands of use­less or harm­ful drugs which look just the same, is hardly any more tes­ti­mo­nial to the folk med­i­cine as a source of truth than the obser­va­tion that Hein­rich Schlie­mann dis­cov­ered a city sort of like Troy jus­ti­fies treat­ing the Illiad or Odyssey as accu­rate his­tor­i­cal text­books rather than 99% fic­tional lit­er­a­ture. (Like­wise other exam­ples such as Aus­tralian Abo­rig­i­nal myths pre­serv­ing some traces of ancient geo­log­i­cal events: they cer­tainly do not show that the oral his­to­ries are reli­able his­to­ries or we should just take them as fac­t.)↩︎

  7. The French Rev­o­lu­tion: A His­tory, by .↩︎

  8. Brand also notes of a lep­rosy patient whose nerves had been dead­ened by it:

    As I watched, this man tucked his crutches under his arm and began to run on both feet with a very lop­sided gait….He ended up near the head of the line, where he stood pant­i­ng, lean­ing on his crutch­es, wear­ing a smile of tri­umph…By run­ning on an already dis­lo­cated ankle, he had put far too much force on the end of his leg bone and the skin had bro­ken under the stress…I knelt beside him and found that small stones and twigs had jammed through the end of the bone into the mar­row cav­i­ty. I had no choice but to ampu­tate the leg below the knee.

    These two scenes have long haunted me.

    ↩︎
  9. An exam­ple quote from Brand & Yancey’s 1993 Pain: The Gift No One Wants about con­gen­i­tal pain insen­si­tiv­i­ty:

    When I unwrapped the last ban­dage, I found grossly infected ulcers on the soles of both feet. Ever so gen­tly I probed the wounds, glanc­ing at Tanya’s face for some reac­tion. She showed none. The probe pushed eas­ily through soft, necrotic tis­sue, and I could even see the white gleam of bare bone. Still no reac­tion from Tanya.

    …her mother told me Tanya’s sto­ry…“A few min­utes later I went into Tanya’s room and found her sit­ting on the floor of the playpen, fin­ger­paint­ing red swirls on the white plas­tic sheet. I did­n’t grasp the sit­u­a­tion at first, but when I got closer I screamed. It was hor­ri­ble. The tip of Tanya’s fin­ger was man­gled and bleed­ing, and it was her own blood she was using to make those designs on the sheets. I yelled, ‘Tanya, what hap­pened!’ She grinned at me, and that’s when I saw the streaks of blood on her teeth. She had bit­ten off the tip of her fin­ger and was play­ing in the blood.”

    …The tod­dler laughed at spank­ings and other phys­i­cal threats, and indeed seemed immune to all pun­ish­ment. To get her way she merely had to lift a fin­ger to her teeth and pre­tend to bite, and her par­ents capit­u­lated at once. The par­ents’ hor­ror turned to despair as wounds mys­te­ri­ously appeared on one of Tanya’s fin­gers after anoth­er…I asked about the foot injuries. “They began as soon as she learned to walk,” the mother replied. “She’d step on a nail or thumb­tack and not bother to pull it out. Now I check her feet at the end of every day, and often I dis­cover a new wound or open sore. If she twists an ankle, she does­n’t limp, and so it twists again and again. An ortho­pe­dic spe­cial­ist told me she’s per­ma­nently dam­aged the joint. If we wrap her feet for pro­tec­tion, some­times in a fit of anger she’ll tear off the ban­dages. Once she ripped open plas­ter cast with her bare fin­gers.”

    …Tanya suf­fered from a rare genetic defect known infor­mally as “con­gen­i­tal indif­fer­ence to pain”…Nerves in her hands and feet trans­mit­ted mes­sages—she felt a kind of tin­gling when she burned her­self or bit a fin­ger—but these car­ried no hint of unpleas­ant­ness…She rather enjoyed the tin­gling sen­sa­tions, espe­cially when they pro­duced such dra­matic reac­tions in oth­er­s…­Tanya, now 11, was liv­ing a pathetic exis­tence in an insti­tu­tion. She had lost both legs to ampu­ta­tion: she had refused to wear proper shoes and that, cou­pled with her fail­ure to limp or shift weight when stand­ing (be­cause she felt no dis­com­fort), had even­tu­ally put intol­er­a­ble pres­sure on her joints. Tanya had also lost most of her fin­gers. Her elbows were con­stantly dis­lo­cat­ed. She suf­fered the effects of chronic sep­sis from ulcers on her hands and ampu­ta­tion stumps. Her tongue was lac­er­ated and badly scarred from her ner­vous habit of chew­ing it.

    ↩︎
  10. One of the first known cases was described in Dear­born 1932, of a man with a remark­able career of injuries as a child rang­ing from being hoisted by a pick­-axe to a hatchet get­ting stuck in his head to shoot­ing him­self in the index fin­ger, cul­mi­nat­ing in a mul­ti­-year career as the “Human Pin­cush­ion”.↩︎

  11. The Chal­lenge of Pain, Melzack & Wall 1996, describes another case (as quoted in , Gra­hek 2001):

    As a child, she had bit­ten off the tip of her tongue while chew­ing food, and has suf­fered third-de­gree burns after kneel­ing on a hot radi­a­tor to look out of the win­dow…Miss C. had severe med­ical prob­lems. She exhib­ited patho­log­i­cal changes in her knees, hip and spine, and under­went sev­eral ortho­pe­dic oper­a­tions. Her sur­geon attrib­uted these changes to the lack of pro­tec­tion to joints usu­ally given by pain sen­sa­tion. She appar­ently failed to shift her weight when stand­ing, to turn over in her sleep, or to avoid cer­tain pos­tures, which nor­mally pre­vent the inflam­ma­tion of joints. All of us quite fre­quently stum­ble, fall or wrench a mus­cle dur­ing ordi­nary activ­i­ty. After these triv­ial injuries, we limp a lit­tle or we pro­tect the joint so that it remains unstressed dur­ing the recov­ery process. This rest­ing of the dam­aged area is an essen­tial part of its recov­ery. But those who feel no pain go on using the joint, adding insult to injury.

    ↩︎
  12. A recent US exam­ple is Min­nesotan Gabby Gin­gras (b. 2001), fea­tured in the 2005 doc­u­men­tary A Life With­out Pain, and occa­sion­ally cov­ered in the media since (eg “Med­ical Mys­tery: A World With­out Pain: A rare genetic dis­or­der leaves one lit­tle girl in con­stant dan­ger”, “Min­nesota girl who can’t feel pain bat­tles insur­ance com­pany”).

    She is legally blind, hav­ing dam­aged her eyes & defeated attempts to save her vision by stitch­ing her eyes shut. She would chew on things, so her baby teeth were sur­gi­cally removed to avoid her break­ing them—but then she broke her adult teeth when they grow in; she can’t use den­tures because her gums are so badly destroyed, which requires spe­cial surgery to graft bone from her hips into her jaw to pro­vide a foun­da­tion for teeth. And so on.↩︎

  13. HN user remote_­phone:

    My cousin feels pain or dis­com­fort but only a lit­tle. This almost affected her when she gave birth because her water had bro­ken but she did­n’t feel any con­trac­tions at all until it was almost too late. Luck­ily she got to the hos­pi­tal in time and her son was born per­fectly nor­mal but it was a bit har­row­ing.

    More inter­est­ing­ly, her son inher­ited this. He does­n’t feel pain the same way nor­mal peo­ple do. Once her son broke his wrist and had to go to the hos­pi­tal. He was­n’t in pain, but I think they had to pull on the arm to put it back in place prop­erly (is this called trac­tion?). The doc­tor was putting in all his effort to sep­a­rate the wrist from the arm, and the dad almost fainted because it looked so grue­some but all the son looked like was mildly dis­com­forted from the ten­sion. The doc­tor was appar­ently shocked at how lit­tle pain he felt.

    The son also pulled out all his teeth on his own, as they got loose. He said it both­ered him to have loose teeth, but the act of pulling them out did­n’t bother him at all.

    ↩︎
  14. See “The Haz­ards of Grow­ing Up Pain­lessly” for a par­tic­u­larly recent exam­ple.↩︎

  15. A genet­ics paper, has a pro­file of a pain-in­sen­si­tive patient (which is par­tic­u­larly eye­brow-rais­ing in light of ear­lier dis­cus­sions of joint dam­age):

    The patient had been diag­nosed with osteoarthri­tis of the hip, which she reported as pain­less, which was not con­sis­tent with the severe degree of joint degen­er­a­tion. At 65 yr of age, she had under­gone a hip replace­ment and was admin­is­tered only parac­eta­mol 2g orally on Post­op­er­a­tive days 1 and 2, report­ing that she was encour­aged to take the parac­eta­mol, but that she did not ask for any anal­gesics. She was also admin­is­tered a sin­gle dose of mor­phine sul­phate 10mg orally on the first post­op­er­a­tive evening that caused severe nau­sea and vom­it­ing for 2 days. After oper­a­tion, her pain inten­sity scores were 0⁄10 through­out except for one score of 1⁄10 on the first post­op­er­a­tive evening. Her past sur­gi­cal his­tory was notable for mul­ti­ple vari­cose vein and den­tal pro­ce­dures for which she has never required anal­ge­sia. She also reported a long his­tory of pain­less injuries (e.g. sutur­ing of a lac­er­a­tion and left wrist frac­ture) for which she did not use anal­gesics. She reported numer­ous burns and cuts with­out pain (Sup­ple­men­tary Fig. S1), often smelling her burn­ing flesh before notic­ing any injury, and that these wounds healed quickly with lit­tle or no resid­ual scar. She reported eat­ing chili pep­pers with­out any dis­com­fort, but a short­-last­ing “pleas­ant glow” in her mouth. She described sweat­ing nor­mally in warm con­di­tions.

    ↩︎
  16. Brand’s Pain: The Gift No One Wants (pg209–211) describes meet­ing an Indian woman whose pain was cured by a lobot­omy (de­signed to sever as lit­tle of the pre­frontal cor­tex as pos­si­ble), who described it in almost exactly the same term as Den­net­t’s para­phrase: “When I inquired about the pain, she said, ‘Oh, yes, it’s still there. I just don’t worry about it any­more.’ She smiled sweetly and chuck­led to her­self. ‘In fact, it’s still ago­niz­ing. But I don’t mind.’” (Den­nett else­where draws a con­nec­tion between ‘not mind­ing’ and Zen Bud­dhis­m.) See also Bar­ber 1959.↩︎

  17. Amne­si­acs appar­ently may still be able to learn fear or pain asso­ci­a­tions with unpleas­ant stim­uli despite their mem­ory impair­ment and some­times reduced pain sen­si­tiv­i­ty, which makes them a bor­der­line case here: the aver­sive­ness out­lasts the (re­mem­bered) qualia.↩︎

  18. help­fully pro­vides a Rosetta Stone between opti­mal con­trol the­ory & rein­force­ment learn­ing (see also Pow­ell 2018 & Bert­sekas 2019):

    The nota­tion and ter­mi­nol­ogy used in this paper is stan­dard in DP and opti­mal con­trol, and in an effort to fore­stall con­fu­sion of read­ers that are accus­tomed to either the rein­force­ment learn­ing or the opti­mal con­trol ter­mi­nol­o­gy, we pro­vide a list of selected terms com­monly used in rein­force­ment learn­ing (for exam­ple in the pop­u­lar book by Sut­ton and Barto [SuB­98], and its 2018 on-­line 2nd edi­tion), and their opti­mal con­trol coun­ter­parts.

    1. Agent = Con­troller or deci­sion mak­er.
    2. Action = Con­trol.
    3. Envi­ron­ment = Sys­tem.
    4. Reward of a stage = (Op­po­site of) Cost of a stage.
    5. State value = (Op­po­site of) Cost of a state.
    6. Value (or state-­val­ue) func­tion = (Op­po­site of) Cost func­tion.
    7. Max­i­miz­ing the value func­tion = Min­i­miz­ing the cost func­tion.
    8. Action (or state-ac­tion) value = Q-fac­tor of a state-­con­trol pair.
    9. Plan­ning = Solv­ing a DP prob­lem with a known math­e­mat­i­cal mod­el.
    10. Learn­ing = Solv­ing a DP prob­lem in mod­el-free fash­ion.
    11. Self­-learn­ing (or self­-­play in the con­text of games) = Solv­ing a DP prob­lem using pol­icy iter­a­tion.
    12. Deep rein­force­ment learn­ing = Approx­i­mate DP using value and/or pol­icy approx­i­ma­tion with deep neural net­works.
    13. Pre­dic­tion = Pol­icy eval­u­a­tion.
    14. Gen­er­al­ized pol­icy iter­a­tion = Opti­mistic pol­icy iter­a­tion.
    15. State abstrac­tion = Aggre­ga­tion.
    16. Episodic task or episode = Finite-step sys­tem tra­jec­to­ry.
    17. Con­tin­u­ing task = Infinite-step sys­tem tra­jec­to­ry.
    18. After­state = Post-de­ci­sion state.
    ↩︎
  19. There are some exam­ples of “Reward hack­ing” in past RL research which resem­ble such ‘self­-in­jur­ing’ agents—­for exam­ple, a bicy­cle agent is ‘rewarded’ for get­ting near a tar­get (but not ‘pun­ished’ for mov­ing away), so it learn to steer toward it in a loop to go around it repeat­edly to earn the reward.↩︎

  20. From the Mar­sili arti­cle:

    In the mid-2000s, Wood’s lab at Uni­ver­sity Col­lege part­nered with a Cam­bridge Uni­ver­sity sci­en­tist named Geoff Woods on a pio­neer­ing research project cen­tered on a group of related fam­i­lies—all from a clan known as the Qureshi biradar­i—in rural north­ern Pak­istan. Woods had learned about the fam­i­lies acci­den­tal­ly: On the hunt for poten­tial test sub­jects for a study on the brain abnor­mal­ity micro­cepha­ly, he heard about a young street per­former, a boy who rou­tinely injured him­self (walk­ing across burn­ing coals, stab­bing him­self with knives) for the enter­tain­ment of crowds. The boy was rumored to feel no pain at all, a trait he was said to share with other fam­ily mem­ber­s…When Woods found the boy’s fam­i­ly, they told him that the boy had died from injuries sus­tained dur­ing a stunt leap from a rooftop.

    ↩︎
  21. Drescher 2004 gives a sim­i­lar account of moti­va­tional pain in (pg77–78):

    But a merely mechan­i­cal state could not have the prop­erty of being intrin­si­cally desir­able or unde­sir­able; inher­ently good or bad sen­sa­tions, there­fore, would be irrec­on­cil­able with the idea of a fully mechan­i­cal mind. Actu­al­ly, though, it is your machin­ery’s very response to a state’s util­ity des­ig­na­tion—the machin­ery’s very ten­dency to sys­tem­at­i­cally pur­sue or avoid the state—that imple­ments and con­sti­tutes a val­ued state’s seem­ingly inher­ent deserved­ness of being pur­sued or avoid­ed. Roughly speak­ing, it’s not that you avoid pain (other things being equal) in part because pain is inher­ently bad; rather, your machin­ery’s sys­tem­atic ten­dency to avoid pain (other things being equal) is what con­sti­tutes its being bad. That sys­tem­atic ten­dency is what you’re really observ­ing when you con­tem­plate a pain and observe that it is “unde­sir­able”, that it is some­thing you want to avoid.

    The sys­tem­atic ten­dency I refer to includes, cru­cial­ly, the ten­dency to plan to achieve pos­i­tively val­ued states (and then to carry out the plan), or to plan the avoid­ance of neg­a­tively val­ued states. In con­trast, for exam­ple, sneez­ing is an insis­tent response to cer­tain stim­uli; yet despite the strength of the urge—s­neez­ing can be very hard to sup­press—we do not regard the sen­sa­tion of sneez­ing as strongly plea­sur­able (nor the incip­i­en­t-s­neeze tin­gle, sub­se­quently extin­guished by the sneeze, as strongly unpleas­an­t). The dif­fer­ence, I pro­pose, is that noth­ing in our machin­ery inclines us to plan our way into sit­u­a­tions that make us sneeze (and noth­ing strongly inclines us to plan the avoid­ance of an occa­sional incip­i­ent sneeze) for the sake of achiev­ing the sneeze (or avoid­ing the incip­i­ent sneeze); the machin­ery just isn’t wired up to treat sneezes that way (nor should it be). The sen­sa­tions we deem plea­sur­able or painful are those that incline us to plan our way to them or away from them, other things being equal.

    ↩︎
  22. This is not about dopamin­er­gic effects being reward­ing them­selves, but about the per­cep­tion of cur­rent tasks vs alter­na­tive tasks. (After all, stim­u­lants don’t sim­ply make you enjoy star­ing at a wall while doing noth­ing.) If every­thing becomes more reward­ing, then there is less to gain from switch­ing, because alter­na­tives will be esti­mated as lit­tle more reward­ing; or, if reward sen­si­tiv­ity is boosted only for cur­rent activ­i­ties, then there will be pres­sure against switch­ing tasks, because it is unlikely that alter­na­tives will be pre­dicted to be more reward­ing than the cur­rent task.↩︎

  23. “On the Sci­en­tific Ways of Treat­ing Nat­ural Law”, Hegel 1803↩︎

  24. pg171–172; research on the pig involved par­a­lyz­ing it & apply­ing slight con­sis­tent pres­sure for 5–7h to spots, which was enough to trig­ger inflam­ma­tion & kill hair on the spots.↩︎